Transcript for:
Lecture on Groq AI Chip

today I would like to talk about the new AI chip  that is breaking all the speed records and the   best part about it that it's fully designed and  manufactured in the US as a chip designer I was   very curious to know if this chip is as good  as it cracked up to be or just hype let's find   out first of all previously I've discussed  the trend that the future of computing is in   custom silicon and here it is Groq chip is an  ASIC application specific integrated circuit   specifically designed for language processing  have a look at this silicon die it's born and   raised in the US rival to Nvidia while all other  AI chips designed by Nvidia AMG Intel Google and   Tesla they are highly dependent on manufacturing  and packaging technology by TSMC and this one is   entirely domestic dis design and Manufacturing so  to say 100% American it's manufactured at Global   fundies at 14 nanometers and 14nm process is quite  a mature technology which makes it more robust and   much cheaper to fabricate however Groq is already  working on the next version the next generation of   their chip which will be fabricated by Samsung  in 4 nanometers at the new Samsung factory at   Texas now let me explain the Groq benchmarks  everyone is talking about and how exactly they   were achieved and spoiler the inference speed  is truly fantastic you know when I'm having a   conversation with ChatGPT I often have to wait 3  to 5 seconds for a response and that's because the   GPT model runs on Microsoft Azure Cloud powered  by Nvidia GPUs and the speed at which I'm getting   the response mostly depends on the speed of those  GPUs and their latency 3 to 5 seconds is very slow   it's a delay that everyone can easily notice just  imagine for a moment if I would start to talk so   slow everyone would drop off immediately Groq took  open source Mixtral AI model and accelerated it on   their hardware to make it a better user experience  I was testing it in chat mode and I get response   in less than a quarter of a second these are  the official benchmarks that compare different   AI Inference Services running the same Mixtral  model here is Perplexity running on Nvidia GPUs   through Amazon Cloud and it costs about 25 cents  per 1 million tokens and has a throughput of 150   tokens per second we obviously want the highest  throughput at the lowest price you can see that   there is one big outlier Groq which costs about  30 cents per 1 million tokens and delivers about   430 tokens per second and if we take the average  here Groq system is on average four to five times   faster than any of the Inference Services listed  here according to another official Benchmark   Meta's Lama 2 which is 70 billion parameters  model while running on Groq system is up to   18 times faster than GPU based Cloud providers!  clearly these benchmarks have generated a lot of   excitement around Groq's Hardware now it's time to  understand how it was achieved and how it compares   to other AI accelerators so have a look at the  layout that's the layout of the Groq1 chip you   may have noticed that there are many thousands of  repeating blocks which appear to resemble solar   cells now why do you care so why does this cheap  layout deserves your attention because it has   something that most of the modern AI accelerators  don't have all of its RAM memory is on the chip   this is similar in a way to cerebras chip design  but in the case of cerebras Chip the memory and   the cores are intertwined while Nvidia GPUs on  the other hand always come with a huge off chip   memory actually there are two to three main  advantages of having on chip memory first of   this close coupling... close location of Matrix  unit and the memory helps to minimize the latency   and this explains why everyone who is trying  Groq Hardware seeing such an amazing latency   as you can see on this figure Groq's Hardware  is outstanding across the board with 430 tokens   per second at 0.3s latency it's clearly amazingly  quick when it comes to respond to prompts let me   know what you think the second advantage of having  on chip memory is that this chip doesn't require   expensive and hard to get Advanced packaging  technology Groq doesn't depend on the memory   chips from Korean Hynix or on CoWoS packaging  technology from TSMC this makes it much cheaper   for Groq to manufacture their silicon more mature  process node means lower costs per we and per die   and no fancy packaging means they don't have  to pay for this and they can stick to domestic   manufacturing offerings and all of this gives  them such a rare flexibility to switch between   faps they just migrated from Global Foundries to  the Samsung fab to migrate to the lower and more   advanced process node on the left side of the  chip you can see the Matrix unit which is the   Chip's main Workhorse and this same Matrix unit  is also on the right side of the Chip and each   millim square of this chip is capable of one Tero  operations per second the Matrix unit is followed   by the memory and the vector unit and these units  are passing the results between each other so you   can stream the data from east to the west of the  chip as well as the other way around as much as I   love the Silicon part of the Groq chip overall the  performance is achieved by co-designing Software   and Hardware so the complete stack now although  Groq is selling their chips their business model   is mostly focused around Inference as a service  and that's very interesting you know training   AI models is one time problem and computing power  is getting cheaper and cheaper but inference is a   constant problem and it's inherently larger market  than training and the most important thing that it   scales very well as more users more businesses and  more people are starting to use generative AI. so   companies like Mistral and Meta are building open  source AI models and then Groq is accelerating   these models on their hardware for other companies  that want to use these open-source AI models in   their applications talking about the perspective  of this chip we need to keep in mind the market   how big is it yes there is a Amazon Cloud and  then there is a Microsoft Azure Cloud which   open AI is using to run their models on and then  there is also Google cloud and Google Deepmind can   run their inference on Google's infrastructure  but what about middle and small businesses they   need to run their model somewhere right that's the  market that Groq is addressing and it's huge and   it's growing however Groq's inference services  are reportedly not yet profitable and to make   it profitable they are planning on scaling ...  scaling the throughput per chip and also number   of chips to 1 million by the end of 2024 Groq  claims that they will be able to break even   by the end of 2024 I think it's possible if they  are able to scale up fast enough to keep up with   a pace of AI development in fact right now is  a perfect time for this chip because this speed   and latency this super fast speed and latency  might be game-changing for many applications if   we are talking of Chat Bots and voice assistance  this speed advantage can make a huge difference   because it can make all this interaction to feel  more natural so congratulations! now it's going to   be even harder to distinguish an AI agent from a  real person now let's discuss some of the concerns   for a model like Mistral which is 50 billion  parameters model the Groq chip architecture   looks great a large language model like Mixtral  with about 50 billion parameters requires 578 Groq   chips to run it for comparison the same Mixtral  model can fit on two Nvidia H100 GPUs and now you   might be thinking wait how this is going to scale  when we come to one trillion on or 10 trillion   parameter models actually this is a question  that should be asked of any AI accelerator how   well can you scale? the thing is Groq has on chip  memory only which has a lot of pros we discussed   but it also has its cons fast forward in the  future for a 10 trillion parameters model we   would need tens of terabytes of memory and if we  recalculate it back to the Groq chip which has   220 Mb of memory per chip it would require Groq  to scale to 10 or hundreds of thousands of Groq   chips which are networked together and then just  think about the distribution of the load between   them and networking them in a way to achieve the  same low latency that's going to be challenging if   it's even feasible at all let me know what you  think in the comments and before we talk about   Groq chip competition and my outlook. I would like  to ask you if you're enjoying this video please   consider subscribing to the channel and sharing  this video with friends and your colleagues who   might be interested this helps the channel more  than you know thank you speaking of scaling and   competition Groq's architecture is quite different  from the rest from AI accelerators made by Nvidia,   Google, Tesla, Moores Thread, Huawei and so on but  it somewhat resembles Cerebras wafer scale engine   and their chip has also on-chip memory that is  distributed around the cores the thing is Cerebras   single chip is 46,225 mm square of silicon it  occupies the entire 300 mm wafer just to give you   a feeling such a 300 mm wafer could accommodate  65 Groq chips cerebra architecture seems to scale   better and although they are providing Inference  Services, they're mostly focusing on selling their   Hardware so building infrastructure for other  companies and if we continue to talk about the   competition of Groq - it's obviously the main  AI horse of the world today Nvidia GPUs from all   these specs benchmarks and numbers I was able  to find Groq is able to outperform Nvidia GPUs   like H100 in latency and costs per million tokens  but not yet at throughput but they're working on   it and keep in mind that very soon in a month  or so Nvidia will be presenting their new B100   GPU which will be taped out in 3 nanometers and  that's very exciting because we can expect double   the performance of H100 GPU. In my opinion Groq  is a very promising startup along with Cerebras   but their success depends on their development  of their software stack and also on their next   generation 4 nanometer chip keep in mind that  all the metrics we discussed today are achieved   by their older 14 nm chip which is from about 2  years ago and it took them some time to build a   whole system the whole stack to make it up and  running so with their next 4nm design they will   increase several Xs in speed and power efficiency  and we are definitely living in a very exciting   times guys some years ago we talked about CPUs and  GPUs then we talked about NPUs neural processing   unit and now Groq calls their LPU language  processing unit these are specifically tailored   for handling natural language processing tasks  and ASIC everything is definitely the trend of   this Century thank you so much for watching  and for supporting this channel! ciao! <3