today I would like to talk about the new AI chip
that is breaking all the speed records and the best part about it that it's fully designed and
manufactured in the US as a chip designer I was very curious to know if this chip is as good
as it cracked up to be or just hype let's find out first of all previously I've discussed
the trend that the future of computing is in custom silicon and here it is Groq chip is an
ASIC application specific integrated circuit specifically designed for language processing
have a look at this silicon die it's born and raised in the US rival to Nvidia while all other
AI chips designed by Nvidia AMG Intel Google and Tesla they are highly dependent on manufacturing
and packaging technology by TSMC and this one is entirely domestic dis design and Manufacturing so
to say 100% American it's manufactured at Global fundies at 14 nanometers and 14nm process is quite
a mature technology which makes it more robust and much cheaper to fabricate however Groq is already
working on the next version the next generation of their chip which will be fabricated by Samsung
in 4 nanometers at the new Samsung factory at Texas now let me explain the Groq benchmarks
everyone is talking about and how exactly they were achieved and spoiler the inference speed
is truly fantastic you know when I'm having a conversation with ChatGPT I often have to wait 3
to 5 seconds for a response and that's because the GPT model runs on Microsoft Azure Cloud powered
by Nvidia GPUs and the speed at which I'm getting the response mostly depends on the speed of those
GPUs and their latency 3 to 5 seconds is very slow it's a delay that everyone can easily notice just
imagine for a moment if I would start to talk so slow everyone would drop off immediately Groq took
open source Mixtral AI model and accelerated it on their hardware to make it a better user experience
I was testing it in chat mode and I get response in less than a quarter of a second these are
the official benchmarks that compare different AI Inference Services running the same Mixtral
model here is Perplexity running on Nvidia GPUs through Amazon Cloud and it costs about 25 cents
per 1 million tokens and has a throughput of 150 tokens per second we obviously want the highest
throughput at the lowest price you can see that there is one big outlier Groq which costs about
30 cents per 1 million tokens and delivers about 430 tokens per second and if we take the average
here Groq system is on average four to five times faster than any of the Inference Services listed
here according to another official Benchmark Meta's Lama 2 which is 70 billion parameters
model while running on Groq system is up to 18 times faster than GPU based Cloud providers!
clearly these benchmarks have generated a lot of excitement around Groq's Hardware now it's time to
understand how it was achieved and how it compares to other AI accelerators so have a look at the
layout that's the layout of the Groq1 chip you may have noticed that there are many thousands of
repeating blocks which appear to resemble solar cells now why do you care so why does this cheap
layout deserves your attention because it has something that most of the modern AI accelerators
don't have all of its RAM memory is on the chip this is similar in a way to cerebras chip design
but in the case of cerebras Chip the memory and the cores are intertwined while Nvidia GPUs on
the other hand always come with a huge off chip memory actually there are two to three main
advantages of having on chip memory first of this close coupling... close location of Matrix
unit and the memory helps to minimize the latency and this explains why everyone who is trying
Groq Hardware seeing such an amazing latency as you can see on this figure Groq's Hardware
is outstanding across the board with 430 tokens per second at 0.3s latency it's clearly amazingly
quick when it comes to respond to prompts let me know what you think the second advantage of having
on chip memory is that this chip doesn't require expensive and hard to get Advanced packaging
technology Groq doesn't depend on the memory chips from Korean Hynix or on CoWoS packaging
technology from TSMC this makes it much cheaper for Groq to manufacture their silicon more mature
process node means lower costs per we and per die and no fancy packaging means they don't have
to pay for this and they can stick to domestic manufacturing offerings and all of this gives
them such a rare flexibility to switch between faps they just migrated from Global Foundries to
the Samsung fab to migrate to the lower and more advanced process node on the left side of the
chip you can see the Matrix unit which is the Chip's main Workhorse and this same Matrix unit
is also on the right side of the Chip and each millim square of this chip is capable of one Tero
operations per second the Matrix unit is followed by the memory and the vector unit and these units
are passing the results between each other so you can stream the data from east to the west of the
chip as well as the other way around as much as I love the Silicon part of the Groq chip overall the
performance is achieved by co-designing Software and Hardware so the complete stack now although
Groq is selling their chips their business model is mostly focused around Inference as a service
and that's very interesting you know training AI models is one time problem and computing power
is getting cheaper and cheaper but inference is a constant problem and it's inherently larger market
than training and the most important thing that it scales very well as more users more businesses and
more people are starting to use generative AI. so companies like Mistral and Meta are building open
source AI models and then Groq is accelerating these models on their hardware for other companies
that want to use these open-source AI models in their applications talking about the perspective
of this chip we need to keep in mind the market how big is it yes there is a Amazon Cloud and
then there is a Microsoft Azure Cloud which open AI is using to run their models on and then
there is also Google cloud and Google Deepmind can run their inference on Google's infrastructure
but what about middle and small businesses they need to run their model somewhere right that's the
market that Groq is addressing and it's huge and it's growing however Groq's inference services
are reportedly not yet profitable and to make it profitable they are planning on scaling ...
scaling the throughput per chip and also number of chips to 1 million by the end of 2024 Groq
claims that they will be able to break even by the end of 2024 I think it's possible if they
are able to scale up fast enough to keep up with a pace of AI development in fact right now is
a perfect time for this chip because this speed and latency this super fast speed and latency
might be game-changing for many applications if we are talking of Chat Bots and voice assistance
this speed advantage can make a huge difference because it can make all this interaction to feel
more natural so congratulations! now it's going to be even harder to distinguish an AI agent from a
real person now let's discuss some of the concerns for a model like Mistral which is 50 billion
parameters model the Groq chip architecture looks great a large language model like Mixtral
with about 50 billion parameters requires 578 Groq chips to run it for comparison the same Mixtral
model can fit on two Nvidia H100 GPUs and now you might be thinking wait how this is going to scale
when we come to one trillion on or 10 trillion parameter models actually this is a question
that should be asked of any AI accelerator how well can you scale? the thing is Groq has on chip
memory only which has a lot of pros we discussed but it also has its cons fast forward in the
future for a 10 trillion parameters model we would need tens of terabytes of memory and if we
recalculate it back to the Groq chip which has 220 Mb of memory per chip it would require Groq
to scale to 10 or hundreds of thousands of Groq chips which are networked together and then just
think about the distribution of the load between them and networking them in a way to achieve the
same low latency that's going to be challenging if it's even feasible at all let me know what you
think in the comments and before we talk about Groq chip competition and my outlook. I would like
to ask you if you're enjoying this video please consider subscribing to the channel and sharing
this video with friends and your colleagues who might be interested this helps the channel more
than you know thank you speaking of scaling and competition Groq's architecture is quite different
from the rest from AI accelerators made by Nvidia, Google, Tesla, Moores Thread, Huawei and so on but
it somewhat resembles Cerebras wafer scale engine and their chip has also on-chip memory that is
distributed around the cores the thing is Cerebras single chip is 46,225 mm square of silicon it
occupies the entire 300 mm wafer just to give you a feeling such a 300 mm wafer could accommodate
65 Groq chips cerebra architecture seems to scale better and although they are providing Inference
Services, they're mostly focusing on selling their Hardware so building infrastructure for other
companies and if we continue to talk about the competition of Groq - it's obviously the main
AI horse of the world today Nvidia GPUs from all these specs benchmarks and numbers I was able
to find Groq is able to outperform Nvidia GPUs like H100 in latency and costs per million tokens
but not yet at throughput but they're working on it and keep in mind that very soon in a month
or so Nvidia will be presenting their new B100 GPU which will be taped out in 3 nanometers and
that's very exciting because we can expect double the performance of H100 GPU. In my opinion Groq
is a very promising startup along with Cerebras but their success depends on their development
of their software stack and also on their next generation 4 nanometer chip keep in mind that
all the metrics we discussed today are achieved by their older 14 nm chip which is from about 2
years ago and it took them some time to build a whole system the whole stack to make it up and
running so with their next 4nm design they will increase several Xs in speed and power efficiency
and we are definitely living in a very exciting times guys some years ago we talked about CPUs and
GPUs then we talked about NPUs neural processing unit and now Groq calls their LPU language
processing unit these are specifically tailored for handling natural language processing tasks
and ASIC everything is definitely the trend of this Century thank you so much for watching
and for supporting this channel! ciao! <3