Innovations in AI Chip Development

uh okay so I think we can start um eigor thanks for joining uh eigor is basically Chief Architect at gr uh the the AI chip company that's been building lpus so language processing units and shows some impressive results over the over the past few weeks on on social media probably before that but that's what when it when it caught my attention at least uh before Gro Eiger was basically at Google running the TPU um silicon customization effort and before that he was a CTO at Marvel I I might be butchering the name e correct me awesome so with that I'm gonna hand it over to you you can share the slides and let's go into into learning what Gro is about I'm sure we'll have a ton of cool questions I already saw some questions on Discord hey guys I want to give a huge shout out to hyp folks who have generously sponsor my compute over the past month or so uh basically I got 16 h100s which is two uh 8 GPU notes and the performance has been amazing basically 2 extra speed up compared to my a100 notes so I want to quickly show you how you can get started yourself it takes basically three steps only you go to the environments here you create new environment you give it a name you pick between Canada and Norway that's the first step the Second Step you go to uh SSH Keys here you create a new pair you pick environment you just created you give it a name you paste your public SSH key that's it and finally go to Virtual machines deploy new machine give it a name again select environment let's select Canada here because they have more compute there then you select the hardware you want like for example h100s you select the images is you select the SSH key you just created and hit deploy that's it literally a couple minutes to get started it was really easy for me to to just uh um basically create these two nodes I just mentioned and get started running LM trainings so the documentation is super cool I I could actually solve all of my problems just by looking at their Ducks they additionally have a slack Channel where they were super helpful so I can't recommend them enough uh honestly the the thing they kind of focus on is basically because a lot of these GPU providers ERS focus on on big Enterprises so often times you can't get on demand h100s whereas next gen focuses particularly on that you can get top-of-the-line Hardware on demand even if you're if you're individual or like a smaller team or whatnot they also focus on bigger companies but that's kind of their their edge here so without further Ado guys let's go back to the talk I do suggest you check them out and let's continue and let me know if you can see my slides uhhuh can you guys see these a sec yeah it's loading yeah I can see GRS deterministic lpu inference engine okay I can see it awesome okay perfect perfect so um so what I want to do today is kind of uh demystify um how we're achieving these incredible uh numbers on large language models right um and what I want to describe is really at the core of our performance Advantage is a full vertical stock optimization uh we call it anywhere from the vertical optimization from sand in other words silicon all the way through system and software and uh basically all the way through the cloud from Sand to Cloud um the the grock's approach is very unique um in the industry um what we have done is we've built a deterministic language processing unit inference engine um and it's not it doesn't end at the at the Silicon but really it spans into the system we have a fully deterministic system which is really not present anywhere else and and the system is entirely software scheduled in other words the software understand exactly um how the data is moving through the system it can schedule down to a nanosecond back to a clock cycle exactly how the different functional units both the the the chip level and the system levels are being utilized um and because of that uh we have been able uh to kind of achieve order of magnitude better performance than um than what is currently kind of the leading platform which is gpus um so we are moving in the new era um and um this kind of shows um how our society has moved from agrarian Revolution where wood was used to kind of produce energy then coal Industrial Revolution Transportation Revolution then we went to the internet revolutions where we were using uh btes to kind of uh uh kind of measure the value of a specific system and now we're kind of entering uh the AI Revolution especially generative AI uh where tokens are going to be uh what's kind of U driving uh a lot of the compute and uh what grock has built effectively is a megga talken factory uh and I'll kind of explain why we kind of connected to a factory and an assembly line and so on as we move forward here so I'm not going to go over these results these are some of the the the kind of uh posts that have kind of lit up the internet uh and kind of surprised gred uh as well with we were trying to kind of uh keep cranking uh towards our future uh and not have to explain why we're doing so well just yet but uh this is exciting uh we're seeing significant Improvement in the uh latency in other words you can see almost an order of magnitude Improvement in how many tokens per second uh we can produce and then you can also see the the actual latency and the throughput uh where grock sits uh in a in a quadrant here on its own where you see everything that's kind of GPU based is sitting in a in a different quadrant so um yeah so what I want to do is kind of explain how we have gotten here uh over the next couple of slides and feel free to stop me at any point ask questions uh and we can dive into it um I'm not going to go over the demos a lot of them are already online uh you can grab and take a look at them uh you can play with uh um you can play with grock uh chat on the grock website as well um so I'll just jump into the the technical kind of details and I'll start with really um the full packaging hierarchy that grock has enabled so from the left you see a Gro chip this is a custom purpose-built uh accelerator um and it was built in a very unique way um when grock started uh we didn't start like many of the AI startups where we we would build the Silicon right away and then figure out how we program it we started with a software first approach um in fact the first six months of gro we never touched RTL up until we figured out that the software they were building would be uh would be easily mappable into the hardware um and what we produce is this very regular structur chip uh that's integrated on on a pcie card eight of those cards make up a grock node and then nine of those grock noes make up a Gro crack we have eight grock nodes plus a redundant grock node inside the rack that we can actually swap uh in the case of a failure so we have some redundancy in the system um and and this this system is really exceptional at processing anything uh that is uh sequential in nature um and the large language models happen to be exactly that where the the next token is is always a function of all the previous tokens ahead of it so it's really sequential in nature and that that's where we really shine and it's not just large language models that we shine at uh we we we cover a number of different applications is just that that is kind of a key highlight for grock at this particular point in time all right so what I want to do for the next bit here is start with the grock chip and then I'll move into the grock system so we'll start start with the Silicon here the sand and all we'll move all the way to the cloud maybe just a quick question before we go there um given that you started in 2016 if I'm not wrong um did you did you have the pivot since then or were you just Visionary and you knew that this specialization is going to pay off along the way like yeah what's the story there yeah so there's definitely some luck in the process um uh but we started really pushing for Hardware that's easy to program so if you look at the current biggest challenges for AI Hardware is not how many tops you get um that your chip can uh kind of um is capable of is can you map these really well behave data flow algorithms into a non-deterministic hardware and I'm going to touch on that a little bit so that was the initial push uh Alexa uh the push was also for sequential processing uh but llms we really stumbl on and they just kind of highlighted the the the value that that really grock uh can bring to these type of workloads got it so it just turned out that Transformers run really nicely on groc Hardware as opposed to you waiting for 2017 and seeing that it's kind of started to become popular and then focusing on on on supporting it better yeah I wish we had a huge foresight uh but uh we definitely saw the need for Hardware that is um is really accessible to everybody uh we don't do large language models ourselves we don't train them we just run them fast um but what we wanted to build is this uh really fast accelerator uh with easy to map software and and I'll as I'll show in the next uh um group of slides that we have over 800 of the best known Ai and HPC type of workloads that are already compiling into our Hardware so that's really a huge amount of models and they're all kind of multiple X better uh than um than gpus so we're not only can compile the software but we can really run it um efficiently awesome maybe just one more question on my side is um what was the the value proposition because you started the TPU team back at Google you and Jonathan if I if I'm not wrong here and so what what was the value proposition initially you you just thought that the rest of the world has to have something similar to a TPU or or you you had some insights of how to make something much better and also share it to the rest of the world or what was the initial insight and inspiration for starting rock yeah so Jonathan was the original founder of grock I think he was also the original inventor of the TPU at Google uh and I think Jonathan push was one uh really enable AI for for the rest of of of the world not just big companies um but the other thing that Jonathan noticed is that these uh these devices are notoriously hard to program so he really started from scratch and he really spent a lot of time figuring out uh this software first approach um and uh this is what resulted in our kind of Renaissance in the number of machine learning and HBC models that we've been able to support uh and this kind of came last year so we kind of hit that Tipping Point last year where we can support all these models and we had efficient Hardware to run it on and then llms became open source with meta kind of releasing llama and that really kind of was the second derivative of our uh kind of uh movement forward awesome thanks lot yeah yeah so I'm GNA do uh I'm gonna dive in the the the the details of the chip feel free to stop me if you have any questions um I think we're all kind of aware of Moore's Law Mo's law was this law that basically said that uh we would be able to double the number of the devices uh per unit area every one and a half years or at least the econ the economy of it would would make sense um but as we've kind of moved forward we're seeing this significant slowdown and we're now at the point where uh morla is kind of giving us maybe 3% uh per year uh increase in number of devices per unit area um at the same time the compute demands are exploding so we've tried multi-core approaches which is also the path that many AI companies have taken for kind of AI acceleration including the gpus there's many many cores on it uh but these are notoriously hard to program and we've kind of still seen the plateau in performance as we move forward so one way to address this was to kind of um uh enter this golden age of computer architecture as David Patterson has called it in his famous uh touring award uh uh and this is really custom hardware for specific applications so we have gpus that have kind of started for gaming applications this is like a 25 year old technology uh that's only recently been repurposed for the applications of AI acceleration uh but then you had uh specific AI accelerators like tpus uh video processing units like VP data processing units so we've seen a lot of like an alphabet uh mix of the these processing units um and grock's uh language processor unit is just one of those domain specific architectures um but the the the way we have pursued uh the lpu is is very different from most of these approaches and I want to kind of describe that in the next couple of slides um was there a hands up with anyone had a question uh e you can just continue and I'll be I'll be curating the the hands that's easier okay sounds good so we wanted to build a chip and I mentioned as I mentioned we wanted a chip that we can easily map software algorithms into that Hardware right um so we didn't touch RTL until we were uh comfortable with the approach that we have taken now if you look at most um um if you look at the the landscape these uh AI algorithms are data flow dominated they're really well behaved you can statically uh kind of schedule them they're highly parallel Vector operations so they use vectors or matrices so they're really well- behaved algorithms on the other hand we have really powerful Hardware so you keep hearing about the number of tops uh about the amount of uh dram bandwidth that these these new devices have uh but the problem with this Hardware is that it's unpredictable um for example if you look at um like the big gpus they use many levels of uh memory hierarchy caches Dam and so on they use um uh and every time you try to hit these uh these memories you have non-deterministic responses did you find the data in the level one cache or the level two cash or did you have to pay the penalty to access D um and when you try to schedule a well- behaved algorithm into unpredictable Hardware the compilers really really struggle right uh and this is where the biggest struggle has been for any of the AI Hardware players uh to date so what we did with Gro is we flipped this tag on the hardware side from unpredictable to 100% predictable hardware and um I just want to kind of like do a comparison to the the most well-known AI accelerator on on the planet right now which is the h100 um and on the right here you see the language processor units the first thing you notice when you compare these two is that one is implemented in 4 nanometer it has like a die size or over 800 millimeters squared for the compute die it's on a big uh expensive silicon interposer with six hpms uh so really it's a Marvel of silicon technology right um on the right side you see the lpu it's really Scrappy 14 nanometers so three techn Technologies older uh doesn't have any hpms um and doesn't have a silicon interposer right so it's really simple implementation um and yet we're able to get orders of magnitude better performance uh out of these devices so let's kind of show the magic behind the scene and the magic really comes down to architecture right uh the the device on the left is made up of many cores each of them executing um at their own kind of uh time if they if they find the data in their level one cache they can execute and complete the task fairly quickly if they don't uh then they need more time and if they hit an HPM they have a non-deterministic Time by the time they get the data back so it could be anywhere from like 300 NCS to over a microsc uh for the data to come back and this non-determinism not just makes that that one core weight but the ENT entire chip has to wait for that Court to complete and then if you have conflicts that are fighting for the same HPM usage then you have additional problems and that's just during the compute kernel phase once you start trying to network in other words let's say you do function like all reduce between these um uh these gpus uh then you pay an additional penalty you have to spool up the HPM again to get the data that you want to send you have to talk to your Nick to basically make sure you establish a pth you send the data and then you sit there and you wait until you get an acknowledgement from the receiving GPU that you send the data and it was okay and you can continue to work on on um on the pro the next problem on on the on the GPU so this adds up a lot of latency and adds up a lot of power burn because every time you hit an hbm you have to pay a penalty six pjws four to six PJs per bit every time you hit the latency 300 to over a microsc uh in delay um so that's really adds up on the right you see how the the lpu works and the way it works is fully deterministic uh so um at any point in time we know exactly what functional units are active we don't have multiple levels of cache we just have direct memory um the software team can access the memory and decide what warline and what bit line they can activate on a specific SRAM um and the natural question at this point people ask is well that's fantastic but llama 270 billion is a massive model how do you load it into a device that doesn't have HPM the beauty is that not only is our chip 100% deterministic uh but we also have the entire system being deterministic so we have synchronized all these chips to act like one giant um kind of spatial processing uh device it's almost like a mega chip we've built um and within two or three microseconds we have access uh to uh to the to terabytes worth of of memory that we can get by joining all these uh chips together so uh and we could do so in a deterministic way we don't have to have this large spread in arrival times we can get it within five 500 NCS we can access any other device within that within that range so this allows us to uh map algorithms easily because it's a deterministic system we're mapping into it we know exactly what functional units are busy which ones are not busy we can assign workload to the uh to the idle business uh the functional units we have very low latency because everything is pre-scheduled and orchestrated by software um and we can massively scale so this is really uh kind of that that next derivative for Gro whereas large language models are growing at a pace of about 10x every year we can scale our systems in a very strong way and I'll touch on the networking component of that in a little bit this is amazing um I had a couple questions so first of all you say higher cost for for the GPU when you say cost what exactly do you mean is it monitory cost or which type of cost here yeah so I mean the device on the left is is probably the most Advanced devices you can build uh at that time frame so it costs a lot uh supply chain issues are difficult because you have to get hpns from Korea you have to get coas from Taiwan and and integrated all this and you have to have supply for each of these components which has been a big problem for the industry then the cost of the device itself reflects all that complexity um but the cost that I was referring to is the cost in latency so every time you have to wait for an HP M you're kind of running your clock on that next token every time you have to communicate with another GPU to do the function like an all reduce again you're running that clock and you're actually burning power because you're communicating through switches and nicks and things like that got it all these things are like both financial and latency and power cost that are kind of adding up and the next question I guess this is the main question I have is it's obvious that when you sp specializ Hardware more you're going to squeeze out more efficiency that that's something that's obvious but there is a trade-off of how do you remain future proof and obviously currently looking at the current state of of of AI you're definitely in just the right spot but like how do you think about being future proof in general if something changes for whatever reason uh even though the the sequential nature of of these models is probably going to be something that's going to be um invariant uh going going into future but everything else yeah how do you how do you go guard against that yeah so I think Gro has gotten a lot of um Spotlight on large language models but in fact a lot of the work be that we have done before language models has been on number of different applications so we have done cyber security uh we have shown almost 600x improvement over gpus in the in the field of cyber security and anomaly detection like really quick anomaly detection uh we've done um kind of real time uh system control so um argon National Labs did work on um on kind of uh uh plasma stabilization in a tokomak reactor which really needs very low latency so we were seeing again uh or two orders of magnitude better performance than gpus on that front um we have done uh work in the financial uh uh space on kind of fast trading and processing of data so we're really not not just a large language model accelerator though although that's been the main kind of uh discussion uh in the uh in the kind of uh um social media right now uh but there's a lot of work that we have done that you guys can look up there's a lot of papers on the on the grock website that really dive into the details of what we're capable of well beyond so we're very pretty versatile uh on the on kind of and future proof no matter how these things change down the road right nice and when it comes to sparse competition if you wanted to run graph neural networks how how how does your Hardware deal with that yeah so we've done a lot of there's a lot of papers on GNN I'm not an expert there like more on the Silicon side so uh I really encourage you to kind of check out some of the work on GNN and we could do another P another talk if if that's interesting uh I'll bring the experts on that side sound sounds great let me just pick up maybe one question or two questions from the chat uh is lpu just change of branding or is it different Hardware to what was called tsp I'm not sure what tsp is you probably know yeah the tsp was the tensor streaming processor uh the lpu is a different branding because it's such a big deal yeah yeah you mean different product not just branding but yeah no the the lpu inference engine uh so the way we' have kind of set up the system is definitely customized for language processing absolutely the but the actual silicon behind it is the tsp the tensor stream Brer yeahh okay um let me see is it possible to bring this approach to smaller chips in your opinion what's the best way to bring inference on device yeah yeah so this is fantastic question and I do have a whole section in the the last portion of the stock if we get there yeah on that but absolutely and the beauty of it is the fact that this architecture is so regular in nature you can actually scale this processor uh to much smaller devices that could either be part of chiplets or they can be embedded inside another chip and because the software is um is uh easily mappable to any one of these configurations we can quickly compile in fact uh what we can do now is actually evaluate performance for a much smaller device directly out of our compiler our compiler just takes into a a number of configuration parameters and actually can compile software and model the performance down to a nanc type of accuracy for for these small uh smaller deployments yeah great questioning okay I think you can continue forward and then we can pick up later more questions so this is kind of a spec chart I'm going to jump over it this is kind of like the tops and all the information it's it's here somebody can take a look at it I want to jump into the more interesting stuff of how the chip is built um and the chip is really built out of these simd uh structures uh they're basically 320 element Vector uh simd structures with a very lightweight instruction dispatch and you can see that black box at the um at the south portion of the slide and what we do is we create a class of these so then we can actually really uh create a matrix vector or Matrix Matrix multiply of version of the simd structure or vector vector or we can do data reshapes where we do transposes and things like that and permutes and then we have a SRAM mem tile we arrange these uh we kind of abut them next to each other we kind of add the instruction dispatch at the bottom here at the South side and the instructions flow from south to North uh data flows from east to west and what this is enabling is is really uh a much easier way to program these devices so we don't have no longer have a 2d uh Bin packing problem we have a 1D uh in packing problem which is much easier to do so the compiler has a much easier time programming these devices right so um kind of diving into a little more detail on the memory uh the memory as I mentioned doesn't have multiple levels of cache it's a flat memory uh but it's a Memory that provides significant bandwidth to the compute uh 80 terabytes per second and just to give you a feel of what that is that's about 100 times more bandwidth than you can get out of an HPM a single h HPM so even if you gang up all the hpms in the h100 you're still getting only 12 12th of the bandwidth that we have uh on chip uh in this 14 nanometer chip right um we have a limited bandwidth the memory as I mentioned 220 230 mtes of me of memory on chip but as I mentioned as you scale a system and you build like a factory for kind of um a language processing units you can share share that memory to kind of load much larger models and we have a very strong scaling which means that we can continue to scale as these models grow by 10x uh every year um the instructions as I mentioned move from North from south to North the beauty of it is we have this very wide word instruction scheduling which is super lightweight so if you look at something like a GPU with all the knocks and all the extra overhead you're burning maybe 20% in just control logic just deciding how you going to move the data on the chip in our case we have this very light uh um instruction uh that kind of uh issuing uh kind of dispatch logic at the bottom and it only takes up about 3% of the chip area uh what that means is that extra area can be devoted to things that actually make a difference like more SRAM or more uh Matrix processing or vector processing as well all right so uh this is kind of where it gets really cool uh if you look at the data uh we have these streaming registers that move east to west then uh west to east and the data is moving at one hop every cycle so the compiler team can reason exactly where this data is going to be 10 Cycles from now it's going to be 10 hops away so they can schedule this and know exactly what units are idle and which ones are busy and then that way they can get more and more utilization out of the hardware by issuing command in these units they don't have to worry about non-determined execution uh that like that one of these bu these functional units is waiting for Dram to respond or something like that everything moves uh in an orchestrated way and the the orchestration is done fully by the software not just at the chip level but also at the network level and I'm going to touch on that in a little bit more uh this is a details slide of our instruction set I'm going to jump it over it's very simple instruction set it's about 50 instructions so we take P torches 10,000 or whatever instructions we map them into a reduced set here and this is what's used to compile onto the chip um it really is a push button approach to moving pytorch or tensorflow or caras into our Hardware we have also custom apis that some of the scientific Community has used as well but it really is meant to move really uh quickly uh just to give you a feel uh I think uh for models like llama 270 billion we were able to deploy this within less than five days uh into the field so this is from getting the data from meta to deploying it it was about five days and that's a it's about 20 times faster than what it takes to deploy something on a GPU where you have to do these custom handcrafted kernels and things like that to map that into into non-deterministic Hardware all right so this is where I mentioned the Tipping Point uh last year uh we went um from 60 models being compiled into our Hardware to 500 models being compiled into our Hardware within 45 days um so this was really The Tipping Point where we saw we can compile more now we're supporting more than 700 or 800 models into our Hardware that could be done in a really push button uh type way and the best of all because the hardware is deterministic because it's predictable and easy to reason what's happening on the hardware we've been able to do this with about 35 software Engineers versus uh what's publicly kind of uh mentioned by Nvidia that they have over 50,000 kernel developers both within Nvidia and outside that are kind of writing these handcrafted software um mapping to Hardware type uh Solutions right so this is really uh pretty painful to uh once you've invested so much into this software uh you have like this inertia to stay with the same architecture and just keep scaling it uh as Mo's law um kind of slowly dies you're still kind of uh invested heavily into that because the software sack is so expensive right maybe a question here um so first of all you're focusing on inference as far as I understand cor do you have any plans for using your hardware for training um llms and in other models at this point it's really the main focus at inference yeah yeah um I think Nvidia is is doing a good job at training I think they they can handle these very massive batch sizes well because they have the HPM so they kind of keep um for large batch sizes they do uh well uh but every time it comes to low latency applications uh they become really cost uh um kind of uh the cost just blows up because uh every time you need to swap data in and out of hpms every time you have to hit switches regularly you you pay such a big penalty that it doesn't make sense to use uh for inference applications so we're really focused on inference on our side yeah got it makes sense and and then the second question is like competing against Nvidia which is obviously a formidable player in the in the market to say the least like how do you see yourself positioning like compared to them in the sense I guess the big issue will be how do you get all of that community that Nvidia has and bring it to your to to Gro but you said your suffer first I guess that that partially answers the question so but curious to hear your your thoughts yeah I mean the when you offer an order of magnitude better performance and better power I think it's it's a natural shift right so at this point the the the big push for grock is to get as much Hardware out there to support to satisfy the demand that we're getting right now yeah so it's G to be a natural shift I mean let's face it graphics processor units is a technology that was developed 25 years ago this was even before the recent most recent AI wave it started it started for really gaming purposes right and that's been kind of mapping and it's really reaching the end of its Evolution right it it's kind of like we're investing into most advanced technology just to get the next little bit of scaling um the lpu is at the beginning of the story right we are we have made this revolutionary leap and there's a lot more gains to be had as I mentioned we're showing these order of magnitude better performance but we're showing them with a 14 nanometer chip so we haven't even pushed the Silicon angle yet into this story of of performance Improvement and that's coming uh We've announced that we're working with Samsung on a 4 nanometer chip so that's coming next and that's going to give us again multiple X Improvement in performance and then uh as we kind of go deeper and deeper into the technology improvements we're going to continue to see this uh evolutionary Improvement but the the recent leap is is a revolution in Hardware not just uh kind of scaling and marching down the path of mors law got it Mak sense makes sense thanks yeah so this is go ahead please um somebody had their hand raised but then what was the question maybe maybe my question one one more follow up here is sure you you're 10x better literally INF FR so that's kind of no-brainer but then for startups there is the logistics aspect of you want to have your training infra and your inference infra just from the same provider it's easier more like there are those aspects like like where can people currently find lpus is it mostly is it only through you or can one actually start renting from hyperscalers what's your plan there yeah so we have a fully vertically integrated stock right so we build our lpus we build our systems we build the software that kind of runs these um but yeah so currently you could buy Hardware from us uh you can uh um basically use tokens as a service so we have API access that people can try out um and um we're already working with a number of companies through that API access in the tokens of a service model yeah uh if somebody wants to buy chips uh we would maybe consider that I don't know we haven't really thought about that because I think it really we're kind of providing value across the entire stack yeah makes sense thanks you yeah all right so what I want to do is show you some cool stuff now so this is probably the slide that pulled me away from the Google TPU and uh to join grock and this is really cool so on the left you see like kind of the microphotograph of the chip on the right you see uh the actual movement of data this is cycle accurate down to a nanc movement of data um and this is really software Hardware co-optimization 2.0 if as a hardware designer you know exactly what's happening on the chip you can actually make a lot of really good optimizations uh for that chip so what I want to show you is couple of slides on on this front so let me just jump through this so on the left here you see a uh microphotograph of the die you can see the different colors for the different functional units and on the right you can see the matching colors for the power of these functional units as a function of time right now the cool part about about this slide is not that this slide was generated by the post silicon team or the Silicon team the cool part about the slide is the fact that these waveforms are coming out of our uh compiler team so the compiler team is not only scheduling efficiently uh algorithms into our Hardware but they're also um they can also profile exactly what power uh they are generating and burning at a specific XY Z location on the chip so they have a full awareness and control of that and not only can they profile this but they can actually start controlling this so uh what you see now on the right are four different waveforms in the different colors the Baseline wave force waveform is what happens if a specific workload is just pushed to the max performance get the the job done as quickly as possible and you see the power peing much higher but what we can do to our compiler is say you know what I want you to compile this without Bur earning 100% of the power I want you to compile it to 25% reduced power and what you can do is you can generate the Red Wave form and you can see you're kind of trading off 25% Peak power reduction for only a 0 2% performance loss and then they have the yellow waveform that reduces Peak Power by 50% for only a 7% reduction in power in performance and then the green waveform kind of uh does 75% reduction in power for a 40% uh performance loss this is really cool what this means that I can deploy the same chip in an air cooled environment or a passively cool environment or I can go for the blue line and deploy that same chip in a liquid cooled Data Center and the only difference I would have to do for these is just a different compiler like different compiler and uh this is really cool for like our current chip but as we move forward and more than More's Lo integration is becoming bigger we're looking at 2 and a half D the integration 3D stacking you need to manage your temperature on these devices so if I can manage the power in a three in a four-dimensional space right three dimensions plus time I can actually control how much power I'm burning in a specific area how much temperature I'm generating like increasing temperature and I can manage that at a compiler level so the compiler now has a full inside of the silicon and they're actually feeling the pain that the Silicon is is is kind of feeling when they do a compile they can get this in instantaneous feedback any questions on this slide was impressive I think Keon has one yeah okay so you're talking about Peak power but uh isn't like temperature a function of average power or am I wrong no you're very right so Peak power is definitely itely affecting things like ldt right this is the change in current multiplied by the inductance of the power delivery so this manages the collapse in the the voltage that you see on these chips and that could be significant it sometimes adds like 20% of the extra power that data centers burn is just margining for that changing current that the chip is experiencing right um temperature we can manage instantaneous power we can actually spread this power over a longer period of time so if you look at the green line uh we're just taking uh 40% longer to complete uh the job which means that um our average power is effectively 40% lower probably right like you've dropped it by sorry you've dropped it by 75% if it takes you 40% to longer to complete it you're burning 40% less average power effectively right so you can manage all of these parameters thermal ldt power uh performance you can trade one over the other you can Sprint for very short distances uh and then run slower for for longer distances all these things are capabilities that we have because we have a deterministic network you wouldn't be able to do this for a non-deterministic network nice impressive is there any theoretical limitation for scaling lpus when you say scaling what do you mean by that do you mean like St sticking multiple lpus into a wre or like system figure systems in that respect yeah so I think we will be limited purely by the technology capability so ideally we would love to stock many of these chips one on top of the other and because we could manage the the power in that four-dimensional space we can extract a lot more performance out of a 3D stack chip than anybody else can yeah got it so for example if one would like to run a 70b llm right now how many chips would you need for that given that you have 230 megabytes per chip if I'm not wrong deare yeah yeah so it you can you can if you want to just run a small appliance on this it doesn't make sense what grock does is build factories for tokens so you would just really uh deploy a large number of chips but then your throughput per chip is equivalent to a throughput that you would get from any other Matrix Prof processing engine you still have to do the same mat and you still have to do it on Silicon what we do is we've created almost like an assembly line that you put together and as long as you can load the model in SRAM you can execute this in a very uh efficient way and because we have this strong scaling and I'm going to touch on that in the next section we can do this really we can scale to very large basically models easily into the uh beyond the um the kind of the current largest models right so we want to we want to chase that 10x increase in large language models per year got it thank you yeah all right so I'll jump into the system and just like we have a domain specific chip we have also a domain specific Network and you've seen some of these domain specific networks right Google has their own TPU type Network Dojo has a mesh network cerebras has a kind of a mesh Network on a wafer uh and so on so all these guys have come up with ways to figure out how you can scale well as the kind of addressing your question are you going to be able to scale to thousand chips hundreds of thousands of chips um and can you do that while achieving very low latency because latency is really critical in these networks so let me show you what grock has done on this front right so on the left you see a conventional Network this is something you see in gpus for for example right and what you see is really three separate layers that are kind of disjoined from one another you have your compute that's non-deterministic as I mentioned the gpus kind of execute that their own time you have networking now switches and things like that that are also trying to make locally optimized decisions what I mean by that is that each uh each switch and a router is going to make optimized decisions just to optimize the traffic through that through that specific router or switch and it's also non-deterministic and then the software layer on the conventional Network on the left here is left to now uh we're left with the task of mapping well- behaved algorithms into very non-deterministic systems and this is why software is so difficult but this is also creating a lot of problems right um You need per hop router arbitration which adds latency uh you need adaptive routing that is also made in Hardware so the hardware is making these decisions software doesn't know anything about them so this becomes a problem it creates congestion and it it results in Weak scaling as the number of devices that need to work together increases which is a natural next thing uh to see in AI you start seeing significant slowdowns and then you have congestion sensing Network back pressure and things like that really that add this non-deterministic delay you don't know when the next calculation is going to complete so what grock has done on the right is we've introduce something called a software controlled Network so each of our chips is not just a processor but it's actually a router and a processor in one so we don't have top rack switches we don't have any switches on our devices uh we basically connect our chips to other our to other uh lpus and then these lpus act as a as a switch that connects them to other local groups basically or Global groups so you can see these groups that are kind of located each other so what this allows us to do is we have no Hardware arbitration because the software scheduled all the communications with all these chips a priori uh it picks the most optimal connecting paths between them uh so it can actually decide what's the best way to move all this data to to to the to from the sources to the destinations right um and this is significantly lower latency allows us to scale to massive amounts hundreds of thousands of chips can be ganged up together uh and then we maintain this deterministic delay and very low Network load sensitivity in other words as the load of the network increases we still can optimize exactly how we move that traffic across and the way this looks like in the real world it would look something like this I don't know if you guys can see the video on the left you have congestion at at a router this is your router you have your your traffic lights you try to move this data but as the the network increases you start having problems right you start having collisions and intersections and the data starts to trickle on the right side everything is prescheduled as if imagine waking up tomorrow and you decide to go to work work and somebody telling you Alexa you need to leave at 7:30 a.m. and drive at exactly 40 miles an hour uh and don't stop at any any point along the way and get there and everybody in the city gets a different departure time and a different speed that they move and you create this way where everything is just coming right at the right time at every point in the network this is impressive but like my question is is there any trade significant trade of you have to make to to get this type of behavior in your systems no so the the the the the benefit of this is that these roads now can have significantly higher capacity out of a sing the same ctoc B link right so right now if I look at Gro C2C links on the first version of silicon they're running it at 30 gig h00 is running much higher than that couple times higher and yet we can get significant can be higher bandwidth between these and I'm going to go into couple of slides kind of explaining this in a little more detail now thanks so let me do that so what we do in order for this to work what we need to do is create this very large Network that's all joined together so what we do is we use software to synchronize all the chips in the system to act as one large spatial processor or one large Mega chip in a way and we do this with software we basically figure out exactly what's the offset between any chip and any other chip so everybody kind of carries that offset and when the communication happens everybody knows exactly uh where uh where they are uh in that uh space so once that once that's aligned now we can start communicating and we use a dragonfly configuration where a lot of our chips in our local node are connected in an all toall fashion so this becomes the building block now and then many of these local groups are then connected to Global groups so for example if this chip wanted to communicate let's say this chip wanted to communicate to that chip it might have to make a local hop then a global hop and then another local hop but remember we just made a very low diameter Network we don't have any network switches we don't have to wait for any like congestion everything is prescheduled Ed right so not only can we communicate in like in a direct way in the local group but we can actually use nonminimal paths which are bouncing off other lpus so let's say these these two lpus wanted to like do all reduce in a very uh there's a large amount of data that moves between them we can really bounce this off multiple uh U um the lpus and get a lot more bandwidth between these two devices we have the direct p and then the non-direct paths and this really extends to the entire network and software controls it all so what's the other benefit um since everything is globally synchronous we have effectively one Global clock when we send information from one lpu to another lpu we don't have to spool up the HPM we don't have to negotiate with the Nick we don't have to wait for the switches to Route our package gets to the other destination and then wait for an acknowledgement before we can produce proceed forward to the next calculation we simply read out of SRAM within a handful of Cycles we're already driving the chipto chip interconnect and then when that data is received on the other side there's no need for that data to contain any information about Source or destination there is no routing it's basically all scheduled through uh through the software right so within three microseconds we can have access to two terabytes worth of SRAM so this explains your question on well how do you scale to large uh for large language models we can do that just add more chips and we can actually we have this very strong scaling and we can we can kind of enable very large models right impressive and then the data as I me mentioned doesn't need Source uh every data packet doesn't need source and a destination it just needs a few parameters that are kind of adding about two and a half percent overhead um and we can send these packets so we can send packets quickly because we don't have to wait for HPM and things like that our packets are very efficient they only have two and a half% encoding overhead so what does that mean and this is probably the most important slide in the presentation and what it is on the x- axxis you see a tensor size these are the packets that we're exchanging we're sending tensors really not packets um on the y axis you see the bandwidth utilization and here we're comparing to an a100 uh but the story is very very similar to h100 what you see is that at very large tensor sizes like that are into the tens of or hundreds of megabytes uh things used for training things um the the gpus do pretty well they kind of utilize their bandwidth pretty well but when you look at inference um type tensor sizes which are in the kilobytes range their utilization is very poor and that is because you have this big overhead every time you need to communicate between gpus you have to access the HPM you have to like talk to the Nick you have to uh wait for acknowledgements and so on that is a big overhead that works that overhead is not a big of deal when you're sending huge batches and large tensors for training but for inference that's a big impact on your performance and you can see grock here saturating its bandwidth very quickly because we have very light overhead on each packet we can send these packets very quickly as soon as you step on the gas you be move really quickly um and this is really why br is doing really well on inference that make sense yes maybe a question is so if I if I par this correctly your B size is can be can't be high but you compens it because in these lower tensor sizes you're you're much better more efficient when it comes to bandwidth so my question is can you somehow use gpus and and compensate through the B size somehow try and compete against what you're offering yeah I know so the problem is that no matter how low of a batch the GPU starts to use the overhead is is really high so as they go down to the lower batch sizes their cost per tokens skyrockets their latency still hits that wall for communication because you're paying for that wall and Beyond a certain point they can't do better latency right and this is where grock is is really doing well now but our architecture is efficient and we're going to be starting to push on the cost as our throughput starts to move Higher and Higher and we just announce a 2X increase in throughput two days ago we're going to see more of these 2x inputs as our software kind of ramps up we're early days at this point so there's going to be a lot of improvements moving forward but as that moves forward we're going to be pushing uh Nvidia harder and harder on that wall basically there was actually one question also related to this chart in the is CA Gro paper looks like the bandwidth for large sensors is inferior to a 100 which is what we just saw on the screen how do you get the hide throughput yeah so it's really how I describe this right it's a software scheduled Communications we don't spend time fighting with our Network switches we don't spend time waiting for our hpms to communicate we don't have we don't spend time waiting for acknowledgements even though we have less bandwidth total bandwidth we still get better performance it's just more efficient use of our resources MH and this is like a really important Point that's kind of missed is we're hitting the Shannon limit on electrical links so we're kind of saturating at about 224 gig we're looking at Optics beyond that but what grock is showing on this slide is no matter what the technology is out there like the Silicon technology we're always going to have an advantage because we're always seeing a much more efficient use of our resources right super cool yeah okay so last slide here on the networking side is the scalability so what we've done here is this is a very old slide uh it's birth large but the team is so swamped that we haven't regenerated this but it shows the strong scaling that GRS achieve so here you can see the number of lpus on the x-axis you see the normalized tops on the y axis for birth large which is birth tiny these days but you can see almost like a linear increase in performance as you add more lpus so this is the strong scaling story and we're going to revisit this slide with some of the latest models but uh just wanted to show that this is how we look as models start to grow by 10x with every generation we just add more lpus we get this very strong scaling we're not fighting with our Network most of the time or with our dams we can actually enable the next lpus as well all right so one summary I wanted to do like GPU versus lpu for llm processing what I wanted to highlight is both of these devices are Matrix processing engines they all have the same amount of compute on relatively plus or minus whatever mors log gives you the difference between these two approaches is the gpus have effectively a production shop in other words the dgx Box is limited to eight devices that are really nicely tightly connected still through a switch but they're connected tighter um versus the lpu which what we've done is we've built an assembly line for making tokens these devices just go one to another we don't access hpms to and waste time doing that we don't access go through switches and fight with switches to move data between devices we can do this in a very linear fashion uh so we don't have to repeatedly run the same things and this is allowing us to be very low latency but it's also really allowing us to be green the power uh is really good even at very high batch sizes on the Grog side we have good latency and good power on this it's really we've built a factory for making tokens rather than many different shops which is really what the DJ boxes are nice how hungry is this compared to gpus when it comes to like total energy consumption you did mention the the nice flexibility where you can trade off between power and performance but like in terms of total power needed per whatever metric how do you compare against gpus yeah so the metric will be Jewels per token and we're about 10x better on that side and the the the explanation is we don't store this data part weight in HPM so we don't have to pay that penalty we just really float through our assembly line um we don't access Nicks and argue and pay the extra power to talk to a Nick and then the Nick to talk to another GPU and vice versa so we don't have to pay for that in latency or power and then because of the strong scaling we don't have to spend energy and power fighting uh with back pressure and network uh kind of adaptive routing and all these other things that that really take place uh in the GPU world right so it's about 10x Improvement in in Jewels per token nice humble thanks it is it and and and it's silly because like a lot of people go well like that's unreal 10x but is it really unreal when you compare it to a technology that's 25 years old and it was never built for lpus or anything sequential processing I don't think it is I think it's just like a natural architectural evolutionary jump and then we're going to see even more Improvement in as we move any evolve lpu and eventually there's going to be another company that's uh going to come up with something even better and if the architecture is more efficient it might be 10x better than the lpu but for now we're going to keep pushing and holding that ring yeah what's preventing you or Nvidia for just Reinventing the whole chip starting from scratch because it's not like you always have to incrementally do stuff you you you can spawn a like theme that starts developing completely new design or how how feasible is that when already yeah yeah no it's totally doable it's totally doable like it's money right uh it's money to invest into it the problem is that these things take time there's a lot of also IP on the lpu side that kind of protects a lot of these solutions that we have implemented but um I think that's the right approach for the future uh is not try to continually fight with Mo's law uh but actually think uh new architectures that kind of accelerate the workloads better the big problem for everybody in the industry is their software stack um once you have invested 50,000 konel writers into writing your software stack How likely are you going to be to change that whole approach and do a different architecture that requires another 50,000 Colonel writers to to write code for you that's a that's an inertia that that's kind of hard to to kind of fight against uh in our case it's a really small investment we've made a very Tal talented 35 software Engineers uh and the determinism has given us this this kind of Advantage but um yeah it's it's definitely something that everybody should be thinking about impressive thanks yeah um so these are some of the other solutions that we have done drug Discovery so we're not just good at LP language acceleration um like we've seen drug Discovery with ar National Labs uh 200x speed up not 200% 200x cyber security this was a published article with the US Army uh 600x speed up for anomaly detection uh Fusion reactor kind of control 600x and then Capital markets about 100x Improvement uh so I have a lot of slides on what's next I don't know if we want to keep going Alexa you said you had a stop um I I think you can you can continue a bit more as far as I'm concerned all right so mores law is continuing to slow down uh AI models are Contin to double every three and a half half months uh so we need to close this Gap right what's the next accelerator going to look like the lpu is improving we're going to continue to get another maybe 10x and and more the question is how do you come up with a custom Hardware solution for a specific workload and if you look at our chip what you're going to notice is a very regular chip it's not like a GPU where you have lots of cores but then you have like uh cashes and hpms and all kinds of stuff it's really uh super Lanes running from left to right these are the lanes along which the data travels and gets processed these are the assembly lines and then you have um then you have north to south which are instructions that are moving these are the simd structures that we can put together uh what you can do is you can actually configure this right now so we can compile um for any different version of our Hardware more SRAM less SRAM more mxm more vxm like you can do these things what if what happens if you added dam in the process how would that look like and we have a tool called design space exploration that plots a curve for a specific uh cost function and we could choose along that curve of what we want to build right so now you can pass in AI HPC models you can see here efficient net is our model you can configure different versions of these these options that we have and you can get a nice curve to evaluate across and once you pick uh a design spot that you want to do you can actually uh create either a chiplet or a chip or embedded IP and the software is going to be the same for all of these it's really we can configure the software uh kind of compiler for what version or what configuration of this core we're using and we can generate software that will compile into that into that core so this is kind of next push uh Beyond our next Generation silicon and this would be meant for somebody that's massive uh enabling massive llms and they want to deploy something that's super efficient for that specific workload or a group of workloads that are weighted in a specific fashion and we want to do this very quickly we want to have this turnaround time such that by the time we deploy silicon we can do this in a very fast manner as opposed to 18 months or more we can pull that into 12 months because we are fully deterministic a lot of the the hardware characterization can be done even before the Silicon comes back in other words we can actually test all these software um um models how they run how efficiently they run on the hardware and when we when the Silicon comes all we have to do is verify its functionality and reliability and quality and then deploy it out so so that's kind of what's next um but here the summary uh I listed kind some of the grock superpowers here chip determinism which is kind of unlike all the other non-deterministic Hardware out there which is really giving us both power efficiency but it's kind of positioning us really well for more than more scaling like 3D stacking and things like that low diameter dragonfly Network we don't have any top of rack switches we can connect very quickly in the fastest path between two lpus so we can do all reduce really well uh synchronous Global Communication all these ships are synchronized together so you don't have to carry all that extra overhead to communicate between one to another and then the software schedule deterministic computer network which is really allowing software the full access of the entire system uh and how they uh kind of how to schedule it and get the best performance out of it um so that's all I had I'm happy to answer any questions sorry for going over a little bit no this is this is just like impressive me coming from Electronics background this is very very inspiring and interesting uh so when do we start simulating universes like what what's the limit here like where where do you see this Thea chips going over the next decade in 10 years like how how much more compute will we have compared to right now yeah so my team is pushing really hard on the Next Generation chip and we're Taping that out this year um and then pushing on really quick um trying to saturate the amount of compute saturate the amount of C to see bandwidth and latency um and at the same time trying to enable these very quick turnaround times for custom models so if the models evolve in the future we can quickly customize the hardware to kind of pitch match what's uh what's coming up um yeah I think the sky is the limit here it's it's really really exciting time to be in in our industry and I'm kind of super happy to be part of grock here I think it's h we're sitting at a really awesome Tipping Point here one question on my side is given that I've been dabbling in the startup Waters myself since last year like given that you were founded back in 2016 it took a lot of conviction and time to get to a point where you suddenly had a breakthrough a couple of days ago on Twitter before that nobody including me I didn't know about rock so like how do you how do you how do you push through that like eight years of just building on something and how do you make sure you don't get um outco competed along those eight years given there's a lot of players sambanova graphcore cerebras Nvidia AMD everybody's doing something the field is getting very comp he's very competitive yeah yeah you really need to be um like you need to be convinced that what you're working on is is something that's unique and that it it has a path out and that's a difficult thing for eight years you're kind of searching the desert in some ways and even if you have the best solution if it doesn't meet the requirements of the day you might get to Parish so uh to be fair we have a really compelling technology but we were also lucky to have like open source models released at this point so we can showcase this if we didn't have open source large language models I don't think I would be doing this talk with you today because I think it would be it would be it would be hard to show the advantages that we have brought to the market but um that kind of lined up and here we are yeah amazing yeah you wander in the darkness and you try to see some light exactly exactly awesome good I think there is a lot more questions from people in the chat I think if you have some time maybe you can uh reply to some of those in Discord later today or or over the next couple of days because we're already fairly over time but this was really impressive I'm going to stop the recording now

Transcript for:Innovations in AI Chip Development

Transcript for:
Innovations in AI Chip Development