Lecture by Jonathan Ross

okay first of all it's incredible to be here I have a few notes so I make sure we cover everything um I want to take this opportunity to introduce Jonathan obviously a lot of you guys have heard of his company but may you may not know his origin story which is quite honestly having been in Silicon Valley for uh 25 years one of the most unique origin stories and founder stories you're going to hear um we're going to talk about some of the things that he's accomplished both at Google and grock we're going to compare grock and Nvidia because I think it's probably one of the most important technical considerations that people should know we'll talk about the software stock uh and we'll leave with a few data points and bullets which I think are pretty impressive so uh I want to start by something that you do uh every week which is you typically tweet out some sort of developer metric um where are you as of this morning and why are developers so important can you hear me try this one testing yeah ah perfect so we are at 75,000 developers and that is slightly over 30 days from launching our developer console for comparison it took Nvidia seven years to get to 100,000 developers and we're at 75,000 in about 30ish days so the reason this matters is the developers of course are building all the applications so every developer is a multiplicative effect on the total number of users that you can have but it's all about developers um let's go all the way back so just with that backdrop this is a not an overnight success story this is a eight years of plotting through the wild wilderness uh punctuated frankly with like a lot of misfires um which is really the sign of a great entrepreneur um so but I want people to hear this story Jonathan may be there you guys have all heard about entrepreneurs who have dropped out of college to start billion- dooll companies Jonathan may be the only High School Dropout to have also started a billion dollar company so let's start with just two minutes just give us the background of you because it was a very circuitous path to being an entrepreneur so I dropped out of high school as mentioned and I ended up getting a job as a programmer and my boss noticed that I was clever and told me that I should be taking classes at a university despite having dropped out of college so unmat triculate I didn't actually get enrolled I started going to Hunter College U as a side thing and then I sort of fell under the wing of one of the professors there did well transferred to NYU and then I started taking PhD courses but as an undergrad and then I dropped out of that so do you technically have a high school diploma no okay this perfect and nor do I have an undergrad degree but yeah so from NYU how did you end up at Google well actually if it hadn't been for NYU I don't think I would have ended up at Google so this is interesting even though I didn't have the degree and I happened to go to an event at Google and one of the people at Google recognized me because they also went to NYU and then they referred me so you can make some great connections in University even if you don't graduate but um it was one of the people that I was sing the PHD courses with and when you first went there sort of what kind of stuff were you working on uh ads but uh testing so we were building giant test systems and if you think that it's hard to build production systems test systems have to test everything the production system does we did it live so every single ads query we would run 100 tests on that but we didn't have the budget of the production system so we had to write our own threading Library we had to do all sorts of crazy stuff which you don't think of in ads but but yeah it was actually harder engineering than the production itself and so like you know Google is very famous for like 20% time where you kind of can do whatever you want and is that what led to the birth of TPU which is now I think or what most of you guys know is Google's sort of leading custom silicon that they use internally So 20% time is famous I called it uh MCI time which probably isn't going to transfer as a joke here but there was these advertisements for this phone company free nights and weekends so you could work on 20% time so long as it wasn't during your work time yeah but uh every single night I would go up and work with uh the speech team so this was separate from my main project and they bought me some hardware and uh I started what was called the TPU as a side project and it was funded out of what a VP referred to as his slush fund or leftover money and it was never expected there were actually two other projects to build AI accelerators it was never expected to be successful which gave us the cover that we needed to do some really counterintuitive and Inn Innovative things once that became successful they brought in the adult supervision okay take a step back though what like what problem were you trying to solve in AI when those words weren't even being used and what was Google trying to do at the time where you saw an opportunity to build something so this started in 200 12 and at the time there had never been a machine learning model that outperformed a human being on any task and the speech team trained a model that transcribed speech better than human beings the problem was they couldn't afford to put it into production and so this led to a very famous engineer Jeff Dean giving a presentation to the leadership team it was just two slides the first slide was good news uh machine Learning Works the second slide bad news we can't afford it so they were going to have to uh double or triple the entire uh global data center footprint of Google at an average of a billion dollars per data center 20 to 40 data centers so 20 to 40 billion dollars and that was just for speech recognition if they wanted to do anything else like search ads is going to cost more that was uneconomical and that's been the history with inference you train it and then you can't afford to put it into production so against that backdrop what did you do that was so unique that allowed TPU to be the one of the three projects that actually won the biggest thing was Jeff Dean noticed that the main algorithm that was consuming most of the CPU Cycles at Google was Matrix multiply and we decided okay let's accelerate that but let's build something around that and so we built a a massive matrix multiplication engine when doing this there were those two other competing teams they took more traditional approaches to do the same thing one of them was led by a turing award Award winner and then what we did was we came up with what's called a systolic array and I remember when that tur Award winner was talking about the TPU he said whoever came up with this must have been really old because systolic arrays have fallen out of favor and it was actually me I just didn't know what a systolic array was someone had to explain to me what the terminology was it was just kind of the obvious way to do it and so the lesson is if you come at things knowing how to do them you might know how to do them the wrong way it's helpful to have people who don't know what should and should not be done so as TPU scales there's probably like a lot of internal recognition at Google how do you walk away from that and why did you walk away from that well all big companies end up becoming political in the end and when you have something that successful a lot of people want to own it and there's always more senior people who start grabbing for it uh I moved on to the the Google X team the the rapid eal team which is the team that comes up with all crazy ideas at Google X and uh I was having fun there but nothing was turning into a production system it was all a bunch of playing around and I wanted to go and do something real again from start to finish I wanted to take something from concept to production and so I started looking outside and that's when we met well that is when we met uh but the thing is you had two ideas one was more of let me build an image classifier and you thought could outr net resnet at the time which was the best thing in town and then you had this Hardware path well actually I had zero intention of building a chip what happened was I had also built the the highest um oper highest performing image classifier but I had noticed that all of the software was being given away for free tenser flow was being given away for free the models were being given away for free it was pretty clear that uh machine learning AI was going to be open source and was going to be even back then even back then that was 2016 right and so I just couldn't imagine building a business around that and it would it would just be hard Scrabble chips it takes so long to build them that if you build something Innovative and you launch it it's going to be four years before anyone can even copy it let alone pull ahead of it so that just felt like a much better approach and it's atams you can you can monetize that more easily so right around that time the TPU paper came out my name was in it people started asking about it and you asked me what I would do differently well I was I was investing in public markets as well at the time a little dalan in the public markets and Sundar goes on in a press release and starts talking about TPU and I was so shocked I thought there is no conceivable World in which Google should be building their own Hardware they must know something that the rest of us don't know and so we need to know that so that we can go and commercialize that for the rest of the world and I probably met you a few weeks afterwards and that was probably the fastest investment I'd ever made I I remember the key moment is you did not have a company right and so we had to incorporate the company after the check was written which is always either a sign of complete stupidity or in 15 or 20 years you'll look like a genius it's but the the odds of the latter are quite small okay so you start the business tell us about the design decisions you were making in Gro at the time knowing what you knew then because at the time is very different from what it is now well again when we started fundraising we actually weren't even 100% sure that we were going to do something in Hardware but it was something that I think you asked shth which is what would you do differently and my answer was the software because the big problem we had was we could build these chips in Google but programming them every single team at Google had a dedicated person who was hand optimizing the models and I'm like this is absolutely crazy right around then we had started hiring some people from Nvidia and they're like no no no you don't understand this is just how it works this is how we do it too we've got these things called kernels Cuda kernels and we hand optimize them we just make it look like we're not doing that but the scale like all of you understand algorithms and bigo complexity that's linear complexity for every application you need an engineer yeah Nvidia now has 50,000 people in their ecosystem how does any and these are like really low-level kernel writing assembly writing hackers who understand gpus and and everything not going to scale so we focused on the compiler for the first six months we banned whiteboards at grock because people kept trying to draw pictures of chips like yeah um so why is it that llms prefer Gro like what was the design decision or what happened in the design of llms some part of it is skill obviously but some part of it was a little bit of luck but where what exactly happened that makes you so much faster than Nvidia and why there's all of these developers what is the the the Crux of it we didn't know that it was going to be language but the inspiration the last thing that I worked on was uh getting the alphago software the the go playing software at Deep mind working on TPU and having watched that it was very clear that inference was going to be a scaled problem everyone else had been looking at inference as you take one chip you run a model on it it runs whatever but what happened with alphago was we ported the software over and even though we had 170 gpus versus 48 tpus the 48 tpus won 99 out of 100 games with the exact same software what that meant was compute was going to result in better performance and so the Insight was let's build scaled inference so we built in the interconnect we built it for scale and that's what we do now when we're running one of these models we have hundreds or thousands of chips contributing just like we did with um alphago but it's built for this as opposed to Cobble together I think this is a good jumping off point a lot of people and I think you know they this company deserves a lot of respect but Nvidia has been toiling for decades and they have clearly built an incredible business but in some ways when you get into the details the business is slightly misunderstood so can you break down first of all where is NVIDIA natively good and where is it more trying to be good so natively good uh the classic saying is you don't have to outrun the bear you just have to outrun your friends so Nvidia outruns all of the other chip companies when it comes to software but they're not a software first company they actually have a very expensive approach as we discussed but they have the ecosystem it's a double-sided Market if you have a kernel-based approach they've already won there's no catching up hence why we have a kernel free approach but the other way that they're very good is um vertical integration and forward integration what happens is NVIDIA over and over again decides that they want to move up the stack and whatever their customers are doing they start doing it so for example I think it was gigabyte or one of these other uh PCI board manufacturers who recently announced even though 80% of their revenue came from Nvidia um Nvidia boards that they were building they're exiting that market because Nvidia moved up and started doing a much lower margin thing and you just see that over and over they start building yeah I think the the other thing is that Nvidia is incredible at training and I think the the design decisions that they made including things like hbm were really oriented around the world back then which was everything was about training there weren't any real world applications none of you guys were really building anything in the wild where you needed super fast inference and I think that's another absolutely and and what we saw over and over again was you would spend 100% of your compute on training you would get something that would work well enough to go into production and then it would flip to about uh 5 to 10% training and 90 to 95% inference but the amount of training would stay the same the inference would grow massively and so every time we would have a success at at Google all of a sudden we would have a disaster we called it the success disaster where we can't afford to get enough compute for inference because it goes 10 20x immediately over and over and if you take that 10 20x and multiply it by the C of nvidia's leading class Solutions you're talking just an enormous amount of money so just maybe explain to folks what hbm is and why these systems like what Nvidia just announced as b200 the complexity and the cost actually if you're trying to do something yeah the the complexity spans every part of the stack but there's a couple of components which are in very limited Supply and Nvidia has locked up the market on these one of them is hbm hbm is this high band with memory which is required to get performance because the speed at which you can run these uh applications depends on how quickly you can read that memory and this is the fastest memory uh there's a finite Supply it is only for data centers so they can't reach into the supply for mobile or other things like you can with other parts but also interposers also uh Nvidia is the largest buyer of super caps in the world and all sorts of other components cables cables the the 400 gigabit cables they've bought them all out so if you want to compete it doesn't matter how good of a product you design they've bought out the entire supply chain for years for years so what do you do you don't use the same things they do right and that's that's where we come in so how do you design a chip then if you if you look at the leading solution and they're using certain things and they're clearly being successful how do you do you is it just a technical bet to be totally orthogonal and different or did was it something very specific where you said we cannot be reliant on the same supply chain because we'll just get forced out of business at some point it was actually a really simple observation at the beginning which is most chip architectures compete on small percentages difference in performance like 15% is considered amazing and what we realized was if we were 15% better no one was going to change to a radically different architecture we needed to be 5 to 10x therefore the small percentages that you get chasing the Leading Edge Technologies was irrelevant so we used an older technology 14 nanometer which is underutilized we didn't use external memory we used older interconnect because our architecture needed to provide the advantage and it needed to be so overwhelming that we didn't need to be at the Leading Edge so how do you measure sort of speed and value today and just give us some comparisons for you versus some other folks running what these guys are probably using llama mistol Etc yeah so we we run uh we compare on two sides of this one is the tokens per dollar and one is the tokens per second per user so tokens per second per user is the experience that's the differentiation and tokens per dollar is the cost and then also of course tokens per watt because power is very limited at the moment right if you were to compare us to gpus we're typically 5 to 10 faster Apples to Apples like without using uh speculative decode and other things so right now um on a 180 billion parameter model we run about 200 tokens per second which I think is less than 50 on the Next Generation GPU that's coming out from Nvidia from Nvidia so your current generation is 4X better than the b200 yeah yeah and then and then in total cost we're about oneth the cost versus a modern GPU per token I I want that to sync in for a moment one tenth of the cost yeah I mean I think the value of that really comes down to the fact that you guys are going to go and have ideas and especially if you are part of the Venture community and ecosystem and you raise money folks like me who will give you money will expect you to be investing that wisely last decade we went into a very negative cycle where almost 50 cents of every dollar we would give a startup would go right back into the hands of Google Amazon and Facebook you were spending it on Compu and you spending it on ads this time around the power of AI should be that you can build companies for on10th or one 100th of the cost but that won't be possible if you're again just shipping the money back out except just now in this case to Nvidia versus somebody else so we we would we will be pushing to make sure that this is kind of the the low lowcost alternative that happens so Nvidia had a huge splashy announcement of a few weeks ago they showed charts things going up and to the right they showed huge dieses they showed huge packaging um tell us about the b200 and compare it to what you're doing right now well the first thing is the b200 is a Marvel of engineering the level of complexity the level of integration the amount of different components in Silicon they spent 10 billion dollar developing it but when it was announced I got some pings from Nvidia Engineers who said you know we were a little embarrassed that they were claiming 30X because it's nowhere near that and we as Engineers felt that like that was hurting our credibility the 30X claim was let's put it into perspective there were there was this one image that showed a claim of up to 50 tokens per second from the user experience and 140 uh throughput that sort of gives you the the value or the cost if you were to compare that to the previous generation that would be saying that the users if you divide 50 by 30 are getting less than two tokens per second which would be slow right there there's nothing running that slow and then also from a throughput perspective that would make the cost so astronomical it would be unbelievable I mean how many of you guys use any of these chat agents right now just as a just raise your hand if you use them and how many of you keep your hands raised if you're satisfied with the Speed and Performance you're satisfied one hand or two there's like two or three yeah yeah that's nice my experience has been that these things if you want to actually make hallucinations go to zero and the quality of these models really fine-tuned you have to get back to kind of like a traditional web experience or a traditional mobile app experience where you have a window of probably 300 milliseconds to get an answer back in the absence of that the user experience doesn't scale and it kind of sucks well how much effort did you spend at um meta and Facebook getting latency down I mean look at Facebook at point I had a team I was so disgusted with the speed so in a cold cache you know we were approaching a thousand milliseconds and my I was so disgusted that I took a small team off to the side rebuilt the entire website and launched it in India uh for the Indian market um with just to prove that we could get it under 500 milliseconds and it was a huge technical feat that the team did it was also very poorly received received by the mainline engineering team because it was somewhat embarrassing but that's the level of intensity we had to approach this problem with and it wasn't just us Google realized it everybody has realized it there's an economic equation where if you deliver an experience to users under about 250 to 300 milliseconds you maximize Revenue so if you actually want to be successful that is the number you have to get to so the idea that you can wait and fetch an answer in three and 5 Seconds is completely ridiculous it's just a it's a it's a non-starter here's the actual number the number is every 100 milliseconds leads to 8% more engagement on desktop 34% on mobile um we're talking about 100 milliseconds which is one tenth of a second right now these things take 10 seconds so think about how much less engagement you were getting today than you otherwise could so why don't you now break down this difference because this is now where I think a good place so that people leave really understanding there's an enormous difference between training and inference and what is required and why don't you define the differences so that then we can contrast where things are going to go the biggest is when you're training the number of tokens that you're training on uh is measured in months like how many tokens Can we train on this month it doesn't matter if it takes a second 10 seconds 100 seconds in a batch how many per month in inference what matters is how many tokens you can generate per millisecond or a couple milliseconds right it's not in seconds it's not in months is it fair to say then that Nvidia is the Exemplar in training yes and then is it fair to say that the there really isn't yet the equivalent scaled winner in inference not yet and do you think that it will be Nvidia I don't think it'll be Nvidia yeah and but but specifically why can't why do you not think it won't work for that market even though it's clearly working in training in order to get the latency down what we had to do we had to design a completely new chip architecture we had to design a completely new networking architecture an entirely new system an entirely new runtime an entirely new compiler an entirely new orchestration layer we had to throw everything away and it had to be compatible with P torch and what other people actually develop in now we're talking about innovators Dilemma on steroids it's hard enough to give up one of those which if you were to do one of those successfully would be a very valuable company but to throw all six of those away is nearly impossible and also you have to maintain what you have if you want to keep training right and so now you have to have a completely different architecture for networking for training versus inference for your chip for for networking for everything so let's say that the market today is 100 units of training or 95 units of training five units of inference I's just say that's roughly where most of the revenue and the dollars are being made okay what does it look like in four or five years from now well actually nvidia's latest earnings 40% inference it's already starting to climb where it's going to end is it will end at somewhere between 90 to 95% or 90 to 95 units inference MH and so that trajectory is going to take off rapidly now that we have these open- Source models that everyone is giving away and you you can download a model and run it you don't need to train it yeah one of the things about these open source models is building useful applications you have to either understand or be able to work with Cuda with you it doesn't even matter because you can just Port so maybe explain to folks the importance in the inference Market of being able to rip and replace these models and where you think these models are going so for the inference Market um every two weeks or so there is a completely new model that has to be run it's important it matters either it's setting the the best um Quality bar across the board or it's good at a particular task if you are writing kernels it's almost impossible to keep up in fact when llama 270 billion was launched it officially had support for AMD however am uh the first support we actually saw implemented was after about a week and we had it in I think two days and so that spe like now everyone develops for NVIDIA Hardware so by default anything launched will work there but if you want anything else to work you can't be writing these kernels by hand and remember AMD had official support and it still took about a week right so if you're building if you're making if you're starting a company today you clearly want to have the ability to swap from llama to mistol to anthropic back as often as possible whatever's latest and just as a as somebody who sees these models run do you have any comment on the quality of these models and where you think some of these companies are going or what you see some doing well versus others so they're all starting to catch up with each other you're starting to see some leap frogging it started off with gp4 pulling ahead and it had a lead for about a year over everyone else and now of course anthropic has caught up we're seeing some great stuff from mrr but across the board they're all starting to Bunch up in quality and so one of the interesting things mrr in particular has been able to get closer to Quality with smaller less expensive models to run which I think gives them a huge Advantage um I think uh coh here has an interesting take on a um sort of rag optimize model so people are finding niches and and there's going to be a couple that are going to be the best across the board at the highest end but what we're seeing is a lot of complaints about the cost to run these models they're just astronomical and they're not you're not going to be able to scale up applications for users with them openai has published um or has disclosed as has meta um as has Tesla and a couple of others just the total Quantum of GPU capacity that they're buying and you can kind of work backwards to figure out how big the inference Market can be because it's really only supported by them as you guys scale up can you give people a sense of the scale of what folks are fighting for so I think Facebook announced that by the end of this year they're going to have the equivalent of 650,000 h100s uh by the end of this year grock will have deployed 100,000 of our lpus which do outperform the h100s on a throughput and on a latency basis so we will probably get pretty close to the equivalent of meta ourselves by the end of next year we're going to deploy 1.5 million lpus for comparison last year Nvidia deployed a total of 500,000 h100s uh so 1.5 million means that grock will probably have more inference um generative AI capacity than all of the hyperscalers and Cloud service providers combined so probably about 50% of the inference compute in the world that's that's just great um tell us about team building in Silicon Valley how hard is it to get folks that are real AI folks in the backdrop of you could go work at Tesla you could go work at Google open AI all these people are we are hearing multi-million dollar pay packages at rival you know playing professional sports like what is going on in finding the people now by the way you had this interesting thing cuz your initial chip you know you were trying to find folks that that knew hasell and MH uh so just tell us like how hard is it to build a team in the valley to do this impossible so if if you want to know how to do it you have to start getting creative just like anything you want to do well don't just compete directly but but yeah these these pay packages are astronomical because everyone views this as a winner take all Market I mean like just that's it it's not about you know am I going to be number two am I going to be number three they're all going I got to be number one so if you don't have the best talent you're out now here's the mistake a lot of these AI researchers are amazing at AI but they're still kind of green they're new they're young right this is a new field and what I always recommend to people is go hire the best most grizzled uh Engineers who know how to ship stuff and on time and let them learn AI because they will be able to do that faster than you will be able to take the AI researchers and give them the 20 years of experience of deploying production code you um you were on stage in Saudi Arabia with Saudi aramco a month ago and announ some big deal can you just like what what what is going on with deals like that like what where is that market going is that you competing with Amazon and Google and Microsoft is that what that is it it's not competing it's actually complimentary the announcement was that we are going to be doing a deal together with a ROM code digital and uh we haven't announced how large exactly but it will be large in terms of the amount of compute that we're going to deploy and in total we've done deals that get us to past 10% of that 1.5 million lpu goal and of course the hard part is the first deals so once we announce that lot of other deals are now coming through but the yeah go ahead so no no I was just I was just signaling to so the the scale of these deals is that these are larger than the amount of compute that meta has right and and a lot of these tech companies right now they think that they have such an advantage because they've locked up the supply they don't want it to be true that there is another alternative out there and so we're we're actually doing deals with folks where they're going to have more compute than a hyperscaler right that's that's a crazy idea yeah last question everybody's worried about what AI means um you've been in it for a very long time just end with your perspectives on should we what what we should be thinking and what your perspectives are in the future of AI our future jobs all of this typical stuff that people worry about so I get asked a lot should we be afraid of AI and my answer to that is if you think back to Galileo someone who got in a lot of trouble the reason he got in trouble was he invented the telescope popularized it and made some claims that we were much smaller than everyone wanted to believe we were supposed to be the center of the universe and turns out we weren't and the better the telescope got the more obvious it became that we were small and in a large sense large language models are the telescope for the mind it's become clear that intelligence is larger than than we are and it makes us feel really really small and it's scary but what happened over time was as we realized the universe was larger than we thought and we got used to that we started to realize how beautiful it was and our place in the universe and I think that's what's going to happen we're going to realize in intelligence is more vast than we ever imagined and we're going to understand our place in it and we're not going to be afraid of it that's a beautiful way to end Jonathan Ross everybody thanks guys thank you very very much I was told Gro means to understand deeply with empathy that was embodying this definition

Transcript for:Lecture by Jonathan Ross

Transcript for:
Lecture by Jonathan Ross