Transcript for:
Elon Musk's Groundbreaking Supercomputer Achievement

Well, Elon Musk has apparently done it again, creating something that industry experts said was impossible to do, and the All In podcast takes a deep dive into it. So let's take a look. Hey y'all, it's Dr. Noah et al. I have sliced and diced about an eight and a half minute clip out of the hour and a half All In podcast and the 20 minute segment starring guests to the podcast, Gavin Baker. I will leave a link to the original in the description. Highly recommend listening to the entire podcast, of course. But in this video, I want to focus specifically on coherence and coherence. in a large supercluster, and how the experts in the field thought this was impossible, and how Elon Musk apparently, we don't know the exact details of this, but apparently Elon Musk was able to come up with a solution, and it uses Ethernet of all things. So before I start in on this, I do want to say that the All In podcast basically gives 100% credit of this to Elon Musk, like basically he was sitting in a room, came up with the idea, and solved the problem, you know, instantly or something. That is not the case, I'm sure. He may have been the one who came up with the spark of the idea. We don't know at this point. Hopefully, we will find out in the future. And Elon, if you want to talk to me on my channel about all of this, I would be happy to do that, of course, so you can give me the inside scoop. But anyway, I'm sure even if he came up with the original spark of the idea, we're talking 100 plus engineers had to sit and think about how to actually implement this, write the code, install the hardware, all of that kind of stuff. So XAI in this particular instance, but also Tesla and its Cortex supercomputer. which I'm sure is using the basic same architecture for how to do all of this stuff. They were the ones who came up with the actual solution, right, and implemented all of this stuff. So I just want to be clear that we can't give Elon Musk all the credit for this because this is way too complicated for any single human being to do. So anyway, with that caveat in mind, let us proceed. A great next place to go would be to talk about the supercomputer being built by a friend of the pod, Elon. He's now got the world's largest supercomputer, and he's going to 10X it. And I would just say this is, I think, a very important moment for this entire AI trade in public and private markets. Everybody, I'm sure, who watches your podcast is very aware of scaling loss. And we have not had a scaling loss for training. Or if you 10x the amount of compute used to train a model, you significantly improve the intelligence and capability of that model. And often there are these kind of emergent properties that emerge alongside that higher IQ. No one thought it was possible. to make more than 25,000, maybe 30,000, 32,000, pick a number, NVIDIA hoppers coherent. And what coherent means is in a training cluster that each GPU, to kind of simplify it, knows what every other GPU is thinking. So every GPU in that 30,000 cluster knows what the other 29,999 are thinking. And you need a lot of networking to make that happen. So first of all, Pee-wee's word of the day is coherence. Coherence? Anyway, that is the ability of a large cluster of compute nodes to be able to talk to each other fast enough to maintain the coherence of a complex computation. And oddly enough, in this case, quantum mechanics might come in to actually help us understand this. They, of course, use the term coherence as well. So you could think of two entangled particles, right? You have a couple of particles. This happens in... quantum computing. This is actually how quantum computing works. But anyway, you have two electrons, two protons, whatever, and you entangle those two particles, which means that they coexist, right? If one of them has spin up, the other one's going to have spin down. All of that kind of stuff comes along with that. That is coherence on the submicroscopic level. And those two entities have coherence with each other. They're connected to each other. The problem is as you add a third electron or third proton and a fourth and a fifth and a thousandth and a ten millionth and stuff, they start to decohere. They can't maintain coherence anymore, and they become spread out, and they become essentially, you know, us. They become macroscopic. They don't have coherence with each other anymore, so they're unable to sort of talk to each other instantaneously. So you could consider that an analogy to the way these compute clusters work. If you have a single computer like what I've got, even this is complicated. Let's dial it way back to the 1980s where you had one CPU. That CPU, it's easy for it to talk to itself because there's only one of them. In my Mac studio with 10 CPUs, 16 GPUs, and a bunch of neural network processing and stuff, we've already got a significant problem with all of these things talking to each other. Now, of course, multiply that out by a thousand and then a thousand again and stuff like that, and then you see the problem of building a gigantic supercluster. And basically, all of the industry experts thought that it was impossible to maintain coherence past about 25 or 30,000 of these GPU nodes. And this is where Elon's genius apparently came in. We don't know exactly what he did, but somehow he managed to figure out how to maintain coherence, not just at 50,000 or 100,000, but apparently a planned 1 million of these nodes. Now, probably by then we'll be talking about Blackwell, so it will be H100 equivalents or something rather than an actual million H100s, but still we're talking about a massive number. of these GPU nodes connected together real time enough through slow bandwidth memory, things like that, to be able to talk to each other fast enough to maintain coherence as they're doing training and inference. And of course, if that works, we can get potentially incredible emergence. The way our brains work, apparently, is we have relatively simple connections between neurons, but we have so many of them and they maintain coherence across long periods of time. That allows the emergence of complex behavior, including intelligence. And so what we might see is something this big and this complicated maintaining coherence if it actually works. And Gavin talks about this. If it actually works, this could be a complete game changer to the nature of AI itself and might actually enable something to become semi-conscious made out of silicon. That's the kind of thing we could be looking at if this actually works. And with that in mind, let's take a look at this Tom's Hardware article, which I also will leave a link to in the description. First in-depth look at Elon Musk's 100,000 GPU AI cluster, XAI Colossus reveals its secrets. I'm just going to touch on this one paragraph because this is the really important part for our discussion here. Because of the high bandwidth requirements of an AI supercluster, constantly training models, and I will say also doing inference, because it's going to be inference compute time is going to be important going forward. That means the amount of time that you can think about a problem. as well as the amount of training that went in. Anyway, XAI went beyond overkill for its networking interconnectivity. It sounds like this was the plan that Elon had, the long-term to be able to 10x this, you know, and then 10x it again, that kind of a thing. Each graphics card has a dedicated NIC, or Network Interface Controller, at 400 gigabytes, with an extra 400 gigabytes NIC per server. That means that each HGX H100 server has 3.6 terabits per second Ethernet. So that's the speed at which it's able to communicate with the outside world with the rest of the cluster. In other words, if you think of this analogously as a neuron in the brain, that's the speed it can communicate with the rest of the neurons in the brain. And that is a game changer. It is radically faster than other supercomputers can talk to each other. And that, I think, is the way that these things are able to communicate and maintain coherence. across such a large number of GPUs. And then, interestingly enough, the last sentence here, and yes, the entire cluster runs on Ethernet, Ethernet, the thing that I'm connected to right now, that you're connected to, all of that kind of stuff, rather than InfiniBand or other exotic connections, which are standard in the supercomputing space. And I will throw in here that probably Elon Musk's and Tesla's experience with doing Ethernet for the Cybertruck, and of course, the CyberCab coming up, likely has a huge impact on their ability, their knowledge of Ethernet, and their ability to use it for such a high-performance use case as building Colossus, and also, of course, Cortex in Austin at Tesla's Gigafactory. So this reminds us of the synergies between Elon Musk's companies. We're also going to talk about the synergy between XAI and X in just a couple of minutes. Getting back to the podcast. Just to slow down for the audience here, Gavin, maybe explain why transporting information between the GPUs is important. That's what these H200s, H100s do particularly well. They'll move a couple of terabytes a second from one processor to the next process. You know, picture a server in the case of a GPU. It looks like maybe three pizza boxes stacked on top of each other. And it has eight GPUs together. You can think of the speed of communication. On chip is the fastest. Chip to memory, next fastest. You know, chip to chip within a server, next fastest. And so you take those units of servers, which are connected, the GPUs are connected on the server with a technology called NVSwitch, and you stitch them together into a giant cluster. And each GPU has to be connected to every other GPU and know what they're thinking. They need to be coherent. They need to kind of share memory. For the compute to work, the GPUs need to work together. And no one thought it was possible. to connect more than 30,000 of these with today's technology. From public reports, Elon, as he so often does, focused deeply on this, thought about it from first principles, and he came up with a very, very different way of designing a data center. And he was able to make over 100,000 GPUs coherent. No one thought it was possible, but I would have said there were all these articles that were being published in the summer. saying that no one believed he was going to be able to do it. It was hype. It was, you know, ridiculousness. And that was coming. The reason the reporters felt comfortable writing those silly stories is because engineers at Meta and Google and other firms were saying, we can't do it. There's no way he can do it. And I think the world really only believed it when, you know, Jensen did that podcast and said, what Elon did was superhuman. No one else could have done it. Now we will see. If someone else is able to do it, it was really, really hard. And as a result of that, Gruck 3 is in trading now on this giant Colossus supercomputer, the biggest in the world, 100,000 GPUs, with a lot of mega packs around it. And the city of Memphis is all in on supporting this. But you have not had a real test of scaling laws for training, arguably since GPT-4. And this will be the first test. So the interesting upshot of this, of course, is we don't know, including Elon Musk, because... We don't know if scaling this thing up to the kind of size that we're talking about, a 50,000 cluster right now building to 100,000, then eventually to a million, we don't know if this is actually going to work. There has been a lot of, you know, talk, crosstalk back and forth about how scaling laws are actually breaking down and throwing more and more compute at these problems is getting tinier and tinier incremental gains. But maybe this is going to break the whole thing apart and it's going to be able to work much better. I did a video recently on version 13 of Tesla's software. You can check it out up here, the full self-driving software. I haven't got it myself yet. I can't wait to do a first test drive, of course, in it. And so, of course, stay tuned for that. But the upshot of this is that version 13, which is the first one that's really been trained on the Cortex cluster in Texas, the big new cluster, that thing is a step change. By all accounts, it is a step change from the quality of full self-driving 12, which is already very, very good. So we're talking about emergent properties, the ability... to get much better with the appropriate kind of scale. So this is what Elon and XAI and also Tesla are banking on, is that if they can scale this compute, if they can scale the training and also inference compute enough, they are going to leapfrog everybody else. And the proof is going to be in the pudding in January to February when Grok 3 comes out. If that is significantly better than what anybody else can do, then everybody else is going to be playing catch-up. And we're going to end this episode. Gavin talks about the prisoner's dilemma at the end of this. So we'll get back to that later on in this episode. And if scaling loss for training holds, Grok 3 should be a significant advance in the state of the art. And immensely, you know, from like a Bayesian way to look at the world, that is like an immensely important data point. But if that card doesn't work, and I think it is going to work, I think Grok 3 is going to be really good. Yeah, they've raised a tremendous amount of capital, a lot of it from the Middle East, and they're supposedly going to build Colossus to a million GPUs. Is this data goal 10 times bigger than it is currently? Do we start to see a shift in how the architecture of the systems are run? Meaning, like, do we start to build models of models? And that starts to resolve a higher level architecture that unlocks new performative capabilities. So I would just say we're already building models of models, you know. lots of very clever things are being done. Every AI application company has what's called a router so they can swap out the underlying model if another one is better for the task at hand. There's been a big debate that we were hitting a wall on these scaling laws and that scaling laws were breaking down. And I just thought that was deeply silly because no one had built a cluster bigger than, you know, 32,000 H100s and nobody knew it was. it was a ridiculous debate. Grok three is the first new data point to support whether or not scaling laws are breaking or holding because no one else thought you could make a hundred thousand hoppers coherent. And I think based on public reports, they're going to 200,000 hoppers. And then the next tick is a million. It was reported they're going to be first in line for Blackwell, but Grok three is a big card and will resolve this question of whether or not we're hitting a wall. I want to jump in here quick and say that I actually think version 13 of Tesla's full self-driving is the first data point and that Grok 3 will be the second data point. Because even though it's a different company, we're talking about it's got to be more or less the same kind of structure that they've built out, the same architecture. Elon is famous for sharing, you know, for cross-pollinating between different companies. And for something this expensive and this cutting edge, of course, they're going to be talking to each other. But if version 13 of Tesla's full self-driving lives up to. what it looks like it's going to live up to, then that clearly shows that that scalability, that next step in AI training compute is going to have a massive effect and we should see a huge step forward. Grok 3 should be significantly better than what we're seeing out of the competition if this is true. If it's not, then we have a data point that shows that scaling doesn't work quite so well and that maybe people need to... to turn things back a little bit and not spend $100 billion on these data centers and everything like people are talking about. Of course, Gavin also talks about, I've cut most of this stuff out, but he also talks about creating better models of models, creating better architecture, thinking about things in different ways. That is all going on at the same time. So even if these scaling laws don't work, we still have a lot of room to develop all of the technology that's been built in the last few years and kind of scale it out and make it work even better. But if this scaling law actually works and $100 billion is table stakes now, that just makes things all the more interesting. Who can afford to keep doing this stuff? Well, only the richest companies in the world and maybe the richest individual in the world. By the way, we should note there is now a new axis. of scaling. Some people call it test time compute. Some people call it inference scaling. And basically the way this works, you just think of these models as human. The more you speak to one of these models, the way you'd speak to your like 17 year old going off to take the SAT, the better it will do for you. We have been giving these models the same amount of time to think no matter how complicated the question was. What we've now learned is if you let them think for longer about more complex questions, test time compute, you can dramatically improve their IQ. So we're... just at the beginning of this new scaling law. So with that in mind, let's remember that Colossus might not only be used for training, right now it's being used for training, but it might be used for inference as well. I mean, if you throw 100,000, 200,000, a million of these GPUs at a fundamental physics problem, like what is the nature of time or how can you integrate general relativity and quantum field theory, if you can throw that kind of inference or test time compute at a problem like this, Can you actually solve the problem? Can you actually come up with a theory of everything? I always think about that because it's just, it's a problem that I want solved before I die. I really want to know what the answer to this question is, or at least our best guess at it. And so I think that that is probably the best option we have at this point. But just imagine if you threw a million of these GPUs at a problem like that and gave it as long as it needed to come up with the answer, would it be able to come up with the answer? That would be a fascinating, fascinating question. And it would be answered by test time compute, not training compute. And there's a context window shift underway as well, which also creates a new kind of scaling access, arguably, in terms of the potential set of applications. So networks of models, think time, context window. There are multiple dimensions upon which these tools ultimately kind of resolve to better performance. Can you kind of theorize on what the build out? that's being done with Colossus does to the advantage that OpenAI has today? How long till we kind of catch up there with XAI and how much is going to be disrupted and how quickly here? Well, if scaling laws hold, the best information I have is the largest cluster Microsoft has after panicking. It's still smaller than XAI's cluster in Memphis. If you didn't believe it was possible, you weren't even working on it. Grok 3 should take the lead. If scaling laws hold in January or February, I think there's a lot of reasons if scaling laws hold to be optimistic about Croc 3. Right now, you have like a friend in your pocket who has an IQ of 115, 110 maybe, but has all of the world's knowledge accessible to it. And that's what makes it amazing. You will have a friend in your pocket with an IQ of maybe 130 that knows everything. has more up-to-date knowledge of the world, and is more grounded in factual accuracy. Grok, because of the Twitter data set, knows what is happening at the moment. In the world today. So let's pause on that for a second. This is kind of the genius, the Trojan horse nature of Elon Musk's companies, at least Tesla and X and XAI, is that you think they're one thing, but they're actually another. Tesla is not a car company, at least not anymore. They are an AI company, and the cars are the way that they gather the data to train the AI, and the AI is the real product. That is the future of Tesla. If you don't believe that, you should probably not own Tesla stock. Again, not financial advice ever, because I suck at investing, or else I'd be retired, you know, living in Hawaii or something. But anyway, as far as I can tell, Tesla is not a car company. It is an AI company. That's what they are. And X slash XAI, and I know officially it's two separate companies, but they're basically rolled into one thing because they're both privately owned by Elon Musk. But X and XAI, it's not a social media company. It is an AI company. I don't know if Elon knew that when he purchased Twitter and turned it into X, but he knows it now. X is just the way that you ingest the information that you need to train a large language model. a large multimodal model, I would assume Grok 3 is going to be able to do vision and sound and things like that as well. Otherwise, why would you have it? The next generation frontier models can do that, and Grok needs to be able to do that as well. But effectively, as Jason said, we're talking about X, you know, Twitter being the mouth that feeds this AI. And that's what you've got. You've got automobiles from Tesla that feed Tesla's AI. You've got all of us humans that interact with X just throwing tons and tons of information at it. training it up, not just in past historical information and how to communicate, but also in current events and everything like that. That is going to give Grok a massive leg up over other AI because it will have access to real-time data. That's something that these other things, they train and then, you know, they're like our cutoff was April of 2024 or something. And you're like, well, that's great. But what about, you know, what about the election? What about things that have happened in the last six months? Doesn't know. It might have a tool to go out and search the web, but it's not going to know as efficiently. as Grok 3 will be able to know. So anyway, at least for X slash XAI and Tesla, think of them as AI first companies. The other stuff is just a way of feeding the AI. And I'm sure both of you have come across these. There are lots of companies. that are just these thin wrappers over a foundation model. And they go from zero to 40 million instantaneously. Yeah. And they're profitable. And for their customers, they're replacing labor budgets. Yes. I'm sure you guys are noticing this too, but startups today at a given size are employing fewer people than they would have three years ago. And just like, you know, it's funny, people were very skeptical. I would say 50% less. Yeah. And that's the ROI on AI. You're seeing real ROI on AI from startups the same way they saw real ROI from cloud computing before anyone else. It's crazy. These companies are in a classic prisoner's dilemma. They all believe to varying degrees that whoever gets there first to artificial superintelligence is going to create tens or hundreds of trillions of dollars of value. And I think they may be right if they get there. And they think that if they lose the race, their company is at mortal risk. So. as long as one person is spending, I think they will all spend, even if the ROI decelerates. It is a classic prisoner's dilemma. All right. So a lot to unpack there at the end. The first thing is that thin wrapper. He kind of throws that off like it's nothing I've been working on. You might call it a thin wrapper around something for about four or five months now. It is not easy. Now, on the other hand, our little startup company, we have a team of six people working on something that easily would have taken double, triple. I call it 30 people probably would have had to work on something like this five years ago in 2019 or 2020. It's incredible how much AI has reduced the need for labor in a circumstance like this. Our team is a startup team. We're very, very small. We're very, very lean, but we can be much, much more lean than we used to be. And not just in human capital, but also in terms of compute and stuff. There would have been a time before AWS and other entities like that, that we would have had to. to build out a server farm just to do the work, to do the training, to be able to do the compute that we need. We don't have to do that right now. We can just throw that off onto a solution like AWS and that takes care of it for us. Now, of course, they make a lot of money on this and it costs us money ultimately to do that, but it's a much more efficient way to build out your ideas and build out your products than it would be to have to invest all of that money. You'd need hundreds of thousands of dollars to build out your compute cluster, not to mention all of the extra people that you would have to hire, so everything would be much more expensive. So from experience, I will say what Gavin is talking about here right now, in terms of ROI, especially for startups, it's huge. Now, the prisoner's dilemma actually gets flipped. That is the larger companies. Of course, we're not going to be investing billions of dollars in creating these gigantic frontier server clusters. That's for somebody else to do, and we just make use of it after it comes out, which is great for us, very complicated for these larger companies that have to... raise and then invest tens to potentially $100 billion to build these things over time. Very, very complicated. So circling back to these large clusters, the question, the crux of all of this was how big can you build these and still have them act as a single entity? How long can they maintain coherence? The wisdom on the street about six months ago was somewhere around 30,000 of these GPUs. Elon Musk apparently figured out a way to break this. It sounds like it was very difficult to do. but they're now at 50 to 100,000. The goal is to go to 200,000, then to a million, perhaps more. And if he really has figured out how to break through this bottleneck and scaling laws hold, in other words, you can still get more performance out of larger scale. Both of those things we don't know the answer to yet, but Elon Musk, he's the bet the farm kind of guy. That's what he's always done. He's like, I'll get this stuff. I will go all in on this. I will try to make this work. And if I'm right, I will get the reward. If I'm wrong, of course, I, you know, I'll reap the... The reward of failure, which is not having your company be in business anymore. But if he's right, and he tends to be right, if those two things hold, then X and XAI and Tesla are going to be worth tens of trillions to $100 trillion. I don't know. It's difficult to know how big the market can get if these kinds of things hold. And so then at that point, $10 or $100 billion investment, who cares? It's just table stakes at that point. It's not really a big deal. So with that, I'll close this video. I would love to know what you all think about this in the comments. Please do let me know. And while you're at it, if you don't mind liking and subscribing, I would love to get to 100,000 subscribers before my birthday in late January. It's my 60th birthday. It would be a fantastic birthday present. So thank you all in advance for helping me out with that. And in the meantime, I will see you in the next video. Bye bye.