Transcript for:
Grok 4 & Heavy AI Breakthroughs

Grock for is smarter than almost all graduate students uh in all disciplines simultaneously. Like it's actually just important to appreciate the like that's uh really something. So yeah, Elon actually did it. Grog 4 is here and it might just be the most powerful AI model on the planet. It's crushing benchmarks. It can sing. It can run a business. And honestly, that's just scratching the surface. But the wildest part of this release was when the XAI team straight up said they're running out of human tests and that their next benchmark will be reality itself. Let's get into it. All right, so first of all, what do they mean by reality will be their next benchmark? Well, as you can see here, the first benchmark they showed was humanity's last exam. A benchmark that consists of the hardest of the hardest problems from a whole range of subjects created by top experts in those subjects. It's a 2500 question exam that no human could possibly pass. And it's literally designed to be the last exam needed for AI, hence the name. And well, Gro 4 is already scoring 40% on it with tool usage. And Gro 4 heavy, which we're going to talk about more in a minute, manages to get up to 50%. A stark difference from the singledigit percentages models were getting at the start of this year when the benchmark was first introduced. And from the second highest score right now held by Gemini 2.5 Pro. Both without tool usage and with tool usage, Gro 4 is state-of-the-art. Then when you look at the more traditional benchmarks like GPQA, which consists of PhD level science questions or Amy, a challenging math benchmark, it has pretty much saturated them. I mean, Gro Heavy literally scores 100% on Amy, which is actually insane. Amy has been solved. You've also got a few other benchmarks here that Gro 4 is showing significant improvements on over the previous state-of-the-art. So again, what do they mean by reality is becoming the new benchmark? Well, clearly current benchmarks are being saturated and new ones that were created with the intent of being extremely challenging are also getting solved at lightning speed. And so the next chapter is not only continuing to scale up post training or reinforcement learning with verifiable rewards as they've done by 100x with Gro 4, but it's actually testing the AI on getting things done in the real world. Essentially using realworld complex tasks to evaluate the AI, which in turn will train it to become more useful in the real world. Check this out. Yeah, and we actually are running out of of of actual test questions to ask. Uh so there's like even ridiculously questions that are ridiculously hard if not essentially impossible for humans that are written down questions um are uh becoming swiftly becoming trivial for for AI. Um so then there's um but you know the the one thing that is an excellent judge of things is reality. So because physics is the law ultimately everything else is a recommendation. You can't break physics. Um so the ultimate test I think for whether an AI is um the the ultimate reasoning test is reality. Yes. So you invent a new technology like say improve the design of a car or a rocket or um create a new medication uh that and and and does it work? Yeah. Um does does the rocket get to orbit? Does the does the car drive? Does the medicine work? Whatever the case may be. Um, reality is the ultimate judge here. Um, so it's it's going to be reinforcement learning, closing the loop around reality. So, yeah, pretty wild. Musk also claims that Grock may be able to invent new technologies that are actually useful as soon as later this year and that it will potentially even discover new physics by next year. Now, you might be wondering, what's the difference between Gro 4 and Gro 4 Heavy? I mean, besides the whopping $300 a month it's going to cost you to even use Gro for Heavy, how is it actually different? Well, part of the reason it's likely so expensive is because Grog for Heavy isn't just a bigger model. It's a multi- aent system. When you ask it a question, it spawns multiple expert agents in parallel. They each try to solve your question individually, then compare notes like a high IQ study group before giving you the best collective answer. So when you prompt Grock 4 heavy, you're not just talking to one genius. You're talking to an entire panel of PhD level experts working together instantly. Now, here's maybe the most shocking benchmark from the whole presentation. Most of you have probably already heard of ARC AGI. It's a benchmark that measures how well AI models can generalize. Basically, how well they can answer questions they've never seen before. And as you can see here, not only is Gro 4 state-of-the-art on the original Arc AGI 1 benchmark, which is already starting to get saturated, but on their newer ARGI 2 test, which is still semi-private, Gro 4 is nearly doubling the second best model, Gemini 2.5 Pro. And it's doing this at a cost per task that is relatively similar to the other models. Actually insane. They also gave us a quick demo of Grock Voice, which is now twice as fast, more natural, snappier, and can even sing. Take a listen. You know, we have a lot to talk about, but I'll be honest, I'm feeling a little nervous. Can you whisper something soothing to calm me down? Take a deep breath, love. You've got this. It's just you and me having a quiet chat like we're tucked away in a cozy corner of a Yorkshire pub. The world's just a murmur out there. Feel that calm wash over you. Let's keep talking nice and easy. What's on your mind? You know what? I'm actually feeling quite relaxed to try this um something crazy. So, can you give us an opera on Diet Coke? Oh, an opera on Diet Coke. That's a riot. Picture this. A shimmering silver cannon glistening under stage lights. Oh, diet coke. Thou elixir divine with bubbles that dance in a sparkling line. Thy crisp cool kiss on lips so fine. Oh, aspartame and bruiser forever mine. How's that for a mad little Arya? Want me to belt out another verse or switch up the tune? All right, now let's go back to the idea of using reality itself as the benchmark because we actually got a glimpse of what that could look like. You may have heard of vending bench. It simulates a scenario where an AI runs an entire vending machine business on its own end to end. They even brought the creators of the benchmark on stage to explain it. But all you really need to know is that Gro 4 crushed it. As you can see here, Gro 4 sold over 4,500 units. That's nearly 3x what Claude 4 opus managed and over 10x what the average human did. Someone on the XAI team joked about deploying Grock to run real vending machines in the real world to help cover compute costs. But honestly, that's not even that far-fetched. This is literally where we're heading. Future benchmarks won't just be multiplechoice tests or logic puzzles. They'll be realworld tasks, especially when these models gain embodiment through humanoid robots. And if you take this even further, entire jobs and maybe even entire industries could be evaluated like this. Imagine for instance a benchmark where the model has to build an entire video game from scratch that is both good and that people actually want to play or an entire movie that people actually want to watch. These could be real tests in the near future. And here's what Elon has to say about that. Yeah. The now the next step obviously is for Grock to uh play be able to play the games. So it has to have very good video understanding so it can play the games and interact with the games and actually assess what whether a game is fun and and and actually have good judgment for whether a game is fun or not. Um so with the with version seven of our foundation model which finishes training this month and then we'll go through post training RL and whatnot. um that that will have excellent video understanding. Um, and with the with the video understanding and the and improved tool use, for example, for video for for video games, you'd want to use, you know, Unreal Engine or Unity or one of the one of the the main graphics engines, um, and then generate the generate the art, uh, apply it to a 3D model, uh, and then create an executable that someone can run on a PC or or a console or or a phone. um like we we expect that to happen probably this year. Um and if not this year, certainly next year u so that's uh it's going to be wild. I would expect the first really good AI video game to be next year. Um, and probably the first uh half hour of watchable TV this year and probably the first watchable AI movie next year. Like things are really moving at an incredible pace. So, this brings us to XAI's future road map. Basically, what's next? As you can see here, they've laid out a pretty stacked lineup over the next few months. In August, we're getting a dedicated coding model. In September, a multimodal agent, which sounds like Grock with vision, voice, tool use, and possibly memory. And in October, a video generation model, which they say could rival Google's V3. All these are already in the works. And if they actually ship everything on this road map, then by the end of the year, XAI, a company that didn't even exist 2 years ago, could be sitting right at the frontier of AI. Obviously, it will depend on what the other labs drop as well. I mean, we're still waiting on GBT 5, Gemini 3, and possibly even Claude 5, but it's clear that Grock is now a legit competitor. Anyways, that was XAI's Gro 4 release. Let me know what you guys thought about this model in the comments. Personally, I would say it actually exceeded my expectations. I didn't think we'd see performance at this level so soon, and the direction XAI seems to be heading along with the speed at which they're moving kind of shook up my internal ranking for who's leading the AI race right now. I would say Google and OpenAI are definitely still in the top two, but I now put XAI in third above Anthropic. And I could see Meta slowly making its way up there with all their new acquisitions. So, as always, if you want to stay uptodate on future AI news just like this, make sure to drop a like, hit that subscribe button if you haven't already, and I'll be catching you guys in the next One.