So OpenAI have finally announced slash released their new large language model which is called OpenAI 01 which is arguably the smartest model in the world and this is a model that has been highly anticipated and it is one that is so smart that I think you're going to want to watch this entire video until the end because some of the capabilities are truly remarkable. So let's take a look at everything that has happened. and i'll explain to you all the key details you'll want to know so right here you can see that it says learning to reason with llms we are introducing openai 01 a new llm trained with reinforcement learning to perform complex reasoning and o1 is quite different to standard models like chat gpt as this model thinks before it answers meaning that it can produce a long internal chain of thought before responding to the user basically where the model lays out a plan walks through that plan and then gives a final output to the user now one of the most incredible things about this entire model is that this model actually currently exceeds human level phds on a variety of different benchmarks so it clearly states here that openai 01 ranks in the 89th percentile on competitive programming questions which is code forces, which is absolutely insane because this means it's at expert level, which is something that only a prior system that Google was able to do with an incredible system that they really could only use with huge amounts of compute. Now, this also places among the top 500 students in the United States in a qualifier for the USA Math Olympia and exceeds human PhD level accuracy.
on a benchmark of physics, biology and chemistry problems. And it says, while the work needed to make this model is as easy to use as current models is still ongoing, they are releasing an early version of this model, the O1 preview for immediate use today in a chat GPT and the API. So if you're wondering if this model is actually out today, yes, it is.
Now, if you're also wondering, is this out in the EU just yet? It's likely going to be delayed. by a few hours, like six to eight hours.
So just be patient and eventually you will see the model appear in your menu. Now, one of the craziest things about this model is that this model was trained on large scale reinforcement learning. Now, what this is doing for the model is that it basically means that this model is thinking productively using its chain of thought in a highly data efficient training process.
And they state. that we have found that the performance of O1 consistently improves with more reinforcement learning, trained time compute, and with more time spent thinking, test time compute. And the constraints on scaling this approach differ substantially from those of LLM pre-training, and we are continuing to investigate them.
For those of you who are unaware of what I just said, they're basically stating that this stuff is something that scales really well. And right now, they're currently figuring out how on earth they are going to continue scaling this because this model continues to get smarter with train time compute and continues to get smarter when it's given more time to think which is the test time compute so what they're basically saying here is that currently they don't really see any limits apart from compute on how smart these models are going to get if we actually take a look at the graph it's a lot more shocking with as to the kind of implications for what this means you can see here that this is some kind of new scaling laws as i've said here the constraints on scaling this approach differ substantially from those of llm pre-training and they are continuing to investigate them so the crazy thing about this is that we can literally see here that the graph shows us that as a train time compute continues to go up the accuracy during training actually continues to go up and we can also see that the test time compute on the log scale that the accuracy of 01 is also going up there. So, I mean, it doesn't take a genius to figure out that potentially, given more compute and more resources, what these models are probably going to be able to do. And the reason I think this is so fascinating is because this largely might show us that we may have actually entered a new paradigm with regards to how these AI models are trained and how they're actually given to users.
We're seeing the train time compute and test time compute are both scaling remarkably. and the accuracy increases with compute. So for the many individuals who have been doubting the fact that compute is something that all you need, this actually shows us that in this new paradigm, compute might be... the most important method to getting extra performance out of certain models.
Combined that with the chain of thought and reinforcement learning, we have an unstoppable system that I truly can't fathom how smart these models are going to be in the future, considering the fact that currently we're still limited by our levels of compute. Now, if we do take a look at what this model is able to do, we can take a look at some of the evaluations. To highlight the reasoning improvement over GPT-4-0, we tested our models on a diverse set of human exams and machine learning benchmarks we show that o1 significantly outperforms gpt4o on the vast majority of these reasoning tasks unless otherwise specified we evaluated o1 on the maximal test time compute set so what we can see here is three different models we can see o1 preview which is largely the distilled version of o1 and then we can see the o1 version which is the version that won't actually be available today. The O1 preview is going to be the distilled version of O1 or Strawberry or Qstar, whatever model you want to call it.
And what this basically means is that currently what we have here is a situation where the O1 preview is going to be the model that is available today. But as for the O1 model, considering the fact that we are restricted by compute, it's likely that that model might be available sometime next year or sometime in the future. Now... I think the most important thing here is the remarkable differences between a GPT-4-0 and of course the O1 preview. I mean, it's not even any kind of similarity.
When we look at these kind of benchmarks, we can see that it's almost apples to oranges in terms of comparison. The O1 preview simply dwarfs GPT-4 in terms of raw performance on challenging tasks. And we can see that on competition maths.
It's almost a four times increase. On the code forces, it's almost a six times increase. And on PhD level science questions, the GPQA diamond, we can see that there's a remarkable jump, which even shockingly surpasses expert human levels, which is a whole new paradigm in terms of how we view ourselves on the scale of intelligence. So this is genuinely something that is... groundbreak.
These aren't the only benchmarks here. And trust me when I tell you that these kind of benchmarks are one that surprised even me. And I'm someone who definitely expected there to be remarkable performance from these models.
But let's take a look at what other things there are going on. So this is where we actually take a look at the GPT-4-0 versus the O-1 improvement. So we can see that there are four different areas here. We've got the machine learning benchmarks, and we can see that there is quite the improvement.
in terms of the MMMU, the MMLU, and of course the Math500 and the MathVista. Noticeably, we can see that the Math500 is at 94.8%, which is a remarkable jump. And the main thing that you want to understand about this model's release is that it is mainly performing a lot better in terms of math and other tasks, which require long reasoning steps. We can also see the same in chemistry. physics and biology and we can see the same across many of these AP exams.
Now what's incredible here is that it says O1 rivals the performance of human experts. Recent frontier models do so well on math that GSM 8K and math benchmarks are no longer effective at differentiating models. Basically what they're stating here is that these models have somewhat completed these benchmarks and are no longer useful in determining how models perform.
So what they decided to do was they decided to evaluate the math performance on the AME, an exam designed to challenge the brightest high school math students in America. And on the 2024 AIME exams, GPT-40 only solved 12%, which is 1.8 out of 15 problems. And in comparison, O1 averaged 74%, 11.15 out of 15, with a single sample per problem. and 12.5 or 83% with consensus among 64 samples and then 93% when re-ranking a thousand samples with a learned scoring function. Basically, what they're stating here is absolutely incredible.
Now, most people won't understand why this is absolutely incredible, but I think getting 74% with a single sample is really incredible because what you have to understand is that this is one shot, meaning that you input a single prompt and then the model... outputs a single response of course using a thousand different samples you're going to largely improve your score but I think doing this single shot and getting such a dramatic result is absolutely incredible. And you can also see that this is at 93%, which means that compared to GPT-4-0, this is a stunning, remarkable improvement. Now we can also see this right here, and this is where we see how it compares to PhDs. So they also evaluated O1 on GPQA Diamond, a difficult intelligence benchmark, which tests for expertise in chemistry, physics, and biology.
And in order to compare... Models to Humans, we recruited experts with PhDs to answer GPQA diamond questions, and we found that O1 surpassed the performance of those human experts, becoming the first model to do so on this benchmark. Now, interestingly enough, they do state that these results do not imply that O1 is more capable than a PhD in all aspects, only that the model is more proficient. in solving some problems that a PhD would be expected to solve. And you can also see that with its vision perception capabilities enabled, O1 scored 78.2% on the MMMU, making it the first model to be competitive with human experts.
Overall, what we can see here is once again incredible. This is the first model that has surpassed the performance of human experts on this GPQA benchmarks. which is supposed to be a remarkably difficult one.
And not only that, but the vision perception capability are competitive with human experts. So we can understand that these kinds of vision capabilities are going to be remarkably incredible once tested on a variety of different areas. So this is where we get to the coding section. And my oh my, is there a lot to cover. This is where they talk about how they did further fine tuning on a version of 01. And this version managed to perform a lot.
better. You can see it says this model competed in the 2024 IOI under the same conditions as the human contestants. It had 10 hours to solve six challenging algorithmic problems and was allowed 50 submissions per problem.
And then it goes on to state with a relaxed submission constraint, we found that the model performance improved significantly. And when allowed 10,000 submissions per problem, the model achieved a score of 362.14 above the gold. medal threshold even without any test time selection strategy which is a remarkable statement considering the fact it was only a few months ago where google demonstrated their ability to get silver at the international mathematical olympiad so once again it seems like open ai might be raising the bar even further now it goes on to state that finally we simulated competitive programming contests hosted by code forces to demonstrate this model's coding skill and our valuations closely matched the competition's rules and allowed for 10 submissions gpt4 achieved an elo rating of 808 which is in the 11th percentile of human competitors and this model far exceeded gpt40 and 01 and achieved an elo rating of 1807 performing better than 93 of competitors And we can see that a rating of 1807 actually puts this at the candidate master level, which is the highest rating for any AI system that I've ever seen. And that makes it current state of the art at coding, which is absolutely incredible. Now, for those of you who are wondering how this model actually works in terms of the internal workings of how they managed to get a model to be this smart, some of the tricks are in how the model has been trained.
So the way that this model actually was trained was, of course, with reinforcement learning and, of course, being trained to use chain of thought when responding. So chain of thought is basically where you have literally a chain of thought before you respond to a problem. Unlike in prior models where you immediately respond to a problem, you lay out the problem step by step and then come to a solution based on those subsequent steps. And you basically verify step by step.
that the steps that you're taking will eventually lead to a good solution. Now, what we can see here is a very insane example of where GPT-4.0 is pitted against the OpenAI O1 preview in which they're both tasked to decipher slash decode the ciphertext using the example provided. So this is where we have a one shot example where you can see we've got this gibberish text and then it's converted into the text.
Think step by step. Then it says use the above example to decode this jumble text, which I would have no clue how to do. And then you could see that this model manages to get it correctly. So the final words are there are three R's in strawberry. And you can see that this model TPT4O says that these words are the answer, which are just completely wrong.
And it asks for additional decoding rules in this cipher. Now, what the really nice thing about this is that we can see. And unfortunately, you won't be able to see this in the model, but you're actually able to see the chain of thought here. Now, what we can also see is that this chain of thought is really, really long. If we click this button, you can see it says first what's going on here.
We are given first an example. think step by step we can see that yada yada yada and it says our task is to use the example above to decode this gibberish so the first part is to figure out how this was decoded into this now you can see if i scroll down here the amount of work that is being done here is absolutely incredible this is a model that is working step by step through many different steps arguably i think it's even like hundreds of different steps before coming to a final solution and you can see sometimes it manages to check its message and then finally output the response so you can see right here the final output that we do get gives us a very basic rendition of the internal chain of thought but i think showing it in this small demo is really powerful because we get to see firsthand how much work is being done behind the scenes we can also see this in the decoding section where there is an extremely large chain of thoughts which we can show and we can hide. There is also this in the math section where there is another large chain of thoughts. For multi-step math word problems, for multi-step mathematical problems, we can see that there is the same in the crossword.
There's also the same in science like here and then we can see that there's also the same in the healthcare niche which is rather fascinating because we can see how it's using step-by-step reasoning to come to a diagnosis. And I have no doubt that this is going to become remarkably effective at diagnosing individuals with remarkable accuracy. Now, if we continue with coding, there are two videos that I would love to show you. All right.
So the example I'm going to show is writing a code for visualization. So I sometimes teach a class on transformers, which is a technology behind models like ChatGPT. And when you give a sentence to ChatGPT, it has to understand the relationship between the words and so on.
So it's a sequence of words and you have to model that. And transformers utilize what's called a self-attention to model that. So I always thought, okay, if I can visualize this self-attention mechanism and with some interactive components to it, it would be really great. I just don't have the skills to do that, so let's ask our new model, O1 Preview, to help me out on that.
So I just typed in this command and see how the model does. So unlike the previous models like GPT-4.0, it will think before outputting an answer. So it started thinking.
As it's thinking, let me show you what are some of these requirements. I'm giving a bunch of requirements to think through. So first one is like use an example sentence, the quick brown fox.
And second one is like when hovering over a token, visualize the edges whose thicknesses are proportional to the attention score. And that means just. if the two words are more relevant, then have thicker edges and so on. So, one common failure in most of the existing models is that when you give a lot of the instructions to follow, it can miss one of them, just like humans can miss one of them if you give too many of them at once.
So, because this reasoning model can think very slowly and carefully, it can go through each requirement in depth and that reduces the chance of missing the instruction. So, this output code let me copy paste this into a terminal so i'm going to use the the editor of 2024 so then html so i'm just gonna paste this thing into that and just save it out and on the browser i'll just try to open this up and you can see that when i hover over this thing it shows the arrows and then quick and brown and so on and when I hover out of it it goes away so that's a correctly rendered version of it now when I click on it it shows the attention scores as just as I asked for and maybe there's a little bit of rendering like it's overlapping but other than that it's actually much better than what I could have done yeah so this model did really nicely I think this can be a really useful tool for me to come up with a bunch of different visualization tools for my new teaching sessions. So this is where we got a direct example of 01 being able to perform a multi-step reasoning tasks that involves coding a web page with certain features that would prove to be quite difficult for current state-of-the-art system. And this is something that goes to show just how advanced the 01 preview is. There is also this video that highlights more coding capabilities.
I want to show an example of a coding prompt that 01 preview is able to do. but previous models might struggle with. And the coding prompt is to write the code for a very simple video game called Squirrel Finder. And the reason O1Preview is better at doing prompts like this is when it wants to write a piece of code, it thinks before giving the final answer. So it can use the thinking process to plan out the structure of the code, make sure it fits the constraints.
So let's try pasting this in. And to give a brief overview of the prompt, The game Squirrel Finder basically has a koala that you can move using the arrow keys. Strawberries spawn every second and they bounce around and you want to avoid the strawberries. After three seconds, a squirrel icon comes up and you want to find the squirrel to win. And there are a few other instructions like putting OpenAI in the game screen and displaying instructions before the game starts, etc.
So first you can see that the model thought for 21 seconds before giving the final answer. And you can see that during its thinking process it is gathering details on the game's layout, mapping out the instructions, setting up the screen, etc. And so here's the code that it gave and I will paste it into a window and we'll see if it works.
So you've seen there's instructions and let's try to play the game. Oh the squirrel came very quickly but oops this time I was hit by a strawberry. Let's try again. You can see that the strawberries are appearing and let's see if I can win by finding the squirrel.
Looks like I won. Now, if you're wondering about some of the other benchmarks, you can see right here that 01 completely dwarfs GPT-4.0. And in the traditional benchmarks right here, you can see that whilst there aren't any traditional, you know, ridiculous improvements, and I say there's not ridiculous improvements, considering the fact that this is currently state-of-the-art, I mean that there aren't ridiculous jumps in performance, I think most people are underestimating the raw capabilities. in terms of how smart this model truly is for it being able to perform multi-step reasoning across a wide range of tasks you can pause the video and look at these but some of the most notable are of course the competition math the competition code and the gpqa diamond which are some of the most difficult tasks for ai systems to perform and for some of the normal ones these ones are all passed at one which is remarkable considering the fact that previously these scores such as the math the MMLU and the MMMU were scores that were seemingly previously unattainable.
Now, what's interesting about this model also is that the human preferences are actually only preferred when we do take into account the fact that there is a preference for subjects that require a lot more calculations. For example, we can see for mathematical calculations, the win rate versus GPT-4.0 is a lot higher. There is also the same for data analysis, computer programming, but in... In personal writing and in editing text, we can see that the win rate versus GPT-4.0 doesn't exceed 50%, which means that GPT-4.0 is most likely superior in terms of personal writing when rated by human voters.
Now, one of the most insane things that I've seen about this model that you probably do want to know is that this model actually has 30 messages a week for its limit. Meaning that if you want to talk to this model, when you get this model released in your chatbot, depending on what region you are, understand that there are only 30 messages a week. Which means you could only send 4.2 messages a day or simply four messages a day before you ran out for the week. So this is something that is quite limited in terms of how many messages you do have. So I would say if you're using this model and you're testing it out, remember.
If you don't want to get rate limited, understand that you only have 30 messages every single week. Now, there were also some scary things that I probably might make into a longer video, but I think one of the crazy things that I saw was the fact that this model system card actually showed us that this model instrumentally faked alignment during testing, where it's strategically manipulating task data in order to make its misaligned action look. more aligned basically to where the model is doing stuff that it knows researchers might not want it to see but it's sort of covering its tracks which is a scary sign for those who are in the AI safety niche. as we are starting to see more and more capabilities emerge as these models get better.
In addition, the reason I say that this model has pushed us into somewhat of a new paradigm, because with this much smarter models, things like chain of thoughts and asking the model to think in a certain way is no longer as effective as it was previously, meaning that getting the raw capabilities out of these models seems to have been already done with the internal thought process. of chain of thought and it also says here that limit additional context in retrieval augmented generation when providing additional context or documents only include the most relevant information from preventing the model from over complicating its response so this is an entirely new type of system to where old methods of prompt engineering are quite unlikely to work so for those of you thinking that you're going to apply those previous techniques onto this new model that is quite unlikely now if you enjoyed this video hopefully you found it informative and I will see you in the next one.