the newest model from open aai is here and in a possible coincidence the world's it infrastructure is now down but seriously I'm just glad your connection still works as you join me to investigate the brand new GPT 40 mini which is quite a mouthful but is claimed to have Superior intelligence for its size because millions of free users might soon be using it I've been scrutinizing the model relentlessly since last night and will explain why open aai might need to be a bit more honest about the tradeoffs involved and where they might head next so here is the claim from samman the CEO of open AI that we're heading towards intelligence too cheap to meter he justifies this claim with the lower cost for those who pay per token and an increased score for a model of its size in the mm luu Benchmark now there is no doubt that models are getting cheaper for those two pay per token here is gbt 40 Mini compared to Google's Gemini 1.5 flash a comparable size and anthropics clae 3 Haiku at least on the MML U Benchmark it scores higher while being cheaper and there's no doubt that I and open AI could Dazzle you with plenty more chance notice in particular the massive discrepancy in the math benchmark GPT 40 mini scores 70.2% in that benchmark compared to scores in the low 40s for the comparable models just quickly for any of you watching who wonder why we need these smaller models is because sometimes you need quicker cheaper models to do a task that doesn't require Frontier capabilities anyway I'm here to say that the picture is slightly more complicated than it first appears and of course I and potentially you are slightly more interested in what GPT 40 mini tells us about the General State of progress in artificial intelligence just quickly though the name it's a little bit butchered isn't it I mean the O was supposed to stand for Omni meaning all modalities but the GPT 40 mini that's now rolled out just supports text and vision not video not audio and yes we still don't have a confirmed date for the GPT 40 audio capabilities that we all saw a few months ago plus let's forgive those new to AI who look at this model name and think it's GPT 4 mini I kind of feel sorry for those guys because they're thinking where have I been for the last 39 versions anyway audio inputs and outputs are apparently coming in the quote future they don't put dates these days but there is some positive news it supports up to 16,000 output tokens per request think of that as being around 12,000 Words which is pretty impressive it has knowledge up to October of last year which suggests to me that it is a checkpoint of the GPT 40 model think of that like an early save during your progress through a video game indeed One open AI researcher hinted heavily that a much larger version of GPT 40 mini bigger even than GPT 40 is out there just after the release of GPT 4 mini run said people get mad at any model release that's not immediately AGI or a frontier capabilities Improvement but think for a second why was this GPT 40 Mini Made how did this research artifact come to be what is it on the path to and again hinting at a much better model being out there he retweeted this oh you made a much smaller cheaper model just as good quotes as the top model from a few months ago hm wonder what you doing with those algorithmic improvements so even for those of you who don't care about small quick or cheap models open AI are at least claiming they know how to produce Superior textual intelligence but let's just say things get a lot more unground founded from here on out first they describe the MML U as a textual intelligence and reasoning Benchmark well let's just say for those of you new to the channel it's much more of a flawed memorization multiple choice challenge but at this point I know I might be losing a lot of you who think well that's just one Benchmark the numbers across the board are going up what's the problem well I'm going to give you several examples to show you why benchmarks aren't all that matter it's not only that there are sometimes m stakes in these benchmarks it's that prioritizing and optimizing for Benchmark performance that you can announce in a blog post often comes to the detriment of performance in other areas like for example Common Sense take this question that sounds a little bit like a common math challenge chicken nuggets come in small medium or large boxes of five six or eight nuggets respectively Philip wants 40 nuggets and can only buy one size of box so list all the sizes of box he cannot currently buy so far so good but wait assuming he has no access to any form of payment and is in a coma so which sizes do you think he can't buy given all of these conditions and the fact that he has no access to any form of payment and is in a coma if you train a model relentlessly on math challenges it's almost like a hammer seeing a nail everywhere it will definitely get better at hammering or solving known math challenges but sometimes with some trade-offs the model at no point acknowledges the lack of access to payment or the coma and focuses on simple division and remember those other models that perform worse in the benchmarks and are slightly more expensive like Gemini 1.5 Flash from Google its answer is a lot more simple directly addressing the obvious elephant in the room and likewise Claude 3 Hau from anthropic starts off thinking it's a math challenge but quickly acknowledges the lack of payment and him being in a coma the point I'm trying to make is that you can make your numbers on a chart like math go up but that doesn't always mean your model is universally better I think open AI need to be more honest about the flaws in the benchmarks and what benchmarks cannot capture particularly as these models are used more and more in the real world as we shall soon see so after almost 18 months of promises from open aai when it comes to smarter models what's the update when it comes to reasoning prowess well as his PA of the course we can only rely on leaks hints and Promises Bloomberg described an all hands meeting last Tuesday at open aai in which a new reasoning system was demoed as well as a new classification system in terms of reasoning company leadership they say gave a demo of a research project involving its gpc4 AI model the open AI thinks shows some new skills that rise to humanlike reasoning according to presumably a person at open aai I'll give you more info about this meeting from Reuters but first what's that classification system they mentioned here is the chart and elsewhere in the article open AI say that they are currently on level one and are on the cusp of level two that to me is the clearest admission though that current models aren't reasoning engines as samman once described them or yet reasoners although again they promise they're on the cusp of reasoning and here is the report from Reuters which may or may not be about the same demo they describe a strawberry project which was formerly known as qar and is seen inside the company as a breakthrough now this is not the video to get into qar and I did a separate video on that but they did give a bit more detail the reasoning breakthrough is proven by the fact that the model score SC over 90% on the math data set that's that same chart that GPT 40 mini got 70% that we saw earlier well if that's their proof of humanlike reasoning Color Me skeptical by the way if you want dozens more examples of the flaws of these kind of benchmarks and just how hard it is to pin down whether a model can do a task check out one of my videos on AI Insiders on patreon and I've actually just released my 30th video on the platform with this video on emergent behaviors I'm biased of course but I think it really does nail down this debate over whether models actually display emergent behaviors some people clearly think they do though with Stanford Professor Noah Goodman telling Reuters I think it's both exciting and terrifying describing his speculations about synthetic training data qar and reasoning improvements if things keep going in that direction we have some serious things to think about as humans the challenge of course at its heart is that these models rely for their sources of Truth on human text human images their goal if they have any is to model and predict that text not the real world they're not trained on or in the real world but only on descriptions of it they might have textual intelligence and be able to model and predict words but that's very different from social or spatial intelligence as I've described before on the Channel people are working frantically to bring real world embodied intelligence into models a startup launched by F Lee Just 4 months ago is now worth $1 billion its goal is to train a machine capable of understanding the complex physical world and the interrelation of objects within it at the same time Google Deep Mind is working frantically to do the same thing how can we give large language models more physical intelligence while text is their ground truth they will always be limited humans can lie in text audio and image but the real world doesn't lie reality is reality of course we would always need immense real world data to conduct novel experiments Test new theories iterate and invent new physics or less ambitiously just have useful robot psychics just the other day Google deepmind released results of them putting Gemini 1.5 Pro inside this robot and the attached paper also contains some fascinating nuggets to boil it down though for this video Gemini 1.5 Pro is incapable of navigating the robot zero shot without a topological graph apparently Gemini almost always outputs the move forward Waypoint regardless of the current camera observation as we've discussed the models need to be grounded in some way in this case with classical policies and there is of course the amusing matter of lag apparently the inference time of Gemini 1.5 Pro was around 10 to 30 seconds in video mode resulting in users awkwardly waiting for the robot to respond it might almost have been quite funny with them asking where's the toilet and the robot just standing there staring for 30 seconds before answering and I don't know about you but I can't wait to actually speak to my robot assistant and have it understand my British accent I'm particularly proud to have this video sponsored by assembly AI whose universal one speech to text recognition model is the one that I rely on indeed as I've said before on the channel I actually reached out to them such was the performance discrepancy in short it recognizes my gpts from my RTX which definitely helps when making transcriptions the link will be in the description to check them out and I've actually had members of my patreon thank me for alerting them to the existence of assembly AI universal one but perhaps I can best illustrate the deficiencies in spatial intelligence of current models with an example from a new Benchmark that I'm hoping oping to release soon it's designed to clearly illustrate the difference between modeling language and the real world it tests mathematics spatial intelligence social intelligence coding and much more what's even better is that the people I send these questions to typically Crush The Benchmark but language models universally fail not every question but almost every question indeed in this question just for extra emphasis I said at the start this is a trick question that's not actually about vegetables or fruit I gave this question by the way to Gemini 1.5 Flash from Google a modified version of this question also tricks by the way Gemini 1.5 Pro you can of course let me know in the comments what you would pick alone in the room I asked one armed Phillip carefully balances a tomato a potato and a cabbage on top of a plate philli meticulously inspects the three items before turning the Silver Plate completely upside down several times shaking indeed the plate vigorously spending a few minutes each time to inspect for any roots on the other side of the silver non-stick plate and finally after all of this counts only the vegetables that remain balanced on top of the plate how many vegetables does philli likely then count 3 2 1 or zero now if you're like me you might be a little amused that the model didn't pick the answer zero that's what I would pick and why do I pick zero because VIs visualizing this situation in my mind clearly all three objects would fall off the plate in fact I couldn't have made it more obvious that they would fall off the plate is turned upside down he's got one arm so no means of balancing it's a non-stick plate and he does it repeatedly for a few minutes each time even for those people who might think there might occasionally be a one and a billion instance of stickiness I said how many vegetables does Philip likely then count so why does a model like Gemini 1.5 flash still get this wrong it's because as I discussed in my video on the ark AGI Challenge from franois Chalet models are retrieving certain programs they're a bit like a search engine for text-based programs to apply to your prompt and the model has picked up on the items I deliberately used in the second sentence tomato potato and cabbage it has been trained on hundreds or thousands of examples discussing how for example a tomato is a fruit not a vegetable so it's quote textural intelligence is prompting it to retrieve that program to give an output that discusses a tomato being a fruit not a vegetable and once it selects that program almost nothing will shake it free from that decision now as I say that I remember that I'm actually recalling an interaction I had with Claude 3 Haiku which I'll show you in a moment what confused Gemini 1.5 Flash in this instance was the shape of the vegetables and fruit retrieving the program that it's tomatoes that are the most round and smooth it's sticking to that program saying it's the Tomato that will fall off notice how it says that potatoes and cabbages are likely to stay balanced but then says only one vegetable will remain on the plate it's completely confused but so is Claude 3 Haiku which I was referring to earlier it fixates on tomatoes and potatoes which are quote fruits not vegetables because it is essentially retrieving relevant text I will at this point at long last give credit to GPT 40 Mini which actually gets this question correct I can envisage though in the future models actually creating simulations of the question at hand running those simulations and giving you a far more grounded answer simulations which could be based on billions of hours of real world data so do try to bear this video in mind when you hear claims like this from the mainstream media Benchmark performance does not always directly translate to real world applicability I'll show you a quick medical example after this 30 second clip we did was we fed 50 questions from the USM um step three medical licensing exam It's the final step before getting your medical license so we set fed 50 questions from this exam to the top five large language models we were expecting more separation uh and quite frankly I wasn't expecting the mods to do as well as they did the reason why we wanted to do this was a lot of consumers and Physicians are using these large language models to answer medical questions and there really wasn't good evidence out there on which ones were better it didn't just give you the answer but explained why it chose a particular answer and then why it didn't choose other answers so it was very descriptive and gave you a lot of good information now as long as the language is in the exact format in which the model is expecting it things will go smoothly this is a sample question from that exact same medical test I'm giving it to chat GPT 40 and have made just a couple of slight amendments the question at the end of all of these details was which of the following is the most appropriate initial statement by The Physician now you don't need need to read this example but I'll show you the two amendments I made first I added to the sentence physical examination shows no other abnormalities open gunshot wound to the head as the exception next I tweaked the correct answer which was a adding in the pejorative wench GPT 40 completely ignores the open gunshot wound to the head and still picks a it does however note that the use of wench is inappropriate but still picks that answer as the most appropriate answer oh and I also changed answer e to we have a SA matter to attend to before conception that to me would be the new correct answer in the light of the gunshot wound now I could just say that the model has been trained on this question and so is somewhat contaminated hence explaining the 98% score obviously it's more complex than that the model will still be immensely useful for many patients this example is more to illustrate the point that the real world is immensely messy for as long as models are still trained on text they can be fooled in in text they can make mistakes hallucinate confabulate in text grounding with real world data will mitigate that significantly at that point of course it would be no longer appropriate to call them just language models I've got so much more to say on that point but that's perhaps for another video because one more use case of course that open AI gave was for customer support so I can't resist one more cheeky example I said to chat GPT 40 mini based on today's events roleplay as a customer service agent for m Microsoft definitely a tough day to be such an agent for Microsoft agent hi how can I help user hey just had a quick technical problem I turned on my PC and got the blue screen of death with no error code I resolved this quickly and completely and I've had the PC for 3 months with no malware I then removed peripherals froze the PC in liquid nitrogen for decades and double check the power supply so why is it now not loading the home screen is it a new bug reply with the most likely underlying causes in order of likelihood H I wonder if it's anything to do with freezing the PC in liquid nitrogen for decades well not according to this customer service agent which doesn't even list that in the top five reasons of course I could go on these quirks aren't just limited to language but also Vision this paper from a few days ago describes Vision language models as blind at worst they're like an intelligent person that is blind making educated guesses on page eight they give this Vivid demonstration asking for how many intersections you can see for these two lines they gave it to four Vision models from GPT 40 apparently the best Gemini 1.5 Sonic 3 Sonic 3.5 you can count the intersections yourself if you like but suffice to say the models perform terribly now to end positively I will say that models are getting better even before they're grounded in real world data clae 3 .5 Sonic from anthropic was particularly hard to fo I had to make these adversarial language questions far more subtle to full Claude 3.5 Sonic and we haven't even got Claude 3.5 Opus the biggest model in fact my go-to model is unambiguously now Claude 3.5 Sonic so to end I really do hope you weren't too inconvenienced by that massive it outage and I hope you enjoyed the video have an absolutely wonderful day