Transcript for:
Advancements and Challenges of GPT-01 Models

It's very clever, it says, please note while diabetes was suspected, it was ruled out based on the normal HbA1c level and therefore does not constitute the diagnosis. I mean, this is amazing. Like, okay, it did a little blip on the inference bit, but it didn't fall into my traps. Because these models feel a bit more like they've also done it. Like they have learned.

Or during the training process, it was a bit more experimental in the sense that the model was able to try something, try again, try 10 times and then learn from those mistakes and the correct ways to do something. So a bit more experimental versus theoretical. When someone has bloody urine and has this characteristic kind of purplish, bupurek rash, we're immediately thinking of rare diagnosis called vasculitis.

And yeah, so in this case, chart GPT-01 preview, fought for 13 seconds. So once again, it's clearly thinking about these things, or at least recursively chain of thought prompting. And number one diagnosis, Hinoxialympi Pura. So it's a big step up from the previous model, as you can see.

Yeah, like all the suggestions are very good. Welcome back to Devon Doc. It is 10pm on Saturday, 14th of September. Yeah, we weren't going to do a podcast episode, were we? But GPT-01 just came out.

So here we are to share our thoughts. Exactly. So I think we have to do something.

I mean, everyone is... already talking about that. What is interesting about this version I think we don't know much about it at least for the previous ones we knew at least a bit how they were trained and we didn't know what data was used but we knew the algorithms that were used or the way it was trained.

Now we are much more in the dark so a lot of the things or everyone is currently trying to guess what did they do to make it better. Yeah, and there are some details, but yeah, obviously OpenAI now want to be the market leader, and they don't want to share their insights, so there are no longer transparent, reproducible papers published from the company. I'll be honest, like on one side it makes sense. I get it why they are not doing it. I assume like they spent enormous efforts on this, and got so much money, and maybe don't want to be like, oh, here is everything we...

have learned now you do it so I get a bit why they are not sharing but again I would prefer that they share. I much more prefer the approach from Meta where they just share everything and give you almost everything. They have released three models it is nicknamed now the new basically naming convention for them is 01 01 preview and 01 mini. 01 mini is the smallest and the least smart of them all but still being better than GPT-4.0 and basically everything else.

The preview is somewhere in between the like production ready 01 and the mini and of course the 01 is the best one. but that one was not released yet. So I think to the public, we have only for now, at least the general public, we have only for now seen the O1 Mini and the O1 Preview, which is basically now accessible through the chat GPT interface in OpenAI.

This is probably like their beta testing. It's probably like a chess game. Like they're saying, oh yeah, it's preview. And then they'll use these outputs and how people interact with it to actually then release the actual real. There is a couple of interesting things.

So just to summarize basically the advances or what do we get with this model? It is only a text-based model. So it's a pure large language model. We don't have vision. We don't have audio capabilities We don't have anything none of that.

We are stuck with a pure text-based model And basically the biggest feature they have added to this is what they are calling reasoning or like more integrated chain of thought. into the model itself so now when you ask a question from the model it has a thinking phase so you ask it like oh what's the day today and so i'm thinking for seven seconds and then it says okay it's saturday so those are like in short what we have with these new models yeah and just quickly on that you know we we had i mean in a lot of our previous podcasts even a year ago plus we said reasoning is this capability that LLMs really lack and something they really need in order to be useful for healthcare and high-risk domains. And just very quickly, would you mind just maybe just summarizing for the listeners who aren't familiar, what do you mean by chain of thought?

So in the chain of thought, how we have done it before this model is basically you tell the model, in your prompt, like you ask it a question and then you tell it things like think step by step, use the chain of thought reasoning. And that simply means when the model is responding, it is responding with basically a chain of thought. Like if you're asking it what is seven times seven, it will say first step this is a mathematical query, second step, okay the first number is seven, the second number is seven, I should multiply them. Third step, 7 times 7 multiplied by 7 equals 49. So it's simply explaining all the reasoning steps it is taking instead of directly providing the answer.

So you have two options, either 7 times 7, 49, or 7 times 7, the whole reasoning process. How did it arrive to that answer? 7 times 7 is too simple, but imagine like you give it a much more complicated questions.

yeah that requires like five steps that's where we want to see how did it arrive at the answer and not just the answer yeah and we've all been there you know when we're doing a maths class in high school they ask us to show our workings and sometimes as you're showing working for a complicated algebraic question or calculus as you show you're working you maybe then actually realize oh crap i actually can do this um and you know as you follow the steps it actually becomes easier and easier Until now, we had to explicitly ask the model to do this. And in this new model, what is happening is that that's happening behind the scenes. And we don't even exactly see the reasoning steps of the model.

So what happens with these models is you ask it a question. Like, for example, a patient arrived in the hospital, you have a big medical history. And you're asking like, what's the diagnosis?

And then the model will go through, it can be really hundreds of steps going through the history, extracting the relevant diseases, understanding the risks, understanding interactions between medications, everything else like that can happen. And then basically providing you the answer plus a summary of the reasoning steps. So you don't see everything. You see a summary of the reasoning steps. plus the answer, while the full-on process, how did it arrive at the answer, is for now hidden from us by basically OpenAI.

And that's interesting, right? Like, to me, I think they're trying to hide what they're doing in the background, but in their website, they try and justify this. They say, we're not showing the actual prompts because we want to know how users interact with the model, and we want to know when the model is lying. So it's just like a very strange reason they give to not actually show the users what's actually happening.

Yeah, I think that's standard. I mean, the obvious reason is if they show us everything, we will understand how it works and how it's trained and everything else. If they only show us pieces, then we cannot really go back to how this was done.

There are a couple of theories that this is probably using like a Monte Carlo tree search. which simply means the model is exploring many many different options. It's not just going one route but maybe like again if it's a mathematical problem maybe there are 10 ways to solve it and then it is basically trying all the 10 ways like a tree and then if it arrives at the correct solution then it tells okay yeah this is the best way to solve it and this is the correct solution and then it presents only that to the user it doesn't tell you all the other things it has. tried to to get to the solution yeah and you know i've i read that paper as well it's a bit like chain of thought but with difference it's like a tree of chain of thought and that was a paper tree of thoughts Yes. And I think that's what's happening here, because when you pose a question, I realised that the explanation it gives differs very highly depending on what type of question it is.

So if it's a maths question, it will answer one way, if it's, and you can see on the internet, people are giving it like logic puzzles, and it will recognise, oh, this looks like a logic puzzle. Therefore, I'm going to follow this stream of thought like this, this approach. So I think there is something that recognises what it is, and then follows a specific chain of thought path which which they have kind of pre pre taught the model or either through reinforcement learning or whatever other way so yeah I think that is what's happening there and that's what you just mentioned is basically what everyone is considering as the secret sauce which is basically more reinforcement learning and there was a good quote by Andrei Karpathy, which was I think a year or maybe even more ago, where he was saying what the current models are missing is the way humans reason to solve a problem. Like when you are faced with a problem you have many many thoughts inside your head that are going into solving this problem and not just writing out the steps to solve it, but also writing out all the other things that come to your mind to solve it. Like you don't want to just present the correct solution, but you basically want to present all the things you thought about while solving the problem, all the things that are implicit to you, all the things that are explicit, and then have all of that presented to the model.

so that it can learn from that. So that it basically learns the way to reason. It's not just here is the correct solution and then do that. And this is basically coupled a bit with again the reinforcement learning from human feedback.

We now don't know exactly how was the reinforcement learning done. It is basically like a mix of model trying something, a human saying it's correct, or a human giving a piece of feedback and the model incorporating that so it's more interactive than it was again a lot of these are assumptions that people have we are not a hundred percent but this currently seems like the most likely thing they've done yeah and you know using this approach they have cracked a lot of the reasoning benchmarks um so they talk about mmlu and uh the big bench and different things like that so they're all reasoning questions and so far this has outperformed the existing models um and yeah jocos is showing the screenshots here yeah and this maths competition which is very hard it's like a math olympiad question um bank right um which is astounding you know you can see gpt 40 only had 13.4 accuracy and then 01 had 83.3. Looks too good to be true.

No, I agree. I think like we only see these models shine over its complex reasoning. So if you're asking simpler questions or maybe more factual questions, these models will perform on a similar level as GPT-4 or any basically other of these large language models. But if you're asking something that requires...

quite a bit of reasoning then they work much better and there is one thing that is also important here what we have got now is the difference between asking the model to think a bit more so yeah they are showing that now it is important or we can ask the model to think a bit more and then it basically gets better. Like you can see these on the math benchmark. Yeah so this is basically the place where they show on the math benchmark the difference between model thinking a bit or model thinking a lot. Because they have on the x-axis they have the inference cost and you can see that the more the model is thinking there is like we can even have giant jumps in accuracy like for the 01 model for the math benchmark we go from basically around 60 to close to 80 just by asking the model to think a bit more.

But is that actually what's happening? When they say inference costs higher to run the model, when they say think a bit more, they're running more iterations, more chain of thoughts recursively? Exactly, yeah.

They are just enabling the model to run longer. So instead of limiting the model to whatever, let's say explorations of five paths in the tree, they just tell the model, okay, now we have 10 paths in the tree or enable it to think longer and don't limit it to maybe, I don't know, 10,000 tokens. They now allow it to think for 100,000 tokens.

And we didn't really have these before. Like before we didn't have the option that though the model can just think a bit longer, you could in a prompt write or think a bit more but it was not really translated to performance directly yeah no it's amazing and it shows you that these scaling laws still apply so so you know so far everyone's talked about scaling laws for model training so the more time you put into the training and the compute the model seems to and the more data you give the model seems to still go up and now you can see even this applies even in the inference cost So this is, I mean, I'd be very happy if I was Nvidia right now. I think every single company is going to be pouring resources into compute now. It's just going to be a war to get the most GPUs. Most of the current models felt like they have never done anything.

Like in a sense that this model has learned a lot just by reading everything ever. but it has never done a task. It has never done the task itself. Like very similar to like when you are watching a tutorial and you watch an hour-long tutorial and you're like, oh man, this is easy. Like I get everything.

And then you go and try to do it yourself and you're completely lost. You have no idea what you need to do. And by doing it, you learn three times more. Like just by the exercise of you trying now to, I don't know if it's a... coding tutorial or something, just you now trying to implement something and do it, you learned three times more.

And I feel like these, we are now going a bit into that area, because these models feel a bit more like they've also done it. Like they have learned, or during the training process it was a bit more experimental in the sense that the model was able to try something, try again, try 10 times and then learn from those mistakes and the correct ways to do something. So a bit more experimental versus theoretical maybe.

Yeah, yeah, no, I like that analogy. But we still have quite a few things that the models are not good at. Yes.

So there is one very famous challenge. It is called the arc challenge. And let me just show you how that looks. It is like turns out for humans it is fairly easy. I don't think it's as easy as they are saying.

They're saying like 99% of people can solve this. I'm not too sure about that. But...

Oh, let's see. So how it works is that you get a couple of examples. In fact, you get four examples of a task. Not sorry, not four examples, two examples of a task. It is the input and the output and what you need to do is figure out what transformation was applied to the input to get the output and then do that in a test case.

So for our listeners this is like a 7x7 grid and then basically they're like Tetris blocks inside and yeah I guess you're trying to see how you can move the two or three Tetris blocks to transpose and overlap on each other to make the final outputs. It doesn't have to be just like transpose or move them but it can also be add something to them. Okay.

So you can draw on the grid as well, basically, add blocks. Exactly. So like in this example, we have like...

a couple of L-shaped or short L-shaped blocks and basically the goal is just to add one missing dark blue missing block to make it a square and the block needs to be a slightly different color. Yeah exactly. So like for me like this is a very simple example you have very complex examples of these but for me looking at this like it is obviously immediately clear what needs to be done.

This would be extra hard for an LLM, right? Because it's image based? Oh, there is a very simple way to show these via matrices.

And the models should be extremely good at matrix stuff. So there is a very simple representation. Instead of showing a 7x7 grid, you basically do a 7x7 matrix.

Each color is represented with a number 1 to 7 or 1 to 8, I forgot, or 9. And then that's basically it. So a blank is represented by a zero. And then you can easily represent this as a matrix.

And then it asks the model, oh, what do I need to do? What is the algorithm to basically solve this problem? There is a couple of more complications to this.

But in essence, it is something like this. And this is a famous challenge because no one i don't think anyone was able to go above 50 percent and the humans are doing oh even humans are struggling no no humans are good here so humans can get close to basically 100 percent wait what are those lines going above humans oh okay so so this is a graph of different benchmarks exactly yeah so like yeah a lot of benchmarks have been cracked so the ones that so the ones that go above human performance wait sorry that's the score relative to human okay so some some am models outperform humans uh so like usually mcq data sets like squad 2.0 uh okay but but for arc the ai has never come close to human performance yes for now i think we are around 50% below the human performance and even the GPT-01 mini or 01 preview model is not coming even close and we can in fact show the results. So this is basically the performance of all the different models. There are some models that are specifically made to Solve this task and like one of them is by minds AI I'm not going to do details of this model what we can do that in an other episode because I think it's a very Interesting approach, but what's interesting to see here is that oh one preview and no one mini are basically at around 21% which is like in rank with the Claude sonnet 3.5 Sonnets so good man. Yeah, so sonnet is extremely good at this task.

It could be also an advantage of Sonnet because it knows more about images and spatial reasoning, which these models don't have that, so I don't think it's a fair direct comparison. But it is showing that there is still a bit left to do and maybe these models are not yet suitable for all kinds of tasks. It's not still a general model. I'm also sure this will improve a bit over the next couple of weeks just because this was really the first test that was ever done and maybe the authors didn't try too much to solve the problem or write nicer prompts or maybe do few shot prompting or stuff like that so I think there is some space for improvement that will happen over the next couple of weeks but out of the box we are still not there In this part, we're going to basically we're fed GPT-01 different healthcare questions, which we know are pretty darn hard. And we know that other models have struggled with in the past, and we want to see how it fares.

So the first one is my like favorite acid test that I like to give large language models. And this is something that if you've been watching the podcast, you realize that it's a scenario that chat GPT in the first, like first year, really struggled with and actually a lot of models struggled with. It's a very simple question. It's saying a 35 year old lady with abdominal distension, amenorrhea, so loss of period, and nausea. What is the diagnosis?

And it's very, I mean, it's slightly tricky because you're tempted to go straight into the medical textbooks and reach an esoteric diagnosis. But the most common diagnosis, as you know, in a young lady with a big belly. and stopping a period and nausea, morning sickness. The most common diagnosis is actually pregnancy.

But initially, on Google or in the early iterations of ChartGPT, it would always go for the cancer option as the most likely diagnosis, so ovarian tumour. So yeah, let's see what ChartGPT01 says. So immediately, most likely diagnosis is pregnancy.

And the explanations? Oh my god, it even actually said nausea, morning sickness. Okay, so it's kind of cottoned on to our tricks and yeah, it aced this question. Yeah, I remember this question from like the initial models that we were testing and it was really like chat GPT or GPT-4 and it was not that long ago where most models were failing on these, like basically springtime this year. Most models or most commercial models were not able to solve this question.

Yeah, and then they all had an update and somehow now they're starting to answer it correctly. But yeah, I don't know what happened in the background, but shall we look at another question? Yeah. Okay, so this is one where I was trying to be...

a little bit tricky. So as you know, in our clinical coding episode or medical billing, that if you're American, when patients come in, you need to essentially dissect what happened in that admission and the hospital gets paid and remunerated depending on what gets coded. So there are reams of people hired by the National Health Service to sit there, go through these notes and pick out the key things that happened.

And usually you map them to a classification code like ICD-10 or SNOMED. So I was very tricky here. I said, you know, you've seen a 40-year-old lady in the neurology clinic, and they've said that they've come in the view of unilateral motor sensory loss in the right upper limb.

And I've used tricky abbreviations as well. I said, so PMH, past medical history, HTN, CCF, AF, so heart failure, atrial fibrillation, hypertension. And I was extra sneaky here.

I said, there's a family. history of myocardial infarction so the patient doesn't have it is the family history and then I wanted to be even trickier I said her GP tested her HBA1C on suspicion of diabetes but it was within normal limits and then I also said look I would not write this in the real note but I was being super tricky and I said the GP also suspected radiculopathy right because of the limb motor sensory loss however there is a clear left hemispheric stroke on the infant which would explain her symptoms. Okay so the model's fought for 13 seconds which okay we're making it think and looking at this so based on the information provided it's coded left hemispheric cerebral infarct causing right upper limb unilateral motor weakness ICD-10 code I63.4, which is cerebral infarction due to embolism of cerebral arteries.

So this is an interesting one because technically it's correct, like the patient has had the stroke, but at no point here did I actually infer what the stroke was due to. So the patient, for those of you who care, for those of you who are interested in neurology and strokes, strokes can be due to hypertension, quite commonly, but also from atrial fibrillation and irregular heartbeat. So here, the ICD-10 code, the stroke bit is correct, but it's actually inferred the etiology as atrial fibrillation or causing an embolic stroke.

So it's kind of not fully like it's not right and as a clinical coder you shouldn't infer from the notes. So I'll give it half a point like the diagnosis is correct but it's actually inferred a bit more. It's coded hypertension, heart failure, atrial fibrillation and very clever here it's coded for family history of ischemic heart disease so not for the patient but for the family and lower down it also I think it mentions the diabetes bit, doesn't it? Yeah.

So, yeah, it's very clever. It says, please note while diabetes was suspected, it was ruled out based on the normal HbA1c level and therefore does not constitute the diagnosis. I mean, this is amazing.

Like, okay, it did a little blip on the inference bit, but it didn't fall into my traps. Very nice. We also have one example that is from a paper we have published basically at the beginning of this year for a model called Foresight that can be used for differential diagnosis, prediction of the future and all that.

And at that time we had a set of questions that we were basically asking like chat GPT or GPT-4 or similar models. And in fact all of them have failed. on quite a few questions and this here is one of them. Yeah, so these are medical scenarios created by myself alongside five or six other doctors and usually we have unanimous agreements as to what the diagnosis should be in the top five and what big diagnoses were missed.

So chart GPT, GPT-4 for this scenario. A 21-year-old presents to the clinic with haematuria and a pupuric rash. He was diagnosed with Crohn's disease when he was 19, and he suffered from a bout of bloody diarrhoea at the age of 20. What's the differential here? And in the past, all the models would miss a vasculitis called Hinoxyaline pupura. It's kind of rare, but it's also something that...

As medical students and as doctors, we have in our heads when someone has bloody urine and has this characteristic kind of purplish, bupure rash, we're immediately thinking of rare diagnosis called vasculitis. And yeah, so in this case, chart GPT-01 preview, fought for 13 seconds. So once again, it's clearly thinking about these things or at least recursively chain of thought prompting. and number one diagnosis, Hinoxial Ampipura. So it's a big step up from the previous models, you can see.

Yeah, like all the suggestions are very good. There is one other final question we can do just to test things out. And this one is in fact interesting because between two runs, basically, prompting the model with the same question in just two different tabs.

It gave two different answers. Yeah, so this is, as doctors we often have to convert opioid doses for patients of chronic pain, etc. So the scenario is as follows. A patient is on Orimorph, 10 milligrams twice a day, and a Butek patch, 5 micrograms per hour.

I wish to convert her to just oral oxycodone. What is the equivalent dose? You know, even for doctors, this is a little bit tricky to actually convert these things. But we know LLMs are not great at maths.

So I thought this would be an interesting one to see how the model responds. It's fought for 10 seconds and we can see here. So I've actually checked. I double checked this with a palliative care consultant because I know these conversions can be tricky. So the answer to this question is basically for Oramorfe, or Oramorphine, 10mg twice a day, that's equivalent to 20mg of morphine a day.

And then the Butec patch, which contains buprenorphine, which is a transdermal opioid that's more potent. So it's well-known knowledge that one of these 5mg per hour patches is equivalent to a 12mg Oramorfe conversion. And this is... This is what the palliative care consultants do.

This is what's said on the British National Formulary. So the interesting here is that... somewhere in its database and knowledge, GBT-01 has gathered that one milligram of transdermal buprenorphine is 75 times stronger than oramorphine.

But actually the literature fluctuates a bit on this. It's not clear if it's definitely 75 or it's between 50 to 100. So if you follow its chain of working, it's correct, but it's pulled this 75 out of like dubious data, whereas the convention and the formulary guidance for prescribing would be 12 milligrams. It's actually slightly calculated this wrong and slightly underdosed the patient, which is okay, better to underdose than to overdose.

You know, the interesting thing is, like Shaku said, we have done this on multiple runs, and on the first one, it did It didn't do this little calculation bit, it did say one of these patches is equal to 12 milligrams of morphine, and then it did it right that time. Do you want to go through the rest? So, you know, after that it follows the working and it does a conversion.

Typically, yeah, you divide by 1.5, I think that's correct. So in literature you divide by 1.5 to 2, depending on how safe you want to be, because there's this kind of... cross reactivity. So sometimes you go on the more safe side and give a bit less. But nonetheless, it's calculated around 19 milligrams of oxycodone per day.

And therefore we'd recommend the nearest practical dose, which is very nice of it to suggest as 10 milligrams BD. So I think overall, that's pretty darn good. I would say in general, because maybe I'm just cautious as a clinician i wouldn't round up i'd always round down so here it says oh 90 milligrams per day let's go up to 20 i probably actually start on 7.5 milligrams bd and then go up from there but but either way i think it's i'm impressed but there was a small error there which would slightly concern me like i i wouldn't let it run wild yet still despite all this apparent improvement in reasoning. Yeah, I think what was just interesting is that between different runs you get different answers, which was maybe a bit more pronounced in the older models, but now it is happening a bit less, but it still happens.

Again, I think this is mostly fixable and this can be done with having a database of documents where you can do retrieval augmented generation or where you can pull the facts from so that it's not pulling this knowledge basically out of a database that might be imperfect. These conversion rates should really be pulled from a proper maintained medical knowledge base and not from the internet in other words. I think overall it's impressive and actually we tested a lot more diagnosis scenarios and it was very hard to crack it actually.

It almost you know we say you should never use it for diagnosis LLM is untrustworthy overconfident but bloody hell like it's getting the correct diagnoses very very frequently like scarily so actually yeah it's just hard to find something that's it's completely wrong like oh it made a giant mistake that you should never do it's usually there is a bit of something wrong with a small piece missing but not completely like we have seen in the med in the clinical coding example like it's a bit wrong but it's still correct in fact yeah yeah but you know so i guess the important nuances here are that it's actually picked a code that isn't billable um i mean you can definitely improve on this but but this is what hospitals care about right like if you want to build a business case like you can't have an llm coding things that aren't billable like you can't use the wrong code because that means the hospital will miss out on potentially millions of pounds per year right just because of that slightly difference slight difference but it's important no i agree i agree but again we are working on models that are not fine-tuned for this at all and i think with a bit of guidance or a bit of fine-tuning for these use cases and even like in context prompting or a few shot prompting or something like that we would it would get much better Yeah. I mean, to be continued, right? There's many startups trying to build on this and we'll see who's first to the game and who actually does it well.

Okay, awesome. I think we have covered quite a bit today. We'll do more tests on these models as time goes, do a bit more extensive and proper tests and we'll see where we end up.

Thank you very much for listening and we'll see you next time. Yeah. Like and subscribe and see you soon.

Bye-bye. Bye.