Transcript for:
Lecture: Challenges and Real-World Applications of Machine Learning and Large Language Models

[Music] all right so let's give a Short Round of Applause to welcome Doug and Nico and let them take it [Applause] away awesome thanks so much and just a quick sanity check people can hear in the back it's okay not too loud not too quiet two3 all works okay fantastic all right thank you so much a bit of background on us so uh as as AA mentioned my name is NCO um I'm on the the the founding team of comet started as a research scientist at elel um maybe five six years ago uh got into ml working on weather forecasting models started a company uh and then started building Comet and then maybe Doug do you want to say a quick intro sure um I guess uh if you're watching us on video uh go watch the previous talk from Dr Doug E uh that was talking about generative media it's funny so uh my name is Dr Doug blank and uh Doug and I were both uh at uh as he mentioned in grad school together at Indiana University uh Bloomington in the 9s and uh there weren't that many of us that were like committed to doing neural networks and Doug was one and I was another and uh Doug was focused on sequential uh problem and I was focused on making analogies and so it's funny to see where we are now I didn't know that Doug was going to be talking earlier uh so it's very funny to see how our lives have uh been very similar uh and we'll talk a little bit uh about that as we go so um yeah I'll hand this over to Nico uh we'll talk about some case studies from um the perspective of our company comet.com and then I'll talk a little bit about uh large language models U both uh fails and how you might deal with those in production at a company awesome thanks Doug um yeah so just a bit of background of a comment I think some of you may be uh uh may have been exposed to us this week and the um you know notebooks associated with this intro course uh we build mlops uh software uh to hopefully make your lives a bit easier building models um all of the founding team Doug myself and the other co-founders all came to this uh basically building for the pain points that we wanted to have solved the other two co-founders Gideon who was at Google at the time uh uh built for you so I'm hoping that over the course of this week as you maybe trained your first neural networks uh that was a bit easier um and we support some of the best research teams in the world and best universities in the world as as well um uh Mila which Doug EK mentioned uh MIT and and many many more and what we want to focus on today uh Doug and I uh my Doug not Doug at we're talking about as we prepared this this talk is um uh you've just spent a week uh diving really really really deep into machine learning into artificial intelligence um you've learned the basics you understand how these system works you understand how to build them how to code them uh we're not going to give you any any more information on those topics than you already received what we spend all of our time doing and what we think might be interesting for all of you now is we spend our time working with businesses companies Enterprises who are trying to take these models and do something useful with them right make money save money make things more efficient in the real world and as beautiful and elegant as all of this uh you know math and code looks in the real world things can get a little messy for reasons that you might not expect um so we're going to do a couple case studies um some interesting case studies from some of our customers you know models they've built in most cases frankly very very simple models where nevertheless things pop up that you wouldn't expect I'm going to see if this first week of learning about deep deep deep learning uh will provide any of you ideas as to how to diagnose these issues uh and then as Doug mentioned we'll talk about large language models uh again with this uh with a specific focus on some of the interesting things that emerge around large language models when you try to take them into production and do things with them so that's our plan uh just a quick primer these are very very simple this is like pomp to forehead material for anyone who's built these things but as we go through these uh case studies think about if it might be a data problem a model problem or maybe a system problem right and we'll go from there awesome okay so our first uh our first example is a customer of ours uh who we can't name but you can probably pick one of the two or three companies you expect this to be it's one of them um uh a team uh building a user verification model right so basic problem given a photo uh uploaded by any user detect whether the photo matches the actual user's profile picture uh the marketed purpose of this model is to minimize fraudulent accounts between you and me there's a couple other types of photos that commonly appear on dating apps that this was also being used to detect and minimize but we won't talk about those um and so think about this from a system perspective it's actually quite simple right what's our data set here images from the app right and these are manually labeled so they have a huge team of manual label labelers excuse me uh going through um and labeling true photos from user accounts and false photos from known fraudulent accounts very very simple you know neural neural classifier uh send it into production you have an app uh excuse me you have a model now uh you know running inference in real time in the application minimizing fraudulant accounts uh and they deployed this model and all goes well fraudulent accounts you know are are like by a large margin going down everyone's happy the business is happy they're spending way less money on manual tasks for identifying these things happy happy happy until all of a sudden it stops working right like after a couple of months we start to see this deterioration in model performance down 10% down 15% and to give you a little bit of a sense of like the economic impact of this degradation in model performance like a 10% decrease in model accuracy is an additional 10,000 manual moderations per day so from their support team it was like in the hundreds of thousands of dollars a month so it's it's you know it's an expensive problem um anyone have a first guess as to what what you would do first if you're working on this team of Engineers your model starts to get worse and worse you can just shadow that and I'll repeat no ideas okay we'll speed through the first attempt was just a retraining right you know you know say the answer is out there you just have to yell it uh so the first idea is look we probably just have some data drift right we had a we had some we there was some Corpus of data which we used to train this model offline we achieved like a high enough performance to deploy this model and now the data that our model is seeing out there in the wild is is is different right it's seeing new data let's gather the new data and let's build a new data set and we'll just re we'll retrain our model uh and it should work right little to no impact makes makes no impact on the performance of this model they're completely stuck um so retraining doesn't work uh think about the full system that's going into this model in production any other any ideas at all as to what might be happening here people gaming the mod better it's a good idea and it might have been happening anything else any other possibilities okay so what actually what they found out and this was a bit of a pomp to forehead moment for them was just a new iPhone and the new iPhone camera was taking photos with a significantly higher resolution than the ones that the initial model had been trained on that the that the existing model didn't have enough layers to actually C captur that information so it was doing horribly so the solution here was literally to add like two layers to the same architecture and re train and everything worked um so uh very very funny story I think the the only lesson here for people by the way just as a show of fans like who here is interested in going to work in ml and Industry after MIT as a data scientist machine learning engineer um yeah so there's no like Silver Bullet to the story there's no sort of like Al Al gor rithmic way to predict this just weird things are going to happen think about the full system end to end from even potentially the machine you know assembling the data that you may be using to build your model upon um okay second case study leading e-commerce company uh and our task here is listing and add recommendations so this is a very similar use case to one you might experience every day on Netflix or max.com right you log in you have a curated set of columns and rows of you know titles in that context their goal is to keep you in the app as long as possible keep you happy watching movies have that monthly $199 continue to deposit into their bank account from yours in this case they're trying to get you to buy clothes right um so how do you build a good system that's going to effectively retrieve and list sort of top candidate uh advertisements or just listings from the application in a way that fits uh a given user profile um so the specific problem uh as formulated by this one of our customers for a given user maximize the likelihood of a click on a served advertisement or listing uh and our goal is like maximizing Revenue right so you can look at every single user session um and you can build sort of like a label data set based on ones that led to a sale um so in this case our data set is embeddings of historic engagement data so this is a little bit interesting and there's like a there are some there are some there are a handful of good papers on this you if you want to learn more but um this data set is actually uh of much more than actual purchases so other activities that a user might engage in during a session on a website that are directionally uh suggestive of leading to a buy so search queries item favorites views at to carts and purchases are all sort of you know labeled as true samples in this case um build an embedding model uh that basically retrieves a top end you know uh listing candidates whether it's advertisements or actual you know items uh based on a given user profile right so again build this model train this model you have a service running in production you have a shopper um and this model works great there's actually no issue with the performance of this model in production The Peculiar thing here oh and let me just say I think I I think I actually already touched on this a second ago but just a bit more of a deep dive into uh the specific system here uh for retrieval and Rex ranking um we can come back if people have questions later um so there's actually no issue with the performance of this model there's no degradation there's no issue uh it's just really hard to beat like the team wants to build a better model uh they have a big team of researchers great researchers they're you know testing new architectures um you know uh you know Gathering a ton more data um it's not working like the production model is just better and they can't make it they can't build something better it's very hard to sort of get um to so to put it a different way they think they have breakthroughs they think they see these offline performance metrics so we have this this new model it's like 8% better they ship it doesn't do any better um so there's sort of two it's like a bifurcated issue here you have uh amazing results offline leading to no effect online um and then also even with lots more data and sophisticated models the online model keeps outperforming uh any sort of uh new candidate um so any thoughts as to why it's so hard to beat the production model is it just seasonal Trends it's a really good idea I guess in training their different local Minima so maybe it just happened to be like a really one possible yeah local Minima it's a good idea and the first idea sorry I'm supposed to repeat this because we're on video was it's a seasonal model another good idea any other ideas as to what might be happening here yeah the back consumer data like consumer segmentation consumer segmentation is a very good idea um anything else yeah distribution shifts of data over time and then um as you move all on in time then you have like distributions that are more accurate to that point in time yeah it's a very good idea as well these are all completely plausible as well yes are they just not deploying the the models quickly enough if someone searches for something they might want it then right now not a week from now yeah another good idea um all these are totally feasible uh uh explanations for what was happening uh to prevent this team from from from making a better model um what was in fact happening and this is again a bit of a palm to forehead moment there's a theme here whenever you go work for a big team of data scientists often times uh the solution is going to be a bit of a palm to forehead but a researcher realized after a while that the way that they were generating the training data set to train a new model uh the historical data that was being uh was retrieved using embeddings from the prod model itself so basically the data that you're using to build a better model is coming from embeddings that are being produced by the model in production so the odds are pretty good that the prod model will do well compared to that data um so it was very very hard to do so um so be it was basically like the the new model was being trained uh like with a mask on its face a little bit and it just wasn't going to it was not going to do better um and I want to make sure that I speed up here a little bit to leave the other Doug time to talk about his stuff so the interesting system solution that they that they devised here to prevent this was uh they actually built multiple retrieval agents in production and so and so this is also um sometimes talked about as Shadow models where you might deploy like four or five models in a production environment at once but only actually allow one of those models to serve predictions to an end user that's a great way to also build like four or five uh different candidates at the same time using real-time data from your user base without necessarily um over complicating a system and still having you know a fantastic user experience so by having multiple retrieval engines in production what it allowed them to do is they had multiple models that were each generating different sets of embeddings uh based on the user activity in the application and they would take the embeddings from one of those retrieval engines to train a new model and then just don't compare it to the model that made those embeddings right ultimately like somewhat simple so um those are two case studies uh that Doug and I find very very funny and often times uh again there's no Silver Lining or silver bullet here for you all but uh you should completely go work in industry and just be prepared for you know some uh non-standard or nonexpected things to start happening once you do um I'm going to let Doug hop up here and talk about llms next all right thank you Nico yeah of course so uh comet's been a company uh helping um other companies do machine learning for about five or six years now and so those two case studies are ones that we actually know about you know we have a lot of customers that don't actually give us insight into how they have something that doesn't work so those are are two that uh we were able to discover uh what I want to talk about now is a different kind of animal and uh although this is probably um we have more customers asking us about large language models than we have them ask about anything so we don't have uh those kind of case studies from existing customers right now however uh we can uh probably guess what uh some of these issues are uh that they're having and so what I want to do is first of all ask uh how many of you be honest now have used chat GPT or some other model raise your hand okay I see a few people that didn't raise their hand I'm not going to point you out but almost everybody has used one of these models uh so have you used it for anything uh useful and important and this is not an implication of guilt I know this is uh you know being students where you have perhaps something doe uh all right so people have actually getting used these that and that's great and uh for those of you have who have tried it have you seen anything funny or weird or wrong all right so okay almost as many people that have views chat GPT have seen something wrong or funny uh not Not Unusual uh so I what I want to do is uh think about how we could actually uh work with large language models in production um but in order to do that and also for you just to think about those uh llms s i I want to take a quick look at how they work and this is really 30,000 ft Bird's eyee view um so uh as you may know that uh systems like chat GPT are based on Transformers and they're not exactly the same there's been a lot of tweaks um with Transformers over the years and uh it's also true that we don't know exactly what goes in to some of the later models they they started out being pretty open and having papers that described all the Gory details not So Much Anymore um why might that be money yeah so uh indeed uh this is big business so if you have a system that's able to you know beat the competing chat GPT model or chatbot uh then uh you you're ahead of the game so um the uh gpts are often just decoder only systems but uh looking at this uh Transformer model uh I think the the one thing that I wanted to uh talk about is that uh and stress is the inputs are largely a block of things uh that get encoded all at once and then these decoders uh as we saw uh in the last uh talk uh you have things that get generated and you've seen this too in the web page if you've used chat GPT you get a little bit of and then it generates more and it generates more so it's iterative it builds and just uh a little bit of a this is an old Transformer um problem of uh translation but you can imagine this in the same kind of system used for uh chat GPT so you have the inputs that come in all as one block and every token gets uh converted into a set of numbers either through uh an embedding system and maybe those were learned previously or maybe they're being learned as the entire model uh is being trained uh and it outputs something uh and then that output comes back in as an input to the decoder so you've got that one big block of input plus now you have the output of whatever you just outputed and so this uh the previous output from your own model comes back in as input and so then you end up with a a complete translation or in chat GPT you get an entire paragraph uh of output now this has been called uh by some stochastic parrots and it is uh very statistical uh so the words the tokens that are being output are based on the statistics and you can ramp up that variability the the noise uh in the system so you get more creativity or you can uh make it more deterministic um and also uh starting with uh chat gpt3 uh people realize that you can add a little bit more of a context before the actual prompt so uh and you've probably seen these kinds of things that you can add to a prompt like get think about this problem step by step or you are a useful assistant and it turns out who would have ever guessed I wouldn't have that adding things like that can really affect the kinds of output and the quality of the output that uh these systems give uh and this is amazing and it's um it really I think it has big implications about how these systems work of course this is called prompt engineering now and you can probably get a job uh being a prompt engineer making very good money uh so what I want to do now is now that you uh have a a glimpse of how those systems work we can take a look at some actual fails and we can think about them reason about them why they fail why do why don't they work so the the first of these is uh this is uh the input is given a coded sentence and then uh the system is supposed to Output what the uncoded system is and uh you you may have if you've played with simple codes uh there's this uh rot 133 rotation 13 so you just take the character and you rotate at 13 more and so uh they tried this um task you know give it the input the coded input tell it it's rot 13 or rot two uh and let it the system see if it can come up with the decoded answer um so what do you think uh is this something that uh chat GPT can do or not yeah tokens aren't they can grou of it's not necessarily recognizing it as a ABC okay so this uh if I can paraphrase you it it's not a a task that they typically see uh it's not a um you know given it's not the kind of sequences that they were trained on well wait a minute they actually are trained on that you'll see if you uh train on the entire internet you'll probably find examples in Wikipedia of rote 13 and probably rote two and in fact they uh the people that did this study they looked at that in the Corpus how often rote 13 and rote two examples and it turns out that uh rote 13 was 60 times more more common so there actually are examples but will that uh varying disparity in probabilities play a role in how it does yeah absolutely so if you take a look at what the actual output is it's sort of amazing crazy that you can take a system that wasn't trained to do decoding and you can ask it to decode something and it does pretty good job um makes mistakes though and it makes mistakes on those where it doesn't see as many examples okay so that's a that's a minor fail uh here's one that is uh this one makes me a little sad I must admit so uh here's an example uh and the problem is uh so they start out with a very nice prompt you're a critical and unbiased writing Professor you know you're going to set the stage for this chat GP to uh be in the right mode when it grades this essay output the following format essay score score out of 10 and then here's the essay and it's followed by uh a an an essay this is an example from Chip Hinn and uh so is this going to work or this not going to work so this this is a tricky question uh it's going to work in that you're going to get output but but what does it mean so let's take a look at these uh and we don't have to read exactly what they are but uh the first uh thing is you know the one on the left essay score 7 out of 10 this essay has some Merit okay that's good if you're getting feedback for your professor and they start out with that you're you're on the right track okay let's look at the other one essay score four out of 10 okay not so good all right while the essay captures oh this is not a good start if you're getting feedback uh while the essay captures you know some it lacks depth and Analysis okay so thinking about the output and knowing a little bit about how these systems work what why might I be sad about this yeah was it the exact same input for both um yes it was so the question is was it the same exact input for both of these and yes so let's see if I can get some [Applause] yes inste kind of stereotyp what yeah so the question is was this trained to do this and I think the answer is no uh this is just regular chat GPT working so uh let me give you a hint um how what's the output how does the output come out of chat GPT iteratively right and so what's the first thing that these two things have to do they have to give you a score oh this one is a four out of 10 so the rest of the text that gets generated has to be driven by that and so it's going to whatever I didn't really read these to see I doubt that there's really anything useful in this uh feedback but it the the rest of the text is going to make sure it matches that score so if you got a 4 out of 10 then the text is going to validate that four out of 10 that makes me a little bit sad uh to know that that's the way these systems are working it's consistent it's giving you what you want so it's sort of works because it looks sort of like that essay was graded would would you say that that essay was graded I see some skeptical NOS um would you want your essay graded this way no okay all right here's uh another one and uh the question is who is Evelyn Hartwell can you answer this question so a good answer might be I don't know Evelyn Hartwell does not exist that's a madeup name so what do you think will happen if you ask this question to a chat bot it's going to give you an answer it's going to give you a great answer uh here are uh four different samples Evelyn Hartwell is an American Author I'm sure it's seen that phrase before Evelyn Hartwell is a Canadian ballerina that sounds very realistic Elvin Evelyn Hartwell is an American actress known for roles in the television series girlfriends and The Parkers I don't even know if those exist Evelyn Hartwell is an American television producer and philanthropist she's a nice person okay so you know uh you you may have seen these before and actually if you read The Source down below you see uh these are called hallucinations and I must admit that uh I hate that term because it implies that sometimes chat GPT is hallucinating you know maybe it got some bad input and you know it started hallucinating but no that's that's not true it's just outputting the uh output the way that it normal normally does so uh I thought you know what is hallucination what what is h the agreed upon definition so I didn't want to cite somebody so I asked chat GPT so I can just put that text up there and this is pretty good when we say that an AI model hallucinates we're ref referring to situ situations where the model generates outputs that seem unrealistic nonsensical or diverge significantly but who who knows that uh chat GPT doesn't know that uh you have to you know use your own knowledge or maybe you have to go searching for Evelyn Hartwell to realize that she doesn't actually exist and you can only do that through your own investigation so how could you possibly put this into production when you're getting output that is all over the map so uh you could ask the question what's different about the model it's hallucinating versus when it's not there there's actually some idea that you might come up with what what might be a possibility that's happening inside the the model doesn't know that from the outside but if you looked inside what might be happening any ideas there there's probably some if you look at the output it's very low probabilities Evelyn Hartwell is low probability we'll pick one and then it's an okay if you say an you better go with the next word that starts with The Vow okay an American uh okay an American what low probabilities let's pick one ballerina so if you looked at the internal uh system you might find that there are some ways to measure that uh internally um and then what should it do um there's actually some work uh showing that you can actually use the model itself to give you a pretty good estimation of how well of an answer it is and that's pretty fascinating there there are some measurements you can do but they're very expensive very time consuming uh to compute but uh here's a I'll show you the code that's actually doing this uh who is Evelyn Hartwell I'm sorry I don't have that specific information who's Nicholas Cage Nicholas Cage is an American actor okay that one is is good so the the code that actually does that is uh it's using uh the chatbot itself and I've got a link to this code this is some python code basically it's got a little bit of self Sim similarity score it ask asks chat GPT what do you think about this comes back with something very quickly uh within a half a second and then it it's hardcoded just to say if it's too low sorry I don't know so I want to move to some uh epic fails now uh you may have heard about the lawyer that submitted a chat GPT written motion to the court did you hear about this uh uh we'll make these slides available uh or you can Google it and you'll probably find it uh but there this actually happened in New York Um Manhattan and um the lawyer overworked I'm sure uh used or uh one of his assistant created a 10-page brief citing more than half a dozen Court decisions with names like Martinez versus Delta Airlines zickerman versus Korean Airlines these cases don't exist completely fabricated made up out of whole cloth hallucinated um and the the uh the the judge W or the uh lawyer was just in court explaining what happened and uh showed some output from chat GPT and chat GP ended that with I hope that helps it did not I don't know if he got disbarred or not but uh you know serious problem uh with that uh here's another uh I wouldn't say it's an epic fail but you can see where this person is going with the prompt if one woman can make one baby in nine months oh I recognize this pattern this is math how many months does it take nine women to make one baby and of course you know chat GPT doesn't know about babies and women and you know how you make one oh it probably you could say it does in in one way but not in the way that it it would be able to answer this question of course chat GPT did the math uh and of course you know if no it nine women cannot make one baby in one month no it doesn't work like that so epic fail uh on that uh so in in summary uh and this is a little bit opinionated if I can go with the the first Speaker the other Dr Doug um LM llm text generation is a bit of a parlor trick I would say so it's definitely doing something it's learning patterns that is it's amazing it's able to create this text but we're using that in a way that's really simplified and a little bit stupid we're using it to you know flip a coin and generate a word flip another coin generate another word and it's all consistent or or if it's music you know you know it can actually sound really good but when it's text that means something very very different uh I actually heard somebody describe um uh chat GPT this way so I went to chat GPT and I gave it a question and it went off and did some research and then it came back with no that is not what chat GPT does it does not do research it doesn't do logic it doesn't do encodings it detects patterns and statistics uh llm results may be racist unethical demeaning and just plain weird llms don't understand much of anything uh one of my colleagues thought that uh chat GPT would good be good at uh the New York Times connections do you know this game it's very fun you pick uh you get uh a bunch of words and you have to pick the way that they're connected in four categories chat GPT is not good at that um it's not the kind of thing it was trained on to create uh this is one that I've seen often uh llms don't do well on non-textual inputs uh and I mean non-ex you can put tech make them text but uh a chess diagram you say lay out a chest or even tic-tac-toe you got some x's and o's and you ask it you know who what's your next move or who won that's not a very good task for chat GPT um because it's not a textual sequence problem uh they don't deal very well with those uh they can answer you yeah you'll win no that's not right uh llms don't know what they don't know there's no idea of Truth or uh you know punching down or inappropriate or you know anything like that you know don't talk about Nazis that it just doesn't know that llms do give you back output but the kind of output that you expect that you want and what the output actually says is anybody's gifts so my opinion is that llms are amazing they're stunning uh Dr uh e showed uh examples that he did in back in 2002 of Music generation same kind of problem it wasn't that good of music but it's amazing that these systems can do that so what are they good for today what what do you use chat GPT or or other being for yeah coding write pretty good code I I've looked at code I've never used it I write a lot of code for my work but uh sometimes I like let's see what chat GPT would do just to get a hint and but I I thought oh that's not bad that's there's a bug there uh what else yeah inspiration inspiration yes I use chat GPT a lot for uh just getting ideas uh I created a Blog on genealogy and I was I need a good name and I it didn't give me good names but it inspired me then so it part of the creative process I do know one uh group I lived in San Francisco and I went to a meet up there and there was a a group there that named their internal group based on a name that chat GPT had given him so definitely inspired yeah helping you rate Christmas cards I should say ah helping you write Christmas cards yeah my my wife is a professional in uh education and she has to write a lot of cover letters and personal letters and sometimes she does fundraising she uses chat GPT for you know generating the structure and then she'll go through and tweak it a little bit so definitely yes what else is useful in Long acts which are passed by Congress like the inflation reduction act oh summarizing large blocks of text I I would be a little bit uh worried about that I I I'm not sure I would trust that 100% custom GPT okay so that that's a different uh kettle of fish uh EX in indeed so there's definitely and I see lots of qu U ideas of ways that you can use chat GPT usefully and successfully and we can talk about that uh perhaps when the the talk is over but uh I guess I just wanted to uh to end with that uh that there are some good uses for chat GPT ones that are ethical uh and that are uh you know you can trust and you can be inspired by but uh not every task you know is appropriate to give to chat GPT grading assignments I would argue do not do that uh if you know anybody that's like grading with by giving it to chat GPT uh I want to talk to them uh you know one-on-one over coffee but uh I I don't think that's a good idea we we should talk about that uh more so uh to wrap up everything that uh Nico and I talked about today uh takeaways from uh this talk um our suggestion would be use a modular framework perhaps by a company that uh gives you the ability to manage your experiments keep track of them log all of your metrics your hyperparameters keep track of your model storage your deployment get a good hand on that model your production monitors have alerts uh and be willing to be flexible with what happens look at the data look at the training dig down and maybe add additional uh parts to that pipeline uh nothing is static in this world uh Dr Doug E and Dr Doug blank have been doing this for 30 years and I'm sure he would agree with me we would never imagine that these systems are able to do what they do today it's fast it's interesting it's exciting there is a job for you uh no doubt um but you have to be on stay on top of the processes and the practices and uh make sure that you use these in appropriate ethical ways Nico that I think that's the [Applause] end uh thanks for the great talk uh I have a quick question on the grading problem so uh you mentioned that okay if the uh if the chpt gives the grade first then they need to use all the paragraph to solidify their uh grades if you make the reverse like as I say to give the grades at last then will it just take what they uh what it use and then give the grade I would love to read that paper at nerps uh next year or I hear you have some projects do tomorrow is that true uh that would be a great experiment to try does it affect if you force the output to be one way first versus another does that have a large impact I I would think it does but big you never know with these models you really have to test them out hello uh great talk I just had a question so could you use like corrected user output to train models to prevent hallucinations so like very low probabilities uh if you could tell chadt you were wrong could that change the hallucination from happening again yeah there are definitely systems Uh custom models that do take chat GPT and they fine-tune it to give you specific answers the the one of the big problems though is that if you're trying to monitor this in production there are too many to do manually you have to come up with automated system automatic measurements so that uh you these will pop out and you can go in and look at them but once you do find them you could definitely fine-tune models that's a thing with the same prompt why do you end up getting different outputs is the entropy of the system somehow Disturbed from a previous run does it change like what causes that variance I couldn't let you answer one of these okay so this these are built into a lot of the models uh there's typically a value between like 0 and 1 or 0 and 10 and you can set that value to be you know seven give you some uh a lot of variability it can make it one and it'll be more deterministic um I sus I don't know exactly what that number does for example in chat GPT I suspect what it does is when the probabilities come on the output and it's deciding what token to Output next um if you have a a deterministic number uh highly deterministic it'll go with the highest probability just Max and that'll be the output and so you'll get the same output every time you move that up and then say okay I I'll maybe go with a lower probability word word to give you some more creativity and craziness it's fun to play with that's a setting you can uh I believe set on the dashboard of chat GP uh hi in the first set we did saw that some models didn't perform well uh when there were higher quality images uh so how would you suggest working on with that uh problem because that problem would you would least expect to happen during the production uh so let's say for the problem that we saw in the first date for a dating app that they recognize some images but when the camera was changed a higher quality was cam camera was used the models didn't perform well right so how would you uh I mean for any scenario when you're trying to work on your models how would you go ahead trying to solve that problems because that that's the area you Le least expect to be yeah so can you hear me by the way is this on okay uh I so we we spent a lot of time with this team when they were uh when they were dealing with this um it's kind of like you have to monitor the entire system and be aware that at at some Cadence maybe every year or two something extraneous to the system will happen like this like a new iPhone will launch unless you have that coded in like every you know every every 18 months send myself an alert over email that we should probably add a layer to our architecture because there's probably iPhone 16 is out now like something like that it's it's hard to do deterministically um and it's a I mean the lesson that they took was they do like an annual like architecture review from like a philosophical perspective I and they do and they do their best to think about all the possible like external perturbations to their system that was totally bulletproof a year ago but now might not be right um yeah I have a question about the the llm business use cases right now so I'm curious what you guys are seeing from the business point of view of the market for llms with all of these problems that you mentioned right I I assume the marketplace for llms is very much evolving right and I'm curious what you're seeing from the customers side of things right what what is the market for llms today yeah I can take that so um without sharing specifics I think some generalizable Trends were seeing uh and you can probably all infer this from the talk that Doug just gave but I would say as a general rule uh teams are uh more wary of deploying llm Services into applications where the model might be making an important decision with a customer so if it's going to impact like a buying motion or if it's going to impact something like that that's going to have a hit on uh the Topline financial performance uh it's less likely that you'll see a model shipped teams are very eager to ship llm um you know models for internal use cases so coding we see a lot of teams who are building internal agents to um to speed up the productivity of their devs we see a lot of Team shipping models to do um like note annotation from internal meetings or summarizing videos so if someone misses a meeting just like take the zoom recording and email out the synopsis so a lot of things that are aimed at improving the advocacy and the efficiency of internal teams that's a very rough thing but we're seeing that as like a general rule it's very similar to where we were uh like five or six years ago when people were doing uh just regular neural network experimentation uh back then we would ask a team how are you keeping track uh and logging your results well I have an Excel spreadsheet and I run this and I copy and I paste it I uh there were so many teams doing that and I think there's probably a lot of teams doing that with llms so the the first loow hanging fruit uh that we're working on is just making it easy to log and then be able to visualize sort through um find you take a model it gives the output you fine-tune the model it gives slightly different output being able to track that lineage of the model all of those kind of what we now call mlops uh which is an analogy with devops you know a developmental system uh software engineering kind of systems but uh this is for machine learning um the thank you for your talk um I can imagine how in production it would be very useful if if uh your LM llm could ask you a question um me personally I've been trying really really really hard to force chat GPT to ask me a question Y is that due to it being trained on text um and um how would like is it possible to change it like the uh the neural net such that it does ask questions I don't know anything about the details of you know what how Jet chat GPT works or you know the internal algorithm um but as mentioned before you can fine-tune it to do anything that you want um so you can take a large language model and you know if you want it to do something that's inappropriate you could train it to do that uh if you want to train it to ask you questions you can do that so it's really uh this fine-tuning is uh what you can take a model and then use it to go One Direction or another I found very interesting the alocation part like using the low probability to detect like when is making up something I wonder if that works also for the opposite problem when it spit out something textually from the training set for example New York Times is so like chpt because some of the articles are H very similar if not the same so how do you tackle that problem is the probably like 100% or is that something that needs something different what what would the measurement be how do you detect this using exactly the same from the training so it's taking a text from the training set um outputting it completely like it was that that's a hard problem because the training set is the entire you know internet so um yeah I I uh mean when it's out putting like H you ask a question and it will output exactly the same thing so you want it to be like a little mixed up or a little uh different or you want to get different outputs yeah not textual textually exactly the what I have seen before uh I think you could pump up the randomness on that number and uh if you change that number you start getting Wild and Wilder deviations you change it a little bit you get some variability you change it a lot you get a lot of variability yeah so that is built into these models okay maybe one more question okay come over sorry she rised first enthusiastic hand up is the last question of the night okay thank you hi thanks for the presentation I'm curious about your company like could you share like uh what's the differentiator of your company compare with your competitors sure yeah happy to talk about it I I we were hoping not to make this like a comet sales pitch so I think I'll honestly just say uh we'd love you to go test it out so go use it for your own you know ml use cases if you're trying to build models if you're trying to go work in Industry I think in general whether you use us or not is is less important in this context I think what we're uh arguing is that these are useful ideas to incorporate into the way that you build and and manage models as like a general rule so whether there's open source tools that you love whether you want to use Comet or something else um use them all and see what makes the most sense for your use case and your model and go with that I'm not going to pretend like I know what you're working on and tell you that ours is the best tool um but in industry for all of the reasons that we you know mentioned weird things are going to happen keeping track of everything matters building meaningful ml systems is hard and it makes it a lot easier when you're managing everything in this way so uh I'm equivocating and avoiding answering but I would say use what you want test them all out see what works best but you should use something thank you so much Nico and Doug again this was an awesome talk and I think everyone really enjoyed it um let's thank them once more [Applause]