Workshop on LLM Hallucination Evaluation

[Music] hey everyone my name is Diana Chan Morgan and I run all things Community here at deeplearning.ai today we are so lucky to have some special guests from Galileo to walk us through llm hallucinations with the metric's first evaluation framework for everyone that is watching uh the session will be recorded and available for replay after the fact but if you have any questions please check the link in the chat in our slido which you'll be able to vote on questions the speakers will answer at the end we're also very fortunate to have some of their developer evangelists setu and Jonathan in the chat as well so if you have any additional questions feel free to ask any questions there as well so for today's event and workshop what you can definitely expect is a deep dive into research back metrics to evaluate the quality of the inputs whether that's data quality rag context quality and outputs whether that's hallucinations within the context of llm powered applications we'll also be talking about an evaluation and experimentation framework while prompt engineering with rag as well as fine-tuning with your own data and of course we have a really well-led demo by our speakers today uh that will definitely help you be able to implement this on your own this event is also inspired by deeplearning.ai generative AI short courses created in collaboration with AI companies across the globe our courses will help you learn new skills tools and Concepts efficiently with one within one hour and now to introduce some of our speakers our first speaker is vickram chaty vicam is the co-founder and CEO at Galileo an evaluation experimentation and observ ability platform for large language models prior to Galileo vicam LED product management at Google AI where his team leveraged language models to build models for the fortune 2000 across retail Financial Services Healthcare and contact centers he also LED product for Google pay in India taking it from zero to 100 million monthly users the most downloaded fintech app globally he is also one of the early members of the Android OS team and believes generative AI is ushering in a similar technological wave to mobile hey vicam hi dank thank you so much for having us of course and to introduce our second speaker um this is endrio son attendo is the co-founder and CTO at Galileo and prior to Galileo attendo was an engineering leader at Uber AI responsible for various machine learning initiatives at the company he was one of the architects of the world's first feature store Michelangelo and early Engineers on Siri at Apple building their foundational technology and infrastructure that democratize machine learning at Apple we are so lucky to have them all here today and uh without further Ado let's definitely get started so uh vram do you want to take it away sounds good Diana thank you so much for that introduction uh and also welcome everybody really excited to have everyone here for the workshop it's going to be hopefully um an exciting one but also something that you can take a lot of really interesting nuggets away from uh so the topic of today's Workshop is uh something which is super close to our hearts and something which is very top of mind I feel like for the industry as well as we all grappled with large language models it's about how do you mitigate llm hallucinations uh but how do you do that with the metrics first evaluation framework um so to kick things off I'm going to start with uh something uh uh with an agenda about what we'll be covering today the first thing we'll talk about is the need for llm experimentation and evaluation Frameworks I feel like we're all hitting these brick walls but um and quickly realizing that there's a need for this we'll go over exactly why and what that means second we'll talk about different kinds of evaluation methods and when to use them and we'll further number three dive into emerging hallucination detection metrics that are coming out in the ecosystem uh but also how we can push the envelope further and build them up specifically for the LM era and last we'll go through a a demo to walk through how you can bring this all together and have this uh experimentation be done in the right way way but also with the right kind of metrics in place and you walk you through a simple app that we've built out and how we could mitigate hallucinations whether you're using r or whether you find tuning with your own data um to start with I'm going to start with a 100,000 foot view right like the when you think about llms uh we have uh the you know you have a lot of inputs that go into the llms and then you have the outputs this this is classic machine learning you also have a feedback loop right from the outputs which go back into the inputs that's basically it it should be simple but it's not because as soon as you start dabbling with large language models you immediately realize that uh there's an explosion on both sides the inputs and the outputs Al on the input side for instance as soon as you start working with large language models you realize that wait I have a bunch of different large language model parameters that I need to tweak whether it's the temperature or any such thing there's a whole host of them that you need to tweak there's the prompt template that we all know if you tweak it just the slightest bit really to all sorts of different results on the other side constant experimentation happening there there's chains that you can bring together create agents and have multi-agents it becomes super complex super quickly you have RG in the mix with different RG parameters um all the context data that you need to add into the vector store whether it's good is it bad then start thinking about fine tuning you start thinking about the quality of the data that you're putting in do you have sufficient data in the first place and so much more so it just blows up the inputs themselves are mindboggling when it comes to the outputs there's it depends on your use case now you might have a Tex summarization use case or text generation or chat Etc or question and answering or code generation and so many more and for each one of these different kinds of outputs that you're expecting from the large language model right the way you evaluate them has to be different you can't be evaluating um turn BYT chat the same way you're evaluating text generation it just doesn't work that way so the complexities just exponentially increase um and so in order to solve for this what we really need to think a lot more about is on the input side right how do you iteratively experiment um and today when you look at how people experimenting amongst our in our community uh it's highly highly manual it's mostly still being done in notebooks or we've seen people use Excel sheets and you know notion documents all sorts of stuff and when you think about it that's not not how data science should work you should not be experimenting with all of these different ingredients which are essentially your IP within in uh in notion documents or Google Docs um on the output side um the output evaluation has now reduced to essentially eyeballing with earthw natural language processing we at least had our um friendly neighborhood F1 score which could help us be the help be the N star for are you moving the right direction or not it's debatable how accurate and good it is but at least it was an evaluation metric that we all gravitated around with LMS you don't have that so we're seeing companies um at the highest level of sophistication still just eyeballing the outputs with large number of humans in the loop which which obviously increases the amount of bias and is far from accurate right so um when we think about this and we thought about this at at Galileo um we've built out what we call the G llm Studio which is essentially an experimentation evaluation and real-time observability bench for any data science team or developer that's working with large language models um the way we think of it as from a framework perspective is there are three parts to this right and three modules as a part of the product as well um the first one is called prompt which is all about accelerating prompt engineering so when it comes to the inputs from your vector store or your prompt evalu your prompt uh template or your um llm parameters how do you experiment across all of that to make sure you're working towards the right set of inputs the second one becomes fine tune where you're fine-tuning your large Lang language model uh you have to make sure that you're doing that with the highest quality data and you're constantly moving in the right direction it it's super important to experiment and be iterative in this process and the third one is monitoring because you can do create all the guard reils you want before production we all know when you put a model in production ction given its probabilistic nature things can go wrong you have to immediately know whether you need to tweak your prompt or you need to add additional data for fine tuning and um all of this together is essentially how we think about experimentation and observability and so we uh strongly urge anybody whether you're using Galileo or not to be thinking about experimentation and observability constantly across all three of these big pieces um apart from this experimentation is meaningless without metrics right without good metrics that's how all fields of science work and so for us since the very beginning of day one of Galileo 2 and a half years ago we were all about language models and all about building metrics for evaluating large language models um we built out uh what we call the griel metric store uh which has a bunch of research backed metrics which we've designed a lot of them within Galileo but a lot of them um have been built on top of Open Source metrics as well so that you can make sure that you can reduce use hallucinations and build trustworthy llm applications but these guardal metric store metrics are are super uh important from an output evaluation perspective uh so we can let's dive deeper into that for a second and if you double click on output evaluation metrics and in general as you think about building large language models based applications um when you're think about these evaluation metrics we urge you to think of it in three different ways there's the output quality metrics there's customized metrics for your specific product and business use case and then there's the output hallucination metrics so what do I mean by that first of all for the quality metrics there's a whole host of them as you know for instance tone of voice is your output if your chat product getting uh is the tone angry is it happy is it sad what do you want it to be um is there any pii presence uh you know for lot of regulated Industries that's super super important to be completely on top of is there toxicity is there any kind of bias that's being uh spewed is there hatefulness or sexism or many such other kinds of quality attributes of the output that you need to be very on top of the second kind of um metric is we call it the custom metric like for instance AI policy adherence there's the EU AI act coming up and some other data acts in the US itself you have to be adherent for that if you want to launch credible models um liance adherence or business metrics adherance as well and many other such customized metrics that you need to think about the last one is called the we call it the output hallucination metrics right this is basically um how do you know whether the lm's output is correct or not and this is where there's a huge bottleneck today right how can you consistently and accurately measure the correctness of the LM output when we think of this we divide that into two large buckets one is for any team that's working with RG workflows or nonr workflows and that nonr includes that includes fine tuning or you you you can combine fine tuning with RG but essentially think of it as do you have context involved or do you not have context involved um and go deeper into this a little bit more because this warrants its own discussion so um what we'll do next is Go deeper on the llm output hallucination metrics and the different categories and um some proposals around what you could use for detecting hallucinations with your llms you want to take it from here yeah thank you um so yeah we'll dive a little bit deeper into the third category of output metrics which uh Vikram was talking about U essentially hallucinations and the existing methods that uh that help quantify some of them um before we dive a bit deeper uh just want to set context on what we mean when we say hallucinations uh ESS essentially it refers to instances where the llm output might sound coherent or plausible but has like logical fallacies in them or there's uh factual validity issues as well as um if there's context uh the llm does not adhere to the context or the instructions which are provided in the input and then there's many reasons why hallucinations occur which is a bit beyond the scope of this talk there's limitations in training data model quality Etc and there's a few methods which have been explored in the recent literature to quantify modern hallucinations and but they kind of broadly fall into uh three techniques three categories of techniques the first one is essentially based on um measuring the amount of overlap between um the output of the llm and a reference ground truth now of course this is only possible if there is a ground truth or a a reference data point uh in existence and a lot of the use cases around uh llms typic typically work without a reference these days so in those cases uh this sort of engram matching category does not apply uh the second technique is based on asking an llm if it thinks whether an input completion pair has hallucinated or not uh now this is the basis for a lot of the more widely adopted modern techniques to detect hallucinations and uh of course it has um certain merits but there's many limitations around that which I'll get into and the third category is um essentially you know most llms are basically token spewing machines and each token has a probability distribution amongst a vocabulary so when you gaze gauge into the statistical properties of these probability distributions you can get some intuitive sense around whether the model got confused and whether it hallucinated so those are the high level intuitions and I'll get into each of them one by one so the first one um a popular Suite of metrics which leverage the first technique uh include metrics like blue and rou scores uh which are essentially measuring engr overlap so they look at phrases and they um essentially try to um see whether how much overlap is there between the output of the llm and a reference ground truth and that gets converted into a a zero to one score there's also meteor which is uh kind of a different version of the same idea which takes into account things like word order and synonyms so it bolsters the blue and rou scores a little bit but they're all based on this underlying principle that uh measurements of similarity to existing ground truth can give you some semblance of hallucinations now of course from the definition there's uh there's decent amount of applicability especially in historic use cases uh these kind of metrics were typically used in Translation quality not just in language Technologies but also um uh in speech based systems as well well uh but you can see that there's a lot of limitations in this kind of an approach uh for one it requires the presence of a ground truth and reference data which is typically not common as you know language Tech has been adopted more and more um these kind of metrics they they fail to adapt to stylistic changes uh where you know words might be different but the overall meaning might be the same uh so it can lead to a lot of false positives and false negatives um and Essen to summarize you know comparing words and tokens to measure hallucinations is overall not a great idea for most part the second category of technique is it revolves around the idea of asking a GPT model or some state-of-the-art llm if it thinks whether the pair of the input and the output hallucinated and there's many ways to go about this of course you can directly ask an llm hey do you think this is hallucinating or you can do multiple reruns and you know get multiple generations and check consistency across them uh this kind of category uh broadly uh falls under the self check method which was recently published a few months ago and it certainly has its merits uh the foremost being it's not restricted by a ground Truth uh and uh it's kind of designed to capture some of these higher order policies uh which means that the applicability is fairly broad uh and this kind of a technique can detect more higher order uh hallucinations uh but again it is very restricted vicam do you want to move to the next yeah uh yeah so it's fairly restricted uh number one being that it's a blackbox technique right you're asking a magic ball if something's hallucinated uh it will fail to capture reasons behind false positives and false negatives um it can be prohibitively expensive if you use more state-of-the-art models like gp4 to give you uh give you whether uh something's hallucinated or not and more importantly it lacks the explainability which is critical for users if you want to take corrective action uh as you'll see there's many different kinds of hallucinations open domain closed domain and uh taking the right action you know it relies heavily on being able to explain what the Hallucination is about and the third and final category of existing metrics to measure hallucinations is based on the the the doing math and probability uh statistics on top of the probabilities uh you essentially look at the token level confidence of these models as it's constructing the output and the intuition here is that you know this kind of measure gives you a sense of the model's uncertainty in its own response and may suggest some semblance of hallucinations uh but from our various in-depth experiments over the last year it has shown that while uh techniques like these are great secondary signals of model confusion uh they essentially u a bit too granular viam if you want to move to the a bit too granular in its uh in in its ability to capture higher AAL signals of hallucination and uh another restriction here is that a lot of llm apis uh typically don't return you uh log probs uh and in those cases uh techniques like these kind of become a little restrictive uh so we've talked about the three broad categories of metrics which are today used for health Nations and there's many experiments which we've done internally to identify issues in these existing metrics uh and we essentially figured that there is a need for a new technique which not only takes you some of the better principles of the techniques which I spoke about but also leverages certain these distinctive powers of these new llms like their ability to respond to Chain of Thought prompting um and use that to create a new category of metrics which number one is more accurate in its results and number two it's uh applicable to real world scenarios it scales to different kinds of tasks uh and finally it is cheap to compute and it should work at low latency because we are all building practical metric systems here so with that we created a new technique to identify hallucinations uh and this technique is called chain pole and chain pole essentially um is an high efficacy method to detect hallucinations move to the next one yeah uh and here in chain pole we break down detecting hallucinations into two phases the the first phase is called chaining uh and in chaining we essentially use a very detailed Chain of Thought prompt to make um a secondary llm request and gauge if it can reasonably sort of break down its the reasoning behind the completion that it just made and then we collect the Chain of Thought reasoning um uh think of it as we we structure it down into a specific schema and then we do what we call is polling and polling is essentially this ensembling technique where we try to gather evidence of policies in the in the Chain of Thought through voting and uh voting is a common technique used in machine learning um whether it's like classic models like random Forest ensembling is a general way to get really good results and think of it as though it's um you know there's domain experts in the room and you're kind of getting structured schematic responses from each of them and then you're trying to make a claim of whether uh the model hallucinated or not so here's a high level flow of chain pole the user makes a query to the llm and the llm responds with the completion um then what we do is we check if the llm also provided a probability distribution of response tokens um this kind of goes back to the third category where you know these log probs you know act as a good secondary signal to hallucination uh so if the output does uh has no log probs we employ a special generation prompt to uh to match the original log probs in theory and then we send the the the newly generated log probs along with the original prompt uh and response to a Chain of Thought module we call it the detailed Chain of Thought module and that's basically our core ensembling algorithm and behind the scenes we are using batch inferencing to be able to um essentially uh reduce the number of model calls we make to just one and the key learning behind this has been that Chain of Thought um the depth of the Chain of Thought prompt makes a significant difference in the quality of the um you hellucination detection and the more detailed and more organized The Prompt is the smaller the model that you can use to generate accurate scores in fact you'll see in a couple of slides ahead that we used uh models like DaVinci which are at this point fairly dated to be able to achieve over 23% improvements in uh compared to the next best hallucination technique around cell checks um and then we take the output of the ensembling module and aggregate create like this aggregate score along with an explanation using the chain of thought process pass it to this cleaning and optimization module and then finally give a user a normalized score of hallucination along with a clear step by-step explanation of what may have gone wrong so just to um highlight a super high level result on all our experiments on all data sets chain pole here significantly performed better than uh the next best which is a self-check birth technique uh which was launched this year in in July um in the UK uh and the uh you can see how much better it is compared to some of the more uh uh metrics common metrics which are used today to detect modern hallucinations uh there's many advantages to chain pole number one is that it is designed to be as accurate as possible like the whole motivation of the ensembling and the chaining is to increase the accuracy uh across a breadth of possible real world tasks now the whole motivation behind this is that our metrics should apply to real world scenarios and not just stick to like academic data sets or you know text bookish uh examples uh the second is is that explainability is key you know like I mentioned like hallucinations in modern llms are very very nuanced sometimes it spews opinions uh so there's the devil is in the detail of what's right or wrong and uh these llms essentially act very similar to humans sometimes so it's very important to give you feedback which is understandable by a human being to be able to take action in next steps uh there's a need for low latency as well uh there's a significant amount of engineering our team has done to make sure that the algorithm itself Works in parallel and the different computations are asynchronous we want to minimize API calls as well and do batch inferencing to finally reduce cost because cost is a very big uh Factor here um so this is chain pole in summary uh this is the method but the outcome of the chain pole method are two metrics which we apply to the two workflows we observe in llms um the first metric is correctness which applies to sort of these open-ended open domain use cases uh where the user essentially passes a query with or without context uh and uh the correctness metric looks at the consistency in the reasoning behind the l&m's completion and how correct the output model can be uh the other is the uh the fact that the chain pole is also um correctness in particular is not is kind kind of agnostic to whether the context was passed in or not the uh the second metric is context adherence which essentially measures how much uh grounded the llm output was in the context which was provided uh it is very helpful in RG workflows where you essentially have context in the form of documents which are fetched from the vector store uh I'll show you a couple of simple examples of correctness and context adherence uh in the next slide here you can see this is a very simple example of correctness in action where the user essentially asked a factual question where was when and where was Abraham Lincoln born and uh the output of the llm was Abraham Lincoln was born on this date near LeRon County um Kentucky and in this case the model made a token level mistake it was very clear and the algorithm detected the correctness was low and it gave an explanation that if Abraham Lincoln was uh born in laru county and not Lon County uh the next example is a more complex example which really shines a light on the efficacy of chaining on onbling this is an RG use case where the the prompt specifies the u a topic uh or rather the prompt asks the llm to describe the topic that's in the documents which are passed as context and if you look at the output of the llm here you will see that it on the onset it seems very coherent it says the study is described as a descriptive study of you know hospitalized cases of uh in influenza in the last five seasons uh on the onset it might look correct but if you really dive deep the devil is in the detail here the algorithm found that almost nothing in the uh in the output was uh really like related to the context at all and the uh Co adherence score was low here and you can see that here it is going one level deeper where it's highlighting where the llm may have picked up its reasoning from uh and how it got it wrong and you know point to specific areas of the document where uh where the mistake might have been uh so here this is uh you can see like overall the experience is kind of like asking this expert and you're able to pinpoint to like specific areas of of of the document where where the model might be hallucinating a quick shine shining a light on the results here uh the mean Roc uh scores for both open and closed domain use cases you can see that uh there's significant improvements in the chain pole methodology compared to self check and some of these other uh metrics used like gal as well as uh using entropy and GPT score um and then finally uh one extra set of results before we dive into the demo this is showing you the most four most challenging data sets that we use in our experiments across four different tasks and you can see that we got um just to shine a light on the experimental process we got experts from domain experts from the world over to annotate thousands and thousands of data points as hallucinated or not and in the results here you can see um the 85% Au score on trivia QA essentially shows that chain pole came closest to matching a human expert to determining hallucinations um yeah with that we'll dive into a demo thanks aan all right uh it's demo time hopefully that was educational that was interesting um please keep the questions flowing in more than happy to take them at the end we'll try to leave around uh at least 5 to 10 minutes in the end for question and answers so the demo is going to be twofold the first one is going to be about just prompt engineering with the vector store so the uh the classic RG use case and how you can um be better about experimentation from the for while for with the inputs using some of these metrics that I've been talked about and the second one is going to be about fine-tuning with your own data um the use case for the first part which is prompt engineering with RG is going to be um one where I'll walk you through how we built out a simple app for kids um so you know a child can come in and ask a question and the llm gives a response the ingredients were a vector store um with uh with thousands of different Wikipedia articles for context um a data set of prompts and a set of llms that we were using to experiment with we used some uh of the close source llms as well as some open source llms just to see you know what what uh what combination of prompts and llms Etc would work the best so for this first demo um you know typically as a data scientist we like to work in notebooks in Python notebooks and so that's typically where you know we've been doing our prompt engineering uh you know you could also do this in an IDE for instance but in this demo what I'll walk through is in a notebook as you're doing the prompt engineering how you could use Galileo in this case to be able to experiment faster but you know the takeaways from here I feel like go far beyond Galileo it's basically we're trying to talk through U how you can think about experimentation and how you can think about applying the different metrics at different pieces of the of the application um so to start out with to build out this Q&A app for kids I'm going to go into my notebook um I uh I'm going to pip install uh this is this is my notebook so I'm going to pip install prompt quality which is the which is Galileo's python client for prompt engineering um once I do that I just quickly load my prompts like I usually do in this case what I did was I we just used the Stanford Q&A data set which is a fairly popular data set it's called Squad um so literally just use that that data set has a bunch of context that provides and a bunch of questions which are the prompts right um once I'm done with this I get the relevant passages from Pine Cone the way I usually do with these uh Vector stores I throw in thousands of different Wikipedia articles in here um once I'm done with that I start creating a prompt run with Galileo so this is typically where you know you're starting to think about what should the um prompt template be that I that I leverage so in this case we started with something simple we just said you know you are a helpful assistant given the following context answer the question provide an accurate and factual response um I get the context from my Vector store and I get the question which is the the different prompts that I added above in the data set from Squad now once I'm done with this I um log in and once I'm done with that I um add the metrics that I want to see right now remember the guardal metrics we talked about and that atin mentioned as well so the question becomes for your use case for your product what are the different metrics that you want to see um it's important to start considering this Avenue before you even start the prompt Engineering Process like what are the important metrics which are top of mind so for instance for us here um we really because we use using we using pine we care a lot about the adherence of the output to the context of a provided in the vector store so context adherance the metric that atin talked about is one that we want to see what that looks like ideally this should be a one which means it's perfectly within context every single output um context relevance which is the relevance of the input to the context um correctness which is the uh this is the level of um of uh kind of detects the level of confusion that the model was having as it was coming up with the response which aan mentioned based on our chain pole technique um also we curious about latency and also sexism so we want to avoid any such thing in the outputs um the other piece here is you know because it's a child kids app we want to create some custom metrics if there are certain words in the models output I I want to be I want that to be flagged so that I can see why those are coming out in the output right and now in this case we just added a few words here we called it the bag of words metric but because it's pythonic you can make this as complex as you want to so for instance some of our other apps we've introduced complex um models around language detection and gender bias detection Etc and just registered that with Galileo so in this case I just say if any of these words are present just answer return of one otherwise return a zero as simple as that and register that with our Galileo python quality prompt quality python client um once that's done I give this a project name and uh choose the model I want to use in this case I'm going to go with chat GPT there's 16k token llm and that's it once I'm done with that um you get a link to the to the gallileo console so no more eyeballing inside of your notebook or exporting to an Excel sheet click on this link and we'll take you to the run now um when I do this I essentially get to this uh UI where uh what I would urge everyone to think about before you start actually going into the the details of their responses is you should take a step back and look at the metrics right what's the health of your prompt run so higher level think of it in three different categories the first one being the output quality right uh the that includes the correctness and lower level metrics like blue and Rouge um the RG metrics uh like how good was the context that you provided it and is the output actually adherent to the context or not as well as you know there are often times when you need humans in the loop in in this entire process and that's it's a good thing to combine um automated um um the the whole automated metrics based experimentation with humans uh at this stage so you know human ratings is super important at a per uh prompt response level and then our custom metrics you know the bag of words that I just added in here so um to start out with you know I can see that the average corus is it's not great it's good it's 08 out of one but not great it's 08 um imagine if your F1 score for an NLP model was0 eight and you can't ship that it's probably going to be wrong 10 to 20% of the time that's not great um so I want to look into this more so why is that happening um turns out that the context of herance is not great either it's 789 so that's it should be ideally a one that means every single out piece of output is adherent within the within the within the um within the context um so uh so this is typically how this uh this uh the this is how on at a high level I can quickly get a sense for how maybe there's something wrong with the groundedness or the the the context that I provided the model with so in order to do that I can start to look at the responses here and I can see amongst all of these different metrics that I have at my disposal um there are certain things which I might want to look into a little bit more so for instance um I can see that it says here that the for this particular uh response you can see that um if you look into the context that was provided um it talks about the weather in a certain part of Australia where the average temperature exceeded 32 Dees cius in the summer and exceeded 15° celius in the winter and then it talks more about the different parts of the of the climate in that region now the question here was about the med what is the median temperature in the winter um and the model's response was median temperature in the winter is about 15 degrees CI now when you look here that's that's exactly what it says here it's 15 degrees Celsius in the winter so now if I'm as a human evaluator if I'm rapidly trying to move through this process I'm just going to mark this as a thumbs up and move on with my life I'm going to go to the next one after this and the next one after that and start seeing other prompt responses but because I see that the overall llm correctness score is zero and the context ofh score is zero I'm curious about why that's happening so uh remember that atin mentioned that the technique for for coming with context adherence was using um was using a process where we also get a response from a model which gives us its reasoning which helps a lot with debugging so instead of trying to wonder why it's a zero I can just hover on this to get a rationale from the model it's an expert telling me why uh it's the the the output is not adherent in the context so if you look at the last few sentences here let me try to zoom in here if that maybe makes it easier if you look at the last two sentences here it says um you know the to confirm whether the median temperature so it does the the context does not pro provide the uh mentioned the median temperature to confirm whether the median is also 15 um we would need additional data and so therefore the claim made in the response cannot be fully supported by the documents so it's clear that it's almost like an expert who's read through all the documents that were provided to Pine con and says that all right it looks like what's being uh what's been given to me from Bine con is basically the uh the average temperatures here but what's being asked from me as to give an answer is the median temperature and I can't do that because without additional information so that's a good takeaway for me to have as a developer I might need to add more context around the median temperatures in certain regions maybe I'm building out a weather app for kids in this case and so I could now based on that information mark this as a thumbs down and move on to the next um the next prompt so that's a quick way in which you can at at a per prompt level at a per response level try to debug much faster using these metrics that I mentioned before now all of this is for one prompt run now remember like when you're working with these um LMS remember how many different kinds of L different kinds of inputs you have to you have to you have to work with you have uh you can have 10 different LMS you can have multiple different kinds of llm parameters different kinds of prom template versions you can't like how should you be doing this one by one in this manner um we urge you to think differently from that so one of the techniques that you could use with the prompt quality python client with Galileo is called prompt sweep where think of it as throw the kitchen sink into the into the python client right so in this case I as a developer I I thought to myself wait I tried the gpd1 16k token model but let me try text in3 as well let me also try gp4 and see if that one works better or worse than than the others instead of one template I actually have many different ideas for templates so I'm going to add four different templates all of them are just comma separated from each other with slight nuances and differences in them um I also add uh give it again the same project name and uh add a bunch of different temperatures in this case like 1.5.1 so varying degrees of freedom for the model to come up with its responses and execute that and at the end I get I get a quick estimate based on these API calls that it's making and the and the vector store that we are using that the total uh cost is going to be over $36 which is not cheap but I'm going to still go for for it um and at the end again I get a link which takes me to um instead of one run it's going to take me to many different runs so all of my runs for this project can be seen over here in one go now this can be a lot you can see all of the different prompt versions that were used U Galileo automatically version controls them for you so you don't have to worry about that it gets stored in what we call your overall prompt store where it's automatically version controlled for you you can see the different models that we use here as well as all the the different metrics that that came up um for that entire run now how do you figure out if which is the best one again if you have a metric based approach it becomes super fast to be able to figure out which of these runs for is is working out the best for the metrics that you care the most about and you know uh even from here it's it's always good as a next step to be able to quickly do an ABN comparison between um different kinds of models for instance so for instance here I'm going to keep the version the exactly the same as a v0 and uh I'm going to try out maybe different kinds of models so that's text in 003 there is the GPD 48k token there's a chat GPD 16k and now it's going to be interesting like you know everything else kept the same how did these three different models uh behave against each other and so you can see over here that if I start looking at the different responses that are coming out um I can I can inspect the the output uh just eyeball it but I can also use the uh the llm uncertainty uh metric that galileia provides which is uses the log probabilities that that's coming from the model essentially um to start to see some really interesting Trends where even with everything else kept the same right you can start to see how the models responses are dramatically different from each other um which which can be a very interesting insight for me as I start thinking about um which which model to go ahead with based on its latency based on its cost based on its output and H potential hallucinations here so um the the takeaway from all of this is you should be able to do multiple prompt runs in one go use a metric first approach to be able to figure out which combination works the best and AB testing is super important uh but there's a need for not doing this in notebooks or Excel sheets there should be a collaborative environment where you can you can leverage and Galileo is one of them um eventually if you do all of this then you can get a much faster uh way of figuring out are you moving in the right direction or not which model is working the best and often times you know building these llm applications is a team sport so you need to have multiple creators in this case it was just Galileo itself that was creating these runs but you know you could have many different people working on these runs at the same time so all of this is on the prompt engineering side for coming up with the best uh best combination of inputs so that you can um build the best version of your application super quickly um before I go into the finetune piece right let's say do all of this you want to put this into production now now when when you put some an app into production it's a whole different ball game like we talked about um you need to have a real time understanding of where things are going wrong um which is why like one of the so I'm I'm going to walk you through how this looks in Galileo but again if you're using any kind of observability solution um you could leverage Galileo's apis or any of the metrics that we talked about before to be able to power that solution right so it becomes really important for you to figure out not just the high level cost and latency and API failure metric super important but also start to look at U whether there was uncertainty whether there was the llm correctness was in a good spot or not um you might have multiple models in your chain so you should have all of those models uh being being surveyed at the same time um and you can start to also can start to like map that back to business level metrics like user session length for instance uh so as an example I can see over here that as soon as the factuality or the correctness started to reduce uh the user session lens starts to reduce as well which is an interesting insight for me where I might want to start diving deeper into that and check for what exactly is going on in that particular instance in near real time right and look at the specific data that was causing that problem and if I when when you start looking at the data and start looking again at the factuality that is coming out of this and the explanation from the model for this you can quickly get a sense for where you need to tweak your prompts so at that point you can you know instead of looking for the rumaging through for the notebook where you did that because Galileo does prompt has a prompt store behind the scenes you can just right click and load this in your prompt playground and see how this particular input from the from the users is doing with your uh with all of the uh with the prompt template that you had and where you might want to tweak it and we also strongly urge you to tweak your prompts constantly and then um eventually maybe even try to AB and experiment with different prompt versions like you know 10% with prompt version one 20% prompt version 2 that's how software development works as well and we did that behind the scenes for this application that we were using that we built out um that's being used by analysts at Banks and um what we noticed is some really interesting trends when you do that you know we tweaked multiple prompts and you can start to uh compare all of them if with live traffic to see which one is actually working out the best or not so that's you know rapid experimentation with with powerful metrics allows you to be able to figure out which is the best L&M which is the best prompt version um in using your users as um a a constant feedback channel so um all of this was to prove that the um that it's super important to be able to experiment with the inputs real time evaluate your outputs and have feedback loops constantly so that you are uh as a team constantly moving the right direction and and improving the llm application it's a never-ending um process of improvement as the world changes now Switching gears towards the second part of of the demo um where I mentioned that I would talk about fine tuning as well it's super important to think about fine tuning as a pretty cost-effective way of creating llms that can be fairly accurate now um in order to do this the big bottleneck does tend to be whether you have the data or not and if you're using synthetic data they tend to be low quality so then again the problem of data quality becomes really really important on top of mind um so in this case for fine tune what I'm going to do is um the the use case is going to be generating headlines for news articles and the ingredients U will include a corpus of news articles uh to fine-tune the llm with right fairly simple use case now I fine tuned this model in my notebook again again added one line of Galo code um in this case the python client is called Data quality um and similar similar user experience in your notebook you got a link to uh visualize your data on the other side um so if I go go to what that looks like essentially you can you in in the gall UI you get to see all of your data in one place as well as a bunch of alerts around what was super hard for the model so this is where uh there's a metric that um our researchers came up with called the data error potential score which we're happy to share after this after this uh Workshop so you can look into the math and you can also look into how to apply this but it's been super powerful as a quantifiable way to figure out what data points the model was having a hard time on let's dive into that for a second right this is about the ground truth so first I'm going to talk about how you can inspect a ground truth with the data error potential to find incorrect ground truth and the second part becomes hallucinations in the output so for the ground truth let's say I go to the pre Chained embeddings and this is for my test set by the way there's also a training set but I care a lot about my test set being perfect so um here I can see my embeddings colored by that D score or the data error potential score so immediately I can get a sense for uh you know where things might be going right or wrong let me remove the low error potential regions here um and I can start um clusters forming here so for instance there's one over here on the side which if I uh make a quick selection for of that one I can see that it's a it's it's the average DP is um quite High which is high error potential so if I look at the table view I can see see here that the input um that was provided which is the news article was about how there was Hefty falls in the stock market which uh led to um a big Pro a big uh uh where 600,000 Vehicles was sold in Japan which led to overall shares for the electronics giant Sony Falling by about 1.7% and this had a big impact on the Nik index which eventually had a big impact on the S&P index as well now the headline for this that the human labeler the target output the human labeler came up with was shares in Japanese automaker Mitsubishi Motors plunged 13.5% now you can use the token Level D score to see exactly which words the model was having the hard time with so you can see here when I turn this on that specifically the model was having a hard time on um you know words like automaker or Mitsubishi and the number 13.5% and then what also happens is the model tells you about the other tokens it was considering so when you hover on Mitsubishi you can see that it's telling you that it probably should have been son instead so it's almost again like an expert telling you um how to fix things um and so I can quickly go and edit this Target in Galileo and uh change this to say you know Sony and you know I could do this my subject matter expert could do this and it gets added to what we call an edit scard where you're basically um improving the uh the uh the input data set and you're you so that you can export and work with a better quality data set for the next run um the last thing I'm going to walk through is as promised I talked about how all of this was for the ground truth how do you figure out if the ground truth is good or bad whether it's using um the automated clustering technology from Galileo where you can start to see quick clusters whether depth score is really really high or just using the um just eyeballing and figuring out which clusters you want to inspect in Spec specifically right um the second thing I want to talk about was the output how do you know if the model's output is hallucinating or not and that's where amongst other metrics the uncertainty score becomes really interesting so if I look at the llm uncertainty score here there's a this is a typical distribution where you know there's a long tail of really um highly uncertain uh data points so when I uh when I select this I can see certain um certain options showing up over here where you can quickly get a sense for specifically when you click on the links um where the model was having a hard time in its generated out and you can start to see that it's basically on this first name where it's of it's it's literally struggling with coming up with the with the first name here when it comes a numbers it's struggling as well if I look at the other one other um other uh outputs as well again the word the first name Steve is where it's struggling um uh and again the number is where it's struggling the name is where it's struggling so quickly I can I can figure out that first names and names and ages are numbers is where it's struggling a lot so that's important for me to know so that I can fix the input that I'm working with next um and I can make sure that the model doesn't hallucinate on those so it has more context from the data that is provided to it so um at a very higher level this is typically how you would want to go through an experimentation process and make sure that when you're doing this you you can keep track of all of this in in one place over time and then it's very important to compare all of these runs in um across across each other but in order to do all of these things again just to go back to the whole point behind this um Workshop super important to start thinking about uh the metrics that you're working with it's very important to define those in the first place before you embark on the prompt engineering with RG process or fine-tuning or if you're doing both so I'll stop there I know we have about few minutes remaining would love to get into any um questions that people have absolutely and thank you so much V tendria this has been great I know there's a lot lot of questions in the chat we only have time for a few but I think I've summarized I think a couple of the important ones I think the first one is can we use adherence to check for text a large document if we summarized multiple call transcripts where can we point out uh from where the llm made the context maybe I can take a quick stab at that um so the short answer is yes uh so irrespective of the size of the document um we we basically detect um adherence issues if any uh but then uh it can depend on how you're chunking the data and uh essentially we run adherence at a request level so whether you're if you're chunking the data in a certain format but are able to cohesively get together the full context in a single request then most certainly yes um and as you saw in the slides we're able to pinpoint even if it's a large document specific area in the document where the model used that area or that chunk as a basis for reasoning and whether or not it was logically correct or not absolutely um all right let's take the next question um how do you mitigate hallucinations and situations where you're streaming live output to the user and therefore don't have the full text to analyze beforehand H you want to take the step vicam yeah I was just thinking about this so it sounds like it's a situation where you have um the it's it's a production application where you have um the output from the model directly going to the user um this is where you know there are two parts to this in my opinion one is there's before you even deploy the model right it's almost like with software engineering when you're qcing and you have to make sure that you're thinking about every single possible outcome before you even launch the model that's where uh experimentation on the on the prompt engineering side if you will becomes really important to have all those use cases and workflows mentioned even if you do that as we know with applications similar with similarly with models as well there things will go wrong with your when you're streaming live um and that's where again like having on Real Time analytics and measurements with these with these metrics like our uh correctness score and adherance score becomes super important uh so you you will be slightly reactive in those cases but you know at least it helps in mitigating those kinds of situations um from escalating uh escalating further um and having effective alerts in those cases also helps quite a bit yeah just to add to that I think in theory of course you can essentially take you know your partial generation and you know measure hallucinations measure correctness uh so in theory it works uh but if you think of like a practical system you essentially want the entire completion to finish before you make a judgment call on whether something's hallucinated or not for many reasons including the fact that the answer might be in the last bit of the completion or um so yeah essentially that so that can lead to a lot of like false negatives or false positives so uh if you are to design an eager system which is trying to measure these kind of um metrics in real time uh a better metric to measure in real time would be the UNC certainty because that gives you sort of a soft signal of how the models uh SP you know reacting or or how the model is getting confused as it's spitting tokens out but if you want to measure hallucination it is better practice to wait for the completion to finish and then make a call which doesn't take too long definitely okay I think this is our last question for today um how reliable can you reproduce results I've seen llm as non-deterministic even with temperature equals zero how do you make sure metrics first is reliable if you can't always reproduce yeah I mean to be honest there is um like any other metric there is a certain dependence on the reliability of the the um the models themselves uh which is the reason why we've built the chain poing system in a way which is very modular and it's very plug in place if you remember the the Box and arrows diagram there's a couple of internal llms in the default cases they can be the source llm themselves but you can plug and play them out to use thirdparty models which might be more powerful might be less powerful um and also there's the the internal the The Chain of Thought prompts as well as the generation prompts the uh the the complexity of which can determine the efficacy of of the output so there's all these knobs that you might have to tune uh and you're right I mean these are all non-deterministic systems so there's no reliable way to to produce results but that's that our experiments were done at scale and done across thousands and thousands of hallucination utterances across various tasks and on aggregate the results you saw was at 85% accuracy so um that's our job to at least amortized we can say that the metric sort of works yeah and then last thing that is like it's it is that is something if you notice as well where your you know the llm output is pretty non-deterministic even when temperature is equal to zero um but that that is why you need these kinds of guard reils in place constantly because it is that is the that's the power the Boon and ban of these llms but I also would urge to for people to start thinking about the whole having the um like the uh the multiple checks that we do for creating these metrics in place today where there's a majority voting that automatically happens as a part of the creation of the llm uh the creation of the metrics output happens for exactly this reason because behind the scenes you basically asking it with one the same same API call asking the same llm five times just to avoid this uh lack of lack of stability issue and the lack of reliability issue great well thank you so much vigram and ATT tendria this has been an amazing Workshop I'm sure everyone learns so much um so for anyone that's still here thank you for coming please take the survey in the chat to have any feedback of how we should run our events in the future uh what other topics you would love to hear even from Galileo as well and we hope to see you next time and keep learning take care everyone bye thank you thank you

Transcript for:Workshop on LLM Hallucination Evaluation

Transcript for:
Workshop on LLM Hallucination Evaluation