Understanding Observability and Hallucinations

uh my name is AAR one of the founders here at arise and today really lucky to have Trevor who's uh one of our our our essays on on the ground to to share about what people are doing in in the real world uh Trevor do you want to introduce yourself well yeah thanksa um hey everyone my name is Trevor um I'm a machine learning Solutions engineer here at arise so I work very closely with a lot of our customers um with pretty much all components of observability so you know starting with understanding use cases or the types of models that they're using working them through data ingestion helping set up monitors and dashboards and you know all the things that are kind of most important in terms of observability awesome well we have something really exciting for for you all today I guess Trevor let's jump into it so we're going to start off with an overview of what are hallucinations and talk a little bit about how to detect um why you should be measuring or evaluating to catch hallucinations um and then we'll go into a little bit more of a practical collab kind of Hands-On type of component afterward cool um par do you want me to share my screen or did you want to share uh feel free to go ahead and share so I can pull up the collab after okay let me okay cool um so kind of diving in a little bit um starting off with like what are hallucinate hallucinations what are the type of hallucinations that you can see um one spend a ton of time kind of on like the broad topic of what are hallucinations because I'm assuming a lot of people are kind of already familiar with that um but at a really high level hallucinations are are just when the model outputs text that isn't factually correct um or doesn't make sense logically so really it's whenever the model makes up a response um and in some of these cases and in kind of your like creative use cases it's actually might be something that you'd want the model to do however in the cases where you kind of need that llm to respond accurately based on your data it's something we definitely want to be able to to kind of catch and mitigate so with that there's kind of two different types of hallucinations um the first being inations on public data which I'm sure this is kind of what everyone's already familiar with this is um essentially when the Model H hallucinates um based off of like what the foundation model was trained on so for example you ask the llm uh who is Michael Jordan the LM responds Michael Jordan was the first US president which you know obviously is not correct um and then the second type is hallucinations with private data and so this is the data that's provided to the llm in the context window so as an example of this uh let's imagine we a company that sells cars online we have a chat bot where users can ask questions about the cars we're selling and the user asks a question about a specific car context is retrieved the model then responds with uh the car gets 700 miles per gallon which you know obviously isn't factually correct here and so I'm sure people can can kind of see why it would be important to catch and mitigate these types of hallucinations but in part I'm kind of curious to hear your thoughts on you know why companies should should kind of care more about hallucinations on this private data aspect yeah no I I think that I like the way that it's laid out here so there's public data and there's kind of private data most of what we're seeing in the real world in terms of applications is people doing retrieval on their own proprietary data private data and the risk here is that if the model is hallucinating so let's say just have an application and it's giving you what is the coverage on some product that you bought or um it's support information about let's say what policy there's a little bit of a risk if it hallucinates and it says they have support for something that they don't well now is it liable to give you that support because the support bot said it actually had um support for some type of you you can imagine if it's some sort of like health care company and now they're saying they have policy support for something and so I think there's one which is it's more of the application use case it's if it's on proprietary data um there's risks associated with one is it guaranteeing things that that aren't true two is it also pulling information to answer that question that might not be relevant to to what the user is actually asking so maybe they're asking about product X and it's actually pulling it about product y or um in some cases I think what people are really nervous about is is also well was there any customer information that it could also be retrieving from so you know something simple at customer reviews okay but does it have access to things like what specific customers bought this and is that now going into and available for for these applications and so I think there's just a little bit more risk when it comes to your proprietary data about what you want to expose um and if you're pulling in things that just aren't relevant to to the question yeah I completely agree from a lot of the customers that I've talked to as well it's definitely a pretty big concern with them um and you know kind of going back to what you touched on about uh you know providing wrong information about a specific product offering um it can definitely have pretty pretty detrimental impacts further down the road cool um so now diving into a little bit um into what kind of causes these types of hallucinations um especially kind of focusing on the private data aspect um so from talking with a lot of other companies and kind of actually seeing a lot of this ourselves too um one of the main causes of this is poor ordering of the retriev chunks so when you provide the llm with a lot of information in the context window in this case let's you know let's imagine we're providing four different chunks of relevant information if the correct answer is actually buried somewhere in the middle um it's a lot harder for the llm to identify that relevant piece of information then this then kind of leads to those Downstream effects of you know the model hallucinating um based off of using a different piece of information um par anything else you want to add on this slide before we move on yeah I I mean I think the the big ones obviously what are the right chunks what are the right chunks that are extracted and that that's kind of where everything Downstream is is pulling the answer from and so if you're pulling and the most common thing I've seen is if you're using if the user is using words or they're referencing things that could they're asking about something but then pulling from something that's that's irrelevant but the llm or the embeddings uh think they're they're actually closer well then you can retrieve information that's actually not the right answer and then that's actually what's being used to generate and so when you're thinking about evaluation of these you know evaluating the chunks how relevant are they to actually answering the question that the user is asking is really important um if you know probably more important than even post how the llm answers on on top of those trunks y cool um cool so now kind of diving into a little bit um how can we detect these so obviously in a perfect world um we would have someone checking the output of every piece of content that the LM generates um you know obviously this isn't very scalable and you know we don't you can kind of think of it as like we don't do this today for traditional ml models either right we don't have someone evaluating every single prediction that the model is making um so what can can we kind of do instead to to automate this a little bit more um so we'll dive into to evals in in the next few slides here in more depth but sort of at a high level um we can actually use llms to evaluate the performance of llms um so again like over the next few slides we'll we'll dive into a little bit more depth into like what evals are and kind of um how you can use them but this is an example prompt that you could use to detect those hallucinations on private data so kind of without going into a lot of depth here um we're basically passing in the user query or the user question um the reference text that was retrieved and the LM response and then basically asking this llm to then evaluate whether that answer is factual or hallucinated based off of the information that's provided in the reference text yeah and in this example there's let's say those four chunks um so each one of those would could be passed in and asked so let's say your question was um you know you're uh you're a merchant you're selling products and someone's asking you about um does your I don't know does your headphones support I don't know noise cancellation and so it would pull all of the different chunks uh so product details about let's say chunk ones about airpods and chunk twos about some head you know headphones and chunk 3es about something that's maybe an irrelevant type of product um and you know what what it would do is actually ask for each one of those chunks is it relevant is the users asking is this do these your headphones um do they support against uh noise cancellation and so what you're seeing here is actually um a scale out that could occur so depending on how many chunks we retrieved that's how many different times you'll have to do this evaluation cool um so kind of taking a step back and kind of just looking at evals at a high level again um so kind of went over at its core right we're using a separate evaluation llm to evaluate the output of another llm so we have that input data um in a query or prompt which gets passed to an llm to generate a response so what we have up here so that's again kind of like your typical chapot situation user asks a question about a product we retrieve relevant context from a vector store for example we pass that information with the prompt to the llm generates a response which then get then gets shown to the user so we actually take that input data and that query prompt over here as well as the response we then pass that to a separate eval llm to to generate that eval um one important distinction here is is kind of the difference between task-based evals and and modelbased evals um par actually wrote a really great blog post about this so I'm actually going to hand this one over to her and kind of let her go over the difference here yeah this one's actually very nuanced you might all be hearing just llm vales LM valves out you know it's it's definitely the the hot thing right now but there's actually a lot of different ways that people are using the term llm evals and maybe some of the biggest distinction is there's more of model llm model evals so what that means is you have the same data set you have the same prompt but what you're testing is well is GPT 4 going to do better than GPT 3.5 instruct is llama going to do better than vicuna so you're actually the the thing that um is being tested is the what model should I use for this use case um and you'll see out there like opening ey evals actually has um a pretty good library that you can use to do more modelbased evals hugging face for instance has her own open llm leaderboard that has all these different metrics um out there that are used to stack and rank llm models out there what we're we're going to be talking about a little bit more in this in this rest of what Trevor is going to be presenting is task-based evals what that means is we're actually keeping the model consistent it's a little bit more of what you know Trevor feel free to chime in but like what we're seeing in with folks actually deploying llms is that they do make a decision hey I'm going to use gbt 4 I'm going to be using gbt 3.5 whatever the llm is but I've made the decision on the llm now what I care about is how is the whole application going to work and really what they're testing in that scenario is the prompt template and so in this case you would actually have you know in this image you're seeing more of a modelbased example same prompt different models but actually what the the task based evals would do is different prompts but same model and you'd be feeding in different questions like um you know you might change the instruction a bit you might change how the context looks like um how much even context but you're actually evaluating is the how well is the llm doing on the task dependent on what I have in control and that could be the prompt it could be the context it could be the parameters you're sending in um and so that actually is a is a lot more common for actually data scientists AI Engineers who are developing these LM applications to be using you you're maybe going to spend you know maybe one iteration on the model eval to pick what your llm you're going to use is but the majority you're going to spend is how do I have this continuous way to Benchmark my application and on this case Trevor is going to jump into is um there's a lot of different things you can Benchmark your application on yeah um I also kind of like to think of it as uh kind of taking a page out of like the traditional ml book you kind of start with this like experimentation phase where you know you may be doing these llm model evals where you know you're trying to figure out which model is going to be best for your use case once you kind of have that in place you kind of move on then to this next iteration where how do we evaluate this model on an ongoing basis in this production environment once it's actually deployed and interacting with real users cool um so these are these task based evals um that apara was talking about um these are uh you know we have these different evals for different tests um hallucinations being one of them we actually preest tested a lot of these evals with Benchmark data sets for a few common eval tasks um so for example retrieval Q&A hallucinations user frustration um and so for the hallucination evals we tested these on the QA data set and the rag data set which are both available as part of the eval Library um and I guess kind of before diving a little bit deeper into the hallucination evals themselves um a part of was there anything you wanted to add on kind of like this pre-tested eval set as a whole um well I I just dropped in the chat just the docs page to some of the pre-tested evals so if folks want to try out the evals feel free to do it it's all open source so you can see the code underneath um the key things I'd just like to highlight with the this library is one it supports pre-tested evals but it also supports custom templates so these are templates that are the most common that we've probably seen across our customer base people want to test for retrieval because they have a retrieval application hallucinations are important um user frustration so we we do have some that uh you know want to measure well if it's a chat pot how frustrated is the user getting um there support for toxicity there's support for um so some of our you know for example some of our users have uh applications that are being exposed children they don't they want to make sure that the the application isn't toxic in any way um and so these are all some of the most common ones that we we see Q&A summarization um but it also supports custom so there are some things that um you know if you want to go in and there's very nuanced specific things that matter to your application you can absolutely go in and modify it all of these evals are benchmarked using data science riger on the data sets are available they're reproducible so if you want to go in and evaluate how well the eval templates are doing you can rerun and and reproduce the results it's really important because the benchmarking on a golden data set is is kind of your way of building intuition that it's also building confidence that okay on average I can expect you know this eval template to work you know in this case 70 90% Precision targets or 70 85% F1 targets so you kind of have a benchmark that then when you later go use it in your real production application those will carry over um couple more things it's designed to work not just in offline benchmarking but also in production so you and you want that out of whatever evalue you're building you want it to be designed for easy quick testing you want to run on data frames but then you also want to be able to run it on your actual python pipelines um designed for throughput so that when you're doing hundreds and thousands of these type of predictions your evals are are running on there so check it out I think there's a lot of of nuance in how you want an eval library to look but today we'll just focus specifically on on hallucinations y oops cool um so yeah this is the the prom template for that hallucinations eval um just for the sake of time I'm not going to bore Everyone by reading through this entire thing but um kind of at a high level here the prom template says you know you're presented with a query a reference text and a response um it describes what a Hallucination is um and then we have a spot at the end where we're actually providing that input so providing that user query providing that reference text um as well as the llm response and then kind of like this binary decision here asking the llm whether or not the answer above was factual or hallucinated B based off of that reference text that was provided and the user question cool um so for these like pre-tested evals we actually tested this hallucination eval with GPT 3.5 as well as GPT 4 and so here you can kind of see the confusion matrices for both and the reason why we took this approach rather than kind of just looking at a metric like accuracy for example um is because accuracy might not kind of give you that full picture of how this uh or what this Benchmark um should be and so kind of again taking a page out of the traditional ml book here we have unbalanced classes um accuracy is not really going to tell you much right and you can imagine in a chatbot scenario we probably want our model to not be hallucinating that much at all and we want a lot of our responses to be factual and so that's why it's important to kind of look at metrics like precision and recall and F1 score to help kind of give you a better picture of how this model is actually performing rather than just looking at kind of like a metric like accuracy for example and so here on the left hand side we have the confusion Matrix for gp4 um and then on the right we have the confusion Matrix for GPT 3.5 um and in this case the the model the GPT 4 model does perform a little bit better on this hallucination EV evaluation task um given that prompt upet and kind of that golden data set that we were using um AA anything else to add here no this great cool um is there something in the chat this for days okay um yes I believe the both the first two sessions are recorded um and saved somewhere um cool um so kind of diving into a little bit on okay we detected hallucinations are happening um what do we then do how do we fix these so there's kind of three main recommended fixes that we've seen um kind of work in in these types of scenarios so the first is improving retrieval or ranking of relevant documents given a user query um and without kind of diving like super into the weeds here uh AAR I'm kind of curious to to hear from you like what are some of the most common methods of like document retrieval or ranking that you've seen across other companies yeah the so so first I'd say um there's a lot of different ways that people out there are doing it um there's some that are definitely using you know the ones that are preconnected for instance to L index linkchain um and and using that to orchestrate the connection to the vector vector Stores um there's also people that I've seen who have built this all um without any orchestration Frameworks like that um they typically do Benchmark um and I think this is really important they do Benchmark things like uh how many chunks are relevant how do we chunk our data um and so some of that for example is um if you're chunking by characters or token you know you could chunk by um you know cohesiveness of what a chunk actually means we have some people during retrieval who actually not only pull the chunk that was identified as the closest but then also pull like the close by relevance uh you know trunk so for example if you're on a document or a page that's asking about a specific product they might pull some of the other relevant products nearby it because they might be similar and so the retrieval itself we have some people that will do a retrieval but then they'll also rerank the documents so they'll do an intermediary around how close was the actual question to each one of these specific chunks um and then do a separate reranking um the documents so so there's there's just a lot of different ways to optimize either the chunk size how many oh how many chunks is something I've also seen people iterating on too many what ends up happening is that you're just giving a lot of information of the llm and sometimes the chunks start to compete with each other or start to disagree with each other and so now you have the llm have to be the one that's making a decision and so there's actually a really good script that I'll link in the chat but it's a really good way to if you are building your own retrieval system um looks at a couple different things like the K value how many chunks what's the size of the chunks um and then the metrics like mrr Precision at K and dcg so it has a number of these different metrics that I think are helpful for teams to to start benchmarking this um and then I'll I'll go through some of this during the demo as well um and part we actually have a there's a question in the chat um what about context Windows uh so obviously it kind of depends on the use case but um any general thoughts on on Long context Windows yeah really good question M so um absolutely true we do need to fit the the context that's retrieved into the the context Windows um so one thing that this one relates to is what's the size of the K you could you know if it wasn't for any context Windows you could have argued well just pull as much information as possible retrieve as many documents and then send it all like the more information you give it the better answer it's going to come back with but I actually think this one's been you know through various experiments that that we've run llama index has run um a number of actually even Vector stores have been publishing out more context that's provided doesn't necessarily mean a better answer and so I think it's actually really important to Benchmark for your use case what is the right K people call this the K value like K number of chunks that you're retrieving what's the right K to actually retrieve that improves your your metrics um and then I saw one more question from har PRI I think yeah how important would meta metadata be to fix hallucinations that's a really good question too um so we are seeing this by the way um rise in popularity right now which is adding a structured data filter so a lot of the vector stores actually now support this where you can go in uh pine cone has this for instance where you can go in and you could do um restrict what it helps is it just restricts right the product the surface area that you now have to search over so for example if it's a uh product specific types of questions they might restrict over well what are products that are available for the specific region that the user is asking a question to um and or or it's available to this user and so you can actually go in and restrict on um you know the surface area and then what it does is it just gives you more meaningful uh content to then go back and and feed to the llm cool um I'll hand it back to you Trevor before I go into the the demo cool um so two other quick ones and I think we kind of talked about uh the second one a little bit already but um improving the chunking strategy and document quality so again you can kind of Imagine garbage in equals garbage out right if you have a lot of really low quality documents that you provide to the LM um you're probably going to get a pretty low quality response um and then the last one uh um is prompt engineering I'm not going to spend too much time on this one because I'm sure a lot of people are already familiar with what this is but um basically being able to you know continue to iterate on your prompts um and kind of you know find that sweet spot between um or find that sweet spot after running some of these evaluations seeing how your model is doing in production kind of figuring out what you need to to to iterate on in that front cool um and then the last slide I think we have um is kind of just research on the subject and and where the industry is kind of heading um there's a lot of research out there on hallucinations today um a lot of it in the past is kind of focused on uh hallucinations on public data so again this is kind of like the hallucinations um for those Foundation models using that training data as sort of like what we're identifying as like public data in this sense um we are starting to see a lot more research though on um Hall ations with private data now too and one kind of IND interesting industry Direction um is this emergence of of guardrails and so this was actually back in April of this year but Nvidia came out with their Nemo guard rails package they actually have a guard rail um specifically for hallucinations um definitely seeing some companies now trying to to test this out a little bit um not really seeing a whole lot of them make it to production yet but um I don't know part have you seen anything different in terms of kind of if companies are out there today using guardrails in in this production setting I mean I I think it's uh an really exciting new area I I haven't been seeing it in production yet and my my reasoning my hypothesis on this could just be um if it's a another call you have to make before you can block an answer to a production application um you know you have to be really you either have to have an application that just prone to making a lot of mistakes that it's definitely worth the investment putting the guardrail or if it's um if it's too risky um to to roll out without but what I've been seeing more people do is offline evaluate um so definitely growing I think if it could be faster if they canar you know not make it blocking it's definitely an interesting area yeah I agree um and it looks like we have another question in the chat um is there a way to determine the accuracy of the self-evaluation of the llms to their own hallucinations and relevance to retrieve docs is that where the benchmarks come in yeah I think I actually might go into that when I hop over to the demo section but really really good question Christos I feel like part of the question you're asking is almost like a how do I trust my own evals and how do I know that it's actually not prone to its own issues and so I think that's really good question we're g to we're going to jump into that in a second cool um I think that actually might be the last of my slide so per I'll hand it over to you awesome okay let me go ahead and uh pull up what I'm going to go through okay can you all see my my screen yeah okay awesome I have a really wide wide view right now so if you can't see you know see the screen or anything please do let me know um okay so first thing what I'm going to talk about might actually be helpful before I jump even into the hallucination evals because I think christos's question kind of gets at this which is well how do I trust my own evils can you zoom in a little sorry yes yes yes can you guys see this now yeah okay awesome um what I might do is just spend a little bit of time talking about how the process of building your own evals goes and I think in talking about that we'll all a able to First build the okay this is how it's built and then we can jump into how to actually we'll go into the specific CAG example about how to run um and hallucination eval um and all of this is available in the docs so we'll we'll post it in the chat so the first thing is um choosing the metric so today with hallucination it's actually one of our pre-tested uh evals but let's say you wanted to build your own eval you typically need to just understand um what's the question that you're trying to ask in this case let's say we were just trying to rebuild the hallucinations one the question we'd ask is is the answer hallucinating is the answer uh pulling information off of this specific context um so you first need to come up with some sort of metric it can be a binary answer it can be um you know multi you know it can have multip outputs like I've seen some people ask things like it's partially relevant it's fully relevant it's not relevant at all but you first just want to decide what's the question that you're going to ask this one's kind of the most important part of this which is uh building a golden data set so if you're used to traditional ml it's almost like the parallel you should think about it is almost like a training data set except you're training data set for your eval and what it typically looks like so let's just say we're doing the hallucination one uh do you support international calling maybe you're some sort of uh support application um for something like a zoom well you're asking do you support international calling there's some context that's retrieved this is your private data these are documents and then in this case over here instead of the question being is the context relevant we might ask hey is the answer that's generated generated based off of this context so is the information that was pulled only information that that was pulled from this document or was there any other new information that it made up and so this over here is really the metric that we were talking in step one and typically what you want is some sort of you know um data set with these labels common data sets but for your use case what we've typically seen people do is actually build this themselves to get it to be a really good evil um so once you build the E once you actually build the golden data set this is a little bit lighter of a step but you need to decide which llm you're using for the evaluation um we've seen people who actually do use different evals so they might use let's say gpt3 to actually answer the question for the application but then they might be using a a different uh llm to actually do the eval itself um it really is just dependent on your use case um we've seen people use the same we've seen people use the different ones but really um the point here is just that these are two separate llm calls one is for actually the response back to the user one is actually the respon one is actually the eval evaluation score for this response and so what you're going to do is um you're going to then build your eval template this is actually the one that is is really the most core component and you're going to probably iterate on this over and over again the things to just be explicit about is what's the input what are you actually asking and then what are the possible output formats so in the eval in the hallucination one that Trevor was was going through I'll just go into the hallucination collab over here but let me read out this one so you guys can see it so this is the hallucination template it says in this task you'll be presented with a query a reference text and an answer the answer is generated to the question based on the reference text so here's the three components the query the reference text and the answer the question may contain false information you must use the reference text to determine if the answer to the question contains false information if the answer is a hallucination of facts so really in this specific one you have to be really clear about what's the input question what's the this is from the user this was the answer that was given to the user and then this was the reference text so the one I'm I'm kind of walking you through in the docs right now this is a little bit different it's a little bit more of a relevance type of Evo not a a hallucination eval um and so this one I don't actually need the for the relevance eval I actually don't need the response that you saw up here but for the hallucination one you do need the actual response for this eval so again depending on what your metric is you'll just need different uh components to answer that question uh so your objective is to determine whether it's a hallucination a hallucination in this context refers to an answer that is not based on the reference text or assumes information that is not available in the reference text so it's clearly stating here's the inputs here's the instruction what you're going to be doing and then here it starts to Define what should the output look like your response should be a single word either factual or hallucinated and it should not include any other text or characters hallucinated indicates that the answer provided factually inaccurate information based on the reference text fact actual indicates that it's correct relevant to the reference text and does not contain madeup information so what you're doing again is what's the input what are you asking what's the objective and then clearly State what the possible output formats are um and depending on your metric again you know the one I'm going through in the docs is more of a retrieval one specifically so that one I really only care about is the context that was retrieved actually relevant to the question question that the user is asking you just need to determine what parts to feed into the eval template and then what should the output look like lot yeah M had two pretty good questions um in the chat so excuse me I think the first one um are we generating evals data set with gp4 or uh manually by by humans uh good question so um some of these are actually by um I think most of the ones it depends on the metric I'd say mer so some of them are actually using public data sets so um if I just go over here for instance the retrieval eval that you're seeing is tested on Ms Marco there's Wiki QA so some of these are really public common data sets um some of them are on ones that um are a combination of you know fully human or some that are modified with heat human and some additional gp4 modifications um I I think most of them though if I'm honest it's typically good to start with just a human human one or a public data set um reason being is like I I think if you really can't generate you know if your application's not deployed and it's not live yet you might be forced to start with okay well I'm trying to simulate some data sets or build a synthetic data set and so in that case you might start with the gp4 if you have an application that's already live in production you probably have stuff that's human generated um and you can generate the you can ask for labels on that human data set um and so I I'd say there's typically a preference for it to be human generated so you have you just want it to be the most closest to what's actually going to happen in the real world um but we do see a lot of people who just given the fact their application's not deployed yet they're going to have to build those synthetic data sets to to even have a starting base uh but but great great question um okay let me um let me go keep going here I actually see one more question from har PR curious if you've seen anything like golden examples for retrieving few shot examples that are most related to the user query feel free to ignore um I think I need to understand the question just a littleit bit more so if you can add anything to the question that' be that'd be super helpful our bre um okay I'll keep going here um so we just went through uh building the golden data set we went through deciding which eval okay we talked about building the actual eval template um again you know you this is for you can use the pre-tested evals but if you're going to I recommend everyone to just build their own just to build some kind of intuition but basically once you build your eal template now you want to test the efficacy of that template so you want to run your eval on your golden data set the reason to run it on your golden data set is because you already have labeles on your golden data set so you can actually run it on your golden data set and you can compare side by side what should the actual answer have been and what was the response that was generated by the llm and then you can actually then run all sorts of different metrics to say well how accurate was my llm uh generated ground Truth versus the true ground truth um I highly recommend not just using accuracy and the blog post if you guys read it has really good explanations of you if you look at accuracy you're you're kind of looking at just like overall correctness but what you really care about is are you more prec you want to understand simple data science rigor uh around using precision and recall is it get it more right when um the answer is no but it should you is it does it commonly mistake um for no hallucination when it is a hallucinate uh is it too strict is it too loose so there's kind of good ways of actually understanding the edges of your eval and going back to like christos's original question this should help you build um your confidence in your eval template because you run it you've tested it on your ground truth and you start to build some intuition around okay it's doing pretty good on Precision maybe it needs Improvement on recal recall you can actually go back and then tweak the template um so in this case this is a hallucination template you can go back maybe it messes up um specifically it says that something hallucinates if they're if it use a slightly different words or synonyms instead of the actual information that was in the reference text well then you can specifically put that in in the template say hey it's okay if you're using slightly different words as long as the information is equivalent to what's in the reference text so you can go back um and and modify this if you want to be more strict hey if it introduces any new Concepts that's not at all in the reference text I want you to mark that as as a hallucination um but you can go back you can iterate on it you can build your your intuition and then you can then use that eval to go back and run through on your actual application which is great because in the real world you're not going to have this type of ground truth like I'm showing you here you're not going to have this on all of your production data and so now that you've built your intuition you can then go back and and and run it on production data okay uh Christo says one more question uh are there any strategies out there to such as adding a scalar weight of importance to human examples um or is that case by case engineering art yeah I mean these are all really great questions christas I I gotta say um so yes I I do think similar to you know what we've seen out there in typical data science world you can have some samples that are more important or more weighted and you you can generate this all comes back down to you know the different Benchmark metrics that you want to use so depending on the right Benchmark metrics um if you want to wait some examples more than other examples to totally up to you um I'd say it's totally up to your golden data set it's totally up to your your use case I think the most important thing is just knowing um the the right benchmarks metrics that you want to use I'd say for most people just accuracy doesn't quite quite cut it um but using things like Precision recall or weighted Precision recall all of those are are good ways um so this is how you build an eval um you go build your golden data set decide what the eval template should be run the Benchmark and you'll notice I'll go to the hallucination one specifically now uh if you want to try any of our pre-tested evals what you can see is you'll actually see the template but then you'll actually also see the benchmarked results and these are the benchmarks off of the data set that I'd shown in in that previous page um so it could be on Ms Marco or Wiki QA but you'll see here that we actually ran these using gp4 and then we also ran these using GPT 3.5 um and they're slightly different for example here the Precision when it was hallucinated GPT 4 did slightly better 92% versus GPT 3.5 got 89% Precision so uh you know probably not a surprise to anyone but gb4 did did slightly better um and and so this is also a good way when you're doing your benchmarking of knowing well how accurate is accurate enough for for you is getting it uh because you also need to consider cost in this as well so you know is is it okay that you use GPT 3.5 as the margin of error not that that crazy that you're okay with this on on your production data um anything else you you'd add to to this uh Trevor no I think uh yeah that was a very comprehensive overview cool cool um I think the the big thing probably just recommend is um definitely uh you know most people they'll probably start with is running some sort of pre-tested BS um these are all already benchmarked they're all already supported on a number of major model types so for instance here they all work with like gb4 Palm AWS Bedrock um but I think that as your application gets more nuanced and and you you need if you notice there's a specific part where you're consistently seeing errors and you want to start um generating EV valves for that that would be a great time to actually build your own eval build your own data sets to do this um and then uh you can actually contribute back if you'd like to to the to the Phoenix Library um so now going specifically to the hallucination collab um this is more of once you actually built your eval template this this notebook actually shows you how to run it so you'll import uh arise Phoenix uh you'll pull down the Benchmark data set here so in this case we're just showing you how to run it on our Benchmark data set um typically though this one might actually be a good one to to show you all what you want for your evals and this is kind of what Trevor was highlighting too is that you want your evals to be able to run in different environments oh sorry about that let me go back oops so you want your eval to be able to to run in different environments what that means is you want to be able to use it when you're actually doing the benchmarking so in this case when we were actually building the eval you want to use it then you want to be able to use it when you're maybe developing you know now that you have an eval metric you actually want to evaluate your application um that you're building this might be before it's actually even launched so we have some people who are building let's say chat bots on their proprietary data you might want to actually build your your eval and test it in a notebook um on your application and then lastly you'd ideally like to run this eval once the application's deployed in production and so this same eval can be run in all three of these different environments and right now for this example over here I'm just going to show you how it runs on on The Benchmark data set so we pull down the Benchmark data set I've decided I'm running it on um the hallucinations eval data set um this is an example of what that data set looks like so kind of like the image it has the context the reference here let's keep going it has the query which is over here and then it has the response and then here the his hallucination this is the ground truth that we were showing so these are the four columns that really matter um we've generated our our template so I can just print that template out um you will need your open AI key or whatever llm that you want to use for it is so in this case I'm using gp4 um so I've gone ahead and I'm now just running the eval itself so um I'm going ahead and I'm just running it and then once I've run it I just display the result of the actual confusion Matrix and so this this is kind of the way that you all can test the hallucination template on your own data sets as well I think that is it for today um at least that's that's all in the example that we had for today um if you have any questions I know this is a super exciting and also growing space uh feel free to drop questions to us in the slack group um check out the docs check out the templates feel free to try it yourself um but all in any feedback is is welcome on this uh anything else you want to add to to close us out Trevor no um yeah thanks so much everyone for joining I was uh had a great time going through all the evil stuff um so yeah really appreciate everyone joining awesome thanks everyone for joining see yall at the next one

Transcript for:Understanding Observability and Hallucinations

Transcript for:
Understanding Observability and Hallucinations