in this episode we'll talk about llm hallucinations and how to detect and mitigate them this episode will be particularly useful if you're building retrieval augmented generation systems as we will cover the metrics in the RAS paper these metrics can be used to evaluate different aspects of a rug pipeline such as faithfulness answer and context relevance we'll see how to use some of these metrics has effective realtime hallucination checks that you can deploy in production right now I think a good place to start is to reframe what we mean by hallucinations and why they happen in all llms it's common to personify llms saying that they hallucinate when they produce false or misleading information but llms don't think or see the world as we do they're just statistical models trained to generate likely completions based on their training data and reality doesn't always follows smooth probability distributions some statements are true but unlikely and some are false but appear plausible we call these hallucinations for example Air Canada had to issue a refound after their chatbot hallucinated the airlines bment policy a customer customer seeking a discounted Fair after their grandmother passed away was told by the llm that they could book a flight immediately and request a refound later although this sounds very reasonable and a likely policy for an airline it turned out Air Canada actual policy did not allow refunds after booking on a reason ask me anything on Reddit even Mark Chen SVP of research at open AI shared that hallucinations are indeed a hard problem to solve for llms there is no magic solution on the horizon to solve them completely but in cases where we actually have access to some facts that we know to be true there are ways to evaluate and mitigate hallucinations we can essentially do this by checking if the lm's response is grounded in those facts and this is exactly the situation we are in when we build rag pipelines as the context we retrieve can function as the ground truth to check the llm answer in a rug setup an application retrieves document chunks relevant to user's query by looking into a knowledge base the llm is then prompted to use the retrieved context to generate a response to the query and by itself this way of prompting already somewhat reduces the chances of hallucinations as the llm is less likely to produce a response not in line with the provided documents however if we want to further reduce hallucinations we can look at the answer and check its faithfulness or groundedness that is whether the llms output is grounded in the context it was given essentially we're asking if the response aligns with the retrieved documents faithfulness or groundedness is the most obvious property you can check to determine if an llm Sansa contains hallucinations but it's not the only thing to consider as there are other factors that can influence the quality of a response the ragas paper introduced a broader framework for evaluating rug pipelines the authors defined metrics to assess different aspects of the system's performance besides faithfulness they Define two other metrics for evaluating rug systems context relevance and answer relevance context relevance evaluates the effectiveness of the retrieval pipeline asking whether the retrieved content is relevant to the query the user user asked it doesn't matter how good an llm is if the documents we give the llm to answer a question are not relevant the quality of the answer will be low likewise answer relevance asks whether the generated answer directly addresses the user's question or whether it contains redundant or irrelevant information just a quick note to avoid confusion since the publication of the original ragas paper some of these metrics have been renamed and more metrics have been added and refined here we'll just cover the original ones as these are the most useful and relevant establishing what we want to measure is only one part of the challenge figuring out how to measure it effectively is the other part for metrics like faithfulness context relevance and answer relevance we we need methods to evaluate these aspects reliably the main Tools in our toolbox for evaluating these metrics are sentence embedding models and llms themselves if you think about it you can prompt llms to act as judges asking them to verify certain properties of an answer on the other hand embedding models allow us to measure the semantic similarity between text Snippets helping us check whether these Snippets are similar in meaning let's start with faithfulness one way to assess this is by using sentence embeddings we embed the generated answer and the context the LM was given to generate such answer and then we calculate the similarity between these embeddings typically using something like coine distance if the embeddings are similar it suggest that the answer generated by the llm is similar in meaning with the context we provided this method is simple but it has limitations especially when dealing with subtle or complex deviations in meaning a more commonly used tool is an llm as a judge in this approach you prompt an llm to score and answers faithfulness to the provided context it is common to prompt the Ln to generate reasoning in natural language for the score it gives you don't have to use the same llm that generated the answer but in theory any general purpose llm such as gp4 can be used as a judge large general purpose llms perform reasonably well as judges at evaluating the output they themselves produce but you could also use a dedicated model like Lynx which has been fine-tuned for hallucination detection Lynx is designed to perform well in challenging domains like medicine and finance where hallucinations can be harder to spot according to benchmarks Lynx outperforms general purpose models like gp4 in these specialized tasks making it a strong tool for hallucination detection in ragas faithfulness checks use a similar LM is Judge approach but with two steps first the llm is prompted to break down the answer to verify into smaller factual statements then the llm is asked to verify each statement against the provided context this two-step method provides a more detailed evaluation of whether the answer is grounded in the context giving us a clearer view of of its faithfulness to check answer relevance ragas uses a similar toep strategy first it prompts the llm to generate multiple questions that could be addressed by the given answer then embeddings for these generated questions are compared to the original question the user asked if the average similarity score is high it suggests that the answer was relevant to the original question a low score might indicate that the llm used to generate the answer is not capable enough for that task or domain and a different llm should be considered finally context relevance is checked by prompting an llm to extract key sentences from the retrieved context that are relevant to answering the question the context relevance score is then calculated as the ratio of relevant sentences to the total number of sentences in the context essentially asking how much of the content retrieved was actually relevant to answer the question a low context relevant score could signal that the retrieval process needs adjustment this might mean evaluating different embedding models to see if they better capture meaning or similarity with in the specific domain it could also suggest the need for a different chunking strategy as overly large chunks might include too much irrelevant information the metrics we saw are said to be reference free meaning that we can calculate them without needing to have a correct reference answer this is important because of course in production when we are answering a user's question we don't don't know the answer already that's why we're using an llm to answer that question now there are two main ways you can use these metrics for your rug pipelines the first is real time as we are generating answers for users and a second is during development to fine-tune our rck pipelines and improve them the most common metric that you can apply in real time is faithfulness or groundedness whenever the system generates an answer if the answer is faithful to the provided context you show it to the user if not you display a message saying that the answer cannot be generated at this time this check reduces the chances that you show a potentially hallucinated answer to the user and is great to safeguard production systems the second way to use these metrics is of course during development you can evaluate your rug pipeline by testing it with a set of expected user questions then reviewing faithfulness context relevance and answer relevance this helps you find areas to improve whether that means switching embedding models trying different llms or adjusting the chunking strategy before deploying the system so in summary hallucinations are a by product of how llms work but they can be somewhat managed faithfulness checks either through llms acting as judges or embedding based methods are essential tools for detecting hallucinations what's more we saw how Frameworks like ragas provide a set of metrics Beyond faithfulness that can be used to evaluate and improve the performance of a rag pipeline such as context and answer relevance if you want to see more of these controls in practice do subscribe to the channel so that you get notified of the next video