Transcript for:
Detecting Hallucinations in AI Models

hi I'm Alex Thomas I'm the principal data scientist here at y Cube and today we're going to be talking about uh detecting hallucinations and the measurement systems that we can use to do that uh so okay so a little bit about us so uh I'm so I'm the principal data scientist at y Cube um and we've been working together for a few years and um uh we've worked on you know various different Pro you know Technologies in in in different domains uh today I'm gonna be talking about uh an issue with AI but with specifically like how do we actually detect this issue and how do we know that we're detecting it correctly so llm have been this really big uh uh Improvement in in processing natural language and have allowed people to do like a lot of really cool things um but there's a problem that they can have uh they can they can hallucinate so in this case what we mean by hallucina is that it's basically making up stuff we'll talk more in detail about that is and that can lead to spreading false information um uh uh giving people the the wrong information or or or or bad advice um it can decrease a the customer's Trust in your system and your hurt your brand name and the LMS also have like inherent bias also there's a an um uh a concept where called framing where the user can accidentally uh uh make the the llm give you give them the wrong result by Framing a question a particular way um so these are all problems that can cause the the llm application that you're building to produce bad or harmful results and some of these uh mishaps have been you know expensive or very embarrassing um so for example and the Air Canada um uh example here their chatbot invented a policy a refund policy and they were forced to honor that um and then of course we have a bunch of other examples here and likely we'll see more examples in the news so these hallucinations um uh are different than sort of other generation problems in the past um before the rise of llms uh language generation problems were usually around um disfluency so it producing sort of bad quality text um these ones they not only are are fluent language that's producing but also to someone who's not an expert in the matter U or who otherwise would not be able to know better it looks reasonable a lot of these outputs and um uh we can have like you know serious brand damage um and legal concerns so how do we measure this problem so first I want I need to establish some terminology because we're going to be talking about measuring applications and then we're going to talk about measuring those measurement systems so that could get a bit confusing so when I refer to the application I'm referring to an LM llm based application so a chatbot a summarization system ET and when I talk about the measurement system I'm talking about some sort of system um uh for measuring such an application so let's break down the the this this um uh the sting into different tasks that the llm applications can be doing this is sort of a a ization there's probably uh like more ways you could configure this and and other things you consider but these are sort of uh you know Cardinal classical examples um so for example uh we have start start with zero context that's just the very basic chat brought experience where just the person's giving some sort of small input in you know often times we we think of it as a as a question um but some sort of small input into um an llm uh and and it gets usually a small output back but it could be variable size depending on what it is um and there's no external information um then there's a what we're calling a rag Q&A system so this is a question answering but where there's going to be some context provided so rag stands for retrieval augmented generation and that's where we uh use the user input to retrieve relevant text um documents uh perhaps just passages um and uh that's passed in with the question to the llm to produce the answer so um uh this uh is going to be a more complex but the idea is to make it more grounded to reduce the amount of hallucinations from sort of the zero context Q&A then we have summerization which is kind of a different task um in this case there's not really like a question or user input um we'll talk about how you can like optionally use that for some things but um uh but this case there's going to be a a large input and generally a smaller output um and this is a much more has a much more complex concept of accuracy so let's look at some of these use cases so in the zero context Q&A um here we have like an example of a question an answer um uh it's difficult to measure the hallucination at times because you need to find some sort of external data source to check the the accuracy factuality what have you and this is I saying before it's very susceptible to this notion of user framing the a lot of the newer llm models they've actually begun to train specifically to combat this so that if someone asks a question um uh that is uh like why is uh South Carolina Barbecue worse than Texas barbecue like that frames it as if that is the case um and so the LM might be inclined to give an agreeable answer to give an answer that does assert that the South Carolina Barbecue is worse than Texas barbecue uh so uh in order to measure something their context Q&A you need to find an external data set this can be done in a couple different ways you can find uh some sort of external structure data set and then you'll try and match up things in the answer to the structure data or you can essentially back and gener it into like a r situation where you look for Relevant uh uh documents and then you measure how well the answers match those documents uh but it's one of the more difficult ones uh ride Q&A so this is a a very popular use case for LM applications um and but now we have three different sources of factual layers you have llm hallucination so the llm just making up something that's that's not true but then you also have errors in the retrieve data that the llm the uh Faithfully used and similarly incorrect retrieve data so perhaps you are you someone's asking like what is a bank and they're intending to ask the the geographical terms so what like the bank of a Riv but when they ask that question it gets a bunch of documents about the financial institution Bank in which case then the answer is is incorrect it's factually incorrect for a the the geographical concept of Bank um but uh uh the problem was incorrectly retrieved information so in this case detecting ucin naations it's probably better to look for consistency with the retriev data because if we remember back to our um uh how R QA works you get the question you have a retrieval system that gets the contexts then the llm generates it and so if you want to look for problems with your llm you should focus on what happens after that step so making sure that the answer is correct for the data retrieved um and then the third um uh scenario which is summarization again this is a a larger one and and considering that the that this is um taking in some data and then attempting to reduce it in a sense uh in in making the summary uh this so this can be um difficult to measure in a different way than the ones before um you also have a problem here where uh so the user may provide something that is known to be fact wrong but they want to summarize it so for example perhaps someone is analyzing news articles containing misinformation and so they submit such an article if you are wanting to like measure the sum the the quality of summarization in this task um you wouldn't want it to flag uh uh factually wrong things coming from the article because that's something that the user is actually interested in capturing so You' want to make sure that the summarization is uh being faithful to the con to the references given um so this is an example where we can uh actually use something to focus it um where you can use a question um to focus the evaluation um so this would be uh as an example given there we have some sort of patient documents and we want to analyze hallucination but we're really concerned about what the summarization is saying about uh the drugs and the prescriptions so we include in uh depending on on what your strategy is and I'll go to concept of strategy soon um uh we will include uh a question in that to say oh here's a question for the summary um and that can be used to focus evaluation on that topic excuse me so um we have these different use cases and now we we want to want establish benchmarks and metrics for measuring these applications well this is still a new domain it's not like the NLP uh natural language processing benchmarks of the past um we don't have you know Decades of uh experience and and research articles that we and data sets um that we can use to do this um there's no even generally accepted metrics there's no convention on like oh this is the metric you use to measure uh uh uh um summarization there are such metrics for pre-m measurement but they don't really make sense in the mo in in like modern summarization because so many of those metrics are also trying to capture this idea of fluency which llms are doing such a great job at there's really no point in capturing that you want capture the factual ACC accuracy um and if this is the case for the measuring of applications you can imagine how much uh uh how much less there is convention around measuring the measurement systems themselves so let's talk about the measurement strategies uh uh and the models that we're going to be using uh uh here so we've done a few experiments where we've uh uh taken some oh we've taken some uh different strategies including one we we've made um and some in a in a couple different models uh and we're going to show how you can compare these to find which measurement system is best for which situation so the Pia strategy this one we have this is one based off extracting claims in the form of triples here um and then evaluating those so the the idea behind this is that if you are just sort of giving your inputs and outputs just the text um to be evaluated by an llm again that you get less actionable information by getting these triples that are then labeled as incorrect or correct you can look to see what sort of hallucinations there are if you are worried about uh your model or and you have a a domain specific problem this could point out that maybe you did not do sufficient fine-tuning in in that domain um if if your errors are are the the the claims that are being extracted that are wrong um uh are the domain specific ones um and so what you do is you you extract these claims um in the form of these triples um here's an example with a a 58y old man uh with a history of hypertension atrial fibrillation uh is on cumin uh and nitm which is an abbreviation essentially means diabetes um so we have example of like two triples extracted from that patient has diagnosis type two diabetes and implied by that uh uh sentence kumin treats at fibrillation so once we have these claims we want to classify them we want to categorize them so the idea is that we have the reference which is you know we we going to uh assign a certain level of confidence that that's correct um and we have the response that's a summary that we're generated um or an answer that regenerated um and so entailment is claims that we can match between them and um uh uh they agree with each other contradictions are are claims that we can match on what they're talking about between the response to reference but they contradict each other missing are the things that were not in in the um output but were in the reference or the context and neutral are things that are in the um uh the generated response but not the original reference hallucinations are going to be in the contradiction and uh neutral category the contradiction obviously those are harmful to the quality of your system um uh but neutrals are not necessarily harmful um it could be that the you know there's some uh you know correct there's some factual sentences that will be generated by the um llm in your application um that are are are uh they're hallucinations in the sense that they're not from the reference but they still might be correct correct um so those are a little harder to measure so our metric right now focuses um on uh entailment and contradiction and optionally this notion of reliability so what we do is we essentially do a um harmonic mean of the entailment and the non-contradiction so that is you know we do one minus the rate of contradiction um and reliability if it's included is going to be basically we're going to take those neutrals and try and verify them um and there's different ways can like modify uh uh uh this metrics um but this is what we use when we are producing uh an evaluation of uh uh of a of of a generated response another one we're going to be comparing to is the grading strategy so this is where we're just have a prompt it's asking to grade the response with an A through f grade so uh right now we're ignoring A+ B minus you know the pluses and minuses um and then we map it using sort of a common scale um in order to get a a score so this is a simplistic one that we want to compare to the value of extraction and then we're also going to be looking at a a popular project out there or a newer project out there called lyns um and they have a similar sort of like a simple prompt but instead of giv a grade uh they give just a pass fail um and sort of like this notion of the granularity of of of uh the label or the prediction so actually be important they also use the term faithfulness um they specifically exclude the question from being included in there which is um uh uh would hopefully minimize the effect of any sort of priming done by the question and it's built around uh their data set which is a data set we use later on here uh that has a binary label so the models we look at are gp40 and llama 3.1 uh 70 billion um there are other models we're looking at right now um here's a a list of of some of them uh but the this presentation is just focusing on those two models of to so now that we've talked about how we detect the hallucinations we need to talk about how do we measure that detection so this is sort of like this this meta step where we are looking to measure the measurement system measure the yeah measure the measurement system so again talking about comparing to Old NLP benchmarking um there are like pros and cons of of uh how benchmarking is used uh you know there's a lot of data sets out there um that are made made well some of them uh they're specialized by tasks so named it to recognition s segmentation C resolution different data sets for those sometimes they're from the same text but they've been annotated for these particular tasks um and then there's also data sets um specialized by domain so medical clinical uh financial and for these different tasks there's known metrics that that and conventions on them um the downside is that um uh uh because of these data sets being specialized and some of them not being special iiz if you have a different domain than what's available for the task you're looking to measur if you can't find a data set for your task um a and domain you're not going to be really sure whether the measurement of your system is is accurate so if the idea is like oh uh I have a sort of a uh a data set that's on social media posts but my text is on academic articles um and so that's that's what this the the NLP system I'm building is is for academic articles looking at how I perform on this generic social media set does not really give me reliable information for how I'm going to perform on the on the scientific articles um uh also some of the data sets are really old is in like from the 70s or 80s um and most are uh uh very clean so in the sense like there's been there's been a lot of like pre-processing of the data one of the worst things you know that as far as for like uh trying to test out against Natural data is some of these data sets come like pre- tokenized um where there's like Extra Spaces added in added in to make tokenization easier um uh uh so when we measure the measurement systems um uh with the new Al ones we didn't need to do that with the older NLP processes because there were these conventions um so like the the ways we had to measure it they were known trusted established academically um and the tasks we were doing were much more clearly defined you can very clearly Define what named entity recognition is you're saying oh find me all the phrases that represent entity the entities of this type or cor reference resolution is something that a little more complicated to actually Define but is clearly definable linguistically when we're measuring the LM responses we're talking about factuality consistency faithfulness these are much murkier Concepts so uh bedw workking is a very different thing with in this sort of new llm world than it was back in the old uh NLP World also when measuring your measurement system there's some special requirements you have um you need to have all the inputs outputs of the systems um now generally that's not a problem but like there are some out there where they'll have like you know features instead of like the actual text um preferably raw none of those pre- tokenized ones um uh you also need good and bad examples this is where it's like hard a lot of times to find a good data set where they'll just have sort of like oh this is because we're talking about generating outputs so they'll just have like what is the correct answer to the question and answer for question answer um so like uh for example uh the bio ask data set has a bunch of like really great questions in the biomedical domain and they and the ideal answers for them they don't have the wrong answers for them so we can't really use it to measure a measurement system um and also you very much want to prefer to have human generated labels because if you have uh uh like some sort of system generating the the labels for you then the question is like well okay now how do we know that that system is accurate and you start getting like oh well now how do we measure the labels that are used to measure the system that measures and it's it gets too deep at that point and we have a stack Overflow um so there aren't that many data sets that fit this bill um but more are being built because this problem is starting to be recognized and there's likely legislation on the way so the need for these data sets is is is strong um so uh let's go to the data sets we're going to be using in this talk real quick so we have the summarization data set so uh there's uh C the two kags sub data sets one is generated from CNN and Daily Mail news articles um and they've been labeled by mechanical turkers um uh and then there's another one from BBC articles the gener the summary generation is um uh uh done in like slightly different ways between these um so the like the good and bad examples that we have um uh or or whether we have like one example per article and then we have the grades done by the the mechanical torque labelers then we have the sum of all data set which is itself also generated from the CNN daily mail data set um but the summaries are generated in in various different ways and the labels are done by both experts and mechanical TRS so for the kags labeling uh each labeler will label each sentence as a yes or no is in like oh is this a is just a like a good sentence in the summary um you know is this accurate for for summarization or no um and then what we do is we take a micro average average across um uh uh the whole response for some of our labeling uh there's the expert and MK labelers those are those are recorded separately in the data set so we can distinguish them um there's actually four things that they measure consistency coherence fluency and relevance we're just using consistency here um and um uh we found that there's actually not a lot of correlation between uh the experts and the mechanical trickers so we actually just go with the experts um so we got these three data sets from a leaderboard for the for the alliance score project and they use spearing correlation um with the human labels of these data sets to see to to to um make their leaderboard there's a problem with spearm correlation though is that these three data sets have very low granularity in their labels so um uh in spe correlation if there's a tie so for let's say for example items 10 11 and 12 are tied they'll all be given rank 11 there's a couple of other tiebreaking like um uh like um uh conventions out there but this is a a pretty common one um and what that means is if you have a system that's more granular when you are looking at the um when you're looking at the the the correlation of of your data you could you will get penalized for being more granular so uh that's for not having the ties especially when the ties are very extreme so with this one um uh lyns for example this explain why that one's there um uh it doesn't actually support summarization it just supports rag uh Q&A but uh with this metric it looks like um uh Pia is uh doing very well for XM uh but the uh grading the simple grading uh prompt is doing the best for the others so um when we look at the distribution of the labels we notice that like okay there's there's there's some granularity for um uh the the KAG CNN uh Daily Mail but for X sum there's basically just four values and for the label histogram uh there's there's fewer there's uh like I believe it's like 12 uh values and overwhelmingly they're all in that higher category what that means is there's a huge number of ties like 80 90% of the data is tied so the the measurement of a of a very granular data set like uh or more granular data set like what Pia produces from this accuracy metric I talked about before is going to be like severely penalized compared to something that's much less granular like the grading so let's pick a different metric let's see if we get a metric that's that's better for both so let's look at mean absolute error this is the reason why this one's better for both is that what we're going do is we're just going to treat the human labels as just like a line and then we're going to measure the error from that line that doesn't excessively penalize more or less granular systems so this case lower is better um and it looks like now the the P system we have which was you know uh uh from our like previous work with other projects should be very good for summarization oh look now it's it's doing much better um now of course this this metric works here but it's not necessarily going to work for every use case so let's talk about our rag Q&A data set so um uh we this is the data set that the links project uses it's composed of these other um data sets and all the labels are just pass fail U which their system also is built to produce just a pass fit so when we this one with mean absolute error um uh it looks like uh lyx is doing pretty well um uh but it doesn't really make sense in the binary one because if uh we are comparing two binary variables the error is going to basically just be like the you know the the based on like the few that it disagrees on so that's sort of a weird way of measuring that um and then uh you know the ones with uh more granularity are still going to be kind of penalized because their values are all going to be kind of more in the middle and then you're going to have the labels be this like extreme zero to one so mean absolute error is not the right one right we not the right metric for this but we can treat this as a binary classification problem um in the sense of uh we can try and figure out for these scores how well they separate out the passes and the famous so let's look at some Roc curves and precision recall curves uh for uh pythia so we see varying qualities of performance across the different component data sets um and like this is very informative for you know classification but again it doesn't really work on the grading or link strategies because they're such granular um metrics so we need to find something that works for all of those and so that means we need to go with accuracy this happens to be the one that the L project uses um but in order to use this one we have to threshold Pia and the grading strategy the links prompt works best overall for this data set um uh the claim based P strategy does better on summaries um uh but the idea is that the the simple prompt based ones do better on R Q&A so there's a couple different hypotheses one is possibly the in the Q&A there's actually not enough text to extract a lot of useful claims another one which is not listed here is um when we threshold um uh the output of pythia we are sort of doing it in a very sort of U uh numb way we're not taking into account of the data we just like pick a good threshold maybe you could tune a little bit for performance um but uh yeah for example here I just picked 0.9 as a threshold um with the idea being that you know you probably want something that has you small number of contradictions and that's good threshold but with both uh the grading and especially with the links strategies with their prompts they push that thresholding effectively off to the llm so instead of having to pick like one threshold for everything they're asking the llm to make that um uh uh make that choice about like where where it sits so uh that allows the llm to take into account the data when deciding whether it's passing f um so the conclusions for this one unlike with previous generations of NLP uh simply just this running a oh I'll go get a a data set and run it and now I know how my system is doing that's not sufficient um uh you you need um a more sophisticated sophisticated uh measurement system and you need to understand how to use that system uh and uh you also need to understand the data and the metrics you're using understand how close is the data to what you expect both in terms of input data and output data so for example if you have one of these data sets out there and it's you know measuring a bunch of like generated summaries that are generated in a way that's completely different to how your system generate summaries they're not really going to tell you as much information um you also need to understand the metrics and how they work and how for different data sets and different different tasks you might need different metrics and then finally um uh measuring is an application is an ongoing process and you need to tune your measurement system and this field is developing so there's there's the the the tuning of the measurement system which will always sort of be the case but there's also you want to like keep up to date with like what the current measurement is as this field develops so here's some uh future items that we're uh looking towards doing there's testing against more strategies we have a a new version of the Pia strategy and The Works um there's also Ras out there uh which is another sort of extraction based one but it's not based off of like triples it's based off of statements um and then we want to test more models we want to test large models versus small models so like for example GPT 40 versus 40 mini um the 70 billion llama and Link 70 billion versus the eight billion um want to see how those work and then also working on compare like General versus fine tuned so like seeing llama versus the the lyns model which is BAS off of llama what what sort of gains do we see from that um and also we want to try calibration so this is a uh I think a very promising Concept in this idea of measurement and improving uh the outputs of alms um and with that calibration we can try different ensembling methods um but yeah more on that and in in future talks so now we like to turn it over to you uh and what questions you have thank you oh so how do we compare different systems on our own data so in this case we were showing uh comparing different systems on this these open data sets um what I would recommend is that you you'd want to take some comparisons like this and then run the systems against uh uh your own data set and see what the distributions look like um for example we saw that um uh in uh the in the summary data set we saw some of the distributions of of the data so like with sum of Val is very skewed towards um positive human judgments so if um you run your your um uh uh data set and the models that we're predicting sort of like a a more or less accurate um uh scores for sum ofal are predicting much lower for yours that might be signs of a problem um so yeah that's be sort of the advice is like you wna when you pick the sort of the the open data set that you think is a good uh uh test is a good example for your data get the distributions on that because we have labels and then compare it to the distributions when it's r on yours so this is this is sort of a challenging thing so again I'd say like you probably going to want to um uh because the the the metric this is the the metric for um either metc the measurement system or your system so let's start with the the measuring the measurement system that's going to be dependent on the labels in the data set so like as we saw in um uh the in the in the summarization data set um it some people were using a Spearman uh correlation but that seemed kind of inappropriate so we switched to an error one and then when we went to the hallu bench one uh they have a binary one so like either spearm correlation or um air doesn't really work with air so we sort of treated it like a uh classification um issue and we um uh just calculated accuracy um again if let's say we had all systems that were giving very high granularity answers probably would have done like an area under the r curve instead of uh just accuracy um so it's going to be dependent on your on on your data set uh for measuring the measuring system now for measuring your own system um uh you're going to probably want to take a a few different uh metrics um the primary one you want to look at for accuracy for like the actual hallucination detection um it's going to kind of depend on your system and what it out outputs so for example if you pick the Pia system you're going to want to that our our our accuracy metric um uh which is based off of the entailment and the contradiction um that one should give you a good idea of of what's going on you may want to actually weight it depending on your interest um if you are uh uh but if you're doing your system there's other validators um that you may also want to run like for example um in in the in the Pia system that you can uh sign up for um we have a like a whole set of other validators and there may be things like um you want to like detect like the reading level maybe you even just want to monitor like the length of the data that you're outputting so for example why you want to monitor these other metrics um if you are let's say producing summaries of uh Financial of medical documents for the patient then you have highly technical language that needs to be put in less technical language so you'd want to uh uh measure to to make sure that that that your system is doing that um and especially for length uh your your customers uh may be paying based off the length of the data that you produce but if you're using any of these uh services like openai you will also be as if you're using their model you'll also be paying you know just under degre off the uh the amount of uh uh token generated so you want to keep track of that and make sure that your data is not like producing overly long or perhaps even overly short uh responses uh yeah so if you don't have your own data um uh you can use one of these open data sets and instead of using it to measure the measurement system you know where we're we're seeing how so when we measure the measurement system we takeen these responses that are in the data set and we see whether our measurements match what's in the data right but you could just provide your own output so you could like take um uh some of the stuff from Hench and run it through your own Q&A system see what answers it provides and then similar to to one of the previous questions you can look to see how the distribution of measurements on your provided answers look to the one in the system so uh for example let's say we have one where um uh like a lot of the ones in Hall bench they're kind of like 50/50 they're pretty balanced if you have a your sponsors and you run it through and then like you're getting like a a a much higher uh rate of like high score so it's like it's your the the measurement system thinks your application is very good uh that's probably a good sign that means that you probably do have a uh uh a better set of responses than what came with the data set um and obviously if it's much lower that would be you know problem oh it's not limited to a specific industry so that's a great question so um uh with Pia we've we've done stuff in uh like uh uh in like with some clinical work we've uh worked in some like generic uh uh sort of data sets and and and uh domains also done some uh work like on financial side so there's nothing that limits it to an industry I would say that I think like the future of this is um uh building more specialization especially in like models but also also in prompts um and for anything like rag that require that uses like another data set having specialized data sets that it requires um but for right now like you can usually get started with like a more generic setup and then begin tuning it to your problems your use cases your D yeah uh yes we do um uh actually I think we have a a a link to that um I think it's should be available um uh yeah you can try it out uh see how it works for your data um uh the previous webinar we did uh uh alen and I uh talked about this and alen walk through like how you set it up with uh your system and yes it's really easy to do cool so uh thank you everyone for attending this uh webinar uh and stay tuned for uh our future developments bye