Evaluating Large Language Models Simplified

hello everyone welcome to AI anytime channel in this video we are going to explore that how we can evaluate a large language model OKAY nowadays when whenever you see a new release by any llm providers or the research group who kind of works on you know large language model they say that they have performed well on the leaderboard they have surpassed gp4 in you know all the evaluation benchmarks now the question is what do we mean by these benchmarks most of you probably would not have worked with how to evaluate a large language models on different benchmarks and we have n number of benchmarks and to be honest I'm not a huge fan of these obsolete evaluation benchmarks but anyway there have been some requests you know how we can how one can basically evaluate a large language models on you know on different benchmarks and there are two ways of doing it manually where you go and get all these data sets like for example truthful keyway truthful question answers pretty much self-explanatory the data set is available on hugging face you can completely do it through a Python program okay where you load a model using Transformers modules or however you want to load it and then you just do a for Loop or a loop and pass all the questions get the answer and then match it so you can also do it that way but that's a that's really a tedious task to set up all this uh write a lot of code programs classes functions and whatnot right it's like a very tedious task to do that okay now the lot of Open Source tools or libraries which are available that helps you evaluate these llm okay uh by just writing a few lines of code the reason being because they have you know configured in a way that they have written all the high level uh they written all the codes and they give you high level API or high level functions to basically involve in your evaluation Journey okay now in this video we're going to look at something called evaluation harness in the next video we're going to look at Optimum Benchmark okay so there are these are the two most famous tools or libraries which are available you know to evaluate llm so let's jump in and see how we can do this you can see here on my screen I am on Google collab which is a paid version so it's a pro version collab Pro you need 15 G of virtual Ram or a vram to basically uh perform here okay uh the task today okay and uh this is what we're going to do cool and let me just show you the evaluation harness GitHub repository you come here and if you look at here it says LM evaluation harness language models evaluation harness a framework for few sort evaluation of language models it's by Uther AI you know Uther AI has done a fantastic job you know to strengthen the open source AI community and if you scroll down you will find a bunch of things over here and I'll show you how to do that know it's pretty easy just install so I'm just going to click on this I'm going to do I'm going to install this from Source okay so let's come back and I'm going to do pip install and pip install and then you just say git plus https and I also need bits and bytes probably because if time permits you know if we want we can also evaluate a quantized model no and bits and bytes you can take any model from hugging phase okay if the model is on hugging phas you can also not only on hugging face you can also do it if it's available locally if you have enough Compu and infra to do that to store the model wids and whatnot now here I'm installing it this will take a bit of time but let me uh tell you a bit about evaluation harness okay uh it provides and we'll also look at the task that it kind of uh provides to basically a fine tune and so excuse me evaluate a large language model so it has a tasks list so we're going to print the tasks list okay now when for example imagine if you take a pre-train model or take any foundation llm and you fine tune it on your data sets which is very easy to do I have 15 plus fine tuning videos you know you have libraries like UNS sloth Lama Factory Exel toll you can also do it uh through uh Transformer itself and if you find tune a large language model on your data setes how can you evaluate how can you evaluate on different benchmarks or multiple benchmarks and then go on publish those evaluations on hugging face open llm leaderboards or LMS like it's an ELO rating for chat BS and whatnot right so you should probably you should know about all this if you are really going a bit deeper into generative AI space that's why I wanted to create this video now and let's see where we are with this and I'll write the next piece of code here okay which is to we're going to probably try out Lama 3 or any model let me see that which is a GED repo I believe so anyway let me just bring up it over here I have granted the access now I you need the API key so you have to log in with your hugging face account so I'm going to do notebook login here so from huging face Hub import notebook login and then I'm just going to do notebook undor login I log I'm going to log in here take a key I take a key for example I need a read key I'll just come over here and paste this and log in once I log in you can see login successful now I'll be able to download or bring up my llm here in this collab notebook even if it's a GED repo until I have granted access to the model and now if you have installed this we want to check out okay so let me just keep refresing this as well now here what I'm going to do I'm going to say LM which is language model and it hyphone eval and then it can take an argument which is tasks and the task you just use list now what it does guys it will list all the tasks that evaluation harness provides that you can run it on a large language model for benchmarking okay so I'll show you it takes a bit of time but it will load okay now it has n number of task you can see all the task which have been listed let me scroll down a bit here and uh let me just say never you can see all the task for example uh Yahoo answers topic being one task now I'll show you the some of the top or the famous one so weog Granite is the one that we're going to look at today probably and we're going to look at H swag and truthful key H swag is a fantastic data set to Benchmark a large language model I'll also talk about H swag in a bit then you have a truthful keyway which kind of if you if you look at the data set I'll show the data set in a bit so you have Tria QA truthful QA translation toxy Jam if you want to find out on content moderation kind of capability of a large language model you can basically run this task you know using uh llm harness when you basically F tune a financial model gu or a l or you know if it's a pre-train or whatever if you fine tuning a language model you have to evaluate it on on different tasks to find out your model is behaving well for a particular data sets right now all of these are available here that you see for example race if you want to evaluate on race data set you can use this then we also have Prost for example we have pile you know which py I used to do that earlier now which has some explicit content as well in it so that is also fine okay now let's see what else we have now you have MML example where you know recently M Gemini was like performed how much I forgot around 90% all or something on mlu okay which is really really a diverse data guys for example you have different MML data sets you can find it out over here let me go a bit up you can find out all these different task that evaluation harness can perform and I'm going to show you H swag and H swag is available in different languages Arabic you know and Dodge and other things as well now this is fine so this I hope you understood that how you can list down a task because these are all param variables that we're going to pass and we're not that worried about you know uh this but let me just bring up my notes here because I'm going to talk about I'm going to talk about H swag here so if you if you look at H swag okay it's let me bring up the research paper on my screen okay let me show you the research paper if you go to H it's by Alan Institute you know there are other of course other creators but I'll go to papers with code you go to homepage it will take you here and I'll just open the research paper in a new tab now if you look at here it says can a machine really finish your sentence so this is more for nli okay which kind of understands uh your intent okay of the natural language that you have and you can see it has it says a new data set for common sense Nal so it's one of the most important uh evaluation data sets that we have but my issues with most of the uh data sets guys for the evaluation is that now imagine if you give me a book uh to read it for two years I will just mug it up completely I will always perform better so I I think this data these data sets needs to be CH uh needs to be changed dynamically you have to make continuous changes because when I am fine tuning LM I will overtrain it on these data sets of course there are some scenarios where it might overfit but most of the time it can learn you know the sampling will be on a higher side where it can learn these patterns more and when you run this on evaluation benchmarks it will always perform better but I think that's not a right way of probably uh not right way for the future I believe but yeah we we are now seeing changes in uh the data sets as well but this is the data sets that we have it has a great thing guys it has some I'll tell you some key con uh key pointers in uh H swag one is that it has a Noel data collection technique they uses something called AF adversarial filtering okay which uh you know which uses discriminators to so basically it's an adverse areial so kind of has a gan kind of an architecture in built right where the way they collect data so have a have a discriminator that iteratively select uh challenging machine generated questions uh machine generated not questions machine generated wrong answers so the discriminator looks at machine generated wrong answers and this method basically helps create a data data sets uh where in set of the art models struggle so that's very important and uh so you can look at on these data sets it's by Ren gillers and the team by Len Institute for Ai and we're not going to go a bit deeper into it but yeah now the next thing is that how can we evaluate uh a model so for that I'm going to show you uh which is LM eval again so you have to do lmore eval and you can also get it from here so let's come here on hugging face and just copy it and I'll show you of course I'll walk you through few things here but let's just copy this quickly just to save some time now of course we're going to pass this now if you look at this here what we are saying we saying LM eval hyph hyph model okay and we're saying the model is don't have available on hugging face and it's a pre-trained and then we are passing the model uh uh repository name so I'm going to change this repository so for example let's change it to let me see if I can take some of my models that I have fine tune or probably you know can we take any of this model let me think uh uh did I train mid Lama I'm not sure about it oh I did I did cool something but I'll not take that probably I can so I think I have already shown about Mr 7B Oro uh in my previous videos uh but I'll not take this as well I'm just thinking want to show you a good one uh multi mod but this is this will not make sense so if I take JMA 7 BSF probably this one but anyway let's take llama 3 for example so if you come here this is your llama 3 Model copied and then you can just replace this here and you just do this metal Lama metal Lama 38b now what you can also do here you can define a d type now if you want to Define data type for this basically the 10 s for example float 16 you can do that as well this is also how you can define a d type float 16 and then here you can write all the task so for example if you want to run it for one task now imagine guys if you add more tasks it's going to take a lot of time you know it it might take multiple hours sometime if you're running it on four or five different Benchmark benchmarking data sets it might take a entire day you know to because these are these have n number of questions right so be a bit attentive when you are selecting the task you should be clear on what data sets that you have to evaluate so for example if I have to do it for for ex truthful if I have to do it for truthful keyway so I'll just do truthful keyway and H swag and let me show you truthful qway here okay so let me just close all of these guys okay and I'm just going to open truthful QA and if you look at truthful keyw data sets this is probably you can also look at this data sets here you can go through research paper so so it has a question it has based answer it has correct answer it has incorrect answer it has Source blah blah blah so a very good data sets for an llm you know to find out uh the truthfulness you can see it says a benchmark to measure whether a language model is truthful in generating answers to questions so this The Benchmark compris comprises 817 questions that is span 38 categories including health law finance and politics now imagine if you are building a healthcare related uh llm you know you fine tuned it so probably you can use truthful QA of course there are medical specific data sets like M QA for example that you can also try it out and medjai has done it on Med QA so you can also have have a look at that as well now it says the author crafted question that some human humans would answer falsely due to false belief or misconceptions this is very good for uh looking at the factuality of a large language model how factual itties you know or know trying to find out a bit of hallucinations when the the way they answers it right now you can find out all these papers they have done this you know Lama 2 blah blah blah gp4 now we're going to do it for probably this model OKAY metal Lama 38b H swag you can also and you can keep on going so for example if you want to add vog Granite you can also do it like that but I'm not going to do that so H swag is fine and then device CA zero you know that's the device that I'm going to have you can define a batch size so I'm going to for example let's call it I'm going to call it six batch size and then uh after that now you can also give an output path because it's basically stores a Json file for you because that is very important now you can take that Json and of course you have to do a bit of changes when you push it on a open LM leader boards and of other things because they have their own criterias right now I'm just going to write an output directory and let's call it results and then you can also call it a log samples you can also give logor samples you can also give a logor samples now this is I think we are done now once you run this it will take a lot of time so what I will do I'll pause the video here because it will take a few hours up up to few hours I'm not sure how much but you know it all depends on what kind of GPU you are using and stuff but I'll take a pause here I'll come back I'll resume the video once this is done but I hope you so far you understood that how you can install evaluation harness know list down all the task select a task going through the research paper on the data set to find out if that is the task you need to evaluate for and then you write this arguments based command to run this in a notebook you can also do it through CLI and once you run this it will take a bit of time so now let's run it and come back and see how it works as you can see guys our evaluation is completed you know for two different tasks H swag and Truth full Q way and you can find out of course let's see the results you you can find out the results over here you know in this folder let me just expand that a bit excuse me I'll just I don't know what's happening let me just expand that yeah here you can find out and I'm going to download This Here download it will download for alas swag download and you have truthful QA so let's download all of these files and you can also load this in you can also save this in drive if you want to save this for later purposes know and let me just do that now if you look at here guys uh now if you look at this it says filter version insort which is zero of course and then metric accuracy accuracy normalized uh rug one or rules whatever you call I think it's a flower name if I'm not wrong okay something with the flower but you the full form is recall oriented understudy for gisting evaluation all these tools that you see in the market today know Ras discard truthful keyway not truthful key sorry true lens not truthful keyword true lens Athena something with when when B ws and biases have something they all uses rules and Blu and bir score and other algorithms I already have created this in my previous responses they us all of these different algorithm of course they have a very fancy looking UI if you want to build your own product evaluation is something that you can look at it the monitoring the observability How You observe the the performances of llms it's not rocket sence don't think that these companies who have created this platform are doing any really a revolutionary thing these are all algorithms you know so rules for example that calculate the similarity between a candidate document a document that you have for example generated response and then you have a reference document which might be your correct answer in this case where you have the correct data answer in the data sets or when you are building a rag application for example then you will have a context which is your retrieval and then you have a generation part of it then you can find out how similar these two responses are you know and then you this is how you calculate rules score and Rule score is to evaluate the quality of document translation summarization so there is a Blu as well we're going to talk about that uh in a bit so Ru score range from zero to one Higher value indicates better summarizations quality better summary quality and B score a perfect summary you know would have a rou score between some point so that's how we calculate it but if you look at here on on this you know it kind of gives you let me just turn on my charger here guys just a minute okay now look at here it takes rules of course maximum you know rules one rules two and then Blu which is bilingual evaluation under study you know which is used for machine translation task as I said and all of this you should you should know all of this if you want to really become an expert in generative AI field or even in natural language you should know rules you should know BL you should know bir score you should know different approximate nearest never bestas evaluation uh evaluation metrics you should know all of this it's again pretty much same machine learning guys same F1 score same recall Precision nothing nothing has changed only the fancy words have been kind of you know derived you know new UI looking kind of a stuff if you look at all of these like they have a UI where you can look at uh observability monitoring it's all same they all uses the same algorithms under the hood probably you are not going through the code but this is what it is you know you can find out all these different uh metric the value over here that has been defined and this is what I wanted to create so you can go through the reports in a vs code and setup and look at that and again you can uh use that to push it on a hugging face leaderboards as well now you can also view the results that we downloaded in all the different Json files right now these are going to be helpful because if you want to push these results that you see it over here or also in the Json which is detailed quite detailed for example if you look at here truthful QA truthful QA gen and you know scroll down H swag over here around 60% on accuracy normalized 979 on this one and yeah around 80% that you see it over here and now all of these results that you see now you can also publish this it has everything that you need right all the logs how much time it took you know multiple choice questions blah blah blah you know some examples and whatnot right everything is there in this results now how can you list that right so for example you have to go to open llm Leaderboard on hugging face okay and come down here okay it's a hugging face space that you see by hugging face H4 which is a team at hugging face and you can publish this result on different other leaderboards as well there are different leaderboards that you can leverage to publish your or Benchmark these llms you know for example if you are fine-tuning your own large language models uh on your data then you can also publish this results now people talk about this right Tas swag mlu you know uh human eval if you are working on a coding problem you should be aware of all of these okay these are not rocket science it's not that only experts will do it anybody can do it who wants to try it out you know now uh you can find out here you can see uh H swag ruthful qway we know Granite everything is here and you can also uh post your own in the submit if you go to submit it says evaluation que for the open llm leaderboard models added here will be automatically evaluated on the cluster okay now they have some instructions that the model you have find to now you have push it on hugging phase because you just going to give the model name it should be available through autoc classes of Transformer convert your model it should be safe tensor so not not PT which is PCH it should have an open license and then you should have your model card filled and automatically you can just create your submit your model here once you are satisfied that my model has performed well on these leaderboards uh these evaluation benchmarks like truthful QA H swag Granite Arc blah blah blah right you can just post it over here and then it will it will be visible you can see it over here H swag this model has performed 91.1 5 which is a reia model then you have different alpaca mixl at22 be instruct by Mel which has around 89.8 on hel swag and truthful Q is 68 or something now this is fantastic so you can have a look at this as well if you want to publish your results feel free to do that this is what I wanted to do in this video guys wanted to show you how you can take any llm which is on hugging face use evaluation harness LM eval harness okay by Uther AI install it you need a GPU do that perform all select all the task perform it and then publish your results on hugging face if you have any question thoughts feedbacks do let me know in the comment box you can also reach out to me through my social media channel find those information on channel Banner and channel about us if you like the content I'm creating please hit the like icon and if you haven't subscribed the channel yet please do subscribe the channel guys that helps me to create more such videos in near future thank you so much for watching see you in the next one

Transcript for:Evaluating Large Language Models Simplified

Transcript for:
Evaluating Large Language Models Simplified