Überblick über das OpenAI Bewertungsframework

In this video, I'm going to show you how to use the brand new OpenAI framework called Evaluations which will help you battle test your prompts, whether they're for a voice application, a text application, or any form of AI agent application. And what's really cool about it is that whether or not you actually build your applications on OpenAI, you can still test the outputs and conversations that your agents have on this framework. Now using this framework as a beginner is a bit daunting at the beginning, which is why I spent my entire weekend to figure out how to use it. so that you don't have to. If we haven't met before, my name is Mark and I run my own AI automation agency called Prompt Advisors for the past few years. We work with company in every vertical to better understand where to use AI best in their workflows. I've been working in the AI and data science space for the past 10 years, so I can tell you firsthand how important testing is and this framework is going to be super helpful to let you do that. I'm first going to quickly show you how to actually access and use this feature and then I want to jump into some slides to quickly show you all the different features that you could use. which ones work really well, which ones don't work really well. And then I'm going to generate some synthetic data sets using some Python code that I put together to show you exactly how these simulations and runs can actually function. And at the end, I'm going to show you how to take a subset of these testing features and be able to create your own custom GPT to allow you to use this a lot more easily and on the go. All right, let's dive right in. All right, so where you first want to go is to platform.openai.com. And then you're going to see some screen like this, and you want to go to dashboard. On dashboard, on the left-hand side, you're going to see this little compass icon that denotes evaluations. You want to click on that. If you're going here for the first time, you're just going to see a green button in the very middle center of the screen that says create. In my case, this is evidence that I spent my entire weekend trying to figure this all out for you. You can then add your test data here by clicking on import, and it has to be in either a JSON-L, or a CSV format. And if you don't know what JSONL even means, I'm going to break that for you later. Don't worry about it. We'll come back to it. If we add any form of sample data here, so I'll just click on the CSV, you can import either a CSV or again, a JSONL file. And this is when you can click on adding a test criteria, which is where you tell OpenAI, what am I trying to test for? Now I'm going to go into the slides to break down what is behind each one of these features and which ones seem to be the most useful. from my testing. Alright, so if we jump into the slides here, you're going to see there are seven different criteria you can test for one is based off factuality, one off of semantic similarity, if that word doesn't mean much to you, I'll explain it to you custom prompt by sentiment, which is basically the vibe of the actual text that's being analyzed, a string check, valid JSON or XML matches schema criteria match or text quality. Few of these are unbelievably confusing to use, but once you learn how to use it, it becomes very potent. Now, if you're curious about what a JSON-L file actually looks like, and you're familiar with the CSV, it pretty much looks like the following. This is a JSON file, and if you're not technical and you've never seen this, it's pretty much, imagine what you'd have in a CSV, but it's structured into either line-by-line or nested lines. Meaning, if you look here, we have first name, last name, email address. is alive, which is some feature. And then you have something like billing address, where under billing address, you have some nested entries that all go under the concept or a category of billing address, such as address, city, state, postal, or zip code. So in this case, JSONL, instead of having these nested little features, it keeps everything line by line. So a JSONL file would look like this, where you have, again, first name, last name, email address, is alive is true and then you'll notice here the billing and shipping address are now flattened into one line the reason for this most likely that they're asking for this format is it kind of guarantees that it has a predictable structure so it's probably easier to analyze that's the extent of what you need to understand about what a json l file is and the nice thing is you can ask either chat gbt or claude to take an existing piece of data and try and convert it in this format nine times out of ten it works perfectly on the first try So on to the features. So for the first one, it's called factuality. And the whole point of it is to grade the actual responses, referring to some form of ground truth answer that you put. So the most tricky and annoying part of using this framework is that technically, you have to make sure that your data is structured in a way where your question is in one place, your answer is in another place, and you have your ground truth in another column. So You have to be a bit more mindful on how you pre-process your information to make sure that you can properly load it and basically properly test what you need to test. So in this case, the question is whatever, let's say, is posed by the assistant or by the user, depending on what the case... you're using. And then the model response is what actually was outputted by the LLM, irrespective of what LLM that is. And the reference is what the ground truth answer should have been. When you go to here in terms of the passing grades, you can pick from five options where you say one, the response is a consistent subset of the reference, meaning it's not necessarily one to one. So let's say you had some form of internal support bot for your company where you could ask questions about your company policy. Maybe it referenced half of the policy correctly, but it didn't reference the follow through or the final half. It's technically a subset of the right answer. So you can set that theoretically as a passing grade. The rest one is this one says superset of the reference. I had no idea what this meant until I looked into it. It basically means you gave more information than what was asked of you. But technically what you said was true. You just added more than expected. responses matches all details of the reference this one makes sense it's just one to one so it would have to be exactly the same way written exactly the same from what i saw capitalization structure etc the next one here is response disagrees with the reference where you say the reference said one thing and the answer said another and this is a case where i want you to actually pass the case that they didn't agree for whatever reason for me it doesn't really make sense to use this as a criteria but it's in there as well and the last one is a bit more nuanced which is the response differs from the reference but the factual accuracy is unaffected which basically says you had an answer in a different way but thematically and factually it's still basically the same thing now full disclosure out of all of them this is the only one i couldn't get to work properly even spending hours on it so that's my little embarrassing moment here is i can tell you what it wants to do but i didn't get it to actually do it so one thing i tried to troubleshoot was in some of these features There's this prompt that you can click down and see the actual underlying prompt of how that feature works. So I tried to use factuality in a custom GPT by taking their prompt, frankensteining it with a few of the other prompts that are available to try and use it. So I didn't manage to get this one right. Sorry about that. On to the next one, which is semantic similarity. And if this is gibberish to you, then you can basically think of it of how similar are two pieces of text. Now... When we say similar, you and I can look at it and say verbally it looks similar based on our experience with diction and vocabulary. When it comes to AI, it needs to translate these words or sentences into what's called a vector, which is basically like a bracket of numbers, let's say negative one comma one comma zero, where this is representative in the AI translation to that same sentence we're talking about. So if that makes zero sense, then I'll quickly show you an example. So if I say show me a hypothetical set of vector embeddings for the following sentences, I love you, I love huskies, show me the actual vectors for each hypothetically. And the reason why I'm saying hypothetically is technically to determine this, you'd have to run this through some code, it doesn't matter, Python, JavaScript, whatever. and then have it actually do the translation and come up with what's called the embedding or the translation of that original format. So in this case we'll send this over and imagine that I Love You and I Love Huskies are abstracted using math and then this is your hypothetical embedding. So you'll see it's 0.8, 0.7, 0.9, 0.2, 0.1 and then I Love Huskies is 0.8, 0.7, 0.5, 0.6 and then 0.3. Now Now you might look at this and be like, this doesn't make any sense to me. This is how pretty much every chat bot that exists that has some form of knowledge base actually looks up its information. You load information, that information that gets all converted into a format like this. So when you ask a question, it tries to look for a, let's say a vector that's as similar as possible to this vector. And one way to do that is to use something called cosine similarity, which basically compares these two vectors and says how similar they are. So the fact that this has 0.9 versus 0.5, this has 0.2 versus 0.6, has a lot to do with how the vocabulary the AI is using to translate it works. So I love is the same in both, which is why you see 0.8 and 0.7 are both the same hypothetically. The fact that it's huskies versus you, one is referring to contextually a person, another is referring to an actual animal, in this case also a dog. So the way that's translated in vector space is very different, which is why these vectors are close, but they're not the same. Go back. So if we take an example to ground it, and we had a customer success chatbot say, we don't process refunds as your actual answer, but the reference answer is. We don't usually process refunds. Semantically, they are similar, but they're not the same. One has some variability, which is we don't typically do it. One is a flat out, we don't process refunds. So on this quote unquote grade, if you want to say I need it to be four out of five similar on their bar of similar, then it would use that math I just explained to you to determine whether or not it was close enough. Now the black box nature of this is, you know, if you put a three versus a four out of five, It's hard to differentiate off the cuff if something that's kind of similar will be picked up or not. So the best way to experiment is to start at three as your North Star. And then if it's passing, just experiment with four. Because ideally, you want your answers from ground truth to actual answer to be as similar as possible. All right, so stick with me. The next ones are more straightforward. So with custom prompt, you literally write a custom prompt where you write in natural language how you want to grade the responses. And then you define what an A, B, C, or D is. So you have more autonomy here to actually create your own prompt and basically decide what is your criteria in layman's terms that you want to discriminate by. Now, in terms of sentiment, this is not a new concept. It's existed for years, especially in the statistics space, where you upload some data and you say, hey, you know what, here are some reviews of our business on Google Reviews. Go through it and analyze the general sentiment. Instead of making it say positive, neutral, and negative, you just test for something you're looking for. So if you're looking through reviews, typically, as a business owner, you want to look for the negative because that's what you want to double down and fix. So you could click on negative here and then assess the sentiment of a specific column, and then you're good to go. The grade width column here allows you to choose which LLM you want to use. So that's one thing I haven't touched on as of yet. You can have it use GPT-40 mini, 01 mini, 01 preview, etc. etc. Any model that exists in OpenAI, you can choose to grade with. So that adds one more level of flexibility, but also complexity. In all my testing, majority of the time, GBD4-0 Mini was a good use case. And you can imagine if you had a file of, let's say, a thousand different cases, and you used 01 Preview or 01 Mini, which are both more expensive models, you'll go bankrupt pretty quickly. All right, so the next one's super straightforward. It's called String Check, where you literally upload file, tell it, Hey. Check if this column is equal to that column identically as a string, which is literally just telling me this text and all its characters and associated annotations are the same as this other text. So you can look for equals, something that starts with the same thing, something that ends with the same thing. So one example use case that I'm going to show you is making sure that maybe your chat agent always says thank you before it finishes the conversation. In that case, I can go and click check if. it ends with thank you and exactly the word thank you with a capital T in a period at the end. That's an example use case where I would check this. Next one is valid JSON or XML. This is probably one you'd use only if you were more on the technical side. You have some form of application that has an output that needs to be read by another application. So in this case, you can basically just check is this JSON that's been produced by OpenAI or whatever LLM an actual valid JSON. On that same note of JSON, you can also have this matches schema feature where you say, hey, this is what I've determined for my JSON. And if you don't remember what JSON looks like, it is this bad boy. So let's say your application is releasing this as an output. You can go back to matches schema and then say, hey, here's exactly what I'm going for. I want first name, age, and then I want billing address, and then I want postal code, city, etc. and you can mandate that whatever the actual LLM outputted fits that exact structure. The second last one here is criteria match, where you give it, again, in natural language, some form of custom criteria. So unlike the custom prompt option where you had to create a grade, let's say A, B, C, or D for the output, you can just say, hey, the criteria is make sure that every single output here is either in French or English. Let's say you have some form of bilingual bot. and you want to make sure that there's no Spanish, German, or extraneous languages being output. So your inputs here are the conversation column, the model response, your criteria in natural language as to what the rules are, and then how you'd like to grade it, like with which large language model you'd like to grade it with. And the last one was also a very confusing one for me. It's called text quality. So in this case, you want to compare two pieces of text, and I think the primary use case for this one was not only doing... a version of semantic similarity if you click on this, which I mentioned before, that cosine similarity metric. But this bleu and rouge apparently are good for translation. I tried testing it myself. I struggled quite a bit. So personally, if I want to do some translation testing, I would use this criteria match or that custom prompt versus this. Again, feel free to test it out yourself. Let me know in the comments below if you succeed where I failed. I want everyone to know how to use everything. So that would be great. Bottom line though, the only one I got to truly work here was the cosine similarity, because again, it goes back to that vector thing of the, I love you, I love Huskies analysis. That one worked pretty well. And outside of that, I wouldn't really use this on a very recurring basis. All right. With all of that out of the way, I'm going to actually dive and create some data sets for us to actually use it and test it in action. All right. So I'm going to jump into this Google collab that I put together. And pretty much, again, you don't need any form of technical background to use this. I basically just put this together to create some data sets that you could start using to demo the evaluation framework without having to worry about, hey, I need this column formatted in this way. with this title, et cetera. So I just programmatically, if I reveal kind of what's in these dropdowns, I'm creating data Pythonically for free, no need for an API key in all of these cases. And I'm telling you kind of what the use case I came up with is. And as you go down here, you have custom prompt, you have sentiment, which I made into product reviews. You have string check, which is just looking for that thank you suffix I referred to before. And then you have validating JSON structure, where I come up with. a hypothetical fake one and then one where you want to match a schema and in this case i put an example schema that you can use and as you go to criteria match i created some prompt with criteria as well as a data set you can use to test against that criteria and then the last one here is text quality for cosine similarity so that was the last one i just went over this last one here text completions don't worry about that for now We'll go over it at a later part of the video. So all you need to do as the technical or non-technical user is just click here and go to runtime and then go to run before. So this will run everything before this cell. So you can see here this is all going to start churning. And what this is installing at the very top is OpenAI, which if you don't use the very last feature in the file, you don't need to use. But this will take probably 10 seconds to install. And then running it actually takes a very short amount of time. So that finished running in under 10 to 15 seconds, and the rest ran in less than 2 to 3 seconds. And the output, if you don't have anything open like this, is you can click on this folder file icon, and then you'll see a bunch of files that are generated. So one thing I made it do is automatically download it to your session. So you can just go and actually download all of these one by one, and then you'll see here if you open it up, I'll just open it up here with an application. Let me just choose in this case my Brackets app. If you don't have this, if you don't know how to open it, you can just kind of upload it to GPT and say tell me what's in here. But this is what it looks like in practice. So name a primary color that is also associated with the sky and everything's line by line like we went over before. So if we close that up, once you have all of your data sets, you're good to go and start actually testing it. So like I said before, factuality, I wasn't able to get this working. So that's a skill issue on my side. Skipping to semantic here. Basically, I have an input, which is, let's say, what is the capital of France? And then the output is Paris is the capital of France. And then the reference, which is our ground truth answer, is Paris. So that's an example of one input that we could test. And then as you keep going, you'll see more and more questions that are all made up. I basically had chat GPT create a hypothetical data set. And if you want to create your own, what you'd have to do is just go to chat GPT. And I'll put this side by side. And you just say, create a hypothetical data set of 10 rows in JSON l format of zoo data, make sure to have the following columns. and then I'll say input which is the let's say a user question about zoo oh I missent that so let me just undo that mistake let me just there we go I'm going to edit it I'm going to say output answer to question and then reference actual answer that should have been answered okay and if I click send again it'll create it very quickly In this format and you can see it's all line by line and in around 20 seconds it ran and all we have to do is we can actually say can you make it downloadable. because otherwise you'd have to copy this opened a text editor remember to save the file as a json.l file or a dot json l file and that's just an extra few steps you can avoid by asking to make a downloadable file so you can see here if you click on download file you should fingers cross get something that actually has this information so again i'll go back and i'll use this brackets app just to double check that it's actually working And we can see here we have our 10 different questions all line by line. So this is all good to go. Now going back to our semantic testing, we'll just go back to evaluations. And then I will create a brand new evaluation. I'll do create and let's call this semantic evaluation, my test data, what I'm going to do is I have data here that I've called JSON files. And I'm going to import, I'll click on upload here. And what we're going to look for in my folder that I put together is the name of this file, which I called semantic similarity dataset dot JSON. So we'll click on upload, we'll go to my JSON files. And then we're going to look for the one that says semantic. So that's this one. And I'll just upload it should take a very short amount of time to upload because it's so small. And if you import it will actually populate the table. and basically take that JSON-L file and structure it itself. So in this case, we want to add a testing criteria. So in the testing criteria, we're going to click on semantic similarity. And then in this case, we're going to use the LLM GPT-4-0 mini just to make things easier. And then you want to basically say, this is my column that I want to compare against. So in our case, we have a question, which is, let's say, what is the capital of France? Who wrote the Pride and Prejudice? Rather. And then you want to take the output, which is let's say Paris is the capital of France, and compare that against the ground truth answer, which is just France. So if I click on compare, this is why it's helpful to have hygienic naming for the actual columns, which is why I put together that collab file for you. You can click on output as being the first thing you want to compare with, and then you want to compare it ideally against the reference, which in this case is the ground truth. So if you click that, I'm just going to use a passing grade of three to make things simple. And then if we click on add, one thing you can do is you can click on estimate here, and it will tell you how many tokens is going to actually be ran with GBD 4.0 mini. So it gives you an idea for cost. Now for 40,000 tokens worth 4.0 mini, it's a very, very negligible amount of cost. But it's very useful for you to use this, especially if you feel like experimenting with larger models like 0.1, that you know exactly what you're getting into. Now if I click on run here, what it should do is it should run a job and then once it's done, it should basically show me how many of the cases passed the test. And you'll see here that only slightly more than half of the cases passed. And what you can do is you can click on data here and then it will actually load that table and tell you exactly which ones failed and which ones passed. So based on our very thin criteria of three out of five, you can see here. What is the powerhouse of the cell? My favorite question from biology in high school. The mitochondria is often referred to as a cell's powerhouse. And in this case, it wasn't close enough to this, which is mitochondria. Where, if you go back to the I love you and I love huskies vectors, you remember how they were still different, even though they were very similar. So, although this has mitochondria in both cases, it's different enough that for that passing grade, there's a cutoff. But if you take something like this, Paris is the capital of France, Paris. This makes sense why it would be a perfect match. You'll notice when you have a superset answer, meaning the answer is quote-unquote correct, but there's adding so much of criteria that at least semantically as quote-unquote vectors, they're not similar anymore. You'll get these failure cases. Whereas if it's only slightly more verbose than the real answer like H2O versus H2O commonly known as water, it is close enough to pass in this case. But if we were to go back and quickly do this test again, and we call it semantic evaluation two, and then we import the same data that I just used. And then we add testing criteria. And this time we do output reference, but we make the passing grade four. It's going to be more stringent. So even those cases that were slightly more verbose might be tagged as failure. So I would not expect it to be 53% pass rate again. It should be a bit lower, if not a lot lower. And you can see here, there's a pass rate of 6%. So this is hyper important to be cognizant of where you're actually using this framework is that any... change in the grading, there's a steep cutoff. So if you go to data again, there's only one that passes, which is this one, H2O, commonly known as water. So it's saying that of everything else, this is the most semantically similar and probably the closest related vectors that it could find based on that harsher criteria. And naturally, if you put five out of five, you would need an exact match, meaning you'd need Paris to be Paris, you'd need H2O to be H2O. So you You might have a case where you do want to look for the exact thing. It's good to know that you can use 5 out of 5 but if it's something more freeform where the answer is in more natural language that's where a passing grade of 3 might make sense if you're looking for things that are literally similar. So we'll try the next testing criteria and the next one on the list here was using a custom prompt where in my case we have all of these questions and answers once again. And when we have a data set called custom prompt data set dot JSON L. So we'll go back and we'll say custom prompt test. And then I will import this data. So I'm going to go to upload. And I'm going to look for custom prompt data set. In this case, we have, again, the input, the output, and a reference column. We're going to click on testing criteria. We're going to click on custom prompt. And I'm going to say I want you to grade it with again, GBT 4.0 mini. And I want to say something along the lines of if the answer is in English, then tag it as pass. If the answer is in French, then tag it as fail. If we do add grade here, I can say the label is either positive or negative. So if the label is pass, and you're generating the label here, which is why it's important because these two parts build on each other. If this was no or yes, my label should be no or yes, so that the LLM knows what to look for when it decides whether to fail or pass. So if it's pass, then it should pass. Otherwise, if it says fail, then you should select fail here. So once you have that going, assuming I wrote this properly and that most of these are in English, if I click on add, you'll see here at least expectation wise. So these are English, eight's a number, it's not English technically, 100 degrees Celsius and 100. If you, because there's a number there, I'm not sure if it'll take that into account, but if we click on run, it will do its run as before. And this one, because it's using an LLM to go through every column or row rather, It'll take slightly more time, especially if you have, let's say, tens of rows or hundreds of rows. So in this case, it says that 0% of the rows passed, which to me is a bit confusing. So what I'll go back is let's create another actual test here. And I'll say it's custom prompt test two. I'll do an import of the same data. And maybe my criteria wasn't the best. So in this case, I'll say use GPT-4.0 to be slightly smarter. And now it does say here use curly braces to insert variables. So in this case, I say if output, and let's call the curly braces. If the output is in English, then tag as, you know, let's just say then say pass. Otherwise, if output is not in English. Then say fail. And then I'll do this again. Pass should be a pass. Let me make sure it's exactly spelled the same. Pass and then fail is spelled fail. And I add that and I run. In this case, it works. So if I go to data, it's saying here that all of them pass, which again, is suspicious. So one thing when you're using this testing framework is when it comes to creating grades. And you know, the factuality thing, there's a bit of iffiness on like how you actually create it or designate it. So it's one thing to keep in mind that even the slightest variation in the tagging could vary your results widely. Now, I'm not going to run three tests with all of these features. Otherwise, we'll be here till tomorrow. But we'll run one more just to see how iffy it is. So if we go and we make a custom prompt test three and we import the exact same data. and then we add testing criteria and we use ub4o i'll say if in this case this is in english English, then say English. Otherwise, if output is a number, then say number. And then I'll add two grades. And this time I'll say it's pass if it's English. And then I'll say it is a fail if it's a number. So this is a good way to test what are the different boundaries here. You'll see in this case, it says 100%, which I know is not correct. So conclusion from this is I think this is still obviously in beta. So there's still some bugs and kinks to actually work out. So my most trustworthy features of this are the semantic prompt. And the actual custom prompt that you can enter later on. So if you go to here and go back to create and we go to a new valuation, we'll import a new file test criteria. So criteria match was my favorite one to use. It made the most sense. Sentiment makes sense. The valid JSON, the string check, the matches schema, but text quality, custom prompt and factuality were to me the penalty box. So. Anyway, it's good food for thought, and I wanted to show that on purpose because I found myself sometimes saying, oh, it works perfectly, and other times this is not working. So back to it, let's try the next one. So I'll just call this new evaluation. If we go back to the Google Cloud to quickly check what we're going to upload next, you'll see that this is product reviews from customers. So you have product, customer name, customer review, and reference review. So in this case, we can just look at the customer review and assess how many negative ones do we have. So if you go back to evaluations, if we go and click upload, I'm going to upload this thing that's called product reviews dot JSONL. And then we'll upload that. Click import. And you see here the product name, customer name, review and reference review. And then we're going to add testing criteria. And in this case, we're going to select the sentiment option. And then we're going to assess the sentiment of the customer review column with GB4O mini. And I'm going to look for just if there are any neutral or negative cases. So we'll do neutral first. We'll do add. And then I'll do run once again. And you'll see here that it tagged 40% as neutral cases. And if we take a look at the data, you'll see here on the one that it says pass. Very sleek design, though the step counter seems inaccurate. Okay, so that one has a bit of positive and a bit of negative, which is why it probably cancels out to be. perceived as neutral. The next one boils water quickly but the handle gets a bit too hot so again follows that pattern of negative and positive slightly cancelling each other out. And this next one says fast and reliable but the fan noise is a bit loud so all of them are the same and to me this tells me that this is a good feature this was working as expected. And if we go back and we create a new evaluation I'm just going to be lazy and not select it I'll do import and I'm going to click on sentiment. And we're going to analyze the sentiment of the customer review. And then we're going to use Zuby 4.0 Mini. And we're going to just look for purely negative and see what pops up. Now, it should be not larger than 60% since 40% was tagged as neutral. Again, unless we're having a hallucination. So that's one other component to take into account when doing this. This is all being done by a large language model by the end of the day. So there is even margin of error when you're doing testing. In this case, it says 20%, which to me is a pretty fair result. And if we go to the data, we'll see makes great coffee, but cleaning is a hassle. So what this tells me is unlike the other ones where there was a but and it was tagged as neutral, the fact that you have hassle, it might have really taken the valence, and the word valence means is like the emotional weight of that word, into consideration when tagging it as negative instead of neutral. If we look at the other case, very sleek design, though the step counter seems inaccurate. This one was just tagged as neutral in the other case. So I use GPT-4 4.0 Mini. If I go back and I rerun this with 4.0, not sure if it's going to tag the exact same thing. That's the whole point of experimenting with this is to understand the flow. And what we're going to choose here is the customer view again, 4.0 Mini. And you know what? Let's actually switch it to 4.0. And maybe we use a more sophisticated model to see if we'd get the same result. And if we click on negative here and we click on run, interestingly, you'll get 0% cases are negative. So if you go to the data, those two cases about the make great coffee, but cleaning is a hassle, that 4.0 mini tagged as negative. A smarter model tagged as not negative because technically, both of those cases were indeed neutral, not negative. So good learning here is that probably smarter models, when it comes to... reasoning and it when it comes to actually assessing meaning and context probably 4.0 is the best model to choose right now the 01 models don't work very well and they're very expensive so i wouldn't use that at all i think 4.0 if you need some reasoning is your best bet so if we go back and let's do the next one so i'm going to click on let's see what we have next here we have a string check which is very simplistic we just have a data set here where we have customer support responses to questions So an example is how would you end a customer support interaction politely? And the output is, I hope I was able to help with your concern. Thank you. What I want to do is be able to make sure that we have a polite bot that always ends in thank you. So I basically asked ChatGPT, make me a data set that has some thank yous at the end and some that don't have thank you. Well, you can see here it ended it with does not have a thank you. And then this is called ends with thank you data set. And if we go here. and upload and we go to ends with thank you data set we'll upload it Import and you'll see here the reference like the ideal thing we're looking for in the string is the thank you and then here are some cases where most of these do have thank you. This one doesn't. This one doesn't. This one doesn't. So we should have at least five of them that don't meet this criteria. So if we go to test criteria string check we're going to check if the output ends with. The string, and I'm going to say thank you. It should be written like that. I don't think I have to use quotes. If you click on add, it'll run. You'll see here it says 50% passed. So if we go to data, these five did not have it, like I said before. So they tagged them all correctly, while these all ended in thank you. So they tagged that correctly. Now for this specific one, if you're more technically inclined, there might be some more useful cases where you do string checks. But... this is the most simplistic one that I could show. And let's try the next one. So if we create, and let's see what we have next. So validating a general JSON structure, this will be pretty quick. This file is called JSON validation. So we have some JSON on a row level that looks like this, where this is one JSON, this is another, etc. And we just want to check the integrity of that payload, whether it's legitimate or not. So if we click on import, I'm going to upload here JSON. validation set, we're going to click open import. And then you're gonna get on the left side, your input, and then the output JSON file with test criteria, we'll click on valid JSON. And all we have to say is we want to check the output, whether or not it's valid. So if we click on add, this one doesn't seem to use an LM behind the scenes. And from the cost standpoint, when I wasn't able to click it, so it tells me that it's using ideally Python to determine this or something. programmatic. So this one should be pretty quick to assess. You'll see here that only 50% passed. If we go to data, it'll say that this one doesn't have valid JSON, this one doesn't have valid JSON, and this one doesn't. So if we take any one of these, and we take this, let me copy this if I can. All right, and then if we go to a JSON validator, and use an online version of this, just to double check, if I paste this... and I click on validate JSON, you'll see it says invalid JSON. So it is working as intended. So if we go back here, I think we only have a couple more to go. The next one we're gonna do is validating a schema. So making sure that a schema for a JSON you came up with is exactly as expected. So we're gonna click on and copy this from the Google Colab to make your life a bit easier. I'm gonna go on test and then let's see what we call the actual file here. So I called it schema validation. We'll click on upload schema validation import and then we're going to add the testing criteria. We're going to look for a matches schema and then we're going to paste the scheme we're looking for and then we're going to check if the output matches the schema and the output is here this column. So when we click on add and we click on run this should check each output from an LLM to see whether or not it matches exactly what you're expecting, which for many applications, whether it be normal app development, web application development, mobile app development, anything with AI, this is super useful to do, especially if you're not using something like OpenAI, where it has like a structured output that you can guarantee what the schema looks like. And you'll see here it hit a 50% score again. In this case, these are the ones that don't match exactly the schema that I entered. All right, we're down to our last two here. So if we click on create and then we go on test, the next one we're going to do is criteria match. So here I've written some criteria for you. Again, hypothetical. Based on the provided criteria, evaluate if the assistant's final response about the tenant is appropriate. So this is creating a hypothetical tenant database. Let's imagine you're a landlord and you're trying to fill up your properties, your occupant properties. These could be like... tenant applications that you're using AI to assess. So the response should highlight the tenants positive qualities, any concerns if present should be mentioned neutrally, check if response highlights positive qualities and suitability, ensure any concerns are mentioned neutrally, and then the output should be pass or fail. So if I take this, and let's see what the data set name is, it's called simple tenant data set. And let's go here. And we'll upload simple tenant data set and then click import you'll see here examples you have ran names you have their tenant profile like good payment history steady job no issues consistent payments but had a minor dispute last year and then you'll see the assistance assessment of that so Alice is a reliable tenant with a steady job and no issues reported so what we want to do is then click on test criteria we're gonna click on criteria match and then I'm gonna say the conversation is the tenant profile And then the model response is the assistant file response. And then the criteria, I'm going to add it here. I'm going to go back and copy this, paste it here. I'll use GB4O just because there might be some nuance here. And then I'm going to click on add. I'm going to click run. It's going to do that assessment to see which ones pass the criteria that I've just created. And you'll see here that... all of them seem to have passed. So if you go here, based on the criteria that I provided, all of the tenant profiles or the assistant file responses, which are the judgments of those profiles seems to be fair and thorough. So according to that criteria, we should be good in that capacity. Again, just like the custom prompt criteria, this one will be iffy because there is a level of subjectivity here and you have the moving part of the LLM that you're using. So ideally try and be super super explicit so that you can know if this is working as expected. Last but not least, if you go to create here, the last one we have is using cosine similarity for customer support responses versus company policy. So let's say you had a customer support representative and you had maybe an AI version of that, like a chat agent. You can see the response of the chat agent, and maybe you want to put it vis-a-vis the company policy manual to see how close were those two different. entries. So you can see here an example, my product arrived broken, what should I do? And then support responses, we're sorry to hear that your product arrived late, yada yada yada. And then the company manual says, for damaged products, apologize to the customer and request a photo as evidence. And you'll see the rest of them. So we want to look for customer support responses. So if we go back here and click on import, I'm going to go here to customer support responses, click on open. Click on import and you'll see these are the inquiries, the responses and the manuals. So if you go to testing criteria, we're going to go to text quality. We're not going to use bleu, we're not going to use rouge, we're going to use cosine. So I'm going to compare the customer support response against in this version the ground truth of the company manual and then I'm going to grade it with in this case you get two options if you have no idea what these means. If you remember when I first described that semantic similarity where I had two vectors, I love you, I love huskies, and you had a vector to represent each set of actual words, this is what actually converts, it's the translator, it converts the words to vectors. So you have something called text-embedding-3-small, another one that is large. So for most cases, for at least 80% of your cases, you should be able to use small. The main difference between both is the large one takes a lot more dimensions into account. So it overthinks. So that vector about Huskies being only like a few things different from I love you, it might be even more complex than that, depending on different factors and dimensions it takes into account. If that makes zero sense to you, don't worry about it. Small should work for 80% of your cases, and it's way cheaper than using large. So just use small. And then passing grade again. Cosine similarity is out of one. So one meaning it's an exact match. And 50% is kind of like, it's a meh match. It's kind of a match. So if we did something like 0.6 instead of that, we'll do add, and then we'll click on run. And you'll see here it says 100%. And if we go to data, you'll see in all cases that all of them seem to be close enough semantically. that you get a perfect score. Just to battle test it out of pure paranoia, if we do a testing criteria where we do the same thing, and then now we use cosine, and let's make it, let's say 0.9, and then use three small, click add, click run. I don't expect that we should get a perfect score again. If it does, I would be suspicious. So in this case, it was so extreme. that none of them passed. So this is a good example of extremes, right? So maybe 0.75 would be more fair, and it doesn't take too much time for me to test that hypothesis out. Now that we're getting the hang of this, we'll just go here, boom, and then let's do 0.75 add and run i would expect it's not zero it's not a hundred probably something in between you'll see here it says this one is the most similar to the company manual with that grading of 0.75 out of one so again if you're trying to explain this to stakeholders on your team cosine similarity could be a bit more confusing to explain like why this result is closer but if you're looking for you know some form of directional accuracy of how closely Do the responses of your agents or bots match your company policies? This will, again, give you something that is good enough to say, you know what? Probably more than 50% we're doing well. More than 60% we're doing well. But maybe at the 90% rate, we're not being exact or in detail enough with our agents. So that being said, we're not done yet. I did mention and promise one more component here, which is this part of the code where you don't have to run this. That's why we... clicked on it and we did run before this code because this one is more superfluous. This code pretty much all it does is I'm just showing you how to send what's called a completion, which is basically like a test message to an LLM to come back with a response. If you run this, all that's going to happen is it's going to loop through all these predetermined questions I've put together. It will send a response and you'll see why I've done that as soon as this actually ends running. And this finished running after around 51 seconds. So the whole point of this was I wanted to show you that once you send this to OpenAI and you go to create test criteria, you don't always have to necessarily upload data. If you've already had an interaction with the OpenAI service, you can go on completions and then you can click on which model you want to use the completions for and then what date. So I'm going to use today's date because I just submitted those. I'm going to click on apply. It's going to say then I've had 10 samples or 10 different requests that have been sent, which is exactly what I sent here right now, these 10 different questions. And in this case, I can leave these all blank, click on import, and this will create a table importing the last pieces of interactions that I've had. So this one, if you go to my thing, it says where's my order, and how do I return an item, etc. You'll see all of these questions. responded to here. So where's my order? And then the text is you are helpful customer assistant for an e-commerce company. And here's the output of that question based on that rule that we've assigned it. So this is an easy way to basically use what you've already submitted to the API, assuming it's maybe for customer purposes and quickly check it that way. Now, the last magic trick I had up my sleeve is if you go back here, go to create. and you go to let's just add any form of data set here it's not important a lot of these like i said three of these to be specific have the underlying prompt that's used to power the actual testing so if you go to factuality and you click on this drop down you'll see the actual prompt they come up with for grading so even if factuality isn't working for me or maybe doesn't work for you as well you can actually take the system instructions to create your own mini prompting agent So that's one way you could do this. And if we go back, we can see the next one is sentiment, where this one also has a breakdown of how the sentiment is being calculated or tabulated. And then the last one that I saw was the criteria match, where even here, it has a full prompt as to how it's creating that criteria and how it's grading it. So even if these are not useful within this app itself, you can steal that prompt to make your own version. And naturally, that's what I did. So I took the factuality. Sentiment and the criteria match prompts and I made them available here in the prompts document here and Naturally, you're gonna find this you're gonna find the Google collab assets everything in this video as well as the actual data files themselves So you can test this without even running that code will all be in the gumroad in the description below and as usual this stuff Takes forever. So if you can support the channel, I would appreciate it. Now what I did is I took all of these together and I created my own Frankenstein prompt using meta prompting to have all in one. So I can create one custom GPT where depending on what kind of analysis I'm trying to do, I don't have to upload a JSON L file. I can just upload a normal file into GPT and go back and forth and do that assessment. So you'll see here I have a GPT called prompt testing, where if I click on, let's say, fact check, and you can tell I built this because I was very salty that I couldn't use factuality, spending. most of my day trying to figure it out it'll ask you what is the desired grading scheme for the evaluation in this case i just used theirs that they have in their prompt If there are any specific facts that need to be prioritized, if minor details or exact phrasing are critical to the assessment, etc. If I go, never mind, sentiment, it should also push you back on what's needed. So in this case, the desired sentiment you expect, any specific words, phrases, or tone that should be prioritized or avoided. And last one, emotional expression expected. And then if we hit criteria, that's my way of spelling criteria. It should again ask you and push you back with a few of the additional things that they came up with in their prompt, which in this case is only if any criteria is more critical than others, and whether the answer can include information beyond the criteria, or if it should strictly adhere to them. So I created a proxy for those three that bothered me, and I wanted some way to still be able to use that, even if it wasn't working for me in the platform itself. Now I know this was a super long video, probably longer than I expected, but I wanted to go... deep into this framework to see what was useful, what wasn't, so you can make the decision as a business owner or as an AI entrepreneur, what makes the most sense for you to use this for, and where this could save you possibly a lot of time. I expect that because this is in beta, all those bugs I was showing you or those weird results will resolve themselves over time as this gets better. But in the meantime, this is better than most Frankenstein solutions that are out there to do some initial conversation testing and prompt testing. without having someone technical on your team to have to do so. If you found this helpful, I'd super appreciate a comment in the description below. It helps the video, it helps the YouTube channel, and if you like and subscribe, it won't hurt either. So thank you very much, and I'll see you next time.

so that you don't have to. If we haven't met before, my name is Mark and I run my own AI automation agency called Prompt Advisors for the past few years. We work with company in every vertical to better understand where to use AI best in their workflows.

I've been working in the AI and data science space for the past 10 years, so I can tell you firsthand how important testing is and this framework is going to be super helpful to let you do that. I'm first going to quickly show you how to actually access and use this feature and then I want to jump into some slides to quickly show you all the different features that you could use. which ones work really well, which ones don't work really well. And then I'm going to generate some synthetic data sets using some Python code that I put together to show you exactly how these simulations and runs can actually function. And at the end, I'm going to show you how to take a subset of these testing features and be able to create your own custom GPT to allow you to use this a lot more easily and on the go.

All right, let's dive right in. All right, so where you first want to go is to platform.openai.com. And then you're going to see some screen like this, and you want to go to dashboard.

On dashboard, on the left-hand side, you're going to see this little compass icon that denotes evaluations. You want to click on that. If you're going here for the first time, you're just going to see a green button in the very middle center of the screen that says create.

In my case, this is evidence that I spent my entire weekend trying to figure this all out for you. You can then add your test data here by clicking on import, and it has to be in either a JSON-L, or a CSV format. And if you don't know what JSONL even means, I'm going to break that for you later.

Don't worry about it. We'll come back to it. If we add any form of sample data here, so I'll just click on the CSV, you can import either a CSV or again, a JSONL file. And this is when you can click on adding a test criteria, which is where you tell OpenAI, what am I trying to test for?

Now I'm going to go into the slides to break down what is behind each one of these features and which ones seem to be the most useful. from my testing. Alright, so if we jump into the slides here, you're going to see there are seven different criteria you can test for one is based off factuality, one off of semantic similarity, if that word doesn't mean much to you, I'll explain it to you custom prompt by sentiment, which is basically the vibe of the actual text that's being analyzed, a string check, valid JSON or XML matches schema criteria match or text quality. Few of these are unbelievably confusing to use, but once you learn how to use it, it becomes very potent. Now, if you're curious about what a JSON-L file actually looks like, and you're familiar with the CSV, it pretty much looks like the following.

This is a JSON file, and if you're not technical and you've never seen this, it's pretty much, imagine what you'd have in a CSV, but it's structured into either line-by-line or nested lines. Meaning, if you look here, we have first name, last name, email address. is alive, which is some feature. And then you have something like billing address, where under billing address, you have some nested entries that all go under the concept or a category of billing address, such as address, city, state, postal, or zip code. So in this case, JSONL, instead of having these nested little features, it keeps everything line by line.

So a JSONL file would look like this, where you have, again, first name, last name, email address, is alive is true and then you'll notice here the billing and shipping address are now flattened into one line the reason for this most likely that they're asking for this format is it kind of guarantees that it has a predictable structure so it's probably easier to analyze that's the extent of what you need to understand about what a json l file is and the nice thing is you can ask either chat gbt or claude to take an existing piece of data and try and convert it in this format nine times out of ten it works perfectly on the first try So on to the features. So for the first one, it's called factuality. And the whole point of it is to grade the actual responses, referring to some form of ground truth answer that you put.

So the most tricky and annoying part of using this framework is that technically, you have to make sure that your data is structured in a way where your question is in one place, your answer is in another place, and you have your ground truth in another column. So You have to be a bit more mindful on how you pre-process your information to make sure that you can properly load it and basically properly test what you need to test. So in this case, the question is whatever, let's say, is posed by the assistant or by the user, depending on what the case... you're using.

And then the model response is what actually was outputted by the LLM, irrespective of what LLM that is. And the reference is what the ground truth answer should have been. When you go to here in terms of the passing grades, you can pick from five options where you say one, the response is a consistent subset of the reference, meaning it's not necessarily one to one. So let's say you had some form of internal support bot for your company where you could ask questions about your company policy.

Maybe it referenced half of the policy correctly, but it didn't reference the follow through or the final half. It's technically a subset of the right answer. So you can set that theoretically as a passing grade.

The rest one is this one says superset of the reference. I had no idea what this meant until I looked into it. It basically means you gave more information than what was asked of you.

But technically what you said was true. You just added more than expected. responses matches all details of the reference this one makes sense it's just one to one so it would have to be exactly the same way written exactly the same from what i saw capitalization structure etc the next one here is response disagrees with the reference where you say the reference said one thing and the answer said another and this is a case where i want you to actually pass the case that they didn't agree for whatever reason for me it doesn't really make sense to use this as a criteria but it's in there as well and the last one is a bit more nuanced which is the response differs from the reference but the factual accuracy is unaffected which basically says you had an answer in a different way but thematically and factually it's still basically the same thing now full disclosure out of all of them this is the only one i couldn't get to work properly even spending hours on it so that's my little embarrassing moment here is i can tell you what it wants to do but i didn't get it to actually do it so one thing i tried to troubleshoot was in some of these features There's this prompt that you can click down and see the actual underlying prompt of how that feature works. So I tried to use factuality in a custom GPT by taking their prompt, frankensteining it with a few of the other prompts that are available to try and use it.

So I didn't manage to get this one right. Sorry about that. On to the next one, which is semantic similarity.

And if this is gibberish to you, then you can basically think of it of how similar are two pieces of text. Now... When we say similar, you and I can look at it and say verbally it looks similar based on our experience with diction and vocabulary.

When it comes to AI, it needs to translate these words or sentences into what's called a vector, which is basically like a bracket of numbers, let's say negative one comma one comma zero, where this is representative in the AI translation to that same sentence we're talking about. So if that makes zero sense, then I'll quickly show you an example. So if I say show me a hypothetical set of vector embeddings for the following sentences, I love you, I love huskies, show me the actual vectors for each hypothetically.

And the reason why I'm saying hypothetically is technically to determine this, you'd have to run this through some code, it doesn't matter, Python, JavaScript, whatever. and then have it actually do the translation and come up with what's called the embedding or the translation of that original format. So in this case we'll send this over and imagine that I Love You and I Love Huskies are abstracted using math and then this is your hypothetical embedding. So you'll see it's 0.8, 0.7, 0.9, 0.2, 0.1 and then I Love Huskies is 0.8, 0.7, 0.5, 0.6 and then 0.3. Now Now you might look at this and be like, this doesn't make any sense to me.

This is how pretty much every chat bot that exists that has some form of knowledge base actually looks up its information. You load information, that information that gets all converted into a format like this. So when you ask a question, it tries to look for a, let's say a vector that's as similar as possible to this vector.

And one way to do that is to use something called cosine similarity, which basically compares these two vectors and says how similar they are. So the fact that this has 0.9 versus 0.5, this has 0.2 versus 0.6, has a lot to do with how the vocabulary the AI is using to translate it works. So I love is the same in both, which is why you see 0.8 and 0.7 are both the same hypothetically.

The fact that it's huskies versus you, one is referring to contextually a person, another is referring to an actual animal, in this case also a dog. So the way that's translated in vector space is very different, which is why these vectors are close, but they're not the same. Go back. So if we take an example to ground it, and we had a customer success chatbot say, we don't process refunds as your actual answer, but the reference answer is.

We don't usually process refunds. Semantically, they are similar, but they're not the same. One has some variability, which is we don't typically do it.

One is a flat out, we don't process refunds. So on this quote unquote grade, if you want to say I need it to be four out of five similar on their bar of similar, then it would use that math I just explained to you to determine whether or not it was close enough. Now the black box nature of this is, you know, if you put a three versus a four out of five, It's hard to differentiate off the cuff if something that's kind of similar will be picked up or not. So the best way to experiment is to start at three as your North Star. And then if it's passing, just experiment with four.

Because ideally, you want your answers from ground truth to actual answer to be as similar as possible. All right, so stick with me. The next ones are more straightforward.

So with custom prompt, you literally write a custom prompt where you write in natural language how you want to grade the responses. And then you define what an A, B, C, or D is. So you have more autonomy here to actually create your own prompt and basically decide what is your criteria in layman's terms that you want to discriminate by. Now, in terms of sentiment, this is not a new concept.

It's existed for years, especially in the statistics space, where you upload some data and you say, hey, you know what, here are some reviews of our business on Google Reviews. Go through it and analyze the general sentiment. Instead of making it say positive, neutral, and negative, you just test for something you're looking for.

So if you're looking through reviews, typically, as a business owner, you want to look for the negative because that's what you want to double down and fix. So you could click on negative here and then assess the sentiment of a specific column, and then you're good to go. The grade width column here allows you to choose which LLM you want to use.

So that's one thing I haven't touched on as of yet. You can have it use GPT-40 mini, 01 mini, 01 preview, etc. etc. Any model that exists in OpenAI, you can choose to grade with. So that adds one more level of flexibility, but also complexity.

In all my testing, majority of the time, GBD4-0 Mini was a good use case. And you can imagine if you had a file of, let's say, a thousand different cases, and you used 01 Preview or 01 Mini, which are both more expensive models, you'll go bankrupt pretty quickly. All right, so the next one's super straightforward.

It's called String Check, where you literally upload file, tell it, Hey. Check if this column is equal to that column identically as a string, which is literally just telling me this text and all its characters and associated annotations are the same as this other text. So you can look for equals, something that starts with the same thing, something that ends with the same thing.

So one example use case that I'm going to show you is making sure that maybe your chat agent always says thank you before it finishes the conversation. In that case, I can go and click check if. it ends with thank you and exactly the word thank you with a capital T in a period at the end. That's an example use case where I would check this.

Next one is valid JSON or XML. This is probably one you'd use only if you were more on the technical side. You have some form of application that has an output that needs to be read by another application. So in this case, you can basically just check is this JSON that's been produced by OpenAI or whatever LLM an actual valid JSON. On that same note of JSON, you can also have this matches schema feature where you say, hey, this is what I've determined for my JSON.

And if you don't remember what JSON looks like, it is this bad boy. So let's say your application is releasing this as an output. You can go back to matches schema and then say, hey, here's exactly what I'm going for.

I want first name, age, and then I want billing address, and then I want postal code, city, etc. and you can mandate that whatever the actual LLM outputted fits that exact structure. The second last one here is criteria match, where you give it, again, in natural language, some form of custom criteria. So unlike the custom prompt option where you had to create a grade, let's say A, B, C, or D for the output, you can just say, hey, the criteria is make sure that every single output here is either in French or English.

Let's say you have some form of bilingual bot. and you want to make sure that there's no Spanish, German, or extraneous languages being output. So your inputs here are the conversation column, the model response, your criteria in natural language as to what the rules are, and then how you'd like to grade it, like with which large language model you'd like to grade it with. And the last one was also a very confusing one for me. It's called text quality.

So in this case, you want to compare two pieces of text, and I think the primary use case for this one was not only doing... a version of semantic similarity if you click on this, which I mentioned before, that cosine similarity metric. But this bleu and rouge apparently are good for translation. I tried testing it myself.

I struggled quite a bit. So personally, if I want to do some translation testing, I would use this criteria match or that custom prompt versus this. Again, feel free to test it out yourself. Let me know in the comments below if you succeed where I failed.

I want everyone to know how to use everything. So that would be great. Bottom line though, the only one I got to truly work here was the cosine similarity, because again, it goes back to that vector thing of the, I love you, I love Huskies analysis.

That one worked pretty well. And outside of that, I wouldn't really use this on a very recurring basis. All right. With all of that out of the way, I'm going to actually dive and create some data sets for us to actually use it and test it in action. All right.

So I'm going to jump into this Google collab that I put together. And pretty much, again, you don't need any form of technical background to use this. I basically just put this together to create some data sets that you could start using to demo the evaluation framework without having to worry about, hey, I need this column formatted in this way.

with this title, et cetera. So I just programmatically, if I reveal kind of what's in these dropdowns, I'm creating data Pythonically for free, no need for an API key in all of these cases. And I'm telling you kind of what the use case I came up with is.

And as you go down here, you have custom prompt, you have sentiment, which I made into product reviews. You have string check, which is just looking for that thank you suffix I referred to before. And then you have validating JSON structure, where I come up with.

a hypothetical fake one and then one where you want to match a schema and in this case i put an example schema that you can use and as you go to criteria match i created some prompt with criteria as well as a data set you can use to test against that criteria and then the last one here is text quality for cosine similarity so that was the last one i just went over this last one here text completions don't worry about that for now We'll go over it at a later part of the video. So all you need to do as the technical or non-technical user is just click here and go to runtime and then go to run before. So this will run everything before this cell.

So you can see here this is all going to start churning. And what this is installing at the very top is OpenAI, which if you don't use the very last feature in the file, you don't need to use. But this will take probably 10 seconds to install. And then running it actually takes a very short amount of time. So that finished running in under 10 to 15 seconds, and the rest ran in less than 2 to 3 seconds.

And the output, if you don't have anything open like this, is you can click on this folder file icon, and then you'll see a bunch of files that are generated. So one thing I made it do is automatically download it to your session. So you can just go and actually download all of these one by one, and then you'll see here if you open it up, I'll just open it up here with an application.

Let me just choose in this case my Brackets app. If you don't have this, if you don't know how to open it, you can just kind of upload it to GPT and say tell me what's in here. But this is what it looks like in practice.

So name a primary color that is also associated with the sky and everything's line by line like we went over before. So if we close that up, once you have all of your data sets, you're good to go and start actually testing it. So like I said before, factuality, I wasn't able to get this working. So that's a skill issue on my side.

Skipping to semantic here. Basically, I have an input, which is, let's say, what is the capital of France? And then the output is Paris is the capital of France.

And then the reference, which is our ground truth answer, is Paris. So that's an example of one input that we could test. And then as you keep going, you'll see more and more questions that are all made up. I basically had chat GPT create a hypothetical data set.

And if you want to create your own, what you'd have to do is just go to chat GPT. And I'll put this side by side. And you just say, create a hypothetical data set of 10 rows in JSON l format of zoo data, make sure to have the following columns. and then I'll say input which is the let's say a user question about zoo oh I missent that so let me just undo that mistake let me just there we go I'm going to edit it I'm going to say output answer to question and then reference actual answer that should have been answered okay and if I click send again it'll create it very quickly In this format and you can see it's all line by line and in around 20 seconds it ran and all we have to do is we can actually say can you make it downloadable.

because otherwise you'd have to copy this opened a text editor remember to save the file as a json.l file or a dot json l file and that's just an extra few steps you can avoid by asking to make a downloadable file so you can see here if you click on download file you should fingers cross get something that actually has this information so again i'll go back and i'll use this brackets app just to double check that it's actually working And we can see here we have our 10 different questions all line by line. So this is all good to go. Now going back to our semantic testing, we'll just go back to evaluations. And then I will create a brand new evaluation.

I'll do create and let's call this semantic evaluation, my test data, what I'm going to do is I have data here that I've called JSON files. And I'm going to import, I'll click on upload here. And what we're going to look for in my folder that I put together is the name of this file, which I called semantic similarity dataset dot JSON.

So we'll click on upload, we'll go to my JSON files. And then we're going to look for the one that says semantic. So that's this one.

And I'll just upload it should take a very short amount of time to upload because it's so small. And if you import it will actually populate the table. and basically take that JSON-L file and structure it itself.

So in this case, we want to add a testing criteria. So in the testing criteria, we're going to click on semantic similarity. And then in this case, we're going to use the LLM GPT-4-0 mini just to make things easier. And then you want to basically say, this is my column that I want to compare against. So in our case, we have a question, which is, let's say, what is the capital of France?

Who wrote the Pride and Prejudice? Rather. And then you want to take the output, which is let's say Paris is the capital of France, and compare that against the ground truth answer, which is just France. So if I click on compare, this is why it's helpful to have hygienic naming for the actual columns, which is why I put together that collab file for you.

You can click on output as being the first thing you want to compare with, and then you want to compare it ideally against the reference, which in this case is the ground truth. So if you click that, I'm just going to use a passing grade of three to make things simple. And then if we click on add, one thing you can do is you can click on estimate here, and it will tell you how many tokens is going to actually be ran with GBD 4.0 mini.

So it gives you an idea for cost. Now for 40,000 tokens worth 4.0 mini, it's a very, very negligible amount of cost. But it's very useful for you to use this, especially if you feel like experimenting with larger models like 0.1, that you know exactly what you're getting into. Now if I click on run here, what it should do is it should run a job and then once it's done, it should basically show me how many of the cases passed the test. And you'll see here that only slightly more than half of the cases passed.

And what you can do is you can click on data here and then it will actually load that table and tell you exactly which ones failed and which ones passed. So based on our very thin criteria of three out of five, you can see here. What is the powerhouse of the cell? My favorite question from biology in high school. The mitochondria is often referred to as a cell's powerhouse.

And in this case, it wasn't close enough to this, which is mitochondria. Where, if you go back to the I love you and I love huskies vectors, you remember how they were still different, even though they were very similar. So, although this has mitochondria in both cases, it's different enough that for that passing grade, there's a cutoff.

But if you take something like this, Paris is the capital of France, Paris. This makes sense why it would be a perfect match. You'll notice when you have a superset answer, meaning the answer is quote-unquote correct, but there's adding so much of criteria that at least semantically as quote-unquote vectors, they're not similar anymore.

You'll get these failure cases. Whereas if it's only slightly more verbose than the real answer like H2O versus H2O commonly known as water, it is close enough to pass in this case. But if we were to go back and quickly do this test again, and we call it semantic evaluation two, and then we import the same data that I just used. And then we add testing criteria. And this time we do output reference, but we make the passing grade four.

It's going to be more stringent. So even those cases that were slightly more verbose might be tagged as failure. So I would not expect it to be 53% pass rate again.

It should be a bit lower, if not a lot lower. And you can see here, there's a pass rate of 6%. So this is hyper important to be cognizant of where you're actually using this framework is that any...

change in the grading, there's a steep cutoff. So if you go to data again, there's only one that passes, which is this one, H2O, commonly known as water. So it's saying that of everything else, this is the most semantically similar and probably the closest related vectors that it could find based on that harsher criteria.

And naturally, if you put five out of five, you would need an exact match, meaning you'd need Paris to be Paris, you'd need H2O to be H2O. So you You might have a case where you do want to look for the exact thing. It's good to know that you can use 5 out of 5 but if it's something more freeform where the answer is in more natural language that's where a passing grade of 3 might make sense if you're looking for things that are literally similar.

So we'll try the next testing criteria and the next one on the list here was using a custom prompt where in my case we have all of these questions and answers once again. And when we have a data set called custom prompt data set dot JSON L. So we'll go back and we'll say custom prompt test. And then I will import this data.

So I'm going to go to upload. And I'm going to look for custom prompt data set. In this case, we have, again, the input, the output, and a reference column. We're going to click on testing criteria. We're going to click on custom prompt.

And I'm going to say I want you to grade it with again, GBT 4.0 mini. And I want to say something along the lines of if the answer is in English, then tag it as pass. If the answer is in French, then tag it as fail. If we do add grade here, I can say the label is either positive or negative.

So if the label is pass, and you're generating the label here, which is why it's important because these two parts build on each other. If this was no or yes, my label should be no or yes, so that the LLM knows what to look for when it decides whether to fail or pass. So if it's pass, then it should pass. Otherwise, if it says fail, then you should select fail here.

So once you have that going, assuming I wrote this properly and that most of these are in English, if I click on add, you'll see here at least expectation wise. So these are English, eight's a number, it's not English technically, 100 degrees Celsius and 100. If you, because there's a number there, I'm not sure if it'll take that into account, but if we click on run, it will do its run as before. And this one, because it's using an LLM to go through every column or row rather, It'll take slightly more time, especially if you have, let's say, tens of rows or hundreds of rows.

So in this case, it says that 0% of the rows passed, which to me is a bit confusing. So what I'll go back is let's create another actual test here. And I'll say it's custom prompt test two. I'll do an import of the same data. And maybe my criteria wasn't the best.

So in this case, I'll say use GPT-4.0 to be slightly smarter. And now it does say here use curly braces to insert variables. So in this case, I say if output, and let's call the curly braces. If the output is in English, then tag as, you know, let's just say then say pass. Otherwise, if output is not in English.

Then say fail. And then I'll do this again. Pass should be a pass.

Let me make sure it's exactly spelled the same. Pass and then fail is spelled fail. And I add that and I run.

In this case, it works. So if I go to data, it's saying here that all of them pass, which again, is suspicious. So one thing when you're using this testing framework is when it comes to creating grades. And you know, the factuality thing, there's a bit of iffiness on like how you actually create it or designate it.

So it's one thing to keep in mind that even the slightest variation in the tagging could vary your results widely. Now, I'm not going to run three tests with all of these features. Otherwise, we'll be here till tomorrow.

But we'll run one more just to see how iffy it is. So if we go and we make a custom prompt test three and we import the exact same data. and then we add testing criteria and we use ub4o i'll say if in this case this is in english English, then say English.

Otherwise, if output is a number, then say number. And then I'll add two grades. And this time I'll say it's pass if it's English.

And then I'll say it is a fail if it's a number. So this is a good way to test what are the different boundaries here. You'll see in this case, it says 100%, which I know is not correct. So conclusion from this is I think this is still obviously in beta. So there's still some bugs and kinks to actually work out.

So my most trustworthy features of this are the semantic prompt. And the actual custom prompt that you can enter later on. So if you go to here and go back to create and we go to a new valuation, we'll import a new file test criteria.

So criteria match was my favorite one to use. It made the most sense. Sentiment makes sense. The valid JSON, the string check, the matches schema, but text quality, custom prompt and factuality were to me the penalty box.

So. Anyway, it's good food for thought, and I wanted to show that on purpose because I found myself sometimes saying, oh, it works perfectly, and other times this is not working. So back to it, let's try the next one.

So I'll just call this new evaluation. If we go back to the Google Cloud to quickly check what we're going to upload next, you'll see that this is product reviews from customers. So you have product, customer name, customer review, and reference review.

So in this case, we can just look at the customer review and assess how many negative ones do we have. So if you go back to evaluations, if we go and click upload, I'm going to upload this thing that's called product reviews dot JSONL. And then we'll upload that.

Click import. And you see here the product name, customer name, review and reference review. And then we're going to add testing criteria. And in this case, we're going to select the sentiment option.

And then we're going to assess the sentiment of the customer review column with GB4O mini. And I'm going to look for just if there are any neutral or negative cases. So we'll do neutral first. We'll do add.

And then I'll do run once again. And you'll see here that it tagged 40% as neutral cases. And if we take a look at the data, you'll see here on the one that it says pass.

Very sleek design, though the step counter seems inaccurate. Okay, so that one has a bit of positive and a bit of negative, which is why it probably cancels out to be. perceived as neutral. The next one boils water quickly but the handle gets a bit too hot so again follows that pattern of negative and positive slightly cancelling each other out. And this next one says fast and reliable but the fan noise is a bit loud so all of them are the same and to me this tells me that this is a good feature this was working as expected.

And if we go back and we create a new evaluation I'm just going to be lazy and not select it I'll do import and I'm going to click on sentiment. And we're going to analyze the sentiment of the customer review. And then we're going to use Zuby 4.0 Mini.

And we're going to just look for purely negative and see what pops up. Now, it should be not larger than 60% since 40% was tagged as neutral. Again, unless we're having a hallucination.

So that's one other component to take into account when doing this. This is all being done by a large language model by the end of the day. So there is even margin of error when you're doing testing. In this case, it says 20%, which to me is a pretty fair result. And if we go to the data, we'll see makes great coffee, but cleaning is a hassle.

So what this tells me is unlike the other ones where there was a but and it was tagged as neutral, the fact that you have hassle, it might have really taken the valence, and the word valence means is like the emotional weight of that word, into consideration when tagging it as negative instead of neutral. If we look at the other case, very sleek design, though the step counter seems inaccurate. This one was just tagged as neutral in the other case.

So I use GPT-4 4.0 Mini. If I go back and I rerun this with 4.0, not sure if it's going to tag the exact same thing. That's the whole point of experimenting with this is to understand the flow.

And what we're going to choose here is the customer view again, 4.0 Mini. And you know what? Let's actually switch it to 4.0. And maybe we use a more sophisticated model to see if we'd get the same result.

And if we click on negative here and we click on run, interestingly, you'll get 0% cases are negative. So if you go to the data, those two cases about the make great coffee, but cleaning is a hassle, that 4.0 mini tagged as negative. A smarter model tagged as not negative because technically, both of those cases were indeed neutral, not negative.

So good learning here is that probably smarter models, when it comes to... reasoning and it when it comes to actually assessing meaning and context probably 4.0 is the best model to choose right now the 01 models don't work very well and they're very expensive so i wouldn't use that at all i think 4.0 if you need some reasoning is your best bet so if we go back and let's do the next one so i'm going to click on let's see what we have next here we have a string check which is very simplistic we just have a data set here where we have customer support responses to questions So an example is how would you end a customer support interaction politely? And the output is, I hope I was able to help with your concern.

Thank you. What I want to do is be able to make sure that we have a polite bot that always ends in thank you. So I basically asked ChatGPT, make me a data set that has some thank yous at the end and some that don't have thank you.

Well, you can see here it ended it with does not have a thank you. And then this is called ends with thank you data set. And if we go here. and upload and we go to ends with thank you data set we'll upload it Import and you'll see here the reference like the ideal thing we're looking for in the string is the thank you and then here are some cases where most of these do have thank you. This one doesn't.

This one doesn't. This one doesn't. So we should have at least five of them that don't meet this criteria. So if we go to test criteria string check we're going to check if the output ends with. The string, and I'm going to say thank you.

It should be written like that. I don't think I have to use quotes. If you click on add, it'll run.

You'll see here it says 50% passed. So if we go to data, these five did not have it, like I said before. So they tagged them all correctly, while these all ended in thank you.

So they tagged that correctly. Now for this specific one, if you're more technically inclined, there might be some more useful cases where you do string checks. But... this is the most simplistic one that I could show. And let's try the next one.

So if we create, and let's see what we have next. So validating a general JSON structure, this will be pretty quick. This file is called JSON validation.

So we have some JSON on a row level that looks like this, where this is one JSON, this is another, etc. And we just want to check the integrity of that payload, whether it's legitimate or not. So if we click on import, I'm going to upload here JSON. validation set, we're going to click open import.

And then you're gonna get on the left side, your input, and then the output JSON file with test criteria, we'll click on valid JSON. And all we have to say is we want to check the output, whether or not it's valid. So if we click on add, this one doesn't seem to use an LM behind the scenes.

And from the cost standpoint, when I wasn't able to click it, so it tells me that it's using ideally Python to determine this or something. programmatic. So this one should be pretty quick to assess. You'll see here that only 50% passed.

If we go to data, it'll say that this one doesn't have valid JSON, this one doesn't have valid JSON, and this one doesn't. So if we take any one of these, and we take this, let me copy this if I can. All right, and then if we go to a JSON validator, and use an online version of this, just to double check, if I paste this... and I click on validate JSON, you'll see it says invalid JSON. So it is working as intended.

So if we go back here, I think we only have a couple more to go. The next one we're gonna do is validating a schema. So making sure that a schema for a JSON you came up with is exactly as expected. So we're gonna click on and copy this from the Google Colab to make your life a bit easier.

I'm gonna go on test and then let's see what we call the actual file here. So I called it schema validation. We'll click on upload schema validation import and then we're going to add the testing criteria. We're going to look for a matches schema and then we're going to paste the scheme we're looking for and then we're going to check if the output matches the schema and the output is here this column. So when we click on add and we click on run this should check each output from an LLM to see whether or not it matches exactly what you're expecting, which for many applications, whether it be normal app development, web application development, mobile app development, anything with AI, this is super useful to do, especially if you're not using something like OpenAI, where it has like a structured output that you can guarantee what the schema looks like.

And you'll see here it hit a 50% score again. In this case, these are the ones that don't match exactly the schema that I entered. All right, we're down to our last two here.

So if we click on create and then we go on test, the next one we're going to do is criteria match. So here I've written some criteria for you. Again, hypothetical. Based on the provided criteria, evaluate if the assistant's final response about the tenant is appropriate.

So this is creating a hypothetical tenant database. Let's imagine you're a landlord and you're trying to fill up your properties, your occupant properties. These could be like...

tenant applications that you're using AI to assess. So the response should highlight the tenants positive qualities, any concerns if present should be mentioned neutrally, check if response highlights positive qualities and suitability, ensure any concerns are mentioned neutrally, and then the output should be pass or fail. So if I take this, and let's see what the data set name is, it's called simple tenant data set. And let's go here.

And we'll upload simple tenant data set and then click import you'll see here examples you have ran names you have their tenant profile like good payment history steady job no issues consistent payments but had a minor dispute last year and then you'll see the assistance assessment of that so Alice is a reliable tenant with a steady job and no issues reported so what we want to do is then click on test criteria we're gonna click on criteria match and then I'm gonna say the conversation is the tenant profile And then the model response is the assistant file response. And then the criteria, I'm going to add it here. I'm going to go back and copy this, paste it here. I'll use GB4O just because there might be some nuance here. And then I'm going to click on add.

I'm going to click run. It's going to do that assessment to see which ones pass the criteria that I've just created. And you'll see here that... all of them seem to have passed.

So if you go here, based on the criteria that I provided, all of the tenant profiles or the assistant file responses, which are the judgments of those profiles seems to be fair and thorough. So according to that criteria, we should be good in that capacity. Again, just like the custom prompt criteria, this one will be iffy because there is a level of subjectivity here and you have the moving part of the LLM that you're using.

So ideally try and be super super explicit so that you can know if this is working as expected. Last but not least, if you go to create here, the last one we have is using cosine similarity for customer support responses versus company policy. So let's say you had a customer support representative and you had maybe an AI version of that, like a chat agent. You can see the response of the chat agent, and maybe you want to put it vis-a-vis the company policy manual to see how close were those two different.

entries. So you can see here an example, my product arrived broken, what should I do? And then support responses, we're sorry to hear that your product arrived late, yada yada yada.

And then the company manual says, for damaged products, apologize to the customer and request a photo as evidence. And you'll see the rest of them. So we want to look for customer support responses. So if we go back here and click on import, I'm going to go here to customer support responses, click on open. Click on import and you'll see these are the inquiries, the responses and the manuals.

So if you go to testing criteria, we're going to go to text quality. We're not going to use bleu, we're not going to use rouge, we're going to use cosine. So I'm going to compare the customer support response against in this version the ground truth of the company manual and then I'm going to grade it with in this case you get two options if you have no idea what these means. If you remember when I first described that semantic similarity where I had two vectors, I love you, I love huskies, and you had a vector to represent each set of actual words, this is what actually converts, it's the translator, it converts the words to vectors. So you have something called text-embedding-3-small, another one that is large.

So for most cases, for at least 80% of your cases, you should be able to use small. The main difference between both is the large one takes a lot more dimensions into account. So it overthinks. So that vector about Huskies being only like a few things different from I love you, it might be even more complex than that, depending on different factors and dimensions it takes into account. If that makes zero sense to you, don't worry about it.

Small should work for 80% of your cases, and it's way cheaper than using large. So just use small. And then passing grade again. Cosine similarity is out of one. So one meaning it's an exact match.

And 50% is kind of like, it's a meh match. It's kind of a match. So if we did something like 0.6 instead of that, we'll do add, and then we'll click on run.

And you'll see here it says 100%. And if we go to data, you'll see in all cases that all of them seem to be close enough semantically. that you get a perfect score. Just to battle test it out of pure paranoia, if we do a testing criteria where we do the same thing, and then now we use cosine, and let's make it, let's say 0.9, and then use three small, click add, click run. I don't expect that we should get a perfect score again.

If it does, I would be suspicious. So in this case, it was so extreme. that none of them passed.

So this is a good example of extremes, right? So maybe 0.75 would be more fair, and it doesn't take too much time for me to test that hypothesis out. Now that we're getting the hang of this, we'll just go here, boom, and then let's do 0.75 add and run i would expect it's not zero it's not a hundred probably something in between you'll see here it says this one is the most similar to the company manual with that grading of 0.75 out of one so again if you're trying to explain this to stakeholders on your team cosine similarity could be a bit more confusing to explain like why this result is closer but if you're looking for you know some form of directional accuracy of how closely Do the responses of your agents or bots match your company policies? This will, again, give you something that is good enough to say, you know what? Probably more than 50% we're doing well.

More than 60% we're doing well. But maybe at the 90% rate, we're not being exact or in detail enough with our agents. So that being said, we're not done yet.

I did mention and promise one more component here, which is this part of the code where you don't have to run this. That's why we... clicked on it and we did run before this code because this one is more superfluous. This code pretty much all it does is I'm just showing you how to send what's called a completion, which is basically like a test message to an LLM to come back with a response.

If you run this, all that's going to happen is it's going to loop through all these predetermined questions I've put together. It will send a response and you'll see why I've done that as soon as this actually ends running. And this finished running after around 51 seconds.

So the whole point of this was I wanted to show you that once you send this to OpenAI and you go to create test criteria, you don't always have to necessarily upload data. If you've already had an interaction with the OpenAI service, you can go on completions and then you can click on which model you want to use the completions for and then what date. So I'm going to use today's date because I just submitted those.

I'm going to click on apply. It's going to say then I've had 10 samples or 10 different requests that have been sent, which is exactly what I sent here right now, these 10 different questions. And in this case, I can leave these all blank, click on import, and this will create a table importing the last pieces of interactions that I've had.

So this one, if you go to my thing, it says where's my order, and how do I return an item, etc. You'll see all of these questions. responded to here.

So where's my order? And then the text is you are helpful customer assistant for an e-commerce company. And here's the output of that question based on that rule that we've assigned it.

So this is an easy way to basically use what you've already submitted to the API, assuming it's maybe for customer purposes and quickly check it that way. Now, the last magic trick I had up my sleeve is if you go back here, go to create. and you go to let's just add any form of data set here it's not important a lot of these like i said three of these to be specific have the underlying prompt that's used to power the actual testing so if you go to factuality and you click on this drop down you'll see the actual prompt they come up with for grading so even if factuality isn't working for me or maybe doesn't work for you as well you can actually take the system instructions to create your own mini prompting agent So that's one way you could do this. And if we go back, we can see the next one is sentiment, where this one also has a breakdown of how the sentiment is being calculated or tabulated. And then the last one that I saw was the criteria match, where even here, it has a full prompt as to how it's creating that criteria and how it's grading it.

So even if these are not useful within this app itself, you can steal that prompt to make your own version. And naturally, that's what I did. So I took the factuality. Sentiment and the criteria match prompts and I made them available here in the prompts document here and Naturally, you're gonna find this you're gonna find the Google collab assets everything in this video as well as the actual data files themselves So you can test this without even running that code will all be in the gumroad in the description below and as usual this stuff Takes forever. So if you can support the channel, I would appreciate it.

Now what I did is I took all of these together and I created my own Frankenstein prompt using meta prompting to have all in one. So I can create one custom GPT where depending on what kind of analysis I'm trying to do, I don't have to upload a JSON L file. I can just upload a normal file into GPT and go back and forth and do that assessment. So you'll see here I have a GPT called prompt testing, where if I click on, let's say, fact check, and you can tell I built this because I was very salty that I couldn't use factuality, spending. most of my day trying to figure it out it'll ask you what is the desired grading scheme for the evaluation in this case i just used theirs that they have in their prompt If there are any specific facts that need to be prioritized, if minor details or exact phrasing are critical to the assessment, etc.

If I go, never mind, sentiment, it should also push you back on what's needed. So in this case, the desired sentiment you expect, any specific words, phrases, or tone that should be prioritized or avoided. And last one, emotional expression expected.

And then if we hit criteria, that's my way of spelling criteria. It should again ask you and push you back with a few of the additional things that they came up with in their prompt, which in this case is only if any criteria is more critical than others, and whether the answer can include information beyond the criteria, or if it should strictly adhere to them. So I created a proxy for those three that bothered me, and I wanted some way to still be able to use that, even if it wasn't working for me in the platform itself. Now I know this was a super long video, probably longer than I expected, but I wanted to go...

deep into this framework to see what was useful, what wasn't, so you can make the decision as a business owner or as an AI entrepreneur, what makes the most sense for you to use this for, and where this could save you possibly a lot of time. I expect that because this is in beta, all those bugs I was showing you or those weird results will resolve themselves over time as this gets better. But in the meantime, this is better than most Frankenstein solutions that are out there to do some initial conversation testing and prompt testing. without having someone technical on your team to have to do so. If you found this helpful, I'd super appreciate a comment in the description below.

It helps the video, it helps the YouTube channel, and if you like and subscribe, it won't hurt either. So thank you very much, and I'll see you next time.

Transcript for:Überblick über das OpenAI Bewertungsframework

Transcript for:
Überblick über das OpenAI Bewertungsframework