Effective Techniques for Testing AI Prompts

Everyone talks about writing prompts, but not how to test them. This is actually the trickiest part where entrepreneurs and organizations always seem to get stuck. It's one thing to write these prompts, and it's a completely other thing to actually know how to battle test them effectively. But what if you could have AI not only write your prompts, but also battle test them for you? In this video, I'm going to show you four practical ways to do exactly that. I'm going to show you how to use a custom GPT for conversation simulation, leverage Google Sheets with a GPT add-on, and use Airtable with make scenarios for both static and conversational prompt testing. These are concrete and actionable steps that you can actually start using and adding to your prompt engineering arsenal today. And I can guarantee whether you're an experienced prompt engineer or just a beginner starting out, you're going to learn a valuable skill or at least a new tool in this video. And if you stick around till the end, I'm going to give you a bonus Easter egg, a tool I've spent weeks building, bringing everything together. If you don't know who I am, my name is Mark, and I've been running my own AI automation agency called Prompt Advisors for the past two years. We work with companies in all industries better understand where to use AI best in their workflows. Now, first, I'm going to show you every single technique in action, basically showing you the end result, and then we'll go back and understand how it's working and ultimately how you can build it for yourself. So without further ado, let's dive right in. Let's start with level one. which is using a custom GPT that's been fine-tuned to take a prompt and generate a simulated conversation of how it would expect a conversation to go if you used said prompt. The way I built this is I click on test my prompt and it will ask me for my underlying prompt here. And what I'll do is I'll take this sample prompt I put together, which is pretty much taking the role or persona as a mini therapist, trying to understand relationship challenges and giving actionable advice. So we're going to take this here. and paste it. And then we click enter, it's going to start actually simulating the back and forth potential of the assistant versus some simulated fake user. And you'll see when I actually break this down, how I'm creating this persona of the fake user. And you can see here, once it's done, you get an entire conversation between what it thinks the AI will say based on your prompt, and what a possible simulated user might respond back to each different utterance. So you can see here as you go down, you have a different back and forth, and you can start to understand this is how your prompt might actually work in the field, especially if it's more of a conversational prompt than a write me an SEO blog kind of prompt. All right, for level two, we're going to go back to the Stone Ages and use a Google Sheet. And what we're going to use specifically is an add in that's called GPT for Sheets, where it's free to use for the most part. And what you can do is just enter your API key for whatever LLM you want. And then what you can do is you can say, write me a prompt that will generate an SEO enriched blog about zoology. What will happen is the first cell here has a GPT function that's now enabled by that add in that will execute this and create a meta prompt. And then column C is going to take that meta prompt and actually execute it to show you what the possible output could be. All right, and you'll see here, it's created a whole prompt. on SEO enriched blog for zoology has all these instructions here. Obviously, in your case, you've already had a prompt, then you can use that. And this case, it will actually execute that prompt here and create the SEO blog. So you have a very quick way, very low skill set way to actually trial this out with zero code whatsoever. You just have to click enable and add in which I'll show you later on. And you're good to go. Alright, level three, this is where we start getting super fancy. And I'm not going to talk too much about what's happening. But we're going to say the same thing. Write me an email to my boss telling them that I'm quitting to become a prompt engineer. So I'm going to click on battle test here. It's going to execute a script and then it's going to kick off a process in make.com that executes this entire workflow and simulates three possible prompts and three associated results using different LLMs. So before we were using one LLM and now we're using things like GPT-4.0, GPT-4.0 Mini, Cloud 3.5 Sonnet and we're getting the results of those prompts simultaneously. So you'll see here it executed in literally under 40 seconds and we got a in some cases an actual prompt in other cases it tried to actually execute our command. So you'll see here in GPT-4 Mini it understood the assignment and it created this prompt. For Cloud it also understood the assignment But for GPT-4-0, for some reason it executed it instead of actually writing an actual prompt for that underlying LLM. And if we go here to the right hand side, we can see the possible result of that prompt all without ever having to go back and forth ourselves. So we can start to really iterate very quickly using this process. All right, and for level four, we're going to go a little bit fancier. So in this case, we're not only going to enter a task that we want to create a prompt from, we're going to enter our actual prompt, and then we're going to enter a user prompt. So in this case, we're going to instruct the LM to take on a certain persona of maybe the avatar of our user or a possible person in the organization that's going to be interacting with this LM. So in this case, we'll take our old prompt here on that mini therapist. We'll paste it in system prompt. Then user prompt, I'm going to say, be a little sassy, you have problems, you're about to talk to a therapist, and use lowercase when you speak. And let's keep it at that. And then we click this button simulate. And this is going to kick off a completely separate process in make.com that will actually execute a back and forth simulation of this conversation. So you'll see here, it's also executed here as well. And there's a back and forth conversation that ends up getting stored back in our air table. And if we go here, we'll look through and you'll see for user prompt or user response one, this is what the user that's simulated is going to say. And this is what the assistant would say back. And then we have a back and forth conversation where you can see immediately, where do I even start? It's like everything just piles up, you know, work is a mess, et cetera, et cetera. So now you can actually start. to have a pulse for how your prompt might actually survive in the wild when it comes to being used by people that don't spell properly or sometimes sassy try to break the ai or whatever different scenario that you might be expecting now like i promised i'm not going to show you level 5 until the very end so we're going to go into some slides to break down how these four tools work conceptually and then i'll break down how you can actually build them on your own all right so again we got a custom gpt we have google sheets with an add-on We have a static prompt Airtable with an automation, then we have a simulated conversation Airtable with an automation. Now if you don't even know what an Airtable is, it's pretty much a Google sheet on steroids where it has a printer UI and honestly in my opinion a better back-end to work with, especially when it comes to automations. So if we take the custom GPT, at a high level this is how it works. And the instructions are provided along with the other three use cases you're going to see. are all going to be in the Gumroad link in the link in the description below. So you can just sit back and try to actually synthesize this information, knowing that you'll be able to dig into the details on your own time. So if we go into the details here, instructions are provided to the GPT, and then it's told to simulate a back and forth conversation. At times, depending on your prompt, it'll ask you, do you want me to adjust the tone based on the persona or topic, or address off topic or irrelevant conversations? And then it's instructed for both the user an assistant to simulate responses to each other, building on each other. And then you want to follow a natural flow of conversation. And I tell it, try to simulate three to five different exchanges, unless otherwise instructed, because obviously it's a GPT. So if you want to go seven to 10, it can do seven to 10. It would just be an override. And then the last instruction is to end the conversation with a summary of the key takeaways of actually how that conversation went and the AI's observation of it. Now, something like this is easier to do, but it's obviously imperfect. perfect. And there's multiple reasons why the same way an LLM can hallucinate in general, it can hallucinate in how it would expect your prompt to work in the wild. Not only that, but you have to remember when you use something like chat, GBT, there's some form of prompt they have behind the curtains that controls how the entire experience works. So if you're using something like an API or a backend, it's not necessarily going to have that same tonality back and forth as you would. within chat GPT. So it's helpful if you want something directionally accurate to kind of check your instincts. But when it comes to actually deploying something in the world, like some form of mobile application, or web application, it's not going to be as representative of what it looks like in reality. Now for level two, we had that Google Sheets with that add on, where pretty much all we have to do is enter that add on. And then we enter the task goal in one column. And then we create a special formula that says equals GPT. And then you basically reference either a prompt or instruction. And I'll show you how to do that. And then you have that prompt automatically shown in the next column, which we saw. And then if you make another cell dependent on that prompt, then you can take that prompt as the only input for column three. And then that acts as the instruction. And then that's when you get that simulated actual response. For level three, for that static prompt air table, this one's a little bit more complicated. So we have that mini task in air table, where in our case, I think we said... We're going to write an email to our boss saying we're resigning to be the prompt engineer of our dreams. And in this case, we can click the battle test button. Behind that button is really all the magic. And it's a series of JavaScript that I wrote that I'll also make available in the description below that you can just copy paste. And you can literally throw into GPT, say, hey, here's my web. hook for make if you're not familiar with what make is it's like zapier's brother or in this case the rival of zapier and automation software and you can just give it that link of where to send that message to to actually trigger the automation and then it flows through all the micro parts of that automation and when the make scenario is activated the prompt is then sent back through the api and we basically store all the responses generated from make back in air table and we pretty much are done at that point all right and for the second version of this automation you What's going to be different here is we have a task like we did before, but then we have a simulated user prompt where we basically say to the AI in the make scenario, hey, here's the persona that you're taking on, which is very familiar to what I just mentioned before on saying, hey, you're going to be sassy, you're going to be a bit difficult, you have some form of problems you want to talk about. So we try to simulate our possible end user. And for the simulate button, we also have a script that when you execute it, it pretty much takes those two pieces of information, throws it into make via webhook. And if you don't know what a webhook is, imagine there's like a listening ear. This listening ear is listening for certain components. We are sending those two components and telling it, hey, here's this, here's this, take it, remember it. You're going to need to use it later on. One super important thing is that you need to create what's called like a run ID because Airtable in these automations needs some form of way to keep track of which row and record we're actually processing. So imagine we had 10 different rows and we clicked simulate, it would have to have this distinct record ID that it sends to the automation so that it keeps track of where it should put the brand new rows for that specific run. Slightly complicated, but again, I'm going to give you that script. So all you have to do if you want to configure anything or add a new column is put that script into GPT-4.0 and say, hey, I want to make a couple changes to my air table. So it does this, this and this. And then if you want to add something in the columns, you just have to accommodate for that. in the underlying air table in the make automation. But once you see it, it'll make a lot more sense. And then after that, it makes that simulated conversation back and forth where pretty much how this scenario works is we're creating our own mini thread where we start with the simulated user persona saying, hey, I have some problems. And then the assistant responds. And then when we come to third step, fourth step, et cetera, the third step is going to basically inherit the messages exchanged between the user that I label and the assistant that I also label. And you keep... building that as you wish until you get to the final one where you basically have six different messages and you're telling the make automation hey you're going to answer based on all these other exchanges you've had and this is what you're going to try to answer from the assistant now even though I'm breaking it down in a diagram it still might be confusing but I'm a very visual learner and hopefully you are so when we get to that automation you'll see exactly how it works alright so what I'm going to do now is walk through each one of these builds step by step So you can actually see the nuts and bolts of how these work. For the prompt battle testing GPT, pretty much behind the scenes, if we go to here and edit GPT, and we blow up this prompt, it says you are tasked with simulating a full back and forth conversation based on the provided prompts. Follow these guidelines to create realistic, autonomous conversation flow. And then I say simulate both sides of the conversation, and that you should initiate the conversation using the prompts that you've created. And then you want to try to respond on behalf of the user with realistic replies reflecting a conversational tone. And then I tell it try to have three and three to five back and forth unless instructed otherwise. And then basically I'm talking about the different tone and persona it should be taking on. I give it a multi shot prompt here giving an example of what the simulation should look like. This is like super essential. Otherwise, it tries to actually ask you a question. and expects an answer from you, the actual human user, to go back and forth with it. So we want to emphasize that it's the AI's job to simulate the conversation for us, not our job to simulate it with it. And then I go down here to edge cases because I noticed that when I built this GPT, sometimes when I'd put a prompt in, it would actually execute that prompt and that's something you would have seen right before in that Airtable example, where for GPT-4-0, which is the same thing this custom GPT is using, It took the prompt literally versus took it in for analysis to create that simulated conversation. And then down here we create another example of that simulated conversation as another piece of evidence as to what it should try to do when it's provided with a prompt. So this one's pretty straightforward. You'll be able to find that in that link I provide. And then if we go to the next tool, this is the Google sheet GPT. This is the level two and how it works like we said is we have this extension. and I'll click on open here just to show you what it looks like when you install it. It's called GPT for Sheets and Docs. So if I go into GPT for Sheets and Docs you'll see that this As an extension, if I just zoom in a tad, you can install this assuming you have a Chrome browser. This will only work in Chrome. I'm not sure if it's available in Firefox as an extension as well or Edge for Microsoft. But at least for Chrome, this is all you have to do. You click install. And then when you actually log in and you go to here, enable GPT functions. If you're using it for the very first time, it should open this up. and then assuming you have, let's say, let's create a new spreadsheet just to make it more simple. And then you'll see a screen like this that says, welcome to GBT for Sheets. You click on continue, and then you can choose the large language model of your choice. A year ago, this only used to be 3.5 turbo, and now they've made it much more advanced to actually have models, even including Gemini and cloud models, which I haven't used myself. But if you wanted to use GBT 4.0, you click on I understand. And then when you click on best practices, it'll tell you what to do and what not to do. And then it'll show you a list of every single GPT function. So this is actually much more potent than what I've already shown you as a use case. You can use it to automatically split format data in a Google Sheet, but we won't focus on that too much for now. When you go down to settings, ideally you want to click on safe mode. A lot of these softwares and add-ins, they tend to time out, especially if you're dragging multiple rows at the same time. So if you do... safe mode it will take a longer time to actually load like you'll see back in our sheet here so if i say write me a blog about something else so let's say dogs you'll see how it'll say in two seconds gbt loading eta less than one minute it's actually delaying the execution to make sure that there's a higher likelihood that there's not going to be some form of timeout when it sends that api request so it's by designs i would toggle on safe mode here and then you want to sign in ideally with your google account And then if you haven't, you want to provide it with your API key, you can go to here, and then go to API keys. And then this is where you configure it for all these different LMS. In my case, I just put mine here. And what's cool about it is that we'll remember it moving forward. And then if it's properly enabled, then when you click on equals, and you put GPT, you should see all of these different functions actually pop up. So if you say something like this GPT, and you click on the actual values here. say some value, tell me a story in brackets, assuming you have your API key actually registered and put in, this should load and actually come back with some form of response. And you'll see here, this is quite long winded, but it came back with the same type of response you'd get through any API request. So you can already see and start to imagine the different use cases you could use this for even beyond prompt testing. But for now, that's pretty much all you need to do to get it to actually work in terms of the formulas themselves. I'll make this available to you as well. And in the sheet, there's nothing crazy I'm doing here. It's pretty much the thing referencing this value in this cell to say, hey, look at this thing and execute that thing, which just happens to be a mini meta prompt. So in our case, if we were building this manually, and I call this, let's say, prompt one, then result one, and I say, write me a story about dinosaurs and then we just double click this and then we say equals GPT all you got to do is you put like initially some double quotes they do double quotes again and then you can either reference multiple cells in this case I can reference this cell by doing ampersand this ampersand double quotes double quotes again and then click enter And I do that because if you want to reference multiple cells at the same time, that whole ampersand quote technique will help you do that much more easily. And you can see here, we get that executed prompt. And that's pretty much all we're doing. So in this case, in our original sheet, we just asked this to generate whatever is here. So in this case, it's kind of like write a prompt. And then this one just looked at this value. So we just referenced, you can see here, B2, which is the cell here, to actually execute it and show the result. So I wouldn't rip this for hundreds of requests at the same time. If you want to test maybe five to 10 different tasks and drag it down, you can totally do that. So if we drag this down, there's a bit large of an input. So let's say, write me an email, write me a note. And then we drag this down to both of these cells here and drag this one down to both of these cells and zoom in just a tad. These will show error until they're actually executing. And in this case, it's telling me like to clarify what I'm trying to execute. So the more descriptive you can be in your underlying prompt, the more likely this will actually execute the thing. And then ideally you want to say write me prompt to create an email. And this will change if we just update this and we just make this refer to this value GPT. quote, quote, ampersand here, ampersand quote. close oh we need one more quote all right so in this case i get an error in my case is telling me i am broke for api costs so i'll take care of that later but this would have run just like the other one ran and just create the prompt and then you can drag over to actually test that prompt in practice so that's the level two way to build this all right for level three it's a bit fancier and obviously a bit more complicated the easy part is you're just creating an air table which again is just like google sheets but prettier in my opinion and easier to use where you create these different columns. For all of these columns that say GPT-40 prompt, what we're going to do is we're going to set the field to be a long text field. So unlike Google Sheets where anything goes, unless you want to change the format of the underlying cells, you can actually say from the beginning, this is going to be a long text type of column, a short text, any date column, phone number, etc. So in this case, all of them are long form. And all you have to do technically is create this task. goal column. And then this is what's called a button. So if you go to Edit field, one of the different types of fields is going to be called a button. And then what you want to do is you want to click on the action should be to run a script when you actually click on this. So when you go to here, edit code, you're going to see this if you're a developer, not scary, if you're a non developer or non techie, a scary little piece of code that I put here. The only thing that you need to change assuming you want to copy this exact build is you want to just replace your webhook with my webhook just make that swap and then you'll be good to go. Otherwise let's say you want to add an additional column or you want to switch the LLM all you do is this is my advice if you don't have the technical background is zoom out just a tad take a screenshot of this table and then paste that into GPT-4.0 not omni model just because it doesn't accept images as of yet and then if we go to the edit field and edit code i would copy paste what i have and say hey i have this code for an air table button i want you to configure it so that i'm adding clod 3.5 or clod 3 opus as a new you prompt column, as well as result column. And then I'll just paste my code here. And then Bob's your uncle, you'll get some form of script that you can copy paste back. And then you can run it, assuming it works. If we click this and click Run, what I wrote in the code is some form of way to actually show that it worked. So you'll see here, I created this little feedback loop where it says, Hey, this is the record ID of the actual row. And then this is the task or goal that I'm going to be sending to the make automation, which is a whole different story, which we'll get to. But this is the fundamental step on the Airtable side. So if we go to GPT, most likely this code will work the first time if that's the only change you want to make. If you want to be a lot more surgical in terms of what those updates are, then that's when I might have to go back and forth a few times with GPT until you see it runs properly and it does exactly what you're looking for. Now. When it comes to Make, I'm going to record a separate probably one to two hour masterclass on how to use Make.com end-to-end, kind of for beginners to intermediate. But in lieu of having that knowledge if you don't have it, Make again is some form of automation software where you connect different modules together to create a mini symphony that are all triggered by one main thing or sometimes multiple but usually one main trigger. In our case, our trigger here is this webhook which is listening in. for certain values. In this case, it's listening in for record ID, and it's listening in for task and goal. I didn't actually have to tell it that's what you're looking for. But all you have to do is you basically add a module you say webhook, you click on custom webhook, and then you call it something like prompt testing table, or you can create a new one where you say, my test webhook, you click OK, and then it's just going to be kind of spinning. Over and over again, it'll stop spinning when it first gets a piece of information that it's listening for so if we copy this address to the webhook and We pasted it in here and we went to the code and we purposely broke my code Which I'll revert shortly and do something like this and then we click run with that code and click this and then when we go back to make it'll now understand hey this is the type of information i'm going to be sending you from now on so this is what you're going to be using for the rest of the automation so if we click ok here and we'll see that this must be a trigger for the first module in a scenario but we know that it's going to now receive that information we just showed you when you have that that's when you get something like this that shows you a bundle which is basically that record And now we can kick off the automation from here. We're going to use this thing called a router, which pretty much says, hey, here's an input. Go do all these different things at the same time with this input. At the very top here, we basically have a GPT step and the instruction of the GPT. I won't go through it in depth because this is going to be available for you. But I'll say you are a prompt engineer. You write unbelievably detailed prompts and output them in Markdown. Take the following task and goal. and create a very detailed prompt based on task goal. So that is this value right here. And let me just go back here and undo what I did. This little make automation, just so I don't break it. Let's copy this address. We're going to go back. I'm going to override this. Click OK. Save that. And we should be good. And if we go back here, we'll see the rest is saying, do not output any other commentary. just because sometimes LMS say, sure, I'll happily help you with that, which we don't really care for. And then I just say generate the prompt to accomplish a task, but don't actually output it. And all this does this tools thing, at least for this case, this is a very versatile type of module, you can just store a variable here for later. So in this case, all I'm saying is, hey, just this output from this GPT step, just hold on to that for a moment. And I do that over and over again for the different versions. It's the identical prompt for GPT-4.0 and GPT-4.0 mini. For Anthropic, this step looks slightly different in terms of how it's broken down. The task, or at least the prompt, is very close. I would say I just made it a bit more emphasized to not output this three little apostrophe markdown. And even despite that, it still will output it. One super helpful thing is that when you go on task goal, assuming you've actually ran this automation, you're going to see... the payloads that you have from Airtable. So you always keep in mind how you should structure your prompt based on how this automation is actually designed. So that's all being stored and then once these all run I'm getting all of those variables that we just stored above and we have another router to kick off multiple other steps which in this case is literally executing the prompt that we just created. So in this case the core prompt here is execute the following prompt, the prompt from above And then my notes is don't output anything other than this. Don't output and markdown. Don't ask me any clarifying questions. Just do the thing. And then we do that for all three steps. We store those as variables. Then all we do is once we have all those variables compiled, we go into Airtable. And this is where it's super important to have this record ID. We have to be able to tell it, hey, this is the row where you need to inject all these values. And then it's more so mapping. So I need you to take the output from above. of the actual prompt we made using some meta prompting for gbd 4.0 mini and sonnet and then i want you to also store the prompt result the mini prompt result and the sonnet prompt result so once you inject it here assuming i didn't break things if i change this to or add a new row and say write me a poem about prompt engineering and click battle test fingers crossed i didn't break anything if we go back to here There we go. And you'll see it is running the automation. It's going to do GPT-4.0. If we zoom in here just a tad, it is now doing the cloud step. After this, it will be going to the router. It's going to be creating the results of those prompts. And then it'll finish off by actually writing to that air table. And you'll see once it's done, it's finished loading, it'll do a write, which should take literally one to two seconds to actually run and practice. And then we'll get a success here. There we go. We got that success. And one thing we can do is we can go to Airtable and see everything that was actually inputted into the Airtable. So this gives us an actual record of saying, hey, I did what you asked me to do. So we should be good to go. So that being said, if we go to Airtable here, you'll see that we got all of those outputs. And you'll see something interesting, which is the first time when I asked it to do something, GPD 4.0 with that prompt actually executed the thing. The second time it actually wrote a proper prompt using meta prompting. So that's good. In this case, all of them worked perfectly. And you can already see how impactful this is, especially if you're in a company. or an entrepreneur trying to generate tens of prompts and you're trying to battle test what is the best way to do this thing or what are different permutations of prompting and what are the results I might get from different LMS this is a really helpful way in a non coding manner to actually make this happen for level four when it comes to conversational prompt we have the exact same design of the air table except we have two different variables we're listening in for so now we have record ID we're listening in for the system prompt which is ideally Some detailed prompt you're actually using. And then you create your fictitious user on each row. Each different user can be different. If you have different avatars saying, how would someone super rude interact with this system prompt? How would someone that is young interact with it? How is someone that makes random sounds like saying quack the whole time? How would that actually look like as a conversation if we were to actually put this in the wild? So this is where you get a lot more variability. And then similar to my last code, if you click on this and you go to edit field, and then you go to edit code, you'll see here that in this case, we have user response, assistant response, user response one and two and three. And we have that webhook again, that's listening in for those three different components, you could easily take my code, put it back in GPT and say, Hey, I made three more columns for user response and assistant response three, four and five. please incorporate that into the code. And then all you'd have to remember to do is in my automation, when it goes to that air table, you probably need to store two more different conversations or three more different conversations and make sure you update those rows with those outputs that we're collecting as variables. So little note there, you might need a slightly different skillset or a couple hours on the weekend to actually figure that part out, but it's not difficult, especially using this as your base. So if we get into the actual make automation itself, Similar to the last scenario, we have a webhook listening in for these variables, which are the record ID, the system prompt, which is some long variable, and then our simulated user prompt. And what's happening here is actually pretty straightforward, but a bit more nuanced. So our prompt here, using GPD 4.0 Mini, and I'm doing this just because it's faster and cheaper, is you are a user of a conversational app, have a full back and forth conversation, communicating and replying to the AI. that you'll be speaking in a very human-like tone, embody all the flaws, spelling mistakes, lowercase spelling, and nuances on how humans tend to respond. So this is a very important part that I came up with that's helpful to create that fake user. And then what I say is, here's what is your goal when chatting. So I reference simulated user prompt. And as I go down here, if I'm allowed to, let me just zoom out just a tad. There we go. come up with an initial question for the AI based on this context. So the first couple modules are going to be slightly more nuanced, and then you can copy paste this pretty much endlessly. Then we collect the variable, which is the result, again, so we can keep that for later and store it in the Airtable. The next GPT step, this one is going to take the role of the assistant. So your assignment is to actually take the system prompt in Airtable, and then we're going to respond to the user's message, which was the output here. that we stored which is why i'm actually storing it so i can keep feeding that stored response as an input for the next part of the chain so naturally we'll we'll store the first assistance response and then we're going to have another user response which is the identical prompt except at the bottom now i add a new component saying conversation history up until now where i show it the back and forth and say hey this is what you're responding to next and then we store that and so on and so on. And as you go to the end, now the system prompt for the assistant says, hey, here's the conversation up until now. And here's the latest user query you need to respond to. So this keeps it completely dynamic and nimble. And you keep this chain going until you get to the very end where you see that it's now taking all three different conversations into account. You click OK. You store that. And then you bring together all the stored variables into this whole get variables module. And that's what you store back into the air table. So the beauty of this is that once you finish building this or configuring this, you can go back here, have your different prompts. So act like a travel agent, give advice on where to go and where to stay and eat. And then if we go to the user prompt, we'll say something like you are a lost millennial. hating your corporate job and want to escape to Bali You are going to speak to a travel agent and then I'm going to say something like write in lowercase and make spelling mistakes and be super informal when asking for your travel help. So I'll take something like this. We'll click on simulate. And assuming I did not break something, it should say success, which it does. So we're sending both the record ID, we're sending the system prompt and the SASE prompt. You can see here that the webhook is going to roll and then the... The chain is actually executing step by step. We're almost done in pretty much under 30 seconds because we're using GPT-4.0 Mini. This is going to store all the tools in this mega tool variable and store it in this air table. And if we go back here, you will see this conversation where we start off. And it says, hey, so I really want to escape this boring corporate life and head to Bali like ASAP. Can you help me find some cheap flights or whatever? Thanks. So you can see there like we made a very basic small prompt. And. And it does a pretty good job at actually emulating human behavior. So if we go to assistant one, absolutely. And then it gives tons of different recommendations. Then we say, it says, hi, thanks for your info. What are the dates I should look into? Yada, yada, yada. We respond. And the last one is, yo, that's super helpful. Like if I'm going to in like May or September, do you think I could snack some good deals on flights? So you can imagine you can extend this conversation in perpetuity. I'd say. From working with a lot of clients, the average duration of a conversation is 10 to 12 exchanges, assuming the user is satisfied with the result or less assuming they're frustrated with the result. So maybe you extend it out to 5, 10, 12. Just be conscious that as you extend this chain, one, you'll be spending more money, and two, you'll have some more maintenance of all of these variables to make sure you're keeping track of all of them as you extend that thread of conversation. But that's pretty much how you build this component. And just to make your life easier, what you'll see is you're going to have something called a blueprint, which when you export it is this file called a.json file. All you have to do is sign up for a free Make account. You come in, and then you actually add and you import the blueprint, and then you'll get something like this from the default. You'll just have to go in and make sure you're signed into GPT. You sign in and add your API keys. You authenticate into something like Airtable or Google Sheets, whatever you want. And that'll be on your end to configure, but at least the prompts, all the kind of main build, will be built out for you. It's more so keeping track of it's working as expected. All right, so that's the core four builds that I'm showing you. If you stuck around to this point of the video and you made it through all those little technical hoops, then you deserve to see my final product that I put together in the last couple weeks. And it looks something like this. I called it Prompt Battle Tester, and the whole way it works is no need for Airtable. No need for all of those custom codes and the make automation. I basically built it all using JavaScript, Python, and tons of Cloud and GPT to guide me through all these operations. So how it works is we have a prompt here. So I have a default one saying, you are a helpful AI assistant. Your goal is to provide informative and engaging responses. And on the simulated user prompt side, it says, you are analyzing a conversation between a user and an assistant. Generate the next user based on the following. ask follow-up questions, introduce a new topic, respond casually, yada yada yada. And then I have a parameter here that's how many exchanges do you want. So if you remember before, if you wanted to add a new row or a new column for a different exchange, you'd have to do that manually, update the make automation, and then update the actual script to make sure it accounts for it. In this case, you can just make it, let's say, five exchanges. You can choose a temperature for the underlying LLM. So let's make that 0.3. And then I just made it so you could pick one of three models. So let's say you choose GB40 mini. When you click on start simulation, the magic happens where I use cloud to create some form of UI to actually show bubbles back and forth. To kind of visualize what the response would look like. Which is super helpful because if you have something more, again, conversational, this is not for a static prompt. You can start to see... when it outputs it outputs in markdown maybe we don't want to include these little asterisks here if we don't plan on showing that in our ui on our mobile app or web app or whatever and then you'll be able to quickly see how the conversation can go off track or stay on track and how verbose your assistance responses might be and just to be fancy what i do at the very end is i take the whole conversation and i feed it to another lm which i think behind the scenes is gpt4o And I say, hey, grade the conversation on these different categories based on the transcript. And tell me what your explanation is for your score out of 10. So it'll say handled ambiguous queries decently, the system provided helpful solutions, and then it'll give you an overall assessment. But wait, there's more. If you click on generate PDF report and you go on simulation results, it'll give you the full transcript of that conversation. So you can go from idea to editing your prompt, whether it be through meta prompting or doing it yourself, to quickly seeing how that might look in the wild. Now, this one I'm not making available in the Gumroad description. I'm actually going to put a bitly link on the screen right now. And if you follow this bitly link, you'll be able to access my replica code behind the scenes. And again, assuming you want to deploy this as is, all you'd have to do is go to the secrets portion here to add your API key for OpenAI. These should stay stagnant. And then we have the main.js style.css. So we have some JavaScript. We have CSS. We have a main.py file. We have a. the PDF generator file that we call in our main.py. And then we have the actual crux of this, which builds the brain on how the simulated conversation goes. So this was my passion project that I built on the side to try to not only iterate for our clients much more quickly. So when they say, hey, do you think this will work for us? I can quickly come to that answer in, as you saw, five different ways. But this also makes it playful so that if I provide a prompt or my team provided a prompt, They can tinker with it to see, you know, if we made some slight modules here or slight changes, maybe we get to our dream outcome in this way. So it lowers that back and forth. And it gives that much more confidence that if you have a sassy user, if you have a weird user, that this is how your AI is going to actually work in the wild. Now, if you love this action packed tutorial, it literally took me probably like 15 hours to put together all the assets here outside of this tool that I just showed you. So if you could leave a like on the channel, a sub. and maybe a comment. I would super appreciate it and I'd love your feedback. And if you find all the assets that I'm going to be giving you in the Gumroad link below, valuable, I'd also appreciate if you could leave some form of support for the channel. It always helps a lot. I literally reinvested back to better editing, better images, so I would super appreciate it. Thanks so much, and I'll see you next time.

But what if you could have AI not only write your prompts, but also battle test them for you? In this video, I'm going to show you four practical ways to do exactly that. I'm going to show you how to use a custom GPT for conversation simulation, leverage Google Sheets with a GPT add-on, and use Airtable with make scenarios for both static and conversational prompt testing.

These are concrete and actionable steps that you can actually start using and adding to your prompt engineering arsenal today. And I can guarantee whether you're an experienced prompt engineer or just a beginner starting out, you're going to learn a valuable skill or at least a new tool in this video. And if you stick around till the end, I'm going to give you a bonus Easter egg, a tool I've spent weeks building, bringing everything together.

If you don't know who I am, my name is Mark, and I've been running my own AI automation agency called Prompt Advisors for the past two years. We work with companies in all industries better understand where to use AI best in their workflows. Now, first, I'm going to show you every single technique in action, basically showing you the end result, and then we'll go back and understand how it's working and ultimately how you can build it for yourself.

So without further ado, let's dive right in. Let's start with level one. which is using a custom GPT that's been fine-tuned to take a prompt and generate a simulated conversation of how it would expect a conversation to go if you used said prompt. The way I built this is I click on test my prompt and it will ask me for my underlying prompt here.

And what I'll do is I'll take this sample prompt I put together, which is pretty much taking the role or persona as a mini therapist, trying to understand relationship challenges and giving actionable advice. So we're going to take this here. and paste it.

And then we click enter, it's going to start actually simulating the back and forth potential of the assistant versus some simulated fake user. And you'll see when I actually break this down, how I'm creating this persona of the fake user. And you can see here, once it's done, you get an entire conversation between what it thinks the AI will say based on your prompt, and what a possible simulated user might respond back to each different utterance. So you can see here as you go down, you have a different back and forth, and you can start to understand this is how your prompt might actually work in the field, especially if it's more of a conversational prompt than a write me an SEO blog kind of prompt.

All right, for level two, we're going to go back to the Stone Ages and use a Google Sheet. And what we're going to use specifically is an add in that's called GPT for Sheets, where it's free to use for the most part. And what you can do is just enter your API key for whatever LLM you want. And then what you can do is you can say, write me a prompt that will generate an SEO enriched blog about zoology.

What will happen is the first cell here has a GPT function that's now enabled by that add in that will execute this and create a meta prompt. And then column C is going to take that meta prompt and actually execute it to show you what the possible output could be. All right, and you'll see here, it's created a whole prompt.

on SEO enriched blog for zoology has all these instructions here. Obviously, in your case, you've already had a prompt, then you can use that. And this case, it will actually execute that prompt here and create the SEO blog. So you have a very quick way, very low skill set way to actually trial this out with zero code whatsoever.

You just have to click enable and add in which I'll show you later on. And you're good to go. Alright, level three, this is where we start getting super fancy. And I'm not going to talk too much about what's happening.

But we're going to say the same thing. Write me an email to my boss telling them that I'm quitting to become a prompt engineer. So I'm going to click on battle test here.

It's going to execute a script and then it's going to kick off a process in make.com that executes this entire workflow and simulates three possible prompts and three associated results using different LLMs. So before we were using one LLM and now we're using things like GPT-4.0, GPT-4.0 Mini, Cloud 3.5 Sonnet and we're getting the results of those prompts simultaneously. So you'll see here it executed in literally under 40 seconds and we got a in some cases an actual prompt in other cases it tried to actually execute our command. So you'll see here in GPT-4 Mini it understood the assignment and it created this prompt.

For Cloud it also understood the assignment But for GPT-4-0, for some reason it executed it instead of actually writing an actual prompt for that underlying LLM. And if we go here to the right hand side, we can see the possible result of that prompt all without ever having to go back and forth ourselves. So we can start to really iterate very quickly using this process.

All right, and for level four, we're going to go a little bit fancier. So in this case, we're not only going to enter a task that we want to create a prompt from, we're going to enter our actual prompt, and then we're going to enter a user prompt. So in this case, we're going to instruct the LM to take on a certain persona of maybe the avatar of our user or a possible person in the organization that's going to be interacting with this LM. So in this case, we'll take our old prompt here on that mini therapist. We'll paste it in system prompt.

Then user prompt, I'm going to say, be a little sassy, you have problems, you're about to talk to a therapist, and use lowercase when you speak. And let's keep it at that. And then we click this button simulate.

And this is going to kick off a completely separate process in make.com that will actually execute a back and forth simulation of this conversation. So you'll see here, it's also executed here as well. And there's a back and forth conversation that ends up getting stored back in our air table. And if we go here, we'll look through and you'll see for user prompt or user response one, this is what the user that's simulated is going to say.

And this is what the assistant would say back. And then we have a back and forth conversation where you can see immediately, where do I even start? It's like everything just piles up, you know, work is a mess, et cetera, et cetera.

So now you can actually start. to have a pulse for how your prompt might actually survive in the wild when it comes to being used by people that don't spell properly or sometimes sassy try to break the ai or whatever different scenario that you might be expecting now like i promised i'm not going to show you level 5 until the very end so we're going to go into some slides to break down how these four tools work conceptually and then i'll break down how you can actually build them on your own all right so again we got a custom gpt we have google sheets with an add-on We have a static prompt Airtable with an automation, then we have a simulated conversation Airtable with an automation. Now if you don't even know what an Airtable is, it's pretty much a Google sheet on steroids where it has a printer UI and honestly in my opinion a better back-end to work with, especially when it comes to automations.

So if we take the custom GPT, at a high level this is how it works. And the instructions are provided along with the other three use cases you're going to see. are all going to be in the Gumroad link in the link in the description below. So you can just sit back and try to actually synthesize this information, knowing that you'll be able to dig into the details on your own time. So if we go into the details here, instructions are provided to the GPT, and then it's told to simulate a back and forth conversation.

At times, depending on your prompt, it'll ask you, do you want me to adjust the tone based on the persona or topic, or address off topic or irrelevant conversations? And then it's instructed for both the user an assistant to simulate responses to each other, building on each other. And then you want to follow a natural flow of conversation.

And I tell it, try to simulate three to five different exchanges, unless otherwise instructed, because obviously it's a GPT. So if you want to go seven to 10, it can do seven to 10. It would just be an override. And then the last instruction is to end the conversation with a summary of the key takeaways of actually how that conversation went and the AI's observation of it.

Now, something like this is easier to do, but it's obviously imperfect. perfect. And there's multiple reasons why the same way an LLM can hallucinate in general, it can hallucinate in how it would expect your prompt to work in the wild.

Not only that, but you have to remember when you use something like chat, GBT, there's some form of prompt they have behind the curtains that controls how the entire experience works. So if you're using something like an API or a backend, it's not necessarily going to have that same tonality back and forth as you would. within chat GPT.

So it's helpful if you want something directionally accurate to kind of check your instincts. But when it comes to actually deploying something in the world, like some form of mobile application, or web application, it's not going to be as representative of what it looks like in reality. Now for level two, we had that Google Sheets with that add on, where pretty much all we have to do is enter that add on.

And then we enter the task goal in one column. And then we create a special formula that says equals GPT. And then you basically reference either a prompt or instruction.

And I'll show you how to do that. And then you have that prompt automatically shown in the next column, which we saw. And then if you make another cell dependent on that prompt, then you can take that prompt as the only input for column three. And then that acts as the instruction. And then that's when you get that simulated actual response.

For level three, for that static prompt air table, this one's a little bit more complicated. So we have that mini task in air table, where in our case, I think we said... We're going to write an email to our boss saying we're resigning to be the prompt engineer of our dreams.

And in this case, we can click the battle test button. Behind that button is really all the magic. And it's a series of JavaScript that I wrote that I'll also make available in the description below that you can just copy paste.

And you can literally throw into GPT, say, hey, here's my web. hook for make if you're not familiar with what make is it's like zapier's brother or in this case the rival of zapier and automation software and you can just give it that link of where to send that message to to actually trigger the automation and then it flows through all the micro parts of that automation and when the make scenario is activated the prompt is then sent back through the api and we basically store all the responses generated from make back in air table and we pretty much are done at that point all right and for the second version of this automation you What's going to be different here is we have a task like we did before, but then we have a simulated user prompt where we basically say to the AI in the make scenario, hey, here's the persona that you're taking on, which is very familiar to what I just mentioned before on saying, hey, you're going to be sassy, you're going to be a bit difficult, you have some form of problems you want to talk about. So we try to simulate our possible end user. And for the simulate button, we also have a script that when you execute it, it pretty much takes those two pieces of information, throws it into make via webhook. And if you don't know what a webhook is, imagine there's like a listening ear.

This listening ear is listening for certain components. We are sending those two components and telling it, hey, here's this, here's this, take it, remember it. You're going to need to use it later on. One super important thing is that you need to create what's called like a run ID because Airtable in these automations needs some form of way to keep track of which row and record we're actually processing.

So imagine we had 10 different rows and we clicked simulate, it would have to have this distinct record ID that it sends to the automation so that it keeps track of where it should put the brand new rows for that specific run. Slightly complicated, but again, I'm going to give you that script. So all you have to do if you want to configure anything or add a new column is put that script into GPT-4.0 and say, hey, I want to make a couple changes to my air table. So it does this, this and this.

And then if you want to add something in the columns, you just have to accommodate for that. in the underlying air table in the make automation. But once you see it, it'll make a lot more sense. And then after that, it makes that simulated conversation back and forth where pretty much how this scenario works is we're creating our own mini thread where we start with the simulated user persona saying, hey, I have some problems. And then the assistant responds.

And then when we come to third step, fourth step, et cetera, the third step is going to basically inherit the messages exchanged between the user that I label and the assistant that I also label. And you keep... building that as you wish until you get to the final one where you basically have six different messages and you're telling the make automation hey you're going to answer based on all these other exchanges you've had and this is what you're going to try to answer from the assistant now even though I'm breaking it down in a diagram it still might be confusing but I'm a very visual learner and hopefully you are so when we get to that automation you'll see exactly how it works alright so what I'm going to do now is walk through each one of these builds step by step So you can actually see the nuts and bolts of how these work.

For the prompt battle testing GPT, pretty much behind the scenes, if we go to here and edit GPT, and we blow up this prompt, it says you are tasked with simulating a full back and forth conversation based on the provided prompts. Follow these guidelines to create realistic, autonomous conversation flow. And then I say simulate both sides of the conversation, and that you should initiate the conversation using the prompts that you've created.

And then you want to try to respond on behalf of the user with realistic replies reflecting a conversational tone. And then I tell it try to have three and three to five back and forth unless instructed otherwise. And then basically I'm talking about the different tone and persona it should be taking on. I give it a multi shot prompt here giving an example of what the simulation should look like.

This is like super essential. Otherwise, it tries to actually ask you a question. and expects an answer from you, the actual human user, to go back and forth with it. So we want to emphasize that it's the AI's job to simulate the conversation for us, not our job to simulate it with it.

And then I go down here to edge cases because I noticed that when I built this GPT, sometimes when I'd put a prompt in, it would actually execute that prompt and that's something you would have seen right before in that Airtable example, where for GPT-4-0, which is the same thing this custom GPT is using, It took the prompt literally versus took it in for analysis to create that simulated conversation. And then down here we create another example of that simulated conversation as another piece of evidence as to what it should try to do when it's provided with a prompt. So this one's pretty straightforward. You'll be able to find that in that link I provide. And then if we go to the next tool, this is the Google sheet GPT.

This is the level two and how it works like we said is we have this extension. and I'll click on open here just to show you what it looks like when you install it. It's called GPT for Sheets and Docs. So if I go into GPT for Sheets and Docs you'll see that this As an extension, if I just zoom in a tad, you can install this assuming you have a Chrome browser.

This will only work in Chrome. I'm not sure if it's available in Firefox as an extension as well or Edge for Microsoft. But at least for Chrome, this is all you have to do.

You click install. And then when you actually log in and you go to here, enable GPT functions. If you're using it for the very first time, it should open this up. and then assuming you have, let's say, let's create a new spreadsheet just to make it more simple.

And then you'll see a screen like this that says, welcome to GBT for Sheets. You click on continue, and then you can choose the large language model of your choice. A year ago, this only used to be 3.5 turbo, and now they've made it much more advanced to actually have models, even including Gemini and cloud models, which I haven't used myself. But if you wanted to use GBT 4.0, you click on I understand. And then when you click on best practices, it'll tell you what to do and what not to do.

And then it'll show you a list of every single GPT function. So this is actually much more potent than what I've already shown you as a use case. You can use it to automatically split format data in a Google Sheet, but we won't focus on that too much for now. When you go down to settings, ideally you want to click on safe mode.

A lot of these softwares and add-ins, they tend to time out, especially if you're dragging multiple rows at the same time. So if you do... safe mode it will take a longer time to actually load like you'll see back in our sheet here so if i say write me a blog about something else so let's say dogs you'll see how it'll say in two seconds gbt loading eta less than one minute it's actually delaying the execution to make sure that there's a higher likelihood that there's not going to be some form of timeout when it sends that api request so it's by designs i would toggle on safe mode here and then you want to sign in ideally with your google account And then if you haven't, you want to provide it with your API key, you can go to here, and then go to API keys. And then this is where you configure it for all these different LMS. In my case, I just put mine here.

And what's cool about it is that we'll remember it moving forward. And then if it's properly enabled, then when you click on equals, and you put GPT, you should see all of these different functions actually pop up. So if you say something like this GPT, and you click on the actual values here. say some value, tell me a story in brackets, assuming you have your API key actually registered and put in, this should load and actually come back with some form of response.

And you'll see here, this is quite long winded, but it came back with the same type of response you'd get through any API request. So you can already see and start to imagine the different use cases you could use this for even beyond prompt testing. But for now, that's pretty much all you need to do to get it to actually work in terms of the formulas themselves. I'll make this available to you as well. And in the sheet, there's nothing crazy I'm doing here.

It's pretty much the thing referencing this value in this cell to say, hey, look at this thing and execute that thing, which just happens to be a mini meta prompt. So in our case, if we were building this manually, and I call this, let's say, prompt one, then result one, and I say, write me a story about dinosaurs and then we just double click this and then we say equals GPT all you got to do is you put like initially some double quotes they do double quotes again and then you can either reference multiple cells in this case I can reference this cell by doing ampersand this ampersand double quotes double quotes again and then click enter And I do that because if you want to reference multiple cells at the same time, that whole ampersand quote technique will help you do that much more easily. And you can see here, we get that executed prompt.

And that's pretty much all we're doing. So in this case, in our original sheet, we just asked this to generate whatever is here. So in this case, it's kind of like write a prompt. And then this one just looked at this value. So we just referenced, you can see here, B2, which is the cell here, to actually execute it and show the result.

So I wouldn't rip this for hundreds of requests at the same time. If you want to test maybe five to 10 different tasks and drag it down, you can totally do that. So if we drag this down, there's a bit large of an input.

So let's say, write me an email, write me a note. And then we drag this down to both of these cells here and drag this one down to both of these cells and zoom in just a tad. These will show error until they're actually executing.

And in this case, it's telling me like to clarify what I'm trying to execute. So the more descriptive you can be in your underlying prompt, the more likely this will actually execute the thing. And then ideally you want to say write me prompt to create an email.

And this will change if we just update this and we just make this refer to this value GPT. quote, quote, ampersand here, ampersand quote. close oh we need one more quote all right so in this case i get an error in my case is telling me i am broke for api costs so i'll take care of that later but this would have run just like the other one ran and just create the prompt and then you can drag over to actually test that prompt in practice so that's the level two way to build this all right for level three it's a bit fancier and obviously a bit more complicated the easy part is you're just creating an air table which again is just like google sheets but prettier in my opinion and easier to use where you create these different columns. For all of these columns that say GPT-40 prompt, what we're going to do is we're going to set the field to be a long text field. So unlike Google Sheets where anything goes, unless you want to change the format of the underlying cells, you can actually say from the beginning, this is going to be a long text type of column, a short text, any date column, phone number, etc.

So in this case, all of them are long form. And all you have to do technically is create this task. goal column.

And then this is what's called a button. So if you go to Edit field, one of the different types of fields is going to be called a button. And then what you want to do is you want to click on the action should be to run a script when you actually click on this.

So when you go to here, edit code, you're going to see this if you're a developer, not scary, if you're a non developer or non techie, a scary little piece of code that I put here. The only thing that you need to change assuming you want to copy this exact build is you want to just replace your webhook with my webhook just make that swap and then you'll be good to go. Otherwise let's say you want to add an additional column or you want to switch the LLM all you do is this is my advice if you don't have the technical background is zoom out just a tad take a screenshot of this table and then paste that into GPT-4.0 not omni model just because it doesn't accept images as of yet and then if we go to the edit field and edit code i would copy paste what i have and say hey i have this code for an air table button i want you to configure it so that i'm adding clod 3.5 or clod 3 opus as a new you prompt column, as well as result column. And then I'll just paste my code here.

And then Bob's your uncle, you'll get some form of script that you can copy paste back. And then you can run it, assuming it works. If we click this and click Run, what I wrote in the code is some form of way to actually show that it worked. So you'll see here, I created this little feedback loop where it says, Hey, this is the record ID of the actual row. And then this is the task or goal that I'm going to be sending to the make automation, which is a whole different story, which we'll get to.

But this is the fundamental step on the Airtable side. So if we go to GPT, most likely this code will work the first time if that's the only change you want to make. If you want to be a lot more surgical in terms of what those updates are, then that's when I might have to go back and forth a few times with GPT until you see it runs properly and it does exactly what you're looking for. Now. When it comes to Make, I'm going to record a separate probably one to two hour masterclass on how to use Make.com end-to-end, kind of for beginners to intermediate.

But in lieu of having that knowledge if you don't have it, Make again is some form of automation software where you connect different modules together to create a mini symphony that are all triggered by one main thing or sometimes multiple but usually one main trigger. In our case, our trigger here is this webhook which is listening in. for certain values. In this case, it's listening in for record ID, and it's listening in for task and goal.

I didn't actually have to tell it that's what you're looking for. But all you have to do is you basically add a module you say webhook, you click on custom webhook, and then you call it something like prompt testing table, or you can create a new one where you say, my test webhook, you click OK, and then it's just going to be kind of spinning. Over and over again, it'll stop spinning when it first gets a piece of information that it's listening for so if we copy this address to the webhook and We pasted it in here and we went to the code and we purposely broke my code Which I'll revert shortly and do something like this and then we click run with that code and click this and then when we go back to make it'll now understand hey this is the type of information i'm going to be sending you from now on so this is what you're going to be using for the rest of the automation so if we click ok here and we'll see that this must be a trigger for the first module in a scenario but we know that it's going to now receive that information we just showed you when you have that that's when you get something like this that shows you a bundle which is basically that record And now we can kick off the automation from here.

We're going to use this thing called a router, which pretty much says, hey, here's an input. Go do all these different things at the same time with this input. At the very top here, we basically have a GPT step and the instruction of the GPT. I won't go through it in depth because this is going to be available for you. But I'll say you are a prompt engineer.

You write unbelievably detailed prompts and output them in Markdown. Take the following task and goal. and create a very detailed prompt based on task goal. So that is this value right here. And let me just go back here and undo what I did.

This little make automation, just so I don't break it. Let's copy this address. We're going to go back.

I'm going to override this. Click OK. Save that. And we should be good.

And if we go back here, we'll see the rest is saying, do not output any other commentary. just because sometimes LMS say, sure, I'll happily help you with that, which we don't really care for. And then I just say generate the prompt to accomplish a task, but don't actually output it. And all this does this tools thing, at least for this case, this is a very versatile type of module, you can just store a variable here for later. So in this case, all I'm saying is, hey, just this output from this GPT step, just hold on to that for a moment.

And I do that over and over again for the different versions. It's the identical prompt for GPT-4.0 and GPT-4.0 mini. For Anthropic, this step looks slightly different in terms of how it's broken down. The task, or at least the prompt, is very close.

I would say I just made it a bit more emphasized to not output this three little apostrophe markdown. And even despite that, it still will output it. One super helpful thing is that when you go on task goal, assuming you've actually ran this automation, you're going to see...

the payloads that you have from Airtable. So you always keep in mind how you should structure your prompt based on how this automation is actually designed. So that's all being stored and then once these all run I'm getting all of those variables that we just stored above and we have another router to kick off multiple other steps which in this case is literally executing the prompt that we just created.

So in this case the core prompt here is execute the following prompt, the prompt from above And then my notes is don't output anything other than this. Don't output and markdown. Don't ask me any clarifying questions.

Just do the thing. And then we do that for all three steps. We store those as variables.

Then all we do is once we have all those variables compiled, we go into Airtable. And this is where it's super important to have this record ID. We have to be able to tell it, hey, this is the row where you need to inject all these values.

And then it's more so mapping. So I need you to take the output from above. of the actual prompt we made using some meta prompting for gbd 4.0 mini and sonnet and then i want you to also store the prompt result the mini prompt result and the sonnet prompt result so once you inject it here assuming i didn't break things if i change this to or add a new row and say write me a poem about prompt engineering and click battle test fingers crossed i didn't break anything if we go back to here There we go.

And you'll see it is running the automation. It's going to do GPT-4.0. If we zoom in here just a tad, it is now doing the cloud step.

After this, it will be going to the router. It's going to be creating the results of those prompts. And then it'll finish off by actually writing to that air table. And you'll see once it's done, it's finished loading, it'll do a write, which should take literally one to two seconds to actually run and practice.

And then we'll get a success here. There we go. We got that success. And one thing we can do is we can go to Airtable and see everything that was actually inputted into the Airtable. So this gives us an actual record of saying, hey, I did what you asked me to do.

So we should be good to go. So that being said, if we go to Airtable here, you'll see that we got all of those outputs. And you'll see something interesting, which is the first time when I asked it to do something, GPD 4.0 with that prompt actually executed the thing.

The second time it actually wrote a proper prompt using meta prompting. So that's good. In this case, all of them worked perfectly. And you can already see how impactful this is, especially if you're in a company. or an entrepreneur trying to generate tens of prompts and you're trying to battle test what is the best way to do this thing or what are different permutations of prompting and what are the results I might get from different LMS this is a really helpful way in a non coding manner to actually make this happen for level four when it comes to conversational prompt we have the exact same design of the air table except we have two different variables we're listening in for so now we have record ID we're listening in for the system prompt which is ideally Some detailed prompt you're actually using.

And then you create your fictitious user on each row. Each different user can be different. If you have different avatars saying, how would someone super rude interact with this system prompt?

How would someone that is young interact with it? How is someone that makes random sounds like saying quack the whole time? How would that actually look like as a conversation if we were to actually put this in the wild? So this is where you get a lot more variability. And then similar to my last code, if you click on this and you go to edit field, and then you go to edit code, you'll see here that in this case, we have user response, assistant response, user response one and two and three.

And we have that webhook again, that's listening in for those three different components, you could easily take my code, put it back in GPT and say, Hey, I made three more columns for user response and assistant response three, four and five. please incorporate that into the code. And then all you'd have to remember to do is in my automation, when it goes to that air table, you probably need to store two more different conversations or three more different conversations and make sure you update those rows with those outputs that we're collecting as variables.

So little note there, you might need a slightly different skillset or a couple hours on the weekend to actually figure that part out, but it's not difficult, especially using this as your base. So if we get into the actual make automation itself, Similar to the last scenario, we have a webhook listening in for these variables, which are the record ID, the system prompt, which is some long variable, and then our simulated user prompt. And what's happening here is actually pretty straightforward, but a bit more nuanced.

So our prompt here, using GPD 4.0 Mini, and I'm doing this just because it's faster and cheaper, is you are a user of a conversational app, have a full back and forth conversation, communicating and replying to the AI. that you'll be speaking in a very human-like tone, embody all the flaws, spelling mistakes, lowercase spelling, and nuances on how humans tend to respond. So this is a very important part that I came up with that's helpful to create that fake user. And then what I say is, here's what is your goal when chatting.

So I reference simulated user prompt. And as I go down here, if I'm allowed to, let me just zoom out just a tad. There we go.

come up with an initial question for the AI based on this context. So the first couple modules are going to be slightly more nuanced, and then you can copy paste this pretty much endlessly. Then we collect the variable, which is the result, again, so we can keep that for later and store it in the Airtable. The next GPT step, this one is going to take the role of the assistant. So your assignment is to actually take the system prompt in Airtable, and then we're going to respond to the user's message, which was the output here.

that we stored which is why i'm actually storing it so i can keep feeding that stored response as an input for the next part of the chain so naturally we'll we'll store the first assistance response and then we're going to have another user response which is the identical prompt except at the bottom now i add a new component saying conversation history up until now where i show it the back and forth and say hey this is what you're responding to next and then we store that and so on and so on. And as you go to the end, now the system prompt for the assistant says, hey, here's the conversation up until now. And here's the latest user query you need to respond to.

So this keeps it completely dynamic and nimble. And you keep this chain going until you get to the very end where you see that it's now taking all three different conversations into account. You click OK. You store that.

And then you bring together all the stored variables into this whole get variables module. And that's what you store back into the air table. So the beauty of this is that once you finish building this or configuring this, you can go back here, have your different prompts.

So act like a travel agent, give advice on where to go and where to stay and eat. And then if we go to the user prompt, we'll say something like you are a lost millennial. hating your corporate job and want to escape to Bali You are going to speak to a travel agent and then I'm going to say something like write in lowercase and make spelling mistakes and be super informal when asking for your travel help.

So I'll take something like this. We'll click on simulate. And assuming I did not break something, it should say success, which it does.

So we're sending both the record ID, we're sending the system prompt and the SASE prompt. You can see here that the webhook is going to roll and then the... The chain is actually executing step by step. We're almost done in pretty much under 30 seconds because we're using GPT-4.0 Mini. This is going to store all the tools in this mega tool variable and store it in this air table.

And if we go back here, you will see this conversation where we start off. And it says, hey, so I really want to escape this boring corporate life and head to Bali like ASAP. Can you help me find some cheap flights or whatever? Thanks. So you can see there like we made a very basic small prompt.

And. And it does a pretty good job at actually emulating human behavior. So if we go to assistant one, absolutely.

And then it gives tons of different recommendations. Then we say, it says, hi, thanks for your info. What are the dates I should look into? Yada, yada, yada. We respond.

And the last one is, yo, that's super helpful. Like if I'm going to in like May or September, do you think I could snack some good deals on flights? So you can imagine you can extend this conversation in perpetuity.

I'd say. From working with a lot of clients, the average duration of a conversation is 10 to 12 exchanges, assuming the user is satisfied with the result or less assuming they're frustrated with the result. So maybe you extend it out to 5, 10, 12. Just be conscious that as you extend this chain, one, you'll be spending more money, and two, you'll have some more maintenance of all of these variables to make sure you're keeping track of all of them as you extend that thread of conversation.

But that's pretty much how you build this component. And just to make your life easier, what you'll see is you're going to have something called a blueprint, which when you export it is this file called a.json file. All you have to do is sign up for a free Make account. You come in, and then you actually add and you import the blueprint, and then you'll get something like this from the default. You'll just have to go in and make sure you're signed into GPT.

You sign in and add your API keys. You authenticate into something like Airtable or Google Sheets, whatever you want. And that'll be on your end to configure, but at least the prompts, all the kind of main build, will be built out for you.

It's more so keeping track of it's working as expected. All right, so that's the core four builds that I'm showing you. If you stuck around to this point of the video and you made it through all those little technical hoops, then you deserve to see my final product that I put together in the last couple weeks.

And it looks something like this. I called it Prompt Battle Tester, and the whole way it works is no need for Airtable. No need for all of those custom codes and the make automation. I basically built it all using JavaScript, Python, and tons of Cloud and GPT to guide me through all these operations. So how it works is we have a prompt here.

So I have a default one saying, you are a helpful AI assistant. Your goal is to provide informative and engaging responses. And on the simulated user prompt side, it says, you are analyzing a conversation between a user and an assistant. Generate the next user based on the following. ask follow-up questions, introduce a new topic, respond casually, yada yada yada.

And then I have a parameter here that's how many exchanges do you want. So if you remember before, if you wanted to add a new row or a new column for a different exchange, you'd have to do that manually, update the make automation, and then update the actual script to make sure it accounts for it. In this case, you can just make it, let's say, five exchanges.

You can choose a temperature for the underlying LLM. So let's make that 0.3. And then I just made it so you could pick one of three models.

So let's say you choose GB40 mini. When you click on start simulation, the magic happens where I use cloud to create some form of UI to actually show bubbles back and forth. To kind of visualize what the response would look like. Which is super helpful because if you have something more, again, conversational, this is not for a static prompt.

You can start to see... when it outputs it outputs in markdown maybe we don't want to include these little asterisks here if we don't plan on showing that in our ui on our mobile app or web app or whatever and then you'll be able to quickly see how the conversation can go off track or stay on track and how verbose your assistance responses might be and just to be fancy what i do at the very end is i take the whole conversation and i feed it to another lm which i think behind the scenes is gpt4o And I say, hey, grade the conversation on these different categories based on the transcript. And tell me what your explanation is for your score out of 10. So it'll say handled ambiguous queries decently, the system provided helpful solutions, and then it'll give you an overall assessment.

But wait, there's more. If you click on generate PDF report and you go on simulation results, it'll give you the full transcript of that conversation. So you can go from idea to editing your prompt, whether it be through meta prompting or doing it yourself, to quickly seeing how that might look in the wild.

Now, this one I'm not making available in the Gumroad description. I'm actually going to put a bitly link on the screen right now. And if you follow this bitly link, you'll be able to access my replica code behind the scenes.

And again, assuming you want to deploy this as is, all you'd have to do is go to the secrets portion here to add your API key for OpenAI. These should stay stagnant. And then we have the main.js style.css. So we have some JavaScript.

We have CSS. We have a main.py file. We have a. the PDF generator file that we call in our main.py.

And then we have the actual crux of this, which builds the brain on how the simulated conversation goes. So this was my passion project that I built on the side to try to not only iterate for our clients much more quickly. So when they say, hey, do you think this will work for us? I can quickly come to that answer in, as you saw, five different ways. But this also makes it playful so that if I provide a prompt or my team provided a prompt, They can tinker with it to see, you know, if we made some slight modules here or slight changes, maybe we get to our dream outcome in this way.

So it lowers that back and forth. And it gives that much more confidence that if you have a sassy user, if you have a weird user, that this is how your AI is going to actually work in the wild. Now, if you love this action packed tutorial, it literally took me probably like 15 hours to put together all the assets here outside of this tool that I just showed you.

So if you could leave a like on the channel, a sub. and maybe a comment. I would super appreciate it and I'd love your feedback.

And if you find all the assets that I'm going to be giving you in the Gumroad link below, valuable, I'd also appreciate if you could leave some form of support for the channel. It always helps a lot. I literally reinvested back to better editing, better images, so I would super appreciate it.

Thanks so much, and I'll see you next time.

Transcript for:Effective Techniques for Testing AI Prompts

Transcript for:
Effective Techniques for Testing AI Prompts