Understanding Prompt Caching in AI

speed cost and reliability these are the three things that most organizations trying to use generative AI at scale struggle to balance if you remember that classic triangle from University or college where you had to pick two things either sleep a social life or good grades in that same vein a lot of organizations are having to choose either cost and reliability or reliability and speed and you have to kind of compromise with the brand new Claude prompt caching feature this is the first light at the end of the tunnel to be able to accomplish all three this brand new feature from anthropic is still in beta but it's already showing that there's a light at the end of the tunnel to allow all organizations to be able to have their AI cake and eat it too I'm going to walk you through what prompt caching is when to use it and most importantly how to actually leveraging it with a full code tutorial whether you're a developer a product manager or just someone that's CLA curious this tutorial is going to give you an A toz conceptually on how this works and if you're not as technical after the first half I'll dive into Google collab to actually walk through examples step by step on how to leverage this to its best ability if you don't know who I am my name is Mark and I run my own AI automation agency called prompt advisors for the past 2 years we help companies in every industry better understand where to use AI best in their workflows now I've prepped some quick slides to go through all the nitty-gritty of this brand new feature to make sure you're fully informed on what it can and can't do let's Dive Right In now first things first what is prompt caching and why should you care if you're you're an organization that just uses chat 2bt and Claw on the front end and you use no form of API on the back end or any form of automations behind the scenes this probably isn't that useful for you however if you're an organization that's running thousands upon millions of Records through LMS every day like Claud Hau Sonet or Opus this has major implications for you and can save you a lot of time money and most importantly speed now the big picture is that instead of always having to take the same prompt with a huge Contex text or huge knowledge base and continually feed that back and forth into a prompt or having to resort to using something like a SQL Vector database or a general Vector database to be able to store that data to then call it when it makes sense to you can now have your cake and eat it too just like I said in the sense that you can have all that context and knowledge base essentially memorized and cached for a certain amount of time while you send multiple prompts that reference that that are cheaper because you don't continually need to refeed that context over and over again now like I said it's primarily in beta so you can only use it for two main nlms right now which is clae 3.5 sonnet and clae 3 Haiku if you use it properly and for The Right Use cases it could potentially save you up to 90% on average from what we've seen already implementing this for some of our clients it's saving around 40 to 60% depending on how long your prompts are and I'll walk you through in a bit what might help you make the decision as to whether or not this is worth the trouble for you to even take a look at in general though the first thing that comes to mind is if you're in a business like a law firm or real estate or you have Monumental files that you wish your prompts would just memorize or understand or you have multiple chain prompts that are based on the understanding of one major set of documents or examples this is a silver lining for you especially from a cost perspective just to visualize one of my last points here typically you just have a huge context and a set of examples especially if it's something more spoke or complex you typically might need five to seven different examples to really give an llm an understanding of what success or Perfection looks like and theoretically you'd usually have a bunch of tasks where you theoretically have to re-inject that same context and example every single time to be able to yield the result you're looking for now the main difference of using this kind of feature is that you are going to pay more upfront to be able to cach your contexts and examples than you would otherwise with a norm normal input token from a typical prompt but the idea is is if you have multiple prompts that rely on this context then those will be cheaper and Aggregate and as you do the cost savings across all those operations daily weekly or however many times you run this for your organization it's going to save you tons of money when you actually do the cumulative analysis now when does it make the most sense for you to consider using prompt caching and for what specific use cases in general there are six different use cases that anthropic says this is great for one is conversational agents coding assistance large document processing detailed instructions agentic search and knowledge based QA now in this specific tutorial I'm going to focus primarily on these I'm going to leave the coding assistance and agentic search for another day as that's hyper Technical and I want to keep this balanced so that even if you're non-technical you can at least understand the implications this would have for your organization and your Dev team now there are two main details here that I really want to highlight one of them is the fact that now you can give tons of examples more than you have before where maybe you only had three examples for the LM to learn from you could upgrade that to 7 8 15 theoretically if you wanted to and you don't have to worry about that context being forgotten over time or overridden somehow or having a loss in the-middle issue that's going to be mitigated greatly by this and at least with our initial testing with our clients we're noticing that the likelihood of having lost in the-middle instructions and a prompt which is basically when you have a very large prompt that some of the instructions get overridden or are interpreted as noise over time and are ignored as you go from prompt to prompt that this happens a lot less and it's a lot more focused because in a way you're kind of partitioning your cash which is your long-term memory with your sub instruction which is kind of like what I want you to do now so we're noticing that balance of large prompts and understanding and remembering things is a lot better naturally without needing some form of external resource like a vector database now this is not to necessarily agree with all the sensationalized titles I'm seeing on YouTube right now which is this is the death of rag which stands for retrieval augmented generation which is basically very complicated terms for you connect database that converts documents to numbers so a computer can understand then you have ai talk to numbers to actually speak to that knowledge in real time while it's still a painful process I don't see that going anywhere until this is at least out of beta has much larger context windows and the cost goes down even more to make it make sense where you actually don't need a vector database in the conversation but until now I still see that as part of your toolkit but if you only have let's say a book or a series of 50 pages that you always resort to over and over this might be a better alternative to using that typical infrastructure now let's move on to actually how it works at a very high level you have a user send a request if that prompt is cached which I'll show you the criteria for when it gets cashed then it's going to use that cash prompt and then you can generate a response from that in addition to whatever query or question you want to ask and then if not then it's going to process the full prompt as it usually would at the rates it usually would charge and then you can then cash that prompt for future use if you want to when you think of cash think of literally the cash you have in your browser when you go on a web page if you go on it for the very first time it's going to take slightly longer to load the second third fourth time assuming you don't remove your history or cookies the next time you load that page is going to be infinitely faster which is the exact same idea here the first time you send a prompt obviously it's not going to be cached nothing's going to be cached about the context or examples but the second third time you actually process that it's going to take way less now one super important point is this the 5 minute lifetime so from the first time you actually choose to cash something and you do so by basically setting a parameter that's a translation for cach this please once you do that you have 5 minutes from the time you've done that to actually reuse that cach over and over again now you don't have to send that initial prompt in 5 minutes but assuming you stop sending requests that reference that cash information after 5 minutes It'll expire and you'll have to redo that process again so the idea here is that assuming you have some form of batch processing where you have one big context and you're constantly sending requests to maybe different users different rows of information that so long as you're doing that in some form of Loop and there's no timeout you should be able to to maintain that cash live at least from our testing we've initially done now there are a few big disclaimers that I want to make about when to actually use this and whether or not you actually should consider using it depending on the other conditions that I'm going to show you number one is you want to make sure that all the information you want to put in Cache is actually static meaning it's a set of instructions that doesn't need to be nuanced or changed very often that's probably where it makes the most sense so let's say you are a real estate firm and you always have XYZ way you write your listings and you have these six or seven types of listings that you have and you have five examples of those listings all of that is a really good example of a use case where this makes sense next the minimum prompt length should be 1,24 tokens for Claude Sonet and 248 for haiku and I wanted to hyper emphasize this because when I was figuring out how to actually write the code for this I kept bangging my head against the wall cuz I didn't understand this concept so when you actually try to cash any information it has to be a minimum of ,000 24 tokens for sonnet meaning it's around 2 three paragraphs worth of length if you don't do this it's just not going to work you're going to look at your cache and for some reason the caching parameter is not going to be set even if you have everything set up perfectly now my hack to verifying whether or not my actual context is long enough is I use the old tokenizer tool from openi to get it directionally accurate to make sure that it's above 1,24 tokens plus or minus if you end up using the code that I'm going to provide you today there's 99% chance that if you have any form of problems it's going to be the fact that your underlying prompt is not long enough so don't say I didn't warn you all right so the next one is the fact that you can't manually clear the cash for now again this is in beta and was literally just released like a week ago so it's natural that it's not all figured out but as of right now you do have to wait for that 5 minute expiration window to clear before your cash is technically expired or cleared and the last thing at least from our testing doing cost-saving analysis for clients already over the weekend it's really not worth doing this and like moving infrastructure if you have very grounded infrastructure unless you're doing this on a scale of thousands if not millions of Records where you can really realize that cost savings we're not at the point yet where it's really worth it to abandon what you have to just jump on this unless you're starting vanilla or you wanted an incentive to use cloud models and you didn't have one before this is probably a good idea especially as I expect this to land on Amazon bedrock in the coming weeks or so now in terms of pricing I'll let you take a look on their website and do the math yourself as to whether or not it makes sense for you in general writing to the cash will be 25% more expensive than if you did it the standard way using standard input token pricing whereas to actually read that cash is going to be 90% cheaper than the same input based tokens that you would otherwise so huge difference on the read and slight difference on the right so I would say the coste of analysis here makes a lot of sense again given all the criteria that I just went over now if you're less technically inclined or code scares you this is the part of the video where I'm actually going to jump into Google collab to walk through everything if you choose to hold on I promise I won't make it that painful to look at there's going to be some pretty diagrams above all my code to walk through what's actually happening there but if it scares you feel free to click off the video right now all right so this is the Google collab that I put together I tried to comment everything out and put headers where it makes sense to make it as readable as possible as well as create some mermaid diagrams to just walk through What's happen happening in certain blocks of code uh you'll find the link to it in the description below uh it's going to be a gumroad link where you can choose to give $0 for the link to the code or if you wish to support the channel and myself to make my content better and better over time I would super appreciate maybe a couple bucks for a coffee here and there either way you can get it for free if you want to and then you can follow along as I go through everything all right so first things first we're going to install anthropic here and then we're just going to import anthropic and time all right so while the rest of the code is running we can start breaking down the first code block what it's going to look like and what it's doing behind the scenes so first we're going to import the anthropic Library we're going to set up our API key I'm going to have mine here that I'm going to delete after this tutorial but you can just swap it with yours and then we're going to initialize the anthropic client and then we're going to create a prompt pretty much that's going to be our context that we're going to cache in memory and again I triple checked and made sure that it was actually eligible for prompt caching based on the length and then we'll get some form of message that I've created through a uh function where it actually explains the metadata whether or not that specific query was cached or not it'll be abundantly clear once I show you uh but in general just think of a bunch of metadata that will say how many prompts have actually been cached or written to cache and how many have been read from Cache that will tell you whether or not it's working because if your prompt writing has a number greater than zero that means you've committed something to memory and it's working at the same time if you're reading from the cachee and for some reason the number of tokens for reading is zero you'll know they're not actually using the cash now if we dive to the actual code here we have our API key that we set here and then we have our business context where I'm going through pretty much everything about this fictitious company um if you want to take a look at the prompt more in depth I put it in this Google dog here nothing fancy just a bunch of information about a company that does not exist um a lot of metrics about corly performance and it's around two three pages long so I would say around 2 and 1 half to three pages is typically the minimum for it to be eligible for this caching feature so you can use that as a rule of thumb if you want to be more particular like I do I would take this uh copy paste it and I would use the tokenizer from open aai again it won't be exact it won't be perfect but it's very directionally accurate I'll just set it to 3.5 and4 and then I'll paste it and it'll tell me around 1,00 178 tokens I found that in general there's a plus or minus 50 token range so if it's around 1,200 then either it's actually 1,250 or 1,150 so plus or minus that you can use this to your advantage now that that's done we pretty much just set up this portion which is client. beta. prompt casing. messages. create so we need that method of prompt caching you can set either again Claude Haiku or or claw Sonet in this case I'm choosing sonnet because it's a bit more sophisticated and then here in the type text in the actual text I actually set the system prompt pretty much for this cache this is going to be less important when it comes to doing this large context caching example but when it comes to testing it out on the conversational portion it's going to be hyper important that you give some love and care to this system prompt because it's going to guide whether or not it's going to be conversational unlike the assistance API from open aai I haven't found yet how to natively make it more conversational where this back and forth like you would any form of chatbot conversation this portion here is a hyper important parameter called cashore control and it's going to be set to ephemeral which is fancy English for temporary pretty much um this is going to tell the underlying prompt and the prompt infrastructure that we want to set the temporary memory or this case the temporary time sensitive cache and this will basically inform it that we want to execute that you'll see here in my next text portion here I'm referencing the business context above I like to do that for cleanliness so I'm not just shoving a 18 line different text in here um naturally you can reference some form of file or anything else and then our first test message will be what are the key challenges technovate Solutions is currently facing when you output it you'll see something like this which is the prompt caching beta message uh you'll have an ID for it and one thing you can do is you can go to the very end and take a peek at the end um if you see cash creation input tokens equals something larger than zero that means you've managed to successfully cash something to memory now I'm going to say it for the third or fourth time but if this is zero and you set up everything exactly like I have or you've used my code but put your own prompt in it there's a 99% chance that it's just not long enough to be eligible to be cashed so whenn and doubt triple quadruple check that um naturally here we're not going to be reading any input tokens because we've just committed to memory um and then it tells you input tokens and output tokens so that's kind of like the default metadata we have now as we go down what I did is I took a stab at creating a poor man's calculator to try to tell me how much cost savings am I making by sending a additional prompt after I've initially uh cached it it's going to be again directionally accurate if you have a better version of it then let me know in the comments below I'm happy to be humbled that way but I gave it my best shot initially um and pretty much I just walk through in this diagram how it's doing that calculation which is if it's initial query meaning it's the first time you're writing to the cache it's going to naturally be more expensive so I'm trying to price that in otherwise we're going to just calculate the follow-up costs which is like we said 90% cheaper than the average input tokens you'd spend otherwise U if we go down here this is the main function that we have that is creating that extra metadata that we're going to have so you can see here I'm taking what we already have from Claude then we're adding a few more components like the cast creation cost uh the total cost overall and any initial cost savings that we have so again Direction accurate it's not going to be exact but I was trying to do this for myself to give myself a bit of a cheat sheet as to what it might be now when it comes to the follow-up queries pretty much how it's going to work assuming that it cached correctly is if the cash is hit it's going to say the cach is hit it's going to have some metadata at the bottom of my output that I've written again from that function back above if it's hit then we're going to say yes if it's using the cached context then we're going to return a response um otherwise we can actually tell it to actually cash it if it hasn't already but uh assuming we did it correctly it should be perfectly fine so then we have follow-up query 2 where I'm asking something about this company about European market share and then assuming again the cash is still there it's not expired then we'll use the cash context now if I send this request after 6 minutes of the last cash initiation then we will have to process it as normal we'll have zero cost savings whatsoever but both will return a response one will just be cheaper than the other now for the very first question here again uh we're saying what are the key challenges that technovate Solutions is currently facing and then down here we have my big response in bullet form and I want to note that it's bullet form for a reason because down below when we get to the conversational version of this code it's really important that we tell the system prompt to ideally not write in bullets and write more conversation short and long form so that we try to emulate that conversational experience but for now this is perfectly fine for just testing it in our response metadata like I said here we're going to see that we're using this model we have the input output tokens it's saying that we're using cash read tokens of 1305 the exact same amount as our original right so I can tell that now I am benefiting from these cost savings and then it'll say cash status this is again something that I made um cash hit so I know okay cool it's working and then what is the estimated cost of this run and I'm trying to make a directionally accurate cost estimate of how much this costed me and again this works this is helpful if you're trying to do this at scale to see you know am I saving thousands of dollars hundreds of dollars or tens of dollars again second query here we're not going to spend too much time on it but I say what strategies should technovate Solutions focus on to overcome its competition in the cloud management space and we have some response here and again we see that the cash is hit and we're reading 1305 tokens numero tress we have another question and we can just ZIP down we'll see again we're hitting that cash we're making some cost savings and at the end of it I'm just ask it to do some savings estimation from running the three different queries um so in this case not too much we're only ripping a couple queries or prompts at the API but at a grand scale this should be super helpful this is a micro example of just large context I didn't bother putting a huge book but you could put pretty much anything you wanted so long as the underlying large language model can handle that context window which I think for Sonet is around 200,000k context for now so when we get to the next portion here which is the multi- turn conversations it's pretty much just like the assistant API in terms of how they're framing it where this is meant to be more back and forth conversation I'm going to zoom in here and I created a diagram in two parts all right so the way this is going to work is very similar to before where we have importing libraries we set up our API key we Define The Prompt or business context I created a function like a sendor query function to make it a lot cleaner to send multiple responses versus having to send something like this over and over again every single time you want to sign a brand new message that's kind of like one fault um maybe a skill issue on my end uh that I couldn't get the native API to give me like a cleaner way of sending those messages instead of using those break points over and over again so you'll see that query down below that I'll get to shortly but uh in general uh you'll have the send query function that I set up which is just you know creating a sparks notes version of that API request and then we're going to send three Cy queries it'll work the same way where we're going to see is it in cash yet yes or no if so use the cash and we're going to Loop that until we're happy with it in this case I think I have five or six messages that I'm sending to the API and you'll see here at the bottom I just made this follow-up queries diagram you can see if you I zoom in here you have query 2 query 3 query four five and six ones on steps to tackle challenges ones on sufficiency of steps sufficiency rather uh additional recommendations we're asking it um impact on strategic goals and implementation challenges uh one thing I want to do very strategically is I want to make sure that my follow-up questions were not framed in a way where it would understand what the question is and what it's referring to without having answered me before so if that seems a bit confusing it seems confusing on delivery to give an analogy let's say I'm asking something vanilla to the API like what's the weather today and it says it's super rainy instead of me saying something that's easy for it to understand the context like what activities should I do on a rainy day I just say what are some activities I should do given this forecast right so if I'm saying this forecast it doesn't know what it's referring to so if it didn't know it was saying it was rainy before I'll know from the response whether or not it's it's actually using the cash or not uh so that was like my Hack That going back and forth for hours to understand how can I make sure it's actually using this context conversationally versus guessing or using my very well worded questions to guess what I'm getting at so if we go down here we'll just jump ahead we're using uh a different prompt here uh on a solar energy company again fictitious I just asked chachu T to generate it and then I threw this into the tokenizer to make sure it was long enough and then I created this uh send query function here that you can take a look at more in depth after this video if you choose to pick up this code and basically what it does is very similar to the former metadata but in this case it just allows me to just send and queries like this sendor query and then it creates that break point behind the scenes and all I have to do is just ask the question so can you summarize the key challenges that greeting energy is facing so that's an easy vanilla question the next one will be which of these challenges do you think is the most urgent so in that case I it I have to make it understand or remember what the challenges it said were so that when I check the answer I know for a fact it's actually using what in memory and what's been cached so that's my little hack there next is given that urgency again and making it very dependent on the prior response what steps is the company taking to tackle this Challenge and the next one is do you think these steps are sufficient or should they take more measures so if you just create these very dependent questions that are not obvious if they were just taken uh mutually exclusively it will help you make sure that it's working exactly as expected so if you zoom back down I'm going to send it a series of follow-up queries like I showed you before and then what I found is that when it mentioned what steps it was taking it was then taking that into consideration and answering each question in a way that made it very clear that it was actually taking the last cash response into account now you'll have to run this on your end and do some iteration to make sure it's working as expected if you do see from run to run that for some reason cash status hit is not happening then you might want to do some troubleshooting to make sure that one your prompt is long enough and two that you're running it all within the same time period so that the cash is not expiring now one thing I noticed as a flaw with this API is it has a series of timeouts even right before I record this tutorial I was running the exact same code over again um and I got slightly different responses I got 500 errors that were like timeouts um so I would recommend even doing delays between different queries but I'm sure that in the next one month to two months it's going to become a lot more mature a lot more Enterprise grade and you'll be able to start reaping the rewards in terms of saving hundreds to possibly thousands of dollars per month and possibly finally making generative AI make sense for your business if you enjoyed this tutorial and love this type of content please leave a like on the video a comment down below and sub to the channel appreciate you and I'll see you in the next one

Transcript for:Understanding Prompt Caching in AI

Transcript for:
Understanding Prompt Caching in AI