Building Contextual Retrieval with Pinecone

uh today we're going to talk about building contextual retrieval with anthropic and pine cone uh I'm Arjun from Pine Cone I'm a developer aate here all I think about every day is to how to teach people to use Vector databases in their retrieval augmented generation applications and I'm here with Alex from anthropic hey Alex yeah hey everybody uh super excited to be here I'm Alex I lead developer relations at anthropic so basically I just get to talk about Claud all day and every day I can't imagine a better job that's awesome it's a lot of fun let's hop into the agenda so first we'll introduce what rag is and where it works well and where it doesn't next we're going to talk about a specific case study in which you could apply retrieval augmented generation techniques uh in the context of video presentations um we'll talk about the benefits and strengths of this approach and how we can introduce contextual retrieval in order to make a robust rag chatot over video data using pine cone anthropic and AWS we're not going to have this webinar without showing you a live demo so we'll show you what's going on and how this works inside of a sage maker notebook uh and finally we'll have Q&A so you can ask us about anything you'd like to learn about contextual retrieval the demo applying pine cone or anthropic so on so forth we're hoping to allocate at least 20 or 30 minutes to Q&A time so if you have questions while we're working through the presentation please drop them in the Q&A and we should have ample time to answer all of them all right let's get into it sojust to level set what what is the goal of retrieval augmented generation why why do people even bother implementing this type of workflow in the first place the idea is to increase the quality and the accuracy of any response that is being generated from an llm that you're using specifically tied to a knowledge base and the way you increase the accuracy or the quality of that response is by uh giving the llm the semantically relevant content it needs to generate an appropriate answer why does that matter when you use an llm without any context especially without the context of your proprietary data stores like knowledge at your company or data that isn't publicly available on the internet it is not going to be able to answer it with the knowledge that it's been trained on so you might be more prone to creating hallucinations the standard way of solving this is giving your llm access to a knowledge base in order for it to answer questions and reason over this type of data the workflow for this must be familiar for many of you on the call it's a standard retrieval augment to generation workflow where you have your Source data your embedding model which is going to represent the source data in a vector space you might upsert that into a vector data base preferably with pine cone you would have that query against the chatbot application with respect to the answer pass all of that retrieved context to the llm and finally generate your answer which allows your llm to ground its Generations uh with regard to whatever question you're using this is a pretty robust workflow you can use this to answer lots of questions that are hard to answer normally without access to an internal knowledge base especially with company data like customer support or uh specific verticals like medical Financial so on so forth but what if the type of data you're asking over is more complicated than just Texton data what can you do in that situation uh so let's take a look at uh the uh let's take a look at the video data for example so if you're thinking about video data you might run to a problem where the modalities that you're working with are in lots of different spaces so if you're working with video data you might have audio you might have video and you'll have slides and it's not necessarily intuitive how to figure out how to embed all of these in the same Vector space or if you were to use a multimodal model it could be extremely expensive to represent all of these ideas in the same Vector space additionally when you're working with video data you might have a lot of context that you need for the chunks that you're trying process you might have videos that are hours long and some segment within the middle of the video might matter to something that happens near the end of the video but the llm that you're working with might not be able to transform and use all of that context finally there might be a discrepancy between the visual information of the video and the spoken information of that video you might have something on screen that the speaker is not actively addressing yet you might want to be able to query both of those ideas at the same time so the methodology in which you're going to use to process this data must have this in mind one of the possible solutions we can use to kind of solve this problem is using pine cone Claud and AWS to ingest and describe the video data into image and text Pairs and then perform retrieval augmented generation over them and today we're going to show you how you can do this with pine con CLA and AWS but before we jump into what that actually looks like I'm going to give you a real world example of what what this could be this is a slide from a presentation that I gave that's publicly available on Pine con's YouTube uh everybody why don't you take a moment and try to guess what I'm talking about with the slide on screen so put in the chat like maybe one sentence of what you think I'm trying to answer or what question I'm talking about with what's on screen right now take a second think about it if you don't want to write it into chat that's completely fine for those that might not be looking at the screen on the slide we see a table with a pro and con chart and we see full text search and multilingual semantic search and the pros and cons of each approach cool we have some answers coming in it could be about hybrid search getting the correct context uh what is the need for hybrid search great great responses okay now I'm going to try to introduce a little more context so now instead of just having the slide on the screen we have the transcript so this is what I said while the slide was on screen um I'm going to take another moment or two for everybody to kind of read over that transcript kind of guess like what was I talking about what was I trying to get at at this point in time throw your answers into the chat or just think about what you think is happening here fun fact in one of my past jobs uh I worked on speech recognition models and so a lot of my job was to sit and read transcripts from transcribed data and it's actually really difficult to understand what is going on from a video or from some audio based on uh the information that has been transcribed okay let's go a little more I'm going to give you the full summary of what has been said in the entirety of the video and now the picture kind of becomes a little clear so what happened during this slide is I was actually referring to the slide after having presented it so I was no longer really talking about full teex search or multilingual semantic search but I was talking about how difficult it is for multilingual models to handle uh cultural nuances specifically in the context of weddings and this presentation was actually about multilingual semantic search in general so we had gotten to a point where we were talking about how multilingual semantic search might work where it might doesn't and specifically where multilingual semantic search might fail so now you might be able to appreciate why it's important to have these three different pieces of context when you're trying to create a representation of what is happening at this point in time in the presentation great so how do we actually go about incorporating this information and how are we using our stack we're going to use pine cone as our Vector database which is going to be our knowledge base the place where we put our metadata and the place where we do our sematic search for that first part of the rag workflow secondly we're going to use cloud which is going to be our state ofthe art llm it has visual understanding which will help us pre-process the data it's going to be our key generating llm which is going to help us create that final response in the rag workflow and it'll create the context that's needed to embed the data that we need in a textual manner finally we're going to use AWS as the place where we store our compute and our data we're going to use also AWS for Titan Bings our way of accessing CLA via bedrock and our a Dev environment using sagemaker notebooks and yes our data set for this demo is going to be this webinar that I gave recently magic of multilingual search which covers a whole bunch of different concepts from zero to building a multilingual sematic Search application we're going to build a rag solution that allows us to answer questions about the slides the audio and what is being said and shown over this video so first why should you use pine cone when you're building this sort of Stack pine cone is a vector database that lets you index vectors to do this fast semantic search at scale so you'll have your UT query you want to embed that query in some Vector space that's going to be the same as all the documents that you've represented already and pine cone lets you do this extremely quickly at scale what's nice about pine cone is that it has a great developer experience and you don't have to think about how to provision resources when you scale up your data so you can imagine when you go from one video to tens of hundreds of videos to an entire YouTube channel you won't have to think about providing more resources because pine cone is inherently serverless we also have a lot of Integrations with popular data uh sources and Frameworks like laying chain and llama index so we can be where you are already Pine con servus is great because like I said you don't have to provision any resources you can just scale up or scale down your use us and you can use your usage uh independently with your compute and your storage cost being charged separately so it only scales up as much as you're using it enough about pine cone let's talk about anthropic and Claud Alex why don't you take it away thanks arjan um yeah hello everybody again uh I want to just start by kind of laying the foundation for how we think about what we're building at anthropic and about Claud more generally um I always like to stress this to developers that it is our goal to always be on the frontier of model capabilities and this has been proven already multiple times this year in just 2024 alone this graph was back uh from our initial launch of cloud 3.5 Sonet so we need to update it a bit but as you can see here we've already built the world's best llm twice in 2024 and I guess three times technically if you count the the latest Sonet release which I'll get into uh next slide here so what did we justn launch so if you been paying attention to the AI news you might have just saw that we had a big October launch where we actually uh released a new and upgraded version of the Claude 3.5 Sonet model so the new 3.5 Sonet model is better than the old 3.5 Sonic model in a multitude of ways as you can see in just some of these benchmarks here um we got the GPA GP QA score went from 59.4 to 65 MML Pro bumped up by 3% coding bumped up by a large percent uh and a lot of these other sort of agentic actions uh also increased a lot uh as we upgraded this version between the models next slide please and that's really the most important takeaway here that I think I really want to get through is we're making this transition from models that do single action chains to models that can now plan on long time Horizons so as you can see in this chart here uh we have categorized models now in terms of how they can perform versus a human with time limits the new 3.5 Sonet is by far in away the best model that we've seen across the industry in terms of matching Human Performance so now we're starting to exceed that 30 minute time Mark um as measured by this uh organization called meter which does model evaluation and threat research so now 3.5 Sonet can take tasks that would take in this case a human developer coding around 30 minutes to do in just a matter of seconds and this is very very important for things like retrieval when you're trying to do a gentic search you want to Loop through a ton of different options you want to make sure that you have the right context and that you're Gathering the right information uh in 3.5 Sonet the new one is very very good at this next slide so now moving into What specifically wor thinking about when we do retrieval uh contextual retrieval is this term that we kind of published in pioneered in a blog post as of about two months ago and that of course is what arjin will be explaining more about uh shortly but in order to actually enable contextual retrieval in a very economical and uh latency sensitive way you need to first start from the foundation which is prompt caching so prompt caching is a API feature which allows you to cach or save a previous prefix in your prompt so that you can then reuse that prefix uh in basically a fraction of the time and add a fraction of the cost later on so this significantly reduces your processing time and your costs for any sort of repetitive tasks or prompts with very consistent elements at the starting block uh for 3.5 Sonet this can reduce input cost by up to 90% And your time to First token buy up to 80% next slide so this is very helpful for a lot of those agentic workflows or tool use things uh as you're doing a gench search you might keep appending context back to the beginning of your prompt and that can then just become cached and cached and cached as you keep going and you'll save that processing time and you'll save the cost and next slide so if you want to learn more about this because it does get a little bit tricky as you start to implement it um definitely check out some of our resources on our docs and in our cookbook these are just Google Search terms you can go use to find those resources so now moving on to the bulk of how we actually think about the best ways to implement uh retrieval which is this new technique called contextual retrieval so we found that this actually reduces incorrect chunk retrieval rates by up to 67% compared to just traditional embedding um uh methods next slide how this works is uh basically when you look at a standard rag procedure you often destroy the context in the process of chump chunking up your documents for embedding so you might have this document and you take you know 1,24 character chunks and you just extract it right out of the document but that creates this isolation problem where one chunk later on in the document not might not have the references to information that were previously stated up above next slide so we solve this with contextual retrieval by adding a relevant context to each chunk before we actually embed or index it and to add this relevant context we actually use CLA to generate chunk specific expl explanatory context which preserves this critical information one way I like to frame this is that it's basically like studying all of your data before you go and index it next slide here is a quick diagram that I pulled out of our blog post on this for explaining how this kind of works so you can see our Corpus of information we've selected out one um file out of this Corpus we split it into chunks just like you do traditionally but now instead of going straight to the embedding step we actually put CLA in the middle of this Loop so now we have Claude run through each one of these chunks in relation to the entire document and then extract out the relevant information and append it into that chunk so now each chunk has the actual information that you want to embed plus the context of what might be missing from that data and then we can put that through our embedding model we can put that through our indexing models and then put it into our Vector database like pine cone next slide so let's take a practical look at what this actually looks like with a real life chunk so maybe you have something very simple you're analyzing financial statements and you have this sentence the company's Revenue grew by 3% over the previous quarter if you want to embed that you're missing a lot of information right you don't know what the company is you don't know what the previous quarter was you don't know the date the time it's just a lot of things are missing to when you're doing that search later on uh you won't be able to find the right pieces of information maybe within this document there's multiple companies whose Revenue grew by 3% so we can use Claude to actually contextualize this Chunk we pass in that full document we pass in the chunk and then we say Okay Claude fill in this relevant information and now you'll see that this chunk gets transformed into this chunk is from Acme corpse Q2 2023 SEC filing previous quarter revenue is this Revenue grew by 3% over the previous quarter uh next slide so this is very very powerful as you might imagine now we've added this context onto that chunk this improves our search uh this improves that retrieval but the downside to this is that it is prohibitively expensive to implement because of course you need to call Claude and pass in your full document for each chunk but tying back to prom caching this is why I started with this prompt caching enables this to all be done in a very uniquely cost-effective way so now what you can do is you can actually have that document at the top of your prompt you cache that document on the first pass and then for each chunk out of that document you just append it to the end so now you're only actually running CLA basically over each subsequent chunk instead of having to pass through the whole document each time so with this uh we've done some cost estimates and we've seen that for 3.5 Sonet your estimated cost for contextualization is only roughly about a dollar per million document tokens so it's very very powerful at a very low cost in fast uh latency awesome I'm to AR yeah awesome thanks so much I'm gonna quickly touch on how we're using AWS for the demo we're using bedrock and sagemaker notebooks Bedrock is great for accessing the cloud API and Sage maker provides a really clean environment for kind of doing a top to bottom educational example of how to do this approach and we're using the Titan text embedding models as well because we can then embed the text contextualized chunks as Alex described into pine cone and our Vector database but what does this application actually look like so I'll show you really quickly what we're going to build and what each step is and then we'll walk over the notebook and mess around with the application that we've built to see the contextual retrieval in action so first we're going to talk about the uh pre-processing step first we have this video data and we need to convert it to a format that makes it easy for cloud to do this initial contextual retrieval so we're going to start with the video data and we're going to extract the frames and the transcript from the video so we're going to use a transcription engine to do this we use whisper but of course you can use AWS transcribe and we'll cach the completed transcript and we'll also write out all of the frames now instead of having like video data we have image and Text data but we were going to walk over the video in 45 second intervals which allows us to kind of take a sample every 45 seconds of what is going on inside the video and with the transcript we're going to look at what words are being spoken uh in the time stamp on a word level then we'll just assign words to the frames inside of those 45C intervals and that'll give us these frame transcript pairs in addition to the whole complete transcript of the entire video which we can then use for the cloud contextual retrieval uh workflow now that we have the frame word pairs and the transcript we're going to pull out the frame as an image we're going to take the raw transcript of the frame and we're going to take the summary of the entire transcript that we can create with Claude and we'll use all of this information to create the contextual description of what is happening on screen in addition to what is being spoken in the context of the entire video now this is exactly what Alex was talking about prior where you have some information raw information in this case which is just the transcript and the image and we're contextualizing it with what is actually on screen in addition to what the entire video is about we'll do this for every frame image pair or for every frame transcript pair and will'll yield a set of frame context pairs per video These are going to be the Texton information that we're going to embed and put inside pine cone now that we have these contextual frame pairs where we're going to create a bunch of metadata about the pairs which will allow us to reference some important information that we're not necessarily embedding for example the raw transcript or the path to the images that we're storing locally we'll use the Titan text embedding model to embed this contextual information to create context metadata Vector pairs which we will upsert into pine cone and this is what we're doing our Vector semantic search over finally once we've completed that workflow we are ready to do the full retrieval augmented generation with one extra trick so we'll have our input query which gets embedded by the Titan text embedding model and we'll query pine cone with that that'll yield a set of matches which are going to contain the image context pairs as metadata so the matches will be the contextual descriptions but we'll be able to figure out uh what those full descriptions are and what image paths we should be looking at because of the metadata that's attached to them from disk will read in the frames that have been retrieved and we'll create we'll pass that to a prompt Builder and we'll we'll be able to pass the actual images and the contextual descriptions directly to Cloud which allows us to yet again leverage the visual question answering capabilities of the cloud llm to create a final generated response in response to the input query this is a pretty clever trick because it means we don't have to be uploading and downloading images all the time we're just doing a Texton search and then referencing the images that are retrieved only as necessary for us to answer these questions in the first place awesome now that we've kind of spoken about our stack and what we want to do I can kind of show you what the demo is going to look like so I'll go ahead and share my screen and go up to the notebook so we can look at what is going on so just a moment oh just one second a little a little bit of technical difficulties we love Zoom all right uh still having an issue just a moment there we go awesome can everybody see that yes great uh give me a moment to rearrange my zoom really quickly and we'll be ready to begin awesome so this is our notebook we're going to start uh a lot of the computation in this notebook is pretty timec consuming so I'm not going to for example transcribe the video or create the contextual trunks live but we will be able to do the querying live so we can kind of see what the results look like when we ask certain videos of what's going on before we get started we're going to do some quick dependency cleanups so I'll just walk over this and we're going to start with the video pre-processing step this is the idea that we're taking the frames doing the transcription and turning the video into frame transcript pairs uh this is just a little bit of cleanup that we're doing and we're going to walk over every single video in our video repository we're going to create the transcript we're going to extract the frames we'll assign the words to the frames and that'll yield a bunch of data on what frames are being extracted in addition to what words are being transcribed awesome we use FFM Peg to kind of make this a lot easier and like I said we use whisper to do the transcription but you can use AWS cribe of course if that's something that's easier for you in order for us to do the contextual retrieval we have to pass the images and convert them in order to pass them to Claude so we need this helper function here in order to take in the images and convert them to base 64 so Claude can interpret them we use Claude Hau because it's really fast and allows us to kind of quickly have that image text understanding and we'll create a helper function here that allows us to kind of do one off one image one piece of text get a response back with Claud to kind of make it easy for us to pre-process each frame transcript pair we'll write another helper function here to do the entire transcript summarization and we'll be able to use that repeatedly as we're creating the contextual pairs uh when we created this demo prompt caching was not available on Bedrock so we weren't able to leverage that capability but of course when you're working with your own demo you can use prompt caching in order to save on cost a little bit now we're ready to do the contextual retrieval step again this is the idea that we're going to take the frames as images the transcript of what is happen inside of that 45 second frame we're going to take the summary that we're using to transcribe the entire video and create that contextual description and frame context pairs the easiest way to do this is to have this meta prompt and embed all of the information that we need in the prompt and order to yield the contextual description so you can see that we're basically walking over each frame we're going to take the summary of the video pass that here we're going to take the raw transcript pass that in and then we're going to prompt Claude to interpret what is on screen what is being covered in the entire video and what is being said in that snippet in order to create that contextual description I've added a few extra lines here to kind of adjust for webinars where you might have two people talking on screen or if there's a Code snipp it to do some extra explanation or if there's an important diagram to do some extra explanation on that as well uh and you might be wondering where does the image actually play into this well using our helper function in ask Cloud we pass the image first and then we pass this prompt for cloud to kind of do that response on after and then we wait for the response to come back so now that we have all these functions that are ready together in order to kind of do this type of contextual description we can iterate over all of the frames in the entire transcript in order to yield these new pairs which will not only contain the contextual frame description but they will o contain additional metadata which is helpful for us when we query over this in Pine Cone the transcript summary the time step of when this happened the frame path and the actual words that were spoken and you can uh imagine that this can make for some interesting applications for example if your videos you're searching over on YouTube you could use the time steps to directly send a clip of that video to somebody once you've queried the database and we'll write that out you can see that we've kind of written out the uh summary of the entire video here and this is what's going to be used in that meta prompt that I described prior and like I said we're doing this over the multilingual uh semantic search webinar that I presented earlier which is why you're going to see a lot of text about multilingualism and embedding models okay now we're in a place where we've already embedded the data we've created these contextual pairs we're ready to embed them and put them inside pine cone so we're going to use the AWS Bedrock Titan text embedding model to embed the contextual descriptions and we're going to use pine cone to store all of these vectors and most importantly we're not storing the raw images in Pine Cone we're just storing their metadata of where they're located inside this notebook in order for us to access them later it's pretty easy to get started with the text embedding model we create a helper function here to do the embeddings uh we convert we do the embeddings for the entire uh contextual frame descriptions and we're going to yield a data frame that kind of looks something like this and using the data frame we can use an awesome function within the pine cone library upsert from data frame to kind of handle all this absortion for us we're going to create an index we're in a specified Dimension size which is going to be based on the text embedding model and then we'll just upsert directly into there nice thing about upset from data frame is that it handles rate limits for us so we don't have to think too hard about the things that we're working with the video that we're working with is about 45 minutes uh and we're going over it in 45 second intervals which is why we see about 72 uh vectors being generated finally we need to set up a few more helper functions for us to do this visual rag workflow uh we want a way for us to directly ingest the response back from Pine Cone to Claude and that's what this function is going to do it's going to take in the list of matches that pine cone retrieves read them all in process them as images create the meta prompt and then pass that back to Claud and again this is the architecture that we're kind of going to be working with we have our input query we have the Titan text embedding model uh and we uh kind of query that we we kind of query get our matches back get the local frames put that into the prompt Builder and generate our final response all right now let's try to put this for a test run first question I'm going to ask is how much data does it really take to make a multilingual large language model we're going to embed that using Titan get a response using pine cone and get the cloud response using the mattress from pine cone in addition to the query text let's go ahead and run this it'll take a few seconds for us to kind of retrieve this information in addition to generate the response and great we get a response about using the cc00 data set which refers to what XMR uses to create a multilingual embedding model now just having the response independently like this might not be as enlightening so why don't we create a helpful fun function to kind of help us understand what is being visualized when we are looking at uh the response that's being created and you can see that we've retrieved a frame from the video which has this graphic of the data that's being used in the video in addition to all the languages that are appearing and the actual data set that's being used now I don't actually say CC 100 while I'm talking about this in the video this is something that cloud was able to pick up on just based on the image of the frame in screen we also get the transcript information back so we can kind of see what I was trying to say in addition to the contextual description and this contextual description is great we can see that it's a frame from a presentation on multilingual search Technologies the graph has to do with 88 different languages and the wiki 100 Corpus and the CC 100 Corpus so on so forth so this was a great retrieval hit and you can see we're getting a few other frames back about uh the question we were asking which is multilingual uh multilingual semantic search great now let's do uh some other more interesting queries so I wrote this helper function to kind of make this a little faster we're going to go through a few different query types in order to kind of show the different capabilities of the uh demo that we've created at some point I talked about Mad Lib for robots uh what's this you might you might remember that I said that but you're not remember what it meant in context so let's try to find what this was said and what is kind of meant here we wait we get the matches we can print the response we get a summary what this means it's based on the concept of mask language modeling which is this idea that lar language models use in order to drop some pieces of text and then try to predict what has occurred in those texts we get the slide back when I say Mad Libs for or when it is been written on screen mad lips for robots and also the transcript and the contextual description of when I'm actually trying to explain that so that's a great hit that ended up working really well during the slideshow I showed a screen where I was talking talking about cultural nuances and how that was a little dissonant from what was being presented so let's see what happens when we do that search so we get that slide back which is impressive because this screen isn't talking about marriage cultural nuances it is talking about when sematic search works and doesn't work uh and we do get the section back when I was talking about uh the the wedding specifically so that is also impressive because we're handling something that wasn't really said on screen but was spoken which shows why we're able to kind of handle both of these modalities at the same time using this pre-processing technique let's try a query that has to do with interpreting diagrams I asked Claude there was a diagram on screen with a bunch of arrows and something about paired training can you explain this concept to me this is a common thing that might happen during a webinar you'll have a diagram that's kind of confusing uh you might not remember what was meant by it or what should you be doing with that information can you explain this we can see that we get a hit on the weekly supervised contrast of pre-training diagram it has a bunch of arrows it's explaining this complicated paired training concept and we get the contextual description back finally Claude is able to kind of explain what this concept is based on the retriev context which is super useful okay let's do something a little more challenging at some point I walk over a Jupiter notebook really similar to this one in that other video so let's see if Cloud can find that notebook and also give me a response on this question I'm asking about crosslingual and multilingual search and how I can kind of think about doing that with pine con through code blocks so we can see we get a hit back on semantic search with pine cone so not super great uh but at near the end here near the top five we do get an embedding screen on the code that I did some of this uh semantic search with and we do get a hit on the specific type of crosslingual search that I was doing in the presentation so that's awes awesome so not only were we able to get the screens that were relevant we were also able to get the code that was relevant even though Claude is just like looking at a picture of that screen right so that's that's pretty that's pretty wild okay let's do one more uh and we'll uh we'll head over to our own Q&A I'm going to ask a question that's a little more complicated which is this idea around what kinds of data are used to fine-tune multilingual llms and what Innovations researchers had to come up with in order to create E5 which was the specific embedding model that I used during the call and we get that screen back of which we were talking about different kinds of data sets and their importances and the interesting thing that cloud was able to do is that it was able to kind of describe what these boxes were on screen and why they matter in context so when we get this uh response back we are able to kind of talk about the data sets that existed in the chart that was presented near the end of that call and there we have it we've covered a lot of different capabilities that we're able to do because we took this simplified contextual retrieval approach of slicing up this video into image and text pairs in order to get uh the ability for us to use pine cone and Claud to ask natural language questions over this data going to go back to the presentation now just a moment just a second I can't seem to reenable my screen sharing so give me a moment to troubleshoot what is what is happened happening with the uh Zoom here I'm unable to kind of click the buttons I need on screen so give me a second to fix that and we can get straight into the Q&A AR I'm going to try sharing my screen because I think I have your deck let's see if I can awesome thank you sorry my my zoom is just completely Frozen I'm unable to kind of hit the buttons I need to great that's exactly where I was awesome so now that you kind of have access to a really simple way of thinking about and doing contextual retrieval what can you do with this information we'll be releasing this demo publicly for people to kind of use and mess around with you'll need Sage maker bedrock and Cloud access in order to run the notebook in addition to a pine API key but the API key is able you're able to get one just by signing up through our service you could try to extend the demo to many different kinds of videos we were just doing one in this one or you can do a whole YouTube channel you could add uh bm25 to do a full form of contextual retrieval like Alex was explaining prior you could create a valuation set to kind of understand how well you're doing on retrieving the the relevant sections of different kinds of videos based on the querying that you're doing or you could adjust the pre-processing parameters that we were using in order for more fine grain control for example I was sampling at a rate of every 45 seconds but you could go down or up depending on the amount of information that's being presented on each video that you're working with and uh next slide please bear and that's it everyone thank you so much for uh coming and listening to our presentation we're happy to answer any questions that you might have about contextual retrieval using pine cone or using CLA so let's get into it I'm just going to give a quick note to the audience because I've seen a handful of people use the raise hand feature inside Zoom since no one's uh microphones are enabled please pop your questions in the Q&A box um and the same goes for I tried to keep Pace with a lot of folks who are asking questions in the chat but some of you I did not get to please copy those questions over to Q&A and Alex and archin will be able to take a look now arent Alex do you have access to the Q&A box do you want me to feature some questions or do you just want to start going through sorry I I can't um click on the questions because my zoom is frozen so I can't quite see them uh we' be happy to kind of get some fed and then I will go ahead and help an answer sure why don't I start with one that's for Alex this is coming from pru which is uh why is Claud more suitable for rag compared to other models like Chacha BT gem or Lama 3 you touched on some of this in your presentation slides yeah I mean I think there's a few aspects you have to consider here this rag as um analogous to any other task when you're evaluating LMS there's a lot of factors uh context length for rag is of course a very important one you want to be able to reason over a large amount of tokens within your context as you're getting these embeddings passed in as you're passing in documents whatever it may be I believe that CLA especially the new 3.5 Sonet is particularly good at that its contact window right now is 200,000 tokens in the future we look to extend this even further um also there's these other factors that mentioned uh around increased intelligence agentic planning long Horizon I think increasingly rag is going to start more looking like a loop instead of just a single flow to where you'll be actually interfacing with these llms going back and forth so that they gather all the relevant information before providing their answer and it's not just a single shot through um so there's a few different things just like with everything I always encourage people to create evals beforehand and then test on the models um you know I I am slightly biased as I say that Claude is the best at this but I always tell devs hey try it for yourself on your own use case and come to your own conclusions there just scrolling through any of these other questions arjin can you see these no let me let me try to troubleshoot a bit so I can kind of look at them a little little easier I see one are there any plans to allow promp caching through AWS Bedrock I saw that's only currently available when using the anthropic API directly that's correct it's currently not available on AWS Bedrock but it's coming very soon um so yes there are plans I don't have a date uh but we are working on it there was a question that came through the chat before about how um if the embeddings that that are or sorry the context that is relevant to the broader document are attached to every single chunk do you run the risk of the chunks all getting the same embedding value or similar embedding values and um this creating a problem for retrieval awesome I'm back hey AR um Alex I think that was a question for you oh okay um yeah that I think that's a a valid concern but I think the models are smart enough to differentiate between them and this can also be done through prompting as well uh so as you're setting up your prompt and we specify the prompt in that blog post that we used very very simple you know we just passed in the document we say here's a document provide some relevant context to this chunk and then we append the chunk you can be a lot more clever with that as well uh you can specify unique ways of how you want the information to be presented to keep it distinct um i' would always run these things on just a few test cases first to make sure that you like your quality you know handgrade your outputs um take a look at what those look like when you're embedding it and then once you're satisfied with your quality run it across all your um documents in your Corpus so I think there's a lot of ways you can get clever and kind of the the cool thing about how we're using llms here is that they just allow for that generality and that steerability towards um whichever contextual format you want you're not really pinned down to anything in particular getting access to the notebook will be relased releasing it uh as a GitHub repo so people will be able to kind of pull it down use it we'll probably add a little more uh functionality kind of make it easier for people to use and modify so don't worry about that Alexis will make it easy for people to access this is a oh what's up I was just wondering whether you all can see the Q&A now and whether you need me to keep moderating I can see the new questions but not the historical ones let's see yes the repo will be sent out via email I am seeing a few uh questions that are related directly to the demo so cish is asking about the um workflow that we're kind of showing so yes this is roughly correct what we're doing is we are converting the video into frame and text Pairs and then we're describing what is going on inside that frame with respect to the transcript with respect to the entire transcript of the video so what's being stored inside there is just inside the vector database is just the contextual description not the images because we are just doing the text only search there and then once we pass it to Claud at the end we have the ability to pass CLA the images directly because Claud has this awesome visual question answering capability so at the end we pass CLA the images and the contextual uh pairs in addition to to the query that's being asked and we say hey with all this information that we're giving you can you try to answer this query that the user has asked that's how the llms able to do that type of summarization John has a question about hybrid search and contextual retrieval so one of the ways to do contextual retrieval Alex please correctly correct me if I'm wrong but you could incorporate something like tfidf or best match 25 in order to kind of get the benefits of keyword search in addition to uh pine cone so you should see a lift an additional lift because you're able to do that keyword search and you're able to do that semantic search for this demo we I chose to keep it really simple and just show you what happens when you do contextual embeddings and how much that helps but you of course could add you know bm25 tfidf to kind of make that work a little easier anything to add to that Alex yeah so if you check out our blog post which I think somebody posted a link to in the chat earlier uh if you just search contextual retrieval anthropic you'll find it we do show how with each thing you add you do get a uh Improvement in your uh chunk retrieval rate so when you do the the full combination of contextual embeddings plus you know contextual bm25 or something like that uh you do get the best of both worlds to some degree and you get this the semantic search but also the keywords and everything else keep these questions coming they're really fun to answer David I see you uh little bit of heard you like video we video webinars let me throw that webinar inside your webinar yeah I think it' be really fun to throw this webinar at this type of application and see what would kind of go on it would totally work uh TOA you had a question about the segmenting that we're doing why 45 seconds honestly this was just a parameter that I picked just messing around with using the demo if you pick a window that's too small there isn't enough spoken text in order to create a helpful description if you pick a window that's too big there might be too many slides that have been flipped through and you're only taking one snapshot within that so I just chose 45 seconds because that kind of correlated with that um pipeline of with that rate in which I was speaking in that video but of course you can kind of adjust this per video you could adjust this uh based on the videos that you're working with to kind of make this easier to use Jeffrey has a question on cach prompt caching Alex do you think you could attack this it looks like it's about when the cash Pro prompt is being sent at inference time and how it's actually beneficial there yeah so I can't dive too much into like the actual implementation details but we can think about is just at a high level and llm is this Auto regressive model that has to go back and check every single word for each subsequent word and when it does that at each point along the way it's Computing this this KV cache it's basically like a uh numerical Matrix and instead of having to recompute that every single time we can just kind of checkpoint it at the end of our cache where our cache breakpoint is and then we don't need to go recompute after everything else we just go from that point forward um so that's how that works at inference time at a very very high level have you tried this technique over uh indes is asking have you tried this technique over tens of thousands of documents or customers have this huge document base I'm curious about how this technique holds uh Alex you want you want to talk about this with respect to contextual retrieval and kind of like scaling costs and what that looks like yeah I mean I think the reason we developed this was primarily to Target get these large corpuses of documents um it is something that of course you're going to need to do some Plumbing on your end to like make sure it it works as you do this pre-processing but what's great about this technique is contextual retrieval and just like all the embedding stuff to some degree as as long as the information is static and not changing too much it's a one-time cost so you just need to run your data through once then you have these all embedded um and then you know you can do the the actual retrieval stuff later on uh but yeah this scales to X number of documents just as it would if you're doing over 10 documents Eli has an interesting question about uh basically reranking so Eli is asking how do you think about the trade-off between throwing a long tail of retrieved pine cone results at Cloud 3.5 versus trimming down to the most relevant especially when it's hard to know uh which results will have the answer to the users request in my usage so far I've seen answer quality diminish quickly above 40 retrieved embedding so maybe we'll do like a tag team answer on this Alex so I'll talk about kind of ranking and how that's helpful and Alex maybe you can talk about like claud's long context ability and how it can deal with kind of throwing tons of things there so um with respect to this like it's important to kind of rerank the data that's being retrieved when you kind of get some responses here uh we lucked out here where when we were doing this reranking we didn't have to or when we were doing this type of generation we didn't have to do too much messing around here but you're right that ideally you want to optimize the amount of information that's being sent so that there isn't too much irrelevant context there but I think Alex can speak to how well Claud can deal with huge contexts especially given the 200k context window the models can kind of deal with yeah uh yeah so definitely there is a with everything long context a a balance in a tradeoff that you have to um factor in with Claude usually that that trade-off is in your favor because you can't just keep throwing information at it uh and you know 200,000 tokens 150,000 words 160,000 words that's that's a lot of information uh now the trade-off here is you're going to bump up your costs and you're going to bump up your latency and there might be marginal intelligence uh decreases as well as you get to the max of that context window now for the first two promp caching helps tremendously right so if you can utilize prompt caching as you of information it's a little tough in a retrieval case where information might be inserted dynamically but there are certain clever ways you can get around that um then that helps with those first two in the intelligence front that's something we are working on uh we think Claude is already industry-leading for the ability to recall over its long context as evidenced by tests like the needle and the Hast stack test which you might have seen with all those green dots on a grid um but there's a lot of improvement still to be done uh so we're continuing to work on just how Claude can reason over the entirety of its context window awesome few more uh swort questions John is asking when will we get the repo very soon along with the webinar recording uh do we need to use l chain for any of this not necessarily if you're using another framework to kind of orchestrate your LM calls you can do that uh you don't have to do that I didn't use linkchain in this demo so it's not uh needed vode is asking about what the recommended way to rerank is um it depends on the data that you're working with and the ranker model that you're working with pine cone has a ranker that is text only and what's nice with this workflow is that the retrieved results are text only so you could just throw the ranker at that typically you want to increase the top K that you're kind of passing so you might do top K of like 10 to 20 or maybe even 30 and then you rerank down back to five so you're passing as many relevant results as possible that could be relevant you use the ranker to get the most relevant results and move forward from there

Transcript for:Building Contextual Retrieval with Pinecone

Transcript for:
Building Contextual Retrieval with Pinecone