Understanding Contextual Retrieval Mechanism

Anthropic just released a new retrieval mechanism called contextual retrieval. Based on the results, it's the best performing technique to date. When combined with re-ranking, it gives you state-of-the-art performance. I'll call it more of a chunking strategy than a new rack technique, but the results are impressive. To understand this, let's first look at how rack works. You have your documents, you chunk them into sub-documents, then for each of the chunks you compute embeddings. and those embeddings are stored in a vector store. At run time or inference time the users ask a question we compute embedding for those questions and then retrieve the most relevant chunks based on the embedding similarity. Those chunks along with the query are fed into an llm and you get a response. This technique is great for semantic similarity but there are still a lot of failure points. That's why a lot of people combine this semantic search with traditional keyword-based search mechanisms. One of such techniques is BM25, which plays a critical role if the user query and the database involves keywords. Here is an example from Anthropix. Suppose a user queries error code TS999 in technical support database. An embedding model or semantic search might find content about error codes in general but could miss the exact TS999. However, the keyword based search mechanisms such as BM25 looks for specific text string to identify the relevant documentation. And it has a higher probability of detecting this specific error code. The biggest problem with the current standard rack systems is that you lose a lot of contextual information. When the chunks are returned based on the relevance to the user query, it doesn't really have any information regarding the rest of the documents. Here is an example that Anthropic provided. Imagine you had a collection of financial information, say your filings, and those are embedded in your knowledge base. The user query is what was the revenue growth for ME Corporation in Q2 2023. Now the relevant chunk to this might be the company's revenue grew by 3% over the previous quarter. However, this chunk on its own does not specify which company it's referring to or the relevant time period. So it's going to be very difficult for the embedding as well as the keyword-based search mechanism to retrieve this specific chunk, especially if there are multiple companies mentioned in the same document. That is why Anthropic is recommending to use contextual retrieval. So when you are creating chunks you also include contextual information in each chunk. There are some issues associated with this approach which we're going to discuss later in the video. But instead of this chunk your new chunk is going to look something like this. This chunk is from in filing on ACME corporation performance in Q2 2023. The previous quarter revenue was this and now this is the actual chunk. that your chunking strategy creates. As you can see they recommend to add a lot of contextual information both to the embedding model as well as to your BM25 index. So how do you implement this? Well they suggest using LLM to do this automatically. They have provided a prompt for Haiku which is one of their smallest and the most cost effective model. This is a general prompt. that you want to customize for your specific application. So in the first part of the prompt, you provide the whole document. Let's say you are working with a PDF file and you're chunking that PDF file. You will provide that PDF file here. Then the rest of the prompt is here is the chunk we want to situate within the whole document. And then you provide the specific chunk that you are working with. After that, you ask Haiku. to give you a short context to situate this specific chunk within the overall document for the purpose of improving search retrieval of this specific chunk. So Haiku is supposed to add contextual information to this specific chunk and as a result you will usually add 50 to 100 tokens to each of your chunk. Now in practice this is how it's going to look like. You will take a single PDF file or a single document from your corpus then convert that into chunks. After that you take one chunk at a time, run that chunk through the prompt that I just showed you along with the original document. This will generate contextual information for each chunk. You combine a given chunk with the relevant contextual information and then you pass it through an embedding model. and those embeddings are going to be stored in a standard vector database. On the other hand, we also update the BM25 indexing. In this case, we're using TF-IDF, which stands for Term Frequency Inverse Document Frequency. This is basically the keyword-based search mechanism. As a result, in each of the chunks, you are adding 50 to 100 tokens. I think you can already see some potential issues with this approach. one of them is that it's going to add a lot of overhead not only the number of tokens that you're adding to each chunk but each chunk goes through an llm so that adds up a lot of different tokens but you can use the prompt caching feature that was introduced by Anthropic a few months ago. Gemini also has a very similar feature. With the help of this you will be able to reduce the cost. So here are quick calculations assuming 100 or 800 token chunks, 8000 token documents, 50 token context instructions and 100 tokens of context per chunk one time. cost to generate contextualized chunk chunks is about 1.02 dollars per million document tokens so it's a relatively small price but keep in mind that this can add up if you have millions of documents the rest of the retrieval pipeline is the same as the standard retrieval pipeline and that's why i think it's more of a new chunking strategy than a completely new retrieval mechanism After doing all this, what type of performance improvement you can expect? Now the Anthropic team performed a scientific study where they ran retrieval benchmarks on a number of different data sets or domain specific benchmarks. The summarized results are here. Contextual embedding reduced the top 20 chunk retrieval failure rate by 35%. So it went from 5.7 to 3.7 which is a huge improvement. If you combine the contextual embeddings approach which they recommend plus the contextual BM25 which is the keyword based search mechanism on the improved chunks you can further reduce the top 20 chunk retrieval failure by 49%. So again you went from 5.7 to 2.9. There are other approaches that you can combine. with this. I cover a lot of them in my advanced RAC Beyond Basics course. In this specific experiment they used a dense embedding model Gemini Text 004. According to their experimentations it's the best performing model on the models they have tested. Based on the results it's definitely worth checking out. Now they also added a re-ranker at top of their RAC pipeline. the software improvements. My recommendation is to have a keyword based search mechanism along with some sort of query rewriter plus a re-ranker in any RAG implementation. Based on my experience adding these three different components can give a substantial boost in retrieval accuracy. With the addition of re-ranker to the same pipeline the average retrieval error rate went from 5.7 to 1.9 percent. Some other things that they recommend to keep in mind when you're building rack systems is chunking strategy. It's very application dependent so you want to look at your chunk size, chunk boundary, chunking overlap. I have a couple of different videos on this specific topic I'll put link to those in the video description. Another thing to consider is the embedding model. So most of the people use dense embedding models such as OpenAI embedding models or a number of different open source embedding models. In their experiments they found that the Gemini and Y-Edge embeddings seems to be pretty effective for their particular task. I also highly recommend to look at something like Colbert based multi vector representations. In theory those should be more effective compared to other models. compared to dense embedding models. I have quite a few videos discussing those, so I'll put a link to a couple of those videos in the video description. Another thing is the custom contextualizer prompt. We looked at a very general prompt they included based on Haiku. Depending on the documents you are working with, you will need to customize your contextualizer prompt. And the last thing is the number of chunks that you want to. return they experimented with 5 10 and 20 chunks and found 20 to be the most performant I would suspect if they increase the number of chunks beyond 20, they might see some improvement as well. In general, you want to first return a large number of chunks during the retrieval process. Then you want to add the re-ranker step in order to reduce the number of most relevant chunks that are going to be fed to the LLM to generate a response. Another thing to keep in mind is how they are measuring their... accuracy or error rate. So they're retrieving 20 different chunks and then looking at whether the relevant chunk to a specific query is present in those 20 chunks or not. So they're not really ranking it whether it's the top chunk but they're basically saying if it's present in the top 20 chunks that are returned we're going to consider this as a success. Now this is a standard approach but I wanted to highlight because People can confuse this by assuming that the return chunk is probably the topmost chunk. I'm happy that they tried to address the long context versus RAG question. So they say sometimes the simplest solution is the best. If your knowledge base is smaller than 200,000 tokens, which is about 500 pages of material, you can just include the entire knowledge base in the prompt that you give the model with no need for RAG or similar methods. but keep in mind this means that with every query you are going to be sending 500 pages of material to the llm the cost will add up pretty soon and that's why they recommend using their prompt caching feature for cloud and if you use prompt caching the cost is reduced up to 90 percent and the latency is going to be reduced by greater than two percent or two times I have a video on how this prompt caching works which I think is very important if you are building anything in production. But if your number of tokens are more than 200,000 tokens which is beyond the context window of the current generation of cloud models, then they recommend to use RAG. Let's look at a code example of the contextual retrievals that cloud has provided. Now here is a notebook in their own repo. I'll put a link to this specific repo. They are comparing basic rag with contextual embedding plus contextual pm25 and adding re-ranking on top of it. The repo contains the data that is used. So you can basically replicate their results. So here they are installing all the different packages that we'll need. For this one they're using the voyage embedding model and for re-ranking they are using the cohort api Okay, so here's how they're creating the vector DB. So in the first step, they compute embeddings for the original chunks, and then they do basic RAG. So they have a code that actually runs evaluation on this basic RAG system. Now, based on the results for top five, and that means if you consider the most relevant chunk to be retrieved in the top five chunks, the accuracy is 80%. If you extend it to top 10, then 87% and if you extend it to top 20 then the accuracy of the standard rack system is about 90%. Now if you use the contextualized embeddings so they have provided this code snippet which basically takes each of the chunk plus the document from which that chunk is coming and then run this through the the prompt that I showed you in the beginning of the video. This way they will create new chunks again they're using the haiku model and you can see that the number of tokens are going to be a lot more than what are contained in the original chunks now they run the same embedding model for these contextualized chunks and for top five it sees an improvement of about six percent for top 10 i think it's again about percent for top 20 the improvement is about three percent now they are creating contextualized BM25 index. This is basically adding keyword based search mechanism and this does I think add some improvements. It's probably about 1% improvement but if you add re-ranking step at the top of it you see about 2% improvement which is pretty huge going from 92% to 94%. Anyways, this was a quick video on their latest approach contextual retrieval. It's actually great to see that companies like Anthropic are taking RAG seriously. This shows that RAG is still relevant even in the era of long context LLMs. And based on my experience is one of the only widely used application of the LLMs at the moment. So if you are interested in learning more about RAG systems or agents, make sure to subscribe to the channel. I create a lot of content on RAG and LLM agents. Let me know if you want me to create more detailed tutorials on contextual RAG and how this can be incorporated in your own systems. I hope you found this video useful. Thanks for watching. As always, see you in the next one.

I'll call it more of a chunking strategy than a new rack technique, but the results are impressive. To understand this, let's first look at how rack works. You have your documents, you chunk them into sub-documents, then for each of the chunks you compute embeddings. and those embeddings are stored in a vector store.

At run time or inference time the users ask a question we compute embedding for those questions and then retrieve the most relevant chunks based on the embedding similarity. Those chunks along with the query are fed into an llm and you get a response. This technique is great for semantic similarity but there are still a lot of failure points.

That's why a lot of people combine this semantic search with traditional keyword-based search mechanisms. One of such techniques is BM25, which plays a critical role if the user query and the database involves keywords. Here is an example from Anthropix.

Suppose a user queries error code TS999 in technical support database. An embedding model or semantic search might find content about error codes in general but could miss the exact TS999. However, the keyword based search mechanisms such as BM25 looks for specific text string to identify the relevant documentation. And it has a higher probability of detecting this specific error code. The biggest problem with the current standard rack systems is that you lose a lot of contextual information.

When the chunks are returned based on the relevance to the user query, it doesn't really have any information regarding the rest of the documents. Here is an example that Anthropic provided. Imagine you had a collection of financial information, say your filings, and those are embedded in your knowledge base.

The user query is what was the revenue growth for ME Corporation in Q2 2023. Now the relevant chunk to this might be the company's revenue grew by 3% over the previous quarter. However, this chunk on its own does not specify which company it's referring to or the relevant time period. So it's going to be very difficult for the embedding as well as the keyword-based search mechanism to retrieve this specific chunk, especially if there are multiple companies mentioned in the same document.

That is why Anthropic is recommending to use contextual retrieval. So when you are creating chunks you also include contextual information in each chunk. There are some issues associated with this approach which we're going to discuss later in the video.

But instead of this chunk your new chunk is going to look something like this. This chunk is from in filing on ACME corporation performance in Q2 2023. The previous quarter revenue was this and now this is the actual chunk. that your chunking strategy creates.

As you can see they recommend to add a lot of contextual information both to the embedding model as well as to your BM25 index. So how do you implement this? Well they suggest using LLM to do this automatically. They have provided a prompt for Haiku which is one of their smallest and the most cost effective model.

This is a general prompt. that you want to customize for your specific application. So in the first part of the prompt, you provide the whole document.

Let's say you are working with a PDF file and you're chunking that PDF file. You will provide that PDF file here. Then the rest of the prompt is here is the chunk we want to situate within the whole document. And then you provide the specific chunk that you are working with.

After that, you ask Haiku. to give you a short context to situate this specific chunk within the overall document for the purpose of improving search retrieval of this specific chunk. So Haiku is supposed to add contextual information to this specific chunk and as a result you will usually add 50 to 100 tokens to each of your chunk. Now in practice this is how it's going to look like. You will take a single PDF file or a single document from your corpus then convert that into chunks.

After that you take one chunk at a time, run that chunk through the prompt that I just showed you along with the original document. This will generate contextual information for each chunk. You combine a given chunk with the relevant contextual information and then you pass it through an embedding model.

and those embeddings are going to be stored in a standard vector database. On the other hand, we also update the BM25 indexing. In this case, we're using TF-IDF, which stands for Term Frequency Inverse Document Frequency. This is basically the keyword-based search mechanism. As a result, in each of the chunks, you are adding 50 to 100 tokens.

I think you can already see some potential issues with this approach. one of them is that it's going to add a lot of overhead not only the number of tokens that you're adding to each chunk but each chunk goes through an llm so that adds up a lot of different tokens but you can use the prompt caching feature that was introduced by Anthropic a few months ago. Gemini also has a very similar feature.

With the help of this you will be able to reduce the cost. So here are quick calculations assuming 100 or 800 token chunks, 8000 token documents, 50 token context instructions and 100 tokens of context per chunk one time. cost to generate contextualized chunk chunks is about 1.02 dollars per million document tokens so it's a relatively small price but keep in mind that this can add up if you have millions of documents the rest of the retrieval pipeline is the same as the standard retrieval pipeline and that's why i think it's more of a new chunking strategy than a completely new retrieval mechanism After doing all this, what type of performance improvement you can expect?

Now the Anthropic team performed a scientific study where they ran retrieval benchmarks on a number of different data sets or domain specific benchmarks. The summarized results are here. Contextual embedding reduced the top 20 chunk retrieval failure rate by 35%.

So it went from 5.7 to 3.7 which is a huge improvement. If you combine the contextual embeddings approach which they recommend plus the contextual BM25 which is the keyword based search mechanism on the improved chunks you can further reduce the top 20 chunk retrieval failure by 49%. So again you went from 5.7 to 2.9. There are other approaches that you can combine. with this.

I cover a lot of them in my advanced RAC Beyond Basics course. In this specific experiment they used a dense embedding model Gemini Text 004. According to their experimentations it's the best performing model on the models they have tested. Based on the results it's definitely worth checking out. Now they also added a re-ranker at top of their RAC pipeline. the software improvements.

My recommendation is to have a keyword based search mechanism along with some sort of query rewriter plus a re-ranker in any RAG implementation. Based on my experience adding these three different components can give a substantial boost in retrieval accuracy. With the addition of re-ranker to the same pipeline the average retrieval error rate went from 5.7 to 1.9 percent.

Some other things that they recommend to keep in mind when you're building rack systems is chunking strategy. It's very application dependent so you want to look at your chunk size, chunk boundary, chunking overlap. I have a couple of different videos on this specific topic I'll put link to those in the video description. Another thing to consider is the embedding model.

So most of the people use dense embedding models such as OpenAI embedding models or a number of different open source embedding models. In their experiments they found that the Gemini and Y-Edge embeddings seems to be pretty effective for their particular task. I also highly recommend to look at something like Colbert based multi vector representations. In theory those should be more effective compared to other models.

compared to dense embedding models. I have quite a few videos discussing those, so I'll put a link to a couple of those videos in the video description. Another thing is the custom contextualizer prompt. We looked at a very general prompt they included based on Haiku.

Depending on the documents you are working with, you will need to customize your contextualizer prompt. And the last thing is the number of chunks that you want to. return they experimented with 5 10 and 20 chunks and found 20 to be the most performant I would suspect if they increase the number of chunks beyond 20, they might see some improvement as well.

In general, you want to first return a large number of chunks during the retrieval process. Then you want to add the re-ranker step in order to reduce the number of most relevant chunks that are going to be fed to the LLM to generate a response. Another thing to keep in mind is how they are measuring their...

accuracy or error rate. So they're retrieving 20 different chunks and then looking at whether the relevant chunk to a specific query is present in those 20 chunks or not. So they're not really ranking it whether it's the top chunk but they're basically saying if it's present in the top 20 chunks that are returned we're going to consider this as a success. Now this is a standard approach but I wanted to highlight because People can confuse this by assuming that the return chunk is probably the topmost chunk.

I'm happy that they tried to address the long context versus RAG question. So they say sometimes the simplest solution is the best. If your knowledge base is smaller than 200,000 tokens, which is about 500 pages of material, you can just include the entire knowledge base in the prompt that you give the model with no need for RAG or similar methods. but keep in mind this means that with every query you are going to be sending 500 pages of material to the llm the cost will add up pretty soon and that's why they recommend using their prompt caching feature for cloud and if you use prompt caching the cost is reduced up to 90 percent and the latency is going to be reduced by greater than two percent or two times I have a video on how this prompt caching works which I think is very important if you are building anything in production. But if your number of tokens are more than 200,000 tokens which is beyond the context window of the current generation of cloud models, then they recommend to use RAG.

Let's look at a code example of the contextual retrievals that cloud has provided. Now here is a notebook in their own repo. I'll put a link to this specific repo.

They are comparing basic rag with contextual embedding plus contextual pm25 and adding re-ranking on top of it. The repo contains the data that is used. So you can basically replicate their results. So here they are installing all the different packages that we'll need. For this one they're using the voyage embedding model and for re-ranking they are using the cohort api Okay, so here's how they're creating the vector DB.

So in the first step, they compute embeddings for the original chunks, and then they do basic RAG. So they have a code that actually runs evaluation on this basic RAG system. Now, based on the results for top five, and that means if you consider the most relevant chunk to be retrieved in the top five chunks, the accuracy is 80%. If you extend it to top 10, then 87% and if you extend it to top 20 then the accuracy of the standard rack system is about 90%. Now if you use the contextualized embeddings so they have provided this code snippet which basically takes each of the chunk plus the document from which that chunk is coming and then run this through the the prompt that I showed you in the beginning of the video.

This way they will create new chunks again they're using the haiku model and you can see that the number of tokens are going to be a lot more than what are contained in the original chunks now they run the same embedding model for these contextualized chunks and for top five it sees an improvement of about six percent for top 10 i think it's again about percent for top 20 the improvement is about three percent now they are creating contextualized BM25 index. This is basically adding keyword based search mechanism and this does I think add some improvements. It's probably about 1% improvement but if you add re-ranking step at the top of it you see about 2% improvement which is pretty huge going from 92% to 94%. Anyways, this was a quick video on their latest approach contextual retrieval.

It's actually great to see that companies like Anthropic are taking RAG seriously. This shows that RAG is still relevant even in the era of long context LLMs. And based on my experience is one of the only widely used application of the LLMs at the moment.

So if you are interested in learning more about RAG systems or agents, make sure to subscribe to the channel. I create a lot of content on RAG and LLM agents. Let me know if you want me to create more detailed tutorials on contextual RAG and how this can be incorporated in your own systems. I hope you found this video useful. Thanks for watching.

As always, see you in the next one.

Transcript for:Understanding Contextual Retrieval Mechanism

Transcript for:
Understanding Contextual Retrieval Mechanism