Enhancing RAG with Knowledge Graphs

Hi folks, welcome back. I hope you're all doing well. By now, the general idea of retrieval augmented generation or RAG is pretty well understood in LLM circles. In fact, it's one of the most common ways in which LLMs are used. The major idea of RAG is to enable a general purpose LLM or foundation model to know something about your specific domain in the form of your own private corpus that you bring to the LLM and then use that to be able to answer questions about that domain. However, many people have been trying to solve some of the shortcomings in RAG ever since it first became common as a pattern. And one of those techniques is to use this idea of a knowledge graph. And this paper from Microsoft Research brings together a lot of those ideas in one comprehensive paper. So I wanted to take a deeper look at that. Let's start by asking, well, what's wrong with RAG? I have a previous video on this channel which explains basics of how RAG works. I'll link that down in the description. The main shortcoming of RAG is, as the authors point out over here, it fails on global questions about the entire corpus. The way RAG is implemented, it goes and first retrieves the documents or parts of documents that are deemed relevant to a user's query. As the authors over here say that the answer you're looking for must be contained within regions of text in those retrieved documents. And those snippets of the document or even if you give the entire document in the context, they provide the grounding for the generation task. By grounding, I mean you prompt the LLM with the query and these parts of your document as source material to answer the query out of. But what if you ask a query that has to do with concepts that appear throughout the document? A simple query like that could be, tell me the major themes in this document. And in scenarios like that, the semantic search that RAG depends on doesn't quite work because there isn't going to be a great semantic match between a query like what are the major themes in this document or this corpus and the terms in the corpus. Answering this query requires some understanding of what the concepts in the corpus are. And it is this major weakness, a deeper level understanding that techniques, such as graph RAG, try to solve. The authors call this sense making. And the idea of sense making is to understand the core connections among entities that your document or your corpus is talking about entities like the people or places or events or concepts that are mentioned or explained in your document. You want to be able to extract what those entities are, how they're related to each other, and then be able to use that knowledge to provide more sensible, more coherent, higher level answers to a user's query. There are several ways to do this, but the authors over here have a fairly general and comprehensive flow or steps for how to do it. A lot of these steps are performed offline or at indexing time, all these blue boxes. And then the information that is computed at index time is used at the time of answering queries at lookup time in these green boxes over here. The offline steps go something like this. So the first couple of steps are going to be familiar. You chunk up your documents. And now after chunking is where the interesting new stuff happens, you try to extract element instances. This is where you extract the main entities in your corpus and the relationships among them. You go on to summarize them. And then once you have the main edges and nodes of this graph, you go on to form clusters of those nodes. So you group together similar ideas, similar concepts into clusters, what the authors here call graph communities. And then you go one step further and you summarize those communities as well. Now at query time, you start by looking at this higher level abstraction and use that to generate your global answer that looks at these ideas throughout the corpus. Let's look at each of these steps in a bit more detail. The first one is fairly straightforward. You have to break up your document into chunks in order to process it. They do some experimentation with various chunk sizes to see which is most effective. This next step where you extract concepts or nodes and edges from these chunks is where a lot of the heavy lifting is happening. The way they do this is by again, prompting an LLM in a way that it can take these chunks and extract these concepts out of it. They use few-shot prompting to do this. They've mentioned elsewhere in this paper that they will publish their code and prompts to a GitHub repo. The next step is to summarize all these elements or nodes that you have extracted. So you've extracted all these concepts from your corpus. And the next step is to get summaries or description of all these entities and all these relationships between entities. Once again, they prompt an LLM to generate these summaries. The step after that is to cluster these elements into what they call communities or clusters of nodes that are connected by strong relationship edges. And this gives you a hierarchy in your graph. When you perform this community clustering, you cover all nodes in your graph in a mutually exclusive, collectively exhaustive way. That just means that you've used all the nodes in the graph. They all belong to a cluster and each node only belongs to one cluster. And then finally, they do one more level of summaries where they summarize these clusters or communities as well. Note that an interesting benefit of this entire process is that these community summaries and node summaries are useful by themselves. Even if you don't perform any queries, these summaries are a great abstraction of the underlying corpus. And just reading these summaries might enable you to understand the main concepts of your corpus or your document and how they're linked to each other. Alright, so we did all this preprocessing to extract concepts and connections between those concepts. And now let's see how all that is used when it's time to answer a query. The query answering also happens in a few different steps. It starts with using the community summaries, which was the last step we looked at that was performed offline. So they chunk and randomly shuffle these community summaries. And the idea is to not concentrate them into a single context window. So they're making multiple calls to the LLM to answer the query against each of these chunks of community summaries. And for each chunk, they generate intermediate answers. And each of these answers is then scored or ranked. And how do they rank? They use again an LLM to rank these answers in terms of how well they answer the user's query. And finally, they take the top ranked answers and concatenate them into the window and generate a final global answer for the user. So this is an iterative way to use these summaries to generate many candidate answers, rank them, and then take the top ranked answers to generate a final answer. Now, how do you evaluate something like this? This itself is an interesting problem because most benchmarks out there don't really target a scenario like this, which requires global understanding of a corpus. So the authors here had to essentially figure out how to evaluate this and bring their own corpus and test data set. They used a couple of different data sets. One is transcripts of podcasts. Another one is a collection of news articles. And then the question becomes, well, how do you generate questions that are going to exercise this kind of global understanding? And once again, they used an LLM to generate questions that require understanding of the entire corpus. So this paper is a great illustration of the general idea that is now pretty common in LLM papers and research, which is that whenever you run into a problem of, how are you going to do this? The first answer seems to be, let's ask the LLM to do it. So here they use the LLM to extract concepts, to extract the connections between them, to summarize those concepts, then summarize them one level higher. And they also use LLMs to generate their test eval questions. They use several levels of eval, depending on what level of community summaries they used in answering the questions. So you can use root level community summaries, or you can use higher level community summaries to answer questions. And then this SS condition is naive or old style rag. And once again, how do you compare all these conditions? They use a head to head comparison using an LLM evaluator. So they give the same query to all of these conditions, and they use an LLM to evaluate which answer is the best. The things they look for in the answer are things like comprehensiveness, diversity, which means they want the answer to be varied and cover different ideas from the corpus. Empowerment is a new metric I haven't heard of before, but basically it means does this help the reader understand the topic. And finally, we have directness, which is how specifically and clearly does the answer address the question. This sounds like the same thing as relevance. So before we look at the numerical results, let's look at one example that they highlight in the paper to compare graph rag with naive rag. This is a question asked of their news article data set, which public figures are repeatedly mentioned across various entertainment articles. Now, if you look at the graph rag answer, it picks out a number of specific celebrities. Whereas, if you look at the graph rag answer, it picks key individuals from different facets of public life, things like actors and directors, musicians, athletes, influencers, public figures, who are currently being discussed and so on. And so rather than zone in on very specific celebrities that are just mentioned often across the corpus, the answer from graph rag is showing that it can extract some deeper meaning and classify and cluster all these celebrities. Now their LLM evaluator in this case deems the graph rag answer to be better. But I think you'll agree that even from a human understanding point of view, the graph rag answer seems to show deeper understanding of the underlying corpus. This diagram shows the wind rates of various conditions against each other. So how often did one type of rag beat another type of rag? But the high level result here is that all the graph rag conditions outperformed the naive rag conditions on both comprehensiveness and diversity. But that was a quick look at a paper that talks about how to enhance naive or traditional rag with higher level summaries with this notion of a graph of concepts and connections between them and how in their evaluation it showed that it can produce much more comprehensive and much more direct answers. And it really does address the Achilles heel of traditional or naive rag, which is that it has a hard time understanding concepts or ideas that cover the entire corpus. I see a lot of rag really moving in this direction. I hope you enjoyed that if you like content like this, please consider subscribing, like the video, and I will see you all next time. Thank you very much.

Transcript for:Enhancing RAG with Knowledge Graphs

Transcript for:
Enhancing RAG with Knowledge Graphs