Transcript for:
Building Knowledge Graphs with LLMs Overview

right do we have to wait until the countdown is done okay can you just let me know when to start it yeah will will I hear you introducing me or how dides it go say I'll say a few one sentence and then I take it over all right welcome everybody we are kicking it off with Tomas bratanic who is a graph data science expert author and all over geni Mastermind at NE J uh with his session building knowledge graphs with llms take it away too much okay so thanks for the nice introduction uh so as Alex mentioned today we'll be talking about building knowledge class with llms and u i as he mentioned we also writing a book with Oscar about knowledge gra re but enough of that so uh funnily enough I was looking at the previous uh session for like five minutes and I saw kind of nice continu how we're going to continue with this session because first we're not going to jump straight into knowledge gra building but we're going to look at when text and Bing approach fails and so uh and then how to overcome it right and then the overcoming overcoming will be using knowled gr but as mentioned by the title but let's just uh go into when uh what are the limitations of text embedding approach right so in a typical Rec pipeline I've SE I've probably seen this so many times now but basically you take a couple of PDFs you chunk them up you index them uh using any text embedding models and then at credit time you use the same embedding model fetch the top four or five or 10 most relevant document Bas to an llm you hope that any relevant information is in that documents and then the LM generates the final answer but there are some use cases or domains where this approach doesn't really work well because like this approach really works well for like let's say documentation but for example for other domains like legal for some questions this might not be the best approach as we'll see right and here I have a couple of questions outlined that you can ask so for example what company policies on remote work but also you can ask other questions like What's the total value of our contracts with NE J or whichever company right and uh we'll see that like different types of questions require different types of approaches right so for example when you're asking what the company policies on on a remote work that's like totally that's like I said like a documentation type of question where basically you just need to find the top the relevant text chunks that talk about remote work pass it to an llm and you get your answer right so that's all good but like what if you think like what are payment terms for specific contract right so in this case if you took like a naive approach and just uh return top four chunks from your vector index what might happen is that you can get four or here three three tax chunks from different contracts which might be might not be related and like the llm could be quite confused because let's say you get three different payment terms for three different contracts right so in this case just using naive U like vector similarity search might already fail and here in the T oops how I go back in the ti I can say like this can be overcome this can be solved and by using like metadata filtering so if your vector text chunks have like some structure data and you can filter by contract ID or company name or something like that then you narrow down your results and then then you don't get U text chunks from different contracts because you used metadata filtering so that's one way one scenario where basically you can use structur data to enhance your results but then like if you ask questions like What's the total value of a Contex with e me or like how many contct expir this month right because this is a question that you might ask if you have like a legal chatbot and in this case basically naive approach even if you had metadata filtering completely fails right because the vector index uh like R approach just returns for different text chunks and like maybe hopefully in Ideal World you would get like four text chunks with contracts about with ecme that have like the value in it and then like you would uh rely on llm U uh basically doing some arithmetics and adding up the total value but what if you have more than four contracts what if you have 10 contracts with e. Inc right then in that case it completely fails right and if you ask like how many contct expired this month again basically the for chunks that you get from a vector index doesn't really help in so in answering that question it might like sometimes what we see is that you might get like one or two contract that might expire this month and then LM will be okay these two contracts expire this month but that might be like just a partial answer because maybe like 200 Contex will expire this month but D doesn't know because it gets just for topk text chunks and uh you can get inac partial results so as mentioned like certain questions that required filtering sting counting and aggregating cannot be answered with text and Bings only because uh basically you need structured data to perform these sorts of aggregations and then there's one which I didn't really mention before but like sometimes when you have information spread across multiple documents uh the vector approach might also um fail because let's say the top K is free but all the information is like in five documents right and then even if your vector similarity search is perfect and it gets the exact information that you want it's still really hard to define the exact topk because it might be dynamic like sometimes you want like five documents sometimes you need 15 documents and U this is mostly like unsolved problem at the moment so uh selection this data can also uh help with overcoming the topk issue right so yeah this is my first memory I guess so basically when people think or talk about pipelines mostly they just think about PDFs and the documentation and just text right but there's like this whole beautiful word of structure data that uh is like overlooked a lot and uh it can help you answer these types of questions right where you need to do structured operations or like for more complex um questions like multihop questions so here uh knowledge graphs come into play so probably you heard a lot about knowledge graphs today so we won't go too much uh into depth about them but basically knowledge graphs are used are great representing structure data right so this is probably the most used um image to show basically what knowledge G are about so you can have entities and relationships between them but this I think is now slightly outdated because in the world of llms and the pipelines basically what you want most of the time it your data sources contain structured information and also unstructured information right because here we only have structured information but now we want to also bring in additional unstructured information and the nice thing about knowledge graphs is that you can have both of these types of informations in a single database like so for example here we have a contract and it's linked to two companies and these companies are located in whatever cities but then like for example if you chunk your contracts by Clauses then you can have each clause in the links to the contract and you could embed closes separately and have them linked to the structured part of the graph right so basically you could think about Clauses or like your text chunks with quite complex metadata right because the structur part of the graph can be T about as metadata uh related to the unstructured part of the graph right so basically how I see it now that in the world of LMS the nice thing is that you can have both of these types of information together and that's the power of knowledge graphs so now let's finally talk about building the actual knowledge graphs with L lamps right because that's kind of nice thing everybody is funny but in in L like 90% of the time the input is a PDF or document right so we need some sort of way to build a knowledge gra to structure that information from a PDF and this is what we'll be talking about so basically you can have U as I shown before multiple documents and then you build a Knowledge Graph based on in the based on the information in those knowledge graphs and the nice thing about knowledge gra is that they are very good at combining information about uh from multiple documents right so here we have let's say four documents that represent this information but now if we ask a question right like um like previous open AI employees like which companies they founded right this would turn into a very simple Cipher query instead of us having to look through multiple documents and then maybe like one document contains that the Dario and danela worked at open Ai and then the another uh document contains that they found it anthropic so this is kind of the multihop types of questions where they have to look through multiple documents to find the whole truth and this is also like a very nice use case for knowledge graphs of combining that information in a nice and compact way and uh building Knowledge Graph text used to be called information extraction and it used to be like very complicated you need multiple models and uh each model in the in the pipeline was very custom and domain specific so mostly academics did that or like billion dollar companies and nobody in between but now basically you just ask an LM LM please exct information in a SED way mostly the structured way is adjacent and that's it right and so it the extracting of structured information from text has become much more mainstream and that also opens up the possibility of building knowledge gra some text and like uh selecting seed information from text is so frequent uh use cases with llm that basically now you have like Json modes or like open ey also has like output mode and like you can also kind of hack the tools and use them to extract structured output so basically extracting structured information with LM is such a frequent use case that there are lots of tools build around llms to make it easier for the developers to get like a consistent output and uh as mentioned basically the whole magic of um information extraction is that you take some text some magic happens and you get like a nice Json so basically you get like a nice structured uh information and how I see it basically the information extraction pipeline there's like a spectrum of approaches how to do that and how I see it is like there's the Spectrum lies from the most generic extraction where you have no idea what's in the text what to extract you just say llm extract as much information as possible and then on the other side you have very domain specific where you know exactly what's in the text for example legal documents and you just uh you can Define each property in the Json uh what it should look like and what should be selected so now let's look at the Spectrum of approaches and how they actually work in practice so the first one is about is the most generic approach we have no idea about documents what's in them it's like a very typical scenario that I see is basically your boss gives you 10,000 PDFs and just says build a chatbot and you have no idea what's in then and in that case like you could use the most generic approach you just say extract notes which have ID and labels and extract relationships between those notes and then you hope for the best that and uh this can be this is like such u a frequent um scenario as well that you will see a lot of Frameworks supporting like the generic approach where you don't have to Define any schema here I have the code for length chain because uh I like length chain because I implemented that so I like to show off so basically the most generic approach with length chain just looks like this you decide which llm you want to use you and then you pass the document and build the knowledge you have and here I have one example uh of a contract between morgage Lo logic and to incorporate so here I rent extraction on the same contract two times and then uh you can compare the extraction right so now because we didn't Define any schema what we want to extract you will get uh very de like some relationships might be the same but some relationships might be different and like the knowledge graph or the ex information can be quite different depending on U different run right uh so for example here we have through link host a website and then morage logic uses it and in the other in the first one on the left side we don't have that information or here we have like that mortgage logic provides Credit Data and client content and here we have credit data and client content so some things might stay the same and some things might change but what we see is that like the schema is very inconsistent and also the information is being selected and yeah so basic exection can be noisy because that lamp decides what's to extract and uh there's no schema consistency right so if you R this extraction uh on multiple uh contracts there is no guarantee that like node labels or relationship types will stay the same so like then I see it like there's like a middle ground approach where you can Define the node labels and relationship types you want to extract so this requires some upfront work because you need to Define what you want to extract right but the nice thing is that you can um use llms to help you with defining the schema right because you can just sample a couple of documents as the LM or types of nodes node labels and relationship types should did Define in the text and you just uh fit that into an llm right and U again you can oops you can use length chain but it's the wrong code but basically what you would do you would Define the types of labels and the types of relationship types so here you would have allowed notes and allow the relationships but somehow I missed that and then like your uh schema so here again I I I ran the extraction on the same contract two times in a row so now the extraction is much more consistent between two two runs right because we guide D what do we actually want to extract uh so um the because we guide the LM more about what information we want toel will get more consistent outputs right there's still some discrepancy between the depth of information between extracted um uh information between different calls because that's just the nature of LMS sometimes they want to selct more information sometimes they want to do less but the nice thing is that we also have like better retriever options because now that your schema is more defined uh that also gives you better options uh for the retrieval because you kind of know what to what information to expect in the knowledge graph you also know how to then retrieve it in better ways or more ways and then the last uh approach that I kind of like the most is the domain specific approach so now basically you define every if like you're selecting a Json you define every key in the Json and you give it description you can give it like available options like for example dates you define what's the format that should be used right so this type of in uh information extraction requires the most work but obviously uh it's the best gives the best result right so here's like one example that you could use to extract information from account contct so you can have like a type with predefined types and like parties of organizations dates like Scopes amounts and uh since this is like more like a lowlevel extraction there's not like a specific tool you can use but like you can use Frameworks like length chain instructor llama index or open AI SED outputs you define the information you want to extract using pantic objects and basically with this approach right because you define what you want and like you're very descriptive uh and constrainted about the information that needs to be extracted you will get the best results right so here I have basically one example and then is basically we have some information as notes but then some information is stored as um properties as well as for example this agreement has effective that like a summarized Contex scope and stuff like that so basically when you can uh uhuh so basically this should be the main specific approach but basically the llm is guided exactly on the type of information it should be it should extract right and the SK EMA is fixed because it's not actually as before we we said to an llm exct nodes and relationships but now we basically say to an llm exct contct type exct effective date so the llm doesn't actually do the graph modeling the graph modeling is like a separate step so the in this case like the graph schema is fixed as it separated from the llm and because we can guide and give instructions to an llm exactly uh what we want extracted we get the best results and since the schema is fixed right you can create all sorts of tools um uh for the retrieval so basically because you know exactly what is in the graph what's the format of properties and all of that you get the best options the to build like a Rec application that can do all sorts of aggregations filtering multihop questions all of that right and uh so that's kind of the three types of approaches that I've seen in practice and now some observations using that in actual practice right so obviously anytime you're going to be building knowledge gra and text you'll have to do some cleaning up afterwards and the most uh frequent one is the entity resolution so you can have like multiple notes representing a single Rel entity so here in this example we have like an UTI asset management company and there are three nodes that represent the same Noe no yeah the same entity so you want to have like some postprocessing steps where you identify these duplicates and merge them so that way your graph will be much will have much better structural integrity and then uh here's some like observations from papers so this is the paper for Microsoft they building knowledge graphs so like when you instruct the llm to extract let's say all people and organizations from text Will the LM actually extract all people and organizations from the text and the answer is mostly not right so the nothing is like when working with t lamps nothing is really perfect so then and they also observe that the extraction is quite dependent on the chunk size that you use so the smaller chunk sizes that you use the more information you will extract overall and another technique that they introduced is that you can can do multiple passes U on a single text Chunk and then extract more information right so if you kind of say that on a single pass we will not extract all the information maybe let's do two passes or three passes and try to get as much information as possible and then this one is also quite interesting as we've seen before is like how consistent is the selection and uh so even with like a fully defined schema like let's say domain specific approach how consistent is like selection and it's like uh valid depends right on the types of documents but the less ambiguous and are the documents right the better the consistency will be right so here in this example CVS are let the Le least ambiguous so they'll have the most consistency and then for it's funny but scientific articles and websites are kind of similar but we can see the scientific articles are still more consistent uh because when you're doing science you shouldn't be as much ambiguous as you can be when you're basically doing websites and one thing that's also quite interesting to see is that nobody like this is like an evolving space so nobody really knows what's the best approach to extracting knowledge graphs or like stred information so like then you have so many options how to do like you can use Json mode with ls you can use tools functions you can just do uh regular Proms and Define the example with fs and uh basically this is already model specific so it's really hard to say what works best because it really depends on the model and not just like the model let's say because even like GPT 40 from January behaves differently than GPT 40 from April uh so but this is also something to think about when you're building knowledge graphs and that's it for the me so just on time and now uh uh we have two minutes for questions um and basically also because this is like a theoretical talk but if you want some code so you can try it out you can go and look at my blog post code examples and test it uh on your own so yeah now I don't know if you have time for questions we can probably squeeze one in uh thank you toas um the one with the most votes comes from Rya uh and they asked how does Knowledge Graph update internally when I add new document to the graph yeah I mean so that's basically up to you uh but uh build like adding additional information is not uh is like not a big problem because it's just adding additional notes and relationships but the only thing you have to be careful is that you do entity resolution right so if an entity is mentioned in the text that's already in the graph you kind of want to use the same name because as mentioned like sometimes not exactly the same name is used so then the only problem is dealing with this but like this could also be like as post processing you update your knowledge graph and then try to search for duplicates but other than that it should be very straight forward yeah cool um and then there's one more maybe we can squeeze it in as well from DAV it um and we'll keep on answering this one and then we close it uh but if you need to jump to the next session you can do so uh What uh is your recommendation to perform graph disambiguation merging duplicated nodes or different nodes uh that are actually the same thing yeah so uh what you see in practice is that I mean it really depends on the types of notes you want to this like merge together so it's very different when you're merging people than when you're merging organizations or locations but for the most part what we've seen is that you use text embeddings to find potential candidates that could be merged and then you can use some sort of like V distance or like you can have like an llm as a judge deciding if an llm if two entities should be mered so kind of the generic approaches is that you use a combination of text embeddings and then additional logic to give the final decision if an entity if two entities should be mered or not super cool with that I think we're at the end thank you very much toas thank you for watching uh keep keep continuing with the notes uh presentations throughout the day and uh yeah see you at the next next session uh and somewhere in the future bye see you later bye