One of the first things that you want to do when you're building AI agents is giving them access to your own data. These could be things like documents, PDFs, websites, anything to give your AI agents specific knowledge about your company or the problem that you're trying to solve. And now there are a lot of tools online that can help you to do this, but most of them come at a cost, mainly that they are closed source, meaning you have to get an API key and then you can send data to their platform where they do the parsing and you get the data back. But what if I told you that there are also open source alternatives available that work just as well? So in this video, I want to show you how you can build a fully open source document extraction pipeline in Python using a library called DocLink.
So in this video, I will walk you through this GitHub repository, link in the description. Of course, we'll dive into some code examples to show you how you can parse PDF documents like this and websites to eventually make them available in a chat application. where we can browse through our vector database, search for relevant context, and then answer questions about that, retrieving sources as citations.
Now, throughout this video, we will cover some fundamental techniques like extraction, parsing, chunking, embedding, and retrieval to show you how you can create a knowledge system for your AI agents end-to-end. I'm going to show you some specific examples with a specific set of tools, vector databases, and AI models. But all of these concepts can be applied to all kinds of situations.
So it doesn't matter which vector database you're using, which embedding model you're using, or which AI model you're using. It will really form the foundation of building a knowledge system for your application. All right, now let's get things started. Now within the readme file in the repository, which I have here in front of me, there is more information if you wanna dive a little bit deeper, but in order to follow along, you need to create an environment and then install the requirements.txt.
So they are available here in the project and you need an OpenAI API key. Now... The document extraction part is going to be fully open source, but I'm still going to use OpenAI to create embeddings and chat with it.
Now, this is optional. You can just as well use an open source model for this as well, just so you know that. Now, in order to go through this entire example on your own, which I will do in this video, we are going to execute five files.
So first we're going to extract document content, then we're going to perform chunking. Then we're going to create embeddings and put them into a vector database. Then we're going to test the search functionality.
And then we're going to bring this all together in the chat application that I demoed in the beginning. So let's dive in, starting with the extraction. So I have this Python file over here, which I'm going to boot up and running the interactive session. And we have a PDF over here from the library that we're using to perform this.
And the library is called... Dockling. Here is the technical report. So this is a project from IBM. So really great company building great tools and they are making this fully open source.
And based on my experiments and what I've heard from other AI engineers in the scene working with great companies, this is by far right now the open source document extraction library of choice. It's already amazing as is, and they still have a roadmap of items that are coming soon to make this library even more Awesome. So it's really straightforward to get started with it. And this simple file over here is going to show you exactly how to do that.
So we are going to start with the document converter that we import from the library. So this, we can do this after a simple pip install doc link, which is in the requirements. So when we have that up and running, we can then convert a PDF. So we can run this and this will take some time, especially if you do it for the first time, because it will download some local extraction models to perform this.
But right now in the background, it's going through the PDF and it's going to analyze and look at all the blocks and components, perform OCR to get a data model back that we will use in the next steps. So here you can see that it is now finished and we can now call the document attribute and have a look at what's inside there. So we now have a doc link document file.
So we took the whole PDF, we ran it through the system, and now we have this object. And why this is so awesome, coming also to the image that they have over here. So what's so nice about this library is that you can throw all kinds of data files at it, whether that's PDF, PowerPoint, DocX, or websites, and it will turn it into this specific data model, this specific data object that they have created called the DocLink document. that you can see over here. And why that is so nice is that we can now unify all kinds of data and create a pipeline or system where it doesn't really matter if you throw a PDF or a webpage at it and we can work with it.
So let's see what that looks like. So. Once we now have that document, we can do various things with it.
So for example, we can export to Markdown or we can export to JSON. And we can have a look at what that looks like and also for the JSON. And you can see if you go all the way over here, you can see what's inside. But maybe to give a better visual view, we can print the Markdown output. And here you can browse through the document over here where you see, hey, this is the Docklane technical report.
So you can see Docklane technical report. the version and here you can see it did a very good job of extracting all of this information from it but hey okay so far there are plenty of other libraries that can do this as well but where this really excels is also in let me scroll down table extraction so a lot of python a lot of open source python libraries that parse pdfs struggle with this so for example if i come down over here you see this table here you can see we can we have a perfect perfectly formatted markdown table And everything just looks super clean. No weird characters. All the headings are in correct markdown format.
So overall, out of the box, really great result. So that was a very simple example of parsing a PDF. Now let me clear this up and continue with HTML extraction.
Because what we can also do is within that same converter, we can call the convert method. But then instead of throwing a PDF at it, we just throw... a website at it. So if I go over here to the dock laying docs, you can see this page over here and run this result to see what we get. This is really fast because it essentially just looks at the HTML and parses it.
So we can get the document and then the markdown and let's see what that looks like. So now we have an exact replica of this webpage correctly parsing all of the HTML in here and now having a markdown object that we can use. So that was just one page. What if we want an entire website, for example, and get all the pages on there?
Well, to do that, we can use a trick by leveraging what is called the sitemap.xml, which most websites have. So for any given website, we can also test this by coming to the browser and then put sitemap.xml behind the URL. And we can look at this, and this is an XML file that contains all of the information, pages, and also URLs from that website. And... I created a simple helper function called get sitemap URLs that goes through this, tries to fetch that sitemap.xml and then returns all of the URLs that were found on that specific page.
So if we then come over here and plug in that same URL right now, instead of getting all the information from just a single page, we can first get the sitemap URLs. And here you can see all of the URLs that are found in the sitemap. And now, as you can imagine, we can loop over this and extract all of the pages one by one. And the doc link library also has a nice method for that. So instead of calling convert, we can now call convert all, which creates an iter object where we can loop over and extract all of that information.
So I can simply create that over here. And now I create a new variable with an empty list called docs. And for all of the results. in this iter item that we just created, I'm going to get the result.document and append it to the list. So let's run over that.
And now we can see all of the doc link documents for all of these specific pages. So this is already quite cool, right? We can throw PDFs at it, single webpage or an entire website and get an exact data object back.
And now these were just the basic examples. The doc link library also allows you to specify custom extraction parameters for situations where your data is a little bit more tricky. But this really forms the foundation and first step of your knowledge extraction system.
Being able to throw different kinds of documents at it, website, PDFs, docx, whatever, and get it into a structured format that we can use in the next step within our system. Hey, and then real quick, if you want to learn how we help developers like you beyond these videos, then you can check out the links in the description. We have a lot of resources available starting from how you can learn Python for AI in a free community and free course, all the way to our production framework that we use to build and deploy generative AI applications for our client. And if you're already at the level where you consider freelancing, maybe taking on side projects, we can also help you to land that first client. So if you're interested in that, make sure to check out all of the links there.
All right, so now we know how to extract... data. The next step is chunking and the doc link library can also help us with this.
And what chunking is really is we're taking all of the data and instead of taking the entire document and putting it into a single record within our database, we split it up in what we call chunks. And we do this in order to hopefully create logical splits and components that fit well together so that when we query our AI system, we don't get the entire document or the entire book pack. but just the specific parts that are relevant to our question. But that is not as simple as just splitting the text by every X words or characters. But luckily, out of the box, DocLink can also help us with chunking through two different methods.
And you can also combine them in the hybrid chunker. So they first have the hierarchical chunker, which essentially looks at the document. And it's going to split up the document based on logical components or groups that fit. together well. So these could be lists or paragraphs.
And it's going to create those groups with children that we can already use. So this is already a great starting point for chunking. And this is performed automatically. But we can take that one step further and also apply the hybrid chunker.
And how that works essentially is explained over here what it does. You can split chunks that are too large for your embedding model. So remember this, we get the data, we extract it.
Then we create the chunks. And then in the next step, we're going to send that to an embedding model to create the embedding, which we're going to store in the vector database. But all embedding models have a specific max input that we can use.
So for example, if we go to the OpenAI embeddings, let me just pull this up over here, we can see that the text embedding, small, large, and other models have a max input of the following. So there's a total amount of... So if...
input that you can send to that model. So you want to keep your chunks below that specific number of the embedding model that you are using. Now what DocLinkDang does is it is first going to split chunks that are too large for the embedding model. So say it exceeds this input it's going to create a split.
Then also it's going to stitch together chunks that are too small. So you might have a chunk which is just one header or a short paragraph and it can fit that together. And it works with your specific tokenizer.
So the chunks will perfectly fit the chosen model that you are working with. So that out of the box is just a great way to work with this. So within the code example here, we're going to use OpenAI and we're going to use the embedding large model where we set the max tokens to the number that we see over here in the documentation.
And in order to do this, I created a simple... OpenAI Tokenizer Wrapper. Because if you look into the documentation within Dockling, they use an open source model that is available via HackingFace. So I created this simple tokenizer over here that follows the exact API specifications that you need to work with this.
So let's fire this up and see what this looks like. So I'm going to get the same PDF again. And for the sake of simplicity, I'm just going to run this one more time in this file.
So this will take a couple of seconds over here. But this is the same action that we did in the previous step, getting the PDF. All right, so we have our result again.
So this is our parsed PDF. Now we are going to apply the hybrid chunker. So what we can do, we can import the hybrid chunker from the Duckling library. We can specify the tokenizer.
We can set it to the OpenAI tokenizer wrapper that I created. Then for max tokens, we can use the max input tokens. And we can set the merge pairs to true, which is the default option. So this is going to allow to put smaller chunks together as well.
So this is optional. So what we can then do is once I put this into memory and run over this. So this is all syntax that is specific from the doc link documentation. So I just followed the example and now created the chunks.
So if I look at this list, I can now see that I took the entire document, took it through the hybrid chunker, and I now have a list. of 36 chunks. So entire PDF now condensed to 36 chunks where we know exactly that all of these text blobs over here will fit into the context of the embedding model that we are using. So that is amazing, right?
That is already a lot of steps covered that normally takes a lot of work to do well. All right, so we now have the chunks that we can send to an embedding model to get the factors which we can store in a factor database. Now, in the next example, I'm going to use Lansdb.
And the specific implementation of the factor database that you're using doesn't really matter. Typically within our projects, I use PostgreSQL and the PGFactor extension, but Lansdb is really easy to work with. because the database is stored in a persistent storage, just like a SQLite database.
So the file will literally just show up in your file system. And that's just easier to work with. They also have a really nice API.
But just so you know that I'm not going to dive really deep into how to work with Lansdb. So if you want to know more about that, you can look into the specific documentation from their website. All right, so now moving to the 3embedding.py file.
And the beginning is just the same that we already did. but just adding steps on top of it to now work with the vector database. So let me run all of this code one more time so we get all of the chunks back. So I'm creating the database over here, which is going to live in the data folder over here.
Now, next, I'm going to specify a function. And this is really specific to LensDB. So if you want to learn more about that, you can reference the docs. But what's nice about their API is that we can specify an embedding model. So in this case, we're using OpenAI and we want text embedding three large.
And we can specify that as a function. And then in the next step, what we can do, we can use a Pydentic model that we inherit from the Lens model to specify what our table should look like. So we use Pydentic to specify the structure of our vector database.
And we can use that function that we created to specify, okay, what is the text source? What is the source field that we need to send? to the embedding model and also what is going to be the vector column that we need to vectorize and by doing it like this within this clean api we don't have to bother with manually doing sending and retrieving embeddings everything is managed from within the table so that's just a nice thing about lance db but let's look at the data model that we're using over here so this is the main schema that we're using so From all of the documents that I extracted, I want a text field.
I want the vector field that we're going to use to perform the search. And I want another field called metadata where we plug in the following. So I want to put in the file name over there, the page numbers that the chunks were on, and also the title.
So this should give us next to the text, also some important metadata from the documents that we can then use. So to look at this, we can... dive a little bit deeper into the chunks to see what that looks like.
So here in the last step, we had the 36 chunks, remember? So let's take the first object in here and let's do a model dump on there to see the exact content. So here we can see we have the text and then there is a meta key in there as well.
And there's a lot of information, but we are just going to extract the file name. We're going to extract the... ...
page numbers and we're also going to look at the headers that are potentially available in here. So that is the chunk model that we are getting back and we are simply defining pydentic models around that in order to work with that. So let me put that in memory and here you can see how we can create a table within our LensDB database.
So we take the DB that we initiated over here, we do create table, we call it doc link. We specify the schema, which is this chunk schema over here. And we set the mode to override, meaning that if it's already there, we just override it. So I forgot to put the DB here in memory. Let me make sure we run everything.
And now we should be good to go. All right, perfect. So now we have a fresh new table in here within our project.
Now, next we can... process the chunks. So here's a simple piece of code that is going to loop over all of the chunks and get it ready for our table. So let me just run that and show you what it looks like, because that is probably easier. So here we can see now for the 36 chunks that we had, it's going to loop over all of them.
It's going to extract the text, and then it's going to set the metadata to the file name, page numbers, and title. And it's going to skip everything else. And here on the right, you can see the result.
So this is going to exactly match our chunk model so we can now send the data to our table. One quick note if you are using this approach using Pydentic if you have a sub model in your document you must order them in alphabetical order otherwise you will get some weird errors. This is probably still a bug within the code over here but I ran into that and it took me probably like an hour or so to figure out what was wrong.
So with the chunks out of the way we can now take our table Remember, this is our LensDB table that we created and we can add the chunks. So let's send them to our database. And the cool thing here with this add function in the background, it performed the embeddings as well. And that's really cool about the LensDB API. It can save you a lot of work.
All of this is pretty straightforward to implement yourself. But because we have that function really stored at the table level, we can just send this chunk over here. And we don't have to worry about the embeddings. So what we can now do is we can have a look at a table and see our text.
We can see our factor and we can see the metadata. So the two pandas method over here just returns the first 10 results. But we can also count the rows to check that we have the exact number 36 records in total.
All right. So now we have the parsed data with the embeddings in a vector database. And again, I showed you how to do this with LensDB, but you can just as well follow the same principles with any Factor database out there.
If you want to use PostgreSQL, for example, you can watch the other videos on my channel for that and just swap out the logic in order to create the embeddings and put it into the table. Our data is now ready for our AI system or agent to use. So let's first look at a very simple example to show you what that looks like and how we can use that information.
And then in the last step, we're going to bring this together in the application. So within this fourth file over here called search, I'm going to fire up the interactive session again, connect to the vector database by simply specifying the local path to our database. I'm going to load the doc link table that we created. And now through the LensDB API, I can use the simple search method over here with a query in here, which is essentially the user question.
So what do we want with what query, what question? Or where do we want to search the factor database? I'm going to set the query type to factor, which is going to perform a similarity search using the embeddings. LensDB also supports a keyword search and also a hybrid method.
Then I'm going to set the limit to five, which is going to limit the amount of results that I'm going to get back from the table. So I'm going to run this and then check out the results, which I can again convert to a pandas data frame. So here you can see the text, the factor, the metadata and the distance.
And we can have a look at the specific text chunks that are retrieved when we search for PDF. Now, again, I can set this to three and it will run three. And I can also, for example, put in here what's docling and then run that result. And you can get the different results over here based on that. All right.
So we now covered the parsing, the chunking, the embedding. and the retrieval, all the key components that we need in order to create a knowledge system for our AI system or agents. Now let's see what that looks like if we put it together in an application that we can interact with. All right, so what I have in front of me over here in the fifth file called chat is a simple setup to create a Streamlit application. You can use this by pip installing the Streamlit library.
And in this video, I'm not going to dive into specifically how to set this up. You can check out the documents. We're going to use the chat elements, which make it really easy to create a simple, interactive chat application that we can spin up locally in pure Python. So this is great for demos and examples. So what we're doing in this file high level without getting into all the details is we create a connection with the database.
Then we have a function to search the database and get the relevant context. And then we have some specific Streamlit components to work with the chat messages and stream it to the user. So in order to boot this up, you can walk through this code. But what we can do right now within this specific file, make sure you are in the docling folder.
So in the folder where the chat.py file is in. And then make sure you have the environment activated where you have Streamlit installed and all of the other requirements in here. And then you can run Streamlit. run over there.
Then you can put the command streamlit run and then the name of the file. So this should be the exact one. So five-chat.py.
And when I run this, this is going to boot up the document QA. And in here, I can ask questions like what's stockling, similar to what I showed you in the beginning in the demo. So it's going to first search all of the documents.
It's going to show you what we retrieved, and it's going to provide us with a section name, name of the file, and also the... page. So this is already really cool, right?
We now essentially have an agent system. We have an AI system that we can ask questions and it can... pull up the right information.
And even though right now this is just one document, it's just one PDF, but you can just expand this by adding more and more data. So it's very dynamic and easy to work with. So what we can do, for example, if we come in here and just make a live change. So for example, let's set the number of results to five.
We can just do that, save it, and we can run this one more time and say, what's docling? So it now has five chunks that it's going to base the information on. So now you know how to build a knowledge extraction pipeline that can extract various types of documents, convert them, extract them, chunk them, put them into a vector database and then make them available in your AI application. So if you found this video helpful, please leave a like and also consider subscribing.
And if you wanna learn more about how to build effective AI agents, make sure to check out this video next.