Transcript for:
Pre-Processing Unstructured Data for AI

My name is Maria, my last name is unpronounceable, and I'm a developer at Avocado at unstructured.io. And so in this talk, I'm going to show you how you can pre-process different types of unstructured data and make them ready for building RAG applications or really any kind of Gen AI application. Since everybody here is aware what RAG is, I am assuming that you have... maybe tried some tutorials, building naive RAG applications, you maybe used the dataset from a Hugging Face or Kaggle.

And then if you got inspired to maybe try to use RAG to build your personal assistant to take it to the next level, if you did that, you discovered that your own documents, your own ground truth is scattered across different types of documents. You have maybe your contracts. or booking reservations in PDFs.

You have your emails with important information. You have maybe your personal blog with Markdown, maybe Slack messages, Discord messages. So your personal information that could be useful in building a personal assistant is scattered all over the place in different formats. And believe it or not, most companies these days that are trying to build AI-powered LLM bots that they have the same problem. So about 80% of data generated over the course of business is unstructured.

So think about maybe insurance company with tons of contracts, policies, or PDFs. And then if you have a software company you have typically issues in JIRA or GitHub. You have actually that's the next slide. You have company wiki, Notion, or Confluence, Slack messages, files in Dropbox, files in S3 bucket, files in blob storage, and so many different silos containing different documents are spread out for every organization. It doesn't matter what industry it is.

In every organization, there will be an abundance of unstructured data. And this is one of the first challenges of using that for any kind of application really, but in this case we're going to be talking about RAC specifically, just getting data deployed from all of those sources as the first challenge. The second challenge of using that type of data is that it's stored in a variety native formats, it has no clear organization, every format reports requires its own approach to parsing it, to extracting the data from those formats.

And then on top of that, you have all kinds of different document layouts with tables, images, different text orientation, different structure. There's really no limit to creativity when people are creating PowerPoint presentations or PDF documents, reports of different kinds. That's yet another challenge of getting text or any information extracted from those documents. So what do you do if you want to use that kind of information as a ground truth in your application?

You have to pre-process these documents somehow to get them into a format that would be usable, that you could store in your retrieval, that you can actually search through using whatever search you prefer. And you could do this yourself. So you can set up ETL pipelines yourself. But in doing so you will discover that you will need to build expertise in multiple different APIs for ingesting data from multiple different places.

You will need to build expertise in parsing those native formats and extracting text from those. different types of files. At some point rule-based parsers will no longer be enough because you will discover that PDFs, PowerPoints, images that can contain scans of documents will require OCR models, document understanding models, and you will have to learn how to properly deploy those models, how to select those models, maybe fine-tune those models. how to efficiently run inference with them.

And then of course there will be edge cases with all kinds of noise, all kinds of different layouts, weird structure. One of my favorite cases is when there's a scan of something but it's kind of sideways and that's fun. On top of that, in advance you have to think about what your data model at the end is going to look like because you will have to have consistency across different document types.

You need to standardize what your output is going to look like. no matter the document, no matter the source. And of course, on top of that, your classical things that you will need to take care of, like scalability, maintenance, and so on and so forth.

And lots and lots of organizations are facing the same issue over and over again. And so Unstructured, the company where I work, the founders, they were building those ETL. pipelines over and over and over and over and then decided to just build a solution for that and start their own company.

And so now we have a set of tools that solve this exact specific problem, making unstructured data usable. Nothing more than that. And unstructured offers three different tools to solve the exact same problem. One, you already mentioned the open source library that we have.

You can just pip install and it supports 25 different document types and can connect to 20 plus different data sources to ingest the data from, can connect to 20 plus different destinations to load the results into. And, you know, it's open source, so you can play as much as you want with it. You can deploy it in your own.

environment with the Docker container, it's there for you to use. The next tool we have is the serverless API which offers a similar functionality. The major difference with the API is that we have our own fine-tuned OCR models and document understanding models.

They're going to perform better on complex noisy image-based documents and so if the open source library is not enough for those types types of documents, then you will get a better performance with the API. It also has a couple additional chunking strategies on the API and we scale things automatically. There's SOC 2 compliance, all those production features. And finally there's an enterprise platform no-code solution with job scheduling, additional multi-model features. This one is only in beta right now is going to be released later this year.

So we're not going to be talking about the enterprise platform. Now, so let's see how unstructured works with unstructured data. And this is going to be the same for both open source library and for the API. So what we have on the left side are a bunch of different options. Not all of them, but just some of the most.

common options for the sources. We have source connectors in the library and on the API that you can use to connect to those sources and ingest the data from those. We support 25 different file types that we can pre-process.

Once we have those files, then we have several different pipelines that use different partitioning strategy to process the documents. I'm going to be talking a little bit more about about those partitioning strategies on the next slide. As a result, you get the text extracted from those documents in a JSON format. And it contains the text and additional metadata.

At this point, you can already, if you don't want to do RAG, and if you just want to pull out the text from those documents, at this point, you can already load the JSON into any of the destination connectors. Or if RAG is your ultimate goal, then we also provide chunking options, embedding options through the providers that we integrate with, such as OpenAI, Hug&Face, and AWS Bedrock. And then you can load those embeddings into your destination of choice, Elastic only.

So let's take a closer look. take a closer look at the different partition strategies, what does it mean? When we partition documents, the first choice that you have is a fast strategy. So if your documents are text-based, such as Markdown, HTML, Word documents, emails, we're going to be using rule-based parsers to extract the text very accurately and very fast.

But the thing is that this strategy also can be used with some kinds of PDFs. So if your PDFs contain only text, no tables, no images, in some cases, we are able to extract that text with rule-based parsers. And in this case, you're going to get super fast performance.

How do you decide whether a PDF has... I mean, most of the PDFs I use or have are a mix of images and PDFs. For those, we're going to go with a high resolution.

So you can decide yourself the resolution, but you can also let us decide which one to apply. sorry, partitioning strategy, you can let us decide which partition strategy to apply. So the next one is the high resolution and this is what we would apply to PDFs and image-based documents to extract text out of them. And in this case we're going to apply a combination of OCR, document understanding models, some traditional machine learning models, and we're going to get much more accurate results. from those documents but it comes with a trade-off that it's going to be slower than the fast strategy and we offer auto strategy in which case we automatically select which partitioning strategy to apply depending on the document type and complexity and once i wanted to show you this so what do you get when you actually partition a document so on the left side we have a page from Archive paper is a PDF and as a result of partitioning this document you get a JSON with document elements.

So we don't give you a block of text, the whole thing in one chunk, instead we split document into logical units. Here I'm only showing the first two because it's a very very long JSON and you wouldn't be able to see anything. So the first one has a type, it's a title, so we classify elements in different structural elements of the document.

It has an ID and it has the text that we have pulled out of it, so who validates the validators. And it has some metadata. And then the next one is classified as narrative text and this small piece of text about the first researcher. Now when we preserve structure in this way and we classify the element the document elements into multiple different categories.

This gives you an understanding of the structure of the document, but also you have some filtering options. If you want to throw away all the footers from your documents and not use them ever, you can do so. If you're only interested in tables from all of your documents, again, you can just choose to use those parts of your documents.

And another way we preserved the a document structure and hierarchy is by keeping the name at the oh it moved a little bit so the the green thing down here is supposed to highlight the parent id um so as you can see there is a title and then there's a paragraph there are two different elements but the paragraph has the parent id of the title so this way we know what section goes under which title Now, so we atomize those documents into elements to preserve the structure, preserve the hierarchy of those documents. And additionally, we extract as much metadata as we can. For example, if text contains links, we're going to drop those links into metadata.

The page number, the section, the file name, the languages. If it's an email. the sender, the recipient, all that information is going to go in the metadata. So as much as we can pull out from the documents, we put all that information there.

And one of the most commonly used pieces of metadata that we see is this one. Let's say we have a table in the PDF document, and so we extract that element, we classify it as a table, we pull out the raw text into the text field, but also in the metadata we put that table as plain HTML to preserve the structure of that table. And I'll print it out.

it matches the original table. So oftentimes, you don't really want just the text from a table. You want to understand what cell the number was in, what were the rows, the columns. That information is very important for tables specifically. And OK, going back.

So now I mentioned that to extract text from PDFs and other image-based documents. We use our own fine-tuned OCR and Document Understanding model on the API side. And I wanted to show a couple of metrics that we track for those models when we fine-tune them, how we evaluate their quality.

The first one is pretty obvious, the text extraction accuracy. Next, we track whether we identified the table structure correctly, whether the rows where they're supposed to be, the columns where they're supposed to be. And the last one is the text reading order because depending on the layout and different columns that you may have The reading order can differ and if you mess it up the results are not going to make sense necessarily So once we bring it all together After partitioning any document type, it's not just for PDF So it could be a markdown file, it could be PowerPoint, it could be Word document, Google Doc At the end you get this normalized JSON with extracted text that preserves document structure, the hierarchy of elements, and is loaded with metadata.

So the next thing that you typically would want to do with your text if you're building a RAC application specifically is to do the chunking. We have published a paper on smart chunking strategies. If you're curious to read the paper, there's a QR code.

But essentially, a lot of content, if you've read about chunking strategies, the main issue that people have is they typically start with a large piece of text and they need to split it into smaller chunks that would fit into the context window of the embedding model. And then you have to figure out how to semantically split the text because you don't want to mix content from different topics within the same chunk, because when that chunk is retrieved, it's going to give mixed signals to the LLM. It's not necessarily going to be the most relevant thing. So ideally, you want to preserve, I guess, the purity of topics in different chunks.

And the thing with unstructured is that once we partition the file, we have already... split it into logical elements of the document. So sections already do not overlap.

And so how do we approach chunking? Two different things. So one thing is if you have a paragraph that doesn't fit into the context window of the model, that we're going to split into smaller chunks. There's nothing extraordinary there. But if we have small elements, like let's say list items, individual list items, they're typically quite small.

What we can do is merge smaller elements but our strategies specify the rules to guarantee what will never get merged. So kind of a little bit of backwards, instead of figuring out the semantics of the document we already have that we just make sure that things that you never want to be merged, they will never be merged. And so we have several strategies like by title, in which case the sections from different sections and the content from them will never be merged into one chunk.

Even if let's say there's a tiny piece left, the next section starts, we'll never merge them. That's where the information about the ERP comes super handy. The second Strategy is called by page. Sometimes documents have structure in which every page represents some semantically independent unit and you never want to merge content from different pages. We can ensure that never happens.

And then finally if you have a very long document with no clear structure but the topic changes throughout the document what we can do is we can employ the chunking by similarity and in this case for small chunks that we are about to merge, we're going to do a quick embedding, compare how similar they are, and you can set a threshold. And if they're very, very dissimilar, we're not going to merge them. And that's done using some kind of classification model or something that's built into the... Yeah, we just choose a small embedding model, we quickly run that.

do the embeddings and measure the similarity between the two. And then you can set the threshold at which point no longer to merge them. And you can choose the chunking strategies yourself. The default one is by title because this is the most commonly used one.

And once you have your chunks, you can do the embeddings. I don't have a slide on that because this is pretty straightforward. We just integrate with different providers.

You specify your key, you choose your model, and there you go. And now let's just look at an example before we run out of time. And I'm going to show an example where I have some PDFs in an S3 bucket.

I process them with unstructured. I choose the auto partition strategy. And the resulting JSONs I'm going to chunk, embed, and load the results into Elasticsearch. If you don't like Python, I'm sorry, but you will need to install some libraries and some dependencies here. So you can actually choose what dependencies you have to install for your use case.

Here I'm limiting unstructured dependencies to the S3 PDF. Elasticsearch and HuggingFace. This is set up so this is not the pipeline yet. I am just simply creating an index and I specify the mapping that I want to have for that index.

We do have the mappings in the documentation so essentially we want our JSONs to match what the index is going to expect. The one thing you can just copy paste that mapping from the documentation but the only thing that you will have to pay attention to and may need to change is for the embeddings number of dimensions that depends on the embedding model that you're going to use so you may have to change that number and that's the only thing that you may have to change in mapping and now so the the pipeline is going to be a bunch of configs by the way you can do the same thing from if you if you are allergic to python you can do the same thing with curl It's just going to look a little bit differently. And I'm going to show the full example later. It just did not fit in a reasonable font. But essentially, you assemble your pipeline with a bunch of configs.

And they don't have to be in this order. I just feel that this is the more logical order of configs. And the first one I have here is the processor config. This is where you specify general behavior parameters like the verbosity. the number of processes you want to have, the error handling parameters, things like that.

Then I have three different configs that relate to S3 bucket. In the S3 indexing config, I'm going to specify where my bucket is, so the bucket URL. The downloader config I usually leave as default, but you can change the download path. for the temporary downloaded PDFs or the documents using this config.

Then you have connection config, where you set up your authentication parameters. Once we ingest the documents, the next step is partitioning. So then you have the partitioner config.

This is where you specify your partitioning strategy if you want to. If you choose high res strategy, you can also specify which model you would like to use from. the ones that we have in the documentation. If you're using API, this is also where you provide your key and the partition endpoint. Once the partitioning is done, next step is chunking.

So chunking and embedding are optional. If you're going to do them, then you need these configs. But if you're just doing partitioning, then you can skip this. So in the chunker, you specify the chunking strategy, the maximum size of the chunks. whether to combine small chunks, things like that.

And better, choose your provider, provide your API key if needed, choose the model. That's about it. And then finally, we're going to be uploading the results into Elasticsearch.

And so you need a few configs for Elastic. The stager is going to check that the documents match the index schema. The destination connection config config again your authentication parameters. And finally the uploader just give it the index name.

It's going to upload all the documents in there. And then you run the pipeline and that's it. So the whole code, you know, this is the whole code it takes to ingest any kind of unstructured data from the 25 different types that we support from S3 bucket partition it, chunk it, embed it with a model from Hugging Face and load it in into Elasticsearch.

If you want to experiment with the open source library first, then you can just remove this flag partition by API, it's going to be false, remove your API key partition endpoint and it's going to use the local installation of a structure. However, if you do that, be mindful of the number of processes. I don't know how beefy your hardware is, but if your documents require OCR, you may want to go easier on the number of processes. And then if you do want to use the API, then you can sign up. on the website there's a 14-day trial and then you're going to get your personalized dashboard with a link with the url for the partition endpoint and you can use this example as this.

This is figured right now is set to use your cloud service to do all of the work. Yeah essentially that's the option partitioned by API set to true and then I authenticate myself with the API key and I say this is the endpoint that I want to use. But if I want, in theory, if I was to run this locally...

Just drop these three things. And then I have tried this, and I get lots of errors around tesseract and like trying to, you know... That's a good point. You do need some additional dependencies, yes. But those dependencies aren't pulled in natively by just like including, you know, building, you know, by including...

You will need to brew install tesseract and brew install... Poplar or whatever other if you're on Macs, you probably you could improve. I think that's it, the two additional dependencies. And initially, the first time you run, it's going to take a while because it's going to be downloading the model on your machine to run OCR locally. Yeah.

Is there any choice? Can you just choose not to use OCR? Because that's a strategy and it will not apply OCR at all. Well, anyway, I just had a lot of problems getting it running locally. No, maybe you didn't specify the strategy to be fast, because in that case, it's not going to use the OCR.

And if it did, then that's a bug that we should report. But I think you should be good if you just specify the fast strategy. Right, and so in the end, it's all loaded. It's all there. We have the text.

the embeddings, the element ID, all the documents just get uploaded from the S3 bucket through the unstructured and to Elasticsearch. And that's it!