Multimodaler RAG-Workflow

check this out I just built a multimodal rag agent that can index and analyze text images and tables from complex PDFs at scale The agent can respond to your questions with images and tables directly within its response And it doesn't just index the images it uses an AI vision model to really understand what's in those images This could make your agents way more effective I'll be showing you how to build it within this video We'll be storing our data within Superbase and we'll be chatting with that data using an NAN agent But if you want to get set up faster then check out the link in the description where the blueprints will be shared in our community At the start of the process we have documents In this case it's information dense PDFs like product manuals Instead of just extracting the text from those documents we're going to use a powerful OCR API to extract the data and annotate the media within those documents And this will work for both machine readable and scanned PDFs We're using Mistl's OCR for this We can send data into their API in the form of PDF documents like this and it will extract out information in markdown format which is LLM friendly and it provides the inline file names within that markdown and it also responds with an array of elements like images and charts But it's not just that when it extracts those images it uses a vision model to really look at and analyze those images And we can pass in a prompt to define how we want this to annotate the images the level of complexity and granularity So we'll have deep context of the images as well as the actual image files themselves in a base 64 format And then we can upload those files to our server in this case using superbase storage We then pass these images and the markdown result from the OCR to our vector database But we need to go through our standard embedding process before we do that So we chunk that data up into manageable pieces and we use an embedded model to translate those into vectors that then we can store in our vector database When we want to query that data we move on to step two We can pass in a question like so This gets passed to our AI agent which then queries the superbase vector store Here we also use an embedding model to transform that query into vectors and then we query the vector database against those The vector database then responds with the top results and this gets fed into a large language model like GPT 4.1 Then it decides on how to respond to the user based on the query and based on the data that it got from the vector database But the key part here is that within that vector database response we're also getting the image URLs that we can send back to the user and we're specifically prompting the language model to render those images where available This workflow is compatible with many of the other advanced rag approaches that we've covered on our channel So make sure to check those out as well Let's start by building a simple workflow using Mistral OCR and then we can work from there So to keep things simple I'm going to start with just retrieving one PDF I'm going to use a HTTP request and use the get method for that And I'm going to paste in a publicly accessible PDF from there I'm going to press execute step And there I can see that it gets that data and then stores it in binary format This will be perfectly fine for us to build out and test our workflow But if you want a far more fleshed out data ingestion pipeline then check out the Rag Masterass on our channel From there you'll need to set up an account with Mistro The pricing for their document AI and OCR is $1 for 1,000 pages and their annotations cost is $3 per thousand pages The quality and speed of this API is really impressive And this API really simplifies a multimodal workflow like this because we get back a lot of data from one single post of a PDF From there you go set up an account And once you have an account you can add credits via the billing screen and then go to API keys and press create new key I'm just going to create a temporary key here I'll select create new key And from there I'll copy out this API key And from there we now want to upload this file to Mistral and then run the OCR processing job on that Now if I go to this plus icon here and type Mistral I see that there's a cloud chat model Unfortunately that's not relevant to our task So instead I see that they have an OCR with uploaded PDF section And I see this is the curl command that we need to upload that PDF So I'll click this copy button at the top right And then I'll go to add a new node I'll type HTTP And from there I'll use the import curl at the top and paste that in From there I'll press the import button And we have a lot of our parameters already filled out which is great Now instead of using this authorization as plain text here I'm going to delete that I'll select predefined credential type And then I'll type in Mistral because this is supported on Nitn And I'll select this Mistral cloud account Otherwise you can create a new credential and then paste in that API key that we just copied out from the Mistral dashboard a few minutes ago Finally we need to just map in this binary file I just type in data and then that should map the binary data from the PDF directly to this request Okay Now I'm going to go back to canvas and I'll select execute workflow And then from there we see we have an ID and an object and it looks like it was successfully uploaded to mistro So now that we have our file uploaded we need to get the OCR results But when you look at the curl command for this you see that you need a signed URL for the document that you just uploaded So let's do that first We go to the signed URL curl command here Copy that out and add in another HTTP request here Go to that Go to import curl And I've pasted in this other curl command I'll press import I'll delete this authorization header again And for authentication I'll go to predefined credential type I'll select Mistral cloud again And here the Mistral Cloud account is automatically selected And that should also be the case if you added one in the previous step And the only other thing you need to do is to map the ID from the previous request So and that's now dragged in like so Okay that should be good to go In order to test that what I'm going to do is I'm going to select all of this I'll press P And from there I'll have pinned the previous data So we do not need to make the first requests Then when I go to execute workflow it will pass the previously retrieved data into this next request And now we have a signed URL back from Mistl I've also pressed P on this So now we can continue building this workflow without having to constantly keep querying these endpoints And finally we want to get the OCO results So go down to get OCO results We'll copy out the curl command here Again add another HTTP request node I'll import curl Paste that in Press import And now we have almost everything already set up here I'll select predefined credential type again Here I'll select Mistral cloud And I'll delete the authorization header parameter because we've already included that here within the authentication here And finally within this JSON section here I'm going to move to expression and then open that up in the edit window from there Now I just need to replace this signed URL with the signed URL that we got back from the previous node So I'll delete that and just drag in this URL And you'll see on the right hand side that that data now looks correct So now we'll press execute and it's going through that OCR process It generally does not take very long to both process and annotate the results So that's a pretty big document but it only took about 20 seconds to actually process the OCR results for that So let's have a look at the results Okay so I've ran that and I see I'm getting a binary response at the end whereas I'm expecting a JSON response that I can see on the right hand side So if I go back to the curl command I see that the output is set as a file So if I go down to the very end I see the response format is a file So I'll change that to JSON and then we should be good to go from there So I'll save that and then re-execute that workflow And it's taken about 20 seconds so far And there we go We see the results from there And now on the right hand side we see a ton of responses So we have an array of pages in the response and there should be 39 pages or so in this particular OCR job cuz that's how many pages were in the PDF and we have a markdown response for every single one of them And within each page object is an array of images And you can see the actual base 64 file for that So we'll have the full file that we can then upload Within this request we actually did not request image annotations which is where we're going to be getting Mistral to use their vision model to analyze those images So we need to update this request to include that But that's where the annotations will be visible We almost have all the data we need here We just need to update this query to request image annotations So let's do that right now Separate to this basic OCR documentation we have an annotations page So then you can scroll down from there We have a specific schema that we need to follow for this Here we're providing a certain response format that we're expecting And we're basically prompting this vision API You can get really granular with this schema but in my opinion the easiest way to get started is to copy and paste in this example documentation into chatbt and just to generate a basic schema that you need And in this case it should respond with a pretty simple concise natural language description of each of the images to get a concrete example of how this kind of annotation works It's where you have an image like this We pass in the image and then this is an example of a response format where you have an image type description and summary In our case we want just a pretty simple text response Now that we've updated this request body I'm going to go back I'm going to save that workflow I'm going to execute it again And now hopefully we should get those annotations back Okay it completed processing in 1 minute and 5 seconds So let's have a look into the data we got back And now on the right hand side we see the pages and we see the markdown for those but if we look into the images not only do we see the base 64 file but we now see the image annotation as well So this is now really digging into the content of those images Our plan from here is to now vectorize this data and upload it to vector store But before we do that it's definitely worth having a look at how this data is presented For each of the pages we get a markdown response and we have the image ids or the image file names And then from there we have an array of images So example we have image zero and then image one right below it Ideally we'd get the image annotations directly within that markdown because then we have much better context of what's in that image and then we're passing all of that within context to the vector database So let's do that Right now we're going to now pin that data again So that means we do not have to constantly wait one minute every time and be paying money via the API I'm going to use a split out node which is going to turn this array of pages into individual items that would be processed separately Now I'll click execute step from there And on the right hand side we now have 39 items So beforehand we had one item which was the full response from mistral OCR And after this we now have 39 items which are now processed separately From here I'm going to add a code node because this is by far the easiest solution to this And I'm going to try and get chatbt to generate the JavaScript code to plant the image annotations directly into this markdown To give context to what chatbt is looking for I'm going to just type in const which is constant and then select equals I'm going to drag this in and then add in a semicolon at the end Just having this by itself should hopefully guide chatbt in the right direction so we do not have to be constantly asking it for updates to get this variable correct later on Here I'm just opening up the snipping tool on Windows and I'm just going to copy out this So I'm taking a little screenshot and I'm going to go back and copy this out I'm going to say here is JavaScript's code I want you to add the image annotation directly in line in the markdown the images will be referenced in the markdown by their file name and I'll say for example and then I've gone to this table view here so I can better see the data and I'll show an example of how the file names will render within the markdown So I've copied that in here and then let's see how that turns out Okay so it's now creating this JavaScript code Now this is just using GBT40 So let let's just work from that for the moment I'll paste that in and I'll press execute step And now let's look through the markdown response Okay we're having an issue here where it's duplicated this for every item But I see we have run once for all items There we want this to run once for each item So now I'll select execute step And when I try to run that we see that this code only works for if you're trying to run once for all items and not for each item Please write the code for run once for each item Okay let's copy that out and try to run that I'll click execute step and now it's pretty difficult to see that markdown but we see here we have the image and then after that we have the full markdown description of that image and again we have the full markdown description of the next image So that's exactly what we want Now we can now chunk this up and then upload this to the vector store And by the way we're not yet uploading the binary images I'll be showing you how to do that after Next up we're going to add a superbase vector store So I'm typing in superbase Select superbase vector store I've selected my superbase account for this If you do not have one go to superbase.com and you can set up a free account and that will work for this workflow On the SQL editor go to quick starts and you can select lang chain from there and just press run and that will create your documents table I already have a documents table set up for this I'm just going to delete the existing rows and then we can work from there So I've selected insert documents I'm going to choose the table documents I'm going to leave the embeddings batch size as it is From there you need to define a data loader So I'm going to select default data loader This is going to define what data we actually want to enter into the vector database I'm going to select load specific data and then just drag this markdown in From there we can select a text splitting node and that's going to chunk up our data into individual segments that we will then vectorize and upload to the vector database So I'll go out and then define the text splitter I'm going to select a recursive text splitter chunk size a th00and with a 200 overlap You can define whatever chunk strategy you want for this but I'm just going to go with this as a default And then we need to select an embedding model for this I'm going to select OpenAI's embeddings and I'll use text embedding 3 small as a default for that If you want to learn more about how all of this works then make sure to check out the link in the description to our rag masterass Excellent So I'm now going to pin that data And now we will try to execute that workflow And hopefully we should now have vectors being uploaded to our database Okay that's currently in process And now when we look at our Superbase table we now see a ton of vectors uploaded to that Next we can start chatting with that data directly I'm going to press the plus icon I'll select add another trigger and select add chat message at the bottom And then from there I'll select AI agent and this is going to enter in a standard NN agent For the chat model we can pick anything here I'm just going to use an OpenAI chat model for the moment I'll select GPT 4.1 I'll use a base model instead of a fine-tuned model So I'll select GPT4.1 One thing I will change is to add in a sampling temperature here and reduce that This lowers the randomness of the output and then makes the model a bit more deterministic and less creative which is exactly what we want from a rag agent From here we can already start sending messages So I've just typed hello It's querying the agent It's gone to the chat model and now it's responded with hello how can I help you today but it currently does not have memory and we've not hooked up the rag element So I'm going to press the plus icon under the memory I'll select simple memory for the moment This is just in memory memory on the NN server If you want something for production then you would use a persistent data store but we'll just use this for the moment Then under tool we're going to select superbase superbase vector store We're going to select operation mode of retrieve documents I've selected a basic description for this And for table I'm going to select the documents table We can choose the limit I'm going to leave this as a limit of four for the moment but we can add more later on For reranking you can use reranking Check out the link in the description if you want to check that out but we're going to leave that disabled for the moment And now finally for embedding we're going to select OpenAI for that And we're going to choose the exact same model that we embedded the data with in the first place which is very important Now before we do anything else here I want to go and provide a system prompt I provided this for the system message You are a washing machine expert Of course you can change that for your use case You are tasked with answering a question using the information retrieved from the attached vector store Your goal is to provide an accurate answer based on this information only If you cannot answer the question using the provided information or if no information is returned from the vector store say sorry I don't know This is really really important You need to give the LLM permission to tell you when it doesn't know cuz otherwise it will just make stuff up Now I'm going to make this workflow active on the top here I'll select got it And then for the chat starting trigger here I'm going to make this chat publicly available Now I've pressed save and from there I'm going to go to this chat URL and I've opened this up on my browser So I've asked it a simple question of what is the warranty for this product Now let's have a look at the execution history for that And if I go into the superbase vector store you see on the right hand side here the data we got back from mistral is related to this table within the document And that's in markdown format or mostly marked format at least in a way that the LM will be able to understand And now this answer is based on that data From here we have most of our workflow set up but we need to upload those files and then make those available within the vector database so that they'll then be retrieved later on and rendered to the user And this takes a bit more work within N8N So I'm going to walk through the scenario that I've previously created to show you exactly how that works At the start of this workflow we have everything that we've done previously which is to get the content of a PDF file and upload that to Mistro for Ourorin Now the one thing I have at the very start here is that I'm providing my Superbase details So the Superbase base URL and the Superbase storage bucket name If you go to your project overview then just scroll down a little bit and you'll see your project URL from there So you can copy that out and then that's where you can store your Superbase base URL and then you want your Superbase storage bucket name And this is where we're going to be uploading our images that we're extracting from the PDFs For that go to storage on the left hand side and then go to new bucket and then type in the name of your bucket and select this as public That means that the images will be publicly accessible by the URLs This could be fine for your use case If so then just go with that because it makes things a bit easier Otherwise you can make it private but you need to jump through some more hoops in terms of generating signed URLs So even for initially testing and building out this workflow even if you plan to make it private later then you can start with public and then work from there So I'll press save and from there just copy out the bucket name that you created then you can add that to this field here From here these next four nodes are the same as we did previously to get the OCO result from Mistral And then after that is where things change a little bit What we're doing here is we're splitting out the pages and images so we can process them individually and then upload all of the relevant images to Superbase There are many different ways of doing things within NA10 I've chosen to go with this approach to upload the images and then merge this back with the main flow But you could potentially do it other ways or you could use code nodes for example to consolidate some of the logic But to start with let's look at the image upload piece In our previous workflow when we perform the post request to get the OCR results from the document we have the list of pages and images and markdown details and so on If you look at the data types you see that the pages is an array and the images within that is also a separate array Now this is all one item So it was all one single response from Mistral Now what we can do is break out this flow and I've added a split out node and within this split out node I've dragged in pages as the field as the array to split out and if you look at the previous execution of this one item went into that node and then 39 items came out So split out nodes are a great way for us to take an array and then split those out into separate items that are separately processed within n So from there we're splitting out each of the pages and then after that we are further splitting out what we get from that So we have each individual page here and now we're splitting out the images from those pages into their own separate data items that are then processed individually From there we have images one by one that we can now upload to superbase storage Before we do that I've added in a set fields node here and then I have a file name and then I've just got chatbt to generate a pseudo random file name that we can use for that particular file Then underneath this we have the original ID So image0.jpeg is the first image that was in this document and these are the image file names that are in line within the markdown in the document We'll be using this later on to identify where in the markdown this image is So we can then replace that with the uploaded superbase URL And then we have the image annotation That's the vision OCR result from that image I'm setting these all here because then it will make it a lot easier to merge this back in with the results later From here I'm breaking out the flow again And this is not a filter router is that the results from this are going to here and underneath here And I'll explain that in a minute I'm splitting this out I'm preparing this B 64 string So if I click into that if I look on the left hand side the split out images if I scroll down I see data image JPEG B 64 So this is the MIME type of the image that we got back from Mistral OCR The actual B 64 part is everything after this And by the way B 64 is a type of encoding for files In this case Mistl responded with the file data in this encoded format It's not in a binary format It's in this long encoded string and that's the actual file data So in order to send this to superbase we need to convert this to binary format In order to do that we need to make sure that we just have the start of this B 64 string without this mime type here And that's exactly what we're doing here I just got chativvt to generate an expression to split this out and to take everything after this first comma The field set here is image base 64 Now after that I have this convert to file node and this is moving this B 64 string to a file So it's moving this long encoded string to a binary file I've dragged in the file name that we set from the set node earlier on and the base 64 field here is the response from the node that we set previously So the output of that is a binary file Next we need to upload this file to superbase and we have this HTTP method If I go into it we have a post method and we have this URL If you look into this expression I'm just going to expand out this expression The binary file name here is actually not showing up I'm looking at a previous execution If you look at the result at the bottom we have the superbase base URL and then we have just the rest of this URL and after that is a superbase bucket and then you will have the file name So that will look more like that which is the file name So for this you can either use a set field at the very start or you could just hardcode in this into this particular URL request For authentication we have predefined credential type I'm selecting the Superbase API credential and then I'm selecting send body N binary file and then I'm selecting the binary data field from the previous node and that will show up here at the top left When this was previously executed we have a list of 45 images that were uploaded to Superbase We get back the key So that's the file name and then we have the storage bucket before that as well as an ID We'll then want to merge these back with the original stream That means we'll have our list of successfully uploaded images to Superbase But before we do that we want to just prepare that file name because we have our storage bucket at the very start So we've just added in this simple expression to return everything after this first forward slash So that means we'll get just the file name back by itself And then we have a merge results node here And this has two inputs The first is the results back from Superbase and the second is the original stream And remember in the original stream we had the file name the original ID and the image annotation Whereas this is responding with just the file name So we're going to be merging all of these back together From here we're using a combine which is going to merge the matching items together And we're matching these fields So file name will be the field for that Actually these are both the same field name So we can uncheck that And that will be the file name Here we want to keep the matches There are a bunch of different ways you can merge result sets together If you understand SQL joins then this will make a lot of sense Keep matches is an inner join So it's a match between the two sets Whereas you could keep everything which is an outer join which is going to keep both data sets regardless of their matches I'm going to select keep matches here That means it's only going to return these images that were uploaded to Superbase And for output data we want to keep both inputs merged together That means we'll have the file name original ID and image annotation So that's how I've done that There's plenty of other ways you could do it And then after that I'm aggregating all of these together So the merge is merging these flows The aggregation is taking 45 items And so that's taking individual data items within n and aggregating these into an array So this is the exact opposite of the split out node Earlier on the split out node was taking an array splitting them into items The aggregation node is taking a load of items and splitting those into an array In the aggregation I've selected all item data and this is all going into the uploaded images array And then after that we're merging everything back together So if I go into this you see on the right hand side we have all of our uploaded images across the entire document and all of our pages I'm just using the mode combine and then combine by position So we now have our entire document with the superbase uploaded images and annotations ready to go And now from there it's completely up to you how you want to chunk that data So I've merged all of this back together Maybe slightly redundant if you're planning a page-by- page chunking strategy but this makes it easy to keep your options open depending on how you want to chunk your document In this case I'm following a very similar approach that we did previously which is to add in a vector store have a default data loader and I'm splitting out the pages one by one and processing those Before we do that we need to have a similar code node that we had previously because previously we were updating the inline markdown to plant in the image data that we were getting back from mistral So let's go into that So before we do that in split out pages we're just splitting out our pages field and then within the code node click into that and we'll look at the JavaScript It's set as run once for each item and the language is JavaScript I'll open the edit window for this to expand it out a bit There's a bit going on here I got chat GBT to generate this What we're doing here is we're constructing the Superbase base URL So that's relevant to your instance and then we're iterating through each of the images that were uploaded to Superbase And then from there if the file name is present within the markdown we got back from Mistral then we're going to replace that file name with the full uploaded Superbase URL in a markdown format So that means when we pass that to our agent then it should automatically render that image directly in the chat window But not only that we're getting the image annotation and then putting that directly below the image and then returning that as JSON So now if you look at the markdown on the right hand side which is now ready to send to our vector store instead of just providing the image name that it was previously now we're providing this image name as well as a full superbase URL for that image and the AI vision description for the image So now it's all ready to go to send to our vector store And within the vector store this is set up in almost the exact same way as previously Whereas in the default data loader we're just passing in the markdown that we want to chunk and embed And now we can use the exact same agent we created previously to chat with that So I'm going to go to the when chat received node I'll select chat URL and I'll just ask a question like where do I put the fabric softener okay the response for that is looking good and it's even taken an image from that part of the PDF Here's another question My washing machine is very noisy This is great It's given us a bunch of different responses while it's also taken a few relevant images about leveling the washing machine and adjusting the feet using non-skid pads So this is fantastic And I took three of these images from the document Once you're happy with this initial system you can expand this out however you want You could have full rag ingestion pipelines like we have in our rag masterass on our channel You could use more advanced methods like hybrid rag to improve the accuracy of the retrieval and reranking to reduce noise All of the blueprints are in our community We have an active discussion board live calls and you get access to all of our courses with more on the way

Transcript for:Multimodaler RAG-Workflow

Transcript for:
Multimodaler RAG-Workflow