Welcome to this video on building multimodal RAC systems. Most of the RAC systems that we have seen so far are primarily focused on text only. However, if you look at this entry on Wikipedia related to SpaceX, you can see that there are a lot of different images which can be extremely helpful when you're trying to retrieve some information from this webpage. Similarly, there are tables as well, which contain a lot of useful information. So having the ability to not only be able to ask questions related to the text that is present in a webpage or document, but also retrieve the corresponding multimedia documents. So for example, images or tables is extremely helpful when it comes to building rack systems. In this video, I'm going to show you an approach in which we are going to be able to retrieve images along with the text related to a query. This is going to be the first iteration, and later on we're going to build a very powerful system, which will also contain information extracted from tables. To start off, we're going to just look at information contained in Wikipedia pages, but the same approach can be applied to any other sorts of documents, including PDFs or Word documents. We are going to look at three different approaches that you can use for multimodal RAG. The focus of this video is going to be on the first system. The first approach is to embed all different modalities into a single vector space. So let's assume we have the input data, which is a document, we extract text and images separately from those, and then we create embeddings which are which can work across both images and text. So, for example, one option would be to use something like a clip model, which we're going to go into a lot more details later in the video. And you create that unified vector space and put this in our vector store. So, when a user query comes in, we create embeddings for our query, then do retrieval on this unified vector space and then use the retrieved documents As our context, we can pass this through a multimodal LLM if there are images retrieved as a part of the context to generate final responses. This is one of the simplest approach, but it will require a very capable multimodal embedding model. Now, the second approach is to ground all the different modalities that we have into a primary modality, which is text in this case. So, let me explain how this process works. We have our input data. We extract text and images. For text, we create text embeddings, like using OpenAI text embeddings. But for images, we are going to pass this through a multi modal model, something like GPT 4 O Gemini Pro. Or even the cloud models to generate text descriptions of the images that we are passing on And then we take those test text description along with the image data to create Text embeddings of this image description and then we put them into a unified vector store So when a user query comes in now, the retrieval is happening on this unified text vector space Because we converted everything into a single modality You we get the retrieved context, but within the context, then we see that in the content that was retrieved, whether it's text or images. So if it's text, we directly pass that through the LLM to generate responses. However, if it contains images based on the description or chunks that we have created, then we pass those through a multimodal model to generate final responses. Now, this is a great approach. Because we are just unifying everything into a single modality. However, since the focus is going to be on text, it can potentially lose some nuances from the original images. The third approach is to use separate vector stores for different modalities. Let me explain this with the help of this flowchart. So for the text data, we create text embeddings and put them in a text vector store. For the images, we use a specialized model that is going to create embeddings from images. So we basically encode those and have a completely separate vector store for images. Now, when the user query comes in, we're going to do retrieval separately based on the text embeddings, as well as we'll convert that query into the corresponding image embeddings, such as using a clip model, and then do retrieval on top of this. image retrieval or image vector store as well. Now from both of them, we will get separate chunks depending on how many chunks we want. So let's say we want top three chunks from the text and top three chunks from the image chunks. Now we need to actually use a multi modal re ranker because we want to rank these chunks that we get. and figure out which chunks are the most relevant chunks, then we get the most relevant chunks, pass this through the multimodal model to get a final response. Now in this case, we will need to have this extra multimodal re ranker that needs to be a capable model which can understand whether the images or the text chunks are more important for a specific query. In this video our focus is going to be on the first approach where we are going to be using a clip model to generate a unified vector space. But in the later videos, we are going to be looking at these a lot more complex solutions to do multimodal rag. If you're not familiar with clip, it is a model that was released back in 2021 by OpenAI. There has been a couple of open source iterations of this model. And clip stands for contrastive Language Image, pre-training. So it's a neural network, accepts both images as well as text as pairs, and it can create embeddings which are basically cross sectional between the image and text embeddings that can describe different concepts that are present in images. So apart from the original CLIP model there is a new initiative called OpenCLIP. This is an open source implementation. Of the original clip model and I think this is trained on a lot more data Then that was done on the original clip. So i'll put a link to the original clip paper Let me know if you want a tutorial on the technical details. I can do that if there is interest All right. So, now with this technical background, let's look at an actual code implementation A quick correction. Here's the flow that we are going to be actually implementing. So it's going to be a little different than the option one that we saw. So our data is going to be in the form of Wikipedia web pages. Same flow will apply to PDF files as well. So we'll extract images and text chunks separately. We'll use clip embeddings for the images and text embeddings for our text chunks. And we're going to create two vector stores that are going to be combined into a multimodal vector store. And for that, we're going to be using the quadrant vector store. Then, when the new user query comes in, we will do retrieval on top of the multimodal vector store that we created. And the result is going to be, we're going to get three chunks, or up to the top three chunks from the text chunks and up to five images that the clip model thinks are most similar to the provided user query. And we'll just display the text chunks and the corresponding images. This is going to be just limited to the retrieval part. We are going to do the generation part in a later video. Okay. So with this quick correction, back to the rest of the video. Let's look at a code example. This is based on a notebook provided by llama index And in this tutorial we are going to be using llama index later on. I'll show you how to use Lang chain as well in a subsequent video. So, As a source of information we will be using some wikipedia articles and the way this is going to work is that we are going to Take the text separately You and then extract those images present in these different articles, those separately. We'll use, we'll use the CLIP model to generate embeddings for those images and GPT embeddings, which is basically the small embeddings from OpenAI for the text chunks that we are going to be extracting. Now, the CLIP is trained to understand and connect images and text in shared embedding space. So we can use that shared embedding space to ask questions when we're doing retrieval. So first we need to download text and raw images from Wikipedia article, articles, and we're going to use several different Wikipedia articles to show you what kind of extraction that you can expect. So let's first set up different packages that we will need. So we're going to install Lama Index and Quadrant. Quadrant is going to be our vector store in this specific case. We'll also install the clip implementation from OpenAI. So once we do the installation the next step is to basically download the data from Wikipedia. So we're going to be looking at four different articles. One is RoboCop. The other one is Labor Party from UK, SpaceX and OpenAI. So this script will get a list of different topics and then download the corresponding Wikipedia articles for you. So this basically extracts the text portion from those articles. And we can have a look here. So for example, this is the open AI related article in Wikipedia. This one is Robocop and this one is related to SpaceX. And here are the actual articles on Wikipedia. The next step is going to be to download and extract and download images from those articles. Now, if you can do the same process with PDFs as well. So for example, you can use something like unstructured to extract images, tables and text, and then partition them into a separate different files. So that process is going to work for Pdf files or even word files as well, so in the second iteration, what we are doing is we are getting images from each of the Wikipedia articles So this loop basically goes through each Wikipedia article and downloads images from there now There are cases in which you Will not be able to download some of the images because the way I think the Wikipedia pages are set up and sometimes you have trouble downloading specific images. So if you run that process if you run this code and there are certain images that you are not going to be able to download here it will just show you that no images found on a given Wikipedia page. This one is, I think, related to the labor Party article. There are some other images as well. And I think these are probably related to Robocop Wikipedia entry. Now we're going to be using the open AI embedding model. So we need to set up the open AI EPI key in my case. Everything is set up in the secrets of Google CoLab. If you want to run this locally, you'll need to set the the API key as Environment variable. Okay. Next we need to set up our vector stores. So this is going to be a little different than in the flow chart I showed you. We're going to compute embeddings separately, both for text and images and then put them in a multimodal vector store. Now in this example, we're going to be using quadrant because quadrant supports multimodalities. So it can, process both image and text embeddings. Some of the other clients like FIAS doesn't have that ability. So that's why we chose Quadrant in this specific case. So first we need to create a base embedding client, and then we need to create different collections. So collections is basically a subset of embedding vectors. The first one is going to be the text embedding store. So we provide our client. And the name is text collection zero one. The second one is going to be specifically for images where we want to use the clip model to generate a vector or vector embeddings for the images. And then we we'll start storage context. So this is the way. You set up vector stores in Llama index. So we are going to create a single vector store, both for image as well as for text. And we wrap this around inside the multimodal vector store index. That is a vector store specifically designed for storing multimodal embeddings. And the way this is set up is that we read everything that is in that folder, which we downloaded. So it will read both images as well as text. And it will chunk the text. By default, I think it's about 800 tokens per chunk. Right. You can set those values and change those if need be. And it will also create embeddings for those images as well. The process took a little bit of time and the resultant vector store has a size of almost 300 megabytes. So it is pretty big embedding size. Now you can actually reduce the size of your vector store by using quantized embeddings which gives you pretty good balance between performance and speed increase. I'm going to be creating a video on that because I think it's a very important topic if you're trying to put these RAG systems in production with support for a large corpus of data. Okay, after that we just are using a simple function called plot images to randomly sample some of the images that are present in the corpus. And you can see there are a number of different images of different individuals. Here's an image of initial version of RoboCop and I think there is an image of Sam Altman that is probably coming from the OpenAI article. Here's an image of Elon Musk that is probably coming from the Tesla article or SpaceX article SpaceX article in this case. Now, you can just run queries on top of the vector store that you created. So for example, here's a query, what is the labor party? Now the way we do it is we take that multi modal vector store that we created. We say that, okay, give me a top similar image similarity of five with a text similarity of three. So, we want to have three most similar retrieved, chunks from the text and up to five images based on this specific query, right? Now you will see that the query results in some cases are not great because it will need a lot more context. And we probably want to run these images through a vision model to generate a text description. But here are the results. If you look at it, we get the initial text chunks. So here's the first chunk, the second chunk, then here's the third chunk. And then at the end we have information regarding specific images that the embedding model thinks are closely related to the query that we are providing. So just to kind of give you a much better overview, here are the text chunks. So it says the Labour Party is a social democratic political party in the UK. that has been described as being an alliance of social democrats, democratic socialists, and trade unions, right? So this is basically goes on and show us the different chunks that we have, and then the corresponding images that the embedding model was able to retrieve based on that query. Now you will see that the queries are not really specific, so it will just pick some images that it thinks are closely related, but if you're looking for very specific information, you could provide a much more customized queries in that case. So here's another one who created Robocop, so again it talks about that Robocop is 1987 American science fiction action film created by Paul, right? Written by Edward. So I'm not even going to try to pronounce their last names. So it returned three different text chunks and then it also returned five different images that it thinks are closely related to the query that we have provided. So pretty interesting stuff. Now, similarly when it comes to OpenAI again, we get three different chunks of text and five different images. Now, in this case somehow it also confused the Labour Party because I think there are not probably enough images for it to retrieve related to OpenAI from the article. So it also added some images from the Labour Party article as well. When I said which company makes Tesla, so there is not a specific article related to Tesla, but I think in the article related to SpaceX, It has some information related to Tesla, so it talks about SpaceX business and I think there is probably an image of Elon Musk in that article, so I actually retrieved that image as well for us. Okay, so in this case, we only did the retrieval part but in order to build step on top of this which will look at the chunks that are being retrieved by the image model or the image embedding model as well as the text embedding model and somehow combine them together to generate a final response. So if you are interested in the end to end system I will be creating subsequent videos on this topic and we're going to look at more advanced solutions. So make sure to subscribe to the channel so you don't miss that video. Hope you found this video useful. Thanks for watching and as always see you in the next one