Welcome to this video on
building multimodal RAC systems. Most of the RAC systems that we have seen
so far are primarily focused on text only. However, if you look at this entry on
Wikipedia related to SpaceX, you can see that there are a lot of different
images which can be extremely helpful when you're trying to retrieve
some information from this webpage. Similarly, there are tables as well,
which contain a lot of useful information. So having the ability to not only be
able to ask questions related to the text that is present in a webpage
or document, but also retrieve the corresponding multimedia documents. So for example, images or tables
is extremely helpful when it comes to building rack systems. In this video, I'm going to show you
an approach in which we are going to be able to retrieve images along
with the text related to a query. This is going to be the first iteration,
and later on we're going to build a very powerful system, which will also contain
information extracted from tables. To start off, we're going to just look
at information contained in Wikipedia pages, but the same approach can be
applied to any other sorts of documents, including PDFs or Word documents. We are going to look at three
different approaches that you can use for multimodal RAG. The focus of this video is
going to be on the first system. The first approach is to
embed all different modalities into a single vector space. So let's assume we have the input data,
which is a document, we extract text and images separately from those, and then
we create embeddings which are which can work across both images and text. So, for example, one option would be
to use something like a clip model, which we're going to go into a lot
more details later in the video. And you create that unified vector
space and put this in our vector store. So, when a user query comes in, we
create embeddings for our query, then do retrieval on this unified vector
space and then use the retrieved documents As our context, we can pass
this through a multimodal LLM if there are images retrieved as a part of the
context to generate final responses. This is one of the simplest
approach, but it will require a very capable multimodal embedding model. Now, the second approach is to ground
all the different modalities that we have into a primary modality,
which is text in this case. So, let me explain how this process works. We have our input data. We extract text and images. For text, we create text embeddings,
like using OpenAI text embeddings. But for images, we are going to pass
this through a multi modal model, something like GPT 4 O Gemini Pro. Or even the cloud models to generate text
descriptions of the images that we are passing on And then we take those test
text description along with the image data to create Text embeddings of this
image description and then we put them into a unified vector store So when a
user query comes in now, the retrieval is happening on this unified text vector
space Because we converted everything into a single modality You we get the
retrieved context, but within the context, then we see that in the content that was
retrieved, whether it's text or images. So if it's text, we directly pass that
through the LLM to generate responses. However, if it contains images
based on the description or chunks that we have created, then we
pass those through a multimodal model to generate final responses. Now, this is a great approach. Because we are just unifying
everything into a single modality. However, since the focus is going to
be on text, it can potentially lose some nuances from the original images. The third approach is to use separate
vector stores for different modalities. Let me explain this with
the help of this flowchart. So for the text data, we
create text embeddings and put them in a text vector store. For the images, we use a
specialized model that is going to create embeddings from images. So we basically encode those
and have a completely separate vector store for images. Now, when the user query comes in,
we're going to do retrieval separately based on the text embeddings, as
well as we'll convert that query into the corresponding image embeddings,
such as using a clip model, and then do retrieval on top of this. image retrieval or image
vector store as well. Now from both of them, we will
get separate chunks depending on how many chunks we want. So let's say we want top three
chunks from the text and top three chunks from the image chunks. Now we need to actually use a multi
modal re ranker because we want to rank these chunks that we get. and figure out which chunks are the most
relevant chunks, then we get the most relevant chunks, pass this through the
multimodal model to get a final response. Now in this case, we will need to have
this extra multimodal re ranker that needs to be a capable model which can understand
whether the images or the text chunks are more important for a specific query. In this video our focus is going to
be on the first approach where we are going to be using a clip model
to generate a unified vector space. But in the later videos, we are going
to be looking at these a lot more complex solutions to do multimodal rag. If you're not familiar with
clip, it is a model that was released back in 2021 by OpenAI. There has been a couple of open
source iterations of this model. And clip stands for contrastive
Language Image, pre-training. So it's a neural network, accepts both
images as well as text as pairs, and it can create embeddings which are basically
cross sectional between the image and text embeddings that can describe different
concepts that are present in images. So apart from the original CLIP model
there is a new initiative called OpenCLIP. This is an open source implementation. Of the original clip model and I think
this is trained on a lot more data Then that was done on the original clip. So i'll put a link to the original
clip paper Let me know if you want a tutorial on the technical details. I can do that if there
is interest All right. So, now with this technical background,
let's look at an actual code implementation A quick correction. Here's the flow that we are going
to be actually implementing. So it's going to be a little different
than the option one that we saw. So our data is going to be in
the form of Wikipedia web pages. Same flow will apply to PDF files as well. So we'll extract images
and text chunks separately. We'll use clip embeddings for the images
and text embeddings for our text chunks. And we're going to create two vector
stores that are going to be combined into a multimodal vector store. And for that, we're going to be
using the quadrant vector store. Then, when the new user query comes
in, we will do retrieval on top of the multimodal vector store that we created. And the result is going to be, we're
going to get three chunks, or up to the top three chunks from the
text chunks and up to five images that the clip model thinks are most
similar to the provided user query. And we'll just display the text
chunks and the corresponding images. This is going to be just
limited to the retrieval part. We are going to do the
generation part in a later video. Okay. So with this quick correction,
back to the rest of the video. Let's look at a code example. This is based on a notebook provided by
llama index And in this tutorial we are going to be using llama index later on. I'll show you how to use Lang chain
as well in a subsequent video. So, As a source of information we will
be using some wikipedia articles and the way this is going to work is that we are
going to Take the text separately You and then extract those images present in these
different articles, those separately. We'll use, we'll use the CLIP model to
generate embeddings for those images and GPT embeddings, which is basically the
small embeddings from OpenAI for the text chunks that we are going to be extracting. Now, the CLIP is trained to
understand and connect images and text in shared embedding space. So we can use that shared
embedding space to ask questions when we're doing retrieval. So first we need to download text and raw
images from Wikipedia article, articles, and we're going to use several different
Wikipedia articles to show you what kind of extraction that you can expect. So let's first set up different
packages that we will need. So we're going to install
Lama Index and Quadrant. Quadrant is going to be our vector
store in this specific case. We'll also install the clip
implementation from OpenAI. So once we do the installation
the next step is to basically download the data from Wikipedia. So we're going to be looking
at four different articles. One is RoboCop. The other one is Labor Party
from UK, SpaceX and OpenAI. So this script will get a list of
different topics and then download the corresponding Wikipedia articles for you. So this basically extracts the
text portion from those articles. And we can have a look here. So for example, this is the open
AI related article in Wikipedia. This one is Robocop and this
one is related to SpaceX. And here are the actual
articles on Wikipedia. The next step is going to be to
download and extract and download images from those articles. Now, if you can do the same
process with PDFs as well. So for example, you can use something
like unstructured to extract images, tables and text, and then partition
them into a separate different files. So that process is going to work for
Pdf files or even word files as well, so in the second iteration, what we are
doing is we are getting images from each of the Wikipedia articles So this loop
basically goes through each Wikipedia article and downloads images from there
now There are cases in which you Will not be able to download some of the images
because the way I think the Wikipedia pages are set up and sometimes you have
trouble downloading specific images. So if you run that process if you run
this code and there are certain images that you are not going to be able to
download here it will just show you that no images found on a given Wikipedia page. This one is, I think, related
to the labor Party article. There are some other images as well. And I think these are probably
related to Robocop Wikipedia entry. Now we're going to be using
the open AI embedding model. So we need to set up the
open AI EPI key in my case. Everything is set up in the
secrets of Google CoLab. If you want to run this locally,
you'll need to set the the API key as Environment variable. Okay. Next we need to set up our vector stores. So this is going to be a little different
than in the flow chart I showed you. We're going to compute embeddings
separately, both for text and images and then put them in
a multimodal vector store. Now in this example, we're going
to be using quadrant because quadrant supports multimodalities. So it can, process both
image and text embeddings. Some of the other clients like
FIAS doesn't have that ability. So that's why we chose
Quadrant in this specific case. So first we need to create a base
embedding client, and then we need to create different collections. So collections is basically a
subset of embedding vectors. The first one is going to
be the text embedding store. So we provide our client. And the name is text collection zero one. The second one is going to be specifically
for images where we want to use the clip model to generate a vector or
vector embeddings for the images. And then we we'll start storage context. So this is the way. You set up vector stores in Llama index. So we are going to create a single vector
store, both for image as well as for text. And we wrap this around inside
the multimodal vector store index. That is a vector store
specifically designed for storing multimodal embeddings. And the way this is set up is
that we read everything that is in that folder, which we downloaded. So it will read both
images as well as text. And it will chunk the text. By default, I think it's
about 800 tokens per chunk. Right. You can set those values
and change those if need be. And it will also create embeddings
for those images as well. The process took a little bit of
time and the resultant vector store has a size of almost 300 megabytes. So it is pretty big embedding size. Now you can actually reduce the
size of your vector store by using quantized embeddings which gives
you pretty good balance between performance and speed increase. I'm going to be creating a video
on that because I think it's a very important topic if you're trying to put
these RAG systems in production with support for a large corpus of data. Okay, after that we just are using
a simple function called plot images to randomly sample some of the images
that are present in the corpus. And you can see there are a number of
different images of different individuals. Here's an image of initial version
of RoboCop and I think there is an image of Sam Altman that is probably
coming from the OpenAI article. Here's an image of Elon Musk
that is probably coming from the Tesla article or SpaceX article
SpaceX article in this case. Now, you can just run queries on top
of the vector store that you created. So for example, here's a
query, what is the labor party? Now the way we do it is we take that
multi modal vector store that we created. We say that, okay, give me a top
similar image similarity of five with a text similarity of three. So, we want to have three most
similar retrieved, chunks from the text and up to five images based
on this specific query, right? Now you will see that the query results
in some cases are not great because it will need a lot more context. And we probably want to run these
images through a vision model to generate a text description. But here are the results. If you look at it, we get
the initial text chunks. So here's the first chunk, the second
chunk, then here's the third chunk. And then at the end we have information
regarding specific images that the embedding model thinks are closely related
to the query that we are providing. So just to kind of give you a much better
overview, here are the text chunks. So it says the Labour Party is a social
democratic political party in the UK. that has been described as being an
alliance of social democrats, democratic socialists, and trade unions, right? So this is basically goes on and
show us the different chunks that we have, and then the corresponding
images that the embedding model was able to retrieve based on that query. Now you will see that the queries are
not really specific, so it will just pick some images that it thinks are closely
related, but if you're looking for very specific information, you could provide a
much more customized queries in that case. So here's another one who created Robocop, so again it talks about that Robocop
is 1987 American science fiction action film created by Paul, right? Written by Edward. So I'm not even going to try
to pronounce their last names. So it returned three different text chunks
and then it also returned five different images that it thinks are closely related
to the query that we have provided. So pretty interesting stuff. Now, similarly when it comes to OpenAI
again, we get three different chunks of text and five different images. Now, in this case somehow it also
confused the Labour Party because I think there are not probably
enough images for it to retrieve related to OpenAI from the article. So it also added some images from
the Labour Party article as well. When I said which company makes Tesla,
so there is not a specific article related to Tesla, but I think in the
article related to SpaceX, It has some information related to Tesla, so it
talks about SpaceX business and I think there is probably an image of Elon
Musk in that article, so I actually retrieved that image as well for us. Okay, so in this case, we only did the
retrieval part but in order to build step on top of this which will look
at the chunks that are being retrieved by the image model or the image
embedding model as well as the text embedding model and somehow combine them
together to generate a final response. So if you are interested in the end to
end system I will be creating subsequent videos on this topic and we're going
to look at more advanced solutions. So make sure to subscribe to the
channel so you don't miss that video. Hope you found this video useful. Thanks for watching and as
always see you in the next one