Gemini 2.0 Flash and RAG Explained

Google dropped their most recent model, Gemini 2.0 Flash, and it's probably the best price to performance model that you can use today. So I made a post about it saying RAG might not be needed anymore, and a lot of people misunderstood what I meant, and some people got pretty upset. So in this video, I'm going to break down exactly what I meant. I'll go over RAG, why I don't think it's needed in the traditional sense anymore, how these new AI models might change the way you think about building AI products.

and what this means if you're a builder or someone who's just getting into ai dive in first let me quickly explain what rag is for those of you who might not know so rag stands for retrieval augmenting generation it's been around for a while it's a very very common technique used to sort of help llms like chad gpt sort of bring information that may have not existed in it sort of like training corpus and you'll see this a lot with sort of companies like perplexity where they will sort of augment your questions when they search the web and things like that or if you look at chat gpt if you have sort of projects and chat gpt you add files now what happens behind the scenes is that there's a thing called context windows context windows basically mean how much text you can give to an ai before it can't answer and in 2023 two years ago which is in you know eons in ai they had very small token limits so if you actually go but look back in early 2023 you had 4 000 token limits uh so rag at the time was very very important and what you would do it was a common technique was you would take your information and you would create embeddings then you would store that in a vector database so if you had a large corpus of text you know files pdfs you would take it and you would do this thing called chunking and this is important you would take small pieces of text you know there would be roughly 256 to 512 tokens um so just let's call it a paragraph And you would take that text and you basically would turn it into an embedding. And then you would put that in a vector database. So then when a user would ask a question, you would take their question and you would try to find and search. What are the most relevant tiny little chunks of information that I can grab and give it to a large language model to help it answer? Whereas if you have something like large context, which is Gemini's 2.0, 1 million and 2 million context, you can just give it.

all of that information and then let the user answer it. Now, this is important to know the difference. So back then we only had 4,000 token limits. Nowadays, the token limit is very big. You have very, very large token limits.

Now you have things like Gemini, which has a million tokens. Some of you have 2 million tokens. And you can see here in this chart, the difference, right? 3.5 had 4,000 tokens, which is about six pages. Whereas nowadays you can give Gemini 2.0 flash over a million tokens, which is a lot of text and that's actually still pretty accurate so if you actually look here this is the hallucination race for those who don't know hallucination it's how likely is it for the ai model to make things up and gemini's newest model google's newest model has the lowest hallucin hallucination rates you have these models who are very very different than what they were in 2023 so when i most made this post saying that rag is dead the traditional sense of taking a single document and turning it into little chunks no longer needs to exist.

If you look at the code snippets these days, this is about 20 lines of code. If you've used chat GPT, or if you use any, you know, cursor sonnet, it can code you. And the beautiful part about this is you can actually just give it the link to the PDF. So it's so much easier now that you don't even need to go through the embedding process. However, there is a lot of specifics I wanted to cover now, now that we understand reg.

So now let me explain to you why I said reg in quotes. is dead. Okay.

And I'll give you a couple of very clear use cases on when you would just give the models the context versus when you wouldn't. So take an example scenario of a transcript of an audio call. Okay.

So let's say we have a large transcript. It's a super large transcript and maybe it's a sort of a earnings report. Now in an earnings report, there's a lot of things that are said, you know, the CEO might say things like. you know so and so finances work for this fiscal year we use this technology there is a lot to unpack now usually they're like let's call it you know an hour long and let's just say to keep it simple they're about 50 000 tokens okay so now you have a 50 000 token 50k token transcript earnings call okay now what a lot of people would do is they'd be like oh wow i can actually what i want to do is i want to turn this into smaller chunks let's say you know, I want to do small tiny little chunks of maybe let's call it 512 tokens. So paragraphs.

And there are actually a lot of different sort of what they call chunking techniques that you can use. Uh, which I won't get it to, but let's say for simplicity, we're doing this. We have, we take this earnings call and we're like, okay, we're going to take it and we're going to turn it into, um, a bunch of chunks. So this is going to be about a hundred chunks.

Now in this scenario, what happens and the issue that you actually run into with naive chunking and rag is that you cannot reason over this content. So in rag, there are two main. Can I find the facts and the information? And can I reason over the data?

The issue with REG today is that when you are chunking and you are sort of embedding, you cannot reason over the information. And this example is a perfect use case. Because let's say I am sort of an analyst and I want to read into this transcript. If I ask a very specific question, a fact, I can say, you know, what was the net revenue? that's a pretty easy number to find.

Like you probably as a human could control F look up revenue and you could probably find it very quickly. That's, that's not very valuable to you. Maybe two years ago it was, but not so much anymore. You might ask something like, how does their outlook earnings compare to blah, blah, blah, blah.

You're going to ask a very loaded question that requires reasoning over the transcript. So you might say, okay, I want to know, you know, what, what is the outlooks? Like, I want to ask the AI.

Maybe this is a good investment. Is it a bad investment? Now to do that, how would you translate that to finding the right chunks? And when you do this chunking process, generally what happens is you take the user's query, you basically try to find the most similar sort of chunks of text, and then give that to an LLM.

Traditional RAG does not work. You cannot do that. There has to be a lot of very complex sort of agentic RAG, and there's a whole rabbit hole of the different things you could do. but it just doesn't work now assume let's assume for simplicity's sake we actually take this transcript and we just use that line of code and we just give this to gemini now gemini with its reasoning can look at the transcript it can look at maybe how the ceo said a specific thing it can look at maybe one of their earning reports um you know whatever numbers they might have mentioned at the beginning of the call and they can look at the end of the call and maybe they can look at the questions that he answered in between the call it can reason over 50,000 tokens and give you a significantly better and more accurate answer to your question. And this concept is sort of what I meant to say when I said rag killer, is that the traditional way of doing things that a lot of people assume is rag, which is I have a single document, I have a single PDF, I have two PDFs, I just want to answer some questions about it.

That method is obsolete. You do not need to do that anymore. you can go to gemini you can go and use ai you know google's models and you can literally go to ai studio by the way drag and drop it you can go to studio here you can drag all the documents that you want and you can just ask google and i highly recommend that everyone if you have these sorts of sort of documents please go to google and upload it because you're not going to get any better result however rag retrieval augmented generation in the more broad sense because there's so many definitions is not dead and i'll give you a very good example because sometimes what people will say in the comments they're saying well what if i have a hundred thousand documents what if i have the earnings call for 2023 and 2024 and 2025 and maybe i have also for some reason in all my files that i uploaded to the ai i have nvidia's you know maybe i'm maybe this transcript's about apple maybe i have nvidia's and maybe i have you know a bunch of irrelevant documents and in that use case rag does apply now again the difference here is in that case where I have 100,000 documents, what are you probably going to want to do?

If I have 100,000 documents, I'm probably just going to search for them. I'm not going to overcomplicate things. I'm going to search.

I'm going to do a, not a patrol app, but there's a lot of existing search systems that will help you find it. So now I'm going to say, okay, look, I found all of Apple's earnings reports. Okay, great. That's maybe 10 documents.

However, I wouldn't still go in and chunk those because you're losing, you're not able to reason over it. And what I would do instead is actually, I would throw all 10 documents to Google Gemini separately. So, okay, let's just go through an example. Let's say I have 10,000 documents and I need to, I have managed to filter down to three.

Now, what I like to do more recently, if you're, you know, you don't care too much about paying a little bit more as I like to do is do parallelization. And this is a technique that sort of, I've... saw happen on Twitter and a lot of people sort of have been talking about it more recently with the deep seek launch.

And it pretty much what you will do is you will, you know, find your three or maybe it's five most relevant documents. and instead of chunking them and doing those 512 little tokens because the ai models are so cheap these days what i would recommend is you take these documents and you throw all three of them in parallel to google's gemini so instead of me saying hey i'm going to take this i'm going to find that i'm going to actually take them separately by the way separately and i'm going to ask them the same question and the the difference that you'll see and you can go ahead and try this on your own when you're getting questions answered will be monumental. So what I would do is I would look at this, I'd say, okay, you know, maybe the user asked some question, user question.

I take that same question, right? And I give it to all of the three documents separately. Say, hey, you know, maybe this is called Gemini.

I could say, you know, Google one, Google two, and Google three. Then I will take this. and then finally use all of these right i will i will merge all of these and finally answer the user's question to do whatever i want this is not a new technique by any means this is uh sort of like a map reduce where you're filtering and removing all the relevant and from irrelevant information okay um and instead of again what i'm trying to get at here is this embedding chunk just doesn't work anymore right this is what traditional rag systems are i i do a search And then I will take these three documents. I will parallelize them.

I'll ask Google to look at all of them separately, take the answers, and finally take that final answer and give it to another AI and answer it. Now, this is a significantly more robust system. And because the models are so much cheaper today, by the way, Google's models are extremely cheap.

If you are looking for quality answers, this is the best way to do it, especially because of how low the hallucination rates are and how well. they can sort of find the information you're finding. So hopefully that explains the, you know, the pose here of rag is dead slash rag killer a little bit more. And if you are looking at building rag or doing things, I just recommend starting simple.

I feel like people like to overcomplicate things. If you're a solo hobbyist, just keep it simple. Use things like, you know, uploading the file to this specific system.

And only when you need to. make it more complex. Again, I think a lot of people will like to add complexity when it isn't needed.

In a year from now, this might change. Maybe the models get a hundred times cheaper. They get a hundred times more efficient. You might not even need this filtering step.

Who knows? But at the end of the day, I think the traditional rag that most people kind of, you know, if you're not too familiar, see, I don't think it's going to exist. I don't see any point because you cannot reason over all of the data to give you a very sufficient answer. So that's all for today's video. Hopefully that gives you guys a little bit more understanding of why RAG in the traditional sense doesn't work and why, again, Gemini's Google model is so strong and useful, especially for these kinds of use cases, whether you're building a product, whether you yourself are like an analyst and you want to read all the documents.

The new model from Google, very, very useful. And guess what? One other thing to add.

When you do this step, okay, you say giving all this stuff, you can sort of mix and match the AI models. If you really wanted to say, hey, look, I'm going to get a really, really cheap model. I'm going to get it to look at all these documents, give us answers.

And I'm going to take those answers and give it to 01 Pro or 03. You know, I can give it to a lot smarter model as opposed to, you know, trying to get the AI model to pick the right chunks. It just doesn't work. So that's all. Hope you guys enjoy the video.

Peace.

Transcript for:Gemini 2.0 Flash and RAG Explained

Transcript for:
Gemini 2.0 Flash and RAG Explained