Insights from Tengu Ma on AI Retrieval

Welcome to No Priors. Today we're talking to Tengu Ma, Assistant Professor of Computer Science at Stanford, and the co-founder and CEO of Voyage. Voyage trains state-of-the-art components for next-generation retrieval systems, including embeddings models and re-rankers. We're really excited to talk about his research and the RAG debate today. Welcome Tengu. Yeah. Thanks so much. Thanks for having me here. We're looking forward to the debate. Yeah. Why don't we start with just a little bit of an overview of your research agenda to date? Because I think uniquely it covers a broad range of fields within and around deep learning from theory to RL to embeddings and optimizers. So can you talk a little bit about how you pick the directions you have? Yeah. So I think most of the papers I wrote have some theoretical thinking in it. I guess maybe that's the commonality. Besides that, I think I worked on quite a few topics, as you mentioned, ranging from the theoretical understanding, mathematical proofs of deep learning systems, all the way to practical large-language models, reinforcement learning, deep reinforcement learning. And these days, recently, I think what we are working on is more centralized to the efficiency of training the large-language models and improving the reasoning tasks for large-language models. So my vision is that in the future, the efficiency is very important because we are running out of data and compute. So we have to either use the data much better and use the compute much better. And also reasoning tasks seems to be a pretty important direction. And also, in some sense, kind of like a risky direction in the sense that we don't know exactly how fast we can solve those challenging. reasoning questions yet. Can you mention a few of the key papers or work that you or students in your lab have done just so our listeners can look them up? In the very early days, I think I worked on some of this matrix completion, optimization for matrix completion. That's like 10 years ago. And then I move on to embedding models like sentence embeddings, vector embeddings. One of the papers we wrote is a very actually simple paper where we averaged the word embeddings to get sentence embeddings. And then we did some of these transformations using PCA to make the performance much better. That was even before Transformer came out. And then I think I move on to Transformers, large-language models, and contrastive learning, which is the new way of training the embedding models. Especially the direction started with some of the papers on using contrastive learning for images. And we work on improving those and understanding why contrastive learning can work. And recently, we work on optimizers for large language models. For example, one of the papers we wrote last year was SOFIA, where we found that we have a new optimizer which can improve the training efficiency by 2x for pre-training. This is great. Adam is very old at this point. Yeah, he's 10 years old now. I think that's the interesting part about it. So optimizers... I think people have tried in the last 10 years for so many times, there were so many papers published which has improvements over IDEM in various cases. But so far, IDEM is still the default algorithms for training large language models. And that's why we thought that it's the time to really... We spent a lot of time on this. I think I started probably around 2018, 2019. And I asked a few students to work on this. And finally, we had one paper out after a few years, after a few failed projects and failed ideas. And recently, I think one of the Facebook friends actually used this in their large-scale multi-model training. And they found that on that scale, I don't know exactly how many parameters there are, but I assume it's kind of more than 100 billion parameters. They found that on that scale, there is a 1.6x. improvement in the efficiency of the training. So that's like $10 million versus $16 million. That's super exciting. Yeah. I think, you know, Sophia has an opportunity to be really, really impactful. You started a company last year taking leave from Stanford. Given your work has been like theoretical, but with practical applications, like what drove you to do that? I think I... came to Stanford partly because there's a very strong industry connection here at Stanford compared to some of the other universities. And also probably entrepreneurship is just part of my career plan anyways. And in terms of the timing, I felt that this is the right timing in the sense that the technologies are more and more mature so that it seems that the commercialization is the right timing right now. For example, I think... One story I have is that I looked up some of my slide deck for my lectures at Stanford CS299 seven years ago when I started to teach at Stanford. At that point, we have a lecture with Chris Ray on applied machine learning. So how do you apply machine learning industry? There are seven steps there. The first step is you define your problem. The second step is you collect your data. And you choose the loss function, you train it, and you iterate, so on and so forth. So it's pretty complicated at that point. And now the foundation model arrives to power. And in a new foundation model era, the only thing you have to do is that you have to, you know, someone will train the foundation model for you, and then you tune a prompt, and you add a retrieve argument generation on top of it, and that's pretty much it. So applying machine learning, AI. to an industry environment is much, much easier than seven years ago. And that's why I felt that this is probably the right time to commercialize many of the technologies because the technologies are more mature. Yeah, this is actually, I mean, a core premise even for the investing fund that I started a conviction in that, you know, somebody's doing the bulk of the work for you in a more general way. And so the application of AI in industry is just much, much cheaper, right? Because you only do. the last few steps, or a different set, but last few steps in essence. So maybe you can talk about, just given a wide range of research, the problem you focus on with Voyage that you saw with customers. Yeah, so with Voyage, I think we are mostly building these two components, rewrank and embeddings, for improving the quality of the retrieval or the search system. So the reason why we focus on this is because we talked to so many customers and we found that Right now, for implementing Rack, the bottleneck seems to be that it's not very hard to implement it. You can just connect the components and have your Rack system ready very quickly. But the bottleneck seems to be the quality of the response. And the quality of the response is heavily affected or is kind of almost bottlenecked by the quality of the retrieval part. If the large-enriched model sees very relevant documents, then they can synthesize very good answers. Even LAMA7DB can do that very well. Can you just give a general intuition for what a RAG system is and some of the applications of it? Yeah, so I guess just a little bit of background. So a retrieval organization, the idea is that there's a retrieval step, there's a generation step. So the main point here is that if you just use a large language model as a black box, as is, then the large language model wouldn't know anything about the proprietary information inside the company. And it doesn't know enough context about the use cases. And the retrieve algorithm generation stack is about you first retrieve some knowledge from, for example, inside a company, and then use this knowledge and give the knowledge to the Latin Gouge model so that the Latin Gouge model can generate or synthesize a good answer without any hallucination. This has been found to be very, very useful to reduce the hallucination rate. And so there's two steps. The first step is to retrieve some relevant information. information given the query and then this relevant information are given to the LARCH and GUCH model. The retrieval step is important because once the LARCH and GUCH model sees the relevant information, it can reduce the hallucination rate dramatically because it uses the relevant information as an anchor to refine the answers in some sense. And what we are doing here is that we want to improve the quality of the retrieval or the relevancy or accuracy of the retrieved documents and information. And the way to do this is that there are two steps. The first step is that you vectorize all of your documents or all of your knowledge base. So you turn the documents to vectors, you turn the videos into vectors. You turn your code into vectors. Code into vectors, everything into vectors. And so the vectors are the representations of each piece of the knowledge or documents. And all the other indices. And then you put these vectors into a vector database. And then you search. the relative information using the vectors as indices. Where are you seeing RAG applications today? What are customers building? What are the most common systems? Yeah, so we have a lot of users and they are all over the places. We have even a customer who is a chemistry company who is building this RAG system to understand their chemistry documents or products in descriptions. And I think it's almost everywhere, like finance, legal, code retrieval, code generation, so on and so forth. I think it can be applied to almost any cases. And also even for individual users where you have a lot of individual personal information and you want to have a rack system on a phone so that you can access your past information in a much more easy way. And you want to retrieve it. For example, we all have seen that. When you search your documents on your laptop, it's actually pretty hard. You have to use the exact file name. It will be much easier if this search can be semantic-based. RAG is a relatively new architecture. I think your average enterprise technology leader had not heard the term before the last year or so, and it became popularized in a researcher for the last few years. But there is already a debate, I think, in terms of opinions from people at different large labs and in academia about whether or not. You need a RAG architecture to work on proprietary data. And just to sort of describe some of the alternative views, I think there's kind of two alternative points of view given. One is a sort of agent chaining architecture where you are inputting your data and knowledge, you know, chemistry, code, law, finance, whatever documents into a series of LLMs that just. operate with instruction on it, for example, to summarize or categorize it, or you simply feed everything into LLMs with infinite context or actively managed context versus explicitly vectorizing anything. And so I would love to get your reaction to that as an alternative to RAG. Actually, there was also a debate last year about RAG versus fine-tuning. And I think that debate was kind of like getting to a consensus now. It sounds like reg is much easier than fine-tuning. And fine-tuning in many cases doesn't work because you need a lot of data to see the results and there are still hallucinations even after fine-tuning. And now, as you said, the debate becomes reg versus agent changing or long context. So maybe let's talk about long context first. So I think there are probably two answers to this. from different angles because the long context right now is not practical yet right so we have to kind of anticipate what long context transformer can do and then do the debate at a future time in some sense or anticipate the debate at a future time in the near term i think the long context transformer where you just put in all the proprietary data 1 billion tokens into the context of the transformer so um will be very very expensive right if you use the price right now it's gonna be just uh impossible to do it. It's probably like five, ten magnitudes of difference depending on how many documents you have in the context. Of course, you can bring the cost down by, for example, one approach is to cache the activations of all of the internal operations of the documents you put in the context. So that will bring the cost down by a lot, but I think still, if you do the calculation, theoretically, it's still much more expensive than RAC. So I think that's the more practical answer. So in terms of cost, it's going to be much more expensive than RAC because you have to save all of these activations or intermediate computations in the GPU memory, most likely, or maybe in CPU memory of all the 1 billion tokens context. You know, you may argue that, okay, over time, everything will become cheaper and cheaper. But RAC will be cheaper as well, right? Because... many of the technologies under RAG are new-rate work-based, and the GPUs will become cheaper, the new-rate works will become smaller. So my prediction is that RAG will be much cheaper than long context going forward. And another way to think about this is that maybe just from the first principle, so my analogy of long context is that, so in some sense, the context is the short-term memory in some sense, right? And the RAG is more like long-term memory in some sense. So the question is, for example, when you answer any question, why you have to go through the entire library every time, right? Like put all of the entire library in your short-term memory for answer a single question, right? It sounds like the right approach should be that for every single question, you retrieve some subset of the information and use those to answer the question. That seems to be the most efficient way to do that. It should be some kind of hierarchies in some sense, in terms of how we... solve the problem so that we can get the best efficiency. Even when we do the computer architecture, like the hardware stuff, you have a different level of caching. So you have disk, you have CPU cache, and so forth. So in that sense, I feel like the more hierarchical, two-level kind of system like RAC is more cost efficient. Yeah, I mean, the analogy certainly makes sense. I think there is another thread of discussion of like, what does long term memory for LLMs look like that where, you know, it is something managed by the LLM itself, but I do not think that is a well answered question. And like RAG may just be a part of that answer. So the embedding model that we run are in some sense, the large language model that are managing the long term memory. Of course, there might be a variance and other ways to manage the long term. memory. But I think it will be somewhat similar. It's going to be like more, you know, the technology always evolves, right? Gradually, right? So maybe two years later, Voyage or maybe other companies will have a new version of the long-term memory, which is based on, you know, embedding models, but kind of like extending the embedding model in some way. That's entirely possible. Yeah. I do think it's useful to sort of contextualize for people who are not working with... sort of data sources for LLMs at scale every day, like what sort of token limitations are, right? You know, we go from a few thousand tokens to something like Gemini 1.5 Pro, context window of a million tokens, right? And so if that's short of that in word count, that's maybe five books or like 25, 30,000 lines of code and obviously like limited amount of video and audio. And so- I think the ability to make reasoning decisions on more than that amount of data is obviously going to be needed. And the questions to me are really like, you know, does efficiency matter both from a cost perspective and a speed, like a latency perspective? Right. And how much can you push the context window? And like, you know, does hallucination management matter? And so I think there are lots of arguments for like RAG being very persistent here. Yeah, yeah, exactly. And just to... a little bit on that. So 1 million tokens, five books, right? So by many companies has 100 million tokens. That's 100x difference, right? So 100x, you know, for cost is a big difference. That could be just, you know, $100K versus like $10 million, right? $10 million is unacceptable, but 100K sounds okay. Yeah, I think that's probably what's going to happen. Like, so at least for many of the companies, right? So right now, if they have 100 million tokens, I don't think they can use long context transformers at all. because it's way too expensive. Right. And the simplest thing for me is actually for a system to look at the entire code base or some representation of the entire code base versus the portion of it that could fit into context today. Yeah. What about the other piece, like the idea of agent chaining and using LLMs to manage the data in that form? So agent chaining, this is a growing area and many people are doing research on it. I think it's a little bit less well-defined in some sense. On the first level bit, I would say is that I think it's kind of orthogonal to embedding models and re-rankers to some degree. Because even when you have agent chaining, you still probably use embedding models as part of the chain. You probably do iterative retrieval as part of the chain. And of course, you use large-language models as part of the chain as well. In some sense, it's orthogonal direction. So I would probably rephrase the agent chaining as more like iterative multi-steps. retrieval, large-language model augmented system. So, some part of this retrieval probably is done by a large-language model, sometimes part of the system is done by a small-large-language model, and some part of the system is done by an embedded model, so on and so forth. So in that sense, I feel like it's somewhat orthogonal. Yeah, and I feel like some of the motivation for agent chaining to begin with is the same efficiency motivation as RAG. Yep. Exactly. But if you use a very, very large language model to manage the system, the knowledge system, I think you again lose the efficiency. So it has to be a somewhat smaller model to manage the knowledge. And then at that point, embedding model might be the right thing to do in that agent training framework. Maybe another angle to look at this is that whether we should do iterative retrieval versus just retrieve at once. Um. I think iterative retrieval is definitely useful, especially because now there are still a lot of headroom in the embedding model's performance. So that's why sometimes you have to retrieve multiple times because the models are not clever enough. However, in the long run, my suspicion is that iterative retrieval will be useful, but it will be a bit less useful if the embedding model becomes more and more clever. So... Once the invite mode is more clever, then maybe one round or two rounds is going to be enough. If we go ahead and just assume that RAG is at least a dominant architecture for enterprise use cases where you care about proprietary data that is large with reliability, how do you go about improving like a RAG system? Right. You can improve the LLM itself, but what are the what are the other components that you guys are working on or what are the maybe challenges from the user, the builder's perspective to improve retrieval quality? Yeah, so I guess there are a few ways, right? One way is that you improve the prompting of the large language models. So, for example, you could tell the large language models to abstain if there's no relevant information in the retrieved documents. But because the large language models are so good these days, I think you don't need a lot of prompting anymore. It just responds to the instructions so well. And then the next thing is to improve the retrieval. part, which is the bottleneck, in my opinion, because most of our users found out that if they improve the retrieval quality directly, that affects the response quality. And improving the retrieval part, I think there are two ways. One way is you improve the embedding model. One way is that you improve some of the other things on top of that. For example, how you chunk the data, whether you do iterative retrieval, whether you put in some of the meta information in the data, so on and so forth. So basically, I would say there are two two ways of improving. One way is to improve the new artworks, either the embedding models or the rewrapers, or you improve the ways to use the artworks with software engineering, right? Better trunking iterations or other kind of like heuristics or kind of like tricks on top of that. So, and what we are specialized in is that we want to improve the new artworks because that requires a lot of heavy lifting. That's a very data driven approach. We train our new artworks on trillions of tokens, at least. and we fine-tune them for special use cases. And this is something that probably a company should do instead of the end users should optimize themselves. And my long-term vision here is that some of the software engineering layers on top of the networks will be less and less needed when the networks are more and more clever. So, for example, right now we already see that trunkings become less needed because the context window becomes longer and longer and the long context... embedding model, no, relatively long context embedding model. Long context here means like 10k, for example, maybe 16k, so that you can put 50 pages PDF into it. Because this long context embedding model becomes much better, there's less of a need to trunk the documents into pieces of like 5, 12 tokens. And I think this will happen, you know, in other dimensions as well, right? So maybe in the future, you don't have to turn your... images into description images and then give it to the text embedding model. That's what people are doing right now. Like everything is turned into text and they use a text embedding model. But when the embedding models are more clever and multi-model, then you don't have to do that anymore. Can you talk a little bit about just like the intuition for how fine-tuning or domain-specific embeddings improves performance? Yeah, fine-tuning and domain-specific embedding models are what we are very good at at Voyage. So just to have some context here, so what we do is that we start with a general purpose base embedding model, which is also what we trained from scratch. And from there, we first fine-tune or continue pre-train, whatever you call it, on some domain-specific data. So for example, we fine-tune on two trillions of code snippets, tokens, and then we get a code embedding model. And we do the fine-tuning on 1 trillion legal tokens, and that's how we got the legal embedding model. And this domain-specific embedding models, I didn't use any of the proprietary data so that everyone can use them, but they really excel in one particular domain, and the performance in other domains are not changed much. And the reason why we do this is because the number of parameters in the embedding model is limited. So because you only have a latency budget, Something like maybe one second, sometimes like 200 milliseconds. Some people even want 50 milliseconds. And then basically it's impossible to use more than 10 billion parameters for embedding models. And we have limit parameters. Any customization is very important because customization means that you use the limit number of parameters on the right tasks, the right domain, so that you excel in that domain. There's no way that you can use these 10 billion parameters to excel in everything. Right. So that's why you have to specialize in one domain. And we have seen like 5% to 20% of improvements by this domain-specific fine-tuning, depending on the particular domains. For code, we have seen 15% to 20% of improvement, partly because we have a lot of data there. And the headroom there is also bigger because code retrieval requires a lot of deep understanding of the algorithmic part of the code. And for legal domain, the baseline is a little better, so the headroom is slightly smaller. So that's why we see 5% to 15% improvement depending on the data sets. For some of the very complex legal data sets, we have seen bigger improvements. Just to make sure that our listeners can picture exactly where the latency cost is coming from here, in a search system like... Your data, you know, has been vectorized by an embeddings model, but then every query also needs to be translated into an embedding and then compared to the embeddings of your knowledge in order to feed that LLM for the generation that you want, right? And so there's inference time latency here as well. I just think that's not obvious if somebody hasn't built a RAG system. Yeah, exactly, exactly. So basically, at the inference time, you have to first turn a query into vectors and then... do the search with Vector Database. And actually, related to this, the dimension of the vectors you produce also affects the latency for the vector-based search. If the dimension of the embedding is like 100, is only 100, then it's going to be much, much faster than when the dimension of the embeddings is 1,000. So, and actually, this is something we are very good at as well. So we produce embeddings that is like a 3x, 4x, smaller dimension than some of the competitors. Yep. I mean, intuitively, you are creating embeddings models that use a limited number of parameters and dimensions, just given the sort of latency budget that any application has to create the best possible representation of proprietary data or domain specific data. Yeah, exactly. And going back to the domain specificity and fine-tuning, so the second level of customization is that we can customize to a particular company, right? So we fine-tune on the proprietary data of a particular company and we can see 10 to 20% improvement on top of the domain specific in fine-tuning as well. So of course, there's a total budget in terms of how much additive improvements you have there. So if you start with like 50% accuracy, then you only have 50% headroom. But if you start with 90%, you only have 10% headroom. So the improvement, the absolute improvement varies a little bit across the domains. Maybe just advice to people who are building RAG systems. At what point do they begin to invest in some of these retrieval components? Yeah, I think they can do it even from day one, as long as they have a prototype available. So basically, my default suggestion for our users is that when they have the rack, first of all, of course, you want to connect the components and at least see some response. And then probably do some kind of basic profiling in terms of the latency and the quality. So you can check the retrieval quality, meaning that how often you retrieve relevant documents. There are some default ways to evaluate the retrieval quality. And then you also do the end-to-end evaluation for the responses. And then you can see which part is the bottleneck. And in many cases, people found that the retrieval quality is not good, so that the final response is not good. And then you can swipe some of the components. You can say, I'm gonna try voyage embedding. I can try the voyage rerun course, which we haven't discussed too much about. So, and you can try various different embeddings. and possibly various different large language models as well. Maybe just zooming out, you started by saying in order to have the debate about RAG versus alternative architectures for working on proprietary data, you need to predict forward, right? Any predictions for how these systems change as LLMs improve dramatically? If we look at the next generations of open AI and... or GPT and Claude and the Mistral models and LAMA and such? Yeah, so my prediction is that the system will be simpler and simpler. Maybe this is my biased view. So at least this is something that we are working towards. So the idea would be that it's a very, very simple system. So you just have three components, like large English model, vector database, and embedding models, and maybe four components, another re-ranker. Okay. which refine the retrieved results. And you connect all of this and each of the new artworks does everything else. You don't have to worry anything about trunking, multimodality, changing the data format because new artworks can do most of them, right? So seven years ago, if you talk to any of the so-called language models seven years ago, you have to turn the format into a very, very clean format. And now you talk to GPT-4, you can have typos, you can... have all kind of like a weird format. You can even dump JSON files to it, right? So the same thing would happen for embedding models as well. So my vision is that in the future, AI will just be that a very simple software engineering layer on top of a few very strong neural network components. Yes, I think the bias toward it is actually all going to be AI versus complex, you know, discretized software systems is clear, but I believe directionally right. Maybe zooming out to just get a little bit of your perspective as a founder, like, you know, what's one or two top learnings you have about starting the company as an academic before even, you know, despite your work with Google and other companies before? Yeah, I think it's very, very different. Founding a company is very different from doing research at big tech. And also even from, actually, it's a little bit closer to being academia because to run a university lab, I'm the CEO, CTO, CFO, and HR for the university lab. So you touch on a little bit of everything, but at a slightly different scale. So I think one of the biggest things I learned actually is from one of our angel investors is that I should read some of the books. Even those, I think for probably experienced entrepreneur, many of the books are very basic, but for me, they are very, very useful. When I read some of the, even the basic books, including Eli's book, by the way. So, but his book is a little bit advanced in a sense that he's, his book is talking about how to scale from 10 people to a thousand people. And I only read a few chapters of that because we are about 10 people right now. So, yeah. And also talking to a lot of angel investors, talking to Sarah and my other lead investors. So I think all of this helped me a lot in reducing the unforced mistakes in this process. To me, I think it's really about how to reduce the number of errors you make so that you can maximize the efficiency. At least this is what happens to me. And also how to correct the mistakes as fast as possible. If you could. correct mistakes every one week after you made them versus like one month after you made them, then that's a 4x efficiency improvement. Very theoretically consistent in your, you know, vein of research. Last question, you know, you have been personally productive, productive research lab, but you've started a company. What do you think the role of academia in AI is in this age of like? scaling? Because most of your former students, they essentially all work at OpenAI or Anthropic with a few professors and Citadel folks in the mix. And the ones working with you, right? Yes, yes. In academia, this is a little bit controversial topic. I think different people have different views. My view is that I think academia probably should work on some different questions from what industry is good at. So if we are just only working on how to scale up the system, then obviously the incentive is not right. We don't have enough capital there. And even OpenAI, I guess, argues that you need a lot of capital to start to do this in some sense. So at the very beginning, I think the point is that it cannot be non-profit because if it's non-profit, then you don't have enough capital and you cannot scale up enough. Thank you. I think I kind of agree with that. And that's why in academia, it's very hard to scale up and have enough resources to do the large scale research. However, I think in academia, there are many, many other things that we can do on a smaller scale. And we probably should focus on more long-term innovations. So what I told my students at the lab is that we should think about what will be the breakthrough in three to five years, as opposed to how do you help open eye to... to improve their large language models in GPD5. So that's why we work on optimizers, which is like 10 years old. The item is a 10 years old optimizer. And we say, okay, that sounds like a long-term project. Maybe in five years, we can improve the optimization efficiency by 5 to 10x. That's going to be a game changer for the whole landscape. So if we improve the efficiency by 10x, I guess that's like $100 million versus $10 million for training GPD5. Then- I think that would change the landscape a lot in the industry. So efficiency is one of the things I spend a lot of time on. Another thing is that there's reasoning tasks. I think the reason why I identified that as one of my lab's direction is because it's challenging and it requires a lot of very innovative research. It's very unclear whether you can really, the scaling law is really enough to get you to prove Riemann. hypothesis, any of the math conjectures. So, you know, and also you have to be superhuman performance in some sense, right? So if you turn on just the common crowd data on the web, can you be a good mathematician? It's kind of very hard to believe that. So we need more innovations there. So that's pretty much what we are doing at the University Lab. We try to work on the three to five years agenda and on a smaller scale. I think that's an inspiring note to end on and like a very open minded one about what is still to be figured out. Thanks so much for doing this, Tango. Thanks so much. Find us on Twitter at NoPriorsPod. Subscribe to our YouTube channel if you want to see our faces. Follow the show on Apple Podcasts, Spotify or wherever you listen. That way you get a new episode every week. And sign up for emails or find transcripts for every episode at No-Priors.com.

Welcome Tengu. Yeah. Thanks so much. Thanks for having me here. We're looking forward to the debate.

Yeah. Why don't we start with just a little bit of an overview of your research agenda to date? Because I think uniquely it covers a broad range of fields within and around deep learning from theory to RL to embeddings and optimizers. So can you talk a little bit about how you pick the directions you have? Yeah.

So I think most of the papers I wrote have some theoretical thinking in it. I guess maybe that's the commonality. Besides that, I think I worked on quite a few topics, as you mentioned, ranging from the theoretical understanding, mathematical proofs of deep learning systems, all the way to practical large-language models, reinforcement learning, deep reinforcement learning.

And these days, recently, I think what we are working on is more centralized to the efficiency of training the large-language models and improving the reasoning tasks for large-language models. So my vision is that in the future, the efficiency is very important because we are running out of data and compute. So we have to either use the data much better and use the compute much better. And also reasoning tasks seems to be a pretty important direction.

And also, in some sense, kind of like a risky direction in the sense that we don't know exactly how fast we can solve those challenging. reasoning questions yet. Can you mention a few of the key papers or work that you or students in your lab have done just so our listeners can look them up? In the very early days, I think I worked on some of this matrix completion, optimization for matrix completion. That's like 10 years ago.

And then I move on to embedding models like sentence embeddings, vector embeddings. One of the papers we wrote is a very actually simple paper where we averaged the word embeddings to get sentence embeddings. And then we did some of these transformations using PCA to make the performance much better. That was even before Transformer came out. And then I think I move on to Transformers, large-language models, and contrastive learning, which is the new way of training the embedding models.

Especially the direction started with some of the papers on using contrastive learning for images. And we work on improving those and understanding why contrastive learning can work. And recently, we work on optimizers for large language models. For example, one of the papers we wrote last year was SOFIA, where we found that we have a new optimizer which can improve the training efficiency by 2x for pre-training.

This is great. Adam is very old at this point. Yeah, he's 10 years old now.

I think that's the interesting part about it. So optimizers... I think people have tried in the last 10 years for so many times, there were so many papers published which has improvements over IDEM in various cases.

But so far, IDEM is still the default algorithms for training large language models. And that's why we thought that it's the time to really... We spent a lot of time on this.

I think I started probably around 2018, 2019. And I asked a few students to work on this. And finally, we had one paper out after a few years, after a few failed projects and failed ideas. And recently, I think one of the Facebook friends actually used this in their large-scale multi-model training.

And they found that on that scale, I don't know exactly how many parameters there are, but I assume it's kind of more than 100 billion parameters. They found that on that scale, there is a 1.6x. improvement in the efficiency of the training.

So that's like $10 million versus $16 million. That's super exciting. Yeah.

I think, you know, Sophia has an opportunity to be really, really impactful. You started a company last year taking leave from Stanford. Given your work has been like theoretical, but with practical applications, like what drove you to do that?

I think I... came to Stanford partly because there's a very strong industry connection here at Stanford compared to some of the other universities. And also probably entrepreneurship is just part of my career plan anyways. And in terms of the timing, I felt that this is the right timing in the sense that the technologies are more and more mature so that it seems that the commercialization is the right timing right now.

For example, I think... One story I have is that I looked up some of my slide deck for my lectures at Stanford CS299 seven years ago when I started to teach at Stanford. At that point, we have a lecture with Chris Ray on applied machine learning.

So how do you apply machine learning industry? There are seven steps there. The first step is you define your problem. The second step is you collect your data. And you choose the loss function, you train it, and you iterate, so on and so forth.

So it's pretty complicated at that point. And now the foundation model arrives to power. And in a new foundation model era, the only thing you have to do is that you have to, you know, someone will train the foundation model for you, and then you tune a prompt, and you add a retrieve argument generation on top of it, and that's pretty much it.

So applying machine learning, AI. to an industry environment is much, much easier than seven years ago. And that's why I felt that this is probably the right time to commercialize many of the technologies because the technologies are more mature. Yeah, this is actually, I mean, a core premise even for the investing fund that I started a conviction in that, you know, somebody's doing the bulk of the work for you in a more general way.

And so the application of AI in industry is just much, much cheaper, right? Because you only do. the last few steps, or a different set, but last few steps in essence.

So maybe you can talk about, just given a wide range of research, the problem you focus on with Voyage that you saw with customers. Yeah, so with Voyage, I think we are mostly building these two components, rewrank and embeddings, for improving the quality of the retrieval or the search system. So the reason why we focus on this is because we talked to so many customers and we found that Right now, for implementing Rack, the bottleneck seems to be that it's not very hard to implement it.

You can just connect the components and have your Rack system ready very quickly. But the bottleneck seems to be the quality of the response. And the quality of the response is heavily affected or is kind of almost bottlenecked by the quality of the retrieval part. If the large-enriched model sees very relevant documents, then they can synthesize very good answers. Even LAMA7DB can do that very well.

Can you just give a general intuition for what a RAG system is and some of the applications of it? Yeah, so I guess just a little bit of background. So a retrieval organization, the idea is that there's a retrieval step, there's a generation step. So the main point here is that if you just use a large language model as a black box, as is, then the large language model wouldn't know anything about the proprietary information inside the company. And it doesn't know enough context about the use cases.

And the retrieve algorithm generation stack is about you first retrieve some knowledge from, for example, inside a company, and then use this knowledge and give the knowledge to the Latin Gouge model so that the Latin Gouge model can generate or synthesize a good answer without any hallucination. This has been found to be very, very useful to reduce the hallucination rate. And so there's two steps. The first step is to retrieve some relevant information. information given the query and then this relevant information are given to the LARCH and GUCH model.

The retrieval step is important because once the LARCH and GUCH model sees the relevant information, it can reduce the hallucination rate dramatically because it uses the relevant information as an anchor to refine the answers in some sense. And what we are doing here is that we want to improve the quality of the retrieval or the relevancy or accuracy of the retrieved documents and information. And the way to do this is that there are two steps.

The first step is that you vectorize all of your documents or all of your knowledge base. So you turn the documents to vectors, you turn the videos into vectors. You turn your code into vectors. Code into vectors, everything into vectors.

And so the vectors are the representations of each piece of the knowledge or documents. And all the other indices. And then you put these vectors into a vector database. And then you search.

the relative information using the vectors as indices. Where are you seeing RAG applications today? What are customers building?

What are the most common systems? Yeah, so we have a lot of users and they are all over the places. We have even a customer who is a chemistry company who is building this RAG system to understand their chemistry documents or products in descriptions. And I think it's almost everywhere, like finance, legal, code retrieval, code generation, so on and so forth. I think it can be applied to almost any cases.

And also even for individual users where you have a lot of individual personal information and you want to have a rack system on a phone so that you can access your past information in a much more easy way. And you want to retrieve it. For example, we all have seen that. When you search your documents on your laptop, it's actually pretty hard. You have to use the exact file name.

It will be much easier if this search can be semantic-based. RAG is a relatively new architecture. I think your average enterprise technology leader had not heard the term before the last year or so, and it became popularized in a researcher for the last few years. But there is already a debate, I think, in terms of opinions from people at different large labs and in academia about whether or not.

You need a RAG architecture to work on proprietary data. And just to sort of describe some of the alternative views, I think there's kind of two alternative points of view given. One is a sort of agent chaining architecture where you are inputting your data and knowledge, you know, chemistry, code, law, finance, whatever documents into a series of LLMs that just. operate with instruction on it, for example, to summarize or categorize it, or you simply feed everything into LLMs with infinite context or actively managed context versus explicitly vectorizing anything.

And so I would love to get your reaction to that as an alternative to RAG. Actually, there was also a debate last year about RAG versus fine-tuning. And I think that debate was kind of like getting to a consensus now. It sounds like reg is much easier than fine-tuning. And fine-tuning in many cases doesn't work because you need a lot of data to see the results and there are still hallucinations even after fine-tuning.

And now, as you said, the debate becomes reg versus agent changing or long context. So maybe let's talk about long context first. So I think there are probably two answers to this. from different angles because the long context right now is not practical yet right so we have to kind of anticipate what long context transformer can do and then do the debate at a future time in some sense or anticipate the debate at a future time in the near term i think the long context transformer where you just put in all the proprietary data 1 billion tokens into the context of the transformer so um will be very very expensive right if you use the price right now it's gonna be just uh impossible to do it.

It's probably like five, ten magnitudes of difference depending on how many documents you have in the context. Of course, you can bring the cost down by, for example, one approach is to cache the activations of all of the internal operations of the documents you put in the context. So that will bring the cost down by a lot, but I think still, if you do the calculation, theoretically, it's still much more expensive than RAC.

So I think that's the more practical answer. So in terms of cost, it's going to be much more expensive than RAC because you have to save all of these activations or intermediate computations in the GPU memory, most likely, or maybe in CPU memory of all the 1 billion tokens context. You know, you may argue that, okay, over time, everything will become cheaper and cheaper. But RAC will be cheaper as well, right? Because...

many of the technologies under RAG are new-rate work-based, and the GPUs will become cheaper, the new-rate works will become smaller. So my prediction is that RAG will be much cheaper than long context going forward. And another way to think about this is that maybe just from the first principle, so my analogy of long context is that, so in some sense, the context is the short-term memory in some sense, right?

And the RAG is more like long-term memory in some sense. So the question is, for example, when you answer any question, why you have to go through the entire library every time, right? Like put all of the entire library in your short-term memory for answer a single question, right?

It sounds like the right approach should be that for every single question, you retrieve some subset of the information and use those to answer the question. That seems to be the most efficient way to do that. It should be some kind of hierarchies in some sense, in terms of how we...

solve the problem so that we can get the best efficiency. Even when we do the computer architecture, like the hardware stuff, you have a different level of caching. So you have disk, you have CPU cache, and so forth.

So in that sense, I feel like the more hierarchical, two-level kind of system like RAC is more cost efficient. Yeah, I mean, the analogy certainly makes sense. I think there is another thread of discussion of like, what does long term memory for LLMs look like that where, you know, it is something managed by the LLM itself, but I do not think that is a well answered question. And like RAG may just be a part of that answer.

So the embedding model that we run are in some sense, the large language model that are managing the long term memory. Of course, there might be a variance and other ways to manage the long term. memory. But I think it will be somewhat similar. It's going to be like more, you know, the technology always evolves, right?

Gradually, right? So maybe two years later, Voyage or maybe other companies will have a new version of the long-term memory, which is based on, you know, embedding models, but kind of like extending the embedding model in some way. That's entirely possible. Yeah. I do think it's useful to sort of contextualize for people who are not working with...

sort of data sources for LLMs at scale every day, like what sort of token limitations are, right? You know, we go from a few thousand tokens to something like Gemini 1.5 Pro, context window of a million tokens, right? And so if that's short of that in word count, that's maybe five books or like 25, 30,000 lines of code and obviously like limited amount of video and audio. And so- I think the ability to make reasoning decisions on more than that amount of data is obviously going to be needed. And the questions to me are really like, you know, does efficiency matter both from a cost perspective and a speed, like a latency perspective?

Right. And how much can you push the context window? And like, you know, does hallucination management matter? And so I think there are lots of arguments for like RAG being very persistent here.

Yeah, yeah, exactly. And just to... a little bit on that.

So 1 million tokens, five books, right? So by many companies has 100 million tokens. That's 100x difference, right? So 100x, you know, for cost is a big difference. That could be just, you know, $100K versus like $10 million, right?

$10 million is unacceptable, but 100K sounds okay. Yeah, I think that's probably what's going to happen. Like, so at least for many of the companies, right?

So right now, if they have 100 million tokens, I don't think they can use long context transformers at all. because it's way too expensive. Right.

And the simplest thing for me is actually for a system to look at the entire code base or some representation of the entire code base versus the portion of it that could fit into context today. Yeah. What about the other piece, like the idea of agent chaining and using LLMs to manage the data in that form? So agent chaining, this is a growing area and many people are doing research on it. I think it's a little bit less well-defined in some sense.

On the first level bit, I would say is that I think it's kind of orthogonal to embedding models and re-rankers to some degree. Because even when you have agent chaining, you still probably use embedding models as part of the chain. You probably do iterative retrieval as part of the chain.

And of course, you use large-language models as part of the chain as well. In some sense, it's orthogonal direction. So I would probably rephrase the agent chaining as more like iterative multi-steps.

retrieval, large-language model augmented system. So, some part of this retrieval probably is done by a large-language model, sometimes part of the system is done by a small-large-language model, and some part of the system is done by an embedded model, so on and so forth. So in that sense, I feel like it's somewhat orthogonal.

Yeah, and I feel like some of the motivation for agent chaining to begin with is the same efficiency motivation as RAG. Yep. Exactly. But if you use a very, very large language model to manage the system, the knowledge system, I think you again lose the efficiency. So it has to be a somewhat smaller model to manage the knowledge.

And then at that point, embedding model might be the right thing to do in that agent training framework. Maybe another angle to look at this is that whether we should do iterative retrieval versus just retrieve at once. Um. I think iterative retrieval is definitely useful, especially because now there are still a lot of headroom in the embedding model's performance. So that's why sometimes you have to retrieve multiple times because the models are not clever enough.

However, in the long run, my suspicion is that iterative retrieval will be useful, but it will be a bit less useful if the embedding model becomes more and more clever. So... Once the invite mode is more clever, then maybe one round or two rounds is going to be enough.

If we go ahead and just assume that RAG is at least a dominant architecture for enterprise use cases where you care about proprietary data that is large with reliability, how do you go about improving like a RAG system? Right. You can improve the LLM itself, but what are the what are the other components that you guys are working on or what are the maybe challenges from the user, the builder's perspective to improve retrieval quality? Yeah, so I guess there are a few ways, right? One way is that you improve the prompting of the large language models.

So, for example, you could tell the large language models to abstain if there's no relevant information in the retrieved documents. But because the large language models are so good these days, I think you don't need a lot of prompting anymore. It just responds to the instructions so well. And then the next thing is to improve the retrieval. part, which is the bottleneck, in my opinion, because most of our users found out that if they improve the retrieval quality directly, that affects the response quality.

And improving the retrieval part, I think there are two ways. One way is you improve the embedding model. One way is that you improve some of the other things on top of that. For example, how you chunk the data, whether you do iterative retrieval, whether you put in some of the meta information in the data, so on and so forth. So basically, I would say there are two two ways of improving.

One way is to improve the new artworks, either the embedding models or the rewrapers, or you improve the ways to use the artworks with software engineering, right? Better trunking iterations or other kind of like heuristics or kind of like tricks on top of that. So, and what we are specialized in is that we want to improve the new artworks because that requires a lot of heavy lifting. That's a very data driven approach.

We train our new artworks on trillions of tokens, at least. and we fine-tune them for special use cases. And this is something that probably a company should do instead of the end users should optimize themselves.

And my long-term vision here is that some of the software engineering layers on top of the networks will be less and less needed when the networks are more and more clever. So, for example, right now we already see that trunkings become less needed because the context window becomes longer and longer and the long context... embedding model, no, relatively long context embedding model. Long context here means like 10k, for example, maybe 16k, so that you can put 50 pages PDF into it.

Because this long context embedding model becomes much better, there's less of a need to trunk the documents into pieces of like 5, 12 tokens. And I think this will happen, you know, in other dimensions as well, right? So maybe in the future, you don't have to turn your...

images into description images and then give it to the text embedding model. That's what people are doing right now. Like everything is turned into text and they use a text embedding model. But when the embedding models are more clever and multi-model, then you don't have to do that anymore.

Can you talk a little bit about just like the intuition for how fine-tuning or domain-specific embeddings improves performance? Yeah, fine-tuning and domain-specific embedding models are what we are very good at at Voyage. So just to have some context here, so what we do is that we start with a general purpose base embedding model, which is also what we trained from scratch. And from there, we first fine-tune or continue pre-train, whatever you call it, on some domain-specific data. So for example, we fine-tune on two trillions of code snippets, tokens, and then we get a code embedding model.

And we do the fine-tuning on 1 trillion legal tokens, and that's how we got the legal embedding model. And this domain-specific embedding models, I didn't use any of the proprietary data so that everyone can use them, but they really excel in one particular domain, and the performance in other domains are not changed much. And the reason why we do this is because the number of parameters in the embedding model is limited.

So because you only have a latency budget, Something like maybe one second, sometimes like 200 milliseconds. Some people even want 50 milliseconds. And then basically it's impossible to use more than 10 billion parameters for embedding models. And we have limit parameters.

Any customization is very important because customization means that you use the limit number of parameters on the right tasks, the right domain, so that you excel in that domain. There's no way that you can use these 10 billion parameters to excel in everything. Right. So that's why you have to specialize in one domain. And we have seen like 5% to 20% of improvements by this domain-specific fine-tuning, depending on the particular domains.

For code, we have seen 15% to 20% of improvement, partly because we have a lot of data there. And the headroom there is also bigger because code retrieval requires a lot of deep understanding of the algorithmic part of the code. And for legal domain, the baseline is a little better, so the headroom is slightly smaller. So that's why we see 5% to 15% improvement depending on the data sets.

For some of the very complex legal data sets, we have seen bigger improvements. Just to make sure that our listeners can picture exactly where the latency cost is coming from here, in a search system like... Your data, you know, has been vectorized by an embeddings model, but then every query also needs to be translated into an embedding and then compared to the embeddings of your knowledge in order to feed that LLM for the generation that you want, right? And so there's inference time latency here as well.

I just think that's not obvious if somebody hasn't built a RAG system. Yeah, exactly, exactly. So basically, at the inference time, you have to first turn a query into vectors and then...

do the search with Vector Database. And actually, related to this, the dimension of the vectors you produce also affects the latency for the vector-based search. If the dimension of the embedding is like 100, is only 100, then it's going to be much, much faster than when the dimension of the embeddings is 1,000. So, and actually, this is something we are very good at as well.

So we produce embeddings that is like a 3x, 4x, smaller dimension than some of the competitors. Yep. I mean, intuitively, you are creating embeddings models that use a limited number of parameters and dimensions, just given the sort of latency budget that any application has to create the best possible representation of proprietary data or domain specific data.

Yeah, exactly. And going back to the domain specificity and fine-tuning, so the second level of customization is that we can customize to a particular company, right? So we fine-tune on the proprietary data of a particular company and we can see 10 to 20% improvement on top of the domain specific in fine-tuning as well. So of course, there's a total budget in terms of how much additive improvements you have there. So if you start with like 50% accuracy, then you only have 50% headroom.

But if you start with 90%, you only have 10% headroom. So the improvement, the absolute improvement varies a little bit across the domains. Maybe just advice to people who are building RAG systems.

At what point do they begin to invest in some of these retrieval components? Yeah, I think they can do it even from day one, as long as they have a prototype available. So basically, my default suggestion for our users is that when they have the rack, first of all, of course, you want to connect the components and at least see some response.

And then probably do some kind of basic profiling in terms of the latency and the quality. So you can check the retrieval quality, meaning that how often you retrieve relevant documents. There are some default ways to evaluate the retrieval quality. And then you also do the end-to-end evaluation for the responses.

And then you can see which part is the bottleneck. And in many cases, people found that the retrieval quality is not good, so that the final response is not good. And then you can swipe some of the components. You can say, I'm gonna try voyage embedding. I can try the voyage rerun course, which we haven't discussed too much about.

So, and you can try various different embeddings. and possibly various different large language models as well. Maybe just zooming out, you started by saying in order to have the debate about RAG versus alternative architectures for working on proprietary data, you need to predict forward, right?

Any predictions for how these systems change as LLMs improve dramatically? If we look at the next generations of open AI and... or GPT and Claude and the Mistral models and LAMA and such?

Yeah, so my prediction is that the system will be simpler and simpler. Maybe this is my biased view. So at least this is something that we are working towards. So the idea would be that it's a very, very simple system. So you just have three components, like large English model, vector database, and embedding models, and maybe four components, another re-ranker.

Okay. which refine the retrieved results. And you connect all of this and each of the new artworks does everything else.

You don't have to worry anything about trunking, multimodality, changing the data format because new artworks can do most of them, right? So seven years ago, if you talk to any of the so-called language models seven years ago, you have to turn the format into a very, very clean format. And now you talk to GPT-4, you can have typos, you can... have all kind of like a weird format.

You can even dump JSON files to it, right? So the same thing would happen for embedding models as well. So my vision is that in the future, AI will just be that a very simple software engineering layer on top of a few very strong neural network components.

Yes, I think the bias toward it is actually all going to be AI versus complex, you know, discretized software systems is clear, but I believe directionally right. Maybe zooming out to just get a little bit of your perspective as a founder, like, you know, what's one or two top learnings you have about starting the company as an academic before even, you know, despite your work with Google and other companies before? Yeah, I think it's very, very different. Founding a company is very different from doing research at big tech. And also even from, actually, it's a little bit closer to being academia because to run a university lab, I'm the CEO, CTO, CFO, and HR for the university lab.

So you touch on a little bit of everything, but at a slightly different scale. So I think one of the biggest things I learned actually is from one of our angel investors is that I should read some of the books. Even those, I think for probably experienced entrepreneur, many of the books are very basic, but for me, they are very, very useful. When I read some of the, even the basic books, including Eli's book, by the way.

So, but his book is a little bit advanced in a sense that he's, his book is talking about how to scale from 10 people to a thousand people. And I only read a few chapters of that because we are about 10 people right now. So, yeah. And also talking to a lot of angel investors, talking to Sarah and my other lead investors. So I think all of this helped me a lot in reducing the unforced mistakes in this process.

To me, I think it's really about how to reduce the number of errors you make so that you can maximize the efficiency. At least this is what happens to me. And also how to correct the mistakes as fast as possible.

If you could. correct mistakes every one week after you made them versus like one month after you made them, then that's a 4x efficiency improvement. Very theoretically consistent in your, you know, vein of research.

Last question, you know, you have been personally productive, productive research lab, but you've started a company. What do you think the role of academia in AI is in this age of like? scaling?

Because most of your former students, they essentially all work at OpenAI or Anthropic with a few professors and Citadel folks in the mix. And the ones working with you, right? Yes, yes.

In academia, this is a little bit controversial topic. I think different people have different views. My view is that I think academia probably should work on some different questions from what industry is good at. So if we are just only working on how to scale up the system, then obviously the incentive is not right.

We don't have enough capital there. And even OpenAI, I guess, argues that you need a lot of capital to start to do this in some sense. So at the very beginning, I think the point is that it cannot be non-profit because if it's non-profit, then you don't have enough capital and you cannot scale up enough. Thank you.

I think I kind of agree with that. And that's why in academia, it's very hard to scale up and have enough resources to do the large scale research. However, I think in academia, there are many, many other things that we can do on a smaller scale. And we probably should focus on more long-term innovations.

So what I told my students at the lab is that we should think about what will be the breakthrough in three to five years, as opposed to how do you help open eye to... to improve their large language models in GPD5. So that's why we work on optimizers, which is like 10 years old. The item is a 10 years old optimizer.

And we say, okay, that sounds like a long-term project. Maybe in five years, we can improve the optimization efficiency by 5 to 10x. That's going to be a game changer for the whole landscape.

So if we improve the efficiency by 10x, I guess that's like $100 million versus $10 million for training GPD5. Then- I think that would change the landscape a lot in the industry. So efficiency is one of the things I spend a lot of time on. Another thing is that there's reasoning tasks. I think the reason why I identified that as one of my lab's direction is because it's challenging and it requires a lot of very innovative research.

It's very unclear whether you can really, the scaling law is really enough to get you to prove Riemann. hypothesis, any of the math conjectures. So, you know, and also you have to be superhuman performance in some sense, right?

So if you turn on just the common crowd data on the web, can you be a good mathematician? It's kind of very hard to believe that. So we need more innovations there. So that's pretty much what we are doing at the University Lab. We try to work on the three to five years agenda and on a smaller scale.

I think that's an inspiring note to end on and like a very open minded one about what is still to be figured out. Thanks so much for doing this, Tango. Thanks so much. Find us on Twitter at NoPriorsPod.

Subscribe to our YouTube channel if you want to see our faces. Follow the show on Apple Podcasts, Spotify or wherever you listen. That way you get a new episode every week.

And sign up for emails or find transcripts for every episode at No-Priors.com.

Transcript for:Insights from Tengu Ma on AI Retrieval

Transcript for:
Insights from Tengu Ma on AI Retrieval