Cost-effective LLM Routing with RouteLLM: A New Framework by LMSYS

okay so recently I've been talking a lot in videos about the whole idea of Cheaper models and faster models and being able to use them over some of the more expensive models a lot of the time and the big reason for doing that is I'm just seeing so many people waste so much money with doing calls to GPT 4 for everything or calls to Claude Opus and stuff like that for lots of things where they don't actually need that kind of model they would be much better to go for a faster model and a cheaper model the one challenge with that has always been well what if some of your calls to the llm actually do need the big model or do need the more powerful model so what I want to talk about in today's video is something that's been released by LM Cy so this is the people who run chatbot Arena who were some of the first people to create models like vuna and for the past year plus they've been bench marking all the different models giving them an ELO score that kind of thing so one of the cool things that they just released earlier this week was this new open source framework called route llm and this is basically an open source framework for costeffective llm routing so the idea here is that sometimes you're going to be using your cheaper model whether that's a llama 38b whether that's a Gemini flash whether it's a clawed Hau something like that and then at other times you're going to want to use your really powerful model like the gp4 the Claude Opus like Gemini Ultra that kind of thing but the challenge is how do you decide when to use one versus the other so most of the time up until now what I've generally done for projects that I've been working on is that certain calls I make to the cheaper models and certain calls I make to the more expensive models what this framework actually proposes is that you've got this router in the middle that looks at the call looks at the prompt that's coming in and decides hey is this best suited to a really strong model like gp4 or can we just answer this with a weaker model like you know a Gemini flash or a llama 3 Etc and by doing this they're able to then get this router to decide which model gets used based on the query that's coming in overall this saves a lot of money so if we look into their blog post a little bit we can see that what they're saying here is that this is making cost Savings of over 85% on various data sets that they've tried out there that it's still able to get very high level of accuracy on those benchmarks but it's able to do it by deciding that actually this particular question we don't need the Top Model we can go for a cheaper model and so therefore they're able to make cost reductions of so much on each data set it also shows us that things like the GSM 8K data set is probably actually a lot harder obviously than certain other things so the cost savings are actually lower on that because it's probably falling back to gbd4 for a lot of those as compared to some other data sets out there and you can see even with these cost savings they're still able to achieve 95% of GPT 4's performance in here so this is a a really nice idea that they've put out as a paper but far more than that they've released the open- source framework they've released the code they've released the data sets that they use to do this and they've released the models that you can actually take and start putting this in production so if you do have an app that's in production and you're looking at ways to save money on this and I know that quite a few of the apps that are out there in production now are sort of often borderline between being profitable versus non-profitable and stuff like that and a lot of it comes down to what models you're actually using so this kind of thing allows you to have the benefit of when 80% of your queries really could be done by a cheap model and you only need the really strong model for 20% of the queries it gives you massive cost Savings in doing this so let's look at how they actually do this so what they did was they've got a bunch of data sets already of human preference data of knowing that when for this particular question someone typed in Model A said this model B said this and people picked Model A over model B 80% of the time or something like that from that they're able to build a number of different models to try and predict that on things that they haven't seen before so the obvious sort of way to do this is you would take an embedding right and that's what they do they actually I think use the open AI small embeddings in doing this but then once they've got that data they look at different ways that they can you know use that and train that so one of the first ways that they do this is a similarity weighted thing so this is basically where they're working out a weighted ELO calculation based on similarity of the embeddings and it's a little more complicated than just doing a cosign score they have to work out the similarity and then also that related to the different models in there the second one that they do is a matrix factorization model so this is cool so where they've got imagine you've got a really big Matrix of where you've got some data that you know that okay this model is better than this model for this particular phrase but for a lot of it you're missing data so what they do is they come up with two matrices that they can then multiply to try and approximate the data that they've got and at the same time it fills in all the empty spots that they've got as well so by using that they're then able to predict that okay for this new kind of thing that we haven't seen before this is the model that most likely is going to work and it turns out that that this one is probably the best model which I'll talk about as we come back the third one is where they're using a Bert model and training up a classifier for it and the fourth one is kind of similar but they're using an llm classifier for it if you look at the results in here what we're looking for is this kind of efficient Frontier of working out what model you should go for based on this efficient Frontier and here when they train this up on just the data set pairs that they have themselves they get out a model but then what they also worked out they could do augment this data so basically they could use gp4 to be like a human and judging as well in there and that turns out to you know work out you know quite nicely that when you then train up your models on this augmented data as well I think all of them improve and this Matrix factorization seems to do really well this is where they're using the model to predict and it ends up using gp4 26% of the time and I guess the other model the the rest of the time and that turn out to be half as cheap as just going for a random Baseline for doing this another thing that they found that was really interesting here was that while they did their training on Mixr versus gp4 models the actual models that they end up making that make this prediction of which llm to use works really well even when you swap out the models so when they change the Mixel for llama 38b and the gbd4 for Claude Opus you can see that the model still works out which model to you know actually use in here so again you see these cost savings and stuff like that from this finally just to finish up there are some Commercial Services out there have been offering things like this for people to be able to run their calls through this and it would decide what to use and they show that basically you know what they're releasing as open source here actually is doing just as good as the the commercial ones and then obviously also a lot cheaper and stuff as well in here so just to finish up I would say if you're doing a project where you're going to put an llm into production and you realize that sometimes you need a really strong model and a lot of the times you're perhaps not going to need a strong model this is definitely worth you know considering and worth checking out here this route llm from LM sis not only have they released the the code so you can run the models that they've got and and use the routers that they've made there they've also got everything that you need to basically make make your own routers and we can see that they've released all the models as well as the data sets on hugging face here so this allows you to basically look at these yourself and try them out yourself imagine the open source Community will be able to build on this and come up with things of where you're comparing all the different popular models that are out there so in some ways perhaps this is not a super sexy topic but if you do anything where you're spending quite a bit of money on tokens this is definitely something that you should be checking out for using in production and stuff all right as always please put you know any comments and questions in the the comments below if you found the video useful please click like And subscribe and I will talk to you in the next video bye for now

okay so recently I&#39;ve been talking a lot in videos about the whole idea of Cheaper models and faster models and being able to use them over some of the more expensive models a lot of the time and the big reason for doing that is I&#39;m just seeing so many people waste so much money with doing calls to GPT 4 for everything or calls to Claude Opus and stuff like that for lots of things where they don&#39;t actually need that kind of model they would be much better to go for a faster model and a cheaper model the one challenge with that has always been well what if some of your calls to the llm actually do need the big model or do need the more powerful model so what I want to talk about in today&#39;s video is something that&#39;s been released by LM Cy so this is the people who run chatbot Arena who were some of the first people to create models like vuna and for the past year plus they&#39;ve been bench marking all the different models giving them an ELO score that kind of thing so one of the cool things that they just released earlier this week was this new open source framework called route llm and this is basically an open source framework for costeffective llm routing so the idea here is that sometimes you&#39;re going to be using your cheaper model whether that&#39;s a llama 38b whether that&#39;s a Gemini flash whether it&#39;s a clawed Hau something like that and then at other times you&#39;re going to want to use your really powerful model like the gp4 the Claude Opus like Gemini Ultra that kind of thing but the challenge is how do you decide when to use one versus the other so most of the time up until now what I&#39;ve generally done for projects that I&#39;ve been working on is that certain calls I make to the cheaper models and certain calls I make to the more expensive models what this framework actually proposes is that you&#39;ve got this router in the middle that looks at the call looks at the prompt that&#39;s coming in and decides hey is this best suited to a really strong model like gp4 or can we just answer this with a weaker model like you know a Gemini flash or a llama 3 Etc and by doing this they&#39;re able to then get this router to decide which model gets used based on the query that&#39;s coming in overall this saves a lot of money so if we look into their blog post a little bit we can see that what they&#39;re saying here is that this is making cost Savings of over 85% on various data sets that they&#39;ve tried out there that it&#39;s still able to get very high level of accuracy on those benchmarks but it&#39;s able to do it by deciding that actually this particular question we don&#39;t need the Top Model we can go for a cheaper model and so therefore they&#39;re able to make cost reductions of so much on each data set it also shows us that things like the GSM 8K data set is probably actually a lot harder obviously than certain other things so the cost savings are actually lower on that because it&#39;s probably falling back to gbd4 for a lot of those as compared to some other data sets out there and you can see even with these cost savings they&#39;re still able to achieve 95% of GPT 4&#39;s performance in here so this is a a really nice idea that they&#39;ve put out as a paper but far more than that they&#39;ve released the open- source framework they&#39;ve released the code they&#39;ve released the data sets that they use to do this and they&#39;ve released the models that you can actually take and start putting this in production so if you do have an app that&#39;s in production and you&#39;re looking at ways to save money on this and I know that quite a few of the apps that are out there in production now are sort of often borderline between being profitable versus non-profitable and stuff like that and a lot of it comes down to what models you&#39;re actually using so this kind of thing allows you to have the benefit of when 80% of your queries really could be done by a cheap model and you only need the really strong model for 20% of the queries it gives you massive cost Savings in doing this so let&#39;s look at how they actually do this so what they did was they&#39;ve got a bunch of data sets already of human preference data of knowing that when for this particular question someone typed in Model A said this model B said this and people picked Model A over model B 80% of the time or something like that from that they&#39;re able to build a number of different models to try and predict that on things that they haven&#39;t seen before so the obvious sort of way to do this is you would take an embedding right and that&#39;s what they do they actually I think use the open AI small embeddings in doing this but then once they&#39;ve got that data they look at different ways that they can you know use that and train that so one of the first ways that they do this is a similarity weighted thing so this is basically where they&#39;re working out a weighted ELO calculation based on similarity of the embeddings and it&#39;s a little more complicated than just doing a cosign score they have to work out the similarity and then also that related to the different models in there the second one that they do is a matrix factorization model so this is cool so where they&#39;ve got imagine you&#39;ve got a really big Matrix of where you&#39;ve got some data that you know that okay this model is better than this model for this particular phrase but for a lot of it you&#39;re missing data so what they do is they come up with two matrices that they can then multiply to try and approximate the data that they&#39;ve got and at the same time it fills in all the empty spots that they&#39;ve got as well so by using that they&#39;re then able to predict that okay for this new kind of thing that we haven&#39;t seen before this is the model that most likely is going to work and it turns out that that this one is probably the best model which I&#39;ll talk about as we come back the third one is where they&#39;re using a Bert model and training up a classifier for it and the fourth one is kind of similar but they&#39;re using an llm classifier for it if you look at the results in here what we&#39;re looking for is this kind of efficient Frontier of working out what model you should go for based on this efficient Frontier and here when they train this up on just the data set pairs that they have themselves they get out a model but then what they also worked out they could do augment this data so basically they could use gp4 to be like a human and judging as well in there and that turns out to you know work out you know quite nicely that when you then train up your models on this augmented data as well I think all of them improve and this Matrix factorization seems to do really well this is where they&#39;re using the model to predict and it ends up using gp4 26% of the time and I guess the other model the the rest of the time and that turn out to be half as cheap as just going for a random Baseline for doing this another thing that they found that was really interesting here was that while they did their training on Mixr versus gp4 models the actual models that they end up making that make this prediction of which llm to use works really well even when you swap out the models so when they change the Mixel for llama 38b and the gbd4 for Claude Opus you can see that the model still works out which model to you know actually use in here so again you see these cost savings and stuff like that from this finally just to finish up there are some Commercial Services out there have been offering things like this for people to be able to run their calls through this and it would decide what to use and they show that basically you know what they&#39;re releasing as open source here actually is doing just as good as the the commercial ones and then obviously also a lot cheaper and stuff as well in here so just to finish up I would say if you&#39;re doing a project where you&#39;re going to put an llm into production and you realize that sometimes you need a really strong model and a lot of the times you&#39;re perhaps not going to need a strong model this is definitely worth you know considering and worth checking out here this route llm from LM sis not only have they released the the code so you can run the models that they&#39;ve got and and use the routers that they&#39;ve made there they&#39;ve also got everything that you need to basically make make your own routers and we can see that they&#39;ve released all the models as well as the data sets on hugging face here so this allows you to basically look at these yourself and try them out yourself imagine the open source Community will be able to build on this and come up with things of where you&#39;re comparing all the different popular models that are out there so in some ways perhaps this is not a super sexy topic but if you do anything where you&#39;re spending quite a bit of money on tokens this is definitely something that you should be checking out for using in production and stuff all right as always please put you know any comments and questions in the the comments below if you found the video useful please click like And subscribe and I will talk to you in the next video bye for now

Transcript for:Cost-effective LLM Routing with RouteLLM: A New Framework by LMSYS

Transcript for:
Cost-effective LLM Routing with RouteLLM: A New Framework by LMSYS