Transcript for:
LLM Size and Performance Overview

The first L in LLM stands for large. But how large is large? Well, today's language models cover a huge range of sizes, from lightweight networks that have maybe 300 million parameters that can run entirely on a smartphone to titanic systems with hundreds of billions, or perhaps even approaching a trillion parameters that require racks of GPUs in a hyperscale data center. And yeah, size in this context, it is measured in parameters. That's how we measure the size of an LLM and parameters are the individual floating point weights that a neural network tweaks while it trains. And collectively these parameters encode everything the model can recall or reason about. Well let's talk about some specific models. So for example Mistral 7B that is an example of a small model, the seven B there that says it contains roughly 7 billion of those weights or those parameters. By comparison. And we could take a look at llama three for example from meta. Now this one is a much bigger model.. 400B. So we would put this in the large LLM category. And in fact some frontier models there much bigger than that. The room to push well beyond half a trillion parameters. And in broad strokes extra parameters buys extra capability. Larger models have more room to memorize more facts and support more languages and carry out more intricate chains of reasoning. But the trade off, of course, with these guys is cost. They demand exponentially more compute and energy and memory, both to train them in the first place and then to run them in production. So the story isn't simply bigger is better. Smaller models are catching up and are punching far above their weight class. And let me give you an example. Well, we measure progress in language model capability with benchmarks. And one of the most enduring benchmarks is the m m l u. That's massive multitask language understanding. Now the MMLU it contains more than 15,000 multiple choice questions across all sorts of domains sub subjects like math and history and law and medicine and anybody taking the test needs to combine both factual recall with problem solving across many fields. So the test is a convenient, if somewhat imperfect snapshot of kind of broad general purpose ability. Now, if you took the MMLU, you and you were just guessing at random, you would score around 25% on the test. But if you weren't guessing at random, if you're just kind of a regular Joe, just a regular human, and you took the test, you might score somewhere around 35%. It's a it's a pretty hard test, but what about a domain expert? Well, a domain expert would score far higher, something like around 90% on questions that are within their specialty. So that's humans. What about AI models? Well, when GPT three came out in 2020, this is a 175 billion parameter model. It posted a score on the MMLU view of 44%. I mean, that's pretty respectable. It's better than the average Joe, but it's far from mastery. What about today's models? Well, if we take a look at today's frontier models, kind of the best models we have, they can score in the high 80s, maybe 88% on the test. But let's use a different benchmark. Let's use a benchmark of 60%. And we can say that is a practical cutoff because above that line, a model begins to look like a like a competent generalist that can answer everyday questions. And what is striking is how quickly that 60% barrier has fallen to ever smaller models. So in February of 2023, the smallest model that could score above 60% was Llama 1-65B 65 B, meaning 65 billion parameters. But just a few months later, by July of the same year, Llama 2 - 34B. They could do it with barely half the parameters. Then if we fast forward to September of the same year that saw Misteral 7B join the cloud, which we know is a 7 billion parameter model, and then in March of 2024, Qwen 1.5 MOE became the first model with fewer than 3 billion active parameters to clear 60%. In other words, month by month, we are learning to squeeze competent generalist behavior into smaller and smaller footprints. So smaller models are getting smarter. And I think the next natural question becomes which model should I put into production, large or small? And the answer, of course, depends on your workload, your latency, your privacy constraints. And let's be honest, the size of your GPU budget. Now I'm generalizing here. Your case may be different, but certain tasks do still reward sheer scale. So let's talk about some large model use cases. And one of the first really comes down to broad spectrum code generation. So a small model can master a handful of programing languages. But a a frontier model has room for dozens of ecosystems and can reason across multi file projects and unfamiliar APIs and weird edge cases. Another good example is when you have document heavy work that you need to process. So we might need to ingest a very large contract and a medical guideline and a technical standard. And a large model's longer context window means it can keep more of the source text in mind, reducing hallucinations and improving citation quality. And the same scale advantage appears in high fidelity multilingual translation as well, where we're going from one language to another, and the extra parameters that the network carve out richer subspaces for each language. Finally, capturing idioms and nuance that smaller models might kind of gloss over. But look, there are some cases where small models are not only good enough, but they are outright preferable. So let's talk about some of those use cases. And one of those comes down to on device a AI. So keyboard prediction or voice commands that offline search that stuff lives or dies by sub 100 millisecond latency and strict data privacy and small models that run on device. Well, they're great for that. Also, when it just comes down to everyday summarization, that's another sweet spot. In an in news summarization study, Mistral 7B instruct achieved ROGUE and Bert score metrics that were statistically indistinguishable from a much larger model GPT 3.5 turbo. And that's despite the model running 30 times cheaper and faster. And another good use case comes down to enterprise chat bots. So with these, a business can fine tune a seven or a 13 billion parameter model on its own manuals, and it can reach near expert accuracy. And IBM found that the the granite 13 B family match the performance of models that were five times larger on typical enterprise Q and A task. So the rule of thumb is for expansive, open ended reasoning. Bigger does still buy more headroom for focused skills like summarizing and classifying. A carefully trained small model delivers perhaps 90% of the quality at a fraction of the cost. So go big. Stay small. In the end, it's your use case that will drive the decision.