🤖

LLM Size and Performance Overview

Jun 11, 2025

Overview

This lecture covers the size and performance of large language models (LLMs), examining current benchmarks, trends in model efficiency, and guidance on choosing between large and small models for different applications.

LLM Size and Parameters

LLM size is measured by the number of parameters, which are the model’s learned weights.
Small LLMs can have as few as 300 million parameters and run on a smartphone.
Frontier LLMs may approach or exceed a trillion parameters, needing substantial computing resources.
More parameters generally mean higher capability, but also greater compute and energy costs.

Model Examples and Capabilities

Mistral 7B: smaller LLM with 7 billion parameters.
Llama 3: larger model, e.g., 400 billion parameters.
More parameters allow models to memorize facts, handle more languages, and perform complex reasoning.

Measuring Progress with Benchmarks

MMLU (Massive Multitask Language Understanding) tests generalist ability across ~15,000 multiple choice questions in various fields.
Random guessing on MMLU yields 25%; regular humans score ~35%, domain experts ~90%.
GPT-3 (175B parameters, 2020) scored 44% on MMLU; today's best models can reach ~88%.
The 60% MMLU score is a practical threshold for general competence.

Efficiency Gains in Smaller Models

In Feb 2023, Llama 1-65B (65B parameters) was the smallest model to score over 60% MMLU.
By July 2023, Llama 2-34B (34B) did so; by September, Mistral 7B (7B); by March 2024, Qwen 1.5 MOE (<3B).
Competent performance now fits into increasingly smaller models.

Choosing Between Large and Small Models

Use large models for broad code generation, document-heavy processing, or high-fidelity multilingual translation.
Large models handle large context windows, complex reasoning, and nuanced translations better.
Small models excel for on-device AI, privacy-sensitive tasks, low latency applications, and summarization.
Small models can be fine-tuned for enterprise chatbots, offering near-expert accuracy at lower cost.

Key Terms & Definitions

LLM (Large Language Model) — an AI model trained to understand and generate human language, sized by the number of learned parameters.
Parameter — a floating-point value the model learns during training, collectively storing its knowledge.
MMLU (Massive Multitask Language Understanding) — a benchmark testing model knowledge and reasoning across many subjects.
Context window — the amount of text a model can consider at once when making predictions.

Action Items / Next Steps

Review benchmark data on current LLMs.
Assess your use case requirements (scale, latency, privacy) to select an appropriate model size.

Full transcript