Overview
This lecture provides an overview of how large language models (LLMs) are built, focusing on their training data, evaluation, scaling laws, and post-training alignment, with additional insights into practical systems considerations.
Key Components of LLM Training
- LLMs are large neural networks designed to generate and understand human language (e.g., ChatGPT, Claude, Gemini).
- Key aspects of LLM development include: architecture, training loss & algorithm, data, evaluation, and systems.
- While architecture and algorithms get academic attention, data, evaluation, and systems are most crucial in practice.
Pre-Training and Tokenization
- Pre-training teaches a model to predict the next word in vast internet text (classical language modeling).
- Language models estimate the probability of a sequence of tokens (words or subwords).
- Auto-regressive models predict each token given the previous tokens using the chain rule of probability.
- Tokenizers map text to tokens, dealing with issues like typos and non-Latin scripts; common methods include byte pair encoding (BPE).
- Choice of tokenizer affects vocabulary size, evaluation metrics, and model flexibility.
Evaluation of LLMs
- Traditional metric: Perplexity (exp of average per-token loss; lower is better).
- Perplexity is limited by tokenizer choice and is less used in academic benchmarks now.
- Benchmarks like HELM or HuggingFace Leaderboard aggregate scores on tasks like question answering, summarization, and other NLP tasks.
- Evaluation of open-ended models is challenging due to the diversity of valid responses and prompt/policy variations.
- Data contamination (test set in training data) can inflate scores and is a known academic concern.
Data Collection and Processing
- Training data is scraped from the internet (e.g., using Common Crawl), leading to massive, messy datasets.
- Key data cleaning steps: extract text, filter unsafe/harmful content, deduplicate, remove low quality, and balance domains.
- Models often upweight code and books, downweight entertainment, and overfit to high-quality sources at the end.
- Data processing is labor-intensive and often involves large interdisciplinary teams.
Scaling Laws
- Scaling laws show that increasing model size and data volume leads to predictable, improved performance (lower loss).
- Optimal resource allocation often follows the Chinchilla result: 20 tokens per parameter for training efficiency; for inference, a higher ratio (~150:1) is often used.
- Diminishing returns have not yet been clearly observed at current model and data scales.
Post-Training and Alignment
- Post-training (alignment) makes LLMs follow instructions and behave as AI assistants.
- Supervised Fine-Tuning (SFT) uses human-written question-answer pairs to steer the model, needing surprisingly little data.
- Reinforcement Learning from Human Feedback (RLHF) allows models to optimize for human preferences using reward models and methods like PPO or DPO.
- DPO (Direct Preference Optimization) is simpler and now widely used instead of RL-based approaches.
- Using LLMs for synthetic dataset generation and evaluation can reduce costs and increase scalability, though introduces biases.
Systems Optimizations
- GPU compute is the main bottleneck; memory and communication lag behind compute improvements.
- Low-precision computation (16-bit floats) is used for faster training and lower memory usage.
- Operator fusion reduces data transfer overhead, making code execution on GPUs faster (e.g., using torch.compile).
Key Terms & Definitions
- LLM (Large Language Model) — A neural network trained on vast text to predict/generate language.
- Tokenizer — Algorithm separating text into units (tokens) for model processing.
- Perplexity — Exponential of average loss per token; measures model's uncertainty.
- Auto-regressive — Predicts next token based on preceding context.
- Supervised Fine-Tuning (SFT) — Aligning a model by training on human-labeled demonstrations.
- RLHF (Reinforcement Learning from Human Feedback) — Using human preferences to guide and improve model outputs.
- DPO (Direct Preference Optimization) — Optimizes directly for preferred outputs based on human or LLM judgments.
Action Items / Next Steps
- Review readings or coursework for CS224N, CS324, or CS336 as relevant for deeper understanding of LLMs.
- Consider experimenting with tokenization techniques and metrics in small-scale projects.
- Practice evaluating models using both automated benchmarks and human/LLM preferences.