LLM Development Overview

Overview

This lecture provides an overview of how large language models (LLMs) are built, focusing on their training data, evaluation, scaling laws, and post-training alignment, with additional insights into practical systems considerations.

Key Components of LLM Training

LLMs are large neural networks designed to generate and understand human language (e.g., ChatGPT, Claude, Gemini).
Key aspects of LLM development include: architecture, training loss & algorithm, data, evaluation, and systems.
While architecture and algorithms get academic attention, data, evaluation, and systems are most crucial in practice.

Pre-Training and Tokenization

Pre-training teaches a model to predict the next word in vast internet text (classical language modeling).
Language models estimate the probability of a sequence of tokens (words or subwords).
Auto-regressive models predict each token given the previous tokens using the chain rule of probability.
Tokenizers map text to tokens, dealing with issues like typos and non-Latin scripts; common methods include byte pair encoding (BPE).
Choice of tokenizer affects vocabulary size, evaluation metrics, and model flexibility.

Evaluation of LLMs

Traditional metric: Perplexity (exp of average per-token loss; lower is better).
Perplexity is limited by tokenizer choice and is less used in academic benchmarks now.
Benchmarks like HELM or HuggingFace Leaderboard aggregate scores on tasks like question answering, summarization, and other NLP tasks.
Evaluation of open-ended models is challenging due to the diversity of valid responses and prompt/policy variations.
Data contamination (test set in training data) can inflate scores and is a known academic concern.

Data Collection and Processing

Training data is scraped from the internet (e.g., using Common Crawl), leading to massive, messy datasets.
Key data cleaning steps: extract text, filter unsafe/harmful content, deduplicate, remove low quality, and balance domains.
Models often upweight code and books, downweight entertainment, and overfit to high-quality sources at the end.
Data processing is labor-intensive and often involves large interdisciplinary teams.

Scaling Laws

Scaling laws show that increasing model size and data volume leads to predictable, improved performance (lower loss).
Optimal resource allocation often follows the Chinchilla result: 20 tokens per parameter for training efficiency; for inference, a higher ratio (~150:1) is often used.
Diminishing returns have not yet been clearly observed at current model and data scales.

Post-Training and Alignment

Post-training (alignment) makes LLMs follow instructions and behave as AI assistants.
Supervised Fine-Tuning (SFT) uses human-written question-answer pairs to steer the model, needing surprisingly little data.
Reinforcement Learning from Human Feedback (RLHF) allows models to optimize for human preferences using reward models and methods like PPO or DPO.
DPO (Direct Preference Optimization) is simpler and now widely used instead of RL-based approaches.
Using LLMs for synthetic dataset generation and evaluation can reduce costs and increase scalability, though introduces biases.

Systems Optimizations

GPU compute is the main bottleneck; memory and communication lag behind compute improvements.
Low-precision computation (16-bit floats) is used for faster training and lower memory usage.
Operator fusion reduces data transfer overhead, making code execution on GPUs faster (e.g., using torch.compile).

Key Terms & Definitions

LLM (Large Language Model) — A neural network trained on vast text to predict/generate language.
Tokenizer — Algorithm separating text into units (tokens) for model processing.
Perplexity — Exponential of average loss per token; measures model's uncertainty.
Auto-regressive — Predicts next token based on preceding context.
Supervised Fine-Tuning (SFT) — Aligning a model by training on human-labeled demonstrations.
RLHF (Reinforcement Learning from Human Feedback) — Using human preferences to guide and improve model outputs.
DPO (Direct Preference Optimization) — Optimizes directly for preferred outputs based on human or LLM judgments.

Action Items / Next Steps

Review readings or coursework for CS224N, CS324, or CS336 as relevant for deeper understanding of LLMs.
Consider experimenting with tokenization techniques and metrics in small-scale projects.
Practice evaluating models using both automated benchmarks and human/LLM preferences.