🤖

LLM Development Overview

Jun 30, 2025

Overview

This lecture provides an overview of how large language models (LLMs) are built, focusing on their training data, evaluation, scaling laws, and post-training alignment, with additional insights into practical systems considerations.

Key Components of LLM Training

  • LLMs are large neural networks designed to generate and understand human language (e.g., ChatGPT, Claude, Gemini).
  • Key aspects of LLM development include: architecture, training loss & algorithm, data, evaluation, and systems.
  • While architecture and algorithms get academic attention, data, evaluation, and systems are most crucial in practice.

Pre-Training and Tokenization

  • Pre-training teaches a model to predict the next word in vast internet text (classical language modeling).
  • Language models estimate the probability of a sequence of tokens (words or subwords).
  • Auto-regressive models predict each token given the previous tokens using the chain rule of probability.
  • Tokenizers map text to tokens, dealing with issues like typos and non-Latin scripts; common methods include byte pair encoding (BPE).
  • Choice of tokenizer affects vocabulary size, evaluation metrics, and model flexibility.

Evaluation of LLMs

  • Traditional metric: Perplexity (exp of average per-token loss; lower is better).
  • Perplexity is limited by tokenizer choice and is less used in academic benchmarks now.
  • Benchmarks like HELM or HuggingFace Leaderboard aggregate scores on tasks like question answering, summarization, and other NLP tasks.
  • Evaluation of open-ended models is challenging due to the diversity of valid responses and prompt/policy variations.
  • Data contamination (test set in training data) can inflate scores and is a known academic concern.

Data Collection and Processing

  • Training data is scraped from the internet (e.g., using Common Crawl), leading to massive, messy datasets.
  • Key data cleaning steps: extract text, filter unsafe/harmful content, deduplicate, remove low quality, and balance domains.
  • Models often upweight code and books, downweight entertainment, and overfit to high-quality sources at the end.
  • Data processing is labor-intensive and often involves large interdisciplinary teams.

Scaling Laws

  • Scaling laws show that increasing model size and data volume leads to predictable, improved performance (lower loss).
  • Optimal resource allocation often follows the Chinchilla result: 20 tokens per parameter for training efficiency; for inference, a higher ratio (~150:1) is often used.
  • Diminishing returns have not yet been clearly observed at current model and data scales.

Post-Training and Alignment

  • Post-training (alignment) makes LLMs follow instructions and behave as AI assistants.
  • Supervised Fine-Tuning (SFT) uses human-written question-answer pairs to steer the model, needing surprisingly little data.
  • Reinforcement Learning from Human Feedback (RLHF) allows models to optimize for human preferences using reward models and methods like PPO or DPO.
  • DPO (Direct Preference Optimization) is simpler and now widely used instead of RL-based approaches.
  • Using LLMs for synthetic dataset generation and evaluation can reduce costs and increase scalability, though introduces biases.

Systems Optimizations

  • GPU compute is the main bottleneck; memory and communication lag behind compute improvements.
  • Low-precision computation (16-bit floats) is used for faster training and lower memory usage.
  • Operator fusion reduces data transfer overhead, making code execution on GPUs faster (e.g., using torch.compile).

Key Terms & Definitions

  • LLM (Large Language Model) — A neural network trained on vast text to predict/generate language.
  • Tokenizer — Algorithm separating text into units (tokens) for model processing.
  • Perplexity — Exponential of average loss per token; measures model's uncertainty.
  • Auto-regressive — Predicts next token based on preceding context.
  • Supervised Fine-Tuning (SFT) — Aligning a model by training on human-labeled demonstrations.
  • RLHF (Reinforcement Learning from Human Feedback) — Using human preferences to guide and improve model outputs.
  • DPO (Direct Preference Optimization) — Optimizes directly for preferred outputs based on human or LLM judgments.

Action Items / Next Steps

  • Review readings or coursework for CS224N, CS324, or CS336 as relevant for deeper understanding of LLMs.
  • Consider experimenting with tokenization techniques and metrics in small-scale projects.
  • Practice evaluating models using both automated benchmarks and human/LLM preferences.