Overview of LLM Training

Overview

This lecture provides an overview of how Large Language Models (LLMs) are trained and built, focusing on data, evaluation, scaling laws, tokenization, pre-training, post-training, and systems.

Core Components for Training LLMs

LLMs are large neural networks, specifically based on Transformers.
Key components: architecture, training loss/algorithm, data, evaluation, and systems for scaling and efficiency.
Industry focuses more on data, evaluation, and systems than on architecture.

Pre-Training LLMs

Pre-training involves modeling the distribution of tokens (words/subwords) using internet-scale data.
Language models assign probabilities to sequences; better models capture both syntax and semantics.
Most current models use autoregressive language modeling (predict next token given context).
Loss function used: cross-entropy (equivalent to maximizing log-likelihood of real text).
Tokenization converts text into manageable chunks (tokens), typically using algorithms like Byte Pair Encoding (BPE).

Tokenization

Tokens are subword units; more general than words or characters.
Efficient tokenization balances vocabulary size and sequence length; BPE is commonly used.
Maintaining small tokens enables handling typos and rare words.
Tokenizers always select the largest matching token from their vocabulary.
Drawbacks: Poor for numbers, some code, and math; potential future shift toward character-level models.

Evaluation of LLMs

Perplexity measures model uncertainty; lower is better; depends on tokenization and data.
Academic benchmarks now aggregate performance across tasks (e.g., HELM, Hugging Face leaderboards).
Multiple-choice QA is used for automatic evaluation; open-ended tasks are harder to assess.
Challenge: Results depend heavily on evaluation setup and possible test-train contamination.

Data for Pre-Training

Raw internet data is massive and noisy; requires substantial filtering.
Steps: extract text, filter undesirable content (NSFW, duplicates, low-quality), heuristic and model-based filtering, and deduplication.
Dataset composition is carefully balanced (upweighting code, downweighting entertainment).
Large models are trained on trillions of tokens (e.g., 15T for Llama 3).
Data collection and curation is a major practical challenge.

Scaling Laws

Performance improves predictably with more data, bigger models, and more compute (no overfitting observed yet).
Scaling laws inform optimal allocation of compute/resources (e.g., Chinchilla: 20 tokens per parameter).
Inference cost is significant; smaller models are often favored for deployment.

Post-Training & Alignment

Post-training (alignment) adapts pre-trained LLMs to follow instructions and avoid toxicity.
Supervised Fine-Tuning (SFT): Fine-tune on human-generated question-answer pairs; only small dataset required.
LLM-generated synthetic data (e.g., Alpaca) can be effective for SFT.
Reinforcement Learning from Human Feedback (RLHF): Models are trained using human preference rankings (PPO, DPO methods).
DPO simplifies RLHF by directly optimizing preference without full RL complexity.

Evaluation of Aligned LLMs

Open-ended tasks require side-by-side human or LLM comparisons (e.g., Chatbot Arena, AlpacaEval).
LLMs can be used as "evaluators" to scale evaluation.
Output length bias is a challenge for using LLMs as evaluators.

Systems & Efficiency

GPU compute is the main bottleneck; memory and communication bandwidth are key constraints.
Mixed-precision training (using 16-bit floats for computation) speeds up training with minimal accuracy loss.
Operator fusion reduces memory transfers, increasing throughput (e.g., Torch.compile).

Key Terms & Definitions

LLM — Large Language Model; a neural network designed to model language at scale.
Token — A subword or word-level unit used for model input and output.
Autoregressive Model — Predicts the next token based on previous tokens.
Perplexity — A measure of language model uncertainty; lower means better prediction.
Supervised Fine-Tuning (SFT) — Training an LLM further on human-annotated data.
RLHF — Reinforcement Learning from Human Feedback; aligns models to human preferences.
DPO — Direct Preference Optimization; a simpler method for RLHF.
Scaling Laws — Empirical relationships predicting model performance as size/data/compute increases.

Action Items / Next Steps

Review tokenizer methods (e.g., BPE).
Explore LLM evaluation benchmarks (HELM, Hugging Face leaderboards).
For further study, consider courses: CS224n (NLP), CS324 (LLMs), CS336 (LLMs from scratch).