Overview
This lecture provides an overview of how Large Language Models (LLMs) are trained and built, focusing on data, evaluation, scaling laws, tokenization, pre-training, post-training, and systems.
Core Components for Training LLMs
- LLMs are large neural networks, specifically based on Transformers.
- Key components: architecture, training loss/algorithm, data, evaluation, and systems for scaling and efficiency.
- Industry focuses more on data, evaluation, and systems than on architecture.
Pre-Training LLMs
- Pre-training involves modeling the distribution of tokens (words/subwords) using internet-scale data.
- Language models assign probabilities to sequences; better models capture both syntax and semantics.
- Most current models use autoregressive language modeling (predict next token given context).
- Loss function used: cross-entropy (equivalent to maximizing log-likelihood of real text).
- Tokenization converts text into manageable chunks (tokens), typically using algorithms like Byte Pair Encoding (BPE).
Tokenization
- Tokens are subword units; more general than words or characters.
- Efficient tokenization balances vocabulary size and sequence length; BPE is commonly used.
- Maintaining small tokens enables handling typos and rare words.
- Tokenizers always select the largest matching token from their vocabulary.
- Drawbacks: Poor for numbers, some code, and math; potential future shift toward character-level models.
Evaluation of LLMs
- Perplexity measures model uncertainty; lower is better; depends on tokenization and data.
- Academic benchmarks now aggregate performance across tasks (e.g., HELM, Hugging Face leaderboards).
- Multiple-choice QA is used for automatic evaluation; open-ended tasks are harder to assess.
- Challenge: Results depend heavily on evaluation setup and possible test-train contamination.
Data for Pre-Training
- Raw internet data is massive and noisy; requires substantial filtering.
- Steps: extract text, filter undesirable content (NSFW, duplicates, low-quality), heuristic and model-based filtering, and deduplication.
- Dataset composition is carefully balanced (upweighting code, downweighting entertainment).
- Large models are trained on trillions of tokens (e.g., 15T for Llama 3).
- Data collection and curation is a major practical challenge.
Scaling Laws
- Performance improves predictably with more data, bigger models, and more compute (no overfitting observed yet).
- Scaling laws inform optimal allocation of compute/resources (e.g., Chinchilla: 20 tokens per parameter).
- Inference cost is significant; smaller models are often favored for deployment.
Post-Training & Alignment
- Post-training (alignment) adapts pre-trained LLMs to follow instructions and avoid toxicity.
- Supervised Fine-Tuning (SFT): Fine-tune on human-generated question-answer pairs; only small dataset required.
- LLM-generated synthetic data (e.g., Alpaca) can be effective for SFT.
- Reinforcement Learning from Human Feedback (RLHF): Models are trained using human preference rankings (PPO, DPO methods).
- DPO simplifies RLHF by directly optimizing preference without full RL complexity.
Evaluation of Aligned LLMs
- Open-ended tasks require side-by-side human or LLM comparisons (e.g., Chatbot Arena, AlpacaEval).
- LLMs can be used as "evaluators" to scale evaluation.
- Output length bias is a challenge for using LLMs as evaluators.
Systems & Efficiency
- GPU compute is the main bottleneck; memory and communication bandwidth are key constraints.
- Mixed-precision training (using 16-bit floats for computation) speeds up training with minimal accuracy loss.
- Operator fusion reduces memory transfers, increasing throughput (e.g., Torch.compile).
Key Terms & Definitions
- LLM — Large Language Model; a neural network designed to model language at scale.
- Token — A subword or word-level unit used for model input and output.
- Autoregressive Model — Predicts the next token based on previous tokens.
- Perplexity — A measure of language model uncertainty; lower means better prediction.
- Supervised Fine-Tuning (SFT) — Training an LLM further on human-annotated data.
- RLHF — Reinforcement Learning from Human Feedback; aligns models to human preferences.
- DPO — Direct Preference Optimization; a simpler method for RLHF.
- Scaling Laws — Empirical relationships predicting model performance as size/data/compute increases.
Action Items / Next Steps
- Review tokenizer methods (e.g., BPE).
- Explore LLM evaluation benchmarks (HELM, Hugging Face leaderboards).
- For further study, consider courses: CS224n (NLP), CS324 (LLMs), CS336 (LLMs from scratch).