📚

Understanding Large Language Models (LLMs)

Apr 23, 2025

Lecture on Building Large Language Models (LLMs)

Overview

Focus: How LLMs work and components needed to train LLMs.
Key components: Architecture, training loss & algorithm, data, evaluation, and system.
LLMs based on Transformers; focus on non-architecture aspects like data and systems.

Key Components of LLMs

Architecture

LLMs are neural networks with critical architecture considerations.
Not detailed in this lecture due to an abundance of existing resources.

Training

Involves specific algorithms and losses.
Training loss is crucial for model optimization.

Data

Critical to model performance; involves collecting and processing vast amounts of internet data.
Importance of data quality and diversity.

Evaluation

Determines model efficacy through benchmarks and specific metrics like perplexity.

Systems

Optimization of LLMs on hardware is crucial due to size and computational demands.

Pre-Training vs. Post-Training

Pre-training: Classical language modeling paradigm.
- Models probability distributions over token sequences.
Post-training: Recent trend focusing on AI assistants.
- Example: Transition from GPT-3 to ChatGPT.

Language Modeling

Probabilistic model over sequences of tokens.
Generative Models: Generate sentences/data from a trained model.
Auto-regressive Language Models: Predict next token based on previous ones.

Tokenization

Converts text to tokens, essential for handling diverse languages and typos.
BPE (Byte Pair Encoding): Common tokenization method.
Tokens are unique and necessary for effective model training.

Evaluation of LLMs

Perplexity: Measures model's uncertainty in token prediction; not widely used due to dependency on tokenization.
Academic benchmarks aggregate multiple NLP tasks.

Data for LLMs

Involves crawling and cleaning internet data.
Challenges: Undesirable content, duplication, heuristic filtering.
Model-based filtering improves data quality.

Scaling Laws

Larger models and more data improve performance.
Predictive scaling laws guide resource allocation and model size decisions.

Post-Training

Align LLMs to follow instructions and reflect human preferences.
Supervised Fine Tuning (SFT): Fine-tunes models on human-labeled data.
Reinforcement Learning from Human Feedback (RLHF): Optimizes models based on human preferences.
DPO (Direct Preference Optimization): A simpler alternative to RLHF using likelihood maximization of preferred outputs.

Evaluation of Post-Training

Challenges in evaluating open-ended responses.
Use of human and LM preferences to rank model outputs.
Importance of unbiased and effective evaluation methods.

Systems and Computing

Efficient use of GPUs crucial for cost-effective LLM training.
Low Precision Computing: Reduces memory and increases speed.
Operator Fusion: Combines operations to minimize data transfer latency.

Conclusion

Training and deploying LLMs is complex, involving significant systems and data considerations.
Ongoing research and development in data collection, scaling laws, and post-training improvements.

Additional Learning Resources

CS courses on natural language processing and LLMs are recommended for deeper understanding.

Full transcript