📚

Understanding Large Language Models (LLMs)

Apr 23, 2025

Lecture on Building Large Language Models (LLMs)

Overview

  • Focus: How LLMs work and components needed to train LLMs.
  • Key components: Architecture, training loss & algorithm, data, evaluation, and system.
  • LLMs based on Transformers; focus on non-architecture aspects like data and systems.

Key Components of LLMs

Architecture

  • LLMs are neural networks with critical architecture considerations.
  • Not detailed in this lecture due to an abundance of existing resources.

Training

  • Involves specific algorithms and losses.
  • Training loss is crucial for model optimization.

Data

  • Critical to model performance; involves collecting and processing vast amounts of internet data.
  • Importance of data quality and diversity.

Evaluation

  • Determines model efficacy through benchmarks and specific metrics like perplexity.

Systems

  • Optimization of LLMs on hardware is crucial due to size and computational demands.

Pre-Training vs. Post-Training

  • Pre-training: Classical language modeling paradigm.
    • Models probability distributions over token sequences.
  • Post-training: Recent trend focusing on AI assistants.
    • Example: Transition from GPT-3 to ChatGPT.

Language Modeling

  • Probabilistic model over sequences of tokens.
  • Generative Models: Generate sentences/data from a trained model.
  • Auto-regressive Language Models: Predict next token based on previous ones.

Tokenization

  • Converts text to tokens, essential for handling diverse languages and typos.
  • BPE (Byte Pair Encoding): Common tokenization method.
  • Tokens are unique and necessary for effective model training.

Evaluation of LLMs

  • Perplexity: Measures model's uncertainty in token prediction; not widely used due to dependency on tokenization.
  • Academic benchmarks aggregate multiple NLP tasks.

Data for LLMs

  • Involves crawling and cleaning internet data.
  • Challenges: Undesirable content, duplication, heuristic filtering.
  • Model-based filtering improves data quality.

Scaling Laws

  • Larger models and more data improve performance.
  • Predictive scaling laws guide resource allocation and model size decisions.

Post-Training

  • Align LLMs to follow instructions and reflect human preferences.
  • Supervised Fine Tuning (SFT): Fine-tunes models on human-labeled data.
  • Reinforcement Learning from Human Feedback (RLHF): Optimizes models based on human preferences.
  • DPO (Direct Preference Optimization): A simpler alternative to RLHF using likelihood maximization of preferred outputs.

Evaluation of Post-Training

  • Challenges in evaluating open-ended responses.
  • Use of human and LM preferences to rank model outputs.
  • Importance of unbiased and effective evaluation methods.

Systems and Computing

  • Efficient use of GPUs crucial for cost-effective LLM training.
  • Low Precision Computing: Reduces memory and increases speed.
  • Operator Fusion: Combines operations to minimize data transfer latency.

Conclusion

  • Training and deploying LLMs is complex, involving significant systems and data considerations.
  • Ongoing research and development in data collection, scaling laws, and post-training improvements.

Additional Learning Resources

  • CS courses on natural language processing and LLMs are recommended for deeper understanding.