Understanding Large Language Models

Sep 16, 2024

Building Large Language Models (LLMs)

Overview

  • LLMs: Large Language Models are AI models like ChatGPT, Claude, and Gemini.
  • The lecture covers:
    • Key components in training LLMs.
    • Pre-training and post-training paradigms.
    • Basic understanding of language modeling.

Key Components for Training LLMs

  1. Architecture: Neural networks, particularly Transformers, are used.
  2. Training Loss and Algorithm: Essential for model training.
  3. Data: The quality and quantity of data used for training.
  4. Evaluation: Metrics to assess model performance.
  5. System Components: Efficiently running models on modern hardware.

Pre-training vs. Post-training

  • Pre-training: General language modeling to understand internet text.
  • Post-training: Adapting LLMs to specific tasks, such as AI assistants (e.g., ChatGPT).

Language Modeling Basics

  • Language models estimate the probability distribution over sequences of tokens (words).
  • Generative Models: They can generate new sentences based on learned distributions.
  • Auto-regressive Language Models: Predict the next word based on previous context.
    • Uses chain rule of probability for sequential predictions.

Tokenization

  • Tokenizers convert text into manageable pieces (tokens) for LLMs.
    • Tokens can be words, subwords, or characters.
    • Important for handling typos and different languages.
    • Byte Pair Encoding: A common algorithm for tokenization that merges frequent pairs of tokens.

Evaluation of LLMs

  • Perplexity is a standard metric to evaluate LLM performance.
  • Evaluation challenges include:
    • Different evaluation methodologies can yield inconsistent results.
    • Chain test contamination: testing on data not seen during training.
  • Benchmarking: Common NLP benchmarks used for evaluation (e.g., MLU, Helm).

Data Collection for LLMs

  • Data is collected by crawling the internet (~250 billion pages).
  • Steps in data processing include:
    • Text extraction from HTML.
    • Filtering undesirable content (toxic data, PII).
    • Deduplication of repeated content.
    • Heuristic filtering for low-quality documents.

Scaling and Optimization

  • Scaling Laws: More data and larger models generally lead to better performance.
  • Computational efficiency is critical due to the size of data and models.
    • Low Precision Training: Using 16 bits instead of 32 bits to speed up processing.
    • Operator Fusion: Reduces communication overhead between GPU memory and computation.

Post-training Techniques

  1. Supervised Fine-tuning (SFT): Fine-tuning on specific human-generated examples.
  2. Reinforcement Learning from Human Feedback (RLHF): Aligning model behavior with human preferences.
    • Collects human preferences on model outputs.
  3. DPO (Direct Preference Optimization): A simpler alternative to RLHF that maximizes the likelihood of preferred outcomes.

Challenges in Post-training

  • Difficulties in generating ideal answers and dealing with human biases in labeling data.

Future Directions and Considerations

  • The lecture emphasizes the importance of systems, data, and effective architecture in building scalable LLMs.
  • It also touches upon various complexities and ethical considerations in deploying LLMs.

Recommended Courses

  • CS224N: Historical context of LLMs.
  • CS324: In-depth exploration of LLMs.
  • CS336: Hands-on experience building LLMs.