📘

Lecture on Large Language Models by Sasha Rush

Jul 19, 2024

Lecture on Large Language Models by Sasha Rush

Introduction

  • Speaker: Sasha Rush, Professor at Cornell, associated with Hugging Face
  • Introducer: David Parks, Dean of School of Engineering and Applied Sciences
  • Background:
    • AB in Computer Science from Harvard College
    • PhD from MIT
    • Postdoc at Facebook Research
    • Former Asst. Professor at Harvard (2015-2019)
    • Works on language models, efficiency of algorithms, hardware, security of AI systems

Overview

  • Title of the Talk: Large Language Models in Five Formulas
  • Structure: Talk is broken down into five sections: Perplexity, Attention, GEM, Chinchilla, and RASP
  • Mode: Interactive, open for questions
  • Platforms: Personal experience, YouTube talk, etc.
  • Essence: Understanding large language models (LLMs) from a CS point of view, not overselling their functionality

Section 1: Perplexity

  • Perplexity: Metric for evaluating language models
    • Assumptions: Large documents (1000-word tokens), 10,000-word dictionary
    • Language Model: Probabilistic model of documents; chain rule using conditional probabilities
    • Auto-Regressive Models: Use previous words to predict the next word
  • Perplexity Metric: Method
    • Encoding: Binary string representations of words
    • Reduction: Bit length based on probability
    • Metric Transformation: Converts bits to a readable metric
    • Examples: Perplexity of 1 (perfectly predictable) to 10,000 (uniform distribution)
  • Historical Insights: Markov models, simplified assumptions
  • Impact on Models: Quality of generated text aligns with lower perplexities
  • Scaling Metrics: From old models like Byte-pair encodings to modern models like GPT-3 (perplexity ~20.5)

Section 2: Memory and Attention

  • Memory: Using historical context to predict future tokens
  • Attention Mechanism: Key component, functioning as memory
    • Process: Query, Key, Value
    • Softmax Function: Replaces argmax to make operation differentiable
    • Embedding Layers: Learn representations via multiple attention layers
  • Transformers: Stacked layers utilizing attention for improved context memory
    • Example: BERT, GPT-2, GPT-3
  • Efficiency: Achieved through the fast computation of attention mechanisms

Section 3: GEM and GPU Efficiency

  • Importance: Optimization of computation for training and inference
  • GPU Programming: Structure and efficiencies
    • Threads and Blocks: Shared and global memory, speeds, and bottlenecks
    • Matrix Multiplications: Key operations optimized in GEM
    • Shared Memory: Reducing reads from global memory
  • Nvidia’s Role: Leading hardware improvements

Section 4: Scaling and Chinchilla

  • Scaling Laws: Balancing model size and training data
    • Key Insight: Compute budget allocation - model size vs. dataset size
    • Terminology: Parameters, tokens, FLOPS
  • Analyses: GPT-3 (2020) vs. newer models like Palm
  • Efficiency Trade-offs: Cost (flops) vs. reduction in perplexity
  • Formula: Compute optimal scaling laws, balancing n (model parameters) and d (dataset size)

Section 5: Reasoning and RASP

  • RASP Language: Formal language for thinking about model computation
    • Purpose: Simplify understanding of Transformer operations
  • Example: Sequence manipulation tasks
  • Yields: Insights into how models handle contextual memory, potential backdoors
    • Potential Applications: Security assessments, improved model transparency

Key Takeaways

  • Perplexity: Crucial for model performance assessment
  • Attention Mechanism: Integral for contextual memory and long-term dependencies
  • GPU Optimization: Essential for scaling, efficiency, and practical deployment
  • Scaling Laws (Chinchilla): Guide optimal resource allocation for training models
  • RASP: Provides theoretical framework for understanding Transformer mechanics and capabilities

Overall Impact: Understanding the intricacies of large language models helps refine their development, enhance scalability, and improve their applicability in real-world scenarios.