Lecture on Large Language Models by Sasha Rush
Introduction
- Speaker: Sasha Rush, Professor at Cornell, associated with Hugging Face
- Introducer: David Parks, Dean of School of Engineering and Applied Sciences
- Background:
- AB in Computer Science from Harvard College
- PhD from MIT
- Postdoc at Facebook Research
- Former Asst. Professor at Harvard (2015-2019)
- Works on language models, efficiency of algorithms, hardware, security of AI systems
Overview
- Title of the Talk: Large Language Models in Five Formulas
- Structure: Talk is broken down into five sections: Perplexity, Attention, GEM, Chinchilla, and RASP
- Mode: Interactive, open for questions
- Platforms: Personal experience, YouTube talk, etc.
- Essence: Understanding large language models (LLMs) from a CS point of view, not overselling their functionality
Section 1: Perplexity
- Perplexity: Metric for evaluating language models
- Assumptions: Large documents (1000-word tokens), 10,000-word dictionary
- Language Model: Probabilistic model of documents; chain rule using conditional probabilities
- Auto-Regressive Models: Use previous words to predict the next word
- Perplexity Metric: Method
- Encoding: Binary string representations of words
- Reduction: Bit length based on probability
- Metric Transformation: Converts bits to a readable metric
- Examples: Perplexity of 1 (perfectly predictable) to 10,000 (uniform distribution)
- Historical Insights: Markov models, simplified assumptions
- Impact on Models: Quality of generated text aligns with lower perplexities
- Scaling Metrics: From old models like Byte-pair encodings to modern models like GPT-3 (perplexity ~20.5)
Section 2: Memory and Attention
- Memory: Using historical context to predict future tokens
- Attention Mechanism: Key component, functioning as memory
- Process: Query, Key, Value
- Softmax Function: Replaces argmax to make operation differentiable
- Embedding Layers: Learn representations via multiple attention layers
- Transformers: Stacked layers utilizing attention for improved context memory
- Example: BERT, GPT-2, GPT-3
- Efficiency: Achieved through the fast computation of attention mechanisms
Section 3: GEM and GPU Efficiency
- Importance: Optimization of computation for training and inference
- GPU Programming: Structure and efficiencies
- Threads and Blocks: Shared and global memory, speeds, and bottlenecks
- Matrix Multiplications: Key operations optimized in GEM
- Shared Memory: Reducing reads from global memory
- Nvidia’s Role: Leading hardware improvements
Section 4: Scaling and Chinchilla
- Scaling Laws: Balancing model size and training data
- Key Insight: Compute budget allocation - model size vs. dataset size
- Terminology: Parameters, tokens, FLOPS
- Analyses: GPT-3 (2020) vs. newer models like Palm
- Efficiency Trade-offs: Cost (flops) vs. reduction in perplexity
- Formula: Compute optimal scaling laws, balancing n (model parameters) and d (dataset size)
Section 5: Reasoning and RASP
- RASP Language: Formal language for thinking about model computation
- Purpose: Simplify understanding of Transformer operations
- Example: Sequence manipulation tasks
- Yields: Insights into how models handle contextual memory, potential backdoors
- Potential Applications: Security assessments, improved model transparency
Key Takeaways
- Perplexity: Crucial for model performance assessment
- Attention Mechanism: Integral for contextual memory and long-term dependencies
- GPU Optimization: Essential for scaling, efficiency, and practical deployment
- Scaling Laws (Chinchilla): Guide optimal resource allocation for training models
- RASP: Provides theoretical framework for understanding Transformer mechanics and capabilities
Overall Impact: Understanding the intricacies of large language models helps refine their development, enhance scalability, and improve their applicability in real-world scenarios.