Lecture on Large Language Models by Sasha Rush

Introduction

Speaker: Sasha Rush, Professor at Cornell, associated with Hugging Face
Introducer: David Parks, Dean of School of Engineering and Applied Sciences
Background:
- AB in Computer Science from Harvard College
- PhD from MIT
- Postdoc at Facebook Research
- Former Asst. Professor at Harvard (2015-2019)
- Works on language models, efficiency of algorithms, hardware, security of AI systems

Overview

Title of the Talk: Large Language Models in Five Formulas
Structure: Talk is broken down into five sections: Perplexity, Attention, GEM, Chinchilla, and RASP
Mode: Interactive, open for questions
Platforms: Personal experience, YouTube talk, etc.
Essence: Understanding large language models (LLMs) from a CS point of view, not overselling their functionality

Section 1: Perplexity

Perplexity: Metric for evaluating language models
- Assumptions: Large documents (1000-word tokens), 10,000-word dictionary
- Language Model: Probabilistic model of documents; chain rule using conditional probabilities
- Auto-Regressive Models: Use previous words to predict the next word
Perplexity Metric: Method
- Encoding: Binary string representations of words
- Reduction: Bit length based on probability
- Metric Transformation: Converts bits to a readable metric
- Examples: Perplexity of 1 (perfectly predictable) to 10,000 (uniform distribution)
Historical Insights: Markov models, simplified assumptions
Impact on Models: Quality of generated text aligns with lower perplexities
Scaling Metrics: From old models like Byte-pair encodings to modern models like GPT-3 (perplexity ~20.5)

Section 2: Memory and Attention

Memory: Using historical context to predict future tokens
Attention Mechanism: Key component, functioning as memory
- Process: Query, Key, Value
- Softmax Function: Replaces argmax to make operation differentiable
- Embedding Layers: Learn representations via multiple attention layers
Transformers: Stacked layers utilizing attention for improved context memory
- Example: BERT, GPT-2, GPT-3
Efficiency: Achieved through the fast computation of attention mechanisms

Section 3: GEM and GPU Efficiency

Importance: Optimization of computation for training and inference
GPU Programming: Structure and efficiencies
- Threads and Blocks: Shared and global memory, speeds, and bottlenecks
- Matrix Multiplications: Key operations optimized in GEM
- Shared Memory: Reducing reads from global memory
Nvidia’s Role: Leading hardware improvements

Section 4: Scaling and Chinchilla

Scaling Laws: Balancing model size and training data
- Key Insight: Compute budget allocation - model size vs. dataset size
- Terminology: Parameters, tokens, FLOPS
Analyses: GPT-3 (2020) vs. newer models like Palm
Efficiency Trade-offs: Cost (flops) vs. reduction in perplexity
Formula: Compute optimal scaling laws, balancing n (model parameters) and d (dataset size)

Section 5: Reasoning and RASP

RASP Language: Formal language for thinking about model computation
- Purpose: Simplify understanding of Transformer operations
Example: Sequence manipulation tasks
Yields: Insights into how models handle contextual memory, potential backdoors
- Potential Applications: Security assessments, improved model transparency

Key Takeaways

Perplexity: Crucial for model performance assessment
Attention Mechanism: Integral for contextual memory and long-term dependencies
GPU Optimization: Essential for scaling, efficiency, and practical deployment
Scaling Laws (Chinchilla): Guide optimal resource allocation for training models
RASP: Provides theoretical framework for understanding Transformer mechanics and capabilities

Overall Impact: Understanding the intricacies of large language models helps refine their development, enhance scalability, and improve their applicability in real-world scenarios.