Coconote
AI notes
AI voice & video notes
Try for free
📚
CS 336 Language Models Overview
Apr 9, 2025
CS 336: Language Models from Scratch - Lecture Notes
Introduction
Instructors
: Percy and Tatsu
Course Focus
: Building language models from scratch, covering data, systems, and modeling.
Course Evolution
: Increased class size by 50%, lectures available on YouTube.
Course Rationale
Current Crisis
: Researchers are disconnecting from the technology behind language models.
Abstractions
: Increasing layers of abstractions (prompting models) can obscure understanding.
Need for Understanding
: Fundamental research requires tearing down abstractions and understanding the full stack.
Challenges in Building Models
Industrial Scale
: Frontier models (like GPT-4) have massive parameters and training costs.
Opacity
: Many models are proprietary with limited public details.
Small vs. Large Models
: Small models may not represent phenomena seen in large models (e.g., emergent behavior).
Knowledge Areas
Mechanics
: Understanding the technical ingredients (e.g., transformers, GPUs).
Mindset
: Maximizing hardware potential and taking scaling seriously.
Intuitions
: Data and modeling decisions—partially teachable due to scale variance.
The Bitter Lesson
Algorithm & Scale
: Efficiency at scale is crucial; algorithmic advancements are significant.
Resource Efficiency
: Key for cost-effective model training and operation.
Course Logistics
Website
: Central hub for class resources.
Assignments
: Five assignments, no scaffolding code, unit tests provided.
Cluster Access
: Together AI providing H100s.
Student Participation
: Encouraged to start assignments early due to cluster demand.
Course Structure
Basics
: Implement tokenizer, model architecture, and training loop.
Systems
: Optimize for hardware efficiency (kernels, parallelism, inference).
Scaling Laws
: Experiment at small scale to predict large scale performance.
Data
: Importance and processing of data for training.
Alignment
: Model fine-tuning and instruction following.
Detailed Units
Basics
Components
: Tokenizer, transformer architecture, training loop.
Assignment
: Implement BP tokenizer, transformer, AdamW optimizer.
Systems
Topics
: Kernels, data and model parallelism, inference techniques.
Assignment
: Implement kernels and parallelism methods, focus on benchmarking.
Scaling Laws
Goal
: Determine optimal model size for given compute budget.
Assignment
: Fit scaling laws and predict hyperparameters at larger scales.
Data
Focus
: Evaluate, curate, filter, and deduplicate data.
Assignment
: Process raw data, train classifiers, improve perplexity.
Alignment
Concepts
: Instruction following, model safety.
Phases
: Supervised fine-tuning (SFT) and feedback-based learning.
Assignment
: Implement SFT, preference-based learning methods.
Conclusion
Efficiency as a Principle
: All course decisions reflect maximizing efficiency.
Current Constraints
: Compute-constrained environment dictates decisions.
Future Considerations
: As constraints shift, design decisions may change.
Next Lecture Preview
Topic
: Dive into PyTorch, focusing on resource accounting and foundational building blocks.
đź“„
Full transcript