📚

CS 336 Language Models Overview

Apr 9, 2025

CS 336: Language Models from Scratch - Lecture Notes

Introduction

Instructors: Percy and Tatsu
Course Focus: Building language models from scratch, covering data, systems, and modeling.
Course Evolution: Increased class size by 50%, lectures available on YouTube.

Course Rationale

Current Crisis: Researchers are disconnecting from the technology behind language models.
Abstractions: Increasing layers of abstractions (prompting models) can obscure understanding.
Need for Understanding: Fundamental research requires tearing down abstractions and understanding the full stack.

Challenges in Building Models

Industrial Scale: Frontier models (like GPT-4) have massive parameters and training costs.
Opacity: Many models are proprietary with limited public details.
Small vs. Large Models: Small models may not represent phenomena seen in large models (e.g., emergent behavior).

Knowledge Areas

Mechanics: Understanding the technical ingredients (e.g., transformers, GPUs).
Mindset: Maximizing hardware potential and taking scaling seriously.
Intuitions: Data and modeling decisions—partially teachable due to scale variance.

The Bitter Lesson

Algorithm & Scale: Efficiency at scale is crucial; algorithmic advancements are significant.
Resource Efficiency: Key for cost-effective model training and operation.

Course Logistics

Website: Central hub for class resources.
Assignments: Five assignments, no scaffolding code, unit tests provided.
Cluster Access: Together AI providing H100s.
Student Participation: Encouraged to start assignments early due to cluster demand.

Course Structure

Basics: Implement tokenizer, model architecture, and training loop.
Systems: Optimize for hardware efficiency (kernels, parallelism, inference).
Scaling Laws: Experiment at small scale to predict large scale performance.
Data: Importance and processing of data for training.
Alignment: Model fine-tuning and instruction following.

Detailed Units

Basics

Components: Tokenizer, transformer architecture, training loop.
Assignment: Implement BP tokenizer, transformer, AdamW optimizer.

Systems

Topics: Kernels, data and model parallelism, inference techniques.
Assignment: Implement kernels and parallelism methods, focus on benchmarking.

Scaling Laws

Goal: Determine optimal model size for given compute budget.
Assignment: Fit scaling laws and predict hyperparameters at larger scales.

Data

Focus: Evaluate, curate, filter, and deduplicate data.
Assignment: Process raw data, train classifiers, improve perplexity.

Alignment

Concepts: Instruction following, model safety.
Phases: Supervised fine-tuning (SFT) and feedback-based learning.
Assignment: Implement SFT, preference-based learning methods.

Conclusion

Efficiency as a Principle: All course decisions reflect maximizing efficiency.
Current Constraints: Compute-constrained environment dictates decisions.
Future Considerations: As constraints shift, design decisions may change.

Next Lecture Preview

Topic: Dive into PyTorch, focusing on resource accounting and foundational building blocks.

Full transcript