Overview
This lecture introduces CS336: Language Models from Scratch, covering the motivation, course structure, and foundational concepts in building language models, with a special focus on efficiency and the tokenization process.
Course Introduction and Motivation
- The course aims to teach the end-to-end pipeline for building language models, covering data, systems, and modeling.
- There's a growing disconnect between researchers and underlying technology due to increased abstraction and industrialization.
- Despite abstraction layers, understanding the full stack is vital for meaningful research and innovation.
- Modern frontier models are prohibitively large and costly to train, so students will work with smaller, representative models.
- The course emphasizes mechanics (how things work), mindset (prioritizing efficiency and scaling), and partial intuitions (which techniques work at scale).
Key Trends in Language Models
- Language models evolved from n-gram models to neural architectures with advances like sequence-to-sequence models, attention mechanisms, and transformers.
- The transformer (2017) is foundational for current models, with incremental improvements since.
- The AI community now distinguishes closed, open-weight, and open-source models with varying transparency.
Course Structure and Logistics
- The class is divided into five units: basics, systems, scaling laws, data, and alignment.
- Assignments require building components from scratch, supported by unit tests but minimal scaffolding.
- Compute resources are limited; efficient prototyping is encouraged.
- There are leaderboards for some assignments to encourage optimization.
Tokenization Fundamentals
- Tokenization converts raw text (Unicode strings) into sequences of integers for model processing.
- Character-based and byte-based tokenization methods are inefficient due to large vocabularies or long sequences.
- Word-based tokenization adapts to word frequencies but struggles with rare/new words and out-of-vocabulary issues.
- Byte Pair Encoding (BPE) is the preferred method, merging frequent adjacent pairs to create efficient, adaptive tokens.
- BPE relies on training data statistics to balance vocabulary size and sequence length.
- Modern tokenizers often pre-tokenize and then apply BPE for practical efficiency.
Key Terms & Definitions
- Tokenization — Process of converting text into a sequence of tokens (integers) for model input.
- Byte Pair Encoding (BPE) — Tokenization technique merging the most frequent adjacent pairs to form new tokens.
- Transformer — Neural network architecture based on attention mechanisms, core to contemporary language models.
- Perplexity — Metric to evaluate how well a language model predicts samples.
- Alignment — Process of fine-tuning models to follow instructions and exhibit desired behaviors.
Action Items / Next Steps
- Review the course website for logistics, assignments, and resources.
- Prepare for Assignment 1: implement a BPE tokenizer and basic transformer components.
- Explore online tokenizer demos for better intuition.
- Attend the next lecture on PyTorch fundamentals and resource accounting.