📚

Language Models Course Overview

Jul 31, 2025

Overview

This lecture introduces CS336: Language Models from Scratch, covering the motivation, course structure, and foundational concepts in building language models, with a special focus on efficiency and the tokenization process.

Course Introduction and Motivation

  • The course aims to teach the end-to-end pipeline for building language models, covering data, systems, and modeling.
  • There's a growing disconnect between researchers and underlying technology due to increased abstraction and industrialization.
  • Despite abstraction layers, understanding the full stack is vital for meaningful research and innovation.
  • Modern frontier models are prohibitively large and costly to train, so students will work with smaller, representative models.
  • The course emphasizes mechanics (how things work), mindset (prioritizing efficiency and scaling), and partial intuitions (which techniques work at scale).

Key Trends in Language Models

  • Language models evolved from n-gram models to neural architectures with advances like sequence-to-sequence models, attention mechanisms, and transformers.
  • The transformer (2017) is foundational for current models, with incremental improvements since.
  • The AI community now distinguishes closed, open-weight, and open-source models with varying transparency.

Course Structure and Logistics

  • The class is divided into five units: basics, systems, scaling laws, data, and alignment.
  • Assignments require building components from scratch, supported by unit tests but minimal scaffolding.
  • Compute resources are limited; efficient prototyping is encouraged.
  • There are leaderboards for some assignments to encourage optimization.

Tokenization Fundamentals

  • Tokenization converts raw text (Unicode strings) into sequences of integers for model processing.
  • Character-based and byte-based tokenization methods are inefficient due to large vocabularies or long sequences.
  • Word-based tokenization adapts to word frequencies but struggles with rare/new words and out-of-vocabulary issues.
  • Byte Pair Encoding (BPE) is the preferred method, merging frequent adjacent pairs to create efficient, adaptive tokens.
  • BPE relies on training data statistics to balance vocabulary size and sequence length.
  • Modern tokenizers often pre-tokenize and then apply BPE for practical efficiency.

Key Terms & Definitions

  • Tokenization — Process of converting text into a sequence of tokens (integers) for model input.
  • Byte Pair Encoding (BPE) — Tokenization technique merging the most frequent adjacent pairs to form new tokens.
  • Transformer — Neural network architecture based on attention mechanisms, core to contemporary language models.
  • Perplexity — Metric to evaluate how well a language model predicts samples.
  • Alignment — Process of fine-tuning models to follow instructions and exhibit desired behaviors.

Action Items / Next Steps

  • Review the course website for logistics, assignments, and resources.
  • Prepare for Assignment 1: implement a BPE tokenizer and basic transformer components.
  • Explore online tokenizer demos for better intuition.
  • Attend the next lecture on PyTorch fundamentals and resource accounting.