📚

CS 336 Language Models Overview

Apr 9, 2025

CS 336: Language Models from Scratch - Lecture Notes

Introduction

  • Instructors: Percy and Tatsu
  • Course Focus: Building language models from scratch, covering data, systems, and modeling.
  • Course Evolution: Increased class size by 50%, lectures available on YouTube.

Course Rationale

  • Current Crisis: Researchers are disconnecting from the technology behind language models.
  • Abstractions: Increasing layers of abstractions (prompting models) can obscure understanding.
  • Need for Understanding: Fundamental research requires tearing down abstractions and understanding the full stack.

Challenges in Building Models

  • Industrial Scale: Frontier models (like GPT-4) have massive parameters and training costs.
  • Opacity: Many models are proprietary with limited public details.
  • Small vs. Large Models: Small models may not represent phenomena seen in large models (e.g., emergent behavior).

Knowledge Areas

  1. Mechanics: Understanding the technical ingredients (e.g., transformers, GPUs).
  2. Mindset: Maximizing hardware potential and taking scaling seriously.
  3. Intuitions: Data and modeling decisions—partially teachable due to scale variance.

The Bitter Lesson

  • Algorithm & Scale: Efficiency at scale is crucial; algorithmic advancements are significant.
  • Resource Efficiency: Key for cost-effective model training and operation.

Course Logistics

  • Website: Central hub for class resources.
  • Assignments: Five assignments, no scaffolding code, unit tests provided.
  • Cluster Access: Together AI providing H100s.
  • Student Participation: Encouraged to start assignments early due to cluster demand.

Course Structure

  1. Basics: Implement tokenizer, model architecture, and training loop.
  2. Systems: Optimize for hardware efficiency (kernels, parallelism, inference).
  3. Scaling Laws: Experiment at small scale to predict large scale performance.
  4. Data: Importance and processing of data for training.
  5. Alignment: Model fine-tuning and instruction following.

Detailed Units

Basics

  • Components: Tokenizer, transformer architecture, training loop.
  • Assignment: Implement BP tokenizer, transformer, AdamW optimizer.

Systems

  • Topics: Kernels, data and model parallelism, inference techniques.
  • Assignment: Implement kernels and parallelism methods, focus on benchmarking.

Scaling Laws

  • Goal: Determine optimal model size for given compute budget.
  • Assignment: Fit scaling laws and predict hyperparameters at larger scales.

Data

  • Focus: Evaluate, curate, filter, and deduplicate data.
  • Assignment: Process raw data, train classifiers, improve perplexity.

Alignment

  • Concepts: Instruction following, model safety.
  • Phases: Supervised fine-tuning (SFT) and feedback-based learning.
  • Assignment: Implement SFT, preference-based learning methods.

Conclusion

  • Efficiency as a Principle: All course decisions reflect maximizing efficiency.
  • Current Constraints: Compute-constrained environment dictates decisions.
  • Future Considerations: As constraints shift, design decisions may change.

Next Lecture Preview

  • Topic: Dive into PyTorch, focusing on resource accounting and foundational building blocks.