CS25 Transformers United: Introductory Lecture

Jul 4, 2024

CS25 Transformers United: Introductory Lecture

Welcome to CS25

  • Created and taught at Stanford in Fall 2021
  • Focus: Deep learning models, specifically transformers
  • Applications: Natural Language Processing (NLP), Computer Vision, Reinforcement Learning
  • Exciting videos and fantastic speakers lined up
  • Aim: Understand transformers, their applications, and inspire new ideas for research


  • Advair: Software engineer at Applied Intuition, former master’s student at Stanford
  • Div: PhD student at Stanford, involved in general modeling, reinforcement learning, and robotics
  • Chaitanya: ML engineer at Move Works, former master’s student at Stanford specializing in NLP

Goals of the Course

  1. Understand how transformers work
  2. Learn applications of transformers beyond NLP
  3. Inspire new research directions and innovations

Transformers Overview

Prehistoric Era (Before 2017)

  • Models: RNNs, LSTMs, simpler attention mechanisms
  • Issues: Poor encoding of long sequences and context
    • Example: Sentence completion and correlation problems in older models

The Emergence of Transformers (2017)

  • Key Idea: Self-attention mechanism
  • Pioneering Paper: “Attention is All You Need”
  • Capabilities: Long sequence problems, protein folding (AlphaFold), Few-shot and Zero-shot learning, Text and Image Generation
  • Notable Talks: LSTM is dead, long live transformers

Current and Future Applications

  • Genetic modeling, video understanding, finance applications
  • Challenges: Need for external memory units, computational complexity, and alignment with human values

Attention Mechanisms

Simple Attention

  • Inspired by human visual attention
  • Issues: Expensive computation and non-differentiable

Global and Local Attention Models

  • Global: Calculate attention weight for the entire sequence
  • Local: Calculate attention over a small window


  • Introduced in “Self-Attentive Sentence Embeddings” by Lin et al.
  • Basis of transformers
  • Simplified as a search & retrieval problem (Query, Key, Value)
  • Multi-head self-attention: Enables learning multiple representation subspaces

Transformer Architecture

Encoder Block

  1. Self-attention layer
  2. Feed-forward layer (to capture non-linearities)
  3. Layer normalization
  4. Residual connections

Decoder Block

  • Similar to encoder but includes extra layer for multi-head attention over encoder's output
  • Masking: Avoid looking into the future

Advantages and Drawbacks of Transformers


  1. Constant path length between sequence positions
  2. Efficient parallelization


  1. Self-attention takes quadratic time (order n^2)
  • Solutions: Big Bird, Linformer, Reformer

Applications of Transformers

GPT (Generative Pre-training Transformer)

  • Model by OpenAI, latest version GPT-3
  • Decoder-only architecture
  • Uses: In-context learning, summarization, natural language generation, coding (e.g., VS Code Copilot)

BERT (Bidirectional Encoder Representations from Transformers)

  • Encoder-only architecture
  • Training tasks: Masked Language Modeling, Next Sentence Prediction
  • Fine-tuning enables use for various downstream tasks
  • Evolution: Elektra, DeBERTa adapted new techniques for improvements


  • Exciting set of videos with remarkable speakers
  • Goals: Grasp the workings of transformers, understand their broad applications, and spark innovative research ideas