CS25 Transformers United: Introductory Lecture

Jul 4, 2024

CS25 Transformers United: Introductory Lecture

Welcome to CS25

Created and taught at Stanford in Fall 2021
Focus: Deep learning models, specifically transformers
Applications: Natural Language Processing (NLP), Computer Vision, Reinforcement Learning
Exciting videos and fantastic speakers lined up
Aim: Understand transformers, their applications, and inspire new ideas for research

Instructors

Advair: Software engineer at Applied Intuition, former master’s student at Stanford
Div: PhD student at Stanford, involved in general modeling, reinforcement learning, and robotics
Chaitanya: ML engineer at Move Works, former master’s student at Stanford specializing in NLP

Goals of the Course

Understand how transformers work
Learn applications of transformers beyond NLP
Inspire new research directions and innovations

Transformers Overview

Prehistoric Era (Before 2017)

Models: RNNs, LSTMs, simpler attention mechanisms
Issues: Poor encoding of long sequences and context
- Example: Sentence completion and correlation problems in older models

The Emergence of Transformers (2017)

Key Idea: Self-attention mechanism
Pioneering Paper: “Attention is All You Need”
Capabilities: Long sequence problems, protein folding (AlphaFold), Few-shot and Zero-shot learning, Text and Image Generation
Notable Talks: LSTM is dead, long live transformers

Current and Future Applications

Genetic modeling, video understanding, finance applications
Challenges: Need for external memory units, computational complexity, and alignment with human values

Attention Mechanisms

Simple Attention

Inspired by human visual attention
Issues: Expensive computation and non-differentiable

Global and Local Attention Models

Global: Calculate attention weight for the entire sequence
Local: Calculate attention over a small window

Self-Attention

Introduced in “Self-Attentive Sentence Embeddings” by Lin et al.
Basis of transformers
Simplified as a search & retrieval problem (Query, Key, Value)
Multi-head self-attention: Enables learning multiple representation subspaces

Transformer Architecture

Encoder Block

Self-attention layer
Feed-forward layer (to capture non-linearities)
Layer normalization
Residual connections

Decoder Block

Similar to encoder but includes extra layer for multi-head attention over encoder's output
Masking: Avoid looking into the future

Advantages and Drawbacks of Transformers

Advantages

Constant path length between sequence positions
Efficient parallelization

Drawbacks

Self-attention takes quadratic time (order n^2)

Solutions: Big Bird, Linformer, Reformer

Applications of Transformers

GPT (Generative Pre-training Transformer)

Model by OpenAI, latest version GPT-3
Decoder-only architecture
Uses: In-context learning, summarization, natural language generation, coding (e.g., VS Code Copilot)

BERT (Bidirectional Encoder Representations from Transformers)

Encoder-only architecture
Training tasks: Masked Language Modeling, Next Sentence Prediction
Fine-tuning enables use for various downstream tasks
Evolution: Elektra, DeBERTa adapted new techniques for improvements

Conclusion

Exciting set of videos with remarkable speakers
Goals: Grasp the workings of transformers, understand their broad applications, and spark innovative research ideas

Full transcript