Coconote
AI notes
AI voice & video notes
Export note
Try for free
CS25 Transformers United: Introductory Lecture
Jul 4, 2024
CS25 Transformers United: Introductory Lecture
Welcome to CS25
Created and taught at Stanford in Fall 2021
Focus: Deep learning models, specifically transformers
Applications: Natural Language Processing (NLP), Computer Vision, Reinforcement Learning
Exciting videos and fantastic speakers lined up
Aim: Understand transformers, their applications, and inspire new ideas for research
Instructors
Advair: Software engineer at Applied Intuition, former master’s student at Stanford
Div: PhD student at Stanford, involved in general modeling, reinforcement learning, and robotics
Chaitanya: ML engineer at Move Works, former master’s student at Stanford specializing in NLP
Goals of the Course
Understand how transformers work
Learn applications of transformers beyond NLP
Inspire new research directions and innovations
Transformers Overview
Prehistoric Era (Before 2017)
Models: RNNs, LSTMs, simpler attention mechanisms
Issues: Poor encoding of long sequences and context
Example: Sentence completion and correlation problems in older models
The Emergence of Transformers (2017)
Key Idea: Self-attention mechanism
Pioneering Paper: “Attention is All You Need”
Capabilities: Long sequence problems, protein folding (AlphaFold), Few-shot and Zero-shot learning, Text and Image Generation
Notable Talks: LSTM is dead, long live transformers
Current and Future Applications
Genetic modeling, video understanding, finance applications
Challenges: Need for external memory units, computational complexity, and alignment with human values
Attention Mechanisms
Simple Attention
Inspired by human visual attention
Issues: Expensive computation and non-differentiable
Global and Local Attention Models
Global: Calculate attention weight for the entire sequence
Local: Calculate attention over a small window
Self-Attention
Introduced in “Self-Attentive Sentence Embeddings” by Lin et al.
Basis of transformers
Simplified as a search & retrieval problem (Query, Key, Value)
Multi-head self-attention: Enables learning multiple representation subspaces
Transformer Architecture
Encoder Block
Self-attention layer
Feed-forward layer (to capture non-linearities)
Layer normalization
Residual connections
Decoder Block
Similar to encoder but includes extra layer for multi-head attention over encoder's output
Masking: Avoid looking into the future
Advantages and Drawbacks of Transformers
Advantages
Constant path length between sequence positions
Efficient parallelization
Drawbacks
Self-attention takes quadratic time (order n^2)
Solutions: Big Bird, Linformer, Reformer
Applications of Transformers
GPT (Generative Pre-training Transformer)
Model by OpenAI, latest version GPT-3
Decoder-only architecture
Uses: In-context learning, summarization, natural language generation, coding (e.g., VS Code Copilot)
BERT (Bidirectional Encoder Representations from Transformers)
Encoder-only architecture
Training tasks: Masked Language Modeling, Next Sentence Prediction
Fine-tuning enables use for various downstream tasks
Evolution: Elektra, DeBERTa adapted new techniques for improvements
Conclusion
Exciting set of videos with remarkable speakers
Goals: Grasp the workings of transformers, understand their broad applications, and spark innovative research ideas
📄
Full transcript