Coconote
AI notes
AI voice & video notes
Export note
Try for free
Lecture Notes on Transformers and Attention Mechanism
Jul 3, 2024
Lecture Notes on Transformers and Attention Mechanism
Introduction
Overview of transformers, key technology in AI and large language models.
Origin: 2017 paper "Attention is All You Need".
Aim: Understand attention mechanism and its data processing.
Key Context
Model goal: Take in text, predict the next word.
Text is broken into tokens, assumed to be words for simplicity.
Each token is associated with a high-dimensional vector called its embedding.
Directions in embedding space correspond to semantic meanings (e.g., gender).
Transformer adjusts embeddings for richer contextual meaning.
Importance of Attention Mechanism
Complex and often confusing, but crucial for context-driven meaning.
Example Phrases
Multiple meanings of 'mole':
"American true mole", "one mole of carbon dioxide", "take a biopsy of the mole".
Contextual refinement:
Context helps refine generic vectors to specific meanings.
Embedding Process
Initial step: Break text and associate each token with a vector (embedding).
Example: "Eiffel tower" vs. "miniature tower" — embedding updated by context.
Computational Details and Examples
Adjectives and nouns:
Imagining tokens as words adjusting corresponding nouns.
Compute query, key, and value vectors for each token via matrix multiplications.
Queries and Keys
Compute query vector
q
using query matrix
Wq
and embedding
e
.
Compute key vector
k
using key matrix
Wk
and embedding
e
.
Dot product of
q
and
k
determines vector alignment (attention).
Attention Pattern
Normalize scores via softmax to get weights between 0 and 1.
Create attention pattern grid representing relevant relationships.
Masking to prevent later tokens from influencing earlier ones.
Value Vectors and Updates
Use value matrix to produce value vectors
v
for embeddings.
Rescale using attention patterns to compute updates for embeddings.
Example: Fluffy to creature — updates embedding to reflect attribute.
Summary of a Single Head of Attention
Single attention head involves three matrices (query, key, value).
Whole process parameterized by matrices filled with tunable weights.
Parameter count: Query and key — 1.5M each; Value — 6.3M per attention head.
Multi-Headed Attention
Multiple heads in parallel, each with unique key, query, value maps (e.g., GPT-3 has 96 heads per block).
Outputs from heads are summed together and added to original embeddings.
Parameter count: ~600M per block for 96 heads.
Implementation Details
Value map factored into value down and value up matrices for efficiency.
Standard practice deviates from initial explanation for practical reasons.
Additional Notes
Transformers also run through multi-layer perceptrons (like neural networks).
Multiple attention layers allow embeddings to continuously refine context.
Training Facts
Masking:
Essential during training to prevent future tokens from leaking information.
Context Bottleneck:
Larger context size scales complexity (bigger grids/scores).
Total Parameter Summary
GPT-3: 96 layers, 58 billion parameters for attention heads.
Attention is significant but doesn't encompass all model parameters.
Future Topics
Next chapter: More on multi-layer perceptrons and training process.
Benefits of parallelization and scale for deep learning.
Further Learning Resources
Andrej Karpathy, Chris Ola: Highly recommended materials on AI.
Vivek's videos on history/motivation of transformers.
Britt Cruz's video on history of large language models.
Conclusion
Importance of parallelizable architectures in scaling model performance.
📄
Full transcript