Understanding Transformers and the Attention Mechanism
May 30, 2024
Lecture Notes on Transformers and Attention Mechanism
Introduction
Topic: Understanding the transformer architecture and attention mechanism in AI models.
Reference: Attention is All You Need (2017 paper).
Objective: To visualize and comprehend the internal workings of the attention mechanism in transformers.
Transformer Recap
Purpose: Predict the next word in a given piece of text.
Tokenization: Input text is broken into tokens (words or pieces of words).
Embedding: First step is to associate each token with a high-dimensional vector (embedding).
Embedding Space: Different directions can correspond to semantic meanings (e.g., gender, contextual meanings).
Attention Mechanism
Confusing Concept: Often perceived as complex; requires time to understand.
Behavioral Goal: Enable the model to distinguish various meanings of words based on context.
Examples of Contextual Meaning
Word “Mole”: Different meanings in “American true mole”, “one mole of carbon dioxide”, and “take a biopsy of the mole” depending on context.
Word “Tower”: Generic embedding refined by context such as “Eiffel Tower” or “miniature tower”.
Attention Blocks
Contextual Meaning Refinement: Update embeddings based on surrounding contexts via attention blocks.
Final Vector: Prediction of the next token depends on the final refined vector, embedding all relevant context.
Single-Head Attention Example
Phrase: “a fluffy blue creature roamed the verdant forest” focusing on adjectives influencing nouns.
Embeddings and Queries: Initial embeddings encode word and position. Queries are vectors derived from embeddings.
Matrices Involved:
Query Matrix (Wq): Maps embeddings to query vectors.
Key Matrix: Produces key vectors to potentially answer queries.
Value Matrix: Produces value vectors added to refine embeddings.
Dot Product: Measures alignment between queries and keys to form an attention pattern.
Softmax Normalization: Normalizes dot products to form a probability distribution.
Masking: Prevents future tokens from influencing past ones during training.
Weighted Sum Update: Value vectors weighted by attention pattern update embeddings.
Multi-Head Attention
Multiple Heads: Multiple attention heads run in parallel, each with distinct key, query, and value maps.
Example: GPT-3 uses 96 attention heads in each block.
Compositional Updates: Sum updates from all heads and apply to embeddings.
Parameter Count: Total parameters for attention heads contribute significantly to model size.
Efficiency and Scale
Parallelizability: Attention mechanism's high parallelizability suits GPU execution, enhancing model scalability.
Contribution to Model Size: Attention contributes a substantial portion of GPT-3’s parameters (around one-third of 175 billion).
Further Learning
Recommended Resources: Andrej Karpathy, Chris Ola, Vivek, and Britt Cruz’s content on AI and transformers.
Conclusion
Next Steps: Future chapters will discuss multi-layer perceptrons and other training processes involved in transformers.