Coconote
AI notes
AI voice & video notes
Export note
Try for free
Transformers and Attention Mechanisms Overview
Aug 1, 2024
🤓
Take quiz
Lecture Notes: Understanding Transformers and Attention Mechanisms
Introduction to Transformers
Key technology in large language models (LLMs)
Originated from the 2017 paper "Attention is All You Need"
Focus on the attention mechanism and how it processes data
Goal of the Model
Objective
: Predict the next word in a piece of text.
Tokens
: Input text is divided into tokens (simplified as words for this explanation).
Embeddings
: Initial step involves associating each token with a high-dimensional vector (embedding).
High-Dimensional Embeddings
Directions in high-dimensional space correspond to semantic meanings.
Example: Moving from a masculine noun embedding to a feminine noun embedding.
Aim: Adjust embeddings to encode richer contextual meanings beyond individual words.
Attention Mechanism Overview
Many find the attention mechanism confusing; patience is key.
Example Phrases
:
"American true mole"
"One mole of carbon dioxide"
"Take a biopsy of the mole"
After initial embedding, context doesn’t influence the token meaning, leading to the same representation for "mole" in all examples.
Contextual Influence of Words
Attention allows for moving information between embeddings based on context.
Example with "tower"
:
Preceded by "Eiffel": Update to represent "Eiffel Tower".
Preceded by "miniature": Update to reflect smaller scale.
Attention Block and Computation
After processing through multiple attention blocks, the final vector encodes comprehensive context.
Simplified example: "A fluffy blue creature roamed the verdant forest."
Focus is on adjectives adjusting corresponding nouns using attention.
Attention Head Mechanics
Initial Embedding
: Encodes meaning but lacks context.
Query Vector (q)
: Created by compressing the noun embedding to check for preceding adjectives.
Generated by multiplying embedding with a query matrix (wq).
Key Vector (k)
: Created similarly for adjectives, used to answer queries.
Dot Product
: Measure alignment between keys and queries to determine relevance.
Larger values indicate higher relevance.
Attention Pattern and Softmax
Attention Pattern
: Grid of dot product scores representing relevance.
Softmax
: Used to normalize scores into a probability distribution.
Masking
: Prevents later tokens from influencing earlier tokens by setting certain entries to negative infinity before softmax.
Value Vectors and Updating Embeddings
Use a value matrix to produce value vectors which inform updates to embeddings based on relevance.
Update Mechanism
: Weighted sum of value vectors modifies the original embedding to reflect contextual meaning.
Final Output
: Sequence of refined embeddings after processing through attention.
Multi-Headed Attention
Multi-headed attention involves running several attention heads in parallel (e.g., GPT-3 has 96 heads).
Each head runs distinct key, query, and value matrices, producing multiple attention patterns.
Outputs are summed to refine embeddings further.
Parameter Count Overview
Each attention head consists of multiple matrices contributing to overall parameters:
Key Matrix: 1.5 million parameters.
Query Matrix: 1.5 million parameters.
Value Matrix: Efficiently factored to match key/query parameters.
Approx. 6.3 million parameters for one attention head; 600 million for a multi-headed block.
Self-Attention vs. Cross-Attention
Self-Attention
: Processes single data type (e.g., text).
Cross-Attention
: Involves processing distinct data types (e.g., translation).
No masking typically in cross-attention as there’s no influence from later tokens on earlier ones.
Final Insights on Transformers
Transformers consist of many layers, enabling deeper context understanding.
Scaling context size poses challenges; various methods are explored to enhance scalability.
The attention mechanism's parallelizability is crucial for efficiency, especially with GPUs.
Learning Resources
For further exploration, check out works by Andrej Karpathy, Chris Ola, or other recommended videos on the history and development of LLMs.
📄
Full transcript