Transformers and Attention Mechanisms Overview

Aug 1, 2024

Take quiz

Lecture Notes: Understanding Transformers and Attention Mechanisms

Introduction to Transformers

Key technology in large language models (LLMs)
Originated from the 2017 paper "Attention is All You Need"
Focus on the attention mechanism and how it processes data

Goal of the Model

Objective: Predict the next word in a piece of text.
Tokens: Input text is divided into tokens (simplified as words for this explanation).
Embeddings: Initial step involves associating each token with a high-dimensional vector (embedding).

High-Dimensional Embeddings

Directions in high-dimensional space correspond to semantic meanings.
Example: Moving from a masculine noun embedding to a feminine noun embedding.
Aim: Adjust embeddings to encode richer contextual meanings beyond individual words.

Attention Mechanism Overview

Many find the attention mechanism confusing; patience is key.
Example Phrases:
- "American true mole"
- "One mole of carbon dioxide"
- "Take a biopsy of the mole"
After initial embedding, context doesn’t influence the token meaning, leading to the same representation for "mole" in all examples.

Contextual Influence of Words

Attention allows for moving information between embeddings based on context.
Example with "tower":
- Preceded by "Eiffel": Update to represent "Eiffel Tower".
- Preceded by "miniature": Update to reflect smaller scale.

Attention Block and Computation

After processing through multiple attention blocks, the final vector encodes comprehensive context.
Simplified example: "A fluffy blue creature roamed the verdant forest."
Focus is on adjectives adjusting corresponding nouns using attention.

Attention Head Mechanics

Initial Embedding: Encodes meaning but lacks context.
Query Vector (q): Created by compressing the noun embedding to check for preceding adjectives.
- Generated by multiplying embedding with a query matrix (wq).
Key Vector (k): Created similarly for adjectives, used to answer queries.
Dot Product: Measure alignment between keys and queries to determine relevance.
- Larger values indicate higher relevance.

Attention Pattern and Softmax

Attention Pattern: Grid of dot product scores representing relevance.
Softmax: Used to normalize scores into a probability distribution.
Masking: Prevents later tokens from influencing earlier tokens by setting certain entries to negative infinity before softmax.

Value Vectors and Updating Embeddings

Use a value matrix to produce value vectors which inform updates to embeddings based on relevance.
Update Mechanism: Weighted sum of value vectors modifies the original embedding to reflect contextual meaning.
Final Output: Sequence of refined embeddings after processing through attention.

Multi-Headed Attention

Multi-headed attention involves running several attention heads in parallel (e.g., GPT-3 has 96 heads).
Each head runs distinct key, query, and value matrices, producing multiple attention patterns.
Outputs are summed to refine embeddings further.

Parameter Count Overview

Each attention head consists of multiple matrices contributing to overall parameters:
- Key Matrix: 1.5 million parameters.
- Query Matrix: 1.5 million parameters.
- Value Matrix: Efficiently factored to match key/query parameters.
Approx. 6.3 million parameters for one attention head; 600 million for a multi-headed block.

Self-Attention vs. Cross-Attention

Self-Attention: Processes single data type (e.g., text).
Cross-Attention: Involves processing distinct data types (e.g., translation).
No masking typically in cross-attention as there’s no influence from later tokens on earlier ones.

Final Insights on Transformers

Transformers consist of many layers, enabling deeper context understanding.
Scaling context size poses challenges; various methods are explored to enhance scalability.
The attention mechanism's parallelizability is crucial for efficiency, especially with GPUs.

Learning Resources

For further exploration, check out works by Andrej Karpathy, Chris Ola, or other recommended videos on the history and development of LLMs.

Full transcript