🤖

Understanding Transformers and Attention Mechanisms

Jan 18, 2025

Lecture Notes: Understanding Transformers and Attention Mechanisms

Introduction

Transformers are key technology in AI, especially for large language models.
Originated from the 2017 paper "Attention is All You Need."
Focus: Understand the attention mechanism and its role in data processing.

Goal of Transformers

Predict the next word in a sequence of text.
Text is broken into tokens, often words.
Tokens are associated with high-dimensional vectors, called embeddings.

Embeddings and Semantic Meaning

Directions in high-dimensional embedding space can reflect semantic meanings (e.g., gender).
Transformers adjust embeddings for richer contextual meaning.

Attention Mechanism

Allows context to refine the meaning of words (e.g., "mole" in different contexts).
Uses embeddings to shift meanings based on context.
Attention block updates embeddings to include contextual information.

Examples

Example: "Eiffel tower" vs. "miniature tower" changing the vector direction.
Attention allows rich information transfer beyond single words.

Computational Details

Initial embeddings are vectors encoding word meaning and position.
Process involves matrix-vector products with tunable weights.
Queries and keys are derived from embeddings to form an attention pattern.

Attention Pattern

Queries seek context (e.g., adjectives before nouns).
Keys potentially answer queries (alignment checked via dot products).
Dot products visualize relevance between words, creating an attention pattern.

Softmax and Masking

Softmax converts values to probabilities for normalization.
Masking prevents future tokens from influencing past tokens during training.

Multi-headed Attention

Multiple heads run in parallel, each with distinct key, query, and value matrices.
GPT-3 uses 96 attention heads per block.
Each head contributes to refining embeddings by different contextual updates.

Parameter Count

Key, query, and value matrices contribute to model parameters.
GPT-3 has approximately 58 billion parameters for attention heads.

Practical Considerations

Attention mechanism's parallelizability is a major advantage.
Large-scale computations possible with GPUs.

Conclusion

Attention mechanisms enable large models to perform efficiently.
Further reading and resources are available for deeper understanding.

Resources

Recommended: Andrej Karpathy, Chris Ola, Vivek's videos, and Britt Cruz's "The Art of the Problem."

Note:

Focus on understanding how transformers adjust word vectors through attention to process language with contextual nuances, leveraging parallel computing for efficiency.

Full transcript