Lecture Notes on Transformers and Attention Mechanism

Jul 3, 2024

Lecture Notes on Transformers and Attention Mechanism

Introduction

Overview of transformers, key technology in AI and large language models.
Origin: 2017 paper "Attention is All You Need".
Aim: Understand attention mechanism and its data processing.

Key Context

Model goal: Take in text, predict the next word.
Text is broken into tokens, assumed to be words for simplicity.
Each token is associated with a high-dimensional vector called its embedding.
Directions in embedding space correspond to semantic meanings (e.g., gender).
Transformer adjusts embeddings for richer contextual meaning.

Importance of Attention Mechanism

Complex and often confusing, but crucial for context-driven meaning.

Example Phrases

Multiple meanings of 'mole': "American true mole", "one mole of carbon dioxide", "take a biopsy of the mole".
Contextual refinement: Context helps refine generic vectors to specific meanings.

Embedding Process

Initial step: Break text and associate each token with a vector (embedding).
Example: "Eiffel tower" vs. "miniature tower" — embedding updated by context.

Computational Details and Examples

Adjectives and nouns: Imagining tokens as words adjusting corresponding nouns.
Compute query, key, and value vectors for each token via matrix multiplications.

Queries and Keys

Compute query vector q using query matrix Wq and embedding e.
Compute key vector k using key matrix Wk and embedding e.
Dot product of q and k determines vector alignment (attention).

Attention Pattern

Normalize scores via softmax to get weights between 0 and 1.
Create attention pattern grid representing relevant relationships.
Masking to prevent later tokens from influencing earlier ones.

Value Vectors and Updates

Use value matrix to produce value vectors v for embeddings.
Rescale using attention patterns to compute updates for embeddings.
Example: Fluffy to creature — updates embedding to reflect attribute.

Summary of a Single Head of Attention

Single attention head involves three matrices (query, key, value).
Whole process parameterized by matrices filled with tunable weights.
Parameter count: Query and key — 1.5M each; Value — 6.3M per attention head.

Multi-Headed Attention

Multiple heads in parallel, each with unique key, query, value maps (e.g., GPT-3 has 96 heads per block).
Outputs from heads are summed together and added to original embeddings.
Parameter count: ~600M per block for 96 heads.

Implementation Details

Value map factored into value down and value up matrices for efficiency.
Standard practice deviates from initial explanation for practical reasons.

Additional Notes

Transformers also run through multi-layer perceptrons (like neural networks).
Multiple attention layers allow embeddings to continuously refine context.

Training Facts

Masking: Essential during training to prevent future tokens from leaking information.
Context Bottleneck: Larger context size scales complexity (bigger grids/scores).

Total Parameter Summary

GPT-3: 96 layers, 58 billion parameters for attention heads.
Attention is significant but doesn't encompass all model parameters.

Future Topics

Next chapter: More on multi-layer perceptrons and training process.
Benefits of parallelization and scale for deep learning.

Further Learning Resources

Andrej Karpathy, Chris Ola: Highly recommended materials on AI.
Vivek's videos on history/motivation of transformers.
Britt Cruz's video on history of large language models.

Conclusion

Importance of parallelizable architectures in scaling model performance.

Full transcript