Lecture Notes on Transformers and Attention Mechanism

Introduction

Topic: Understanding the transformer architecture and attention mechanism in AI models.
Reference: Attention is All You Need (2017 paper).
Objective: To visualize and comprehend the internal workings of the attention mechanism in transformers.

Purpose: Predict the next word in a given piece of text.
Tokenization: Input text is broken into tokens (words or pieces of words).
Embedding: First step is to associate each token with a high-dimensional vector (embedding).
Embedding Space: Different directions can correspond to semantic meanings (e.g., gender, contextual meanings).

Confusing Concept: Often perceived as complex; requires time to understand.
Behavioral Goal: Enable the model to distinguish various meanings of words based on context.

Word “Mole”: Different meanings in “American true mole”, “one mole of carbon dioxide”, and “take a biopsy of the mole” depending on context.
Word “Tower”: Generic embedding refined by context such as “Eiffel Tower” or “miniature tower”.

Contextual Meaning Refinement: Update embeddings based on surrounding contexts via attention blocks.
Final Vector: Prediction of the next token depends on the final refined vector, embedding all relevant context.

Phrase: “a fluffy blue creature roamed the verdant forest” focusing on adjectives influencing nouns.
Embeddings and Queries: Initial embeddings encode word and position. Queries are vectors derived from embeddings.
Matrices Involved:
- Query Matrix (Wq): Maps embeddings to query vectors.
- Key Matrix: Produces key vectors to potentially answer queries.
- Value Matrix: Produces value vectors added to refine embeddings.
Dot Product: Measures alignment between queries and keys to form an attention pattern.
Softmax Normalization: Normalizes dot products to form a probability distribution.
Masking: Prevents future tokens from influencing past ones during training.
Weighted Sum Update: Value vectors weighted by attention pattern update embeddings.

Multiple Heads: Multiple attention heads run in parallel, each with distinct key, query, and value maps.
Example: GPT-3 uses 96 attention heads in each block.
Compositional Updates: Sum updates from all heads and apply to embeddings.
Parameter Count: Total parameters for attention heads contribute significantly to model size.

Parallelizability: Attention mechanism's high parallelizability suits GPU execution, enhancing model scalability.
Contribution to Model Size: Attention contributes a substantial portion of GPT-3’s parameters (around one-third of 175 billion).

Recommended Resources: Andrej Karpathy, Chris Ola, Vivek, and Britt Cruz’s content on AI and transformers.

Next Steps: Future chapters will discuss multi-layer perceptrons and other training processes involved in transformers.