🔍

Understanding Transformers and the Attention Mechanism

May 30, 2024

Lecture Notes on Transformers and Attention Mechanism

Introduction

  • Topic: Understanding the transformer architecture and attention mechanism in AI models.
  • Reference: Attention is All You Need (2017 paper).
  • Objective: To visualize and comprehend the internal workings of the attention mechanism in transformers.

Transformer Recap

  • Purpose: Predict the next word in a given piece of text.
  • Tokenization: Input text is broken into tokens (words or pieces of words).
  • Embedding: First step is to associate each token with a high-dimensional vector (embedding).
  • Embedding Space: Different directions can correspond to semantic meanings (e.g., gender, contextual meanings).

Attention Mechanism

  • Confusing Concept: Often perceived as complex; requires time to understand.
  • Behavioral Goal: Enable the model to distinguish various meanings of words based on context.

Examples of Contextual Meaning

  • Word “Mole”: Different meanings in “American true mole”, “one mole of carbon dioxide”, and “take a biopsy of the mole” depending on context.
  • Word “Tower”: Generic embedding refined by context such as “Eiffel Tower” or “miniature tower”.

Attention Blocks

  • Contextual Meaning Refinement: Update embeddings based on surrounding contexts via attention blocks.
  • Final Vector: Prediction of the next token depends on the final refined vector, embedding all relevant context.

Single-Head Attention Example

  • Phrase: “a fluffy blue creature roamed the verdant forest” focusing on adjectives influencing nouns.
  • Embeddings and Queries: Initial embeddings encode word and position. Queries are vectors derived from embeddings.
  • Matrices Involved:
    • Query Matrix (Wq): Maps embeddings to query vectors.
    • Key Matrix: Produces key vectors to potentially answer queries.
    • Value Matrix: Produces value vectors added to refine embeddings.
  • Dot Product: Measures alignment between queries and keys to form an attention pattern.
  • Softmax Normalization: Normalizes dot products to form a probability distribution.
  • Masking: Prevents future tokens from influencing past ones during training.
  • Weighted Sum Update: Value vectors weighted by attention pattern update embeddings.

Multi-Head Attention

  • Multiple Heads: Multiple attention heads run in parallel, each with distinct key, query, and value maps.
  • Example: GPT-3 uses 96 attention heads in each block.
  • Compositional Updates: Sum updates from all heads and apply to embeddings.
  • Parameter Count: Total parameters for attention heads contribute significantly to model size.

Efficiency and Scale

  • Parallelizability: Attention mechanism's high parallelizability suits GPU execution, enhancing model scalability.
  • Contribution to Model Size: Attention contributes a substantial portion of GPT-3’s parameters (around one-third of 175 billion).

Further Learning

  • Recommended Resources: Andrej Karpathy, Chris Ola, Vivek, and Britt Cruz’s content on AI and transformers.

Conclusion

  • Next Steps: Future chapters will discuss multi-layer perceptrons and other training processes involved in transformers.