Transformers and Attention Mechanisms Overview

Aug 1, 2024

Lecture Notes: Understanding Transformers and Attention Mechanisms

Introduction to Transformers

  • Key technology in large language models (LLMs)
  • Originated from the 2017 paper "Attention is All You Need"
  • Focus on the attention mechanism and how it processes data

Goal of the Model

  • Objective: Predict the next word in a piece of text.
  • Tokens: Input text is divided into tokens (simplified as words for this explanation).
  • Embeddings: Initial step involves associating each token with a high-dimensional vector (embedding).

High-Dimensional Embeddings

  • Directions in high-dimensional space correspond to semantic meanings.
  • Example: Moving from a masculine noun embedding to a feminine noun embedding.
  • Aim: Adjust embeddings to encode richer contextual meanings beyond individual words.

Attention Mechanism Overview

  • Many find the attention mechanism confusing; patience is key.
  • Example Phrases:
    • "American true mole"
    • "One mole of carbon dioxide"
    • "Take a biopsy of the mole"
  • After initial embedding, context doesn’t influence the token meaning, leading to the same representation for "mole" in all examples.

Contextual Influence of Words

  • Attention allows for moving information between embeddings based on context.
  • Example with "tower":
    • Preceded by "Eiffel": Update to represent "Eiffel Tower".
    • Preceded by "miniature": Update to reflect smaller scale.

Attention Block and Computation

  • After processing through multiple attention blocks, the final vector encodes comprehensive context.
  • Simplified example: "A fluffy blue creature roamed the verdant forest."
  • Focus is on adjectives adjusting corresponding nouns using attention.

Attention Head Mechanics

  1. Initial Embedding: Encodes meaning but lacks context.
  2. Query Vector (q): Created by compressing the noun embedding to check for preceding adjectives.
    • Generated by multiplying embedding with a query matrix (wq).
  3. Key Vector (k): Created similarly for adjectives, used to answer queries.
  4. Dot Product: Measure alignment between keys and queries to determine relevance.
    • Larger values indicate higher relevance.

Attention Pattern and Softmax

  • Attention Pattern: Grid of dot product scores representing relevance.
  • Softmax: Used to normalize scores into a probability distribution.
  • Masking: Prevents later tokens from influencing earlier tokens by setting certain entries to negative infinity before softmax.

Value Vectors and Updating Embeddings

  • Use a value matrix to produce value vectors which inform updates to embeddings based on relevance.
  • Update Mechanism: Weighted sum of value vectors modifies the original embedding to reflect contextual meaning.
  • Final Output: Sequence of refined embeddings after processing through attention.

Multi-Headed Attention

  • Multi-headed attention involves running several attention heads in parallel (e.g., GPT-3 has 96 heads).
  • Each head runs distinct key, query, and value matrices, producing multiple attention patterns.
  • Outputs are summed to refine embeddings further.

Parameter Count Overview

  • Each attention head consists of multiple matrices contributing to overall parameters:
    • Key Matrix: 1.5 million parameters.
    • Query Matrix: 1.5 million parameters.
    • Value Matrix: Efficiently factored to match key/query parameters.
  • Approx. 6.3 million parameters for one attention head; 600 million for a multi-headed block.

Self-Attention vs. Cross-Attention

  • Self-Attention: Processes single data type (e.g., text).
  • Cross-Attention: Involves processing distinct data types (e.g., translation).
  • No masking typically in cross-attention as there’s no influence from later tokens on earlier ones.

Final Insights on Transformers

  • Transformers consist of many layers, enabling deeper context understanding.
  • Scaling context size poses challenges; various methods are explored to enhance scalability.
  • The attention mechanism's parallelizability is crucial for efficiency, especially with GPUs.

Learning Resources

  • For further exploration, check out works by Andrej Karpathy, Chris Ola, or other recommended videos on the history and development of LLMs.