Lecture Notes on Transformers and Attention Mechanism

Jul 3, 2024

Lecture Notes on Transformers and Attention Mechanism

Introduction

  • Overview of transformers, key technology in AI and large language models.
  • Origin: 2017 paper "Attention is All You Need".
  • Aim: Understand attention mechanism and its data processing.

Key Context

  • Model goal: Take in text, predict the next word.
  • Text is broken into tokens, assumed to be words for simplicity.
  • Each token is associated with a high-dimensional vector called its embedding.
  • Directions in embedding space correspond to semantic meanings (e.g., gender).
  • Transformer adjusts embeddings for richer contextual meaning.

Importance of Attention Mechanism

  • Complex and often confusing, but crucial for context-driven meaning.

Example Phrases

  • Multiple meanings of 'mole': "American true mole", "one mole of carbon dioxide", "take a biopsy of the mole".
  • Contextual refinement: Context helps refine generic vectors to specific meanings.

Embedding Process

  • Initial step: Break text and associate each token with a vector (embedding).
  • Example: "Eiffel tower" vs. "miniature tower" — embedding updated by context.

Computational Details and Examples

  • Adjectives and nouns: Imagining tokens as words adjusting corresponding nouns.
  • Compute query, key, and value vectors for each token via matrix multiplications.

Queries and Keys

  • Compute query vector q using query matrix Wq and embedding e.
  • Compute key vector k using key matrix Wk and embedding e.
  • Dot product of q and k determines vector alignment (attention).

Attention Pattern

  • Normalize scores via softmax to get weights between 0 and 1.
  • Create attention pattern grid representing relevant relationships.
  • Masking to prevent later tokens from influencing earlier ones.

Value Vectors and Updates

  • Use value matrix to produce value vectors v for embeddings.
  • Rescale using attention patterns to compute updates for embeddings.
  • Example: Fluffy to creature — updates embedding to reflect attribute.

Summary of a Single Head of Attention

  • Single attention head involves three matrices (query, key, value).
  • Whole process parameterized by matrices filled with tunable weights.
  • Parameter count: Query and key — 1.5M each; Value — 6.3M per attention head.

Multi-Headed Attention

  • Multiple heads in parallel, each with unique key, query, value maps (e.g., GPT-3 has 96 heads per block).
  • Outputs from heads are summed together and added to original embeddings.
  • Parameter count: ~600M per block for 96 heads.

Implementation Details

  • Value map factored into value down and value up matrices for efficiency.
  • Standard practice deviates from initial explanation for practical reasons.

Additional Notes

  • Transformers also run through multi-layer perceptrons (like neural networks).
  • Multiple attention layers allow embeddings to continuously refine context.

Training Facts

  • Masking: Essential during training to prevent future tokens from leaking information.
  • Context Bottleneck: Larger context size scales complexity (bigger grids/scores).

Total Parameter Summary

  • GPT-3: 96 layers, 58 billion parameters for attention heads.
  • Attention is significant but doesn't encompass all model parameters.

Future Topics

  • Next chapter: More on multi-layer perceptrons and training process.
  • Benefits of parallelization and scale for deep learning.

Further Learning Resources

  • Andrej Karpathy, Chris Ola: Highly recommended materials on AI.
  • Vivek's videos on history/motivation of transformers.
  • Britt Cruz's video on history of large language models.

Conclusion

  • Importance of parallelizable architectures in scaling model performance.