Understanding Transformers and Attention Mechanisms

Aug 25, 2024

Lecture Notes on the Transformer and Attention Mechanism

Introduction

  • Transformers are key technology in large language models and modern AI tools.
  • Originated from the 2017 paper "Attention is All You Need."
  • Goal: Predict the next word in a given text.

Tokenization and Embedding

  • Text is divided into tokens (simplified as words for this discussion).
  • Each token is associated with a high-dimensional vector called an embedding.
  • Directions in embedding space can correspond to semantic meanings (e.g., gender).
  • Aim: Adjust embeddings to encode richer contextual meaning, not just individual words.

Understanding Attention Mechanism

  • Attention enables the model to account for context when interpreting words.
  • Example phrases:
    • "American True Mole"
    • "One Mole of Carbon Dioxide"
    • "Take a Biopsy of the Mole"
  • Initial embeddings do not capture context; only the next step (attention) does.
  • Well-trained attention can adjust embeddings based on surrounding context.

Attention Process Explained

Contextual Updates

  • Example: The word "tower" can imply different meanings based on context.
  • Attention blocks adjust embeddings based on the relevance of surrounding words.

Example Phrase

  • Phrase: "a fluffy blue creature roamed the verdant forest."
  • Goal: Have adjectives adjust the meanings of their corresponding nouns.
  • Initial embeddings denote position and meaning but require context for refinement.

Attention Heads

  • A single head of attention updates embeddings through matrix-vector products.
  • Each word creates a query vector (compressed dimension) to find relevant adjectives.
  • Key vectors answer queries to determine relevance.

Attention Scores and Weights

  • Dot products between keys and queries measure relevance.
  • Scores are normalized using softmax to create an attention pattern.
  • Masking is applied to prevent later words from influencing earlier words.

Updating Embeddings

  • Use a value matrix to adjust the embeddings according to relevance.
  • Resulting vectors are added to original embeddings to refine meanings.
  • This process outputs refined embeddings from the attention block.

Multi-Headed Attention

  • Full attention block uses multiple heads (e.g., GPT-3 has 96 heads) for diverse contextual interpretations.
  • Each head has distinct parameter matrices contributing to the overall embedding update.
  • Final output is a sum of adjustments from all heads.

Parameter Count

  • Each attention head has around 6.3 million parameters.
  • Total parameters in multi-headed attention block: approximately 600 million.
  • GPT-3 has around 58 billion parameters in total for attention heads.

Contextual Learning

  • Data flows through multiple attention blocks and layers, enhancing the embedding refinement.
  • The model aims to capture higher-level ideas, sentiments, and contextual nuances.

Conclusion

  • Attention is a crucial aspect of the transformer architecture, leveraging parallel computation for efficiency.
  • A majority of model parameters come from blocks between attention layers.

Further Learning

  • Recommended resources:
    • Andrej Karpathy and Chris Ola for in-depth understanding.
    • Videos by Vivek and Britt Kruse on the history of language models.