Understanding Transformers and Attention Mechanisms

Aug 25, 2024

Lecture Notes on the Transformer and Attention Mechanism

Introduction

Transformers are key technology in large language models and modern AI tools.
Originated from the 2017 paper "Attention is All You Need."
Goal: Predict the next word in a given text.

Tokenization and Embedding

Text is divided into tokens (simplified as words for this discussion).
Each token is associated with a high-dimensional vector called an embedding.
Directions in embedding space can correspond to semantic meanings (e.g., gender).
Aim: Adjust embeddings to encode richer contextual meaning, not just individual words.

Understanding Attention Mechanism

Attention enables the model to account for context when interpreting words.
Example phrases:
- "American True Mole"
- "One Mole of Carbon Dioxide"
- "Take a Biopsy of the Mole"
Initial embeddings do not capture context; only the next step (attention) does.
Well-trained attention can adjust embeddings based on surrounding context.

Attention Process Explained

Contextual Updates

Example: The word "tower" can imply different meanings based on context.
Attention blocks adjust embeddings based on the relevance of surrounding words.

Example Phrase

Phrase: "a fluffy blue creature roamed the verdant forest."
Goal: Have adjectives adjust the meanings of their corresponding nouns.
Initial embeddings denote position and meaning but require context for refinement.

Attention Heads

A single head of attention updates embeddings through matrix-vector products.
Each word creates a query vector (compressed dimension) to find relevant adjectives.
Key vectors answer queries to determine relevance.

Attention Scores and Weights

Dot products between keys and queries measure relevance.
Scores are normalized using softmax to create an attention pattern.
Masking is applied to prevent later words from influencing earlier words.

Updating Embeddings

Use a value matrix to adjust the embeddings according to relevance.
Resulting vectors are added to original embeddings to refine meanings.
This process outputs refined embeddings from the attention block.

Multi-Headed Attention

Full attention block uses multiple heads (e.g., GPT-3 has 96 heads) for diverse contextual interpretations.
Each head has distinct parameter matrices contributing to the overall embedding update.
Final output is a sum of adjustments from all heads.

Parameter Count

Each attention head has around 6.3 million parameters.
Total parameters in multi-headed attention block: approximately 600 million.
GPT-3 has around 58 billion parameters in total for attention heads.

Contextual Learning

Data flows through multiple attention blocks and layers, enhancing the embedding refinement.
The model aims to capture higher-level ideas, sentiments, and contextual nuances.

Conclusion

Attention is a crucial aspect of the transformer architecture, leveraging parallel computation for efficiency.
A majority of model parameters come from blocks between attention layers.

Further Learning

Recommended resources:
- Andrej Karpathy and Chris Ola for in-depth understanding.
- Videos by Vivek and Britt Kruse on the history of language models.

Full transcript