Coconote
AI notes
AI voice & video notes
Export note
Try for free
Understanding Transformers and Attention Mechanisms
Aug 25, 2024
Lecture Notes on the Transformer and Attention Mechanism
Introduction
Transformers are key technology in large language models and modern AI tools.
Originated from the 2017 paper "Attention is All You Need."
Goal: Predict the next word in a given text.
Tokenization and Embedding
Text is divided into tokens (simplified as words for this discussion).
Each token is associated with a high-dimensional vector called an embedding.
Directions in embedding space can correspond to semantic meanings (e.g., gender).
Aim: Adjust embeddings to encode richer contextual meaning, not just individual words.
Understanding Attention Mechanism
Attention enables the model to account for context when interpreting words.
Example phrases:
"American True Mole"
"One Mole of Carbon Dioxide"
"Take a Biopsy of the Mole"
Initial embeddings do not capture context; only the next step (attention) does.
Well-trained attention can adjust embeddings based on surrounding context.
Attention Process Explained
Contextual Updates
Example: The word "tower" can imply different meanings based on context.
Attention blocks adjust embeddings based on the relevance of surrounding words.
Example Phrase
Phrase: "a fluffy blue creature roamed the verdant forest."
Goal: Have adjectives adjust the meanings of their corresponding nouns.
Initial embeddings denote position and meaning but require context for refinement.
Attention Heads
A single head of attention updates embeddings through matrix-vector products.
Each word creates a query vector (compressed dimension) to find relevant adjectives.
Key vectors answer queries to determine relevance.
Attention Scores and Weights
Dot products between keys and queries measure relevance.
Scores are normalized using softmax to create an attention pattern.
Masking is applied to prevent later words from influencing earlier words.
Updating Embeddings
Use a value matrix to adjust the embeddings according to relevance.
Resulting vectors are added to original embeddings to refine meanings.
This process outputs refined embeddings from the attention block.
Multi-Headed Attention
Full attention block uses multiple heads (e.g., GPT-3 has 96 heads) for diverse contextual interpretations.
Each head has distinct parameter matrices contributing to the overall embedding update.
Final output is a sum of adjustments from all heads.
Parameter Count
Each attention head has around 6.3 million parameters.
Total parameters in multi-headed attention block: approximately 600 million.
GPT-3 has around 58 billion parameters in total for attention heads.
Contextual Learning
Data flows through multiple attention blocks and layers, enhancing the embedding refinement.
The model aims to capture higher-level ideas, sentiments, and contextual nuances.
Conclusion
Attention is a crucial aspect of the transformer architecture, leveraging parallel computation for efficiency.
A majority of model parameters come from blocks between attention layers.
Further Learning
Recommended resources:
Andrej Karpathy and Chris Ola for in-depth understanding.
Videos by Vivek and Britt Kruse on the history of language models.
📄
Full transcript