Coconote
AI notes
AI voice & video notes
Try for free
🤖
Transformer Architecture Overview
Jun 21, 2024
Transformer Architecture in AI
Introduction
Transformers power recent AI breakthroughs (e.g., ChatGPT, Vision Transformers, AlphaFold 2).
Versatile: Work with text, images, speech, etc., as long as data is represented as vectors.
Focus on remastered explanation with emphasis on self-attention.
Data Representation
Text
Tokenization: Converts text to subwords using tokenizer.
Subwords assigned unique vectors (random or word embeddings).
Word Embeddings: Represent semantic similarity.
Precomputed by analyzing word co-occurrences.
Neural network assigns similar vectors to semantically similar words.
Images
Images as matrices (RGB channels, intensity values).
Convert to vectors by row-wise concatenation (inefficient for transformers).
Use patches and apply linear neural network layer to reduce dimensionality to vector form.
Transformer Processing
Neural networks process vectors into better representations across layers.
Task examples: predicting next token, sentiment classification.
Transformer Layer
Takes input sequence as vectors, outputs vectors of same number and dimensionality.
Special tokens (e.g., classification token) added for specific tasks.
Special token output used by classification layer for prediction/probability assignment.
Training: compare predictions to expected results, backpropagate loss, update parameters.
Transformer Components
Feed Forward Network (MLP Sublayer)
Dense layer with GeLU activation doubles the dimension.
Another dense layer with GeLU scales down dimension.
Independent processing of each token.
Self-Attention Sublayer
Self-attention enables information flow within sequence.
Computes importance of each token relative to others.
Attention weights used to combine token representations.
Attention vs. Self-Attention: Self-attention within same sequence, Attention across different sequences.
Attention Computation
Input vectors transformed into keys, queries, and values.
Multiplied with respective matrices (queries, keys, values) initialized randomly.
Attention scores calculated via scalar product and softmax.
Final token representation: weighted sum of value vectors.
Multi-head self-attention: multiple sets of attention patterns.
Number of attention heads: a hyperparameter balancing complexity and memory usage.
Research on approximating/replacing attention to reduce resource consumption.
Sequence Information
Positional Embeddings
Order matters: Positional embeddings added to input embeddings.
Identify position using rule-based or learnable vectors.
Residual Connections
Add input of attention layer to its output (normalization maintains value range).
Facilitates training by focusing on transforming inputs incrementally.
Stacked transformer layers enable deeper problem-solving.
Prevents gradient signal loss in deep networks.
Training Procedures
Masked Language Modeling (BERT)
Classifier token for sentence pair classification.
Randomly mask 15% of tokens with [MASK] token.
Objective: adapt weights to correctly predict masked words.
Great for training classification transformers/encoders.
Predicting Next Word (GPT)
Transformers (decoders) for generating next word predictions.
Transformers vs. RNNs
Transformers: Parallel processing of tokens via attention.
RNNs: Sequential processing with dependencies, slower training.
Transformer’s parallelism enables training on large datasets (e.g., entire internet).
Conclusion
Transformers revolutionized NLP with parallel token processing.
Learning resources: Jay Allamar’s illustrated transformer, Louis Serrano’s series.
Thanks to supporters and viewers.
đź“„
Full transcript