Transformer Architecture in AI

Introduction

Transformers power recent AI breakthroughs (e.g., ChatGPT, Vision Transformers, AlphaFold 2).
Versatile: Work with text, images, speech, etc., as long as data is represented as vectors.
Focus on remastered explanation with emphasis on self-attention.

Data Representation

Text

Tokenization: Converts text to subwords using tokenizer.
Subwords assigned unique vectors (random or word embeddings).
Word Embeddings: Represent semantic similarity.
- Precomputed by analyzing word co-occurrences.
- Neural network assigns similar vectors to semantically similar words.

Images

Images as matrices (RGB channels, intensity values).
Convert to vectors by row-wise concatenation (inefficient for transformers).
Use patches and apply linear neural network layer to reduce dimensionality to vector form.

Transformer Processing

Neural networks process vectors into better representations across layers.
Task examples: predicting next token, sentiment classification.

Transformer Layer

Takes input sequence as vectors, outputs vectors of same number and dimensionality.
Special tokens (e.g., classification token) added for specific tasks.
Special token output used by classification layer for prediction/probability assignment.
Training: compare predictions to expected results, backpropagate loss, update parameters.

Transformer Components

Feed Forward Network (MLP Sublayer)

Dense layer with GeLU activation doubles the dimension.
Another dense layer with GeLU scales down dimension.
Independent processing of each token.

Self-Attention Sublayer

Self-attention enables information flow within sequence.
Computes importance of each token relative to others.
Attention weights used to combine token representations.
Attention vs. Self-Attention: Self-attention within same sequence, Attention across different sequences.

Attention Computation

Input vectors transformed into keys, queries, and values.
Multiplied with respective matrices (queries, keys, values) initialized randomly.
Attention scores calculated via scalar product and softmax.
Final token representation: weighted sum of value vectors.
Multi-head self-attention: multiple sets of attention patterns.
Number of attention heads: a hyperparameter balancing complexity and memory usage.
Research on approximating/replacing attention to reduce resource consumption.

Sequence Information

Positional Embeddings

Order matters: Positional embeddings added to input embeddings.
Identify position using rule-based or learnable vectors.

Residual Connections

Add input of attention layer to its output (normalization maintains value range).
Facilitates training by focusing on transforming inputs incrementally.
Stacked transformer layers enable deeper problem-solving.
Prevents gradient signal loss in deep networks.

Training Procedures

Masked Language Modeling (BERT)

Classifier token for sentence pair classification.
Randomly mask 15% of tokens with [MASK] token.
Objective: adapt weights to correctly predict masked words.
Great for training classification transformers/encoders.

Predicting Next Word (GPT)

Transformers (decoders) for generating next word predictions.

Transformers vs. RNNs

Transformers: Parallel processing of tokens via attention.
RNNs: Sequential processing with dependencies, slower training.
Transformer’s parallelism enables training on large datasets (e.g., entire internet).

Conclusion

Transformers revolutionized NLP with parallel token processing.
Learning resources: Jay Allamar’s illustrated transformer, Louis Serrano’s series.
Thanks to supporters and viewers.

Transformer Architecture Overview

Transformer Architecture in AI

Introduction

Data Representation

Text

Images

Transformer Processing

Transformer Layer

Transformer Components

Feed Forward Network (MLP Sublayer)

Self-Attention Sublayer

Attention Computation

Sequence Information

Positional Embeddings

Residual Connections

Training Procedures

Masked Language Modeling (BERT)

Predicting Next Word (GPT)

Transformers vs. RNNs

Conclusion