Coconote
AI notes
AI voice & video notes
Try for free
Understanding Transformer Model Architecture
Jan 18, 2025
Notes on Transformer Explainer: LLM Transformer Model Visually Explained
What is a Transformer?
Transformer is a neural network architecture introduced in 2017 by the paper "Attention is All You Need".
It's a foundational architecture for AI, used in text-generative models such as OpenAI's GPT, Meta's Llama, and Google's Gemini.
It is also applied across various domains including audio generation, image recognition, protein structure prediction, and game playing.
Operates on the principle of
next-word prediction
using a self-attention mechanism.
GPT-2 models exemplify text-generative Transformers, with the Transformer Explainer using a GPT-2 (small) model with 124 million parameters.
Transformer Architecture
Composed of three key components:
Embedding
: Converts text input into numerical vectors (embeddings).
Transformer Block
: Processes input data through attention mechanisms and MLP layers.
Output Probabilities
: Converts processed embeddings into next-token predictions.
Embedding
Involves tokenization, obtaining token embeddings, adding positional information, and combining token and position encodings.
Step 1: Tokenization
Breaks input into tokens, which can be words or subwords, each with a unique ID.
GPT-2 uses a vocabulary of 50,257 tokens.
Step 2: Token Embedding
Each token is represented by a 768-dimensional vector, stored in a large matrix.
Step 3: Positional Encoding
Captures the position of tokens in the input, using a positional encoding matrix.
Step 4: Final Embedding
Combines token and positional encodings to form the final embedding.
Transformer Block
Core consists of multi-head self-attention and MLP layers.
Multiple blocks are stacked to build complex token representations.
Multi-Head Self-Attention
Query, Key, and Value Matrices
: Used to calculate attention scores.
Masked Self-Attention
: Prevents access to future tokens, uses softmax to create probability distributions.
Output
: Uses self-attention scores to produce the final output, with multiple attention heads.
MLP: Multi-Layer Perceptron
Enhances model's representational capacity through linear transformations and GELU activation.
Output Probabilities
Processed through a linear layer for token prediction.
Softmax function used to create a probability distribution for token selection.
Temperature
: Adjusts model output randomness.
Advanced Architectural Features
Layer Normalization
: Stabilizes training and improves convergence.
Dropout
: Prevents overfitting by deactivating neurons during training.
Residual Connections
: Mitigate vanishing gradient problem, used extensively in Transformer blocks.
Interactive Features
Allows user to input text and control model predictions via temperature adjustments.
Visualizes attention weights and computations.
Implementation
GPT-2 model runs in-browser using ONNX Runtime, built with Svelte and D3.js for UI and visualizations.
Developers
Developed by a team at Georgia Institute of Technology including Aeree Cho, Grace C. Kim, and others.
🔗
View note source
https://poloclub.github.io/transformer-explainer/