Understanding Transformer Model Architecture

Jan 18, 2025

Notes on Transformer Explainer: LLM Transformer Model Visually Explained

What is a Transformer?

  • Transformer is a neural network architecture introduced in 2017 by the paper "Attention is All You Need".
  • It's a foundational architecture for AI, used in text-generative models such as OpenAI's GPT, Meta's Llama, and Google's Gemini.
  • It is also applied across various domains including audio generation, image recognition, protein structure prediction, and game playing.
  • Operates on the principle of next-word prediction using a self-attention mechanism.
  • GPT-2 models exemplify text-generative Transformers, with the Transformer Explainer using a GPT-2 (small) model with 124 million parameters.

Transformer Architecture

  • Composed of three key components:
    1. Embedding: Converts text input into numerical vectors (embeddings).
    2. Transformer Block: Processes input data through attention mechanisms and MLP layers.
    3. Output Probabilities: Converts processed embeddings into next-token predictions.

Embedding

  • Involves tokenization, obtaining token embeddings, adding positional information, and combining token and position encodings.

Step 1: Tokenization

  • Breaks input into tokens, which can be words or subwords, each with a unique ID.
  • GPT-2 uses a vocabulary of 50,257 tokens.

Step 2: Token Embedding

  • Each token is represented by a 768-dimensional vector, stored in a large matrix.

Step 3: Positional Encoding

  • Captures the position of tokens in the input, using a positional encoding matrix.

Step 4: Final Embedding

  • Combines token and positional encodings to form the final embedding.

Transformer Block

  • Core consists of multi-head self-attention and MLP layers.
  • Multiple blocks are stacked to build complex token representations.

Multi-Head Self-Attention

  • Query, Key, and Value Matrices: Used to calculate attention scores.
  • Masked Self-Attention: Prevents access to future tokens, uses softmax to create probability distributions.
  • Output: Uses self-attention scores to produce the final output, with multiple attention heads.

MLP: Multi-Layer Perceptron

  • Enhances model's representational capacity through linear transformations and GELU activation.

Output Probabilities

  • Processed through a linear layer for token prediction.
  • Softmax function used to create a probability distribution for token selection.
  • Temperature: Adjusts model output randomness.

Advanced Architectural Features

  • Layer Normalization: Stabilizes training and improves convergence.
  • Dropout: Prevents overfitting by deactivating neurons during training.
  • Residual Connections: Mitigate vanishing gradient problem, used extensively in Transformer blocks.

Interactive Features

  • Allows user to input text and control model predictions via temperature adjustments.
  • Visualizes attention weights and computations.

Implementation

  • GPT-2 model runs in-browser using ONNX Runtime, built with Svelte and D3.js for UI and visualizations.

Developers

  • Developed by a team at Georgia Institute of Technology including Aeree Cho, Grace C. Kim, and others.