Lecture: Zero to Hero Series - Reproducing GPT-2 Model (124 Million Version)

Jul 18, 2024

Lecture: Zero to Hero Series - Reproducing GPT-2 Model (124 Million Version)

Introduction

  • Series Name: Zero to Hero
  • Focus: Reproducing the GPT-2 model (124 million parameter version)
  • Published by: OpenAI in 2019
    • Released with a blog post, paper, and code on GitHub (OpenAI/gpt2)
  • Note: Reproducing GPT-2 miniseries (different sizes, focusing on the 124 million model)
  • Plot Analysis: Model sizes (x-axis) vs. Downstream metrics like translation, summarization, etc. (y-axis)
  • Paper: Mentions four models in GPT-2 miniseries from 124M to 1558M
    • GitHub repo clarifies parameter counts

Model Architecture

  • Parameters: 124 Million Model
    • 12 layers in Transformer
    • 768 channels (dimensionality) in Transformer
  • Training Objective: Validation Loss
    • Measures model performance in predicting the next token
    • Start training from scratch to beat the OpenAI's released 124M model

Optimizing Training

  • Duration and Cost (2019 vs. Today)
    • Initial POD optimization: Complicated and costly
    • Modern training: ~1 hour, ~$10 on cloud compute

Training Data and Weights Release

  • OpenAI: Released weights, not detailed training settings
  • GPT-3 Paper: Provides more detailed hyperparameters and optimization settings
  • Objective: Reference both GPT-2 and GPT-3 papers for accuracy

Practical Implementation

Setup for Sampling from GPT-2

  • Original OpenAI Codebase: Uses TensorFlow
  • Preferred: PyTorch (uses Hugging Face Transformers)
    • Model import: from transformers import GPT2LMHeadModel
    • Model loading: model = GPT2LMHeadModel.from_pretrained('gpt2') (GPT-2: 124M parameters)
    • Extraction:
      • Token Embeddings: Weight matrix of size 50257 x 768
      • Positional Embeddings: Learned vector for each absolute position up to 1024
      • Transformer Weights: Various structures for weights, biases, etc.

Component Analysis

  • Visualization of Position Embeddings: Reveals 768 dim vector structures over positions (sinusoids & cosines)
  • Mechanistic Interpretability: Analysis of Matrix Structure

Custom GPT-2 Implementation

  • Class Initialization: GPTConfig and models
    • Embeddings initialization, MLP, and Attention blocks
    • Layers' Norm after applications of attention/MLP

Training Loop and Evaluation

  • Gradient Accumulation for Large Batch Sizes: Accumulate gradients over multiple mini-batches before updating model weights
  • Optimizers: AddamsW with beta parameters and weight decay
    • Gradient clipping to mitigate large gradient updates
  • Learning Rate Scheduling: Linear warmup and cosine anneal decay
  • Utilizing Multiple GPUs: Distributed training for faster learning

Metrics and Advanced Execution

  • Evaluation: Validation loss, Hellaswag accuracy, token completion (examining various text models like Shakespeare dataset)
    • Consistent training improvement, which compares well against other models
  • TorchCompile Optimization: Compile and optimize the code for PyTorch (though compatibility issues noted)

Conclusion and Beyond

  • Fine Web-EDU Dataset: Sampled appropriately for precision educational content
  • LM Optimization: Consistent progression in training metrics for the synthesized model with enlightening samples
  • Further Research Prospects: Iterative validation for Hyperparameter fine-tuning, subdata permutations, and ensuring competitive performance metrics

Observations: Model quality validation with public datasets. Multiple optimizations and model evaluations show robust reproduction of GPT-2, suggesting areas of improvement for further scaling models and showcasing learning efficiency.