🤖

Reproducing the GPT-2 Model Steps

Sep 5, 2024

Zero to Hero Series: Reproducing GPT-2 Model

Overview

  • Aim: Reproduce GPT-2 124M parameter model.
  • Released by OpenAI in 2019 with a blog post, paper, and GitHub code.
  • Focus on the 124M parameter model, part of a series with sizes ranging up to 1558M.
  • Scaling laws observed: larger models improve downstream metrics (translation, summarization, etc.).

GPT-2 Architecture

  • The 124M parameter model has 12 transformer layers, 768 dimensions.
  • Original GPT-2 code in TensorFlow, but translated to PyTorch with Hugging Face Transformers library.
  • Positional embeddings with sinusoidal structures, indicating undertrained models.

Implementation Steps

  1. Loading GPT-2:

    • Use the Hugging Face library to load and understand model weights.
    • Inspect token and position embeddings.
  2. Reproducing the Architecture:

    • Develop from scratch in PyTorch.
    • Create Transformer container, ModuleList for layers, and final normalization.
    • Implement blocks: pre-normalization, multi-headed attention, MLP with Gated Linear Units (GELU).
    • Use efficient attention operation techniques.
  3. Weight Initialization:

    • Follow GPT-2's initialization standards: normal distribution with specific standard deviations.
    • Implement weight tying between input and output embeddings.
  4. Training Process:

    • Use Tokenizer and prepare data (e.g., Tiny Shakespeare dataset).
    • Implement data loader to handle sequences and batches.
    • Execute training loop with Adam optimizer and learning rate scheduling.
    • Implement gradient clipping and learning rate decay.
  5. Performance Optimization:

    • Utilize GPU using Tensor Cores, reduce precision with TF32.
    • Transition to BF16 for even more efficiency.
    • Employ torch.compile for further speed-up.
    • Integrate FlashAttention to optimize attention computation.
    • Optimize batch dimensions for efficient computation.
  6. Multi-GPU Training:

    • Use Distributed Data Parallel (DDP) for multi-GPU setups.
    • Modify data loading and gradient calculations accordingly.

Dataset and Evaluations

  • Transition to larger, high-quality datasets like Fine Web EDU for realistic pre-training.
  • Introduce HellaSwag as an evaluation benchmark.

Evaluation and Results

  • Model evaluation using validation loss and HellaSwag accuracy.
  • Comparison with OpenAI's GPT-2 124M in terms of performance on specific datasets.

Conclusion

  • Successfully reproduced the GPT-2 124M model with efficient training techniques.
  • Matched or exceeded OpenAI's model performance on some evaluations using fewer tokens.
  • The process included significant optimizations in code execution and data handling.
  • Future improvements could focus on further data handling enhancements and evaluation strategies.