Zero to Hero Series: Reproducing GPT-2 Model

Overview

Aim: Reproduce GPT-2 124M parameter model.
Released by OpenAI in 2019 with a blog post, paper, and GitHub code.
Focus on the 124M parameter model, part of a series with sizes ranging up to 1558M.
Scaling laws observed: larger models improve downstream metrics (translation, summarization, etc.).

The 124M parameter model has 12 transformer layers, 768 dimensions.
Original GPT-2 code in TensorFlow, but translated to PyTorch with Hugging Face Transformers library.
Positional embeddings with sinusoidal structures, indicating undertrained models.

Loading GPT-2:
- Use the Hugging Face library to load and understand model weights.
- Inspect token and position embeddings.
Reproducing the Architecture:
- Develop from scratch in PyTorch.
- Create Transformer container, ModuleList for layers, and final normalization.
- Implement blocks: pre-normalization, multi-headed attention, MLP with Gated Linear Units (GELU).
- Use efficient attention operation techniques.
Weight Initialization:
- Follow GPT-2's initialization standards: normal distribution with specific standard deviations.
- Implement weight tying between input and output embeddings.
Training Process:
- Use Tokenizer and prepare data (e.g., Tiny Shakespeare dataset).
- Implement data loader to handle sequences and batches.
- Execute training loop with Adam optimizer and learning rate scheduling.
- Implement gradient clipping and learning rate decay.
Performance Optimization:
- Utilize GPU using Tensor Cores, reduce precision with TF32.
- Transition to BF16 for even more efficiency.
- Employ torch.compile for further speed-up.
- Integrate FlashAttention to optimize attention computation.
- Optimize batch dimensions for efficient computation.
Multi-GPU Training:
- Use Distributed Data Parallel (DDP) for multi-GPU setups.
- Modify data loading and gradient calculations accordingly.

Transition to larger, high-quality datasets like Fine Web EDU for realistic pre-training.
Introduce HellaSwag as an evaluation benchmark.

Model evaluation using validation loss and HellaSwag accuracy.
Comparison with OpenAI's GPT-2 124M in terms of performance on specific datasets.

Successfully reproduced the GPT-2 124M model with efficient training techniques.
Matched or exceeded OpenAI's model performance on some evaluations using fewer tokens.
The process included significant optimizations in code execution and data handling.
Future improvements could focus on further data handling enhancements and evaluation strategies.