Intro to Language Modeling

Jul 18, 2024

Intro to Language Modeling 💡

Course Overview

  • Goal: Build your own large language model (LLM) from scratch.
  • Elements: Data handling, math, transformers.
  • Created by: Elia Arleds.
  • No high-level prerequisites in calculus or linear algebra required.
  • Inspired by Andrej Karpathy’s GPT lecture.
  • Basic Python (3 months experience) recommended.
  • Features local computation and data handling.

Tools & Setup

  • Using Jupyter Notebooks and Anaconda Prompt.
  • Setting up virtual environment with CUDA for GPU acceleration.
  • Essential Python libraries: matplotlib, numpy, PYLZMA, IPYKERNEL, Jupyter.
  • Install torch with CUDA support configured.
  • Ensure Python version (recommended: 3.10.9).

Data and Environment Preparation

  • Wizard of Oz text data sample.
  • Load, read, and clean text data.
  • Tokenization: Character-level and subword-level explained.
  • Conversion into Integer tokens.
  • Setup for Tensor handling using PyTorch for efficient computation.

Language Model Basics

Bigram Language Model

  • Train and validation splits (80% Train, 20% Validation).
  • Explanation with Hello example ( Given: H, then predict: E).
  • Use of PyTorch tensors to handle data.

Training Loop

  • Hyperparameters: block_size, batch_size.
  • Train/validation split to prevent memorizing data.
  • Batching for scalability over GPU.
  • Training involves optimizer (AdamW) and iterations.
  • Reporting loss using PyTorch capabilities.

Transformer Models

  • Inspired from Attention is All You Need paper.

Architecture Overview

  • Inputs, Embeddings, Positional Encodings.
  • Multi-head self-attention mechanism.
  • Feed-forward networks within transformer blocks.
  • Residual connections & Layer Normalization.

Multi-head Attention

  • Each head exploits different aspects (like different perspectives).
  • Scaled Dot-Product Attention with keys, queries, values.
  • Parallel execution for efficiency on GPUs.
  • Query-dot-Product-Key method explained, followed by scaling and masked fill.

Key Classes in PyTorch

  • Module: GPTLanguageModel, Block, FeedForward, MultiHeadAttention.

Model Training

Parameter Initialization

  • Standard deviation for parameter normalization.
  • Efficient training and avoiding gradient explosion/vanishing.
  • PyTorch functions for weight initialization (importance of practice).

Data Handling and Processing

  • Loading extremely large data (Open Web Text) using memory mapping.
  • Splits into training and validation datasets, stored separately.

Model Evaluation and Saving

  • Save the model using pickle for serialization.
  • Loading models for continued training or inference.
  • Efficiency testing using time module.

Extensions & Tips

  • Fine-tuning explained (different from pre-training: uses prompt-completion pairs).
  • Quantization (reduce memory usage): 16-bit and 4-bit forms.
  • Gradient Accumulation to fit more data
  • Hugging Face Integration: Explore pre-trained models and datasets.

Key Points and Best Practices

  • Base architecture explanations are vital for debugging and efficiency.
  • Experiment with learning rates and hyperparameters dynamically.
  • Always ensure your environment configurations are compatible with the models.
  • Document your training process and understand each phase’s output for further adjustments.