🤖

How to Build a Large Language Model from Scratch

Jul 22, 2024

How to Build a Large Language Model from Scratch

Introduction

  • Presenter: Sha
  • Sixth video in a series about large language models (LLMs)
  • Overview of building LLMs from scratch

Changing Landscape of LLMs

  • One year ago, building LLMs was niche, mainly for AI research
  • Post ChatGPT era, increased interest from businesses and enterprises
  • Example: Bloomberg GPT for finance
  • Key Point: Usually, using prompt engineering or fine-tuning is more practical than building from scratch

Financial Costs

  • Llama 2 Example:
    • 7 billion parameters: 180,000 GPU hours
    • 70 billion parameters: 1.7 million GPU hours
  • Cost Estimation:
    • Renting GPUs: $1 to $2 per A100 GPU hour
      • 10 billion parameter model: ~$150,000
      • 100 billion parameter model: ~$1.5 million
    • Buying hardware: 1,000 A100 GPUs cost ~$10 million
    • Additional energy costs: ~$100,000 for large models

Technical Aspects

1. Data Curation

  • Importance: Quality of data dictates quality of model
  • Challenges: Requires large datasets (trillions of tokens)
  • Sources:
    • Internet (web pages, Wikipedia, forums)
    • Public datasets (e.g., Common Crawl, C4, The Pile)
    • Private datasets (e.g., Bloomberg's FinPile)
    • Using LLM for generating training data (e.g., Alpaca model)
  • Data Diversity: Diverse datasets enable models to perform well in various tasks
  • Data Preparation Steps:
    • Quality Filtering: Removing non-helpful text (e.g., gibberish, hate speech)
    • Deduplication: Avoiding biases from duplicate texts
    • Privacy Redaction: Removing sensitive information
    • Tokenization: Translating text into numbers using methods like byte-pair encoding

2. Model Architecture

  • Transformer Architecture:
    • Types: Encoder-only, Decoder-only, Encoder-Decoder
    • Popularity: Decoder-only is most common for LLMs
  • Design Considerations:
    • Residual Connections: Intermediate values bypass hidden layers
    • Layer Normalization: Pre-layer and post-layer normalization methods
    • Activation Functions: Common ones include gelu, swish, glu
    • Position Embeddings: Absolute vs. relative positional encoding
    • Size: Balancing underfitting and overfitting, rule of thumb: 20 tokens per model parameter

3. Training the Model

  • Challenges: Managing computational costs and training stability
  • Techniques:
    • Mixed Precision Training: Using both 32-bit and 16-bit floating-point numbers
    • 3D Parallelism: Pipeline, Model, and Data Parallelism
    • Zero Redundancy Optimizer (Zero)
  • Stability Strategies:
    • Checkpointing: Regular snapshots of model state
    • Weight Decay: Regularization to penalize large parameters
    • Gradient Clipping: Rescaling gradients to prevent the exploding gradient problem
  • Hyperparameters:
    • Batch size: Typically large, can be dynamic
    • Learning rate: Often follows a dynamic schedule
    • Optimizer: Atom-based optimizers are common
    • Dropout: Typical values range from 0.2 to 0.5

4. Evaluating the Model

  • Benchmark Datasets: ARC, HellaSwag, MMLU, TruthfulQA
  • Evaluation Techniques:
    • Multiple-choice tasks: Using prompt templates and probability distributions
    • Open-ended tasks: Human evaluation, NLP metrics, auxiliary fine-tuned models

Post-Training Considerations

  • Practical Applications:
    • Prompt Engineering: Customizing model outputs for specific tasks
    • Model Fine-Tuning: Adapting the pre-trained model for specific use-cases
  • Resources: Links to previous videos for more details on prompt engineering and fine-tuning

Conclusion

  • Importance of understanding financial and technical aspects of building LLMs
  • Usually, using pre-existing models and fine-tuning or prompt engineering is more practical
  • Encouragement to check out prior videos and provide feedback