Coconote
AI notes
AI voice & video notes
Try for free
🤖
How to Build a Large Language Model from Scratch
Jul 22, 2024
How to Build a Large Language Model from Scratch
Introduction
Presenter: Sha
Sixth video in a series about large language models (LLMs)
Overview of building LLMs from scratch
Changing Landscape of LLMs
One year ago, building LLMs was niche, mainly for AI research
Post ChatGPT era, increased interest from businesses and enterprises
Example: Bloomberg GPT for finance
Key Point
: Usually, using prompt engineering or fine-tuning is more practical than building from scratch
Financial Costs
Llama 2 Example
:
7 billion parameters: 180,000 GPU hours
70 billion parameters: 1.7 million GPU hours
Cost Estimation
:
Renting GPUs: $1 to $2 per A100 GPU hour
10 billion parameter model: ~$150,000
100 billion parameter model: ~$1.5 million
Buying hardware: 1,000 A100 GPUs cost ~$10 million
Additional energy costs: ~$100,000 for large models
Technical Aspects
1. Data Curation
Importance
: Quality of data dictates quality of model
Challenges
: Requires large datasets (trillions of tokens)
Sources
:
Internet (web pages, Wikipedia, forums)
Public datasets (e.g., Common Crawl, C4, The Pile)
Private datasets (e.g., Bloomberg's FinPile)
Using LLM for generating training data (e.g., Alpaca model)
Data Diversity
: Diverse datasets enable models to perform well in various tasks
Data Preparation Steps
:
Quality Filtering
: Removing non-helpful text (e.g., gibberish, hate speech)
Deduplication
: Avoiding biases from duplicate texts
Privacy Redaction
: Removing sensitive information
Tokenization
: Translating text into numbers using methods like byte-pair encoding
2. Model Architecture
Transformer Architecture
:
Types
: Encoder-only, Decoder-only, Encoder-Decoder
Popularity
: Decoder-only is most common for LLMs
Design Considerations
:
Residual Connections
: Intermediate values bypass hidden layers
Layer Normalization
: Pre-layer and post-layer normalization methods
Activation Functions
: Common ones include gelu, swish, glu
Position Embeddings
: Absolute vs. relative positional encoding
Size
: Balancing underfitting and overfitting, rule of thumb: 20 tokens per model parameter
3. Training the Model
Challenges
: Managing computational costs and training stability
Techniques
:
Mixed Precision Training: Using both 32-bit and 16-bit floating-point numbers
3D Parallelism: Pipeline, Model, and Data Parallelism
Zero Redundancy Optimizer (Zero)
Stability Strategies
:
Checkpointing
: Regular snapshots of model state
Weight Decay
: Regularization to penalize large parameters
Gradient Clipping
: Rescaling gradients to prevent the exploding gradient problem
Hyperparameters
:
Batch size: Typically large, can be dynamic
Learning rate: Often follows a dynamic schedule
Optimizer: Atom-based optimizers are common
Dropout: Typical values range from 0.2 to 0.5
4. Evaluating the Model
Benchmark Datasets
: ARC, HellaSwag, MMLU, TruthfulQA
Evaluation Techniques
:
Multiple-choice tasks
: Using prompt templates and probability distributions
Open-ended tasks
: Human evaluation, NLP metrics, auxiliary fine-tuned models
Post-Training Considerations
Practical Applications
:
Prompt Engineering
: Customizing model outputs for specific tasks
Model Fine-Tuning
: Adapting the pre-trained model for specific use-cases
Resources
: Links to previous videos for more details on prompt engineering and fine-tuning
Conclusion
Importance of understanding financial and technical aspects of building LLMs
Usually, using pre-existing models and fine-tuning or prompt engineering is more practical
Encouragement to check out prior videos and provide feedback
📄
Full transcript