Understanding Large Language Models

Sep 16, 2024

Building Large Language Models (LLMs)

Overview

LLMs: Large Language Models are AI models like ChatGPT, Claude, and Gemini.
The lecture covers:
- Key components in training LLMs.
- Pre-training and post-training paradigms.
- Basic understanding of language modeling.

Key Components for Training LLMs

Architecture: Neural networks, particularly Transformers, are used.
Training Loss and Algorithm: Essential for model training.
Data: The quality and quantity of data used for training.
Evaluation: Metrics to assess model performance.
System Components: Efficiently running models on modern hardware.

Pre-training vs. Post-training

Pre-training: General language modeling to understand internet text.
Post-training: Adapting LLMs to specific tasks, such as AI assistants (e.g., ChatGPT).

Language Modeling Basics

Language models estimate the probability distribution over sequences of tokens (words).
Generative Models: They can generate new sentences based on learned distributions.
Auto-regressive Language Models: Predict the next word based on previous context.
- Uses chain rule of probability for sequential predictions.

Tokenization

Tokenizers convert text into manageable pieces (tokens) for LLMs.
- Tokens can be words, subwords, or characters.
- Important for handling typos and different languages.
- Byte Pair Encoding: A common algorithm for tokenization that merges frequent pairs of tokens.

Evaluation of LLMs

Perplexity is a standard metric to evaluate LLM performance.
Evaluation challenges include:
- Different evaluation methodologies can yield inconsistent results.
- Chain test contamination: testing on data not seen during training.
Benchmarking: Common NLP benchmarks used for evaluation (e.g., MLU, Helm).

Data Collection for LLMs

Data is collected by crawling the internet (~250 billion pages).
Steps in data processing include:
- Text extraction from HTML.
- Filtering undesirable content (toxic data, PII).
- Deduplication of repeated content.
- Heuristic filtering for low-quality documents.

Scaling and Optimization

Scaling Laws: More data and larger models generally lead to better performance.
Computational efficiency is critical due to the size of data and models.
- Low Precision Training: Using 16 bits instead of 32 bits to speed up processing.
- Operator Fusion: Reduces communication overhead between GPU memory and computation.

Post-training Techniques

Supervised Fine-tuning (SFT): Fine-tuning on specific human-generated examples.
Reinforcement Learning from Human Feedback (RLHF): Aligning model behavior with human preferences.
- Collects human preferences on model outputs.
DPO (Direct Preference Optimization): A simpler alternative to RLHF that maximizes the likelihood of preferred outcomes.

Challenges in Post-training

Difficulties in generating ideal answers and dealing with human biases in labeling data.

Future Directions and Considerations

The lecture emphasizes the importance of systems, data, and effective architecture in building scalable LLMs.
It also touches upon various complexities and ethical considerations in deploying LLMs.

Recommended Courses

CS224N: Historical context of LLMs.
CS324: In-depth exploration of LLMs.
CS336: Hands-on experience building LLMs.

Full transcript