Coconote
AI notes
AI voice & video notes
Export note
Try for free
Understanding Large Language Models
Sep 16, 2024
Building Large Language Models (LLMs)
Overview
LLMs
: Large Language Models are AI models like ChatGPT, Claude, and Gemini.
The lecture covers:
Key components in training LLMs.
Pre-training and post-training paradigms.
Basic understanding of language modeling.
Key Components for Training LLMs
Architecture
: Neural networks, particularly Transformers, are used.
Training Loss and Algorithm
: Essential for model training.
Data
: The quality and quantity of data used for training.
Evaluation
: Metrics to assess model performance.
System Components
: Efficiently running models on modern hardware.
Pre-training vs. Post-training
Pre-training
: General language modeling to understand internet text.
Post-training
: Adapting LLMs to specific tasks, such as AI assistants (e.g., ChatGPT).
Language Modeling Basics
Language models estimate the probability distribution over sequences of tokens (words).
Generative Models
: They can generate new sentences based on learned distributions.
Auto-regressive Language Models
: Predict the next word based on previous context.
Uses chain rule of probability for sequential predictions.
Tokenization
Tokenizers convert text into manageable pieces (tokens) for LLMs.
Tokens can be words, subwords, or characters.
Important for handling typos and different languages.
Byte Pair Encoding
: A common algorithm for tokenization that merges frequent pairs of tokens.
Evaluation of LLMs
Perplexity is a standard metric to evaluate LLM performance.
Evaluation challenges include:
Different evaluation methodologies can yield inconsistent results.
Chain test contamination: testing on data not seen during training.
Benchmarking
: Common NLP benchmarks used for evaluation (e.g., MLU, Helm).
Data Collection for LLMs
Data is collected by crawling the internet (~250 billion pages).
Steps in data processing include:
Text extraction from HTML.
Filtering undesirable content (toxic data, PII).
Deduplication of repeated content.
Heuristic filtering for low-quality documents.
Scaling and Optimization
Scaling Laws
: More data and larger models generally lead to better performance.
Computational efficiency is critical due to the size of data and models.
Low Precision Training
: Using 16 bits instead of 32 bits to speed up processing.
Operator Fusion
: Reduces communication overhead between GPU memory and computation.
Post-training Techniques
Supervised Fine-tuning (SFT)
: Fine-tuning on specific human-generated examples.
Reinforcement Learning from Human Feedback (RLHF)
: Aligning model behavior with human preferences.
Collects human preferences on model outputs.
DPO (Direct Preference Optimization)
: A simpler alternative to RLHF that maximizes the likelihood of preferred outcomes.
Challenges in Post-training
Difficulties in generating ideal answers and dealing with human biases in labeling data.
Future Directions and Considerations
The lecture emphasizes the importance of systems, data, and effective architecture in building scalable LLMs.
It also touches upon various complexities and ethical considerations in deploying LLMs.
Recommended Courses
CS224N
: Historical context of LLMs.
CS324
: In-depth exploration of LLMs.
CS336
: Hands-on experience building LLMs.
📄
Full transcript