Lecture Notes on DeepSeek AI Models

Introduction to DeepSeek

DeepSeek is a startup based in China.
Made headlines by overtaking OpenAI's spot on the US Apple App Store as the most downloaded free app.
Known for releasing an open-source model that rivals industry-leading models at a fraction of the cost.

DeepSeek V1 (Jan 2024)
- 67 billion parameters.
- Traditional transformer model.
DeepSeek V2 (June 2024)
- 236 billion parameters.
- Featured multi-headed attention and DeepSeek mixture of experts.
DeepSeek V3 (Dec 2024)
- 671 billion parameters.
- Introduced reinforcement learning and load balancing across GPUs.
DeepSeek R1-Zero (Jan 2025)
- First reasoning model using exclusively reinforcement learning.
DeepSeek R1
- Combines reinforcement learning with supervised fine-tuning.

Distilled Models: Smaller student models derived from larger teacher models.
Serve as model compression or translation between architectures.

Utilizes fewer specialized Nvidia chips compared to competitors.
Example: Only 2,000 GPUs for DeepSeek V3 while Meta used over 100,000 for Llama 4.

Mixture of Experts (MoE) architecture is resource-efficient:
- Divides the model into sub-networks (experts).
- Activates only necessary experts for tasks.
- Reduces computational costs and speeds up performance.

Rewards the model for correct answers, allowing it to discover its reasoning pathway.