Overview of DeepSeek AI Models

Chances are you've heard about the newest entrants to the very crowded and very competitive realm of AI models, DeepSeek. It's a startup based in China, and it caught everyone's attention by taking over OpenAI's coveted spot for most downloaded free app in the US on Apple's App Store. So how? Well, by releasing an open source model that it claims can match all surpass the performance of other industry leading models and at a fraction of the cost. Now, the specific model that's really making a splash from DeepSeek is called DeepSeek R1, and the R here that implies reasoning, because this is a reasoning model. DeepSeek R1 is their reasoning model. Now DeepSeek our one performs as well as some of the other models, including OpenAI's own reasoning model. That's called o1, and it can match or even outperform it across a number of AI benchmarks for math and coding tasks, which is all the more remarkable because according to DeepSeek, DeepSeek R1 is trained with far fewer chips and is approximately 96% cheaper to run than o1. Now, unlike previous AI models which produce an answer without explaining the why, a reasoning model solves complex problems by breaking them down into steps. So before answering a user query, the model spends time thinking, thinking in air quotes here, and that thinking time could be a few seconds or even minutes. Now, during this time, the model is performing step by step analysis through a process that is known as chain of thought. And unlike other reasoning models, R1 shows the use of that chain of thought process as it breaks the problem down, as it generates insights, as it backtrack and says it needs to, and as it ultimately arrives at an answer. Now I'm going to get into how this model works. But before that, let's talk about how it came to be a DeepSeek R1 seems to have come out of nowhere. But there are in fact many DeepSeek models that brought us to this point a model avalanche, if you like. And my colleague Aaron can help dig us out. Well, thanks, Martin. There is certainly a lot to dig out here. There's a lot of these models. But let's start from the very top in the beginning of all this. So we began and we go to, let's say, DeepSeek version one, which is a 67 billion model that was released in January of 2024. Now, this is a traditional transformer with a focus on the feedforward neural networks. This gets us down into DeepSeek version two, which really put this on the map. This is a very large 236 billion model that was released not that far away from the original, which is June 2024. But to put this into perspective, there are really two novel aspects around this model. The first one, was the multi-headed laden attention. And the second aspect was the DeepSeek mixture of experts. It just made the model really fast and performant. And it set us up for success for the DeepSeek version three, which was released December of 2024. Now, this one is even bigger. It's 671 billion parameters. But this is where we began to see the introduction of using reinforcement learning with that model, and some other contributions that this model had is it was able to balance load across many GPUs, because they used a lot of H800s within their infrastructure and that was also built around on top of DeepSeek v2. So all these models accumulate and build on top of each other, which gets us down into DeepSeek R1-Zero, which was released in January of 2025. So this is the first of the reasoning model is now, right? It is, yeah. And it's really neat how they began to train, you know, these types of models, right? So it's a type of fine tuning. But on this one, the exclusively use reinforcement learning, which is a way where you have policies and you want to reward or you want to penalize the model for some action that it has taken or output that it has taken in itself learns over time, and it was very performant. It did well, but it got even better with DeepSeek R1, right, which was again built on top of R1-Zero, and this one used a combination of reinforcement learning, and supervised fine tuning the best of both worlds so that it could even be better, and it's very close to performance on many standards and benchmarks as some of these OpenAI models we have now. And this gets us down into now distilled models, which is like a whole other paradigm. Distilled models. Okay, so tell me what that is all about. Yeah, great question and comment. So first of all, for a distilled model is where you have a student model, which is a very small model, and you have the teacher model, which is very big, and you want to distill or extract knowledge from the teacher model down into the student model, and some aspects you could think of it as model compression. But one interesting aspect around this is this is not just compression or transferring knowledge, but it's model translation because we're going from the R1-Zero, right? Which is one of those mixture of expert models down into, for example, a Llama series model, which is not a mixture of experts, but it's a traditional transformer, right? So, so you're going from one architecture type to another. And we do the same with Qwen, right? So there's different series of models that are the foundation that we then distill into from the R1-Zero. Well, thanks. It's really interesting to get the history behind all this. It didn't come from nowhere, but with all of these distilled models coming, I think you might need your shovel back to dig your way out of those. Thank you very much. There's going to be a lot of distilled models. So you're exactly right. I think you're going to go dig. Thanks. So our one didn't come from nowhere is an evolution of other models, but how does DeepSeek operate at such comparatively low cost? Well, by using a fraction of the highly specialized Invidia chips used by their American competitors to train their systems. If I can illustrate this in a graph. So if we consider different types of model and then the number of GPUs that they use. Well, DeepSeek engineers, for example, they said that they only need 2000 GPUs that's graphical processing units to train the DeepSeek V3 Model, DeepSeek V3. Now in isolation, what does that mean? Is that good? Is that bad? Well, by contrast, meta said that the company was training that latest opensource model. That's Llama 4 and they are using a computer cluster with over 100,000 Nvidia GPUs. So that brings up the question of how is it so efficient? Well, DeepSeek R1 combines chain of thought reasoning with a process called reinforcement learning. This is a capability that Aaron mentioned just now which arrived at the V3 model of DeepSeek. And here an autonomous agent learns to perform a task through trial and error without any instructions from a human user. Now, traditionally, models will improve their ability to reason by being trained on labeled examples of correct or incorrect behavior. That's known as supervised learning, or by extracting information from hidden patterns that send us unsupervised learning. But the key hypothesis here with reinforcement learning is to reward the model for correctness, no matter how it arrived at the right answer and let the model discover the best way to think all on its own. Now DeepSeek R1 also uses a mixture of experts, architectural or MoE, and a mixture of experts architecture is considerably less resource intensive to train. Now the MoE architecture divides an AI model up into separate entities or sub networks, which we can think of as being individual experts. So in my little neural network here, I'm going to create three experts and a real MoE architecture probably have quite a bit more than that. But each one of these is specialized in a subset of the input data, and the model only activates the specific experts needed for a given task. So a request comes in. We activate the experts that we need and we only use those rather than activating the entire neural network. So consequently, the MoE architecture reduces computational costs during pre-training and achieves faster performance during inference time and look MoE, that architecture isn't unique to models from DeepSeek. There are models from the French AI company Mistral that also use this, and in fact the IBM Granite model that is also built on a mixture of experts architecture. So it's a commonly used architecture. So that is DeepSeek R1. It's an AI reasoning model that is matching other industry leading models on reasoning benchmarks, while being delivered at a fraction of the cost in both training and inference. All of which makes me think this is an exciting time for AI reasoning models.

Chances are you've heard about the newest entrants to the very crowded and very competitive realm of AI models, DeepSeek. It's a startup based in China, and it caught everyone's attention by taking over OpenAI's coveted spot for most downloaded free app in the US on Apple's App Store. So how? Well, by releasing an open source model that it claims can match all surpass the performance of other industry leading models and at a fraction of the cost. Now, the specific model that's really making a splash from DeepSeek is called DeepSeek R1, and the R here that implies reasoning, because this is a reasoning model. DeepSeek R1 is their reasoning model. Now DeepSeek our one performs as well as some of the other models, including OpenAI's own reasoning model. That's called o1, and it can match or even outperform it across a number of AI benchmarks for math and coding tasks, which is all the more remarkable because according to DeepSeek, DeepSeek R1 is trained with far fewer chips and is approximately 96% cheaper to run than o1. Now, unlike previous AI models which produce an answer without explaining the why, a reasoning model solves complex problems by breaking them down into steps. So before answering a user query, the model spends time thinking, thinking in air quotes here, and that thinking time could be a few seconds or even minutes. Now, during this time, the model is performing step by step analysis through a process that is known as chain of thought. And unlike other reasoning models, R1 shows the use of that chain of thought process as it breaks the problem down, as it generates insights, as it backtrack and says it needs to, and as it ultimately arrives at an answer. Now I'm going to get into how this model works. But before that, let's talk about how it came to be a DeepSeek R1 seems to have come out of nowhere. But there are in fact many DeepSeek models that brought us to this point a model avalanche, if you like. And my colleague Aaron can help dig us out. Well, thanks, Martin. There is certainly a lot to dig out here. There's a lot of these models. But let's start from the very top in the beginning of all this. So we began and we go to, let's say, DeepSeek version one, which is a 67 billion model that was released in January of 2024. Now, this is a traditional transformer with a focus on the feedforward neural networks. This gets us down into DeepSeek version two, which really put this on the map. This is a very large 236 billion model that was released not that far away from the original, which is June 2024. But to put this into perspective, there are really two novel aspects around this model. The first one, was the multi-headed laden attention. And the second aspect was the DeepSeek mixture of experts. It just made the model really fast and performant. And it set us up for success for the DeepSeek version three, which was released December of 2024. Now, this one is even bigger. It's 671 billion parameters. But this is where we began to see the introduction of using reinforcement learning with that model, and some other contributions that this model had is it was able to balance load across many GPUs, because they used a lot of H800s within their infrastructure and that was also built around on top of DeepSeek v2. So all these models accumulate and build on top of each other, which gets us down into DeepSeek R1-Zero, which was released in January of 2025. So this is the first of the reasoning model is now, right? It is, yeah. And it's really neat how they began to train, you know, these types of models, right? So it's a type of fine tuning. But on this one, the exclusively use reinforcement learning, which is a way where you have policies and you want to reward or you want to penalize the model for some action that it has taken or output that it has taken in itself learns over time, and it was very performant. It did well, but it got even better with DeepSeek R1, right, which was again built on top of R1-Zero, and this one used a combination of reinforcement learning, and supervised fine tuning the best of both worlds so that it could even be better, and it's very close to performance on many standards and benchmarks as some of these OpenAI models we have now. And this gets us down into now distilled models, which is like a whole other paradigm. Distilled models. Okay, so tell me what that is all about. Yeah, great question and comment. So first of all, for a distilled model is where you have a student model, which is a very small model, and you have the teacher model, which is very big, and you want to distill or extract knowledge from the teacher model down into the student model, and some aspects you could think of it as model compression. But one interesting aspect around this is this is not just compression or transferring knowledge, but it's model translation because we're going from the R1-Zero, right? Which is one of those mixture of expert models down into, for example, a Llama series model, which is not a mixture of experts, but it's a traditional transformer, right? So, so you're going from one architecture type to another. And we do the same with Qwen, right? So there's different series of models that are the foundation that we then distill into from the R1-Zero. Well, thanks. It's really interesting to get the history behind all this. It didn't come from nowhere, but with all of these distilled models coming, I think you might need your shovel back to dig your way out of those. Thank you very much. There's going to be a lot of distilled models. So you're exactly right. I think you're going to go dig. Thanks. So our one didn't come from nowhere is an evolution of other models, but how does DeepSeek operate at such comparatively low cost? Well, by using a fraction of the highly specialized Invidia chips used by their American competitors to train their systems. If I can illustrate this in a graph. So if we consider different types of model and then the number of GPUs that they use. Well, DeepSeek engineers, for example, they said that they only need 2000 GPUs  that's graphical processing units to train the DeepSeek V3 Model, DeepSeek V3. Now in isolation, what does that mean? Is that good? Is that bad? Well, by contrast, meta said that the company was training that latest opensource model. That's Llama 4 and they are using a computer cluster with over 100,000 Nvidia GPUs. So that brings up the question of how is it so efficient? Well, DeepSeek R1 combines chain of thought reasoning with a process called reinforcement learning. This is a capability that Aaron mentioned just now which arrived at the V3 model of DeepSeek. And here an autonomous agent learns to perform a task through trial and error without any instructions from a human user. Now, traditionally, models will improve their ability to reason by being trained on labeled examples of correct or incorrect behavior. That's known as supervised learning, or by extracting information from hidden patterns that send us unsupervised learning. But the key hypothesis here with reinforcement learning is to reward the model for correctness, no matter how it arrived at the right answer and let the model discover the best way to think all on its own. Now DeepSeek R1 also uses a mixture of experts, architectural or MoE, and a mixture of experts architecture is considerably less resource intensive to train. Now the MoE architecture divides an AI model up into separate entities or sub networks, which we can think of as being individual experts. So in my little neural network here, I'm going to create three experts and a real MoE architecture probably have quite a bit more than that. But each one of these is specialized in a subset of the input data, and the model only activates the specific experts needed for a given task. So a request comes in. We activate the experts that we need and we only use those rather than activating the entire neural network. So consequently, the MoE architecture reduces computational costs during pre-training and achieves faster performance during inference time and look MoE, that architecture isn't unique to models from DeepSeek. There are models from the French AI company Mistral that also use this, and in fact the IBM Granite model that is also built on a mixture of experts architecture. So it's a commonly used architecture. So that is DeepSeek R1. It's an AI reasoning model that is matching other industry leading models on reasoning benchmarks, while being delivered at a fraction of the cost in both training and inference. All of which makes me think this is an exciting time for AI reasoning models.

Transcript for:Overview of DeepSeek AI Models

Transcript for:
Overview of DeepSeek AI Models