DeepSeek Coder V2 Overview

what happens when you take a model that's already pretty good at coding and you give it 6 trillion more tokens do some more work with pre-training and see what happens well that's what we got with deep seek coder version 2 from Deep seek AI which I believe is one of the most impressive large language models to come out of China so far it's not only great at coding but it's also great at just being an llm and right now it looks like with the benchmarks we have so far that it's basically beating gp4 turbo Cloud 3 Opus Gemini 1. from Google and most impressively codest rol from mistol AI so I want to get into what makes this model so impressive how it works how these improvements were made and we're even going to try this out at the end of this video so welcome to AI flux let's get into it so deepsea coder V2 kind of came out of nowhere we saw a lot of updates with coding models this past week and frankly just so you guys know um as a professional software engineer deep seat coder was kind of my go-to and what I actually have hooked up to my text editor is deep SE coder um although I do use a little bit of GPD 4 and Cloud 3 in between so why is this model so interesting so what's interesting with this model is it's not doing a ton of wildly new things deep C coder is one of the better models back from when mixture of experts was kind of a novel way to get to a state-of-the-art model and just to get the basics out of the way this is basically a 236 billion parameter mixture of experts model with 21 billion of those parameters active at any given point it supports over 338 programming languages and I'm going to try a few more exotic ones to see if we get there and the context length also has extends from 16,000 to 128,000 tokens when compared to the first version of deepsea coder and again the biggest win here is that it's beating GPT 40 and gp4 turbo on coding and math which is kind of crazy and the benchmarks really prove up so curiously enough if we look at these benchmarks that have been released from Deep seek AI obviously we're going to get a few more as the week progresses you can see that deep SE Cod V2 in this kind of Dash blue has a significant margin compared to a lot of these models uh the most curious thing to me is that codol is actually pretty much on the lower end of many of these and prised how low cestal was scoring in a lot of these cases uh Cloud 3 I think is probably one of the better measures for state-ofthe-art gp4 Turbo is interesting I think only because it's much more reactive and responsive so if you're going to use it in sort of a GitHub co-pilot esque application it's much easier to use and in my opinion just much more productive to use and that's also something I'm curious to see if they mentioned numbers for deep SE Cod V2 so you can see here that deep SE Cod V2 has quite a margin on most of these models with gbd4 turbo I would say probably being within margin of error in most cases and in some cases actually being quite a bit better and llama 370b is also surprisingly low I I did not think that this would be the case obviously human evl has to kind to be taken with a grain of salt but uh GSM 8K and MB Plus+ are ones that I like quite a bit uh sbench was a pretty interesting announcement a few weeks ago but I'm not really someone who thinks this Benchmark has a lot of value nonetheless it's pretty interesting so how did these advancements actually happen so some people think that this was released earlier than deeps AI wanted and I think there's some interesting comments made here that oh like you know it would be really cool to see how fast this would run on Gro or is this something that maybe could have been a larger model if we waited a bit longer but deeps wanted to have out before uh meta's llama 3400b so basically what deeps AI says they did here is that they put a lot of work into finding additional tokens to train on and doing a lot of work with pre-training and pre-training is something that actually gave a lot of performance boost to llama 3 at meta so it's curious it took kind of a similar approach and what's important is what actually made up that additional 6 trillion tokens so 60% of it was just raw source code so this was not kind of Internal Documentation or code notes or git history which is kind of interesting which kind of makes sense and I find that interesting because um trading on Raw data can sometimes be quite a bit harder if you don't have quite a bit of direction already defined in the model itself they trained on a 10% math Corpus and the remainder was a 30% natural language Corpus so obviously this was intended to be used in Chinese but it works fine in English and they say the source code consists of 1.2 trillion code related tokens sourced from GitHub and common crawl using the same pipeline as deep seek math so I'm sure Microsoft probably isn't too happy that this was all just pulled right out of GitHub but if we know anything about China they just don't care and they just want to make the best models which I can't necessarily argue with in a lot of cases so they say after pre-training the model goes through supervised fine tuning on code m and general instruction data then reinforcement learning with group relative policy optimization or grpo which is quite different from DPO which is more commonly used and this algorithm is used to further optimize its responses for correctness and human preference on coding tasks using test case feedback and a learn reward model so what I find cool here is that they're actually using test case feedback to Loop in as opposed to just kind of run-ofthe-mill rhf uh and it's important to note that human preference is actually quite a bit different than rhf because this is kind of gray so we don't know a lot about their process Apple actually shared a lot about this in their Benchmark data with their new models but curiously we're not getting much of that from Deep seek Ai and what I also think is cool is the use of a learned reward model is actually really similar to the massive model that Nvidia just released with its entire purpose just being to create data which is kind of interesting and in my opinion obviously I think deep SE is one of the best open source coding models um my usage shows very little that this model can't do um especially when it comes to more complex tasks like standing up an entire page or really more realistic tasks like looking at a lot of code and then telling me where I can simplify things or pull things out to make things more readable because you know in most cases if I can't read the code or if I can't revisit it in a few months and then also still know what's going on it's kind of useless and frankly that's what I use a lot of this for I'll just feed in large bits of code and say hey where do you think this happens and of course I could do it but it saves me time just by understanding the code more so than necessarily writing code which I think is also an approach Google's taken with Gemini so it's available both on hugging face and deep seek AI GitHub we're going to take a look at the hugging face page here to see what's going on what I do think is cool is they have deep C coder V2 instruct and chat which is something we haven't seen from other models before mistol did this as well with codol but this has been kind of a trend with deep seek for some time so this is basically the same data from this tweet uh with a few other ways to download the model which is pretty cool um I do want to look at the full list of programming languages because I was curious what we actually get here so first off one of the really interesting ones is we actually have AMD GPU so this is actually a markup language that was previous to opencl which is pretty cool there also some older ones in here uh like ambient talk and action script and some of these might be useful if you want to understand how to fix something in like a really old application of course we have Cuda I do want to see if there any vhdl markups so these are actually languages that are used to U codify how you create integrated circuits it's cool to see Elixir emac lisp that's kind of funny uh FP for Trend so what's cool is they're really really old ones and some of this code is surprising that they even get it onto GitHub because GitHub didn't exist for quite some time even after these languages were created all right so we have Java JavaScript all the standards Jupiter notebooks so you can create some great Jupiter notebooks with this um something called moo code and moonscript which I've have never heard of uh this is a big one enginex configuration file which is pretty funny open scad that's very cool to see I actually know a lot of people who are using deep seek and llms to accelerate open scad as a service which is very cool so they do have vhdl and some other kind of silicon markup languages which is pretty cool and J so interesting that that's actually a really interesting list and there's a lot in there that I wasn't expecting and it covered all the ones that I I did expect and of course to have 338 languages fully supported know I'm not sure I entirely believe that I bet um some of those are a little heavier than others just because you can go and find more source code about about them but it's pretty cool and it's also fully open sourced with actually two versions of the API so the API exposes a 230b version and a much smaller and I would bet faster 16 billion parameter version and we're going to go try one of those now and you can find some more information about these on the Deep seek hugging face page not just the card for these new models so I'm going to chat with coder V2 not just deep seek V2 and let's see what we can get up to so first off I'm actually going to start with a non-programming question I'm going to ask it to give me 10 sentences that have something to do with a glass and a peach uh some models have difficulty with this I'm curious to see if deep seat coder can do this on its own so it's an instruct model so it should be pretty used to giving us kind of numbered lists and let's see what we get here so a right Peach and then glass bowl uh so the reason I didn't indicate glass kind of like a mug or a container is I wanted to see what it said here so it actually did quite well it passed this test and a lot of llms have quite a bit of difficulty with this granted it does get a a little bit confused where it says a Class Window of the greenhouse but the greenhouse and the peach does make sense so I'll give it some credit there so I'm going to clear context and let's actually get into some coding questions now so we're going to start with some easy ones we're going to start with python and we'll see what it gives us so first I want to say uh write a basic python function that will estimate one mandal BR set so obviously you just generate these you don't estimate them but I want to see what it gives us so first it tells us what that is so it says it's a fractal that's defined in a complex plane the set is defined by iterating over a simple function and then just iterating so what's cool is it gave us a very very well-written and concise uh kind of statement here for our function formatted with pep8 which is great and it's very easy to follow sometimes code isol would really struggle with this it would kind of get caught up over itself with things even as simple as like variable names and what was being fed in and it gave us something really pretty useful and it gave us test cases without even asking for them and what I like about this is specifically the way that they train with test cases is that it understands kind of what an engineer would want to look for so frankly I wasn't really that impressed with Devon but this is something that I'm actually really really impressed with so wow this actually looks really good so now I'm going to do my second question where I say great and then it says cool yeah we can store the previously generated values let's see what it gives us okay so it's giving us a really concise now updated comment telling us what this function does we end up with a really simple function and well this looks pretty good now a lot of you have asked me to have this generated a snake game and I think one thing that is interesting with a lot of these models is they have a really hard time actually reasoning on like a geometric plane so for instance if you say write a function that can understand something on kind of a two dimensional plane or a number line they sometimes struggle so I basically this new question of mind is asking it to write a snake game but have it work on a radial plane so around a circle using radians and some models can do this gp4 can definitely not do this and let's see what we get so right and I'm going to say uh use the language UC see as a good fit for this game so I didn't tell it to use Python we're just going to see what it uses and now I'm going to see if it understands why we want to hopefully it gives us code still letting us know why this is interesting and why this is different all right so it shows python even though we cleared context so that kind of makes sense I think this model probably picks languages based on what it thinks is most readable but I would also guess that python probably is most of what the training data was because right now if you go on GitHub I mean most of it is Python and c and python is just kind of directly down stream of c and wellow so it didn't use so it still used a relatively normal coordinate system but it did understand that movement needs to work differently and it understood that we wanted to work around a circle so that's actually still pretty impressive so now I'd like to ask something a little bit more complex and I'm going to ask it to build a really simple Asic with vhdl now this is something that Nvidia has done internally and I want to see if deep SE Cod or AI will let us do this so again this is a language that you used to define uh combinatorial Logic for a chip so it's pretty similar to programming but it's programming that's then like etched into a chip and then used everywhere so basically uh I want to say um I want to create an Asic for Bitcoin mining base so we'll see if I remember enough of my e degree let's see what it gives us so it might kind of give up this is a really complex task so this is kind of my task I'm picking to try to break this model and although this model has been jailbroken which if you want to see that um I'm going to start doing videos like that on Rumble so let me know in the comments if you want to see those because I definitely can't do them here but let's see here all right so it's starting out right it's giving us an entity and it's understanding basically what that is it's basically showing us that the architecture is basically just a hardware implementation of the shaw 256 algorithm which is right and the entity declaration is just saying kind of how many chips you want Define in this code so now it's giving us okay so it's giving us those base constants it understands that we want some way to use kind of a logic Factor so let's see what it keeps giving us so it's interesting that it's giving us these kind of initial hash values and round constants which is part of shot 256 but let's see what it gives us I'm going to wait around just a few minutes so what's cool is it's actually getting through a lot of the signals declaration which frankly as a student was one of the harder things to understand so now it's actually showing us the logic so it's giving us the initial hash values uh and setting those Gates and it's sort of explaining what's going on here we we got sort of a uh partial implementation because this doic this standard logic Vector is a library and like I told you it's using a few different libraries to kind of cheat so I'm still very impressed that it basically gave us a working version of this and it gave us a bit more information after the fact so it understands power efficiency because I mentioned Bitcoin mining area efficiency reliability and testing interesting and then the funny thing is this this step here fabricate a prototype and test in a real world environment would cost about $4 million so I'm not going to be doing that so I'm curious um how many of you will be using this model for your new coding model how many of you just use coding models in general locally uh not from close Source llm companies I'm really curious to see what you guys have to say uh I think this is pretty impressive I can't wait to see what kind of comes Downstream of this and I'm definitely going to start using this locally as an upgrade to deep SE coder V1 but um but yeah so as always I hope you learned something and if you like this video please like subscribe and share and we'll see you in the next one

Transcript for:DeepSeek Coder V2 Overview

Transcript for:
DeepSeek Coder V2 Overview