Hey, this is Andrew Brown. And in this crash course, I'm going to show you the basics of DeepSeek. So first, we're going to look at the DeepSeek website where you can utilize it just like use TachyPT.
After that, we will download it using Ollama and have an idea of its capabilities there. Then we'll use another tool called Studio LM, which will allow us to run the model locally, but have a bit of an agentic behavior. We're going to use an AIPC and also a modern graphics card, my RTX 4080, I'm going to show you some of the skills about troubleshooting with it. And we do run into issues with both machines.
But it gives you kind of an idea of the capabilities of what we can use with deep seek and where it's not going to work. I also show you how to work with it with hugging face with transformers and to do local inference. So you know, hopefully you're excited to learn that. But we will have a bit of a primer just before we jump in it.
So we know what deep seek is. And I'll see you there in one second. Before we jump into deep seek, let's learn a little bit about it.
So deep seek is a Chinese a company that creates open way LLMs. That's its proper name, I cannot pronounce it. DC has many open open weight models.
So we have our one our one zero deep seek v3 math coder moe so moe mixture of experts, and then deep seek v3 is mixture of models. I would tell you more about those, but I never remember what those are. They're somewhere in my Gen AI essentials course. The one we're going to be focusing on is mostly R1. We will look at V3 initially, because that is what is utilized on deepseek.com.
And I want to show you the AI power assistant there. But let's talk more about R1. And before we can talk about R1, we need to know a little bit about R10. So there is a paper where you can read all about how deepseek works. But deep seek r10 is a model trained via large scale reinforcement learning with soup without without supervised fine tuning, and demonstrates remarkable reasoning capabilities.
r10 has problems like poor readability and language mixing. So r1 was trained further to mitigate those issues. And it can achieve performance comparable to open AI 01. And they have a bunch of benchmarks across the board. And they're basically showing the one in blue is deep seek and then you can see opening eyes there. And most of the time, they're suggesting that deep seek is performing better.
And I need to point out the deep seek R1 is just text generation, it doesn't do anything else, but supposedly does really, really well. But they're comparing probably the 271 billion parameter model, the model that we cannot run, but maybe large organizations can afford to build at an affordable rate. But the reason why deep seek is such a big deal is that it is a speculated method.
that it has a 95 to 97 reduction in cost compared to open AI. That is the big deal here, because these models to train them to run them is millions and millions and millions of dollars, and hundreds of millions of dollars. And they said they trained and built this model with $5 million, which is nothing compared to these other ones. And with the talk about deep seek R1, we saw like...
chip manufacturers stocks drop because companies are like, why do we need all this expensive compute, when clearly these models can be optimized further. So we are going to explore deep sea guard one and see how we can get it to run and see where we can get it run and where we're going to hit the limits with it. I do want to talk about what hardware I'm going to be utilizing because it really is dependent on your local hardware.
We could run this in cloud, but it's not really worth it to do it, you really should be investing some money into local hardware and learning what you can and can't run based on your limitations. But what I have is an Intel Lunar Lake AI PC dev kit. Its proper name is the Core Ultra 200 V series. And this came out in September 2024. It is a mobile chip. And the chip is special because it has an IGBU, so an integrated graphics unit.
That's what the LLM is going to use. It has an MPU. which is intended for smaller models. But that's what I'm going to run it on.
The other one that we're going to run it on is my Precision 3680 tower workstation Optiplex. I just got this station. It's okay. It is a 14th generation I-Core 9. And I have a GeForce RTX 4080. And so I ran this model on both of them.
I would say that the dedicated graphics card did do better because they just generally do. But from a cost perspective, the Lake AI PC dev kit is cheaper. You cannot buy the one on the left-hand side because this is something that Intel sent me.
There are equivalent kits out there. If you just have an AI PC dev kit, Intel, AMD, Quadcom, they all make them. So I just prefer to use Intel hardware. you know, whichever one you want to utilize, even the Mac M4 would be in the same kind of line of these things that you could utilize. But I found that we could run about a seven to 8 billion parameter model on either.
But there were cases where when I use specific things, and the models weren't optimized, and I didn't tweak them, it would literally hang the computer and shut them down, both of them. right, both of them. So there is some finessing here and understanding how your work your hardware works. But probably if you want to run this stuff, you would probably want to have a computer on your network.
So like I my AI PCs on my network, or you might want to have a dedicated computer with multiple graphics cards to do it. But I kind of feel like if I really wanted decent performance, I probably need to AI PCs with distributed distributing the LLM across them with something like racer. Or I need another graphics card with distributed because just having one of either or just feels a little bit too too little, but you can run this stuff and you can get some interesting results. But we'll jump into that right now. Okay.
So before we try to work with deep seek programmatically, let's go ahead and use deep seek.com AI power assistant. So this is supposed to be the equivalent of chat GPT, Claude sonnet. Mistral seven llamas meta AI.
As far as I understand, this is completely free. It could be limited in the future because this is a product coming out of China. And for whatever reason, it might not work in North America in some future. So if that doesn't work, you'll just skip on to the other videos in this crash course, which will show you how to programmatically download the open source model and run it on your local compute. But this one in particular is running deep seek version or v3.
And then up here, we have deep seek r1, which they're talking about. And that's the one that we're going to try to run locally. But deep seek v3 is going to be more capable because there's a lot more stuff that's moving around in the background there. So what we'll do is go click Start now.
Now, I got logged in right away, because I connected with my Google account, that is something that's really, really easy to do. And the use case that I like to test these things on is I created this prompt document for helping me learn Japanese. And so basically what the this prompt document does is I tell it you are a Japanese language teacher, and you're going to help me work through a translation. And so I have one where I did on meta, clod, and chat GPT.
So we're just going to take this one and try to apply it to deep seek. The one that's most advanced is the clod one. And here you can click into here, and you can see I have a role, I have a language, I have teaching instructions, we have agent flow, so it's handling state, we're giving it very specific instructions, we have examples. And so hopefully, what I can do is give it these documents, and it will act appropriately.
So this is in my GitHub, and it's completely open source or open to you to access at Oman, King, free JNI bootcamp 2025 in the sentence constructor. But what I'm going to do is I'm in GitHub, and I'm logged in. But if I press period, this will open this up in github.dev.
And so now I can download those files and make it a little bit easier to work with them. So that's what I'm doing. I'm just opening this in github.dev.
And we'll get a VS code like editor. So we'll just give it a moment here. And so the next thing I'm going to do here is just give it a moment. And I'm going to go into the cloud folder.
Basically, these other ones, you could use these ones, these are just very simple prompts. But what I did is over time, I made it more advanced than the cloud one is the one that we really want to test out. So I have these and so I want this one here. This is a teaching test.
That's fine. I have examples. And I have consideration examples.
Okay, so I'm just carefully reading this. I'm just trying to decide which ones I want. I actually want almost all of these I want.
I was going to download the folder. So I'm going to do I'm gonna go ahead and download this folder. I'm gonna just download this to my desktop. Okay. And it doesn't like unless it's in a folder.
So I'm going to go ahead and just hit download again. I think I actually made a folder on my desktop called Nope, maybe not download, but we'll just make a new one called download. Okay, I'm going to go in here and select, we'll save you save changes.
And that's going to download those files to there. So if I go to my desktop here, I go into download, we now have the same files. Okay, so what I want to do next is I want to go back over to deep seek.
And it appears that we can attach files. So it says text extraction only upload docs or images. So it looks like we can upload multiple documents.
So these are very small documents. And so I want to grab this one, this one, this one, this one and this one. And I'm going to go ahead and drag it on in here.
Okay. And actually, I'm going to take out the prompt MD. And I'm actually just going to copy its contents in here because the prompt MD tells it to look at those other files.
So go ahead and copy this. Okay, we'll paste it in here. Let's enter and then we'll see how it performs.
Another thing we should check is its vision ability. But we'll go here and says let's break down a sentence example for sentence structure looks really, really good. So next possible answers, try formatting the first clue. So I'm going to try to tell it to give me the answer. Just give me the answer.
I want to see if it if I can subvert subvert my instructions. Okay, and so it's giving me the answer which is not supposed supposed to be doing? Did I tell you not to give me the answer in my prompt document?
Let's see if it knows. My apologies for providing the answer clearly. So already it's failed on that. But I mean, it's still really powerful. And the consideration is like, even if it's not as capable as Claude or as ChachiBT, it's just the cost factor.
But it really depends on what these models are doing. Because when you look at Meta AI, right, if you look at Meta AI, or you look at Mistral, Mistral 7, these models, they're not necessarily working with a bunch of other models. And so there might be additional steps that Claude or ChatGPT is doing so that it doesn't, like it makes sure that it actually reads your model.
But so far, right, like I ran it on these ones as well. But here are equivalents of more simpler ones that don't do all those extra checks. So it's probably more comparable to compare to like Mistral seven, or llama in terms of its reasoning. But here you can see it already made a mistake, but we were able to correct it.
But still, this is pretty good. So I mean, that's fine. But let's go test its vision capabilities. Because I believe that this does have vision capabilities. So I'm going to go ahead And I'm looking for some kind of image.
So I'm gonna say Japanese text, right, I'm gonna go to images here. And we'll say Japanese menu, Japanese, again, if you even if you don't care about it, it's a very good test language as, as it really has to work hard to try to figure it out. And so I'm trying to find a Japanese menu in Japanese.
So what I'm going to do is say translate, maybe we'll just go to like a Japanese website. So we'll say Japanese hotel. And so or maybe you know, it's better, we'll say Japanese newspaper, that might be better. And so this is probably one Mainichi. Okay.
And I want it actually in Japanese. So that's that's the struggle here today. So I'm looking for the Japanese version.
I don't want it in English. Let's try this Japanese time.jp. I do not want it in English. I want it in Japanese.
And so I'm just looking for that here. Just give me a second. Okay, I went back to this first one in the top right corner says Japanese. And so I'll click this So now we have some Japanese text. Now, if this model was built by China, I would imagine that they probably really good with Chinese characters and, and Japanese borrow Chinese characters.
And so it should perform really well. So what I'm going to do is I'm going to go ahead, I have no idea what this is about. We'll go ahead and grab this image here.
And so now that is there, I'm going to go back over to deep seek. And I'm going to just start a new chat. I'm going to paste this image in and say, Can you transcribe the Japanese text in this image, because this is what we want to find out, can it do this, because if we can do that, that makes it a very capable model.
And transcribing means extract out the text. Now, I didn't tell it to produce the translation, which says this test discusses the scandal involving a former talent, etc, etc. You know, can you translate the text and break down break down the grammar. And so what we're trying to do is say break it down so we can see what it says.
formatting is not the best. Oh, here we go. Here, this is what we want.
So just carefully looking at this possessive advancement to ask a question voices also, yeah, it looks like it's doing what it's supposed to be doing. So yeah, it can do vision. So that's a really big deal. But it is v3.
And that makes sense. But this is deep seek. this one. But the question will be what can we actually run locally, as there has been claims that this thing does not require serious GPUs.
And I have the hardware to test that out on. So we'll do that in the next video. But this was just showing you how to use the a power assistant if you didn't know where it was, okay.
Alright, so in this video, we're going to start learning how to download the model locally, because imagine if deep seek is not available one day for whatever reason. And, again, it's supposed to run really well on computers that do not have expensive GPUs. And so that's what we're going to find out here. The computer that I'm on right now, I'm actually remoted, like I'm connected on my network to my Intel developer kit.
And this thing, if you probably bought it brand new, it's between $500 to $1,000. But the fact is, is that this thing is a mobile chip. I call it the Lunar Lake, but it's actually called the Core Ultra 200 V-Series mobile processors. And this is the kind of processor that you could imagine will be in your phone in the next year or two.
But what's so special about these new types of chips is that when you think of having a chip, you just think of CPUs, and then you hear about GPUs being an extra graphics card. But these things have a built in graphics card called an iGPU integrated graphics card. It has an MPU, a neural processing unit, and just a bunch of other capabilities. So basically, they've crammed a bunch of stuff onto a single chip. And it's supposed to allow you to be able to run ml models and be able to download them.
So this is something that you might want to invest in. You could probably do this on a Mac m4 as well or some other things. But this is just the hardware that I have.
And I do recommend it. But anyway, one of the easiest ways that we can work with the model is by using olama. So olama is something I already have installed, you just download and install it. And once it's installed, it usually appears over here and mine is over here. Okay, but the way olama works is that you have to do everything via the terminal.
So I'm on Windows 11. Here, I'm going to open up terminal. If you're on a Mac, same process, you open up terminal. And now that I'm in here, I can type the word Olam.
Okay, so Olam is here. And if it's running, it shows a little Olam somewhere in in your on your computer. So what I want to do is go over to here.
And you can see it's showing us R1. Okay, but notice here, there's a drop down. Okay. And we have seven 1.5 billion 7 billion 8 billion 14 billion 32 billion 70 billion 671 billion.
So when they're talking about deep seek r1 being as good as chat GPT, they're usually comparing the top one, the 671 billion parameter one, which is 404 gigabytes, I don't even have enough room to download this on my computer. And so you have to understand that this would require you to have actual GPUs. or more complex setups. I've seen somebody, there's a video that circulates around that somebody bought a bunch of Mac minis and stack them. Let me see if I can find that for you quickly.
Alright, so I found the video and here is the person that is running they have 12345677 Mac minis and it says they're running deep seek r1. And you can see that it says m4 Mac minis. And it says total unified memory 496 gigabytes, right?
So that's a lot of memory. first of all. And it is kind of using GPUs because these M4 chips are just like the Lunar Lake chip that I have in that they have integrated graphics units, they have MPUs, but you see that they need a lot of them. And so you can, if you have a bunch of these, technically run them.
And again, whatever you want to invest in, you only need really one of these, whether it is like the Intel Lunar Lake or the Mac M4 or whatever AMD Ryzen's one is. But the point is, like, even if you were to stack them all and have them and network them together and do distributed compute, which you'd use something like Ray, to do that Ray serve, you'll notice like, look at the type speed, it is not. It's not fast.
It's like clunk, clunk, clunk, clunk, clunk, clunk, clunk. So, you know, understand that you can do it. But you're not going to get that from home unless the hardware improves, or you buy seven of these.
But that doesn't mean that we can't run some of these other models. right? But you do need to invest in something like this thing, and then add it to your network, because, you know, buying a graphics card, then you have to buy a whole computer, and it gets really expensive. So I really do believe in AI PCs. But we'll go back over to here.
And so we're not running this one, there's no way we're able to run this one. But we can probably run easily the 7 billion parameter one, I think that one is doable. We definitely can do the one 1.5 billion one. And so this is really what we're targeting, right?
It's probably the 7 billion parameter model. So to download this, I all I have to do is copy this command here, I already have Olam installed. And what it's going to do is going to download the model for me.
So it's now pulling it from probably from hugging face. Okay, so we go to hugging face, and we say deep seek our one. what it's doing is it's grabbing it from here, it's grabbing it from from hugging face. And it's probably this one, there are some variants under here, which not 100% certain here. But you can see there's distillments of other of other models underneath, which is kind of interesting.
But this is probably the one that is being downloaded right now. This I think it is. Normally, what you're looking for here is we have these safe tensor files, and we have a bunch of So yeah, I'm not exactly sure we'll figure that out here in a little bit. But the point is, is that we are downloading it right now we go back over to here, you can see it's almost downloaded. So it doesn't take that long.
But you can see they're a little bit large, but I should have enough RAM on this computer. I'm not sure how much this comes with just give me a moment. So what I did is I just soaked up opened up system information. And then down below here, it's saying I have 32 gigabytes of RAM.
So the RAM matters because you have to have enough RAM to hold this stuff in memory and also If the models are Jeff to be able to download it, and then you also need the GPUs for it. But you can see this is almost done. So I'm just going to pause here until it's 100% done.
And it should once it's done, it should automatically just start working. And we'll see there in a moment. Okay, just showing that it's still pulling.
So it downloaded and now it's pulling additional containers. I'm not exactly sure what's doing, but now it is ready. So it didn't take that long, just a few minutes.
And we'll just say, Hello, how are you? And that's pretty decent. So that's going at an okay pace. Could I download a more, a more intensive one?
That is the question that we have here. Because we're at the 7 billion, we could have done the 8 billion. Why did I do seven when I could have an eight? The question is like, where does it start kind of chugging it might be at the 1414 billion parameter model, we'll just test this again.
So hello, and just try this again. But you could see that we're getting pretty, pretty decent results. The thing is, even if you had a smaller model through fine tuning, if we can find you this model, we can get better performance for very specific tasks, if that's what we want to do. But this one seems okay.
So I would actually kind of be curious to go ahead and launch it, I can hear the computer spinning up from here, the lunar lake dev kit, but I'm going to go ahead and just type in by. And I'm going to just go here, I want to delete that one. So I'm gonna say remove and that was deep sea car one. First, let's list the model here, because we want to be cautious of the space that we have on here. And this model is great, I just want to have more, I just want to run I just want to run the 8 billion parameter one or something larger.
So we'll say remove this, okay, it's deleted. And I'm pretty confident it can run the 8 billion, let's do the 14 billion parameter, this is where it might struggle. And the question is, how large is this?
This is 10 gigabytes, I definitely have room for that. So we're gonna go ahead and download this one. And then once we have that, we'll decide what it is that we want to do with it. Okay, so we're going to go ahead and download that I'll be back here when this is done downloading. Okay.
Alright, so we now have this model running. And I'm just going to go ahead and type Hello. And surprisingly, it's doing okay. Now, you can't hear it.
But as soon as I typed, I can hear my my little Intel developer kit is going And so I just want you to know, like if you were to buy an AI PC, the one that I have is not for sale. But if you look up one, it has a lunar like chip in it, that ultra core was the ultra core 2020 to 2020 or whatever. If you just find it with another provider, like if it's with Asus, or whoever Intel is partnered with, you can get the same thing.
It's the same hardware in it. Intel just does not sell them direct, they always do it through a partner. But you can see here that we can actually work with it. I'm not sure how long this would work for it might, it might quit at some point, but at least we have some way to work with it.
And so a llama is one way that we can get this model. But obviously, there are different ones like the deep seek are one I'm going to go ahead back to a llama here. And I just want to now delete that model just because we're done here. But there's another way that we can work with it.
I think it's called notebook LM or LM studio. we'll do in the next video. And that will give you more of a AI powered assistant experience. So not necessarily working with it programmatically, but closer to the end result that we want. I'm not going to delete the model just yet here.
But if you want to, I've already showed you how to do that. But we're going to look at the next one in the next video here because it might require you to have a llama as the way that you download the model, but we'll go find out. Okay, so see the next one.
Alright, so here we are at Studio LM or LM studio, I've actually never used this product before I usually use web UI, which will hook up to a llama. But I've heard really good things about this one. And so I figured we'll just go open it up.
And let's see if we can get a very similar experience to having like a chat GPT experience. And so here you they have downloads for Mac, the metal series, which are the latest ones, Windows and Linux. So you can see here that they're suggesting that you want to have one of these new AI PC chips, as that is usually the case.
If you have GPUs, and you can probably use GPUs, I actually do have really good GPUs. I have a 4080 RTX here. But I want to show you what you can utilize locally.
So what we'll do is just wait for this to download. Okay, and now let's go ahead and install this. But I'm really curious on how we are going to plug this into, like, how are we going to download the model, right? Does it plug into Oh llama?
Does it download the model separately? That's what we're going to find out here just shortly when it's done installing. So we'll just wait a moment here.
Okay. Alright, so now we have completing the ML studio setup. So LM studio has been installed in your computer, click Finish and Setup. So go ahead and hit Finish.
Okay, so this will just open up here, we'll give it a moment to open. I think in the last video, we stopped Oh llama. So even if it's not there, we might want to I'm just going to close it out here. Again, it might require a llama we'll find out here moment so so get your first LM So here it says, llama 3.2. That's not what we want.
So we're gonna go down below here, it says enable local LM service on login. So it sounds like what we need to do is we need to log in here and make an account. I don't see a login.
I don't. So we'll go back over to here. And they have this onboarding step.
So I'm going to go and we'll skip onboarding. And let's see if we can figure out how to install this just a moment. So I'm noticing at the top here we have select a model to load no LMS yet download the one to get started. I mean, yes, llama 3.1 is cool.
But it's not the model that I want. Right? I want that specific one.
And so this is what I'm trying to figure out in the bottom left corner, we have some options here. And I know it's hard to read, I apologize, but there's no way I can make the font larger, unfortunately. But they have the LM studio.ai. So we'll go over to here, we go to the model catalog. And we're looking for deep seek, we have deep seek math 7 billion, which is fine.
But I just want the normal deep seek model, we have deep seek coder version two. So that'd be cool. If we wanted to do some coding, we have distilled ones, we have our one distill. So we have llama 8 billion distilled and quen 7 billion.
So I would think we probably want the llama 8 billion distilled. Okay, Siri says using LM Studio. So I'm going to go ahead and click it.
And we'll click Open. Okay, now it's going to download them also 4.9 gigabytes, we'll go ahead and do that. So that model is now downloading.
So we'll wait for that to finish. Okay, so it looks like we don't need a llama at all. This is like all inclusive one thing to go. though I do want to point out notice that it has a gguf file. So that makes me think that it is using like whatever llama index can use, I think it's called llama index, that this is what's compatible and same thing with olama.
So they might be sharing the same the same stuff because they're both using gguf files. And this is still downloading. But while I'm here, I might as well just talk about what distilled model is. So you'll notice that it's saying like r1 distilled llama eight, or quen 7 billion parameter.
So distillation is where you are taking a larger models knowledge, and you're doing knowledge transfer to a smaller model. So it runs more efficiently, but it has the same capabilities of it. The process is complicated. I explained it in my Jenny essentials course, which this, this part of this crash course will probably get rolled into later on. But basically, it's just it's a it's a technique to transfer that knowledge.
And there's a lot of ways to do it. So I can't summarize it here. But that's why you're seeing distilled versions of those things.
So basically, they figured out a way to take the knowledge, maybe they're querying directly, that's probably what they're doing is like they have a bunch of evaluations, like queries that they hit with, what do you call llama or these other models, and then they look at the result. And then they then when they get their smaller model to do the same thing, then it performs just as well. So the model is done, we're gonna go ahead and load the model. And so now I'm just gonna get my head a little bit out of the way, because I'm kind of in the way here. So now we have an experience that is more like what we expect it to be.
And on the top here, I love it as a way that I can definitely bring the font up here. I'm not sure if there is a dark mode. A light mode is okay, but a dark mode would be nicer.
There's a lot of options around here. So just open settings in the bottom right corner. And here we do have some themes.
There we go. That's a little bit easier. And I do apologize for the small fonts.
There's not much I can do about it. I even told it to go larger. This is one way we can do it.
So let's see if we can interact with this. So we'll say can you I am learning Japanese? Can you act as my Japanese teacher?
Let's see how it does. Now this is our one, this does not mean that it has vision capabilities, as I believe that is a different model. And I'm, again, I'm hearing my my computer spinning up in the background. But here, you can see that it's thinking, okay, so I'm trying to learn Japanese. And I came across the problem where I have to translate I'm eating sushi into Japanese.
First, I know that in Japanese, the order subject can be this. So it's really interesting. It's going through a thought process. normally, when you use something like web UI, it's literally using the model directly, almost like you're using it as a playground. But this one actually has reasoning built in, which is really interesting.
I didn't know that it had that. So there literally is agent thinking capability. This is not specific to OpenSeq. I think if we brought in any model, it would do this. And so it's showing us the reasoning that it's doing here, as it's working through this.
So we're gonna let it think and wait till it finishes. But it's really cool to see its reasoning. where normally you wouldn't see this, right. So you know, when Chachi BGC says it's thinking this is the stuff that it actually is doing in the background that doesn't fully tell you, but we'll let it work here. We back in just a moment.
Okay. Alright, so it looks like I lost my connection. This sometimes happens because when you're running a computational task, it can halt all the resources on your machine. So this model is a bit smaller, but I was still running Olam in the background.
So what I'm going to do is I'm going to go my Intel machine, I can see it rebooting in the background here. I'm going to give it a moment to reboot here, I'm going to reconnect, I'm going to make sure Olam is not running. And then we'll try that again.
Okay, so be back in just a moment. You know what it was the computer decided to do Windows updates. So it didn't crash.
But this can happen when you're working with LMS that it can exhaust all the resources. So I'm going to wait till the update is done. I'll get my screen back up here in just a moment.
Okay. Alright, so I'm reconnected to my machine, I do actually have some tools here that probably tell me my use, let me just open them up and see if anyone will actually tell me where my memory usage is. Yeah, I wouldn't call that very useful.
Maybe there's some kind of tool I can download. So monitor memory usage. Well, I guess activity monitor can just do it right.
Or what's it called? Try remember the hockey for it. There we go.
And we go to task manager. So maybe I just have Task Manager open here, we can kind of keep track of our memory usage. Obviously, Chrome likes to consume quite a bit here.
I'm actually not running OBS. I'm not sure why it automatically launched here. Oh, you know what, I didn't open on this computer here.
Okay, so what I'll do is I'll just hit Task Manager. That was my task manager in the background. There we go.
And so here we can kind of get an idea this computer just restarted. So it's getting its itself in order here. And so we can see our memory uses at 21%.
That's what we really want to keep a track of. So what I'm going to do is go back over to LM studio, we're going to open it up. But this is stuff that really happens to me where it's like, you're using local LMS and things crash.
And it's not a big deal. It just happens. But we came back here and it actually did do it.
It said, thought for three minutes and four seconds. And you can see its reasoning here. Okay, it says the translation of I like eating sushi into Japanese is what Tashi wa sushi or tubby mass. gosuki desu, which is true, the structure correctly places it.
One thing I'd like to ask it is can it give me Japanese characters? So can you show me the the sentence? Can you show me Japanese using Japanese characters, eg kanji, and hiragana? Okay, and so we'll go ahead and do that.
It doesn't have a model selected. So we'll go to the top here. kind of interesting is that maybe you can switch between different kinds of models as you're working here.
We do have GPU offload of discrete model layers, I don't know how to configure any of these things right now. Flash attention would be really good. So decrease memory usage generation time on some models.
That is where a model is trained on flash attention, which we don't have here right now. But I'm going to go ahead, I'm going to load the llama distilled model. And we're going to go ahead and ask if it can do this for us, because that would make it a little bit more useful. Okay, so I'm going to go ahead and run that and we back here in just a moment, it'll see the results.
Alright, we are back and we can take a look at the results here. We'll just give it a moment, I'm gonna scroll up. And you know, what's really interesting is that it is working every time I do this, it does work, but the computer restarts. And I think the reason why is that it's exhausting all possible resources. Now, the size of the model is not large, it's whatever it is the 8 billion parameter one, at least I think that's what we're running here.
It's a bit hard because it says 8 billion distilled. And so we'd have to take a closer look at it says 8 billion. So it's 8 billion parameter.
But the thing is, it's the reasoning that's happening behind the scenes. And so I think for that it's exhausting, whereas when we're using Lama, it's less of an issue. And I think it might just be that LM Studio, the way the agent works, might not have ways of, or at least I don't know how to configure it to make sure that it doesn't destroy stuff when it runs out here.
Because you'll notice here that we can set the context length. And so maybe if I reduce that, keep model and memory, so reserve system memory for the model, even when offload GPU improves performance, but requires more RAM. So here, you know, we might...
toggle this off and get better production. But right now when I run it, it is restarting. But the thing is, it is working.
So you can see here it thought for 21 seconds, it says, of course, I'd like to help you. And so here's some examples. And it's producing pretty good code, or like output, I should say. But anyway, what we've done here is we've just changed a few options. So I'm saying don't keep it in memory.
Okay, because that might be an issue. And we'll bring the context window down. And it says CPU thread to allocate that seems fine to me.
Again, I'm not sure about any of these other options. We're going to reload this model. Okay, so we're downloading with those options. I want to try one more time if my computer restarts, it's not a big deal.
But again, it might be just LM studio that's causing us these issues here. And so I'm just going to click into this one. I think it's set up with those settings. We'll go ahead and just say, okay.
So I'm going to say like, how do I ask? How do I say in Japanese? Where is the movie theater? Okay, it doesn't matter if you know Japanese, it's just we're trying to tax it with something hard.
So here it's running again. And it's going to start thinking we'll give it a moment here. And as it's doing that, I'm going to open up task manager. And we'll give it a moment. I noticed that it has my restart again.
Yeah, it did. So yeah, this is just the experience. Again, it has nothing to do with the Intel machine. It's just this is what happens when your resources get exhausted. And so it's going to restart again.
But this is the best I can demonstrate it here. Now I can try to run this on my main machine using the RTX 4080. So that might be another option that we can do where I actually have dedicated GPUs. And I have a this is like a 14th generation Intel chip. I think it's Raptor Lake.
So maybe we'll try that as well in a separate video here just to see what happens. But that was the example there. But I could definitely see how having more than like those computers stacked would make this a lot easier, even if you had a second one there, that's still be more cost effective than buying a completely new computer outright.
two or smaller mini PCs. But I'll be back here in just a moment. Okay, so I'm gonna get this installed on my main machine, my main machine, like as I'm recording here, it's using my GPU, so it's gonna have to share it.
So I'm just going to stop this video. And then we're going to treat this one as LM studio using the RTX 4080. And we'll just see if the experience is the same or different. Okay. Alright, so I'm back here.
And now I'm on my main computer. And we're going to use ML studio. So I'm going to go and skip the onboarding. And I remember, there's a way for us to change the theme, maybe in the bottom right corner of the cog, and we'll change it to dark mode here. So our eyes are a little bit easier to see here, also want to bump up the font a little bit.
To select the model, I'm going to go here to select a model at the top here, we do not want that model here. So I'm going to go to maybe here on the left hand side, no, not there. It was here in the bottom left corner. And we're going to go to LM studio AI. And we want to make our way over to the model catalog at the top right corner.
And I'm looking for deep seek, are one distilled llama eight B. So I'll click that here. And we'll say use in studio, that's now going to download this locally.
Okay, so Nate, we are now going to download this model. And I'll be back here in just a moment. Okay. Alright, so I've downloaded the model here, I'm going to go ahead and load it. And again, I'm a little bit concerned because I feel like it's going to cause this computer to restart but Because it's offloading to the GPUs, I'm hoping that'll be less of an issue.
But here you can see it's loading the model into memory. Okay. And we really should look at our options that we have here. It doesn't make it very easy to select them. But oh, here it is right here.
Okay, so we have some options here. And this one actually is offloading to the GPU. So you see it has GPU offload.
I'm always wondering if I should have set GPU offload on the IPC because it technically has I GPUs. And maybe that's where we're running into issues. Whereas when we're using Oh llama, maybe it was already utilizing the GPUs.
I don't know. But anyway, what I want to do is go ahead and ask the same thing. So I'm going to say, can you teach me teach me Japanese for JLPT and five level. So go ahead and do that.
We'll hit enter. And again, I love how it shows us the thinking that it does here. I'm assuming that it's using our RTX RTX 480 that I have on this computer. And this is going pretty decently fast here. It's not causing my computer to cry.
This is very good. This is actually reasonably good. And so yeah, it's performing really well.
So the question is, you know, I'd like to go try the the developer kit again and see if I because I remember the GPUs were not offloading, right? So maybe it didn't detect the IGP hues. But this thing is going pretty darn quick here. And so that was really, really good. And so it's giving me a bunch of stuff.
It's like, okay, but give me give me example sentences. in Japanese. Okay, so that's what I want. We'll give it a moment. Yep.
And that looks good. So it is producing really good stuff. This model again, is just the llama a billion parameter one, I'm going to eject this model, let's go back over to here into the studio over here. And I want to go to the model catalog because there are other deep seek models.
So we go and take a look deep We have coder version two, so the younger sibling of GPT for deep sea coder version two model, but that sounds like deep sea too. Right? So I'm not sure if that's really the latest one, because we only want to focus on our one.
And so yeah, I don't think those other ones we really care about, we only care about our one models. But you can see we're getting really good performance. So the question is, like, what's the compute or the tops difference between these two? And maybe we can ask this over to the model ourselves, but I'm going to start a new conversation here. And I'm going to say, how many tops are the tops does I think it's called tops tops does RTX 4080 have, okay, we'll see if we can do it.
Just like this model here. And yeah, we'll load the model. And we'll run that there. We'll give it a moment.
And while that's thinking, I mean, obviously, we just use Google for this, we don't really need to do that. But I want to do a comparison to see like how many tops they have. So let that run the background.
I'm also just going to search and find out very quickly. Oh, here it goes. Does not have a specified number of tensor. As officially Nvidia, the company focuses on metrics like kudos, cores and memory bandwidth.
But this would be speculative. Okay, but but then but then how do I how do I compare compare tops for let's say lunar lake versus RTX. 4d 80. And I know like, there's lots of ways to do it.
But it's like, if I can't compare it, how do I do it? And while that's trying to figure it out, I'm gonna go over to perfect perplexity. And maybe we can get an exact example because I'm trying to understand like, how much does my discrete GPU do compared to that that one that's internal. So say, lunar lunar lake versus RTX 4040 84 tops performance, and we'll see what we get.
So lunar lake has 120 tops. And I remember it's gaming rather than AI workloads. So anybody doesn't typically advertise their tops, maintaining 60 FPS.
Okay, but then so then, okay, but what, what could it be? Like, how many tops could it be for the RTX 4080 kind of makes it hard because it's like, we don't know how many tops it is. We don't we don't know what kind of expectation we should have with it.
Okay, fair enough. So yeah, so we can't really compare it's like apples to oranges, I guess. And it's just not gonna give us the result here.
But here, it is going through comparison. So if you run ml perfect GPUs, like a model with ResNet, you directly compare the tops with the new architecture. And so that's basically the only way to do it. So we can't it's apples to oranges, I want to go and attempt to try to run this one more time on the lunar lake. And I want to see if I can set the GPUs.
But if we can't set the GPUs, then I think it's gonna always have that issue specifically with this. But we will use the lunar lake for using with hugging face and other things like that. So be back in just a moment. Okay. Alright, so I'm back and I just did a little bit exploration on my other computer there, because I want to understand like, okay, I have this AI PC, it's very easy to run this here on my RTX 4080. But when I run it on the on the the lunar, like it is shutting down, I think I understand why.
And so this is I think is really important when you are working local machines, you have to have a bit better understanding of the hardware. So I'm just going to RDP back into this machine here. Just give me just a moment.
Okay, I have it running again. And it probably will crash again. But at least I know why.
So there's a program called core temp. And what core temp does is it allows you to monitor your this is for Windows for Mac, I don't know what you use, you'd probably just utility manager. But here, you know, I can see that none of these CPUs are being overloaded.
But this is just showing us the CPUs. If we open up Task Manager here, okay, and now the computer is running perfectly fine. It's not even spinning its fans.
I've got the left hand side here, we can see we have CPUs, MPUs and GPUs. Now MPUs are the things that we want to use. Because MPUs, like an MPU is specifically designed to run models.
However, a lot of the frameworks, like PyTorch, and TensorFlow, they're optimized on CUDA originally, because the underlying framework, and so normally, you have to go through an optimization or conversion format. I don't know. at this time, if there is a conversion format for Intel hardware, because deep seek is so new, but I would imagine that is something the Intel team is probably working on. And this is not just specific to Intel, if it's AMD, or whoever, they want to make optimization to leverage their different kinds of compute, like their MP us. And also has to do with the thing that we're using.
So we're using that thing called this one over here, I'm not sure. Well, all these little Oh, yeah, this is just this core LM and showing us all the temperatures, right. And so what we can do is just kind of see what's going on here is that I'm going to bring this over so that we can see what's happening, right?
We want to use MPUs, it's not going to happen because this thing is not set up to do that. But if I drop it down here, and we click into this, right, we have our options before we didn't have any GPUs. But we can go here, we can say use all the GPUs, I don't know how many how much you can offload to but I'll send something like 24. We have a CPU thread count, like that might be something we want to increase, we can reduce our context window, we might not want to load it into memory.
But the point is, is that if it if it exhausts the GPU, because it's all it's a single integrated circuit, I have a feeling that it's going to end up restarting it. But here again, you can see it's very low. We'll go ahead and we'll load the model, right.
And the next thing I will do is I will go type in something like, you know, I want to learn Japanese. Can you provide me a lesson on Japanese sentence structure? Okay, we'll go ahead and do that. Actually, I noticed with it, this doesn't require a thought process, it works perfectly, it doesn't cause any issues with the computer. We'll go ahead and run it.
And let's pay attention left hand side here. And now we can see that it's used utilizing GPUs when it was at zero wasn't using GPUs at all. But notice it's at 50 50%. Right and it's doing pretty good. Our CPU is higher than usual.
Before when I ran this earlier off screen, the CPU was really low, and it was the GPU that was working hard. So again, really, you have to understand your settings as you go here. But this is not exhausting so far. But we're just watching these numbers here and also our core temps, right?
And you can see we're not running into any issues. It's not even spinning up. It's not even making any complaints right now. The other challenge is that I have a developer kit that it's... it's something they don't sell, right.
So if there was an issue with the BIOS, I'd have to update it. And if there's like, no, all I can get is Intel's help on it. But if I was to buy like a commercial version of this, like whoever is partnered with it, if it's Asus, or Lenovo, or whatever, I would probably have less issues because they're maintaining those BIOS updates. But so far, we're not having issues. But again, we're just monitoring here, we have 46 47% 41%.
Um, gamma watching it, you can see cores at 84% 89%. And so we're just carefully watching this stuff. But I might have picked the perfect the perfect amount of settings here.
And maybe that was the thing is that you know, I turned down the CPU, like what did we do the options, I turned the GPUs down. So I turned that down. I also told not to load memory. And now it's not crashing.
Okay, there we go. It's not as fast as the RTX 4080. But you know what, this is my old graphics card here. I actually bought this not even long ago before I got my new computer.
This is an RTX 3060. Okay, this is not that old. It's like a it's like a couple years old 2022. And I would say that when I used to use that, I would run models, my computer would crash, right? So but the point is, is that these newer CPUs, whether it's again, the M4, or the Intel Lunar Lake, or whatever AMD's one is there, they have the strong equivalence of like graphics cards from two years ago, which is crazy to me.
But anyway, I think I might have found the sweet spot. I'm just really, really lucky. But you can see the memory usage here and stuff like that.
And you just have to kind of monitor it. And you'll find out once you get those settings. what works for you. Or, you know, you buy a really expensive GPU, and it'll run perfectly fine.
But here it's going. And we'll just give it a moment. We'll be back in just a moment. Okay. Anyway, it's going a little bit slow.
So you know, I just decided we'll just move on here. But my point was made clear is that if you dial in the specific settings, you can make this stuff work on things where you don't have dedicated graphics card. If you have a dedicated graphics card, you can see it's pretty good. And yeah, this is fine with the RTX 4080. So you know, if you have that you're going to be in good shape there.
But now that we've shown how to do with AI power assistance, let's take a look at how we can actually get these models from hugging face next, okay, and work with them programmatically. So I'll see you in the next one. Alright, so what I want to do in this video is I want to see if we can download the model from hugging phase and then work with it programmatically, is that's going to give you the most flexibility with these models, of course, if you just want to consume them, then using the LM studio that I showed you. or whatever it was called, would be the easiest way to do it. But having a better understanding of these models, how we can use them directly would be useful.
I think for the rest of this, I'm just going to use the RTX 4080. Because I realized that to really make use of AI PCs, you have to wait till they have optimizers for it. So we're talking about Intel, again, you have this kit called OpenVINO. And OpenVINO is an optimization framework.
And if we go down, they I think they have like a bunch of examples here. back for a moment. Yeah, quick start examples, maybe over here. And maybe not over here. But we go back to the notebooks.
And we scroll on down. Yeah, they have this page here. And so in this thing, they will have different LLMs that are optimized specifically so that you can maybe leverage the MPUs or the or or make it run better on CPUs. But until that's out there, we're stuck on the GPUs.
And we're not going to get the best performance that we can. So maybe in a month or so, I can revisit that. And then I will be utilizing it and it might be as fast as my RTX 4080. But for now, we're going to just stick with the RTX 4080. And we'll go look at deep seek because they have more than just our one. And so you can see there is a collection of models. And in here, we click into it, we have R1, R1 zero, which I don't know what that is.
Let's go take a look here. It probably explains it somewhere. But we have R1 distilled 70 billion parameter, quen 32 billion parameter, quen 14 billion.
And so we have some variants here that we can utilize. Just give me a moment, I want to see what zero is. So to me, it sounds like zero is the precursor to R1.
So it says a model trained with supervised learning. Okay, and so I don't think we want to use zero, we want to use the R one model, or one of these distilled versions, which give similar capabilities. But we go over to here, it's not 100% clear on how we can run this.
But down below here, we can see total parameters is 671 billion. Okay, so this one literally is the big one, this is the really, really big one. And so that would be a little bit too hard for us to run on this machine, we can't run 671 billion parameters, you saw the person stacking all those.
Apple M4s. Like, yeah, I have an RTX 4080, but I need a bunch of those to do it. Down below, we have the distilled models. And so this is probably what we were using when we were using Ollama if we wanted to go ahead and do that there. So this is probably where I would focus my attention on is these distilled models.
When we're using Hugging Face, it will show us how we can deploy the models up here. Notice over here, we have BLLM. I covered this in my Gen EI Essentials course, I believe. there are different types of ways we can serve models just as web servers have, you know, servers to serve them like the like software underneath. So do these ML models, these machine learning models, and VLM is one that you want to pay attention, attention to because it can work with the Ray framework.
And Ray is important because say Ray, I'll just say ML here. But this framework specifically has a product within it called racer. It's not showing me the graphic here.
But racer allows you to take VLM and distribute it across compute. So when we saw that video of that, again, those Mac M fours being stacked on top of each other, that was probably using racer with VLM to scale it out. And so if you were to run this, run this, you might want to invest in VLM, the hugging face transformer library is fine as well. But again, we're not gonna be able to run this on my computer and not on your computer.
So we're going to go back here for a moment. But there's also v3, which is been very popular as well. And that actually is what we were using when we went to the deep seek website. But if we go over to here, and we go into deep seek three, I think this is Yeah, this one's a mixture of experts model.
And this would be a really interesting one to deploy as well. But it's also 671 billion parameter model. So it's another one that we can't deploy locally, right. But if we did, we could have like vision tasks and all these other things that maybe it could do.
So we're going to really just have to stick with the R one. And it's going to be with one of these distributions. I'm going to go with the llama 8 billion parameter, I don't know why we don't see the other ones there.
But 8 billion is something we know that we can reliably run, whether it's on the lunar lake, or if it's on the RTX 4080. And so I'm going to go over here on the right hand side, we have transformers and VLMs. Transformers is probably the easiest way to run it. And so we can see that we have some code here. So I'm going to get set up here, I'm going to just open up VS code.
And I already have a repo, I'm going to put this in my Jenny essentials course, because I figured if we're going to do it, we might as well put it in there. And so I'm going to go and open that folder here. And I need to go up a directory, I might not even have this cloned.
So I'm going to just go and grab this directory really quickly here. So just CD back. And I do not.
So I'm going to go over to GitHub, this repo is completely open. So if you want to do the same thing, you can do this as well. We're gonna say Jenny I essentials. Okay. And I'm going to go ahead and just copy this and download it here.
So give it a clone, get clone. And I'm going to go ahead and open this up. I'm going to open this with windsor for fun, because I really like windsor.
I've been using that quite a bit if I have it installed here. should Yeah, I do. I have a paid version of windsurf.
So I have full access to if you don't just you can just copy and paste the code. But I'm trying to save myself some time here. So we're going to open this up, I'm going to go into the Jedi essentials, I'm going to make a new folder in here, I'm going to call this one deep seek. And I want to go inside of this one and call it our one transformers because we're going to just use the transformers library to do this.
I'm going to select that folder, we're going to say yes. I'm going to make a new file here. And I probably want to make this an iron Python file.
I'm not sure if I'm set up for that. But we'll give it a go. So what we'll do is we'll type in basic dot iron Python, Y N B, which is for Jupyter notebooks. And you'd have to have already have Jupyter installed. If you don't know I my general essentials, I show you how to set this stuff up.
So you can learn it that way if you want. I'm going to go over to WSL here. And yeah, I'll install that extension there if it wants to install there. And I'm going to see if I have conda installed, I should have it installed.
There it is. And we have a base. So anytime that you are setting up one of these environments, you should really make a new one, because that way you'll run into less conflicts. And so I need to set up a new environment, I can't remember the instructions, but I'm pretty certain I show that somewhere here local developments in this folder.
And so if I go to conda, and I go into setup, I think I explained it here. So for Linux, that's what I'm using right now with Windows subsystem Linux two is I would need to it's already installed. So I want to create a new environment.
So I probably want to use Python 3.10 point zero, if it's the future, you might want to use three 12. But this version seems to give me the least amount of problems. So I want this command, but I want to change it a little bit. I don't want it to be Hello, I want to call this deep seek.
So we'll go back over to here, we're going to paste it into here. And So now we are setting up Python 310. And it's going to install some stuff. Okay, so now we are good, I need to activate that.
So it's a con to activate deep seek. So now we are using deep seek, I'm gonna go back here on the right hand left hand side. And what I want to do is I want to get some code set up here.
So if we go back over to here, into the 8 billion distilled model, and we go to transformers, we have some code. And if it doesn't work, that's totally fine. We will we will tweak it from there.
I also have example code lying around. So for whatever reason, this doesn't work. Sorry, just pause there for a second.
If it doesn't work, we can grab from my code base here, because I don't always remember how to do this stuff. Even though I've done a lot of this, I don't remember half the stuff that I do. So we're going to go ahead here and cut this up and put this up here.
But we're going to need I'm not sure how well I'm not sure how well windsurf works it within iron Python actually never did that before. So it's asking us it's asking us to start something, we need to select a kernel. And I'm gonna say, oh, it's not seeing the kernels that I want.
Hmm. But you know, one thing I don't think we did is I don't think we installed iron Python. So there's an extra step that we're supposed to do to get it to work with Jupyter. And it might be under our Jupyter instructions here, where Yes, it's this. So we need to make sure we install iron Python kernel.
Otherwise, it might not show up here. So I'm going to just go ahead here. And I'm going to do conda, conda, whoops, conda hyphen f conda forge.
So we're saying downloads from the conda forge. And I think it's conda install. So it's conda install hyphen f conda forge, and then we paste in IP kernel. And so now it should install IP kernel.
I'm not sure if that worked or not. We'll go up here and take a look. The following packages are not available for installation. Oh, it's hyphen C not hyphen F.
Okay, so we'll go here. And that just means to use the conda forge. And so this should resolve our issue.
So we're going to install IPY kernel, right? Give it a second. And we'll say yes.
Okay. So I'm hoping what that will do is that we'll be able to actually select the kernel, we might have to close that windsurf and reopen it, we can do the same thing in VS code. It's the same interface, right? So I'm not seeing it showing up here.
So I'm just going to close that windsurf, it would have been nice to use windsurf. But if we can't, that's totally fine. I'm going to go ahead and open this again, we're gonna open up the Gen A essential.
So I'm just gonna say open, I'm not using a coding assistant here. So we're just going to work through it the old fashioned way. And somewhere in here, we have a deep seek folder, we're going to go ahead and make a new terminal here. I want to make sure that I'm in in WSL, which I am, I'm going to say conda activate deep seek, because that's where I need to go. So I now have that activated, I'm going to go into the deep seek folder into our one transformers folder, I'm looking for the deep seek folder, there it is, we'll click into it.
And I did not save any of the code, which is totally fine. It's not like it's too far away to get this code again. And so I'm going to go back over to here.
And we're going to grab this code. Okay, I'm going to paste it in. And we'll make a new code block.
And I want to grab this and put this below. Okay, now normally we install pytorch and some other things. But I'm going to just try from the most bare bones thing, it's gonna tell me transformers isn't installed.
And that's totally fine. And I'm just trying to look, there we go. Do this. So we'll run that. And so I'm going to go here to install Jupiter.
Oh, it's installing Jupiter. I see. Okay, so we do need that. maybe the kernel would have worked.
And so I'm going to go to Python environments, Python environments. And so now we have deep seek. So maybe we could have got to work with Windsor, but that's fine. So we don't have transformers installs is no modules called transformers.
I know that we do this before. So we might as well go leverage code and see what we did here before. Here we have hugging face basic. And so here we Yeah, we do an install with transformers.
So that's all we really need. There's Python pi dot dot env, we might also need that as well, because we might need to put in our hugging face API to download the model. I'm not sure at this point.
But I'll go ahead and just install that up here in the top. Okay, so we'll give that a moment to install shouldn't take too long. We might also need to install pytorch or say our TensorFlow or both.
That's very common when you are working with open source models is that they may be in one for format or another and need to be converted over. Sometimes you don't need to do it at all. But we'll see.
So now it's saying to restart. So we'll just do a restart here, we should only have to do that once. And so I'm going to go ahead here and now include it. So now we have less of an issue. Here, it's showing us this model.
So basically, this will download it specifically from hugging face. So if we grab this address here, and we go back over to wherever I had one open here just a moment ago, and it should match this address, right. So if I was to just delete this out here, put it in here.
It's the same address, right? And so that's how it knows what model it's grabbing. But we'll go back over to here.
And it doesn't look like we need our hugging face API. But you'll we'll find out here in just a moment. So it should download it, we'll get a message here. we'll load transformers, we'll have tokenizers, and then we'll have the model. The messages here is being passed into here says copy local model directory directly.
Okay, so I think here it's like we just have two different ones. We have one that's using the pre trained one. Yes, there's two ways that we can do it. And I think we cover this when you use a direct model or a pipeline.
And so let's go ahead and see if we can just use the pipeline. Okay, if I don't remember how to do this, we probably go over here and take a look. I don't remember everything that I do.
But yeah, this is the one we just had open here just a moment ago, the basic one. And so this has a pipeline, and then we just use it. And so this, in a sense should just work.
So let's go ahead and see if that works. So I'm just going to separate this out. So I don't have to continually run this. We'll cut this out.
Okay, we'll run that. And then we'll run this. Okay. And we'll go down below and says at least one TensorFlow or PyTorch should be installed to install TensorFlow do this.
And so this is what I figured we were going to run into where it's complaining like, hey, you need PyTorch or TensorFlow. I don't know which one it needs, I would think that it was safe TensorFlow because I saw that. And so I'm going to just go ahead and make a new one up here.
I'm really just guessing I'm gonna go say TensorFlow. And I'm also going to just say PyTorch. Let's just install both because It'll need one or the other and one of them will work.
So my spelt it right to competing frameworks, I learned TensorFlow first. And then I kind of regret that because PyTorch is now the most popular even though I really like TensorFlow or specifically Keras. But we'll give this a moment to install.
And then once we do that, we'll run it again. And we'll see what happens. Okay, so the same PyTorch failed to build and I hope that doesn't matter.
Because if it uses TensorFlow, that's fine, but it's failed to build installable wheels. So just a moment here, as my twin sister calling me, she doesn't know I'm recording right now. So I'm going to go ahead and restart this even though we don't have pytorch or it might be wrong, it might be installed. I'm not sure.
We're going to go ahead and just try it again anyway. Because sometimes this stuff just works anyway. And we'll run it and so it is complaining saying at least one one of TensorFlow or pychips should be installed install TensorFlow 2.0.
To install pytorch read the instructions here. Um, Okay, so I mean, this shouldn't be such a huge issue. So I'm gonna go and let's use deep seek since we are big deep seek fans here today.
But I'm going to go over to the deep seek website, which is running v3. So I've been using the r1. I'm gonna log into here, we'll give it a moment. And We'll go here and say, you know, I want to, I need to install TensorFlow 2.0, and PyTorch to run a transformers pipeline model.
So we'll give that a go and see what we get. So here it's specifically saying to use 2.0. Yeah, and it's always a little bit tricky.
So I'm going to go back up to here and maybe we can say equals I mean, what it did install TensorFlow 20, we don't need to tell it to do two again. So go down below here. And let me just carefully look here.
So at least one of TensorFlow 2.0 or PyTorch should be installed to install it, you should have it. The select framework TensorFlow pie charts to use the model pass returns a tuple framework. Oh, so it's asking which model to use, as it doesn't know. Okay, so I'm going to go back over to here. And I'm going to say like, you know, give it this thing and see if it can figure it out.
And it's not exactly what I want. So I'm going to stop it here. I'm just saying like, I am using transformers How do I specify the framework? Okay, I'm surprised I have to specify the framework usually just picks it up.
Okay, so here we have pi torture TensorFlow, I think TensorFlow successfully installed. I'm not sure if it's just guessing because this thing could be hallucinating. We don't know. But we'll go ahead and just give this a try.
And we'll run this here. Here it's saying We're still getting that right. So I'm going to go over to here, this probably is a common hugging face issue for TensorFlow. Somebody has commented that here you need to have pytorch installed.
So say deep seek, I don't know if there's anyone that's actually told us how to do this yet. Give me a second. Let me see if I can figure it out.
Alright, so I went over and we're asking Claude instead. And so maybe Claude again, because it's not just the model itself, but it's the reasoning behind it. And so v3 didn't really get us very far.
It was supposed to be a really good model. But here it's suggesting that pytorch is generally used. And maybe my instructions here is incorrect. And so it's suggesting to do, I mean, we have TensorFlow, which is fine. But here, it's suggesting that we do torch, torch and accelerate.
Okay, so I'm gonna go ahead and run this here. So maybe PyTorch is just torch. And I just forgot, I don't know why I wrote in PyTorch.
We'll give that a moment, we'll see what happens. The other thing is that it's saying that we probably don't need the framework specified because while it's saying for llama in particular that it normally uses pytorch. I'm not sure if that's the case here. Another thing that we could do is go take a look at hugging face or sorry, not hugging face. Yeah, hugging face.
And look at the files here. And I'm seeing TensorFlow file. So it makes me think that it is using TensorFlow.
But maybe it needs to convert it over to pytorch. I don't know. But we should have both installed.
So even though I removed it from the top there, TensorFlow is still installed. And we can just leave it. there as a separate line was a pip install. TensorFlow, this is half the battle to get these things to work is dealing with these conflicts.
And you will get something completely different than me and you have to work through it. But we'll wait for this. It would be interesting to see we could serve this via a VLLM.
But we'll just first work this way. Okay. Alright, so that's now installed.
I'm going to go to the top here. And we're going to give it a restart. And so now we should have those installed, we'll go ahead and do transformers, pipelines, and we'll go run this next. And so now it's working.
So that's really good. Is it utilizing my GPUs? I would think so.
Sometimes there's some configurations here that you have to set, but I didn't set anything here. I think right now it's just downloading the model. So we're going to wait for the model to download. And then we just want to see if it infers. I'm not sure why it's not getting here, but maybe it'll take a moment to get going.
We didn't provide any hugging face API key. So maybe that's the issue. It's kind of hanging here.
So it makes me really think that I need my hugging face API key. So what I'm going to do is I'm going to grab this code over here, because I just assume that it wants it. That's probably what it is.
And sorry, I'm going to just pull this up here. Oops, we'll paste this in here as such. I'm going to drag this on up here.
And I'm gonna just make a new dot env dot txt. I'm also going to just ignore that because I don't want it to end up in there. And it's like hugging face API key.
I never remember what it is. But we'll go take a look here. I'm just doing this off screen here. So say hugging face API key and bar.
Okay. So key, where are you key, I'm having a hard time finding the name of the environment variable right now. Oh, it's a HF token.
That's what it is. So I need the HF token. And I'm gonna go back here and see if it's actually downloaded at all.
Did it did it move at all? No, it hasn't. So I don't think it's gonna move. And I think it's because it needs I think it needs the hugging face API key.
So I'm over here and hugging face I have an account you go over down below, you go to access tokens, I got to log in once. Alright, and so I'm going to create a new token, it's going to be read only this will be for deep, deep, deep, deep, deep seek, there was no settings that I had to accept to be able to download it. So I think it's going to work, I'm gonna get rid of my key later on. So I don't care if you see it.
I'm in this file here. So that was called hf token, I believe hf token. And so now we have our token supposedly set, we'll go back over to here, I'm going to go and scroll up. And I'm going to run this.
And now it should know about my token, I shouldn't even have to set it, I don't think so maybe it'll download now. I'm not sure. go back over to this one notice we're not pumping the token in anywhere. I'm just going to bring this also down by one.
This is acting a little bit funny here today. I'm not sure why. Like, why is it going all the way down there?
It's probably just the way the messaging works here. I'm gonna cut this here and paste it down below. So I'm really just trying to get this to trigger.
And I mean, this one's this other one here, but it's not not doing anything. Another way we could do it is we just download it directly. I don't like doing it that way.
But we could also do it that way. But I'm just looking for the hugging face. token and bars.
Yeah, it's hf hf tokens. Yeah, so I have it right. But why it's not downloading, I don't know.
Let's go take a look at that page and just make sure that there wasn't anything that we had to accept. Sometimes that's a requirement where it's like, hey, if you don't accept the things, they won't give you access to it. So if I go over here to the model card, it doesn't show anything that I have to select to download this. Yeah, there's nothing here whatsoever. Right.
So again, just carefully looking here, we have some safe tensors, that's fine. Oh, here it goes. Okay, so we just had to be a little bit patient.
It's probably a really popular model right now. And that's probably why it's so hard to download. But I'm just going to wait till this is done downloading.
I'll be back here in just a moment. It's it's downloading and running the pipeline. Okay, I did put the print down below here. So it might execute it here might execute up there.
We'll find out here in a moment. This one might be redundant because I took it out while it was running live here. But we'll wait for this to finish. Okay.
It's taking a significant time to download. Oh, maybe it's just almost done here. But yeah. it downloading from shards, getting the checkpoints.
Now starting to write run saying CUDA zero, I think that means it's going to utilize my GPUs. I'm pretty sure zero is GPUs and one is CPU. I'm not sure why that is. But now it appears to be running.
Okay, so we'll just wait a little bit longer. Now the thing is, is that once this model is downloaded, right, we can just call pipe every time and it'll be a lot faster, right? We'll wait a little bit longer. Okay.
All right, I'm back here and I mean, around the first part of the pipeline, which is fine, but I guess I didn't run this line here. So we'll run it. And since we separate out, I think this one's defined, hopefully it is. And we'll run this. And it should work.
It's probably now just doing its thing trying to run. But we'll give it a moment. And we'll see what happens here.
Okay. Yeah, I don't think it should take this long to run. I'm going to stop this. And we're going to run this And I think it will be faster this time. It's not really working because my video here is, the video I'm recording here is kind of struggling.
That's why I like to use an external thing here because now my computer is hanging. So what I might need to do here is pause if I can. All right, I'm kind of back.
My computer almost crashed. Again, it's not, I'm telling you, it's not the lunar lake. It's that these things can exhaust all your resources. And that's why it's really good to have an external computer that's specifically dedicated, like an AI PC, or even a dedicated PC with GPUs, not on your main machine.
But there is a tool here called NVD SMI, and it will actually show us the usage here. And It's probably not going to tell us much now because it's already running here. But as this is running, we can use this to figure out what is the usage of GPUs are going on here.
But I'm gonna go back up here for a moment. We'll take a look. So it says CPU went out of memory.
So CUDA kernels asynchronously reported some API calls. So this is what I mean where this could be a little bit challenging. And again, we downloaded the other models, but those other models that we saw, and by the way, I'll bring my head back in here. So we stopped seeing EOS webcam utility here. But the thing that we saw was that when we used olometer download, it was using gguf, which is a format that is optimized to run on CPUs.
Right, it can utilize GPUs as well. So it was already optimized, whereas the model we're downloading is not optimized, I don't think. And apparently, I just don't have enough to run it at the 8 billion parameter one. But the question is, is it downloading the correct one.
So if we go back over to here, right, this one is distilled 8 billion parameter, it has to be it right because, because of that there. And so we might actually not even be able to run this, at least not in that format. Okay, so you can see where the challenges are coming here.
So go over to our files. And we take a look here, we can see we have a bunch of safe tensors, that's not going to really help us that much. We got to go back into deep seek here. And we'll look into the ones that they have here. Well, here's the question is it we did the 8 billion 8 billion parameter one.
So we go into here 8 billion. there is quen 7 billion, which is a bit smaller is also the 1.5 billion one, that's not going to be useful for us. But you know what, I'm kind of exhausting my resources here.
So we can run this as an example. And then if you had more resources, like more RAM, then you'll have less of a problem. So I'm going to go ahead and copy this over here. And we're going to go ahead and paste it in here as such.
Okay, so now we are literally just using a smaller model, because I don't think I have enough. memory in order to run this, especially when I'm recording this at the same time. And you know, we go over to here, just type in clear here. So fan temperature performance, you can see none of the GPUs are being used right now.
So if we knew if we knew that they would be showing up over here, right, the GPUs. And so right now, I think it's just trying to attempt to download the model, because we swapped out the model, right? So at some point here, it should say, hey, we're downloading the model, it's not for some reason, but we'll give it a moment, okay, because the other one took a bit of time to get going. So I'm going to pause until I see something. Alright, so after waiting a while, this one ran, it says CUDA out of memory CUDA, external errors might be asynchronous reported at the API calls and stack.
And so it keeps running out of memory. And I think that's more of an issue. this computer. So I might have to restart and run this again. So I'm going to be back and stop the video I'm going to restart.
That's the easiest way to dump memory. Because I don't know any other way to do it. But you know, if I go here, I mean, it shows no memory usage. So I'm not really sure what the issue is.
But I'm going to restart, I'm also going to close OBS, I'm going to run it offline, and then I'm going to tell you I'm going to show you the results. Okay, be back in just a moment. All right, I'm back. And I also just went ahead and I ran it.
And this time it worked much faster. So I'm not sure maybe it was holding on to the cache of the old one that was in here. But giving my computer a nice restart really did help it out. And you can see that we are getting the model to run. I don't need to run the pipeline every single time.
I'm not sure why I ran that twice. But I should be able to run this again. Again, I'm recording. So maybe this won't work well as it is utilizing the GPUs.
We'll see here. So now it's struggling. But literally, I ran this and it was almost instantaneous, like how fast it was that it ran. So yeah, I think it might be fighting for resources. And that is that is a little bit tricky for me here.
We'll go back over here to NVIDIA SMI. I mean, I'm not seeing any of the processes being utilized. So it's kind of hard to tell what's going on here.
But I'm gonna go ahead and just stop this kind of stop this, but it clearly works. So even though I can't show you, yeah, see over here, this is volatile GPU utilization, 100%. And then down here it says 33%. I thought that these cores would start spinning up so we could make sense of it. And then here, I guess, is the memory usage.
So over here, you could see we have 790 of 818. And here we can see kind of the limits of it. But if I run it again, you can see that my, me recording just this video is using up the memory. And so that kind of makes it a bit of a challenge.
And the only way I could do that is maybe if I was to use onboard graphics, which are not working for me, because I don't know if I even have any onboard graphics. But that's okay. So anyway, that's our that's our example here that we got working, it clearly does work.
I would like to try to do another video where we use V LLM, but I'm not sure if that is possible. But we'll consider this part done. And if there's a video after that, then you know that I was able to get the alarm to work.
See the next one. Alright, that's my crash course into deep seek, I want to give you some of my thoughts about and how I think our crash course went here and what we learned as we were working through it. One thing I realized is that in order to run these models, you really do need optimized models.
And when we were using Ollama, if you remember, it had the GGUF extension. That's that file format that is more optimized to run on CPUs. I know that with Llama Index from my Gen AI Essentials course when I did that exploration. So optimized models are going to make these things a lot more accessible. When we were using Notebook LM, or whatever it was called, we saw that it was, it wasn't Notebook LM, it was LM Studio.
Notebook LM is a Google product, but LM Studio, it was adding that extra thought process. And so more things were happening there, it was exhausting the machine. Even on my main machine, where I have an RTX 4080, which was really good, you could see that it ran well. But then when we were trying to work with it directly, where we didn't have an optimized model that we were downloading. my computer was restarting.
So it was exhausting both my machines trying to run it. Though I think on this machine, because I was using OBS is using a lot of my resources. But there's a video that I did not add to this where I was trying to run it on VLLM. And I was even trying to use 1.5, the 1.5 billion quen distilled model, and it was saying I was running on a memory. So you can see this stuff is really, really tricky.
And even with an RTX 4080, and with my lunar lake, um there were challenges but there are areas that we can utilize it i don't think we're exactly there yet to have a full ai powered assistant with with thought and reasoning um but the rtx 4080 kind of handled it if that if that's all you're using it for and you're restarting those conversations um and then you're fine tuning some of those things down and then the lunar lake could do it if we tuned it down one thing that i did say that um i realized after doing a bit more research because i forget all the stuff that i learned but MPUs are not really designed to use LLMs. I was saying earlier, maybe there's a way to optimize it or something. But MPUs are designed to run smaller models alongside your LLMs for your workloads. So you can distribute a more complex AI workload.
So maybe you have an LLM and it has a smaller model that does something like images or something, something, I don't know, something. And maybe you can utilize that MPUs. But you know, we're not going to ever at least in the next couple of years, we're not going to see anything utilizing MPUs to run LLMs. It's really the GPUs. And so we are really fixed on the iGPU on the Lunar Lake and then what the RTX 4080 can do.
So, you know, maybe if I had another graphics card, and I actually do, I have a 3060, but unfortunately the computer I bought doesn't allow me to slot it in. So if there was a way I could distribute the compute, I would. from this computer and my old computer, or even the lunar lake as well, then I bet I could run something that is a little bit better.
But you know, you probably want like a home built computer with two graphics cards in it. Or you want multiple, multiple AI PCs that are stacked that have distributed compute. And just like it was, we saw that video where the person was running the 671 billion parameter model. If you paid close attention to the post, it actually said in there that it was running on a four-bit quantization. So that wasn't just the model running at its full precision.
It was running it highly quantized. And so quantization can be good. But if it's at four-bit, that's really small.
And it was chugging along. So... you know, the question really is, is like, okay, even if you had seven or eight of those, you still have to quantize it, which is not easy. And it's still even it's still slow.
And would the results be any good? So as a example, it was cool. But I think that 271 billion parameter model is really far out of reach.
But that means we can try to target one of these other ones, like if it's 70, 70 billion, billion parameter model or Maybe we just want to reliably run the 7 billion, 8 billion parameter model by having one extra computer. And so you're looking at, depending if you're smart about it, 1000 $1,500. And then you can run a model, it's not going to be as good as these as chat GPT or Claude, but it definitely will pave the way there.
We'll just have to continue to wait for these models to be optimized. And for the hardware to improve or the cost to go down, but maybe we're just two computers away. or two graphics cards away. But yeah, that's my two cents there and I'll see you in the next one. Okay, ciao.