Transcript for:
Mac Studio M3 Ultra Performance

This is Apple's most overlooked M3 Ultra Mac Studio. Wa! How'd you do that so fast? Magic. Everyone chases is 512 GB Big Brother, but I couldn't pass up the deal I got on this one. And the value proposition actually surprised me. See, when running large language models locally, RAM is everything. The more you have, the more of your own data and code you can keep off the cloud. A few months back, I tested the 512 GB Ultra and in a cluster and then I sold it. An eyewatering price tag will do that. But I've wondered ever since if I made a mistake. Instead, I built a workstation around a 96 GB VRAM RTX Pro 6000 chasing raw power at any cost. Now, I found this BPEC Ultra also with 96 GB of memory, but for about a third of the money, and I have a feeling that this is actually the one that people should be looking at. So today we're pitting the Frugal Ultra against the bloated sibling and my power hungry RTX rig to see which one truly owns the sweet spot for local AI code and whatever else you might be wanting to do with it. Just don't game, okay? You can only game on that one. Nice. Well, you could game on the Ultra, but we'll leave that to Android. I want to show you why it's still important to have the M3 Ultra even though it's one generation behind. We got M4s here now. Here's the M4, M4 Pro, and there's the M4 Max. This is the most important thing right here for LLM's memory bandwidth. This right here is the GPU bandwidth, 102.4 GB per second. That's just the M4. M4 Pro is quite a bit more. 153 GB per second here. M4 Max. Sorry, I don't have a Mac Studio with M4 Max. Let's check it out. Here we go. Here we go. This number right here, 400 something gigabytes per second. This is already getting into serious territory and it's more than what the new Nvidia Spark is going to have. It's more bandwidth than the new AMD chip, the Ryzen AI 9 Max Plus 395. I hope I said that right. But but this this just wait M3 Ultra. Check this out. Bam. 819 GB per second. That's crazy. That's the highest bandwidth on any Mac ever so far. But there is a butt. The Nvidia top-of-the-line GPUs have a much higher bandwidth, 1.8 terabytes per second. 5090 and the RTX Pro 6000, the 5080, 5070, they all have lower bandwidths. And I may do a separate video comparing the speeds of all those, but what does that mean for actual realworld performance? Let's do a quick comparison here and we'll see how the M3 Ultra stands out against the M4 Pro, for example. Just a quick one. Let's do this uh Deep Seek R1 distilled Quen 7B Q4. Bam. And here's the huge difference between the M4 Pro and the M3 Ultra. We have the model here and we have PP512, which is prompt processing speed is this speed right here. We got 456 tokens per second. And then token generation is 46 tokens per second for that model here. Both are really good speeds. Both are really acceptable. But that prompt processing speed, pay attention to that because that's very important when it comes to actual real world performance. For example, if you're using your model as code completion in a code editor, you're sending a bunch of context over. Each time you type a keystroke or a set of keystrokes, it's going to send that off to the LLM. It's going to be a bunch of text around whatever you're typing a context. You want to make sure that that gets processed very quickly. And look how quickly the M3 Ultra processes that. So PP512 on the M3 Ultra for the same model, 1,118 tokens per second. The output is 85 tokens per second. That's what the M3 Ultra gets you. And that's why people want that machine. The question is though, do we need 512 or is 96 enough? So these days, I'm constantly flipping between models. GPT4 for notes and email, claude for code refactors, Flux for image generation, cling for video, four tabs, four bills, and counting. Enter chat LLM teams. There's one dashboard that houses every top LLM and route LLM picks the right one for you for a given task. 04 Mini High for fast answers. Claw Sonnet 3.7 for coding. Gemini 2.5 Pro for big context and even adds GPT 4.1 before Chad GPT has it. Chat with PDFs and PowerPoints. Then generate decks and docs and do deep research all in the same chat. Need human sounding copy? The humanized toggle rewrites text to beat AI detectors. Spin up agents and run code with AI engineer. I built my first bot in just minutes. Track artifacts, create GitHub poll requests, and debug from the same interface. Need visuals? No problem. Use Flux or Ideogram and Recraft for images. Cling, Luma, and Runway for video, all builtin. And the kicker is just $10 a month, less than one premium model. Head over to chatlm.abacus.ai AI or click the link in the description and level up with chat llm teams. Now, I have a theory. There's less GPU cores here. There's 60 versus the full-blown 512 GB model which has 80 cores. My theory is it's not going to affect the LLM speeds much. I may be wrong on that because there is a lot of compute that happens in the prompt processing stage especially. So, I'm going to send in some long prompts. At the same time, I want to pick a more fair comparison from the Nvidia line of GPUs. And the 5080 comes the closest. It has higher memory bandwidth for sure, but uh it's the closest one that I found without going under the 819 that the Mac Studio has. So, that would be the RTX 5080. Let's say you have a somewhat long prompt like this one right here. It's a more detailed prompt. You're a highly experienced full stack software architect, DevOps engineer, and here are basically the requirements of what you're looking for. It's a pretty decent size prompt. And I'm running Quinn 2.5 coder because there is no Quen 3 coder yet. But Quen 2.5 coder has been a really good model based on word of mouth from my experiences so far as well with it and from what I've read about it. There's 32 billion version of that and there's the 14 billion and I stuck with the 14 billion so that it can fit inside both of these machines because the RTX 5080 has 16 gigs of memory and Quen 2.5 14 billion is 9 GB in size. So that's a good sizing for that. This is the prompt.json file which I'm going to send over the network to these machines. One to the Mac Studio, one to the other machine because let's say you're running on a less powerful laptop. Let's say you're on a MacBook Air or something, but you have a dedicated machine for your AI processing for your LLM loads somewhere else or you have it on a shelf and you want to connect to it and that's how you want to use it, you can definitely do that. So you don't need to carry a chunky big MacBook Pro or whatever laptop you have. But to get a baseline, I'm going to run it locally first on this MacBook Pro. This will also give you an idea of how it would be if you're traveling and if you just have one machine with you. And I'm using LM Studio for this. You could use Olama or whatever else you want. And then 85 is my local IP. So I'm going to change that to 85. So let's launch that. Now what this is doing right now is actually querying the local version of LM Studio. And that's going to load the model up. So if the model times out, it gets unloaded automatically. So, there's that initial hit. Oh, I'm hearing it. The fans are kicking up here. And there we go. There's our response from the model. This took 58 seconds total. Now, the model is loaded right now. So, I'm going to run this one more time while the model is loaded just to get the actual speed of the prompt processing and the generation just by itself without the loading time. Wow. You could probably hear that. This MacBook Pro is not happy right now. 47 seconds. Wow. Now, let's go to the Mac Studio. That's a different IP address. So, let's go to that one. Boom. And as you can see, that triggered the model to load. It didn't take that long, but now it's processing. And you'll see from the GPU history chart that it's processing on the GPU. Right now, we're using about 25 GB out of the 96 available on the machine. Plenty of room to go for larger models if you want to run extra models on top of Quen Coder. And that's going to be important if you're doing, for example, a chat model while you're coding as well as an autocomplete model. You can load both of them at the same time while you're doing your coding. All right, 38 seconds on this one. 20 seconds faster than the M4 Max MacBook Pro, but that's with the load. So, let's run it again while the model is loaded. Boom. 13 seconds. Nice. Nice. Now, let's do the RTX 5080 machine. IP is 215. Boom. And there it goes. Loading up. This part is a little bit slower generally on the Windows machines or the Linux machines because it has to load up the model into system memory first and then copy it to the GPU memory. That step is avoided on the Apple silicon machines. But let's see if it makes up for that. By the way, this card has 16 GB of VRAM and we're using 9.7 of that right now. Utilization is 95% which means we're doing pretty well here. Look at that GPU chart. It's using it to the fullest and it's done. 39 seconds. What? 1 second slower than the Mac Studio. It's that load time, I'm telling you. Let's see if we can beat the 13 seconds that we got while the model is actually loaded. So, we're going to run this one more time. It's already loaded, so it doesn't need to do that again. And there it goes. 27. That's kind of a embarrassing. This thing has higher bandwidth compared to the Mac Studios 819. Let's run the Mac Studio one again. There's a couple of other benefits to having the Mac Studio. It's a lot more portable than the whole AI rig that I got down there. It uses a lot less power. I already have other videos on this, so I'm not going to dig into that too much, but this one is taking quite a bit longer. 55 seconds this time around. It's the same exact prompt. Let's run it one more time. It's getting warm in here. Okay, 45 seconds this time around. And let's give another chance to the RTX 5080. [Music] 27 seconds that time. pretty consistent for the RTX board, but very inconsistent for the Mac Studio. That was the MLX model on Apple Silicon. And MLX provides performance optimizations for you to be able to run things on Apple Silicon. Now, I'm trying the GGUF version, which is the exact same one that's available to run for machines that don't have MLX capability, which is going to be our RTX machine here. Whoa. 1 minute and 5 seconds for that one. Holy cow. Let's try that one more time. 1 minute and 4 seconds. Now, we're getting some consistency from the Apple silicon machine, and it's showing us a couple things. One is MLX models are faster, but they're also not acting very consistent. The GGUF models are consistent, but they're much slower than the Nvidia 580 counterparts. Now, when you're chatting, that's the experience that you're going to get with that kind of model. What if you're doing code completions, which are a slightly different animal? Because with code completions, it's constantly sending your context and prompts to the LLM so it can be available immediately and always there. And this is where smaller models come in handy. Smaller models that are good. And I made an exploratory video about this with the members of the channel just comparing a few different ones. But let's just ignore all that and which one is better for now. We're going to be using the same model and I'm going to test parallelism here. I've sent off four of these requests all at the same time to the Mac Studio right now. the exact same request, but they're all happening simultaneously, therefore dividing that processing up. And here, LM Studio is actually telling us that it's queuing some of these up. So, it's not going to be able to do them in parallel very well. So, let's see how long this takes. And on one of these, it took 1 minute and 3 seconds. Another one took 2 minutes and 5 seconds, 3 minutes and 8 seconds, and 4 minutes and 9 seconds. This demonstrates that the libraries that we have available for Apple Silicon right now, at least with Llama CPP, are not that great at doing parallel processing like this. I shouldn't blame Llama CPP because I'd ran the exact same test on the RTX 5080. That one has slightly different results. 27 seconds for one of them to complete. Notice that's pretty close to the exact same time that we got when we're just running one request. 53 for the second one, 1 minute 19 for the third, and 144. So overall, it took a lot less time to do those. So on a smaller model, it's going to be even quicker. There was no queued up requests waiting to be completed on the Windows machine. Now, if you're using a library like VLM instead of Llama CPP, LM Studio doesn't offer that. But if you are, you're going to see the parallelism at work even more because that library does it really well. And that's something the RTX Pro 6000 is really good at. I'm doing a separate video on that, so stay tuned to find out more. But given what we've seen so far and the speeds we're getting that are actually practical, that's why using a smaller model is actually going to be better when you're doing local development. So I want to do a quick little test here. We're going to do a simple prompt. Hi. Boom. And we're getting 81 tokens per second and 49 tokens per second on the Mac Studio. Both are pretty reasonable if you're doing chat. And of course, your prompts are going to be longer than that. But let's say we were to get a larger model like the Quen 2.5 coder 32 billion instruct. First of all, we're going to exceed the memory limit on the 5080, but not on the Apple silicon machine. So, if I do something like um write a story here, I know a lot of you are going to be mad at me for that prompt. But, we're not doing any coding here, so it doesn't really matter. This will just give us an idea of what speed we're getting. You can see it right there. Look at that speed difference. The M3 Ultramax Studio is so much faster because everything is running on the GPU. We're at 23 tokens per second here. We've got 2.95 tokens per second for a 32 billion parameter model on the RTX 5080. So, at this point, I think this is just my opinion here, but we've reached sort of a a processing limit for this generation of the M3 Ultra. And if you go higher than that, you're just going to struggle. So, let's take a look at another one that's bigger. Llama 3.370 billion. Can we offload everything to the GPU? I sure as heck I'm going to try. And don't even try to load this on the 5080. There's no point. Write a story. Failed to send the message. Let's try that again. Hi. The message failed to send. And what's happening here is we're out of memory. This is the only place where I think where if you want to run larger models on the 512 GB machine, you can. But just because you can, should you? Now, of course, I can make it fit. And here's how I would do it. See, by default, not all 96 GB is allocated to the GPU. Some of it is reserved for the system. And there's a certain limit that you can set. Right now, it's set to 72 GB for the GPU. You can change that. I can set it to, let's say, something safe. If it's 96 total, maybe we can do um 90 GB for the GPU. And allocate 90. Let's double check to make sure it was set. And yeah, it's set to 90 now. Now, if we go back to LM Studio and I say hi, there it goes. Now it's working. It's not crashing. It's nice to meet you. Is there something I can help you? But but 9.3 tokens per second is what we're getting. And this is the 8bit quant by the way. And now I imagine if you're running anything larger than that, it's going to be even slower. At that point, it's probably not even useful. Uh, you let me know in the comments down below what you think. I think there's a certain limit. Yeah, it's nice to be able to run DeepSeek R1, but you're getting such low tokens per second that isn't even worth it. The 96 GBTE model is actually pretty decent. Now, I also paid $600 less than the list price because I got it refurbished. And buying refurbished from Apple, I've so far bought four refurbished things from them and they've always been pretty much like new. So, I think I got a pretty decent deal on it. Anyway, just to conclude this experiment and this rant, I believe that we're going to be running smaller models, especially as developers, because we want almost instantaneous results. This machine is very capable of that. Even if you're running multiple small models, one for completions, one for chat, another one for editing, anything that's going to be 14 billion parameters, 32 might be pushing it, but still doable. We're getting pretty decent results for chat for 32 billion. But for code completions, you want to go eight or lower. And all that will fit on a 96 GB model. You don't need the 512 GB model. Anyway, that's it. By the way, if you missed my video about clustering Mac Studios, a bunch of them, check out this video right over here. Thanks for watching and I'll see you in the next one. [Applause] [Music]