Mac Studio M3 Ultra Performance

Overview

The speaker compares the value and performance of the 96 GB Apple M3 Ultra Mac Studio against its higher-end 512 GB sibling and an RTX AI workstation. Tests focus on running large language models (LLMs) locally, showcasing memory, speed, and practical considerations for local AI development.

Mac Studio M3 Ultra Overview

96 GB M3 Ultra offers a strong value proposition at one third the cost of the 512 GB model.
RAM capacity is critical for running LLMs locally and keeping data off the cloud.
The M3 Ultra has the highest memory bandwidth on any Mac to date (819 GB/s).
Compared to Nvidia GPUs, top models like the RTX Pro 6000 offer higher bandwidth, but at a much higher cost.

Performance Comparison: M3 Ultra, M4 Pro, RTX 5080

Prompt processing speed on M3 Ultra (Quen 7B Q4): 1,118 tokens/sec vs. M4 Pro’s 456 tokens/sec.
Output generation faster on M3 Ultra (85 tokens/sec) than M4 Pro (46 tokens/sec).
RTX 5080 (16 GB VRAM) handles models up to 14B parameters; struggles with larger ones due to VRAM limits.
M3 Ultra processes long prompts consistently faster when model is loaded (13 sec) than M4 Max MacBook Pro (47 sec), and similar or better than RTX 5080 (27 sec).

Practical Developer Use: Model Sizes and Parallelism

For code tasks and chat, running multiple small LLMs (14B or lower) is quick and efficient on 96 GB.
M3 Ultra can load simultaneous models for code completion and chat, outperforming a MacBook Pro.
Parallel processing is less efficient on Apple Silicon with current libraries (queueing delays); Nvidia performs better in this scenario.
MLX-optimized models on Apple Silicon are fast but less consistent; GGUF format is slower but more stable.

Limitations and Memory Allocation

M3 Ultra (96 GB) cannot run the largest models (e.g., Llama 3 70B) without memory adjustments.
System memory allocation can be tuned (e.g., GPU set to 90 GB) to fit larger models, but generation speed drops (e.g., 9.3 tokens/sec for 70B model).
Models beyond 32B struggle or become impractically slow on 96 GB.

Cost and Value Considerations

Refurbished M3 Ultra purchased at $600 below list price, matching new-device quality.
Most developers will benefit more from the 96 GB model than the costlier 512 GB variant.
96 GB suffices for multiple smaller models, offering good speed and efficiency for local AI development.

Decisions

96 GB M3 Ultra is the most practical and cost-effective choice for local LLM and AI development over higher RAM or Nvidia alternatives.

Recommendations / Advice

Use 14B parameter models or smaller for code completions; 32B is feasible but slower.
Developers should prioritize models that deliver near-instant results for productivity.
Adjust GPU memory allocation for larger models if necessary, but expect speed trade-offs.
Buy refurbished Mac Studios for significant savings with little compromise on quality.