DeepSeek-LLM Local Inference Setup

Summary

This meeting provided a comprehensive technical walkthrough and performance assessment of running DeepSeek-LLM 67.1B (404GB quantized 4-bit version) on a $2,000 local AI inference server, with detailed hardware and BIOS recommendations, software setup, and performance tuning tips.
The host demonstrated reasoning and parsing test cases, gave actionable advice for setting up both CPU- and GPU-based inference, and highlighted key configuration and troubleshooting steps.
BIOS and system settings were discussed for optimal performance, along with model architecture choices and practical context/prompt strategies.
Several model limitations, use-case considerations, and next steps for improving local LLM workflows were noted.

Ongoing: Update linked article with all performance metrics and configuration details discussed in the session.
Ongoing: Test extended context window and memory map settings; share findings.
Ongoing: Monitor performance impacts of BIOS/ASVM mode changes and report in a follow-up.
Ongoing: Gather user feedback on CPU and context window performance for future tuning.
Ongoing: Explore further containerization (LXC, Docker, Proxmox VM) and update guides when viable.
Ongoing: Investigate Llama CPP and alternative models (e.g., Llama 3.3, distills, 70B variants) for local deployment and report comparative findings.
Ongoing: Solicit GPU/CPU tuning tips and memory interleaving insights from the community; update recommendations accordingly.

Running a 404GB DeepSeek-LLM 67.1B model locally requires an enterprise/server-class machine with at least 512GB of RAM, ideally using AMD EPYC CPUs for cost-effective bandwidth.
The $2,000 rig (without GPUs) achieves 3.5–4 tokens/second for DeepSeek full quant 4, with further gains via BIOS and system-level optimizations.
Server motherboards (e.g., ASUS MZ32-AR0) with 16 DIMM slots are preferred for affordable memory expansion; using 32GB DIMMs offers cost advantages.
Noise levels and form factors differ greatly between workstation and server builds; remote management via BMC was recommended for headless operation.

Full installation, configuration, and troubleshooting for bare-metal Ubuntu 24 systems is documented and emphasized as critical; Linux experience is important.
LLM deployment was tested using oLlama and Open Web UI; baseline established for future containerized/virtualized deployments.
Disabling simultaneous multithreading (SMT) and manually tuning BIOS/power policies significantly improved tokens/second throughput.
Guidance provided for setting static IPs, Docker compose network settings, and connecting Open Web UI to external LLM endpoints.
For best CPU-only performance: set context window explicitly and manage parallelism (num_parallel=1 to maximize supported context size).
GPU offloading increases supported context window size but does not dramatically improve throughput for this model size.
Key environment variables for tuning (num_parallel, host interface, keep_alive, GPU-specific settings) were detailed for both CPU and GPU inference.

The DeepSeek-LLM handled a range of reasoning, parsing, and memory tasks (e.g., Flappy Bird code generation, logic word problems, spatial/cat position queries, math reasoning, 100 digits of Pi, SVG generation).
Parsing and chain-of-thought style queries were accurately processed, with noted performance degradation during long context or multi-turn reasoning.
Some limitations in code review, context recall speed, and complex spatial/geographical reasoning were observed.
Throughput (tokens/second) ranged from 2.5 to 4, generally decreasing with longer or more complex tasks.

Detailed BIOS settings changes were shared: SMT disabled; NPS=1; power policy/deterministic manual set to performance; boost clock and TDP values set according to board/CPU limits.
Best practices for memory interleaving, PCIe, and CPU/CCD control were noted, with community input invited for further tuning.
Firmware upgrade paths for boards (e.g., V1->V3 for MZ32-AR0) were explained for compatibility with newer CPUs like Milan.

The DeepSeek 67.1B is not suggested as a daily driver LLM due to speed and hardware demands; Llama 3.3 or similar is recommended for general-purpose use.
GPU investment is best reserved for scenarios needing larger context windows; vRAM is not a cost-effective substitute for system RAM at this scale.
Anticipation for rapid LLM improvements and new models (e.g., vision-capable) in the coming months was expressed.

Continue using AMD EPYC with server motherboards for cost-efficient local LLM hosting — Server platforms offer greater scalability for memory and bandwidth, critical for current and next-generation large models.
Prefer Linux/bare metal for local LLM deployment with future migration to containers/VMs as performance allows — Establish a well-documented baseline for further experimentation.
Set num_parallel=1 for larger context windows on high-memory servers — Avoids OOM errors and maximizes single-user session performance.

Does memory-mapped inference offer substantial gains in stability or speed for multi-service/multi-container environments?
Can oLlama or other frameworks explicitly assign model layers/partitions to specific GPUs for further optimization?
Are there BIOS/memory interleaving settings that can further increase throughput for these workloads?
Feedback requested from community: experiences with various CPUs, context window configurations, and alternative hosting solutions.
Will Proxmox/LXC/Docker-based deployments materially impact tokens/second and stability versus bare metal? More testing needed.