🤖

DeepSeek-LLM Local Inference Setup

Sep 15, 2025

Summary

  • This meeting provided a comprehensive technical walkthrough and performance assessment of running DeepSeek-LLM 67.1B (404GB quantized 4-bit version) on a $2,000 local AI inference server, with detailed hardware and BIOS recommendations, software setup, and performance tuning tips.
  • The host demonstrated reasoning and parsing test cases, gave actionable advice for setting up both CPU- and GPU-based inference, and highlighted key configuration and troubleshooting steps.
  • BIOS and system settings were discussed for optimal performance, along with model architecture choices and practical context/prompt strategies.
  • Several model limitations, use-case considerations, and next steps for improving local LLM workflows were noted.

Action Items

  • Ongoing: Update linked article with all performance metrics and configuration details discussed in the session.
  • Ongoing: Test extended context window and memory map settings; share findings.
  • Ongoing: Monitor performance impacts of BIOS/ASVM mode changes and report in a follow-up.
  • Ongoing: Gather user feedback on CPU and context window performance for future tuning.
  • Ongoing: Explore further containerization (LXC, Docker, Proxmox VM) and update guides when viable.
  • Ongoing: Investigate Llama CPP and alternative models (e.g., Llama 3.3, distills, 70B variants) for local deployment and report comparative findings.
  • Ongoing: Solicit GPU/CPU tuning tips and memory interleaving insights from the community; update recommendations accordingly.

Hardware and System Setup for DeepSeek 67.1B Local Inference

  • Running a 404GB DeepSeek-LLM 67.1B model locally requires an enterprise/server-class machine with at least 512GB of RAM, ideally using AMD EPYC CPUs for cost-effective bandwidth.
  • The $2,000 rig (without GPUs) achieves 3.5–4 tokens/second for DeepSeek full quant 4, with further gains via BIOS and system-level optimizations.
  • Server motherboards (e.g., ASUS MZ32-AR0) with 16 DIMM slots are preferred for affordable memory expansion; using 32GB DIMMs offers cost advantages.
  • Noise levels and form factors differ greatly between workstation and server builds; remote management via BMC was recommended for headless operation.

Software Installation, Configuration & Performance

  • Full installation, configuration, and troubleshooting for bare-metal Ubuntu 24 systems is documented and emphasized as critical; Linux experience is important.
  • LLM deployment was tested using oLlama and Open Web UI; baseline established for future containerized/virtualized deployments.
  • Disabling simultaneous multithreading (SMT) and manually tuning BIOS/power policies significantly improved tokens/second throughput.
  • Guidance provided for setting static IPs, Docker compose network settings, and connecting Open Web UI to external LLM endpoints.
  • For best CPU-only performance: set context window explicitly and manage parallelism (num_parallel=1 to maximize supported context size).
  • GPU offloading increases supported context window size but does not dramatically improve throughput for this model size.
  • Key environment variables for tuning (num_parallel, host interface, keep_alive, GPU-specific settings) were detailed for both CPU and GPU inference.

Reasoning and Model Performance Demonstrations

  • The DeepSeek-LLM handled a range of reasoning, parsing, and memory tasks (e.g., Flappy Bird code generation, logic word problems, spatial/cat position queries, math reasoning, 100 digits of Pi, SVG generation).
  • Parsing and chain-of-thought style queries were accurately processed, with noted performance degradation during long context or multi-turn reasoning.
  • Some limitations in code review, context recall speed, and complex spatial/geographical reasoning were observed.
  • Throughput (tokens/second) ranged from 2.5 to 4, generally decreasing with longer or more complex tasks.

BIOS Optimization and Power Management

  • Detailed BIOS settings changes were shared: SMT disabled; NPS=1; power policy/deterministic manual set to performance; boost clock and TDP values set according to board/CPU limits.
  • Best practices for memory interleaving, PCIe, and CPU/CCD control were noted, with community input invited for further tuning.
  • Firmware upgrade paths for boards (e.g., V1->V3 for MZ32-AR0) were explained for compatibility with newer CPUs like Milan.

Model Use Cases and Limitations

  • The DeepSeek 67.1B is not suggested as a daily driver LLM due to speed and hardware demands; Llama 3.3 or similar is recommended for general-purpose use.
  • GPU investment is best reserved for scenarios needing larger context windows; vRAM is not a cost-effective substitute for system RAM at this scale.
  • Anticipation for rapid LLM improvements and new models (e.g., vision-capable) in the coming months was expressed.

Decisions

  • Continue using AMD EPYC with server motherboards for cost-efficient local LLM hosting — Server platforms offer greater scalability for memory and bandwidth, critical for current and next-generation large models.
  • Prefer Linux/bare metal for local LLM deployment with future migration to containers/VMs as performance allows — Establish a well-documented baseline for further experimentation.
  • Set num_parallel=1 for larger context windows on high-memory servers — Avoids OOM errors and maximizes single-user session performance.

Open Questions / Follow-Ups

  • Does memory-mapped inference offer substantial gains in stability or speed for multi-service/multi-container environments?
  • Can oLlama or other frameworks explicitly assign model layers/partitions to specific GPUs for further optimization?
  • Are there BIOS/memory interleaving settings that can further increase throughput for these workloads?
  • Feedback requested from community: experiences with various CPUs, context window configurations, and alternative hosting solutions.
  • Will Proxmox/LXC/Docker-based deployments materially impact tokens/second and stability versus bare metal? More testing needed.