🖥️

GB200 Hardware Architecture Overview

Dec 4, 2024

GB200 Hardware Architecture and Component Supply Chain Notes

Introduction to GB200

  • Nvidia's GB200 offers significant performance advances with complex deployment.
  • Standard racks with multiple deployment variants and trade-offs.
  • Impacts on supply chain for datacenter deployers, clouds, server OEMs/ODMs.

GB200 Form Factors

Rack Scale Form Factors

  • Four major form factors with customization:
    • GB200 NVL72: 120kW per rack, requiring liquid cooling.
    • GB200 NVL36x2: Two racks, 66kW each (132kW total). Most common deployment.
    • GB200 NVL36x2 (Ariel): Higher CPU/GPU ratio, used primarily by Meta.
    • x86 B200 NVL72/NVL36x2 (Miranda): Uses x86 CPUs instead of Nvidia's Grace CPUs, lower upfront costs.

Power Budget

  • Max TDP of each compute tray: 6.3kW.
  • NVL72 total power draw: 123.6kW.
  • NVL36x2 total: 132kW.

Compute Tray Architecture

  • Bianca Board: Contains two Blackwell B200 GPUs and a Grace CPU.
  • Reduced need for PCIe switches or retimers between CPU and GPU.
  • ConnectX-7/8 ICs on a mezzanine board on top of the Bianca board.

Networking Fabrics

NVLink Fabric

  • NVL72: 1-hop NVSwitch connection within the rack.
  • NVL36x2: 1-hop within rack, 2-hops between interconnected racks.
  • Use of copper cables due to cost and reliability vs. optics.

Backend Networking

  • Initial shipments use ConnectX-7, later moving to ConnectX-8.
  • Switch options include Quantum-2, Spectrum-X, Broadcom.
  • Transition from 400G SR4 to 800G DR4 optical transceivers.
  • Amazon using custom backend 400G NIC.

Frontend Networking

  • Typical setup overprovisioned; focus on 200G frontend bandwidth per GPU.
  • Bluefield-3 used mainly by Oracle.

Networking Cables and Transceivers BOM

  • ConnectX-8 allows cost savings with DAC/ACC copper.
  • Expansion in optics supply chain with Eoptolink and Broadcom.

Hyperscaler Customization

  • Impacts in design and supply chain.
  • Use of custom NICs by major hyperscalers like Amazon.

Liquid Cooling Architectures

  • Essential for high-density racks like NVL72.
  • Different approaches: L2A vs. L2L.
  • Redesign data center infrastructure to accommodate these cooling needs.

Power Delivery Network

  • Power distribution from the rack's 48V DC to 12V DC for compute trays.
  • RapidLock connectors and power distribution board for power management.

Mechanical Components

  • Detailed discussions on various mechanical aspects not covered here.

Conclusion

  • Ongoing shift in component supply chain due to GB200's architecture.
  • Important considerations for deployment in data centers.