🖥️

Deep Learning Hardware and Frameworks

Dec 6, 2025

Overview

  • Lecture 9: Deep Learning Hardware and Software — practical focus with code examples.
  • Covers hardware (CPUs, GPUs, TPUs), GPU internals (SMs, FP32 cores, Tensor Cores), and software frameworks (PyTorch, TensorFlow).
  • Emphasizes computation vs. memory trade-offs, dynamic vs. static computation graphs, and framework features.

Deep Learning Hardware: Key Concepts

  • CPUs vs GPUs:
    • CPUs: fewer powerful cores, high clock speeds, advanced caching/branch prediction.
    • GPUs: many simpler cores, lower clock speeds, massively parallel for data-parallel tasks.
  • Flops-per-dollar trend:
    • GPUs became dramatically more cost-effective since ~2012, enabling large-scale deep learning.
  • GPU device is a mini-computer:
    • Includes its own memory modules and cooling; has hierarchical compute units.

GPU Internal Architecture

  • Streaming Multiprocessors (SMs):
    • NVIDIA GPUs have many SMs (example: 72 SMs in RTX Titan).
    • Each SM contains many FP32 cores (example: 64 FP32 units per SM).
    • FP32 cores can perform multiply-accumulate (two FP ops) per clock cycle.
  • Tensor Cores:
    • Specialized hardware for mixed-precision matrix multiply (e.g., 4x4 GEMM + add per cycle).
    • Use mixed precision: multiplications in FP16, accumulations in FP32.
    • Much higher throughput for matrix multiply/convolution (example: Tensor Cores yield large TFLOPS improvements).
    • To use Tensor Cores in PyTorch: switch input data type to FP16 (mixed precision); may require tuning for numerical stability.
  • Memory differences:
    • Consumer GPUs use GDDR6 (less memory bandwidth, less memory capacity).
    • Compute GPUs use High-Bandwidth Memory (HBM) with greater bandwidth and capacity.
    • Memory bandwidth often bottlenecks many GPU workloads; larger/faster memory enables larger models and faster training.

Multi-GPU and Data-Center Scale

  • Typical servers: often 8 GPUs per server; data centers link many servers for distributed training.
  • Google TPUs:
    • TPU v2 (~180 TFLOPS per board), TPU v3 (~420 TFLOPS per board).
    • TPUs excel when assembled into TPU Pods (many chips aggregated): petaflop-scale compute.
    • TPUs historically tied to TensorFlow; accessible via Google Cloud (rental pricing varies by type).
    • TPUs also use low/ mixed-precision matrix hardware; design is specialized for ML workloads.

Why GPUs (and TPUs) Are Efficient For DL

  • Many NN primitives are data-parallel (matrix multiply, convolutions) and map well to GPUs/TPUs.
  • Matrix multiplication: each output element is an inner product — trivially parallelizable.
  • Tensor Cores and specialized matrix hardware allow chunked matrix operations to be executed extremely quickly.

Deep Learning Software: Framework Goals

  • Main expectations from frameworks:
    • Rapid prototyping (layers, utilities, pretrained models).
    • Automatic differentiation (compute gradients via backpropagation).
    • Efficient execution on accelerators (GPUs, TPUs) without deep hardware knowledge.

Major Frameworks Overview

  • Historical zoo: Caffe, Torch, Theano, Chainer, MXNet, CNTK, Paddle, JAX.
  • Current mainstream: PyTorch and TensorFlow.
  • Differences in design philosophy historically:
    • PyTorch: dynamic computation graphs (eager execution).
    • TensorFlow 1.x: static graphs (define then run).
    • Recent convergence: TensorFlow 2.0 adopts dynamic graphs as default and adds easier static-graph tooling; PyTorch added JIT/static options.

PyTorch: Three Abstraction Levels

  • Tensor API:
    • Low-level multidimensional arrays (like NumPy), runs on CPU/GPU.
    • Assign device to tensors to run on GPU.
  • Autograd:
    • Automatic differentiation via dynamic graph tracing when tensors have requires_grad=True.
    • Call loss.backward() to compute gradients; gradients accumulate by default.
    • Must zero gradients each iteration (common bug source).
    • Use torch.no_grad() context for parameter updates to avoid building graphs during updates.
    • Can define custom autograd Functions for numerically stable or fused backward logic.
  • nn.Module (higher-level):
    • Object-oriented layers, containers (nn.Sequential), predefined layers (nn.Linear), loss functions.
    • Create custom modules by subclassing nn.Module.
    • Built-in optimizers (torch.optim) handle parameter updates (call optimizer.step(), optimizer.zero_grad()).
    • Pretrained models in torchvision with one-line downloads (e.g., torchvision.models.alexnet(pretrained=True)).

Dynamic vs Static Computation Graphs

  • Dynamic graphs (PyTorch style):
    • Graph built per forward pass and discarded after backward.
    • Allows normal Python control flow to change graph structure per iteration.
    • Easier to debug; natural Python semantics.
    • Useful when computation depends on inputs (variable-length RNNs, recursive structures), or when model structure depends on intermediate outputs.
  • Static graphs:
    • Define graph once, then run repeatedly.
    • Enable global graph optimizations (operator fusion), serialization for deployment without Python.
    • Harder to debug (errors surface at runtime when running graph).
    • PyTorch provides JIT/torch.jit.trace or scripting to create static graphs for optimization and deployment.

PyTorch Examples & Notes (two-layer FC network)

  • Tensor-only implementation:
    • Manual forward, manual backward (compute gradients by formula), update weights.
  • Autograd implementation:
    • Set requires_grad=True on parameters.
    • Forward: compute loss; call loss.backward() to compute grads.
    • Update under torch.no_grad() and zero grads after step.
  • Custom autograd.Function:
    • Implement forward(ctx, input) and backward(ctx, grad_output) for fused/numerically stable ops (e.g., stable sigmoid backward).
  • Modular nn.Module:
    • Define init to create submodules; forward() to connect them.
    • Compose blocks (e.g., residual blocks or custom parallel blocks) inside containers or custom modules.

TensorFlow 2.0 Highlights

  • TensorFlow 1.x:
    • Static graph first; two-phase (build graph, run in Session).
    • Debugging can be difficult due to indirection between building and execution.
  • TensorFlow 2.0:
    • Defaults to eager (dynamic) execution; similar style to PyTorch.
    • tf.GradientTape allows automatic differentiation (similar to PyTorch autograd).
    • Use @tf.function to compile a Python function into a static graph for optimization/serialization.
    • Keras (tf.keras) provides high-level API (layers, models, losses, optimizers) similar to torch.nn.
    • TensorBoard: powerful visualization/logging tool for metrics; widely used. PyTorch also can log to TensorBoard.

Framework Trade-Offs & Practical Points

  • PyTorch:
    • Pros: intuitive dynamic graph debugging, flexible Python control flow, easy to prototype.
    • Cons: historically limited TPU support (improving), mobile deployment less mature.
  • TensorFlow:
    • Pros: strong static-graph tooling, serialization for deployment, TensorBoard, TPU support.
    • Cons: TF 1.x API complexity and debugging pain; TF 2.0 aims to fix usability but ecosystem transition creates mixed resources online.
  • Both frameworks are converging: both support dynamic and static modes, JIT/compilation, and high-level APIs.

Key Terms and Definitions

| Term | Definition | | CPU | Central Processing Unit; few powerful cores, high clock rates, general purpose. | | GPU | Graphics Processing Unit; many simple cores, high parallel throughput for data-parallel ops. | | SM (Streaming Multiprocessor) | A grouping of GPU compute units; contains FP cores and other specialized units. | | FP32 / FP16 | 32-bit / 16-bit floating-point numeric formats. | | Tensor Core | NVIDIA specialized unit performing mixed-precision small-matrix multiply-and-accumulate. | | Autograd | Automatic differentiation via computational graphs (PyTorch term). | | Dynamic Computation Graph | Graph built on-the-fly per forward pass; supports Python control flow. | | Static Computation Graph | Graph defined once, optimized and executed repeatedly; supports graph-level optimizations and serialization. | | TPU | Tensor Processing Unit (Google); specialized accelerator for ML workloads, used in pods for scale. | | TorchScript / tf.function | Mechanisms to convert Python/model code into static graph representations for optimization/serialization. |

Action Items / Next Steps (if applying lecture content)

  • When training on GPUs:
    • Consider mixed precision (FP16 with FP32 accumulation) to utilize Tensor Cores; watch numerical stability.
    • Check GPU memory and memory bandwidth limits when increasing model or batch sizes.
  • When coding models:
    • Use autograd / high-level modules to reduce boilerplate and errors.
    • Always zero gradients each iteration; use torch.no_grad() for parameter updates.
  • For deployment or optimization:
    • Explore static graph compilation (torch.jit, tf.function) for improved performance and serialization.
    • Use TensorBoard or equivalent logging for monitoring training; integrate PyTorch logging to TensorBoard.
  • Choose framework based on needs:
    • Need TPU or deep Google Cloud integration: consider TensorFlow (TPU compatibility).
    • Need flexible research prototyping or dynamic model structures: PyTorch is recommended.