Overview
- Lecture 9: Deep Learning Hardware and Software — practical focus with code examples.
- Covers hardware (CPUs, GPUs, TPUs), GPU internals (SMs, FP32 cores, Tensor Cores), and software frameworks (PyTorch, TensorFlow).
- Emphasizes computation vs. memory trade-offs, dynamic vs. static computation graphs, and framework features.
Deep Learning Hardware: Key Concepts
- CPUs vs GPUs:
- CPUs: fewer powerful cores, high clock speeds, advanced caching/branch prediction.
- GPUs: many simpler cores, lower clock speeds, massively parallel for data-parallel tasks.
- Flops-per-dollar trend:
- GPUs became dramatically more cost-effective since ~2012, enabling large-scale deep learning.
- GPU device is a mini-computer:
- Includes its own memory modules and cooling; has hierarchical compute units.
GPU Internal Architecture
- Streaming Multiprocessors (SMs):
- NVIDIA GPUs have many SMs (example: 72 SMs in RTX Titan).
- Each SM contains many FP32 cores (example: 64 FP32 units per SM).
- FP32 cores can perform multiply-accumulate (two FP ops) per clock cycle.
- Tensor Cores:
- Specialized hardware for mixed-precision matrix multiply (e.g., 4x4 GEMM + add per cycle).
- Use mixed precision: multiplications in FP16, accumulations in FP32.
- Much higher throughput for matrix multiply/convolution (example: Tensor Cores yield large TFLOPS improvements).
- To use Tensor Cores in PyTorch: switch input data type to FP16 (mixed precision); may require tuning for numerical stability.
- Memory differences:
- Consumer GPUs use GDDR6 (less memory bandwidth, less memory capacity).
- Compute GPUs use High-Bandwidth Memory (HBM) with greater bandwidth and capacity.
- Memory bandwidth often bottlenecks many GPU workloads; larger/faster memory enables larger models and faster training.
Multi-GPU and Data-Center Scale
- Typical servers: often 8 GPUs per server; data centers link many servers for distributed training.
- Google TPUs:
- TPU v2 (~180 TFLOPS per board), TPU v3 (~420 TFLOPS per board).
- TPUs excel when assembled into TPU Pods (many chips aggregated): petaflop-scale compute.
- TPUs historically tied to TensorFlow; accessible via Google Cloud (rental pricing varies by type).
- TPUs also use low/ mixed-precision matrix hardware; design is specialized for ML workloads.
Why GPUs (and TPUs) Are Efficient For DL
- Many NN primitives are data-parallel (matrix multiply, convolutions) and map well to GPUs/TPUs.
- Matrix multiplication: each output element is an inner product — trivially parallelizable.
- Tensor Cores and specialized matrix hardware allow chunked matrix operations to be executed extremely quickly.
Deep Learning Software: Framework Goals
- Main expectations from frameworks:
- Rapid prototyping (layers, utilities, pretrained models).
- Automatic differentiation (compute gradients via backpropagation).
- Efficient execution on accelerators (GPUs, TPUs) without deep hardware knowledge.
Major Frameworks Overview
- Historical zoo: Caffe, Torch, Theano, Chainer, MXNet, CNTK, Paddle, JAX.
- Current mainstream: PyTorch and TensorFlow.
- Differences in design philosophy historically:
- PyTorch: dynamic computation graphs (eager execution).
- TensorFlow 1.x: static graphs (define then run).
- Recent convergence: TensorFlow 2.0 adopts dynamic graphs as default and adds easier static-graph tooling; PyTorch added JIT/static options.
PyTorch: Three Abstraction Levels
- Tensor API:
- Low-level multidimensional arrays (like NumPy), runs on CPU/GPU.
- Assign device to tensors to run on GPU.
- Autograd:
- Automatic differentiation via dynamic graph tracing when tensors have requires_grad=True.
- Call loss.backward() to compute gradients; gradients accumulate by default.
- Must zero gradients each iteration (common bug source).
- Use torch.no_grad() context for parameter updates to avoid building graphs during updates.
- Can define custom autograd Functions for numerically stable or fused backward logic.
- nn.Module (higher-level):
- Object-oriented layers, containers (nn.Sequential), predefined layers (nn.Linear), loss functions.
- Create custom modules by subclassing nn.Module.
- Built-in optimizers (torch.optim) handle parameter updates (call optimizer.step(), optimizer.zero_grad()).
- Pretrained models in torchvision with one-line downloads (e.g., torchvision.models.alexnet(pretrained=True)).
Dynamic vs Static Computation Graphs
- Dynamic graphs (PyTorch style):
- Graph built per forward pass and discarded after backward.
- Allows normal Python control flow to change graph structure per iteration.
- Easier to debug; natural Python semantics.
- Useful when computation depends on inputs (variable-length RNNs, recursive structures), or when model structure depends on intermediate outputs.
- Static graphs:
- Define graph once, then run repeatedly.
- Enable global graph optimizations (operator fusion), serialization for deployment without Python.
- Harder to debug (errors surface at runtime when running graph).
- PyTorch provides JIT/torch.jit.trace or scripting to create static graphs for optimization and deployment.
PyTorch Examples & Notes (two-layer FC network)
- Tensor-only implementation:
- Manual forward, manual backward (compute gradients by formula), update weights.
- Autograd implementation:
- Set requires_grad=True on parameters.
- Forward: compute loss; call loss.backward() to compute grads.
- Update under torch.no_grad() and zero grads after step.
- Custom autograd.Function:
- Implement forward(ctx, input) and backward(ctx, grad_output) for fused/numerically stable ops (e.g., stable sigmoid backward).
- Modular nn.Module:
- Define init to create submodules; forward() to connect them.
- Compose blocks (e.g., residual blocks or custom parallel blocks) inside containers or custom modules.
TensorFlow 2.0 Highlights
- TensorFlow 1.x:
- Static graph first; two-phase (build graph, run in Session).
- Debugging can be difficult due to indirection between building and execution.
- TensorFlow 2.0:
- Defaults to eager (dynamic) execution; similar style to PyTorch.
- tf.GradientTape allows automatic differentiation (similar to PyTorch autograd).
- Use @tf.function to compile a Python function into a static graph for optimization/serialization.
- Keras (tf.keras) provides high-level API (layers, models, losses, optimizers) similar to torch.nn.
- TensorBoard: powerful visualization/logging tool for metrics; widely used. PyTorch also can log to TensorBoard.
Framework Trade-Offs & Practical Points
- PyTorch:
- Pros: intuitive dynamic graph debugging, flexible Python control flow, easy to prototype.
- Cons: historically limited TPU support (improving), mobile deployment less mature.
- TensorFlow:
- Pros: strong static-graph tooling, serialization for deployment, TensorBoard, TPU support.
- Cons: TF 1.x API complexity and debugging pain; TF 2.0 aims to fix usability but ecosystem transition creates mixed resources online.
- Both frameworks are converging: both support dynamic and static modes, JIT/compilation, and high-level APIs.
Key Terms and Definitions
| Term | Definition |
| CPU | Central Processing Unit; few powerful cores, high clock rates, general purpose. |
| GPU | Graphics Processing Unit; many simple cores, high parallel throughput for data-parallel ops. |
| SM (Streaming Multiprocessor) | A grouping of GPU compute units; contains FP cores and other specialized units. |
| FP32 / FP16 | 32-bit / 16-bit floating-point numeric formats. |
| Tensor Core | NVIDIA specialized unit performing mixed-precision small-matrix multiply-and-accumulate. |
| Autograd | Automatic differentiation via computational graphs (PyTorch term). |
| Dynamic Computation Graph | Graph built on-the-fly per forward pass; supports Python control flow. |
| Static Computation Graph | Graph defined once, optimized and executed repeatedly; supports graph-level optimizations and serialization. |
| TPU | Tensor Processing Unit (Google); specialized accelerator for ML workloads, used in pods for scale. |
| TorchScript / tf.function | Mechanisms to convert Python/model code into static graph representations for optimization/serialization. |
Action Items / Next Steps (if applying lecture content)
- When training on GPUs:
- Consider mixed precision (FP16 with FP32 accumulation) to utilize Tensor Cores; watch numerical stability.
- Check GPU memory and memory bandwidth limits when increasing model or batch sizes.
- When coding models:
- Use autograd / high-level modules to reduce boilerplate and errors.
- Always zero gradients each iteration; use torch.no_grad() for parameter updates.
- For deployment or optimization:
- Explore static graph compilation (torch.jit, tf.function) for improved performance and serialization.
- Use TensorBoard or equivalent logging for monitoring training; integrate PyTorch logging to TensorBoard.
- Choose framework based on needs:
- Need TPU or deep Google Cloud integration: consider TensorFlow (TPU compatibility).
- Need flexible research prototyping or dynamic model structures: PyTorch is recommended.