L8 - GPU Architecture and Parallel Processing

Overview

This lecture focused on the architecture, operation, and evolution of Graphics Processing Units (GPUs), emphasizing their parallel processing paradigms and their relation to SIMD (Single Instruction Multiple Data) concepts covered previously.

SIMD and Vector Processing Recap

SIMD exploits data parallelism by applying the same instruction to multiple data points simultaneously.
Array processors have multiple powerful processing elements handling different data in parallel.
Vector processors pipeline operations (like load, add, multiply, store) across functional units to boost throughput.
Efficient SIMD/vector processing requires high memory and register file bandwidths due to parallel data access.
Performance benefits depend on regular, vectorizable code; irregular parallelism reduces efficiency.
SIMD instructions can be added to scalar ISAs by partitioning registers (e.g., Intel MMX, SSE, AVX).

Introduction to GPUs and Programming Models

GPUs evolved to handle massive parallel computation, primarily using SIMD/SIMT pipelines.
Modern applications often require more memory and compute power than a single GPU can provide, leading to multi-GPU systems.
Programming model (how you write code) is distinct from execution model (how hardware runs code).
GPUs implement SPMD (Single Program Multiple Data), running the same program on many data elements using threads.

GPU Execution Model: SIMT & Warps

SIMT (Single Instruction Multiple Threads) is Nvidia's term for SPMD on SIMD hardware.
Threads are grouped into warps (typically 32 threads) that execute together on SIMD pipelines.
Flexible scheduling of warps enables latency hiding and efficient parallel execution.
Programmer writes code for one thread; hardware handles mapping threads to warps and scheduling.
Warp scheduling uses fine-grained multithreading and context-switching to maximize pipeline utilization.

Handling Divergence & Warp Compaction

Threads in the same warp may follow different control flow paths (branch divergence), leading to underutilization.
Hardware can dynamically regroup threads at same program counter to form more efficient warps (dynamic warp formation), subject to constraints like register file conflicts and block boundaries.
Efficient use of warps is crucial for maximizing GPU performance.

GPU Architecture and Case Studies

GPUs consist of many shader cores (SMs), each with large register files and multiple SIMD functional units.
Tensor cores were added in recent architectures (e.g., Nvidia Volta) for machine learning acceleration.
Support for mixed precision and sparsity is increasingly important in modern GPUs.
Multi-GPU systems require advanced interconnects (e.g., NVLink) to scale memory and compute capabilities for large workloads.

Key Terms & Definitions

SIMD — Single Instruction Multiple Data: executes one instruction on multiple data elements.
SIMT — Single Instruction Multiple Threads: Nvidia's model for running threads on SIMD hardware.
SPMD — Single Program Multiple Data: programming model for running same code on various data.
Warp — A group of threads (typically 32) scheduled together on a GPU.
Fine-grained multithreading — Rapidly switching between threads to hide latency.
Branch divergence — Situation where threads in a warp take different branches in code.
Tensor Core — Specialized GPU unit for accelerating matrix operations, especially for AI.
Systolic array — A parallel architecture used by some accelerators (e.g., Google TPU) for matrix operations.

Action Items / Next Steps

Review assigned readings on SIMD, vector processing, and GPU architectures.
Consider watching suggested lectures on GPU programming and systolic arrays.
Practice examining and programming simple GPU kernels in CUDA or similar framework.