💻

[Lecture 30] Understanding SIMD and GPU Architectures

Apr 11, 2025

Lecture Notes: SIMD and GPUs

Introduction to SIMD

Paradigm of Single Instruction Multiple Data (SIMD):
- Influences modern computer architectures.
- Present in all architectures today, either as GPUs or SIMD functional units.

Execution Paradigms

Current and Future Topics:
- SIMD architectures and Graphics Processing Units (GPUs).
- Decoupled Access and Execute (optional lecture).
- Previous topic: Systolic Arrays.

Systolic Arrays

Specialized architectures for accelerating computations.
Example: Google's TPU is a systolic array designed for matrix multiplication.
Google TPU Evolution: TPU, TPU2, TPU3:
- Used for inference and later for training.
- Improvements in memory bandwidth, computational power, and number of systolic arrays over generations.
- Collaboration with Google revealed memory issues in TPUs.

SIMD Architectures

Key Concepts:
- Exploits data parallelism.
- Examples include SIMD functional units in CPUs and GPUs.
- SIMD is beneficial for operations like convolution and matrix multiplication.

Taxonomy of Computer Architectures

Flynn's Taxonomy:
- SISD: Single Instruction, Single Data (e.g., scalar machines).
- SIMD: Single Instruction, Multiple Data (e.g., vector and array processors).
- MISD: Multiple Instruction, Single Data (e.g., systolic arrays).
- MIMD: Multiple Instruction, Multiple Data (e.g., multiprocessors).

Comparison: Array vs. Vector Processors

Array Processors:
- Perform operations on multiple elements in parallel.
Vector Processors:
- Perform operations on one element at a time but reuse hardware.
- Can be more hardware efficient due to deep pipelining.

SIMD Limitations

Works well with regular data parallelism (e.g., large vectors).
Inefficiency with Irregular Parallelism:
- Challenging to vectorize loops with dependencies.
Memory bandwidth can be a bottleneck, especially with irregular data access or high stride values.

Vector Processing Details

Vector Registers:
- Store multiple data elements.
- Vector Length and Stride Registers define the number of operations and memory layout.
Chaining and Forwarding:
- Enables overlapping of operations to improve performance.

Memory Systems

Memory Banking:
- Allows concurrent memory accesses.
- Ensures throughput by minimizing bank conflicts.

Conditional Operations

Vector Masking:
- Allows conditional execution on vectors.
- Uses a mask register to determine which operations to execute.

Modern SIMD Extensions

SIMD Instructions in Modern CPUs:
- Examples: Intel's MMX, SSE, AVX.
- Accelerate multimedia, graphics, and machine learning tasks.
- Operate on packed data types (e.g., add multiple bytes in parallel).

GPU Architectures

SIMT (Single Instruction Multiple Threads):
- Combines SIMD and multithreading.
- Executes threads in warps for parallelism.
Programming Models:
- SPMPD (Single Program Multiple Data).
- More flexible than traditional SIMD.

GPU Execution

Fine-Grain Multithreading:
- Interleaves warps to keep pipeline filled.
- Tolerates memory latencies by switching between warps.
Dynamic Warp Formation:
- Combines threads executing the same instruction to improve SIMD utilization.

GPU Programming

Use of Blocks and Threads:
- Programmer specifies number of threads and blocks.
- Each thread executes the same code on different data.

Conclusion and Future Directions

GPUs and Machine Learning:
- Ongoing enhancements in GPU architecture, including tensor cores for specialized operations.
Balancing General Purpose and Specialized Processing:
- GPUs incorporate both SIMD and specialized units for efficiency in different applications.

Additional Readings

Lectures on heterogeneous systems for deeper understanding of GPU programming and architecture.

Full transcript