💻

[Lecture 30] Understanding SIMD and GPU Architectures

Apr 11, 2025

Lecture Notes: SIMD and GPUs

Introduction to SIMD

  • Paradigm of Single Instruction Multiple Data (SIMD):
    • Influences modern computer architectures.
    • Present in all architectures today, either as GPUs or SIMD functional units.

Execution Paradigms

  • Current and Future Topics:
    • SIMD architectures and Graphics Processing Units (GPUs).
    • Decoupled Access and Execute (optional lecture).
    • Previous topic: Systolic Arrays.

Systolic Arrays

  • Specialized architectures for accelerating computations.
  • Example: Google's TPU is a systolic array designed for matrix multiplication.
  • Google TPU Evolution: TPU, TPU2, TPU3:
    • Used for inference and later for training.
    • Improvements in memory bandwidth, computational power, and number of systolic arrays over generations.
    • Collaboration with Google revealed memory issues in TPUs.

SIMD Architectures

  • Key Concepts:
    • Exploits data parallelism.
    • Examples include SIMD functional units in CPUs and GPUs.
    • SIMD is beneficial for operations like convolution and matrix multiplication.

Taxonomy of Computer Architectures

  • Flynn's Taxonomy:
    • SISD: Single Instruction, Single Data (e.g., scalar machines).
    • SIMD: Single Instruction, Multiple Data (e.g., vector and array processors).
    • MISD: Multiple Instruction, Single Data (e.g., systolic arrays).
    • MIMD: Multiple Instruction, Multiple Data (e.g., multiprocessors).

Comparison: Array vs. Vector Processors

  • Array Processors:
    • Perform operations on multiple elements in parallel.
  • Vector Processors:
    • Perform operations on one element at a time but reuse hardware.
    • Can be more hardware efficient due to deep pipelining.

SIMD Limitations

  • Works well with regular data parallelism (e.g., large vectors).
  • Inefficiency with Irregular Parallelism:
    • Challenging to vectorize loops with dependencies.
  • Memory bandwidth can be a bottleneck, especially with irregular data access or high stride values.

Vector Processing Details

  • Vector Registers:
    • Store multiple data elements.
    • Vector Length and Stride Registers define the number of operations and memory layout.
  • Chaining and Forwarding:
    • Enables overlapping of operations to improve performance.

Memory Systems

  • Memory Banking:
    • Allows concurrent memory accesses.
    • Ensures throughput by minimizing bank conflicts.

Conditional Operations

  • Vector Masking:
    • Allows conditional execution on vectors.
    • Uses a mask register to determine which operations to execute.

Modern SIMD Extensions

  • SIMD Instructions in Modern CPUs:
    • Examples: Intel's MMX, SSE, AVX.
    • Accelerate multimedia, graphics, and machine learning tasks.
    • Operate on packed data types (e.g., add multiple bytes in parallel).

GPU Architectures

  • SIMT (Single Instruction Multiple Threads):
    • Combines SIMD and multithreading.
    • Executes threads in warps for parallelism.
  • Programming Models:
    • SPMPD (Single Program Multiple Data).
    • More flexible than traditional SIMD.

GPU Execution

  • Fine-Grain Multithreading:
    • Interleaves warps to keep pipeline filled.
    • Tolerates memory latencies by switching between warps.
  • Dynamic Warp Formation:
    • Combines threads executing the same instruction to improve SIMD utilization.

GPU Programming

  • Use of Blocks and Threads:
    • Programmer specifies number of threads and blocks.
    • Each thread executes the same code on different data.

Conclusion and Future Directions

  • GPUs and Machine Learning:
    • Ongoing enhancements in GPU architecture, including tensor cores for specialized operations.
  • Balancing General Purpose and Specialized Processing:
    • GPUs incorporate both SIMD and specialized units for efficiency in different applications.

Additional Readings

  • Lectures on heterogeneous systems for deeper understanding of GPU programming and architecture.