Coconote
AI notes
AI voice & video notes
Try for free
💻
[Lecture 30] Understanding SIMD and GPU Architectures
Apr 11, 2025
Lecture Notes: SIMD and GPUs
Introduction to SIMD
Paradigm of Single Instruction Multiple Data (SIMD):
Influences modern computer architectures.
Present in all architectures today, either as GPUs or SIMD functional units.
Execution Paradigms
Current and Future Topics:
SIMD architectures and Graphics Processing Units (GPUs).
Decoupled Access and Execute (optional lecture).
Previous topic: Systolic Arrays.
Systolic Arrays
Specialized architectures for accelerating computations.
Example: Google's TPU is a systolic array designed for matrix multiplication.
Google TPU Evolution: TPU, TPU2, TPU3:
Used for inference and later for training.
Improvements in memory bandwidth, computational power, and number of systolic arrays over generations.
Collaboration with Google revealed memory issues in TPUs.
SIMD Architectures
Key Concepts:
Exploits data parallelism.
Examples include SIMD functional units in CPUs and GPUs.
SIMD is beneficial for operations like convolution and matrix multiplication.
Taxonomy of Computer Architectures
Flynn's Taxonomy:
SISD: Single Instruction, Single Data (e.g., scalar machines).
SIMD: Single Instruction, Multiple Data (e.g., vector and array processors).
MISD: Multiple Instruction, Single Data (e.g., systolic arrays).
MIMD: Multiple Instruction, Multiple Data (e.g., multiprocessors).
Comparison: Array vs. Vector Processors
Array Processors:
Perform operations on multiple elements in parallel.
Vector Processors:
Perform operations on one element at a time but reuse hardware.
Can be more hardware efficient due to deep pipelining.
SIMD Limitations
Works well with regular data parallelism (e.g., large vectors).
Inefficiency with Irregular Parallelism:
Challenging to vectorize loops with dependencies.
Memory bandwidth can be a bottleneck, especially with irregular data access or high stride values.
Vector Processing Details
Vector Registers:
Store multiple data elements.
Vector Length and Stride Registers define the number of operations and memory layout.
Chaining and Forwarding:
Enables overlapping of operations to improve performance.
Memory Systems
Memory Banking:
Allows concurrent memory accesses.
Ensures throughput by minimizing bank conflicts.
Conditional Operations
Vector Masking:
Allows conditional execution on vectors.
Uses a mask register to determine which operations to execute.
Modern SIMD Extensions
SIMD Instructions in Modern CPUs:
Examples: Intel's MMX, SSE, AVX.
Accelerate multimedia, graphics, and machine learning tasks.
Operate on packed data types (e.g., add multiple bytes in parallel).
GPU Architectures
SIMT (Single Instruction Multiple Threads):
Combines SIMD and multithreading.
Executes threads in warps for parallelism.
Programming Models:
SPMPD (Single Program Multiple Data).
More flexible than traditional SIMD.
GPU Execution
Fine-Grain Multithreading:
Interleaves warps to keep pipeline filled.
Tolerates memory latencies by switching between warps.
Dynamic Warp Formation:
Combines threads executing the same instruction to improve SIMD utilization.
GPU Programming
Use of Blocks and Threads:
Programmer specifies number of threads and blocks.
Each thread executes the same code on different data.
Conclusion and Future Directions
GPUs and Machine Learning:
Ongoing enhancements in GPU architecture, including tensor cores for specialized operations.
Balancing General Purpose and Specialized Processing:
GPUs incorporate both SIMD and specialized units for efficiency in different applications.
Additional Readings
Lectures on heterogeneous systems for deeper understanding of GPU programming and architecture.
📄
Full transcript