GPU Architectures and Programming: Introduction to CUDA Programming

What is CUDA?

CUDA: Compute Unified Device Architecture
Programming language extension of C, developed by NVIDIA for their GPUs
Supports parallel programming constructs
Commonly used for accelerating devices in combination with high-performance CPUs

Host code (runs on CPU) and device code (runs on GPU)
Compiling with NVCC
- NVCC compiles CUDA program into host code and device code
- Host code is further compiled by the host C compiler
- Device code is JIT compiled for GPU execution

Include CUDA Header Files: cuda.h and cuda_runtime.h
Host Program (vecAdd): Similar to the CPU-only but launches GPU kernels
Device Memory Allocation: Using cudaMalloc for memory on GPU
Data Transfer to GPU: Using cudaMemcpy from host (CPU) to device (GPU)
Launching GPU Kernel: Special syntax for calling GPU functions
CUDA Kernel Definition: Parallel computation, each thread adds corresponding elements of two vectors
Error Handling: Using cudaGetLastError to check for errors
Copying Results Back to Host: Using cudaMemcpy from device to host
Memory Deallocation: Using cudaFree to free device memory
Compilation and Execution: Compile with NVCC, run the binary, and check results

cudaMalloc allocates memory in GPU
- Generic pointer (void **)) required
- Allocation checked with cudaSuccess
cudaMemcpy transfers data between host and device
- Directions: cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice
- Device memory pointers cannot be dereferenced in host code, only in device code
Data cannot be transferred directly between different GPU devices**

Launches multiple threads organized in a two-level hierarchy
Syntax: kernel<<<blocksPerGrid, threadsPerBlock>>>(args)
- blocksPerGrid: Number of blocks in the grid
- threadsPerBlock: Number of threads in each block (max 1024)
Example hierarchy for n threads: n/256 blocks, 256 threads per block

Computes index for part thread behavior
Adds elements in parallel using multiple threads
Each thread computes C[i] = A[i] + B[i]
Return type: Functions executed on GPU do not have return types but errors can be checked using cudaGetLastError