Introduction to CUDA Programming

Jul 9, 2024

GPU Architectures and Programming: Introduction to CUDA Programming

What is CUDA?

  • CUDA: Compute Unified Device Architecture
  • Programming language extension of C, developed by NVIDIA for their GPUs
  • Supports parallel programming constructs
  • Commonly used for accelerating devices in combination with high-performance CPUs

Perspectives of a CUDA Programmer

  • CPU (Host): Runs the main orchestration program
  • GPU (Device): Executes parallel jobs dispatched by the CPU
  • Multiple GPUs can be attached to a single CPU
  • CUDA programs consist of host code and device code
  • Host Code: Executes on CPU
  • Device Code: Executes on GPU
  • NVCC (NVIDIA C Compiler): Specific compiler required for CUDA programs

Basic CUDA Program Structure

  • Host code (runs on CPU) and device code (runs on GPU)
  • Compiling with NVCC
    • NVCC compiles CUDA program into host code and device code
    • Host code is further compiled by the host C compiler
    • Device code is JIT compiled for GPU execution

Execution Flow

  1. Host code executes serially on the CPU
  2. Host code launches device code (GPU parallel kernel)
  3. Device code executes on GPU and returns results to host code
  4. Host code may process results and launch more kernels if needed

Example: Vector Addition

CPU-Only Vector Addition

  • Main function declares and allocates float arrays
  • vecAdd function adds two vectors and stores the result in a third array
  • Function arguments: first two arrays are inputs, the third is output

CPU-GPU Vector Addition with CUDA

  1. Include CUDA Header Files: cuda.h and cuda_runtime.h
  2. Host Program (vecAdd): Similar to the CPU-only but launches GPU kernels
  3. Device Memory Allocation: Using cudaMalloc for memory on GPU
  4. Data Transfer to GPU: Using cudaMemcpy from host (CPU) to device (GPU)
  5. Launching GPU Kernel: Special syntax for calling GPU functions
  6. CUDA Kernel Definition: Parallel computation, each thread adds corresponding elements of two vectors
  7. Error Handling: Using cudaGetLastError to check for errors
  8. Copying Results Back to Host: Using cudaMemcpy from device to host
  9. Memory Deallocation: Using cudaFree to free device memory
  10. Compilation and Execution: Compile with NVCC, run the binary, and check results

Detailed Memory Operations

  • cudaMalloc allocates memory in GPU
    • Generic pointer (void **)) required
    • Allocation checked with cudaSuccess
  • cudaMemcpy transfers data between host and device
    • Directions: cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice
    • Device memory pointers cannot be dereferenced in host code, only in device code
  • Data cannot be transferred directly between different GPU devices

Calling a CUDA Kernel

  • Launches multiple threads organized in a two-level hierarchy
  • Syntax: kernel<<<blocksPerGrid, threadsPerBlock>>>(args)
    • blocksPerGrid: Number of blocks in the grid
    • threadsPerBlock: Number of threads in each block (max 1024)
  • Example hierarchy for n threads: n/256 blocks, 256 threads per block

Example Kernel: vectorAdd Function

  • Computes index for part thread behavior
  • Adds elements in parallel using multiple threads
  • Each thread computes C[i] = A[i] + B[i]
  • Return type: Functions executed on GPU do not have return types but errors can be checked using cudaGetLastError

Observations and Best Practices

  • CUDA includes functions for memory allocation, copying, and error checking
  • Kernels are defined for parallel execution of computations
  • Threads are the basic computing units executed by scalar processors in GPU
  • Hierarchical arrangement of threads allows efficient parallel computation