CUDA Performance Optimization
Chapter 7 — Memory coalescing, occupancy, and profiling
Global Memory Coalescing: Maximizing Bandwidth
The single biggest performance win in CUDA is coalesced memory access. When consecutive threads access consecutive memory locations, the GPU hardware combines those accesses into a single wide transaction:
#include <stdio.h> // ✅ COALESCED — thread i writes element i // Hardware merges all 32 warp threads into one 128-byte transaction __global__ void coalescedWrite(float *out, float *in, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) out[i] = in[i] * 2.0f; } // ❌ STRIDED — thread i writes element i*stride // Each thread touches a separate cache line → 32 separate transactions per warp __global__ void stridedWrite(float *out, float *in, int n, int stride) { int i = blockIdx.x * blockDim.x + threadIdx.x; if ((long long)i * stride < n) out[i * stride] = in[i * stride] * 2.0f; } int main() { const int N = 1 << 22; // 4M floats = 16 MB float *d_a, *d_b; cudaMalloc(&d_a, N * sizeof(float)); cudaMalloc(&d_b, N * sizeof(float)); cudaEvent_t t0, t1; cudaEventCreate(&t0); cudaEventCreate(&t1); float ms; // Coalesced cudaEventRecord(t0); coalescedWrite<<<(N+255)/256, 256>>>(d_b, d_a, N); cudaEventRecord(t1); cudaEventSynchronize(t1); cudaEventElapsedTime(&ms, t0, t1); printf("Coalesced : %.3f ms\n", ms); // Strided (stride=32 → 32 cache-line hops per warp) cudaEventRecord(t0); stridedWrite<<<(N/32+255)/256, 256>>>(d_b, d_a, N, 32); cudaEventRecord(t1); cudaEventSynchronize(t1); cudaEventElapsedTime(&ms, t0, t1); printf("Strided x32: %.3f ms (expect ~32x slower)\n", ms); cudaFree(d_a); cudaFree(d_b); cudaEventDestroy(t0); cudaEventDestroy(t1); return 0; }
Structure your data so that threadIdx.x maps to the innermost dimension of your arrays — this gives you coalesced access. For 2D arrays stored row-major, each thread should process one column.
Struct of Arrays vs Array of Structs
For coalesced access, prefer Struct of Arrays (SoA) overArray of Structs (AoS):
// ❌ AoS — interleaved, poor coalescing struct Particle_AoS { float x, y, z; float vx, vy, vz; }; Particle_AoS particles[1024]; // thread i reads particles[i].x → stride of 6 floats // ✅ SoA — contiguous arrays, perfect coalescing struct Particles_SoA { float x[1024], y[1024], z[1024]; }; Particles_SoA particles; // thread i reads particles.x[i] → stride of 1 float
Maximizing Thread Occupancy: Registers and Shared Memory
Occupancy is the ratio of active warps to the maximum number of warps a Streaming Multiprocessor (SM) can host. Higher occupancy generally means better latency hiding.
Factors that limit occupancy:
- Registers per thread — more registers per thread → fewer threads fit on an SM
- Shared memory per block — more shared memory per block → fewer blocks per SM
- Block size — if block size doesn't divide evenly into SM warp slots, resources are wasted
// Query optimal block size for a kernel int minGridSize, blockSize; cudaOccupancyMaxPotentialBlockSize( &minGridSize, &blockSize, myKernel, 0, 10000); printf("Optimal block size: %d\n", blockSize); printf("Min grid size: %d\n", minGridSize);
Higher occupancy doesn't always mean faster kernels. Sometimes reducing occupancy to use more registers or shared memory per thread yields better performance. Always profile!
Loop Unrolling
Unrolling loops reduces branch overhead and enables instruction-level parallelism:
// Manual unrolling __global__ void reduceUnrolled(float *data, float *result, int n) { __shared__ float sdata[256]; int tid = threadIdx.x; int i = blockIdx.x * blockDim.x * 2 + threadIdx.x; // Load 2 elements per thread (reduces blocks by half) sdata[tid] = (i < n ? data[i] : 0) + (i + blockDim.x < n ? data[i + blockDim.x] : 0); __syncthreads(); // Unrolled reduction for (int s = blockDim.x / 2; s > 32; s >>= 1) { if (tid < s) sdata[tid] += sdata[tid + s]; __syncthreads(); } // Warp-level reduction (no sync needed within a warp) if (tid < 32) { volatile float *vmem = sdata; vmem[tid] += vmem[tid + 32]; vmem[tid] += vmem[tid + 16]; vmem[tid] += vmem[tid + 8]; vmem[tid] += vmem[tid + 4]; vmem[tid] += vmem[tid + 2]; vmem[tid] += vmem[tid + 1]; } if (tid == 0) result[blockIdx.x] = sdata[0]; }
CUDA Profiling Tools: Nsight Systems and Nsight Compute
Never optimize blindly — always profile first. NVIDIA provides several tools:
- Nsight Compute — detailed kernel-level analysis (memory throughput, occupancy, stalls)
- Nsight Systems — system-wide timeline view (CPU/GPU interaction, stream overlap)
- nvprof (legacy) — command-line profiler, still useful for quick checks
# Quick profiling with Nsight Systems nsys profile ./my_cuda_app # Detailed kernel analysis with Nsight Compute ncu --set full ./my_cuda_app # Legacy profiler nvprof ./my_cuda_app
Optimization Checklist
- Coalesce global memory accesses — use SoA layout, align data
- Use shared memory — cache frequently reused data
- Minimize host↔device transfers — batch operations, use pinned memory
- Overlap compute & transfer — use multiple CUDA streams
- Choose the right block size — use
cudaOccupancyMaxPotentialBlockSize - Reduce thread divergence — avoid conditionals that split a warp
- Unroll critical loops — fewer instructions, better ILP
- Profile, don't guess — use Nsight Compute / Nsight Systems
Summary
- Coalesced memory access is the #1 optimization — consecutive threads should access consecutive memory
- SoA layout outperforms AoS on GPUs for this reason
- Occupancy affects latency hiding — balance registers, shared memory, and block size
- Always profile with Nsight tools before and after optimizing
Congratulations! You've completed the core CUDA curriculum. Return to the Tutorials Indexto review or check for new chapters.
Common Mistakes in CUDA Optimization
- Optimizing Blindly: Spending hours on a kernel that only accounts for 1% of total application time.
- Over-occupancy: Shoving too many threads onto an SM, leading to register spills or cache thrashing.
- Pointer Aliasing: Not using
__restrict__, which prevents the compiler from optimizing memory loads.
Practice Exercises
- Profile your vector addition from Chapter 2. What is the achieved memory bandwidth?
- Implement a Matrix Transpose kernel. Compare the performance with and without Shared Memory for coalesced writes.
Further Reading
Practice These Concepts
Reinforce what you just learned with hands-on GPU coding challenges.
Write a kernel that sums all elements of an array using parallel reduction. The array size is 1024 elements. Use shared memory for efficiency.
Implement a simple 2D box blur filter for a 64x64 grayscale image. Each output pixel should be the average of itself and its 8 neighbors (3x3 kernel). Handle edge pixels by only averaging valid neighbors.
Sign in to track your progress across challenges.