CUDA Programming Model
Chapter 2 — Threads, blocks, grids and kernel functions
The CUDA Thread Hierarchy: Threads, Blocks, and Grids
CUDA organizes parallel execution into a three-level hierarchy:
- Thread — the smallest unit of execution. Each thread runs the kernel function independently.
- Block — a group of threads that can cooperate via shared memory and synchronization.
- Grid — a collection of blocks that together execute the kernel.
A Grid is a city. Each Block is a building. Each Thread is a worker inside a building. Workers in the same building can talk to each other (shared memory), but workers across buildings communicate only through global memory.
CUDA Thread Indexing: How to Map Data to Threads
Every thread has built-in variables to identify itself:
threadIdx.x,threadIdx.y,threadIdx.z— thread index within its blockblockIdx.x,blockIdx.y,blockIdx.z— block index within the gridblockDim.x,blockDim.y,blockDim.z— number of threads per block in each dimensiongridDim.x,gridDim.y,gridDim.z— number of blocks in the grid in each dimension
The global thread index is computed as:
int globalIdx = blockIdx.x * blockDim.x + threadIdx.x;
CUDA Kernel Qualifiers: __global__, __device__, __host__
CUDA extends C/C++ with three function qualifiers:
__global__— runs on the GPU, called from the CPU (or from another kernel). Must returnvoid.__device__— runs on the GPU, callable only from other GPU functions.__host__— runs on the CPU. This is the default for all normal C/C++ functions.
You can combine __host__ __device__ to compile a function for both CPU and GPU.
Example: Vector Addition
Vector addition is the "Hello World" of GPU computing. Each thread adds one element:
#include <stdio.h> __global__ void vecAdd(float *a, float *b, float *c, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) { c[i] = a[i] + b[i]; } } int main() { int n = 1024; size_t bytes = n * sizeof(float); // Allocate host memory float *h_a = (float*)malloc(bytes); float *h_b = (float*)malloc(bytes); float *h_c = (float*)malloc(bytes); // Initialize vectors for (int i = 0; i < n; i++) { h_a[i] = i * 1.0f; h_b[i] = i * 2.0f; } // Allocate device memory float *d_a, *d_b, *d_c; cudaMalloc(&d_a, bytes); cudaMalloc(&d_b, bytes); cudaMalloc(&d_c, bytes); // Copy data to GPU cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice); cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice); // Launch kernel: 4 blocks × 256 threads = 1024 threads int threadsPerBlock = 256; int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock; vecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, n); // Copy result back to CPU cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost); // Verify printf("c[0] = %f, c[1023] = %f\n", h_c[0], h_c[1023]); // Cleanup cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); free(h_a); free(h_b); free(h_c); return 0; }
Key Takeaways from this Example
- The boundary check
if (i < n)is critical — we may launch more threads than data elements. - The formula
(n + threadsPerBlock - 1) / threadsPerBlockrounds up the number of blocks so every element is covered. - Memory must be explicitly allocated on the GPU with
cudaMallocand copied withcudaMemcpy.
Choosing the Optimal CUDA Block Size
Block size (threads per block) is a performance-critical parameter:
- Must be a multiple of 32 (the warp size). Common choices: 128, 256, 512.
- Maximum is typically 1024 threads per block.
- Larger blocks use more shared memory and registers per block, which can limit occupancy.
Start with 256 threads per block. Profile and adjust from there.
Using Multi-Dimensional Grids and Blocks for 2D/3D Data
For 2D problems (like image processing), you can use 2D blocks and grids:
dim3 threadsPerBlock(16, 16); // 16×16 = 256 threads dim3 blocksPerGrid( (width + 15) / 16, (height + 15) / 16 ); myKernel<<<blocksPerGrid, threadsPerBlock>>>(...); // Inside the kernel: int col = blockIdx.x * blockDim.x + threadIdx.x; int row = blockIdx.y * blockDim.y + threadIdx.y;
Summary
- CUDA threads are organized into Blocks inside a Grid
- Use
blockIdx,threadIdx, andblockDimto compute each thread's unique work index __global__functions (kernels) run on the GPU and are launched from the CPU- Always check array bounds inside kernels to avoid out-of-range writes
Now that you understand the thread hierarchy, the next step is to masterCUDA Memory Managementto move data efficiently between the CPU and GPU.
Common Mistakes in CUDA Programming Model
- Integer Division Pitfall: When calculating
blocksPerGrid, usingn / threadsPerBlockwithout rounding up will skip the last few elements. - Hardcoding 1,024: Assuming 1,024 threads is always best. 256 is often a safer starting point for occupancy.
- Ignoring Warp Divergence: Writing complex
if/elselogic inside a kernel (detailed in Chapter 7).
Practice Exercises
- Write a 1D kernel that fills an array with the square of each index (i.e.,
c[i] = i * i). - Modify the vector addition example to use a 2D grid/block configuration for a 1D array. Why might you do this?
Further Reading
Practice These Concepts
Reinforce what you just learned with hands-on GPU coding challenges.
Implement the BLAS SAXPY operation: y = a*x + y, where a is a scalar and x, y are vectors. This is one of the most fundamental GPU operations. Vector size N = 8192. Each thread handles one element.
Sign in to track your progress across challenges.