What is a CUDA thread block?

A CUDA thread block is a group of threads that execute the same kernel and can communicate via shared memory and synchronize their execution.

How many threads can be in a CUDA block?

In modern NVIDIA GPUs, the maximum number of threads per block is usually 1024.

CUDA Programming Model

Chapter 2 — Threads, blocks, grids and kernel functions

The CUDA Thread Hierarchy: Threads, Blocks, and Grids

CUDA organizes parallel execution into a three-level hierarchy:

Thread — the smallest unit of execution. Each thread runs the kernel function independently.
Block — a group of threads that can cooperate via shared memory and synchronization.
Grid — a collection of blocks that together execute the kernel.

Think of it this way

A Grid is a city. Each Block is a building. Each Thread is a worker inside a building. Workers in the same building can talk to each other (shared memory), but workers across buildings communicate only through global memory.

CUDA Thread Indexing: How to Map Data to Threads

Every thread has built-in variables to identify itself:

threadIdx.x, threadIdx.y, threadIdx.z — thread index within its block
blockIdx.x, blockIdx.y, blockIdx.z — block index within the grid
blockDim.x, blockDim.y, blockDim.z — number of threads per block in each dimension
gridDim.x, gridDim.y, gridDim.z — number of blocks in the grid in each dimension

The global thread index is computed as:

cuda
1
int globalIdx = blockIdx.x * blockDim.x + threadIdx.x;

CUDA Kernel Qualifiers: global, device, host

CUDA extends C/C++ with three function qualifiers:

__global__ — runs on the GPU, called from the CPU (or from another kernel). Must return void.
__device__ — runs on the GPU, callable only from other GPU functions.
__host__ — runs on the CPU. This is the default for all normal C/C++ functions.

You can combine __host__ __device__ to compile a function for both CPU and GPU.

Example: Vector Addition

Vector addition is the "Hello World" of GPU computing. Each thread adds one element:

vec_add.cu
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
#include <stdio.h>

__global__ void vecAdd(float *a, float *b, float *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}

int main() {
    int n = 1024;
    size_t bytes = n * sizeof(float);

    // Allocate host memory
    float *h_a = (float*)malloc(bytes);
    float *h_b = (float*)malloc(bytes);
    float *h_c = (float*)malloc(bytes);

    // Initialize vectors
    for (int i = 0; i < n; i++) {
        h_a[i] = i * 1.0f;
        h_b[i] = i * 2.0f;
    }

    // Allocate device memory
    float *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    // Copy data to GPU
    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);

    // Launch kernel: 4 blocks × 256 threads = 1024 threads
    int threadsPerBlock = 256;
    int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
    vecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, n);

    // Copy result back to CPU
    cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);

    // Verify
    printf("c[0] = %f, c[1023] = %f\n", h_c[0], h_c[1023]);

    // Cleanup
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    free(h_a); free(h_b); free(h_c);
    return 0;
}

Key Takeaways from this Example

The boundary check if (i < n) is critical — we may launch more threads than data elements.
The formula (n + threadsPerBlock - 1) / threadsPerBlock rounds up the number of blocks so every element is covered.
Memory must be explicitly allocated on the GPU with cudaMalloc and copied with cudaMemcpy.

Choosing the Optimal CUDA Block Size

Block size (threads per block) is a performance-critical parameter:

Must be a multiple of 32 (the warp size). Common choices: 128, 256, 512.
Maximum is typically 1024 threads per block.
Larger blocks use more shared memory and registers per block, which can limit occupancy.

Rule of Thumb

Start with 256 threads per block. Profile and adjust from there.

Using Multi-Dimensional Grids and Blocks for 2D/3D Data

For 2D problems (like image processing), you can use 2D blocks and grids:

cuda
1234567891011
dim3 threadsPerBlock(16, 16);     // 16×16 = 256 threads
dim3 blocksPerGrid(
    (width + 15) / 16,
    (height + 15) / 16
);

myKernel<<<blocksPerGrid, threadsPerBlock>>>(...);

// Inside the kernel:
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;

Summary

CUDA threads are organized into Blocks inside a Grid
Use blockIdx, threadIdx, and blockDim to compute each thread's unique work index
__global__ functions (kernels) run on the GPU and are launched from the CPU
Always check array bounds inside kernels to avoid out-of-range writes

Now that you understand the thread hierarchy, the next step is to masterCUDA Memory Managementto move data efficiently between the CPU and GPU.

Common Mistakes in CUDA Programming Model

Integer Division Pitfall: When calculating blocksPerGrid, using n / threadsPerBlock without rounding up will skip the last few elements.
Hardcoding 1,024: Assuming 1,024 threads is always best. 256 is often a safer starting point for occupancy.
Ignoring Warp Divergence: Writing complex if/else logic inside a kernel (detailed in Chapter 7).

Practice Exercises

Write a 1D kernel that fills an array with the square of each index (i.e., c[i] = i * i).
Modify the vector addition example to use a 2D grid/block configuration for a 1D array. Why might you do this?

Practice These Concepts

Reinforce what you just learned with hands-on GPU coding challenges.

Implement the BLAS SAXPY operation: y = a*x + y, where a is a scalar and x, y are vectors. This is one of the most fundamental GPU operations. Vector size N = 8192. Each thread handles one element.