CUDA Synchronization

Chapter 5 — Thread synchronization, barriers, and atomic operations

Why Synchronization Matters

When thousands of threads run in parallel, you often need to coordinate them. Without proper synchronization, threads can read stale data or produce non-deterministic results (race conditions).

Intra-Block Synchronization: Using __syncthreads()

__syncthreads() is a barrier — all threads in a block must reach it before any thread can proceed past it. Use it whenever threads in the same block need to share data via shared memory.

parallel_reduction.cu
12345678910111213141516171819202122232425262728293031323334353637383940414243
#include <stdio.h>

__global__ void blockSum(float *input, float *output, int n) {
    __shared__ float sdata[256];
    int tid = threadIdx.x;
    int gid = blockIdx.x * blockDim.x + threadIdx.x;

    // Load into shared memory
    sdata[tid] = (gid < n) ? input[gid] : 0.0f;
    __syncthreads();

    // Parallel reduction in shared memory
    for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) {
        if (tid < stride) {
            sdata[tid] += sdata[tid + stride];
        }
        __syncthreads();  // sync after each step!
    }

    // Thread 0 writes the block's sum
    if (tid == 0) {
        output[blockIdx.x] = sdata[0];
    }
}

int main() {
    const int N = 256;
    float h_in[N], h_out;
    for (int i = 0; i < N; i++) h_in[i] = 1.0f;

    float *d_in, *d_out;
    cudaMalloc(&d_in, N * sizeof(float));
    cudaMalloc(&d_out, sizeof(float));

    cudaMemcpy(d_in, h_in, N * sizeof(float), cudaMemcpyHostToDevice);
    blockSum<<<1, N>>>(d_in, d_out, N);
    cudaMemcpy(&h_out, d_out, sizeof(float), cudaMemcpyDeviceToHost);

    printf("Sum of 256 ones (should be 256): %f\n", h_out);
    
    cudaFree(d_in); cudaFree(d_out);
    return 0;
}
Deadlock danger

Never place __syncthreads() inside a conditional branch where some threads in the block won't reach it. All threads in the block must execute the same __syncthreads() call, or the program will deadlock.

Warp-Level Execution and Lockstep Synchronization

A warp is a group of 32 threads that execute instructions in lockstep (SIMT — Single Instruction, Multiple Threads). Threads within the same warp are inherently synchronized — they execute the same instruction at the same time.

Modern CUDA (compute capability 7.0+) provides warp-level primitives:

cuda
1234567891011121314151617181920
#include <stdio.h>

__global__ void warpShuffleExample() {
    int val = threadIdx.x;
    
    // Sum reduction within a warp (no shared memory!)
    for (int offset = 16; offset > 0; offset >>= 1) {
        val += __shfl_down_sync(0xFFFFFFFF, val, offset);
    }
    
    if (threadIdx.x == 0) {
        printf("Warp sum of 0..31: %d\n", val);
    }
}

int main() {
    warpShuffleExample<<<1, 32>>>();
    cudaDeviceSynchronize();
    return 0;
}
The mask parameter 0xFFFFFFFF

The first argument to __shfl_*_sync is a 32-bit mask indicating which threads participate. 0xFFFFFFFF means all 32 threads in the warp.

CUDA Atomics: Safe Multi-Thread Writes to Memory

When multiple threads need to update the same memory location, use atomic operations to prevent race conditions:

atomics.cu
123456789101112131415161718192021222324252627282930
#include <stdio.h>

__global__ void atomicHistogram(int *data, int *bins, int n) {
    int gid = blockIdx.x * blockDim.x + threadIdx.x;
    if (gid < n) {
        int val = data[gid];
        atomicAdd(&bins[val], 1);  // safe concurrent increment
    }
}

int main() {
    const int N = 1000;
    int h_data[N], h_bins[10] = {0};
    for (int i = 0; i < N; i++) h_data[i] = i % 10;

    int *d_data, *d_bins;
    cudaMalloc(&d_data, N * sizeof(int));
    cudaMalloc(&d_bins, 10 * sizeof(int));

    cudaMemcpy(d_data, h_data, N * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemset(d_bins, 0, 10 * sizeof(int));

    atomicHistogram<<<4, 256>>>(d_data, d_bins, N);

    cudaMemcpy(h_bins, d_bins, 10 * sizeof(int), cudaMemcpyDeviceToHost);
    printf("Bin 0 count (should be 100): %d\n", h_bins[0]);

    cudaFree(d_data); cudaFree(d_bins);
    return 0;
}
Atomics are slow

Atomic operations serialize access, creating contention. To build a fast histogram, first accumulate in shared memory per-block, then use one atomic per block to update global memory.

Using CUDA Cooperative Groups for Flexible Synchronization

CUDA 9+ introduced Cooperative Groups — a flexible API for synchronization at various granularities:

cuda
12345678910111213141516
#include <cooperative_groups.h>
namespace cg = cooperative_groups;

__global__ void kernel() {
    // Get the thread block group
    cg::thread_block block = cg::this_thread_block();
    block.sync();  // equiv to __syncthreads()

    // Get this thread's warp
    cg::coalesced_group active = cg::coalesced_threads();

    // Get a tile of 16 threads within a warp
    cg::thread_block_tile<16> tile =
        cg::tiled_partition<16>(block);
    tile.sync();
}

Summary

  • __syncthreads() synchronizes all threads within a block — required when sharing data via shared memory
  • Warps (32 threads) execute in lockstep; warp shuffle functions exchange data without shared memory
  • Atomic operations (atomicAdd, atomicCAS, etc.) prevent race conditions on shared data
  • Cooperative Groups provide modern, flexible synchronization primitives

Common Mistakes in CUDA Synchronization

  • Conditional Deadlock: Calling __syncthreads() inside an if block that only some threads enter.
  • Forgetting to Sync: Reading shared memory that was written by another thread without a barrier first.
  • Atomic Contention: Having 10,000 threads all atomicAdd to the same global counter simultaneously.

Practice Exercises

  1. Implement a simple "sliding window" average filter using shared memory and __syncthreads().
  2. Rewrite a global memory atomic addition to use shared memory for local accumulation within each block first.

Further Reading

Practice These Concepts

Reinforce what you just learned with hands-on GPU coding challenges.

Parallel Histogram Computationmedium

Compute a histogram of 256 bins for an array of 10,000 random byte values (0-255). Use atomic operations to avoid race conditions when multiple threads update the same bin.

Sign in to track your progress across challenges.