What does __syncthreads() do in CUDA?

__syncthreads() acts as a barrier synchronization point where every thread in the block must wait until all threads in that same block have reached it before any can proceed.

Are atomic operations in CUDA slow?

Atomic operations can be slower than normal writes because they serialize access to memory locations to prevent race conditions. Performance depends on the level of contention among threads.

CUDA Synchronization

Chapter 5 — Thread synchronization, barriers, and atomic operations

Why Synchronization Matters

When thousands of threads run in parallel, you often need to coordinate them. Without proper synchronization, threads can read stale data or produce non-deterministic results (race conditions).

Intra-Block Synchronization: Using __syncthreads()

__syncthreads() is a barrier — all threads in a block must reach it before any thread can proceed past it. Use it whenever threads in the same block need to share data via shared memory.

parallel_reduction.cu
12345678910111213141516171819202122232425262728293031323334353637383940414243
#include <stdio.h>

__global__ void blockSum(float *input, float *output, int n) {
    __shared__ float sdata[256];
    int tid = threadIdx.x;
    int gid = blockIdx.x * blockDim.x + threadIdx.x;

    // Load into shared memory
    sdata[tid] = (gid < n) ? input[gid] : 0.0f;
    __syncthreads();

    // Parallel reduction in shared memory
    for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) {
        if (tid < stride) {
            sdata[tid] += sdata[tid + stride];
        }
        __syncthreads();  // sync after each step!
    }

    // Thread 0 writes the block's sum
    if (tid == 0) {
        output[blockIdx.x] = sdata[0];
    }
}

int main() {
    const int N = 256;
    float h_in[N], h_out;
    for (int i = 0; i < N; i++) h_in[i] = 1.0f;

    float *d_in, *d_out;
    cudaMalloc(&d_in, N * sizeof(float));
    cudaMalloc(&d_out, sizeof(float));

    cudaMemcpy(d_in, h_in, N * sizeof(float), cudaMemcpyHostToDevice);
    blockSum<<<1, N>>>(d_in, d_out, N);
    cudaMemcpy(&h_out, d_out, sizeof(float), cudaMemcpyDeviceToHost);

    printf("Sum of 256 ones (should be 256): %f\n", h_out);
    
    cudaFree(d_in); cudaFree(d_out);
    return 0;
}

Deadlock danger

Never place __syncthreads() inside a conditional branch where some threads in the block won't reach it. All threads in the block must execute the same __syncthreads() call, or the program will deadlock.

Warp-Level Execution and Lockstep Synchronization

A warp is a group of 32 threads that execute instructions in lockstep (SIMT — Single Instruction, Multiple Threads). Threads within the same warp are inherently synchronized — they execute the same instruction at the same time.

Modern CUDA (compute capability 7.0+) provides warp-level primitives:

cuda
1234567891011121314151617181920
#include <stdio.h>

__global__ void warpShuffleExample() {
    int val = threadIdx.x;
    
    // Sum reduction within a warp (no shared memory!)
    for (int offset = 16; offset > 0; offset >>= 1) {
        val += __shfl_down_sync(0xFFFFFFFF, val, offset);
    }
    
    if (threadIdx.x == 0) {
        printf("Warp sum of 0..31: %d\n", val);
    }
}

int main() {
    warpShuffleExample<<<1, 32>>>();
    cudaDeviceSynchronize();
    return 0;
}

The mask parameter 0xFFFFFFFF

The first argument to __shfl_*_sync is a 32-bit mask indicating which threads participate. 0xFFFFFFFF means all 32 threads in the warp.

CUDA Atomics: Safe Multi-Thread Writes to Memory

When multiple threads need to update the same memory location, use atomic operations to prevent race conditions:

atomics.cu
123456789101112131415161718192021222324252627282930
#include <stdio.h>

__global__ void atomicHistogram(int *data, int *bins, int n) {
    int gid = blockIdx.x * blockDim.x + threadIdx.x;
    if (gid < n) {
        int val = data[gid];
        atomicAdd(&bins[val], 1);  // safe concurrent increment
    }
}

int main() {
    const int N = 1000;
    int h_data[N], h_bins[10] = {0};
    for (int i = 0; i < N; i++) h_data[i] = i % 10;

    int *d_data, *d_bins;
    cudaMalloc(&d_data, N * sizeof(int));
    cudaMalloc(&d_bins, 10 * sizeof(int));

    cudaMemcpy(d_data, h_data, N * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemset(d_bins, 0, 10 * sizeof(int));

    atomicHistogram<<<4, 256>>>(d_data, d_bins, N);

    cudaMemcpy(h_bins, d_bins, 10 * sizeof(int), cudaMemcpyDeviceToHost);
    printf("Bin 0 count (should be 100): %d\n", h_bins[0]);

    cudaFree(d_data); cudaFree(d_bins);
    return 0;
}

Atomics are slow

Atomic operations serialize access, creating contention. To build a fast histogram, first accumulate in shared memory per-block, then use one atomic per block to update global memory.

Using CUDA Cooperative Groups for Flexible Synchronization

CUDA 9+ introduced Cooperative Groups — a flexible API for synchronization at various granularities:

cuda
12345678910111213141516
#include <cooperative_groups.h>
namespace cg = cooperative_groups;

__global__ void kernel() {
    // Get the thread block group
    cg::thread_block block = cg::this_thread_block();
    block.sync();  // equiv to __syncthreads()

    // Get this thread's warp
    cg::coalesced_group active = cg::coalesced_threads();

    // Get a tile of 16 threads within a warp
    cg::thread_block_tile<16> tile =
        cg::tiled_partition<16>(block);
    tile.sync();
}

Summary

__syncthreads() synchronizes all threads within a block — required when sharing data via shared memory
Warps (32 threads) execute in lockstep; warp shuffle functions exchange data without shared memory
Atomic operations (atomicAdd, atomicCAS, etc.) prevent race conditions on shared data
Cooperative Groups provide modern, flexible synchronization primitives

Coordination is critical for correctness. Now that you can sync threads, exploreStreams & Concurrencyto overlap computation and data transfers for maximum efficiency.

Common Mistakes in CUDA Synchronization

Conditional Deadlock: Calling __syncthreads() inside an if block that only some threads enter.
Forgetting to Sync: Reading shared memory that was written by another thread without a barrier first.
Atomic Contention: Having 10,000 threads all atomicAdd to the same global counter simultaneously.

Practice Exercises

Implement a simple "sliding window" average filter using shared memory and __syncthreads().
Rewrite a global memory atomic addition to use shared memory for local accumulation within each block first.

Practice These Concepts

Reinforce what you just learned with hands-on GPU coding challenges.

Compute a histogram of 256 bins for an array of 10,000 random byte values (0-255). Use atomic operations to avoid race conditions when multiple threads update the same bin.