CUDA Streams & Concurrency

Chapter 6 — Overlapping computation with data transfers

Introduction to CUDA Streams: Asynchronous Execution

A CUDA stream is a sequence of operations (kernel launches, memory copies) that execute in order. Operations in different streams can overlap and run concurrently, enabling massive throughput gains.

By default, all operations go into stream 0 (the default stream), which is synchronous with respect to the host.

Why streams matter

Without streams, the GPU sits idle while data is being transferred. With streams, you can overlap kernel execution on one chunk of data with the transfer of the next chunk — utilizing the full bandwidth of both the compute engine and the copy engine simultaneously.

Creating and Using Multiple CUDA Streams

streams_basic.cu
1234567891011121314151617181920
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

// Async copy and kernel in stream1
cudaMemcpyAsync(d_a, h_a, bytes, cudaMemcpyHostToDevice, stream1);
streamKernel<<<4, 256, 0, stream1>>>(d_a, N);
cudaMemcpyAsync(h_a, d_a, bytes, cudaMemcpyDeviceToHost, stream1);

// Async copy and kernel in stream2
cudaMemcpyAsync(d_b, h_b, bytes, cudaMemcpyHostToDevice, stream2);
streamKernel<<<4, 256, 0, stream2>>>(d_b, N);
cudaMemcpyAsync(h_b, d_b, bytes, cudaMemcpyDeviceToHost, stream2);

cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);

// Cleanup
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);
Pinned memory required

cudaMemcpyAsync only achieves true asynchronous transfer withpinned (page-locked) host memory. Use cudaMallocHostinstead of malloc for the host buffers.

Stream Concurrency: Overlapping Data Transfers and Kernels

The most common stream pattern is to split data into chunks and process them in a pipeline:

pipeline.cu
123456789101112131415161718192021222324252627282930313233343536373839
const int nStreams = 4;
const int chunkSize = N / nStreams;
cudaStream_t streams[nStreams];

// Allocate pinned host memory
float *h_data;
cudaMallocHost(&h_data, N * sizeof(float));

for (int i = 0; i < nStreams; i++) {
    cudaStreamCreate(&streams[i]);
}

for (int i = 0; i < nStreams; i++) {
    int offset = i * chunkSize;
    size_t chunkBytes = chunkSize * sizeof(float);

    // 1. Copy chunk to GPU
    cudaMemcpyAsync(&d_data[offset], &h_data[offset],
                    chunkBytes, cudaMemcpyHostToDevice,
                    streams[i]);

    // 2. Process chunk
    int blocks = (chunkSize + 255) / 256;
    process<<<blocks, 256, 0, streams[i]>>>(
        &d_data[offset], chunkSize);

    // 3. Copy result back
    cudaMemcpyAsync(&h_data[offset], &d_data[offset],
                    chunkBytes, cudaMemcpyDeviceToHost,
                    streams[i]);
}

// Wait for all streams
cudaDeviceSynchronize();

for (int i = 0; i < nStreams; i++) {
    cudaStreamDestroy(streams[i]);
}
cudaFreeHost(h_data);

Synchronizing CUDA Streams: events and Waiters

Events let you measure timing and create inter-stream dependencies:

events.cu
1234567891011121314151617
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start, stream);
myKernel<<<blocks, threads, 0, stream>>>(args);
cudaEventRecord(stop, stream);

// Wait for the stop event
cudaEventSynchronize(stop);

float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
printf("Kernel took %.3f ms\n", milliseconds);

cudaEventDestroy(start);
cudaEventDestroy(stop);

Inter-Stream Dependencies

You can make one stream wait for an event recorded in another:

cuda
12345678910
cudaEvent_t event;
cudaEventCreate(&event);

// Record event after kernel in stream1
kernelA<<<..., stream1>>>(args);
cudaEventRecord(event, stream1);

// Make stream2 wait for the event
cudaStreamWaitEvent(stream2, event, 0);
kernelB<<<..., stream2>>>(args);  // runs only after kernelA

Complete Example: Measuring Stream Overlap

This program processes data in chunks using 4 streams and measures the wall-clock time — compare it with cudaDeviceSynchronize() after each step to see the speedup from overlap:

streams_overlap.cu
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
#include <stdio.h>

__global__ void scale(float *data, float factor, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) data[i] *= factor;
}

int main() {
    const int N       = 1 << 20;   // 1M floats total
    const int NSTREAM = 4;
    const int CHUNK   = N / NSTREAM;

    // Pinned host memory — required for async transfers
    float *h_data;
    cudaMallocHost(&h_data, N * sizeof(float));
    for (int i = 0; i < N; i++) h_data[i] = (float)i;

    float *d_data;
    cudaMalloc(&d_data, N * sizeof(float));

    cudaStream_t streams[NSTREAM];
    for (int s = 0; s < NSTREAM; s++) cudaStreamCreate(&streams[s]);

    cudaEvent_t tStart, tStop;
    cudaEventCreate(&tStart);
    cudaEventCreate(&tStop);

    // ----- pipelined (overlapped) run -----
    cudaEventRecord(tStart);
    for (int s = 0; s < NSTREAM; s++) {
        int off   = s * CHUNK;
        size_t nb = CHUNK * sizeof(float);
        cudaMemcpyAsync(d_data + off, h_data + off, nb,
                        cudaMemcpyHostToDevice, streams[s]);
        scale<<<(CHUNK+255)/256, 256, 0, streams[s]>>>(d_data + off, 2.0f, CHUNK);
        cudaMemcpyAsync(h_data + off, d_data + off, nb,
                        cudaMemcpyDeviceToHost, streams[s]);
    }
    cudaDeviceSynchronize();
    cudaEventRecord(tStop);
    cudaEventSynchronize(tStop);

    float ms;
    cudaEventElapsedTime(&ms, tStart, tStop);
    printf("Pipelined (4 streams): %.3f ms\n", ms);
    printf("h_data[0]=%.1f  h_data[N-1]=%.1f  (expect 0.0, %.1f)\n",
           h_data[0], h_data[N-1], (float)(N-1)*2.0f);

    for (int s = 0; s < NSTREAM; s++) cudaStreamDestroy(streams[s]);
    cudaFree(d_data);
    cudaFreeHost(h_data);
    cudaEventDestroy(tStart);
    cudaEventDestroy(tStop);
    return 0;
}

Summary

  • Streams enable concurrent execution of kernels and memory transfers on different data chunks
  • Use cudaMemcpyAsync with pinned host memory for true async transfers
  • The chunked pipeline pattern overlaps copy↔compute for maximum throughput
  • CUDA Events provide precise GPU timing and inter-stream synchronization

Common Mistakes in CUDA Streams

  • Using Pageable Memory: Asynchronous copies revert to synchronous behavior if host memory isn't pinned.
  • Implicit Synchronization: Certain commands (like cudaMemset on some hardware) can implicitly sync all streams.
  • Race Conditions across Streams: Updating the same global memory from two different streams without careful event-based synchronization.

Practice Exercises

  1. Modify the pipeline example to use 8 streams instead of 4. Does the performance improve? Why or why not?
  2. Implement a timing script that uses CUDA Events to measure exactly how much time is saved by overlapping transfers.

Further Reading

Practice These Concepts

Reinforce what you just learned with hands-on GPU coding challenges.

Dot Productmedium

Compute the dot product of two vectors on the GPU. The dot product is the sum of element-wise products: result = sum(a[i] * b[i]). Use parallel reduction with shared memory for efficiency. Array size N = 4096.

Sign in to track your progress across challenges.