What is a CUDA stream?

A CUDA stream is a sequence of operations that execute in issue order on the GPU. Operations in different streams can run concurrently and overlap.

How do you overlap data transfer and kernel execution in CUDA?

By using multiple non-default streams and asynchronous memory copies (cudaMemcpyAsync) with pinned host memory, you can overlap the transfer of one data chunk with the computation of another.

CUDA Streams & Concurrency

Chapter 6 — Overlapping computation with data transfers

Introduction to CUDA Streams: Asynchronous Execution

A CUDA stream is a sequence of operations (kernel launches, memory copies) that execute in order. Operations in different streams can overlap and run concurrently, enabling massive throughput gains.

By default, all operations go into stream 0 (the default stream), which is synchronous with respect to the host.

Why streams matter

Without streams, the GPU sits idle while data is being transferred. With streams, you can overlap kernel execution on one chunk of data with the transfer of the next chunk — utilizing the full bandwidth of both the compute engine and the copy engine simultaneously.

Creating and Using Multiple CUDA Streams

streams_basic.cu
1234567891011121314151617181920
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

// Async copy and kernel in stream1
cudaMemcpyAsync(d_a, h_a, bytes, cudaMemcpyHostToDevice, stream1);
streamKernel<<<4, 256, 0, stream1>>>(d_a, N);
cudaMemcpyAsync(h_a, d_a, bytes, cudaMemcpyDeviceToHost, stream1);

// Async copy and kernel in stream2
cudaMemcpyAsync(d_b, h_b, bytes, cudaMemcpyHostToDevice, stream2);
streamKernel<<<4, 256, 0, stream2>>>(d_b, N);
cudaMemcpyAsync(h_b, d_b, bytes, cudaMemcpyDeviceToHost, stream2);

cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);

// Cleanup
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);

Pinned memory required

cudaMemcpyAsync only achieves true asynchronous transfer withpinned (page-locked) host memory. Use cudaMallocHostinstead of malloc for the host buffers.

Stream Concurrency: Overlapping Data Transfers and Kernels

The most common stream pattern is to split data into chunks and process them in a pipeline:

pipeline.cu
123456789101112131415161718192021222324252627282930313233343536373839
const int nStreams = 4;
const int chunkSize = N / nStreams;
cudaStream_t streams[nStreams];

// Allocate pinned host memory
float *h_data;
cudaMallocHost(&h_data, N * sizeof(float));

for (int i = 0; i < nStreams; i++) {
    cudaStreamCreate(&streams[i]);
}

for (int i = 0; i < nStreams; i++) {
    int offset = i * chunkSize;
    size_t chunkBytes = chunkSize * sizeof(float);

    // 1. Copy chunk to GPU
    cudaMemcpyAsync(&d_data[offset], &h_data[offset],
                    chunkBytes, cudaMemcpyHostToDevice,
                    streams[i]);

    // 2. Process chunk
    int blocks = (chunkSize + 255) / 256;
    process<<<blocks, 256, 0, streams[i]>>>(
        &d_data[offset], chunkSize);

    // 3. Copy result back
    cudaMemcpyAsync(&h_data[offset], &d_data[offset],
                    chunkBytes, cudaMemcpyDeviceToHost,
                    streams[i]);
}

// Wait for all streams
cudaDeviceSynchronize();

for (int i = 0; i < nStreams; i++) {
    cudaStreamDestroy(streams[i]);
}
cudaFreeHost(h_data);

Synchronizing CUDA Streams: events and Waiters

Events let you measure timing and create inter-stream dependencies:

events.cu
1234567891011121314151617
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start, stream);
myKernel<<<blocks, threads, 0, stream>>>(args);
cudaEventRecord(stop, stream);

// Wait for the stop event
cudaEventSynchronize(stop);

float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
printf("Kernel took %.3f ms\n", milliseconds);

cudaEventDestroy(start);
cudaEventDestroy(stop);

Inter-Stream Dependencies

You can make one stream wait for an event recorded in another:

cuda
12345678910
cudaEvent_t event;
cudaEventCreate(&event);

// Record event after kernel in stream1
kernelA<<<..., stream1>>>(args);
cudaEventRecord(event, stream1);

// Make stream2 wait for the event
cudaStreamWaitEvent(stream2, event, 0);
kernelB<<<..., stream2>>>(args);  // runs only after kernelA

Complete Example: Measuring Stream Overlap

This program processes data in chunks using 4 streams and measures the wall-clock time — compare it with cudaDeviceSynchronize() after each step to see the speedup from overlap:

streams_overlap.cu
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
#include <stdio.h>

__global__ void scale(float *data, float factor, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) data[i] *= factor;
}

int main() {
    const int N       = 1 << 20;   // 1M floats total
    const int NSTREAM = 4;
    const int CHUNK   = N / NSTREAM;

    // Pinned host memory — required for async transfers
    float *h_data;
    cudaMallocHost(&h_data, N * sizeof(float));
    for (int i = 0; i < N; i++) h_data[i] = (float)i;

    float *d_data;
    cudaMalloc(&d_data, N * sizeof(float));

    cudaStream_t streams[NSTREAM];
    for (int s = 0; s < NSTREAM; s++) cudaStreamCreate(&streams[s]);

    cudaEvent_t tStart, tStop;
    cudaEventCreate(&tStart);
    cudaEventCreate(&tStop);

    // ----- pipelined (overlapped) run -----
    cudaEventRecord(tStart);
    for (int s = 0; s < NSTREAM; s++) {
        int off   = s * CHUNK;
        size_t nb = CHUNK * sizeof(float);
        cudaMemcpyAsync(d_data + off, h_data + off, nb,
                        cudaMemcpyHostToDevice, streams[s]);
        scale<<<(CHUNK+255)/256, 256, 0, streams[s]>>>(d_data + off, 2.0f, CHUNK);
        cudaMemcpyAsync(h_data + off, d_data + off, nb,
                        cudaMemcpyDeviceToHost, streams[s]);
    }
    cudaDeviceSynchronize();
    cudaEventRecord(tStop);
    cudaEventSynchronize(tStop);

    float ms;
    cudaEventElapsedTime(&ms, tStart, tStop);
    printf("Pipelined (4 streams): %.3f ms\n", ms);
    printf("h_data[0]=%.1f  h_data[N-1]=%.1f  (expect 0.0, %.1f)\n",
           h_data[0], h_data[N-1], (float)(N-1)*2.0f);

    for (int s = 0; s < NSTREAM; s++) cudaStreamDestroy(streams[s]);
    cudaFree(d_data);
    cudaFreeHost(h_data);
    cudaEventDestroy(tStart);
    cudaEventDestroy(tStop);
    return 0;
}

Summary

Streams enable concurrent execution of kernels and memory transfers on different data chunks
Use cudaMemcpyAsync with pinned host memory for true async transfers
The chunked pipeline pattern overlaps copy↔compute for maximum throughput
CUDA Events provide precise GPU timing and inter-stream synchronization

Concurrency is the ultimate performance tool. In our final chapter,Performance Optimization, we'll pull everything together to squeeze every last flop out of your GPU.

Common Mistakes in CUDA Streams

Using Pageable Memory: Asynchronous copies revert to synchronous behavior if host memory isn't pinned.
Implicit Synchronization: Certain commands (like cudaMemset on some hardware) can implicitly sync all streams.
Race Conditions across Streams: Updating the same global memory from two different streams without careful event-based synchronization.

Practice Exercises

Modify the pipeline example to use 8 streams instead of 4. Does the performance improve? Why or why not?
Implement a timing script that uses CUDA Events to measure exactly how much time is saved by overlapping transfers.

Practice These Concepts

Reinforce what you just learned with hands-on GPU coding challenges.

Compute the dot product of two vectors on the GPU. The dot product is the sum of element-wise products: result = sum(a[i] * b[i]). Use parallel reduction with shared memory for efficiency. Array size N = 4096.