CUDA Streams & Concurrency
Chapter 6 — Overlapping computation with data transfers
Introduction to CUDA Streams: Asynchronous Execution
A CUDA stream is a sequence of operations (kernel launches, memory copies) that execute in order. Operations in different streams can overlap and run concurrently, enabling massive throughput gains.
By default, all operations go into stream 0 (the default stream), which is synchronous with respect to the host.
Without streams, the GPU sits idle while data is being transferred. With streams, you can overlap kernel execution on one chunk of data with the transfer of the next chunk — utilizing the full bandwidth of both the compute engine and the copy engine simultaneously.
Creating and Using Multiple CUDA Streams
cudaStream_t stream1, stream2; cudaStreamCreate(&stream1); cudaStreamCreate(&stream2); // Async copy and kernel in stream1 cudaMemcpyAsync(d_a, h_a, bytes, cudaMemcpyHostToDevice, stream1); streamKernel<<<4, 256, 0, stream1>>>(d_a, N); cudaMemcpyAsync(h_a, d_a, bytes, cudaMemcpyDeviceToHost, stream1); // Async copy and kernel in stream2 cudaMemcpyAsync(d_b, h_b, bytes, cudaMemcpyHostToDevice, stream2); streamKernel<<<4, 256, 0, stream2>>>(d_b, N); cudaMemcpyAsync(h_b, d_b, bytes, cudaMemcpyDeviceToHost, stream2); cudaStreamSynchronize(stream1); cudaStreamSynchronize(stream2); // Cleanup cudaStreamDestroy(stream1); cudaStreamDestroy(stream2);
cudaMemcpyAsync only achieves true asynchronous transfer withpinned (page-locked) host memory. Use cudaMallocHostinstead of malloc for the host buffers.
Stream Concurrency: Overlapping Data Transfers and Kernels
The most common stream pattern is to split data into chunks and process them in a pipeline:
const int nStreams = 4; const int chunkSize = N / nStreams; cudaStream_t streams[nStreams]; // Allocate pinned host memory float *h_data; cudaMallocHost(&h_data, N * sizeof(float)); for (int i = 0; i < nStreams; i++) { cudaStreamCreate(&streams[i]); } for (int i = 0; i < nStreams; i++) { int offset = i * chunkSize; size_t chunkBytes = chunkSize * sizeof(float); // 1. Copy chunk to GPU cudaMemcpyAsync(&d_data[offset], &h_data[offset], chunkBytes, cudaMemcpyHostToDevice, streams[i]); // 2. Process chunk int blocks = (chunkSize + 255) / 256; process<<<blocks, 256, 0, streams[i]>>>( &d_data[offset], chunkSize); // 3. Copy result back cudaMemcpyAsync(&h_data[offset], &d_data[offset], chunkBytes, cudaMemcpyDeviceToHost, streams[i]); } // Wait for all streams cudaDeviceSynchronize(); for (int i = 0; i < nStreams; i++) { cudaStreamDestroy(streams[i]); } cudaFreeHost(h_data);
Synchronizing CUDA Streams: events and Waiters
Events let you measure timing and create inter-stream dependencies:
cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start, stream); myKernel<<<blocks, threads, 0, stream>>>(args); cudaEventRecord(stop, stream); // Wait for the stop event cudaEventSynchronize(stop); float milliseconds = 0; cudaEventElapsedTime(&milliseconds, start, stop); printf("Kernel took %.3f ms\n", milliseconds); cudaEventDestroy(start); cudaEventDestroy(stop);
Inter-Stream Dependencies
You can make one stream wait for an event recorded in another:
cudaEvent_t event; cudaEventCreate(&event); // Record event after kernel in stream1 kernelA<<<..., stream1>>>(args); cudaEventRecord(event, stream1); // Make stream2 wait for the event cudaStreamWaitEvent(stream2, event, 0); kernelB<<<..., stream2>>>(args); // runs only after kernelA
Complete Example: Measuring Stream Overlap
This program processes data in chunks using 4 streams and measures the wall-clock time — compare it with cudaDeviceSynchronize() after each step to see the speedup from overlap:
#include <stdio.h> __global__ void scale(float *data, float factor, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) data[i] *= factor; } int main() { const int N = 1 << 20; // 1M floats total const int NSTREAM = 4; const int CHUNK = N / NSTREAM; // Pinned host memory — required for async transfers float *h_data; cudaMallocHost(&h_data, N * sizeof(float)); for (int i = 0; i < N; i++) h_data[i] = (float)i; float *d_data; cudaMalloc(&d_data, N * sizeof(float)); cudaStream_t streams[NSTREAM]; for (int s = 0; s < NSTREAM; s++) cudaStreamCreate(&streams[s]); cudaEvent_t tStart, tStop; cudaEventCreate(&tStart); cudaEventCreate(&tStop); // ----- pipelined (overlapped) run ----- cudaEventRecord(tStart); for (int s = 0; s < NSTREAM; s++) { int off = s * CHUNK; size_t nb = CHUNK * sizeof(float); cudaMemcpyAsync(d_data + off, h_data + off, nb, cudaMemcpyHostToDevice, streams[s]); scale<<<(CHUNK+255)/256, 256, 0, streams[s]>>>(d_data + off, 2.0f, CHUNK); cudaMemcpyAsync(h_data + off, d_data + off, nb, cudaMemcpyDeviceToHost, streams[s]); } cudaDeviceSynchronize(); cudaEventRecord(tStop); cudaEventSynchronize(tStop); float ms; cudaEventElapsedTime(&ms, tStart, tStop); printf("Pipelined (4 streams): %.3f ms\n", ms); printf("h_data[0]=%.1f h_data[N-1]=%.1f (expect 0.0, %.1f)\n", h_data[0], h_data[N-1], (float)(N-1)*2.0f); for (int s = 0; s < NSTREAM; s++) cudaStreamDestroy(streams[s]); cudaFree(d_data); cudaFreeHost(h_data); cudaEventDestroy(tStart); cudaEventDestroy(tStop); return 0; }
Summary
- Streams enable concurrent execution of kernels and memory transfers on different data chunks
- Use
cudaMemcpyAsyncwith pinned host memory for true async transfers - The chunked pipeline pattern overlaps copy↔compute for maximum throughput
- CUDA Events provide precise GPU timing and inter-stream synchronization
Concurrency is the ultimate performance tool. In our final chapter,Performance Optimization, we'll pull everything together to squeeze every last flop out of your GPU.
Common Mistakes in CUDA Streams
- Using Pageable Memory: Asynchronous copies revert to synchronous behavior if host memory isn't pinned.
- Implicit Synchronization: Certain commands (like
cudaMemseton some hardware) can implicitly sync all streams. - Race Conditions across Streams: Updating the same global memory from two different streams without careful event-based synchronization.
Practice Exercises
- Modify the pipeline example to use 8 streams instead of 4. Does the performance improve? Why or why not?
- Implement a timing script that uses CUDA Events to measure exactly how much time is saved by overlapping transfers.
Further Reading
Practice These Concepts
Reinforce what you just learned with hands-on GPU coding challenges.
Compute the dot product of two vectors on the GPU. The dot product is the sum of element-wise products: result = sum(a[i] * b[i]). Use parallel reduction with shared memory for efficiency. Array size N = 4096.
Sign in to track your progress across challenges.