CUDA Memory Management
Chapter 3 — Global, shared, constant, and local memory
The GPU Memory Hierarchy: Latency and Bandwidth
Understanding memory is the single most important factor for writing fast CUDA code. The GPU has multiple memory types, each with different size, speed, and scope:
- Registers — fastest, private to each thread (a few KB)
- Local memory — per-thread, backed by DRAM (slow), used when registers spill
- Shared memory — on-chip scratchpad shared by all threads in a block (~48–164 KB, very fast)
- Global memory — large DRAM accessible by all threads (GBs, slowest)
- Constant memory — read-only, cached, 64 KB
- Texture memory — read-only with a specialized cache for spatial locality
Registers: ~1 cycle | Shared memory: ~5 cycles | Global memory: ~200–800 cycles. That's a 100× difference! Optimizing memory access is critical.
Global Memory: cudaMalloc and Data Transfers
Global memory is the primary way to move data between host and device. You allocate with cudaMalloc and transfer with cudaMemcpy:
#include <stdio.h> int main() { int N = 1024; float *h_data = (float*)malloc(N * sizeof(float)); float *d_data; size_t bytes = N * sizeof(float); // Allocate on GPU cudaMalloc(&d_data, bytes); // Host → Device cudaMemcpy(d_data, h_data, bytes, cudaMemcpyHostToDevice); // Device → Host cudaMemcpy(h_data, d_data, bytes, cudaMemcpyDeviceToHost); printf("Successfully allocated and copied %d bytes\n", (int)bytes); // Free GPU memory cudaFree(d_data); free(h_data); return 0; }
Shared Memory: Fast Inter-Thread Communication
Shared memory is an on-chip scratchpad that threads within the same block can use to communicate and collaborate. It's declared with __shared__:
#include <stdio.h> __global__ void sharedMemReverse(float *input, float *output, int n) { // Dynamically sized shared memory or static __shared__ float cache[256]; int tid = threadIdx.x; int gid = blockIdx.x * blockDim.x + threadIdx.x; // Load from global memory into shared memory if (gid < n) { cache[tid] = input[gid]; } // Synchronize — ensure all threads have loaded their data __syncthreads(); // Now every thread in the block can read any cache[] element // Example: reverse order within block if (gid < n) { output[gid] = cache[blockDim.x - 1 - tid]; } } int main() { const int N = 256; float h_in[N], h_out[N]; for (int i = 0; i < N; i++) h_in[i] = i; float *d_in, *d_out; cudaMalloc(&d_in, N * sizeof(float)); cudaMalloc(&d_out, N * sizeof(float)); cudaMemcpy(d_in, h_in, N * sizeof(float), cudaMemcpyHostToDevice); sharedMemReverse<<<1, N>>>(d_in, d_out, N); cudaMemcpy(h_out, d_out, N * sizeof(float), cudaMemcpyDeviceToHost); printf("h_out[0] (should be 255): %f\n", h_out[0]); cudaFree(d_in); cudaFree(d_out); return 0; }
If you read shared memory written by other threads without calling__syncthreads() first, you will get undefined (garbage) values.
Constant Memory: Read-Only Caching for Parameters
Use constant memory for small, read-only data that every thread reads (e.g., lookup tables, filter coefficients). It's cached and broadcast-optimized:
#include <stdio.h> __constant__ float coefficients[8]; // stored in fast constant cache __global__ void applyFilter(float *data, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) { data[i] *= coefficients[i % 8]; // all threads broadcast same coeff } } int main() { // Simple box-filter weights float h_coeffs[8] = {0.5f, 0.75f, 1.0f, 0.75f, 0.5f, 0.25f, 0.125f, 0.0625f}; cudaMemcpyToSymbol(coefficients, h_coeffs, sizeof(h_coeffs)); const int N = 64; float h_data[N], h_out[N]; for (int i = 0; i < N; i++) h_data[i] = 1.0f; float *d_data; cudaMalloc(&d_data, N * sizeof(float)); cudaMemcpy(d_data, h_data, N * sizeof(float), cudaMemcpyHostToDevice); applyFilter<<<1, N>>>(d_data, N); cudaMemcpy(h_out, d_data, N * sizeof(float), cudaMemcpyDeviceToHost); printf("h_out[0]=%.4f (coeff[0]=0.5), h_out[2]=%.4f (coeff[2]=1.0)\n", h_out[0], h_out[2]); cudaFree(d_data); return 0; }
Unified Memory (Managed Memory)
CUDA 6+ introduced Unified Memory, which simplifies programming by creating a single address space accessible from both CPU and GPU:
float *data; cudaMallocManaged(&data, N * sizeof(float)); // Use on CPU for (int i = 0; i < N; i++) data[i] = i; // Use on GPU — no explicit copy needed! myKernel<<<blocks, threads>>>(data, N); cudaDeviceSynchronize(); // Read results on CPU — automatically migrated back printf("result: %f\n", data[0]); cudaFree(data);
Great for prototyping and simple programs. For maximum performance, explicitcudaMemcpy with pinned memory and streams gives you more control.
Pinned (Page-Locked) Host Memory for Faster Transfers
Regular malloc memory can be paged out by the OS, slowing transfers. Pinned memory stays in physical RAM and enables faster DMA transfers:
float *h_pinned; cudaMallocHost(&h_pinned, bytes); // pinned allocation // Transfers with pinned memory are 2-3× faster cudaMemcpy(d_data, h_pinned, bytes, cudaMemcpyHostToDevice); cudaFreeHost(h_pinned);
Summary
- Global memory is large but slow — minimize accesses and use coalesced patterns
- Shared memory is fast on-chip memory for intra-block communication — always sync before reading
- Constant memory is ideal for small, read-only, broadcast data
- Unified Memory simplifies coding but explicit management gives more performance control
- Pinned memory on the host accelerates host↔device transfers
Mastering memory is key to performance. In the next chapter, we'll look atKernels & Executionto understand how to launch and configure your kernels optimally.
Common Mistakes in CUDA Memory Management
- Memory Leakage: Forgetting to call
cudaFreeon device pointers. - Uncoalesced Access: Accessing global memory in a non-contiguous pattern (explained in Chapter 7).
- Race Conditions: Multiple threads writing to the same shared memory location without synchronization.
Practice Exercises
- Try to implement a kernel that swaps two halves of an array using Shared Memory.
- Compare the performance of
cudaMallocManaged(Unified Memory) vs. explicitcudaMallocandcudaMemcpyfor a large vector addition.
Further Reading
Practice These Concepts
Reinforce what you just learned with hands-on GPU coding challenges.
Use CUDA constant memory to store a sine lookup table. Compute sin(x) for 1024 values using the lookup table instead of calling sin() directly. This demonstrates constant memory caching for read-only data accessed by all threads.
Sign in to track your progress across challenges.