CUDA Memory Management

Chapter 3 — Global, shared, constant, and local memory

The GPU Memory Hierarchy: Latency and Bandwidth

Understanding memory is the single most important factor for writing fast CUDA code. The GPU has multiple memory types, each with different size, speed, and scope:

  • Registers — fastest, private to each thread (a few KB)
  • Local memory — per-thread, backed by DRAM (slow), used when registers spill
  • Shared memory — on-chip scratchpad shared by all threads in a block (~48–164 KB, very fast)
  • Global memory — large DRAM accessible by all threads (GBs, slowest)
  • Constant memory — read-only, cached, 64 KB
  • Texture memory — read-only with a specialized cache for spatial locality
Speed comparison

Registers: ~1 cycle | Shared memory: ~5 cycles | Global memory: ~200–800 cycles. That's a 100× difference! Optimizing memory access is critical.

Global Memory: cudaMalloc and Data Transfers

Global memory is the primary way to move data between host and device. You allocate with cudaMalloc and transfer with cudaMemcpy:

cuda
123456789101112131415161718192021222324
#include <stdio.h>

int main() {
    int N = 1024;
    float *h_data = (float*)malloc(N * sizeof(float));
    float *d_data;
    size_t bytes = N * sizeof(float);

    // Allocate on GPU
    cudaMalloc(&d_data, bytes);

    // Host → Device
    cudaMemcpy(d_data, h_data, bytes, cudaMemcpyHostToDevice);

    // Device → Host
    cudaMemcpy(h_data, d_data, bytes, cudaMemcpyDeviceToHost);

    printf("Successfully allocated and copied %d bytes\n", (int)bytes);

    // Free GPU memory
    cudaFree(d_data);
    free(h_data);
    return 0;
}

Shared Memory: Fast Inter-Thread Communication

Shared memory is an on-chip scratchpad that threads within the same block can use to communicate and collaborate. It's declared with __shared__:

shared_example.cu
123456789101112131415161718192021222324252627282930313233343536373839404142
#include <stdio.h>

__global__ void sharedMemReverse(float *input, float *output, int n) {
    // Dynamically sized shared memory or static
    __shared__ float cache[256];

    int tid = threadIdx.x;
    int gid = blockIdx.x * blockDim.x + threadIdx.x;

    // Load from global memory into shared memory
    if (gid < n) {
        cache[tid] = input[gid];
    }

    // Synchronize — ensure all threads have loaded their data
    __syncthreads();

    // Now every thread in the block can read any cache[] element
    // Example: reverse order within block
    if (gid < n) {
        output[gid] = cache[blockDim.x - 1 - tid];
    }
}

int main() {
    const int N = 256;
    float h_in[N], h_out[N];
    for (int i = 0; i < N; i++) h_in[i] = i;

    float *d_in, *d_out;
    cudaMalloc(&d_in, N * sizeof(float));
    cudaMalloc(&d_out, N * sizeof(float));

    cudaMemcpy(d_in, h_in, N * sizeof(float), cudaMemcpyHostToDevice);
    sharedMemReverse<<<1, N>>>(d_in, d_out, N);
    cudaMemcpy(h_out, d_out, N * sizeof(float), cudaMemcpyDeviceToHost);

    printf("h_out[0] (should be 255): %f\n", h_out[0]);
    
    cudaFree(d_in); cudaFree(d_out);
    return 0;
}
Always sync before reading shared memory

If you read shared memory written by other threads without calling__syncthreads() first, you will get undefined (garbage) values.

Constant Memory: Read-Only Caching for Parameters

Use constant memory for small, read-only data that every thread reads (e.g., lookup tables, filter coefficients). It's cached and broadcast-optimized:

cuda
123456789101112131415161718192021222324252627282930313233
#include <stdio.h>

__constant__ float coefficients[8];  // stored in fast constant cache

__global__ void applyFilter(float *data, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        data[i] *= coefficients[i % 8];  // all threads broadcast same coeff
    }
}

int main() {
    // Simple box-filter weights
    float h_coeffs[8] = {0.5f, 0.75f, 1.0f, 0.75f, 0.5f, 0.25f, 0.125f, 0.0625f};
    cudaMemcpyToSymbol(coefficients, h_coeffs, sizeof(h_coeffs));

    const int N = 64;
    float h_data[N], h_out[N];
    for (int i = 0; i < N; i++) h_data[i] = 1.0f;

    float *d_data;
    cudaMalloc(&d_data, N * sizeof(float));
    cudaMemcpy(d_data, h_data, N * sizeof(float), cudaMemcpyHostToDevice);

    applyFilter<<<1, N>>>(d_data, N);

    cudaMemcpy(h_out, d_data, N * sizeof(float), cudaMemcpyDeviceToHost);
    printf("h_out[0]=%.4f (coeff[0]=0.5), h_out[2]=%.4f (coeff[2]=1.0)\n",
           h_out[0], h_out[2]);

    cudaFree(d_data);
    return 0;
}

Unified Memory (Managed Memory)

CUDA 6+ introduced Unified Memory, which simplifies programming by creating a single address space accessible from both CPU and GPU:

cuda
1234567891011121314
float *data;
cudaMallocManaged(&data, N * sizeof(float));

// Use on CPU
for (int i = 0; i < N; i++) data[i] = i;

// Use on GPU — no explicit copy needed!
myKernel<<<blocks, threads>>>(data, N);
cudaDeviceSynchronize();

// Read results on CPU — automatically migrated back
printf("result: %f\n", data[0]);

cudaFree(data);
When to use Unified Memory

Great for prototyping and simple programs. For maximum performance, explicitcudaMemcpy with pinned memory and streams gives you more control.

Pinned (Page-Locked) Host Memory for Faster Transfers

Regular malloc memory can be paged out by the OS, slowing transfers. Pinned memory stays in physical RAM and enables faster DMA transfers:

cuda
1234567
float *h_pinned;
cudaMallocHost(&h_pinned, bytes);  // pinned allocation

// Transfers with pinned memory are 2-3× faster
cudaMemcpy(d_data, h_pinned, bytes, cudaMemcpyHostToDevice);

cudaFreeHost(h_pinned);

Summary

  • Global memory is large but slow — minimize accesses and use coalesced patterns
  • Shared memory is fast on-chip memory for intra-block communication — always sync before reading
  • Constant memory is ideal for small, read-only, broadcast data
  • Unified Memory simplifies coding but explicit management gives more performance control
  • Pinned memory on the host accelerates host↔device transfers

Common Mistakes in CUDA Memory Management

  • Memory Leakage: Forgetting to call cudaFree on device pointers.
  • Uncoalesced Access: Accessing global memory in a non-contiguous pattern (explained in Chapter 7).
  • Race Conditions: Multiple threads writing to the same shared memory location without synchronization.

Practice Exercises

  1. Try to implement a kernel that swaps two halves of an array using Shared Memory.
  2. Compare the performance of cudaMallocManaged (Unified Memory) vs. explicit cudaMalloc and cudaMemcpy for a large vector addition.

Further Reading

Practice These Concepts

Reinforce what you just learned with hands-on GPU coding challenges.

Constant Memory Lookup Tablehard

Use CUDA constant memory to store a sine lookup table. Compute sin(x) for 1024 values using the lookup table instead of calling sin() directly. This demonstrates constant memory caching for read-only data accessed by all threads.

Sign in to track your progress across challenges.