CUDA Kernels & Execution

Chapter 4 — Writing, launching, and debugging CUDA kernels

CUDA Kernel Launch Syntax: The <<< >>> Triple Angle Brackets

When you call a kernel with the <<<>>> syntax, you specify the execution configuration — how many blocks and threads to use:

cuda
12345678
// Syntax:
kernel<<<gridDim, blockDim, sharedMemBytes, stream>>>(args...);

// Most common usage:
kernel<<<numBlocks, threadsPerBlock>>>(args...);

// With dynamic shared memory:
kernel<<<numBlocks, threadsPerBlock, sharedSize>>>(args...);

The first two arguments are required. The third (dynamic shared memory size) and fourth (CUDA stream) are optional.

Optimal Launch Configurations: Blocks, Grids, and Warps

The execution configuration directly impacts performance:

launch_config.cu
1234567891011121314151617181920212223
#include <stdio.h>

__global__ void myKernel(int *data, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) data[i] = i * 2;
}

int main() {
    int N = 10000;
    int *d_data;
    cudaMalloc(&d_data, N * sizeof(int));

    int threadsPerBlock = 256;
    int numBlocks = (N + threadsPerBlock - 1) / threadsPerBlock;

    myKernel<<<numBlocks, threadsPerBlock>>>(d_data, N);
    cudaDeviceSynchronize();

    printf("Launched %d blocks with %d threads each\n", numBlocks, threadsPerBlock);
    
    cudaFree(d_data);
    return 0;
}
The rounding formula

(N + T - 1) / T is the standard way to compute the number of blocks. For N=1000 and T=256, this gives 4 blocks = 1024 threads (24 threads idle, but the boundary check if (i < N) prevents invalid access).

CUDA Error Handling: cudaGetLastError and cudaDeviceSynchronize

CUDA functions return error codes, but kernel launches do not return errors immediately. You must check explicitly:

error_handling.cu
12345678910111213141516171819202122232425262728293031323334
#include <stdio.h>

__global__ void myKernel(int *data) {
    // Intentional out of bounds or error could be tested here
}

int main() {
    int *d_data;
    size_t bytes = 1024;

    // Check CUDA API calls
    cudaError_t err = cudaMalloc(&d_data, bytes);
    if (err != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed: %s\n", cudaGetErrorString(err));
        return 1;
    }

    // Check kernel launch errors
    myKernel<<<1, 1>>>(d_data);

    err = cudaGetLastError();
    if (err != cudaSuccess) {
        fprintf(stderr, "Kernel launch failed: %s\n", cudaGetErrorString(err));
    }

    err = cudaDeviceSynchronize();
    if (err != cudaSuccess) {
        fprintf(stderr, "Kernel execution failed: %s\n", cudaGetErrorString(err));
    }
    
    printf("Error checking complete. Status: %s\n", cudaGetErrorString(err));
    cudaFree(d_data);
    return 0;
}

A common pattern is a helper macro:

cuda
123456789101112
#define CUDA_CHECK(call) do { \
    cudaError_t err = call; \
    if (err != cudaSuccess) { \
        fprintf(stderr, "CUDA error at %s:%d: %s\n", \
                __FILE__, __LINE__, cudaGetErrorString(err)); \
        exit(1); \
    } \
} while(0)

// Usage:
CUDA_CHECK(cudaMalloc(&d_data, bytes));
CUDA_CHECK(cudaMemcpy(d_data, h_data, bytes, cudaMemcpyHostToDevice));

Kernel Limitations

Keep these constraints in mind when writing kernels:

  • No recursion on older architectures (compute capability < 2.0); limited stack on newer ones
  • No dynamic memory allocation inside kernels (except malloc on newer GPUs with limited heap)
  • printf works from kernels but uses a limited buffer — great for debugging, not for production
  • Kernel calls are asynchronous — the CPU continues immediately after launch

How to Query GPU Properties with cudaGetDeviceProperties

You can query GPU capabilities at runtime. This is useful for choosing optimal kernel parameters:

device_query.cu
1234567891011121314151617181920212223
#include <stdio.h>

int main() {
    int deviceCount;
    cudaGetDeviceCount(&deviceCount);
    printf("Found %d CUDA device(s)\n\n", deviceCount);

    for (int i = 0; i < deviceCount; i++) {
        cudaDeviceProp prop;
        cudaGetDeviceProperties(&prop, i);

        printf("Device %d: %s\n", i, prop.name);
        printf("  Compute capability : %d.%d\n", prop.major, prop.minor);
        printf("  SM count           : %d\n", prop.multiProcessorCount);
        printf("  Max threads/block  : %d\n", prop.maxThreadsPerBlock);
        printf("  Shared memory/block: %zu bytes\n", prop.sharedMemPerBlock);
        printf("  Global memory      : %.1f GB\n", prop.totalGlobalMem / 1.0e9);
        printf("  Memory bandwidth   : %.1f GB/s\n",
               2.0 * prop.memoryClockRate * (prop.memoryBusWidth / 8) / 1.0e6);
        printf("\n");
    }
    return 0;
}

Summary

  • Use <<<gridDim, blockDim>>> to configure kernel launches
  • Always round up block count with (N + T - 1) / T and use boundary checks
  • Check errors with cudaGetLastError() and cudaDeviceSynchronize()
  • Kernel launches are asynchronous — the CPU doesn't wait unless you synchronize
  • Query device properties to write adaptive, portable code

Common Mistakes in CUDA Kernel Execution

  • Forgetting cudaDeviceSynchronize(): The CPU may try to read data before the GPU is finished writing it.
  • Launch Config Errors: Launching more blocks/threads than the hardware supports (e.g., >1024 threads per block).
  • Silent Failures: Not checking cudaGetLastError() leads to hard-to-debug "nothing happened" scenarios.

Practice Exercises

  1. Implement the CUDA_CHECK macro in a small script and intentionally trigger an error (e.g., allocate too much memory).
  2. Write a script that queries your GPU and prints only the Compute Capability and Max Threads per SM.

Further Reading

Practice These Concepts

Reinforce what you just learned with hands-on GPU coding challenges.

Matrix Multiplicationmedium

Implement a simple 2D matrix multiplication kernel. Multiply two NxN matrices where N = 64. Each thread computes one output element.

Sign in to track your progress across challenges.