CUDA Kernels & Execution
Chapter 4 — Writing, launching, and debugging CUDA kernels
CUDA Kernel Launch Syntax: The <<< >>> Triple Angle Brackets
When you call a kernel with the <<<>>> syntax, you specify the execution configuration — how many blocks and threads to use:
// Syntax: kernel<<<gridDim, blockDim, sharedMemBytes, stream>>>(args...); // Most common usage: kernel<<<numBlocks, threadsPerBlock>>>(args...); // With dynamic shared memory: kernel<<<numBlocks, threadsPerBlock, sharedSize>>>(args...);
The first two arguments are required. The third (dynamic shared memory size) and fourth (CUDA stream) are optional.
Optimal Launch Configurations: Blocks, Grids, and Warps
The execution configuration directly impacts performance:
#include <stdio.h> __global__ void myKernel(int *data, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) data[i] = i * 2; } int main() { int N = 10000; int *d_data; cudaMalloc(&d_data, N * sizeof(int)); int threadsPerBlock = 256; int numBlocks = (N + threadsPerBlock - 1) / threadsPerBlock; myKernel<<<numBlocks, threadsPerBlock>>>(d_data, N); cudaDeviceSynchronize(); printf("Launched %d blocks with %d threads each\n", numBlocks, threadsPerBlock); cudaFree(d_data); return 0; }
(N + T - 1) / T is the standard way to compute the number of blocks. For N=1000 and T=256, this gives 4 blocks = 1024 threads (24 threads idle, but the boundary check if (i < N) prevents invalid access).
CUDA Error Handling: cudaGetLastError and cudaDeviceSynchronize
CUDA functions return error codes, but kernel launches do not return errors immediately. You must check explicitly:
#include <stdio.h> __global__ void myKernel(int *data) { // Intentional out of bounds or error could be tested here } int main() { int *d_data; size_t bytes = 1024; // Check CUDA API calls cudaError_t err = cudaMalloc(&d_data, bytes); if (err != cudaSuccess) { fprintf(stderr, "cudaMalloc failed: %s\n", cudaGetErrorString(err)); return 1; } // Check kernel launch errors myKernel<<<1, 1>>>(d_data); err = cudaGetLastError(); if (err != cudaSuccess) { fprintf(stderr, "Kernel launch failed: %s\n", cudaGetErrorString(err)); } err = cudaDeviceSynchronize(); if (err != cudaSuccess) { fprintf(stderr, "Kernel execution failed: %s\n", cudaGetErrorString(err)); } printf("Error checking complete. Status: %s\n", cudaGetErrorString(err)); cudaFree(d_data); return 0; }
A common pattern is a helper macro:
#define CUDA_CHECK(call) do { \ cudaError_t err = call; \ if (err != cudaSuccess) { \ fprintf(stderr, "CUDA error at %s:%d: %s\n", \ __FILE__, __LINE__, cudaGetErrorString(err)); \ exit(1); \ } \ } while(0) // Usage: CUDA_CHECK(cudaMalloc(&d_data, bytes)); CUDA_CHECK(cudaMemcpy(d_data, h_data, bytes, cudaMemcpyHostToDevice));
Kernel Limitations
Keep these constraints in mind when writing kernels:
- No recursion on older architectures (compute capability < 2.0); limited stack on newer ones
- No dynamic memory allocation inside kernels (except
mallocon newer GPUs with limited heap) printfworks from kernels but uses a limited buffer — great for debugging, not for production- Kernel calls are asynchronous — the CPU continues immediately after launch
How to Query GPU Properties with cudaGetDeviceProperties
You can query GPU capabilities at runtime. This is useful for choosing optimal kernel parameters:
#include <stdio.h> int main() { int deviceCount; cudaGetDeviceCount(&deviceCount); printf("Found %d CUDA device(s)\n\n", deviceCount); for (int i = 0; i < deviceCount; i++) { cudaDeviceProp prop; cudaGetDeviceProperties(&prop, i); printf("Device %d: %s\n", i, prop.name); printf(" Compute capability : %d.%d\n", prop.major, prop.minor); printf(" SM count : %d\n", prop.multiProcessorCount); printf(" Max threads/block : %d\n", prop.maxThreadsPerBlock); printf(" Shared memory/block: %zu bytes\n", prop.sharedMemPerBlock); printf(" Global memory : %.1f GB\n", prop.totalGlobalMem / 1.0e9); printf(" Memory bandwidth : %.1f GB/s\n", 2.0 * prop.memoryClockRate * (prop.memoryBusWidth / 8) / 1.0e6); printf("\n"); } return 0; }
Summary
- Use
<<<gridDim, blockDim>>>to configure kernel launches - Always round up block count with
(N + T - 1) / Tand use boundary checks - Check errors with
cudaGetLastError()andcudaDeviceSynchronize() - Kernel launches are asynchronous — the CPU doesn't wait unless you synchronize
- Query device properties to write adaptive, portable code
Executing kernels correctly is just the start. Next, learnCUDA Synchronizationto coordinate threads and prevent race conditions.
Common Mistakes in CUDA Kernel Execution
- Forgetting cudaDeviceSynchronize(): The CPU may try to read data before the GPU is finished writing it.
- Launch Config Errors: Launching more blocks/threads than the hardware supports (e.g., >1024 threads per block).
- Silent Failures: Not checking
cudaGetLastError()leads to hard-to-debug "nothing happened" scenarios.
Practice Exercises
- Implement the
CUDA_CHECKmacro in a small script and intentionally trigger an error (e.g., allocate too much memory). - Write a script that queries your GPU and prints only the Compute Capability and Max Threads per SM.
Further Reading
Practice These Concepts
Reinforce what you just learned with hands-on GPU coding challenges.
Implement a simple 2D matrix multiplication kernel. Multiply two NxN matrices where N = 64. Each thread computes one output element.
Sign in to track your progress across challenges.