What does the symbol >> mean in CUDA?

The triple angle brackets >> are part of the CUDA kernel launch syntax, used to specify the execution configuration (grid and block dimensions).

Are CUDA kernels asynchronous?

Yes, CUDA kernel launches are asynchronous with respect to the host CPU, meaning the CPU will continue executing the next line of code without waiting for the kernel to finish.

CUDA Kernels & Execution

Chapter 4 — Writing, launching, and debugging CUDA kernels

CUDA Kernel Launch Syntax: The <<< >>> Triple Angle Brackets

When you call a kernel with the <<<>>> syntax, you specify the execution configuration — how many blocks and threads to use:

cuda
12345678
// Syntax:
kernel<<<gridDim, blockDim, sharedMemBytes, stream>>>(args...);

// Most common usage:
kernel<<<numBlocks, threadsPerBlock>>>(args...);

// With dynamic shared memory:
kernel<<<numBlocks, threadsPerBlock, sharedSize>>>(args...);

The first two arguments are required. The third (dynamic shared memory size) and fourth (CUDA stream) are optional.

Optimal Launch Configurations: Blocks, Grids, and Warps

The execution configuration directly impacts performance:

launch_config.cu
1234567891011121314151617181920212223
#include <stdio.h>

__global__ void myKernel(int *data, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) data[i] = i * 2;
}

int main() {
    int N = 10000;
    int *d_data;
    cudaMalloc(&d_data, N * sizeof(int));

    int threadsPerBlock = 256;
    int numBlocks = (N + threadsPerBlock - 1) / threadsPerBlock;

    myKernel<<<numBlocks, threadsPerBlock>>>(d_data, N);
    cudaDeviceSynchronize();

    printf("Launched %d blocks with %d threads each\n", numBlocks, threadsPerBlock);
    
    cudaFree(d_data);
    return 0;
}

The rounding formula

(N + T - 1) / T is the standard way to compute the number of blocks. For N=1000 and T=256, this gives 4 blocks = 1024 threads (24 threads idle, but the boundary check if (i < N) prevents invalid access).

CUDA Error Handling: cudaGetLastError and cudaDeviceSynchronize

CUDA functions return error codes, but kernel launches do not return errors immediately. You must check explicitly:

error_handling.cu
12345678910111213141516171819202122232425262728293031323334
#include <stdio.h>

__global__ void myKernel(int *data) {
    // Intentional out of bounds or error could be tested here
}

int main() {
    int *d_data;
    size_t bytes = 1024;

    // Check CUDA API calls
    cudaError_t err = cudaMalloc(&d_data, bytes);
    if (err != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed: %s\n", cudaGetErrorString(err));
        return 1;
    }

    // Check kernel launch errors
    myKernel<<<1, 1>>>(d_data);

    err = cudaGetLastError();
    if (err != cudaSuccess) {
        fprintf(stderr, "Kernel launch failed: %s\n", cudaGetErrorString(err));
    }

    err = cudaDeviceSynchronize();
    if (err != cudaSuccess) {
        fprintf(stderr, "Kernel execution failed: %s\n", cudaGetErrorString(err));
    }
    
    printf("Error checking complete. Status: %s\n", cudaGetErrorString(err));
    cudaFree(d_data);
    return 0;
}

A common pattern is a helper macro:

cuda
123456789101112
#define CUDA_CHECK(call) do { \
    cudaError_t err = call; \
    if (err != cudaSuccess) { \
        fprintf(stderr, "CUDA error at %s:%d: %s\n", \
                __FILE__, __LINE__, cudaGetErrorString(err)); \
        exit(1); \
    } \
} while(0)

// Usage:
CUDA_CHECK(cudaMalloc(&d_data, bytes));
CUDA_CHECK(cudaMemcpy(d_data, h_data, bytes, cudaMemcpyHostToDevice));

Kernel Limitations

Keep these constraints in mind when writing kernels:

No recursion on older architectures (compute capability < 2.0); limited stack on newer ones
No dynamic memory allocation inside kernels (except malloc on newer GPUs with limited heap)
printf works from kernels but uses a limited buffer — great for debugging, not for production
Kernel calls are asynchronous — the CPU continues immediately after launch

How to Query GPU Properties with cudaGetDeviceProperties

You can query GPU capabilities at runtime. This is useful for choosing optimal kernel parameters:

device_query.cu
1234567891011121314151617181920212223
#include <stdio.h>

int main() {
    int deviceCount;
    cudaGetDeviceCount(&deviceCount);
    printf("Found %d CUDA device(s)\n\n", deviceCount);

    for (int i = 0; i < deviceCount; i++) {
        cudaDeviceProp prop;
        cudaGetDeviceProperties(&prop, i);

        printf("Device %d: %s\n", i, prop.name);
        printf("  Compute capability : %d.%d\n", prop.major, prop.minor);
        printf("  SM count           : %d\n", prop.multiProcessorCount);
        printf("  Max threads/block  : %d\n", prop.maxThreadsPerBlock);
        printf("  Shared memory/block: %zu bytes\n", prop.sharedMemPerBlock);
        printf("  Global memory      : %.1f GB\n", prop.totalGlobalMem / 1.0e9);
        printf("  Memory bandwidth   : %.1f GB/s\n",
               2.0 * prop.memoryClockRate * (prop.memoryBusWidth / 8) / 1.0e6);
        printf("\n");
    }
    return 0;
}

Summary

Use <<<gridDim, blockDim>>> to configure kernel launches
Always round up block count with (N + T - 1) / T and use boundary checks
Check errors with cudaGetLastError() and cudaDeviceSynchronize()
Kernel launches are asynchronous — the CPU doesn't wait unless you synchronize
Query device properties to write adaptive, portable code

Executing kernels correctly is just the start. Next, learnCUDA Synchronizationto coordinate threads and prevent race conditions.

Common Mistakes in CUDA Kernel Execution

Forgetting cudaDeviceSynchronize(): The CPU may try to read data before the GPU is finished writing it.
Launch Config Errors: Launching more blocks/threads than the hardware supports (e.g., >1024 threads per block).
Silent Failures: Not checking cudaGetLastError() leads to hard-to-debug "nothing happened" scenarios.

Practice Exercises

Implement the CUDA_CHECK macro in a small script and intentionally trigger an error (e.g., allocate too much memory).
Write a script that queries your GPU and prints only the Compute Capability and Max Threads per SM.

Practice These Concepts

Reinforce what you just learned with hands-on GPU coding challenges.

Implement a simple 2D matrix multiplication kernel. Multiply two NxN matrices where N = 64. Each thread computes one output element.