Introduction to CUDA

Chapter 1 — Understand GPU computing and why CUDA exists

What is GPU Computing? Parallel Architecture Explained

A GPU (Graphics Processing Unit) was originally designed to render pixels on screen. Over time, engineers discovered that the massively parallel architecture of GPUs could accelerate many workloads beyond graphics — scientific simulations, machine learning, image processing, and more.

While a typical CPU has 8 to 64 powerful cores optimized for sequential tasks, a modern GPU has thousands of smaller cores designed to execute the same operation on many data points simultaneously. This is called data parallelism.

Key Insight

CPUs are like a team of expert workers — each can handle complex tasks independently. GPUs are like an army of simple workers — each does one small job, but together they finish massive workloads incredibly fast.

What is CUDA? The Parallel Computing Platform

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA. It lets you write C/C++ code that runs on NVIDIA GPUs.

With CUDA you can:

  • Write functions (called kernels) that execute on the GPU
  • Move data between CPU (host) memory and GPU (device) memory
  • Coordinate thousands of parallel threads

Your First CUDA Kernel: Hello World Example

Let's look at the simplest possible CUDA program — a "Hello from GPU" kernel. Don't worry about understanding every line yet; we'll cover each concept in depth in the following chapters.

hello.cu
1234567891011121314151617
#include <stdio.h>

// A kernel function — runs on the GPU
__global__ void helloKernel() {
    printf("Hello from GPU thread %d!\n", threadIdx.x);
}

int main() {
    // Launch the kernel with 1 block of 8 threads
    helloKernel<<<1, 8>>>();

    // Wait for GPU to finish
    cudaDeviceSynchronize();

    printf("Done!\n");
    return 0;
}

Breaking it down

  • __global__ — tells the compiler this function runs on the GPU and is called from the CPU.
  • <<<1, 8>>> — the execution configuration: launch 1 block containing 8 threads.
  • threadIdx.x — a built-in variable that gives each thread its unique ID within its block.
  • cudaDeviceSynchronize() — waits for all GPU threads to finish before the CPU continues.

CPU vs GPU: When to Accelerate with CUDA

CUDA shines when your problem can be broken into many independent, similar tasks:

  • Matrix math — linear algebra, neural network layers
  • Image processing — filters, transforms applied to every pixel
  • Simulations — physics, molecular dynamics, Monte Carlo
  • Data processing — parallel reductions, sorting, searching
When NOT to use CUDA

If your workload is inherently sequential (e.g., parsing a complex recursive data structure) or involves very little data, the overhead of moving data to the GPU may outweigh the performance gain.

Setting Up CUDA

To write and compile CUDA programs locally, you need:

  1. An NVIDIA GPU (GeForce, Quadro, Tesla, or datacenter GPUs)
  2. The NVIDIA CUDA Toolkit — includes the nvcc compiler, runtime libraries, and tools
  3. A C/C++ compiler (gcc, MSVC, or clang)

Download the CUDA Toolkit from the official NVIDIA Developer site. Or use the Playground to write and run CUDA code directly in your browser — no setup required.

terminal
1234567891011
# Compile a .cu file with nvcc
nvcc hello.cu -o hello

# Run the program
./hello
# Output:
# Hello from GPU thread 0!
# Hello from GPU thread 1!
# ...
# Hello from GPU thread 7!
# Done!

Summary

  • GPUs have thousands of cores designed for parallel execution
  • CUDA lets you write C/C++ code that runs on NVIDIA GPUs
  • Kernel functions use __global__ and are launched with <<<blocks, threads>>>
  • Use CUDA when your problem is data-parallel and compute-heavy

Common Mistakes for CUDA Beginners

  • Assuming GPU is always faster: For small data, CPU is often faster due to PCI-e transfer overhead.
  • Ignoring error codes: Most CUDA API calls return an error code. Always check them (see Chapter 8).
  • Sync issues: Forgetting that kernels run asynchronously with the CPU.

Practice Exercises

  1. Try changing the launch parameters <<<1, 8>>> to <<<2, 16>>>. What happens to the output?
  2. Modify the kernel to print "Hello from block %d" using blockIdx.x.

Further Reading

Practice These Concepts

Reinforce what you just learned with hands-on GPU coding challenges.

Vector Additioneasy

Write a CUDA kernel that adds two arrays element-wise. Each thread should handle one element. The arrays are of size N = 1024.

Sign in to track your progress across challenges.