Introduction to CUDA
Chapter 1 — Understand GPU computing and why CUDA exists
What is GPU Computing? Parallel Architecture Explained
A GPU (Graphics Processing Unit) was originally designed to render pixels on screen. Over time, engineers discovered that the massively parallel architecture of GPUs could accelerate many workloads beyond graphics — scientific simulations, machine learning, image processing, and more.
While a typical CPU has 8 to 64 powerful cores optimized for sequential tasks, a modern GPU has thousands of smaller cores designed to execute the same operation on many data points simultaneously. This is called data parallelism.
CPUs are like a team of expert workers — each can handle complex tasks independently. GPUs are like an army of simple workers — each does one small job, but together they finish massive workloads incredibly fast.
What is CUDA? The Parallel Computing Platform
CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA. It lets you write C/C++ code that runs on NVIDIA GPUs.
With CUDA you can:
- Write functions (called kernels) that execute on the GPU
- Move data between CPU (host) memory and GPU (device) memory
- Coordinate thousands of parallel threads
Your First CUDA Kernel: Hello World Example
Let's look at the simplest possible CUDA program — a "Hello from GPU" kernel. Don't worry about understanding every line yet; we'll cover each concept in depth in the following chapters.
#include <stdio.h> // A kernel function — runs on the GPU __global__ void helloKernel() { printf("Hello from GPU thread %d!\n", threadIdx.x); } int main() { // Launch the kernel with 1 block of 8 threads helloKernel<<<1, 8>>>(); // Wait for GPU to finish cudaDeviceSynchronize(); printf("Done!\n"); return 0; }
Breaking it down
__global__— tells the compiler this function runs on the GPU and is called from the CPU.<<<1, 8>>>— the execution configuration: launch 1 block containing 8 threads.threadIdx.x— a built-in variable that gives each thread its unique ID within its block.cudaDeviceSynchronize()— waits for all GPU threads to finish before the CPU continues.
CPU vs GPU: When to Accelerate with CUDA
CUDA shines when your problem can be broken into many independent, similar tasks:
- Matrix math — linear algebra, neural network layers
- Image processing — filters, transforms applied to every pixel
- Simulations — physics, molecular dynamics, Monte Carlo
- Data processing — parallel reductions, sorting, searching
If your workload is inherently sequential (e.g., parsing a complex recursive data structure) or involves very little data, the overhead of moving data to the GPU may outweigh the performance gain.
Setting Up CUDA
To write and compile CUDA programs locally, you need:
- An NVIDIA GPU (GeForce, Quadro, Tesla, or datacenter GPUs)
- The NVIDIA CUDA Toolkit — includes the
nvcccompiler, runtime libraries, and tools - A C/C++ compiler (
gcc,MSVC, orclang)
Download the CUDA Toolkit from the official NVIDIA Developer site. Or use the Playground to write and run CUDA code directly in your browser — no setup required.
# Compile a .cu file with nvcc nvcc hello.cu -o hello # Run the program ./hello # Output: # Hello from GPU thread 0! # Hello from GPU thread 1! # ... # Hello from GPU thread 7! # Done!
Summary
- GPUs have thousands of cores designed for parallel execution
- CUDA lets you write C/C++ code that runs on NVIDIA GPUs
- Kernel functions use
__global__and are launched with<<<blocks, threads>>> - Use CUDA when your problem is data-parallel and compute-heavy
Next, we'll dive into the CUDA Programming Modelto understand how threads work in grids and blocks.
Common Mistakes for CUDA Beginners
- Assuming GPU is always faster: For small data, CPU is often faster due to PCI-e transfer overhead.
- Ignoring error codes: Most CUDA API calls return an error code. Always check them (see Chapter 8).
- Sync issues: Forgetting that kernels run asynchronously with the CPU.
Practice Exercises
- Try changing the launch parameters
<<<1, 8>>>to<<<2, 16>>>. What happens to the output? - Modify the kernel to print "Hello from block %d" using
blockIdx.x.
Further Reading
Practice These Concepts
Reinforce what you just learned with hands-on GPU coding challenges.
Write a CUDA kernel that adds two arrays element-wise. Each thread should handle one element. The arrays are of size N = 1024.
Sign in to track your progress across challenges.