GPU Challenges

Solve GPU programming challenges in CUDA C++ or Python (PyTorch).

Challenges

Write a CUDA kernel that adds two arrays element-wise. Each thread should handle one element. The arrays are of size N = 1024.

Matrix Multiplication

medium

Implement a simple 2D matrix multiplication kernel. Multiply two NxN matrices where N = 64. Each thread computes one output element.

Parallel Sum Reduction

hard

Write a kernel that sums all elements of an array using parallel reduction. The array size is 1024 elements. Use shared memory for efficiency.

Dot Product

medium

Compute the dot product of two vectors on the GPU. The dot product is the sum of element-wise products: result = sum(a[i] * b[i]). Use parallel reduction with shared memory for efficiency. Array size N = 4096.

Image Blur Filter

medium

Implement a simple 2D box blur filter for a 64x64 grayscale image. Each output pixel should be the average of itself and its 8 neighbors (3x3 kernel). Handle edge pixels by only averaging valid neighbors.

SAXPY: Single Precision A*X Plus Y

easy

Implement the BLAS SAXPY operation: y = a*x + y, where a is a scalar and x, y are vectors. This is one of the most fundamental GPU operations. Vector size N = 8192. Each thread handles one element.

Parallel Histogram Computation

medium

Compute a histogram of 256 bins for an array of 10,000 random byte values (0-255). Use atomic operations to avoid race conditions when multiple threads update the same bin.

Constant Memory Lookup Table

hard

Use CUDA constant memory to store a sine lookup table. Compute sin(x) for 1024 values using the lookup table instead of calling sin() directly. This demonstrates constant memory caching for read-only data accessed by all threads.

Terminal Output

Select a challenge and write your solution, then run it.

Need more credits? Upgrade your plan →

GPU Challenges — Test Your CUDA & Python GPU Skills

GPU Challenges