Browse 14 CUDA code posts from the community.
Single-sphere ray tracer in CuPy — every pixel's ray traced in parallel via NumPy-style array ops. Lambert shading from a point light. Renders to ASCII.
Custom row-wise softmax in Triton with numerically-stable max-subtract. Verified against torch.softmax — bit-identical to within fp32 epsilon.
PyTorch 2.x scaled_dot_product_attention auto-dispatches to FlashAttention on supported GPUs. 4x16x1024x64 fp16, causal mask. Reports measured TFLOPS.
Reduces 16M floats with a tree reduction in shared memory, then a warp-shuffle in the final 32 elements (no __syncthreads needed). Reports measured memory bandwidth.
Direct O(N²) gravity integrator — every body pulls on every other. 4096 bodies, 50 timesteps, vectorized with rsqrtf intrinsic. ~3.4G interactions total.
Two-tier histogram: per-block shared-memory atomics, then a single global atomic per bin per block. Avoids contention on hot bins. 1M bytes processed.
Classic 3x3 Sobel filter for edge detection. Each thread computes one output pixel from a 9-pixel neighbourhood. Synthetic disk image used for verification.
1024x1024 float matrix multiply using 16x16 tiles in shared memory. ~10x speedup over the naive global-memory version. Times itself with cudaEvents and reports GFLOPS.
Classic cellular automaton on the GPU — toroidal 64x64 grid, glider seed, runs 30 generations. Double-buffered with pointer swap. Each cell evaluates its 8 neighbours in parallel.
Renders the famous Mandelbrot fractal as ASCII art. Uses CUDA to compute the iteration count for each pixel, mapping complex escaping behavior to progressively denser characters. Shows real-time GPU computation of mathematical fractals.
ASCII-rendered Mandelbrot set, parallelized per pixel. Each thread computes one (cx, cy) point's escape time independently — no shared memory needed.