← Back to Gallery

High-Throughput Reduction with Warp Shuffles

by dankim_99Apr 28, 2026👁 2967 views

Reduces 16M floats with a tree reduction in shared memory, then a warp-shuffle in the final 32 elements (no __syncthreads needed). Reports measured memory bandwidth.

#reduction#warp-primitives#bandwidth#performance
Sign in to vote or run
main.cu
Terminal Output
Press "Run" to execute on a real GPU.

Comments (0)

to post comments and vote
No comments yet. Be the first to share your thoughts!