← Back to Gallery
High-Throughput Reduction with Warp Shuffles
by dankim_99•Apr 28, 2026•👁 2967 views
Reduces 16M floats with a tree reduction in shared memory, then a warp-shuffle in the final 32 elements (no __syncthreads needed). Reports measured memory bandwidth.
#reduction#warp-primitives#bandwidth#performance
Sign in to vote or run
Terminal Output
Press "Run" to execute on a real GPU.
Comments (0)
to post comments and vote
No comments yet. Be the first to share your thoughts!