Tiled GEMM with Shared Memory

by mikebrown_88•Apr 19, 2026•👁 2362 views

1024x1024 float matrix multiply using 16x16 tiles in shared memory. ~10x speedup over the naive global-memory version. Times itself with cudaEvents and reports GFLOPS.

#gemm#linear-algebra#performance#shared-memory

Terminal Output

Press "Run" to execute on a real GPU.

Comments (0)

to post comments and vote

No comments yet. Be the first to share your thoughts!