← Back to Gallery

Tiled GEMM with Shared Memory

by mikebrown_88Apr 19, 2026👁 2181 views

1024x1024 float matrix multiply using 16x16 tiles in shared memory. ~10x speedup over the naive global-memory version. Times itself with cudaEvents and reports GFLOPS.

#gemm#linear-algebra#performance#shared-memory
Sign in to vote or run
main.cu
Terminal Output
Press "Run" to execute on a real GPU.

Comments (0)

to post comments and vote
No comments yet. Be the first to share your thoughts!