← Back to Gallery
Tiled GEMM with Shared Memory
by mikebrown_88•Apr 19, 2026•👁 2181 views
1024x1024 float matrix multiply using 16x16 tiles in shared memory. ~10x speedup over the naive global-memory version. Times itself with cudaEvents and reports GFLOPS.
#gemm#linear-algebra#performance#shared-memory
Sign in to vote or run
Terminal Output
Press "Run" to execute on a real GPU.
Comments (0)
to post comments and vote
No comments yet. Be the first to share your thoughts!