Triton GPU Programming
Write custom GPU kernels in Python — faster than PyTorch, simpler than CUDA C++
What is Triton?
Triton is an open-source GPU kernel language created by OpenAI. Instead of writing CUDA C++ to program NVIDIA GPUs, you write Python with a small set of GPU-aware primitives. Triton compiles your code to PTX (NVIDIA's low-level GPU assembly) and achieves performance that often matches hand-optimized CUDA kernels — with a fraction of the complexity.
PyTorch's torch.compile uses Triton internally to fuse operations into single kernels. When you write a Triton kernel, you're writing the same kind of code that powers modern deep learning frameworks under the hood.
As LLMs and transformer models dominate AI workloads, engineers need custom kernels for operations like Flash Attention, fused softmax, and quantized matrix multiplications that PyTorch doesn't optimize out of the box. Triton is the practical tool for this — it's Python, it's fast, and it runs on real GPUs.
Triton vs PyTorch vs CUDA C++
| Aspect | Triton | PyTorch | CUDA C++ |
|---|---|---|---|
| Language | Python | Python | C++ |
| Abstraction level | Tile-based | Tensor-based | Thread-based |
| Control over memory | High | Low | Full |
| Learning curve | Medium | Easy | Hard |
| Custom ops | Easy | Painful | Full control |
| Used for | Custom kernels, LLM inference | General ML | Low-level GPU programming |
The Core Mental Model: Programs, Not Threads
CUDA C++ asks you to think about individual threads — each thread has a unique ID and operates on a single element. You're responsible for computing memory offsets, handling warps, and managing shared memory manually.
Triton asks you to think about programs — each program instance operates on a tile (a contiguous block) of data. You load a block, operate on it, store it back. Triton handles thread organization and vectorization automatically.
The most important parameter in every Triton kernel is BLOCK_SIZE — how many elements each program instance handles. It's declared as a tl.constexpr so the compiler knows it at compile time and can optimize memory access patterns and vectorization accordingly.
Triton Language Primitives
Every Triton kernel uses a small, focused set of operations:
tl.program_id(axis)— which program instance is this (likeblockIdxin CUDA)tl.arange(0, N)— create a range of integers[0, 1, ..., N-1]tl.load(ptr + offsets, mask)— load a tile of memory into registerstl.store(ptr + offsets, value, mask)— write a tile back to memorytl.dot(a, b)— matrix multiply using Tensor Cores (fp16/bf16)tl.max / tl.sum / tl.exp— reductions and element-wise ops
What You'll Build in This Series
Vector addition kernel
Tiled GEMM with tl.dot
One-pass softmax kernel
Tiling + online softmax
Every code example in these tutorials has a ▶ Run button. Click it to open the code in the Python playground and execute it on a real NVIDIA T4 GPU instantly — no setup required.