Triton GPU Programming

Write custom GPU kernels in Python — faster than PyTorch, simpler than CUDA C++

What is Triton?

Triton is an open-source GPU kernel language created by OpenAI. Instead of writing CUDA C++ to program NVIDIA GPUs, you write Python with a small set of GPU-aware primitives. Triton compiles your code to PTX (NVIDIA's low-level GPU assembly) and achieves performance that often matches hand-optimized CUDA kernels — with a fraction of the complexity.

PyTorch's torch.compile uses Triton internally to fuse operations into single kernels. When you write a Triton kernel, you're writing the same kind of code that powers modern deep learning frameworks under the hood.

Why Triton Matters in 2026

As LLMs and transformer models dominate AI workloads, engineers need custom kernels for operations like Flash Attention, fused softmax, and quantized matrix multiplications that PyTorch doesn't optimize out of the box. Triton is the practical tool for this — it's Python, it's fast, and it runs on real GPUs.

Triton vs PyTorch vs CUDA C++

AspectTritonPyTorchCUDA C++
LanguagePythonPythonC++
Abstraction levelTile-basedTensor-basedThread-based
Control over memoryHighLowFull
Learning curveMediumEasyHard
Custom opsEasyPainfulFull control
Used forCustom kernels, LLM inferenceGeneral MLLow-level GPU programming

The Core Mental Model: Programs, Not Threads

CUDA C++ asks you to think about individual threads — each thread has a unique ID and operates on a single element. You're responsible for computing memory offsets, handling warps, and managing shared memory manually.

Triton asks you to think about programs — each program instance operates on a tile (a contiguous block) of data. You load a block, operate on it, store it back. Triton handles thread organization and vectorization automatically.

Key Concept: BLOCK_SIZE

The most important parameter in every Triton kernel is BLOCK_SIZE — how many elements each program instance handles. It's declared as a tl.constexpr so the compiler knows it at compile time and can optimize memory access patterns and vectorization accordingly.

Triton Language Primitives

Every Triton kernel uses a small, focused set of operations:

  • tl.program_id(axis) — which program instance is this (like blockIdx in CUDA)
  • tl.arange(0, N) — create a range of integers [0, 1, ..., N-1]
  • tl.load(ptr + offsets, mask) — load a tile of memory into registers
  • tl.store(ptr + offsets, value, mask) — write a tile back to memory
  • tl.dot(a, b) — matrix multiply using Tensor Cores (fp16/bf16)
  • tl.max / tl.sum / tl.exp — reductions and element-wise ops

What You'll Build in This Series

1Introduction

Vector addition kernel

2Matrix Multiply

Tiled GEMM with tl.dot

3Fused Softmax

One-pass softmax kernel

4Flash Attention

Tiling + online softmax

Interactive — Run Every Example on a Real GPU

Every code example in these tutorials has a ▶ Run button. Click it to open the code in the Python playground and execute it on a real NVIDIA T4 GPU instantly — no setup required.