Triton is an open-source GPU programming language developed by OpenAI that lets you write high-performance GPU kernels in Python. It compiles to PTX (NVIDIA's assembly language) and achieves performance comparable to hand-written CUDA C++.

Is Triton faster than PyTorch?

For custom operations, Triton kernels can significantly outperform PyTorch's built-in ops because you control tiling, memory access patterns, and fusion. PyTorch's torch.compile uses Triton internally for kernel fusion.

Do I need to know CUDA to learn Triton?

No. Triton abstracts away much of CUDA's complexity. You write Python with type-annotated pointers and tl.load/store primitives. Understanding GPU memory hierarchy helps but is not required to start.

Triton GPU Programming

Write custom GPU kernels in Python — faster than PyTorch, simpler than CUDA C++

What is Triton?

Triton is an open-source GPU kernel language created by OpenAI. Instead of writing CUDA C++ to program NVIDIA GPUs, you write Python with a small set of GPU-aware primitives. Triton compiles your code to PTX (NVIDIA's low-level GPU assembly) and achieves performance that often matches hand-optimized CUDA kernels — with a fraction of the complexity.

PyTorch's torch.compile uses Triton internally to fuse operations into single kernels. When you write a Triton kernel, you're writing the same kind of code that powers modern deep learning frameworks under the hood.

Why Triton Matters in 2026

As LLMs and transformer models dominate AI workloads, engineers need custom kernels for operations like Flash Attention, fused softmax, and quantized matrix multiplications that PyTorch doesn't optimize out of the box. Triton is the practical tool for this — it's Python, it's fast, and it runs on real GPUs.

Triton vs PyTorch vs CUDA C++

Aspect	Triton	PyTorch	CUDA C++
Language	Python	Python	C++
Abstraction level	Tile-based	Tensor-based	Thread-based
Control over memory	High	Low	Full
Learning curve	Medium	Easy	Hard
Custom ops	Easy	Painful	Full control
Used for	Custom kernels, LLM inference	General ML	Low-level GPU programming

The Core Mental Model: Programs, Not Threads

CUDA C++ asks you to think about individual threads — each thread has a unique ID and operates on a single element. You're responsible for computing memory offsets, handling warps, and managing shared memory manually.

Triton asks you to think about programs — each program instance operates on a tile (a contiguous block) of data. You load a block, operate on it, store it back. Triton handles thread organization and vectorization automatically.

Key Concept: BLOCK_SIZE

The most important parameter in every Triton kernel is BLOCK_SIZE — how many elements each program instance handles. It's declared as a tl.constexpr so the compiler knows it at compile time and can optimize memory access patterns and vectorization accordingly.

Triton Language Primitives

Every Triton kernel uses a small, focused set of operations:

tl.program_id(axis) — which program instance is this (like blockIdx in CUDA)
tl.arange(0, N) — create a range of integers [0, 1, ..., N-1]
tl.load(ptr + offsets, mask) — load a tile of memory into registers
tl.store(ptr + offsets, value, mask) — write a tile back to memory
tl.dot(a, b) — matrix multiply using Tensor Cores (fp16/bf16)
tl.max / tl.sum / tl.exp — reductions and element-wise ops

What You'll Build in This Series

1Introduction

Vector addition kernel

2Matrix Multiply

Tiled GEMM with tl.dot

3Fused Softmax

One-pass softmax kernel

4Flash Attention

Tiling + online softmax

Interactive — Run Every Example on a Real GPU

Every code example in these tutorials has a ▶ Run button. Click it to open the code in the Python playground and execute it on a real NVIDIA T4 GPU instantly — no setup required.