This article is a summary of the learning from @sotadeeplearningtutorials9598's YouTube tutorial. Thanks to the teacher's clear guidance, I, a novice who has never been exposed to GPU programming, was able to write my first effective kernel.
Softmax is a commonly used activation function, typically used in the output layer of neural networks for multiclass tasks. It converts an input realvalued vector into a probability distribution, ensuring that all output values are between 0 and 1, and that their sum equals 1. Karpathy describes it as squashing logits into a probability distribution between 0 and 1.
Formula:
$\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$
Why Implement Softmax on GPU?#
GPUs excel at handling parallel computing tasks. Deep learning models often need to process large amounts of data and computations, and using a GPU can significantly increase computation speed.
Why Choose Triton?#
Triton is a compiler and programming language developed by OpenAI, designed to make it easier for developers to write highperformance kernels. It offers a highlevel syntax similar to Python, reducing the complexity of GPU programming compared to CUDA, and PyTorch's support for Triton provides more opportunities for developers willing to contribute to the ecosystem.
The following image illustrates how Triton can balance performance and efficiency in kernel development:
Image source: Yang Jun's answer on Zhihu: # Some Understandings of OpenAI Triton
The following operations were performed on WSL2, Ubuntu 20.04, Python3.10:
import torch
import triton
import triton.language as tl
triton
: The main library for Triton.triton.language as tl
: The programming language module of Triton, containing the functions and operations needed to write Triton kernels.
Basic Knowledge of GPU#
In GPU programming, a Kernel
is a special function that defines the computational tasks to be executed in parallel. To efficiently utilize the GPU's parallel processing capabilities, this Kernel
is broken down into multiple execution units called Blocks
. This structure allows the GPU to process large amounts of data in a highly parallel manner, achieving significant performance improvements by breaking down a large computational task into many small parallel tasks.
Kernel
: The core algorithm written by the programmer, describing the operations each parallel execution unit should perform. This code is designed to execute the same operation across manyThreads
.Block
: The GPU divides thisKernel
task into multipleBlocks
, each containing manyThreads
. TheseThreads
run simultaneously, each processing a portion of the data while executing the sameKernel
code.
In short, the Kernel
defines "what to do," while Block
and Thread
determine "how to do it in parallel." This approach fully utilizes the hardware features of the GPU, achieving efficient parallel computation.
Softmax Implementation#
Eager Mode#
First, implement Softmax in pure Python to reference and verify the correctness of other implementations:
def naive_softmax(x: torch.Tensor) > torch.Tensor:
x_max = x.max(dim=1, keepdim=True)[0]
safe_x = x  x_max
numerator = torch.exp(safe_x)
denominator = numerator.sum(dim=1, keepdim=True)
softmax_out = numerator / denominator
return softmax_out
 Each line of code is executed immediately, and the computation results are produced right away, similar to the normal execution of Python code.
 The computation graph is dynamically constructed, creating and executing the computation graph each time the code is run.
 In contrast,
Graph Mode
differs in whether it executes immediately (as opposed to prebuilding a static computation graph).
Numerical Stability#
It's worth mentioning the step safe_x = x  x_max
, which subtracts the maximum value to turn all values into nonpositive numbers, preventing overflow in the computation of $e^x$ and improving numerical stability. The core idea is that the following equality holds:
Triton Implementation#
The development of the kernel is actually divided into two parts: the kernel itself and the driver that enables parallelization, allowing it to handle a large number of instances simultaneously.

Driver Program: This is the Python code running on the CPU, used to prepare data, configure kernel parameters, and call the Triton kernel.

Operator: This is the GPU kernel written in Triton that performs the actual Softmax computation.
Driver Program#
Here, using a topdown learning approach, you first need a driver program
that sets a lot of meta information, such as block size, shared memory allocation, etc.
def softmax(x: torch.Tensor) > torch.Tensor:
""" Softmax implemented in Triton, only forward propagation """
rows, cols = x.shape
assert x.dim() == 2, f"Expected 2D input, got {x.dim()}D input"
# Calculate block_size, the smallest power of 2 greater than or equal to cols
block_size = triton.next_power_of_2(cols)
# Dynamically adjust num_warps based on block_size
num_warps = 4 # Each warp has 32 threads
if block_size > 2047:
num_warps = 8
if block_size > 4095:
num_warps = 16
# Define grid size, each thread block (Block) processes one row of data
grid = (rows,) # This creates a tuple containing only rows
# Create an empty tensor with the same shape as the input tensor to store the output
sm_out = torch.empty_like(x)
# Call the Triton kernel (using square brackets to pass in grid, then passing parameters to the kernel)
_softmax_fwd_kernel[grid](
sm_out,
sm_out.stride(0),
x,
x.stride(0),
cols,
block_size=block_size,
num_warps=num_warps
)
return sm_out
Later, you will find that the parameters passed to the kernel in the driver program exceed the number declared in the kernel function; the reason will be explained later.
A clever point here is that GPUs typically perform best when processing data blocks of sizes that are powers of 2.
Using next_power_of_2
can round up the data size to the nearest power of 2, which helps optimize memory access patterns and alignment. Additionally, you can dynamically adjust num_warps based on the size of block_size—using fewer warps for smaller problems to avoid resource waste, while larger ones fully utilize the GPU's parallel capabilities.
Operator (Triton Kernel)#
The kernel performs the actual computation on the GPU.
Decorator#
To develop a kernel in Triton, you need to use the @triton.jit
decorator to enter the Triton compiler.
@triton.jit
def _softmax_fwd_kernel():
pass
Kernel Parameters#
output_ptr
: The starting address of the output tensor in memory.stride_output_row
: The stride of the output tensor in the row direction (i.e., the interval in memory for each row).input_ptr
: The starting address of the input tensor in memory.stride_input_row
: The stride of the input tensor in the row direction.num_cols
: The number of columns in the input tensor.block_size: tl.constexpr
: Block size, a compiletime constant that determines the number of elements processed by each thread block.
Get the Row Index of the Current Thread Block#
Get the ID of the current thread block in the 0th dimension (row dimension), i.e., the row index being processed.
row_index = tl.program_id(0)
Calculate the Data Pointer for the Current Row#
row_start_ptr = input_ptr + row_index * stride_input_row
col_offsets = tl.arange(0, block_size)
input_ptrs = row_start_ptr + col_offsets
row_start_ptr
: The starting address of the current row in memory.col_offsets
: Generates a sequence from 0 toblock_size  1
, representing the column offsets.input_ptrs
: The address of each element in the current row in memory.
Create a Mask#
Mask is used to avoid outofbounds access during parallel computation. When the number of elements being processed is not a multiple of the thread block size, a mask is used to shield invalid threads.
Here, when the number of columns is less than block_size
, a mask is needed to avoid accessing outofbounds memory addresses.
mask = col_offsets < num_cols
Load Data from Global Memory to Shared Memory (SRAM)#
row = tl.load(input_ptrs, mask=mask, other=float("inf"))
tl.load
: The API for loading data from memory.mask
: Indicates which addresses are valid.other=float("inf")
: For invalid addresses, fill with negative infinity to ensure it does not affect the result when calculating the maximum value in subsequent computations.
Softmax Calculation#
Utilize the efficient parallel computation API provided by Triton to perform elementwise division of the numerator by the denominator, yielding the Softmax output.
row_max = tl.max(row, axis=0)
safe_row = row  row_max
numerator = tl.exp(safe_row)
denominator = tl.sum(numerator, axis=0)
sm_output = numerator / denominator
Write Results Back to Global Memory#
output_row_ptr = output_ptr + row_index * stride_output_row
output_ptrs = output_row_ptr + col_offsets
tl.store(output_ptrs, sm_output, mask=mask)
output_row_ptr
: The starting address of the current row in the output tensor.output_ptrs
: The address of each element in the current row of the output tensor.tl.store
: Writes the results back to memory, using the same mask as loading to ensure only valid data is written back.
Overall, our kernel looks like this:
@triton.jit
def _softmax_fwd_kernel(
output_ptr,
stride_output_row,
input_ptr,
stride_input_row,
num_cols,
block_size: tl.constexpr,
):
# Get the ID of the current program (row index)
row_index = tl.program_id(0)
# Calculate the starting pointer for the current row
row_start_ptr = input_ptr + (row_index * stride_input_row)
col_offsets = tl.arange(0, block_size)
input_pointers = row_start_ptr + col_offsets
# Create a mask to prevent outofbounds access
row_mask = col_offsets < num_cols
# Load data from global memory to shared SRAM
row = tl.load(input_pointers, mask=row_mask, other=float("inf"))
# Softmax calculation
safe_row = row  tl.max(row, axis=0)
numerator = tl.exp(safe_row)
denominator = tl.sum(numerator, axis=0)
sm_out = numerator / denominator
# Write results back to global memory
output_row_ptr = output_ptr + (row_index * stride_output_row)
output_pointers = output_row_ptr + col_offsets
tl.store(output_pointers, sm_out, mask=row_mask)
Interaction Between Driver Program and Operator#
Grid & Block#
In our driver program code:
grid = (rows,)
: Defines the grid size, which is onedimensional withrows
number ofBlocks
, each processing one row of the input tensor.
Parameter Passing#
When we call the kernel, we actually pass the following parameters to enable the kernel to correctly locate and process the input and output data:
_softmax_fwd_kernel[grid](
sm_out, # Pointer to the output tensor
sm_out.stride(0), # Stride of the output tensor in the row direction
x, # Pointer to the input tensor
x.stride(0), # Stride of the input tensor in the row direction
cols, # Number of columns in the input tensor
# Kernel configuration parameters
block_size=block_size,
num_warps=num_warps
)
Kernel Execution#
Each thread block processes one row of data. By using row_index = tl.program_id(0)
, each thread block knows which row it should process.
Multiple thread blocks on the GPU execute simultaneously, allowing multiple rows of data to be processed in parallel, greatly speeding up computation.
Special API Review#
tl.arange(start, end)
: Generates a sequence fromstart
toend  1
, used to create column offsets.tl.program_id(axis)
: Gets the ID of the current thread block in the specified dimension.tl.constexpr
: Indicates a constant known at compile time, used for optimization.
Benchmark#
See the complete code at: triton_kernels_for_fun_and_profit/demos/demo_softmax.py
Performance on 3090 Ti (GB/s)
In the original video, Triton could be nearly three times faster than the latter. As of September 2024, Triton is still slightly faster than Torch Native and is very stable.
Meta Parameters#
Do you remember that we mentioned earlier that the parameters passed to the kernel in the driver program exceed the number declared in the kernel function?
# Driver
_softmax_fwd_kernel[grid](
sm_out,
sm_out.stride(0),
x,
x.stride(0),
cols,
block_size=block_size,
num_warps=num_warps
)
# Kernel
@triton.jit
def _softmax_fwd_kernel(
output_ptr,
stride_output_row,
input_ptr,
stride_input_row,
num_cols,
block_size: tl.constexpr,
):
You can see that the driver has 7 parameters, while the latter has only 6.
The reason is that some of these parameters are reserved keywords in Triton, also known as Metaparameters
.
triton/python/triton/runtime/interpreter.py shows that there are actually 6 reserved keywords.
Upon further research, it can be found that these keywords are filtered out from the parameters during the subsequent GridExecutor
call:
class GridExecutor:
"""Omitted initialization and other parts"""
def __call__(self, *args_dev, **kwargs):
# removes reserved keywords from kwargs
kwargs = {k: v for k, v in kwargs.items() if k not in RESERVED_KWS}
The Triton compiler absorbs these parameters, which is why the parameter count does not match.
Triton Reserved Keywords#
Num Warps
: The number of thread bundles used by the kernel on the GPU (default is 32 threads per warp);Num Stages
: Determines the number of stages allocated by the compiler for software pipelining loops. Mainly used for executing operations like matrix multiplication on SM80+ (Ampere) architecture GPUs. Pipelining allows multiple loop iterations to execute simultaneously, with each iteration partially overlapping to improve computational performance (a memory from CSAPP resurfaces);Num CTAS
: The number of thread blocks (CTA) that can be executed concurrently on each SM (streaming multiprocessor);Warps Specialization(bool)
(now deprecated): Also known as Spatial Partitioning, a technique that allows Warps to perform independent computations. When enabled, multiple Warps can execute different tasks in parallel without synchronizing to execute the same instructions, as used in producer/consumer patterns. This has now been replaced by the three keywords below in Triton;enable_fp_fusion
: Enables floatingpoint operation fusion, merging multiple floatingpoint operations to execute in the same pipeline, further enhancing performance and reducing the overhead of multiple executions;grid
: Controls the grid structure of the Triton kernel;maxnreg
: Used to control the maximum number of registers that can be used by each thread block (Block).
References#
Thanks to:
 The learning object of this article: SOTA Deep Learning Tutorials  YouTube
 Yang Jun's Some Understandings of OpenAI Triton
 Some other excellent articles on Zhihu
 Triton’s documentation
 o1preview language model