POWERED BY TRITON

Our compiler

brings Triton

to your chips

Our compiler brings Triton to your chips

A wave of NPU hardware specialized for LLM inference is challenging GPUs. We are building a platform based on the Triton language and compiler that will work across all of them.

Partnership and Trusted by the team at

BENEFITS

Open by design

Built on open, industry-standard AI infrastructure, with deep compiler expertise and strong roots in the Triton ecosystem.

Industry Standard

Industry Standard

Open-source code built on industry standard AI infrastructure prevents lock-in and falling behind

Open-source code built on industry standard AI infrastructure prevents lock-in and falling behind

Compiler Experts

Compiler Experts

Our team has decades of experience building compilers for GPU and NPU AI hardware

Our team has decades of experience building compilers for GPU and NPU AI hardware

Triton Community

Triton Community

Open-source code built on industry standard AI infrastructure prevents lock-in and falling behind

Open-source code built on industry standard AI infrastructure prevents lock-in and falling behind

HOW IT WORKS

How Triton works

Triton generates kernels that integrate with your existing kernels or internal representations. Some models may not need any Triton.

Triton generates kernels that integrate with your existing kernels or internal representations. Some models may not need any Triton.

Matmul

SoftMax

Matmul

SoftMax has been replaced with a Triton-generated operator to improve performance.

The Triton compiler can perform complex operator fusion to further improve performance

The Triton language allows complex kernels to be written by hand that are portable to any device supported by the Triton compiler.

Matmul

SoftMax

Matmul

SoftMax has been replaced with a Triton-generated operator to improve performance.

The Triton compiler can perform complex operator fusion to further improve performance

The Triton language allows complex kernels to be written by hand that are portable to any device supported by the Triton compiler.

Matmul

SoftMax

Matmul

SoftMax has been replaced with a Triton-generated operator to improve performance.

The Triton compiler can perform complex operator fusion to further improve performance

The Triton language allows complex kernels to be written by hand that are portable to any device supported by the Triton compiler.

WHY KERNELIZE & TRITON

The LLM Optimization Layer

Triton is foundational for LLMs, combining accessibility and optimization critical for all inference hardware.

EXMAPLE: MATRIX MULTIPLICATION

import triton
import triton.language as tl

@triton.jit
def matmul_kernel(
    a_ptr, b_ptr, c_ptr,
    M, N, K,
    stride_am, stride_ak,
    stride_bk, stride_bn,
    stride_cm, stride_cn,
    BLOCK_SIZE_M: tl.constexpr,
    BLOCK_SIZE_N: tl.constexpr,
    BLOCK_SIZE_K: tl.constexpr,
):
    # Program ID
    pid = tl.program_id(0)
    
    # Offsets
    offs_am = (pid * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)) % M
    offs_bn = (tl.arange(0, BLOCK_SIZE_N) + tl.arange(0, BLOCK_SIZE_K)[:, None] * stride_bn) % N
    
    # Initialize accumulator
    accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
    
    # Compute
    for k in range(0, K, BLOCK_SIZE_K):
        a_ptrs = a_ptr + (offs_am[:, None] * stride_am + k * stride_ak + tl.arange(0, BLOCK_SIZE_K)[None, :])
        b_ptrs = b_ptr + (k * stride_bk + offs_bn)
        
        a = tl.load(a_ptrs)
        b = tl.load(b_ptrs)
        accumulator += tl.dot(a, b)
    
    # Store result
    offs_cm = pid * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)
    offs_cn = tl.arange(0, BLOCK_SIZE_N)
    c_ptrs = c_ptr + stride_cm * offs_cm[:, None] + stride_cn * offs_cn[None, :]
    tl.store(c_ptrs, accumulator)
import triton
import triton.language as tl

@triton.jit
def matmul_kernel(
    a_ptr, b_ptr, c_ptr,
    M, N, K,
    stride_am, stride_ak,
    stride_bk, stride_bn,
    stride_cm, stride_cn,
    BLOCK_SIZE_M: tl.constexpr,
    BLOCK_SIZE_N: tl.constexpr,
    BLOCK_SIZE_K: tl.constexpr,
):
    # Program ID
    pid = tl.program_id(0)
    
    # Offsets
    offs_am = (pid * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)) % M
    offs_bn = (tl.arange(0, BLOCK_SIZE_N) + tl.arange(0, BLOCK_SIZE_K)[:, None] * stride_bn) % N
    
    # Initialize accumulator
    accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
    
    # Compute
    for k in range(0, K, BLOCK_SIZE_K):
        a_ptrs = a_ptr + (offs_am[:, None] * stride_am + k * stride_ak + tl.arange(0, BLOCK_SIZE_K)[None, :])
        b_ptrs = b_ptr + (k * stride_bk + offs_bn)
        
        a = tl.load(a_ptrs)
        b = tl.load(b_ptrs)
        accumulator += tl.dot(a, b)
    
    # Store result
    offs_cm = pid * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)
    offs_cn = tl.arange(0, BLOCK_SIZE_N)
    c_ptrs = c_ptr + stride_cm * offs_cm[:, None] + stride_cn * offs_cn[None, :]
    tl.store(c_ptrs, accumulator)

Complete Triton kernel example showing matrix multiplication.

import triton
import triton.language as tl

@triton.jit
def matmul_kernel(
    a_ptr, b_ptr, c_ptr,
    M, N, K,
    stride_am, stride_ak,
    stride_bk, stride_bn,
    stride_cm, stride_cn,
    BLOCK_SIZE_M: tl.constexpr,
    BLOCK_SIZE_N: tl.constexpr,
    BLOCK_SIZE_K: tl.constexpr,
):
    # Program ID
    pid = tl.program_id(0)
    
    # Offsets
    offs_am = (pid * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)) % M
    offs_bn = (tl.arange(0, BLOCK_SIZE_N) + tl.arange(0, BLOCK_SIZE_K)[:, None] * stride_bn) % N
    
    # Initialize accumulator
    accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
    
    # Compute
    for k in range(0, K, BLOCK_SIZE_K):
        a_ptrs = a_ptr + (offs_am[:, None] * stride_am + k * stride_ak + tl.arange(0, BLOCK_SIZE_K)[None, :])
        b_ptrs = b_ptr + (k * stride_bk + offs_bn)
        
        a = tl.load(a_ptrs)
        b = tl.load(b_ptrs)
        accumulator += tl.dot(a, b)
    
    # Store result
    offs_cm = pid * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)
    offs_cn = tl.arange(0, BLOCK_SIZE_N)
    c_ptrs = c_ptr + stride_cm * offs_cm[:, None] + stride_cn * offs_cn[None, :]
    tl.store(c_ptrs, accumulator)

Complete Triton kernel example showing matrix multiplication.

Model innovation

Model innovation

Triton is essential for rapid experimentation and iteration for new model concepts and research prototypes.

Triton is essential for rapid experimentation and iteration for new model concepts and research prototypes.

Great compatibility

Great compatibility

Easily add Triton-generated kernels only where they are needed. Compliments your Existing Software

Easily add Triton-generated kernels only where they are needed. Compliments your Existing Software

Optimized kernels

Optimized kernels

Modern LLMs embed optimized Triton kernels for maximum inference performance and efficiency.

Modern LLMs embed optimized Triton kernels for maximum inference performance and efficiency.

Specialized for LLMs

Specialized for LLMs

Projects from Liger Kernel to vLLM use Triton to create specialized kernels, maximizing LLM operation performance

Projects from Liger Kernel to vLLM use Triton to create specialized kernels, maximizing LLM operation performance

Accessible Parallelism

Accessible Parallelism

Triton’s Python-like syntax makes parallel programming accessible without hardware expertise.

Triton’s Python-like syntax makes parallel programming accessible without hardware expertise.

Hardware Agnostic Compiler

Hardware Agnostic Compiler

An open-source compiler that generates optimized code across architectures and future-proof by design.

An open-source compiler that generates optimized code across architectures and future-proof by design.

COMPARISON

AI Inference, Simplified

Kernelize removes GPU lock-in and backend complexity so teams can deploy AI inference faster across more hardware.

Before Kernelize

Before Kernelize

Locked to GPU-centric inference stacks.

Slow, manual support for new models and hardware.

Fragile backends that are costly to maintain.

With Kernelize

With Kernelize

Run AI models across CPUs, GPUs, NPUs, and accelerators.

Support new models on new hardware from day one.

Built on open, industry-standard infrastructure.

Get optimized, production-grade performance.

Reduce engineering effort and infrastructure cost.

Get Started

Talk to the Kernelize team

Tell us about your inference stack and hardware needs. We’ll help you evaluate how Kernelize can support your models across more hardware, faster.

Kernelize

Copyright Kernelize 2025. All rights reserved.