The NPU Software Gap: How to Build Compatibility for NPU Migration

Introduction

The AI hardware market has many kingdoms and each kingdom is shouting “democratize over here!” Search for “Democratizing AI” and you will find hundreds of blog posts and product pitches over the past few years. Few posts talk about compatibility across a wide range of models and devices. In many cases, the only compatibility between AI hardware vendors is a list of overlapping models in their model zoo.

In the previous post, we discussed how the new hardware architectures in NPUs will dramatically improve efficiency for AI workloads. Many NPUs are already available and show huge improvements for a small subset of AI models. For example, Amazon claims a 70% cost reduction with Inferentia over GPUs with similar performance. But, the migration to new NPU hardware devices is slow because AI models are difficult to port from GPU, and must be customized for each NPU solution.

In this post, we will focus on adding compatibility into the AI software space without major disruption to the existing compilers and runtimes. The first challenge is recognizing what truly needs to be compatible. The second challenge, which we will cover in the next post, is finding an open-source community to support standardization at those points.

Without compatibility, most companies are taking a conservative approach. Porting models to new hardware is risky and expensive. Instead, many companies are choosing to pay more for inference for now and wait to see which NPU companies survive.

Mass migration to NPUs requires compatibility at only three key points: model language, abstract kernel language and runtime API. Compatibility at these three points does not require the core hardware or compilers to be the same. More importantly, standard open-source software already exists in the GPU space that can be used to drive NPU programming standardization and migration.

Each NPU Requires Specialized Software

Building one software stack for all AI hardware is not going to work!

NPU hardware is, by design, better at running AI workloads. Depending on the vendor, there can be custom hardware for memory-heavy operations like embedding, specialized data movement between processing elements, or hardware focused on matrix multiplication. Each piece of custom hardware adds more complexity and requires specialization in the software.

The goal for NPU software teams is to find the most efficient way to target their hardware. As NPU companies focus on winning customers, software is built to optimize for their customers. NPUs need custom software, low-level kernel libraries and runtime to unlock the very best efficiency.

Graph compilers are based on one of the many open-source frameworks (PyTorch, TVM, IREE, XLA) or they can be custom, possibly on top of LLVM or MLIR. Common open-source compiler frameworks allow 50 small compiler teams to leverage and share common work. However, common code in the compilers does not lead to one monolithic compiler, or even compatibility between compilers.

NPU-specific kernel languages are also essential to unlock the very best performance. Framework compilers, like TorchInductor, select the best kernel implementation by comparing high-level Triton kernels and device-specific kernels. Some compilers might choose to build the low-level kernels in an internal representation or directly in their compiler. We will discuss more about the need for abstract kernels in the next section.

Close to the metal runtime software is also specialized based on NPU features. We are not aware of open-source libraries that help in the development of the lowest level of runtimes. Maybe there is no need to have common libraries for something so close to the hardware.

There is no need for every compiler team to come together on one unified platform or universal compiler.

Only Three Compatibility Points are Required for Broad NPU Migration

Each NPU HW device has unique features, and each NPU’s software team needs to support their unique features. Software fragmentation is required to give the best hardware support. Easy migration of AI models between devices does not require changing existing software architecture, it only requires adding compatibility at three points.

Three Compatible Software Stacks

Each stack can remain optimized for its specific hardware
Compatible interfaces allow models to run across different architectures
Reduces migration risk and enables hardware diversity

The first compatibility point is the model language. Whatever languages or standards that are developed for describing models need to be supported by all of the compiler tool chains. The front-end languages (Pytorch, ONNX, TensorFlow) of the AI compiler tool chain are open source and easily accessible. If there is interest, we can cover model languages and storage formats in a future post.

The second compatibility point is an abstract kernel language. Even the largest companies could be disrupted by a new kind of model. Attention revolutionized LLMs and the pace of model innovation has only accelerated over the past seven years. Even if the fundamental models do not change, the way models are mapped into hardware has a huge impact on performance. Model innovation cannot be limited by the speed of compiler development or by traditional hardware architectures.

Abstract kernel languages, like Triton, were created to fill the model innovation gap. Instead of building all the intelligence into the compiler, Triton provides a hardware agnostic kernel language for user level optimization. The advantage of abstract kernel languages for NPUs are that they are portable across architectures.

Once an abstract kernel has been evaluated on an NPU, teams can move forward knowing that the change works at scale. The low-level kernel team or compiler team for each NPU can add to or enhance their flow. There are many examples in GPU, for example StreamK, where the community has driven development of new optimizations that are now part of device-specific libraries and kernels.

The third compatibility point is the runtime API. Similar to kernels, there needs to be a vendor-specific runtime environment that is unique to the hardware. A common runtime API provides compatibility between inference engines or ML frameworks and different architectures. The only examples today are specific to model types and require new code for every new vendor they support. Example runtime APIs are llama.cpp for LLMs and whisper.cpp for ASR.

These three compatibility points need constant attention. The software community needs to support a rapidly changing ML environment. A few competing standards for compatibility are likely, but each vendor cannot have their own. We will discuss more in our next post why a strong open-source community is needed for each compatibility point.

Conclusion

NPU hardware has arrived, but there is a software compatibility gap that is blocking broad migration. Software that can give the best efficiency for each NPU is an essential part of the solution, but efficiency is not enough. Each NPU software stack needs to be compatible with other software stacks at three compatibility points: model language, abstract kernel language and runtime API. More importantly, each NPU needs to be compatible with GPUs at the same three points.

There is an often overlooked risk because models are developed based on available hardware. Revolutionary model changes will be delayed by limited hardware options in a consolidated market. A diverse set of compatible hardware and software is needed to truly move all AI forward.

AI innovation will outrun any team that works in isolation. The hardware and software being developed and released today will be outdated tomorrow. We are at a time in AI where one great idea can push everything in a whole new direction. In our next post we will discuss how community beats code in the rapidly evolving world of AI.