ARGUS: Agentic GPU Optimization Guided by Data-Flow Invariants
Summary: arXiv:2604.18616v1 Announce Type: cross
The advancement of large language model (LLM)-based coding agents has enabled the generation of functionally correct GPU kernels. Nevertheless, these generated kernels often fall short of the performance levels achieved by hand-optimized libraries, particularly in critical computations such as matrix multiplication, attention mechanisms, and Mixture-of-Experts (MoE) models. Achieving peak GPU performance necessitates a coordinated approach to various tightly coupled optimizations, which include tiling, shared-memory staging, software pipelining, and instruction scheduling. Currently, existing agents depend on sparse pass/fail feedback, rendering them incapable of diagnosing global constraint violations.
In response to these challenges, we present Argus, an innovative agentic framework that utilizes data-flow invariants. These invariants serve as compile-time specifications, detailing how data should be choreographed throughout the execution of GPU kernels. Argus introduces a tile-based, Pythonic domain-specific language (DSL) that exposes hardware instructions and compiler policies while abstracting away low-level representations. The DSL offers:
- Tag Functions: These propagate symbolic annotations through both data and control flow.
- Tag Assertions: These enforce relational constraints at specific use sites.
When violations occur, the compiler provides concrete counterexamples that identify the specific thread, data element, and program point, facilitating dense and structured feedback for targeted fixes. The verification of invariants occurs at compile time through abstract interpretation over a layout algebra and SMT solving, ensuring that there is zero runtime overhead. Furthermore, an in-context reinforcement learning planner is employed to learn optimization selections and synthesize effective invariants, supported by a curated knowledge base of GPU optimization techniques.
The performance of Argus has been rigorously evaluated on the AMD MI300X GPU, focusing on GEMM, flash attention, and MoE kernels, which account for over 90% of GPU time in LLM inference. The kernels generated by Argus demonstrate remarkable efficiency, achieving:
- 99-104% of the state-of-the-art hand-optimized assembly throughput.
- Speed improvements ranging from 2x to 1543x compared to existing agentic systems.
In addition to these findings, Argus has shown the ability to generalize across 200 KernelBench tasks, successfully solving 100% of Level 1 and 90% of Level 2 problems. This breakthrough in GPU optimization not only highlights the potential of Argus but also paves the way for more effective and efficient computational models in the future.
