ARGUS: Agentic GPU Optimization Guided by Data-Flow Invariants
Summary: arXiv:2604.18616v1 | Announce Type: cross
Large Language Model (LLM)-based coding agents have made significant strides in generating functionally accurate GPU kernels. However, the performance of these generated kernels often falls short compared to hand-optimized libraries, particularly in crucial computations like matrix multiplication, attention mechanisms, and Mixture-of-Experts (MoE) architectures. Achieving peak GPU performance necessitates a comprehensive approach, incorporating tightly coupled optimizations such as tiling, shared-memory staging, software pipelining, and instruction scheduling. Current agents typically rely on sparse pass/fail feedback during this process, which hampers their ability to identify and resolve global constraint violations effectively.
In response to these challenges, we introduce Argus, an innovative agentic framework that leverages data-flow invariants. These compile-time specifications define how data should be orchestrated throughout the execution of GPU kernels. Argus features a tile-based, Pythonic Domain-Specific Language (DSL) that exposes hardware instructions and compiler policies while abstracting away low-level representations. This DSL includes:
- Tag Functions: These functions allow for the propagation of symbolic annotations through both data and control flow.
- Tag Assertions: These assertions enforce relational constraints at various use sites, ensuring that data flow adheres to the specified invariants.
When violations of these invariants occur, the compiler provides concrete counterexamples that pinpoint the specific thread, data element, and program point associated with the issue. This capability enables dense, structured feedback, facilitating targeted corrections. The verification of invariants is conducted at compile time using abstract interpretation over a layout algebra and SMT solving, which incurs zero runtime overhead.
Additionally, an in-context reinforcement learning planner is integrated into Argus, which learns to select the most effective optimizations and synthesize robust invariants. This learning process is supported by a curated knowledge base containing various GPU optimization techniques, enhancing the overall efficiency and effectiveness of the generated kernels.
To evaluate Argus, we conducted extensive tests on the AMD MI300X GPU, focusing on key benchmarks including General Matrix Multiplication (GEMM), flash attention, and MoE kernels. These benchmarks account for over 90% of the GPU time utilized in LLM inference. The results revealed that the kernels generated by Argus achieve an impressive throughput, ranging from 99% to 104% of the state-of-the-art hand-optimized assembly performance. Furthermore, they exhibit a performance increase of 2 to 1543 times faster than existing agentic systems.
Argus also demonstrates its versatility by generalizing to 200 KernelBench tasks, successfully solving 100% of Level 1 and 90% of Level 2 problems. This capability highlights the framework’s potential to significantly enhance GPU optimization processes and contribute to more efficient LLM implementations.
