ARGUS: Advanced GPU Optimization via Data-Flow Invariants

Date:


ARGUS: Agentic GPU Optimization Guided by Data-Flow Invariants

Summary: arXiv:2604.18616v1 Announce Type: cross

The advancement of large language model (LLM)-based coding agents has enabled the generation of functionally correct GPU kernels. Nevertheless, these generated kernels often fall short of the performance levels achieved by hand-optimized libraries, particularly in critical computations such as matrix multiplication, attention mechanisms, and Mixture-of-Experts (MoE) models. Achieving peak GPU performance necessitates a coordinated approach to various tightly coupled optimizations, which include tiling, shared-memory staging, software pipelining, and instruction scheduling. Currently, existing agents depend on sparse pass/fail feedback, rendering them incapable of diagnosing global constraint violations.

In response to these challenges, we present Argus, an innovative agentic framework that utilizes data-flow invariants. These invariants serve as compile-time specifications, detailing how data should be choreographed throughout the execution of GPU kernels. Argus introduces a tile-based, Pythonic domain-specific language (DSL) that exposes hardware instructions and compiler policies while abstracting away low-level representations. The DSL offers:

  • Tag Functions: These propagate symbolic annotations through both data and control flow.
  • Tag Assertions: These enforce relational constraints at specific use sites.

When violations occur, the compiler provides concrete counterexamples that identify the specific thread, data element, and program point, facilitating dense and structured feedback for targeted fixes. The verification of invariants occurs at compile time through abstract interpretation over a layout algebra and SMT solving, ensuring that there is zero runtime overhead. Furthermore, an in-context reinforcement learning planner is employed to learn optimization selections and synthesize effective invariants, supported by a curated knowledge base of GPU optimization techniques.

The performance of Argus has been rigorously evaluated on the AMD MI300X GPU, focusing on GEMM, flash attention, and MoE kernels, which account for over 90% of GPU time in LLM inference. The kernels generated by Argus demonstrate remarkable efficiency, achieving:

  • 99-104% of the state-of-the-art hand-optimized assembly throughput.
  • Speed improvements ranging from 2x to 1543x compared to existing agentic systems.

In addition to these findings, Argus has shown the ability to generalize across 200 KernelBench tasks, successfully solving 100% of Level 1 and 90% of Level 2 problems. This breakthrough in GPU optimization not only highlights the potential of Argus but also paves the way for more effective and efficient computational models in the future.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.