Improving Efficiency of GPU Kernel Optimization Agents using a Domain-Specific Language and Speed-of-Light Guidance
Summary: arXiv:2603.29010v1 Announce Type: cross
Abstract
Optimizing GPU kernels with large language model (LLM) agents is an iterative process that navigates a vast design space. In this process, every candidate must undergo generation, compilation, validation, and profiling. Reducing the number of trials can significantly lower both runtime and costs. Our study identifies two critical observations that drive the need for optimization:
- Abstraction Level: The level at which agents operate is crucial. If the abstraction is too low, the LLM expends reasoning on trivial details that yield little impact. Conversely, if the abstraction is too high, significant optimization choices may be overlooked.
- Diminishing Returns: Agents often struggle to determine when they have reached diminishing returns in their search, leading to unnecessary resource expenditure.
These observations inspire two design principles aimed at enhancing efficiency:
- Domain-Specific Language (DSL): We propose a compact DSL that can be learned in context, enabling the model to operate at a higher level of reasoning while still preserving crucial optimization levers.
- Speed-of-Light (SOL) Guidance: This guidance employs first-principles performance bounds to direct and budget the optimization search process.
Implementation of $\mu$CUTLASS
We have implemented these principles in a system called $\mu$CUTLASS, a DSL accompanied by a compiler for CUTLASS-backed GPU kernels. This system encompasses several key features:
- Kernel configuration
- Epilogue fusion
- Multi-stage pipelines
Performance Results
Utilizing SOL guidance, we can estimate performance headroom and strategically guide optimization trials. This allows us to deprioritize problems that are close to the speed-of-light limit and flag kernels that might manipulate benchmark results.
In our experiments, we evaluated 59 KernelBench problems under identical iteration budgets. The results were compelling:
- Transitioning from low-level code generation to DSL code using GPT-5-mini resulted in a 0.40x geometric mean regression being transformed into a 1.27x speedup over PyTorch.
- Incorporating SOL-guided steering further increased this efficiency to a 1.56x speedup.
- Across various model tiers, $\mu$CUTLASS combined with SOL guidance enabled weaker models to outperform stronger baseline agents while incurring lower token costs.
- SOL-guided budgeting achieved a token savings of 19-43% while maintaining at least 95% of the geometric mean speedup, with the most effective policy yielding a 1.68x efficiency gain.
Conclusion
Our SOL analysis is instrumental in identifying benchmark-gaming scenarios, where kernels may exhibit fast performance metrics while failing to execute the intended computations. This research not only advances our understanding of GPU kernel optimization but also sets the stage for future enhancements in performance efficiency through innovative methodologies.
