Boost GPU Kernel Optimization with DSL & SOL Guidance

Improving Efficiency of GPU Kernel Optimization Agents using a Domain-Specific Language and Speed-of-Light Guidance

Summary: arXiv:2603.29010v1 Announce Type: cross

Abstract

Optimizing GPU kernels with large language model (LLM) agents is an iterative process that navigates a vast design space. In this process, every candidate must undergo generation, compilation, validation, and profiling. Reducing the number of trials can significantly lower both runtime and costs. Our study identifies two critical observations that drive the need for optimization:

Abstraction Level: The level at which agents operate is crucial. If the abstraction is too low, the LLM expends reasoning on trivial details that yield little impact. Conversely, if the abstraction is too high, significant optimization choices may be overlooked.
Diminishing Returns: Agents often struggle to determine when they have reached diminishing returns in their search, leading to unnecessary resource expenditure.

These observations inspire two design principles aimed at enhancing efficiency:

Domain-Specific Language (DSL): We propose a compact DSL that can be learned in context, enabling the model to operate at a higher level of reasoning while still preserving crucial optimization levers.
Speed-of-Light (SOL) Guidance: This guidance employs first-principles performance bounds to direct and budget the optimization search process.

Implementation of $\mu$CUTLASS

We have implemented these principles in a system called $\mu$CUTLASS, a DSL accompanied by a compiler for CUTLASS-backed GPU kernels. This system encompasses several key features:

Kernel configuration
Epilogue fusion
Multi-stage pipelines

Performance Results

Utilizing SOL guidance, we can estimate performance headroom and strategically guide optimization trials. This allows us to deprioritize problems that are close to the speed-of-light limit and flag kernels that might manipulate benchmark results.

In our experiments, we evaluated 59 KernelBench problems under identical iteration budgets. The results were compelling:

Transitioning from low-level code generation to DSL code using GPT-5-mini resulted in a 0.40x geometric mean regression being transformed into a 1.27x speedup over PyTorch.
Incorporating SOL-guided steering further increased this efficiency to a 1.56x speedup.
Across various model tiers, $\mu$CUTLASS combined with SOL guidance enabled weaker models to outperform stronger baseline agents while incurring lower token costs.
SOL-guided budgeting achieved a token savings of 19-43% while maintaining at least 95% of the geometric mean speedup, with the most effective policy yielding a 1.68x efficiency gain.

Conclusion

Our SOL analysis is instrumental in identifying benchmark-gaming scenarios, where kernels may exhibit fast performance metrics while failing to execute the intended computations. This research not only advances our understanding of GPU kernel optimization but also sets the stage for future enhancements in performance efficiency through innovative methodologies.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Boost GPU Kernel Optimization with DSL & SOL Guidance

Improving Efficiency of GPU Kernel Optimization Agents using a Domain-Specific Language and Speed-of-Light Guidance

Abstract

Implementation of $\mu$CUTLASS

Performance Results

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related