AscendOptimizer: Episodic Agent for Ascend NPU Operator Optimization
Summary: arXiv:2603.23566v1 Announce Type: cross
Abstract: AscendC (Ascend C) operator optimization on Huawei Ascend neural processing units (NPUs) faces a two-fold knowledge bottleneck: unlike the CUDA ecosystem, there are few public reference implementations to learn from, and performance hinges on a coupled two-part artifact – a host-side tiling program that orchestrates data movement and a kernel program that schedules and pipelines instructions.
In response to these challenges, researchers have developed AscendOptimizer, an innovative episodic agent designed to streamline the optimization process for AscendC operators. This article delves into the key features and methodologies of AscendOptimizer, which seeks to enhance the performance of NPUs by turning execution into experience.
Key Features of AscendOptimizer
- Profiling-in-the-loop Evolutionary Search: On the host side, AscendOptimizer employs a unique profiling-in-the-loop evolutionary search mechanism. This approach allows it to discover valid and high-performing tiling and data-movement configurations directly from hardware feedback, thereby optimizing resource utilization.
- Kernel Optimization Motifs: On the kernel side, AscendOptimizer innovatively mines transferable optimization motifs by rewinding optimized kernels. This process involves systematically de-optimizing them to synthesize instructive “bad-to-good” trajectories, which can be distilled into a retrievable experience bank for guided rewriting.
- Closed Loop Optimization: By alternating between host tuning and kernel rewriting in a closed loop, AscendOptimizer continuously expands the feasibility of optimizations and effectively reduces latency across various operations.
Performance Achievements
The efficacy of AscendOptimizer has been demonstrated through rigorous benchmarking against 127 real AscendC operators. The results indicate that AscendOptimizer achieves a remarkable 1.19x geometric-mean speedup over the open-source baseline. Additionally, 49.61% of operators successfully outperform their respective references, showcasing the agent’s capability to surpass strong agent and search baselines.
Conclusion
AscendOptimizer represents a significant advancement in the optimization landscape for Ascend NPUs, addressing the critical knowledge bottleneck that has hindered performance improvements. By leveraging innovative techniques such as profiling-in-the-loop evolutionary search and kernel optimization motifs, AscendOptimizer provides a robust framework for enhancing the efficiency of AscendC operators. As the demand for high-performance computing continues to grow, advancements like AscendOptimizer will play a pivotal role in optimizing resource allocation and execution efficiency in neural processing units.
