Demystifying the Silence of Correctness Bugs in PyTorch Compiler
Summary: arXiv:2604.08720v1 Announce Type: cross
The performance optimization of AI infrastructure is crucial for the rapid adoption of large language models (LLMs). Among the various tools available, the PyTorch compiler, known as torch.compile, serves as a fundamental optimization resource for deep learning (DL) models, including LLMs. Despite its importance, torch.compile is susceptible to correctness bugs that result in incorrect outputs from compiled DL models without indicating any exceptions, crashes, or warnings. This issue significantly threatens the reliability of downstream LLM applications.
Data sourced from the PyTorch community reveals that a substantial 19.2% of high-priority issues are attributed to incorrect outputs caused by bugs in torch.compile. This makes it the second-most-common category of bugs, closely following program crashes, which account for 19.57% of reported issues. Alarmingly, there has been no systematic study aimed at specifically characterizing and detecting these correctness bugs.
First Empirical Study on Correctness Bugs
In response to this gap, we present the first empirical study focused on correctness bugs within torch.compile. Our study examines the characteristics of these bugs and evaluates the effectiveness of existing fuzzers in detecting them. Through this investigation, we aim to enhance the understanding of how these bugs arise and their implications for AI development.
Introducing AlignGuard
Based on our findings, we introduce a proof-of-concept testing technique named AlignGuard, which is specifically designed to detect correctness bugs in torch.compile. AlignGuard leverages the bug characteristics identified in our study and applies LLM-based test mutation to existing test cases for effective bug detection.
- AlignGuard has successfully identified 23 new correctness bugs in the latest versions of
torch.compile. - All of these bugs have been confirmed or resolved by the PyTorch development team.
- Notably, over half (14 out of 23) of the detected bugs are classified as high-priority, emphasizing the significance of our technique.
The successful detection of these bugs not only underscores the potential of AlignGuard but also highlights the pressing need for more robust testing methodologies in the development of AI tools. As the reliance on large language models continues to grow, ensuring the correctness of underlying frameworks like PyTorch becomes increasingly critical.
Conclusion
In conclusion, the investigation into correctness bugs within torch.compile paves the way for improved reliability in deep learning applications. Our proposed technique, AlignGuard, demonstrates the importance of proactive measures in identifying and addressing these silent but impactful bugs. As the field of artificial intelligence continues to evolve, such innovations will be pivotal in fostering trust and efficiency in AI infrastructure.
