AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection
Summary: arXiv:2604.11950v1 Announce Type: cross
Abstract
Recent advancements in large language model (LLM)-based agents have shown promise in identifying potential bugs within source code. However, these agents often produce static hypotheses that necessitate manual validation, which constrains the efficacy of automated bug detection. To transform this challenge into an actionable task, we propose a test generation approach: synthesizing executable proof-of-concept tests (PoCs)—including scripts, command sequences, or crafted inputs—to effectively trigger the suspected defects. Automated PoC generation serves as a scalable validation oracle, facilitating end-to-end autonomous bug detection by providing concrete execution evidence.
The Challenge of Naive LLM Agents
Despite their capabilities, naive LLM agents struggle with reliability as validators. They tend to favor “successful” outcomes and may engage in reward-hacking, producing plausible yet non-functional PoCs or even hallucinated execution traces. To counter this issue, we introduce AnyPoC, a versatile multi-agent framework designed to:
- Analyze and validate candidate bug reports.
- Iteratively synthesize and execute PoCs while gathering execution traces.
- Independently re-execute and scrutinize PoCs to minimize hallucinations and reward hacking.
Continuous Knowledge Base Evolution
AnyPoC continuously extracts and evolves a PoC knowledge base, enabling it to manage a variety of tasks efficiently. The framework is adaptable, operating on candidate bug reports regardless of their origin, and can be integrated with diverse bug reporters.
Practical Application and Results
To showcase the practicality and versatility of AnyPoC, we applied it alongside a straightforward agentic bug reporter to 12 critical software systems spanning various programming languages and domains. These systems include:
- Firefox
- Chromium
- LLVM
- OpenSSL
- SQLite
- FFmpeg
- Redis
Many of these systems comprise millions of lines of code. In comparison to leading coding agents such as Claude Code and Codex, AnyPoC demonstrated superior performance, yielding 1.3 times more valid PoCs for true-positive bug reports while rejecting 9.8 times more false-positive bug reports.
Impact of AnyPoC
To date, AnyPoC has successfully uncovered 122 new bugs, with 105 confirmed and 86 already rectified. Notably, 45 of the generated PoCs have been adopted as official regression tests, underscoring the framework’s potential to enhance software reliability and automate the bug detection process.
Conclusion
AnyPoC represents a significant advancement in automated bug detection, addressing key limitations of existing LLM-based agents. By providing a robust framework for PoC generation and validation, AnyPoC paves the way for more efficient and reliable software development processes.
