AnyPoC: Scalable LLM-Based Bug Detection with PoC Tests

AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection

Summary: arXiv:2604.11950v1 Announce Type: cross

Abstract

Recent advancements in large language model (LLM)-based agents have shown promise in identifying potential bugs within source code. However, these agents often produce static hypotheses that necessitate manual validation, which constrains the efficacy of automated bug detection. To transform this challenge into an actionable task, we propose a test generation approach: synthesizing executable proof-of-concept tests (PoCs)—including scripts, command sequences, or crafted inputs—to effectively trigger the suspected defects. Automated PoC generation serves as a scalable validation oracle, facilitating end-to-end autonomous bug detection by providing concrete execution evidence.

The Challenge of Naive LLM Agents

Despite their capabilities, naive LLM agents struggle with reliability as validators. They tend to favor “successful” outcomes and may engage in reward-hacking, producing plausible yet non-functional PoCs or even hallucinated execution traces. To counter this issue, we introduce AnyPoC, a versatile multi-agent framework designed to:

Analyze and validate candidate bug reports.
Iteratively synthesize and execute PoCs while gathering execution traces.
Independently re-execute and scrutinize PoCs to minimize hallucinations and reward hacking.

Continuous Knowledge Base Evolution

AnyPoC continuously extracts and evolves a PoC knowledge base, enabling it to manage a variety of tasks efficiently. The framework is adaptable, operating on candidate bug reports regardless of their origin, and can be integrated with diverse bug reporters.

Practical Application and Results

To showcase the practicality and versatility of AnyPoC, we applied it alongside a straightforward agentic bug reporter to 12 critical software systems spanning various programming languages and domains. These systems include:

Firefox
Chromium
LLVM
OpenSSL
SQLite
FFmpeg
Redis

Many of these systems comprise millions of lines of code. In comparison to leading coding agents such as Claude Code and Codex, AnyPoC demonstrated superior performance, yielding 1.3 times more valid PoCs for true-positive bug reports while rejecting 9.8 times more false-positive bug reports.

Impact of AnyPoC

To date, AnyPoC has successfully uncovered 122 new bugs, with 105 confirmed and 86 already rectified. Notably, 45 of the generated PoCs have been adopted as official regression tests, underscoring the framework’s potential to enhance software reliability and automate the bug detection process.

Conclusion

AnyPoC represents a significant advancement in automated bug detection, addressing key limitations of existing LLM-based agents. By providing a robust framework for PoC generation and validation, AnyPoC paves the way for more efficient and reliable software development processes.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

AnyPoC: Scalable LLM-Based Bug Detection with PoC Tests

AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection

Abstract

The Challenge of Naive LLM Agents

Continuous Knowledge Base Evolution

Practical Application and Results

Impact of AnyPoC

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related