AnyPoC: Scalable LLM-Based Bug Detection with PoC Tests

Date:

AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection

Summary: arXiv:2604.11950v1 Announce Type: cross

Abstract

Recent advancements in large language model (LLM)-based agents have shown promise in identifying potential bugs within source code. However, these agents often produce static hypotheses that necessitate manual validation, which constrains the efficacy of automated bug detection. To transform this challenge into an actionable task, we propose a test generation approach: synthesizing executable proof-of-concept tests (PoCs)—including scripts, command sequences, or crafted inputs—to effectively trigger the suspected defects. Automated PoC generation serves as a scalable validation oracle, facilitating end-to-end autonomous bug detection by providing concrete execution evidence.

The Challenge of Naive LLM Agents

Despite their capabilities, naive LLM agents struggle with reliability as validators. They tend to favor “successful” outcomes and may engage in reward-hacking, producing plausible yet non-functional PoCs or even hallucinated execution traces. To counter this issue, we introduce AnyPoC, a versatile multi-agent framework designed to:

  • Analyze and validate candidate bug reports.
  • Iteratively synthesize and execute PoCs while gathering execution traces.
  • Independently re-execute and scrutinize PoCs to minimize hallucinations and reward hacking.

Continuous Knowledge Base Evolution

AnyPoC continuously extracts and evolves a PoC knowledge base, enabling it to manage a variety of tasks efficiently. The framework is adaptable, operating on candidate bug reports regardless of their origin, and can be integrated with diverse bug reporters.

Practical Application and Results

To showcase the practicality and versatility of AnyPoC, we applied it alongside a straightforward agentic bug reporter to 12 critical software systems spanning various programming languages and domains. These systems include:

  • Firefox
  • Chromium
  • LLVM
  • OpenSSL
  • SQLite
  • FFmpeg
  • Redis

Many of these systems comprise millions of lines of code. In comparison to leading coding agents such as Claude Code and Codex, AnyPoC demonstrated superior performance, yielding 1.3 times more valid PoCs for true-positive bug reports while rejecting 9.8 times more false-positive bug reports.

Impact of AnyPoC

To date, AnyPoC has successfully uncovered 122 new bugs, with 105 confirmed and 86 already rectified. Notably, 45 of the generated PoCs have been adopted as official regression tests, underscoring the framework’s potential to enhance software reliability and automate the bug detection process.

Conclusion

AnyPoC represents a significant advancement in automated bug detection, addressing key limitations of existing LLM-based agents. By providing a robust framework for PoC generation and validation, AnyPoC paves the way for more efficient and reliable software development processes.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.