Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery
In the rapidly evolving landscape of software development, large language models (LLMs) have emerged as powerful tools for detecting defects. However, the proliferation of plausible-but-wrong reports has created a precision crisis, overwhelming maintainers and undermining confidence in genuine findings. To address this challenge, researchers have introduced an innovative methodology known as Refute-or-Promote.
Overview of the Methodology
The Refute-or-Promote methodology integrates several advanced techniques designed to enhance the reliability of defect discovery. Key components of this approach include:
- Stratified Context Hunting (SCH): A candidate generation technique that systematically explores various contexts to identify potential defects.
- Adversarial Kill Mandates: Agents tasked with disapproving candidates at each promotion gate to ensure only the most credible findings are advanced.
- Context Asymmetry: A strategy that leverages different perspectives to identify blind spots that may be overlooked in traditional reviews.
- Cross-Model Critic (CMC): A mechanism that employs multiple models to critique candidates, enhancing the thoroughness of the review process.
Operational Insights and Results
Over a rigorous 31-day campaign involving seven targets—including security libraries and the ISO C++ standard—the methodology demonstrated impressive results. The pipeline succeeded in eliminating approximately 79% of 171 candidates before they progressed to disclosure. Notably, in a targeted subset involving two libraries, the prospective kill rate reached an astounding 83%.
Notable Achievements
The outcomes of the Refute-or-Promote methodology include significant contributions to the field:
- Four Common Vulnerabilities and Exposures (CVEs) were identified, with three being made public.
- The LWG 4549 was accepted into the C++ working paper.
- Five editorial pull requests (PRs) were merged into the C++ project.
- Three compiler conformance bugs were identified and addressed.
- Eight security-related fixes were implemented without resulting in CVEs.
- An errata related to RFC 9000 was filed and is currently under committee review.
- Identified one or more normative compliance issues under FIPS 140-3, currently undergoing coordinated disclosure.
Lessons Learned from Failures
Among the key lessons learned was the importance of empirical testing. A particularly instructive failure occurred when ten dedicated reviewers unanimously endorsed a non-existent Bleichenbacher padding oracle in OpenSSL’s CMS module. It was only through a single empirical test that the false positive was identified, leading to the establishment of a mandatory empirical gate within the methodology.
Broader Implications
The Refute-or-Promote methodology is not limited to defect discovery. As a preliminary transfer test, a simplified cross-family critique variant successfully resolved five previously unsolved instances on the SWE-bench Verified and one challenging task from SWE-rebench. This highlights the potential for broader applicability of the methodology across various domains in software engineering.
In conclusion, the innovative approach of Refute-or-Promote marks a significant advancement in the field of LLM-assisted defect discovery, providing a structured framework that filters out persistent false positives and enhances the reliability of findings.
