FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification
Summary: arXiv:2604.04074v1 Announce Type: new
Abstract: Peer review in machine learning is under growing pressure from rising submission volume and limited reviewer time. Most LLM-based reviewing systems read only the manuscript and generate comments from the paper’s own narrative. This makes their outputs sensitive to presentation quality and leaves them weak when the evidence needed for review lies in related work or released code. We present FactReview, an evidence-grounded reviewing system that combines claim extraction, literature positioning, and execution-based claim verification.
Introduction
The landscape of peer review in the field of machine learning is evolving rapidly, driven by an increase in submissions and a scarcity of available reviewers. This situation has prompted the development of innovative solutions to enhance the effectiveness of the review process. One such solution is FactReview, which aims to provide a more comprehensive and evidence-based approach to manuscript evaluation.
Overview of FactReview
FactReview operates through three critical components:
- Claim Extraction: The system identifies major claims and reported results within the submitted manuscript.
- Literature Positioning: FactReview retrieves and analyzes related work to clarify the technical position of the paper in the broader research context.
- Execution-Based Claim Verification: When code is available, the system executes the released code under defined parameters to verify central empirical claims.
Review Process
Upon receiving a manuscript, FactReview generates a concise review along with an evidence report. Each major claim is assigned one of five labels:
- Supported
- Supported by the paper
- Partially supported
- In conflict
- Inconclusive
This labeling system allows for a nuanced understanding of how well the claims stand up to scrutiny based on the available evidence.
Case Study: CompGCN
In a case study involving CompGCN, FactReview successfully reproduced results that closely matched the reported outcomes for link prediction and node classification tasks. However, it also revealed that the paper’s broader performance claims were not entirely accurate. Specifically, for the MUTAG graph classification task, the reproduced result was 88.4%, while the strongest baseline reported in the paper was 92.6%. This analysis led to the classification of the claim as only partially supported.
Implications for AI in Peer Review
The findings from the CompGCN case demonstrate that AI can play a valuable role in peer review, not as a final arbiter, but as a powerful tool for evidence gathering. By assisting reviewers in producing more grounded assessments, FactReview enhances the overall quality and reliability of the review process.
Conclusion
As the demands on peer review continue to grow, systems like FactReview present promising avenues for improving the efficiency and effectiveness of the evaluation process in machine learning. By combining advanced techniques for claim verification with a thorough literature analysis, FactReview sets a new standard for evidence-based reviews.
For more information and access to the code, visit FactReview GitHub Repository.
