FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification
Summary: arXiv:2604.04074v1 Announce Type: new
The rapid increase in submission volumes for machine learning research has intensified the pressure on traditional peer review processes. In the face of limited reviewer time and resources, many large language model (LLM)-based reviewing systems have emerged. However, these systems predominantly rely on the manuscript’s own narrative to generate feedback. This approach often leads to outputs that are overly sensitive to the quality of presentation and neglect the wealth of evidence that may reside in related works or available code.
To address these shortcomings, a new system called FactReview has been introduced. This innovative evidence-grounded reviewing system integrates multiple techniques, including claim extraction, literature positioning, and execution-based claim verification. The primary goal of FactReview is to provide a more robust and thorough evaluation of research submissions by leveraging external evidence.
Key Features of FactReview
- Claim Extraction: FactReview systematically identifies major claims and reported results within a submitted manuscript.
- Literature Positioning: The system retrieves and analyzes nearby work to clarify the technical position of the submitted paper in the context of existing literature.
- Execution-Based Claim Verification: When code is available, FactReview executes the released repository under controlled conditions to test the central empirical claims made in the manuscript.
- Concise Review Generation: After evaluating the submission, FactReview produces a concise review and an evidence report, labeling each significant claim with one of five categories: Supported, Supported by the paper, Partially supported, In conflict, or Inconclusive.
Case Study: CompGCN
In a practical application of FactReview, a case study was conducted on the paper titled CompGCN. The system successfully reproduced results that closely matched the reported outcomes for link prediction and node classification tasks. However, it also revealed that the broader performance claim made by the authors across various tasks was not entirely substantiated. Specifically, on the MUTAG graph classification task, the reproduced result was 88.4%, while the strongest baseline reported in the original paper was 92.6%. This discrepancy led to the classification of the claim as “partially supported.”
Implications for Peer Review
The findings from the CompGCN case study suggest that AI can play a pivotal role in enhancing peer review processes. Rather than serving as a final decision-maker, AI tools like FactReview are valuable for gathering evidence and assisting human reviewers in delivering more evidence-grounded assessments. By providing a systematic approach to reviewing, FactReview aims to improve the overall quality and reliability of peer evaluations in the ever-evolving field of machine learning.
Access and Future Work
The code for FactReview is publicly available at https://github.com/DEFENSE-SEU/Review-Assistant. As the demand for efficient and accurate peer review continues to grow, systems like FactReview may pave the way for future advancements in scholarly communication.
