FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification
In recent years, the peer review process in the field of machine learning has faced significant challenges due to a surge in submission volumes combined with a limited pool of available reviewers. Traditional reviewing systems, particularly those powered by large language models (LLMs), often rely solely on the content of the submitted manuscript to generate their assessments. This approach has its limitations, as it does not adequately account for the broader context of related work or the necessary evidence found in supplementary materials such as released code.
To address these issues, researchers have introduced FactReview, an innovative evidence-grounded reviewing system. FactReview seeks to enhance the peer review process by integrating claim extraction, literature positioning, and execution-based claim verification. This multi-faceted approach aims to provide a more comprehensive and accurate evaluation of submissions.
Key Features of FactReview
- Claim Extraction: FactReview identifies and extracts major claims and results reported in the paper, allowing for focused analysis.
- Literature Positioning: The system retrieves related work to provide context and clarify the technical positioning of the submission within the existing body of literature.
- Execution-Based Claim Verification: When code is available, FactReview executes the provided repository under controlled conditions to empirically test the central claims made by the authors.
Upon completing its analysis, FactReview generates a concise review along with an evidence report. Each major claim is assigned one of five labels:
- Supported
- Supported by the paper
- Partially supported
- In conflict
- Inconclusive
Case Study: CompGCN
A notable application of FactReview was demonstrated through a case study involving the paper on CompGCN. The system successfully reproduced results for link prediction and node classification that were in close alignment with those reported by the authors. However, it also highlighted discrepancies in the broader performance claims of the paper. Specifically, on the MUTAG graph classification task, FactReview reproduced a result of 88.4%, while the strongest baseline reported in the original paper stood at 92.6%. This finding led to the conclusion that the broader performance claim was only partially supported.
The Role of AI in Peer Review
The insights gained from this case study suggest that AI should not be viewed as a final decision-maker in the peer review process. Instead, it serves as a valuable tool for gathering evidence and assisting reviewers in producing more evidence-grounded assessments. By leveraging systems like FactReview, the academic community can enhance the quality and rigor of peer reviews, ultimately contributing to a more robust body of research.
For those interested in exploring FactReview further, the code is publicly available at GitHub Repository.
