Not Search, But Scan: Benchmarking MLLMs on Scan-Oriented Academic Paper Reasoning
Summary: arXiv:2603.28651v1 Announce Type: new
Abstract
With the rapid progress of multimodal large language models (MLLMs), AI already performs well at literature retrieval and certain reasoning tasks, serving as a capable assistant to human researchers. However, it remains far from achieving autonomous research capabilities. The fundamental reason for this limitation is that current efforts in academic paper reasoning are largely confined to a search-oriented paradigm, which is centered on pre-specified targets. This paradigm primarily focuses on relevance retrieval and struggles to support a researcher-style full-document understanding, reasoning, and verification.
Introducing ScholScan
To address these challenges, we propose ScholScan, a new benchmark for academic paper reasoning. ScholScan introduces a scan-oriented task setting that requires models to read and cross-check entire papers, akin to how human researchers operate. The goal is to enable AI systems to scan documents to identify consistency issues and validate information effectively.
Benchmark Composition
The ScholScan benchmark comprises:
- 1,800 carefully annotated questions: These questions are drawn from nine error categories across 13 natural-science domains.
- 715 academic papers: A diverse collection of papers that represent various fields of study.
- Detailed annotations: Annotations are provided for evidence localization and reasoning traces.
- A unified evaluation protocol: This ensures consistent assessment across different models and configurations.
Model Assessment and Findings
In our comprehensive evaluation, we assessed 15 models across 24 input configurations. The analysis focused on the capabilities of MLLMs across all error categories included in the ScholScan benchmark. Notably, we observed that:
- Retrieval-Augmented Generation (RAG) methods: These techniques showed no significant improvements in performance when applied to the scan-oriented tasks.
- Systematic deficiencies: Current MLLMs exhibited significant shortcomings in handling the complexities associated with scan-oriented tasks.
- Challenge of ScholScan: The findings underscore the challenges posed by the ScholScan benchmark, highlighting the need for further advancements in MLLM capabilities.
Conclusion
We believe that ScholScan will emerge as a leading and representative work within the new scan-oriented task paradigm. By shifting the focus from traditional search-oriented methods to a more holistic and thorough scanning approach, we aim to enhance the research capabilities of MLLMs and ultimately support researchers in their quest for knowledge. The path toward autonomous research may be long, but benchmarks like ScholScan are essential in driving the field forward.
