ReactBench: A Benchmark for Topological Reasoning in MLLMs on Chemical Reaction Diagrams
Summary: arXiv:2604.15994v1 Announce Type: new
Abstract: Multimodal Large Language Models (MLLMs) excel at recognizing individual visual elements and reasoning over simple linear diagrams. However, when faced with complex topological structures involving branching paths, converging flows, and cyclic dependencies, their reasoning capabilities degrade sharply, even on tasks as basic as counting endpoints. Existing benchmarks fail to probe this gap, focusing on semantic comprehension rather than structural reasoning. We introduce ReactBench, a benchmark that reveals fundamental limitations in structural reasoning through chemical reaction diagrams.
These real-world scientific diagrams offer an ideal testbed because they naturally span diverse structures from linear chains to cyclic graphs, while requiring both precise local recognition and coherent global reasoning. Our benchmark comprises 1,618 expert-annotated QA pairs across four hierarchical task dimensions.
Key Findings
Extensive evaluation across 17 MLLMs reveals a significant performance gap exceeding 30% between anchor-based tasks and holistic structural reasoning tasks. Controlled ablations confirm this bottleneck lies in reasoning, not perception. These findings expose a fundamental deficit in structural understanding and establish directions for advancing visual reasoning.
Introduction to ReactBench
ReactBench is designed to address the limitations of existing benchmarks that primarily focus on semantic comprehension. Unlike traditional benchmarks, ReactBench emphasizes the need for structural reasoning, particularly in the context of chemical reaction diagrams.
Why Chemical Reaction Diagrams?
Chemical reaction diagrams are inherently complex, featuring a variety of structures that challenge the reasoning capabilities of MLLMs. These structures include:
- Linear chains
- Cyclic graphs
- Branching paths
- Converging flows
Each of these elements requires not only the ability to identify individual components but also to understand their interrelationships within a broader context.
Benchmark Composition
ReactBench consists of 1,618 expert-annotated question-answer pairs that have been categorized across four hierarchical task dimensions:
- Basic recognition tasks
- Intermediate reasoning tasks
- Complex structural reasoning tasks
- Holistic understanding tasks
Evaluation Results
In our extensive evaluation, we observed that MLLMs performed significantly better on anchor-based tasks compared to holistic structural reasoning tasks. The performance gap, exceeding 30%, indicates a pressing need for improved methodologies to enhance structural understanding in MLLMs.
Conclusion
ReactBench not only highlights the limitations of current MLLMs in structural reasoning but also sets the stage for future research aimed at bridging this gap. By focusing on complex topological structures within chemical reaction diagrams, we aim to advance the field of visual reasoning and improve the capabilities of MLLMs in understanding complex scientific data.
