ReactBench: Benchmarking Structural Reasoning in MLLMs

ReactBench: A Benchmark for Topological Reasoning in MLLMs on Chemical Reaction Diagrams

Summary: arXiv:2604.15994v1 Announce Type: new

Abstract: Multimodal Large Language Models (MLLMs) excel at recognizing individual visual elements and reasoning over simple linear diagrams. However, when faced with complex topological structures involving branching paths, converging flows, and cyclic dependencies, their reasoning capabilities degrade sharply, even on tasks as basic as counting endpoints. Existing benchmarks fail to probe this gap, focusing on semantic comprehension rather than structural reasoning. We introduce ReactBench, a benchmark that reveals fundamental limitations in structural reasoning through chemical reaction diagrams.

These real-world scientific diagrams offer an ideal testbed because they naturally span diverse structures from linear chains to cyclic graphs, while requiring both precise local recognition and coherent global reasoning. Our benchmark comprises 1,618 expert-annotated QA pairs across four hierarchical task dimensions.

Key Findings

Extensive evaluation across 17 MLLMs reveals a significant performance gap exceeding 30% between anchor-based tasks and holistic structural reasoning tasks. Controlled ablations confirm this bottleneck lies in reasoning, not perception. These findings expose a fundamental deficit in structural understanding and establish directions for advancing visual reasoning.

Introduction to ReactBench

ReactBench is designed to address the limitations of existing benchmarks that primarily focus on semantic comprehension. Unlike traditional benchmarks, ReactBench emphasizes the need for structural reasoning, particularly in the context of chemical reaction diagrams.

Why Chemical Reaction Diagrams?

Chemical reaction diagrams are inherently complex, featuring a variety of structures that challenge the reasoning capabilities of MLLMs. These structures include:

Linear chains
Cyclic graphs
Branching paths
Converging flows

Each of these elements requires not only the ability to identify individual components but also to understand their interrelationships within a broader context.

Benchmark Composition

ReactBench consists of 1,618 expert-annotated question-answer pairs that have been categorized across four hierarchical task dimensions:

Basic recognition tasks
Intermediate reasoning tasks
Complex structural reasoning tasks
Holistic understanding tasks

Evaluation Results

In our extensive evaluation, we observed that MLLMs performed significantly better on anchor-based tasks compared to holistic structural reasoning tasks. The performance gap, exceeding 30%, indicates a pressing need for improved methodologies to enhance structural understanding in MLLMs.

Conclusion

ReactBench not only highlights the limitations of current MLLMs in structural reasoning but also sets the stage for future research aimed at bridging this gap. By focusing on complex topological structures within chemical reaction diagrams, we aim to advance the field of visual reasoning and improve the capabilities of MLLMs in understanding complex scientific data.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

ReactBench: Benchmarking Structural Reasoning in MLLMs

ReactBench: A Benchmark for Topological Reasoning in MLLMs on Chemical Reaction Diagrams

Key Findings

Introduction to ReactBench

Why Chemical Reaction Diagrams?

Benchmark Composition

Evaluation Results

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related