Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation
Summary: arXiv:2602.11635v2 Announce Type: replace
Abstract
Multimodal large language models (MLLMs) have achieved strong performance on perception-oriented tasks, yet their ability to perform mathematical spatial reasoning, defined as the capacity to parse and manipulate two- and three-dimensional relations, remains unclear. Humans easily solve textbook-style spatial reasoning problems with over 95% accuracy, but we find that most leading MLLMs fail to reach even 60% on the same tasks. This striking gap highlights spatial reasoning as a fundamental weakness of current models.
Introduction
In recent years, the performance of MLLMs has garnered significant attention, particularly in tasks related to perception and natural language understanding. However, the ability of these models to handle mathematical spatial reasoning presents a different challenge. This article explores the limitations of current MLLMs in this critical area of cognitive ability.
Introducing MathSpatial
To investigate the gap in spatial reasoning capabilities, we present MathSpatial, the first large-scale and systematic dataset resource dedicated to mathematical spatial reasoning in MLLMs. MathSpatial provides two complementary subsets:
- MathSpatial-Bench: A rigorously curated evaluation set of 2,000 problems spanning 3 categories and 11 subtypes, designed to isolate spatial reasoning from perceptual noise.
- MathSpatial-Corpus: A training set of 8,000 problems equipped with verified solutions and structured reasoning traces.
Quality Control Measures
All problems in MathSpatial are sourced from authentic educational materials and undergo multi-stage quality control, which includes:
- Deduplication
- Geometric consistency checking
- Cross-validated solution verification
Benchmarking MLLMs
We benchmarked 16 leading MLLMs on MathSpatial-Bench and found that spatial reasoning remains a fundamental bottleneck. Even GPT-5 lags behind human performance by over 35 percentage points, with particularly poor results on abstract deduction tasks. This highlights the critical need for advancements in spatial reasoning capabilities within MLLMs.
Implications for Future Research
Our findings indicate that training on MathSpatial-Corpus yields consistent improvements across model families. This demonstrates the dataset’s practical value for advancing spatial reasoning capabilities in MLLMs.
Conclusion
The evaluation of MLLMs through the lens of mathematical spatial reasoning reveals significant limitations. As we move forward, the development of datasets like MathSpatial will be essential for enhancing the capabilities of MLLMs and bridging the gap between human and machine reasoning.
Public Access
MathSpatial is publicly available at https://shuolucs.github.io/MathSpatial.
