Evaluating MLLMs' Mathematical Spatial Reasoning Skills

Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation

Summary: arXiv:2602.11635v2 Announce Type: replace

Abstract

Multimodal large language models (MLLMs) have achieved strong performance on perception-oriented tasks, yet their ability to perform mathematical spatial reasoning, defined as the capacity to parse and manipulate two- and three-dimensional relations, remains unclear. Humans easily solve textbook-style spatial reasoning problems with over 95% accuracy, but we find that most leading MLLMs fail to reach even 60% on the same tasks. This striking gap highlights spatial reasoning as a fundamental weakness of current models.

Introduction

In recent years, the performance of MLLMs has garnered significant attention, particularly in tasks related to perception and natural language understanding. However, the ability of these models to handle mathematical spatial reasoning presents a different challenge. This article explores the limitations of current MLLMs in this critical area of cognitive ability.

Introducing MathSpatial

To investigate the gap in spatial reasoning capabilities, we present MathSpatial, the first large-scale and systematic dataset resource dedicated to mathematical spatial reasoning in MLLMs. MathSpatial provides two complementary subsets:

MathSpatial-Bench: A rigorously curated evaluation set of 2,000 problems spanning 3 categories and 11 subtypes, designed to isolate spatial reasoning from perceptual noise.
MathSpatial-Corpus: A training set of 8,000 problems equipped with verified solutions and structured reasoning traces.

Quality Control Measures

All problems in MathSpatial are sourced from authentic educational materials and undergo multi-stage quality control, which includes:

Deduplication
Geometric consistency checking
Cross-validated solution verification

Benchmarking MLLMs

We benchmarked 16 leading MLLMs on MathSpatial-Bench and found that spatial reasoning remains a fundamental bottleneck. Even GPT-5 lags behind human performance by over 35 percentage points, with particularly poor results on abstract deduction tasks. This highlights the critical need for advancements in spatial reasoning capabilities within MLLMs.

Implications for Future Research

Our findings indicate that training on MathSpatial-Corpus yields consistent improvements across model families. This demonstrates the dataset’s practical value for advancing spatial reasoning capabilities in MLLMs.

Conclusion

The evaluation of MLLMs through the lens of mathematical spatial reasoning reveals significant limitations. As we move forward, the development of datasets like MathSpatial will be essential for enhancing the capabilities of MLLMs and bridging the gap between human and machine reasoning.

Public Access

MathSpatial is publicly available at https://shuolucs.github.io/MathSpatial.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Evaluating MLLMs’ Mathematical Spatial Reasoning Skills

Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation

Abstract

Introduction

Introducing MathSpatial

Quality Control Measures

Benchmarking MLLMs

Implications for Future Research

Conclusion

Public Access

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related