TableVision Benchmark for Spatial Reasoning in Hierarchical Tables

TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables

Summary: arXiv:2604.03660v1 Announce Type: new

Abstract

Structured tables are essential for conveying high-density information in professional domains such as finance, healthcare, and scientific research. Despite the progress in Multimodal Large Language Models (MLLMs), reasoning performance remains limited for complex tables with hierarchical layouts. In this paper, we identify a critical Perception Bottleneck through quantitative analysis. We find that as task complexity scales, the number of involved discrete visual regions increases disproportionately. This processing density leads to an internal “Perceptual Overload,” where MLLMs struggle to maintain accurate spatial attention during implicit generation.

Introduction

To address this bottleneck, we introduce TableVision, a large-scale, trajectory-aware benchmark designed for spatially grounded reasoning. TableVision stratifies tabular tasks into three cognitive levels:

Perception
Reasoning
Analysis

This framework is organized across 13 sub-categories, allowing for a nuanced understanding of reasoning tasks associated with tables.

Methodology

By utilizing a rendering-based deterministic grounding pipeline, the dataset explicitly couples multi-step logical deductions with pixel-perfect spatial ground truths. This comprises 6,799 high-fidelity reasoning trajectories, which are integral for rigorous evaluation of MLLMs.

Results

Our empirical results, supported by diagnostic probing, demonstrate that explicit spatial constraints significantly recover the reasoning potential of MLLMs. Furthermore, our two-stage decoupled framework achieves a robust 12.3% overall accuracy improvement on the test set. This improvement highlights the effectiveness of TableVision in enhancing the capabilities of MLLMs in handling complex hierarchical tables.

Conclusion

TableVision provides a rigorous testbed and a fresh perspective on the synergy between perception and logic in document understanding. The benchmark not only addresses existing limitations in MLLMs but also sets a new standard for future research in spatially grounded reasoning. By focusing on the intricate relationship between perception and reasoning, we aim to propel advancements in artificial intelligence applications across various sectors.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

TableVision Benchmark for Spatial Reasoning in Hierarchical Tables

TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables

Abstract

Introduction

Methodology

Results

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related