DecompSR: A Dataset for Decomposed Analyses of Compositional Multihop Spatial Reasoning
Published on: arXiv:2511.02627v2
In the rapidly evolving field of artificial intelligence, the ability to understand and reason about spatial relationships is paramount. A new dataset, DecompSR, aims to push the boundaries of compositional spatial reasoning by providing a comprehensive framework designed to evaluate Large Language Models (LLMs) on their spatial reasoning capabilities. This article delves into the features, construction, and implications of the DecompSR dataset.
Overview of DecompSR
DecompSR, or decomposed spatial reasoning, is a benchmark dataset consisting of over five million datapoints. It serves a dual purpose: to evaluate the reasoning depth and compositionality of LLMs while also allowing researchers to conduct a nuanced analysis of spatial reasoning abilities. The dataset’s structure enables researchers to manipulate several critical aspects of compositionality, thereby providing insights into the strengths and weaknesses of various models.
Key Features of DecompSR
The unique characteristics of DecompSR facilitate an in-depth analysis of compositional reasoning. The dataset allows users to independently vary the following elements:
- Productivity: This aspect pertains to the depth of reasoning, allowing researchers to assess how well models can perform tasks that require multiple layers of reasoning.
- Substitutivity: This involves entity and linguistic variability, testing how models can adapt to different types of inputs.
- Overgeneralisation: This focuses on factors such as input order and the presence of distractors, which can significantly influence performance.
- Systematicity: This aspect evaluates how well models can generalize using novel linguistic elements, providing insights into their adaptability.
Methodology and Verification
One of the standout features of DecompSR is its procedural generation method, which ensures that the dataset is “correct by construction.” Each generated instance is verified using a symbolic solver, which independently guarantees its correctness. This rigorous approach not only enhances the reliability of the dataset but also establishes a benchmark for future research in spatial reasoning.
Benchmarking and Insights
In initial benchmarking efforts, DecompSR was tested across various LLMs. The findings revealed that while these models exhibit resilience to linguistic variation, they struggle significantly with tasks requiring productive and systematic generalization in spatial reasoning scenarios. These results highlight critical areas for improvement in LLM architecture and training methodologies.
Conclusion
DecompSR represents a significant advancement in the evaluation of compositional spatial reasoning within artificial intelligence. By providing a robust, verifiable, and multifaceted dataset, it opens new avenues for research and development in LLMs. As AI continues to integrate into various applications, understanding and improving spatial reasoning capabilities will be essential for creating more intelligent and versatile systems.
For more information, refer to the full paper available on arXiv.
