Mind the Gap Between Spatial Reasoning and Acting! Step-by-Step Evaluation of Agents With Spatial-Gym
Summary: arXiv:2604.09338v1 Announce Type: new
Abstract: Spatial reasoning is central to navigation and robotics, yet measuring model capabilities on these tasks remains difficult. Existing benchmarks evaluate models in a one-shot setting, requiring full solution generation in a single response, unlike humans, who work in interactive environments step-by-step. We introduce Spatial-Gym, a Gymnasium environment that isolates spatial constraint reasoning by testing pathfinding in 2D-grid puzzles as a sequential decision task with optional backtracking.
In our study, we evaluate eight models in three settings (one-shot, step-by-step, step-by-step with backtracking) against human, random, and A* baselines on 500 episodes. The results reveal significant insights into the performance of these models in spatial reasoning tasks.
Key Findings
- Reasoning Effort: Models fail to scale reasoning effort with difficulty. This suggests that the current architectures do not adequately adjust their processing capabilities in response to increasing task complexity.
- Impact of Visual Input: Vision models that receive images of the spatial environment show a reduction in solve rate by 73%. This indicates that visual input may not be enhancing the models’ performance as expected, and could potentially introduce noise into the reasoning process.
- Chain-of-Thought Reasoning: Extended chain-of-thought reasoning retains a 3-5x accuracy advantage over standard inference, even in the step-by-step setting. This emphasizes the importance of structured reasoning in improving model performance.
Model Evaluation
The best-performing model, GPT-OSS 120B, achieved a solve rate of 16.0%, which is 82 points below the human baseline of 98.0%. The step-by-step evaluation format was beneficial for weaker models, improving their performance by up to 5.4% by reducing formatting errors. Conversely, stronger models experienced a decrease in performance, with a drop of up to 5.6%, likely due to constraints on global planning.
Backtracking, a feature that allows models to revise their decisions, was found to improve episode completion rates but only enhanced the solve rate for weaker models. Stronger models demonstrated a tendency to avoid backtracking, indicating a potential area for further exploration in model training and architecture.
Conclusion
Spatial-Gym serves as a vital tool for diagnosing model limitations and offers a framework for enhancing spatial reasoning through reinforcement learning. By providing a structured environment to assess the capabilities of various models, we can better understand the intricacies of spatial reasoning and improve the design of agents capable of navigating complex environments.
As the field of artificial intelligence continues to evolve, understanding the dynamics between spatial reasoning and action will be crucial for developing more sophisticated and capable models. The insights gained from the Spatial-Gym evaluations will guide future research and innovations in this domain.
