Do LLMs Build Spatial World Models? Evidence from Grid-World Maze Tasks
Summary: arXiv:2604.10690v1 Announce Type: new
Abstract: Foundation models have shown remarkable performance across diverse tasks, yet their ability to construct internal spatial world models for reasoning and planning remains unclear. We systematically evaluate the spatial understanding of large language models through maze tasks, a controlled testing context requiring multi-step planning and spatial abstraction.
Across comprehensive experiments with Gemini-2.5-Flash, GPT-5-mini, Claude-Haiku-4.5, and DeepSeek-Chat, we uncover significant discrepancies in spatial reasoning that challenge assumptions about LLM planning capabilities.
Key Findings
- Performance Discrepancies: Using chain-of-thought prompting, Gemini achieves 80-86% accuracy on smaller mazes (5×5 to 7×7 grids) with tokenized adjacency representations.
- Collapse in Performance: The performance drops dramatically to 16-34% with visual grid formats, revealing a 2-5x difference and suggesting representation-dependent rather than format-invariant spatial reasoning.
Further Analysis
To probe deeper into spatial understanding, we employed sequential proximity questions and compositional distance comparisons. Despite achieving an impressive 96-99% semantic coverage in reasoning traces, the models struggled to leverage this understanding for consistent spatial computations.
Independent Question Treatment
Our analysis indicates that the models tend to treat each question independently, failing to build cumulative spatial knowledge. This limitation raises critical questions about the robustness of LLMs in developing effective spatial world models.
Implications
The findings from our maze-solving tasks suggest that large language models do not exhibit the ability to develop robust spatial world models. Instead, they demonstrate representation-specific and prompting-dependent reasoning, which is successful only under narrow conditions.
Conclusion
These results have significant implications for the deployment of foundation models in applications that require spatial abstraction. As the capabilities of large language models continue to evolve, understanding their limitations in spatial reasoning will be crucial for their effective application in real-world scenarios.
Future Directions
Future research should consider enhancing the spatial reasoning capabilities of LLMs through improved training methodologies and representation techniques. Exploring alternative approaches to spatial abstraction may also yield valuable insights for the development of more versatile AI systems.
