Toward Memory-Aided World Models: Benchmarking via Spatial Consistency
Summary: arXiv:2505.22976v2 Announce Type: replace-cross
The capability to simulate the world in a spatially consistent manner is essential for developing effective world models. Such models are not only vital for high-quality visual generation but also crucial in ensuring the reliability of world models for various downstream tasks, including simulation and planning. A significant aspect of addressing spatial consistency lies in designing an effective memory module, which must retain long-horizon observational information and enable the construction of either explicit or implicit internal spatial representations.
Despite the importance of these features, there has been a noticeable absence of datasets specifically aimed at promoting the development of memory modules by enforcing spatial consistency constraints. Most existing benchmarks primarily focus on visual coherence and generation quality, often overlooking the critical requirement for long-range spatial consistency.
To address this gap, researchers have constructed a comprehensive dataset and corresponding benchmark by sampling 150 distinct locations within the open-world environment of Minecraft. This initiative involved the collection of approximately 250 hours of loop-based navigation videos, amounting to around 20 million frames of actions.
Dataset and Benchmark Design
Key features of the newly developed dataset include:
- Diverse Sampling: The dataset encompasses 150 unique locations within Minecraft, providing a rich environment for testing various world models.
- Extensive Video Collection: Researchers gathered about 250 hours of video footage, which translates to 20 million frames of navigational actions, allowing for in-depth analysis and development.
- Curriculum Design: The dataset follows a structured curriculum design of sequence lengths, enabling models to learn spatial consistency through increasingly complex navigation trajectories.
- Extensibility: The data collection pipeline is designed to be easily extensible to new Minecraft environments and modules, promoting further research and application.
Evaluation of World Model Baselines
To validate the effectiveness of the dataset and benchmark, four representative world model baselines have been evaluated. This evaluation aims to assess how well these models can adapt to and perform within the spatially consistent framework provided by the new dataset.
Open Source Contribution
In a move to support future research and development in this critical area, the dataset, benchmark, and code have been open-sourced. This initiative is expected to facilitate collaboration and innovation among researchers aiming to enhance memory modules within world models.
As the field of artificial intelligence continues to evolve, the introduction of such comprehensive benchmarks will play a vital role in advancing the capabilities of world models, ensuring they can operate effectively in real-world scenarios.
