PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models
Summary: arXiv:2603.29281v1 Announce Type: cross
A critical gap exists between the general-purpose visual understanding of state-of-the-art physical AI models and the specialized perceptual demands of structured real-world deployment environments. In response to this challenge, researchers have introduced PRISM, a groundbreaking dataset aimed at enhancing embodied vision-language models (VLMs) for practical applications within retail settings.
PRISM is a comprehensive corpus consisting of 270,000 samples of multi-view video data specifically designed for supervised fine-tuning (SFT) of VLMs. The motivation behind PRISM stems from the observation that failures in physical AI systems are not primarily due to inadequate visual recognition. Instead, these systems struggle with understanding spatial dynamics and embodied actions necessary for reliable operation in real-world scenarios.
Key Features of PRISM
The PRISM dataset is built on a novel three-dimensional knowledge ontology that encompasses:
- Spatial Knowledge: Understanding the arrangement and relationships of objects in space.
- Temporal and Physical Knowledge: Recognizing the dynamics of movement and changes over time.
- Embodied Action Knowledge: Comprehending the actions that a physical agent must perform in various contexts.
PRISM covers over 20 capability probes across four evaluation dimensions:
- Embodied Reasoning (ER): The ability to reason about actions and their consequences in a physical space.
- Common Sense (CS): Utilizing everyday knowledge to make inferences about scenarios.
- Spatial Perception (SP): Interpreting spatial relationships and distances between objects.
- Intuitive Physics (IP): Understanding physical laws and how they govern interactions in the environment.
Data Collection and Scope
Notably, PRISM is the first dataset to integrate all three knowledge dimensions within a single real-world deployment domain. The dataset captures data from various perspectives, including egocentric, exocentric, and 360-degree viewpoints, across five distinct supermarket locations. This diversity ensures a robust representation of real-world conditions.
At a frame rate of 4 fps, PRISM comprises approximately 11.8 million video frames and around 730 million tokens, making it one of the largest domain-specific video SFT corpora available today. The dataset includes various forms of supervision, such as open-ended, chain-of-thought, and multiple-choice formats, allowing for a comprehensive training experience.
Impact on Fine-Tuning and Performance
Recent evaluations indicate that fine-tuning models on the PRISM dataset significantly reduces error rates across all 20+ probes by an impressive 66.6% compared to pre-trained baselines. Notably, there are substantial improvements in embodied action understanding, where accuracy enhances by 36.4%. These findings suggest that an ontology-structured, domain-specific SFT approach can greatly reinforce the capabilities of embodied VLMs in real-world applications.
For more information on the PRISM dataset and to access it, visit https://dreamvu.ai/prism.
