PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos
In the rapidly evolving field of artificial intelligence, the ability of multimodal large language models (MLLMs) to understand spatial relationships in indoor environments remains a significant challenge. This challenge is particularly pronounced when it comes to small object-centric spatial understanding in indoor videos. Despite the practical applications of such capabilities in object search and assistive technologies, there has been a notable gap in existing benchmarks that adequately evaluate a model’s ability to localize target objects within video content and express their positions with the precision required for downstream applications.
To address this gap, researchers have introduced PinpointQA, the first dataset and benchmark specifically designed for small object-centric spatial understanding in indoor videos. This dataset is built upon the foundations of ScanNet++ and ScanNet200, incorporating a total of 1,024 scenes and 10,094 question-answer pairs. The QA pairs are organized into four progressively challenging tasks, each designed to test different aspects of spatial reasoning:
- Target Presence Verification (TPV): Assessing whether a specified object is present in a video frame.
- Nearest Reference Identification (NRI): Identifying the nearest reference object in relation to a target object.
- Fine-Grained Spatial Description (FSD): Providing detailed spatial descriptions of a target object’s position.
- Structured Spatial Prediction (SSP): Predicting spatial relationships and configurations of multiple objects.
The construction of PinpointQA involves creating intermediate spatial representations from the video data, with QA pairs generated automatically and subsequently refined through rigorous quality control processes. This meticulous approach ensures that the dataset is not only comprehensive but also suitable for training and evaluating advanced MLLMs.
Initial experiments conducted on representative MLLMs have revealed a consistent capability gap across the progressive tasks, particularly highlighting the challenges associated with the Structured Spatial Prediction (SSP) task. The performance metrics indicate that while models demonstrate some proficiency in the easier tasks, the complexities of SSP present a formidable barrier that underscores the necessity of specialized training.
Notably, supervised fine-tuning on the PinpointQA dataset has yielded substantial performance improvements, particularly on the more difficult tasks. This finding illustrates that PinpointQA is not only a diagnostic benchmark for assessing model capabilities but also serves as an effective training resource that can enhance the spatial reasoning abilities of MLLMs.
For those interested in exploring the dataset further, the PinpointQA dataset and project page are accessible at https://rainchowz.github.io/PinpointQA. This initiative represents a significant step forward in the quest for improved spatial understanding in indoor video contexts, paving the way for more intelligent and responsive AI systems.
