PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning
In the realm of artificial intelligence, understanding and reasoning about videos has become increasingly complex. A new benchmark, known as PerceptionComp, has been introduced to address these challenges by providing a robust framework for evaluating perception-centric video reasoning. This benchmark aims to enhance the capabilities of AI models in parsing intricate video data through multiple perceptual subtasks.
Overview of PerceptionComp
PerceptionComp is a manually annotated benchmark designed specifically for long-horizon video reasoning that involves complex perceptual tasks. The core idea behind this benchmark is that answering each question requires integrating information from various moments in the video. This necessitates a comprehensive understanding of multiple visual elements and their interrelations.
Key Features of PerceptionComp
- Comprehensive Annotation: The benchmark comprises 1,114 complex questions derived from 279 videos that span diverse domains such as city walk tours, indoor villa tours, video games, and extreme outdoor sports. Each question has been 100% manually annotated to ensure quality and reliability.
- Multifaceted Reasoning: Participants must engage in several perceptual subtasks, which include recognizing objects, attributes, relations, locations, actions, and events. This requires advanced skills in semantic recognition, visual correspondence, temporal reasoning, and spatial reasoning.
- Test-Time Thinking: Research findings suggest that PerceptionComp requires significant cognitive engagement from participants. Human studies indicate that individuals take substantially longer to answer questions compared to prior benchmarks. Additionally, accuracy rates plummet to near chance levels (18.97%) when participants are not allowed to rewatch the videos.
- Performance of State-of-the-Art Models: Current state-of-the-art Multi-Modal Language Models (MLLMs) show a marked decrease in performance when evaluated on PerceptionComp. For instance, the best-performing model, Gemini-3-Flash, achieves only 45.96% accuracy in a five-choice setting, while many open-source models fail to surpass the 40% mark.
Implications for the Future
The introduction of PerceptionComp underscores the ongoing challenges within the domain of perception-centric long-horizon video reasoning. The results derived from this benchmark highlight the necessity for further advancements in AI methodologies and models to tackle such complex reasoning tasks. The creators of PerceptionComp hope that it will serve as a catalyst for future research and development in perceptual reasoning, ultimately leading to more sophisticated AI systems capable of understanding and interpreting video content more effectively.
Conclusion
As AI continues to evolve, benchmarks like PerceptionComp play a crucial role in pushing the boundaries of what is possible in video reasoning. By providing a comprehensive and challenging framework, PerceptionComp aims to foster innovation and improvement in AI perceptual capabilities, paving the way for more intelligent systems that can interact with and understand the world in a richer, more nuanced way.
