Are Video Reasoning Models Ready to Go Outside?
Summary: arXiv:2603.10652v2 Announce Type: replace-cross
Abstract: In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially, revealing a gap between clean, controlled (i.e., unperturbed) evaluation settings and real-world robustness.
Introduction to ROVA
To address the limitations faced by video reasoning models in real-world scenarios, researchers have proposed a novel training framework known as ROVA. This framework aims to enhance the robustness of these models by modeling a consistency reward that is sensitive to spatio-temporal corruptions. The primary goal of ROVA is to bridge the gap between theoretical performance in controlled environments and practical performance in unpredictable real-world conditions.
Key Features of ROVA
- Robustness-Aware Consistency Reward: ROVA implements a unique reward system that encourages models to maintain performance despite various disturbances.
- Difficulty-Aware Online Training: This innovative training strategy focuses on prioritizing samples that provide the most informative training experience based on the model’s current capabilities.
- Self-Reflective Evaluation: By continuously re-evaluating sample difficulty, ROVA enables adaptive training that adjusts to the evolving strengths and weaknesses of the model.
PVRBench: A New Benchmark
In addition to the ROVA framework, the researchers introduced PVRBench, a benchmark designed specifically to test video reasoning models under real-world disturbances. This new benchmark incorporates realistic perturbations into embodied video datasets, allowing for a comprehensive assessment of both accuracy and reasoning quality.
Evaluation and Results
The effectiveness of ROVA was evaluated against baseline models on three different datasets: PVRBench, UrbanVideo, and VisBench. The results revealed a significant performance gap when models were subjected to realistic perturbations. Notably, open-source and proprietary models experienced accuracy drops of up to 35% and reasoning drops of 28% under such conditions.
ROVA demonstrated a remarkable ability to mitigate these performance degradations. The framework achieved a relative accuracy improvement of at least 24% and an increase of over 9% in reasoning quality compared to baseline models such as QWen2.5/3-VL, InternVL2.5, and Embodied-R. Additionally, these performance gains were not limited to the challenging scenarios; ROVA also translated these improvements to clean standard benchmarks, showcasing consistent enhancements across the board.
Conclusion
The introduction of ROVA and PVRBench marks a significant advancement in the field of video reasoning models, particularly in their adaptability to real-world conditions. By addressing the critical gaps in robustness and reasoning under disturbances, ROVA paves the way for future developments in vision-language models, potentially expanding their application in various practical scenarios.
