V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views
In the rapidly evolving field of autonomous driving, the integration of advanced artificial intelligence is crucial to enhancing vehicle safety and performance. Multimodal large language models (MLLMs) have emerged as promising tools for understanding and interpreting complex driving environments. However, current evaluation benchmarks predominantly focus on ego-centric perspectives, which limits their ability to assess model performance in diverse driving conditions that involve infrastructure and cooperation.
This article introduces V2X-QA, a pioneering dataset and benchmark specifically designed to evaluate MLLMs in various driving contexts, including vehicle-side, infrastructure-side, and cooperative viewpoints. The framework aims to provide a comprehensive assessment of MLLMs by utilizing a view-decoupled evaluation protocol that allows for controlled comparisons across different driving scenarios.
Key Features of V2X-QA
- View-Decoupled Evaluation Protocol: This protocol enables the evaluation of MLLMs under three distinct driving conditions: vehicle-only, infrastructure-only, and cooperative. This systematic approach ensures a robust comparison of model capabilities.
- Multiple-Choice Question Answering (MCQA) Framework: V2X-QA is organized around an MCQA framework, which allows for structured interaction with the dataset and facilitates nuanced understanding of model performance.
- Twelve-Task Taxonomy: The benchmark comprises a comprehensive taxonomy that spans perception, prediction, reasoning, and planning tasks. This diversity ensures a thorough examination of the models’ capabilities across multiple dimensions.
- Expert-Verified Annotations: The dataset is constructed through meticulous MCQA annotations verified by experts, enabling fine-grained analysis of viewpoint-dependent capabilities and fostering a deeper understanding of model performance.
Benchmark Results and Insights
Benchmark results have been obtained by evaluating ten representative state-of-the-art proprietary and open-source models on the V2X-QA dataset. The findings indicate that:
- Viewpoint accessibility plays a significant role in model performance, highlighting the importance of considering both ego-centric and infrastructure-centric perspectives.
- Infrastructure-side reasoning is critical for understanding macroscopic traffic dynamics and improving overall traffic management.
- Cooperative reasoning poses challenges, as it necessitates not only additional visual input but also effective cross-view alignment and evidence integration.
Introducing V2X-MoE
To address the challenges identified in cooperative reasoning, the authors propose V2X-MoE, a benchmark-aligned baseline that incorporates explicit view routing and viewpoint-specific LoRA experts. The strong performance of V2X-MoE underscores the potential for explicit viewpoint specialization as a promising avenue for enhancing multi-view reasoning in autonomous driving scenarios.
Conclusion
V2X-QA lays the groundwork for future research in multi-perspective reasoning, reliability, and cooperative physical intelligence in connected autonomous driving. The dataset and V2X-MoE resources are publicly available, offering researchers and practitioners valuable tools for advancing the field. Access the resources at: GitHub – V2X-QA.
