Collision-Aware Vision-Language Learning for End-to-End Driving with Multimodal Infraction Datasets
Summary: arXiv:2603.25946v1 Announce Type: cross
Abstract
High infraction rates remain the primary bottleneck for end-to-end (E2E) autonomous driving, as evidenced by the low driving scores on the CARLA Leaderboard. Despite collision-related infractions being the dominant failure mode in closed-loop evaluations, collision-aware representation learning has received limited attention. To address this gap, we first develop a Video-Language-Augmented Anomaly Detector (VLAAD), leveraging a Multiple Instance Learning (MIL) formulation to obtain stable, temporally localized collision signals for proactive prediction.
To transition these capabilities into closed-loop simulations, we must overcome the limitations of existing simulator datasets, which lack multimodality and are frequently restricted to simple intersection scenarios. Therefore, we introduce CARLA-Collide, a large-scale multimodal dataset capturing realistic collision events across highly diverse road networks.
Key Developments
- Introduction of VLAAD: VLAAD serves as a collision-aware plug-in module that can be seamlessly integrated into existing E2E driving models.
- Enhanced Driving Performance: When integrated into a pretrained TransFuser++ agent, VLAAD demonstrates a 14.12% relative increase in driving score with minimal fine-tuning.
- Generalization Capability: The effectiveness of VLAAD is further assessed in an open-loop setting using real-world driving data.
- Launch of Real-Collide: This new multimodal dataset features diverse dashcam videos paired with semantically rich annotations for collision detection and prediction.
- Performance Benchmark: Despite containing only 0.6 billion parameters, VLAAD outperforms a multi-billion-parameter vision-language model, achieving a 23.3% improvement in AUC (Area Under Curve).
Conclusion
In summary, the development of the VLAAD module and the introduction of the CARLA-Collide and Real-Collide datasets represent significant advancements in the field of autonomous driving. By focusing on collision-aware learning and leveraging multimodal data, this research addresses critical challenges in E2E driving systems. The promising results indicate a pathway toward more reliable and efficient autonomous driving technologies, paving the way for safer roadways and enhanced driver experiences.
The study highlights the importance of multimodal datasets and sophisticated model architectures in improving the performance of AI-driven vehicles. As the research community continues to explore these avenues, the potential for groundbreaking advancements in autonomous driving remains vast.
