DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
Summary: arXiv:2604.00813v1 Announce Type: cross
Abstract
End-to-end autonomous driving has evolved from the conventional paradigm based on sparse perception into vision-language-action (VLA) models, which focus on learning language descriptions as an auxiliary task to facilitate planning. In this paper, we propose an alternative Vision-Geometry-Action (VGA) paradigm that advocates dense 3D geometry as the critical cue for autonomous driving. As vehicles operate in a 3D world, we think dense 3D geometry provides the most comprehensive information for decision-making.
Introduction
Despite significant advancements in autonomous driving technology, existing geometry reconstruction methods often rely on computationally expensive batch processing of multi-frame inputs. This limitation poses challenges for online planning, which is crucial for real-time decision-making in autonomous vehicles.
Introducing DVGT-2
To address these challenges, we introduce the Driving Visual Geometry Transformer (DVGT-2), a novel framework that processes inputs in an online manner while jointly outputting dense geometry and trajectory planning for the current frame. This innovation allows for immediate decision-making based on real-time data.
Key Features of DVGT-2
- Temporal Causal Attention: DVGT-2 employs a mechanism that focuses on sequential data, ensuring that the model can adaptively prioritize the most relevant information over time.
- Historical Feature Caching: The model caches historical features, allowing it to support on-the-fly inference and reduce the need for redundant computations.
- Sliding-Window Streaming Strategy: By using a sliding-window approach, DVGT-2 can efficiently manage computational resources, processing only relevant data within a defined interval.
Performance and Efficiency
Despite the improvements in processing speed, DVGT-2 achieves superior geometry reconstruction performance across various datasets. This efficiency is particularly beneficial for real-world applications, where rapid response times are critical.
Versatility Across Configurations
One of the standout features of DVGT-2 is its versatility. The same trained model can be applied to planning tasks across diverse camera configurations without the need for fine-tuning. This characteristic is demonstrated in two benchmark scenarios:
- Closed-loop NAVSIM: A simulation environment where the model can adapt and respond to dynamic changes in the environment.
- Open-loop nuScenes: A diverse dataset that tests the model’s ability to generalize across different driving conditions and scenarios.
Conclusion
In conclusion, DVGT-2 represents a significant step forward in the realm of autonomous driving by embracing dense 3D geometry as a foundational element of decision-making. The model’s ability to process data in real-time while maintaining high performance opens up new avenues for research and application in the field of autonomous vehicles.
