DVGT-2: Real-Time 3D Geometry Model for Autonomous Driving

DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

Summary: arXiv:2604.00813v1 Announce Type: cross

Abstract

End-to-end autonomous driving has evolved from the conventional paradigm based on sparse perception into vision-language-action (VLA) models, which focus on learning language descriptions as an auxiliary task to facilitate planning. In this paper, we propose an alternative Vision-Geometry-Action (VGA) paradigm that advocates dense 3D geometry as the critical cue for autonomous driving. As vehicles operate in a 3D world, we think dense 3D geometry provides the most comprehensive information for decision-making.

Introduction

Despite significant advancements in autonomous driving technology, existing geometry reconstruction methods often rely on computationally expensive batch processing of multi-frame inputs. This limitation poses challenges for online planning, which is crucial for real-time decision-making in autonomous vehicles.

Introducing DVGT-2

To address these challenges, we introduce the Driving Visual Geometry Transformer (DVGT-2), a novel framework that processes inputs in an online manner while jointly outputting dense geometry and trajectory planning for the current frame. This innovation allows for immediate decision-making based on real-time data.

Key Features of DVGT-2

Temporal Causal Attention: DVGT-2 employs a mechanism that focuses on sequential data, ensuring that the model can adaptively prioritize the most relevant information over time.
Historical Feature Caching: The model caches historical features, allowing it to support on-the-fly inference and reduce the need for redundant computations.
Sliding-Window Streaming Strategy: By using a sliding-window approach, DVGT-2 can efficiently manage computational resources, processing only relevant data within a defined interval.

Performance and Efficiency

Despite the improvements in processing speed, DVGT-2 achieves superior geometry reconstruction performance across various datasets. This efficiency is particularly beneficial for real-world applications, where rapid response times are critical.

Versatility Across Configurations

One of the standout features of DVGT-2 is its versatility. The same trained model can be applied to planning tasks across diverse camera configurations without the need for fine-tuning. This characteristic is demonstrated in two benchmark scenarios:

Closed-loop NAVSIM: A simulation environment where the model can adapt and respond to dynamic changes in the environment.
Open-loop nuScenes: A diverse dataset that tests the model’s ability to generalize across different driving conditions and scenarios.

Conclusion

In conclusion, DVGT-2 represents a significant step forward in the realm of autonomous driving by embracing dense 3D geometry as a foundational element of decision-making. The model’s ability to process data in real-time while maintaining high performance opens up new avenues for research and application in the field of autonomous vehicles.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

DVGT-2: Real-Time 3D Geometry Model for Autonomous Driving

DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

Abstract

Introduction

Introducing DVGT-2

Key Features of DVGT-2

Performance and Efficiency

Versatility Across Configurations

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related