GOLD-BEV: Ground and Aerial Data for Dense Semantic BEV Mapping of Dynamic Scenes
Summary: arXiv:2604.19411v1 Announce Type: cross
Abstract: Understanding road scenes in a geometrically consistent, scene-centric representation is crucial for planning and mapping. We present GOLD-BEV, a framework that learns dense bird’s-eye-view (BEV) semantic environment maps—including dynamic agents—from ego-centric sensors, using time-synchronized aerial imagery as supervision only during training.
Introduction
The development of autonomous vehicles and advanced mapping technologies has underscored the necessity for precise and reliable scene understanding. Traditional methods often struggle with the dynamic nature of real-world environments. GOLD-BEV addresses these challenges by leveraging a combination of ground and aerial data to create comprehensive BEV maps.
Key Features of GOLD-BEV
- Dense BEV Semantic Environment Maps: GOLD-BEV generates detailed semantic maps that include identification and categorization of dynamic agents in the environment.
- Time-Synchronized Aerial Imagery: The framework utilizes aerial imagery to provide a supervisory signal during the training phase, enhancing the accuracy of the BEV maps.
- Minimal Manual Annotation: By utilizing BEV-aligned aerial crops, the system significantly reduces the need for extensive manual labeling efforts, thereby streamlining the mapping process.
- Overhead Observation Supervision: The strict synchronization of aerial and ground data helps in accurately monitoring moving traffic participants, reducing the temporal inconsistencies often encountered with unsynchronized data sources.
Innovative Approaches
GOLD-BEV incorporates several innovative techniques that set it apart from existing methodologies:
- Domain-Adaptive Aerial Teachers: The framework generates BEV pseudo-labels through the application of aerial teachers that have been adapted for specific domains, ensuring scalability and relevance in diverse environments.
- Joint Training for Segmentation and Reconstruction: The system simultaneously trains on BEV segmentation and optional pseudo-aerial BEV reconstruction, which enhances the interpretability of the mapping process.
- Synthesis of Pseudo-Aerial BEV Images: GOLD-BEV extends its capabilities by learning to synthesize pseudo-aerial BEV images from ego sensors, facilitating lightweight human annotation and uncertainty-aware pseudo-labeling on unlabeled drives.
Conclusion
GOLD-BEV represents a significant advancement in the field of semantic mapping for dynamic scenes. By integrating ground and aerial data, the framework not only improves the accuracy of BEV maps but also reduces the reliance on manual annotation. As the demand for sophisticated mapping solutions grows, GOLD-BEV stands out as a promising tool for enhancing the capabilities of autonomous systems and urban planning applications.
For further details, please refer to the full paper available on arXiv: arXiv:2604.19411v1.
