ETA-VLA: Efficient Token Adaptation via Temporal Fusion and Intra-LLM Sparsification for Vision-Language-Action Models
Summary: arXiv:2603.25766v1 Announce Type: cross
Abstract
The integration of Vision-Language-Action (VLA) models into autonomous driving systems offers a unified framework for interpreting complex scenes and executing control commands. However, the necessity to incorporate historical multi-view frames for accurate temporal reasoning imposes a severe computational burden, primarily driven by the quadratic complexity of self-attention mechanisms in Large Language Models (LLMs).
Introduction
In recent years, advancements in artificial intelligence have paved the way for the development of sophisticated autonomous driving systems. The convergence of vision, language, and action into a cohesive framework has led to the rise of Vision-Language-Action (VLA) models. These models are instrumental in interpreting intricate visual scenes and translating them into actionable commands for vehicles.
Challenges in VLA Models
Despite the promising capabilities of VLA models, they face significant challenges, particularly in terms of computational efficiency. One of the primary issues lies in the requirement to process historical multi-view frames, which is essential for achieving accurate temporal reasoning. The reliance on self-attention mechanisms within Large Language Models (LLMs) introduces quadratic complexity, resulting in substantial computational demands.
Proposed Solution: ETA-VLA
To address these challenges, we introduce ETA-VLA, an Efficient Token Adaptation framework specifically designed for VLA models. This innovative approach processes the past n frames of multi-view images and incorporates a novel Intra-LLM Sparse Aggregator (ILSA). The ILSA mechanism draws inspiration from the way human drivers allocate their attention, allowing the system to dynamically identify and prune redundant visual tokens based on textual queries and temporal consistency.
Key Features of ETA-VLA
- Text-Guided Scoring Mechanism: This mechanism aids in evaluating the importance of visual tokens, ensuring that only the most relevant information is retained for processing.
- Diversity-Preserving Sparsification Strategy: By selecting a sparse subset of critical tokens, ETA-VLA guarantees a comprehensive understanding of the driving scene while minimizing computational overhead.
- Extensive Experimentation: Our experiments conducted on the NAVSIM v2 benchmark demonstrate that ETA-VLA achieves driving performance on par with state-of-the-art baselines.
Results
The results of our evaluations are promising. ETA-VLA manages to reduce computational FLOPs by approximately 32%, while notably pruning 85% of visual tokens. This leads to a reduction in inference FLOPs by 61%, all while maintaining a remarkable 94% of the original accuracy on the NAVSIM v2 benchmark.
Conclusion
ETA-VLA represents a significant advancement in the field of autonomous driving systems, combining the strengths of VLA models with efficient token adaptation techniques. By minimizing computational demands while preserving accuracy, this framework holds great potential for enhancing the performance and feasibility of future autonomous applications.
