ETA-VLA: Efficient Token Adaptation for Vision-Language-Action Models

Date:

ETA-VLA: Efficient Token Adaptation via Temporal Fusion and Intra-LLM Sparsification for Vision-Language-Action Models

Summary: arXiv:2603.25766v1 Announce Type: cross

Abstract

The integration of Vision-Language-Action (VLA) models into autonomous driving systems offers a unified framework for interpreting complex scenes and executing control commands. However, the necessity to incorporate historical multi-view frames for accurate temporal reasoning imposes a severe computational burden, primarily driven by the quadratic complexity of self-attention mechanisms in Large Language Models (LLMs).

Introduction

In recent years, advancements in artificial intelligence have paved the way for the development of sophisticated autonomous driving systems. The convergence of vision, language, and action into a cohesive framework has led to the rise of Vision-Language-Action (VLA) models. These models are instrumental in interpreting intricate visual scenes and translating them into actionable commands for vehicles.

Challenges in VLA Models

Despite the promising capabilities of VLA models, they face significant challenges, particularly in terms of computational efficiency. One of the primary issues lies in the requirement to process historical multi-view frames, which is essential for achieving accurate temporal reasoning. The reliance on self-attention mechanisms within Large Language Models (LLMs) introduces quadratic complexity, resulting in substantial computational demands.

Proposed Solution: ETA-VLA

To address these challenges, we introduce ETA-VLA, an Efficient Token Adaptation framework specifically designed for VLA models. This innovative approach processes the past n frames of multi-view images and incorporates a novel Intra-LLM Sparse Aggregator (ILSA). The ILSA mechanism draws inspiration from the way human drivers allocate their attention, allowing the system to dynamically identify and prune redundant visual tokens based on textual queries and temporal consistency.

Key Features of ETA-VLA

  • Text-Guided Scoring Mechanism: This mechanism aids in evaluating the importance of visual tokens, ensuring that only the most relevant information is retained for processing.
  • Diversity-Preserving Sparsification Strategy: By selecting a sparse subset of critical tokens, ETA-VLA guarantees a comprehensive understanding of the driving scene while minimizing computational overhead.
  • Extensive Experimentation: Our experiments conducted on the NAVSIM v2 benchmark demonstrate that ETA-VLA achieves driving performance on par with state-of-the-art baselines.

Results

The results of our evaluations are promising. ETA-VLA manages to reduce computational FLOPs by approximately 32%, while notably pruning 85% of visual tokens. This leads to a reduction in inference FLOPs by 61%, all while maintaining a remarkable 94% of the original accuracy on the NAVSIM v2 benchmark.

Conclusion

ETA-VLA represents a significant advancement in the field of autonomous driving systems, combining the strengths of VLA models with efficient token adaptation techniques. By minimizing computational demands while preserving accuracy, this framework holds great potential for enhancing the performance and feasibility of future autonomous applications.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.