DUET-VLM: Efficient Dual-Stage Token Reduction for VLMs

Date:

DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference

Summary: arXiv:2602.18846v2 Announce Type: replace-cross

In recent years, Vision-Language Models (VLMs) have demonstrated exceptional capabilities in multimodal understanding and reasoning. However, their computational demands remain a significant challenge due to the dense visual tokenization processes involved. Traditional methods aimed at enhancing efficiency often involve either merging redundant visual tokens or progressively dropping them in the language backbone, which can lead to a trade-off between accuracy and processing speed.

To address these issues, we introduce DUET-VLM, a novel dual compression framework designed for Vision-Language Models. This framework is versatile and serves as a plug-and-play solution, incorporating two key components:

  • Vision-only redundancy aware compression: This initial stage focuses on compressing the output of the vision encoder into a smaller set of information-preserving tokens.
  • Layer-wise, salient text-guided dropping: In this second stage, visual tokens within the language backbone are pruned based on their informativeness, allowing for gradual reduction of less critical tokens.

The coordinated management of tokens throughout this dual-stage process enables aggressive compression while ensuring that the essential semantics of the visual information are preserved. Our experiments with LLaVA-1.5-7B demonstrate that DUET-VLM maintains over 99% of baseline accuracy with a 67% reduction in tokens. Remarkably, it still retains over 97% accuracy even with an 89% reduction in token count.

Moreover, when integrating DUET-VLM into the Video-LLaVA-7B model, the results are even more impressive. This integration not only surpasses the baseline performance but also achieves an accuracy of over 100% while reducing the token count by 53.1%. In extreme conditions, where a 93.4% reduction is applied, the model still retains a remarkable 97.6% accuracy. These findings underscore the effectiveness of end-to-end training with DUET-VLM, which allows for robust adaptation to reduced visual input, whether in images or videos, without sacrificing accuracy.

Furthermore, DUET-VLM produces compact yet semantically rich representations, fitting within the same computational budget as prior models. This dual-stage compression approach is a significant advancement in the field, setting new standards in visual token reduction methods across multiple benchmarks.

For those interested in exploring the technical details and implementation of DUET-VLM, our code is publicly available at https://github.com/AMD-AGI/DUET-VLM.

As the demand for efficient AI models continues to grow, innovations like DUET-VLM provide promising pathways toward enhancing the capabilities and efficiency of Vision-Language Models, paving the way for more accessible and powerful AI applications.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.