DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference
Summary: arXiv:2602.18846v2 Announce Type: replace-cross
In recent years, Vision-Language Models (VLMs) have demonstrated exceptional capabilities in multimodal understanding and reasoning. However, their computational demands remain a significant challenge due to the dense visual tokenization processes involved. Traditional methods aimed at enhancing efficiency often involve either merging redundant visual tokens or progressively dropping them in the language backbone, which can lead to a trade-off between accuracy and processing speed.
To address these issues, we introduce DUET-VLM, a novel dual compression framework designed for Vision-Language Models. This framework is versatile and serves as a plug-and-play solution, incorporating two key components:
- Vision-only redundancy aware compression: This initial stage focuses on compressing the output of the vision encoder into a smaller set of information-preserving tokens.
- Layer-wise, salient text-guided dropping: In this second stage, visual tokens within the language backbone are pruned based on their informativeness, allowing for gradual reduction of less critical tokens.
The coordinated management of tokens throughout this dual-stage process enables aggressive compression while ensuring that the essential semantics of the visual information are preserved. Our experiments with LLaVA-1.5-7B demonstrate that DUET-VLM maintains over 99% of baseline accuracy with a 67% reduction in tokens. Remarkably, it still retains over 97% accuracy even with an 89% reduction in token count.
Moreover, when integrating DUET-VLM into the Video-LLaVA-7B model, the results are even more impressive. This integration not only surpasses the baseline performance but also achieves an accuracy of over 100% while reducing the token count by 53.1%. In extreme conditions, where a 93.4% reduction is applied, the model still retains a remarkable 97.6% accuracy. These findings underscore the effectiveness of end-to-end training with DUET-VLM, which allows for robust adaptation to reduced visual input, whether in images or videos, without sacrificing accuracy.
Furthermore, DUET-VLM produces compact yet semantically rich representations, fitting within the same computational budget as prior models. This dual-stage compression approach is a significant advancement in the field, setting new standards in visual token reduction methods across multiple benchmarks.
For those interested in exploring the technical details and implementation of DUET-VLM, our code is publicly available at https://github.com/AMD-AGI/DUET-VLM.
As the demand for efficient AI models continues to grow, innovations like DUET-VLM provide promising pathways toward enhancing the capabilities and efficiency of Vision-Language Models, paving the way for more accessible and powerful AI applications.
