Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
A recent paper titled Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models, available on arXiv under the identifier 2604.06912v1, introduces significant advancements in the field of multimodal large language models (MLLMs). This innovative approach addresses critical inefficiencies in processing high-resolution visual inputs necessary for tasks such as document understanding and dense scene perception.
Current Challenges in MLLMs
MLLMs have traditionally relied on global resolution scaling paradigms that inundate the quadratic self-attention mechanism with visually redundant tokens. This leads to severe bottlenecks in inference throughput while neglecting spatial sparsity and the specific intent of queries. The inefficiencies in processing high-resolution inputs hinder the performance of MLLMs, particularly in fine-grained tasks.
Introducing Q-Zoom
To address these challenges, the authors propose Q-Zoom, a query-aware adaptive high-resolution perception framework that works in a highly efficient coarse-to-fine manner. The framework consists of two primary components:
- Dynamic Gating Network: This lightweight network safely bypasses high-resolution processing when coarse global features are sufficient, optimizing the inference process.
- Self-Distilled Region Proposal Network (SD-RPN): For queries requiring fine-grained perception, the SD-RPN accurately localizes the task-relevant Region-of-Interest (RoI) directly from intermediate feature spaces, ensuring that only necessary data is processed.
Optimizing Efficiency
The optimization of these modules is achieved through a consistency-aware generation strategy for routing labels in the gating network and a fully self-supervised distillation paradigm for the SD-RPN. Furthermore, a continuous spatio-temporal alignment scheme and targeted fine-tuning are employed to seamlessly integrate the dense local RoI with the coarse global layout.
Experimental Results
Extensive experiments showcase that Q-Zoom establishes a dominant Pareto frontier in performance metrics. Utilizing Qwen2.5-VL-7B as a primary testbed, the framework enhances inference speed by:
- 2.52 times on Document & OCR benchmarks
- 4.39 times in High-Resolution scenarios
Notably, Q-Zoom maintains the baseline’s peak accuracy, while also achieving:
- A performance increase of 1.1% for maximum perceptual fidelity on Document & OCR benchmarks
- An 8.1% improvement on High-Resolution benchmarks
Broader Impact
The robust improvements delivered by Q-Zoom are not limited to a single model. The framework’s advantages transfer seamlessly to other models such as Qwen3-VL, LLaVA, and emerging reinforcement learning-based thinking-with-image models. This positions Q-Zoom as a pivotal development in enhancing the efficiency and effectiveness of multimodal large language models.
Further Information
For more details on the project, you can visit the official project page at Q-Zoom Project Page.
