Q-Zoom: Efficient Query-Aware Perception for MLLMs

Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

A recent paper titled Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models, available on arXiv under the identifier 2604.06912v1, introduces significant advancements in the field of multimodal large language models (MLLMs). This innovative approach addresses critical inefficiencies in processing high-resolution visual inputs necessary for tasks such as document understanding and dense scene perception.

Current Challenges in MLLMs

MLLMs have traditionally relied on global resolution scaling paradigms that inundate the quadratic self-attention mechanism with visually redundant tokens. This leads to severe bottlenecks in inference throughput while neglecting spatial sparsity and the specific intent of queries. The inefficiencies in processing high-resolution inputs hinder the performance of MLLMs, particularly in fine-grained tasks.

Introducing Q-Zoom

To address these challenges, the authors propose Q-Zoom, a query-aware adaptive high-resolution perception framework that works in a highly efficient coarse-to-fine manner. The framework consists of two primary components:

Dynamic Gating Network: This lightweight network safely bypasses high-resolution processing when coarse global features are sufficient, optimizing the inference process.
Self-Distilled Region Proposal Network (SD-RPN): For queries requiring fine-grained perception, the SD-RPN accurately localizes the task-relevant Region-of-Interest (RoI) directly from intermediate feature spaces, ensuring that only necessary data is processed.

Optimizing Efficiency

The optimization of these modules is achieved through a consistency-aware generation strategy for routing labels in the gating network and a fully self-supervised distillation paradigm for the SD-RPN. Furthermore, a continuous spatio-temporal alignment scheme and targeted fine-tuning are employed to seamlessly integrate the dense local RoI with the coarse global layout.

Experimental Results

Extensive experiments showcase that Q-Zoom establishes a dominant Pareto frontier in performance metrics. Utilizing Qwen2.5-VL-7B as a primary testbed, the framework enhances inference speed by:

2.52 times on Document & OCR benchmarks
4.39 times in High-Resolution scenarios

Notably, Q-Zoom maintains the baseline’s peak accuracy, while also achieving:

A performance increase of 1.1% for maximum perceptual fidelity on Document & OCR benchmarks
An 8.1% improvement on High-Resolution benchmarks

Broader Impact

The robust improvements delivered by Q-Zoom are not limited to a single model. The framework’s advantages transfer seamlessly to other models such as Qwen3-VL, LLaVA, and emerging reinforcement learning-based thinking-with-image models. This positions Q-Zoom as a pivotal development in enhancing the efficiency and effectiveness of multimodal large language models.

Further Information

For more details on the project, you can visit the official project page at Q-Zoom Project Page.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Q-Zoom: Efficient Query-Aware Perception for MLLMs

Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

Current Challenges in MLLMs

Introducing Q-Zoom

Optimizing Efficiency

Experimental Results

Broader Impact

Further Information

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related