V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators
Summary: arXiv:2604.03307v1 Announce Type: cross
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success, yet they remain prone to perception-related hallucinations in fine-grained tasks. This vulnerability arises from a fundamental limitation: their reasoning is largely restricted to the language domain, treating visual input as a static, reasoning-agnostic preamble rather than a dynamic participant. Consequently, current models act as passive observers, unable to re-examine visual details to ground their evolving reasoning states.
To overcome this limitation, we propose V-Reflection, a framework that transforms the MLLM into an active interrogator through a “think-then-look” visual reflection mechanism. During reasoning, latent states function as dynamic probes that actively interrogate the visual feature space, grounding each reasoning step for task-critical evidence. Our approach employs a two-stage distillation strategy, which includes:
- Box-Guided Compression (BCM) Module: This module establishes stable pixel-to-latent targets through explicit spatial grounding.
- Dynamic Autoregressive Compression (DAC) Module: This component maps the model’s hidden states into dynamic probes that interrogate the global visual feature map.
By distilling the spatial expertise of the BCM teacher into the DAC student, V-Reflection internalizes the ability to localize task-critical evidence effectively. During inference, both modules remain entirely inactive, maintaining a purely end-to-end autoregressive decoding in the latent space with optimal efficiency.
Extensive experiments have demonstrated the effectiveness of our V-Reflection framework across six perception-intensive benchmarks, which significantly narrow the fine-grained perception gap typically observed in MLLMs. The results validate that our approach not only enhances the accuracy of the models but also improves their interpretability by allowing latent reasoning to autonomously localize task-critical visual evidence.
Key Findings
- V-Reflection transforms MLLMs from passive observers into active interrogators, enhancing their reasoning capabilities.
- The “think-then-look” mechanism allows for dynamic interrogation of visual inputs, which significantly improves performance in fine-grained tasks.
- The two-stage distillation strategy effectively trains the model to localize critical evidence in a more efficient manner.
- Experimental results show substantial improvements in perception-intensive benchmarks, confirming the framework’s effectiveness.
- Visualizations reveal that the latent reasoning can accurately identify relevant visual details, supporting the model’s conclusions.
In conclusion, V-Reflection represents a significant advancement in the evolution of Multimodal Large Language Models, addressing the limitations of previous models by enabling a more interactive and responsive approach to visual reasoning. This innovative framework not only holds promise for future research in AI but also paves the way for practical applications where accurate interpretation of visual information is critical.
