Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models
Summary: arXiv:2604.03556v1 Announce Type: cross
Abstract
Large Vision-Language Models (LVLMs) have achieved impressive progress in multimodal reasoning, yet they remain prone to object hallucinations, generating descriptions of objects that are not present in the input image. Recent approaches attempt to mitigate hallucinations by suppressing unreliable visual signals in the vision encoder, but many rely on iterative optimization for each input, resulting in substantial inference latency.
In this work, we investigate the internal attention dynamics of vision encoders in LVLMs and identify a consistent three-phase structure of visual information processing: diffusion, focus, and rediffusion. Our analysis reveals that hallucination behavior is particularly sensitive to tokens receiving low attention during the focus phase. Motivated by this observation, we propose a lightweight inference-time intervention that selectively suppresses such tokens during the focus phase.
Key Findings
The following key findings emerge from our study:
- Attention Dynamics: We identified three phases in the visual information processing of LVLMs: diffusion, focus, and rediffusion.
- Token Sensitivity: Hallucination behavior is particularly sensitive to low-attention tokens during the focus phase.
- Lightweight Intervention: Our proposed method suppresses low-attention tokens during inference without requiring retraining.
Methodology
Our approach operates in a training-free manner using statistics gathered from a single forward pass. By employing a Determinantal Point Process (DPP), we are able to filter redundant tokens while preserving diverse visual cues. This method allows for effective suppression of hallucinations without incurring significant inference latency.
Results and Discussion
Extensive experiments were conducted across multiple LVLM backbones and decoding strategies. The results consistently demonstrated that our approach significantly reduces hallucination metrics while maintaining competitive caption quality. Additionally, when compared to adversarial uncertainty estimation methods, our intervention achieved comparable hallucination mitigation with negligible additional inference latency.
Conclusion
In conclusion, our study highlights the importance of focusing on the attention dynamics within LVLMs. By implementing a phase-aware suppression method, we have shown that it is possible to effectively reduce hallucinations in these models while preserving their performance. This advancement opens up new avenues for enhancing the reliability of vision-language models in real-world applications.
