Phase-Aware Suppression to Reduce Hallucinations in LVLMs

Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models

Summary: arXiv:2604.03556v1 Announce Type: cross

Abstract

Large Vision-Language Models (LVLMs) have achieved impressive progress in multimodal reasoning, yet they remain prone to object hallucinations, generating descriptions of objects that are not present in the input image. Recent approaches attempt to mitigate hallucinations by suppressing unreliable visual signals in the vision encoder, but many rely on iterative optimization for each input, resulting in substantial inference latency.

In this work, we investigate the internal attention dynamics of vision encoders in LVLMs and identify a consistent three-phase structure of visual information processing: diffusion, focus, and rediffusion. Our analysis reveals that hallucination behavior is particularly sensitive to tokens receiving low attention during the focus phase. Motivated by this observation, we propose a lightweight inference-time intervention that selectively suppresses such tokens during the focus phase.

Key Findings

The following key findings emerge from our study:

Attention Dynamics: We identified three phases in the visual information processing of LVLMs: diffusion, focus, and rediffusion.
Token Sensitivity: Hallucination behavior is particularly sensitive to low-attention tokens during the focus phase.
Lightweight Intervention: Our proposed method suppresses low-attention tokens during inference without requiring retraining.

Methodology

Our approach operates in a training-free manner using statistics gathered from a single forward pass. By employing a Determinantal Point Process (DPP), we are able to filter redundant tokens while preserving diverse visual cues. This method allows for effective suppression of hallucinations without incurring significant inference latency.

Results and Discussion

Extensive experiments were conducted across multiple LVLM backbones and decoding strategies. The results consistently demonstrated that our approach significantly reduces hallucination metrics while maintaining competitive caption quality. Additionally, when compared to adversarial uncertainty estimation methods, our intervention achieved comparable hallucination mitigation with negligible additional inference latency.

Conclusion

In conclusion, our study highlights the importance of focusing on the attention dynamics within LVLMs. By implementing a phase-aware suppression method, we have shown that it is possible to effectively reduce hallucinations in these models while preserving their performance. This advancement opens up new avenues for enhancing the reliability of vision-language models in real-world applications.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Phase-Aware Suppression to Reduce Hallucinations in LVLMs

Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models

Abstract

Key Findings

Methodology

Results and Discussion

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related