Mitigating Object Hallucinations in LVLMs via Attention Imbalance Rectification
Object hallucination in Large Vision-Language Models (LVLMs) severely compromises their reliability in real-world applications, posing a critical barrier to their deployment in high-stakes scenarios such as autonomous driving and medical image analysis. In a recent study published on arXiv (arXiv:2603.24058v1), researchers conducted systematic empirical investigations to identify the factors contributing to this phenomenon.
Understanding Object Hallucination
Object hallucination occurs when LVLMs incorrectly identify or generate objects that are not present in the visual input. This misalignment between visual perception and language understanding can lead to significant errors, particularly in applications where accuracy is paramount. The research highlights that imbalanced attention allocation across different modalities (vision and language) and within modalities (among individual tokens) strongly correlates with the occurrence of object hallucination.
Introducing Attention Imbalance
The study introduces the concept of attention imbalance, which quantifies the degree of attention disparity in LVLMs. This concept not only measures attention allocation but also visually delineates underlying patterns that contribute to object hallucination. Specifically, it identifies:
- Over-attentiveness to irrelevant language tokens
- Under-attentiveness to discriminative visual features
Proposed Solution: Attention Imbalance Rectification (AIR)
To address the issue of object hallucination, the researchers propose a novel intervention method called Attention Imbalance Rectification (AIR). This lightweight approach is implemented during the decoding phase of the model and focuses on reallocating attention weights and adjusting attention distributions. The goal is to rectify both modality-wise and token-wise imbalances that lead to hallucinations.
Evaluation and Results
Extensive evaluations were conducted on four mainstream LVLMs across three benchmark datasets: CHAIR, POPE, and MM-Vet. The results demonstrate that AIR significantly reduces object hallucination rates, achieving up to a 35.1% reduction compared to existing baselines. Furthermore, the implementation of AIR improved the general capabilities of LVLMs by up to 15.9% across various vision-language tasks.
Conclusion
The findings from this research provide crucial insights into the mechanisms behind object hallucination in LVLMs and propose a viable solution through the use of AIR. As technology continues to evolve, addressing such critical issues will pave the way for more reliable and effective applications of LVLMs in high-stakes environments. The introduction of attention imbalance as a concept marks a significant step forward in understanding and mitigating the challenges associated with object hallucination.
