Revealing Multi-View Hallucination in Large Vision-Language Models
Large vision-language models (LVLMs) are gaining traction in various applications that involve processing multi-view image inputs captured from diverse angles. Despite their increasing adoption, these models frequently encounter a significant challenge: a phenomenon known as multi-view hallucination. This article discusses the recent findings from the research paper titled Revealing Multi-View Hallucination in Large Vision-Language Models (arXiv:2603.23934v1), which systematically investigates this issue.
Understanding Multi-View Hallucination
Multi-view hallucination occurs when LVLMs confuse visual information that originates from different instances or viewpoints. This misalignment can lead to incorrect associations between visual evidence and the corresponding questions, ultimately diminishing the performance and reliability of these models in practical applications.
The MVH-Bench Benchmark
To address the challenges associated with multi-view hallucination, the researchers constructed a benchmark known as MVH-Bench. This benchmark consists of:
- 4.8k question-answer pairs
- Focus on two types of hallucination: cross-instance and cross-view
By systematically analyzing the performance of various LVLMs on this benchmark, the researchers discovered that recent models often struggled to accurately correlate visual evidence with its respective instance or viewpoint.
Proposed Solution: Reference Shift Contrastive Decoding (RSCD)
In light of the limitations identified, the authors propose a novel approach called Reference Shift Contrastive Decoding (RSCD). This technique is a training-free decoding method designed to suppress visual interference. It achieves this by generating negative logits through attention masking, thereby enhancing the model’s ability to focus on the relevant visual context.
Experimental Results
The effectiveness of RSCD was tested through experiments conducted on the MVH-Bench using two prominent LVLMs: Qwen2.5-VL and LLaVA-OneVision. The results were promising, demonstrating significant improvements in performance:
- Up to 21.1 points improvement over existing hallucination mitigation methods with Qwen2.5-VL
- Up to 34.6 points improvement with LLaVA-OneVision
These findings underscore the potential of RSCD to enhance the performance of LVLMs in multi-view scenarios, effectively addressing the challenges posed by multi-view hallucination.
Conclusion
The research highlights a critical area in the development of large vision-language models, emphasizing the need for improved techniques to manage visual inconsistencies across different instances and viewpoints. As the field continues to evolve, approaches like Reference Shift Contrastive Decoding may pave the way for more robust and reliable LVLMs, ensuring their effectiveness in diverse applications.
