Fixing Multi-View Hallucination in Vision-Language Models

Revealing Multi-View Hallucination in Large Vision-Language Models

Large vision-language models (LVLMs) are gaining traction in various applications that involve processing multi-view image inputs captured from diverse angles. Despite their increasing adoption, these models frequently encounter a significant challenge: a phenomenon known as multi-view hallucination. This article discusses the recent findings from the research paper titled Revealing Multi-View Hallucination in Large Vision-Language Models (arXiv:2603.23934v1), which systematically investigates this issue.

Understanding Multi-View Hallucination

Multi-view hallucination occurs when LVLMs confuse visual information that originates from different instances or viewpoints. This misalignment can lead to incorrect associations between visual evidence and the corresponding questions, ultimately diminishing the performance and reliability of these models in practical applications.

The MVH-Bench Benchmark

To address the challenges associated with multi-view hallucination, the researchers constructed a benchmark known as MVH-Bench. This benchmark consists of:

4.8k question-answer pairs
Focus on two types of hallucination: cross-instance and cross-view

By systematically analyzing the performance of various LVLMs on this benchmark, the researchers discovered that recent models often struggled to accurately correlate visual evidence with its respective instance or viewpoint.

Proposed Solution: Reference Shift Contrastive Decoding (RSCD)

In light of the limitations identified, the authors propose a novel approach called Reference Shift Contrastive Decoding (RSCD). This technique is a training-free decoding method designed to suppress visual interference. It achieves this by generating negative logits through attention masking, thereby enhancing the model’s ability to focus on the relevant visual context.

Experimental Results

The effectiveness of RSCD was tested through experiments conducted on the MVH-Bench using two prominent LVLMs: Qwen2.5-VL and LLaVA-OneVision. The results were promising, demonstrating significant improvements in performance:

Up to 21.1 points improvement over existing hallucination mitigation methods with Qwen2.5-VL
Up to 34.6 points improvement with LLaVA-OneVision

These findings underscore the potential of RSCD to enhance the performance of LVLMs in multi-view scenarios, effectively addressing the challenges posed by multi-view hallucination.

Conclusion

The research highlights a critical area in the development of large vision-language models, emphasizing the need for improved techniques to manage visual inconsistencies across different instances and viewpoints. As the field continues to evolve, approaches like Reference Shift Contrastive Decoding may pave the way for more robust and reliable LVLMs, ensuring their effectiveness in diverse applications.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Fixing Multi-View Hallucination in Vision-Language Models

Revealing Multi-View Hallucination in Large Vision-Language Models

Understanding Multi-View Hallucination

The MVH-Bench Benchmark

Proposed Solution: Reference Shift Contrastive Decoding (RSCD)

Experimental Results

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related