Benchmarking Deflection & Hallucination in Vision-Language AI

Benchmarking Deflection and Hallucination in Large Vision-Language Models

Summary: arXiv:2604.12033v1 Announce Type: cross

Abstract: Large Vision-Language Models (LVLMs) increasingly rely on retrieval to answer knowledge-intensive multimodal questions. Existing benchmarks overlook conflicts between visual and textual evidence and the importance of generating deflections (e.g., Sorry, I cannot answer…) when retrieved knowledge is incomplete. These benchmarks also suffer from rapid obsolescence, as growing LVLM training sets allow models to answer many questions without retrieval. We address these gaps with three contributions.

Contributions Overview

Dynamic Data Curation Pipeline: We propose a method that preserves benchmark difficulty over time by filtering for genuinely retrieval-dependent samples.
VLM-DeflectionBench: This new benchmark includes 2,775 samples spanning diverse multimodal retrieval settings, specifically designed to examine model behavior under conflicting or insufficient evidence.
Fine-Grained Evaluation Protocol: Our evaluation method includes four specific scenarios that help disentangle parametric memorization from retrieval robustness.

Methodology and Findings

Experiments conducted across 20 state-of-the-art LVLMs reveal that models generally struggle to generate appropriate deflections when faced with noisy or misleading evidence. This highlights a critical gap in current evaluation methodologies, necessitating a shift in focus from merely assessing what models know to understanding how they respond in situations where knowledge is lacking or unclear.

Need for Improved Evaluation Metrics

The results from our evaluations underscore the importance of creating benchmarks that not only test for accuracy but also examine the reliability of models in real-world scenarios. As LVLMs evolve and their training datasets expand, the challenges they face will become increasingly nuanced, necessitating more sophisticated evaluation strategies.

Public Availability of Resources

In line with our commitment to fostering open research, all resources related to this study will be made publicly available upon publication. We believe that transparency and accessibility are crucial for advancing the field of AI and improving model robustness.

Conclusion

Our research presents a significant step forward in the assessment of LVLM capabilities, particularly in contexts involving multimodal interactions. By focusing on deflection and hallucination, we aim to provide a more comprehensive understanding of model behavior, ensuring that future developments in AI are both innovative and reliable.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Benchmarking Deflection & Hallucination in Vision-Language AI

Benchmarking Deflection and Hallucination in Large Vision-Language Models

Contributions Overview

Methodology and Findings

Need for Improved Evaluation Metrics

Public Availability of Resources

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related