Benchmarking Deflection and Hallucination in Large Vision-Language Models
Summary: arXiv:2604.12033v1 Announce Type: cross
Abstract: Large Vision-Language Models (LVLMs) increasingly rely on retrieval to answer knowledge-intensive multimodal questions. Existing benchmarks overlook conflicts between visual and textual evidence and the importance of generating deflections (e.g., Sorry, I cannot answer…) when retrieved knowledge is incomplete. These benchmarks also suffer from rapid obsolescence, as growing LVLM training sets allow models to answer many questions without retrieval. We address these gaps with three contributions.
Contributions Overview
- Dynamic Data Curation Pipeline: We propose a method that preserves benchmark difficulty over time by filtering for genuinely retrieval-dependent samples.
- VLM-DeflectionBench: This new benchmark includes 2,775 samples spanning diverse multimodal retrieval settings, specifically designed to examine model behavior under conflicting or insufficient evidence.
- Fine-Grained Evaluation Protocol: Our evaluation method includes four specific scenarios that help disentangle parametric memorization from retrieval robustness.
Methodology and Findings
Experiments conducted across 20 state-of-the-art LVLMs reveal that models generally struggle to generate appropriate deflections when faced with noisy or misleading evidence. This highlights a critical gap in current evaluation methodologies, necessitating a shift in focus from merely assessing what models know to understanding how they respond in situations where knowledge is lacking or unclear.
Need for Improved Evaluation Metrics
The results from our evaluations underscore the importance of creating benchmarks that not only test for accuracy but also examine the reliability of models in real-world scenarios. As LVLMs evolve and their training datasets expand, the challenges they face will become increasingly nuanced, necessitating more sophisticated evaluation strategies.
Public Availability of Resources
In line with our commitment to fostering open research, all resources related to this study will be made publicly available upon publication. We believe that transparency and accessibility are crucial for advancing the field of AI and improving model robustness.
Conclusion
Our research presents a significant step forward in the assessment of LVLM capabilities, particularly in contexts involving multimodal interactions. By focusing on deflection and hallucination, we aim to provide a more comprehensive understanding of model behavior, ensuring that future developments in AI are both innovative and reliable.
