Don’t Blink: Evidence Collapse during Multimodal Reasoning
Summary: arXiv:2604.04207v1 Announce Type: new
The recent study on reasoning Vision Language Models (VLMs) reveals a critical issue known as evidence collapse, where the models demonstrate a paradoxical increase in accuracy at the cost of visual grounding during the reasoning process. This phenomenon presents task-conditional danger zones where low-entropy predictions exhibit high confidence while lacking essential visual context, a failure mode that traditional text-only monitoring systems cannot effectively identify.
Key Findings
The research evaluates three distinct reasoning VLMs across different benchmarks: MathVista, HallusionBench, and MMMU_Pro. The findings indicate a pervasive evidence-collapse phenomenon characterized by the following:
- Attention to annotated evidence regions significantly diminishes as reasoning progresses.
- Models often lose more than half of the evidence mass during this reasoning sequence.
- Full-response entropy serves as the most reliable text-only uncertainty signal, especially under cross-dataset transfer scenarios.
- Incorporating vision features through a single global linear rule is found to be brittle, often resulting in degraded transfer performance.
Task-Conditional Regime
Further analysis uncovers a task-conditional regime where the interaction between entropy and visual engagement is critical. Specifically:
- Low-entropy, visually disengaged predictions pose significant risks in tasks that require sustained visual references.
- Conversely, such disengagement appears benign in symbolic tasks that do not rely heavily on visual grounding.
Mitigation Strategies
Utilizing the insights gained from the entropy-vision interaction model, the study proposes a targeted vision veto mechanism. This approach aims to:
- Reduce selective risk by up to 1.9 percentage points while maintaining 90% coverage.
- Avoid performance degradations in scenarios where model disengagement is anticipated.
Conclusion
The findings from this study underscore the necessity for task-aware multimodal monitoring systems to ensure the safe deployment of reasoning VLMs, particularly in environments subject to distribution shifts. As these models continue to evolve, addressing the issue of evidence collapse will be paramount in enhancing their reliability and effectiveness across various applications.
In conclusion, the research highlights the complexities involved in multimodal reasoning and the inherent risks associated with evidence collapse, advocating for improved monitoring frameworks that can adapt to the nuanced challenges presented by differing tasks.
