DISSECT: Diagnosing Where Vision Ends and Language Priors Begin in Scientific VLMs
Recent advancements in Vision-Language Models (VLMs) have raised intriguing questions about the boundaries between visual perception and linguistic reasoning. A recent study, detailed in the paper titled DISSECT: Diagnosing Where Vision Ends and Language Priors Begin in Scientific VLMs, highlights a significant gap in how these models interpret visual data and apply reasoning.
Understanding the Perception-Integration Gap
The study illustrates a critical phenomenon termed the perception-integration gap. This term describes instances where a VLM can accurately identify visual elements, such as a molecular structure, yet fails to apply appropriate reasoning when prompted. For example, when asked to describe a molecular diagram, a VLM may correctly identify it as “a benzene ring with an -OH group,” but struggle with subsequent reasoning tasks related to that diagram. This discrepancy unveils the limitations of existing benchmarks that conflate perception with reasoning in their evaluations, often masking these integration failures.
Introducing DISSECT Benchmark
To systematically expose these failures, the authors of the study introduced the DISSECT benchmark, consisting of 12,000 diagnostic questions categorized into two primary fields: Chemistry and Biology. This benchmark allows for a comprehensive assessment of VLM capabilities across different contexts.
- Chemistry: 7,000 questions focused on molecular structures and chemical reasoning.
- Biology: 5,000 questions aimed at biological concepts and reasoning.
Evaluating VLMs through Diverse Input Modes
Each question within the DISSECT benchmark is evaluated under five distinct input modes:
- Vision+Text: Combining both visual and textual inputs.
- Text-Only: Relying solely on textual information.
- Vision-Only: Using only visual inputs without text.
- Human Oracle: Utilizing human expertise for accurate reasoning.
- Model Oracle: A novel approach where the VLM first verbalizes the image before reasoning based on its description.
Key Findings from the Evaluation
The evaluation of 18 VLMs yielded several critical insights:
- Lower Language-Prior Exploitability: Chemistry questions exhibited significantly lower language-prior exploitability compared to Biology, indicating that molecular visual content poses a more challenging test for genuine visual reasoning.
- Integration Bottleneck in Open-Source Models: Open-source models demonstrated higher performance when reasoning from their own verbalized descriptions rather than raw images, highlighting a systematic integration bottleneck in visual reasoning.
- Closed-Source Models: Contrarily, closed-source models did not show such a gap, suggesting that the ability to bridge perception and integration is a key differentiator between open-source and closed-source multimodal capabilities.
Conclusion
The Model Oracle protocol introduced in this study is both model and benchmark agnostic, making it applicable post-hoc to any VLM evaluation. This innovative approach aims to diagnose integration failures, paving the way for improved multimodal capabilities in future VLM developments.
