The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation
Summary: arXiv:2603.28387v1 Announce Type: new
Abstract: Trustworthy clinical AI requires that performance gains reflect genuine evidence integration rather than surface-level artifacts. We evaluate 12 open-weight vision-language models (VLMs) on binary classification across two clinical neuroimaging cohorts, FOR2107 (affective disorders) and OASIS-3 (cognitive decline). Both datasets come with structural MRI data that carries no reliable individual-level diagnostic signal.
Under these conditions, smaller VLMs exhibit gains of up to 58% F1 upon introduction of neuroimaging context, with distilled models becoming competitive with counterparts an order of magnitude larger. A contrastive confidence analysis reveals that merely mentioning MRI availability in the task prompt accounts for 70-80% of this shift, independent of whether imaging data is present, a domain-specific instance of modality collapse we term the scaffold effect.
Expert evaluation reveals fabrication of neuroimaging-grounded justifications across all conditions, and preference alignment, while eliminating MRI-referencing behavior, collapses both conditions toward random baseline. Our findings demonstrate that surface evaluations are inadequate indicators of multimodal reasoning, with direct implications for the deployment of VLMs in clinical settings.
Introduction
The intersection of artificial intelligence and clinical evaluation is becoming increasingly intricate, particularly with the advent of vision-language models (VLMs). As these models gain traction in healthcare, understanding the underlying mechanisms that drive their performance is crucial.
Methodology
In this study, we employed 12 open-weight VLMs to assess their ability to classify clinical conditions using two neuroimaging datasets. The datasets consist of:
- FOR2107: Focused on affective disorders.
- OASIS-3: Concentrated on cognitive decline.
Each model’s performance was measured based on F1 scores, particularly when neuroimaging context was included in the prompts. Our analysis was designed to isolate the effects of prompt framing from the actual content of the data.
Findings
Our results indicated that smaller VLMs could achieve significant performance improvements, with gains reaching up to 58% F1. Notably, the presence of neuroimaging context in prompts played a critical role in these gains:
- Scaffold Effect: The mere mention of MRI data in the task prompts contributed to 70-80% of the apparent improvement in performance.
- Fabrication of Justifications: Experts noted that the models often generated justifications that were not grounded in actual imaging data.
Moreover, when we aligned model preferences away from MRI-referencing behavior, both conditions reverted to random performance levels, highlighting the fragility of these models’ perceived capabilities.
Conclusion
Our findings underscore the importance of scrutinizing the evaluation metrics used for VLMs in clinical applications. The observed scaffold effect raises concerns about the reliability of surface-level performance indicators, necessitating a more nuanced approach to model evaluation that genuinely reflects multimodal reasoning capabilities.
As clinical AI continues to evolve, understanding and addressing these underlying issues will be essential for developing trustworthy systems that can genuinely assist in healthcare settings.
