Scaffold Effect: Prompt Framing Impacts Clinical VLM Performance

The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

Summary: arXiv:2603.28387v1 Announce Type: new

Abstract: Trustworthy clinical AI requires that performance gains reflect genuine evidence integration rather than surface-level artifacts. We evaluate 12 open-weight vision-language models (VLMs) on binary classification across two clinical neuroimaging cohorts, FOR2107 (affective disorders) and OASIS-3 (cognitive decline). Both datasets come with structural MRI data that carries no reliable individual-level diagnostic signal.

Under these conditions, smaller VLMs exhibit gains of up to 58% F1 upon introduction of neuroimaging context, with distilled models becoming competitive with counterparts an order of magnitude larger. A contrastive confidence analysis reveals that merely mentioning MRI availability in the task prompt accounts for 70-80% of this shift, independent of whether imaging data is present, a domain-specific instance of modality collapse we term the scaffold effect.

Expert evaluation reveals fabrication of neuroimaging-grounded justifications across all conditions, and preference alignment, while eliminating MRI-referencing behavior, collapses both conditions toward random baseline. Our findings demonstrate that surface evaluations are inadequate indicators of multimodal reasoning, with direct implications for the deployment of VLMs in clinical settings.

Introduction

The intersection of artificial intelligence and clinical evaluation is becoming increasingly intricate, particularly with the advent of vision-language models (VLMs). As these models gain traction in healthcare, understanding the underlying mechanisms that drive their performance is crucial.

Methodology

In this study, we employed 12 open-weight VLMs to assess their ability to classify clinical conditions using two neuroimaging datasets. The datasets consist of:

FOR2107: Focused on affective disorders.
OASIS-3: Concentrated on cognitive decline.

Each model’s performance was measured based on F1 scores, particularly when neuroimaging context was included in the prompts. Our analysis was designed to isolate the effects of prompt framing from the actual content of the data.

Findings

Our results indicated that smaller VLMs could achieve significant performance improvements, with gains reaching up to 58% F1. Notably, the presence of neuroimaging context in prompts played a critical role in these gains:

Scaffold Effect: The mere mention of MRI data in the task prompts contributed to 70-80% of the apparent improvement in performance.
Fabrication of Justifications: Experts noted that the models often generated justifications that were not grounded in actual imaging data.

Moreover, when we aligned model preferences away from MRI-referencing behavior, both conditions reverted to random performance levels, highlighting the fragility of these models’ perceived capabilities.

Conclusion

Our findings underscore the importance of scrutinizing the evaluation metrics used for VLMs in clinical applications. The observed scaffold effect raises concerns about the reliability of surface-level performance indicators, necessitating a more nuanced approach to model evaluation that genuinely reflects multimodal reasoning capabilities.

As clinical AI continues to evolve, understanding and addressing these underlying issues will be essential for developing trustworthy systems that can genuinely assist in healthcare settings.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Scaffold Effect: Prompt Framing Impacts Clinical VLM Performance

The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

Introduction

Methodology

Findings

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related