Scaffold Effect: Prompt Framing Impacts Clinical VLM Performance

Date:

The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

Summary: arXiv:2603.28387v1 Announce Type: new

Abstract: Trustworthy clinical AI requires that performance gains reflect genuine evidence integration rather than surface-level artifacts. We evaluate 12 open-weight vision-language models (VLMs) on binary classification across two clinical neuroimaging cohorts, FOR2107 (affective disorders) and OASIS-3 (cognitive decline). Both datasets come with structural MRI data that carries no reliable individual-level diagnostic signal.

Under these conditions, smaller VLMs exhibit gains of up to 58% F1 upon introduction of neuroimaging context, with distilled models becoming competitive with counterparts an order of magnitude larger. A contrastive confidence analysis reveals that merely mentioning MRI availability in the task prompt accounts for 70-80% of this shift, independent of whether imaging data is present, a domain-specific instance of modality collapse we term the scaffold effect.

Expert evaluation reveals fabrication of neuroimaging-grounded justifications across all conditions, and preference alignment, while eliminating MRI-referencing behavior, collapses both conditions toward random baseline. Our findings demonstrate that surface evaluations are inadequate indicators of multimodal reasoning, with direct implications for the deployment of VLMs in clinical settings.

Introduction

The intersection of artificial intelligence and clinical evaluation is becoming increasingly intricate, particularly with the advent of vision-language models (VLMs). As these models gain traction in healthcare, understanding the underlying mechanisms that drive their performance is crucial.

Methodology

In this study, we employed 12 open-weight VLMs to assess their ability to classify clinical conditions using two neuroimaging datasets. The datasets consist of:

  • FOR2107: Focused on affective disorders.
  • OASIS-3: Concentrated on cognitive decline.

Each model’s performance was measured based on F1 scores, particularly when neuroimaging context was included in the prompts. Our analysis was designed to isolate the effects of prompt framing from the actual content of the data.

Findings

Our results indicated that smaller VLMs could achieve significant performance improvements, with gains reaching up to 58% F1. Notably, the presence of neuroimaging context in prompts played a critical role in these gains:

  • Scaffold Effect: The mere mention of MRI data in the task prompts contributed to 70-80% of the apparent improvement in performance.
  • Fabrication of Justifications: Experts noted that the models often generated justifications that were not grounded in actual imaging data.

Moreover, when we aligned model preferences away from MRI-referencing behavior, both conditions reverted to random performance levels, highlighting the fragility of these models’ perceived capabilities.

Conclusion

Our findings underscore the importance of scrutinizing the evaluation metrics used for VLMs in clinical applications. The observed scaffold effect raises concerns about the reliability of surface-level performance indicators, necessitating a more nuanced approach to model evaluation that genuinely reflects multimodal reasoning capabilities.

As clinical AI continues to evolve, understanding and addressing these underlying issues will be essential for developing trustworthy systems that can genuinely assist in healthcare settings.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.