Trustworthy Medical VQA: Auditing Vision-Language Models

Date:

Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation

Recent advancements in vision-language models (VLMs) hold significant promise for enhancing medical visual question answering (VQA). However, the deployment of these models in clinical settings necessitates an understanding of their reliability under realistic failure conditions. A new study, detailed in the preprint arXiv:2604.27720v1, addresses this critical gap by auditing five leading VLMs on their performance in medical VQA scenarios.

Key Findings from the Audit

The study focuses on five frontier and grounding-aware VLMs: Gemini 2.5 Pro, GPT-5, o3, GLM-4.5V, and Qwen 2.5 VL. The evaluation revolves around two primary axes of trust: perception and pipeline integration.

  • Perception: The audit reveals that all models struggle with localizing anatomical and pathological targets accurately. The best-performing model achieves a mere 0.23 mean Intersection over Union (IoU) and 19.1% accuracy at a threshold of 0.5. Alarmingly, the models exhibit dangerous levels of laterality confusion, which can have severe implications in clinical settings.
  • Pipeline Integration: A self-grounding pipeline, where the model localizes and subsequently answers questions, leads to degraded VQA accuracy across all models. This degradation stems from both inaccurate localization and failures in format compliance during the two-step prompting process. Notably, parse failures soar to between 70% and 99% for Gemini and GPT-5 on the VQA-RAD task.

Interestingly, when the predicted bounding boxes are replaced with ground-truth annotations, VQA accuracy improves significantly, indicating that the issues primarily lie within the perception module rather than the model’s ability to decompose the task.

Identifying Trustworthiness Bottlenecks

The findings emphasize grounding quality as a crucial bottleneck in the trustworthiness of VLMs in medical applications. The study highlights that while these models exhibit advanced capabilities in other domains, their performance in specialized medical contexts requires further scrutiny and improvement.

Future Directions: Domain Adaptation

As a follow-up to the audit, the researchers conducted supervised fine-tuning of Qwen 2.5 VL using a combined dataset of medical VQA training data. This approach yielded the highest reported open-ended recall of 85.5% in the SLAKE setting among comparable methods. The results suggest that the gap in VQA performance may be addressable through domain adaptation techniques.

However, the study leaves open the question of whether domain adaptation can also mitigate the perception and trustworthiness challenges identified in the audit. Future research will need to explore these avenues further to ensure that VLMs can be safely and effectively deployed in clinical environments.

Conclusion

The audit of frontier VLMs underscores the necessity of rigorous evaluation and improvement of these technologies before their integration into medical practice. As the field advances, ensuring the reliability and trustworthiness of VLMs will be paramount for their successful application in healthcare settings.

Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.