Trustworthy Medical VQA: Auditing Vision-Language Models

Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation

Recent advancements in vision-language models (VLMs) hold significant promise for enhancing medical visual question answering (VQA). However, the deployment of these models in clinical settings necessitates an understanding of their reliability under realistic failure conditions. A new study, detailed in the preprint arXiv:2604.27720v1, addresses this critical gap by auditing five leading VLMs on their performance in medical VQA scenarios.

Key Findings from the Audit

The study focuses on five frontier and grounding-aware VLMs: Gemini 2.5 Pro, GPT-5, o3, GLM-4.5V, and Qwen 2.5 VL. The evaluation revolves around two primary axes of trust: perception and pipeline integration.

Perception: The audit reveals that all models struggle with localizing anatomical and pathological targets accurately. The best-performing model achieves a mere 0.23 mean Intersection over Union (IoU) and 19.1% accuracy at a threshold of 0.5. Alarmingly, the models exhibit dangerous levels of laterality confusion, which can have severe implications in clinical settings.
Pipeline Integration: A self-grounding pipeline, where the model localizes and subsequently answers questions, leads to degraded VQA accuracy across all models. This degradation stems from both inaccurate localization and failures in format compliance during the two-step prompting process. Notably, parse failures soar to between 70% and 99% for Gemini and GPT-5 on the VQA-RAD task.

Interestingly, when the predicted bounding boxes are replaced with ground-truth annotations, VQA accuracy improves significantly, indicating that the issues primarily lie within the perception module rather than the model’s ability to decompose the task.

Identifying Trustworthiness Bottlenecks

The findings emphasize grounding quality as a crucial bottleneck in the trustworthiness of VLMs in medical applications. The study highlights that while these models exhibit advanced capabilities in other domains, their performance in specialized medical contexts requires further scrutiny and improvement.

Future Directions: Domain Adaptation

As a follow-up to the audit, the researchers conducted supervised fine-tuning of Qwen 2.5 VL using a combined dataset of medical VQA training data. This approach yielded the highest reported open-ended recall of 85.5% in the SLAKE setting among comparable methods. The results suggest that the gap in VQA performance may be addressable through domain adaptation techniques.

However, the study leaves open the question of whether domain adaptation can also mitigate the perception and trustworthiness challenges identified in the audit. Future research will need to explore these avenues further to ensure that VLMs can be safely and effectively deployed in clinical environments.

Conclusion

The audit of frontier VLMs underscores the necessity of rigorous evaluation and improvement of these technologies before their integration into medical practice. As the field advances, ensuring the reliability and trustworthiness of VLMs will be paramount for their successful application in healthcare settings.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Trustworthy Medical VQA: Auditing Vision-Language Models

Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation

Key Findings from the Audit

Identifying Trustworthiness Bottlenecks

Future Directions: Domain Adaptation

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related