VeriSim: A Configurable Framework for Evaluating Medical AI Under Realistic Patient Noise
In the rapidly evolving field of medical artificial intelligence (AI), large language models (LLMs) have demonstrated remarkable capabilities when assessed against standardized benchmarks. However, these evaluations often fall short of reflecting the intricate realities of actual clinical interactions. Patients frequently face challenges such as memory lapses, limited health literacy, anxiety, and various barriers to effective communication. To address these shortcomings, researchers have introduced VeriSim, a pioneering truth-preserving patient simulation framework designed to incorporate controllable, clinically grounded noise into patient responses while rigorously adhering to medical truths.
Introducing VeriSim
VeriSim operationalizes a set of six noise dimensions derived from peer-reviewed medical communication literature. This innovative framework captures authentic clinical phenomena including:
- Patient recall limitations
- Health literacy barriers
- Stigma-driven non-disclosure
- Emotional and psychological influences on communication
- Variability in patient responses
- Context-dependent understanding of medical information
Research Findings
Experiments conducted across seven open-weight LLMs revealed a significant decline in model performance when subjected to realistic patient noise. Key findings include:
- Diagnostic accuracy decreased by 15-25% under noise conditions.
- Conversation length increased by 34-55% as models struggled to navigate the complexities of patient interactions.
- Smaller models (7B parameters) experienced a 40% greater degradation in performance compared to larger models (70B+ parameters).
- Medical fine-tuning on standard corpora yielded limited benefits in terms of robustness against patient communication noise.
Evaluation by Clinicians
To validate the quality of the simulations produced by VeriSim, evaluations were conducted by board-certified clinicians. The results demonstrated high-quality simulation with robust inter-annotator agreement, with kappa values exceeding 0.80. Furthermore, the integration of LLM-as-a-Judge provided a validated auxiliary evaluation mechanism, achieving comparable reliability for scalable assessment.
Addressing the Sim-to-Real Gap
The implications of this research highlight a critical Sim-to-Real gap that currently exists within the medical AI landscape. As the medical community increasingly relies on AI tools to assist in clinical decision-making, it is essential to ensure that these systems perform optimally in real-world scenarios, where patient communication is often fraught with challenges.
Open Source Release
In an effort to foster further research and development in this vital area, the creators of VeriSim have made the framework available as an open-source noise-injection tool. This initiative establishes a rigorous testbed for evaluating the clinical robustness of medical AI solutions, ultimately contributing to improved patient outcomes and healthcare delivery.
