MedDialBench: Testing LLM Diagnostic Robustness in Medical Dialogue

MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors

Summary: arXiv:2604.06846v1 Announce Type: cross

Abstract

Interactive medical dialogue benchmarks have shown that LLM (Large Language Model) diagnostic accuracy degrades significantly when interacting with non-cooperative patients. However, existing approaches either apply adversarial behaviors without graded severity or case-specific grounding, or reduce patient non-cooperation to a single ungraded axis. Moreover, none of these methods analyze the interactions across multiple dimensions of patient behavior.

In response to these limitations, we introduce MedDialBench, a benchmark designed to enable controlled, dose-response characterization of how individual patient behavior dimensions affect LLM diagnostic robustness. This benchmark decomposes patient behavior into five distinct dimensions:

Logic Consistency
Health Cognition
Expression Style
Disclosure
Attitude

Each of these dimensions is equipped with graded severity levels and case-specific behavioral scripts. This controlled factorial design enables researchers to perform graded sensitivity analysis, dose-response profiling, and cross-dimension interaction detection.

Evaluation and Findings

We evaluated five frontier LLMs across a total of 7,225 dialogues, comprising 85 cases, 17 configurations, and 5 models. The results revealed a fundamental asymmetry in the model’s performance: information pollution, characterized by the fabrication of symptoms, led to accuracy drops ranging from 1.7 to 3.4 times greater than those caused by information deficit, which involves the withholding of information.

Notably, fabricating symptoms was the only configuration that achieved statistical significance across all five models, indicated by a McNemar p-value of less than 0.05. Among six different combinations of behavioral dimensions, fabricating symptoms emerged as the sole driver of super-additive interactions. Specifically, all three pairs that involved fabrication produced observed/expected (O/E) ratios of 0.70 to 0.81, suggesting that 35 to 44 percent of eligible cases failed under the combination, despite succeeding when evaluated individually.

In contrast, all non-fabricating pairs demonstrated purely additive effects, with O/E ratios around 1.0. Additionally, our findings indicated that inquiry strategy moderates the impact of information deficit but not that of information pollution. While exhaustive questioning can help recover withheld information, it cannot compensate for inputs that are fabricated. The models exhibited distinct vulnerability profiles, with worst-case accuracy drops ranging from 38.8 to 54.1 percentage points.

Conclusion

MedDialBench offers significant advancements in understanding the complex interactions of patient behaviors and their effects on LLM diagnostic robustness. By employing a structured approach to evaluate multiple dimensions of patient behavior, this benchmark lays the groundwork for future research aimed at improving the reliability and accuracy of medical dialogue systems.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

MedDialBench: Testing LLM Diagnostic Robustness in Medical Dialogue

MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors

Abstract

Evaluation and Findings

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related