MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors
Summary: arXiv:2604.06846v1 Announce Type: cross
Abstract
Interactive medical dialogue benchmarks have shown that LLM (Large Language Model) diagnostic accuracy degrades significantly when interacting with non-cooperative patients. However, existing approaches either apply adversarial behaviors without graded severity or case-specific grounding, or reduce patient non-cooperation to a single ungraded axis. Moreover, none of these methods analyze the interactions across multiple dimensions of patient behavior.
In response to these limitations, we introduce MedDialBench, a benchmark designed to enable controlled, dose-response characterization of how individual patient behavior dimensions affect LLM diagnostic robustness. This benchmark decomposes patient behavior into five distinct dimensions:
- Logic Consistency
- Health Cognition
- Expression Style
- Disclosure
- Attitude
Each of these dimensions is equipped with graded severity levels and case-specific behavioral scripts. This controlled factorial design enables researchers to perform graded sensitivity analysis, dose-response profiling, and cross-dimension interaction detection.
Evaluation and Findings
We evaluated five frontier LLMs across a total of 7,225 dialogues, comprising 85 cases, 17 configurations, and 5 models. The results revealed a fundamental asymmetry in the model’s performance: information pollution, characterized by the fabrication of symptoms, led to accuracy drops ranging from 1.7 to 3.4 times greater than those caused by information deficit, which involves the withholding of information.
Notably, fabricating symptoms was the only configuration that achieved statistical significance across all five models, indicated by a McNemar p-value of less than 0.05. Among six different combinations of behavioral dimensions, fabricating symptoms emerged as the sole driver of super-additive interactions. Specifically, all three pairs that involved fabrication produced observed/expected (O/E) ratios of 0.70 to 0.81, suggesting that 35 to 44 percent of eligible cases failed under the combination, despite succeeding when evaluated individually.
In contrast, all non-fabricating pairs demonstrated purely additive effects, with O/E ratios around 1.0. Additionally, our findings indicated that inquiry strategy moderates the impact of information deficit but not that of information pollution. While exhaustive questioning can help recover withheld information, it cannot compensate for inputs that are fabricated. The models exhibited distinct vulnerability profiles, with worst-case accuracy drops ranging from 38.8 to 54.1 percentage points.
Conclusion
MedDialBench offers significant advancements in understanding the complex interactions of patient behaviors and their effects on LLM diagnostic robustness. By employing a structured approach to evaluate multiple dimensions of patient behavior, this benchmark lays the groundwork for future research aimed at improving the reliability and accuracy of medical dialogue systems.
