MedDialBench: Testing LLM Diagnostic Robustness in Medical Dialogue

Date:

MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors

Summary: arXiv:2604.06846v1 Announce Type: cross

Abstract

Interactive medical dialogue benchmarks have shown that LLM (Large Language Model) diagnostic accuracy degrades significantly when interacting with non-cooperative patients. However, existing approaches either apply adversarial behaviors without graded severity or case-specific grounding, or reduce patient non-cooperation to a single ungraded axis. Moreover, none of these methods analyze the interactions across multiple dimensions of patient behavior.

In response to these limitations, we introduce MedDialBench, a benchmark designed to enable controlled, dose-response characterization of how individual patient behavior dimensions affect LLM diagnostic robustness. This benchmark decomposes patient behavior into five distinct dimensions:

  • Logic Consistency
  • Health Cognition
  • Expression Style
  • Disclosure
  • Attitude

Each of these dimensions is equipped with graded severity levels and case-specific behavioral scripts. This controlled factorial design enables researchers to perform graded sensitivity analysis, dose-response profiling, and cross-dimension interaction detection.

Evaluation and Findings

We evaluated five frontier LLMs across a total of 7,225 dialogues, comprising 85 cases, 17 configurations, and 5 models. The results revealed a fundamental asymmetry in the model’s performance: information pollution, characterized by the fabrication of symptoms, led to accuracy drops ranging from 1.7 to 3.4 times greater than those caused by information deficit, which involves the withholding of information.

Notably, fabricating symptoms was the only configuration that achieved statistical significance across all five models, indicated by a McNemar p-value of less than 0.05. Among six different combinations of behavioral dimensions, fabricating symptoms emerged as the sole driver of super-additive interactions. Specifically, all three pairs that involved fabrication produced observed/expected (O/E) ratios of 0.70 to 0.81, suggesting that 35 to 44 percent of eligible cases failed under the combination, despite succeeding when evaluated individually.

In contrast, all non-fabricating pairs demonstrated purely additive effects, with O/E ratios around 1.0. Additionally, our findings indicated that inquiry strategy moderates the impact of information deficit but not that of information pollution. While exhaustive questioning can help recover withheld information, it cannot compensate for inputs that are fabricated. The models exhibited distinct vulnerability profiles, with worst-case accuracy drops ranging from 38.8 to 54.1 percentage points.

Conclusion

MedDialBench offers significant advancements in understanding the complex interactions of patient behaviors and their effects on LLM diagnostic robustness. By employing a structured approach to evaluate multiple dimensions of patient behavior, this benchmark lays the groundwork for future research aimed at improving the reliability and accuracy of medical dialogue systems.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.