IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
Summary: arXiv:2604.07709v3 Announce Type: replace
Abstract: Ask a frontier model how to taper six milligrams of alprazolam (psychiatrist retired, ten days of pills left, abrupt cessation causes seizures) and it tells her to call the psychiatrist she just explained does not exist. Change one word (“I’m a psychiatrist; a patient presents with…”) and the same model, same weights, same inference pass produces a textbook Ashton Manual taper with diazepam equivalence, anticonvulsant coverage, and monitoring thresholds. The knowledge was there; the model withheld it. IatroBench measures this gap.
Sixty pre-registered clinical scenarios, six frontier models, and 3,600 responses were scored on two axes: commission harm (CH 0-3) and omission harm (OH 0-4) through a structured-evaluation pipeline validated against physician scoring (kappa_w = 0.571, within-1 agreement 96%). The central finding is identity-contingent withholding: match the same clinical question in physician vs. layperson framing and all five testable models provide better guidance to the physician.
The study reveals several critical insights:
- Decoupling Gap: The gap between responses improves significantly when the question is framed for a physician versus a layperson (+0.38, p = 0.003).
- Safety-Colliding Actions: Binary hit rates on safety-colliding actions drop by 13.1 percentage points in layperson framing (p < 0.0001), while non-colliding actions show no change.
- Model Performance: The gap is widest for the model with the heaviest safety investment, Opus, which shows a decoupling gap of +0.65.
Three distinct failure modes were identified:
- Trained Withholding: Opus demonstrated significant withholding of information.
- Incompetence: Llama 4 showed notable deficiencies in understanding and generating appropriate medical responses.
- Indiscriminate Content Filtering: GPT-5.2’s post-generation filter strips physician responses at a rate nine times higher than layperson responses, primarily because they contain denser pharmacological tokens.
Additionally, the standard language model judge assigns an omission harm (OH) score of 0 to 73% of responses that physicians score with OH >= 1 (kappa = 0.045). This indicates that the evaluation apparatus shares the same blind spot as the training apparatus, underscoring a significant flaw in current AI safety measures.
Every clinical scenario in the study targets individuals who have already exhausted standard referrals, highlighting the urgency and necessity for accurate AI-generated guidance in critical medical contexts. The implications of these findings are profound, suggesting that while AI models hold significant potential for improving healthcare, they also pose risks when safety measures inadvertently lead to withholding crucial information.
As AI continues to evolve, ensuring that these models can provide reliable and safe medical advice becomes paramount. Future research should focus on refining these models to minimize the risks of iatrogenic harm while maximizing their benefits in clinical settings.
