Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA
Summary: arXiv:2603.24481v1 Announce Type: new
Abstract: Miscalibrated confidence scores are a practical obstacle to deploying AI in clinical settings. A model that is always overconfident offers no useful signal for deferral. We present a multi-agent framework that combines domain-specific specialist agents with Two-Phase Verification and S-Score Weighted Fusion to improve both calibration and discrimination in medical multiple-choice question answering.
Introduction
The integration of artificial intelligence (AI) into clinical environments has been met with enthusiasm and skepticism alike. One of the primary concerns is the reliability of AI systems, particularly in terms of their confidence scores. Overconfidence in AI predictions can lead to detrimental outcomes in medical settings. This article discusses a novel approach to enhance the calibration of uncertainty in medical multiple-choice question answering (MCQA) through a multi-agent framework.
Methodology
Our framework employs four specialist agents focused on different medical domains:
- Respiratory
- Cardiology
- Neurology
- Gastroenterology
Each agent generates independent diagnoses utilizing the Qwen2.5-7B-Instruct model. Following this, the diagnoses undergo a two-phase self-verification process which assesses internal consistency and produces what we refer to as the Specialist Confidence Score (S-score).
Results
The S-scores are pivotal in a weighted fusion strategy that determines the final answer and calibrates the confidence reported by the system. Our evaluation spans four experimental settings, encompassing high-disagreement subsets of both MedQA-USMLE and MedMCQA, consisting of 100 and 250 questions, respectively.
Key findings from our research include:
- Calibration improvement is significant, with an Expected Calibration Error (ECE) reduction of 49-74% across all settings.
- In the more challenging MedMCQA benchmark, calibration gains persist even when overall accuracy is limited by knowledge-intensive recall demands.
- On the MedQA-250 dataset, the complete system achieved an ECE of 0.091, marking a 74.4% reduction compared to the single-specialist baseline, alongside an Area Under the Receiver Operating Characteristic Curve (AUROC) of 0.630, a gain of 0.056 at 59.2% accuracy.
Conclusion
Ablation analysis revealed that Two-Phase Verification is the primary driver of calibration improvement, while multi-agent reasoning plays a crucial role in enhancing accuracy. Our findings underscore the importance of consistency-based verification in producing more reliable uncertainty estimates across diverse medical question types. This advancement provides a practical confidence signal that can be critical for deferral in safety-sensitive clinical AI applications.
