Medical Reasoning with Large Language Models: A Survey and MR-Bench
Source: arXiv:2604.08559v1
Type: Cross
Abstract: Large language models (LLMs) have achieved strong performance on medical exam-style tasks, motivating growing interest in their deployment in real-world clinical settings. However, clinical decision-making is inherently safety-critical, context-dependent, and conducted under evolving evidence. In such situations, reliable LLM performance depends not on factual recall alone, but on robust medical reasoning.
Introduction
This article presents a comprehensive review of medical reasoning with large language models. Grounded in cognitive theories of clinical reasoning, we conceptualize medical reasoning as an iterative process involving abduction, deduction, and induction. This survey organizes existing methods into seven major technical routes spanning both training-based and training-free approaches.
Methodology
We conducted a unified cross-benchmark evaluation of representative medical reasoning models under a consistent experimental setting. This approach enables a more systematic and comparable assessment of the empirical impact of existing methods in medical reasoning.
Key Technical Routes
The seven technical routes identified in our survey include:
- Data-driven Training: Utilizing large datasets for model training to enhance reasoning capabilities.
- Transfer Learning: Adapting pre-trained models for specialized medical tasks.
- Prompt Engineering: Crafting specific prompts to guide model responses in medical contexts.
- Interactive Learning: Incorporating real-time feedback mechanisms to improve decision-making.
- Explainable AI: Implementing methods that make model reasoning transparent and interpretable.
- Domain Adaptation: Tailoring models to better fit specific clinical environments or specialties.
- Hybrid Approaches: Combining various methodologies for enhanced reasoning capabilities.
Introducing MR-Bench
To better assess clinically grounded reasoning, we introduce MR-Bench, a benchmark derived from real-world hospital data. Evaluations on MR-Bench reveal a pronounced gap between exam-level performance and accuracy on authentic clinical decision tasks. This discrepancy underscores the challenges faced when deploying LLMs in actual clinical settings.
Findings and Implications
Our survey provides a unified view of existing medical reasoning methods, benchmarks, and evaluation practices. The findings highlight key gaps between current model performance and the requirements of real-world clinical reasoning.
As the deployment of LLMs in healthcare continues to grow, it is essential to address these gaps to ensure that AI-assisted clinical decision-making is both reliable and safe. The insights from this work not only contribute to the academic discourse but also offer practical implications for the integration of AI in healthcare.
Conclusion
In summary, while large language models show promise in medical reasoning tasks, significant challenges remain in aligning their performance with the complexities of real-world clinical decision-making. Ongoing research and development in this area are crucial for the future of AI in healthcare.
