Agentic AI Outperforms Experts in Myeloma Clinical Reasoning

Agentic Clinical Reasoning Over Longitudinal Myeloma Records: A Retrospective Evaluation Against Expert Consensus

In a groundbreaking study recently published on arXiv (2604.24473v1), researchers have explored the efficacy of large language model (LLM)-based systems in synthesizing clinical evidence from extensive longitudinal records of multiple myeloma patients. This research is pivotal as it seeks to determine whether AI can match expert oncologists in decision-making based on complex clinical histories that span years.

Background

Multiple myeloma, a type of blood cancer, requires a meticulous management approach characterized by sequential lines of therapy over many years. Each treatment decision is influenced by cumulative disease history, which is often documented in numerous clinical records. The challenge lies in synthesizing this information accurately to guide treatment.

Study Overview

The study conducted a retrospective evaluation on longitudinal clinical records from 811 myeloma patients treated at a tertiary center between 2001 and 2026. This dataset included:

44,962 clinical documents
1,334,677 laboratory values

To validate the findings, external data from the MIMIC-IV database was also utilized. The researchers compared an agentic reasoning system against several baseline models, including:

Single-pass retrieval-augmented generation (RAG)
Iterative RAG
Full-context input

The evaluation focused on 469 patient-question pairs derived from 48 templates categorized into three complexity levels. Reference labels were established through double annotation by four oncologists, with adjudication from a senior hematologist.

Key Findings

The results were significant:

Iterative RAG and full-context input achieved a near-identical ceiling of 75.4% and 75.8% concordance, respectively (p = 1.00).
The agentic reasoning system outperformed both baselines, reaching a concordance rate of 79.6% (95% CI 76.4-82.8), a statistically significant improvement of +3.8 and +4.2 percentage points (p = 0.006 and 0.007).
The performance gains were more pronounced with increasing question complexity, attaining an additional +9.4 percentage points on criteria-based synthesis (p = 0.032).
For longer records, the agentic system showed a remarkable +13.5 percentage points increase in the top decile of record length (n = 10).

While the system’s error rate stood at 12.2%, it was comparable to expert disagreement, which was recorded at 13.6%. However, the clinical significance of errors differed, with 57.8% of the system’s errors deemed clinically significant compared to only 18.8% for expert disagreements.

Implications

The findings suggest that agentic reasoning approaches can exceed traditional methods, particularly in complex scenarios. The pronounced clinical consequences of remaining system errors highlight the necessity for further prospective evaluations in routine care settings before these technologies can be confidently integrated into patient management strategies. As AI continues to evolve, the potential to enhance clinical decision-making in oncology appears promising, but thorough assessments are essential to ensure patient safety and efficacy.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Agentic AI Outperforms Experts in Myeloma Clinical Reasoning

Agentic Clinical Reasoning Over Longitudinal Myeloma Records: A Retrospective Evaluation Against Expert Consensus

Background

Study Overview

Key Findings

Implications

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related