L2D-Clinical: Learning to Defer for Adaptive Model Selection in Clinical Text Classification
In the rapidly evolving field of artificial intelligence, particularly in healthcare, the challenge of clinical text classification has garnered significant attention. Traditional methods often involve choosing between specialized fine-tuned models, such as BERT variants, and general-purpose large language models (LLMs). However, research indicates that neither approach consistently outperforms the other across all instances.
A recent study introduced the Learning to Defer for clinical text (L2D-Clinical), a novel framework designed to enhance the accuracy of clinical text classification. This framework learns when a BERT classifier should defer to an LLM, utilizing uncertainty signals and text characteristics to make informed decisions.
Key Features of L2D-Clinical
- Adaptive Deferral: Unlike previous methods that relied on deferring to human experts—who were assumed to be universally superior—L2D-Clinical adapts its strategy based on the context, improving accuracy when the LLM complements BERT.
- Evaluation on Clinical Tasks: The framework was evaluated on two English clinical tasks. The first task involved Adverse Drug Event (ADE) detection using the ADE Corpus V2, where BioBERT demonstrated an F1 score of 0.911, significantly outperforming the LLM, which scored 0.765.
- Treatment Outcome Classification: The second task utilized the MIMIC-IV dataset, with a multi-LLM consensus as the ground truth. Here, GPT-5-nano achieved an impressive F1 score of 0.967, surpassing ClinicalBERT, which scored 0.887.
Performance Insights
The results of the evaluation showcase the efficacy of the L2D-Clinical framework. In the ADE detection task, L2D-Clinical achieved an F1 score of 0.928, which is an improvement of 1.7 points over the standard BERT model. This enhancement was realized by selectively deferring 7% of instances to the LLM, where its high recall effectively compensated for the misses by BERT.
Similarly, in the treatment outcome classification task, L2D-Clinical reached an F1 score of 0.980, representing a substantial increase of 9.3 points over BERT. This was accomplished by deferring only 16.8% of cases to the LLM, demonstrating the framework’s efficiency in leveraging the strengths of LLMs while minimizing operational costs associated with API usage.
Conclusion
The L2D-Clinical framework presents a promising advancement in the realm of clinical text classification. By intelligently determining when to defer to a more capable model, it not only boosts classification accuracy but also offers a cost-effective solution for healthcare applications. As the demand for precise and efficient clinical text processing continues to grow, frameworks like L2D-Clinical may play a pivotal role in shaping the future of AI in healthcare.
