When Chain-of-Thought Backfires: Evaluating Prompt Sensitivity in Medical Language Models
In recent years, Large Language Models (LLMs) have been increasingly integrated into medical settings, offering the potential to enhance diagnostic processes and patient care. However, a significant aspect of their effectiveness—sensitivity to prompt formatting—remains largely unexplored. Researchers have sought to bridge this gap by evaluating MedGemma, a medical language model with both 4 billion and 27 billion parameters, across two prominent medical question-answering datasets: MedMCQA and PubMedQA. This evaluation involved a comprehensive suite of robustness tests designed to assess how varying prompt strategies impact model performance.
Key Findings
The results of the study led to several concerning conclusions regarding the performance of MedGemma when subjected to different prompting techniques:
- Chain-of-Thought (CoT) Prompting: This method, often praised for its reasoning capabilities, was found to decrease accuracy by 5.7% compared to direct answering methods.
- Few-shot Examples: Utilizing few-shot examples resulted in an 11.9% degradation in performance, while also increasing position bias from 0.14 to 0.47.
- Shuffling Answer Options: Altering the order of answer choices led to the model changing predictions 59.1% of the time, with accuracy dropping by as much as 27.4 percentage points.
- Truncation of Context: When context was front-truncated to 50%, there was a notable drop in accuracy below the no-context baseline. In contrast, back-truncation preserved an impressive 97% of full-context accuracy.
- Cloze Scoring: This technique, which involves selecting the highest log-probability option token, achieved accuracy rates of 51.8% for the 4B model and 64.5% for the 27B model, surpassing all other prompting strategies. This finding suggests that the models possess knowledge that is not always reflected in their generated text.
- Permutation Voting: This method allowed for a recovery of 4 percentage points in accuracy compared to single-ordering inference, highlighting its potential as a valuable strategy.
Implications for Medical Language Models
The findings of this evaluation carry significant implications for the deployment of medical LLMs. They demonstrate that traditional prompt engineering techniques, which may work effectively for general-purpose models, do not necessarily translate to domain-specific medical applications. This discrepancy underscores the necessity for tailored approaches in the development and evaluation of medical language models.
As the medical field continues to embrace AI technologies, understanding the nuances of prompt sensitivity will be crucial for ensuring reliable and effective performance. Future research should focus on developing robust prompting techniques that enhance the accuracy and reliability of medical LLMs, ultimately contributing to better patient outcomes and more efficient healthcare delivery.
