This Treatment Works, Right? Evaluating LLM Sensitivity to Patient Question Framing in Medical QA
In recent years, patients have increasingly turned to large language models (LLMs) to seek answers to their complex medical questions. However, the way these questions are phrased can significantly influence the responses provided by the models. A new study, detailed in the paper arXiv:2604.05051v1, explores this phenomenon, investigating how different question framings affect the consistency of LLM responses in medical question answering (QA).
Understanding the Study
The researchers conducted a systematic evaluation in a controlled retrieval-augmented generation (RAG) setting. Unlike traditional methods that rely on automatically retrieved documents, this approach utilized expert-selected documents to enhance the quality of the responses. The study focused on two key dimensions of patient query variation:
- Question Framing: Positive vs. Negative
- Language Style: Technical vs. Plain Language
Methodology
A dataset comprising 6,614 query pairs was constructed, grounded in clinical trial abstracts. The researchers evaluated the consistency of responses across eight different LLMs. This evaluation aimed to determine if the framing of the questions impacted the conclusions drawn by the models.
Key Findings
The findings of the study were particularly revealing:
- Positively- and negatively-framed question pairs were significantly more likely to yield contradictory conclusions compared to same-framing pairs.
- The inconsistency in responses was amplified in multi-turn conversations, where sustained persuasion led to greater variability in answers.
- No significant interaction was observed between framing and language style, indicating that the impact of phrasing remains consistent regardless of the complexity of the language used.
Implications for Medical QA
The results underscore a critical issue in the deployment of LLMs for medical inquiries. The study highlights that LLM responses can be systematically influenced by the phrasing of queries, even when grounded in the same underlying evidence. This raises important questions about the robustness of LLMs in high-stakes medical settings, where the accuracy and reliability of information can have profound implications for patient care.
Conclusion
As patients increasingly rely on LLMs for medical advice, it is crucial to ensure that these models provide consistent and reliable information. The study calls for enhanced evaluation criteria for RAG-based systems in medical QA, emphasizing the need for phrasing robustness. Future research should focus on developing methodologies that mitigate the risks associated with ambiguous or variable question framings, ultimately leading to improved decision-making in healthcare.
