WHBench: Evaluating Frontier LLMs with Expert-in-the-Loop Validation on Women’s Health Topics
In recent years, large language models (LLMs) have gained prominence for their application in medical guidance. However, the evaluation of these models in the context of women’s health has remained significantly underexplored. Addressing this critical gap, researchers have introduced the Women’s Health Benchmark (WHBench), an evaluation suite meticulously designed to assess LLMs on various women’s health topics.
Understanding WHBench
The WHBench comprises 47 expert-crafted scenarios that encompass 10 key topics in women’s health. The primary aim of this benchmark is to uncover clinically meaningful failure modes that may arise when LLMs are deployed in real-world settings. These failure modes include:
- Outdated clinical guidelines
- Unsafe omissions of critical information
- Dosing errors that could impact patient safety
- Equity-related blind spots that may affect marginalized groups
Evaluation Methodology
To conduct a comprehensive assessment, the WHBench employs a rigorous 23-criterion rubric. This rubric evaluates models on various parameters, including:
- Clinical accuracy
- Completeness of information
- Safety of recommendations
- Quality of communication
- Adherence to instructions
- Equity considerations
- Handling of uncertainty
- Guideline adherence
To ensure that safety concerns are prioritized, the evaluation process includes safety-weighted penalties and recalculation of scores on the server side.
Findings from the Evaluation
The benchmark has been utilized to evaluate 22 different models, leading to an extensive analysis of 3,102 attempted responses, of which 3,100 were scored. The results revealed that no model achieved a mean performance exceeding 75 percent, with the highest-performing model reaching only 72.1 percent. This indicates a pressing need for improvement in the application of LLMs in the field of women’s health.
Furthermore, even the top-performing models demonstrated low fully correct response rates and significant variation in harm rates. The evaluation results underscored the necessity for expert involvement in the clinical deployment of these models to ensure patient safety.
Implications for Future AI Development
The inter-rater reliability observed in the study was moderate at the response label level, yet high for model ranking. This suggests that WHBench can serve as a valuable tool for comparative evaluation of systems while emphasizing the importance of expert oversight. The ultimate goal of WHBench is to provide a publicly accessible, failure-mode-aware benchmark that can help track the development of safer and more equitable AI solutions in women’s health.
As the field of AI in medicine continues to evolve, benchmarks like WHBench are essential to ensure that advancements serve all populations equitably and safely. Through rigorous evaluation and expert involvement, the aim is to enhance the quality of medical guidance provided by AI systems, particularly in the critical area of women’s health.
