VERT: Reliable LLM Judges for Radiology Report Evaluation
Summary: arXiv:2604.03376v1 Announce Type: new
Abstract: Current literature on radiology report evaluation has focused primarily on designing LLM-based metrics and fine-tuning small models for chest X-rays. However, it remains unclear whether these approaches are robust when applied to reports from other modalities and anatomies. Which model and prompt configurations are best suited to serve as LLM judges for radiology evaluation?
In response to these challenges, we conduct a thorough correlation analysis between expert and LLM-based ratings. This study compares three existing LLM-as-a-judge metrics—RadFact, GREEN, and FineRadScore—alongside VERT, our proposed LLM-based metric. Our evaluation utilizes both open- and closed-source models (reasoning and non-reasoning) of various sizes across two expert-annotated datasets, RadEval and RaTE-Eval, which span multiple modalities and anatomies.
Methodology
To ensure a comprehensive analysis, we further evaluate few-shot approaches, ensembling, and parameter-efficient fine-tuning using the RaTE-Eval dataset. Our approach also includes a systematic error detection and categorization study aimed at assessing the alignment of these metrics against expert judgments. This study will help identify areas of lower and higher agreement among the metrics.
Findings
- Our results indicate that VERT improves correlation with radiologist judgments by up to 11.7% relative to GREEN.
- Fine-tuning the Qwen3 30B model yields gains of up to 25% using only 1,300 training samples.
- The fine-tuned model reduces inference time by as much as 37.2 times.
Implications
These findings highlight the effectiveness of LLM-based judges in radiology report evaluation. The ability to achieve reliable evaluations with lightweight adaptations opens new avenues for incorporating advanced AI technologies in clinical settings. With the potential to enhance the efficiency of radiology report assessments, the development of models like VERT signifies a significant step forward in the integration of AI in healthcare.
Conclusion
As the field of radiology continues to evolve, the need for robust evaluation metrics becomes increasingly critical. Our research demonstrates that LLM-based metrics, particularly VERT, can provide significant improvements in correlation with expert judgments, thereby enhancing the reliability of radiology report evaluations across various modalities and anatomies. This advancement not only contributes to the academic literature but also has practical implications for clinical practices, paving the way for more effective use of AI in the medical field.
