VERT: Advanced LLM Judges for Accurate Radiology Reports

Date:

VERT: Reliable LLM Judges for Radiology Report Evaluation

Summary: arXiv:2604.03376v1 Announce Type: new

Abstract: Current literature on radiology report evaluation has focused primarily on designing LLM-based metrics and fine-tuning small models for chest X-rays. However, it remains unclear whether these approaches are robust when applied to reports from other modalities and anatomies. Which model and prompt configurations are best suited to serve as LLM judges for radiology evaluation?

In response to these challenges, we conduct a thorough correlation analysis between expert and LLM-based ratings. This study compares three existing LLM-as-a-judge metrics—RadFact, GREEN, and FineRadScore—alongside VERT, our proposed LLM-based metric. Our evaluation utilizes both open- and closed-source models (reasoning and non-reasoning) of various sizes across two expert-annotated datasets, RadEval and RaTE-Eval, which span multiple modalities and anatomies.

Methodology

To ensure a comprehensive analysis, we further evaluate few-shot approaches, ensembling, and parameter-efficient fine-tuning using the RaTE-Eval dataset. Our approach also includes a systematic error detection and categorization study aimed at assessing the alignment of these metrics against expert judgments. This study will help identify areas of lower and higher agreement among the metrics.

Findings

  • Our results indicate that VERT improves correlation with radiologist judgments by up to 11.7% relative to GREEN.
  • Fine-tuning the Qwen3 30B model yields gains of up to 25% using only 1,300 training samples.
  • The fine-tuned model reduces inference time by as much as 37.2 times.

Implications

These findings highlight the effectiveness of LLM-based judges in radiology report evaluation. The ability to achieve reliable evaluations with lightweight adaptations opens new avenues for incorporating advanced AI technologies in clinical settings. With the potential to enhance the efficiency of radiology report assessments, the development of models like VERT signifies a significant step forward in the integration of AI in healthcare.

Conclusion

As the field of radiology continues to evolve, the need for robust evaluation metrics becomes increasingly critical. Our research demonstrates that LLM-based metrics, particularly VERT, can provide significant improvements in correlation with expert judgments, thereby enhancing the reliability of radiology report evaluations across various modalities and anatomies. This advancement not only contributes to the academic literature but also has practical implications for clinical practices, paving the way for more effective use of AI in the medical field.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.