A Reasoning-Enabled Vision-Language Foundation Model for Chest X-ray Interpretation
Summary: arXiv:2604.00493v1 Announce Type: cross
Abstract: Chest X-rays (CXRs) are among the most frequently performed imaging examinations worldwide, yet rising imaging volumes increase radiologist workload and the risk of diagnostic errors. Although artificial intelligence (AI) systems have shown promise for CXR interpretation, most generate only final predictions, without making explicit how visual evidence is translated into radiographic findings and diagnostic predictions. We present CheXOne, a reasoning-enabled vision-language model for CXR interpretation. CheXOne jointly generates diagnostic predictions and explicit, clinically grounded reasoning traces that connect visual evidence, radiographic findings, and these predictions.
The model is trained on 14.7 million instruction and reasoning samples curated from 30 public datasets spanning 36 CXR interpretation tasks, using a two-stage framework that combines instruction tuning with reinforcement learning to improve reasoning quality.
Key Features of CheXOne
- Jointly generates diagnostic predictions and reasoning traces.
- Utilizes a two-stage framework for training efficiency.
- Curated from extensive datasets, covering a wide range of tasks.
- Implements reinforcement learning to enhance reasoning quality.
Evaluation and Performance
We evaluate CheXOne in zero-shot settings across various domains, including:
- Visual Question Answering
- Report Generation
- Visual Grounding
- Reasoning Assessment
Covering 17 evaluation settings, CheXOne outperforms existing medical and general-domain foundation models, achieving strong performance on independent public benchmarks. A clinical reader study indicates that CheXOne-drafted reports are comparable to or better than resident-written reports in 55% of cases. This performance is particularly notable in addressing clinical indications and enhancing both report writing and CXR interpretation efficiency.
Clinical Implications
Further analyses involving radiologists reveal that the generated reasoning traces exhibit high clinical factuality, providing causal support for the final predictions. This offers plausible explanations for the observed performance gains. The implications of these findings suggest that:
- Explicit reasoning can improve model performance.
- Interpretability of AI-assisted interpretations is enhanced.
- Clinical utility in CXR interpretation is significantly increased.
In conclusion, CheXOne represents a significant advancement in AI-assisted chest X-ray interpretation, bridging the gap between visual evidence and clinical reasoning. By enabling explicit reasoning, CheXOne not only enhances diagnostic accuracy but also supports radiologists in their decision-making processes, ultimately leading to improved patient outcomes.
