Training a Large Language Model for Medical Coding Using Privacy-Preserving Synthetic Clinical Data
Summary: arXiv:2603.23515v1 Announce Type: cross
Abstract
Improving the accuracy and reliability of medical coding is crucial for reducing clinician burnout and enhancing revenue cycle processes. This, in turn, allows healthcare providers to devote more time to patient care. However, the task of automating the assignment of ICD-10-CM and CPT codes from clinical documentation remains fraught with challenges. These challenges arise from the heterogeneous nature of medical records, the complexity of coding guidelines, and the long-tail distributions of clinical data.
Large language models (LLMs) have been proposed as tools to assist or even automate certain medical coding tasks. However, existing foundation models have not been explicitly trained for medical coding, and attempts at zero-shot coding have yielded subpar results. In this study, we investigate the potential of adapting a modern open-weight foundation model for expert-level medical coding tasks, utilizing privacy-preserving synthetic training data derived from electronic health records (EHRs).
Methodology
To explore this, we fine-tuned the Llama 3-70B model using pairs of clinical notes and gold codes generated from EHR-grounded templates and coding policies. Following this, we evaluated the model’s ability to predict exact codes for ICD-10-CM and CPT.
Results
- A zero-shot baseline using the unadapted model achieved an F1 score of 0.18 for exact code matching.
- After fine-tuning on the synthetic training corpus, the exact-match F1 score exceeded 0.70, demonstrating a significant improvement across both coding systems.
- Performance remained strong even in complex coding categories that typically require multi-step clinical reasoning and code composition.
- Particularly, categories such as Advanced Illness and Frailty showed sustained high performance.
- The model also retained its efficacy in medical comprehension tasks, indicating a robust ability to generalize from the training data.
Conclusion
The findings suggest that synthetic, policy-aware data can be effectively leveraged to teach a general-purpose large language model to assist in precise medical coding while ensuring the protection of sensitive health information. This approach offers a practical and safe avenue for training coding agents iteratively on specific tasks that mirror real-world population needs.
In summary, the successful adaptation of the Llama 3-70B model for medical coding using privacy-preserving synthetic data paves the way for future advancements in automated coding processes, ultimately contributing to improved healthcare delivery and reduced clinician workload.
