CDMT-EHR: A Continuous-Time Diffusion Framework for Generating Mixed-Type Time-Series Electronic Health Records
Summary: arXiv:2603.23719v1 Announce Type: cross
Abstract
Electronic health records (EHRs) are invaluable for clinical research, yet privacy concerns severely restrict data sharing. Synthetic data generation offers a promising solution, but EHRs present unique challenges: they contain both numerical and categorical features that evolve over time. While diffusion models have demonstrated strong performance in EHR synthesis, existing approaches predominantly rely on discrete-time formulations, which suffer from finite-step approximation errors and coupled training-sampling step counts.
Introduction
In the field of healthcare, the ability to access and utilize electronic health records (EHRs) is crucial for advancing clinical research and improving patient care. However, the sensitive nature of health data poses significant privacy challenges, leading to restrictions on data sharing among researchers and institutions. To address these challenges, synthetic data generation has emerged as a viable alternative, allowing researchers to work with data that reflects real-world scenarios without compromising patient privacy.
Challenges in EHR Synthesis
Generating synthetic EHR data is particularly challenging due to the complex nature of these records, which include both numerical and categorical features that change over time. Traditional methods often rely on discrete-time models, which can introduce approximation errors and complicate the training and sampling processes. These issues highlight the need for innovative approaches that can effectively capture the intricacies of EHR data.
Proposed Framework
We propose a continuous-time diffusion framework for generating mixed-type time-series EHRs, which introduces several key innovations:
- Continuous-Time Diffusion: Utilizing a bidirectional gated recurrent unit (GRU) backbone, our model effectively captures the temporal dependencies inherent in EHR data.
- Unified Gaussian Diffusion: By employing learnable continuous embeddings for categorical variables, we enable joint cross-feature modeling, enhancing the model’s ability to generate realistic and coherent synthetic EHRs.
- Factorized Learnable Noise Schedule: This component adapts to the learning difficulties associated with each feature at each time step, improving the overall efficacy of the generation process.
Experimental Results
We conducted extensive experiments on two large-scale intensive care unit datasets to evaluate the performance of our proposed framework. The results demonstrate that our method outperforms existing approaches in several critical areas:
- Downstream Task Performance: Our framework consistently yields better results in predictive tasks compared to baseline methods.
- Distribution Fidelity: The synthetic data generated by our method closely resembles the distribution of real EHR data.
- Discriminability: Our approach enhances the ability to differentiate between classes in the generated data, which is essential for clinical applications.
- Efficiency: Notably, our method requires only 50 sampling steps, significantly fewer than the 1,000 steps needed by baseline methods.
Conclusion
The CDMT-EHR framework represents a significant advancement in the generation of mixed-type time-series EHRs. By addressing the limitations of existing approaches, our method not only enhances data utility for clinical research but also preserves patient privacy. Furthermore, the implementation of classifier-free guidance facilitates effective conditional generation, particularly in class-imbalanced clinical scenarios.
