CDMT-EHR: Continuous-Time Diffusion for Synthetic EHR Data

Date:

CDMT-EHR: A Continuous-Time Diffusion Framework for Generating Mixed-Type Time-Series Electronic Health Records

Summary: arXiv:2603.23719v1 Announce Type: cross

Abstract

Electronic health records (EHRs) are invaluable for clinical research, yet privacy concerns severely restrict data sharing. Synthetic data generation offers a promising solution, but EHRs present unique challenges: they contain both numerical and categorical features that evolve over time. While diffusion models have demonstrated strong performance in EHR synthesis, existing approaches predominantly rely on discrete-time formulations, which suffer from finite-step approximation errors and coupled training-sampling step counts.

Introduction

In the field of healthcare, the ability to access and utilize electronic health records (EHRs) is crucial for advancing clinical research and improving patient care. However, the sensitive nature of health data poses significant privacy challenges, leading to restrictions on data sharing among researchers and institutions. To address these challenges, synthetic data generation has emerged as a viable alternative, allowing researchers to work with data that reflects real-world scenarios without compromising patient privacy.

Challenges in EHR Synthesis

Generating synthetic EHR data is particularly challenging due to the complex nature of these records, which include both numerical and categorical features that change over time. Traditional methods often rely on discrete-time models, which can introduce approximation errors and complicate the training and sampling processes. These issues highlight the need for innovative approaches that can effectively capture the intricacies of EHR data.

Proposed Framework

We propose a continuous-time diffusion framework for generating mixed-type time-series EHRs, which introduces several key innovations:

  • Continuous-Time Diffusion: Utilizing a bidirectional gated recurrent unit (GRU) backbone, our model effectively captures the temporal dependencies inherent in EHR data.
  • Unified Gaussian Diffusion: By employing learnable continuous embeddings for categorical variables, we enable joint cross-feature modeling, enhancing the model’s ability to generate realistic and coherent synthetic EHRs.
  • Factorized Learnable Noise Schedule: This component adapts to the learning difficulties associated with each feature at each time step, improving the overall efficacy of the generation process.

Experimental Results

We conducted extensive experiments on two large-scale intensive care unit datasets to evaluate the performance of our proposed framework. The results demonstrate that our method outperforms existing approaches in several critical areas:

  • Downstream Task Performance: Our framework consistently yields better results in predictive tasks compared to baseline methods.
  • Distribution Fidelity: The synthetic data generated by our method closely resembles the distribution of real EHR data.
  • Discriminability: Our approach enhances the ability to differentiate between classes in the generated data, which is essential for clinical applications.
  • Efficiency: Notably, our method requires only 50 sampling steps, significantly fewer than the 1,000 steps needed by baseline methods.

Conclusion

The CDMT-EHR framework represents a significant advancement in the generation of mixed-type time-series EHRs. By addressing the limitations of existing approaches, our method not only enhances data utility for clinical research but also preserves patient privacy. Furthermore, the implementation of classifier-free guidance facilitates effective conditional generation, particularly in class-imbalanced clinical scenarios.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.