EHRStruct: A Comprehensive Benchmark Framework for Evaluating Large Language Models on Structured Electronic Health Record Tasks
Summary: arXiv:2511.08206v4 Announce Type: replace
Structured Electronic Health Record (EHR) data serves as a pivotal reservoir of patient information, organized in relational tables and essential for clinical decision-making. The rise of large language models (LLMs) has sparked interest in their potential to process this structured data effectively. While recent studies have highlighted the capabilities of LLMs across various clinical applications, a significant hurdle remains: the lack of standardized evaluation frameworks and well-defined tasks complicates the systematic assessment and comparison of LLM performance in this domain.
To tackle these challenges, we present EHRStruct, a benchmark specifically crafted to evaluate LLMs on structured EHR tasks. EHRStruct delineates 11 representative tasks that encompass a wide range of clinical requirements. Additionally, the framework includes 2,200 task-specific evaluation samples sourced from two widely utilized EHR datasets, facilitating a comprehensive evaluation process.
Key Features of EHRStruct
- Representative Tasks: EHRStruct defines 11 tasks that reflect the diverse needs of clinical practice, ensuring a thorough evaluation of LLM capabilities.
- Evaluation Samples: The benchmark comprises 2,200 evaluation samples, providing a robust dataset for testing model performance.
- Model Evaluation: We employed EHRStruct to assess 20 advanced LLMs, which include both general and medical models, thus offering a comparative analysis across different architectures.
Analysis and Findings
In our evaluation, we examined several critical factors that influence model performance, including:
- Input Formats: Different input representations can significantly affect how well LLMs understand and process EHR data.
- Few-shot Generalization: The ability of models to generalize from limited examples was a focal point, shedding light on their adaptability to varied clinical scenarios.
- Finetuning Strategies: We explored various finetuning approaches to assess their impact on enhancing model performance in structured data reasoning.
Our comparative analysis also involved benchmarking against 11 state-of-the-art LLM-based enhancement methods tailored for structured data reasoning. The results illuminated the challenges posed by many structured EHR tasks, underscoring the high demands on the understanding and reasoning faculties of LLMs.
Introducing EHRMaster
In light of our findings, we propose EHRMaster, a code-augmented methodology that not only achieves state-of-the-art performance but also provides actionable insights intended to guide future research efforts in this critical area. EHRMaster exemplifies the potential of integrating advanced techniques to enhance the reasoning capabilities of LLMs when applied to structured EHR data.
As the healthcare landscape continues to evolve, frameworks like EHRStruct and innovations such as EHRMaster will be vital in harnessing the power of large language models to improve clinical outcomes through intelligent data processing.
