ESL-Bench: An Event-Driven Synthetic Longitudinal Benchmark for Health Agents
Summary: arXiv:2604.02834v1 Announce Type: new
Abstract: Longitudinal health agents must reason across multi-source trajectories that combine continuous device streams, sparse clinical exams, and episodic life events – yet evaluating them is hard: real-world data cannot be released at scale, and temporally grounded attribution questions seldom admit definitive answers without structured ground truth. We present ESL-Bench, an event-driven synthesis framework and benchmark providing 100 synthetic users, each with a 1-5 year trajectory comprising a health profile, a multi-phase narrative plan, daily device measurements, periodic exam records, and an event log with explicit per-indicator impact parameters.
The framework is designed to address the challenges faced in evaluating longitudinal health agents by creating a controlled environment where synthetic data can be generated. Each health indicator follows a baseline stochastic process driven by discrete events that utilize sigmoid-onset, exponential-decay kernels under saturation and projection constraints. This allows for the simulation of realistic health trajectories without the complications of accessing real-world data.
Key Features of ESL-Bench
- Synthetic User Profiles: ESL-Bench includes 100 synthetic users, each characterized by a unique health profile and trajectory ranging from 1 to 5 years.
- Multi-Phase Narrative Plans: Each user has a comprehensive narrative plan that outlines their health journey, integrating various medical and life events.
- Data Measurement Logs: Daily device measurements and periodic exam records are included, providing a holistic view of the user’s health status.
- Event Logs: Each user has an event log detailing explicit per-indicator impact parameters, facilitating in-depth analysis of health changes over time.
Evaluation Framework
ESL-Bench pairs each user with 100 evaluation queries categorized across five dimensions: Lookup, Trend, Comparison, Anomaly, and Explanation. These dimensions are further stratified into Easy, Medium, and Hard tiers, ensuring a comprehensive assessment of the health agents’ reasoning capabilities.
- Lookup: Queries that retrieve specific data points from the health records.
- Trend: Analysis of data points to identify patterns over time.
- Comparison: Evaluating differences between various health indicators.
- Anomaly: Identifying outliers or unexpected health events.
- Explanation: Providing insights into the reasons behind certain health outcomes.
Performance Insights
In evaluating 13 methods that span LLMs with tools, database-native agents, and memory-augmented retrieval-augmented generation (RAG), findings indicate that database agents significantly outperform memory RAG baselines. The performance metrics show a range of 48-58% accuracy for database agents compared to 30-38% for memory RAG approaches. The performance gap is particularly pronounced in the areas of Comparison and Explanation queries, which require complex multi-hop reasoning and accurate evidence attribution.
Conclusion
ESL-Bench represents a significant advancement in the evaluation of longitudinal health agents, providing researchers with a structured and scalable method to assess the capabilities of various health reasoning algorithms. By utilizing synthetic data, ESL-Bench overcomes the limitations associated with real-world data access, paving the way for future innovations in health informatics and AI-driven health assessments.
