ESL-Bench: Synthetic Benchmark for Longitudinal Health AI

ESL-Bench: An Event-Driven Synthetic Longitudinal Benchmark for Health Agents

Summary: arXiv:2604.02834v1 Announce Type: new

Abstract: Longitudinal health agents must reason across multi-source trajectories that combine continuous device streams, sparse clinical exams, and episodic life events – yet evaluating them is hard: real-world data cannot be released at scale, and temporally grounded attribution questions seldom admit definitive answers without structured ground truth. We present ESL-Bench, an event-driven synthesis framework and benchmark providing 100 synthetic users, each with a 1-5 year trajectory comprising a health profile, a multi-phase narrative plan, daily device measurements, periodic exam records, and an event log with explicit per-indicator impact parameters.

The framework is designed to address the challenges faced in evaluating longitudinal health agents by creating a controlled environment where synthetic data can be generated. Each health indicator follows a baseline stochastic process driven by discrete events that utilize sigmoid-onset, exponential-decay kernels under saturation and projection constraints. This allows for the simulation of realistic health trajectories without the complications of accessing real-world data.

Key Features of ESL-Bench

Synthetic User Profiles: ESL-Bench includes 100 synthetic users, each characterized by a unique health profile and trajectory ranging from 1 to 5 years.
Multi-Phase Narrative Plans: Each user has a comprehensive narrative plan that outlines their health journey, integrating various medical and life events.
Data Measurement Logs: Daily device measurements and periodic exam records are included, providing a holistic view of the user’s health status.
Event Logs: Each user has an event log detailing explicit per-indicator impact parameters, facilitating in-depth analysis of health changes over time.

Evaluation Framework

ESL-Bench pairs each user with 100 evaluation queries categorized across five dimensions: Lookup, Trend, Comparison, Anomaly, and Explanation. These dimensions are further stratified into Easy, Medium, and Hard tiers, ensuring a comprehensive assessment of the health agents’ reasoning capabilities.

Lookup: Queries that retrieve specific data points from the health records.
Trend: Analysis of data points to identify patterns over time.
Comparison: Evaluating differences between various health indicators.
Anomaly: Identifying outliers or unexpected health events.
Explanation: Providing insights into the reasons behind certain health outcomes.

Performance Insights

In evaluating 13 methods that span LLMs with tools, database-native agents, and memory-augmented retrieval-augmented generation (RAG), findings indicate that database agents significantly outperform memory RAG baselines. The performance metrics show a range of 48-58% accuracy for database agents compared to 30-38% for memory RAG approaches. The performance gap is particularly pronounced in the areas of Comparison and Explanation queries, which require complex multi-hop reasoning and accurate evidence attribution.

Conclusion

ESL-Bench represents a significant advancement in the evaluation of longitudinal health agents, providing researchers with a structured and scalable method to assess the capabilities of various health reasoning algorithms. By utilizing synthetic data, ESL-Bench overcomes the limitations associated with real-world data access, paving the way for future innovations in health informatics and AI-driven health assessments.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

ESL-Bench: Synthetic Benchmark for Longitudinal Health AI

ESL-Bench: An Event-Driven Synthetic Longitudinal Benchmark for Health Agents

Key Features of ESL-Bench

Evaluation Framework

Performance Insights

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related