HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks
Summary: arXiv:2604.09937v1 Announce Type: new
Abstract: Healthcare administration accounts for over $1 trillion in annual spending, making it a promising target for LLM-based computer-use agents (CUAs). While clinical applications of LLMs have received significant attention, no benchmark exists for evaluating CUAs on end-to-end administrative workflows. To address this gap, we introduce HealthAdminBench, a benchmark comprising four realistic GUI environments: an EHR, two payer portals, and a fax system, and 135 expert-defined tasks spanning three administrative task types: Prior Authorization, Appeals and Denials Management, and Durable Medical Equipment (DME) Order Processing.
Each task is decomposed into fine-grained, verifiable subtasks, yielding 1,698 evaluation points. We evaluate seven agent configurations under multiple prompting and observation settings and find that, despite strong subtask performance, end-to-end reliability remains low: the best-performing agent (Claude Opus 4.6 CUA) achieves only 36.3 percent task success, while GPT-5.4 CUA attains the highest subtask success rate (82.8 percent). These results reveal a substantial gap between current agent capabilities and the demands of real-world administrative workflows.
HealthAdminBench provides a rigorous foundation for evaluating progress toward safe and reliable automation of healthcare administrative workflows.
Introduction
The healthcare sector is inundated with administrative tasks, which consume a significant portion of resources and time. With the rise of large language models (LLMs), there is an increasing interest in leveraging artificial intelligence to streamline these processes. However, the lack of standardized benchmarks to evaluate computer-use agents in this domain has hampered progress.
HealthAdminBench Overview
HealthAdminBench aims to fill this void by providing a comprehensive framework for assessing CUAs across a variety of healthcare administration tasks. This benchmark includes:
- Four realistic GUI environments: These environments include an Electronic Health Record (EHR) system, two payer portals, and a fax system.
- 135 expert-defined tasks: Tasks are categorized into three types:
- Prior Authorization
- Appeals and Denials Management
- Durable Medical Equipment (DME) Order Processing
- 1,698 evaluation points: Each task is broken down into subtasks that can be independently verified.
Evaluation Findings
The evaluation of seven agent configurations revealed mixed results. While certain agents performed well on individual subtasks, their overall efficacy in completing entire workflows was suboptimal. Key findings include:
- The best-performing agent, Claude Opus 4.6 CUA, achieved a mere 36.3% success rate on full task completion.
- GPT-5.4 CUA had the highest subtask success rate at 82.8%, indicating strengths in specific areas but also highlighting significant gaps in overall task execution.
Conclusion
HealthAdminBench serves as a vital tool for researchers and developers aiming to enhance the capabilities of computer-use agents in healthcare administration. The findings underscore the need for continued development and refinement of these technologies to meet the complexities of real-world applications. As the healthcare industry continues to evolve, establishing reliable benchmarks like HealthAdminBench will be crucial for advancing the deployment of intelligent automation solutions.
