HealthAdminBench: Benchmarking AI in Healthcare Admin Tasks

Date:

HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks

Summary: arXiv:2604.09937v1 Announce Type: new

Abstract: Healthcare administration accounts for over $1 trillion in annual spending, making it a promising target for LLM-based computer-use agents (CUAs). While clinical applications of LLMs have received significant attention, no benchmark exists for evaluating CUAs on end-to-end administrative workflows. To address this gap, we introduce HealthAdminBench, a benchmark comprising four realistic GUI environments: an EHR, two payer portals, and a fax system, and 135 expert-defined tasks spanning three administrative task types: Prior Authorization, Appeals and Denials Management, and Durable Medical Equipment (DME) Order Processing.

Each task is decomposed into fine-grained, verifiable subtasks, yielding 1,698 evaluation points. We evaluate seven agent configurations under multiple prompting and observation settings and find that, despite strong subtask performance, end-to-end reliability remains low: the best-performing agent (Claude Opus 4.6 CUA) achieves only 36.3 percent task success, while GPT-5.4 CUA attains the highest subtask success rate (82.8 percent). These results reveal a substantial gap between current agent capabilities and the demands of real-world administrative workflows.

HealthAdminBench provides a rigorous foundation for evaluating progress toward safe and reliable automation of healthcare administrative workflows.

Introduction

The healthcare sector is inundated with administrative tasks, which consume a significant portion of resources and time. With the rise of large language models (LLMs), there is an increasing interest in leveraging artificial intelligence to streamline these processes. However, the lack of standardized benchmarks to evaluate computer-use agents in this domain has hampered progress.

HealthAdminBench Overview

HealthAdminBench aims to fill this void by providing a comprehensive framework for assessing CUAs across a variety of healthcare administration tasks. This benchmark includes:

  • Four realistic GUI environments: These environments include an Electronic Health Record (EHR) system, two payer portals, and a fax system.
  • 135 expert-defined tasks: Tasks are categorized into three types:
    • Prior Authorization
    • Appeals and Denials Management
    • Durable Medical Equipment (DME) Order Processing
  • 1,698 evaluation points: Each task is broken down into subtasks that can be independently verified.

Evaluation Findings

The evaluation of seven agent configurations revealed mixed results. While certain agents performed well on individual subtasks, their overall efficacy in completing entire workflows was suboptimal. Key findings include:

  • The best-performing agent, Claude Opus 4.6 CUA, achieved a mere 36.3% success rate on full task completion.
  • GPT-5.4 CUA had the highest subtask success rate at 82.8%, indicating strengths in specific areas but also highlighting significant gaps in overall task execution.

Conclusion

HealthAdminBench serves as a vital tool for researchers and developers aiming to enhance the capabilities of computer-use agents in healthcare administration. The findings underscore the need for continued development and refinement of these technologies to meet the complexities of real-world applications. As the healthcare industry continues to evolve, establishing reliable benchmarks like HealthAdminBench will be crucial for advancing the deployment of intelligent automation solutions.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.