GUIDE: Accurate GUI Agent Evaluation with Hierarchical Diagnosis

GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis

Summary: arXiv:2604.04399v1 Announce Type: new

Abstract: Evaluating GUI agents presents a distinct challenge: trajectories are long, visually grounded, and open-ended, yet evaluation must be both accurate and interpretable. Existing approaches typically apply a single holistic judgment over the entire action-observation sequence—a strategy that proves unreliable on long-horizon tasks and yields binary verdicts offering no insight into where or why an agent fails. This opacity limits the utility of evaluation as a diagnostic tool for agent development. We introduce GUIDE (GUI Understanding and Interpretable Diagnostic Evaluation), a framework that decomposes trajectory assessment into three sequential stages mirroring the compositional structure of GUI tasks.

Framework Overview

GUIDE aims to enhance the evaluation of GUI agents by breaking down the assessment process into manageable components. The framework consists of three main stages:

Trajectory Segmentation: This stage involves partitioning the full trajectory into semantically coherent subtask units. By segmenting the trajectory, the framework allows for a more granular analysis of agent performance.
Subtask Diagnosis: In this stage, each identified unit is evaluated within its context. This involves assigning a completion verdict and generating a structured error analysis that includes corrective recommendations. This component is crucial in understanding the specific failures of an agent during its task.
Overall Summary: The final stage aggregates the findings from the subtask diagnoses into a comprehensive task-level judgment. This summary provides an overall assessment of the agent’s performance while retaining the detailed insights from earlier evaluations.

Benefits of GUIDE

By focusing on bounded subtask segments rather than entire trajectories, GUIDE mitigates the context overload that often hampers existing evaluators as task complexity increases. This approach allows for:

Improved accuracy in evaluations, particularly for long-horizon tasks.
Enhanced interpretability of the evaluation results, offering insights into specific areas of failure.
Structured diagnostic reports that directly inform agent improvement, making the evaluation process a valuable tool for developers.

Validation and Performance

GUIDE was validated on three distinct benchmarks:

An industrial e-commerce dataset comprising 932 trajectories.
AGENTREWARDBENCH, which includes five web agent tasks with a total of 1302 trajectories.
AndroidBench, designed for mobile device control evaluation.

Across all testing environments, GUIDE demonstrated substantial performance improvements over existing evaluators, achieving up to 5.35 percentage points higher accuracy than the strongest baseline. The structured diagnostic reports produced by GUIDE not only enhance the evaluation process but also provide actionable insights for agent improvement.

Conclusion

In summary, GUIDE represents a significant advancement in the evaluation of GUI agents. By providing a framework that enhances accuracy and interpretability, it addresses many of the challenges faced in the assessment of long-horizon tasks. As the field of AI continues to evolve, tools like GUIDE will be essential for developing more capable and reliable agents.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

GUIDE: Accurate GUI Agent Evaluation with Hierarchical Diagnosis

GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis

Framework Overview

Benefits of GUIDE

Validation and Performance

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related