GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis
Summary: arXiv:2604.04399v1 Announce Type: new
Abstract: Evaluating GUI agents presents a distinct challenge: trajectories are long, visually grounded, and open-ended, yet evaluation must be both accurate and interpretable. Existing approaches typically apply a single holistic judgment over the entire action-observation sequence—a strategy that proves unreliable on long-horizon tasks and yields binary verdicts offering no insight into where or why an agent fails. This opacity limits the utility of evaluation as a diagnostic tool for agent development. We introduce GUIDE (GUI Understanding and Interpretable Diagnostic Evaluation), a framework that decomposes trajectory assessment into three sequential stages mirroring the compositional structure of GUI tasks.
Framework Overview
GUIDE aims to enhance the evaluation of GUI agents by breaking down the assessment process into manageable components. The framework consists of three main stages:
- Trajectory Segmentation: This stage involves partitioning the full trajectory into semantically coherent subtask units. By segmenting the trajectory, the framework allows for a more granular analysis of agent performance.
- Subtask Diagnosis: In this stage, each identified unit is evaluated within its context. This involves assigning a completion verdict and generating a structured error analysis that includes corrective recommendations. This component is crucial in understanding the specific failures of an agent during its task.
- Overall Summary: The final stage aggregates the findings from the subtask diagnoses into a comprehensive task-level judgment. This summary provides an overall assessment of the agent’s performance while retaining the detailed insights from earlier evaluations.
Benefits of GUIDE
By focusing on bounded subtask segments rather than entire trajectories, GUIDE mitigates the context overload that often hampers existing evaluators as task complexity increases. This approach allows for:
- Improved accuracy in evaluations, particularly for long-horizon tasks.
- Enhanced interpretability of the evaluation results, offering insights into specific areas of failure.
- Structured diagnostic reports that directly inform agent improvement, making the evaluation process a valuable tool for developers.
Validation and Performance
GUIDE was validated on three distinct benchmarks:
- An industrial e-commerce dataset comprising 932 trajectories.
- AGENTREWARDBENCH, which includes five web agent tasks with a total of 1302 trajectories.
- AndroidBench, designed for mobile device control evaluation.
Across all testing environments, GUIDE demonstrated substantial performance improvements over existing evaluators, achieving up to 5.35 percentage points higher accuracy than the strongest baseline. The structured diagnostic reports produced by GUIDE not only enhance the evaluation process but also provide actionable insights for agent improvement.
Conclusion
In summary, GUIDE represents a significant advancement in the evaluation of GUI agents. By providing a framework that enhances accuracy and interpretability, it addresses many of the challenges faced in the assessment of long-horizon tasks. As the field of AI continues to evolve, tools like GUIDE will be essential for developing more capable and reliable agents.
