IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents
Summary: arXiv:2604.05157v1 Announce Type: new
Abstract: Computer-Use Agents (CUAs) leverage large language models to execute GUI operations on desktop environments, yet they generate actions without evaluating action quality, leading to irreversible errors that cascade through subsequent steps. We propose IntentScore, a plan-aware reward model that learns to score candidate actions from 398K offline GUI interaction steps spanning three operating systems.
Introduction
The rapid advancement of artificial intelligence and machine learning has led to the development of Computer-Use Agents (CUAs) capable of executing actions within graphical user interfaces (GUIs). However, a significant challenge remains: these agents often generate actions without assessing their quality, resulting in potential errors that can propagate through subsequent tasks. To address this issue, researchers have introduced IntentScore, a novel approach aimed at enhancing the evaluation of actions taken by CUAs.
Overview of IntentScore
IntentScore is designed to score candidate actions based on a model trained with a substantial dataset of 398,000 offline GUI interaction steps across three different operating systems. This plan-aware reward model seeks to improve the reliability of CUAs by incorporating two key training objectives:
- Contrastive Alignment: This objective focuses on ensuring that the state-action pairs are relevant to each other, facilitating a more accurate understanding of the context in which actions are taken.
- Margin Ranking: By emphasizing the correctness of actions, this objective aims to distinguish between actions that may seem similar but differ significantly in their appropriateness.
Architectural Innovation
One of the unique features of IntentScore is its architectural design, which embeds each candidate’s planning intent within the action encoder. This allows the model to differentiate between candidates performing similar actions driven by distinct rationales. Such discrimination is crucial for minimizing errors and enhancing the overall effectiveness of CUAs.
Performance Metrics
In rigorous testing, IntentScore demonstrated an impressive 97.5% pairwise discrimination accuracy on held-out evaluation datasets. This high accuracy rate indicates the model’s robustness in differentiating between various action candidates based on intent and quality.
Real-World Application
IntentScore has been deployed as a re-ranker for Agent S3 in OSWorld, an environment that was entirely unseen during the model’s training phase. The results of this deployment were promising, revealing a 6.9-point increase in task success rates. This improvement underscores the model’s ability to generalize reward estimation from diverse offline trajectories to new agents and tasks.
Conclusion
The introduction of IntentScore marks a significant advancement in the field of Computer-Use Agents. By integrating intent-awareness into action evaluation, the model not only enhances the quality of actions executed by CUAs but also reduces the risk of errors that could lead to cascading failures in subsequent tasks. As AI continues to evolve, approaches like IntentScore will be critical in ensuring that automated systems operate with greater accuracy and reliability.
