CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents
arXiv:2603.24440v1 Announce Type: cross
Abstract
Computer-use agents (CUAs) hold great promise for automating complex desktop workflows, yet progress toward general-purpose agents is bottlenecked by the scarcity of continuous, high-quality human demonstration videos. Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents. However, the largest existing open dataset, ScaleCUA, contains only 2 million screenshots, equating to less than 20 hours of video.
Introduction to CUA-Suite
To address this bottleneck, we introduce CUA-Suite, a large-scale ecosystem of expert video demonstrations and dense annotations for professional desktop computer-use agents. At its core is VideoCUA, which provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layered reasoning annotations, totaling approximately 55 hours and 6 million frames of expert video.
Benefits of Continuous Video
Unlike sparse datasets that capture only final click coordinates, these continuous video streams preserve the full temporal dynamics of human interaction. This forms a superset of information that can be losslessly transformed into the formats required by existing agent frameworks. The continuous nature of the data allows for a more nuanced understanding of user behavior, significantly enhancing the training and performance of CUAs.
Complementary Resources
CUA-Suite further provides two complementary resources:
- UI-Vision: A rigorous benchmark for evaluating grounding and planning capabilities in CUAs.
- GroundCUA: A large-scale grounding dataset with 56K annotated screenshots and over 3.6 million UI element annotations.
Preliminary Evaluation
Preliminary evaluation reveals that current foundation action models struggle substantially with professional desktop applications, recording a ~60% task failure rate. This highlights the urgent need for robust training datasets that encapsulate the complexities of user interactions within desktop environments.
Future Research Directions
Beyond evaluation, CUA-Suite’s rich multimodal corpus supports emerging research directions, including:
- Generalist screen parsing
- Continuous spatial control
- Video-based reward modeling
- Visual world models
Public Release
All data and models from CUA-Suite are publicly released, encouraging further research and development in the field of computer-use agents. This initiative aims to bridge the gap between theoretical advancements in AI and practical applications in desktop environments, fostering a new era of automation and efficiency.
