CUA-Suite: Large Human-Annotated Videos for Computer Agents

CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

arXiv:2603.24440v1 Announce Type: cross

Abstract

Computer-use agents (CUAs) hold great promise for automating complex desktop workflows, yet progress toward general-purpose agents is bottlenecked by the scarcity of continuous, high-quality human demonstration videos. Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents. However, the largest existing open dataset, ScaleCUA, contains only 2 million screenshots, equating to less than 20 hours of video.

Introduction to CUA-Suite

To address this bottleneck, we introduce CUA-Suite, a large-scale ecosystem of expert video demonstrations and dense annotations for professional desktop computer-use agents. At its core is VideoCUA, which provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layered reasoning annotations, totaling approximately 55 hours and 6 million frames of expert video.

Benefits of Continuous Video

Unlike sparse datasets that capture only final click coordinates, these continuous video streams preserve the full temporal dynamics of human interaction. This forms a superset of information that can be losslessly transformed into the formats required by existing agent frameworks. The continuous nature of the data allows for a more nuanced understanding of user behavior, significantly enhancing the training and performance of CUAs.

Complementary Resources

CUA-Suite further provides two complementary resources:

UI-Vision: A rigorous benchmark for evaluating grounding and planning capabilities in CUAs.
GroundCUA: A large-scale grounding dataset with 56K annotated screenshots and over 3.6 million UI element annotations.

Preliminary Evaluation

Preliminary evaluation reveals that current foundation action models struggle substantially with professional desktop applications, recording a ~60% task failure rate. This highlights the urgent need for robust training datasets that encapsulate the complexities of user interactions within desktop environments.

Future Research Directions

Beyond evaluation, CUA-Suite’s rich multimodal corpus supports emerging research directions, including:

Generalist screen parsing
Continuous spatial control
Video-based reward modeling
Visual world models

Public Release

All data and models from CUA-Suite are publicly released, encouraging further research and development in the field of computer-use agents. This initiative aims to bridge the gap between theoretical advancements in AI and practical applications in desktop environments, fostering a new era of automation and efficiency.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

CUA-Suite: Large Human-Annotated Videos for Computer Agents

CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

Abstract

Introduction to CUA-Suite

Benefits of Continuous Video

Complementary Resources

Preliminary Evaluation

Future Research Directions

Public Release

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related