CUA-Suite: Large Human-Annotated Videos for Computer Agents

Date:

CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

arXiv:2603.24440v1 Announce Type: cross

Abstract

Computer-use agents (CUAs) hold great promise for automating complex desktop workflows, yet progress toward general-purpose agents is bottlenecked by the scarcity of continuous, high-quality human demonstration videos. Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents. However, the largest existing open dataset, ScaleCUA, contains only 2 million screenshots, equating to less than 20 hours of video.

Introduction to CUA-Suite

To address this bottleneck, we introduce CUA-Suite, a large-scale ecosystem of expert video demonstrations and dense annotations for professional desktop computer-use agents. At its core is VideoCUA, which provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layered reasoning annotations, totaling approximately 55 hours and 6 million frames of expert video.

Benefits of Continuous Video

Unlike sparse datasets that capture only final click coordinates, these continuous video streams preserve the full temporal dynamics of human interaction. This forms a superset of information that can be losslessly transformed into the formats required by existing agent frameworks. The continuous nature of the data allows for a more nuanced understanding of user behavior, significantly enhancing the training and performance of CUAs.

Complementary Resources

CUA-Suite further provides two complementary resources:

  • UI-Vision: A rigorous benchmark for evaluating grounding and planning capabilities in CUAs.
  • GroundCUA: A large-scale grounding dataset with 56K annotated screenshots and over 3.6 million UI element annotations.

Preliminary Evaluation

Preliminary evaluation reveals that current foundation action models struggle substantially with professional desktop applications, recording a ~60% task failure rate. This highlights the urgent need for robust training datasets that encapsulate the complexities of user interactions within desktop environments.

Future Research Directions

Beyond evaluation, CUA-Suite’s rich multimodal corpus supports emerging research directions, including:

  • Generalist screen parsing
  • Continuous spatial control
  • Video-based reward modeling
  • Visual world models

Public Release

All data and models from CUA-Suite are publicly released, encouraging further research and development in the field of computer-use agents. This initiative aims to bridge the gap between theoretical advancements in AI and practical applications in desktop environments, fostering a new era of automation and efficiency.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.