A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions
In the ever-evolving landscape of technology, Agents for Computer Use (ACUs) stand out as a promising class of systems that can execute intricate tasks on digital devices. These systems, which encompass desktops, mobile phones, and web platforms, are designed to interpret and act on instructions given in natural language. By automating tasks through low-level actions such as mouse clicks and touchscreen gestures, ACUs have the potential to revolutionize user interaction with technology. However, despite significant advancements, they are not yet fully equipped for everyday use.
This article provides an extensive survey of the current state of ACUs, highlighting trends, challenges, and research gaps that need to be addressed to enhance their functionality and usability.
Survey Overview
Our comprehensive review categorizes ACUs into a unifying taxonomy that spans three critical dimensions:
- Domain Perspective: This dimension characterizes the various contexts in which agents operate.
- Interaction Perspective: This aspect describes the modalities of observation (e.g., screenshots, HTML) and action (e.g., mouse, keyboard, code execution).
- Agent Perspective: This focuses on how agents perceive, reason, and learn from their environments.
Through our taxonomy, we analyzed 87 ACUs and 33 datasets, comparing foundation model-based approaches with classical methods. This analysis led to the identification of six major research gaps that hinder the progress of ACUs:
- Insufficient Generalization: Many ACUs struggle to generalize their learning across various tasks and environments.
- Inefficient Learning: Current learning methods are often static and do not adapt effectively to new information or contexts.
- Limited Planning: Many agents lack robust planning capabilities, which are crucial for executing complex tasks.
- Low Task Complexity in Benchmarks: Existing benchmarks do not adequately reflect real-world task complexity, limiting the applicability of research findings.
- Non-Standardized Evaluation: There is a lack of standardized metrics for evaluating agent performance, making comparisons difficult.
- Disconnect Between Research and Practical Conditions: There is often a gap between academic research and real-world implementation challenges, which can stifle innovation.
Proposed Directions for Improvement
To address these identified gaps, we recommend several strategies:
- Vision-Based Observations and Low-Level Control: Implementing these features can enhance agents’ generalization capabilities.
- Adaptive Learning: Moving beyond static prompting to incorporate adaptive learning techniques will allow agents to respond dynamically to new inputs.
- Effective Planning and Reasoning Methods: Developing models that enhance agents’ planning and reasoning abilities is essential for tackling complex tasks.
- Real-World Task Complexity Benchmarks: Creating benchmarks that reflect the complexities of real-world tasks will improve the relevance of research outputs.
- Standardized Evaluation Metrics: Establishing consistent evaluation criteria based on task success will facilitate better comparisons across studies.
- Alignment with Real-World Constraints: Designing agents with real-world deployment in mind will ensure greater practical applicability.
In conclusion, our taxonomy and analysis serve as a foundational framework for advancing ACU research, paving the way for the development of general-purpose agents capable of robust and scalable computer use. The future of ACUs holds great potential, provided that the challenges identified are systematically addressed.
