CocoaBench: Evaluating Unified Digital Agents in the Wild
The rapid advancement of Large Language Model (LLM) agents has propelled their performance in numerous fields, including software engineering, deep research, and graphical user interface (GUI) automation. Despite these advancements, a significant gap remains in the evaluation methodologies applied to these agents. Current assessments often test individual capabilities in isolation, failing to address the multifaceted nature of real-world tasks that require agents to synergize various skills.
To bridge this gap, researchers have introduced CocoaBench, a comprehensive benchmark designed specifically for unified digital agents. CocoaBench is built around human-designed, long-horizon tasks that necessitate a flexible composition of vision, search, and coding skills. The tasks within this benchmark are uniquely specified by a simple instruction paired with an automatic evaluation function that assesses the final output. This design ensures a reliable and scalable evaluation across diverse agent infrastructures, paving the way for more thorough assessments of agent capabilities.
Key Features of CocoaBench
- Unified Task Design: Tasks are crafted to require the integration of multiple skills, reflecting the complexities faced in real-world applications.
- Instruction-Based Specifications: Each task is defined only by an instruction, allowing agents to demonstrate their understanding and execution capabilities.
- Automatic Evaluation: The inclusion of an automatic evaluation function enables objective assessments of the agents’ performances, making the evaluation process both reliable and scalable.
- CocoaAgent Framework: Alongside CocoaBench, the CocoaAgent framework serves as a lightweight scaffold that facilitates controlled comparisons across various model backbones, providing a standardized platform for evaluation.
Experimental Findings
Initial experiments conducted using CocoaBench reveal that current digital agents still face significant challenges in achieving reliable performance. The best evaluated system managed to reach only a 45.1% success rate on the benchmark. This underperformance highlights the need for further advancements in several critical areas:
- Reasoning and Planning: Many agents struggle with complex reasoning tasks that require advanced planning capabilities.
- Tool Use and Execution: The ability to effectively utilize and execute tasks through various tools is still lacking in many systems.
- Visual Grounding: Agents often have difficulty understanding and interacting with visual inputs, which is crucial for tasks that involve vision.
Conclusion
CocoaBench represents a significant step forward in the evaluation of unified digital agents. By focusing on tasks that require a blend of capabilities, it provides a more realistic assessment of an agent’s performance in practical applications. As the field continues to evolve, CocoaBench and the CocoaAgent framework will play pivotal roles in guiding research and development towards creating more competent and reliable digital agents. The insights gained from ongoing evaluations will undoubtedly contribute to the enhancement of reasoning, planning, and execution strategies, ultimately leading to more effective AI systems in the future.
