GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows
The development of general-purpose agents is undergoing a significant transformation, moving away from merely executing simple instructions to completing intricate, real-world productivity workflows. Current benchmarks for tool-use, however, remain misaligned with the actual demands of real-world applications. They often rely on AI-generated queries, dummy tools, and lack comprehensive system-level coordination.
To address these shortcomings, researchers have proposed GTA-2, a hierarchical benchmark designed specifically for General Tool Agents (GTA). This new benchmark encompasses both atomic tool use and the complexities of open-ended workflows, ensuring alignment with real-world authenticity through the use of genuine user queries, deployed tools, and multimodal contexts.
Key Components of GTA-2
- GTA-Atomic: This component is derived from the previous GTA benchmark and focuses on evaluating short-horizon, closed-ended tool-use precision. It provides a solid foundation for assessing basic tool capabilities in a controlled setting.
- GTA-Workflow: In contrast, this segment introduces long-horizon, open-ended tasks that require realistic end-to-end completion. This aspect is crucial for evaluating how well agents can navigate complex workflows that mimic real-life scenarios.
Evaluation Methodology
To effectively assess open-ended deliverables, the GTA-2 framework implements a recursive checkpoint-based evaluation mechanism. This innovative approach allows for the decomposition of overarching objectives into verifiable sub-goals. As a result, it facilitates a unified evaluation of both model capabilities and agent execution frameworks, often referred to as execution harnesses.
Findings from Experiments
Initial experiments reveal a significant capability cliff among the evaluated models. While leading-edge models currently struggle with atomic tasks—achieving less than 50% success—they fare even worse in the context of workflows, with top models managing only a 14.39% success rate. This stark contrast underscores the challenges faced in transitioning from simple tool use to handling complex workflows.
Further analysis has shown that checkpoint-guided feedback can lead to notable improvements in performance. Moreover, advanced frameworks like Manus and OpenClaw have demonstrated a substantial enhancement in workflow completion rates, emphasizing the necessity of designing effective execution harnesses that extend beyond the inherent capabilities of the models themselves.
Implications for Future Development
The insights gained from the GTA-2 benchmark provide valuable guidance for the ongoing development of reliable personal and professional assistants. By focusing on both atomic and open-ended task performance, researchers and developers can better understand the requirements for creating more sophisticated AI agents capable of functioning effectively in real-world contexts.
Access to Dataset and Code
For those interested in exploring the GTA-2 framework further, the associated dataset and code will be made available at https://github.com/open-compass/GTA.
