The Amazing Agent Race: Strong Tool Users, Weak Navigators
Summary: arXiv:2604.10261v1 Announce Type: new
Abstract: Existing tool-use benchmarks for LLM agents are overwhelmingly linear: our analysis of six benchmarks shows 55 to 100% of instances are simple chains of 2 to 5 steps. We introduce The Amazing Agent Race (AAR), a benchmark featuring directed acyclic graph (DAG) puzzles (or “legs”) with fork-merge tool chains. We release 1,400 instances across two variants: sequential (800 legs) and compositional (600 DAG legs). Agents must navigate Wikipedia, execute multi-step tool chains, and aggregate results into a verifiable answer.
The Amazing Agent Race aims to challenge the limitations of current benchmarks, which are primarily linear in nature. The traditional benchmarks fail to assess the agents’ abilities to navigate complex scenarios, which are critical for real-world applications. The structure of AAR allows for a more nuanced evaluation of agent performance in tool use and navigation.
Key Features of The Amazing Agent Race
The AAR benchmark includes several innovative features:
- Directed Acyclic Graph (DAG) Puzzles: Each puzzle consists of multiple paths, requiring agents to make decisions on which tools to use and when.
- Multi-Step Tool Chains: Agents must use a series of tools effectively, combining them in a way that is not straightforward.
- Difficulty Levels: The instances are procedurally generated from Wikipedia seeds and categorized into four difficulty levels, providing a comprehensive testing ground.
- Live-API Validation: Each leg is validated in real-time, ensuring that the agents’ outputs are accurate and verifiable.
Evaluation Metrics
The performance of the agents in AAR is evaluated using three complementary metrics:
- Finish-Line Accuracy: Measures the overall correctness of the final answer provided by the agent.
- Pit-Stop Visit Rate: Assesses how effectively agents navigate to the necessary information on Wikipedia.
- Roadblock Completion Rate: Evaluates the agents’ ability to overcome challenges posed by the DAG structure.
Findings and Implications
In evaluating three different agent frameworks on 1,400 legs, the best-performing agent achieved only 37.2% accuracy. Notably, navigation errors accounted for a significant portion of failures, ranging from 27% to 52% across trials. In contrast, tool-use errors were comparatively low, remaining below 17%. This indicates that the agents struggle more with navigation than with the actual execution of tools.
Interestingly, the architecture of the agents proved to be as significant as the scale of the models. For instance, Claude Code matched Codex CLI’s performance at 37% accuracy while using six times fewer tokens. This suggests that agent design and architecture can dramatically influence performance, particularly in complex navigation tasks.
Conclusion
The Amazing Agent Race provides a groundbreaking approach to evaluating large language model agents, revealing critical insights into their capabilities and limitations. The focus on navigation challenges, coupled with the innovative design of the benchmark, highlights areas for future improvement in agent development.
For more information, please visit the project page: The Amazing Agent Race.
