ACE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments
Summary: arXiv:2604.06111v1 Announce Type: new
Abstract: Existing Agent benchmarks suffer from two critical limitations: high environment interaction overhead (up to 41% of total evaluation time) and imbalanced task horizon and difficulty distributions that make aggregate scores unreliable. To address these issues, we propose ACE-Bench built around a unified grid-based planning task, where agents must fill hidden slots in a partially completed schedule subject to both local slot constraints and global constraints.
Our benchmark offers fine-grained control through two orthogonal axes:
- Scalable Horizons: Controlled by the number of hidden slots H.
- Controllable Difficulty: Governed by a decoy budget B that determines the number of globally misleading decoy candidates.
Crucially, all tool calls are resolved via static JSON files under a Lightweight Environment design, eliminating setup overhead and enabling fast, reproducible evaluation suitable for training-time validation. We first validate that H and B provide reliable control over task horizon and difficulty, and that ACE-Bench exhibits strong domain consistency and model discriminability.
We then conduct comprehensive experiments across 13 models of diverse sizes and families over 6 domains, revealing significant cross-model performance variation and confirming that ACE-Bench provides interpretable and controllable evaluation of agent reasoning.
Key Features of ACE-Bench
- Unified Grid-Based Planning Task: A consistent framework that allows agents to interact with the environment in a structured manner.
- Reduced Evaluation Overhead: By minimizing environment interaction overhead, ACE-Bench increases the efficiency of agent evaluations.
- Dynamic Control Over Difficulty: The customizable nature of B allows researchers to manipulate the complexity of tasks, facilitating targeted evaluations.
- Reproducibility: The static JSON configuration ensures that experiments can be effortlessly reproduced, enhancing the reliability of research findings.
Implications for Future Research
The introduction of ACE-Bench marks a significant advancement in the field of agent evaluation. Researchers can now conduct more reliable assessments of agent performance across various scenarios. The ability to scale task horizons and control difficulty levels will enable a deeper understanding of agent capabilities and limitations.
In conclusion, ACE-Bench not only addresses existing limitations but also sets a new standard for agent evaluation methodologies. By offering a lightweight, configurable framework, it paves the way for innovative research and development in artificial intelligence.
