ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis
In the rapidly evolving landscape of artificial intelligence, ensuring the safety of large language model (LLM)-based agents has become a critical concern. The complexities of real-world interactions often introduce risks that cannot be adequately assessed through isolated prompts or single responses. Recent advancements have highlighted the need for a comprehensive evaluation framework that can address these multi-step interactions, giving rise to the introduction of the ATBench.
Introduction to ATBench
ATBench is a trajectory-level benchmark designed specifically for the structured, diverse, and realistic evaluation of agent safety. It aims to fill the gaps present in existing benchmarks which often suffer from limited interaction diversity, coarse observability of safety failures, and weak long-horizon realism. By organizing agentic risks along three distinct dimensions—risk source, failure mode, and real-world harm—ATBench provides a more nuanced approach to safety assessment.
Key Features of ATBench
- Diverse Trajectories: The benchmark comprises 1,000 trajectories, of which 503 are classified as safe and 497 as unsafe. Each trajectory averages 9.01 turns and 3.95k tokens, ensuring a robust dataset for evaluation.
- Heterogeneous Tool Pools: ATBench utilizes a wide array of tools, drawing from a pool of 2,084 available tools, with a total of 1,954 invoked tools in the trajectories. This diversity enhances the realism of the interactions.
- Delayed-Trigger Protocol: The benchmark employs a long-context delayed-trigger protocol, which effectively captures the emergence of risks across multiple stages of interaction.
- Data Quality Assurance: The quality of the data is upheld through a combination of rule-based and LLM-based filtering processes, complemented by a thorough human audit.
Experimental Findings
Initial experiments utilizing ATBench have been conducted on a range of frontier LLMs, open-source models, and specialized guard systems. The results reveal that ATBench presents a significant challenge, even for advanced evaluators. This complexity is attributed to the benchmark’s taxonomy-stratified analysis, which enables researchers to perform detailed cross-benchmark comparisons and diagnose long-horizon failure patterns.
Conclusion
The introduction of ATBench marks a pivotal step towards enhancing the safety evaluation of LLM-based agents. By providing a structured and realistic framework for assessing agentic risks, ATBench not only contributes to advancing the field of artificial intelligence safety but also equips developers and researchers with the tools necessary to understand and mitigate potential risks in real-world applications. As the reliance on AI systems continues to grow, benchmarks like ATBench will play a crucial role in ensuring the responsible deployment of intelligent agents.
