HINTBench: Horizon-agent Intrinsic Non-attack Trajectory Benchmark
Summary: arXiv:2604.13954v1 Announce Type: cross
Abstract
Existing agent-safety evaluation has primarily concentrated on externally induced risks. However, it is crucial to recognize that agents can still veer into unsafe trajectories even under benign conditions. This article delves into this complementary yet underexplored aspect through the lens of intrinsic risk. Intrinsic failures can remain latent, propagate across long-horizon execution, and eventually lead to high-consequence outcomes.
Introduction
To address this issue, we introduce non-attack intrinsic risk auditing and present HINTBench, a comprehensive benchmark comprising 629 agent trajectories. This dataset categorizes trajectories into 523 risky and 106 safe instances, with an average of 33 steps each. HINTBench supports three distinct tasks essential for enhancing agent safety:
- Risk Detection: Identifying whether a trajectory is risky or safe.
- Risk-step Localization: Pinpointing specific steps within a trajectory that contribute to its risk status.
- Intrinsic Failure-type Identification: Classifying the type of intrinsic failure that may occur during the trajectory.
Methodology
The annotations within HINTBench are systematically organized under a unified five-constraint taxonomy. This structured approach allows for more precise analyses and evaluations of agent behaviors in various scenarios.
Findings
Our experiments reveal a significant capability gap in current models. While strong large language models (LLMs) demonstrate robust performance in trajectory-level risk detection, their efficacy diminishes sharply when it comes to risk-step localization, with performance metrics dropping to below 35 Strict-F1. Fine-grained failure diagnosis proves to be even more challenging, highlighting the limitations of existing methodologies.
Challenges Ahead
Additionally, existing guard models have shown poor transferability to this new setting, underscoring the need for innovative approaches in intrinsic risk auditing. The findings from HINTBench establish intrinsic risk auditing as a pressing open challenge for the field of agent safety.
Conclusion
The introduction of HINTBench marks a pivotal step towards a more comprehensive understanding of agent safety. By focusing on intrinsic risks, researchers can better prepare for the complexities of real-world applications where agents operate. Future work should aim to bridge the capability gap identified in our study and develop more effective methodologies for assessing and mitigating intrinsic risks in autonomous systems.
