STARS: Skill-Triggered Audit for Request-Conditioned Invocation Safety in Agent Systems
Summary: arXiv:2604.10286v1 Announce Type: new
Abstract: Autonomous language-model agents increasingly rely on installable skills and tools to complete user tasks. Static skill auditing can expose capability surface before deployment, but it cannot determine whether a particular invocation is unsafe under the current user request and runtime context. We therefore study skill invocation auditing as a continuous-risk estimation problem: given a user request, candidate skill, and runtime context, predict a score that supports ranking and triage before a hard intervention is applied.
Introduction to STARS
The STARS framework introduces a groundbreaking approach to skill invocation auditing, which is essential for enhancing the safety and reliability of autonomous language-model agents. By addressing the limitations of static skill auditing, STARS aims to provide a dynamic risk assessment that can adapt to the nuances of user requests and runtime environments.
Components of STARS
STARS integrates several innovative components to achieve its objectives:
- Static Capability Prior: This component establishes a baseline understanding of the skills and capabilities of the agents before deployment.
- Request-Conditioned Invocation Risk Model: A sophisticated model that evaluates the risk associated with each skill invocation based on the specific context and user request.
- Calibrated Risk-Fusion Policy: This policy combines various risk assessments to produce a unified risk score that informs decision-making processes.
Benchmarking with SIA-Bench
To validate the effectiveness of STARS, researchers developed SIA-Bench, a comprehensive benchmark consisting of 3,000 invocation records. Key features of SIA-Bench include:
- Group-safe splits to maintain data integrity.
- Lineage metadata for tracking skill origins and usage.
- Runtime context information to provide situational awareness.
- Canonical action labels to standardize responses.
- Derived continuous-risk targets to enhance risk assessment accuracy.
Results and Findings
The evaluation of STARS yielded significant findings, particularly in the context of indirect prompt injection attacks. The calibrated risk-fusion approach achieved a high-risk Area Under the Precision-Recall Curve (AUPRC) of 0.439, surpassing the contextual scorer’s 0.405 and the strongest static baseline’s 0.380. However, the contextual scorer exhibited better calibration with an expected calibration error of 0.289.
On a locked in-distribution test split, while the performance gains were less pronounced, the value of static priors remained evident. This indicates that while request-conditioned auditing is transformative, it should be viewed as a complementary layer to static screening rather than a complete replacement.
Conclusion and Future Work
STARS represents a significant advancement in the field of autonomous agent safety, providing a robust framework for real-time risk assessment in skill invocation. Future work will focus on refining the models and expanding the capabilities of the STARS framework to accommodate a broader range of user requests and contexts. For those interested in exploring the code and further details, it is available at https://github.com/123zgj123/STARS.
