Policy-Invisible Violations in LLM-Based Agents
Summary: arXiv:2604.12177v1 Announce Type: new
Abstract: LLM-based agents can execute actions that are syntactically valid, user-sanctioned, and semantically appropriate, yet still violate organizational policy because the facts needed for correct policy judgment are hidden at decision time. We call this failure mode policy-invisible violations: cases in which compliance depends on entity attributes, contextual state, or session history absent from the agent’s visible context.
Introduction
As the use of large language models (LLMs) in organizational settings becomes increasingly common, a new concern has emerged regarding compliance with organizational policies. These LLM-based agents can perform tasks that appear to align with user intentions, but they may still inadvertently breach established policies due to a lack of necessary contextual information. This article explores the concept of policy-invisible violations, presenting the PhantomPolicy benchmark and the Sentinel enforcement framework as solutions.
Understanding Policy-Invisible Violations
Policy-invisible violations occur when an LLM-based agent makes decisions without access to critical information that is essential for compliance. Such information might include:
- Entity attributes
- Contextual state
- Session history
These attributes are crucial for evaluating whether an action adheres to organizational policies. The absence of this information can lead to significant compliance risks, as agents may produce outputs that, while valid on the surface, do not conform to the organization’s rules or regulations.
The PhantomPolicy Benchmark
To address the challenges posed by policy-invisible violations, researchers developed PhantomPolicy, a comprehensive benchmark that categorizes violations into eight distinct types. This benchmark includes a balanced set of violation and safe-control cases, ensuring that all tool responses are derived from clean business data without any policy metadata. The study involved a manual review of 600 model traces generated by five leading models. This review process revealed that:
- 32 labels (5.3%) were altered compared to the original annotations.
This statistic underscores the importance of conducting thorough trace-level human reviews to ensure accurate compliance assessments.
Introducing Sentinel: An Enforcement Framework
To enhance compliance enforcement, the researchers presented Sentinel, an innovative framework that leverages counterfactual graph simulation. Sentinel operates under the premise that each action taken by an agent is a proposed change to an organizational knowledge graph. The framework employs speculative execution techniques to visualize the potential outcomes of these actions, thereby allowing it to:
- Verify graph-structural invariants
- Decide on Allow, Block, or Clarify actions
In comparative testing against human-reviewed trace labels, Sentinel demonstrated remarkable accuracy, achieving:
- 93.0% accuracy compared to a content-only Data Loss Prevention (DLP) baseline of 68.8%.
Although Sentinel shows high precision, there remains potential for improvement in certain violation categories.
Conclusion
The research findings highlight the critical importance of integrating policy-relevant world state information into enforcement mechanisms for LLM-based agents. By addressing the issues of policy-invisible violations, organizations can significantly enhance their compliance and security posture, ultimately fostering a more responsible deployment of AI technologies.
