The A-R Behavioral Space: Execution-Level Profiling of Tool-Using Language Model Agents in Organizational Deployment
Summary: arXiv:2604.12116v1 Announce Type: new
Abstract: Large language models (LLMs) are increasingly deployed as tool-augmented agents capable of executing system-level operations. While existing benchmarks primarily assess textual alignment or task success, less attention has been paid to the structural relationship between linguistic signaling and executable behavior under varying autonomy scaffolds. This study introduces an execution-layer behavioral measurement approach based on a two-dimensional A-R space defined by Action Rate (A) and Refusal Signal (R), with Divergence (D) capturing coordination between the two.
Models are evaluated across four normative regimes: Control, Gray, Dilemma, and Malicious, and three autonomy configurations: direct execution, planning, and reflection. Rather than assigning aggregate safety scores, the method characterizes how execution and refusal redistribute across contextual framing and scaffold depth.
Introduction
The deployment of large language models (LLMs) as tool-using agents has raised important considerations regarding their operational effectiveness and safety. Traditional benchmarks have focused on textual alignment and task completion, but this research shifts the focus to the underlying behavioral dynamics that inform these interactions. By employing a novel A-R space framework, this study provides a comprehensive analysis of how LLMs navigate complex operational scenarios.
Methodology
The A-R space is defined by two primary dimensions: Action Rate (A) and Refusal Signal (R). The Action Rate measures the frequency at which a model executes a command, while the Refusal Signal indicates when a model opts not to carry out an action. Divergence (D) reflects the coordination between these two dimensions, providing insights into the agent’s decision-making process within different contexts.
Evaluation Regimes
The models were tested under four normative regimes:
- Control: Standard operational parameters.
- Gray: Ambiguous situations requiring nuanced judgment.
- Dilemma: Scenarios presenting ethical or practical conflicts.
- Malicious: Environments where harmful actions are possible.
Autonomy Configurations
Additionally, the models were assessed under three autonomy configurations:
- Direct Execution: Immediate action without prior deliberation.
- Planning: Consideration of multiple steps before execution.
- Reflection: Analysis of past actions before decision-making.
Findings
Empirical results highlight that execution and refusal behaviors are distinct dimensions. Their joint distribution varies systematically across different regimes and autonomy levels. Notably, reflection-based scaffolding tends to increase refusal rates in risk-laden contexts, yet the patterns of redistribution differ among the models analyzed.
Conclusion
This research underscores the importance of execution-layer characterization in the deployment and selection of tool-enabled LLM agents. By moving beyond scalar safety scores, the A-R representation offers a nuanced perspective that allows organizations to better understand and manage the risks associated with LLM deployment in various operational contexts.
