Profiling Tool-Using Language Model Agents in Organizations

The A-R Behavioral Space: Execution-Level Profiling of Tool-Using Language Model Agents in Organizational Deployment

Summary: arXiv:2604.12116v1 Announce Type: new

Abstract: Large language models (LLMs) are increasingly deployed as tool-augmented agents capable of executing system-level operations. While existing benchmarks primarily assess textual alignment or task success, less attention has been paid to the structural relationship between linguistic signaling and executable behavior under varying autonomy scaffolds. This study introduces an execution-layer behavioral measurement approach based on a two-dimensional A-R space defined by Action Rate (A) and Refusal Signal (R), with Divergence (D) capturing coordination between the two.

Models are evaluated across four normative regimes: Control, Gray, Dilemma, and Malicious, and three autonomy configurations: direct execution, planning, and reflection. Rather than assigning aggregate safety scores, the method characterizes how execution and refusal redistribute across contextual framing and scaffold depth.

Introduction

The deployment of large language models (LLMs) as tool-using agents has raised important considerations regarding their operational effectiveness and safety. Traditional benchmarks have focused on textual alignment and task completion, but this research shifts the focus to the underlying behavioral dynamics that inform these interactions. By employing a novel A-R space framework, this study provides a comprehensive analysis of how LLMs navigate complex operational scenarios.

Methodology

The A-R space is defined by two primary dimensions: Action Rate (A) and Refusal Signal (R). The Action Rate measures the frequency at which a model executes a command, while the Refusal Signal indicates when a model opts not to carry out an action. Divergence (D) reflects the coordination between these two dimensions, providing insights into the agent’s decision-making process within different contexts.

Evaluation Regimes

The models were tested under four normative regimes:

Control: Standard operational parameters.
Gray: Ambiguous situations requiring nuanced judgment.
Dilemma: Scenarios presenting ethical or practical conflicts.
Malicious: Environments where harmful actions are possible.

Autonomy Configurations

Additionally, the models were assessed under three autonomy configurations:

Direct Execution: Immediate action without prior deliberation.
Planning: Consideration of multiple steps before execution.
Reflection: Analysis of past actions before decision-making.

Findings

Empirical results highlight that execution and refusal behaviors are distinct dimensions. Their joint distribution varies systematically across different regimes and autonomy levels. Notably, reflection-based scaffolding tends to increase refusal rates in risk-laden contexts, yet the patterns of redistribution differ among the models analyzed.

Conclusion

This research underscores the importance of execution-layer characterization in the deployment and selection of tool-enabled LLM agents. By moving beyond scalar safety scores, the A-R representation offers a nuanced perspective that allows organizations to better understand and manage the risks associated with LLM deployment in various operational contexts.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Profiling Tool-Using Language Model Agents in Organizations

The A-R Behavioral Space: Execution-Level Profiling of Tool-Using Language Model Agents in Organizational Deployment

Introduction

Methodology

Evaluation Regimes

Autonomy Configurations

Findings

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related