Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents
Summary: arXiv:2604.19457v1 Announce Type: new
Abstract: Long-horizon enterprise agents make high-stakes decisions in various sectors such as loan underwriting, claims adjudication, clinical review, and prior authorization. These decisions are often made under conditions of lossy memory, multi-step reasoning, and binding regulatory constraints. The current evaluation methods predominantly report a single task-success scalar, which tends to conflate different failure modes and obscures the extent to which an agent aligns with the standards required in its deployment environment.
In response to these challenges, we propose a novel framework that decomposes long-horizon decision behavior into four orthogonal alignment axes, each of which is independently measurable and can exhibit failure:
- Factual Precision (FRP): This axis measures the accuracy of the factual information utilized by the agent.
- Reasoning Coherence (RCS): This aspect evaluates the logical consistency of the agent’s decision-making process.
- Compliance Reconstruction (CRR): A new regulatory-grounded axis that assesses an agent’s adherence to established guidelines.
- Calibrated Abstention (CAR): This measurement distinguishes between the coverage of decisions and their accuracy.
Our research emphasizes the importance of these axes through a controlled benchmark known as LongHorizon-Bench. This benchmark encompasses scenarios such as loan qualification and insurance claims adjudication, utilizing deterministic ground-truth construction for rigorous evaluation. In our experiments with six different memory architectures, we uncovered several critical insights:
- Aggregate accuracy metrics often fail to reveal underlying issues; for instance, retrieval processes collapse on factual precision.
- Schema-anchored architectures incur a scaffolding tax that affects their performance.
- Plain summarization under a fact-preservation prompt emerges as a robust baseline across multiple axes, including FRP, RCS, EDA, and CRR.
- All six architectures demonstrated a commitment in every case, highlighting a decisional-alignment axis that has not been adequately addressed in existing literature.
Additionally, our decomposition revealed a significant pre-registered prediction: while we anticipated summarization to falter in terms of factual recall, our findings contradicted this expectation at a large magnitude. This indicates that an aggregate accuracy measure would have concealed a pivotal axis-level reversal.
We note that both institutional alignment (related to regulatory reconstruction) and decisional alignment (associated with calibrated abstention) are under-represented in the current alignment literature. These dimensions become crucial once decisions extend beyond laboratory settings. Our proposed framework is adaptable to any regulated decision-making domain through a straightforward two-step process: first, construct a fact schema, and second, calibrate the CRR auditor prompt.
In conclusion, our research provides a comprehensive approach to understanding and measuring decision alignment in long-horizon enterprise AI agents, paving the way for more reliable and compliant AI systems in high-stakes environments.
