AgentDrift: Unsafe LLM Recommendations Hidden by Metrics

AgentDrift: Unsafe Recommendation Drift Under Tool Corruption Hidden by Ranking Metrics in LLM Agents

Summary: arXiv:2603.12564v3 Announce Type: replace-cross

Abstract

Tool-augmented LLM agents increasingly serve as multi-turn advisors in high-stakes domains, yet their evaluation relies on ranking-quality metrics that measure what is recommended but not whether it is safe for the user. We introduce a paired-trajectory protocol that replays real financial dialogues under clean and contaminated tool-output conditions across seven LLMs (7B to frontier) and decomposes divergence into information-channel and memory-channel mechanisms.

Key Findings

Across the seven models tested, we consistently observe the evaluation-blindness pattern: recommendation quality is largely preserved under contamination (utility preservation ratio approximately 1.0) while risk-inappropriate products appear in 65-93% of turns, a systematic safety failure poorly reflected by standard NDCG.

Safety Violations

Safety violations are predominantly information-channel-driven, emerge at the first contaminated turn, and persist without self-correction over 23-step trajectories. Notably, no agent across 1,563 contaminated turns explicitly questions tool-data reliability.

Impact of Narrative-Only Corruption

Even narrative-only corruption, such as biased headlines without numerical manipulation, induces significant drift while completely evading consistency monitors. This raises important concerns regarding the robustness of existing evaluation metrics.

Introducing sNDCG

We propose a safety-penalized NDCG variant (sNDCG) that reduces preservation ratios to 0.51-0.74, indicating that much of the evaluation gap becomes visible once safety is explicitly measured. This suggests that current metrics fail to capture critical safety concerns in multi-turn interactions.

Recommendations

These results motivate considering trajectory-level safety monitoring, beyond single-turn quality, for deployed multi-turn agents in high-stakes settings. The following recommendations are made:

Implement safety-penalized evaluation metrics to better assess the risk of recommendations.
Encourage developers to integrate real-time safety monitoring systems into LLM agents.
Conduct further research on the impact of information-channel and memory-channel mechanisms on recommendation quality.
Establish guidelines for evaluating the safety of tool outputs in high-stakes domains.

Conclusion

The findings from our study highlight significant gaps in the safety evaluation of LLM agents, particularly in high-stakes scenarios. By shifting focus towards trajectory-level assessments and implementing enhanced metrics, we can better ensure the reliability and safety of AI-driven recommendations.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

AgentDrift: Unsafe LLM Recommendations Hidden by Metrics

AgentDrift: Unsafe Recommendation Drift Under Tool Corruption Hidden by Ranking Metrics in LLM Agents

Abstract

Key Findings

Safety Violations

Impact of Narrative-Only Corruption

Introducing sNDCG

Recommendations

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related