AgentDrift: Unsafe Recommendation Drift Under Tool Corruption Hidden by Ranking Metrics in LLM Agents
Summary: arXiv:2603.12564v3 Announce Type: replace-cross
Abstract
Tool-augmented LLM agents increasingly serve as multi-turn advisors in high-stakes domains, yet their evaluation relies on ranking-quality metrics that measure what is recommended but not whether it is safe for the user. We introduce a paired-trajectory protocol that replays real financial dialogues under clean and contaminated tool-output conditions across seven LLMs (7B to frontier) and decomposes divergence into information-channel and memory-channel mechanisms.
Key Findings
Across the seven models tested, we consistently observe the evaluation-blindness pattern: recommendation quality is largely preserved under contamination (utility preservation ratio approximately 1.0) while risk-inappropriate products appear in 65-93% of turns, a systematic safety failure poorly reflected by standard NDCG.
Safety Violations
Safety violations are predominantly information-channel-driven, emerge at the first contaminated turn, and persist without self-correction over 23-step trajectories. Notably, no agent across 1,563 contaminated turns explicitly questions tool-data reliability.
Impact of Narrative-Only Corruption
Even narrative-only corruption, such as biased headlines without numerical manipulation, induces significant drift while completely evading consistency monitors. This raises important concerns regarding the robustness of existing evaluation metrics.
Introducing sNDCG
We propose a safety-penalized NDCG variant (sNDCG) that reduces preservation ratios to 0.51-0.74, indicating that much of the evaluation gap becomes visible once safety is explicitly measured. This suggests that current metrics fail to capture critical safety concerns in multi-turn interactions.
Recommendations
These results motivate considering trajectory-level safety monitoring, beyond single-turn quality, for deployed multi-turn agents in high-stakes settings. The following recommendations are made:
- Implement safety-penalized evaluation metrics to better assess the risk of recommendations.
- Encourage developers to integrate real-time safety monitoring systems into LLM agents.
- Conduct further research on the impact of information-channel and memory-channel mechanisms on recommendation quality.
- Establish guidelines for evaluating the safety of tool outputs in high-stakes domains.
Conclusion
The findings from our study highlight significant gaps in the safety evaluation of LLM agents, particularly in high-stakes scenarios. By shifting focus towards trajectory-level assessments and implementing enhanced metrics, we can better ensure the reliability and safety of AI-driven recommendations.
