Why Retrying Fails: Context Contamination in LLM Agent Pipelines
Summary: arXiv:2605.08563v1 Announce Type: new
Abstract: In the domain of Large Language Models (LLMs), the phenomenon of context contamination has emerged as a significant barrier to achieving reliable outcomes in multi-step tool-augmented tasks. This article explores the implications of context contamination when an LLM agent fails and subsequently retries a task, leading to elevated per-step error rates. Our research introduces the Context-Contaminated Restart Model (CCRM), a theoretical framework designed to quantify and analyze this issue.
Understanding Context Contamination
Context contamination occurs when an LLM agent retains information from a failed attempt during its next try. This retained context can mislead the model and elevate the chances of subsequent failures. The CCRM provides a detailed examination of this phenomenon through a series of rigorous analyses.
Key Results of the Context-Contaminated Restart Model
Our research yields five pivotal results:
- Result 1 (R1): We present an exact closed-form formula for the probability of succeeding within a maximum of K attempts, incorporating the effects of context contamination.
- Result 2 (R2): A cascade-overhead theorem quantifying the additional attempts, ΔK, required due to contamination, compared to a clean-restart baseline.
- Result 3 (R3): An optimal budget-allocation theorem which identifies the pipeline depth T* that maximizes success probability for a fixed total budget B=KT. We derive the closed form T* = sqrt(B * log(1/(1-epsilon_1)) / log(1/(1-epsilon_0))), where K*=B/T*.
- Result 4 (R4): An information-theoretic lower bound established via Le Cam’s method, demonstrating that K_CCRM remains tight up to O(1).
- Result 5 (R5): A clean-restart dominance theorem that quantifies the benefits of clearing context before a retry attempt.
Empirical Validation of CCRM
To validate our model, we applied CCRM to real data sourced from the SWE-bench Verified dataset. Our findings indicate that the IID model significantly overestimates the pass rate at three attempts, projecting an inflated success rate of 98.6%, while our model fits the actual performance with an error margin of less than 0.001. This discrepancy implies a cascade ratio of epsilon_1/epsilon_0 = 7.1, suggesting that context contamination has a profound impact on performance outcomes.
Monte Carlo Experiments
We conducted a series of Monte Carlo simulations to further corroborate our theoretical predictions. These experiments consistently demonstrated that the effects of context contamination, as modeled by CCRM, align closely with observed behaviors in LLM pipelines. The results illustrate not only the theoretical soundness of our model but also its practical implications for improving the efficiency and effectiveness of LLM agents in real-world applications.
Conclusion
As LLM technologies continue to evolve and permeate various sectors, understanding and addressing context contamination becomes increasingly crucial. The Context-Contaminated Restart Model provides a foundational framework for recognizing the limitations of current retry mechanisms and offers pathways for optimizing LLM agent performance. Future work should focus on refining these models and developing strategies to mitigate the adverse effects of context contamination.
Related AI Insights
- Rubric-Based On-Policy Distillation for AI Model Alignment
- Human-Inspired Memory Architecture Boosts LLM Agents
- SkillLens: Efficient Multi-Granularity Skill Reuse for LLM Agents
- Why Log Analysis Is Key for Credible AI Agent Evaluation
- Anchored Bipolicy Self-Play: Advancing AI Safety Training
- MemQ: Q-Learning for Self-Evolving Memory Agents
- Reducing Unsolvability in Multi-LLM Routing: Key Insights
- Political Plasticity in Large Language Models: Ideology Shift
- Thinking Machines Develops AI That Listens While Talking
- AI Embeddings for Capturing Preferences in Decisions
