AgentFixer: From Failure Detection to Fix Recommendations in LLM Agentic Systems
Summary: arXiv:2603.29848v1 Announce Type: new
Abstract: We introduce a comprehensive validation framework for LLM-based agentic systems that provides systematic diagnosis and improvement of reliability failures.
The framework includes fifteen failure-detection tools and two root-cause analysis modules that jointly uncover weaknesses across input handling, prompt design, and output generation. It integrates lightweight rule-based checks with LLM-as-a-judge assessments to support structured incident detection, classification, and repair. This innovative approach aims to enhance the reliability and performance of large language model (LLM) systems in complex applications.
Key Features of the Framework
- Fifteen Failure-Detection Tools: A set of diagnostic tools designed to identify various types of failures in LLM systems.
- Two Root-Cause Analysis Modules: These modules help to identify the underlying causes of failures, facilitating more effective remediation strategies.
- Integration of Rule-Based Checks: Lightweight, rule-based checks are used to provide quick assessments alongside more complex LLM evaluations.
- Structured Incident Classification: The framework allows for systematic classification of incidents, making it easier to manage and address issues as they arise.
Application and Results
The framework was applied to IBM CUGA, a notable LLM system, and its performance was evaluated using the AppWorld and WebArena benchmarks. This analysis uncovered several recurrent issues, including:
- Planner misalignments that led to inconsistent outputs.
- Schema violations that compromised data integrity.
- Brittle prompt dependencies that affected the system’s responsiveness.
Based on these insights, the team refined both prompting and coding strategies. This process successfully maintained CUGA’s benchmark results while allowing mid-sized models such as Llama 4 and Mistral Medium to achieve notable accuracy gains. These advancements significantly narrowed the performance gap with frontier models.
Exploratory Study and Future Directions
In addition to quantitative validation, an exploratory study was conducted to leverage the framework’s diagnostic outputs and agent descriptions for self-reflection within an LLM. This interactive analysis yielded actionable insights on recurring failure patterns and suggested areas for improvement.
The findings demonstrate how validation processes can evolve into an agentic, dialogue-driven approach. This shift not only enhances the quality assurance of LLM systems but also promotes adaptive validation processes that can be scaled in production environments.
Conclusion
The results of this study exhibit a promising path toward creating more robust, interpretable, and self-improving agentic architectures. By implementing the AgentFixer framework, organizations can improve the reliability of their LLM-based systems, ensuring they perform effectively in real-world applications.
