Reasoning Graphs: Self-Improving, Deterministic RAG through Evidence-Centric Feedback
Summary: arXiv:2604.07595v2 Announce Type: replace
Abstract
Language model agents traditionally reason from scratch with each query, resulting in the loss of their chain of thought after each execution. This approach has led to lower accuracy rates and high variability, where a similar query can yield unpredictable results. In response to these challenges, we introduce reasoning graphs—a novel graph structure designed to maintain a persistent record of evidence chains of thought. Each graph consists of structured edges connected to the particular evidence items they evaluate, thereby allowing for a more robust reasoning process.
Key Innovations
Unlike prior memory mechanisms that rely on retrieving distilled strategies based on query similarity, reasoning graphs facilitate evidence-centric feedback. This innovative approach allows the system to traverse incoming evaluation edges for each piece of evidence across all previous runs. Consequently, this enables the system to surface how specific evidence items have been assessed in the past, fostering a more informed decision-making process.
Complementary Structures
In addition to reasoning graphs, we also introduce retrieval graphs. These serve as a complementary structure that feeds a pipeline planner, which helps to streamline the candidate funnel across successive runs. Together, reasoning and retrieval graphs create a self-improving feedback loop where accuracy improves systematically and reduces verdict-level variance. Notably, this improvement occurs without the need for retraining; the base model remains unchanged, and all enhancements stem from context engineering via graph traversal.
Evaluation and Results
We conducted evaluations on two benchmark datasets: MuSiQue and HotpotQA. The testing employed a sequential cluster protocol, simulating high-reuse deployment scenarios, as well as a determinism experiment. The findings are compelling:
- At 50%+ evidence profile coverage, our system demonstrated a 47% reduction in errors compared to the vanilla RAG model on identical questions (controlled dose-response, p < 0.0001).
- For 4-hop questions, accuracy saw an improvement of +11.0pp (p=0.0001).
- In high-reuse settings, the system achieved Pareto dominance, exhibiting the highest accuracy while simultaneously achieving a 47% reduction in cost and a 46% decrease in latency.
- Evidence profiles contributed to an increase in verdict consistency by 7-8 percentage points (p=0.007, Wilcoxon); the complete system reached perfect consistency across all 11 hard probes at both temperature settings of 0 and 0.5 (p=0.004).
Conclusion
The introduction of reasoning graphs and retrieval graphs marks a significant advancement in the field of language model reasoning. By enabling evidence-centric feedback and creating a self-improving feedback loop, these structures enhance accuracy while reducing error rates and latency. This innovative approach paves the way for more reliable and efficient language model applications, driving the future of AI reasoning technologies.
