Beyond Completion: Probing Cumulative State Tracking to Predict LLM Agent Performance
Summary: arXiv:2603.27343v1 Announce Type: new
Abstract: Task-completion rate is the standard proxy for LLM agent capability, but models with identical completion scores can differ substantially in their ability to track intermediate state. We introduce Working Memory Fidelity-Active Manipulation (WMF-AM), a calibrated no-scratchpad probe of cumulative arithmetic state tracking, and evaluate it on 20 open-weight models (0.5B-35B, 13 families) against a released deterministic 10-task agent battery. In a pre-specified, Bonferroni-corrected analysis, WMF-AM predicts agent performance with Kendall’s tau = 0.612 (p < 0.001, 95% CI [0.360, 0.814]); exploratory partial-tau analyses suggest this signal persists after controlling for completion score and model scale. Three construct-isolation ablations (K = 1 control, non-arithmetic ceiling, yoked cancellation) support the interpretation that cumulative state tracking under load, rather than single-step arithmetic or entity tracking alone, is the primary difficulty source. K-calibration keeps the probe in a discriminative range where prior fixed-depth benchmarks become non-discriminative; generalization beyond this open-weight sample remains open.
Introduction
The rapid evolution of large language models (LLMs) has brought about significant advancements in artificial intelligence. Traditionally, the performance of these models has been measured using task-completion rates. However, recent findings indicate that this metric may not provide a comprehensive picture of an LLM’s capabilities, particularly in tracking intermediate states during complex tasks.
Introducing WMF-AM
The research introduces a novel approach called Working Memory Fidelity-Active Manipulation (WMF-AM). This innovative probe is designed to assess the cumulative state tracking abilities of LLMs without relying on a scratchpad method. By evaluating 20 different open-weight models, ranging from 0.5 billion to 35 billion parameters across 13 families, the study aims to provide deeper insights into LLM performance.
Methodology
The evaluation involved a rigorous analysis using a deterministic 10-task agent battery. A pre-specified, Bonferroni-corrected analysis was employed to ensure statistical robustness. The findings revealed that WMF-AM predicts agent performance with a notable Kendall’s tau correlation of 0.612, with a significance level of p < 0.001.
Key Findings
- The correlation suggests that cumulative state tracking is a critical factor in determining LLM agent performance.
- Exploratory analyses indicate that this predictive signal remains significant even when controlling for other variables such as completion scores and model scales.
- Three construct-isolation ablations were used to further investigate the sources of difficulty in state tracking, highlighting that issues arise primarily from cumulative state tracking under load rather than from isolated arithmetic tasks.
Implications
The implications of these findings are profound for the future of language model development. By shifting focus from mere completion rates to a more nuanced understanding of memory and state tracking, researchers can better assess and enhance LLM capabilities. The study also emphasizes the need for K-calibration, which ensures that the probe remains in a range where traditional benchmarks may fail to provide meaningful differentiation.
Conclusion
As the field of AI continues to evolve, understanding the intricacies of LLM performance becomes increasingly important. The research on cumulative state tracking through WMF-AM opens new avenues for evaluating and improving LLM agents, ultimately pushing the boundaries of what these models can achieve.
