Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents
Summary: arXiv:2603.29231v1 Announce Type: new
Abstract
Existing benchmarks measure capability — whether a model succeeds on a single attempt — but production deployments require reliability — consistent success across repeated attempts on tasks of varying duration. In recent research, it has been demonstrated that these properties diverge systematically as task duration increases, indicating that pass@1 on short tasks is structurally blind to this divergence.
Introduction to Reliability Science Framework
To address the limitations of current evaluation methods, a reliability science framework has been introduced specifically for long-horizon Large Language Model (LLM) agents. This framework consists of four key metrics:
- Reliability Decay Curve (RDC): This metric assesses how reliability decreases as task duration increases.
- Variance Amplification Factor (VAF): This factor quantifies how variance in performance amplifies over longer tasks.
- Graceful Degradation Score (GDS): This score evaluates how well a model maintains performance as conditions become less favorable.
- Meltdown Onset Point (MOP): This point identifies when a model begins to fail dramatically during extended tasks.
Evaluation Methodology
The framework was applied to evaluate ten different models across 23,392 episodes, utilizing a comprehensive 396-task benchmark that spans four duration buckets and three domains. This extensive evaluation allows for a nuanced understanding of model performance across various scenarios.
Key Findings
- Reliability Decay is Domain-Stratified: The GDS for structured environments (SE) drops significantly from 0.90 to 0.44, whereas document processing tasks show nearly flat performance, ranging from 0.74 to 0.71.
- VAF Bifurcation by Capability Tier: Models exhibiting high VAF serve as a signature of capability rather than an instability indicator, suggesting a deeper relationship between these metrics.
- Divergence in Capability and Reliability Rankings: The study found substantial divergences in rankings, with multi-rank inversions occurring at longer task durations.
- High Meltdown Rates in Frontier Models: Frontier models exhibit higher meltdown rates, reaching up to 19%, due to their attempts at ambitious multi-step strategies that can lead to spiraling failures.
- Negative Impact of Memory Scaffolds: Interestingly, the introduction of memory scaffolds consistently detracted from long-horizon performance across all ten evaluated models.
Conclusion
The findings from this research highlight the necessity of incorporating reliability as a critical evaluation dimension alongside traditional capability metrics. This reliability science framework not only enhances the assessment of long-horizon LLM agents but also sets a foundation for future studies aimed at improving model performance in practical, real-world applications.
