Improving Agentic Eval Accuracy: Handling Randomness

On Randomness in Agentic Evals

Summary: arXiv:2602.07150v3 Announce Type: replace-cross

Abstract: Agentic systems are evaluated on benchmarks where agents interact with environments to solve tasks. Most papers report a pass@1 score computed from a single run per task, assuming this gives a reliable performance estimate. We test this assumption by collecting 60,000 agentic trajectories on SWE-Bench-Verified, spanning three models and two scaffolds. We find substantial variance: single-run pass@1 estimates vary by 2.2 to 6.0 percentage points depending on which run is selected, with standard deviations exceeding 1.5 percentage points even at temperature 0. This variance has critical implications: reported improvements of 2–3 percentage points may reflect evaluation noise rather than genuine algorithmic progress.

Through token-level analysis, we show that trajectories diverge early, often within the first few percent of tokens, and that these small differences cascade into different solution strategies. To enable reliable evaluation of agentic systems, we recommend three concrete practices:

Estimate pass@1 from multiple independent runs per task, especially when measuring small improvements.
Use statistical power analysis to determine the number of runs needed to detect expected effect sizes.
Consider metrics like pass@k (optimistic bound) and pass^k (pessimistic bound) with k>1 to better characterize the full performance envelope.

While these practices increase evaluation cost, they are essential for distinguishing genuine scientific progress from statistical noise.

Introduction

In the field of artificial intelligence, the evaluation of agentic systems plays a critical role in assessing their effectiveness and reliability. Traditionally, the pass@1 metric has been the standard for reporting performance, yet this study reveals significant flaws in relying on a single run for such evaluations.

Findings

Our research scrutinizes the performance of various agentic models by analyzing 60,000 trajectories across diverse tasks. The findings highlight a concerning level of variance in pass@1 scores, which underscores the unreliability of this metric when derived from a single execution. The observed fluctuations between 2.2 and 6.0 percentage points raise questions about the validity of minor performance improvements reported in existing literature.

Early Divergence and Its Implications

Token-level analysis indicates that trajectories begin to diverge early in the execution process. This divergence, often occurring within the first few tokens, leads to cascading effects on the overall solution strategy adopted by the agents. Such early differences can significantly skew the perceived effectiveness of a model, contributing to the noise in evaluation metrics.

Recommendations for Improved Evaluation

To enhance the reliability of evaluations in agentic systems, we propose the following recommendations:

Implement multiple independent runs for each task to calculate pass@1, particularly when tracking small performance changes.
Conduct statistical power analyses to ascertain the necessary number of runs to confidently detect meaningful effect sizes.
Utilize alternative metrics—such as pass@k and pass^k—to provide a more comprehensive view of performance capabilities.

Though these methodologies may incur additional costs, they are paramount for ensuring that scientific advancements in AI are accurately recognized and validated.

Conclusion

As the field of AI continues to evolve, the need for robust evaluation metrics becomes increasingly evident. By acknowledging and addressing the issues associated with randomness in agentic evaluations, researchers can foster a more accurate understanding of algorithmic progress and contribute to the responsible development of intelligent systems.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Improving Agentic Eval Accuracy: Handling Randomness

On Randomness in Agentic Evals

Introduction

Findings

Early Divergence and Its Implications

Recommendations for Improved Evaluation

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related