Improving Agentic Eval Accuracy: Handling Randomness

Date:

On Randomness in Agentic Evals

Summary: arXiv:2602.07150v3 Announce Type: replace-cross

Abstract: Agentic systems are evaluated on benchmarks where agents interact with environments to solve tasks. Most papers report a pass@1 score computed from a single run per task, assuming this gives a reliable performance estimate. We test this assumption by collecting 60,000 agentic trajectories on SWE-Bench-Verified, spanning three models and two scaffolds. We find substantial variance: single-run pass@1 estimates vary by 2.2 to 6.0 percentage points depending on which run is selected, with standard deviations exceeding 1.5 percentage points even at temperature 0. This variance has critical implications: reported improvements of 2–3 percentage points may reflect evaluation noise rather than genuine algorithmic progress.

Through token-level analysis, we show that trajectories diverge early, often within the first few percent of tokens, and that these small differences cascade into different solution strategies. To enable reliable evaluation of agentic systems, we recommend three concrete practices:

  • Estimate pass@1 from multiple independent runs per task, especially when measuring small improvements.
  • Use statistical power analysis to determine the number of runs needed to detect expected effect sizes.
  • Consider metrics like pass@k (optimistic bound) and pass^k (pessimistic bound) with k>1 to better characterize the full performance envelope.

While these practices increase evaluation cost, they are essential for distinguishing genuine scientific progress from statistical noise.

Introduction

In the field of artificial intelligence, the evaluation of agentic systems plays a critical role in assessing their effectiveness and reliability. Traditionally, the pass@1 metric has been the standard for reporting performance, yet this study reveals significant flaws in relying on a single run for such evaluations.

Findings

Our research scrutinizes the performance of various agentic models by analyzing 60,000 trajectories across diverse tasks. The findings highlight a concerning level of variance in pass@1 scores, which underscores the unreliability of this metric when derived from a single execution. The observed fluctuations between 2.2 and 6.0 percentage points raise questions about the validity of minor performance improvements reported in existing literature.

Early Divergence and Its Implications

Token-level analysis indicates that trajectories begin to diverge early in the execution process. This divergence, often occurring within the first few tokens, leads to cascading effects on the overall solution strategy adopted by the agents. Such early differences can significantly skew the perceived effectiveness of a model, contributing to the noise in evaluation metrics.

Recommendations for Improved Evaluation

To enhance the reliability of evaluations in agentic systems, we propose the following recommendations:

  • Implement multiple independent runs for each task to calculate pass@1, particularly when tracking small performance changes.
  • Conduct statistical power analyses to ascertain the necessary number of runs to confidently detect meaningful effect sizes.
  • Utilize alternative metrics—such as pass@k and pass^k—to provide a more comprehensive view of performance capabilities.

Though these methodologies may incur additional costs, they are paramount for ensuring that scientific advancements in AI are accurately recognized and validated.

Conclusion

As the field of AI continues to evolve, the need for robust evaluation metrics becomes increasingly evident. By acknowledging and addressing the issues associated with randomness in agentic evaluations, researchers can foster a more accurate understanding of algorithmic progress and contribute to the responsible development of intelligent systems.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.