AgentLens: Fixing Lucky Pass Issues in SWE-Agent Evaluation

Date:

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

In the field of software engineering (SWE), the evaluation of agents has long been dominated by a simplistic binary signal: whether the final patch passes all tests. This outcome-centric approach fails to distinguish between two fundamentally different processes: a principled solution and a chaotic trial-and-error methodology. Recent research highlights the flaws in this equivalence, demonstrating that it is empirically inaccurate.

The study, encapsulated in the paper arXiv:2605.12925v1, scrutinizes a total of 2,614 OpenHands trajectories derived from eight distinct model backends across 60 verified SWE-bench tasks. Among these trajectories, only 47 were deemed robust enough to construct task-level process references, resulting in a refined evaluation subset of 1,815 trajectories. Notably, within this subset, 10.7% of the passing trajectories exhibit characteristics identified as “Lucky Passes.” These include various problematic behaviors such as regression cycles, blind retries, missing verification, and temporally disordered exploration, implementation, and verification.

To address these issues, the researchers introduce AgentLens, an innovative framework designed for process-level assessment of SWE-agent trajectories. Accompanying this framework is the release of AgentLens-Bench, a comprehensive dataset containing 1,815 annotated trajectories. This dataset includes quality scores, waste signals, divergence points, and 47 task-level Prefix Tree Acceptor (PTA) references.

  • Quality Assessment: AgentLens utilizes PTA references by merging multiple passing solutions for identical tasks. This allows for a more nuanced evaluation of agent performance.
  • Context-Sensitive Labeling: The framework employs a context-sensitive intent labeler that categorizes actions into four distinct categories: Exploration, Implementation, Verification, or Orchestration. This classification is based on the trajectory’s history rather than merely the identity of the tools used.
  • Tiered Quality Scoring: AgentLens-Bench allows the separation of passing trajectories into three distinct tiers: Lucky, Solid, and Ideal. Furthermore, it breaks down Lucky Passes into five recurring mechanisms, offering deeper insights into agent behavior.

The research reveals a significant variance in the Lucky Pass rates across the eight model backends, ranging from 0.5% to an alarming 23.2%. Notably, when evaluated based on quality score instead of mere pass rates, some models shift by as many as five rank positions, underscoring the critical need for a more sophisticated evaluation framework.

To promote transparency and further research in this area, the researchers have made the anonymized project repository publicly available. This includes the AgentLens-Bench dataset and the AgentLens SDK, which can be accessed at https://github.com/microsoft/code-agent-state-trajectories/.

As the software engineering landscape continues to evolve, the introduction of AgentLens marks a significant step towards more rigorous and meaningful evaluation of SWE agents, paving the way for improved development practices and enhanced tool performance.

Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.