AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
In the field of software engineering (SWE), the evaluation of agents has long been dominated by a simplistic binary signal: whether the final patch passes all tests. This outcome-centric approach fails to distinguish between two fundamentally different processes: a principled solution and a chaotic trial-and-error methodology. Recent research highlights the flaws in this equivalence, demonstrating that it is empirically inaccurate.
The study, encapsulated in the paper arXiv:2605.12925v1, scrutinizes a total of 2,614 OpenHands trajectories derived from eight distinct model backends across 60 verified SWE-bench tasks. Among these trajectories, only 47 were deemed robust enough to construct task-level process references, resulting in a refined evaluation subset of 1,815 trajectories. Notably, within this subset, 10.7% of the passing trajectories exhibit characteristics identified as “Lucky Passes.” These include various problematic behaviors such as regression cycles, blind retries, missing verification, and temporally disordered exploration, implementation, and verification.
To address these issues, the researchers introduce AgentLens, an innovative framework designed for process-level assessment of SWE-agent trajectories. Accompanying this framework is the release of AgentLens-Bench, a comprehensive dataset containing 1,815 annotated trajectories. This dataset includes quality scores, waste signals, divergence points, and 47 task-level Prefix Tree Acceptor (PTA) references.
- Quality Assessment: AgentLens utilizes PTA references by merging multiple passing solutions for identical tasks. This allows for a more nuanced evaluation of agent performance.
- Context-Sensitive Labeling: The framework employs a context-sensitive intent labeler that categorizes actions into four distinct categories: Exploration, Implementation, Verification, or Orchestration. This classification is based on the trajectory’s history rather than merely the identity of the tools used.
- Tiered Quality Scoring: AgentLens-Bench allows the separation of passing trajectories into three distinct tiers: Lucky, Solid, and Ideal. Furthermore, it breaks down Lucky Passes into five recurring mechanisms, offering deeper insights into agent behavior.
The research reveals a significant variance in the Lucky Pass rates across the eight model backends, ranging from 0.5% to an alarming 23.2%. Notably, when evaluated based on quality score instead of mere pass rates, some models shift by as many as five rank positions, underscoring the critical need for a more sophisticated evaluation framework.
To promote transparency and further research in this area, the researchers have made the anonymized project repository publicly available. This includes the AgentLens-Bench dataset and the AgentLens SDK, which can be accessed at https://github.com/microsoft/code-agent-state-trajectories/.
As the software engineering landscape continues to evolve, the introduction of AgentLens marks a significant step towards more rigorous and meaningful evaluation of SWE agents, paving the way for improved development practices and enhanced tool performance.
Related AI Insights
- Enhancing Multi-Agent Coordination via Dialogue Alignment
- CoT-Guard: Efficient Small Models for AI Monitoring
- Orthrus: Fast, Memory-Efficient Parallel Token Generation
- Optimizing Data Difficulty for LLM Fine-Tuning Success
- Language-Based Agent Control for Secure AI Agents
- Enhancing LLM Accuracy with Orthogonal Latent Spaces
- FRAME: Advanced Image Manipulation Detection Method
- Elon Musk vs Sam Altman: What the Jury Will Decide
- Emergent Misalignment and Persona Collapse in LLMs
- GraphIP-Bench: Protecting Graph Neural Networks from Theft
