Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution
Summary: arXiv:2604.05955v1 Announce Type: cross
Abstract
Repository-level issue resolution benchmarks have become a standard testbed for evaluating LLM-based agents, yet success is still predominantly measured by test pass rates. In practice, however, acceptable patches must also comply with project-specific design constraints, such as architectural conventions, error-handling policies, and maintainability requirements, which are rarely encoded in tests and are often documented only implicitly in code review discussions. This paper introduces design-aware issue resolution and presents bench, a benchmark that makes such implicit design constraints explicit and measurable.
Introduction
As the reliance on large language models (LLMs) in software development grows, the methods used to evaluate their effectiveness in issue resolution have come under scrutiny. Traditional metrics, which often focus solely on pass rates of tests, may not fully capture the quality and compliance of the patches generated by these models.
Design Constraints and Their Importance
In software development, design constraints play a crucial role in ensuring that code not only functions correctly but also adheres to the architectural and stylistic guidelines of a project. These constraints can include:
- Architectural conventions
- Error-handling policies
- Maintainability requirements
Unfortunately, many of these constraints are not explicitly defined in tests. Instead, they are often implied through discussions in code reviews, making it challenging to ensure compliance when evaluating LLM-generated patches.
Introducing Bench
The paper introduces a new benchmark called bench, aimed at addressing this gap. Bench is constructed by:
- Mining design constraints from real-world pull requests.
- Validating these constraints and linking them to specific issue instances.
- Automatically checking patch compliance using an LLM-based verifier.
Bench has yielded 495 issues and 1,787 validated constraints across six repositories, demonstrating its capacity to align with existing benchmarks like SWE-bench-Verified and SWE-bench-Pro.
Findings
Experiments conducted with state-of-the-art LLM-based agents reveal concerning trends. Notably:
- Test-based correctness substantially overestimates patch quality.
- Fewer than half of the resolved issues are fully compliant with design constraints.
- Design violations are prevalent, with functional correctness showing negligible association with design satisfaction.
Implications for Future Research
While providing issue-specific design guidance has been shown to reduce violations, the study highlights that significant non-compliance continues to exist. This underscores a fundamental gap in the capabilities of current LLM-based agents and motivates the need for:
- Design-aware evaluations that go beyond mere functional correctness.
- The development of more sophisticated metrics for assessing design compliance.
Conclusion
As the field of software development continues to evolve with AI technologies, understanding and integrating design constraints into evaluation metrics will be essential. The introduction of benchmarks like bench is a vital step toward ensuring that LLM-generated patches not only pass tests but also adhere to the necessary design standards that enhance software quality and maintainability.
