Edit, But Verify: An Empirical Audit of Instructed Code-Editing Benchmarks
Summary: arXiv:2604.05100v1 Announce Type: cross
Abstract: Instructed code editing, where an LLM modifies existing code based on a natural language instruction, accounts for roughly 19% of real-world coding assistant interactions. Yet very few benchmarks directly evaluate this capability.
From a survey of over 150 code-related benchmarks, we find that only two, CanItEdit and EDIT-Bench, target instructed code editing with human-authored instructions and test-based evaluation. We audit both by comparing their programming languages, edit intents, and application domains against distributions observed in the wild (Copilot Arena, AIDev, GitHub Octoverse), and by measuring test counts, statement coverage, and test scope across all 213 problems.
Key Findings of the Audit
- Language Concentration: Both benchmarks concentrate over 90% of evaluation on Python. TypeScript, which is GitHub’s most-used language, is notably absent.
- Development Coverage: Backend and frontend development, which together constitute 46% of real-world editing activity, are largely missing from the benchmarks.
- Documentation and Maintenance Edits: Edits related to documentation, testing, and maintenance, which account for 31.4% of human pull requests (PRs), have no representation in either benchmark.
- Test Count and Coverage: Both benchmarks exhibit modest test counts—CanItEdit has a median of 13 tests, while EDIT-Bench has a median of 4. However, CanItEdit compensates for this with near-complete whole-file coverage and fail-before/pass-after validation.
- Test Suite Limitations: 59% of EDIT-Bench’s low-coverage suites would not detect modifications made outside the designated edit region.
- Problem Redundancy: There are 15 problems in EDIT-Bench that are not solvable by any of the 40 LLMs tested, with 11 of these failures traced back to poor benchmark artifacts rather than limitations of the models. Additionally, 29% of EDIT-Bench problems and 6% of CanItEdit problems share a codebase with at least one other problem within the benchmark.
Implications for Future Benchmarking
In summary, these benchmarks measure a narrower construct than what deployment decisions require. The findings highlight the need for a more comprehensive evaluation of instructed code editing capabilities.
To address these limitations, we propose six empirically grounded desiderata for developing better benchmarks. Furthermore, we are releasing all audit artifacts to the community. This initiative aims to facilitate the creation of instructed code-editing benchmarks whose scores can reliably reflect real-world editing capabilities.
Conclusion
The effectiveness of code-editing benchmarks is crucial as they serve as the foundation for assessing the capabilities of large language models in practical scenarios. By improving these benchmarks, we can enhance the performance and reliability of coding assistants in real-world applications.
