ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation
arXiv:2603.26137v1
Type: cross
Abstract
The evaluation of repository-aware software engineering systems faces several challenges, including synthetic task design, prompt leakage, and temporal contamination between repository knowledge and future code changes. To address these issues, we introduce a time-consistent benchmark methodology that captures a snapshot of a repository at time T0. This methodology constructs repository-derived code knowledge using only artifacts available before T0 and evaluates engineering tasks derived from pull requests merged in the future interval (T0, T1].
Methodology Overview
Each historical pull request is transformed into a natural-language task through a large language model (LLM)-assisted prompt-generation pipeline. The benchmark is formalized as a matched A/B comparison in which the same software engineering agent is evaluated both with and without repository-derived code knowledge while keeping all other variables constant. This approach ensures that the effects of repository knowledge can be accurately measured.
Results and Analysis
We conducted a baseline characterization study on two prominent open-source repositories, DragonFly and React. This study involved the application of three Claude-family models across four different prompt granularities. The results indicate a consistent improvement in file-level F1 scores as prompt granularity increases:
- DragonFly: Achieved a maximum F1 score of 0.8081 with the strongest tested model.
- React: Reached an F1 score of 0.8078 with the same model.
These findings suggest that the construction of prompts plays a crucial role as a benchmark variable, significantly impacting the evaluation outcomes.
Key Insights
The benchmark methodology emphasizes two core aspects that are essential for valid repository-aware software engineering evaluations:
- Temporal Consistency: Ensuring that the evaluation of repository knowledge is conducted in a manner that accurately reflects its relevance to future code changes.
- Prompt Control: Demonstrating that the design and formulation of prompts are critical factors in determining the performance of software engineering agents.
Conclusion
This time-consistent benchmark serves as a significant advancement in the evaluation of repository-aware software engineering systems. By addressing the challenges of temporal contamination and prompt leakage, our methodology provides a robust framework for assessing the efficacy of these systems. The results from our study underscore the importance of prompt construction and set a foundation for further research in this area.
As the field of software engineering continues to evolve, leveraging benchmarks that uphold temporal integrity and prompt precision will be essential in fostering more reliable and effective repository-aware systems.
