Evaluating Repository-level Software Documentation via Question Answering and Feature-Driven Development
Summary: arXiv:2604.06793v1 Announce Type: cross
Software documentation plays a pivotal role in enhancing repository comprehension. With the rapid advancements in Large Language Models (LLMs), there has been significant progress in automating the generation of documentation, ranging from snippets of code to entire repositories. However, existing benchmarks for evaluating this documentation exhibit two primary shortcomings:
- The absence of a comprehensive, repository-level assessment.
- Reliance on evaluation strategies that are often unreliable, such as using LLMs as judges, which can be hindered by vague criteria and limited repository-level knowledge.
To tackle these challenges, we introduce SWD-Bench, a novel benchmark designed specifically for the evaluation of repository-level software documentation. This benchmark is inspired by the principles of documentation-driven development, focusing on the quality of documentation by assessing an LLM’s ability to understand and implement functionalities based on that documentation, rather than providing direct scores.
The evaluation is structured around function-driven Question Answering (QA) tasks, which are integral to our benchmark. SWD-Bench is composed of three interconnected QA tasks:
- Functionality Detection: This task assesses whether a given functionality is adequately described within the documentation.
- Functionality Localization: This task evaluates the accuracy of identifying related files relevant to the functionality.
- Functionality Completion: This task measures how comprehensively the implementation details are documented.
To construct the SWD-Bench, we curated a dataset containing 4,170 entries, sourced from high-quality Pull Requests, which were then enriched with repository-level context. This comprehensive dataset allows for an in-depth evaluation of documentation quality across various repositories.
Initial experiments utilizing SWD-Bench have uncovered several limitations present in current documentation generation methods. Furthermore, they have indicated that the source code itself provides complementary value, which can enhance the quality of documentation. Notably, the documentation produced by the best-performing method resulted in a 20.00% increase in the issue-solving rate of the Software Engineering Agent (SWE-Agent). This finding underscores the practical significance of high-quality documentation in facilitating effective documentation-driven development.
In conclusion, the introduction of SWD-Bench marks a significant advancement in the evaluation of software documentation at the repository level. By addressing existing limitations and focusing on the practical implementation of functionalities, this benchmark not only enhances the assessment process but also contributes to the overall improvement of software documentation practices within the development community.
