Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering
Large Language Models (LLMs) have garnered attention in the software engineering community for their ability to perform various tasks, including question answering (QA). Traditional studies, however, have predominantly focused on isolated functions or single-file code snippets. This limited scope often neglects the complexities involved in real-world program comprehension, which typically requires understanding multiple files and their interdependencies.
In an effort to bridge this gap, a recent study introduces StackRepoQA, the first multi-project, repository-level question answering dataset. This dataset is constructed from 1,318 real developer questions and their accepted answers, sourced from 134 open-source Java projects. The research aims to evaluate the performance of two prominent LLMs—Claude 3.5 Sonnet and GPT-4o—under different prompting configurations, including both direct prompting and agentic setups.
Methodology and Evaluation
The study systematically assesses the LLMs’ performance in repository-level QA by comparing baseline accuracy with advanced methods that utilize retrieval-augmented generation. These advanced methods leverage file-level retrieval and graph-based representations of structural dependencies to improve comprehension and accuracy.
- Dataset Creation: StackRepoQA provides a comprehensive overview of real-world developer inquiries.
- Model Evaluation: The study evaluates Claude 3.5 Sonnet and GPT-4o under various configurations.
- Methods Utilized: Incorporation of retrieval-augmented generation techniques to enhance accuracy.
Findings and Implications
The results indicate that while LLMs achieve moderate accuracy in repository-level QA tasks at baseline, their performance improves significantly when structural signals are integrated into the reasoning process. However, the overall accuracy remains limited, highlighting the challenges inherent in repository-scale comprehension.
Notably, the analysis reveals a concerning trend: high scores often stem from the verbatim reproduction of Stack Overflow answers, rather than from genuine reasoning capabilities. This finding emphasizes the need for further research to disentangle memorization from authentic understanding in LLMs.
Future Directions
The introduction of StackRepoQA aims to encourage ongoing research into benchmarking, evaluation protocols, and augmentation strategies that can improve LLM performance in repository-level QA. By addressing the limitations identified in this study, researchers can work towards advancing LLMs into reliable tools for repository-scale program comprehension.
- Further exploration of augmentation strategies.
- Development of more robust evaluation protocols.
- Encouragement of interdisciplinary collaboration to enhance LLM capabilities.
In conclusion, StackRepoQA is a pivotal contribution to the field of software engineering and LLM research. It not only highlights the current capabilities and limitations of LLMs in understanding complex codebases but also sets the stage for future advancements in this crucial area of study.
