ReCUBE: Evaluating Repository-Level Context Utilization in Code Generation
Summary: arXiv:2603.25770v1 Announce Type: cross
Abstract
Large Language Models (LLMs) have recently emerged as capable coding assistants that operate over large codebases through either agentic exploration or full-context generation. Existing benchmarks capture a broad range of coding capabilities, such as resolving GitHub issues, but none of them directly isolate and measure how effectively LLMs leverage repository-level context during code generation. To address this, we introduce ReCUBE, a benchmark in which LLMs reconstruct a masked file within a real-world repository, using all remaining source files, dependency specifications, and documentation as their only source of context.
Introduction to ReCUBE
ReCUBE evaluates reconstructed code with usage-aware test cases that simulate both internal module logic and external cross-file integration, reflecting real-world software usage patterns. This innovative approach allows researchers to focus on how effectively LLMs utilize the context available in a complete repository setting.
Key Features of ReCUBE
- Repository-Level Context: Unlike traditional benchmarks, ReCUBE specifically assesses the ability of LLMs to leverage the context from the entire repository.
- Usage-Aware Test Cases: Test cases are designed to reflect realistic software usage, ensuring that the evaluation is grounded in practical scenarios.
- Open Source Release: The benchmark, code, and evaluation framework have been made available to the NLP research community, promoting collaborative advancement in this field.
Caller-Centric Exploration (CCE) Toolkit
In conjunction with ReCUBE, we propose the Caller-Centric Exploration (CCE) toolkit, a set of dependency graph-based tools that can be integrated into agentic frameworks. This toolkit guides agents toward the most relevant caller files during repository exploration, enhancing their ability to navigate complex codebases.
Experimental Results
Experiments involving eight models across four settings show that repository-level context utilization remains a highly challenging task, even for state-of-the-art models. Notably, GPT-5 achieved only a 37.57% strict pass rate in the full-context setting. However, agents augmented with our CCE toolkit consistently outperformed all baseline models, with improvements of up to 7.56% in strict pass rate.
Conclusion
ReCUBE represents a significant advancement in the benchmarking of LLMs for code generation, specifically in how these models utilize repository-level context. The introduction of the CCE toolkit further enhances the potential for improving model performance in real-world coding tasks. We encourage the NLP research community to utilize our open-source resources to explore and push the boundaries of LLM capabilities in software development.
Future Work
Moving forward, we aim to refine the ReCUBE benchmark and expand the CCE toolkit to further enhance the capabilities of LLMs in code generation tasks. Continued research in this area will contribute to more effective coding assistants, ultimately benefiting software development processes across various industries.
