ReCUBE: Benchmarking Repo-Level Context in Code Generation

ReCUBE: Evaluating Repository-Level Context Utilization in Code Generation

Summary: arXiv:2603.25770v1 Announce Type: cross

Abstract

Large Language Models (LLMs) have recently emerged as capable coding assistants that operate over large codebases through either agentic exploration or full-context generation. Existing benchmarks capture a broad range of coding capabilities, such as resolving GitHub issues, but none of them directly isolate and measure how effectively LLMs leverage repository-level context during code generation. To address this, we introduce ReCUBE, a benchmark in which LLMs reconstruct a masked file within a real-world repository, using all remaining source files, dependency specifications, and documentation as their only source of context.

Introduction to ReCUBE

ReCUBE evaluates reconstructed code with usage-aware test cases that simulate both internal module logic and external cross-file integration, reflecting real-world software usage patterns. This innovative approach allows researchers to focus on how effectively LLMs utilize the context available in a complete repository setting.

Key Features of ReCUBE

Repository-Level Context: Unlike traditional benchmarks, ReCUBE specifically assesses the ability of LLMs to leverage the context from the entire repository.
Usage-Aware Test Cases: Test cases are designed to reflect realistic software usage, ensuring that the evaluation is grounded in practical scenarios.
Open Source Release: The benchmark, code, and evaluation framework have been made available to the NLP research community, promoting collaborative advancement in this field.

Caller-Centric Exploration (CCE) Toolkit

In conjunction with ReCUBE, we propose the Caller-Centric Exploration (CCE) toolkit, a set of dependency graph-based tools that can be integrated into agentic frameworks. This toolkit guides agents toward the most relevant caller files during repository exploration, enhancing their ability to navigate complex codebases.

Experimental Results

Experiments involving eight models across four settings show that repository-level context utilization remains a highly challenging task, even for state-of-the-art models. Notably, GPT-5 achieved only a 37.57% strict pass rate in the full-context setting. However, agents augmented with our CCE toolkit consistently outperformed all baseline models, with improvements of up to 7.56% in strict pass rate.

Conclusion

ReCUBE represents a significant advancement in the benchmarking of LLMs for code generation, specifically in how these models utilize repository-level context. The introduction of the CCE toolkit further enhances the potential for improving model performance in real-world coding tasks. We encourage the NLP research community to utilize our open-source resources to explore and push the boundaries of LLM capabilities in software development.

Future Work

Moving forward, we aim to refine the ReCUBE benchmark and expand the CCE toolkit to further enhance the capabilities of LLMs in code generation tasks. Continued research in this area will contribute to more effective coding assistants, ultimately benefiting software development processes across various industries.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

ReCUBE: Benchmarking Repo-Level Context in Code Generation

ReCUBE: Evaluating Repository-Level Context Utilization in Code Generation

Abstract

Introduction to ReCUBE

Key Features of ReCUBE

Caller-Centric Exploration (CCE) Toolkit

Experimental Results

Conclusion

Future Work

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related