Codebase-Memory: Tree-Sitter-Based Knowledge Graphs for LLM Code Exploration via MCP
Summary: arXiv:2603.27277v1 Announce Type: cross
Large Language Model (LLM) coding agents typically explore codebases through repeated file-reading and grep-searching, consuming thousands of tokens per query without structural understanding. We present Codebase-Memory, an open-source system that constructs a persistent, Tree-Sitter-based knowledge graph via the Model Context Protocol (MCP), parsing 66 languages through a multi-phase pipeline with parallel worker pools, call-graph traversal, impact analysis, and community discovery.
In recent years, the emergence of Large Language Models (LLMs) has revolutionized how developers interact with code. However, the traditional methods employed by these coding agents are limited. Often, they rely on inefficient techniques like file reading and grep searching, which can lead to excessive token consumption and a lack of structural understanding of the codebase. This article introduces Codebase-Memory, a cutting-edge system designed to enhance the exploration capabilities of LLMs by leveraging a Tree-Sitter-based knowledge graph.
Overview of Codebase-Memory
Codebase-Memory is an open-source initiative that addresses the limitations of current LLM coding agents. The system constructs a persistent knowledge graph using Tree-Sitter, a parser generator tool that can build concrete syntax trees for various programming languages. The Model Context Protocol (MCP) facilitates this process, allowing for efficient parsing and understanding of code across 66 different languages.
Key Features
- Multi-phase Pipeline: Codebase-Memory employs a multi-phase pipeline that enhances the efficiency of code exploration.
- Parallel Worker Pools: The system utilizes parallel worker pools to speed up processing and analysis tasks.
- Call-Graph Traversal: It incorporates call-graph traversal techniques to understand the relationships between different code components.
- Impact Analysis: The system performs impact analysis to evaluate how changes in one part of the codebase may affect others.
- Community Discovery: Codebase-Memory can identify and analyze communities within the code, providing insights into collaboration and code dependencies.
Performance Evaluation
Codebase-Memory has been rigorously evaluated across 31 real-world repositories. The findings indicate that the system achieves an impressive 83% answer quality compared to 92% for traditional file-exploration agents. Notably, Codebase-Memory accomplishes this with ten times fewer tokens and 2.1 times fewer tool calls, showcasing its efficiency.
Furthermore, for graph-native queries, such as hub detection and caller ranking, Codebase-Memory matches or even exceeds the performance of traditional exploration methods in 19 out of the 31 languages tested. This highlights the potential of knowledge graphs in enhancing the capabilities of LLMs in code exploration tasks.
Conclusion
Codebase-Memory represents a significant advancement in LLM code exploration, overcoming the limitations of traditional methods. By leveraging Tree-Sitter-based knowledge graphs and the Model Context Protocol, it not only improves answer quality but also drastically reduces the resources required for code analysis. As the field of AI continues to evolve, systems like Codebase-Memory are poised to play a crucial role in empowering developers and enhancing coding efficiency.
