AgentForge: Execution-Grounded Multi-Agent LLM Framework for Autonomous Software Engineering
Summary: arXiv:2604.13120v1 Announce Type: cross
Abstract: Large language models generate plausible code but cannot verify correctness. Existing multi-agent systems simulate execution or leave verification optional. We introduce execution-grounded verification as a first-class principle: every code change must survive sandboxed execution before propagation. We instantiate this principle in AGENTFORGE, a multi-agent framework where Planner, Coder, Tester, Debugger, and Critic agents coordinate through shared memory and a mandatory Docker sandbox.
We formalize software engineering with LLMs as an iterative decision process over repository states, where execution feedback provides a stronger supervision signal than next-token likelihood. AGENTFORGE achieves 40.0% resolution on SWE-BENCH Lite, outperforming single-agent baselines by 26–28 points. Ablations confirm that execution feedback and role decomposition each independently drive performance. The framework is open-source at https://github.com/raja21068/AutoCodeAI.
Introduction to AgentForge
In the rapidly evolving field of software engineering, the demand for tools that enhance productivity and ensure code correctness has never been higher. Traditional large language models (LLMs) have shown promise in generating code, yet they often fall short in verifying the accuracy and reliability of that code. To address this gap, researchers have introduced AgentForge, a multi-agent framework that emphasizes execution-grounded verification in software development.
Key Features of AgentForge
AgentForge introduces a novel approach to software engineering by implementing several key features:
- Execution-Grounded Verification: Unlike existing systems, AgentForge mandates that all code changes must pass through a sandboxed execution environment before they can be integrated, ensuring higher reliability.
- Multi-Agent Coordination: The framework comprises multiple specialized agents—Planner, Coder, Tester, Debugger, and Critic—that work collaboratively through shared memory, enhancing the overall efficiency of the software development process.
- Iterative Decision Process: By formalizing software engineering as an iterative decision-making process over repository states, AgentForge leverages execution feedback to provide a more robust supervision signal compared to traditional next-token prediction methods.
Performance and Results
In rigorous testing, AgentForge demonstrated a remarkable 40.0% resolution rate on the SWE-BENCH Lite benchmark. This impressive result positions it significantly ahead of single-agent baselines, which fell short by 26 to 28 percentage points. The success of AgentForge underscores the importance of execution feedback and role decomposition, both of which were shown to independently contribute to its superior performance.
Open Source Initiative
In a bid to foster collaboration and further innovation, the AgentForge framework is available as an open-source project. Developers and researchers can access the framework on GitHub at https://github.com/raja21068/AutoCodeAI. This initiative not only encourages the community to build upon the existing framework but also promotes the sharing of best practices in autonomous software engineering.
Conclusion
AgentForge represents a significant advancement in the realm of autonomous software engineering, integrating execution-grounded verification as a core principle. By harnessing the power of multi-agent systems and iterative decision-making, it sets a new standard for code reliability and efficiency. As the software development landscape continues to evolve, frameworks like AgentForge will be crucial in shaping the future of coding practices.
