RoboPhD: Evolving Diverse Complex Agents Under Tight Evaluation Budgets
Summary: arXiv:2604.04347v1 Announce Type: new
As we step into 2026, the field of artificial intelligence is witnessing a remarkable surge in interest surrounding the evolution of agentic artifacts guided by large language models (LLMs). Systems such as GEPA and Autoresearch have illustrated the potential of LLMs to iteratively enhance prompts, code, and agent architectures across a multitude of domains. With this rapid adoption, a pivotal question arises: under identical conditions—when provided with the same information, seed agent, and objective—which optimization algorithm yields the most effective results while adhering to a strict evaluation budget? This inquiry becomes increasingly crucial when evaluations are costly, particularly in scenarios that necessitate human judgment or require multiple LLM calls.
In this context, we present a comprehensive comparison of three optimization paradigms: Elo tournament selection (RoboPhD), Pareto-based selection (GEPA), and greedy hill-climbing (Autoresearch). This evaluation spans four benchmarks that include:
- Abstract reasoning
- Cloud scheduling
- SQL generation
- Financial question and answering
All evaluations are conducted under a fixed budget of 1,500 evaluations. A noteworthy feature of RoboPhD is its introduction of validation-free evolution. Unlike traditional methods that divide the budget between training and validation, RoboPhD employs Elo competition on training data, allowing for simultaneous evaluation of agents and driving their evolution.
Additionally, all three systems begin with seed agents equipped with diagnostic print() statements. This capability enables the evolution of self-instrumenting agents that can develop more insightful diagnostics, ultimately benefiting their evolutionary successors. The results from our systematic comparison reveal that, using a single default configuration, RoboPhD surpasses both GEPA and Autoresearch on three out of four benchmarks. The only exception arises in the simplest task, where the winning solution—adapted from Autoresearch—required fewer than 90 lines of code.
In one of our benchmarks, ARC-AGI, RoboPhD successfully evolves a 22-line seed agent into a robust 1,013-line multi-strategy system, achieving a significant accuracy improvement from 27.8% to 65.8% by utilizing Gemini 3.1 Flash Lite as the solver. This accomplishment underscores the potential of RoboPhD in enhancing the capabilities of agentic systems through efficient evolutionary processes.
To promote further research and development in this field, we are excited to release RoboPhD as a versatile toolkit under the MIT license. It comes equipped with a straightforward optimize_anything() API, designed for the evolution of diverse complex agents.
The advancements presented in this study not only highlight the effectiveness of RoboPhD as an optimization paradigm but also pave the way for future exploration in the realm of AI-driven agent evolution.
