Can AI Models Direct Each Other? Organizational Structure as a Probe into Training Limitations
Published on: arXiv:2603.26458v1
Summary: This article explores whether an expensive AI model can effectively direct a cheaper model to solve software engineering tasks through a two-agent pipeline called ManagerWorker.
Abstract
Can an expensive AI model effectively direct a cheap one to solve software engineering tasks? We investigate this question by introducing ManagerWorker, a two-agent pipeline where an expensive “manager” model (text-only, no code execution) analyzes issues, dispatches exploration tasks, and reviews implementations, while a cheap “worker” model (with full repository access) executes code changes. Our evaluation is based on 200 instances from SWE-bench Lite across five configurations that vary the manager-worker relationship, pipeline complexity, and model pairing.
Key Findings
- Effective Direction: A strong manager directing a weak worker achieves a performance of 62%, which is comparable to a strong single agent’s 60% performance, but at a fraction of the strong model’s token usage. This indicates that high-level reasoning can substitute for costly code execution.
- Genuine Capability Gap: A weak manager directing a weak worker performs at 42%, which is worse than the weak agent alone at 44%. This demonstrates that the directing relationship necessitates a genuine capability gap; without it, the structure becomes mere overhead.
- Active Direction Matters: The manager’s value lies in active direction, not just in reviewing outputs. A minimal review-only loop adds only 2 percentage points over the baseline, whereas structured exploration and planning contribute an additional 11 percentage points, highlighting that active guidance is essential for productivity.
- Training Limitations: The observed behaviors stem from a single root cause: current AI models are predominantly trained as monolithic agents. Dividing them into director and worker roles challenges their training distribution. The success of the pipeline is rooted in designing around this mismatch, ensuring that each model operates within its trained mode (text generation for the manager and tool usage for the worker) while externalizing organizational structures to the code itself.
Implications for Future Research
This study points to concrete training gaps in current AI models, particularly in the areas of:
- Delegation: The ability to assign tasks effectively.
- Scoped Execution: Managing the execution of tasks within defined parameters.
- Mode Switching: Transitioning between different operational modes based on the task requirements.
Conclusion
In conclusion, our findings suggest that while there is significant promise in utilizing a two-agent pipeline for software engineering tasks, the effectiveness hinges on the capability of the manager to actively direct the worker. Future research should focus on addressing the identified training limitations to enhance the collaborative potential of AI models.
