Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents
Summary: arXiv:2604.11465v1 Announce Type: new
Abstract: Large language model (LLM) agents show promise on realistic tool-use tasks, but deploying capable agents on modest hardware remains challenging. We study whether inference-time scaffolding alone, without any additional training compute, can improve the performance of a small model in complex multi-step environments.
Operating on a single 24 GB GPU, we evaluate Qwen3-8B under both full-precision (FP16, 12K context) and 4-bit quantized (AWQ, 32K context) configurations. Without any intervention, the raw model achieves just 5.4% (FP16) and 3.0% (AWQ) task goal completion.
Guided by a systematic failure mode analysis, we introduce a three-tier inference scaffolding pipeline that deploys the same frozen model in three distinct roles:
- Summarization Model: This role preserves critical artifacts (tokens, credentials, API responses) while compressing dialogue history.
- Main Agent Model: This model reasons over the compressed context to engage in effective decision-making.
- Correction Model: This isolated model reviews and revises the agent’s code output without access to conversation history, effectively breaking repetitive failure loops.
Applied to the same unmodified model, this scaffolding yields 8.9% (FP16) and 5.9% (AWQ) task goal completion, roughly doubling performance in both settings. Notably, we observe particularly strong gains on difficulty-1 tasks, improving from 15.8% to 26.3% for FP16 and from 5.3% to 14.0% for AWQ.
On full-precision inference, our scaffolded 8B model surpasses DeepSeek-Coder 33B Instruct (7.1%) from the original AppWorld evaluation. This demonstrates that structured inference-time interventions can make small models competitive with systems four times their size.
We formalize the approach as a scaffolded policy over a frozen base model, with three invocations of the same weights under different conditioning. This draws connections to test-time compute scaling and action-space shaping in reinforcement learning, emphasizing the potential for improved performance without the need for additional training resources.
In conclusion, our findings reveal that effective role orchestration at inference time can significantly bridge the performance gap between smaller and larger models. As the demand for efficient AI solutions grows, these insights pave the way for more capable and resource-efficient AI agents, capable of performing complex tasks on modest hardware.
