AgentFloor Benchmark: Small Open-Weight Models' Tool Use Limits

AgentFloor: How Far Up the Tool Use Ladder Can Small Open-Weight Models Go?

Recent research published on arXiv under the identifier 2605.00334v1 introduces a significant benchmark known as AgentFloor, aimed at evaluating the capabilities of various AI models in agentic systems. The paper highlights the increasing complexity of production agentic systems, which often require numerous model calls per user request, particularly for tasks that are short, structured, and routine.

The Core Question

As AI continues to evolve, a pressing question arises: which parts of an agent’s workflow necessitate the use of large, cutting-edge models, and which can be effectively managed by smaller, open-weight models? AgentFloor seeks to answer this inquiry through a structured evaluation of 16 open-weight models, ranging from 0.27 billion to 32 billion parameters, alongside the advanced GPT-5 model.

Benchmark Overview

AgentFloor is designed as a deterministic benchmark comprising 30 distinct tasks, organized into a six-tier capability ladder. This ladder encompasses:

Instruction Following
Tool Use
Multi-Step Coordination
Long-Horizon Planning
Persistent Constraints
Routine Action Management

The evaluation process involved over 16,542 scored runs, providing a comprehensive overview of each model’s performance across various tasks.

Key Findings

The results of the evaluation reveal a clear delineation in the necessity of model size. Notably, small and mid-sized open-weight models have proven adequate for the majority of short-horizon, structured tool use tasks that characterize real-world agent pipelines. In fact, the strongest open-weight model demonstrated performance comparable to that of GPT-5 on the AgentFloor benchmark, while also being notably cheaper and faster to operate.

Long-Horizon Planning Challenges

However, the research also identifies a distinct advantage held by larger models in long-horizon planning tasks. These tasks require sustained coordination and reliable tracking of constraints over extended sequences of actions, where the frontier models still excel. Despite this, neither category of models achieved a level of strong reliability in these more complex scenarios.

Insights on Model Performance

Interestingly, the study points out that the boundary between model capabilities cannot be solely attributed to scale. Some failures encountered during the evaluation responded positively to targeted interventions, suggesting that enhancements may vary by model rather than being universally applicable across all systems. This insight leads to a potential design principle for developing agentic systems:

Utilize smaller open-weight models for the broad spectrum of routine actions.
Reserve larger, frontier models for specialized tasks requiring deeper planning and control.

Conclusion and Future Directions

As the landscape of AI continues to evolve, the findings from the AgentFloor benchmark present vital considerations for the development and deployment of agentic systems. By strategically leveraging the strengths of both small and large models, developers can enhance efficiency while managing costs. The benchmark, along with harness, sweep configurations, and the complete run corpus, has been made available for further research and exploration in this exciting field.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

AgentFloor Benchmark: Small Open-Weight Models’ Tool Use Limits

AgentFloor: How Far Up the Tool Use Ladder Can Small Open-Weight Models Go?

The Core Question

Benchmark Overview

Key Findings

Long-Horizon Planning Challenges

Insights on Model Performance

Conclusion and Future Directions

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related