CostBench: A New Benchmark for Evaluating Multi-Turn Cost-Optimal Planning
The rapid evolution of Large Language Models (LLMs) has led to their increasing deployment across various domains, yet most evaluations focus primarily on task completion rather than resource efficiency and adaptability. This oversight highlights a significant gap in understanding how effectively these models can formulate and adjust cost-optimized plans in response to dynamic environments.
Introducing CostBench
To address this critical need, researchers have introduced CostBench, a scalable benchmark specifically designed to assess the economic reasoning and replanning capabilities of LLM agents. CostBench situates its evaluation framework in the travel-planning domain, which is inherently complex and subject to various unpredictable factors.
Key Features of CostBench
CostBench comprises a series of tasks that can be solved using multiple sequences of both atomic and composite tools, each with unique and customizable costs. The benchmark also incorporates four distinct types of dynamic blocking events, including:
- Tool failures
- Cost changes
- Resource availability fluctuations
- Time constraints
These features are designed to recreate the unpredictability of real-world scenarios, challenging agents to adapt their strategies in real time.
Performance Evaluation of Leading Models
To evaluate the effectiveness of CostBench, researchers conducted tests on several leading open-sourced and proprietary models. The results revealed a significant gap in cost-aware planning among these agents. Notably, even advanced models like GPT-5 achieved less than a 75% exact match rate on the most challenging tasks in static settings. Performance further deteriorated by approximately 40% when the conditions became dynamic.
Implications and Future Directions
These findings underscore the critical need for improvements in the economic reasoning of LLMs. By diagnosing the weaknesses highlighted in the evaluation process, CostBench lays a foundation for the development of future agents that are not only robust but also economically rational. This advancement is essential for real-world applications where resource management and adaptability are paramount.
As the landscape of AI continues to evolve, benchmarks like CostBench will play a vital role in shaping the next generation of LLM agents, ensuring that they can navigate complex, dynamic environments effectively while maintaining cost efficiency.
