Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints
Summary: arXiv:2603.26796v1 Announce Type: cross
Abstract
In the rapidly advancing field of artificial intelligence, large language models (LLMs) have emerged as powerful tools for various applications. However, the efficient routing of queries to these models presents significant challenges, particularly under constraints of cost, GPU resources, and concurrency. Traditional per-query routing methods have often struggled to maintain control over batch-level costs, especially when subjected to non-uniform or adversarial batching scenarios.
Introduction
This article explores a novel batch-level, resource-aware routing framework designed to optimize model assignment for each batch while adhering to cost and model capacity limitations. By addressing these constraints, we aim to enhance the performance of LLMs in real-world applications.
Proposed Framework
- Batch-Level Routing: Unlike traditional methods that focus on individual queries, our framework evaluates the collective requirements of a batch, leading to more efficient resource utilization.
- Resource Awareness: The framework takes into account the specific GPU resources available and allocates them accordingly to maximize throughput without exceeding cost limits.
- Robustness Against Uncertainty: We introduce a robust variant of the framework that factors in uncertainties in the predicted performance of LLMs, allowing for more adaptable routing decisions.
Offline Instance Allocation Procedure
To further enhance efficiency, we developed an offline instance allocation procedure. This approach balances the quality of responses with the throughput across multiple models. By optimizing how instances are allocated, we can ensure that each model operates at its capacity while adhering to the established cost constraints.
Experimental Results
To validate our approach, we conducted experiments on two multi-task LLM benchmarks. The results were promising:
- Robustness improvements were observed, with accuracy gains ranging from 1% to 14% over non-robust counterparts, depending on the performance estimator utilized.
- Batch-level routing demonstrated superiority over per-query methods, achieving up to a 24% improvement under adversarial batching conditions.
- Optimized instance allocation yielded additional accuracy gains of up to 3% compared to non-optimized allocation strategies, all while strictly controlling costs and GPU resource use.
Conclusion
Our proposed batch-level query routing framework represents a significant advancement in the efficient utilization of large language models under stringent cost and capacity constraints. By focusing on batch-level optimization and incorporating robustness against uncertainty, we are paving the way for more effective and scalable applications of LLMs in various domains. Future work will involve further refining this approach and exploring its applicability across diverse modeling scenarios.
References
For further reading, please refer to the full paper available on arXiv: arXiv:2603.26796v1.
