Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference
Summary: arXiv:2604.09613v1 Announce Type: cross
Abstract: Production vLLM fleets provision every instance for worst-case context length, wasting 4-8x concurrency on the 80-95% of requests that are short and simultaneously triggering KV-cache failures — OOM crashes, preemption storms, and request rejections. Both problems share a single root cause: configuration-traffic mismatch.
Introduction
Artificial Intelligence (AI) has made significant strides in recent years, with large language models (LLMs) playing a pivotal role in various applications. However, the operational efficiency of these models is often hampered by configuration-traffic mismatches, leading to increased costs and resource wastage. The latest research introduces a novel approach known as token-budget-aware pool routing, which aims to optimize LLM inference costs significantly.
Understanding the Problem
The current state of production vLLM fleets involves provisioning every instance based on the worst-case context length. This leads to:
- Wasted concurrency, especially as 80-95% of requests are shorter than the worst-case scenario.
- KV-cache failures that result in out-of-memory (OOM) crashes, preemption storms, and request rejections.
The root cause of these issues can be attributed to a mismatch between configuration and traffic demands, which this new approach seeks to address.
Proposed Solution
The proposed token-budget-aware pool routing method involves estimating each request’s total token budget. This is achieved using a self-calibrating per-category bytes-per-token ratio. The system then dispatches requests to one of two specific vLLM pools:
- High-throughput short pool: Designed for handling requests with shorter token budgets.
- High-capacity long pool: Catered to requests that require a larger token capacity.
Methodology
The bytes-per-token ratio is learned online using an exponential moving average based on usage feedback, requiring no tokenizer for its operation. A closed-form cost model is established to predict fleet-level GPU savings, expressed as:
savings = alpha * (1 – 1/rho)
In this formula, alpha represents the short-traffic fraction and rho signifies the throughput gain ratio.
Results and Savings
Testing conducted on traces from the Azure LLM Inference Dataset and LMSYS-Chat-1M, while serving Llama-3-70B on A100 GPUs, demonstrated that token-budget routing can reduce GPU instances by 17-39%. This translates to a financial saving of approximately $1.2-2.0 million per year at a rate of 1,000 requests per second. A self-contained discrete-event simulator verified these savings. Furthermore, a case study projecting the Qwen3-235B-A22B model on AMD MI300X at 10,000 requests per second estimated savings of $15.4 million annually.
Conclusion
The token-budget-aware pool routing algorithm introduces minimal dispatch overhead while self-calibrating across various content types without requiring a tokenizer. Its compatibility with existing technologies, including PagedAttention, continuous batching, and prefill-decode disaggregation, makes it a promising solution for enhancing the efficiency of LLM inference in production environments.
