Cost-Efficient LLM Inference with Token-Budget Routing

Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference

Summary: arXiv:2604.09613v1 Announce Type: cross

Abstract: Production vLLM fleets provision every instance for worst-case context length, wasting 4-8x concurrency on the 80-95% of requests that are short and simultaneously triggering KV-cache failures — OOM crashes, preemption storms, and request rejections. Both problems share a single root cause: configuration-traffic mismatch.

Introduction

Artificial Intelligence (AI) has made significant strides in recent years, with large language models (LLMs) playing a pivotal role in various applications. However, the operational efficiency of these models is often hampered by configuration-traffic mismatches, leading to increased costs and resource wastage. The latest research introduces a novel approach known as token-budget-aware pool routing, which aims to optimize LLM inference costs significantly.

Understanding the Problem

The current state of production vLLM fleets involves provisioning every instance based on the worst-case context length. This leads to:

Wasted concurrency, especially as 80-95% of requests are shorter than the worst-case scenario.
KV-cache failures that result in out-of-memory (OOM) crashes, preemption storms, and request rejections.

The root cause of these issues can be attributed to a mismatch between configuration and traffic demands, which this new approach seeks to address.

Proposed Solution

The proposed token-budget-aware pool routing method involves estimating each request’s total token budget. This is achieved using a self-calibrating per-category bytes-per-token ratio. The system then dispatches requests to one of two specific vLLM pools:

High-throughput short pool: Designed for handling requests with shorter token budgets.
High-capacity long pool: Catered to requests that require a larger token capacity.

Methodology

The bytes-per-token ratio is learned online using an exponential moving average based on usage feedback, requiring no tokenizer for its operation. A closed-form cost model is established to predict fleet-level GPU savings, expressed as:

savings = alpha * (1 – 1/rho)

In this formula, alpha represents the short-traffic fraction and rho signifies the throughput gain ratio.

Results and Savings

Testing conducted on traces from the Azure LLM Inference Dataset and LMSYS-Chat-1M, while serving Llama-3-70B on A100 GPUs, demonstrated that token-budget routing can reduce GPU instances by 17-39%. This translates to a financial saving of approximately $1.2-2.0 million per year at a rate of 1,000 requests per second. A self-contained discrete-event simulator verified these savings. Furthermore, a case study projecting the Qwen3-235B-A22B model on AMD MI300X at 10,000 requests per second estimated savings of $15.4 million annually.

Conclusion

The token-budget-aware pool routing algorithm introduces minimal dispatch overhead while self-calibrating across various content types without requiring a tokenizer. Its compatibility with existing technologies, including PagedAttention, continuous batching, and prefill-decode disaggregation, makes it a promising solution for enhancing the efficiency of LLM inference in production environments.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Cost-Efficient LLM Inference with Token-Budget Routing

Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference

Introduction

Understanding the Problem

Proposed Solution

Methodology

Results and Savings

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related