Cost-Efficient LLM Inference with Token-Budget Routing

Date:

Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference

Summary: arXiv:2604.09613v1 Announce Type: cross

Abstract: Production vLLM fleets provision every instance for worst-case context length, wasting 4-8x concurrency on the 80-95% of requests that are short and simultaneously triggering KV-cache failures — OOM crashes, preemption storms, and request rejections. Both problems share a single root cause: configuration-traffic mismatch.

Introduction

Artificial Intelligence (AI) has made significant strides in recent years, with large language models (LLMs) playing a pivotal role in various applications. However, the operational efficiency of these models is often hampered by configuration-traffic mismatches, leading to increased costs and resource wastage. The latest research introduces a novel approach known as token-budget-aware pool routing, which aims to optimize LLM inference costs significantly.

Understanding the Problem

The current state of production vLLM fleets involves provisioning every instance based on the worst-case context length. This leads to:

  • Wasted concurrency, especially as 80-95% of requests are shorter than the worst-case scenario.
  • KV-cache failures that result in out-of-memory (OOM) crashes, preemption storms, and request rejections.

The root cause of these issues can be attributed to a mismatch between configuration and traffic demands, which this new approach seeks to address.

Proposed Solution

The proposed token-budget-aware pool routing method involves estimating each request’s total token budget. This is achieved using a self-calibrating per-category bytes-per-token ratio. The system then dispatches requests to one of two specific vLLM pools:

  • High-throughput short pool: Designed for handling requests with shorter token budgets.
  • High-capacity long pool: Catered to requests that require a larger token capacity.

Methodology

The bytes-per-token ratio is learned online using an exponential moving average based on usage feedback, requiring no tokenizer for its operation. A closed-form cost model is established to predict fleet-level GPU savings, expressed as:

savings = alpha * (1 – 1/rho)

In this formula, alpha represents the short-traffic fraction and rho signifies the throughput gain ratio.

Results and Savings

Testing conducted on traces from the Azure LLM Inference Dataset and LMSYS-Chat-1M, while serving Llama-3-70B on A100 GPUs, demonstrated that token-budget routing can reduce GPU instances by 17-39%. This translates to a financial saving of approximately $1.2-2.0 million per year at a rate of 1,000 requests per second. A self-contained discrete-event simulator verified these savings. Furthermore, a case study projecting the Qwen3-235B-A22B model on AMD MI300X at 10,000 requests per second estimated savings of $15.4 million annually.

Conclusion

The token-budget-aware pool routing algorithm introduces minimal dispatch overhead while self-calibrating across various content types without requiring a tokenizer. Its compatibility with existing technologies, including PagedAttention, continuous batching, and prefill-decode disaggregation, makes it a promising solution for enhancing the efficiency of LLM inference in production environments.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.