Irminsul: Efficient Position-Independent Caching for Agentic LLMs

Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving

In the rapidly evolving landscape of machine learning, particularly in the realm of large language models (LLMs), efficiency and performance are paramount. A recent paper titled “Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving,” published on arXiv, sheds light on innovative caching solutions tailored to the unique challenges posed by agentic LLM workloads.

Agentic LLMs, known for their ability to generate contextually relevant outputs, face significant challenges with traditional caching mechanisms. When these models process bit-identical tokens at shifted positions, it renders prefix caches ineffective as soon as the first byte diverges. Operators have reported noticeable regressions in cache hits, leading to varying degrees of slowdowns, with time-to-first-token (TTFT) spikes reaching alarming levels of 10-16 seconds for unchanged content.

Challenges with Existing Caching Systems

Current position-independent caching systems have attempted to address these issues by correcting the Relative Positional Encoding (RoPE) on the full $d_K$-dimensional key. However, this architectural adjustment incurs costs that are not solely attributed to caching but stem from the underlying model architecture, such as Generalized Query Attention (GQA).

Introducing Multi-Head Latent Attention

The paper introduces a novel approach utilizing Multi-Head Latent Attention, which has been successfully deployed at scale in multiple systems, including DeepSeek-V2/V3/R1, Kimi-K2/Moonlight, GLM-5, and Mistral Large 3. This new architecture factors each key-value (KV) row into a position-free component, denoted as $c_{KV}$, alongside a 64-dimensional correctable key $k_r$. This structural innovation facilitates content-addressed caching, positioning it as a more natural solution rather than a mere workaround for GQA limitations.

The Irminsul Framework

The authors present Irminsul, an advanced caching framework that enhances SGLang’s radix cache through content-hash keying over CDC-chunked segments. Additionally, it incorporates a $\delta$-rotation rule for the correctable key $k_r$. This dual approach significantly improves cache performance and efficiency.

Evaluating Irminsul’s Performance

To validate the efficacy of Irminsul, the authors conducted evaluations across three native MLA-Mixture of Experts (MoE) deployments: DeepSeek-V2-Lite (16B/2.4B), Kimi Moonlight-16B-A3B, and JoyAI-Flash (48B/3B). The results indicated consistent output across all three deployments, with recovery metrics particularly strong on the two endpoints.

Irminsul demonstrated the ability to recover up to ~83% of prompt tokens above exact-prefix on agentic traffic.
The framework achieved a remarkable 63% energy savings per cache hit during prefill operations.

Conclusion: A Paradigm Shift in Caching

The findings presented in the Irminsul paper advocate for the integration of content-addressed caching as a first-class primitive within the serving stack of LLMs. Rather than retrofitting existing systems with prefix matching, the authors argue for a foundational shift that accommodates the unique needs of agentic workloads. As LLMs continue to advance, innovations like Irminsul will be critical in ensuring they operate efficiently and effectively in real-world applications.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Irminsul: Efficient Position-Independent Caching for Agentic LLMs

Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving

Challenges with Existing Caching Systems

Introducing Multi-Head Latent Attention

The Irminsul Framework

Evaluating Irminsul’s Performance

Conclusion: A Paradigm Shift in Caching

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related