CASK: Efficient KV Compression for Long-Form Reasoning

CASK: Core-Aware Selective KV Compression for Reasoning Traces

Summary: arXiv:2604.10900v1 Announce Type: new

Abstract: In large language models performing long-form reasoning, the KV cache grows rapidly with decode length, creating bottlenecks in memory and inference stability. Existing reasoning-oriented KV compression has mostly followed an eviction-centered view: estimate token importance more accurately, then discard lower-ranked entries. Our analysis suggests that scorer refinement alone often fails to substantially reorganize the actual keep-set and may therefore not be the main lever for preserving reasoning behavior. We instead frame reasoning KV compression as a behavior-preserving structured consolidation problem.

Introduction

The emergence of large language models has revolutionized the field of artificial intelligence, particularly in the domain of natural language processing. However, these models face significant challenges when it comes to memory management during long-form reasoning tasks. One of the main issues is the rapid growth of the Key-Value (KV) cache, which can lead to bottlenecks in both memory usage and inference stability.

The Limitations of Current Approaches

Traditionally, KV compression methods have focused on an eviction-centered approach. This involves estimating the importance of tokens and discarding those that are deemed less critical. While this method has its advantages, our analysis indicates that simply refining the scoring mechanism does not significantly reorganize the keep-set of tokens. Consequently, this approach may not effectively preserve the reasoning capabilities of the model.

Introducing CASK

To tackle these challenges, we introduce CASK (Core-Aware Selective KV Compression). CASK approaches KV compression through a new lens: treating it as a behavior-preserving structured consolidation problem. The key innovation of CASK lies in its ability to partition the reasoning trace into two distinct components:

Protected Core: This segment anchors the answer formation and intermediate state, ensuring that crucial elements of the reasoning process remain intact.
Mergeable Scratch: This component contains high redundancy and is subject to selective consolidation.

Two-Stage Design

CASK employs a two-stage design to enhance its effectiveness, especially in prompt-heavy scenarios. The first stage involves prefix eviction, which ensures that the prefix does not exhaust the available budget before the decode-stage compression is engaged. The second stage is the decode-stage consolidation, which selectively consolidates the scratch component while preserving the core.

Performance Analysis

In empirical evaluations on the H100 reasoning gate, CASK has demonstrated superior performance compared to TriAttention. Specifically, it showcases higher full-KV continuation fidelity at matched budgets on both AIME24 and AIME25 datasets. Notably, CASK exhibits recurring crossings where cask@384 outperforms triattention@512, indicating its efficiency in maintaining reasoning fidelity.

Conclusion

The findings in our research underline a pivotal insight: effective reasoning KV compression relies less on sophisticated scorer engineering and more on the strategic combination of core preservation with selective scratch consolidation. This approach not only reduces the budget frontier but also enhances the overall reasoning capabilities of large language models.

In conclusion, CASK represents a promising advancement in the field of KV compression, offering a more nuanced and effective solution to the challenges posed by long-form reasoning tasks in AI.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

CASK: Efficient KV Compression for Long-Form Reasoning

CASK: Core-Aware Selective KV Compression for Reasoning Traces

Introduction

The Limitations of Current Approaches

Introducing CASK

Two-Stage Design

Performance Analysis

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related