CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference
Summary: arXiv:2604.08584v1 Announce Type: cross
Abstract
Long-context large language models (LLMs) are increasingly dependent on extended, reusable prefill prompts that serve various applications, including agents and domain-specific question-answering systems. Consequently, the attention mechanism and the key-value cache (KV-cache) have emerged as the primary bottlenecks during the decoding phase. Although sparse attention techniques aim to reduce computational and transfer costs, they often encounter challenges in maintaining accuracy when high sparsity levels are applied. This is primarily due to the inherent distribution shift that exists between Queries and Keys.
To address these challenges, we present Centroid-Scoring Attention (CSAttention), a novel sparse attention method that is training-free and optimized for high-throughput serving of reusable contexts. CSAttention employs a storage-for-computation strategy specifically designed for settings that require offline prefill and online decoding. This innovative approach allows for the computation to be front-loaded into a one-time offline prefill phase, which can then be amortized across multiple queries. Moreover, CSAttention aggressively optimizes the latency of per-step decoding.
Methodology
CSAttention constructs query-centric lookup tables during the offline prefill phase, ensuring that the size of these tables remains fixed during the decoding process. This structural design enables the online decoding phase to replace the traditional full-context scans with efficient table lookups, coupled with GPU-friendly score accumulation.
Experimental Results
We conducted extensive experiments to evaluate the performance of CSAttention against existing sparse attention methods. The results illustrate that CSAttention achieves nearly identical accuracy to that of full attention mechanisms. Notably, under high sparsity conditions (95%) and in long-context scenarios (ranging from 32K to 128K tokens), CSAttention consistently surpasses state-of-the-art sparse attention techniques in terms of both model accuracy and inference speed.
Key Findings
- CSAttention provides up to 4.6x speedup in inference time when compared to the most accurate baseline under a context length of 128K.
- The method’s design allows for effective amortization of computation across multiple queries, enhancing overall efficiency.
- CSAttention maintains high accuracy levels even in high sparsity settings, addressing a common issue among sparse attention techniques.
Conclusion
CSAttention represents a significant advancement in the field of large language model inference, paving the way for more efficient and scalable applications. By mitigating the challenges associated with traditional sparse attention methods, CSAttention not only enhances computational efficiency but also preserves accuracy, thereby making it a promising solution for future developments in LLMs.
