HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention
Summary: arXiv:2603.28458v2 Announce Type: replace-cross
Abstract: Token-level sparse attention mechanisms, exemplified by DeepSeek Sparse Attention (DSA), achieve fine-grained key selection by scoring every historical key for each query through a lightweight indexer, then computing attention only on the selected subset. While the downstream sparse attention itself scales favorably, the indexer must still scan the entire prefix for every query, introducing a per-layer bottleneck that grows prohibitively with context length.
Introduction to HISA
In the realm of machine learning and natural language processing, the efficiency of attention mechanisms is crucial for scaling models to larger contexts. The Hierarchical Indexed Sparse Attention (HISA) represents a significant advancement in this field. It serves as a plug-and-play alternative to existing indexers, specifically designed to streamline the process of token selection in sparse attention models.
How HISA Works
HISA transforms the conventional flat token scan approach into a more efficient two-stage hierarchical procedure:
- Coarse Filtering Stage: This initial stage involves scoring pooled block representations to eliminate irrelevant regions, thereby reducing the number of tokens that need further analysis.
- Token-Level Refinement Stage: In this stage, the original indexer is applied solely within the candidate blocks that have been retained from the first stage, focusing computational resources where they are most needed.
Benefits of HISA
One of the most significant advantages of HISA is that it preserves the identical token-level top-sparse pattern required by downstream Sparse MLA operators. This ensures compatibility with existing systems without necessitating additional training or fine-tuning.
Performance Metrics
Benchmark tests have demonstrated HISA’s impressive performance, particularly in kernel-level evaluations:
- Achieved speedup at 64K context, showcasing its scalability.
- In applications such as Needle-in-a-Haystack and LongBench, HISA was able to replace the indexer in DeepSeek-V3.2 and GLM-5 with minimal adjustments and without any fine-tuning.
- Quality metrics indicate that HISA closely matches the performance of the original DeepSeek Sparse Attention while significantly outperforming traditional block-sparse baselines.
Conclusion
The introduction of HISA marks a pivotal moment in the development of efficient attention mechanisms for large-scale machine learning applications. By streamlining the indexing process and reducing computational bottlenecks, HISA not only enhances performance but also maintains compatibility with existing models. As researchers continue to explore the implications of this advancement, it is anticipated that HISA will play a critical role in the evolution of sparse attention methods.
