MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation
In the evolving landscape of large language models (LLMs), the need for efficient long-context decoding remains a pressing challenge. A recent paper titled “MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation” introduces a novel approach aimed at addressing this issue. The paper, available on arXiv (arXiv:2604.00235v1), presents a method that enhances the efficiency and fidelity of attention computations.
Understanding the Problem
Long-context decoding in LLMs is primarily input/output (IO)-bound. Each token re-reads an increasingly large key-value (KV) cache, which can lead to inefficiencies. Previous efforts to accelerate this process have typically employed strategies that either compress data or involve selection and eviction mechanisms. However, these methods often compromise the quality of the output, leading to degraded delayed recall and issues with long-form generation.
The MAC-Attention Approach
The MAC-Attention framework introduces a fidelity- and access-preserving alternative to existing methods. It accelerates the decoding process by reusing prior attention computations for semantically similar recent queries. The MAC-Attention scheme is structured into three main stages:
- Match Stage: This initial stage performs a pre-RoPE L2 matching over a short local window to identify relevant previous computations.
- Amend Stage: This stage rectifies the reused attention by recomputing a small band close to the match boundary, ensuring that the results are accurate and relevant.
- Complete Stage: Finally, this stage fuses the rectified results with fresh attention computed on the KV tail. This is achieved through a numerically stable merge process.
Performance Benefits
One of the most significant advantages of the MAC-Attention method is its ability to maintain constant compute and bandwidth complexity during a match hit, regardless of context length. This model-agnostic approach is compatible with IO-aware kernels, paged-KV managers, and MQA/GQA frameworks.
In extensive benchmarking using LongBench v2 (120K), RULER (120K), and LongGenBench (16K continuous generation), MAC-Attention demonstrated remarkable performance improvements:
- Reduced KV accesses by up to 99%.
- Cut token generation latency by over 60% at 128K.
- Achieved more than 14.3x speedups during the attention phase and up to 2.6x end-to-end speed improvements.
Importantly, these enhancements did not compromise the quality of full-attention outputs. By efficiently reusing computations, MAC-Attention provides a method for long-context inference that is both fast and faithful to the underlying data.
Conclusion
The introduction of MAC-Attention marks a significant advancement in the field of attention computation for large language models. With its ability to enhance efficiency without sacrificing fidelity, it promises to make long-context decoding faster and more reliable. For those interested in exploring this innovative approach further, the code is available on GitHub: MAC-Attention GitHub Repository.
