Why Attend to Everything? Focus is the Key
Summary: arXiv:2604.03260v1 Announce Type: cross
Abstract: We introduce Focus, a method that learns which token pairs matter rather than approximating all of them. Learnable centroids assign tokens to groups; distant attention is restricted to same-group pairs while local attention operates at full resolution. Because all model weights stay frozen, Focus is purely additive: centroid-only training (as few as 148K parameters) improves domain perplexity with zero degradation on downstream benchmarks–from 124M to 70B parameters, across five attention architectures. No existing efficient attention method achieves this in the retrofit setting.
Introduction to Focus
Focus represents a significant advancement in efficient attention mechanisms, addressing the challenge of determining which token pairs are crucial during the processing of data. Traditional models often attempt to approximate all token pairs, leading to inefficiencies and increased computational costs. Focus, on the other hand, employs learnable centroids that intelligently group tokens, allowing for a more streamlined and effective approach to attention.
Mechanism of Action
The core innovation of Focus lies in its dual attention mechanism:
- Distant Attention: This form of attention is limited to same-group pairs, ensuring that only relevant token pairs are considered.
- Local Attention: Operates at full resolution, allowing for a comprehensive analysis of closely related tokens.
This combination ensures that Focus maintains high performance while reducing the computational load compared to traditional models.
Performance Benchmarking
Focus has demonstrated remarkable results across various scales and architectures:
- At 124M parameters, Focus outperforms full attention models with a perplexity score of 30.3 compared to 31.4.
- When trained from scratch at a 7B scale using 2B tokens, Focus achieved a perplexity of 13.82, again surpassing the full attention score of 13.89.
These results highlight Focus’s capability to maintain or enhance performance while significantly reducing the number of parameters, achieving a high efficiency in processing.
Inference and Speed Improvements
During inference, Focus implements a method of restricting each token to its top-k highest-scoring groups. This approach discretizes the soft routing into a hard sparsity pattern, resulting in:
- A 2x speedup over the pretrained baseline, achieving a perplexity of 41.3 compared to 42.8.
- By decomposing this pattern into two standard FlashAttention calls, an impressive 8.6x wall-clock speedup is accomplished at 1M tokens without the need for custom kernels.
Advantages Over Existing Methods
Focus not only enhances performance but also maintains model alignment. Unlike other methods, such as LoRA, which can degrade performance at varying learning rates and ranks, Focus ensures that instruction-tuned models retain their scores on benchmarks like TruthfulQA after adaptation. This consistency is due to the centroid routing that Focus utilizes.
Conclusion
In conclusion, Focus is a groundbreaking method that revolutionizes the approach to attention in AI models. By emphasizing the importance of token relevance through learnable centroids and efficient attention mechanisms, Focus paves the way for more effective and scalable AI applications. With its ability to maintain high performance while significantly reducing complexity, Focus is set to become a key player in the field of natural language processing.
