MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis
In the rapidly evolving field of computer vision, the efficiency of Vision Transformers (ViTs) is a primary concern, especially as the number of input tokens increases. A recent advancement in this area has been introduced in the paper titled “MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis,” published on arXiv.
Introduction
Token compression plays a pivotal role in addressing the quadratic complexity associated with self-attention mechanisms in ViTs. Traditional methods, such as ToMe, have been found to rely on GPU-inefficient operations, including sorting and scattered writes. These operations introduce overheads that limit their overall effectiveness. In contrast, the authors present a novel solution through MaMe, a training-free and differentiable token merging technique that utilizes matrix operations, making it GPU-friendly and capable of accelerating ViTs.
Key Contributions
The paper introduces two significant components:
- MaMe: A matrix-based token merging method that enhances the efficiency of ViTs without requiring extensive training.
- MaRe: An inverse operation designed for token restoration, which together with MaMe forms a comprehensive pipeline aimed at improving image synthesis.
Performance Metrics
When applied to pre-trained Vision Transformer models, MaMe has shown remarkable results:
- Doubling the throughput of ViT-B with only a 2% drop in accuracy.
- Fine-tuning the last layer with MaMe resulted in a 1.0% accuracy boost at a speed increase of 1.1x.
- In zero-shot classification tasks using SigLIP2-B@512, MaMe achieved a 1.3x acceleration while maintaining negligible performance degradation.
- For video processing, MaMe enhanced VideoMAE-L by 48.5% on the Kinetics-400 dataset with only a 0.84% accuracy loss.
Image Synthesis Enhancement
In the realm of image synthesis, the MaMe+MaRe pipeline has demonstrated significant improvements:
- Enhanced quality of generated images.
- Reduced latency in the generation process of Stable Diffusion v2.1 by 31%.
Conclusion
Collectively, the findings presented in this paper highlight the effectiveness of MaMe and MaRe in accelerating vision models while maintaining or even improving performance in certain tasks. This breakthrough not only addresses the computational challenges posed by self-attention mechanisms in ViTs but also sets the stage for future advancements in efficient visual perception and synthesis.
For those interested, the code for MaMe is available at https://github.com/cominder/mame.
