MaMe & MaRe: Efficient Token Merging for Vision Transformers

Date:

MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis

In the rapidly evolving field of computer vision, the efficiency of Vision Transformers (ViTs) is a primary concern, especially as the number of input tokens increases. A recent advancement in this area has been introduced in the paper titled “MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis,” published on arXiv.

Introduction

Token compression plays a pivotal role in addressing the quadratic complexity associated with self-attention mechanisms in ViTs. Traditional methods, such as ToMe, have been found to rely on GPU-inefficient operations, including sorting and scattered writes. These operations introduce overheads that limit their overall effectiveness. In contrast, the authors present a novel solution through MaMe, a training-free and differentiable token merging technique that utilizes matrix operations, making it GPU-friendly and capable of accelerating ViTs.

Key Contributions

The paper introduces two significant components:

  • MaMe: A matrix-based token merging method that enhances the efficiency of ViTs without requiring extensive training.
  • MaRe: An inverse operation designed for token restoration, which together with MaMe forms a comprehensive pipeline aimed at improving image synthesis.

Performance Metrics

When applied to pre-trained Vision Transformer models, MaMe has shown remarkable results:

  • Doubling the throughput of ViT-B with only a 2% drop in accuracy.
  • Fine-tuning the last layer with MaMe resulted in a 1.0% accuracy boost at a speed increase of 1.1x.
  • In zero-shot classification tasks using SigLIP2-B@512, MaMe achieved a 1.3x acceleration while maintaining negligible performance degradation.
  • For video processing, MaMe enhanced VideoMAE-L by 48.5% on the Kinetics-400 dataset with only a 0.84% accuracy loss.

Image Synthesis Enhancement

In the realm of image synthesis, the MaMe+MaRe pipeline has demonstrated significant improvements:

  • Enhanced quality of generated images.
  • Reduced latency in the generation process of Stable Diffusion v2.1 by 31%.

Conclusion

Collectively, the findings presented in this paper highlight the effectiveness of MaMe and MaRe in accelerating vision models while maintaining or even improving performance in certain tasks. This breakthrough not only addresses the computational challenges posed by self-attention mechanisms in ViTs but also sets the stage for future advancements in efficient visual perception and synthesis.

For those interested, the code for MaMe is available at https://github.com/cominder/mame.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.