MacTok: Efficient Continuous Tokenization for Image Generation

Date:

MacTok: Robust Continuous Tokenization for Image Generation

Summary: arXiv:2603.29634v1 Announce Type: cross

Abstract

Continuous image tokenizers have emerged as a powerful tool for efficient visual generation. These tokenizers, particularly those built on variational frameworks, facilitate the learning of smooth and structured latent representations through the use of Kullback-Leibler (KL) regularization. However, a common challenge encountered in this domain is the phenomenon known as posterior collapse, which typically occurs when employing a reduced number of tokens. In such cases, the encoder often struggles to capture and encode informative features into the compressed latent space.

Introducing MacTok

To tackle the issue of posterior collapse, we present MacTok, a Masked Augmenting 1D Continuous Tokenizer. MacTok innovatively integrates image masking and representation alignment techniques, effectively preventing collapse while simultaneously learning compact and robust representations. The architecture of MacTok employs two distinct masking strategies:

  • Random Masking: This method serves to regularize the latent learning process, encouraging the model to explore various representations.
  • DINO-guided Semantic Masking: This approach emphasizes informative regions within images, compelling the model to encode robust semantics from incomplete visual information.

Enhanced Representation Alignment

MacTok further enhances its performance through the application of global and local representation alignment. This dual alignment strategy ensures that the model retains rich discriminative information within a highly compressed 1D latent space. Remarkably, MacTok only necessitates 64 or 128 tokens to achieve its high-performance outcomes.

Performance Metrics

When evaluated on the ImageNet dataset, MacTok demonstrates impressive results. Notably, it achieves a competitive generative Fréchet Inception Distance (gFID) of 1.44 at a resolution of 256×256 pixels and an outstanding state-of-the-art gFID of 1.52 at 512×512 pixels using the SiT-XL model. Additionally, MacTok significantly reduces token usage by up to 64× compared to previous methodologies.

Conclusion

The results obtained through our extensive experimentation confirm that the combination of masking techniques and semantic guidance is effective in preventing posterior collapse. Ultimately, MacTok shows great promise in achieving efficient and high-fidelity tokenization for image generation tasks. As the field of AI continues to evolve, innovations like MacTok pave the way for enhanced visual generation capabilities, opening up new possibilities for both researchers and practitioners in the domain.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.