MacTok: Robust Continuous Tokenization for Image Generation
Summary: arXiv:2603.29634v1 Announce Type: cross
Abstract
Continuous image tokenizers have emerged as a powerful tool for efficient visual generation. These tokenizers, particularly those built on variational frameworks, facilitate the learning of smooth and structured latent representations through the use of Kullback-Leibler (KL) regularization. However, a common challenge encountered in this domain is the phenomenon known as posterior collapse, which typically occurs when employing a reduced number of tokens. In such cases, the encoder often struggles to capture and encode informative features into the compressed latent space.
Introducing MacTok
To tackle the issue of posterior collapse, we present MacTok, a Masked Augmenting 1D Continuous Tokenizer. MacTok innovatively integrates image masking and representation alignment techniques, effectively preventing collapse while simultaneously learning compact and robust representations. The architecture of MacTok employs two distinct masking strategies:
- Random Masking: This method serves to regularize the latent learning process, encouraging the model to explore various representations.
- DINO-guided Semantic Masking: This approach emphasizes informative regions within images, compelling the model to encode robust semantics from incomplete visual information.
Enhanced Representation Alignment
MacTok further enhances its performance through the application of global and local representation alignment. This dual alignment strategy ensures that the model retains rich discriminative information within a highly compressed 1D latent space. Remarkably, MacTok only necessitates 64 or 128 tokens to achieve its high-performance outcomes.
Performance Metrics
When evaluated on the ImageNet dataset, MacTok demonstrates impressive results. Notably, it achieves a competitive generative Fréchet Inception Distance (gFID) of 1.44 at a resolution of 256×256 pixels and an outstanding state-of-the-art gFID of 1.52 at 512×512 pixels using the SiT-XL model. Additionally, MacTok significantly reduces token usage by up to 64× compared to previous methodologies.
Conclusion
The results obtained through our extensive experimentation confirm that the combination of masking techniques and semantic guidance is effective in preventing posterior collapse. Ultimately, MacTok shows great promise in achieving efficient and high-fidelity tokenization for image generation tasks. As the field of AI continues to evolve, innovations like MacTok pave the way for enhanced visual generation capabilities, opening up new possibilities for both researchers and practitioners in the domain.
