DuoTok: Source-Aware Dual-Track Tokenization for Multi-Track Music Language Modeling
Summary: arXiv:2511.20224v2 | Announce Type: replace-cross
In recent advancements in the field of audio processing, the method of audio tokenization has emerged as a crucial bridge connecting continuous waveforms with multi-track music language models. This innovative approach is particularly significant in dual-track modeling, where effective tokenization must satisfy three critical properties: high-fidelity reconstruction, strong predictability under a language model, and cross-track correspondence. Addressing these challenges, researchers have introduced DuoTok, a source-aware dual-track tokenizer that successfully navigates this trade-off through a staged disentanglement process.
Understanding DuoTok
DuoTok operates through a systematic approach that enhances the efficiency and effectiveness of music language modeling. The process unfolds in several key stages:
- Semantic Encoder Pretraining: The first stage involves pretraining a semantic encoder, which lays the foundation for accurate audio representation.
- Multi-Task Supervision: The encoder is then regularized using multi-task supervision, enhancing its ability to learn diverse audio features.
- Freezing the Encoder: After sufficient training, the encoder is frozen to preserve its learned capabilities for subsequent processes.
- Hard Dual-Codebook Routing: This stage implements hard dual-codebook routing, ensuring that the extracted tokens remain focused and structured.
- Auxiliary Objectives: The use of auxiliary objectives on quantized codes further refines the tokenization process.
- Diffusion Decoder: Finally, a diffusion decoder reconstructs high-frequency details, allowing tokens to concentrate on structured information necessary for effective sequence modeling.
Performance Insights
When evaluated against standard benchmarks, DuoTok demonstrates a favorable predictability-fidelity trade-off, achieving the lowest conditional non-Bayesian predictive test (cnBPT) while maintaining competitive reconstruction quality at a rate of 0.75 kbps. This performance indicates that DuoTok is not only capable of effective audio tokenization but also excels in maintaining the integrity of audio fidelity during reconstruction.
Implications for Language Modeling
Under a held-constant dual-track language modeling protocol, the effective number of bits per token (enBPT) also shows marked improvement, suggesting that DuoTok’s advancements extend beyond mere codebook size effects. Controlled diagnostics reveal additional insights: larger predictability costs occur under conditions of cross-track corruption, and models trained on DuoTok tokens exhibit significant gains from longer contextual histories. This suggests that such models leverage cross-track structures and non-local historical information effectively.
Conclusion
In summary, DuoTok represents a significant advancement in audio tokenization for multi-track music language modeling. By prioritizing high-fidelity reconstruction, strong predictability, and cross-track correspondence, it opens new avenues for research and application in the field of audio processing and machine learning. As researchers continue to explore the potential of DuoTok, its implications for the future of music language modeling and audio synthesis are promising, paving the way for more sophisticated audio applications.
