DASB — Discrete Audio and Speech Benchmark
Summary: arXiv:2406.14294v3 Announce Type: replace-cross
Abstract
Discrete audio tokens have recently gained considerable attention for their potential to bridge audio and language processing, enabling multimodal language models that can both generate and understand audio. However, preserving key information such as phonetic content, speaker identity, and paralinguistic cues remains a major challenge. Identifying the optimal tokenizer and configuration is further complicated by inconsistent evaluation settings across existing studies.
Introduction
To address these challenges, we introduce the Discrete Audio and Speech Benchmark (DASB), a comprehensive framework designed for benchmarking discrete audio tokens across various domains including speech, general audio, and music. This framework is aimed at facilitating the evaluation of discrete audio representations on a range of discriminative and generative tasks.
Key Features of DASB
- Comprehensive Evaluation: DASB provides a standardized platform to assess the performance of discrete audio tokens in diverse audio contexts.
- Discriminative and Generative Tasks: The framework supports a variety of tasks that enable researchers to evaluate both the understanding and generation capabilities of models using discrete audio tokens.
- Public Accessibility: DASB codes, evaluation setup, and leaderboards are publicly available at DASB Website.
Findings
Our results reveal several important insights regarding the performance of discrete audio representations:
- Discrete representations were found to be less robust than their continuous counterparts.
- Performance was heavily influenced by various factors, including model architecture, data size, learning rate, and model capacity.
- Semantic tokens generally outperformed acoustic tokens, yet a noticeable performance gap persists between discrete tokens and continuous features.
Implications for Future Research
The findings from DASB underline the necessity for further research in the area of discrete audio tokens, particularly in improving their robustness and performance across different tasks. The research community is encouraged to utilize the DASB framework to explore the nuances of audio tokenization and its implications in multimodal learning.
Conclusion
In conclusion, the Discrete Audio and Speech Benchmark (DASB) offers a significant advancement in the evaluation of discrete audio tokens. By addressing existing challenges and providing a unified evaluation framework, DASB aims to foster innovation in audio processing and its intersection with language models. Researchers and practitioners are invited to engage with the benchmark and contribute to the growing body of knowledge in this exciting field.
