WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models
Recent advancements in artificial intelligence have led to the development of decoder-only autoregressive text-to-speech (AR-TTS) models, which are capable of producing high-fidelity speech outputs. However, these models face significant challenges due to their memory and computational costs, which scale quadratically with the sequence length due to the full self-attention mechanism utilized in their architecture.
In response to these challenges, researchers have introduced a novel framework called WAND, which stands for Windowed Attention and Knowledge Distillation. This approach aims to adapt pretrained AR-TTS models to function with a constant computational and memory complexity, thereby enhancing their efficiency without compromising the quality of the speech synthesis.
Key Features of WAND
The WAND framework incorporates several innovative techniques designed to optimize the performance of AR-TTS models:
- Separation of Attention Mechanism: WAND divides the attention mechanism into two distinct types: persistent global attention over conditioning tokens and local sliding-window attention over generated tokens. This separation allows for more efficient processing while maintaining the integrity of the output.
- Curriculum Learning Strategy: To facilitate stable fine-tuning of the models, WAND employs a curriculum learning strategy that progressively tightens the attention window. This method helps in gradually enhancing the model’s performance as it adapts to the new framework.
- Knowledge Distillation: WAND leverages knowledge distillation from a full-attention teacher model. This process helps recover high-fidelity synthesis quality while ensuring high data efficiency. The distillation method allows for the transfer of knowledge from a more complex model to a more efficient one, retaining quality while reducing resource requirements.
Performance Evaluation
The efficacy of the WAND framework has been evaluated on three modern AR-TTS models. The results demonstrate that WAND successfully preserves the original quality of speech synthesis while achieving notable improvements in efficiency. Specifically, the framework can attain up to a 66.2% reduction in KV cache memory usage. Furthermore, it ensures that the per-step latency remains near-constant and invariant to the sequence length, enabling more scalable and responsive text-to-speech applications.
Conclusion
WAND presents a significant advancement in the field of autoregressive text-to-speech models, addressing crucial issues related to memory and computation costs. By integrating windowed attention and knowledge distillation, this framework not only enhances the efficiency of TTS models but also maintains the high-quality speech output that users expect. As the demand for more sophisticated and resource-efficient AI solutions grows, WAND stands out as a promising approach for future developments in text-to-speech technology.
