Expressive Prompting: Improving Emotion Intensity and Speaker Consistency in Zero-Shot TTS
Summary: arXiv:2409.18512v2 Announce Type: replace-cross
Abstract
Recent advancements in speech synthesis have enabled large language model (LLM)-based systems to perform zero-shot generation with controllable content, timbre, speaker identity, and emotion through input prompts. As a result, these models heavily rely on prompt design to guide the generation process. However, existing prompt selection methods often fail to ensure that prompts contain sufficiently stable speaker identity cues and appropriate emotional intensity indicators, which are crucial for expressive speech synthesis.
Introduction
With the rapid evolution of artificial intelligence technologies, the field of text-to-speech (TTS) has witnessed remarkable progress. Recent models leverage large language models to achieve zero-shot generation capabilities, allowing for a wide range of emotional and stylistic variations in speech synthesis. However, this innovation brings forth challenges, particularly in prompt design, which heavily influences the quality and expressiveness of the generated speech.
Challenges in Current Approaches
Existing methods for prompt selection often overlook critical factors that contribute to expressive speech synthesis. Key challenges include:
- Stable Speaker Identity Cues: Many prompts fail to provide consistent indicators of speaker identity, leading to variability in synthesized output.
- Emotional Intensity Indicators: Current techniques may not adequately reflect the desired emotional intensity, resulting in less engaging speech.
- Dependence on Prompt Design: The reliance on well-crafted prompts is essential, yet many existing approaches do not systematically evaluate prompt efficacy.
Proposed Two-Stage Prompt Selection Strategy
To tackle these challenges, we introduce a two-stage prompt selection strategy tailored for expressive speech synthesis. This innovative approach consists of:
- Static Stage:
- Evaluating prompt candidates using pitch-based prosodic features.
- Assessing perceptual audio quality and text-emotion coherence scores, as evaluated by an LLM.
- Measuring character error rate, speaker similarity, and emotional similarity between synthesized and prompt speech using a specific TTS model.
- Dynamic Stage:
- Employing a textual similarity model to select prompts that best align with the current input text during the synthesis process.
Experimental Results
Our experimental findings indicate that the proposed strategy significantly enhances the selection of prompts, leading to synthesized speech characterized by:
- High-intensity emotional expression.
- Robust speaker identity consistency.
- Overall improved quality and stability in zero-shot TTS performance.
Conclusion
The introduction of a two-stage prompt selection strategy represents a significant advancement in expressive speech synthesis. By addressing the shortcomings of current methods, this approach paves the way for more engaging and human-like speech generation. For audio samples and related codes, please visit this link.
