PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing
In recent years, advancements in artificial intelligence have significantly improved the field of automated dubbing (AD), allowing for the seamless conversion of source speech in videos to target speech in different languages. However, achieving a natural dubbing experience remains challenging, particularly due to synchronization issues such as duration and lip synchronization (lip-sync), which are vital for maintaining viewer engagement and experience.
A new study, documented in the paper “PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing,” proposes a novel synchronization method aimed at improving these critical aspects of automated dubbing. This method comprises two main steps: isochrony for timing constraints and phonetic synchronization (PS) to ensure effective lip-sync.
Methodology Overview
The proposed approach involves the following key steps:
- Isochrony: The first step focuses on achieving isochrony by paraphrasing the translated text through a sophisticated language model. This ensures that the duration of the target speech aligns closely with that of the source speech, thereby enhancing the overall synchronization.
- Phonetic Synchronization (PS): The second step introduces phonetic synchronization, utilizing dynamic time warping (DTW) to measure the local costs of vowel distances derived from training data. This method ensures that the target text is composed of vowels that are pronounced similarly to those in the source speech, thereby improving the lip-sync experience.
- PS-Comet Extension: Building on these methods, the study further extends its approach to PS-Comet, which considers both semantic and phonetic similarity. This dual focus enhances the preservation of meaning while ensuring accurate lip-sync.
Performance Evaluation
The efficacy of the proposed methods was rigorously evaluated using diverse datasets, including Korean and English lip-reading datasets, along with a voice-actor dubbing dataset. The results demonstrated that both the PS-TTS and PS-Comet TTS systems significantly outperform traditional text-to-speech (TTS) systems lacking phonetic synchronization. Notably, these systems also surpassed the performance of human voice actors in dubbing tasks between Korean and English, as well as English and Korean.
Cross-Linguistic Applicability
To further validate the robustness of the proposed methods, the experiments were extended to include French, testing all language pairs to assess cross-linguistic applicability. Across all tested language pairs, PS-Comet consistently delivered superior performance, achieving an optimal balance between lip-sync accuracy and semantic preservation. These findings confirm that PS-Comet not only excels in maintaining accurate lip-sync but also preserves the semantic integrity of the dialogue better than the PS method alone.
Conclusion
The advancements presented in this study highlight the potential of phonetic synchronization in enhancing automated dubbing technology. By addressing the challenges of synchronization and semantic preservation, the proposed PS-TTS and PS-Comet TTS systems stand to revolutionize the field of automated dubbing, paving the way for more natural and engaging multilingual content delivery.
