AST: Adaptive, Seamless, and Training-Free Precise Speech Editing
Summary: arXiv:2604.16056v1 Announce Type: cross
Abstract
Text-based speech editing aims to modify specific segments while preserving speaker identity and acoustic context. Existing methods rely on task-specific training, which incurs high data costs and struggles with temporal fidelity in unedited regions. Meanwhile, adapting Text-to-Speech (TTS) models often faces a trade-off between editing quality and consistency. To address these issues, we propose AST, an Adaptive, Seamless, and Training-free precise speech editing framework.
Key Features of AST
The AST framework introduces several innovative features that enhance the process of speech editing:
- Latent Recomposition: This feature allows for the selective stitching of preserved source segments with newly synthesized targets, effectively maintaining the integrity of the original speech.
- Precise Style Editing: AST extends latent manipulation to enable specific style edits for targeted speech segments, enhancing the flexibility of speech editing.
- Adaptive Weak Fact Guidance (AWFG): To prevent artifacts at edit boundaries, AST incorporates AWFG, which dynamically modulates a mel-space guidance signal, enforcing structural constraints only where necessary.
Addressing Existing Limitations
One of the primary challenges in current speech editing techniques is their reliance on extensive training data, which is often costly and time-consuming. The AST framework mitigates these issues by being training-free while significantly improving temporal consistency in unedited regions. By leveraging a pre-trained autoregressive TTS model, AST resolves the controllability-quality trade-off that has long plagued speech editing technologies.
Benchmark Introduction: LibriSpeech-Edit
To fill the gap of publicly accessible benchmarks in speech editing, the AST framework introduces LibriSpeech-Edit, a new and larger speech editing dataset. This dataset aims to provide researchers with the resources they need to advance the field of speech editing.
Innovative Evaluation Metric: Word-level Dynamic Time Warping (WDTW)
Existing evaluation metrics have been found lacking, particularly in assessing temporal consistency in unedited regions. To address this, AST proposes Word-level Dynamic Time Warping (WDTW) as a more effective evaluation metric that accurately measures the performance of speech editing models.
Experimental Results
Extensive experiments conducted with AST demonstrate its ability to resolve the trade-off between controllability and quality without requiring additional training. Notably, AST shows a significant improvement in temporal consistency, achieving nearly a 70% reduction in Word Error Rate compared to the previous leading baseline. Furthermore, applying AST to a foundation TTS model results in a 27% reduction in WDTW, setting a new standard for speaker preservation and temporal fidelity in speech editing.
Conclusion
The introduction of AST marks a significant advancement in the field of speech editing. By eliminating the need for extensive training and improving temporal fidelity, AST paves the way for more versatile and efficient speech editing solutions. The combination of innovative features, a new benchmark dataset, and enhanced evaluation metrics positions AST as a leading framework in this rapidly evolving domain.
