DDSP-QbE++: Improving Speech Quality for Speech Anonymisation for Atypical Speech
In recent years, the field of voice conversion has seen significant advancements, particularly with the introduction of Differentiable Digital Signal Processing (DDSP) pipelines. A new study, identified as arXiv:2604.09246v1, has proposed significant enhancements to the existing DDSP-QbE framework, aiming to improve the quality of speech synthesis in atypical speech scenarios.
The traditional DDSP-QbE framework utilizes a method known as subtractive synthesis. In this method, a periodic excitation signal is shaped by a learned spectral envelope to reconstruct the desired target voice. However, the existing DDSP-QbE system has been observed to produce undesirable artefacts due to its excitation generation process, which relies on phase accumulation to create a sawtooth-like waveform. The inherent abrupt discontinuities in this waveform lead to aliasing artefacts, which are perceived as buzziness and spectral distortion, particularly at higher fundamental frequencies.
Proposed Improvements
The researchers behind this study have introduced two innovative modifications to enhance the excitation stage of the DDSP-QbE subtractive synthesizer:
-
Explicit Voicing Detection:
The first improvement involves the incorporation of explicit voicing detection. This technique allows for the gating of harmonic excitation, which effectively suppresses the periodic component in unvoiced regions of speech. Instead of generating a periodic signal in these areas, filtered noise is introduced. This substitution helps to avoid the aliased harmonic content that can be particularly disruptive to the overall quality of the synthesized speech.
-
Polynomial Band-Limited Step (PolyBLEP) Correction:
The second enhancement involves the application of a PolyBLEP correction to the phase-accumulated oscillator. This method replaces the hard waveform discontinuities at each phase wrap with a smooth polynomial residual. By doing so, the approach effectively cancels out the alias-generating components without requiring oversampling or spectral truncation. The result is a cleaner harmonic roll-off and a significant reduction in high-frequency artefacts.
Results and Impact
Combining these two modifications results in a substantial improvement in the perceptual naturalness of the generated speech, as measured by Mean Opinion Score (MOS) evaluations. The enhancements contribute to a cleaner sound quality with reduced high-frequency artefacts, making the synthesized speech more pleasant to the ear.
Notably, the proposed DDSP-QbE++ approach is designed to be lightweight and differentiable, allowing it to integrate seamlessly into the existing DDSP-QbE training pipeline without the need for additional learnable parameters. This aspect not only simplifies implementation but also enhances the efficiency of the training process.
In conclusion, the DDSP-QbE++ framework represents a significant step forward in the field of voice conversion, particularly for applications requiring speech anonymisation in atypical speech contexts. By addressing the core issues of the original DDSP-QbE system, these improvements have the potential to advance the quality and usability of synthesized speech in various real-world applications.
