Controllable Singing Style Conversion with Boundary-Aware Information Bottleneck
Summary: arXiv:2604.05526v1 Announce Type: cross
This paper presents the submission of the S4 team to the Singing Voice Conversion Challenge 2025 (SVCC2025) – a novel singing style conversion system that advances fine-grained style conversion and control within in-domain settings. The research focuses on addressing several critical challenges in singing voice conversion, including style leakage, dynamic rendering, and high-fidelity generation with limited data.
Key Innovations
The S4 team introduces three key innovations to improve the singing style conversion process:
- Boundary-aware Whisper Bottleneck: This component pools phoneme-span representations to suppress residual source style while preserving the linguistic content. This innovation helps in maintaining the integrity of the voice while converting styles.
- Explicit Frame-Level Technique Matrix: Enhanced by targeted F0 processing during inference, this method ensures stable and distinct dynamic style rendering. It allows for a more controlled conversion process that can adapt to various singing styles effectively.
- Perceptually Motivated High-Frequency Band Completion Strategy: This strategy leverages an auxiliary standard 48kHz SVC model to augment the high-frequency spectrum. It addresses the issue of data scarcity without overfitting, ensuring that the output maintains high fidelity and quality.
Performance and Evaluation
In the official SVCC2025 subjective evaluation, the S4 team’s system achieved the best naturalness performance among all submissions. Despite utilizing significantly less extra singing data than other top-performing systems, it maintained competitive results in speaker similarity and technique control. This is a notable achievement given the challenges inherent in voice conversion tasks.
Conclusion
The advancements made by the S4 team in controllable singing style conversion represent a significant step forward in the field of audio processing and machine learning. By addressing key challenges through innovative techniques, the team has set a new standard for future research in singing voice conversion. Audio samples demonstrating the capabilities of this system are available online, showcasing the system’s ability to produce high-quality, natural-sounding singing voice conversions.
