X-Voice: Zero-Shot Voice Cloning in 30 Languages

X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

In a groundbreaking development in the field of multilingual voice synthesis, researchers have unveiled X-Voice, a 0.4 billion parameter multilingual zero-shot voice cloning model. This innovative system not only clones arbitrary voices but also empowers users to communicate in 30 different languages with unprecedented ease and efficiency.

According to the research paper published on arXiv (arXiv:2605.05611v1), X-Voice is built upon a massive 420,000-hour multilingual corpus, utilizing the International Phonetic Alphabet (IPA) as a unified representation. This approach allows for more accurate and consistent voice cloning across various languages, removing the need for complex preprocessing techniques traditionally employed in voice synthesis.

Two-Stage Training Paradigm

The development of X-Voice involves a novel two-stage training paradigm designed to enhance the model’s efficiency and output quality:

Stage 1: The researchers established X-Voice$_{\text{s1}}$, employing standard conditional flow-matching training. This stage generated approximately 10,000 hours of speaker-consistent segments, which serve as audio prompts for the next stage.
Stage 2: In the second phase, the model was fine-tuned on these audio pairs with the prompt text masked. This process resulted in X-Voice$_{\text{s2}}$, which enables zero-shot voice cloning, eliminating the need for transcripts of audio prompts.

Architectural Innovations

To facilitate multilingual speech synthesis, the architecture of X-Voice extends the capabilities of existing models such as F5-TTS. Key innovations include:

Dual-Level Injection of Language Identifiers: This feature allows the model to seamlessly switch between different languages while maintaining voice consistency.
Decoupling and Scheduling of Classifier-Free Guidance: This technique enhances the model’s ability to generate high-quality voice outputs without the constraints of traditional classification methods.

Performance Evaluation

Extensive subjective and objective evaluations have demonstrated that X-Voice significantly outperforms existing flow-matching based multilingual systems, such as LEMAS-TTS. Notably, it achieves zero-shot cross-lingual cloning capabilities that rival those of much larger models, including Qwen3-TTS, which boasts a billion parameters.

Commitment to Research Transparency

In a bid to foster research transparency and community advancement, the developers of X-Voice have made all related resources open-source. This initiative is expected to encourage further innovation in the field of voice synthesis and broaden the accessibility of multilingual communication tools.

With its cutting-edge technology and commitment to inclusivity, X-Voice represents a significant leap forward in the capabilities of AI-driven voice synthesis, potentially transforming how people communicate across linguistic barriers.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

X-Voice: Zero-Shot Voice Cloning in 30 Languages

X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

Two-Stage Training Paradigm

Architectural Innovations

Performance Evaluation

Commitment to Research Transparency

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related