Gelina: Unified Speech & Gesture Synthesis with Interleaved Tokens

Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction

Summary: arXiv:2510.12834v3 Announce Type: replace-cross

Introduction

Human communication is inherently multimodal, integrating both verbal and non-verbal elements such as speech and gestures. Despite the natural coupling of these modalities, most existing computational methods for generating speech and gestures tend to synthesize them sequentially. This approach often results in reduced synchrony and misalignment in prosody, which can hinder the overall effectiveness of communication. To address this challenge, we introduce Gelina, a novel framework designed to jointly synthesize speech and co-speech gestures from text using interleaved token sequences.

Overview of Gelina

Gelina operates on a discrete autoregressive backbone, employing modality-specific decoders that work in tandem to generate both speech and gestures simultaneously. By interleaving the token generation process, Gelina ensures that the timing and expression of gestures align closely with the spoken words, thereby enhancing the naturalness and fluency of the output. This innovative approach marks a significant advancement over traditional unimodal synthesis methods.

Key Features

Joint Synthesis: Gelina synthesizes speech and gestures together, promoting better synchrony and alignment.
Interleaved Token Sequences: The framework utilizes a unique token generation strategy that interlaces speech and gesture tokens.
Modality-Specific Decoders: Dedicated decoders for speech and gestures ensure high-quality outputs for both modalities.
Multi-Speaker Cloning: Gelina supports the cloning of multiple speakers, enhancing its versatility in various applications.
Gesture-Only Synthesis: The framework allows for the synthesis of gestures solely from speech inputs, opening new avenues for gesture-based communication.

Evaluation and Results

To validate the effectiveness of Gelina, we conducted both subjective and objective evaluations. Participants were tasked with assessing the quality of the synthesized speech and gestures. The results indicated that Gelina achieved competitive speech quality while significantly improving gesture generation compared to unimodal baselines. This demonstrates the potential of Gelina as a robust solution for multimodal communication synthesis.

Conclusion

Gelina represents a pioneering step towards more natural and effective multimodal communication systems. By integrating speech and gesture synthesis into a unified framework, it not only enhances the quality of generated outputs but also sets the stage for future advancements in AI-driven communication technologies. The implications of this research extend beyond simple speech synthesis, paving the way for more immersive and interactive experiences in various domains, including virtual reality, animation, and human-computer interaction.

Future Directions

As we look to the future, further research will focus on refining the underlying algorithms and expanding the framework’s capabilities. This includes exploring additional modalities, increasing the diversity of synthesized gestures, and enhancing the real-time performance of the system. The ultimate goal is to create a comprehensive and adaptable synthesis solution that truly mirrors the richness of human communication.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Gelina: Unified Speech & Gesture Synthesis with Interleaved Tokens

Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction

Introduction

Overview of Gelina

Key Features

Evaluation and Results

Conclusion

Future Directions

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related