Gelina: Unified Speech & Gesture Synthesis with Interleaved Tokens

Date:

Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction

Summary: arXiv:2510.12834v3 Announce Type: replace-cross

Introduction

Human communication is inherently multimodal, integrating both verbal and non-verbal elements such as speech and gestures. Despite the natural coupling of these modalities, most existing computational methods for generating speech and gestures tend to synthesize them sequentially. This approach often results in reduced synchrony and misalignment in prosody, which can hinder the overall effectiveness of communication. To address this challenge, we introduce Gelina, a novel framework designed to jointly synthesize speech and co-speech gestures from text using interleaved token sequences.

Overview of Gelina

Gelina operates on a discrete autoregressive backbone, employing modality-specific decoders that work in tandem to generate both speech and gestures simultaneously. By interleaving the token generation process, Gelina ensures that the timing and expression of gestures align closely with the spoken words, thereby enhancing the naturalness and fluency of the output. This innovative approach marks a significant advancement over traditional unimodal synthesis methods.

Key Features

  • Joint Synthesis: Gelina synthesizes speech and gestures together, promoting better synchrony and alignment.
  • Interleaved Token Sequences: The framework utilizes a unique token generation strategy that interlaces speech and gesture tokens.
  • Modality-Specific Decoders: Dedicated decoders for speech and gestures ensure high-quality outputs for both modalities.
  • Multi-Speaker Cloning: Gelina supports the cloning of multiple speakers, enhancing its versatility in various applications.
  • Gesture-Only Synthesis: The framework allows for the synthesis of gestures solely from speech inputs, opening new avenues for gesture-based communication.

Evaluation and Results

To validate the effectiveness of Gelina, we conducted both subjective and objective evaluations. Participants were tasked with assessing the quality of the synthesized speech and gestures. The results indicated that Gelina achieved competitive speech quality while significantly improving gesture generation compared to unimodal baselines. This demonstrates the potential of Gelina as a robust solution for multimodal communication synthesis.

Conclusion

Gelina represents a pioneering step towards more natural and effective multimodal communication systems. By integrating speech and gesture synthesis into a unified framework, it not only enhances the quality of generated outputs but also sets the stage for future advancements in AI-driven communication technologies. The implications of this research extend beyond simple speech synthesis, paving the way for more immersive and interactive experiences in various domains, including virtual reality, animation, and human-computer interaction.

Future Directions

As we look to the future, further research will focus on refining the underlying algorithms and expanding the framework’s capabilities. This includes exploring additional modalities, increasing the diversity of synthesized gestures, and enhancing the real-time performance of the system. The ultimate goal is to create a comprehensive and adaptable synthesis solution that truly mirrors the richness of human communication.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.