OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion
Summary: arXiv:2512.00234v2 Announce Type: replace-cross
The field of language translation has seen remarkable advancements in recent years, particularly with the emergence of open-source text-only translation large language models (LLMs). These models have significantly improved in terms of language coverage and translation quality. However, a notable limitation remains: such models typically operate in cascaded pipelines for speech translation (ST), requiring an initial phase of automatic speech recognition followed by translation. This method introduces additional latency, which is particularly detrimental in scenarios requiring simultaneous speech translation (SimulST).
Moreover, the inability to incorporate multimodal context, such as images, can hinder the model’s performance in discerning meaning and intent. Pretrained multimodal foundation models (MMFMs) exhibit robust perception and reasoning capabilities across various modalities. However, they often lack the multilingual coverage and specialized translation performance that dedicated translation LLMs can offer.
The Solution: OmniFusion
To tackle these challenges, a novel approach has been proposed that integrates MMFMs with translation LLMs to create an effective multimodal translation system. This innovative system is called OmniFusion, which aims to provide seamless multilingual and multimodal translation capabilities.
Fusion Strategy
OmniFusion introduces a cutting-edge fusion strategy that connects hidden states from multiple layers of a pretrained MMFM to a translation LLM. This connection allows for joint end-to-end training, enhancing the model’s ability to process and translate inputs from different modalities simultaneously.
Model Architecture
The OmniFusion model is built on two key components:
- Omni 2.5-7B: This serves as the MMFM, providing strong perceptual capabilities across audio and visual inputs.
- SeedX PPO-7B: This is the translation LLM, designed specifically for high-quality multilingual translation.
Performance and Results
OmniFusion demonstrates its versatility by performing various types of translations, including:
- Speech-to-text translation
- Speech-and-image-to-text translation
- Text-and-image-to-text translation
Experimental results indicate that OmniFusion effectively leverages both audio and visual inputs, achieving a remarkable 1-second reduction in latency for SimulST compared to traditional cascaded pipelines. Additionally, the overall translation quality has shown significant improvement, making this model a promising advancement in the field of multilingual multimodal translation.
Further Research and Development
The integration of MMFMs with translation LLMs represents a pivotal step forward in the realm of artificial intelligence and language translation. As research continues, further enhancements in both architecture and training methodologies are anticipated, paving the way for even more efficient and accurate translation systems.
To access the code and further details about OmniFusion, please visit GitHub.
