Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model
Summary: arXiv:2604.19635v1 Announce Type: cross
In recent years, generative models have made significant strides in the field of Target Speaker Extraction (TSE), setting new benchmarks that were previously unattainable. However, the intrinsic reliance of these models on global context presents substantial challenges for deployment in real-time applications. When adapted directly to streaming scenarios, these models often suffer from catastrophic inference performance degradation, primarily due to a significant mismatch between the conditions present during training and those encountered during streaming inference.
To address this critical gap in the application of autoregressive (AR) models for TSE, researchers have introduced a novel approach specifically designed for streaming environments. This groundbreaking method is encapsulated in the Chunk-wise Interleaved Splicing Paradigm, which enables efficient and stable streaming inference.
Key Innovations
- Chunk-wise Interleaved Splicing Paradigm: This paradigm facilitates the processing of speech in manageable chunks, allowing for smoother transitions and maintaining coherence across segments.
- Historical Context Refinement Mechanism: To reduce boundary discontinuities and ensure the cohesion of the extracted speech segments, this mechanism makes use of historical information from previous chunks.
To validate the efficacy of their approach, extensive experiments were conducted on the Libri2Mix dataset. The findings revealed a stark contrast in performance between the AR generative baseline and the proposed method:
- The AR generative baseline exhibited notable performance degradation at low latencies.
- In contrast, the new approach maintained 100% stability and superior intelligibility, demonstrating its robustness in real-time applications.
- Furthermore, the streaming results achieved by this method are comparable to, and in some cases even surpass, those obtained from traditional offline baselines.
Moreover, the model boasts a Real-Time-Factor (RTF) of 0.248 when implemented on consumer-level GPUs, showcasing its potential for practical applications in latency-sensitive environments.
Conclusion
This innovative research provides substantial empirical evidence supporting the viability of using autoregressive generative backbones for real-time applications in Target Speaker Extraction. The introduction of the Chunk-wise Interleaved Splicing Paradigm not only enhances performance but also opens up new avenues for further exploration in the realm of streaming audio processing. As researchers continue to push the boundaries of what is possible with artificial intelligence, this work stands out as a significant contribution to the field, paving the way for more efficient and effective extraction of target speakers in various real-world scenarios.
