AFSS: Artifact-Focused Self-Synthesis for Mitigating Bias in Audio Deepfake Detection
Summary: arXiv:2603.26856v1 Announce Type: cross
The rapid advancement of generative models has enabled the creation of highly realistic audio deepfakes, posing significant challenges for detection systems. Current audio deepfake detectors are plagued by a critical bias problem, which leads to poor generalization across unseen datasets. In response to this pressing issue, researchers have proposed a novel method known as Artifact-Focused Self-Synthesis (AFSS) aimed at mitigating bias and enhancing the reliability of audio deepfake detection.
Introduction to AFSS
AFSS introduces two innovative mechanisms for generating pseudo-fake audio samples from authentic recordings: self-conversion and self-reconstruction. These mechanisms are grounded in the core insight of AFSS, which emphasizes enforcing same-speaker constraints. This ensures that the generated pseudo-fake samples maintain identical speaker identity and semantic content as the original audio recordings. By doing so, the method directs the detector’s focus towards identifying generation artifacts, thereby minimizing the influence of irrelevant confounding factors that may skew results.
Key Features of AFSS
- Same-Speaker Constraints: By ensuring that real and pseudo-fake samples share the same speaker identity and semantic content, AFSS allows detectors to concentrate on the artifacts generated during synthesis.
- Learnable Reweighting Loss: This innovative loss function dynamically emphasizes synthetic samples during the training process, allowing the model to adaptively learn from the most informative data points.
- Comprehensive Dataset Testing: AFSS has been tested across seven diverse datasets, showcasing its versatility and robustness in various scenarios.
Performance and Results
The results from extensive experiments demonstrate that AFSS achieves state-of-the-art performance in audio deepfake detection. The method boasts an average Equal Error Rate (EER) of 5.45%, with remarkable reductions observed in specific datasets: a mere 1.23% EER on WaveFake and 2.70% on In-the-Wild. Notably, AFSS accomplishes these impressive results without the need for pre-collected fake datasets, marking a significant advancement in the field.
Conclusion
The introduction of Artifact-Focused Self-Synthesis represents a significant leap forward in the quest to develop reliable audio deepfake detection systems. By addressing the inherent biases present in current detectors and focusing on generation artifacts, AFSS not only improves detection accuracy but also paves the way for future research in this critical area. Researchers and practitioners interested in exploring AFSS further can access the code publicly available at GitHub – AFSS.
