Unsupervised Evaluation of Deep Audio Embeddings for Music Structure Analysis
Source: arXiv:2603.27218v1
Type: Cross
Abstract
Music Structure Analysis (MSA) aims to uncover the high-level organization of musical pieces. State-of-the-art methods are often based on supervised deep learning, but these methods are bottlenecked by the need for heavily annotated data and inherent structural ambiguities. In this paper, we propose an unsupervised evaluation of nine open-source, generic pre-trained deep audio models, on MSA.
Key Findings
The research presents several critical findings regarding the evaluation of audio embeddings and their effectiveness in MSA:
- Barwise embeddings were extracted from each model and segmented using three unsupervised segmentation algorithms.
- The segmentation algorithms used include:
- Foote’s checkerboard kernels
- Spectral clustering
- Correlation Block-Matching (CBM)
- The focus was placed exclusively on boundary retrieval, which is essential for understanding the structure of music.
Performance Comparison
The results of the study indicate that modern, generic deep embeddings generally outperform traditional spectrogram-based baselines, although this is not a consistent outcome across all models. Furthermore, the unsupervised boundary estimation methodology used in the study demonstrated stronger performance than recent linear probing baselines.
Most Effective Techniques
Among the evaluated techniques, the Correlation Block-Matching (CBM) algorithm emerged as the most effective downstream segmentation method, highlighting its potential utility in MSA tasks.
Standard Evaluation Metrics
One of the critical points raised in the paper is the artificial inflation of standard evaluation metrics in music structure analysis. The authors advocate for a systematic adoption of “trimming,” or even “double trimming,” annotations to establish more rigorous MSA evaluation standards.
Conclusion
The findings from this research could lead to significant advancements in the field of music analysis, emphasizing the importance of unsupervised methods and the potential of deep audio embeddings. As the field continues to evolve, adopting more robust evaluation measures will be crucial in enhancing the understanding of musical structures and improving the effectiveness of MSA methodologies.
