ViVoLab, Aragón Institute for Engineering Research (I3A), University of Zaragoza, Spain
Contact: {pablogj, ortega, amiguel, lleida}@unizar.es
DOI. (https://doi.org/10.21437/Interspeech.2021-309 )
This paper presents a study on the use of new unsupervised representations through wav2vec models seeking to jointly model speech and music fragments of audio signals in a multiclass audio segmentation task. Previous studies have already described the capabilities of deep neural networks in binary and multiclass audio segmentation tasks. Particularly, the separation of speech, music and noise signals through audio segmentation shows competitive results using a combination of perceptual and musical features as input to a neural network. Wav2vec representations have been successfully applied to several speech processing applications. In this study, they are considered for the multiclass audio segmentation task presented in the Albayz´ın 2010 evaluation.
We compare the use of different representations obtained through unsupervised learning with our previous results in this database using a traditional set of features under different conditions. Experimental results show that wav2vec representations can improve the performance of audio segmentation systems for classes containing speech, while showing a degradation in the segmentation of isolated music. This trend is consistent among all experiments developed. On average, the use of unsupervised representation learning leads to a relative improvement close to 6.8% on the segmentation task.