ViVoLab, Aragón Institute for Engineering Research (I3A), University of Zaragoza, Spain
Contact: {pablogj, ortega, amiguel, lleida}@unizar.es
DOI. (https://doi.org/10.21437/Interspeech.2021-309 )
In this paper, we describe the ViVoLab speech activity detection (SAD) system submitted to the Fearless Steps Challenge Phase III. This series of challenges have proposed a number of speech processing task dealing with audio from Apollo space missions over the last few years. The focus in this edition is set on the generalisation capabilities of the systems, with new evaluation data from different channels.
Our proposed submission is based on the use of the unsupervised representation learning paradigm, seeking to obtain a new and more discriminative audio representation than traditional perceptual features such as log Mel-filterbank energies. These new features are used to train different variations of a convolutional recurrent neural network (CRNN). Experimental results show that features learned via unsupervised learning provide a much more robust representation, significantly reducing the mismatch observed between development and evaluation partition results. Obtained results largely outperform the organisation baseline, achieving a DCF metric of 2.98% on the evaluation set and ranking third among all the participant teams.