LIUM - Laboratoire d'Informatique de l'Université du Mans
ViVoLab, Aragon Institute for Engineering Research (I3A)
Contact: martin.lebourdais[@]rit.fr ; {marie.tahon, anthony.larcher}@univ-lemans.fr
doi:
When processing audio data, multiple challenges arise, one of them being the diversity of information present in the audio signal. Various audio segmentation subtasks appeared including voice activity detection (VAD), overlapped speech detection (OSD), music or noise detection. These tasks are often completed by separate models trained on different datasets, thus increasing computational costs and limiting the usage to specific datasets.
We first show that a multiclass VAD and OSD model outperforms state of the art models. Then, we propose 3MAS, a novel deep learning-based audio segmentation model capable of handling multiple datasets, and assessing multiple simultaneously as a multilabel segmentation problem.
3MAS provides similar performances as specialized models with a similar architecture and can be trained using partial and unbalanced annotations on different datasets. 3MAS is a gain in computational time, and opens new opportunities to include new labels.