Overlapped speech occurs when multiple speakers are simultaneously active. This may lead to severe performance degradation in automatic speech processing systems such as speaker diarization. Overlapped speech detection (OSD) aims at detecting time segments in which several speakers are simultaneously active. Recent deep neural network architectures have shown impressive results in the close-talk scenario. However, performance tends to deteriorate in the context of distant speech. Microphone arrays are often considered under these conditions to record signals including spatial information.
This paper investigates the use of the self-attention channel combinator (SACC) system as a feature extractor for OSD. This model is also extended in the complex space (cSACC) to improve the interpretability of the approach. Results show that distant OSD performance with self-attentive models gets closer to the near-field condition. A detailed analysis of the cSACC combination-weights is also conducted showing that the self-attention module focuses attention on the speakers' direction.