Deep Neural Networks and Model-Based Approaches for Robust Speaker Diarization in Naturalistic Audio Streams




Journal Title

Journal ISSN

Volume Title



Speaker diarization is an unsupervised task that determines "who spoke and when" within input audio stream. It consists of four sub-systems: (i) speech activity detection (SAD); (ii) speaker segmentation and modeling; (iii) speaker clustering and (iv) re-segmentation. Previous diarization systems have addressed telephone and/or meeting recordings in cleaner, but fail in naturalistic audio streams. Naturalistic audio such as CRSS-PLTL corpus consists of short-speaker turns and distortions including noise, reverberation, overlapped speech, and other miscellaneous human non-speech vocalizations. These factors pose challenge for speaker diarization in naturalistic audio. This dissertation formulates several systems to enhance speaker diarization, resulting in four contributions. The first contribution advances SAD based on frequency-dependent kernel (FDK-SAD) features and three alternate decision backends, namely: (i) Variable model-size GMM (VMGMM), (ii) Hartigan dip test based robust feature clustering (DipSAD), and (iii) Cumulative density based linear curve (D-SAD). Evaluations employ open-source corpora such as NIST OpenSAD-2015, NIST OpenSAT-2017, redDots and CRSS-PLTL corpus. CRSS-PLTL contains multi-stream audio recordings from UTDallas student-led STEM teaching model . Second, novel architectures are developed based on SincNet convolutional neural network for speaker identification and diarization. Proposed models generalize well with smaller training data, hence an attractive choice for transfer learning (TL). The standard SincNet architecture is expanded by introducing both additive margin (AM)-Softmax and Center loss, which leads to four novel architectures namely (i) standard SincNet, (ii) AM-SincNet, (iii) AM-CL-SincNet, and (iv) CL-SincNet for speaker diarization. We leverage transfer learning (TL) for training SincNet models on two training datasets: (1) TIMIT, (2) Librispeech corpus. Diarization evaluations are conducted on UT Dallas CRSS-PLTL and AMI meeting corpora. Thirdly, three novel algorithms are proposed for speaker clustering: (i) Mixture of von Mises-Fisher distributions (movMF); (ii) Normalized Fuzzy C-means clustering (NFCM); and (iii) Toeplitz Inverse Covariance-based speaker clustering (TIC). While TIC is computationally complex than movMF and NFCM, it out-performs both movMF and NCFM. Finally, speech systems are proposed for knowledge extraction and interaction analysis using unsupervised or pre-trained models, using Peer-Led Team Learning (PLTL) sessions. We leverage CRSS Speech Profiler for detecting four lowlevel attributes namely: (i) Emotion recognition; (ii) Lombard effect; (iii) Whisper detection; and (iv) Physical task stress. These low-level attributes are used for unsupervised PLTL interaction analysis aimed at assessing student engagement. The resulting evaluations of both publically available corpora, as well as UT Dallas PLTL data, confirm the impact of the proposed algorithmic advancements for diarization in naturalistic audio streams. Taken collectively, the resulting dissertation contributions advance a number of processing sub-tasks to achieve effective robust speaker diarization in naturalistic streams.



Supervised learning (Machine learning), Automatic speech recognition, Analysis of covariance, Neural networks (Computer science), Streaming audio


©2019 Harishchandra Dubey