Deep Neural Networks and Model-Based Approaches for Robust Speaker Diarization in Naturalistic Audio Streams

Dubey, Harishchandra

Deep Neural Networks and Model-Based Approaches for Robust Speaker Diarization in Naturalistic Audio Streams

dc.contributor.ORCID	0000-0003-0476-3884 (Dubey, H)
dc.contributor.advisor	Hansen, John H. L.
dc.creator	Dubey, Harishchandra
dc.date.accessioned	2019-10-13T03:41:44Z
dc.date.available	2019-10-13T03:41:44Z
dc.date.created	2019-08
dc.date.issued	2019-08
dc.date.submitted	August 2019
dc.date.updated	2019-10-13T03:43:55Z
dc.description.abstract	Speaker diarization is an unsupervised task that determines "who spoke and when" within input audio stream. It consists of four sub-systems: (i) speech activity detection (SAD); (ii) speaker segmentation and modeling; (iii) speaker clustering and (iv) re-segmentation. Previous diarization systems have addressed telephone and/or meeting recordings in cleaner, but fail in naturalistic audio streams. Naturalistic audio such as CRSS-PLTL corpus consists of short-speaker turns and distortions including noise, reverberation, overlapped speech, and other miscellaneous human non-speech vocalizations. These factors pose challenge for speaker diarization in naturalistic audio. This dissertation formulates several systems to enhance speaker diarization, resulting in four contributions. The first contribution advances SAD based on frequency-dependent kernel (FDK-SAD) features and three alternate decision backends, namely: (i) Variable model-size GMM (VMGMM), (ii) Hartigan dip test based robust feature clustering (DipSAD), and (iii) Cumulative density based linear curve (D-SAD). Evaluations employ open-source corpora such as NIST OpenSAD-2015, NIST OpenSAT-2017, redDots and CRSS-PLTL corpus. CRSS-PLTL contains multi-stream audio recordings from UTDallas student-led STEM teaching model . Second, novel architectures are developed based on SincNet convolutional neural network for speaker identification and diarization. Proposed models generalize well with smaller training data, hence an attractive choice for transfer learning (TL). The standard SincNet architecture is expanded by introducing both additive margin (AM)-Softmax and Center loss, which leads to four novel architectures namely (i) standard SincNet, (ii) AM-SincNet, (iii) AM-CL-SincNet, and (iv) CL-SincNet for speaker diarization. We leverage transfer learning (TL) for training SincNet models on two training datasets: (1) TIMIT, (2) Librispeech corpus. Diarization evaluations are conducted on UT Dallas CRSS-PLTL and AMI meeting corpora. Thirdly, three novel algorithms are proposed for speaker clustering: (i) Mixture of von Mises-Fisher distributions (movMF); (ii) Normalized Fuzzy C-means clustering (NFCM); and (iii) Toeplitz Inverse Covariance-based speaker clustering (TIC). While TIC is computationally complex than movMF and NFCM, it out-performs both movMF and NCFM. Finally, speech systems are proposed for knowledge extraction and interaction analysis using unsupervised or pre-trained models, using Peer-Led Team Learning (PLTL) sessions. We leverage CRSS Speech Profiler for detecting four lowlevel attributes namely: (i) Emotion recognition; (ii) Lombard effect; (iii) Whisper detection; and (iv) Physical task stress. These low-level attributes are used for unsupervised PLTL interaction analysis aimed at assessing student engagement. The resulting evaluations of both publically available corpora, as well as UT Dallas PLTL data, confirm the impact of the proposed algorithmic advancements for diarization in naturalistic audio streams. Taken collectively, the resulting dissertation contributions advance a number of processing sub-tasks to achieve effective robust speaker diarization in naturalistic streams.
dc.format.mimetype	application/pdf
dc.identifier.uri	https://hdl.handle.net/10735.1/7002
dc.language.iso	en
dc.rights	©2019 Harishchandra Dubey
dc.subject	Supervised learning (Machine learning)
dc.subject	Automatic speech recognition
dc.subject	Analysis of covariance
dc.subject	Neural networks (Computer science)
dc.subject	Streaming audio
dc.title	Deep Neural Networks and Model-Based Approaches for Robust Speaker Diarization in Naturalistic Audio Streams
dc.type	Dissertation
dc.type.material	text
thesis.degree.department	Electrical Engineering
thesis.degree.grantor	The University of Texas at Dallas
thesis.degree.level	Doctoral
thesis.degree.name	PHD

Files

Original bundle

Now showing 1 - 1 of 1

Name:: ETD-5608-013-DUBEY-260268.93.pdf
Size:: 31.83 MB
Format:: Adobe Portable Document Format
Description:: Dissertation

Download

License bundle

Now showing 1 - 2 of 2

Name:: LICENSE.txt
Size:: 1.85 KB
Format:: Plain Text
Description:

Download

Name:: PROQUEST_LICENSE.txt
Size:: 5.85 KB
Format:: Plain Text
Description:

Download

Collections

UTD Theses and Dissertations