Deep Neural Network Based Representation Learning and Modeling for Robust Speaker Recognition





Journal Title

Journal ISSN

Volume Title



Automatic Speaker Verification (ASV) involves determining a person’s identity from audio streams. ASV provides a natural and efficient way for biometric identity authentication. Being able to perform text-independent speaker verification that does not require a fixed input text phrase can significantly help verify or retrieve a target person. Speaker recognition has many applications today including audio surveillance, computer access control, and voice authentication. Smart home devices such as Google Home, Amazon Alexa, and Apple Homepod can also benefit from ASV for personalized voice applications. This dissertation address four related research problems. First, we investigate the impact of non-linear distortion based on waveform peak clipping for automatic speech-based systems. We begin by defining various forms of clipping and then explore potential impact for practical speech systems and speech corpora. Next, we present an overview of audio quality assessment, illustrate the effect that clipping has on automated speaker recognition systems, and provide findings of an investigation into the occurrence of clipping in a variety of data sets used by the speech community. Second, we provide an unsupervised Adversarial Discriminative Domain Adaptation (ADDA) method for speaker verification when training and testing data have mismatched conditions. ADDA needs just the source and unlabeled target domain data in order to discover an asymmetric mapping that adapts the target domain feature encoder to the source domain. The experimental findings demonstrate that trained ADDA speaker embeddings can perform well on speaker classification for the target domain data and are less sensitive to language shifts. In the third topic, a generalized global context modeling framework is proposed for speaker recognition. We first present a data-driven attention based global time-frequency context model, which can better capture long-range time-frequency dependencies and channel variances. It aims to obtain a better combination of the Non-local block and Squeeze&Excitation block to adaptively recalibrate the learned feature map and provides time-frequency attention to specific regions. Further, we propose a data-independent Discrete Cosine Transform (DCT) based global context model. A multi-DCT attention mechanism is presented to improve the modeling power with different DCT bases. We also use global context information to enhance important channels and recalibrate salient time-frequency locations by computing the similarity between global context and local features. We show that the proposed global context modeling method can be easily incorporated into a CNN model with little additional computational costs and effectively improves the speaker verification performance by a large margin. Lastly, in topic four, we investigate the effects of reverberation and noise for self-supervised speaker verification. In order to normalize extrinsic variations of two random segments taken from one spoken utterance, a number of alternate training data augmentation methodologies are investigated. We systematically simulate alternate levels and types of reverberation and noise on the test data for performance comparison. The experiments show a clear correlation between microphone distance, reverberation time, signal-to-noise ratio, and the verification result. Taken collectively, the investigative studies have contributed to a more comprehensive understanding of speaker recognition, as well as advancing algorithm robustness for real-world speaker systems.



Engineering, Electronics and Electrical, Computer Science