Deep Neural Network Based Speaker Verification Under Domain Mismatched Conditions
Speaker verification (SV), which offers a natural and flexible solution for biometric authentication, has been actively studied in the past decades. Machine learning models usually achieve high performance with the independent and identically distributed (I.I.D.) train/test set assumption, however their performance degrades dramatically when tested with none I.I.D. samples. In the research of speaker detection/verificaftion, this is called domain mismatch. Although recent advancements have greatly improved the reliability of SV systems, they are still prone to performance degradation under many domain mismatched conditions. Generally, domain mismatch can be categorized into two classes: (i) the intrinsic-speaker mismatch, which is the variability being introduced by the changes within a speaker her/himself (e.g., duration, stress, emotion and age etc.); (ii) the extrinsic-speaker mismatch, which is usually carried by external environmental factors, such as noise or channel distortions. In this dissertation, we aim to develop SV systems which are robust against domain mismatch. In the first part of this dissertation, novel i-vector and deep neural network (DNN) based non-neutral speech detection systems are investigated. As the preprocessing for actual SV systems, this portion of study aims to know when the SV systems are at risk, which in turn can improve the robustness of SV systems. Specifically, for non-neutral speech detection (i.e., physical stress detection and spoofing detection in this dissertation), different benchmarks such as the i-vector systems or several deep learning architectures have been examined, a novel system which simultaneously employs Convolutional Neural network (CNN) and Recurrent Neural Network (RNN) is proposed. Features including previously formulated Teager Energy Operator Critical Band Autocorrelation Envelope (TEO-CB-Auto-Env), Perceptual Minimum Variance Distortionless Response (PMVDR) and a more general spectrogram are also incorporated as the input to the proposed frameworks. Experiments on non-neutral speech copora show that the proposed methods achieve improved system performance. In the second part of this dissertation, we focus on the development of novel speaker embedding systems for SV tasks. We propose a novel text-independent SV framework based on the triplet loss and a deep CNN architecture, where a fixed-dimension embedding is extracted as an alternative speaker representation to replace the previous dominated i-vector model. On various standard SV copora with intrinsic-speaker or extrinsic-speaker domain mismatch, our proposed approaches achieve significant performance improvements over traditional frameworks. Lastly, transfer learning based domain adaptation methods are employed to further improve the performance of the triplet loss based speaker embedding system on UTScopePhysical corpus. Overall, the proposed individual components result in strong systems for DNN based speaker verification.