Deep Neural Network Based Speaker Verification Under Domain Mismatched Conditions
Abstract
Abstract
Speaker verification (SV), which offers a natural and flexible solution for biometric authentication, has been actively studied in the past decades. Machine learning models usually
achieve high performance with the independent and identically distributed (I.I.D.) train/test
set assumption, however their performance degrades dramatically when tested with none
I.I.D. samples. In the research of speaker detection/verificaftion, this is called domain mismatch. Although recent advancements have greatly improved the reliability of SV systems,
they are still prone to performance degradation under many domain mismatched conditions.
Generally, domain mismatch can be categorized into two classes: (i) the intrinsic-speaker mismatch, which is the variability being introduced by the changes within a speaker her/himself
(e.g., duration, stress, emotion and age etc.); (ii) the extrinsic-speaker mismatch, which is
usually carried by external environmental factors, such as noise or channel distortions. In
this dissertation, we aim to develop SV systems which are robust against domain mismatch.
In the first part of this dissertation, novel i-vector and deep neural network (DNN) based
non-neutral speech detection systems are investigated. As the preprocessing for actual SV
systems, this portion of study aims to know when the SV systems are at risk, which in turn
can improve the robustness of SV systems. Specifically, for non-neutral speech detection (i.e.,
physical stress detection and spoofing detection in this dissertation), different benchmarks
such as the i-vector systems or several deep learning architectures have been examined, a
novel system which simultaneously employs Convolutional Neural network (CNN) and Recurrent Neural Network (RNN) is proposed. Features including previously formulated Teager
Energy Operator Critical Band Autocorrelation Envelope (TEO-CB-Auto-Env), Perceptual
Minimum Variance Distortionless Response (PMVDR) and a more general spectrogram are
also incorporated as the input to the proposed frameworks. Experiments on non-neutral
speech copora show that the proposed methods achieve improved system performance. In
the second part of this dissertation, we focus on the development of novel speaker embedding systems for SV tasks. We propose a novel text-independent SV framework based on the
triplet loss and a deep CNN architecture, where a fixed-dimension embedding is extracted as
an alternative speaker representation to replace the previous dominated i-vector model. On
various standard SV copora with intrinsic-speaker or extrinsic-speaker domain mismatch, our
proposed approaches achieve significant performance improvements over traditional frameworks. Lastly, transfer learning based domain adaptation methods are employed to further
improve the performance of the triplet loss based speaker embedding system on UTScopePhysical corpus. Overall, the proposed individual components result in strong systems for
DNN based speaker verification.