Learning Based Algorithms for Speech Applications Under Domain Mismatch Conditions
Abstract
Abstract
Recent years have experienced a tremendous growth in the use of voice based virtual assistants. Communicating with the digital world via voice is becoming a very easy and pleasant
experience, since this mode of interaction is much more naturalistic than text based keyboard
entry. To this end, many virtual assistants can now accomplish much more than just providing answers to specific questions. For instance, today’s virtual assistants can now order
a taxi just based on her/his voice command, and a person can order food for home delivery
through only voice commands using a virtual assistant. A core component in all of these
assistants is an effective Automatic Speech Recognition (ASR) engine that can recognize
words spoken by humans, and act on it. Another key component of virtual assistants is a
Speaker Identification (SID) system based voice authentication phase that only allows the
intended user to have access to the device.
With the widespread use of commercial speech products across the world, it has become
critical to also identify the language of the speaker as a front-end task, since almost all the
ASR models currently in use are language specific. To deliver customized content to any
particular user, it has also become increasingly important to identify various speaker-specific
traits such as gender, age, stress and emotional state among others.
Far-field ASR has also received considerable attention recently, since removing the constraint
that the subject be in close proximity to the audio capturing microphone allows more natural
interaction with the virtual assistant. In this dissertation, we propose novel learning based
algorithms to address several critical problems that impact various speech applications.
Specifically, we address the issue of robust speaker and gender identification for severely
degraded audio. We also propose a deep neural network (DNN) based approach to language
identification. Furthermore, we also propose a novel approach to improve the performance
of far-field ASR systems. The specific contributions of this dissertation are described next.
First, we present an unsupervised domain adaptation based strategy for noise robust gender
identification on utterances from the DARPA RATS corpora, where a relative reduction in
Equal Error Rate (EER) of +14.75% is achieved using a state-of-the-art i-Vector Probabilistic
Linear Discriminant Analysis (PLDA) back-end. Next, we propose a novel 2-step DNN based
solution to language identification (LID), where an initial DNN trained for in-set languages
is used to create an augmented set with out-of-set examples to train a second DNN that can
explicitly detect out-of-set languages. Our proposed approach is shown to offer significant
improvements over a baseline system for the NIST 2015 Language Recognition Evaluation
i-Vector Machine Learning Challenge, and reduces a NIST defined cost function by up to
32.91% (relative).
We also further noise robust speaker identification in this dissertation. To this end, we
propose a Curriculum Learning (CL) based approach to noise robust SID that is shown to
reduce the EER by up to +20.07% (relative) on severely degraded utterances of the DARPA
RATS SID task. Here, we propose to use CL based PLDA estimation, and a CL based
i-Vector extractor matrix training. Finally, we improve performance of end-to-end far-field
ASR systems by proposing a novel CL based approach using Bidirectional Long Short Term
(BLSTM) networks that is shown to reduce the Word Error Rate (WER) by 10.1% (relative)
on the AMI corpus.
Taken collectively, this investigation has made effective steps towards improving the robustness of voice based interactive systems.