Learning Based Algorithms for Speech Applications Under Domain Mismatch Conditions
Recent years have experienced a tremendous growth in the use of voice based virtual assistants. Communicating with the digital world via voice is becoming a very easy and pleasant experience, since this mode of interaction is much more naturalistic than text based keyboard entry. To this end, many virtual assistants can now accomplish much more than just providing answers to specific questions. For instance, today’s virtual assistants can now order a taxi just based on her/his voice command, and a person can order food for home delivery through only voice commands using a virtual assistant. A core component in all of these assistants is an effective Automatic Speech Recognition (ASR) engine that can recognize words spoken by humans, and act on it. Another key component of virtual assistants is a Speaker Identification (SID) system based voice authentication phase that only allows the intended user to have access to the device. With the widespread use of commercial speech products across the world, it has become critical to also identify the language of the speaker as a front-end task, since almost all the ASR models currently in use are language specific. To deliver customized content to any particular user, it has also become increasingly important to identify various speaker-specific traits such as gender, age, stress and emotional state among others. Far-field ASR has also received considerable attention recently, since removing the constraint that the subject be in close proximity to the audio capturing microphone allows more natural interaction with the virtual assistant. In this dissertation, we propose novel learning based algorithms to address several critical problems that impact various speech applications. Specifically, we address the issue of robust speaker and gender identification for severely degraded audio. We also propose a deep neural network (DNN) based approach to language identification. Furthermore, we also propose a novel approach to improve the performance of far-field ASR systems. The specific contributions of this dissertation are described next. First, we present an unsupervised domain adaptation based strategy for noise robust gender identification on utterances from the DARPA RATS corpora, where a relative reduction in Equal Error Rate (EER) of +14.75% is achieved using a state-of-the-art i-Vector Probabilistic Linear Discriminant Analysis (PLDA) back-end. Next, we propose a novel 2-step DNN based solution to language identification (LID), where an initial DNN trained for in-set languages is used to create an augmented set with out-of-set examples to train a second DNN that can explicitly detect out-of-set languages. Our proposed approach is shown to offer significant improvements over a baseline system for the NIST 2015 Language Recognition Evaluation i-Vector Machine Learning Challenge, and reduces a NIST defined cost function by up to 32.91% (relative). We also further noise robust speaker identification in this dissertation. To this end, we propose a Curriculum Learning (CL) based approach to noise robust SID that is shown to reduce the EER by up to +20.07% (relative) on severely degraded utterances of the DARPA RATS SID task. Here, we propose to use CL based PLDA estimation, and a CL based i-Vector extractor matrix training. Finally, we improve performance of end-to-end far-field ASR systems by proposing a novel CL based approach using Bidirectional Long Short Term (BLSTM) networks that is shown to reduce the Word Error Rate (WER) by 10.1% (relative) on the AMI corpus. Taken collectively, this investigation has made effective steps towards improving the robustness of voice based interactive systems.