Advancements in Domain Adaptation for Speaker Recognition and Effective Speaker De-Identification
Date
Authors
ORCID
Journal Title
Journal ISSN
Volume Title
Publisher
item.page.doi
Abstract
Recent advancements in machine learning and artificial intelligence have significantly impacted the way humans interact with machines. Voice assistant based solutions are examples of emerging technology advancements that impact human-machine interaction. Since, speech is the most natural form of human communication, voice assistant devices have received wide user acceptance, and have become a pleasant way to facilitate and address everyday living needs, including access to the current news, events, etc. These voice-based technologies have been made possible through advanced robust processing of speech signals. Depending on the application, various speech processing techniques are required to achieve an effective overall robust solution. Speech recognition is required when text content of spoken words is needed; for example adding text captions to broadcast news or YouTube videos. If a service should become available based on who is interacting with the device, speaker recognition becomes a required step; for example, if an individual gains access to a data account (e.g., music, voice-mail, health or financial records), effective speaker recognition is needed for that service. Overall, a range of solutions in speech processing can be required to address an overall request. Other areas of speech processing that benefit the human-machine interaction include language/dialect recognition, speech enhancement, machine translation, speech synthesis, voice conversion, general diarization, etc. The environment where a person interacts with a device and input tools employed (such as phone or microphone) can impact performance. It is common to have intrinsic/extrinsic mismatch between train data and application data; in other words, data used for training the speech processing tasks is often different than those at the test time. These variations need to be considered while developing effective speech systems, especially when performance is impacted significantly due to mismatch conditions. In this dissertation, we study the problem of speaker recognition for domain mismatch. Recognizing the identity of a speaker is an important task in speaker-dependent applications, and providing robust performance regardless of how data is captured for model training and considering environmental/extrinsic changes within the application phase is very important. In this dissertation, we propose two categories of solutions to address the mismatch problem in speaker recognition: discriminant analysis based adaptation methods (generalized discriminant analysis-GDA, and support vector discriminant analysis-SVDA) and deep learning based adaptation technique (a-vector speaker embeddings). The proposed solutions are evaluated on NIST SRE-10, NIST SRE-16 and NIST SRE-18 tasks. The GDA and SVDA achieved 20% and 32% improvement in terms of EER for SRE-10 task. A-Vectors with incorporating SVDA achieved up to 18% improvement over the previous best performing solution on SRE-16 task. In addition, we propose a solution for speaker de-identification task. In more detail, the first category of solutions we propose is based on domain mismatch compensation with discriminant analysis methods. Traditional speaker recognition use linear discriminant analysis to reduce the dimensionality of speaker embeddings and provide a better discriminant feature representations for speaker classes. We propose non-linear discriminant analysis to compensate for variabilities included during recording through generalized discriminant analysis. In addition, domain adaptation is also incorporated through our proposed support vector discriminant analysis method; which also provides improved discrimination by considering the boundary structure of speaker classes. The second category of solutions are based on domain mismatch compensation with deep learning approaches. We propose a deep learning based technique to compensate for unwanted directions and information included in speaker embeddings, and provide domaininvariant speaker representations. Finally, we address speaker de- identification advancements to help protect confidential speaker or text-content within a given audio stream. Taken collectively, these three domains highlight technological advancement, which strengthen and make speaker recognition more useful in commercial, personal, and governmental/society applications, which incorporate human-speech engagement.
The environment where a person interacts with a device and input tools employed (such as phone or microphone) can impact performance. It is common to have intrinsic/extrinsic mismatch between train data and application data; in other words, data used for training the speech processing tasks is often different than those at the test time. These variations need to be considered while developing effective speech systems, especially when performance is impacted significantly due to mismatch conditions. In this dissertation, we study the problem of speaker recognition for domain mismatch. Recognizing the identity of a speaker is an important task in speaker-dependent applications, and providing robust performance regardless of how data is captured for model training and considering environmental/extrinsic changes within the application phase is very important.
In this dissertation, we propose two categories of solutions to address the mismatch problem in speaker recognition: discriminant analysis based adaptation methods (generalized discriminant analysis-GDA, and support vector discriminant analysis-SVDA) and deep learning based adaptation technique (a-vector speaker embeddings). The proposed solutions are evaluated on NIST SRE-10, NIST SRE-16 and NIST SRE-18 tasks. The GDA and SVDA achieved 20% and 32% improvement in terms of EER for SRE-10 task. A-Vectors with incorporating SVDA achieved up to 18% improvement over the previous best performing solution on SRE-16 task. In addition, we propose a solution for speaker de-identification task.
In more detail, the first category of proposed solutions we propose are based on domain mismatch compensation with discriminant analysis methods. Traditional speaker recognition use linear discriminant analysis to reduce the dimensionality of speaker embeddings and provide a better discriminant feature representations for speaker classes. We propose non-linear discriminant analysis to compensate for variabilities included during recording through generalized discriminant analysis. In addition, domain adaptation is also incorporated through our proposed support vector discriminant analysis method; which also provides improved discrimination by considering boundary structure of speaker classes.
The second category of solutions are based on domain mismatch compensation with deep learning approaches. We propose a deep learning based technique to compensate for unwanted directions and information included in speaker embeddings, and provide domain-invariant speaker representations.
Finally, we address speaker de-identification advancements to help protect confidential speaker or text-content within a given audio stream. Taken collectively, these three domains highlight technological advancement, which strengthen and make speaker recognition more useful in commercial, personal, and governmental/society applications, which incorporate human-speech engagement.