Robust Back-End Processing for Speaker Verification under Language and Acoustic Mismatch Conditions




Journal Title

Journal ISSN

Volume Title



Recently, due to the availability of large amounts of data and computation power, there has been a significant rise in Machine Learning/Artificial Intelligence technology. Today, the amount of digital data being generated is huge thanks to smart devices and the Internet of Things. Furthermore, Moore's law has ensured that the current hardware has the capability to reliably store, analyze and perform massive amount of computations in a reasonable amount of time. There are many applications of Machine Learning across different domains, like image processing, video processing, data mining and finance. Among all these application, integration of Speech Technology in mobile and online services has been a major area of research in recent times. Speech, being the primary means of human-to-human interaction is one of the preferred methods for human-to-machine interaction. Since the beginning of computer era, scientists, scholars and artists have dreamed of computers that can have a natural conversation with humans. Turing's test of computational intelligence, HAL 9000 in 2001: A Space Odyssey (film) are some of the examples of this futuristic vision.

Speech signal contains multiple levels of information conveying what is being spoken, who has spoken it, as well as information about the acoustic conditions of the environments in which speech was utterred. Moreover, speech can be conveniently acquired remotely using a telephone or over the internet. Due to these properties, speech technology has been in increasing demand over the past few years. In this study, we focus on ``who has spoken it'' part of speech signal, commonly known as speaker recognition.

There has been significant advancements made in the field of speaker recognition in recent years. However, robustness across mismatched conditions remains a difficult bottleneck to resolve. The mismatch can occur between enrollment and test conditions as well as between development and evaluation data. We define evaluation data as enrollment and test speech utterances, while development data is the one that is used to train system parameters. The mismatch can occur due to a variety of reasons like background noise, communication channel, different languages spoken by multi-lingual speakers, etc. In this study, we propose three methods to compensate acoustic and language mismatch scenarios in a speaker recognition system. The first two methods attempt to reduce mismatch between enrollment and test utterances while the last method attempts to suppress mismatch between the development and evaluation data of a speaker recognition system:

i) First method focuss on language mismatch scenario between enrollment and test conditions. We propose a method of Within-Class Covariance Correction (WCC) that enables us to get significant improvements under language mismatch condition of a speaker recognition system.

ii) Second method addresses the issue of multi-modality in development data-set caused due to variation in spoken languages and channels used by the speaker. We show that if a multi-lingual speaker speaks different languages or uses different microphones, it hampers the speaker recognition performance. We propose a method called Locally Weighted Linear Discriminant Analysis (LWLDA) to compensate this drop in performance.

iii) Third method enables us to employ unlabeled out-of-domain development data to evaluate speaker recognition trials. We show that when development data-set closely matches evaluation trials, we obtain excellent speaker recognition performance.

This kind of development data-set is known as in-domain data. However, when there is acoustic or language mismatch between development and evaluation data, a sharp drop in performance is observed. This kind of development data-set is known as out-of-domain data. We propose a method called Unsupervised Probabilistic Feature Transformation (UPFT) to transform out-of-domain data towards in-domain data. Our proposed method has an added advantage of not requiring labeling of data-sets that saves a lot of time, money and resources.



Automatic speech recognition, Multilingual computing, Big data, Machine learning


Copyright ©2017 is held by the author. Digital access to this material is made possible by the Eugene McDermott Library. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.