Advancements in Acoustic Based Language Identification/Recognition




Journal Title

Journal ISSN

Volume Title



With over 6,000 languages spoken worldwide, effective language recognition(LR) is needed prior to employing any speech technologies. Language identification (LID) is essential in speech pre-processing which is typically followed by automatic speech recognition or target speech post-processing. There are closed-set and open-set LID tasks according to the specific test condition. In real scenarios, closed-set robust language identification is usually hindered by mismatch factors such as background noise, channel, and speech duration. In addition, unknown/out-of-set (OOS) language rejection is another major challenge for open-set LID because of the increased cost/resources necessary in collecting effective OOS data. To address the close-set LID problem, this dissertation focuses on advancements based on diverse acoustic features and back-ends, and their influence on LID system fusion. A set of distinct acoustic features are considered, which are grouped into three categories: classical features, innovative features, and extensional features. In addition, both front-end concatenation and back-end fusion are considered. The results suggest that no single feature type is universally vital across all LID tasks and that a fusion of a diverse set is needed to ensure sustained LID performance in challenging scenarios. More specifically, the proposed hybrid fusion method improves LID system performance by +38.5% and +46.2% on the highly noisy DARPA RATS dataset and the large scale NIST LRE-09 dataset, respectively. To address a related scenario, for closely spaced dialect identification, two types of unsupervised deep learning methods are introduced for feature extraction. First, an unsupervised bottleneck feature extraction diagram is proposed, which is derived from the traditional bottleneck structure but trained with estimated phonetic label knowledge. Secondly, two types of latent variable learning algorithms are introduced to speech feature processing based on generative modeling auto-encoder. Compared with the baseline MFCC i-Vector system, the proposed methods can achieve up to a relative 58% performance improvement for a 4-way Chinese dialect corpus. For open-set LID, we propose three effective and flexible OOS candidate selection methods in order to boost OOS language rejection and improve overall classification performance. Specifically, two selection strategies are proposed at the front-end feature level, (i) k-means clustering selection and (ii) complementary candidate selection with a minimum Kullback-Leibler divergence versus the closed-set as a baseline. In addition, a (iii) general candidate selection method is proposed according to an engineering perspective based language relationship, which is explored based on the back-end score vectors of each language. With these proposed selection methods, data enhancement will be more effective and efficient than that based on an alternative baseline random selection option. To the best of our knowledge, this is the first major effort on effective OOS language selection to improve OOS rejection in open-set LID. As speech technology is employed in more diverse consumer, commercial, government, social, and global human engagement scenarios, advancing effective LR is needed as individual language diversity expanded for voice engagement and communication/electronic interaction.



Automatic speech recognition, Language and languages, Computer sound processing


Copyright ©2017 is held by the author. Digital access to this material is made possible by the Eugene McDermott Library. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.