Browsing by Author "Yu, Chengzhu"

Now showing 1 - 2 of 2

Robust I-Vector Extraction for Neural Network Adaptation in Noisy Environment
(International Speech and Communication Association) Yu, Chengzhu; Ogawa, A.; Delcroix, M.; Yoshioka, T.; Nakatani, T.; Hansen, John H. L.; 19968651 (Hansen, JHL); Yu, Chengzhu; Hansen, John H. L.
In this study, we explore an i-vector based adaptation of deep neural network (DNN) in noisy environment. We first demonstrate the importance of encapsulating environment and channel variability into i-vectors for DNN adaptation in noisy conditions. To be able to obtain robust i-vector without losing noise and channel variability information, we investigate the use of parallel feature based i-vector extraction for DNN adaptation. Specifically, different types of features are used separately during two different stages of i-vector extraction namely universal background model (UBM) state alignment and i-vector computation. To capture noise and channel-specific feature variation, the conventional MFCC features are still used for i-vector computation. However, much more robust features such as Vector Taylor Series (VTS) enhanced as well as bottleneck features are exploited for UBM state alignment. Experimental results on Aurora-4 show that the parallel feature-based i-vectors yield performance gains of up to 9.2% relative compared to a baseline DNN-HMM system and 3.3% compared to a system using conventional MFCC-based i-vectors.
Robust Speaker Modeling in Non-Neutral Environments with Application to Large Scale Multi-Speaker Audio Streams
(2017-08) Yu, Chengzhu; Hansen, John H. L.
With an explosive increase in the amount of multimedia content available worldwide and through the web, automatically detecting who spoke when in an audio stream is an important technique that has many practical applications. The task of automatically annotating speech segments with speaker labels could be considered as either a speaker recognition or speaker diarization problem depending on whether the voice samples of the speakers are available as a priori knowledge. Despite the differences, the success of both speaker recognition and speaker diarization hinge on accurate and robust modeling of speaker voice characteristics. Over the past several decades, the technology of statistical speaker modeling has achieved signiﬁcant advancements. However, the applications of speaker modeling technology in real world by means of speaker recognition and speaker diarization has considerably limited performance. In this dissertation, we investigate the applications of speaker recognition and speaker diarization on The National Aeronautics and Space Administration (NASA) Apollo-11 mission audio corpus to advance their performance in practical applications. In the ﬁrst part of this dissertation, we focus on understanding the problems and challenges of applying speaker recognition techniques on a subset of the Apollo-11 space-to-ground audio corpus to automatically recognize all three astronauts. Speciﬁcally, we investigate the variations of astronauts voices characteristics across different phases of the lunar mission and their impact on speaker recognition performance. In the second part of this dissertation, we focus on the development of robust speaker recognition and diarization algorithms. We illustrate the challenge of applying speaker diarization techniques on multi-speaker naturalistic audio streams such as Apollo-11 mission control center (MCC) audio corpus, and propose active learning based algorithms to effectively incorporate limited human effort in the current speaker diarization process. Moreover, we propose several robust speaker modeling techniques that improve speaker recognition in generally mismatched or noisy environments. Lastly, the application of speaker recognition and speaker diarization for conversation analysis on the Apollo-11 MCC audio corpus is discussed. This dissertation therefore advances speech and language technology to address diarization of multi-speaker naturalistic audio streams for real task oriented teams. It is expected that these advancements will contribute signiﬁcantly for research on human-to-human voice interaction for team oriented tasks in business, social, government, and security applications.