Robust Speaker Modeling in Non-Neutral Environments with Application to Large Scale Multi-Speaker Audio Streams
MetadataShow full item record
With an explosive increase in the amount of multimedia content available worldwide and through the web, automatically detecting who spoke when in an audio stream is an important technique that has many practical applications. The task of automatically annotating speech segments with speaker labels could be considered as either a speaker recognition or speaker diarization problem depending on whether the voice samples of the speakers are available as a priori knowledge. Despite the differences, the success of both speaker recognition and speaker diarization hinge on accurate and robust modeling of speaker voice characteristics. Over the past several decades, the technology of statistical speaker modeling has achieved signiﬁcant advancements. However, the applications of speaker modeling technology in real world by means of speaker recognition and speaker diarization has considerably limited performance. In this dissertation, we investigate the applications of speaker recognition and speaker diarization on The National Aeronautics and Space Administration (NASA) Apollo-11 mission audio corpus to advance their performance in practical applications. In the ﬁrst part of this dissertation, we focus on understanding the problems and challenges of applying speaker recognition techniques on a subset of the Apollo-11 space-to-ground audio corpus to automatically recognize all three astronauts. Speciﬁcally, we investigate the variations of astronauts voices characteristics across different phases of the lunar mission and their impact on speaker recognition performance. In the second part of this dissertation, we focus on the development of robust speaker recognition and diarization algorithms. We illustrate the challenge of applying speaker diarization techniques on multi-speaker naturalistic audio streams such as Apollo-11 mission control center (MCC) audio corpus, and propose active learning based algorithms to effectively incorporate limited human effort in the current speaker diarization process. Moreover, we propose several robust speaker modeling techniques that improve speaker recognition in generally mismatched or noisy environments. Lastly, the application of speaker recognition and speaker diarization for conversation analysis on the Apollo-11 MCC audio corpus is discussed. This dissertation therefore advances speech and language technology to address diarization of multi-speaker naturalistic audio streams for real task oriented teams. It is expected that these advancements will contribute signiﬁcantly for research on human-to-human voice interaction for team oriented tasks in business, social, government, and security applications.