Automatic Speaker Recognition and Diarization in Co-Channel Speech
MetadataShow full item record
This study investigates various aspects of multi-speaker interference and its impact on speaker recognition. Single-channel multi-speaker speech signals (aka co-channel speech) comprise a significant portion of speech processing data. Examples of co-channel signals are recordings from multiple speakers in meetings, conversations, debates, etc. The nuisances of co-channel speech are two-fold: 1) overlapped speech, and 2) non-overlapping speaker interference. In overlap, the direct effects of two stochastically similar, non-stationary signals added together disrupts speech processing performance, originally developed for single-speaker audio. For example, in speaker recognition, identifying speakers in overlapped segments is more difficult compared to single-speaker signals. Analyses in this study show that introducing overlapped speech increases speaker recognition error rates by an order of magnitude. In addition to the direct impact of overlap, its secondary effect is in how one speaker forces the other to change his/her speech characteristics. Different forms of co-channel data are investigated in this study. In scenarios where the focus is on overlap, independent cross-talk is used. Independent cross-talk refers to the summation of independent audio signals from different speakers to simulate overlap. The alternative form of data used in this study is real conversation recordings. Although conversations contain both overlapped and non-overlapped speech, independent cross-talk is a better source of overlap. The reason real conversations are not deemed sufficient for overlap analysis is the scarcity and non-uniformity of overlaps in typical conversations. Independent cross-talk is obtained from the GRID corpus, which was used in the speech separation challenge as a source of overlapped speech. Real conversations are obtained from the Switchboard telephone conversation corpus. Other real conversational data used throughout this study include: the AMI meeting corpus, Prof-lifelog, and UTDrive data. These datasets provide a perspective towards environment noise and co-channel interference in day-to-day speech. Categorizing datasets allows for a meticulous analysis of different aspects of co-channel speech. Most of the focus in analyzing overlaps is presented in the form of overlap detection techniques. This study proposes two overlap detection methods: 1) Pyknogram-based 2) Gammatone sub-band frequency modulation (GSFM). Both methods take advantage of the harmonic structure of speech to detect overlaps. Pyknograms do so by enhancing speech harmonics and evaluating dynamics across time, while GSFM magnifies the presence of multiple harmonics in different sub-bands. The other advancements proposed in this study use back-end modeling techniques to compensate for co-channel speech in real conversational data. These techniques are presented to reduce the impact of interfering speech in speaker-dependent models. Several methods are investigated, all of which propose a different modification to the popular probabilistic linear discriminant analysis (PLDA) used in state-of-the-art speaker recognition systems. In addition to model compensation techniques, this study presents CRSS-SpkrDiar, which is a speaker diarization research platform aimed at tackling conversational co-channel speech data. CRSS-SpkrDiar was developed during this study to alleviate end-to-end co-channel speech analysis. Taken collectively, the speech analysis, proposed features, and algorithmic advancements developed in this study all contribute to an improved understanding and measurable performance gain in speech/speaker technology for the co-channel speech problem.