Leveraging Frequency-Dependent Kernel and Dip-Based Clustering for Robust Speech Activity Detection in Naturalistic Audio Streams
Date
ORCID
Journal Title
Journal ISSN
Volume Title
Publisher
item.page.doi
Abstract
Speech activity detection (SAD) is front-end in most speech systems e.g. speaker verification, speech recognition etc. Supervised SAD typically leverages machine learning models trained on annotated data. For applications like zero-resource speech processing and NIST-OpenSAT-2017 public safety communications task, it might not be feasible to collect SAD annotations. SAD is challenging for naturalistic audio streams containing multiple noise-sources simultaneously. We propose a novel frequencydependent kernel (FDK) based SAD features. FDK provides enhanced spectral decomposition from which several statistical descriptors are derived. FDK statistical descriptors are combined by principal component analysis into one-dimensional FDKSAD features. We further proposed two decision backends: (i) Variable model-size Gaussian mixture model (VMGMM); and (ii) Hartigan dip-based robust feature clustering (DipSAD). While VMGMM is a model based approach, the DipSAD is nonparametric. We used both backends for comparative evaluations in two phases:(1) standalone SAD performance; (2) effect of SAD on text-dependent speaker verification using RedDots data. The NIST-OpenSAD-2015 and NIST-OpenSAT-2017 corpora are used for standalone SAD evaluations. We establish two CRSS corpora namely CRSS-PLTL-II and CRSS long-duration naturalistic noise (CLDNN) corpus. The CRSS corpora facilitate standalone SAD evaluations on naturalistic audio streams. We performed comparative studies of proposed approaches with multiple baselines including SohnSAD, rSAD, semi-supervised Gaussian mixture model (SSGMM) and Gammatone spectrogram features. IEEE