Leveraging Frequency-Dependent Kernel and Dip-Based Clustering for Robust Speech Activity Detection in Naturalistic Audio Streams

Dubey, HarishchandraSangwan, AbhijeetHansen, John H. L.2019-07-262019-07-262018-07-022329-9290https://hdl.handle.net/10735.1/6725Full text access from Treasures at UT Dallas is restricted to current UTD affiliates (use the provided Link to Article). All others may find the web address for this item in the full item record as "dc.relation.uri" metadata.Speech activity detection (SAD) is front-end in most speech systems e.g. speaker verification, speech recognition etc. Supervised SAD typically leverages machine learning models trained on annotated data. For applications like zero-resource speech processing and NIST-OpenSAT-2017 public safety communications task, it might not be feasible to collect SAD annotations. SAD is challenging for naturalistic audio streams containing multiple noise-sources simultaneously. We propose a novel frequencydependent kernel (FDK) based SAD features. FDK provides enhanced spectral decomposition from which several statistical descriptors are derived. FDK statistical descriptors are combined by principal component analysis into one-dimensional FDKSAD features. We further proposed two decision backends: (i) Variable model-size Gaussian mixture model (VMGMM); and (ii) Hartigan dip-based robust feature clustering (DipSAD). While VMGMM is a model based approach, the DipSAD is nonparametric. We used both backends for comparative evaluations in two phases:(1) standalone SAD performance; (2) effect of SAD on text-dependent speaker verification using RedDots data. The NIST-OpenSAD-2015 and NIST-OpenSAT-2017 corpora are used for standalone SAD evaluations. We establish two CRSS corpora namely CRSS-PLTL-II and CRSS long-duration naturalistic noise (CLDNN) corpus. The CRSS corpora facilitate standalone SAD evaluations on naturalistic audio streams. We performed comparative studies of proposed approaches with multiple baselines including SohnSAD, rSAD, semi-supervised Gaussian mixture model (SSGMM) and Gammatone spectrogram features. IEEEen©2018 IEEESpeech perceptionNoiseHearingGaussian distributionImage segmentationPrincipal components analysisSpeech processing systemsLeveraging Frequency-Dependent Kernel and Dip-Based Clustering for Robust Speech Activity Detection in Naturalistic Audio StreamsarticleDubey, H., A. Sangwan, and J. H. Hansen. 2018. "Leveraging frequency-dependent kernel and dip-based clustering for robust speech activity detection in naturalistic audio streams." IEEE/ACM Transactions on Audio Speech and Language Processing 26(11): 2056-2071, doi:10.1109/TASLP.2018.28486982611