Leveraging Frequency-Dependent Kernel and Dip-Based Clustering for Robust Speech Activity Detection in Naturalistic Audio Streams

Dubey, Harishchandra; Sangwan, Abhijeet; Hansen, John H. L.

Leveraging Frequency-Dependent Kernel and Dip-Based Clustering for Robust Speech Activity Detection in Naturalistic Audio Streams

Files

JECS-3626-279828.07-LINK.pdf (165.82 KB)

Authors

Dubey, Harishchandra

Sangwan, Abhijeet

Hansen, John H. L.

Publisher

Institute of Electrical and Electronics Engineers Inc.

URI

https://hdl.handle.net/10735.1/6725

Abstract

Speech activity detection (SAD) is front-end in most speech systems e.g. speaker verification, speech recognition etc. Supervised SAD typically leverages machine learning models trained on annotated data. For applications like zero-resource speech processing and NIST-OpenSAT-2017 public safety communications task, it might not be feasible to collect SAD annotations. SAD is challenging for naturalistic audio streams containing multiple noise-sources simultaneously. We propose a novel frequencydependent kernel (FDK) based SAD features. FDK provides enhanced spectral decomposition from which several statistical descriptors are derived. FDK statistical descriptors are combined by principal component analysis into one-dimensional FDKSAD features. We further proposed two decision backends: (i) Variable model-size Gaussian mixture model (VMGMM); and (ii) Hartigan dip-based robust feature clustering (DipSAD). While VMGMM is a model based approach, the DipSAD is nonparametric. We used both backends for comparative evaluations in two phases:(1) standalone SAD performance; (2) effect of SAD on text-dependent speaker verification using RedDots data. The NIST-OpenSAD-2015 and NIST-OpenSAT-2017 corpora are used for standalone SAD evaluations. We establish two CRSS corpora namely CRSS-PLTL-II and CRSS long-duration naturalistic noise (CLDNN) corpus. The CRSS corpora facilitate standalone SAD evaluations on naturalistic audio streams. We performed comparative studies of proposed approaches with multiple baselines including SohnSAD, rSAD, semi-supervised Gaussian mixture model (SSGMM) and Gammatone spectrogram features. IEEE

Description

Full text access from Treasures at UT Dallas is restricted to current UTD affiliates (use the provided Link to Article). All others may find the web address for this item in the full item record as "dc.relation.uri" metadata.

Keywords

Speech perception, Noise, Hearing, Gaussian distribution, Image segmentation, Principal components analysis, Speech processing systems

item.page.sponsorship

AFRL under Grant FA8750-15-1-0205

Rights

Collections

Hansen, John H. L.

Full item page

Leveraging Frequency-Dependent Kernel and Dip-Based Clustering for Robust Speech Activity Detection in Naturalistic Audio Streams

Files

Date

Authors

ORCID

Journal Title

Journal ISSN

Volume Title

Publisher

item.page.doi

URI

Abstract

Description

Keywords

item.page.sponsorship

Rights

Citation

Collections