Leveraging Frequency-Dependent Kernel and Dip-Based Clustering for Robust Speech Activity Detection in Naturalistic Audio Streams

Dubey, Harishchandra; Sangwan, Abhijeet; Hansen, John H. L.

Leveraging Frequency-Dependent Kernel and Dip-Based Clustering for Robust Speech Activity Detection in Naturalistic Audio Streams

dc.contributor.VIAF	19968651 (Hansen, JHL)
dc.contributor.author	Dubey, Harishchandra
dc.contributor.author	Sangwan, Abhijeet
dc.contributor.author	Hansen, John H. L.
dc.contributor.utdAuthor	Dubey, Harishchandra
dc.contributor.utdAuthor	Sangwan, Abhijeet
dc.contributor.utdAuthor	Hansen, John H. L.
dc.date.accessioned	2019-07-26T16:52:43Z
dc.date.available	2019-07-26T16:52:43Z
dc.date.created	2018-07-02
dc.description	Full text access from Treasures at UT Dallas is restricted to current UTD affiliates (use the provided Link to Article). All others may find the web address for this item in the full item record as "dc.relation.uri" metadata.
dc.description.abstract	Speech activity detection (SAD) is front-end in most speech systems e.g. speaker verification, speech recognition etc. Supervised SAD typically leverages machine learning models trained on annotated data. For applications like zero-resource speech processing and NIST-OpenSAT-2017 public safety communications task, it might not be feasible to collect SAD annotations. SAD is challenging for naturalistic audio streams containing multiple noise-sources simultaneously. We propose a novel frequencydependent kernel (FDK) based SAD features. FDK provides enhanced spectral decomposition from which several statistical descriptors are derived. FDK statistical descriptors are combined by principal component analysis into one-dimensional FDKSAD features. We further proposed two decision backends: (i) Variable model-size Gaussian mixture model (VMGMM); and (ii) Hartigan dip-based robust feature clustering (DipSAD). While VMGMM is a model based approach, the DipSAD is nonparametric. We used both backends for comparative evaluations in two phases:(1) standalone SAD performance; (2) effect of SAD on text-dependent speaker verification using RedDots data. The NIST-OpenSAD-2015 and NIST-OpenSAT-2017 corpora are used for standalone SAD evaluations. We establish two CRSS corpora namely CRSS-PLTL-II and CRSS long-duration naturalistic noise (CLDNN) corpus. The CRSS corpora facilitate standalone SAD evaluations on naturalistic audio streams. We performed comparative studies of proposed approaches with multiple baselines including SohnSAD, rSAD, semi-supervised Gaussian mixture model (SSGMM) and Gammatone spectrogram features. IEEE
dc.description.department	Erik Jonsson School of Engineering and Computer Science
dc.description.sponsorship	AFRL under Grant FA8750-15-1-0205
dc.identifier.bibliographicCitation	Dubey, H., A. Sangwan, and J. H. Hansen. 2018. "Leveraging frequency-dependent kernel and dip-based clustering for robust speech activity detection in naturalistic audio streams." IEEE/ACM Transactions on Audio Speech and Language Processing 26(11): 2056-2071, doi:10.1109/TASLP.2018.2848698
dc.identifier.issn	2329-9290
dc.identifier.issue	11
dc.identifier.uri	https://hdl.handle.net/10735.1/6725
dc.identifier.volume	26
dc.language.iso	en
dc.publisher	Institute of Electrical and Electronics Engineers Inc.
dc.relation.uri	http://dx.doi.org/10.1109/TASLP.2018.2848698
dc.rights	©2018 IEEE
dc.source	IEEE/ACM Transactions on Audio Speech and Language Processing
dc.subject	Speech perception
dc.subject	Noise
dc.subject	Hearing
dc.subject	Gaussian distribution
dc.subject	Image segmentation
dc.subject	Principal components analysis
dc.subject	Speech processing systems
dc.title	Leveraging Frequency-Dependent Kernel and Dip-Based Clustering for Robust Speech Activity Detection in Naturalistic Audio Streams
dc.type.genre	article

Files

Original bundle

Now showing 1 - 1 of 1

Name:: JECS-3626-279828.07-LINK.pdf
Size:: 165.82 KB
Format:: Adobe Portable Document Format
Description:: Link to Article

Download

Collections

Hansen, John H. L.