Leveraging Frequency-Dependent Kernel and Dip-Based Clustering for Robust Speech Activity Detection in Naturalistic Audio Streams

dc.contributor.VIAF19968651 (Hansen, JHL)
dc.contributor.authorDubey, Harishchandra
dc.contributor.authorSangwan, Abhijeet
dc.contributor.authorHansen, John H. L.
dc.contributor.utdAuthorDubey, Harishchandra
dc.contributor.utdAuthorSangwan, Abhijeet
dc.contributor.utdAuthorHansen, John H. L.
dc.descriptionFull text access from Treasures at UT Dallas is restricted to current UTD affiliates (use the provided Link to Article). All others may find the web address for this item in the full item record as "dc.relation.uri" metadata.
dc.description.abstractSpeech activity detection (SAD) is front-end in most speech systems e.g. speaker verification, speech recognition etc. Supervised SAD typically leverages machine learning models trained on annotated data. For applications like zero-resource speech processing and NIST-OpenSAT-2017 public safety communications task, it might not be feasible to collect SAD annotations. SAD is challenging for naturalistic audio streams containing multiple noise-sources simultaneously. We propose a novel frequencydependent kernel (FDK) based SAD features. FDK provides enhanced spectral decomposition from which several statistical descriptors are derived. FDK statistical descriptors are combined by principal component analysis into one-dimensional FDKSAD features. We further proposed two decision backends: (i) Variable model-size Gaussian mixture model (VMGMM); and (ii) Hartigan dip-based robust feature clustering (DipSAD). While VMGMM is a model based approach, the DipSAD is nonparametric. We used both backends for comparative evaluations in two phases:(1) standalone SAD performance; (2) effect of SAD on text-dependent speaker verification using RedDots data. The NIST-OpenSAD-2015 and NIST-OpenSAT-2017 corpora are used for standalone SAD evaluations. We establish two CRSS corpora namely CRSS-PLTL-II and CRSS long-duration naturalistic noise (CLDNN) corpus. The CRSS corpora facilitate standalone SAD evaluations on naturalistic audio streams. We performed comparative studies of proposed approaches with multiple baselines including SohnSAD, rSAD, semi-supervised Gaussian mixture model (SSGMM) and Gammatone spectrogram features. IEEE
dc.description.departmentErik Jonsson School of Engineering and Computer Science
dc.description.sponsorshipAFRL under Grant FA8750-15-1-0205
dc.identifier.bibliographicCitationDubey, H., A. Sangwan, and J. H. Hansen. 2018. "Leveraging frequency-dependent kernel and dip-based clustering for robust speech activity detection in naturalistic audio streams." IEEE/ACM Transactions on Audio Speech and Language Processing 26(11): 2056-2071, doi:10.1109/TASLP.2018.2848698
dc.publisherInstitute of Electrical and Electronics Engineers Inc.
dc.rights©2018 IEEE
dc.sourceIEEE/ACM Transactions on Audio Speech and Language Processing
dc.subjectSpeech perception
dc.subjectGaussian distribution
dc.subjectImage segmentation
dc.subjectPrincipal components analysis
dc.subjectSpeech processing systems
dc.titleLeveraging Frequency-Dependent Kernel and Dip-Based Clustering for Robust Speech Activity Detection in Naturalistic Audio Streams


Original bundle

Now showing 1 - 1 of 1
Thumbnail Image
165.82 KB
Adobe Portable Document Format
Link to Article