Browsing by Author "Sangwan, Abhijeet"

Now showing 1 - 2 of 2

Laughter and Filler Detection in Naturalistic Audio
(International Speech and Communication Association) Kaushik, Lakshmish .; Sangwan, Abhijeet; Hansen, John H. L.; 19968651 (Hansen, JHL); Kaushik, Lakshmish .; Sangwan, Abhijeet; Hansen, John H. L.
Laughter and fillers are common phenomenon in speech, and play an important role in communication. In this study, we present Deep Neural Network (DNN) and Convolutional Neural Network (CNN) based systems to classify non-verbal cues (laughter and fillers) from verbal speech in naturalistic audio. We propose improvements over a deep learning system proposed in 1]. Particularly, we propose a simple method to combine spectral features with pitch information to capture prosodic and spectral cues for filler/laughter. Additionally, we propose using a wider time context for feature extraction so that the time evolution of the spectral and prosodic structure can also be exploited for classification. Furthermore, we propose to use CNN for classification. The new method is evaluated on conversational telephony speech (CTS, drawn from Switchboard and Fisher) data and UT-Opinion corpus. Our results shows that the new system improves the AUC (area under the curve) metric by 8.15% and 11.9% absolute for laughters, and 4.85% and 6.01% absolute for fillers, over the baseline system, for CTS and UT-Opinion data, respectively. Finally, we analyze the results to explain the difference in performance between traditional CTS data and naturalistic audio (UT-Opinion), and identify challenges that need to be addressed to make systems perform better for practical data. Copyright
Leveraging Frequency-Dependent Kernel and Dip-Based Clustering for Robust Speech Activity Detection in Naturalistic Audio Streams
(Institute of Electrical and Electronics Engineers Inc.) Dubey, Harishchandra; Sangwan, Abhijeet; Hansen, John H. L.; 19968651 (Hansen, JHL); Dubey, Harishchandra; Sangwan, Abhijeet; Hansen, John H. L.
Speech activity detection (SAD) is front-end in most speech systems e.g. speaker verification, speech recognition etc. Supervised SAD typically leverages machine learning models trained on annotated data. For applications like zero-resource speech processing and NIST-OpenSAT-2017 public safety communications task, it might not be feasible to collect SAD annotations. SAD is challenging for naturalistic audio streams containing multiple noise-sources simultaneously. We propose a novel frequencydependent kernel (FDK) based SAD features. FDK provides enhanced spectral decomposition from which several statistical descriptors are derived. FDK statistical descriptors are combined by principal component analysis into one-dimensional FDKSAD features. We further proposed two decision backends: (i) Variable model-size Gaussian mixture model (VMGMM); and (ii) Hartigan dip-based robust feature clustering (DipSAD). While VMGMM is a model based approach, the DipSAD is nonparametric. We used both backends for comparative evaluations in two phases:(1) standalone SAD performance; (2) effect of SAD on text-dependent speaker verification using RedDots data. The NIST-OpenSAD-2015 and NIST-OpenSAT-2017 corpora are used for standalone SAD evaluations. We establish two CRSS corpora namely CRSS-PLTL-II and CRSS long-duration naturalistic noise (CLDNN) corpus. The CRSS corpora facilitate standalone SAD evaluations on naturalistic audio streams. We performed comparative studies of proposed approaches with multiple baselines including SohnSAD, rSAD, semi-supervised Gaussian mixture model (SSGMM) and Gammatone spectrogram features. IEEE