ItemSpeech and Language Processing for Assessing Child–Adult Interaction Based on Diarization and Location(Springer New York LLC, 2019-06-05) Hansen, John H. L.; Najafian, Maryam; Lileikyte, Rasa; Irvin, D.; Rous, B.; 0000-0003-1382-9929 (Hansen, JHL); Hansen, John H. L.; Najafian, Maryam; Lileikyte, RasaUnderstanding and assessing child verbal communication patterns is critical in facilitating effective language development. Typically speaker diarization is performed to explore children’s verbal engagement. Understanding which activity areas stimulate verbal communication can help promote more efficient language development. In this study, we present a two-stage children vocal engagement prediction system that consists of (1) a near to real-time, noise robust system that measures the duration of child-to-adult and child-to-child conversations, and tracks the number of conversational turn-takings, (2) a novel child location tracking strategy, that determines in which activity areas a child spends most/least of their time. A proposed child–adult turn-taking solution relies exclusively on vocal cues observed during the interaction between a child and other children, and/or classroom teachers. By employing a threshold optimized speech activity detection using a linear combination of voicing measures, it is possible to achieve effective speech/non-speech segment detection prior to conversion assessment. This TO-COMBO-SAD reduces classification error rates for adult-child audio by 21.34% and 27.3% compared to a baseline i-Vector and standard Bayesian Information Criterion diarization systems, respectively. In addition, this study presents a unique location tracking system adult-child that helps determine the quantity of child–adult communication in specific activity areas, and which activities stimulate voice communication engagement in a child–adult education space. We observe that our proposed location tracking solution offers unique opportunities to assess speech and language interaction for children, and quantify the location context which would contribute to improve verbal communication. © 2019, Springer Science+Business Media, LLC, part of Springer Nature. ItemEnhancement of Consonant Recognition in Bimodal and Normal Hearing Listeners(Sage Publications Inc., 2019-05-15) Yoon, Y. -S; Riley, B.; Patel, H.; Frost, Amanda; Fillmore, P.; Gifford, R.; Hansen, John H. L.; Frost, Amanda; Hansen, John H. L.Objectives: The present study investigated the effects of 3-dimensional deep search (3DDS) signal processing on the enhancement of consonant perception in bimodal and normal hearing listeners. Methods: Using an articulation-index gram and 3DDS signal processing, consonant segments that greatly affected performance were identified and intensified with a 6-dB gain. Then consonant recognition was measured unilaterally and bilaterally before and after 3DDS processing both in quiet and noise. Results: The 3DDS signal processing provided a benefit to both groups, with greater benefit occurring in noise than quiet. The benefit rendered by 3DDS was the greatest in binaural listening condition. Ability to integrate acoustic features across ears was also enhanced with 3DDS processing. In listeners with normal hearing, manner and place of articulation were improved in binaural listening condition. In bimodal listeners, voicing and manner and place of articulation were also improved in bimodal and hearing aid ear–alone conditions. Conclusions: Consonant recognition was improved with 3DDS in both groups. This observed benefit suggests 3DDS can be used as an auditory training tool for improved integration and for bimodal users who receive little or no benefit from their current bimodal hearing. © The Author(s) 2019. ItemLeveraging Frequency-Dependent Kernel and Dip-Based Clustering for Robust Speech Activity Detection in Naturalistic Audio Streams(Institute of Electrical and Electronics Engineers Inc.) Dubey, Harishchandra; Sangwan, Abhijeet; Hansen, John H. L.; 19968651 (Hansen, JHL); Dubey, Harishchandra; Sangwan, Abhijeet; Hansen, John H. L.Speech activity detection (SAD) is front-end in most speech systems e.g. speaker verification, speech recognition etc. Supervised SAD typically leverages machine learning models trained on annotated data. For applications like zero-resource speech processing and NIST-OpenSAT-2017 public safety communications task, it might not be feasible to collect SAD annotations. SAD is challenging for naturalistic audio streams containing multiple noise-sources simultaneously. We propose a novel frequencydependent kernel (FDK) based SAD features. FDK provides enhanced spectral decomposition from which several statistical descriptors are derived. FDK statistical descriptors are combined by principal component analysis into one-dimensional FDKSAD features. We further proposed two decision backends: (i) Variable model-size Gaussian mixture model (VMGMM); and (ii) Hartigan dip-based robust feature clustering (DipSAD). While VMGMM is a model based approach, the DipSAD is nonparametric. We used both backends for comparative evaluations in two phases:(1) standalone SAD performance; (2) effect of SAD on text-dependent speaker verification using RedDots data. The NIST-OpenSAD-2015 and NIST-OpenSAT-2017 corpora are used for standalone SAD evaluations. We establish two CRSS corpora namely CRSS-PLTL-II and CRSS long-duration naturalistic noise (CLDNN) corpus. The CRSS corpora facilitate standalone SAD evaluations on naturalistic audio streams. We performed comparative studies of proposed approaches with multiple baselines including SohnSAD, rSAD, semi-supervised Gaussian mixture model (SSGMM) and Gammatone spectrogram features. IEEE ItemSpeech Enhancement for Cochlear Implant Recipients.Wang, Dongmei; Hansen, John H. L.; 0000 0001 1604 5383 (Hansen, JHL); 19968651 (Hansen, JHL); Wang, Dongmei; Hansen, John H. L.In this study, a single microphone speech enhancement algorithm is proposed to improve speech intelligibility for cochlear implant recipients. The proposed algorithm combines harmonic structure estimation with a subsequent statistical based speech enhancement stage. Traditional minimum mean square error (MMSE) based speech enhancement methods typically focus on statistical characteristics of the noise and track the noise variance along time dimension. The MMSE method is usually effective for stationary noise, but not as useful for non-stationary noise. To address both stationary and non- stationary noise, the current proposed method not only tracks noise over time, but also estimates the noise structure along the frequency dimension by exploiting the harmonic structure of the target speech. Next, the estimated noise is employed in the traditional MMSE framework for speech enhancement. To evaluate the proposed speech enhancement solution, a formal listener evaluation was performed with 6 cochlear implant recipients. The results suggest that a substantial improvement in speech intelligibility performance can be gained for cochlear implant recipients in noisy environments. ItemA Study of Voice Production Characteristics of Astronaut Speech During Apollo 11 for Speaker Modeling in Space(Acoustical Society of America, 2017-03-08) Yu, C.; Hansen, John H. L.; 19968651 (Hansen, JHL); Hansen, John H. L.Human physiology has evolved to accommodate environmental conditions, including temperature, pressure, and air chemistry unique to Earth. However, the environment in space varies significantly compared to that on Earth and, therefore, variability is expected in astronauts' speech production mechanism. In this study, the variations of astronaut voice characteristics during the NASA Apollo 11 mission are analyzed. Specifically, acoustical features such as fundamental frequency and phoneme formant structure that are closely related to the speech production system are studied. For a further understanding of astronauts' vocal tract spectrum variation in space, a maximum likelihood frequency warping based analysis is proposed to detect the vocal tract spectrum displacement during space conditions. The results from fundamental frequency, formant structure, as well as vocal spectrum displacement indicate that astronauts change their speech production mechanism when in space. Moreover, the experimental results for astronaut voice identification tasks indicate that current speaker recognition solutions are highly vulnerable to astronaut voice production variations in space conditions. Future recommendations from this study suggest that successful applications of speaker recognition during extended space missions require robust speaker modeling techniques that could effectively adapt to voice production variation caused by diverse space conditions. ItemThe Lombard Effect Observed in Speech Produced by Cochlear Implant Users in Noisy Environments: A Naturalistic Study(Acoustical Society of America) Lee, J.; Ali, H.; Ziaei, A.; Tobey, Emily A.; Hansen, John H. L.; Tobey, Emily A.; Hansen, John H. L.The Lombard effect is an involuntary response speakers experience in the presence of noise during voice communication. This phenomenon is known to cause changes in speech production such as an increase in intensity, pitch structure, formant characteristics, etc., for enhanced audibility in noisy environments. Although well studied for normal hearing listeners, the Lombard effect has received little, if any, attention in the field of cochlear implants (CIs). The objective of this study is to analyze speech production of CI users who are postlingually deafened adults with respect to environmental context. A total of six adult CI users were recruited to produce spontaneous speech in various realistic environments. Acoustic-phonetic analysis was then carried out to characterize their speech production in these environments. The Lombard effect was observed in the speech production of all CI users who participated in this study in adverse listening environments. The results indicate that both suprasegmental (e.g., F0, glottal spectral tilt and vocal intensity) and segmental (e.g., F1 for /i/ and /u/) features were altered in such environments. The analysis from this study suggests that modification of speech production of CI users under the Lombard effect may contribute to some degree an intelligible communication in adverse noisy environments. © 2017 Acoustical Society of America. ItemAnalysis of Human Scream and Its Impact on Text-Independent Speaker Verification(Acoustical Society of America, 2018-08-20) Hansen, John H. L.; Nandwana, M. K.; Shokouhi, N.; 19968651 (Hansen, JHL); Hansen, John H. L.; Nandwana, M. K.; Shokouhi, N.Scream is defined as sustained, high-energy vocalizations that lack phonological structure. Lack of phonological structure is how scream is identified from other forms of loud vocalization, such as "yell." This study investigates the acoustic aspects of screams and addresses those that are known to prevent standard speaker identification systems from recognizing the identity of screaming speakers. It is well established that speaker variability due to changes in vocal effort and Lombard effect contribute to degraded performance in automatic speech systems (i.e., speech recognition, speaker identification, diarization, etc.). However, previous research in the general area of speaker variability has concentrated on human speech production, whereas less is known about non-speech vocalizations. The UT-NonSpeech corpus is developed here to investigate speaker verification from scream samples. This study considers a detailed analysis in terms of fundamental frequency, spectral peak shift, frame energy distribution, and spectral tilt. It is shown that traditional speaker recognition based on the Gaussian mixture models-universal background model framework is unreliable when evaluated with screams. © 2017 Author(s). ItemRobust I-Vector Extraction for Neural Network Adaptation in Noisy Environment(International Speech and Communication Association) Yu, Chengzhu; Ogawa, A.; Delcroix, M.; Yoshioka, T.; Nakatani, T.; Hansen, John H. L.; 19968651 (Hansen, JHL); Yu, Chengzhu; Hansen, John H. L.In this study, we explore an i-vector based adaptation of deep neural network (DNN) in noisy environment. We first demonstrate the importance of encapsulating environment and channel variability into i-vectors for DNN adaptation in noisy conditions. To be able to obtain robust i-vector without losing noise and channel variability information, we investigate the use of parallel feature based i-vector extraction for DNN adaptation. Specifically, different types of features are used separately during two different stages of i-vector extraction namely universal background model (UBM) state alignment and i-vector computation. To capture noise and channel-specific feature variation, the conventional MFCC features are still used for i-vector computation. However, much more robust features such as Vector Taylor Series (VTS) enhanced as well as bottleneck features are exploited for UBM state alignment. Experimental results on Aurora-4 show that the parallel feature-based i-vectors yield performance gains of up to 9.2% relative compared to a baseline DNN-HMM system and 3.3% compared to a system using conventional MFCC-based i-vectors. ItemI-Vector Based Physical Task Stress Detection with Different Fusion Strategies(International Speech and Communication Association) Zhang, C.; Liu, G.; Yu, C.; Hansen, John H. L.; 19968651 (Hansen, JHL); Zhang, C.; Liu, G.; Yu, C.; Hansen, John H. L.It is common for subjects to produce speech while performing a physical task where speech technology may be used. Variabilities are introduced to speech since physical task can influence human speech production. These variabilities degrade the performance of most speech systems. It is vital to detect speech under physical stress variabilities for subsequent algorithm processsing. This study presents a method for detecting physical task stress from speech. Inspired by the fact that i-vectors can generally model total factors from speech, a state-of-the-art ivector framework is investigated with MFCCs and our previously formulated TEO-CB-Auto-Env features for neutral/physical task stress detection. Since MFCCs are derived from a linear speech production model and TEO-CB-Auto-Env features employ a nonlinear operator, these two features are believed to have complementary effects on physical task stress detection. Two alternative fusion strategies (feature-level and score-level fusion) are investigated to validate this hypothesis. Experiments over the UT-Scope Physical Corpus demonstrate that a relative accuracy gain of 2.68% is obtained when fusing different feature based i-vectors. An additional relative performance boost with of 6.52% in accuracy is achieved using score level fusion. ItemFrequency Offset Correction in Single Sideband (SSB) Speech by Deep Neural Network for Speaker Verification(International Speech and Communication Association) Xing, Hua; Liu, Gang; Hansen, John H. L.; 19968651 (Hansen, JHL); Xing, Hua; Liu, Gang; Hansen, John H. L.Communication system mismatch represents a major influence for loss in speaker recognition performance. This paper considers a type of nonlinear communication system mismatch- modulation/ demodulation (Mod/DeMod) carrier drift in single sideband (SSB) speech signals. We focus on the problem of estimating frequency offset in SSB speech in order to improve speaker verification performance of the drifted speech. Based on a two-step framework from previous work, we propose using a multi-layered neural network architecture, stacked denoising autoencoder (SDA), to determine the unique interval of the offset value in the first step. Experimental results demonstrate that the SDA based system can produce up to a +16.1% relative improvement in frequency offset estimation accuracy. A speaker verification evaluation shows a +65.9% relative improvement in EER when SSB speech signal is compensated with the frequency offset value estimated by the proposed method. ItemAn Unsupervised Visual-Only Voice Activity Detection Approach Using Temporal Orofacial Features(International Speech and Communication Association) Tao, Fei; Hansen, John H. L.; Busso, Carlos; Hansen, John H. L.Detecting the presence or absence of speech is an important step toward building robust speech-based interfaces. While previous studies have made progress on voice activity detection (VAD), the performance of these systems significantly degrades when subjects employ challenging speech modes that deviate from normal acoustic patterns (e.g., whisper speech), or in noisy/adverse conditions. An appealing approach under these conditions is visual voice activity detection (VVAD), which detects speech using features characterizing the orofacial activity. This study proposes an unsupervised approach that relies only on visual features, and, therefore, is insensitive to vocal style or time-varying background noise. This study proposes an unsupervised approach that relies on visual features. We estimate optical flow variance and geometrical features around lips, extracting the short-time zero crossing rates, short-time variances, and delta features over a small temporal window. These variables are fused using principal component analysis (PCA) to obtain a "combo" feature, which displays a bimodal distributions (speech versus silence). A threshold is automatically determine using the expectation-maximization (EM) algorithm. The approach can be easily transformed into a supervised VVAD, if needed. We evaluate the system in neutral and whisper speech. While speech based VADs generally fail to detect speech activity in whisper speech, given its important acoustic differences, the proposed VVAD achieves near 80% accuracy in both neutral and whisper speech, highlighting the benefits of the system. Copyright ItemA New Front-End for Classification of Non-Speech Sounds: A Study on Human Whistle(International Speech and Communication Association) Nandwana, Mahesh Kumar; Bořil, Hynek; Hansen, John H. L.; 19968651 (Hansen, JHL); Nandwana, Mahesh Kumar; Bořil, Hynek; Hansen, John H. L.Speech/non-speech sound classification is an important problem in audio diarization, audio document retrieval and advanced human interfaces. The focus of this study is on the development of spectral and temporal acoustic features for speech/non-speech sound classification based on production differences in speech versus whistle. Seven time- and frequency-domain based features are investigated. Performance of the proposed feature set for the task of speech/whistle classification is evaluated at frame level. This evaluation utilizes support vector machine (SVM) models and Gaussian mixture models (GMM) for back-end classifiers. At the frame-level, the proposed front-end fusion gives an absolute performance gain of +15.0% and +3.1% over MFCC with SVM and GMM based classifiers, respectively. This research will benefit the development of intelligent speech interfaces for identification, recognition, and speech coding, as a preprocessing step for real world audio streams. ItemEvaluation and Calibration of Short-Term Aging Effects in Speaker Verification(International Speech and Communication Association) Kelly, Finnian; Hansen, John H. L.; Kelly, Finnian; Hansen, John H. L.A speaker verification evaluation is presented on the Multisession Audio Research Project (MARP) corpus, for which speakers were recorded at regular intervals, in consistent conditions, over a period of three years. It is observed that the performance of an i-vector system with probabilistic linear discriminant analysis (PLDA) modelling decreases progressively, in terms of both discrimination and calibration, as the time intervals between train and test sessions increase. For male speakers, the equal error rate (EER) increases from 2.4% to 4.4% when the interval between sessions grows from several months to three years. An extension to conventional linear score calibration is proposed, whereby short-term aging information is incorporated as an additional factor in the score transformation. This new approach improves discrimination and calibration performance in the presence of increasing time intervals between train and test sessions, compared with score-only calibration. Copyright ItemAnti-Spoofing System: An Investigation of Measures to Detect Synthetic and Human Speech(International Speech and Communication Association) Misra, A.; Ranjan, S.; Zhang, C.; Hansen, John H. L.; 19968651 (Hansen, JHL); Hansen, John H. L.Automatic Speaker Verification (ASV) systems are prone to spoofing attacks of various kinds. In this study, we explore the effects of different features and spoofing algorithms on a state-of-the-art i-vector speaker verification system. Our study is based on the standard dataset and evaluation protocols released as part of the ASVspoof 2015 challenge. We compare how different features perform while detecting both genuine and spoofed speech. We observe that features that contain phase information (Modified Group Delay based features) are better in detecting synthetic speech, and give comparable performance when compared to standard MFCCs. We report an anti-spoofing system that performs well both on known as well as unknown spoofing attacks. ItemLaughter and Filler Detection in Naturalistic Audio(International Speech and Communication Association) Kaushik, Lakshmish .; Sangwan, Abhijeet; Hansen, John H. L.; 19968651 (Hansen, JHL); Kaushik, Lakshmish .; Sangwan, Abhijeet; Hansen, John H. L.Laughter and fillers are common phenomenon in speech, and play an important role in communication. In this study, we present Deep Neural Network (DNN) and Convolutional Neural Network (CNN) based systems to classify non-verbal cues (laughter and fillers) from verbal speech in naturalistic audio. We propose improvements over a deep learning system proposed in 1]. Particularly, we propose a simple method to combine spectral features with pitch information to capture prosodic and spectral cues for filler/laughter. Additionally, we propose using a wider time context for feature extraction so that the time evolution of the spectral and prosodic structure can also be exploited for classification. Furthermore, we propose to use CNN for classification. The new method is evaluated on conversational telephony speech (CTS, drawn from Switchboard and Fisher) data and UT-Opinion corpus. Our results shows that the new system improves the AUC (area under the curve) metric by 8.15% and 11.9% absolute for laughters, and 4.85% and 6.01% absolute for fillers, over the baseline system, for CTS and UT-Opinion data, respectively. Finally, we analyze the results to explain the difference in performance between traditional CTS data and naturalistic audio (UT-Opinion), and identify challenges that need to be addressed to make systems perform better for practical data. Copyright ItemPhysical Task Stress and Speaker Variability in Voice Quality(Springer International Publishing) Godin, Keith W.; Hansen, John H. L.The presence of physical task stress induces changes in the speech production system which in turn produces changes in speaking behavior. This results in measurable acoustic correlates including changes to formant center frequencies, breath pause placement, and fundamental frequency. Many of these changes are due to the subject’s internal competition between speaking and breathing during the performance of the physical task, which has a corresponding impact on muscle control and airflow within the glottal excitation structure as well as vocal tract articulatory structure. This study considers the effect of physical task stress on voice quality. Three signal processing-based values which include (i) the normalized amplitude quotient (NAQ), (ii) the harmonic richness factor (HRF), and (iii) the fundamental frequency are used to measure voice quality. The effects of physical stress on voice quality depend on the speaker as well as the specific task. While some speakers do not exhibit changes in voice quality, a subset exhibits changes in NAQ and HRF measures of similar magnitude to those observed in studies of soft, loud, and pressed speech. For those speakers demonstrating voice quality changes, the observed changes tend toward breathy or soft voicing as observed in other studies. The effect of physical stress on the fundamental frequency is correlated with the effect of physical stress on the HRF (r = −0.34) and the NAQ (r = −0.53). Also, the inter-speaker variation in baseline NAQ is significantly higher than the variation in NAQ induced by physical task stress. The results illustrate systematic changes in speech production under physical task stress, which in theory will impact subsequent speech technology such as speech recognition, speaker recognition, and voice diarization systems. . ItemCompensation of SNR and Noise Type Mismatch using an Environmental Sniffing Based Speech Recognition SolutionChung, Y.; Hansen, John H. L.; 0000 0001 1604 5383 (Hansen, JHL); 92101568 (Hansen, JHL)Multiple-model based speech recognition (MMSR) has been shown to be quite successful in noisy speech recognition. Since it employs multiple hidden Markov model (HMM) sets that correspond to various noise types and signal-to-noise ratio (SNR) values, the selected acoustic model can be closely matched with the test noisy speech, which leads to improved performance when compared with other state-of-the-art speech recognition systems that employ a single HMM set. However, as the number of HMM sets is usually limited due to practical considerations as well as effective model selection, acoustic mismatch can still be a problem in MMSR. In this study, we proposed methods to improve recognition performance by mitigating the mismatch in SNR and noise type for an MMSR solution. For the SNR mismatch, an optimal SNR mapping between the test noisy speech and the HMM was determined by experimental investigation. Improved performance was demonstrated by employing the SNR mapping instead of using the estimated SNR of the test noisy speech directly. We also proposed a novel method to reduce the effect of noise type mismatch by compensating the test noisy speech in the log-spectrum domain. We first derive the relation between the log-spectrum vectors in the test and training noisy speech. Since the relation is a non-linear function of the speech and noise parameters, the statistical information regarding the testing log-spectrum vectors was obtained by approximation using vector Taylor series (VTS) algorithm. Finally, the minimum mean square error estimation of the training log-spectrum vectors was used to reduce the mismatch between the training and test noisy speech. By employing the proposed methods in the MMSR framework, relative word error rate reduction of 18.7% and 21.3% was achieved on the Aurora 2 task when compared to a conventional MMSR and multi-condition training (MTR) method, respectively.