Browsing by Author "Busso, Carlos A."

Now showing 1 - 4 of 4

A Multimodal Analysis of Synchrony During Dyadic Interaction Using a Metric Based on Sequential Pattern Mining
(2017-05) Jakkam, Anil Kumar; Busso, Carlos A.
In human-human interaction, people tend to temporally adapt to each other as the conversation progresses, changing their intonation, speech rate, fundamental frequency, word selection, hand gestures, and head movements. This phenomenon is known as synchrony, convergence, entrainment, and adaptation. Recent studies have investigated this phenomenon at different dimensions and levels for single modalities. However, studying synchrony as an interplay between modalities at a local level between conversational partners is an open question. This study explores synchrony using a multimodal approach based on sequential pattern mining in dyadic conversations. This analysis considers acoustic, text and video-based features at a turn level. The proposed data-driven framework identifies frequent sequences containing events from multiple modalities that can quantify the synchrony between conversational partners (e.g., a speaker reduces speech rate when the other utters disfluencies). The evaluation relies on 90 sessions from the Fishers corpus, which comprises telephone conversations between two people, and 54 sessions of audio-visual recordings of dyadic interactions from the MAHNOB MHI-Mimicry database. We develop a multimodal metric to quantify synchrony between conversational partners using this framework. We report results on this metric by comparing actual dyadic conversations with pseudo-interactions, which are artificially created by randomly pairing the speakers. Our results show that the proposed metric provides the temporal evolution of synchrony identifying non-trivial sequence of events across multimodal features.
Expressive Speech-Driven Lip Movements with Multitask Learning
(Institute of Electrical and Electronics Engineers Inc.) Sadoughi, Najmeh; Busso, Carlos A.; Sadoughi, Najmeh; Busso, Carlos A.
The orofacial area conveys a range of information, including speech articulation and emotions. These two factors add constraints to the facial movements, creating non-trivial integrations and interplays. To generate more expressive and naturalistic movements for conversational agents (CAs) the relationship between these factors should be carefully modeled. Data-driven models are more appropriate for this task than rule-based systems. This paper provides two deep learning speech-driven structures to integrate speech articulation and emotional cues. The proposed approaches rely on multitask learning (MTL) strategies, where related secondary tasks are jointly solved when synthesizing orofacial movements. In particular, we evaluate emotion recognition and viseme recognition as secondary tasks. The approach creates shared representations that generate behaviors that not only are closer to the original orofacial movements, but also are perceived more natural than the results from single task learning.
Lexical Dependent Emotion Detection Using Synthetic Speech Reference
(IEEE-Inst Electrical Electronics Engineers Inc, 2019-02-08) Lotfian, Reza; Busso, Carlos A.; 0000-0003-1891-0267 (Lotfian, R); 0000-0002-4075-4072 (Busso, CA); Lotfian, Reza; Busso, Carlos A.
This paper aims to create neutral reference models from synthetic speech to contrast the emotional content of a speech signal. Modeling emotional behaviors is a challenging task due to the variability in perceiving and describing emotions. Previous studies have indicated that relative assessments are more reliable than absolute assessments. These studies suggest that having a reference signal with known emotional content (e.g., neutral emotion) to compare a target sentence may produce more reliable metrics to identify emotional segments. Ideally, we would like to have an emotionally neutral sentence with the same lexical content as the target sentence where their contents are timely aligned. In this fictitious scenario, we would be able to identify localized emotional cues by contrasting frame-by-frame the acoustic features of the target and reference sentences. This paper explores the idea of building these reference sentences leveraging the advances in speech synthesis. This paper builds a synthetic speech signal that conveys the same lexical information and is timely aligned with the target sentence in the database. Since it is expected that a single synthetic speech will not capture the full range of variability observed in neutral speech, we build multiple synthetic sentences using various voices and text-to-speech approaches. This paper analyzes whether the synthesized signals provide valid template references to describe neutral speech using feature analysis and perceptual evaluation. Finally, we demonstrate how this framework can be used in emotion recognition, achieving improvements over classifiers trained with the state-of-the-art features in detecting low versus high levels of arousal and valence.
Speech Analysis and Single Channel Enhancement to Improve Speech Intelligibility for Cochlear Implant Recipients
(2017-05) Wang, Dongmei; Hansen, John H. L.; Panahi, Issa M. S.; Busso, Carlos A.; Assmann, Peter F.
Cochlear implant (CI) devices are able to help deaf individuals recover hearing ability by surgically inserting electrode arrays into the inner ear, to stimulate the auditory nerve and transmit the sound to the auditory cortex in the brain. CI listeners achieve high speech intelligibility in quiet environments, however their speech perception degrades dramatically when presented in noisy backgrounds. This is especially true in fluctuating noise, such as competing-speaker or babble noise, where CI users have difficulties in speech understanding. One of the reasons is that low spectral resolution provided by CI encoding strategies is insufficient to distinguish speech components from noise. In this dissertation, we propose a new speech enhancement solution to improve speech intelligibility for CI recipients in noise. Speech energy is primarily carried in the harmonic structure located at multiple integer harmonics of the fundamental frequency. In order to combat noise, we propose to use harmonic structure as the frequency domain cues to estimate the degraded noise. The proposed speech enhancement is based on harmonic structure estimation combined with a traditional statistical based leveraged solution. This dissertation has investigated robust fundamental frequency estimation in noise, along with integrating such novel in formulate to improve harmonic based speech enhancement in both stationary and non-stationary noise scenarios. Noise-robust pitch estimation is proposed based on temporal harmonic structure in local time-frequency (TF) segments. To reduce the noise affect, we take advantage of the sparsity of speech signal that only the high signal to noise ratio (SNR) TF segments are used for pitch estimation. Robust harmonic features are investigated for neural network classification based pitch estimation. The harmonic features map the pitch candidates into a more separable space for classification. Experimental results show that the proposed pitch estimation method improves global pitch error in noise. Next, harmonic structure estimation is combined with the traditional statistical based method to speech enhancement. Noise estimation is performed based on harmonic structures. The estimated noise variance is employed in a traditional MMSE framework for a prior and posterior SNR estimation to obtain a gain function for the target speech. Listening experiments with CI subjects demonstrated improved speech intelligibility for both stationary and non-stationary noise.