Busso-Recabarren, Carlos A.

Permanent URI for this collectionhttps://hdl.handle.net/10735.1/6770

Carlos Busso-Recabarren is a Professor of Electrical Engineering and Principal Investigator of the MSP (Multimodal Signal Processing) Laboratory. His research interests include:

Modeling and synthesis of human behavior
Affective State Recognition
Multimodal Interfaces
Sensing Participant Interaction
Digital Signal Processing
Speech and Video Processing

Browse

Now showing 1 - 3 of 3

Expressive Speech-Driven Lip Movements with Multitask Learning
(Institute of Electrical and Electronics Engineers Inc.) Sadoughi, Najmeh; Busso, Carlos A.; Sadoughi, Najmeh; Busso, Carlos A.
The orofacial area conveys a range of information, including speech articulation and emotions. These two factors add constraints to the facial movements, creating non-trivial integrations and interplays. To generate more expressive and naturalistic movements for conversational agents (CAs) the relationship between these factors should be carefully modeled. Data-driven models are more appropriate for this task than rule-based systems. This paper provides two deep learning speech-driven structures to integrate speech articulation and emotional cues. The proposed approaches rely on multitask learning (MTL) strategies, where related secondary tasks are jointly solved when synthesizing orofacial movements. In particular, we evaluate emotion recognition and viseme recognition as secondary tasks. The approach creates shared representations that generate behaviors that not only are closer to the original orofacial movements, but also are perceived more natural than the results from single task learning.
Speech-Driven Animation with Meaningful Behaviors
(Elsevier B.V., 2019-04-05) Sadoughi, Najmeh; Busso, Carlos; Sadoughi, Najmeh; Busso, Carlos
Conversational agents (CAs) play an important role in human computer interaction (HCI). Creating believable movements for CAs is challenging, since the movements have to be meaningful and natural, reflecting the coupling between gestures and speech. Studies in the past have mainly relied on rule-based or data-driven approaches. Rule-based methods focus on creating meaningful behaviors conveying the underlying message, but the gestures cannot be easily synchronized with speech. Data-driven approaches, especially speech-driven models, can capture the relationship between speech and gestures. However, they create behaviors disregarding the meaning of the message. This study proposes to bridge the gap between these two approaches overcoming their limitations. The approach builds a dynamic Bayesian network (DBN), where a discrete variable is added to constrain the behaviors on the underlying constraint. The study implements and evaluates the approach with two constraints: discourse functions and prototypical behaviors. By constraining on the discourse functions (e.g., questions), the model learns the characteristic behaviors associated with a given discourse class learning the rules from the data. By constraining on prototypical behaviors (e.g., head nods), the approach can be embedded in a rule-based system as a behavior realizer creating trajectories that are timely synchronized with speech. The study proposes a DBN structure and a training approach that (1) models the cause-effect relationship between the constraint and the gestures, and (2) captures the differences in the behaviors across constraints by enforcing sparse transitions between shared and exclusive states per constraint. Objective and subjective evaluations demonstrate the benefits of the proposed approach over an unconstrained baseline model. ©2019 Elsevier B.V.
Speech-Driven Expressive Talking Lips with Conditional Sequential Generative Adversarial Networks
(Institute of Electrical and Electronics Engineers Inc., 2019-05-07) Sadoughi, Najmeh; Busso, Carlos; Sadoughi, Najmeh; Busso, Carlos
Articulation, emotion, and personality play strong roles in the orofacial movements. To improve the naturalness and expressiveness of virtual agents(VAs), it is important that we carefully model the complex interplay between these factors. This paper proposes a conditional generative adversarial network, called conditional sequential GAN(CSG), which learns the relationship between emotion, lexical content and lip movements in a principled manner. This model uses a set of spectral and emotional speech features directly extracted from the speech signal as conditioning inputs, generating realistic movements. A key feature of the approach is that it is a speech-driven framework that does not require transcripts. Our experiments show the superiority of this model over three state-of-the-art baselines in terms of objective and subjective evaluations. When the target emotion is known, we propose to create emotionally dependent models by either adapting the base model with the target emotional data (CSG-Emo-Adapted), or adding emotional conditions as the input of the model(CSG-Emo-Aware). Objective evaluations of these models show improvements for the CSG-Emo-Adapted compared with the CSG model, as the trajectory sequences are closer to the original sequences. Subjective evaluations show significantly better results for this model compared with the CSG model when the target emotion is happiness. IEEE

Browse

Browsing Busso-Recabarren, Carlos A. by Subject "Speech"