Synthesizing Naturalistic and Meaningful Speech-Driven Behaviors




Journal Title

Journal ISSN

Volume Title



Nonverbal behaviors externalized through head, face and body movements for conversational agents (CAs) play an important role in human computer interaction (HCI). Believable movements for CAs have to be meaningful and natural. Previous studies mainly relied on rule-based or speechdriven approaches. We propose to bridge the gap between these two approaches overcoming their limitations. We build a dynamic Bayesian network (DBN), with a discrete variable to constrain the behaviors. We implement and evaluate the approach with discourse functions as constraints (e.g., questions). The model learns the characteristic behaviors associated with a given discourse class learning the rules from the data. Objective and subjective evaluations demonstrate the benefits of the proposed approach over an unconstrained model. Another problem with speech-driven models is that they require all the potential utterances of the CA to be recorded. Using existing text to speech (TTS) systems scales the applications of these methods by providing the flexibility of using text instead of pre-recorded speech. However, training the models with natural speech, and testing them with TTS creates a mismatch affecting the performance of the system. We propose a novel strategy to address this mismatch. It starts by creating a parallel corpus with synthetic speech aligned with the original speech for which we have motion capture recordings. This parallel corpus is used to retrain the models from scratch, or adapt the models built with natural speech. Subjective and objective evaluations show the effectiveness of this solution in reducing the mismatch. In addition to head movements, face conveys a blend of verbal and nonverbal information playing an important role in daily interaction. While speech articulation mostly affects the orofacial area, emotional behaviors are externalized across the entire face. Furthermore, facial muscles connect areas across the face, creating principled relationships and dependencies between the movements that have to be taken into account. Using multi-task learning (MTL), we create speech-driven models that jointly capture the relationship not only between speech and facial movements, but also across facial movements. We build our models with bidirectional long-short term memory (BLSTM) units which are shown to be very successful in modeling dependencies for sequential data. Within the face, the orofacial area conveys information including speech articulation and emotions. These two factors add constraints to the facial movements creating non-trivial integrations and interplays. The relationship between these factors should be modeled, to generate more naturalistic movements for CAs. We provide deep learning speech-driven structures to integrate these factors. We use MTL, where related secondary tasks are jointly solved when synthesizing orofacial movements. In particular, we evaluate emotion and viseme recognition as secondary tasks. The approach creates orofacial movements with superior objective and subjective performances than baseline models. Taken collectively, this dissertation has made algorithmic advancements into speech and body movements sequential modeling to leverage knowledge extraction from speech for nonverbal characterization over time.



Speech synthesis, Conversation analysis, Sequential processing (Computer science), Computer multitasking, Nonverbal communication, Human-computer interaction, Text-to-speech software