Talking Human Synthesis: Learning Photorealistic Co-speech Motions and Visual Appearances From Videos

Date

2023-05

ORCID

Journal Title

Journal ISSN

Volume Title

Publisher

item.page.doi

Abstract

Talking video synthesis is a cutting-edge technology that enables the creation of highly realistic video sequences of individuals speaking. This technology has a wide range of applications in fields such as film-making, advertising, gaming, entertainment, social media, and is likely to continue to be an active area of research in the coming years. However, there are still many open questions and challenges in the field of talking video synthesis. In 3D talking face generation, most existing methods can only generate 3D faces with a static head pose, which is inconsistent with how humans perceive faces. Only a few works focus on head pose generation, but even these ignore the attribute of personality. In realistic talking face generation, it is still very challenging to generate photo-realistic talking faces that are indistinguishable from real captured videos, which not only contain synchronized lip motions, but also have personalized and natural head movements and eye blinks, etc. In full-body speech video synthesis, although substantial progress has been made in audio-driven talking video synthesis, there still remain two major difficulties: existing works 1) need a long sequence of training dataset (>1h) to synthesize co-speech gestures, which causes a significant limitation on their applicability; 2) usually fail to generate long sequences, or can only generate long sequences without enough diversity. To address those limitations, my research will be developed in a progressive manner, focusing on three main aspects. Firstly, we will delve into the generation of personalized head poses for 3D talking faces. Secondly, for realistic 2D talking faces, we propose a generation method that takes an audio signal as input and a short target video clip as a reference to synthesize a photo-realistic video of the target face with natural lip motions, head poses, and eye blinks that are synchronized with the input audio signal. Lastly, we propose a data-efficient ReMix learning method, which can be trained on monocular ”in-the-wild” short videos to synthesize photo-realistic talking videos with full-body gestures. To generate personalized head poses for 3D talking faces, we propose a unified audio-driven approach to endow 3D talking faces with personalized pose dynamics. To achieve this goal, we establish an original person-specific dataset, providing corresponding head poses and face shapes for each video. To model implicit face attributes with input audio, we propose a FACe Implicit Attribute Learning Generative Adversarial Network (FACIAL-GAN), which integrates the phonetics-aware, context-aware, and identity-aware information to synthesize the 3D face animation with realistic motions of lips, head poses, and eye blinks. Finally, we make an audio-pose remixed latent space assumption to encourage unpaired audio and pose combinations, which results in diverse “one-to-many” mappings in pose generation. We also develop a dual-function inference scheme to regularize both the start pose and the general appearance of the next sequence, enhancing the long-term video generation of full continuity and diversity. Experimental results indicate that our methods can generate 1) person-specific head pose sequences that are in sync with the input audio and that best match with the human experience of talking heads, 2) realistic talking face videos with not only synchronized lip motions, but also natural head movements and eye blinks, 3) realistic synchronized full-body talking videos with training data efficiency with better qualities than the results of state-of-the-art methods.

Description

Keywords

Computer Science

item.page.sponsorship

Rights

Citation