GENEFACE: GENERALIZED AND HIGH-FIDELITY AUDIO-DRIVEN 3D TALKING FACE SYNTHESIS

Abstract

Generating photo-realistic video portraits with arbitrary speech audio is a crucial problem in film-making and virtual reality. Recently, several works explore the usage of neural radiance field (NeRF) in this task to improve 3D realness and image fidelity. However, the generalizability of previous NeRF-based methods to out-of-domain audio is limited by the small scale of training data. In this work, we propose GeneFace, a generalized and high-fidelity NeRF-based talking face generation method, which can generate natural results corresponding to various out-of-domain audio. Specifically, we learn a variational motion generator on a large lip-reading corpus, and introduce a domain adaptative post-net to calibrate the result. Moreover, we learn a NeRF-based renderer conditioned on the predicted facial motion. A head-aware torso-NeRF is proposed to eliminate the headtorso separation problem. Extensive experiments show that our method achieves more generalized and high-fidelity talking face generation compared to previous methods 1 .

1. INTRODUCTION

Audio-driven face video synthesis is an important and challenging problem with several applications such as digital humans, virtual reality (VR), and online meetings. Over the past few years, the community has exploited Generative Adversarial Networks (GAN) as the neural renderer and promoted the frontier from only predicting the lip movement (Prajwal et al., 2020; Chen et al., 2019) to generating the whole face (Zhou et al., 2021; Lu et al., 2021) . However, GAN-based renderers suffer from several limitations such as unstable training, mode collapse, difficulty in modelling delicate details (Suwajanakorn et al., 2017; Thies et al., 2020) , and fixed static head pose (Pham et al., 2017; Taylor et al., 2017; Cudeiro et al., 2019) . Recently, Neural Radiance Field (NeRF) (Mildenhall et al., 2020) has been explored in talking face generation. Compared with GAN-based rendering techniques, NeRF renderers could preserve more details and provide better 3D naturalness since it models a continuous 3D scene in the hidden space. Recent NeRF-based works (Guo et al., 2021; Liu et al., 2022; Yao et al., 2022) manage to learn an end-to-end audio-driven talking face system with only a few-minutes-long video. However, the current end-to-end framework is faced with two challenges. 1) The first challenge is the weak generalizability due to the small scale of training data, which only consists of about thousands-many audio-image pairs. This deficiency of training data makes the trained model not robust to out-ofdomain (OOD) audio in many applications (such as cross-lingual (Guo et al., 2021; Liu et al., 2022) or singing voice). 2) The second challenge is the so-called "mean face" problem. Note that the audio to its corresponding facial motion is a one-to-many mapping, which means the same audio input may have several correct motion patterns. Learning such a mapping with a regression-based model leads to over-smoothing and blurry results (Ren et al., 2021) ; specifically, for some complicated audio with several potential outputs, it tends to generate an image with a half-opened and blurry mouth, which leads to unsatisfying image quality and bad lip-synchronization. To summarize, the current NeRF-based methods are challenged with the weak generalizability problem due to the lack of audio-to-motion training data and the "mean face" results due to the one-to-many mapping. In this work, we develop a talking face generation system called GeneFace to address these two challenges. To handle the weak generalizability problem, we devise an audio-to-motion model to predict the 3D facial landmark given the input audio. We utilize hundreds of hours of audio-motion pairs from a large-scale lip reading dataset (Afouras et al., 2018) to learn a robust mapping. As for the "mean face" problem, instead of using the regression-based model, we adopt a variational auto-encoder (VAE) with a flow-based prior as the architecture of the audio-to-motion model, which helps generate accurate and expressive facial motions. However, due to the domain shift between the generated landmarks (in the multi-speaker domain) and the training set of NeRF (in the target person domain), we found that the NeRF-based renderer fails to generate high-fidelity frames given the predicted landmarks. Therefore, a domain adaptation process is proposed to rig the predicted landmarks into the target person's distribution. To summarize, our system consists of three stages: 1 Audio-to-motion. We present a variational motion generator to generate accurate and expressive facial landmark given the input audio. 2 Motion domain adaptation. To overcome the domain shift, we propose a semi-supervised adversarial training pipeline to train a domain adaptative post-net, which refines the predicted 3D landmark from the multi-speaker domain into the target person domain. 3 Motion-to-image. We design a NeRF-based renderer to render high-fidelity frames conditioned on the predicted 3D landmark. The main contributions of this paper are summarized as follows: • We present a three-stage framework that enables the NeRF-based talking face system to enjoy the large-scale lip-reading corpus and achieve high generalizability to various OOD audio. We propose an adversarial domain adaptation pipeline to bridge the domain gap between the large corpus and the target person video. • We are the first work that analyzes the "mean face" problem induced by the one-to-many audioto-motion mapping in the talking face generation task. To handle this problem, we design a variational motion generator to generate accurate facial landmarks with rich details and expressiveness. • Experiments show that our GeneFace outperforms other state-of-the-art GAN-based and NeRFbased baselines from the perspective of objective and subjective metrics.

2. RELATED WORK

Our approach is a 3D talking face system that utilizes a generative model to predict the 3DMMbased motion representation given the driving audio and employs a neural radiance field to render the corresponding images of a human head. It is related to recent approaches to audio-driven talking head generation and scene representation networks for the human portrait. Audio-driven Talking Head Generation With the progress of audio synthesis technology (Ye et al., 2022; Huang et al., 2022a; b; c) and generative models (Zhang et al., 2022; 2023) , generating talking faces in line with input audio has attracted much attention of the computer vision community. Earlier works focus on synthesizing the lip motions on a static facial image (Jamaludin et al., 2019; Tony Ezzat & Poggio, 2002; Vougioukas et al., 2020) . Then the frontier is promoted to synthesize the full head (Yu et al., 2020; Zhou et al., 2019; 2020) . However, free pose control is not feasible in these methods due to the lack of 3D modeling. With the development of 3D face reconstruction techniques (Deng et al., 2019) , many works explore extracting 3D Morphable Model (3DMM) (Paysan et al., 2009) from the monocular video to represent the facial movement (Tero Karras & Lehtinen, 2017; Yi et al., 2020) in the talking face system, which is named as model-based methods. With 3DMM, a coarse 3D face mesh M can be represented as an affine model of facial expression and identity code: M = M + B id i + B exp e,

