GENEFACE: GENERALIZED AND HIGH-FIDELITY AUDIO-DRIVEN 3D TALKING FACE SYNTHESIS

Abstract

Generating photo-realistic video portraits with arbitrary speech audio is a crucial problem in film-making and virtual reality. Recently, several works explore the usage of neural radiance field (NeRF) in this task to improve 3D realness and image fidelity. However, the generalizability of previous NeRF-based methods to out-of-domain audio is limited by the small scale of training data. In this work, we propose GeneFace, a generalized and high-fidelity NeRF-based talking face generation method, which can generate natural results corresponding to various out-of-domain audio. Specifically, we learn a variational motion generator on a large lip-reading corpus, and introduce a domain adaptative post-net to calibrate the result. Moreover, we learn a NeRF-based renderer conditioned on the predicted facial motion. A head-aware torso-NeRF is proposed to eliminate the headtorso separation problem. Extensive experiments show that our method achieves more generalized and high-fidelity talking face generation compared to previous methods 1 .

1. INTRODUCTION

Audio-driven face video synthesis is an important and challenging problem with several applications such as digital humans, virtual reality (VR), and online meetings. Over the past few years, the community has exploited Generative Adversarial Networks (GAN) as the neural renderer and promoted the frontier from only predicting the lip movement (Prajwal et al., 2020; Chen et al., 2019) to generating the whole face (Zhou et al., 2021; Lu et al., 2021) . However, GAN-based renderers suffer from several limitations such as unstable training, mode collapse, difficulty in modelling delicate details (Suwajanakorn et al., 2017; Thies et al., 2020) , and fixed static head pose (Pham et al., 2017; Taylor et al., 2017; Cudeiro et al., 2019) . Recently, Neural Radiance Field (NeRF) (Mildenhall et al., 2020) has been explored in talking face generation. Compared with GAN-based rendering techniques, NeRF renderers could preserve more details and provide better 3D naturalness since it models a continuous 3D scene in the hidden space. Recent NeRF-based works (Guo et al., 2021; Liu et al., 2022; Yao et al., 2022) manage to learn an end-to-end audio-driven talking face system with only a few-minutes-long video. However, the current end-to-end framework is faced with two challenges. 1) The first challenge is the weak generalizability due to the small scale of training data, which only consists of about thousands-many audio-image pairs. This deficiency of training data makes the trained model not robust to out-ofdomain (OOD) audio in many applications (such as cross-lingual (Guo et al., 2021; Liu et al., 2022) or singing voice). 2) The second challenge is the so-called "mean face" problem. Note that the audio to its corresponding facial motion is a one-to-many mapping, which means the same audio input may have several correct motion patterns. Learning such a mapping with a regression-based model

