PV3D: A 3D GENERATIVE MODEL FOR PORTRAIT VIDEO GENERATION

Abstract

Recent advances in generative adversarial networks (GANs) have demonstrated the capabilities of generating stunning photo-realistic portrait images. While some prior works have applied such image GANs to unconditional 2D portrait video generation and static 3D portrait synthesis, there are few works successfully extending GANs for generating 3D-aware portrait videos. In this work, we propose PV3D, the first generative framework that can synthesize multi-view consistent portrait videos. Specifically, our method extends the recent static 3D-aware image GAN to the video domain by generalizing the 3D implicit neural representation to model the spatio-temporal space. To introduce motion dynamics into the generation process, we develop a motion generator by stacking multiple motion layers to synthesize motion features via modulated convolution. To alleviate motion ambiguities caused by camera/human motions, we propose a simple yet effective camera condition strategy for PV3D, enabling both temporal and multiview consistent video generation. Moreover, PV3D introduces two discriminators for regularizing the spatial and temporal domains to ensure the plausibility of the generated portrait videos. These elaborated designs enable PV3D to generate 3Daware motion-plausible portrait videos with high-quality appearance and geometry, significantly outperforming prior works. As a result, PV3D is able to support downstream applications such as static portrait animation and view-consistent motion editing.

1. INTRODUCTION

Recent progress in generative adversarial networks (GANs) has led human portrait generation to unprecedented success (Karras et al., 2020; 2021; Skorokhodov et al., 2022) and has spawned a lot of industrial applications (Tov et al., 2021; Richardson et al., 2021) . Generating portrait videos has emerged as the next challenge for deep generative models with wider applications like video manipulation (Abdal et al., 2022) and animation (Siarohin et al., 2019) . A long line of work has been proposed to either learn a direct mapping from latent code to portrait video (Vondrick et al., 2016; Saito et al., 2017) or decompose portrait video generation into two stages, i.e., content synthesis and motion generation (Tian et al., 2021; Tulyakov et al., 2018; Skorokhodov et al., 2022) . Despite offering plausible results, such methods only produce 2D videos without considering the underlying 3D geometry, which is the most desirable attribute with broad applications such as portrait reenactment (Doukas et al., 2021) , talking face animation (Siarohin et al., 2019) , and VR/AR (Cao et al., 2022) . Current methods typically create 3D portrait videos through classical graphics techniques (Wang et al., 2021b; Ma et al., 2021; Grassal et al., 2022) , which require multi-camera systems, well-controlled studios, and heavy artist works. In this work, we aim to alleviate the effort of creating high-quality 3D-aware portrait videos by learning from 2D monocular videos only, without the need of any 3D or multi-view annotations. Figure 1 : Our PV3D can generate photo-realistic portrait videos with diverse motions and dynamic 3D geometry. We render surfaces extracted by marching cubes. The video frames and shape (normal map) can be rendered from arbitrary viewpoints. Please see our project page for video results. Recent 3D-aware portrait generative methods have witnessed rapid advances (Schwarz et al., 2020; Gu et al., 2022; Chan et al., 2021; Niemeyer & Geiger, 2021; Or-El et al., 2022; Chan et al., 2022) . Through integrating implicit neural representations (INRs) (Sitzmann et al., 2020; Mildenhall et al., 2020) into GANs (Karras et al., 2019; 2020) , they can produce photo-realistic and multi-view consistent results. However, such methods are limited to static portrait generation and can hardly be extended to portrait video generation due to several challenges: 1) it remains unclear how to effectively model 3D dynamic human portrait in a generative framework; 2) learning dynamic 3D geometry without 3D supervision is highly under-constrained; 3) entanglement between camera movements and human motions/expressions introduces ambiguities to the training process. To this end, we propose a 3D Portrait Video generation model (PV3D), the first method that can generate high-quality 3D portrait videos with diverse motions while learning purely from monocular 2D videos. PV3D enables 3D portrait video modeling by extending 3D tri-plane representation (Chan et al., 2022) to the spatio-temporal domain. In this paper, we comprehensively analyze various design choices and arrive at a set of novel designs, including decomposing latent codes into appearance and motion components, temporal tri-plane based motion generator, proper camera pose sequence conditioning, and camera-conditioned video discriminators, which can significantly improve the video fidelity and geometry quality for 3D portrait video generation. As shown in Figure 1 , despite being trained from only monocular 2D videos, PV3D can generate a large variety of photo-realistic portrait videos under arbitrary viewpoints with diverse motions and high-quality 3D geometry. Comprehensive experiments on various datasets including Vox-Celeb (Nagrani et al., 2017 ), CelebV-HQ (Zhu et al., 2022) and TalkingHead-1KH (Wang et al., 2021a) well demonstrate the superiority of PV3D over previous state-of-the-art methods, both qualitatively and quantitatively. Notably, it achieves 29.1 FVD on VoxCeleb, improving upon a concurrent work 3DVidGen (Bahmani et al., 2022) by 55.6%. PV3D can also generate high-quality 3D geometry, achieving the best multi-view identity similarity and warping error across all datasets. Our contributions are three-fold. 1) To our best knowledge, PV3D is the first method that is capable to generate a large variety of 3D-aware portrait videos with high-quality appearance, motions, and geometry. 2) We propose a novel temporal tri-plane based video generation framework that can synthesize 3D-aware portrait videos by learning from 2D videos only. 3) We demonstrate state-ofthe-art 3D-aware portrait video generation on three datasets. Moreover, our PV3D supports several downstream applications, i.e., static image animation, monocular video reconstruction, and multiview consistent motion editing.

2. RELATED WORK

2D video generation. Early video generation works (Vondrick et al., 2016; Saito et al., 2017) propose to learn a video generator to transform random vectors directly to video clips. While recent video generation works adopt a similar paradigm to design the video generator, i.e., disentangle the video content and motion (trajectory), then control them by different random noises. For the video

availability

//showlab.github.io/pv3d.

