PV3D: A 3D GENERATIVE MODEL FOR PORTRAIT VIDEO GENERATION

Abstract

Recent advances in generative adversarial networks (GANs) have demonstrated the capabilities of generating stunning photo-realistic portrait images. While some prior works have applied such image GANs to unconditional 2D portrait video generation and static 3D portrait synthesis, there are few works successfully extending GANs for generating 3D-aware portrait videos. In this work, we propose PV3D, the first generative framework that can synthesize multi-view consistent portrait videos. Specifically, our method extends the recent static 3D-aware image GAN to the video domain by generalizing the 3D implicit neural representation to model the spatio-temporal space. To introduce motion dynamics into the generation process, we develop a motion generator by stacking multiple motion layers to synthesize motion features via modulated convolution. To alleviate motion ambiguities caused by camera/human motions, we propose a simple yet effective camera condition strategy for PV3D, enabling both temporal and multiview consistent video generation. Moreover, PV3D introduces two discriminators for regularizing the spatial and temporal domains to ensure the plausibility of the generated portrait videos. These elaborated designs enable PV3D to generate 3Daware motion-plausible portrait videos with high-quality appearance and geometry, significantly outperforming prior works. As a result, PV3D is able to support downstream applications such as static portrait animation and view-consistent motion editing.

1. INTRODUCTION

Recent progress in generative adversarial networks (GANs) has led human portrait generation to unprecedented success (Karras et al., 2020; 2021; Skorokhodov et al., 2022) and has spawned a lot of industrial applications (Tov et al., 2021; Richardson et al., 2021) . Generating portrait videos has emerged as the next challenge for deep generative models with wider applications like video manipulation (Abdal et al., 2022) and animation (Siarohin et al., 2019) . A long line of work has been proposed to either learn a direct mapping from latent code to portrait video (Vondrick et al., 2016; Saito et al., 2017) or decompose portrait video generation into two stages, i.e., content synthesis and motion generation (Tian et al., 2021; Tulyakov et al., 2018; Skorokhodov et al., 2022) . Despite offering plausible results, such methods only produce 2D videos without considering the underlying 3D geometry, which is the most desirable attribute with broad applications such as portrait reenactment (Doukas et al., 2021) , talking face animation (Siarohin et al., 2019) , and VR/AR (Cao et al., 2022) . Current methods typically create 3D portrait videos through classical graphics techniques (Wang et al., 2021b; Ma et al., 2021; Grassal et al., 2022) , which require multi-camera systems, well-controlled studios, and heavy artist works. In this work, we aim to alleviate the effort of creating high-quality 3D-aware portrait videos by learning from 2D monocular videos only, without the need of any 3D or multi-view annotations.

availability

//showlab.github.io/pv3d.

