3D-AWARE VIDEO GENERATION

Abstract

Generative models have emerged as an essential building block for many image synthesis and editing tasks. Recent advances in this field have also enabled high-quality 3D or video content to be generated that exhibits either multi-view or temporal consistency. With our work, we explore 4D generative adversarial networks (GANs) that learn unconditional generation of 3D-aware videos. By combining neural implicit representations with time-aware discriminator, we develop a GAN framework that synthesizes 3D video supervised only with monocular videos. We show that our method learns a rich embedding of decomposable 3D structures and motions that enables new visual effects of spatio-temporal renderings while producing imagery with quality comparable to that of existing 3D or video GANs.



Figure 1 : 3D-Aware video generation. We show multiple frames and viewpoints of two 3D videos, generated using our model trained on the FaceForensics dataset (Rössler et al., 2019) . Our 4D GAN generates 3D content of high quality while permitting control of time and camera extrinsics. Video results can be viewed from our supplementary html.

1. INTRODUCTION

Recent advances in generative adversarial networks (GANs) (Goodfellow et al., 2014) have led to artificial synthesis of photorealistic images (Karras et al., 2019; 2020; 2021) . These methods have been extended to enable unconditional generation of high-quality videos (Chen et al., 2021a; Yu et al., 2022) and multi-view-consistent 3D scenes (Gu et al., 2022; Chan et al., 2022; Or-El et al., 2022) . However, despite important applications in visual effects, computer vision, and other fields, no generative model has been demonstrated that is successful in synthesizing 3D videos to date. We propose the first 4D GAN that learns to generate multi-view-consistent video data from singleview videos. For this purpose, we develop a 3D-aware video generator to synthesize 3D content that is animated with learned motion priors, and permits viewpoint manipulations. Two key elements of our framework are a time-conditioned 4D generator that leverages emerging neural implicit scene representations (Park et al., 2019; Mescheder et al., 2019; Mildenhall et al., 2020) and a time-aware video discriminator. Our generator takes as input two latent code vectors for 3D identity and motion, respectively, and it outputs a 4D neural fields that can be queried continuously at any spatio-temporal xyzt coordinate. The generated 4D fields can be used to render realistic video frames from arbitrary camera viewpoints. To train the 4D GAN, we use a discriminator that takes two randomly sampled video frames from the generator (or from real videos) along with their time differences to score the realism of the motions. Our model is trained with an adversarial loss where the generator is encouraged, by the discriminator, to render realistic videos across all sampled camera viewpoints. Our contributions are following: i) We introduce the first 4D GAN which generates 3D-aware videos supervised only from single-view 2D videos. ii) We develop a framework combining implicit fields with a time-aware discriminator, that can be continuously rendered for any xyzt coordinate. iii) We evaluate the effectiveness of our approach on challenging, unstructured video datasets. We show that the trained 4D GAN is able to synthesize plausible videos that allows viewpoint changes, whose visual and motion qualities are competitive against the state-of-the-art 2D video GANs' outputs.

2. RELATED WORK

In this section, we discuss the most relevant literature on image and video synthesis as well as neural implicit representations in the context of 2D and 3D content generation. See supplementary document for a more complete list of references.

GAN-based Image Synthesis

Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) have demonstrated impressive results on multiple synthesis tasks such as image generation (Brock et al., 2019; Karras et al., 2019; 2020) , image editing (Wang et al., 2018; Shen et al., 2020; Ling et al., 2021) and image-to-image translation (Isola et al., 2017; Zhu et al., 2017; Choi et al., 2018) . To allow for increased controllability, during the image synthesis process, several recent works have proposed to disentangle the underlying factors of variation (Reed et al., 2014; Chen et al., 2016; Lee et al., 2020; Shoshan et al., 2021) or rely on pre-defined templates (Tewari et al., 2020a; b) . However, since most of these methods operate on 2D images, they often lack physically-sound control in terms of viewpoint manipulation. In this work, we advocate modelling both the image and the video generation process in 3D in order to ensure controllable generations. Neural Implicit Representations Neural Implicit Representation (NIR) (Mescheder et al., 2019; Park et al., 2019; Chen & Zhang, 2019) have been extensively employed in various generation tasks due to their continuous, efficient, and differentiable nature. These tasks include 3D reconstruction of objects and scenes, novel-view synthesis of static and dynamic scenes, inverse graphics, and video representations. (Jiang et al., 2020; Chibane et al., 2020; Sitzmann et al., 2020; Barron et al., 2021; Sajjadi et al., 2022; Park et al., 2021a; Li et al., 2021; Niemeyer et al., 2020; Chen et al., 2021a) . Among the most widely used NIRs are Neural Radiance Fields (NeRFs) (Mildenhall et al., 2020 ) that combine NIRs with volumetric rendering to enforce 3D consistency while performing novel view synthesis. In this work, we employ a generative variant of NeRF (Gu et al., 2022 ) and combine it with a time-aware discriminator to learn a generative model of videos from unstructured videos. Closely related to our work are recent approaches that try to control the motion and the pose of scenes (Lin et al., 2022; Liu et al., 2021a; Zhang et al., 2021; Ren et al., 2021) . In particular, they focus on transferring or controlling the motion of their target objects instead of automatically generating plausible motions. They often use networks overfitted to a single reconstructed scene (Chen et al., 2021b; Zhang et al., 2021) or rely on pre-defined templates of human faces or bodies (Liu et al., 2021a; Ren et al., 2021) . In our work, we build a 4D generative model that can automatically generate diverse 3D content along with its plausible motion without using any pre-defined templates. 3D-Aware Image Generations Another line of research investigates how 3D representations can be incorporated in generative settings for improving the image quality (Park et al., 2017; Nguyen-Phuoc et al., 2018) and increasing the control over various aspects of the image formation process (Gadelha et al., 2017; Chan et al., 2021; Henderson & Ferrari, 2019) . Towards this goal, several works (Henzler et al., 2019; Nguyen-Phuoc et al., 2019; 2020) proposed to train 3D-aware GANs from a set of unstructured images using voxel-based representations. However, due to the low voxel resolution and the inconsistent view-controls stemming from the use of pseudo-3D structures that rely on non-physically-based 2D-3D conversions, these methods tend to generate images with artifacts and struggle to generalize in real-world scenarios. More recent approaches rely on volume rendering to generate 3D objects (Schwarz et al., 2020; Chan et al., 2021; Niemeyer & Geiger, 2021a) . Similarly, (Zhou et al., 2021; Chan et al., 2021; DeVries et al., 2021) explored the idea of combining NeRF with GANs for designing 3D-aware image generators. Likewise, StyleSDF (Or-El et al., 2022) and StyleNeRF (Gu et al., 2022) proposed to combine an MLP-based volume renderer with a style-based generator (Karras et al., 2020) to produce high-resolution 3D-aware images. Deng et al. (2022) explored learning a generative radiance field on 2D manifolds and Chan et al. ( 2022) introduced a 3D-aware architecture that exploits both implicit and explicit representations. In contrast to this line of research that focuses primarily on 3D-aware image generation, we are interested in 3D-aware video generation. In particular, we build on top of StyleNeRF (Gu et al., 2022) to allow control on

