A GOOD IMAGE GENERATOR IS WHAT YOU NEED FOR HIGH-RESOLUTION VIDEO SYNTHESIS

Abstract

Image and video synthesis are closely related areas aiming at generating content from noise. While rapid progress has been demonstrated in improving imagebased models to handle large resolutions, high-quality renderings, and wide variations in image content, achieving comparable video generation results remains problematic. We present a framework that leverages contemporary image generators to render high-resolution videos. We frame the video synthesis problem as discovering a trajectory in the latent space of a pre-trained and fixed image generator. Not only does such a framework render high-resolution videos, but it also is an order of magnitude more computationally efficient. We introduce a motion generator that discovers the desired trajectory, in which content and motion are disentangled. With such a representation, our framework allows for a broad range of applications, including content and motion manipulation. Furthermore, we introduce a new task, which we call cross-domain video synthesis, in which the image and motion generators are trained on disjoint datasets belonging to different domains. This allows for generating moving objects for which the desired video data is not available. Extensive experiments on various datasets demonstrate the advantages of our methods over existing video generation techniques. Code will be released at https://github.com/snap-research/MoCoGAN-HD. * Work done while at Snap Inc. 1 We estimate that the cost of training a model such as DVD- GAN (Clark et al., 2019) once requires > $30K.

1. INTRODUCTION

Video synthesis seeks to generate a sequence of moving pictures from noise. While its closely related counterpart-image synthesis-has seen substantial advances in recent years, allowing for synthesizing at high resolutions (Karras et al., 2017) , rendering images often indistinguishable from real ones (Karras et al., 2019) , and supporting multiple classes of image content (Zhang et al., 2019) , contemporary improvements in the domain of video synthesis have been comparatively modest. Due to the statistical complexity of videos and larger model sizes, video synthesis produces relatively low-resolution videos, yet requires longer training times. For example, scaling the image generator of Brock et al. (2019) to generate 256 × 256 videos requires a substantial computational budget 1 . Can we use a similar method to attain higher resolutions? We believe a different approach is needed. There are two desired properties for generated videos: (i) high quality for each individual frame, and (ii) the frame sequence should be temporally consistent, i.e. depicting the same content with plausible motion. Previous works (Tulyakov et al., 2018; Clark et al., 2019) attempt to achieve both goals with a single framework, making such methods computationally demanding when high resolution is desired. We suggest a different perspective on this problem. We hypothesize that, given an image generator that has learned the distribution of video frames as independent images, a video can be represented as a sequence of latent codes from this generator. The problem of video synthesis can then be framed as discovering a latent trajectory that renders temporally consistent images. Hence, we demonstrate that (i) can be addressed by a pre-trained and fixed image generator, and (ii) can be achieved using the proposed framework to create appropriate image sequences. To discover the appropriate latent trajectory, we introduce a motion generator, implemented via two recurrent neural networks, that operates on the initial content code to obtain the motion representation. We model motion as a residual between continuous latent codes that are passed to the image generator for individual frame generation. Such a residual representation can also facilitate the disentangling of motion and content. The motion generator is trained using the chosen image discriminator with contrastive loss to force the content to be temporally consistent, and a patch-based multi-scale video discriminator for learning motion patterns. Our framework supports contemporary image generators such as StyleGAN2 (Karras et al., 2019) and BigGAN (Brock et al., 2019) . We name our approach as MoCoGAN-HD (Motion and Content decomposed GAN for High-Definition video synthesis) as it features several major advantages over traditional video synthesis pipelines. First, it transcends the limited resolutions of existing techniques, allowing for the generation of high-quality videos at resolutions up to 1024 × 1024. Second, as we search for a latent trajectory in an image generator, our method is computationally more efficient, requiring an order of magnitude less training time than previous video-based works (Clark et al., 2019) . Third, as the image generator is fixed, it can be trained on a separate high-quality image dataset. Due to the disentangled representation of motion and content, our approach can learn motion from a video dataset and apply it to an image dataset, even in the case of two datasets belonging to different domains. It thus unleashes the power of an image generator to synthesize high quality videos when a domain (e.g., dogs) contains many high-quality images but no corresponding high-quality videos (see Fig. 4 ). In this manner, our method can generate realistic videos of objects it has never seen moving during training (such as generating realistic pet face videos using motions extracted from images of talking people). We refer to this new video generation task as cross-domain video synthesis. Finally, we quantitatively and qualitatively evaluate our approach, attaining state-of-the-art performance on each benchmark, and establish a challenging new baseline for video synthesis methods.

2. RELATED WORK

Video Synthesis. Approaches to image generation and translation using Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) have demonstrated the ability to synthesize high quality images (Radford et al., 2016; Zhang et al., 2019; Brock et al., 2019; Donahue & Simonyan, 2019; Jin et al., 2021) . Built upon image translation (Isola et al., 2017; Wang et al., 2018b) , works on video-to-video translation (Bansal et al., 2018; Wang et al., 2018a) are capable of converting an input video to a high-resolution output in another domain. However, the task of high-fidelity video generation, in the unconditional setting, is still a difficult and unresolved problem. Without the strong conditional inputs such as segmentation masks (Wang et al., 2019) or human poses (Chan et al., 2019; Ren et al., 2020) that are employed by video-to-video translation works, generating videos following the distribution of training video samples is challenging. Earlier works on GANbased video modeling, including MDPGAN (Yushchenko et al., 2019) , VGAN (Vondrick et al., 2016) , TGAN (Saito et al., 2017) , MoCoGAN (Tulyakov et al., 2018) , ProgressiveVGAN (Acharya et al., 2018 ), TGANv2 (Saito et al., 2020) show promising results on low-resolution datasets. Recent efforts demonstrate the capacity to generate more realistic videos, but with significantly more computation (Clark et al., 2019; Weissenborn et al., 2020) . In this paper, we focus on generating realistic videos using manageable computational resources. LDVDGAN (Kahembwe & Ramamoorthy, 2020) uses low dimensional discriminator to reduce model size and can generate videos with resolution up to 512 × 512, while we decrease training cost by utilizing a pre-trained image generator. The high-quality generation is achieved by using pre-trained image generators, while the motion trajectory is modeled within the latent space. Additionally, learning motion in the latent space allows us to easily adapt the video generation model to the task of video prediction (Denton et al., 2017) , in which the starting frame is given (Denton & Fergus, 2018; Zhao et al., 2018; Walker et al., 2017; Villegas et al., 2017b; a; Babaeizadeh et al., 2017; Hsieh et al., 2018; Byeon et al., 2018) , by inverting the initial frame through the generator (Abdal et al., 2020) , instead of training an extra image encoder (Tulyakov et al., 2018; Zhang et al., 2020) . Interpretable Latent Directions. The latent space of GANs is known to consist of semantically meaningful vectors for image manipulation. Both supervised methods, either using human annotations or pre-trained image classifiers (Goetschalckx et al., 2019; Shen et al., 2020) , and unsupervised methods (Jahanian et al., 2020; Plumerault et al., 2020) , are able to find interpretable directions for image editing, such as supervising directions for image rotation or background removal (Voynov & 

