A GOOD IMAGE GENERATOR IS WHAT YOU NEED FOR HIGH-RESOLUTION VIDEO SYNTHESIS

Abstract

Image and video synthesis are closely related areas aiming at generating content from noise. While rapid progress has been demonstrated in improving imagebased models to handle large resolutions, high-quality renderings, and wide variations in image content, achieving comparable video generation results remains problematic. We present a framework that leverages contemporary image generators to render high-resolution videos. We frame the video synthesis problem as discovering a trajectory in the latent space of a pre-trained and fixed image generator. Not only does such a framework render high-resolution videos, but it also is an order of magnitude more computationally efficient. We introduce a motion generator that discovers the desired trajectory, in which content and motion are disentangled. With such a representation, our framework allows for a broad range of applications, including content and motion manipulation. Furthermore, we introduce a new task, which we call cross-domain video synthesis, in which the image and motion generators are trained on disjoint datasets belonging to different domains. This allows for generating moving objects for which the desired video data is not available. Extensive experiments on various datasets demonstrate the advantages of our methods over existing video generation techniques. Code will be released at https://github.com/snap-research/MoCoGAN-HD. * Work done while at Snap Inc. 1 We estimate that the cost of training a model such as DVD- GAN (Clark et al., 2019) once requires > $30K.

1. INTRODUCTION

Video synthesis seeks to generate a sequence of moving pictures from noise. While its closely related counterpart-image synthesis-has seen substantial advances in recent years, allowing for synthesizing at high resolutions (Karras et al., 2017) , rendering images often indistinguishable from real ones (Karras et al., 2019) , and supporting multiple classes of image content (Zhang et al., 2019) , contemporary improvements in the domain of video synthesis have been comparatively modest. Due to the statistical complexity of videos and larger model sizes, video synthesis produces relatively low-resolution videos, yet requires longer training times. For example, scaling the image generator of Brock et al. (2019) to generate 256 × 256 videos requires a substantial computational budget 1 . Can we use a similar method to attain higher resolutions? We believe a different approach is needed. There are two desired properties for generated videos: (i) high quality for each individual frame, and (ii) the frame sequence should be temporally consistent, i.e. depicting the same content with plausible motion. Previous works (Tulyakov et al., 2018; Clark et al., 2019) attempt to achieve both goals with a single framework, making such methods computationally demanding when high resolution is desired. We suggest a different perspective on this problem. We hypothesize that, given an image generator that has learned the distribution of video frames as independent images, a video can be represented as a sequence of latent codes from this generator. The problem of video synthesis can then be framed as discovering a latent trajectory that renders temporally consistent images. Hence, we demonstrate that (i) can be addressed by a pre-trained and fixed image generator, and (ii) can be achieved using the proposed framework to create appropriate image sequences.

