SSW-GAN: SCALABLE STAGE-WISE TRAINING OF VIDEO GANS

Abstract

Current state-of-the-art generative models for videos have high computational requirements that impede high resolution generations beyond a few frames. In this work we propose a stage-wise strategy to train Generative Adversarial Networks (GANs) for videos. We decompose the generative process to first produce a downsampled video that is then spatially upscaled and temporally interpolated by subsequent stages. Upsampling stages are applied locally on temporal chunks of previous outputs to manage the computational complexity. Stages are defined as Generative Adversarial Networks, which are trained sequentially and independently. We validate our approach on Kinetics-600 and BDD100K, for which we train a three stage model capable of generating 128x128 videos with 100 frames.

1. INTRODUCTION

The field of generative modeling has seen rapid developments over the past few years. Current models such as GPT-3 (Brown et al., 2020) or BigGAN (Brock et al., 2018) are capable of generating coherent long paragraphs and detailed high resolution images. Generative models for videos have high memory requirements that quickly scale with the output resolution and length. Prior works have therefore restricted the video dimensions by operating at low spatial resolution or by only considering a small number of frames to generate (Ranzato et al., 2014; Vondrick et al., 2016a; Tulyakov et al., 2018; Kalchbrenner et al., 2017) . In this work we investigate an approach to reduce the computational costs needed to generate long high resolution videos in the context of Generative Adversarial Networks (GANs). Current GAN approaches require large batch sizes and high capacity models (Clark et al., 2019; Brock et al., 2018) . We propose to break down the generative process into a set of smaller generative problems or stages, each stage having reduced computational requirements. The first stage produces a downsampled low-resolution video that is then spatially upscaled and temporally interpolated by subsequent upscaling stages. Each stage is modeled as a GAN problem and stages are trained sequentially and independently. Each stages only considers a lower dimensional view of the video during training. The first stage is trained to produce full-length videos at a reduced spatiotemporal resolution, while the upscaling stages are trained to upsample partial temporal windows on the previous generations. At inference time, the upscaling stages are applied on the full first stage output in a convolutional fashion to generate full resolution videos. Learning the upscaling stages on local views of the data reduces their computational requirements. By keeping a fixed temporal window size, computational requirements scale only in output resolution, independent of the final video length. However, upsampling stages consider a limited field of view in time, which could negatively impact the temporal consistency of the full-size generation. To address this problem, we rely on the first low-resolution generation to capture long-term temporal information, and use it to condition the upscaling stages. In particular, we introduce a novel matching discriminator that ensures that outputs are grounded to the low-resolution generation. Our approach, named SSW-GAN, offers a novel way to decompose the training of large GAN models, inspired by other coarse-to-fine methods that have been explored in the context of images and videos (Denton et al., 2015; Karras et al., 2017; Acharya et al., 2018) . In contrast to previous methods, we do not train on full resolution inputs in upscaling stages, and instead impose global 1

