SSW-GAN: SCALABLE STAGE-WISE TRAINING OF VIDEO GANS

Abstract

Current state-of-the-art generative models for videos have high computational requirements that impede high resolution generations beyond a few frames. In this work we propose a stage-wise strategy to train Generative Adversarial Networks (GANs) for videos. We decompose the generative process to first produce a downsampled video that is then spatially upscaled and temporally interpolated by subsequent stages. Upsampling stages are applied locally on temporal chunks of previous outputs to manage the computational complexity. Stages are defined as Generative Adversarial Networks, which are trained sequentially and independently. We validate our approach on Kinetics-600 and BDD100K, for which we train a three stage model capable of generating 128x128 videos with 100 frames.

1. INTRODUCTION

The field of generative modeling has seen rapid developments over the past few years. Current models such as GPT-3 (Brown et al., 2020) or BigGAN (Brock et al., 2018) are capable of generating coherent long paragraphs and detailed high resolution images. Generative models for videos have high memory requirements that quickly scale with the output resolution and length. Prior works have therefore restricted the video dimensions by operating at low spatial resolution or by only considering a small number of frames to generate (Ranzato et al., 2014; Vondrick et al., 2016a; Tulyakov et al., 2018; Kalchbrenner et al., 2017) . In this work we investigate an approach to reduce the computational costs needed to generate long high resolution videos in the context of Generative Adversarial Networks (GANs). Current GAN approaches require large batch sizes and high capacity models (Clark et al., 2019; Brock et al., 2018) . We propose to break down the generative process into a set of smaller generative problems or stages, each stage having reduced computational requirements. The first stage produces a downsampled low-resolution video that is then spatially upscaled and temporally interpolated by subsequent upscaling stages. Each stage is modeled as a GAN problem and stages are trained sequentially and independently. Each stages only considers a lower dimensional view of the video during training. The first stage is trained to produce full-length videos at a reduced spatiotemporal resolution, while the upscaling stages are trained to upsample partial temporal windows on the previous generations. At inference time, the upscaling stages are applied on the full first stage output in a convolutional fashion to generate full resolution videos. Learning the upscaling stages on local views of the data reduces their computational requirements. By keeping a fixed temporal window size, computational requirements scale only in output resolution, independent of the final video length. However, upsampling stages consider a limited field of view in time, which could negatively impact the temporal consistency of the full-size generation. To address this problem, we rely on the first low-resolution generation to capture long-term temporal information, and use it to condition the upscaling stages. In particular, we introduce a novel matching discriminator that ensures that outputs are grounded to the low-resolution generation. Our approach, named SSW-GAN, offers a novel way to decompose the training of large GAN models, inspired by other coarse-to-fine methods that have been explored in the context of images and videos (Denton et al., 2015; Karras et al., 2017; Acharya et al., 2018) . In contrast to previous methods, we do not train on full resolution inputs in upscaling stages, and instead impose global temporal consistency through conditioning on the complete but low resolution of the the first stage output. Our model thus provides significant computational savings, which allow for high quality high resolution generators capable of producing hundred of frames. Our contributions can be summarized as follows: • We define a stage-wise approach to train GANs for video in which stages are trained sequentially and show that solving this multi-stage problem is equivalent to modeling the joint data probability of samples and their corresponding downsampled views. • We empirically validate our approach on Kinetics-600 and BDD100K, two large-scale datasets with complex videos in real-world scenarios. Our approach matches or outperforms state-of-art approaches while requiring significantly less computational resources. • We use our approach to generate videos with 100 frames at high resolutions with high capacity models. To the best of our knowledge, our method is the first one to produce such generations. We propose a model for class conditional video generation, and therefore our setup is closely related to stochastic generative video models. Autoregressive models (Larochelle & Murray, 2011; Dinh et al., 2016; Kalchbrenner et al., 2017; Reed et al., 2017; Weissenborn et al., 2020) approximate the joint data distribution in pixel space without introducing latent variables. These models capture complex pixel dependencies without independence assumptions. However, inference in autoregressive models often requires a full model forward pass for each output pixel, making them slow and not scalable to long high resolution videos, with state-of-the-art models requiring multiple minutes to generate a single batch of samples (Weissenborn et al., 2020) .

2. RELATED WORK

Variational AutoEncoders (VAEs) define latent variable models and use variational inference methods to optimize a lower bound on the empirical data likelihood (Kingma & Welling, 2013; Rezende et al., 2014; Babaeizadeh et al., 2017) . Models based on VRNNs (Chung et al., 2015; Denton & Fergus, 2018; Castrejon et al., 2019) use per-frame latent variables and have greater modeling capacity. Normalizing flows (NFs) define bijective functions that map a probability distribution over a latent variable to a tractable distribution over data (Rezende & Mohamed, 2015; Kingma & Dhariwal, 2018; Kumar et al., 2019) . NFs are trained to directly maximize the data likelihood. The main disadvantage of NFs is that their latent dimensionality has to match that of the data, often resulting in slow and memory-intensive models. Autoregressive models, VAEs and NFs are trained by maximizing the data likelihood (or a bound) under the generative distribution. It has been empirically observed that such models often produce blurry results. Generative Adversarial Networks (GANs) on the other hand optimize a min-max game between a Generator and a Discriminator trained to tell real and generated data apart (Goodfellow et al., 2014) et al., 2017) and proposes a video GAN trained on data windows, similar to our approach. However, unlike TGANv2, our model is composed of multiple stages which are not trained jointly. MoCoGAN (Tulyakov et al., 2018) first introduced a dual discriminator architecture for video, with DVD-GAN (Clark et al., 2019) scaling up this approach to high resolution videos in the wild. DVD-GAN outperforms MoCoGAN and TGANv2, and is arguably the current state-of-the-art in adversarial video generation. We propose a multi-stage generative model approach with each stage defining an adversarial game and a model architecture based upon DVD-GAN. Recent work (Xiong et al., 2018; Zhao et al., 2020 ) also proposes multi-stage models, but differently from our approach their stages model different semantic aspects of the generation, such as producing a motion outline.



. Empirically, GANs usually produce better samples but might suffer from mode collapse.

