VIDEOGEN: GENERATIVE MODELING OF VIDEOS US-ING VQ-VAE AND TRANSFORMERS Anonymous

Abstract

We present VideoGen: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGen uses VQ-VAE that learns learns downsampled discrete latent representations of a video by employing 3D convolutions and axial self-attention. A simple GPT-like architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings. Despite the simplicity in formulation, ease of training and a light compute requirement, our architecture is able to generate samples competitive with state-ofthe-art GAN models for video generation on the BAIR Robot dataset, and generate coherent action-conditioned samples based on experiences gathered from the ViZ-Doom simulator. We hope our proposed architecture serves as a reproducible reference for a minimalistic implementation of transformer based video generation models without requiring industry scale compute resources. Samples are available at https://sites.google.com/view/videogen.

1. INTRODUCTION

Amodei & Hernandez, 2018) that are more powerful than a few years ago. However, one notable modality that has not seen the same level of progress in generative modeling is high fidelity natural videos. The complexity of natural videos requires modeling correlations across both space and time with much higher input dimensions, thereby presenting a natural next challenge for current deep generative models. The complexity of the problem also demands more compute resources which can be considered as one important reason for the slow progress in generative modeling of videos. It is useful to build generative models of videos, both conditional and unconditional, as it implicitly solves the problem of video prediction and forecasting. Video prediction (Kalchbrenner et al., 2017; Sønderby et al., 2020) can be seen as learning a generative model of future frames conditioned on the past frames. Architectures developed for video generation can be useful in forecasting applications for autonomous driving, such as predicting the future in more semantic and dense abstractions like segmentation masks (Luc et al., 2017). Finally, building generative models of the world around us is considered as one way to measure understanding of physical common sense (Lake et al. ). These different generative model families have their tradeoffs: sampling speed, sample diversity, sample quality, ease of training, compute requirements, and ease of evaluation. To build a generative model for videos, we first make a choice between likelihood-based and adversarial models. Likelihood-based models are convenient to train since the objective is well understood, easy to optimize across a range of batch sizes, and easy to evaluate. Given that videos already present a hard modeling challenge due to the nature of the data, we believe likelihood-based models present fewer difficulties in the optimization and evaluation, hence allowing us to focus on the architecture modeling. Among likelihood-based models, autoregressive models that work on discrete data in particular have shown great success and have well established training recipes and modeling architectures. Second, we consider the following question: Is it better to perform autoregressive modeling in a downsampled latent space without spatio-temporal redundancies compared to modeling at the atomic level of all pixels across space and time? Below, we present our reasons for choosing the former: Natural images and videos contain a lot of spatial and temporal redundancies and hence the reason we use image compression tools such as JPEG (Wallace, 1992) and video codecs such as MPEG (Le Gall, 1991) everyday. These redundancies can be removed by learning a denoised downsampled encoding of the high resolution inputs. For example, 4x downsampling across spatial and temporal dimensions results in 64x downsampled resolution so that the computation of powerful deep generative models is spent on these more fewer and useful bits. As shown in VQ-VAE (Van Den Oord et al., 2017), even a lossy decoder can transform the latents to generate sufficiently realistic samples. Furthermore, modeling in the latent space downsampled across space and time instead of the pixel space improves sampling speed and compute requirements due to reduced dimensionality. 1 The above line of reasoning leads us to our proposed model: VideoGen, a simple video generation architecture that is a minimal adaptation of VQ-VAE and GPT architectures for videos. VideoGen Our results are highlighted below:



Modeling long sequences is a challenge for transformer based architectures due to quadratic memory complexity of the attention matrix (Child et al., 2019).



Figure 1: 64 × 64 samples for BAIR and ViZDoom environments generated by VideoGen

, 2015). Multiple classes of generative models have been shown to produce strikingly good samples such as autoregressive models (van den Oord et al., 2016b;c; Menick & Kalchbrenner, 2018; Radford et al., 2019; Chen et al., 2020), generative adversarial networks (GANs) (Goodfellow et al., 2014; Radford et al., 2015), variational autoencoders (VAEs) (Kingma & Welling, 2013; Kingma et al., 2016; Vahdat & Kautz, 2020), Flows (Dinh et al., 2014; 2016; Kingma & Dhariwal, 2018), vector quantized VAE (VQ-VAE) (Van Den Oord et al., 2017; Razavi et al., 2019), and lately diffusion and score matching models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020

employs 3D convolutions and transposed convolutions (Tran et al., 2015) along with axial attention (Clark et al., 2019; Ho et al., 2019b) for the autoencoder in VQ-VAE in order to be able to learn a downsampled set of discrete latents. These latents are then autoregressively generated by a GPT-like (Radford et al., 2019; Child et al., 2019; Chen et al., 2020) architecture. The latents are then decoded to videos of the original resolution using the decoder of the VQ-VAE.

