VIDEOGEN: GENERATIVE MODELING OF VIDEOS US-ING VQ-VAE AND TRANSFORMERS Anonymous

Abstract

We present VideoGen: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGen uses VQ-VAE that learns learns downsampled discrete latent representations of a video by employing 3D convolutions and axial self-attention. A simple GPT-like architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings. Despite the simplicity in formulation, ease of training and a light compute requirement, our architecture is able to generate samples competitive with state-ofthe-art GAN models for video generation on the BAIR Robot dataset, and generate coherent action-conditioned samples based on experiences gathered from the ViZ-Doom simulator. We hope our proposed architecture serves as a reproducible reference for a minimalistic implementation of transformer based video generation models without requiring industry scale compute resources. Samples are available at https://sites.google.com/view/videogen.

1. INTRODUCTION

Figure 1 : 64 × 64 samples for BAIR and ViZDoom environments generated by VideoGen Deep generative models of multiple types (Goodfellow et al., 2014; van den Oord et al., 2016b; Dinh et al., 2016) have seen incredible progress in the last few years on multiple modalities including natural images (van den Oord et al., 2016c; Zhang et al., 2019; Brock et al., 2018; Kingma & Dhariwal, 2018; Ho et al., 2019a; Karras et al., 2017; 2019; Van Den Oord et al., 2017; Razavi et al., 2019; Vahdat & Kautz, 2020; Ho et al., 2020; Chen et al., 2020) , audio waveforms conditioned on language features (van den Oord et al., 2016a; Oord et al., 2017; Bińkowski et al., 2019) , natural language in the form of text (Radford et al., 2019; Brown et al., 2020) , and music generation (Dhariwal et al., 2020) . These results have been made possible thanks to fundamental advances in deep learning architectures (He et al., 2015; van den Oord et al., 2016b; c; Vaswani et al., 2017; Zhang et al., 2019; Menick & Kalchbrenner, 2018) as well as the availability of compute resources (Jouppi et al., 2017; Amodei & Hernandez, 2018) that are more powerful than a few years ago. However, one notable modality that has not seen the same level of progress in generative modeling is high fidelity natural videos. The complexity of natural videos requires modeling correlations across both space and time with much higher input dimensions, thereby presenting a natural next challenge for current deep generative models. The complexity of the problem also demands more compute resources which can be considered as one important reason for the slow progress in generative modeling of videos. It is useful to build generative models of videos, both conditional and unconditional, as it implicitly solves the problem of video prediction and forecasting. Video prediction (Kalchbrenner et al., 2017;  

