PREDICTING VIDEO WITH VQVAE

Abstract

In recent years, the task of video prediction-forecasting future video given past video frames-has attracted attention in the research community. In this paper we propose a novel approach to this problem with Vector Quantized Variational AutoEncoders (VQ-VAE). With VQ-VAE we compress high-resolution videos into a hierarchical set of multi-scale discrete latent variables. Compared to pixels, this compressed latent space has dramatically reduced dimensionality, allowing us to apply scalable autoregressive generative models to predict video. In contrast to previous work that has largely emphasized highly constrained datasets, we focus on very diverse, large-scale datasets such as Kinetics-600. We predict video at a higher resolution, 256 × 256, than any other previous method to our knowledge. We further validate our approach against prior work via a crowdsourced human evaluation.

1. INTRODUCTION

When it comes to real-world image data, deep generative models have made substantial progress. With advances in computational efficiency and improvements in architectures, it is now feasible to generate high resolution, realistic images from vast and highly diverse datasets (Brock et al., 2019; Razavi et al., 2019; Karras et al., 2017) . Apart from the domain of images, deep generative models have also shown promise in other data domains such as music (Dieleman et al., 2018; Dhariwa et al., 2020) , speech synthesis (Oord et al., 2016 ), 3D voxels (Liu et al., 2018; Nash & Williams, 2017) , and text (Radford et al., 2019) . One particular fledgling domain is video. While some work in the area of video generation (Clark et al., 2020; Vondrick et al., 2016; Saito & Saito, 2018) has explored video synthesis-generating videos with no prior frame informationmany approaches actually focus on the task of video prediction conditioned on past frames (Ranzato et al., 2014; Srivastava et al., 2015; Patraucean et al., 2015; Mathieu et al., 2016; Lee et al., 2018; Babaeizadeh et al., 2018; Oliu et al., 2018; Xiong et al., 2018; Xue et al., 2016; Finn et al., 2016; Luc et al., 2020) . It can be argued that video synthesis is a combination of image generation and video prediction. In other words, one could decouple the problem of video synthesis into unconditional image generation and conditional video prediction from a generated image. Therefore, we specifically focus on video prediction in this paper. Potential computer vision applications of video forecasting include interpolation, anomaly detection, and activity understanding. More generally, video prediction also has more general implications for intelligent systems-the ability to anticipate the dynamics of the environment. The problem is thus also relevant for robotics and reinforcement learning (Finn et al., 2016; Ebert et al., 2017; Oh et al., 2015; Ha & Schmidhuber, 2018; Racanire et al., 2017) . Approaches toward video prediction have largely skewed toward variations of generative adversarial networks (Mathieu et al., 2016; Lee et al., 2018; Clark et al., 2020; Vondrick et al., 2016; Luc et al., 2020) . In comparison, we are aware of only a relatively small number of approaches which propose variational autoencoders (Babaeizadeh et al., 2018; Xue et al., 2016; Denton & Fergus, 2018) , autoregressive models (Kalchbrenner et al., 2017; Weissenborn et al., 2020) , or flow based approaches (Kumar et al., 2020) . There may be a number of reasons for this situation. One is the explosion in the dimensionality of the input space. A generative model of video needs to model not only one image but tens of them in a coherent fashion. This makes it difficult to scale up such models to large datasets or high resolutions. In addition, previous work (Clark et al., 2020) suggests that video prediction may be fundamentally more difficult than video synthesis; a synthesis model can generate simple samples from the dataset while prediction potentially forces the model to forecast conditioned on videos that are outliers in the distribution. Furthermore, most prior work has focused on datasets with low scene diversity such as Moving MNIST (Srivastava et al., 2015) , KTH (Schuldt et al., 2004) , or robotic arm datasets (Finn et al., 2016; Ebert et al., 2017) . While there have been attempts to synthesize video at a high resolution (Clark et al., 2020) , we know of no attempt-excluding flow based approaches-to predict video beyond resolutions of 64x64. In this paper we address the large dimensionality of video data through compression. Using Vector Quantized Variational Autoencoders (VQ-VAE) (van den Oord et al., 2017) , we can compress video into a space requiring only 1.3% of the bits expressed in pixels. While this compressed encoding is lossy, we can still reconstruct the original video from the latent representation with a high degree of fidelity. Furthermore, we can leverage the modularity of VQ-VAE and decompose our latent representation into a hierarchy of encodings, separating high-level, global information from details such as fine texture or small motions. Instead of training a generative model directly on pixel space, we can instead model this much more tractable discrete representation, allowing us to train much more powerful models, use large diverse datasets, and generate at a high resolution. While most prior work has focused on GANs, this discrete representation can also be modeled by likelihood-based models. Likelihood models in concept do not suffer from mode-collapse, instability in training, and lack of diversity of samples often witnessed in GANs (Denton & Fergus, 2018; Babaeizadeh et al., 2018; Razavi et al., 2019) . In this paper, we propose a PixelCNN augmented with causal convolutions in time and spatiotemporal self-attention to model this space of latents. In addition, because the latent representation is decomposed into a hierarchy, we can exploit this decomposition and train separate specialized models at different levels of the hierarchy. Our paper makes four contributions. First, we demonstrate the novel application of VQ-VAE to video data. Second, we propose a set of spatiotemporal PixelCNNs to predict video by utilizing the latent representation learned with VQ-VAE. Third, we explicitly predict video at a higher resolution than ever before. Finally, we demonstrate the competitive performance of our model with a crowdsourced human evaluation.

2. BACKGROUND

2.1 VECTOR QUANTIZED AUTOENCODERS VQ-VAEs (van den Oord et al., 2017) are autoencoders which learn a discrete latent encoding for input data x. First, the output of non-linear encoder z e (x), implemented by a neural network, is passed through a discretization bottleneck. z e (x) is mapped via nearest-neighbor into a quantized codebook e ∈ R K×D where D is the dimensionality of each vector e j and K is the number of categories in the codebook. The discretized representation is thus given by: z q (x) = e k where k = argmin j ||z e (x) -e j || 2 (1)



Figure 1: In this paper we predict video at a high resolution (256 × 256) using a compressed latent representation. The first 4 frames are given as conditioning. We predict the next 12, two of which (9th and 16th) we show on the right. All frames shown here have been compressed by VQ-VAE. Videos licensed under CC-BY. Attribution for videos in this paper can be found in supplementary material. Best seen in video in the supplementary material.

