PREDICTING VIDEO WITH VQVAE

Abstract

In recent years, the task of video prediction-forecasting future video given past video frames-has attracted attention in the research community. In this paper we propose a novel approach to this problem with Vector Quantized Variational AutoEncoders (VQ-VAE). With VQ-VAE we compress high-resolution videos into a hierarchical set of multi-scale discrete latent variables. Compared to pixels, this compressed latent space has dramatically reduced dimensionality, allowing us to apply scalable autoregressive generative models to predict video. In contrast to previous work that has largely emphasized highly constrained datasets, we focus on very diverse, large-scale datasets such as Kinetics-600. We predict video at a higher resolution, 256 × 256, than any other previous method to our knowledge. We further validate our approach against prior work via a crowdsourced human evaluation.

1. INTRODUCTION

When it comes to real-world image data, deep generative models have made substantial progress. With advances in computational efficiency and improvements in architectures, it is now feasible to generate high resolution, realistic images from vast and highly diverse datasets (Brock et al., 2019; Razavi et al., 2019; Karras et al., 2017) . Apart from the domain of images, deep generative models have also shown promise in other data domains such as music (Dieleman et al., 2018; Dhariwa et al., 2020) , speech synthesis (Oord et al., 2016 ), 3D voxels (Liu et al., 2018; Nash & Williams, 2017) , and text (Radford et al., 2019) . One particular fledgling domain is video. While some work in the area of video generation (Clark et al., 2020; Vondrick et al., 2016; Saito & Saito, 2018) has explored video synthesis-generating videos with no prior frame informationmany approaches actually focus on the task of video prediction conditioned on past frames (Ranzato et al., 2014; Srivastava et al., 2015; Patraucean et al., 2015; Mathieu et al., 2016; Lee et al., 2018; Babaeizadeh et al., 2018; Oliu et al., 2018; Xiong et al., 2018; Xue et al., 2016; Finn et al., 2016; Luc et al., 2020) . It can be argued that video synthesis is a combination of image generation and video prediction. In other words, one could decouple the problem of video synthesis into unconditional image generation and conditional video prediction from a generated image. Therefore, we specifically focus on video prediction in this paper. Potential computer vision applications of video forecasting include interpolation, anomaly detection, and activity understanding. More generally, video prediction also has more general implications for intelligent systems-the ability to anticipate the dynamics of the environment. The problem is thus also relevant for robotics and reinforcement learning (Finn et al., 2016; Ebert et al., 2017; Oh et al., 2015; Ha & Schmidhuber, 2018; Racanire et al., 2017) . Approaches toward video prediction have largely skewed toward variations of generative adversarial networks (Mathieu et al., 2016; Lee et al., 2018; Clark et al., 2020; Vondrick et al., 2016; Luc et al., 2020) . In comparison, we are aware of only a relatively small number of approaches which propose variational autoencoders (Babaeizadeh et al., 2018; Xue et al., 2016; Denton & Fergus, 2018) , autoregressive models (Kalchbrenner et al., 2017; Weissenborn et al., 2020) , or flow based approaches (Kumar et al., 2020) . There may be a number of reasons for this situation. One is the explosion in the dimensionality of the input space. A generative model of video needs to model not only one image but tens of them in a coherent fashion. This makes it difficult to scale up such models to large datasets or high resolutions. In addition, previous work (Clark et al., 2020) suggests that video prediction may be fundamentally more difficult than video synthesis; a synthesis model can generate simple samples from the dataset while prediction potentially forces the model to forecast conditioned on videos that are outliers in the distribution. Furthermore, most prior work has focused on datasets with low scene

