VIDEO PREDICTION WITH VARIATIONAL TEMPORAL HIERARCHIES

Abstract

Deep learning has shown promise for accurately predicting high-dimensional video sequences. Existing video prediction models succeeded in generating sharp but often short video sequences. Toward improving long-term video prediction, we study hierarchical latent variable models with levels that process at different time scales. To gain insights into the representations of such models, we study the information stored at each level of the hierarchy via the KL divergence, predictive entropy, datasets of varying speed, and generative distributions. Our analysis confirms that faster changing details are generally captured by lower levels, while slower changing facts are remembered by higher levels. On synthetic datasets where common methods fail after 25 frames, we show that temporally abstract latent variable models can make accurate predictions for up to 200 frames.

1. INTRODUCTION

Deep learning has enabled predicting video sequences from large datasets (Chiappa et al., 2017; Oh et al., 2015; Vondrick et al., 2016) . For high-dimensional inputs such as video, there likely exists a more compact representation of the scene that facilitates long term prediction. Instead of learning dynamics in pixel space, latent dynamics models predict ahead in a more compact feature space (Doerr et al., 2018; Buesing et al., 2018; Karl et al., 2016; Hafner et al., 2019) . This has the added benefit of increased computational efficiency and a lower memory footprint, allowing to predict thousands of sequences in parallel using a large batch size. A lot of work in deep learning has focused on spatial abstraction, following the advent of convolutional networks (LeCun et al., 1989) , such as the Variational Ladder Autoencoder (Zhao et al., 2017) that learns a hierarchy of features in images using networks of different capacities, along with playing an important role in the realm of video prediction models (Castrejón et al., 2019) . Recent sequential models have incorporated temporal abstraction for learning dependencies in temporally distant observations (Koutník et al., 2014; Chung et al., 2016) . Kim et al. (2019) proposed Variational Temporal Abstraction (VTA), in which they explored one level of temporal abstraction above the latent states, the transition of which was modeled using a Bernoulli random variable. In this paper, we intend to work in a more controlled setup than VTA for a qualitative and quantitative analysis of temporally abstract latent variable models. In this paper, we study the benefits of temporal abstraction using a hierarchical latent dynamics model, trained using a variational objective. Each level in the hierarchy of this model temporally abstracts the level below by an adjustable factor. This model can perform long-horizon video prediction of 200 frames, while predicting accurate low-level information for a 6 times longer duration than the baseline model. We study the information stored at different levels of the hierarchy via KL divergence, predictive entropy, datasets of varying speeds, and generative distributions. In our experiments we show that this amounts to object location and identities for the Moving MNIST dataset, and the wall or floor patterns for the GQN mazes dataset (Eslami et al., 2018) , stored at different levels.



level, factor 8 3 level, factor 6 3 level, factor 3 level, factor 2 RSSM SVG-LP random



Figure 1: Mean SSIM over a test set of 100 sequences of open-loop prediction with Moving MNIST. All 3-level latent dynamics models, with temporal abstraction factors 2, 4, 6, and 8, have the same number of model parameters.

