VIDEO PREDICTION WITH VARIATIONAL TEMPORAL HIERARCHIES

Abstract

Deep learning has shown promise for accurately predicting high-dimensional video sequences. Existing video prediction models succeeded in generating sharp but often short video sequences. Toward improving long-term video prediction, we study hierarchical latent variable models with levels that process at different time scales. To gain insights into the representations of such models, we study the information stored at each level of the hierarchy via the KL divergence, predictive entropy, datasets of varying speed, and generative distributions. Our analysis confirms that faster changing details are generally captured by lower levels, while slower changing facts are remembered by higher levels. On synthetic datasets where common methods fail after 25 frames, we show that temporally abstract latent variable models can make accurate predictions for up to 200 frames.

1. INTRODUCTION

Deep learning has enabled predicting video sequences from large datasets (Chiappa et al., 2017; Oh et al., 2015; Vondrick et al., 2016) . For high-dimensional inputs such as video, there likely exists a more compact representation of the scene that facilitates long term prediction. Instead of learning dynamics in pixel space, latent dynamics models predict ahead in a more compact feature space (Doerr et al., 2018; Buesing et al., 2018; Karl et al., 2016; Hafner et al., 2019) . This has the added benefit of increased computational efficiency and a lower memory footprint, allowing to predict thousands of sequences in parallel using a large batch size. A lot of work in deep learning has focused on spatial abstraction, following the advent of convolutional networks (LeCun et al., 1989) , such as the Variational Ladder Autoencoder (Zhao et al., 2017) that learns a hierarchy of features in images using networks of different capacities, along with playing an important role in the realm of video prediction models (Castrejón et al., 2019) . Recent sequential models have incorporated temporal abstraction for learning dependencies in temporally distant observations (Koutník et al., 2014; Chung et al., 2016) . Kim et al. (2019) proposed Variational Temporal Abstraction (VTA), in which they explored one level of temporal abstraction above the latent states, the transition of which was modeled using a Bernoulli random variable. In this paper, we intend to work in a more controlled setup than VTA for a qualitative and quantitative analysis of temporally abstract latent variable models. In this paper, we study the benefits of temporal abstraction using a hierarchical latent dynamics model, trained using a variational objective. Each level in the hierarchy of this model temporally abstracts the level below by an adjustable factor. This model can perform long-horizon video prediction of 200 frames, while predicting accurate low-level information for a 6 times longer duration than the baseline model. We study the information stored at different levels of the hierarchy via KL divergence, predictive entropy, datasets of varying speeds, and generative distributions. In our experiments we show that this amounts to object location and identities for the Moving MNIST dataset, and the wall or floor patterns for the GQN mazes dataset (Eslami et al., 2018) , stored at different levels. We observe that TALD accurately predicts the movement of the scene from the ocean to the forest, and maintains that context until 420 frames. However, VTA predicts an implausible scene after 240 steps with blue and black skies in the same frame. RSSM predicts a plausible future as well where the player stays in the ocean as the distant forest moves out of the scene, whereas SVG-LP learns to copy the initial frame indefinitely and does not predict any new events in the future. Our key contributions are summarized as follows: • Temporal Abstract Latent Dynamics (TALD) We introduce a simple model with different clock speeds at every level to study the properties of variational hierarchical dynamics. • Accurate long-term predictions Our form of temporal abstraction substantially improves for how long the model can accurately predict video frames into the future. • Adaptation to sequence speed We demonstrate that our model automatically adapts the amount of information processed at each level to the speed of the video sequence. • Separation of information We visualize the content represented at each level of the hierarchy to find location information in lower levels and object identity in higher levels.

2. RELATED WORK

Generative video models A variety of methods have successfully approached video prediction using large datasets (Chiappa et al., 2017; Oh et al., 2015; Vondrick et al., 2016; Babaeizadeh et al., 2017; Gemici et al., 2017; Ha & Schmidhuber, 2018) . Denton & Fergus (2018) proposed a stochastic video generation model with a learned prior that transitions in time, and is conditioned on past observations. Lee et al. Latent dynamics models Latent dynamics models have evolved from latent space models that had access to low-dimensional features (Deisenroth & Rasmussen, 2011; Higuera et al., 2018) , to models that can build a compact representation of visual scenes and facilitate video prediction purely in the latent space (Doerr et al., 2018; Buesing et al., 2018; Karl et al., 2016; Franceschi et al., 2020) . The Variational RNN (Chung et al., 2015) uses an auto-regressive state transition that takes inputs from observations, making it computationally expensive to be used as an imagination module. Hafner et al. (2019) proposed a latent dynamics model, which is a combination of deterministic and stochastic states, that enables the model to deterministically remember all previous states and filter that information to obtain a distribution over the current state. Hierarchical latent variables Learning per-frame hierarchical structures has proven to be helpful in generating videos on long-term horizon (Wichers et al., 2018) . Zhao et al. (2017) proposed the Variational Ladder Autoencoder (VLAE) that uses networks of different capacities at different levels of the hierarchy, encouraging the model to store high-level image features at the top level, and simple features at the bottom. Other recently proposed hierarchical models use a purely bottom-up inference approach with no interaction between the inference and generative models (Kingma & Welling,



Figure 1: Mean SSIM over a test set of 100 sequences of open-loop prediction with Moving MNIST. All 3-level latent dynamics models, with temporal abstraction factors 2, 4, 6, and 8, have the same number of model parameters.

Figure 2: Long-horizon open-loop video prediction for the MineRL Navigate dataset (Guss et al., 2019), using our 3-level TALD model with temporal abstraction factor 6, compared with VTA, RSSM, and SVG-LP. We observe that TALD accurately predicts the movement of the scene from the ocean to the forest, and maintains that context until 420 frames. However, VTA predicts an implausible scene after 240 steps with blue and black skies in the same frame. RSSM predicts a plausible future as well where the player stays in the ocean as the distant forest moves out of the scene, whereas SVG-LP learns to copy the initial frame indefinitely and does not predict any new events in the future.

(2018)  proposed to use an adversarial loss with a variational latent variable model to produce naturalistic images, whileKumar et al. (2019)  used flow-based generative modeling to directly optimize the likelihood of a video generation model.Recently, Weissenborn et al. (2020)   scaled autoregressive models for video prediction using a three-dimensional self-attention mechanism and showed competitive results on real-world video datasets. On similar lines,Xu et al. (2018)   proposed to use an entirely CNN-based architecture for modeling dependencies between sequential inputs.

