LONG-HORIZON VIDEO PREDICTION USING A DYNAMIC LATENT HIERARCHY Anonymous authors Paper under double-blind review

Abstract

The task of video prediction and generation is known to be notoriously difficult, with the research in this area largely limited to short-term predictions. Though plagued with noise and stochasticity, videos consist of features that are organised in a spatiotemporal hierarchy, different features possessing different temporal dynamics. In this paper, we introduce Dynamic Latent Hierarchy (DLH) -a deep hierarchical latent model that represents videos as a hierarchy of latent states that evolve over separate and fluid timescales. Each latent state is a mixture distribution with two components, representing the immediate past and the predicted future, causing the model to learn transitions only between sufficiently dissimilar states, while clustering temporally persistent states closer together. Using this unique property, DLH naturally discovers the spatiotemporal structure of a dataset and learns disentangled representations across its hierarchy. We hypothesise that this simplifies the task of modeling temporal dynamics of a video, improves the learning of long-term dependencies, and reduces error accumulation. As evidence, we demonstrate that DLH outperforms state-of-the-art benchmarks in video prediction, is able to better represent stochasticity, as well as to dynamically adjust its hierarchical and temporal structure. Our paper shows, among other things, how progress in representation learning can translate into progress in prediction tasks.

1. INTRODUCTION

Video data is considered to be one of the most difficult modalities for generative modelling and prediction, characterised by high levels of noise, complex temporal dynamics, and inherent stochasticity. Even more so, modelling long-term videos poses a significant challenge due to the problem of sequential error accumulation, largely restricting the research in this topic to short-term predictions. Deep learning has given rise to generative latent-variable models with the capability to learn rich latent representations, allowing to model high-dimensional data by means of more efficient, lowerdimensional states (Kingma & Welling, 2014; Higgins et al., 2022; Vahdat & Kautz, 2020; Rasmus et al., 2015) . Here, of particular interest are hierarchical latent models, which possess a higher degree of representational power and expressivity. Employing hierarchies has so far proved to be an effective method for generating high-fidelity visual data, as well as concurrently producing more meaningful and disentangled latent representations in both static (Vahdat & Kautz, 2020) and temporal (Zakharov et al., 2022) datasets. Unlike images, videos possess a spatiotemporal structure, in which a collection of spatial features adhere to the intrinsic temporal dynamics of a dataset -often evolving at different and fluid timescales. For instance, consider a simplistic example shown in Figure 1 , in which the features of a video sequence evolve within a strict temporal hierarchy: from the panda continuously changing its position to the background elements being static over the entire duration of the video. Discovering such a temporal structure in videos complements nicely the research into hierarchical generative models, which have been shown capable of extracting and disentangling features across a hierarchy of latent states. Relying on this notion of inherent spatiotemporal organisation of features, several hierarchical architectures have been proposed to either enforce a generative temporal hierarchy explicitly (Saxena et al., 2021) , or discover it in an unsupervised fashion (Kim et al., 2019; Zakharov et al., 2022) . In general, these architectures consist of a collection of latent states that Building upon these notions, we propose an architecture of a hierarchical generative model for long-horizon video prediction -Dynamic Latent Hierarchy (DLH). The principle ideas underlying this work are two-fold. First, we posit that learning disentangled hierarchical representations and their separate temporal dynamics increases the model's expressivity and breaks down the problem of video modelling into simpler sub-problems, thus benefiting prediction quality. As such, our model is capable of discovering the appropriate hierarchical spatiotemporal structure of the dataset, seamlessly adapting its generative structure to a dataset's dynamics. Second, the existence of a spatiotemporal hierarchy, in which some features can remain static for an arbitrary period of time (e.g. background in Fig. 1 ), implies that predicting the next state at every timestep may introduce unnecessary accumulation of error and computational complexity. Instead, our model learns to transition between states only if a change in the represented features has been observed (e.g. airplane in Fig. 1 ). Conversely, if no change in the features has been detected, the model clusters such temporally-persistent states closer together, thus building a more organised latent space. Our contributions are summarised as follows: • A novel architecture of a hierarchical latent-variable generative model employing temporal Gaussian mixtures (GM) for representing latent states and their dynamics; • Incorporation of a non-parametric inference method for estimating the discrete posterior distribution over the temporal GM components; • The resulting superior long-horizon video prediction performance, emergent hierarchical disentanglement properties, and improved stochasticity representation.

2. DYNAMIC LATENT HIERARCHY

We propose an architecture of a hierarchical latent model for video prediction -Dynamic Latent Hierarchy. DLH consists of a hierarchy of latent states that evolve over different and flexible timescales. Each latent state is a mixture of two Gaussian components that represent the immediate past and the predicted future in a single belief state. Using this formalisation, the model dynamically assigns every newly inferred posterior state to one of these clusters, and thus implicitly learns the temporal hierarchy of the data in an unsupervised fashion.

2.1. GENERATIVE MODEL

We consider sequences of observations, {o 1 , ..., o T }, modelled by a hierarchical generative model with a joint distribution in the form (Fig. 2 



Figure1: Videos can be viewed as a collection of features organised in a spatiotemporal hierarchy. This graphic illustrates a sequence of frames, in which the components of the video possess different temporal dynamics (white circles indicate feature changes). Notice the irregularities in their dynamics -the panda continuously changes its position, the airplane is seen only for a few timesteps, while the background remains static throughout. Similar to this, our model learns hierarchically disentangled representations of video features with the ability to model their unique dynamics.

