LONG-HORIZON VIDEO PREDICTION USING A DYNAMIC LATENT HIERARCHY Anonymous authors Paper under double-blind review

Abstract

The task of video prediction and generation is known to be notoriously difficult, with the research in this area largely limited to short-term predictions. Though plagued with noise and stochasticity, videos consist of features that are organised in a spatiotemporal hierarchy, different features possessing different temporal dynamics. In this paper, we introduce Dynamic Latent Hierarchy (DLH) -a deep hierarchical latent model that represents videos as a hierarchy of latent states that evolve over separate and fluid timescales. Each latent state is a mixture distribution with two components, representing the immediate past and the predicted future, causing the model to learn transitions only between sufficiently dissimilar states, while clustering temporally persistent states closer together. Using this unique property, DLH naturally discovers the spatiotemporal structure of a dataset and learns disentangled representations across its hierarchy. We hypothesise that this simplifies the task of modeling temporal dynamics of a video, improves the learning of long-term dependencies, and reduces error accumulation. As evidence, we demonstrate that DLH outperforms state-of-the-art benchmarks in video prediction, is able to better represent stochasticity, as well as to dynamically adjust its hierarchical and temporal structure. Our paper shows, among other things, how progress in representation learning can translate into progress in prediction tasks.

1. INTRODUCTION

Video data is considered to be one of the most difficult modalities for generative modelling and prediction, characterised by high levels of noise, complex temporal dynamics, and inherent stochasticity. Even more so, modelling long-term videos poses a significant challenge due to the problem of sequential error accumulation, largely restricting the research in this topic to short-term predictions. Deep learning has given rise to generative latent-variable models with the capability to learn rich latent representations, allowing to model high-dimensional data by means of more efficient, lowerdimensional states (Kingma & Welling, 2014; Higgins et al., 2022; Vahdat & Kautz, 2020; Rasmus et al., 2015) . Here, of particular interest are hierarchical latent models, which possess a higher degree of representational power and expressivity. Employing hierarchies has so far proved to be an effective method for generating high-fidelity visual data, as well as concurrently producing more meaningful and disentangled latent representations in both static (Vahdat & Kautz, 2020) and temporal (Zakharov et al., 2022) datasets. Unlike images, videos possess a spatiotemporal structure, in which a collection of spatial features adhere to the intrinsic temporal dynamics of a dataset -often evolving at different and fluid timescales. For instance, consider a simplistic example shown in Figure 1 , in which the features of a video sequence evolve within a strict temporal hierarchy: from the panda continuously changing its position to the background elements being static over the entire duration of the video. Discovering such a temporal structure in videos complements nicely the research into hierarchical generative models, which have been shown capable of extracting and disentangling features across a hierarchy of latent states. Relying on this notion of inherent spatiotemporal organisation of features, several hierarchical architectures have been proposed to either enforce a generative temporal hierarchy explicitly (Saxena et al., 2021) , or discover it in an unsupervised fashion (Kim et al., 2019; Zakharov et al., 2022) . In general, these architectures consist of a collection of latent states that

