REVISITING HIERARCHICAL APPROACH FOR PERSISTENT LONG-TERM VIDEO PREDICTION

Abstract

Learning to predict the long-term future of video frames is notoriously challenging due to inherent ambiguities in the distant future and dramatic amplifications of prediction error through time. Despite the recent advances in the literature, existing approaches are limited to moderately short-term prediction (less than a few seconds), while extrapolating it to a longer future quickly leads to destruction in structure and content. In this work, we revisit hierarchical models in video prediction. Our method predicts future frames by first estimating a sequence of semantic structures and subsequently translating the structures to pixels by videoto-video translation. Despite the simplicity, we show that modeling structures and their dynamics in the discrete semantic structure space with a stochastic recurrent estimator leads to surprisingly successful long-term prediction. We evaluate our method on three challenging datasets involving car driving and human dancing, and demonstrate that it can generate complicated scene structures and motions over a very long time horizon (i.e., thousands frames), setting a new standard of video prediction with orders of magnitude longer prediction time than existing approaches.

1. INTRODUCTION

Video prediction aims to generate future frames conditioned on a short video clip. It has received much attention in recent years as forecasting the future of visual sequence is critical in improving the planning for model-based reinforcement learning (Finn et al., 2016; Hafner et al., 2019; Ha & Schmidhuber, 2018) , forecasting future event (Hoai & Torre, 2013) , action (Lan et al., 2014) , and activity (Lan et al., 2014; Ryoo, 2011) . To make it truly beneficial for these applications, video prediction should be capable of forecasting long-term future. Many previous approaches have formulated video prediction as a conditional generation task by recursively synthesizing future frames conditioned on the previous frames (Vondrick et al., 2016; Tulyakov et al., 2018; Denton & Fergus, 2018; Babaeizadeh et al., 2018; Castrejon et al., 2019; Villegas et al., 2019) . Despite their success in short-term forecasting, however, none of these approaches have been successful in synthesizing convincing long-term future, due to the challenges in modeling complex dynamics and extrapolating from short sequences to much longer future. As the prediction errors easily accumulate and amplify through time, the quality of the predicted frames quickly degrades over time. One way to reduce the error propagation is to extrapolate in a low dimensional structure space instead of directly estimating pixel-level dynamics in a video. Therefore, many hierarchical modeling approaches are proposed (Villegas et al., 2017b; Wichers et al., 2018; Liang et al., 2017; Yan et al., 2018; Walker et al., 2017; Kim et al., 2019) . These approaches first generate a sequence using a low-dimensional structure representation, and subsequently generate appearance conditioned on the predicted structures. Hierarchical approaches are potentially promising for long-term prediction since learning structure-aware dynamics allows the model to generate semantically accurate motion and content in the future. However, previous approaches often employed too specific and incomprehensive structures such as human body joints (Villegas et al., 2017b; Yan et al., 2018; Yang et al., 2018; Walker et al., 2017; Kim et al., 2019) 

Image generator (Section 2.2)

...

Segmentation

Figure 1 : The overall framework of the proposed hierarchical approach. Given the context frames and label maps extracted by the segmentation network, our model predicts the future frames by estimating the semantic label maps using a stochastic sequence estimator (Section 2.1) and converting the predicted labels to RGB frames by using a conditional image sequence generator (Section 2.2). Moreover, they made oversimplified assumptions of the future by using a deterministic loss or assuming homogeneous content. We therefore argue that the benefit of hierarchical models has been underestimated and their impact on long-term video prediction has not been properly demonstrated. In this paper, we propose a hierarchical model with a general structure representation (i.e., dense semantic label map) for long-term video prediction on complex scenes. We abstract the scene as categorical semantic labels for each pixel, and predict the motion and content change in this label space using a variational sequence model. Given the context frames and the predicted label maps, we generate the textures by translating the sequence of label maps to the RGB frames. As dense label maps are generic and universal, we can learn comprehensive scene dynamics from object motion to even dramatic scene change. We can also capture the multi-modal dynamics in the label space with the stochastic prediction of the variational sequence model. Our experiments demonstrate that we can generate a surprisingly long-term future of videos, from driving scenes to human dancing, including the complex motion of multiple objects and even an evolution of the content in a distant future. We also show that predicted frame quality is preserved through time, which enables persistent future prediction virtually near-infinite time horizon. For scalable evaluation of long-term prediction at this scale, we also propose a novel metric called shot-wise FVD, which enables the evaluation of spatio-temporal prediction quality without ground-truths and is consistent with human perception.

2. METHOD

Given the context frames x 1:C = {x 1 , x 2 , ..., x C }, our goal is to synthesize the future frames x C+1:T = {x C+1 , x C+2 , ..., x T } up to an arbitrary long-term future T . Let s t ∈ R N ×H×W denote a dense label map of the frame x t defined over N categories, which is inferred by the pre-trained semantic segmentation modelfoot_0 . Then given the context frames x 1:C and the label maps s 1:C , our hierarchical framework synthesizes the future frames xC+1:T by the following steps. • A structure generator takes the context label maps as inputs and produces a sequence of the future label maps by ŝC+1:T ∼ G struct (s 1:C , z 1:T ), where z t denotes the latent variable encoding the stochasticity of the structure. • Given the context frames and the predicted structures, an image generator produces RGB frames by xC+1:T ∼ G image (x 1:C , {s 1:C , ŝC+1:T }). Figure 1 illustrates the overall pipeline. Note that there are various factors beyond motions that make spatio-temporal variations in the label maps, such as an emergence of new objects, partial observability, or even dramatic scene changes by the global camera motion (e.g., panning). By learning to model these dynamics in the semantic labels with the stochastic sequence estimator and conditioning the video generation with the estimated labels, the proposed hierarchical model can synthesize convincing frames into the very long-term future. Below, we describe each component.



We employ the pre-trained network(Zhu et al., 2019) on still images to obtain segmentations in videos.



or face landmarks(Yan et al., 2018; Yang et al., 2018).

availability

//1konny.github.io/HVP/.

