DIVERSE VIDEO GENERATION USING A GAUSSIAN PROCESS TRIGGER

Abstract

Generating future frames given a few context (or past) frames is a challenging task. It requires modeling the temporal coherence of videos and multi-modality in terms of diversity in the potential future states. Current variational approaches for video generation tend to marginalize over multi-modal future outcomes. Instead, we propose to explicitly model the multi-modality in the future outcomes and leverage it to sample diverse futures. Our approach, Diverse Video Generator, uses a Gaussian Process (GP) to learn priors on future states given the past and maintains a probability distribution over possible futures given a particular sample. In addition, we leverage the changes in this distribution overtime to control the sampling of diverse future states by estimating the end of on-going sequences. That is, we use the variance of GP over the output function space to trigger a change in an action sequence. We achieve state-of-the-art results on diverse future frame generation in terms of reconstruction quality and diversity of the generated sequences.

1. INTRODUCTION

Humans are often able to imagine multiple possible ways that the scene can change over time. Modeling and generating diverse futures is an incredibly challenging problem. The challenge stems from the inherent multi-modality of the task, i.e., given a sequence of past frames, there can be multiple possible outcomes of the future frames. For example, given the image of a "person holding a cup" in Figure . 1, most would predict that the next few frames correspond to either the action "drinking from the cup" or "keeping the cup on the table." This challenge is exacerbated by the lack of real training data with diverse outputs -all real-world training videos come with a single real future and no "other" potential futures. Similar looking past frames can have completely different futures (e.g., Figure. 1). In the absence of any priors or explicit supervision, the current methods struggle with modeling this diversity. Given similar looking past frames, with different futures in the training data, variational methods, which commonly utilize (Kingma & Welling, 2013), tend to average the results to better match to all different futures (Denton & Fergus, 2018; Babaeizadeh et al., 2017; Gao et al., 2018; Lee et al., 2018; Oliu et al., 2017) . We hypothesize that explicit modeling of future diversity is essential for high-quality, diverse future frame generation. In this paper, we model the diversity of the future states, given past context, using Gaussian Processes (GP) (Rasmussen, 2006) , which have several desirable properties. They learn a prior on potential future given past context, in a Bayesian formulation. This allows us to update the distribution of possible futures as more context frames are provided as evidence and maintain a list of potential futures (underlying functions in GP). Finally, our formulation provides an interesting property that is crucial to generating future frames -the ability to estimate when to generate a diverse output vs. continue an on-going action, and a way to control the predicted futures. In particular, we utilize the variance of the GP at any specific time step as an indicator of whether an action sequence is on-going or finished. An illustration of this mechanism is presented in Figure . 2. When we observe a frame (say at time t) that can have several possible futures, the variance of the GP model is high (Figure. 2 (left) ). Different functions represent potential action sequences that can be generated, starting from this particular frame. Once we select the next frame (at t+2), the GP variance of the future states is relatively low (Figure. 2 (center) ), indicating that an action sequence

