DIVERSE VIDEO GENERATION USING A GAUSSIAN PROCESS TRIGGER

Abstract

Generating future frames given a few context (or past) frames is a challenging task. It requires modeling the temporal coherence of videos and multi-modality in terms of diversity in the potential future states. Current variational approaches for video generation tend to marginalize over multi-modal future outcomes. Instead, we propose to explicitly model the multi-modality in the future outcomes and leverage it to sample diverse futures. Our approach, Diverse Video Generator, uses a Gaussian Process (GP) to learn priors on future states given the past and maintains a probability distribution over possible futures given a particular sample. In addition, we leverage the changes in this distribution overtime to control the sampling of diverse future states by estimating the end of on-going sequences. That is, we use the variance of GP over the output function space to trigger a change in an action sequence. We achieve state-of-the-art results on diverse future frame generation in terms of reconstruction quality and diversity of the generated sequences.

1. INTRODUCTION

Humans are often able to imagine multiple possible ways that the scene can change over time. Modeling and generating diverse futures is an incredibly challenging problem. The challenge stems from the inherent multi-modality of the task, i.e., given a sequence of past frames, there can be multiple possible outcomes of the future frames. For example, given the image of a "person holding a cup" in Figure . 1, most would predict that the next few frames correspond to either the action "drinking from the cup" or "keeping the cup on the table." This challenge is exacerbated by the lack of real training data with diverse outputs -all real-world training videos come with a single real future and no "other" potential futures. Similar looking past frames can have completely different futures (e.g., Figure . 1). In the absence of any priors or explicit supervision, the current methods struggle with modeling this diversity. Given similar looking past frames, with different futures in the training data, variational methods, which commonly utilize (Kingma & Welling, 2013) , tend to average the results to better match to all different futures (Denton & Fergus, 2018; Babaeizadeh et al., 2017; Gao et al., 2018; Lee et al., 2018; Oliu et al., 2017) . We hypothesize that explicit modeling of future diversity is essential for high-quality, diverse future frame generation. In this paper, we model the diversity of the future states, given past context, using Gaussian Processes (GP) (Rasmussen, 2006) , which have several desirable properties. They learn a prior on potential future given past context, in a Bayesian formulation. This allows us to update the distribution of possible futures as more context frames are provided as evidence and maintain a list of potential futures (underlying functions in GP). Finally, our formulation provides an interesting property that is crucial to generating future frames -the ability to estimate when to generate a diverse output vs. continue an on-going action, and a way to control the predicted futures. In particular, we utilize the variance of the GP at any specific time step as an indicator of whether an action sequence is on-going or finished. An illustration of this mechanism is presented in Figure . 2. When we observe a frame (say at time t) that can have several possible futures, the variance of the GP model is high (Figure. 2 (left) ). Different functions represent potential action sequences that can be generated, starting from this particular frame. Once we select the next frame (at t+2), the GP variance of the future states is relatively low (Figure. Now that we have a good way to model diversity, the next step is to generate future frames. Even after tremendous advances in the field of generative models for image synthesis (Denton & Fergus, 2018; Babaeizadeh et al., 2017; Lee et al., 2018; Vondrick & Torralba, 2017; Lu et al., 2017; Vondrick et al., 2016; Saito et al., 2017; Tulyakov et al., 2018; Hu & Wang, 2019) , the task of generating future frames (not necessarily diverse) conditioned on past frames is still hard. As opposed to independent images, the future frames need to obey potential video dynamics that might be on-going in the past frames, follow world knowledge (e.g., how humans and objects interact), etc.. We utilize a fairly straightforward process to generate future frames, which utilizes two modules: a frame auto-encoder and a dynamics encoder. The frame auto-encoder learns to encode a frame in a latent representation and utilizes it to generate the original frame back. The dynamics encoder learns to model dynamics between past and future frames. We learn two independent dynamics encoders: an LSTM encoder, utilized to model on-going actions and the GP encoder (similar to (Srivastava et al., 2015) ), and a GP encoder, utilized to model transitions to new actions. The variance of this GP encoder can be used as a trigger to decide when to sample new actions. We train this framework end-to-end. We first provide an overview of GP formulation and scalable training techniques in §3, and then describe our approach in §4. Comprehensively evaluating diverse future frames generation is still an open research problem. Following recent state-of-the-art, we will evaluate different aspects of the approach independently. The quality of generated frames is quantified using image synthesis/reconstruction per-frame metrics: SSIM (Wang et al., 2019; Sampat et al., 2009), PSNR, and LPIPS (Zhang et al., 2018; Dosovitskiy & Brox, 2016; Johnson et al., 2016) . The temporal coherence and quality of a short video clip (16 neighboring frames) are jointly evaluated using the FVD (Unterthiner et al., 2018) metric. However, high-quality, temporarily coherent frame synthesis does not evaluate diversity in predicted frames. Therefore, to evaluate diversity, since there are no multiple ground-truth futures, we propose an alternative evaluation strategy, inspired by (Villegas et al., 2017b) : utilizing action classifiers to evaluate whether an action switch has occurred or not. A change in action indicates that the method was able to sample a diverse future. Together, these metrics can evaluate if an approach can generate multiple high-quality frames that temporally coherent and diverse. Details of these metrics and baselines, and extensive quantitative and qualitative results are provided in §5. To summarize, our contributions are: (a) modeling the diversity of future states using a GP, which maintains priors on future states given the past frames using a Bayesian formulation (b) leveraging the changing GP distribution over time (given new observed evidence) to estimate when an on-going action sequence completes and using GP variance to control the triggering of a diverse future state. This results in state-of-the-art results on future frame generation. We also quantify the diversity of the generated sequences using action classifiers as a proxy metric.

2. RELATED WORK

Understanding and predicting the future, given the observed past, is a fundamental problem in video understanding. The future states are inherently multi-modal and capturing their diversity finds direct applications in many safety-critical applications (e.g., autonomous vehicles), where it is critical



Figure1: Given "person holding cup," humans can often predict multiple possible futures (e.g.,"drinking from the cup" or "keeping the cup on the table.").is on-going, and the model should continue it as opposed to trying to sample a diverse sample. After the completion of the on-going sequence, the GP variance over potential future states becomes high again. This implies that we can continue this action (i.e., pick the mean function represented by the black line in Figure.2 (center)) or try and sample a potentially diverse sample (i.e., one of the functions that contributes to high-variance). This illustrates how we can use GP to decide when to trigger diverse actions. An example of using GP trigger is shown in Figure.2 (right), where after every few frames, we trigger a different action.

