CONTRASTIVE VIDEO TEXTURES

Abstract

Existing methods for video generation struggle to generate more than a short sequence of frames. We introduce a non-parametric approach for infinite video generation based on learning to resample frames from an input video. Our work is inspired by Video Textures, a classic method relying on pixel similarity to stitch sequences of frames, which performs well for videos with a high degree of regularity but fails in less constrained settings. Our method learns a distance metric to compare frames in a manner that scales to more challenging dynamics and allows for conditioning on heterogeneous data, such as audio. We learn representations for video frames and probabilities of transitioning by fitting a video-specific bi-gram model trained using contrastive learning. To synthesize the texture, we represent the video as a graph where the nodes are frames and edges are transitions with probabilities predicted by our video-specific model. By randomly traversing edges with high transition probabilities, we generate diverse temporally smooth videos with novel sequences and transitions. The model naturally extends with no additional training to handle the task of Audio Conditioned Video Synthesis, when conditioned on an audio signal. Our model outperforms baselines on human perceptual scores, can handle a diverse range of input videos, and can combine semantic and audio-visual cues in order to synthesize videos that synchronize well with an audio signal.

1. INTRODUCTION

Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) and Variational Autoencoders (VAEs) (Kingma & Welling, 2013) have achieved great success in generating images "from scratch". While one might have hoped that video generation would be a simple extension of image-generation methods, this has not been the case. A major reason is that videos are much higher dimensional than images, and producing correct transitions between frames is a difficult problem. While video generation (Vondrick et al., 2016; Mallya et al., 2020; Lee et al., 2019; Wang et al., 2018a; b) has shown some success, videos generated using such methods are relatively short and are unable to match the realism of actual videos. In comparison, classic non-parametric video synthesis methods from two decades ago, most notably Video Textures (Schödl et al., 2000) , are much simpler and can often produce videos of arbitrary lengths. In these models, a new plausible video is generated by stitching together snippets of an existing video. While video textures have been very successful on simple videos with a high degree of regularity, they use simple Euclidean pixel distance as a similarity metric between frames, which causes them to fail for less constrained videos containing irregularities and chaotic movements, such as dance or playing a musical instrument. They are also sensitive to subtle changes in brightness and often produce jarring transitions. In this work, we propose Contrastive Video Textures, a non-parametric learning-based approach for video texture synthesis that overcomes the limitations of classic video textures. As in Schödl et al. (2000) , we synthesize textures by resampling frames from the input video. However, as opposed to using pixel similarity, we learn feature representations and a distance metric to compare frames by training a deep model on a single

Video Textures

Non-Parametric Texture Synthesis To synthesize the video texture, we use the video-specific model to compute probabilities of transitioning between frames of the same video. We represent the video as a graph where the individual frames are nodes and the edges represent transition probabilities predicted by our video-specific model. We generate output videos (or textures) by randomly traversing edges with high transition probabilities. We additionally incorporate deep video interpolation into our contrastive video textures framework to suppress visual discontinuities and to allow for large transitions. Our proposed method is able to synthesize realistic, smooth, and diverse output textures on a variety of domains, including dance and music videos as shown at this website. Fig. 1 illustrates the distinction between video generation/prediction, video textures and our contrastive model. We also extend our model to an audio conditioned video synthesis task. Given a source video with associated audio and a new conditioning audio not in the source, we synthesize a new video that approximately matches the conditioning audio. A demonstration of this task is shown at this link. We modify the inference algorithm to include an additional constraint that the predicted frame's audio should match the conditioning audio. We trade off between temporal coherence (frames predicted by the constrastive video texture model) and audio similarity (frames predicted by the audio matching algorithm) to generate videos which align well with the conditioning audio and are also temporally smooth. We assess the perceptual quality of the synthesized textures by conducting human perceptual evaluations comparing our method to a number of baselines. In the case of unconditional video texture synthesis, we compare to the classic video texture algorithm (Schödl et al., 2000) and variations to this which we describe in Sec. 4. For the audio conditioning setting, we compare to three different baselines: classic video textures with audio conditioning, visual rhythm and beat (Davis & Agrawala, 2018), and a random baseline. Our results confirm that our method is perceptually better than all the baselines.

2. CONTRASTIVE VIDEO TEXTURES

We propose a non-parametric learning-based approach for video texture synthesis. At a high-level, we fit an example-specific bi-gram model (i.e. a Markov chain) and use it to re-sample input frames, producing a diverse and temporally coherent video. In the following, we first define the bi-gram model, and then describe how to train and sample from it.



Figure 1: Video Texture Synthesis. Prior video prediction (Xu et al., 2020) and generation (Vondrick et al., 2016; Lee et al., 2019; Mallya et al., 2020) fail to generate long sequences and high resolution images. Classic video textures (Schödl et al., 2000) (middle) can generate infinite sequences by resampling frames, but uses fixed representations which are not robust to varying domains. Our method (right) learns a representation and non-parametric method for infinite video generation based on resampling frames from an input video. input video. The network is trained using contrastive learning to fit an example-specific bi-gram model (i.e. a Markov chain).

