CONTRASTIVE VIDEO TEXTURES

Abstract

Existing methods for video generation struggle to generate more than a short sequence of frames. We introduce a non-parametric approach for infinite video generation based on learning to resample frames from an input video. Our work is inspired by Video Textures, a classic method relying on pixel similarity to stitch sequences of frames, which performs well for videos with a high degree of regularity but fails in less constrained settings. Our method learns a distance metric to compare frames in a manner that scales to more challenging dynamics and allows for conditioning on heterogeneous data, such as audio. We learn representations for video frames and probabilities of transitioning by fitting a video-specific bi-gram model trained using contrastive learning. To synthesize the texture, we represent the video as a graph where the nodes are frames and edges are transitions with probabilities predicted by our video-specific model. By randomly traversing edges with high transition probabilities, we generate diverse temporally smooth videos with novel sequences and transitions. The model naturally extends with no additional training to handle the task of Audio Conditioned Video Synthesis, when conditioned on an audio signal. Our model outperforms baselines on human perceptual scores, can handle a diverse range of input videos, and can combine semantic and audio-visual cues in order to synthesize videos that synchronize well with an audio signal.

1. INTRODUCTION

Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) and Variational Autoencoders (VAEs) (Kingma & Welling, 2013) have achieved great success in generating images "from scratch". While one might have hoped that video generation would be a simple extension of image-generation methods, this has not been the case. A major reason is that videos are much higher dimensional than images, and producing correct transitions between frames is a difficult problem. While video generation (Vondrick et al., 2016; Mallya et al., 2020; Lee et al., 2019; Wang et al., 2018a; b) has shown some success, videos generated using such methods are relatively short and are unable to match the realism of actual videos. In comparison, classic non-parametric video synthesis methods from two decades ago, most notably Video Textures (Schödl et al., 2000) , are much simpler and can often produce videos of arbitrary lengths. In these models, a new plausible video is generated by stitching together snippets of an existing video. While video textures have been very successful on simple videos with a high degree of regularity, they use simple Euclidean pixel distance as a similarity metric between frames, which causes them to fail for less constrained videos containing irregularities and chaotic movements, such as dance or playing a musical instrument. They are also sensitive to subtle changes in brightness and often produce jarring transitions. In this work, we propose Contrastive Video Textures, a non-parametric learning-based approach for video texture synthesis that overcomes the limitations of classic video textures. As in Schödl et al. (2000) , we synthesize textures by resampling frames from the input video. However, as opposed to using pixel similarity, we learn feature representations and a distance metric to compare frames by training a deep model on a single 1

