TOWARDS SMOOTH VIDEO COMPOSITION

Abstract

Video generation, with the purpose of producing a sequence of frames, requires synthesizing consistent and persistent dynamic contents over time. This work investigates how to model the temporal relations for composing a video with arbitrary number of frames, from a few to even infinite, using generative adversarial networks (GANs). First, towards composing adjacent frames, we show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, bring a smooth frame transition without harming the perframe quality. Second, through incorporating a temporal shift module (TSM), which is originally designed for video understanding, into the discriminator, we manage to advance the generator in synthesizing more reasonable dynamics. Third, we develop a novel B-Spline based motion representation to ensure the temporal smoothness, and hence achieve infinite-length video generation, going beyond the frame number used in training. We evaluate our approach on a range of datasets and show substantial improvements over baselines on video generation. Code and models are publicly available at https://genforce.github. io/StyleSV.

1. INTRODUCTION

Synthesizing images using a generative adversarial network (GAN) (Goodfellow et al., 2014; Radford et al., 2016; Karras et al., 2019; 2020b; 2021; 2018) usually requires to compose diverse visual concepts with fine details of a single object and plausible spatial arrangement of different objects. Recent advances in GANs have enabled many appealing applications such as customized editing (Goetschalckx et al., 2019; Shen et al., 2020; Jahanian et al., 2020; Yang et al., 2021) and animation (Qiu et al., 2022; Alaluf et al., 2022) . However, employing GANs for video generation remains challenging considering the additional requirement on the temporal dimension. In fact, a video is not simply a stack of images. Instead, the contents in video frames should have a smooth transition over time, and the video may last arbitrarily long. Thus, compared to image synthesis, the crux of video synthesis lies in modeling the temporal relations across frames. We argue that the temporal relations fall into three folds regarding the time scale. First, when looking at a transient dynamic, we would focus more on the subtle change between neighbor frames and expect decent local motions, such as facial muscle movement and cloud drifting. Along with the duration getting longer, say a segment, more contents within the frame may vary. Under such a case, learning a consistent global motion is vital. For example, in a video of first-view driving, trees and buildings along the street should move backward together with the car running forward. Finally, for those extremely long videos, the objects inside are not immutable. It therefore requires the motion to be generalizable along the time axis in a continuous and rational sense. This work targets smooth video composition through modeling multi-scale temporal relations with GANs. First, we confirm that, same as in image synthesis, the texture sticking problem (i.e., some visual concepts are bound to their coordinates) also exists in video generation, interrupting the smooth flow of frame contents. To tackle this obstacle, we borrow the alias-free technique (Karras et al., 2021) from single image generation and preserve the frame quality via appropriate pretraining. Then, to assist the generator in producing reasonable dynamics, we introduce a temporal shift module (TSM) (Lin et al., 2019) into the discriminator as an inductive bias. That way, the discriminator could capture more information from the temporal perspective for real/fake classification, serving as a better guidance to the generator. Furthermore, we observe that the motion representation in previous work (Skorokhodov et al., 2022) suffers from undesired content jittering (see Sec. 2.4 for details) for super-long video generation. We identify the cause of such a phenomenon as the first-order discontinuity when interpolating motion embeddings. Towards this problem, we design a novel B-Spline based motion representation that could soundly and continuously generalize over time. A low-rank strategy is further proposed to alleviate the issue that frame contents may repeat cyclically. We evaluate our approach on various video generation benchmarks, including YouTube driving dataset (Zhang et al., 2022) , SkyTimelapse (Xiong et al., 2018) , Taichi-HD (Siarohin et al., 2019b) and observe consistent and substantial improvements over existing alternatives. Given its simplicity and efficiency, our approach sets up a simple yet strong baseline for the task of video generation.

2. METHOD

We introduce the improvements made on the prior art StyleGAN-V (Skorokhodov et al., 2022) to set a new baseline for video generation. We first introduce the default configuration (Config-A) in Sec. 2.1. We then make a comprehensive overhaul on it. Concretely, in Sec. 2.2 we confirm that the alias-free technique (Config-B) in single image generation, together with adequately pre-learned knowledge (Config-C), could result in a smooth transition of two adjacent frames. Sec. 2.3 shows that when the temporal information is explicitly modeled into the discriminator through the temporal shift module (Lin et al., 2019) (Config-D), the generator can produce significantly better dynamic content across frames. Although prior arts could already generate arbitrarily long videos, the cyclic jittering is observed as time goes by. We therefore propose a B-Spline based motion representation (Config-E) to ensure the continuity, which together with a low-rank temporal modulation (Config-F) could produce much more realistic and natural long videos in Sec. 2.4. 2.1 PRELIMINARY StyleGAN-V (Skorokhodov et al., 2022) introduces continuous motion representations and a holistic discriminator for video generation. Specifically, continuous frames I t could be obtained by feeding continuous t into a generator G(•): I t = G(u, v t ), ) where u and v t denote the content code and continuous motion representation. Concretely, content code is sampled from a standard gaussian distribution while the motion representation v t consists of two embeddings: time positional embedding v pe t and interpolated motion embedding v me t . To obtain the time positional embedding v pe t , we first randomly sample N A codes as a set of discrete time anchors A i , i ∈ [0, • • • , N A -1] that share equal interval (256 frames in practice). Convolutional operation with 1D kernel is then applied on anchor A i for temporal modeling, producing the corresponding features a i with timestamp t i . For an arbitrary continuous t, its corresponding interval is first found with the nearest left and right anchor feature a l and a l+1 , at time t l and t l+1 , so that t l ≤ t < t l+1 , l ∈ [0, • • • , N A -2]. For time positional embedding v pe t ,

