A GOOD IMAGE GENERATOR IS WHAT YOU NEED FOR HIGH-RESOLUTION VIDEO SYNTHESIS

Abstract

Image and video synthesis are closely related areas aiming at generating content from noise. While rapid progress has been demonstrated in improving imagebased models to handle large resolutions, high-quality renderings, and wide variations in image content, achieving comparable video generation results remains problematic. We present a framework that leverages contemporary image generators to render high-resolution videos. We frame the video synthesis problem as discovering a trajectory in the latent space of a pre-trained and fixed image generator. Not only does such a framework render high-resolution videos, but it also is an order of magnitude more computationally efficient. We introduce a motion generator that discovers the desired trajectory, in which content and motion are disentangled. With such a representation, our framework allows for a broad range of applications, including content and motion manipulation. Furthermore, we introduce a new task, which we call cross-domain video synthesis, in which the image and motion generators are trained on disjoint datasets belonging to different domains. This allows for generating moving objects for which the desired video data is not available. Extensive experiments on various datasets demonstrate the advantages of our methods over existing video generation techniques. Code will be released at https://github.com/snap-research/MoCoGAN-HD. * Work done while at Snap Inc. 1 We estimate that the cost of training a model such as DVD- GAN (Clark et al., 2019) once requires > $30K.

1. INTRODUCTION

Video synthesis seeks to generate a sequence of moving pictures from noise. While its closely related counterpart-image synthesis-has seen substantial advances in recent years, allowing for synthesizing at high resolutions (Karras et al., 2017) , rendering images often indistinguishable from real ones (Karras et al., 2019) , and supporting multiple classes of image content (Zhang et al., 2019) , contemporary improvements in the domain of video synthesis have been comparatively modest. Due to the statistical complexity of videos and larger model sizes, video synthesis produces relatively low-resolution videos, yet requires longer training times. For example, scaling the image generator of Brock et al. (2019) to generate 256 × 256 videos requires a substantial computational budget 1 . Can we use a similar method to attain higher resolutions? We believe a different approach is needed. There are two desired properties for generated videos: (i) high quality for each individual frame, and (ii) the frame sequence should be temporally consistent, i.e. depicting the same content with plausible motion. Previous works (Tulyakov et al., 2018; Clark et al., 2019) attempt to achieve both goals with a single framework, making such methods computationally demanding when high resolution is desired. We suggest a different perspective on this problem. We hypothesize that, given an image generator that has learned the distribution of video frames as independent images, a video can be represented as a sequence of latent codes from this generator. The problem of video synthesis can then be framed as discovering a latent trajectory that renders temporally consistent images. Hence, we demonstrate that (i) can be addressed by a pre-trained and fixed image generator, and (ii) can be achieved using the proposed framework to create appropriate image sequences. To discover the appropriate latent trajectory, we introduce a motion generator, implemented via two recurrent neural networks, that operates on the initial content code to obtain the motion representation. We model motion as a residual between continuous latent codes that are passed to the image generator for individual frame generation. Such a residual representation can also facilitate the disentangling of motion and content. The motion generator is trained using the chosen image discriminator with contrastive loss to force the content to be temporally consistent, and a patch-based multi-scale video discriminator for learning motion patterns. Our framework supports contemporary image generators such as StyleGAN2 (Karras et al., 2019) and BigGAN (Brock et al., 2019) . We name our approach as MoCoGAN-HD (Motion and Content decomposed GAN for High-Definition video synthesis) as it features several major advantages over traditional video synthesis pipelines. First, it transcends the limited resolutions of existing techniques, allowing for the generation of high-quality videos at resolutions up to 1024 × 1024. Second, as we search for a latent trajectory in an image generator, our method is computationally more efficient, requiring an order of magnitude less training time than previous video-based works (Clark et al., 2019) . Third, as the image generator is fixed, it can be trained on a separate high-quality image dataset. Due to the disentangled representation of motion and content, our approach can learn motion from a video dataset and apply it to an image dataset, even in the case of two datasets belonging to different domains. It thus unleashes the power of an image generator to synthesize high quality videos when a domain (e.g., dogs) contains many high-quality images but no corresponding high-quality videos (see Fig. 4 ). In this manner, our method can generate realistic videos of objects it has never seen moving during training (such as generating realistic pet face videos using motions extracted from images of talking people). We refer to this new video generation task as cross-domain video synthesis. Finally, we quantitatively and qualitatively evaluate our approach, attaining state-of-the-art performance on each benchmark, and establish a challenging new baseline for video synthesis methods.

2. RELATED WORK

Video Synthesis. Approaches to image generation and translation using Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) have demonstrated the ability to synthesize high quality images (Radford et al., 2016; Zhang et al., 2019; Brock et al., 2019; Donahue & Simonyan, 2019; Jin et al., 2021) . Built upon image translation (Isola et al., 2017; Wang et al., 2018b) , works on video-to-video translation (Bansal et al., 2018; Wang et al., 2018a) are capable of converting an input video to a high-resolution output in another domain. However, the task of high-fidelity video generation, in the unconditional setting, is still a difficult and unresolved problem. Without the strong conditional inputs such as segmentation masks (Wang et al., 2019) or human poses (Chan et al., 2019; Ren et al., 2020) that are employed by video-to-video translation works, generating videos following the distribution of training video samples is challenging. Earlier works on GANbased video modeling, including MDPGAN (Yushchenko et al., 2019) , VGAN (Vondrick et al., 2016) , TGAN (Saito et al., 2017) , MoCoGAN (Tulyakov et al., 2018) , ProgressiveVGAN (Acharya et al., 2018) , TGANv2 (Saito et al., 2020) show promising results on low-resolution datasets. Recent efforts demonstrate the capacity to generate more realistic videos, but with significantly more computation (Clark et al., 2019; Weissenborn et al., 2020) . In this paper, we focus on generating realistic videos using manageable computational resources. LDVDGAN (Kahembwe & Ramamoorthy, 2020 ) uses low dimensional discriminator to reduce model size and can generate videos with resolution up to 512 × 512, while we decrease training cost by utilizing a pre-trained image generator. The high-quality generation is achieved by using pre-trained image generators, while the motion trajectory is modeled within the latent space. Additionally, learning motion in the latent space allows us to easily adapt the video generation model to the task of video prediction (Denton et al., 2017) , in which the starting frame is given (Denton & Fergus, 2018; Zhao et al., 2018; Walker et al., 2017; Villegas et al., 2017b; a; Babaeizadeh et al., 2017; Hsieh et al., 2018; Byeon et al., 2018) , by inverting the initial frame through the generator (Abdal et al., 2020) , instead of training an extra image encoder (Tulyakov et al., 2018; Zhang et al., 2020) . Interpretable Latent Directions. The latent space of GANs is known to consist of semantically meaningful vectors for image manipulation. Both supervised methods, either using human annotations or pre-trained image classifiers (Goetschalckx et al., 2019; Shen et al., 2020) , and unsupervised methods (Jahanian et al., 2020; Plumerault et al., 2020) , are able to find interpretable directions for image editing, such as supervising directions for image rotation or background removal (Voynov & < l a t e x i t s h a _ b a s e = " U s z E N N N O G T R O c b z d Z d z E E = " > A A A B / H i c b V D L S g M x F M U V v S d B I v g q s y I o s u C m y U K v Y F T B k k w b m m S G J C M M Q U / c e N C E b d + i D v / x s y C e C B z O u e G e e K Y U a U d s q r a y u r W + U N y t b z u e / b + Q V d F i c S k g y M W y X A F G F U k I m m p F + L A n i A S O Y H q V + I h W N R F u n M f E G g s a U o y k X y O u R I T y T P r u / a N G B J N f L v m J C C x F Q G F m j u d w F O G E E E x Q o N X C f W X o a k p p i R W W W Y K B I j P E V j M j B U I E U l x X h Z / D Y K C M Y R t I o W G h / v y R I a U y g M z m U d V y u / u c N E h e e h k V c a L z s p F Y c K g j m D e B B x R S b B m q S E I S q y Q j x B E m F t + q q Y E t z l k / + S m n d P a t e R v N x X k c Z H I I j c A J c c A E a o A l a o A M w S M E T e A G v o P b L Z / P R k r W o s A p + w f r B k l Z = < / l a t e x i t > LSTMenc < l a t e x i t s h a 1 _ b a s e 6 4 = " 6 B Q R z E d a N M C G o d v f A F y u t 8 F Z T 5 Q = " > A A A B / H i c b V D L S s N A F J 3 U V 6 2 v a J d u B o v g q i Q + 0 G X B T Z c V 7 A P a E C b T m 3 b o Z B J m J k I I F f / E j Q t F 3 P o h 7 v w b J 2 0 X 2 n r g w u G c e 5 k z J 0 g 4 U 9 p x v q 3 S 2 v r G 5 l Z 5 u 7 K z u 7 d / Y B 8 e d V S c S g p t G v N Y 9 g K i g D M B b c 0 0 h 1 4 i g U Q B h 2 4 w u S 3 8 7 g N I x W J x r 7 M E v I i M B A s Z J d p I v l 0 d R E S P g z A f Q K I Y j 8 X U v / D t m l N 3 Z s C r x F 2 Q G l q g 5 d t f g 2 F M 0 w i E p p w o 1 X e d R H s 5 k Z p R D t P K I F W Q E D o h I + g b K k g E y s t n 4 a f 4 1 C h D H M b S j N B 4 p v 6 + y E m k V B Y F Z r O I q p a 9 Q v z P 6 6 c 6 v P F y J p J U g 6 D z h 8 K U Y x 3 j o g k 8 Z B K o 5 p k h h E p m s m I 6 J p J Q b f q q m B L c 5 S + v k s 5 5 3 b 2 q O 3 e X t U b z a V 5 H G R 2 j E 3 S G X H S N G q i J W q i N K M r Q M 3 p F b 9 a j 9 W K 9 W x / z 1 Z K 1 q L C K / s D 6 / A F V h J W h < / l a t e x i t > ✏3 < l a t e x i t s h a 1 _ b a s e 6 4 = " v l 6 X C 6 d C k x m 7 w Y 6 b K a w l m z q F J N 8 = " > A A A B / H i c b V D L S s N A F J 3 U V 6 2 v a J d u g k V w V Z K i 6 L L g p s s K 9 g F t C J P p T T t 0 M h N m J k I I F f / E j Q t F 3 P o h 7 v w b J 2 0 X 2 n r g w u G c e 5 k z J 0 w Y V d p 1 v 6 3 S x u b W 9 k 5 5 t 7 K 3 f 3 B 4 Z B + f d J V I J Y E O E U z I f o g V M M q h o 6 l m 0 E 8 k 4 D h k 0 A u n t 4 X f e w C p q O D 3 O k v A j / G Y 0 4 g S r I 0 U 2 N V h j P U k j P I h J I o y w W d B I 7 B r b t 2 d w 1 k n 3 p L U 0 B L t w P 4 a j g R J Y + C a M K z U w H M T 7 e d Y a k o Y z C r D V E G C y R S P Y W A o x z E o P 5 + H n z n n R h k 5 k Z B m u H b m 6 u + L H M d K Z X F o N o u o a t U r x P + 8 Q a q j G z + n P E k 1 c L J 4 K E q Z o 4 V T N O G M q A S i W W Y I J p K a r A 6 Z Y I m J N n 1 V T A n e 6 p f X S b d R 9 6 7 q 7 t 1 l r d l 6 W t R R R q f o D F 0 g D 1 2 j J m q h N u o g g j L 0 j F 7 R m / V o v V j v 1 s d i t W Q t K 6 y i P 7 A + f w B U A J W g < / l a t e x i t > ✏2 < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 2 o 8 / 5 H V 7 q O O b x d v S S d T f d A e X b A = " > A A A B / H i c b V B N S w M x F M z W r 1 q / q j 1 6 C R b B U 9 k V R Y 8 F L z 1 W s K 3 Q X U o 2 f d u G Z p M l y Q r L U v G f e P G g i F d / i D f / j d m 2 B 2 0 d e D D M v E c m E y a c a e O 6 3 0 5 p b X 1 j c 6 u 8 X d n Z 3 d s / q B 4 e d b V M F Y U O l V y q + 5 B o 4 E x A x z D D 4 T 5 R Q O K Q Q y + c 3 B R + 7 w G U Z l L c m S y B I C Y j w S J G i b H S o F r z Y 2 L G Y Z T 7 k G j G p Z g O r F p 3 G + 4 M e J V 4 C 1 J H C 7 Q H 1 S 9 / K G k a g z C U E 6 3 7 n p u Y I C f K M M p h W v F T D Q m h E z K C v q W C x K C D f B Z + i k + t M s S R V H a E w T P 1 9 0 V O Y q 2 z O L S b R V S 9 7 B X i f 1 4 / N d F 1 k D O R p A Y E n T 8 U p R w b i Y s m 8 J A p o I Z n l h C q m M 2 K 6 Z g o Q o 3 t q 2 J L 8 J a / v E q 6 5 w 3 v s u H e X t S b r a d 5 H W V 0 j E 7 Q G f L Q F W q i F m q j D q I o Q 8 / o F b 0 5 j 8 6 L 8 + 5 8 z F d L z q L C G v o D 5 / M H r v C V 3 A = = < / l a t e x i t > ✏n < l a t e x i t s h a _ b a s e = " E P p u P g H M c G P w M z + y O J I Y f Q T o = " > A A A B / H i c b V D L S g M x F M U V v S d B I v g q s y I o s u C m y U K v Y F T B k M p k N M k M S U Y Y h h H / x I L R d z I e G P H Q q s H L h z O u Z e c n C B h V G n H + b J K K t r x v l z c r W s u n r / F V x K j H p J j F s h g R R g V p K O p Z q S f S I J w E g v m F x N / d k Y r G o q z h H g c j Q S N K E b a S L d H X K k x L n f t G z P C S K d W c G + J e C I D C R + M Y x j j l R G j M k F I D m l y O p K W a k q A x T R R K E J h E B o Y K x I n y l n A h b J Y R R L M I D W f q z s c c a U y H p j N a V S E F / x B q q N L L c i S T U R e P Q l D K o Y z h t A o Z U E q x Z Z g j C k p q s E I + R R F i b v i q m B H f y J T u n t e d N a o / k r M M D s E R O A E u u A A N A Q t A E Y Z O A J v I B X F t t s / l q y V p U W A W / Y H A z l Z M = < / l a t e x i t > LSTMdec < l a t e x i t s h a _ b a s e = " E P p u P g H M c G P w M z + y O J I Y f Q T o = " > A A A B / H i c b V D L S g M x F M U V v S d B I v g q s y I o s u C m y U K v Y F T B k M p k N M k M S U Y Y h h H / x I L R d z I e G P H Q q s H L h z O u Z e c n C B h V G n H + b J K K t r x v l z c r W s u n r / F V x K j H p J j F s h g R R g V p K O p Z q S f S I J w E g v m F x N / d k Y r G o q z h H g c j Q S N K E b a S L d H X K k x L n f t G z P C S K d W c G + J e C I D C R + M Y x j j l R G j M k F I D m l y O p K W a k q A x T R R K E J h E B o Y K x I n y l n A h b J Y R R L M I D W f q z s c c a U y H p j N a V S E F / x B q q N L L c i S T U R e P Q l D K o Y z h t A o Z U E q x Z Z g j C k p q s E I + R R F i b v i q m B H f y J T u n t e d N a o / k r M M D s E R O A E u u A A N A Q t A E Y Z O A J v I B X F t t s / l q y V p U W A W / Y H A z l Z M = < / l a t e x i t > LSTMdec < l a t e x i t s h a _ b a s e = " E P p u P g H M c G P w M z + y O J I Y f Q T o = " > A A A B / H i c b V D L S g M x F M U V v S d B I v g q s y I o s u C m y U K v Y F T B k M p k N M k M S U Y Y h h H / x I L R d z I e G P H Q q s H L h z O u Z e c n C B h V G n H + b J K K t r x v l z c r W s u n r / F V x K j H p J j F s h g R R g V p K O p Z q S f S I J w E g v m F x N / d k Y r G o q z h H g c j Q S N K E b a S L d H X K k x L n f t G z P C S K d W c G + J e C I D C R + M Y x j j l R G j M k F I D m l y O p K W a k q A x T R R K E J h E B o Y K x I n y l n A h b J Y R R L M I D W f q z s c c a U y H p j N a V S E F / x B q q N L L c i S T U R e P Q l D K o Y z h t A o Z U E q x Z Z g j C k p q s E I + R R F i b v i q m B H f y J T u n t e d N a o / k r M M D s E R O A E u u A A N A Q t A E Y Z O A J v I B X F t t s / l q y V p U W A W / Y H A z l Z M = < / l a t e x i t > LSTMdec < l a t e x i t s h a 1 _ b a s e 6 4 = " p U H R h G T t z H R / C A i r x h z J 8 2 Z D 3 C g = " > A A A B 8 3 i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s z 4 Q J c F N 1 1 W s A / o l J J J 7 7 S h m c y Q Z I Q y F P w K N y 4 U c e v P u P N v z L R d a O u B C 4 d z 7 i U n J 0 g E 1 8 Z 1 v 5 3 C 2 v r G 5 l Z x u 7 S z u 7 d / U D 4 8 a u k 4 V Q y b L B a x 6 g R U o + A S m 4 Y b g Z 1 E I Y 0 C g e 1 g f J f 7 7 U d U m s f y w U w S 7 E V 0 K H n I G T V W 8 v 2 I m l E Q Z q N p / 7 J f r r h V d w a y S r w F q c A C j X 7 5 y x / E L I 1 Q G i a o 1 l 3 P T U w v o 8 p w J n B a 8 l O N C W V j O s S u p Z J G q H v Z L P O U n F l l Q M J Y 2 Z G G z N T f F x m N t J 5 E g d 3 M M + p l L x f / 8 7 q p C W 9 7 G Z d J a l C y + U N h K o i J S V 4 A G X C F z I i J J Z Q p b r M S N q K K M m N r K t k S v O U v r 5 L W R d W 7 r r r 3 V 5 V a / W l e R x F O 4 B T O w Y M b q E E d G t A E B g k 8 w y u 8 O a n z 4 r w 7 H / P V g r O o 8 B j + w P n 8 A T 4 Y k j w = < / l a t e x i t > h3 < l a t e x i t s h a 1 _ b a s e 6 4 = " 6 k h z j A s C n N H r z J D Z B 6 p I J 2 7 q T r o = " > A A A B 8 3 i c b V D L S s N A F L 3 x W e u r 6 t L N Y B F c l U Q U X R b c d F n B P q A J Z T K d t E M n k z B z I 5 R Q 8 C v c u F D E r T / j z r 9 x 0 n a h r Q c u H M 6 5 l z l z w l Q K g 6 7 7 7 a y t b 2 x u b Z d 2 y r t 7 + w e H l a P j t k k y z  X i L J T L R 3 Z A a L o X i L R Q o e T f V n M a h 5 J 1 w f F f 4 n U e u j U j U A 0 5 S H s R 0 q E Q k G E U r + X 5 M c R R G + W j a V / 1 K 1 a 2 5 M 5 B V 4 i 1 I F R Z o 9 i t f / i B h W c w V M k m N 6 X l u i k F O N Q o m + b T s Z 4 a n l I 3 p k P c s V T T m J s h n m a f k 3 C o D E i X a j k I y U 3 9 f 5 D Q 2 Z h K H d r P I a J a 9 Q v z P 6 2 U Y 3 Q a 5 U G m G X L H 5 Q 1 E m C S a k K I A M h O Y M 5 c Q S y r S w W Q k b U U 0 Z 2 p r K t g R v + c u r x f Y R v c U 2 X c I Y o B X k W S 4 B l D 0 Y q x o = " > A A A B 8 3 i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s w U R Z c F N 1 1 W s A / o D C W T Z t r Q T G Z I 7 g h l K P g V b l w o 4 t a f c e f f m D 4 W 2 n r g w u G c e 8 n J C V M p D L r u t 1 P Y 2 N z a 3 i n u l v b 2 D w 6 P y s c n b Z N k m v E W S 2 S i u y E 1 X A r F W y h Q 8 m 6 q O Y 1 D y T v h + G 7 m d x 6 5 N i J R D z h J e R D T o R K R Y B S t 5 P s x x V E Y 5 a N p v 9 Y v V 9 y q O w d Z J 9 6 S V G C J Z r / 8 5 Q 8 S l s V c I Z P U m J 7 n p h j k V K N g k k 9 L f m Z 4 S t m Y D n n P U k V j b o J 8 n n l K L q w y I F G i 7 S g k c / X 3 R U 5 j Y y Z x a D d n G c 2 q N x P / 8 3 o Z R r d B L l S a I V d s 8 V C U S Y I J m R V A B k J z h n J i C W V a 2 K y E j a i m D G 1 N J V u C t / r l d d K u V b 3 r q n t / V a k 3 n h Z 1 F O E M z u E S P L i B O j S g C S 1 g k M I z v M K b k z k v z r v z s V g t O M s K T + E P n M 8 f P J S S O w = = < / l a t e x i t > h2 < l a t e x i t s h a 1 _ b a s e 6 4 = " l g 7 h j A X s m F Y R C s X U a x u E d W B W j d 0 = " > A A A B 9 X i c b V D L S g M x F L 3 j s 9 Z X 1 a W b Y B F c l Z m i 6 L L g p s s K 9 g F t L Z k 0 0 4 Z m M k N y R 6 l D w c 9 w 4 0 I R t / 6 L O / / G T N u F t h 6 4 c D j n X n J y / F g K g 6 7 7 7 a y s r q 1 v b O a 2 8 t s 7 u 3 s h a 1 _ b a s e 6 4 = " p 9 s 5 v A x e 6 T u 6 d 3 Z k 1 0 4 Z 5 7 w l Q K g 6 7 7 7 a y t b 2 x u b Z d 2 y r t 7 + w e H l a P j t l G Z Z r z F l F S 6 G 1 L D p U h 4 C w V K 3 k 0 1 p 3 E o e S c c 3 x V + 5 5 F r I 1 T y 4 Z 5 7 w l Q K g 6 7 7 7 a y t b 2 x u b Z d 2 y r t 7 + w e H l a P j t l G Z Z r z F l F S 6 G 1 L D p U h 4 C w V K 3 k 0 1 p 3 E o e S c c 3 x V + 5 5 F r I 1 T y  v 7 h Y P D h o k S z X i d R T L S L Z 8 a L o X i d R Q o e S v W n I a + 5 E 1 / d J 3 5 z X u u j Y j U L Y 5 j 3 g 3 p Q I l A M I p W u u u E F I d + k D 5 O e m V C e o W i W 3 K n I M v E m 5 M i z F H r F b 4 6 / Y g l I V f I J D W m 7 b k x d l O q U T D J J / l O Y n h M 2 Y g O e N t S R U N u u u k 0 9 Y S c W q V P g k j b U U i m 6 u + L l I b G j E P f b m Y p z a K X i f 9 5 7 Q S D q 2 4 q V J w g V 2 z 2 U J B I g h H J K i B 9 o T l D O b a E M i 1 s V s K G V F O G t q i 8 L c F b / P I y a Z R L 3 k X J v T k v V q p P s z p y c A w n c A Y e X E I F q l C D O j D Q 8 A y v 8 O Y 8 O C / O u / M x W 1 1 x 5 h U e w R 8 4 n z 8 M 2 Z K h < / l a t e x i t > z2 < l a t e x i t D h K n b R o / A = " > A A A B 8 3 i c b V D L S g M x F L 3 j s 9 Z X 1 a W b Y B F c l R k f 6 L L g p s s K 9 g H t U D L p n T Y 0 k x m S j F C H g l / h x o U i b v 0 Z d / 6 N m b Y L b T 1 w 4 X D O v e T k B I n g 2 r j u t 7 O y u r a + s V n Y K m 7 v 7 O 7 t l w 4 O m z p O F c M G i 0 W s 2 g H V K L j E h u F G Y D t R S K N A Y C s Y 3 e Z + 6 w G V 5 r G 8 N + M E / Y g O J A 8 5 o 8 Z K 3 W 5 E z T A I s 8 d J 7 6 J X K r s V d w q y T L w 5 K c M c 9 V 7 p q 9 u P W R q h N E x Q r T u e m x g / o 8 p w J n B S 7 K Y a E 8 p G d I A d S y W N U P v Z N P O E n F q l T 8 J Y 2 Z G G T N X f F x m N t B 5 H g d 3 M M + p F L x f / 8 z q p C W / 8 j M s k N S j Z 7 K E w F c T E J C + A 9 L l C Z s T Y E s o U t 1 k J G 1 J F m b E 1 F W 0 J 3 u K X l 0 n z v O J d V d y 7 y 3 K 1 9 j S r o w D H c A J n 4 M E 1 V K E G d W g A g w S e 4 R X e n N R 5 c d 6 d j 9 n q i j O v 8 A j + w P n 8 A V m W k k 4 = < / l a t e x i t > z3 < l a t e x i t s h a 1 _ b a s e 6 4 = " v + b z q L k p U 1 t D 3 X 6 H q K m L L r Z w N v 4 = " > A A A B 9 H i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s y I o s u C m y 4 r 2 A e 0 Q 8 m k m T Y 0 k x m T O 4 U 6 F P w L N y 4 U c e v H u P N v T B 8 L b T 1 w 4 X D O v e T k B I k U B l 3 3 2 8 m t r W 9 s b u W 3 C z u 7 e / s H x c O j h o l T z X i d x T L W r Y A a L o X i d R Q o e S v R n E a B 5 M 1 g e D v 1 m y O u j Y j V P Y 4 T 7 k e 0 r 0 Q o G E U r + Z 2 I 4 i A I s 8 d J V 5 F u s e S W 3 R n I K v E W p A Q L 1 L r F r 0 4 v Z m n E F T J J j W l 7 b o J + R j U K J v m k 0 E k N T y g b 0 j 5 v W 6 p o x I 2 f z U J P y J l V e i S M t R 2 F Z K b + v s h o Z M w 4 C u z m N K R Z 9 q b i f 1 4 7 x f D G z 4 R K U u S K z R 8 K U 0 k w J t M G S E 9 o z l C O L a F M C 5 u V s A H V l K H t q W B L 8 J a / v E o a F 2 X v q u z e X Z Y q 1 a d 5 H X k 4 g V M 4 B w + u o Q J V q E E d G D z A M 7 z C m z N y X p x 3 5 2 O + m n M W F R 7 D H z i f P w 2 Y k r M = < / l a t e x i t > zn < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 W l 4 f / F Z e A u H r h d D 8 h z 0 Z K m I U H 4 = " > A A A B 8 n i c b V D L S s N A F L 3 x W e u r 6 t L N Y B F c l U Q U X R b c d F n B P i A N Z T K d t E M n m T B z I 5 R Q 8 C f c u F D E r V / j z r 9 x 0 n a h r Q c G D u f c 4 Z 5 7 w l Q K g 6 7 7 7 a y t b 2 x u b Z d 2 y r t 7 + w e H l a P j t l G Z Z r z F l F S 6 G 1 L D p U h 4 C w V K 3 k 0 1 p 3 E o e S c c 3 x V + 5 5 F r I 1 T y g J O U B z E d J i I S j K K V / F 5 M c R R G e X t K + p W q W 3 N n I K v E W 5 A q L N D s V 7 5 6 A 8 W y m C f I J D X G 9 9 w U g 5 x q F E z y a b m X G Z 5 S N q Z D 7 l u a 0 J i b I J 9 F n p J z q w x I p L R 9 C Z K Z + v t H T m N j J n F o J 4 u I Z t k r x P 8 8 P 8 P o N s h F k m b I E z Z f F G W S o C L F / W Q g N G c o J 5 Z Q p o X N S t i I a s r Q t l S 2 J X j L J 6 + S 9 m X N u 6 6 5 9 1 f V e u N p X k c J T u E M L s C D G 6 h D A 5 r Q A g Y K n u E V 3 h x 0 X p x 3 5 2 M + u u Y s K j y B P 3 A + f w B M U J G u < / l a t e x i t > V < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 W l 4 f / F Z e A u H r h d D 8 h z 0 Z K m I U H 4 = " > A A A B 8 n i c b V D L S s N A F L 3 x W e u r 6 t L N Y B F c l U Q U X R b c d F n B P i A N Z T K d t E M n m T B z I 5 R Q 8 C f c u F D E r V / j z r 9 x 0 n a h r Q c G D u f c g J O U B z E d J i I S j K K V / F 5 M c R R G e X t K + p W q W 3 N n I K v E W 5 A q L N D s V 7 5 6 A 8 W y m C f I J D X G 9 9 w U g 5 x q F E z y a b m X G Z 5 S N q Z D 7 l u a 0 J i b I J 9 F n p J z q w x I p L R 9 C Z K Z + v t H T m N j J n F o J 4 u I Z t k r x P 8 8 P 8 P o N s h F k m b I E z Z f F G W S o C L F / W Q g N G c o J 5 Z Q p o X N S t i I a s r Q t l S 2 J X j L J 6 + S 9 m X N u 6 6 5 9 1 f V e u N p X k c J T u E M L s C D G 6 h D A 5 r Q A g Y K n u E V 3 h x 0 X p x 3 5 2 M + u u Y s K j y B P 3 A + f w B M U J G u < / l a t e x i t > V < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 W l 4 f / F Z e A u H r h d D 8 h z 0 Z K m I U H 4 = " > A A A B 8 n i c b V D L S s N A F L 3 x W e u r 6 t L N Y B F c l U Q U X R b c d F n B P i A N Z T K d t E M n m T B z I 5 R Q 8 C f c u F D E r V / j z r 9 x 0 n a h r Q c G D u f c g J O U B z E d J i I S j K K V / F 5 M c R R G e X t K + p W q W 3 N n I K v E W 5 A q L N D s V 7 5 6 A 8 W y m C f I J D X G 9 9 w U g 5 x q F E z y a b m X G Z 5 S N q Z D 7 l u a 0 J i b I J 9 F n p J z q w x I p L R 9 C Z K Z + v t H T m N j J n F o J 4 u I Z t k r x P 8 8 P 8 P o N s h F k m b I E z Z f F G W S o C L F / W Q g N G c o J 5 Z Q p o X N S t i I a s r Q t l S 2 J X j L J 6 + S 9 m X N u 6 6 5 9 1 f V e u N p X k c J T u E M L s C D G 6 h D A 5 r Q A g Y K n u E V 3 h x 0 X p x 3 5 2 M + u u Y s K j y B P 3 A + f w B M U J G u < / l a t e x i t > V < l a t e x i t s h a 1 _ b a s e 6 4 = " t s l j 0 P d H q b 2 4 H J q p 2 h G b N x A 6 w e w = " > A A A B / X i c b V D L S s N A F L 3 x W e s r P n Z u g k V w V R J R d F l w 0 2 U F + 4 A m h M l k 0 g 6 d T M L M R K w h 6 K e 4 c a G I W / / D n X / j p O 1 C W w 9 c O J x z L 3 P m B C m j U t n 2 t 7 G 0 v L K 6 t l 7 Z q G 5 u b e / s m n v 7 H Z l k A p M 2 T l g i e g G S h F F O 2 o o q R n q p I C g O G O k G o + v S 7 9 4 R I W n C b 9 U 4 J V 6 M B p x G F C O l J d 8 8 d B V l I c n d G K l h E O X 3 R e E 7 v l m z 6 / Y E 1 i J x Z q Q G M 7 R 8 8 8 s N E 5 z F h C v M k J R 9 x 0 6 V l y O h K G a k q L q Z J C n C I z Q g f U 0 5 i o n 0 8 k n 6 w j r R S m h F i d D D l T V R f 1 / k K J Z y H A d 6 s w w p 5 7 1 S / M / r Z y q 6 8 n L K 0 0 w R j q c P R R m z V G K V V V g h F Q Q r N t Y E Y U F 1 V g s P k U B Y 6 c K q u g R n / s u L p H N W d y 7 q 9 s 1 5 r d F 8 m t Z R g S M 4 h l N w 4 B I a 0 I Q W t A H D A z z D K 7 w Z j 8 a L 8 W 5 8 T F e X j F m F B / A H x u c P T m e W M Q = = < / l a t e x i t > x1 G I G I G I < l a t e x i t s h a 1 _ b a s e 6 4 = " H k S M 3 D Y u O 1 7 H P A l 8 0 6 U q z x Q + n p g = " > A A A B / X i c b V D L S s N A F J 3 U V 6 2 v + N i 5 C R b B V U m K o s u C m y 4 r 2 A c 0 J U w m N + 3 Q y S T M T M Q a g n 6 K G x e K u P U / 3 P k 3 T t o u t P X A h c M 5 9 z J n j p 8 w K p V t f x u l l d W 1 9 Y 3 y Z m V r e 2 d 3 z 9 w / 6 M g 4 F Q T a J G a x 6 P l Y A q M c 2 o o q B r 1 E A I 5 8 B l 1 / f F 3 4 3 T s Q k s b 8 V k 0 S G E R 4 y G l I C V Z a 8 s w j V 1 E W Q O Z G W I 3 8 M L v P c 6 / u m V W 7 Z k 9 h L R N n T q p o j p Z n f r l B T N I I u C I M S 9 l 3 7 E Q N M i w U J Q z y i p t K S D A Z 4 y H 0 N e U 4 A j n I p u l z 6 1 Q r g R X G Q g 9 X 1 l T 9 f Z H h S M p J 5 O v N I q R c 9 A r x P 6 + f q v B q k F G e p A o 4 m T 0 U p s x S s V V U Y Q V U A F F s o g k m g u q s F h l h g Y n S h V V 0 C c 7 i l 5 d J p 1 5 z L m r 2 z X m 1 0 X y a 1 V F G x + g E n S E H X a I G a q I W a i O C H t A z e k V v x q P x Y r w b H 7 P V k j G v 8 B D 9 g f H 5 A 0 / r l j I = < / l a t e x i t > x2 < l a t e x i t s h a 1 _ b a s e 6 4 = " u o / f X 7 + S 9 j I S M i E 7 H B b f Y f X s 8 R 0 = " > A A A B / X i c b V D L S s N A F J 3 U V 6 2 v + N i 5 C R b B V U l 8 o M u C m y 4 r 2 A e 0 I U w m N + 3 Q y S T M T M Q a g n 6 K G x e K u P U / 3 P k 3 T t o u t P X A h c M 5 9 z J n j p 8 w K p V t f x u l p e W V 1 b X y e m V j c 2 t 7 x 9 z d a 8 s 4 F Q R a J G a x 6 P p Y A q M c W o o q B t 1 E A I 5 8 B h 1 / d F 3 4 n T s Q k s b 8 V o 0 T c C M 8 4 D S k B C s t e e Z B X 1 E W Q N a P s B r 6 Y X a f 5 9 6 Z Z 1 b t m j 2 B t U i c G a m i G Z q e + d U P Y p J G w B V h W M q e Y y f K z b B Q l D D I K / 1 U Q o L J C A + g p y n H E U g 3 m 6 T P r W O t B F Y Y C z 1 c W R P 1 9 0 W G I y n H k a 8 3 i 5 B y 3 i v E / 7 x e q s I r N 6 M 8 S R V w M n 0 o T J m l Y q u o w g q o A K L Y W B N M B N V Z L T L E A h O l C 6 v o E p z 5 L y + S 9 m n N u a j Z N + f V e u N p W k c Z H a I j d I I c d I n q q I G a q I U I e k D P 6 B W 9 G Y / G i / F u f E x X S 8 a s w n 3 0 B 8 b n D 1 F v l j M = < / l a t e x i t > x3 < l a t e x i t s h a 1 _ b a s e 6 4 = " s c F K M 0 P 1 b 2 2 X 2 z X N E y A p F r 3 S h k E = " > A A A B / X i c b V D L S s N A F J 3 4 r P V V H z s 3 g 0 V w V R J R d F l w 0 2 U F + 4 A m l M n k p h 0 6 m Y S Z i V h D 0 E 9 x 4 0 I R t / 6 H O / / G S d u F t h 6 4 c D j n X u b M 8 R P O l L b t b 2 t p e W V 1 b b 2 0 U d 7 c 2 t 7 Z r e z t t 1 W c S g o t G v N Y d n 2 i g D M B L c 0 0 h 2 4 i g U Q + h 4 4 / u i 7 8 z h 1 I x W J x q 8 c J e B E Z C B Y y S r S R + p V D V z M e Q O Z G R A / 9 M L v P 8 7 6 R q 3 b N n g A v E m d G q m i G Z r / y 5 Q Y x T S M Q m n K i V M + x E + 1 l R G p G O e R l N 1 W Q E D o i A + g Z K k g E y s s m 6 X N 8 Y p Q A h 7 E 0 I z S e q L 8 v M h I p N Y 5 8 s 1 m E V P N e I f 7 n 9 V I d X n k Z E 0 m q Q d D p Q 2 H K s Y 5 x U Q U O m A S q + d g Q Q i U z W T E d E k m o N o W V T Q n O / J c X S f u s 5 l z U 7 J v z a r 3 x N K 2 j h I 7 Q M T p F D r p E d d R A T d R C F D 2 g Z / S K 3 q x H 6 8 V 6 t z 6 m q 0 v W r M I D 9 A f W 5 w + q 2 5 Z u < / l a t e x i t > xn Generated Video Motion Generator Pre-trained Generator D V Real videos Video Discriminator Motion Trajectory < l a t e x i t s h a 1 _ b a s e 6 4 = " D c Z v 2 o j l K g w B J N + 5 T g o h k N w V g H A = " > A A A B 8 3 i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s y I o s u C m y 4 r 2 A e 0 Q 8 m k m T Y 0 k x m S O 0 I d C n 6 F G x e K u P V n 3 P k 3 Z t o u t P X A h c M 5 9 5 K T E y R S G H T d b 6 e w t r 6 x u V X c L u 3 s 7 u 0 f l A + P W i Z O N e N N F s t Y d w J q u B S K N 1 G g 5 J 1 E c x o F k r e D 8 W 3 u t x + 4 N i J W 9 z h J u B / R o R K h Y B S t 1 O t F F E d B m D 1 O + 1 6 / X H G r 7 g x k l X g L U o E F G v 3 y V 2 8 Q s z T i C p m k x n Q 9 N 0 E / o x o F k 3 x a 6 q W G J 5 S N 6 Z B 3 L V U 0 4 s b P Z p m n 5 M w q A x L G 2 o 5 C M l N / X 2 Q 0 M m Y S B X Y z z 2 i W v V z 8 z + u m G N 7 4 m V B J i l y x + U N h K g n G J C + A D I T m D O X E E s q 0 s F k J G 1 F N G d q a S r Y E b / n L q 6 R 1 U f W u q u 7 d Z a V W f 5 r X U Y Q T O I V z 8 O A a a l C H B j S B Q Q L P 8 A p v T u q 8 O O / O x 3 y 1 4 C w q P I Y / c D 5 / A F a O k k w = < / l a t e x i t > z1 Content Multiply Add < l a t e x i t s h a 1 _ b a s e 6 4 = " t s l j 0 P d H q b 2 4 H J q p 2 h G b N x A 6 w e w = " > A A A B / X i c b V D L S s N A F L 3 x W e s r P n Z u g k V w V R J R d F l w 0 2 U F + 4 A m h M l k 0 g 6 d T M L M R K w h 6 K e 4 c a G I W / / D n X / j p O 1 C W w 9 c O J x z L 3 P m B C m j U t n 2 t 7 G 0 v L K 6 t l 7 Z q G 5 u b e / s m n v 7 H Z l k A p M 2 T l g i e g G S h F F O 2 o o q R n q p I C g O G O k G o + v S 7 9 4 R I W n C b 9 U 4 J V 6 M B p x G F C O l J d 8 8 d B V l I c n d G K l h E O X 3 R e E 7 v l m z 6 / Y E 1 i J x Z q Q G M 7 R 8 8 8 s N E 5 z F h C v M k J R 9 x 0 6 V l y O h K G a k q L q Z J C n C I z Q g f U 0 5 i o n 0 8 k n 6 w j r R S m h F i d D D l T V R f 1 / k K J Z y H A d 6 s w w p 5 7 1 S / M / r Z y q 6 8 n L K 0 0 w R j q c P R R m z V G K V V V g h F Q Q r N t Y E Y U F 1 V g s P k U B Y 6 c K q u g R n / s u L p H N W d y 7 q 9 s 1 5 r d F 8 m t Z R g S M 4 h l N w 4 B I a 0 I Q W t A H D A z z D K 7 w Z j 8 a L 8 W 5 8 T F e X j F m F B / A H x u c P T m e W M Q = = < / l a t e x i t > x1 < l a t e x i t s h a 1 _ b a s e 6 4 = " H 1 v 3 f B / L 1 X T 5 p U m J 3 V H i B H e q z I w = " > A A A C A 3 i c b V B N S 8 N A E N 3 4 W e t X 1 J t e g k X w V B J R 9 C Q F L z 1 W s B / Q h r D Z b N q l m 0 3 Y n Y g l B n r x r 3 j x o I h X / 4 Q 3 / 4 2 b t g d t f T D w e G + G m X l + w p k C 2 / 4 2 l p Z X V t f W S x v l z a 3 t n V 1 z b 7 + l 4 l Q S 2 i Q x j 2 X H x 4 p y J m g T G H D a S S T F k c 9 p 2 x / e F H 7 7 n k r F Y n E H o 4 S 6 E e 4 L F j K C Q U u e e d g D x g O a 9 S I M A z / M H v L c y + A R r p 3 c M y t 2 1 Z 7 A W i T O j F T Q D A 3 P / O o F M U k j K o B w r F T X s R N w M y y B E U 7 z c i 9 V N M F k i P u 0 q 6 n A E V V u N v k h t 0 6 0 E l h h L H U J s C b q 7 4 k M R 0 q N I l 9 3 F q e q e a 8 Q / / O 6 K Y R X b s Z E k g I V Z L o o T L k F s V U E Y g V M U g J 8 p A k m k u l b L T L A E h P Q s Z V 1 C M 7 8 y 4 u k d V Z 1 L q r 2 7 X m l V h 9 P 4 y i h I 3 S M T p G D L l E N 1 V E D N R F B Y / S M X t G b 8 W S 8 G O / G x 7 R 1 y Z h F e I D + w P j 8 A V h f m Q c = < / l a t e x i t > xt|t>1 < l a t e x i t s h a 1 _ b a s e 6 4 = " h P X q 3 a e 6 l V 9 j 0 u m n T j h Babenko, 2020; Shen & Zhou, 2020) . We further consider the motion vectors in the latent space. By disentangling the motion trajectories in an unsupervised fashion, we are able to transfer the motion information from a video dataset to an image dataset in which no temporal information is available. I 5 F p Y j N o = " > A A A B 8 3 i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s y I R Z c F X d R d B f u A z l A y a a Y N T T J D k h H K U P A r 3 L h Q x K 0 / 4 8 6 / M d N 2 o a 0 H L h z O u Z e c n D D h T B v X / X Y K a + s b m 1 v F 7 d L O 7 t 7 + Q f n w q K 3 j V B H a I j G P V T f E m n I m a c s w w 2 k 3 U R S L k N N O O L 7 J / c 4 j V Z r F 8 s F M E h o I P J Q s Y g Q b K / m 3 f V 9 g M 1 I i u 5 v 2 y x W 3 6 s 6 A V o m 3 I B V Y o N k v f / m D m K S C S k M 4 1 r r n u Y k J M q w M I 5 x O S 3 6 q a Y L J G A 9 p z 1 K J B d V B N s s 8 R W d W G a A o V n a k Q T P 1 9 0 W G h d Y T E d r N P K F e 9 n L x P 6 + X m u g 6 y J h M U k M l m T 8 U p R y Z G O U F o A F T l B g + s Q Q T x W x W R E Z Y Y W J s T S V b g r f 8 5 V X S v M S q H V C N g k t s G G 4 E t h O F N A o E t o L R 7 d R v P a L S P J b 3 Z p y g H 9 G B 5 C F n 1 F i p f t c r l t y y O w N Z J d 6 C l G C B W q / 4 1 e 3 H L I 1 Q G i a o 1 h 3 P T Y y f U W U 4 E z g p d F O N C W U j O s C O p Z J G q P 1 s d u i E n F m l T 8 J Y 2 Z K G z N T f E x m N t B 5 H g e 2 M q B n q Z W 8 q / u d 1 U h P e + B m X S W p Q s v m i M B X E x G M S q H V C N g k t s G G 4 E t h O F N A o E t o L R 7 d R v P a L S P J b 3 Z p y g H 9 G B 5 C F n 1 F i p f t c r l t y y O w N Z J d 6 C l G C B W q / 4 1 e 3 H L I 1 Q G i a o 1 h 3 P T Y y f U W U 4 E z g p d F O N C W U j O s C O p Z J G q P 1 s d u i E n F m l T 8 J Y 2 Z K G z N T f E x m N t B 5 H g e 2 M q B n q Z W 8 q / u d 1 U h P e + B m X S W p Q s v m i M B X E x G M S q H V C N g k t s G G 4 E t h O F N A o E t o L R 7 d R v P a L S P J b 3 Z p y g H 9 G B 5 C F n 1 F i p f t c r l t y y O w N Z J d 6 C l G C B W q / 4 1 e 3 H L I 1 Q G i a o 1 h 3 P T Y y f U W U 4 E z g p d F O N C W U j O s C O p Z J G q P 1 s d u i E n F m l T 8 J Y 2 Z K G z N T f E x m N t B 5 H g e 2 M q B n q Z W 8 q / u d 1 U h P e + B m X S W p Q s v m i M B X E x G Y B 6 Q L G F 2 0 p u M m Z 1 d Z m a F s A S 8 e / G g i F c / y Z t / 4 + R x 0 M S C h q K q m + 6 u I B F c G 9 f 9 d n J r 6 x u b W / n t w s 7 u 3 v 5 B 8 f C o q e N U M W y w W M S q H V C N g k t s G G 4 E t h O F N A o E t o L R 7 d R v P a L S P J b 3 Z p y g H 9 G B 5 C F n 1 F i p f t c r l t y y O w N Z J d 6 C l G C B W q / 4 1 e 3 H L I 1 Q G i a o 1 h 3 P T Y y f U W U 4 E z g p d F O N C W U j O s C O p Z J G q P 1 s d u i E n F m l T 8 J Y 2 Z K G z N T f E x m N t B 5 H g e 2 M q B n q Z W 8 q / u d 1 U h P e + B m X S W p Q s v m i M B X E x G T 6 N e l z h c y I s S W U K W 5 v J W x I F W X G Z l O w I X j L L 6 + S 5 k X Z u y q 7 9 c t S p f o 0 j y M P J 3 A K 5 + D B N V S g C j V o A A O E Z 3 i F N + f B e Contrastive Representation Learning is widely studied in unsupervised learning tasks (He et al., 2020; Chen et al., 2020a; b; Hénaff et al., 2020; Löwe et al., 2019; Oord et al., 2018; Misra & Maaten, 2020) . Related inputs, such as images (Wu et al., 2018) or latent representations (Hjelm et al., 2019) , which can vary while training due to data augmentation, are forced to be close by minimizing differences in their representation during training. Recent work (Park et al., 2020) applies noisecontrastive estimation (Gutmann & Hyvärinen, 2010) to image generation tasks by learning the correspondence between image patches, achieving performance superior to that attained when using cycle-consistency constraints (Zhu et al., 2017; Yi et al., 2017) . On the other hand, we learn an image discriminator to create videos with coherent content by leveraging contrastive loss (Hadsell et al., 2006) along with an adversarial loss (Goodfellow et al., 2014) .

3. METHOD

In this section, we introduce our method for high-resolution video generation. Our framework is built on top of a pre-trained image generator (Karras et al., 2020a; b; Zhao et al., 2020a; b) , which helps to generate high-quality image frames and boosts the training efficiency with manageable computational resources. In addition, with the image generator fixed during training, we can disentangle video motion from image content, and enable video synthesis even when the image content and the video motion come from different domains. More specifically, our inference framework includes a motion generator G M and an image generator G I . G M is implemented with two LSTM networks (Hochreiter & Schmidhuber, 1997) and predicts the latent motion trajectory Z = {z 1 , z 2 , • • • , z n }, where n is the number of frames in the synthesized video. The image generator G I can thus synthesize each individual frame from the motion trajectory. The generated video sequence ṽ is given by ṽ = {x 1 , x2 , • • • , xn }. For each synthesized frame xt , we have xt = G I (z t ) for t = 1, 2, • • • , n. We also define the real video clip as v = {x 1 , x 2 , • • • , x n } and the training video distribution as p v . To train the motion generator G M to discover the desired motion trajectory, we apply a video discriminator to constrain the generated motion patterns to be similar to those of the training videos, and an image discriminator to force the frame content to be temporally consistent. Our framework is illustrated in Fig. 1 . We describe each component in more detail in the following sections.

3.1. MOTION GENERATOR

The motion generator G M predicts consecutive latent codes using an input code z 1 ∈ Z, where the latent space Z is also shared by the image generator. For BigGAN (Brock et al., 2019) , we sample z 1 from the normal distribution p z . For StyleGAN2 (Karras et al., 2020b) , p z is the distribution after the multi-layer perceptron (MLP), as the latent codes within this distribution can be semantically disentangled better than when using the normal distribution (Shen et al., 2020; Zhu et al., 2020) . Formally, G M includes an LSTM encoder LSTM enc , which encodes z 1 to get the initial hidden state, and a LSTM decoder LSTM dec , which estimates n -1 continuous hidden states recursively: h 1 , c 1 = LSTM enc (z 1 ), h t , c t = LSTM dec ( t , (h t-1 , c t-1 )), t = 2, 3, • • • , n, where h and c denote the hidden state and cell state respectively, and t is a noise vector sampled from the normal distribution to model the motion diversity at timestamp t. Motion Disentanglement. Prior work (Tulyakov et al., 2018) applies h t as the motion code for the frame to be generated, while the content code is fixed for all frames. However, such a design requires a recurrent network to estimate the motion while preserving consistent content from the latent vector, which is difficult to learn in practice. Instead, we propose to use a sequence of motion residuals for estimating the motion trajectory. Specifically, we model the motion residual as the linear combination of a set of interpretable directions in the latent space (Shen & Zhou, 2020; Härkönen et al., 2020) . We first conduct principal component analysis (PCA) on m randomly sampled latent vectors from Z to get the basis V. Then, we estimate the motion direction from the previous frame z t-1 to the current frame z t by using h t and V as follows: z t = z t-1 + λ • h t • V, t = 2, 3, • • • , n, where the hidden state h t ∈ [-1, 1], and λ controls the step given by the residual. With Eqn. 1 and Eqn. 2, we have G M (z 1 ) = {z 1 , z 2 , • • • , z n } , and the generated video ṽ is given as ṽ = G I (G M (z 1 )). Motion Diversity. In Eqn. 1, we introduce a noise vector t to control the diversity of motion. However, we observe that the LSTM decoder tends to neglect the t , resulting in motion mode collapse, meaning that G M cannot capture the diverse motion patterns from training videos and generate distinct videos from one initial latent code with similar motion patterns for different sequences of noise vectors. To alleviate this issue, we introduce a mutual information loss L m to maximize the mutual information between the hidden vector h t and the noise vector t . With sim(u, v) = u T v/ u v denoting the cosine similarity between vectors u and v, we define L m as follows: L m = 1 n -1 n t=2 sim(H(h t ), t ), ( ) where H is a 2-layer MLP that serves as a mapping function. Learning. To learn the appropriate parameters for the motion generator G M , we apply a multi-scale video discriminator D V to tell whether a video sequence is real or synthesized. The discriminator is based on the architecture of PatchGAN (Isola et al., 2017) . However, we use 3D convolutional layers in D V , as they can model temporal dynamics better than 2D convolutional layers. We divide input video sequence into small 3D patches, and classify each patch as real or fake. The local responses for the input sequence are averaged to produce the final output. Additionally, each frame in the input video sequence is conditioned on the first frame, as it falls into the distribution of the pre-trained image generator, for more stable training. We thus optimize the following adversarial loss to learn G M and D V : L DV = E v∼pv [log D v (v)] + E z1∼pz [log(1 -D V (G I (G M (z 1 ))))] .

3.2. CONTRASTIVE IMAGE DISCRIMINATOR

As our image generator is pre-trained, we may use an image generator that is trained on a given domain, e.g. images of animal faces (Choi et al., 2020) , and learn the motion generator parameters using videos from a different domain, such as videos of human facial expressions (Nagrani et al., 2017) . With Eqn. 4 alone, however, we lack the ability to explicitly constrain the generated images xt|t>1 to possess similar quality and content as the first image x1 , which is sampled from the image space of the image generator and thus has high fidelity. Hence, we introduce a contrastive image discriminator D I , which is illustrated in Fig. 1 , to match both image quality and content between x1 and xt|t>1 . Quality Matching. To increase the perceptual quality, we train D I and G M adversarially by forwarding xt into the discriminator D I and using x1 as real sample and xt|t>1 as the fake sample. L DI = E z1∼pz [log D I (G I (z 1 ))] + E z1∼pz,zt∼GM(z1)|t>1 [log(1 -D I (G I (z t )))] . ( ) Content Matching. To learn content similarity between frames within a video, we use the image discriminator as a feature extractor and train it with a form of contrastive loss known as In-foNCE (Oord et al., 2018) . The goal is that pairs of images with the same content should be close together in embedding space, while images containing different content should be far apart. Given a minibatch of N generated videos {ṽ (1) , ṽ(2) , • • • , ṽ(N) }, we randomly sample one frame t from each video: {x  (1) t , x(2) t , • • • , x(N) t }, (i•) t , x(j•) t ) are all negative pairs for i = j. Let F be an encoder network, which shares the same weights and architecture of D I , but excluding the last layer of D I and including a 2-layer MLP as a projection head that produces the representation of the input images. We have a contrastive loss function L contr , which is the cross-entropy computed across 2N augmentations as follows: Lcontr = - N i=1 b α=a log exp(sim(F (x (ia) t ), F (x (ib) t ))/τ ) N j=1 b β=a 1 [j =i] (exp(sim(F (x (iα) t ), F (x (jβ) t ))/τ ) , where sim(•, •) is the cosine similarity function defined in Eqn. 3, 1 [j =i] ∈ {0, 1} is equal to 1 iff j = i, and τ is a temperature parameter empirically set to 0.07. We use a momentum decoder mechanism similar to that of MoCo (He et al., 2020) by maintaining a memory bank to delete the oldest negative pairs and update the new negative pairs. We apply augmentation methods including translation, color jittering, and cutout (DeVries & Taylor, 2017) on synthesized images. With the positive and negative pairs generated on-the-fly during training, the discriminator can effectively focus on the content of the input samples. The choice of positive pairs in Eqn. 6 is specifically designed for cross-domain video synthesis, as videos of arbitrary content from the image domain is not available. In the case that images and videos are from the same domain, the positive and negative pairs are easier to obtain. We randomly select and augment two frames from a real video to create positive pairs sharing the same content, while the negative pairs contain augmented images from different real videos. Aside from L contr , we also adopt the feature matching loss (Wang et al., 2018b) L f between the generated first frame and other frames by changing the L 1 regularization to cosine similarity. Full Objective. The overall loss function for training motion generator, video discriminator, and image discriminator is thus defined as: min GM (max DV L DV + max DI L DI ) + max GM (λ m L m + λ f L f ) + min DI (λ contr L contr ) (7) where λ m , λ contr , and λ f are hyperparameters to balance losses.

4. EXPERIMENTS

In this section, we evaluate the proposed approach on several benchmark datasets for video generation. We also demonstrate cross-domain video synthesis for various image and video datasets.

4.1. VIDEO GENERATION

We conduct experiments on three datasets including UCF-101 (Soomro et al., 2012) , FaceForensics (Rössler et al., 2018) , and Sky Time-lapse (Xiong et al., 2018) for unconditional video synthesis. We use StyleGAN2 as the image generator. Training details can be found in Appx. B. (Tran et al., 2015) that is trained on the Sports-1M dataset (Karpathy et al., 2014) and fine-tuned on UCF-101, which is the same model used in previous works (Saito et al., 2020; Clark et al., 2019) . The quantitative results are shown in Tab. 1. Our method achieves state-of-the-art results for both IS and FVD, and outperforms existing works by a large margin. Interestingly, this result indicates that a well-trained image generator has learned to represent rich motion patterns, and therefore can be used to synthesize high-quality videos when used with a well-trained motion generator. FaceForensics is a dataset containing news videos featuring various reporters. We use all the images from 704 training videos, with a resolution of 256 × 256, to learn an image generator, and sequences of 16 consecutive frames to train motion generator. Note that our network can generate even longer continuous sequences, e.g. 64 frames (Fig. 12 in Appx.), though only 16 frames are used for training. We show the FVD between generated and real video clips (16 frames in length) for different methods in Tab. 2. Additionally, we use the Average Content Distance (ACD) from MoCoGAN (Tulyakov et al., 2018) to evaluate the identity consistency for these human face videos. We calculate ACD values over 256 videos. We also report the two metrics for ground truth (GT) videos. To get FVD of GT videos, we randomly sample two groups of real videos and compute the score. Our method achieves better results than TGANv2 (Saito et al., 2020) . Both methods have low FVD values, and can generate complex motion patterns close to the real data. However, the much lower ACD value of our approach, which is close to GT, demonstrates that the videos it synthesizes have much better identity consistency than the videos from TGANv2. Qualitative examples in Fig. 2 illustrate different motions patterns learned from the dataset. Furthermore, we perform perceptual experiments using Amazon Mechanical Turk (AMT) by presenting a pair of videos from the two methods to users and asking them to select a more realistic one. Results in Tab. 2 indicate our method outperforms TGANv2 in 73.6% of the comparisons. Sky Time-Lapse is a video dataset consisting of dynamic sky scenes, such as moving clouds. The number of video clips for training and testing is 35, 392 and 2, 815, respectively. We resize images to 128 × 128 and train the model to generate 16 frames. We compare our methods with the two recent approaches of MDGAN (Xiong et al., 2018) and DTVNet (Zhang et al., 2020) , which are specifically designed for this dataset. In Tab. 3, we report the FVD for all three methods. It is clear that our approach significantly outperforms the others. Example sequences are shown in Fig. 3 . Following DTVNet (Zhang et al., 2020) , we evaluate the proposed model for the task of video prediction. We use the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) (Wang et al., 2004) as evaluation metrics to measure the frame quality at the pixel level and the structural similarity between synthesized and real video frames. Evaluation is performed on the testing set. We select the first frame x 1 from each video clip and project it to the latent space of the image generator (Abdal et al., 2020) to get ẑ1 . We use ẑ1 as the starting latent code for motion generator to get 16 latent codes, and interpolate them to get 32 latent codes to synthesize a video sequence, where the first frame is given by G I (ẑ 1 ). For a fair comparison, we also use G I (ẑ 1 ) as the starting frame for MDGAN and DTVNet to calculate the metrics with ground truth videos. In addition, we calculate the PSNR and SSIM between x 1 and G I (ẑ 1 ) as the upper bound for all methods, which we denote as Up-B. Tab. 3 shows the video prediction results, which demonstrate that our method's performance is superior to those of MDGAN and DTVNet. Interestingly, by simply interpolating the motion trajectory, we can easily generate longer video sequence, e.g. from 16 to 32 frames, while retaining high quality.  t = 2 t = 4 t = 6 t = 8 t = 10 t = 12 t = 14 t = 16 t = 2 t = 4 t = 6 t = 8 t = 10 t = 12 t = 14 t = 16

4.2. CROSS-DOMAIN VIDEO GENERATION

To demonstrate how our approach can disentangle motion from image content and transfer motion patterns from one domain to another, we perform several experiments on various datasets. More specifically, we use the StyleGAN2 model, pre-trained on the FFHQ (Karras et al., 2019) , AFHQ-Dog (Choi et al., 2020) , AnimeFaces (Branwen, 2019) , and LSUN-Church (Yu et al., 2015) datasets, as the image generators. We learn human facial motion from VoxCeleb (Nagrani et al., 2020) and time-lapse transitions in outdoor scenes from TLVDB (Shih et al., 2013) . In these experiments, a pair such as (FFHQ, VoxCeleb) indicates that we synthesize videos with image content from FFHQ and motion patterns from VoxCeleb. We generate videos with a resolution of We also demonstrate how the motion and content are disentangled in Fig. 5 and Fig. 6 , which portray generated videos with the same identity but performing diverse motion patterns, and the same motion applied to different identities, respectively. We show results from (AFHQ-Dog, VoxCeleb) (first two rows) and (AnimeFaces, VoxCeleb) (last two rows) in these two figures. 

4.3. ABLATION ANALYSIS

We first report IS and FVD in Tab. 4 for UCF-101 using the following methods: w/o Eqn. 2 uses z t = h t instead of estimating the residual as in Eqn. 2; w/o D I omits the contrastive image discriminator D I and uses the video discriminator D V only for learning the motion generator; w/o D V omits D V during training; and Full-128 and Full-256 indicate that we generate videos using our full method with resolutions of 128 × 128 and 256 × 256, respectively. We resize frames for all methods to 128 × 128 when calculating IS and FVD. The full method outperforms all others, proving the importance of each module for learning temporally consistent and high-quality videos. We perform further analysis of our cross-domain video generation on (FFHQ, VoxCeleb). We compare our full method (Full) with two variants. w/o L contr denotes that we omit the contrastive loss (Eqn. 6) from D I , and w/o L m indicates that we omit the mutual information loss (Eqn. 3) for the motion generator. The results in Tab. 5 demonstrate that L contr is beneficial for learning videos with coherent content, as employing L contr results in lower ACD values and higher human preferences. L m also contributes to generating higher quality videos by mitigating motion synchronization. To validate the motion diversity, we show pairs of 9 randomly generated videos from the two methods to users and ask them to choose which one has superior motion diversity, including rotations and facial expressions. User preference suggests that using L m increases motion diversity. Due to the limitation of computational resources, we train MoCoGAN-HD to synthesize 16 consecutive frames. However, we can generate longer video sequences during inference by applying the following two ways. Motion Generator Unrolling. For motion generator, we can run the LSTM decoder for more steps to synthesize long video sequences. In Fig. 7 , we show a synthesized video example of 64 frames using the model trained on the FaceForensics dataset. Our method is capable to synthesize videos with more frames than the number of frames used for training. 

5. CONCLUSION

In this work, we present a novel approach to video synthesis. Building on contemporary advances in image synthesis, we show that a good image generator and our framework are essential ingredients to boost video synthesis fidelity and resolution. The key is to find a meaningful trajectory in the image generator's latent space. This is achieved using the proposed motion generator, which produces a sequence of motion residuals, with the contrastive image discriminator and video discriminator. This disentangled representation further extends applications of video synthesis to content and motion manipulation and cross-domain video synthesis. The framework achieves superior results on a variety of benchmarks and reaches resolutions unattainable by prior state-of-the-art techniques. • Cutout (DeVries & Taylor, 2017). We mask out pixels in a random subregion of each image to 0. Each subregion starts at a random point and with size (α m H, α m W ), where α m ∼ U(0, 0.25) and (H, W ) is the image resolution. • Flipping. We horizontally flip the image with the probability of 0.5. Memory Bank. It has been shown that contrastive learning benefits from large batch-sizes and negative pairs (Chen et al., 2020b) . To increase the number of negative pairs, we incorporate the memory mechanism from MoCo (He et al., 2020) , which designates a memory bank to store negative examples. More specifically, we keep an exponential moving average of the image discriminator, and its output of fake video frames are buffered as negative examples. We use a memory bank with a dictionary size of 4, 096.

B MORE DETAILS FOR EXPERIMENTS

Image Generators. We train the unconditional StyleGAN2 models from scratch on the UCF-101, FaceForensics, Sky Time-lapse, and AFHQ-Dog datasets. We train the image generators with the official Tensorflow codefoot_0 and select the checkpoints that obtain the best Fréchet inception distance (FID) (Heusel et al., 2017) score to be used as the image generators. The FID score of each image generator is shown in Table 7 . For FFHQ, AnimeFaces, and LSUN-Church, we simply use the released pre-trained models. We also train an unconditional BigGAN model on the FFHQ dataset using the public PyTorch codefoot_1 . We train a model with resolution 128 × 128 and select the last checkpoint as the image generator. Training Time. We train each image generator for UCF-101, FaceForensics, Sky Time-lapse, and AFHQ-Dog in less than 2 days using 8 Tesla V100 GPUs. For FFHQ, AnimeFaces, and LSUN-Church, we use the released models with no training cost. The training time for video generators ranges from 1.5 ∼ 3 days depending on the datasets (Due to the memory issue, the training for generating videos with resolution of 1, 024 × 1, 024 was done on 8 Quadro RTX 8000, with 5 days). The total training time for all the datasets is 1.5 ∼ 5 days and the estimated cost for training on Google Cloud is $0.7K∼$2.3K. Implementation Details. We implement our experiments with PyTorch 1.3.1 and also tested them with PyTorch 1.6. We use the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 0.0001 for G M , D V , and D I in all experiments. In Eqn. 2, we set λ = 0.5 for conventional video generation tasks and use a smaller λ = 0.2 for cross-domain video generation, as it improves the content consistency. In Eqn. 7, we set λ m = λ contr = λ f = 1. Grid searching on these hyper-parameters could potentially lead to a performance boost. For TGANv2, we use the released codefoot_2 to train the models on UCF-101 and FaceForensics using 8 Tesla V100 with 16GB of GPU memory. Video Prediction. For video prediction, we predict consecutive frames, given the first frame x from a test video clip as the input. We find the inverse latent code ẑ1 for x 1 by minimizing the following objective: ẑ1 = arg min ẑ1 x 1 -G I (ẑ 1 ) 2 + λ vgg F vgg (x 1 ) -F vgg (G I (ẑ 1 )) 2 , where λ vgg is the weight for perceptual loss (Johnson et al., 2016) , F vgg is the VGG feature extraction model (Simonyan & Zisserman, 2014) . We set λ vgg = 1 and optimize Eqn. 8 for 20, 000 iterations. We take ẑ1 as the input to our model for video prediction. AMT Experiments. We present more details on the AMT experiments for different experimental settings and datasets. For each experiment, we run 5 iterations to get the averaged score. • FaceForensics, Ours vs TGANv2. We randomly select 300 videos from each method and ask users to select the better one from a pair of videos. • Sky Time-lapse, Ours vs DTVNet. We compare our method with DTVNet on the video prediction task. The testing set of Sky Time-lapse dataset includes 2, 815 short video clips. Considering that many of these video clips share similar content and are sampled from 148 long videos, we select 148 short videos with different content for testing. For these videos, we perform inversion (Eqn. 8) on the first frame to get the latent code and generate videos. For DTVNet, we use the first frame directly as input to produce their results. We ask users to chose the one with better video quality from a pair of videos generated by our method and DTVNet. The results shown in Tab. 8 demonstrate the clear advantage of our approach. • FFHQ, Full vs w/o L contr . We randomly sample 200 videos generated by each method and ask users to select the more realistic one from a pair of videos. • FFHQ, Full vs w/o L m . For each method, we use the same content code z 1 to generate 9 videos with different motion trajectories, and organize them into a 3 × 3 grid. To conduct AMT experiments, we randomly generate 50 3 × 3 videos for each method and ask users to choose the one with higher motion diversity from a pair of videos. Cross-Domain Video Generation. We provide more details on the image and video datasets. 

C MORE VIDEO RESULTS

In this section, we provide more qualitative video results generated by our approach. We show the thumbnail from each video in the figures. Full resolution videos are in the supplementary material. We also provide an HTML page to visualize these videos. UCF-101. In Fig. 9 , we show videos generated by our approach on the UCF-101 dataset. FaceForensics. In Fig. 10 , we show the generated videos for FaceForensics. In Fig. 11 and Fig. 12 , we show that our approach can generate long consecutive results, 32 and 64 frames respectively, 

D MORE ABLATION ANALYSIS FOR MUTUAL INFORMATION LOSS L m

In addition to Tab. 5, we perform another ablation experiment to show how mutual information loss L m improves motion diversity by considering the following setting. We random sample a content code z 1 ∈ Z and use it as an input to synthesize 100 videos, where each video contains 16 frames. We average the generated 100 videos (they share the same first frame) to get one meanvideo, which contains 16 frames. For example, for the last frame in the mean-video, it is obtained by averaging all the last frames from the 100 generated videos. We also calculate the per-pixel standard deviation (std) for each averaged frame in the mean-video. More blurry frames and higher per-pixel std indicate the 100 synthetic videos contain more diverse motion. We evaluate the settings of Full and w/o L m (without using the mutual information loss) by running the above experiments for 50 times, e.g., sampling z 1 for 50 times. Across the 50 trials, for Full model, the mean and std of the per-pixel std for the 16 th frame (the last frame in a generated video) is 0.233 ± 0.036, which is significantly higher than that of the w/o L m model (0.126 ± 0.025). In Fig. 23 , we show 8 examples of the last frame from the mean-video and the images with per-pixel std (See supplementary material for the whole videos). Our Full model has more diverse motion as the averaged frame is more blurry and the per-pixel std is higher. Note that StyleGAN2 enables noise inputs for extra randomness, we disable it in this ablation study. 

E LIMITATIONS

Our framework requires a well-trained image generator for frame synthesis. In order to synthesize high-quality and temporally coherent videos, an ideal image generator should satisfy two requirements: R1. The image generator should synthesize high-quality images, otherwise the video discriminator can easily tell the generated videos as the image quality is different from the real videos. R2. The image generator should be able to generate diverse image contents to include enough motion modes for sequence modeling. Example of R1. UCF-101 is a challenging dataset even for the training of an image generator. In Tab. 7, the StyleGAN2 model trained on UCF-101 has FID 45.63, which is much worse than the others. We hypothesis the reason is that UCF-101 dataset has many categories, but within each category, it includes relatively a small amount of videos and these videos share very similar content. Such observation is also discussed in DVDGAN (Clark et al., 2019) . Although we can achieve Moreover, we visualize the video synthesis results by moving along the top 20 PCA components. Let V i denote the i th PCA component. Given content code z 1 , we synthesize a 5-frame video clip by using the following sequence as input: {z 1 -2V i , z 1 -V i , z 1 , z 1 + V i , z 1 + 2V i }. In Fig. 26 , we show the video synthesis results by moving along the top 20 PCA directions. It can be seen that: 1) changing the later components (the 8 th and later rows) of BAIR only make small changes; 2) the first 7 components of BAIR have entangled semantic meaning, while the components in FFHQ have more disentangled meaning (2 nd row, rotation; 20 th row, smile). This indicates the image generator of BAIR may not cover enough (disentangled) motion modes, and it might be hard for the motion generator to fully disentangle all the contents and motion with only a few dominating PCA components, while for the image generator trained on FFHQ, it is much easier for disentangling foreground and background.



https://github.com/NVlabs/stylegan2 https://github.com/ajbrock/BigGAN-PyTorch https://github.com/pfnet-research/tgan2



p H 1 Z 8 6 5 r 7 v 1 V t d 5 4 m t d R g l M 4 g w v w 4 A b q 0 I A m t I B B C s / w C m 9 O 5 r w 4 7 8 7 H f H X N W V R 4 A n / g f P 4 A l 4 S S d w = = < / l a t e x i t > hn G I < l a t e x i t s h a 1 _ b a s e 6 4 = "

q h 6 t a p 7 f 1 m p N 5 7 m d R T h B E 7 h H D y 4 g j o 0 o A k t I J D A M 7 z C m 5 M 6 L 8 6 7 8 z F f L T i L C o / h D 5 z P H 0 n Z k k U = < / l a t e x i t > DI < l a t e x i t s h a 1 _ b a s e 6 4 = " b B / p Z h l m C X M v 2 3 L l C e n b 0 4 z 7 G G 4 = " > A A A B 6 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K o s e A I D k m Y B 6 Q L G F 2 0 p u M m Z 1 d Z m a F s A S 8 e / G g i F c / y Z t / 4 + R x 0 M S C h q K q m + 6 u I B F c G 9 f 9 d n J r 6 x u b W / n t w s 7 u 3 v 5 B 8 f C o q e N U M W y w W

T 6 N e l z h c y I s S W U K W 5 v J W x I F W X G Z l O w I X j L L 6 + S 5 k X Z u y q 7 9 c t S p f o 0 j y M P J3 A K 5 + D B N V S g C j V o A A O E Z 3 i F N + f B e X He n Y 9 5 a 8 5 ZR H g M f + B 8 / g D D k Y 1 U < / l a t e x i t > F < l a t e x i t s h a 1 _ b a s e 6 4 = " b B / p Z h l m C X M v 2 3 L l C e n b 0 4 z 7 G G 4 = " > A A A B 6 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K o s e A I D k m Y B 6 Q L G F 2 0 p u M m Z 1 d Z m a F s A S 8 e / G g i F c / y Z t / 4 + R x 0 M S C h q K q m + 6 u I B F c G 9 f 9 d n J r 6 x u b W / n t w s 7 u 3 v 5 B 8 f C o q e N U M W y w W

T 6 N e l z h c y I s S W U K W 5 v J W x I F W X G Z l O w I X j L L 6 + S 5 k X Z u y q 7 9 c t S p f o 0 j y M P J3 A K 5 + D B N V S g C j V o A A O E Z 3 i F N + f B e X He n Y 9 5 a 8 5 ZR H g M f + B 8 / g D D k Y 1 U < / l a t e x i t > F < l a t e x i t s h a 1 _ b a s e 6 4 = " b B / p Z h l m C X M v 2 3 L l C e n b 0 4 z 7 G G 4 = " > A A A B 6 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K o s e A I D k m Y B 6 Q L G F 2 0 p u M m Z 1 d Z m a F s A S 8 e / G g i F c / y Z t / 4 + R x 0 M S C h q K q m + 6 u I B F c G 9 f 9 d n J r 6 x u b W / n t w s 7 u 3 v 5 B 8 f C o q e N U M W y w W

T 6 N e l z h c y I s S W U K W 5 v J W x I F W X G Z l O w I X j L L 6 + S 5 k X Z u y q 7 9 c t S p f o 0 j y M P J3 A K 5 + D B N V S g C j V o A A O E Z 3 i F N + f B eX H e n Y 9 5 a 8 5 Z R H g M f + B 8 / g D D k Y 1 U < / l a t e x i t > F < l a t e x i t s h a 1 _ b a s e 6 4 = " b B / p Z h l m C X M v 2 3 L l C e n b 0 4 z 7 G G 4 = " > A A A B 6 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K o s e A I D k m

Figure 1: Left: Given an initial latent code z 1 , a trajectory t , and a PCA basis V, the motion generator G M encodes z 1 using LSTM enc to get the initial hidden state and uses LSTM dec to estimate hidden states for future frames. The image generator G I synthesizes images using the predicted latent codes. The discriminator D V is trained on both real and generated video sequences. Right: For each generated video, the first and subsequent frames are sent to an image discriminator D I . An encoder-like network F calculates the features of synthesized images used to compute the contrastive loss L contr with positive (same image content, but different augmentation, shown in blue) and negative pairs (different image content and augmentation, shown in red).

Figure 2: Example generated videos from a model trained on FaceForensics. We can generate natural and photo-realistic videos with various motion patterns, such as eye blink and talking. Four examples show frames 2, 7, 11, and 16.

Figure 3: Sample generated frames at several time steps (t) for the Sky Time-lapse dataset.

Figure 4: Example sequences for cross-domain video generation. First Row: (FFHQ, VoxCeleb). Second Row: (LSUN-Church, TLVDB). Third Row: (AFHQ-Dog, VoxCeleb). Fourth Row: (Ani-meFaces, VoxCeleb). Images in the first and second rows have a resolution of 256 × 256, while the third and fourth rows have a resolution of 512 × 512.

256 × 256 and 1024 × 1024 for FFHQ, 512×512 for AFHQ-Dog and AnimeFaces, and 256×256 for LSUN-Church. Qualitative examples for (FFHQ, VoxCeleb), (LSUN-Church, TLVDB), (AFHQ-Dog, VoxCeleb), and (Anime-Faces, VoxCeleb) are shown in Fig. 4, depicting high-quality and temporally consistent videos (more videos, including results with BigGAN as the image generator, are shown in the Appendix).

Figure 5: The first and second row (also the third and fourth row) share the same initial content code but with different motion codes.

Figure 7: The generation of a 64-frame video using a model trained with 16-frame on FaceForensics.Motion Interpolation. We can do interpolation on the motion trajectory directly to synthesize long videos. Fig.8shows an interpolation example of 32-frame on (AFHQ-Dog, VoxCeleb) dataset.

Figure 8: The generation of a 32-frame video on (AFHQ-Dog, VoxCeleb) by doing the interpolation on motion trajectory.

Figure 13: Each row is synthesized using the same content code to generate diverse motion patterns. Please see the corresponding supplementary video for a better illustration.

Figure 14: Each row is synthesized with the same motion trajectory but different content codes. Please see the corresponding supplementary video for a better illustration.

Figure 15: Example videos generated by our approach on the Sky Time-lapse dataset. The videos have a resolution of 128 × 128.

Figure 16: Cross-domain video generation for (FFHQ, Vox). The videos have a resolution of 128 × 128.

Figure 17: Cross-domain video generation for (FFHQ, Vox). The videos have a resolution of 256 × 256.

Figure 18: Cross-domain video generation for (FFHQ, Vox). The videos have a resolution of 1024 × 1024.

Figure 19: Cross-domain video generation for (AFHQ-Dog, Vox). The videos have a resolution of 512 × 512.

Figure 20: Cross-domain video generation for (AFHQ-Dog, Vox). We interpolate every two frames to get 32 sequential frames. The videos have a resolution of 512 × 512.

Figure 21: Cross-domain video generation for (AnimeFaces, Vox). The videos have a resolution of 512 × 512.

Figure 22: Cross-domain video generation for (LSUN-Church, TLVDB). The videos have a resolution of 256 × 256.

Figure 23: Row 1 and 3: The last frame of the mean-video and per-pixel std of w/o L m model. Row 2 and 4: The last frame of the mean-video and per-pixel std of the Full model. The Full model has a more blurry mean-video and higher per-pixel std, which indicates more diverse motion.

and make two randomly augmented versions (x



FVD, ACD, and Human Preference on FaceForensics.UCF-101 is widely used in video generation. The dataset includes 13, 320 videos of 101 sport categories. The resolution of each video is 320 × 240. To process the data, we crop a rectangle with size of 240 × 240 from each frame in a video and resize it to 256 × 256. We train the motion generator to predict 16 frames. For evaluation, we report Inception Score (IS)(Saito et al., 2020) on 10, 000 generated videos and Fréchet Video Distance (FVD)(Unterthiner et al., 2018) on 2, 048 videos. The classifier used to calculate IS is a C3D network

Evaluation on Sky Time-lapse for video synthesis and prediction.

Ablation study on UCF-101.

FID of our trained StyleGAN2 models on different datasets.

Human evaluation experiments on Sky Time-lapse dataset.

Image Datasets: -FFHQ (Karras et al., 2019) consists of 70, 000 high-quality face images at 1024×1024 resolution with considerable variation in terms of age, ethnicity, and background. -AFHQ-Dog (Choi et al., 2020) contains 5, 239 high-quality dog images at 512 × 512 resolution with both training and testing sets. -AnimeFaces (Branwen, 2019) includes 2, 232, 462 anime face images at 512 × 512 resolution. -LSUN-Church (Yu et al., 2015) includes 126, 227 in-the-wild church images at 256 × 256 resolution. VoxCeleb (Nagrani et al., 2020) consists of 22, 496 short clips of human speech, extracted from interview videos uploaded to YouTube. -TLVDB (Shih et al., 2013) includes 463 time-lapse videos, covering a wide range of landscapes and cityscapes.For the video datasets, we randomly select 32 consecutive frames from training videos and select every other frame to form a 16-frame sequence for training.

A ADDITIONAL DETAILS FOR THE FRAMEWORK A.1 ADDITIONAL DETAILS FOR THE MOTION GENERATOR

To use StyleGAN2 (Karras et al., 2020b) as the image generator, we randomly sample 1, 000, 000 latent codes from the input space Z and send them to the 8-layer MLPs to get the latent codes in the space of W. Each latent code is a 512-dimension vector. We perform PCA on these 1, 000, 000 latent codes and select the top 384 principal components to form the matrix V ∈ R 384×512 , which is used to model the motion residuals in Eqn. 2. The LSTM encoder and the LSTM decoder in the motion generator both have an input size of 512 and a hidden size of 384. The noise vector t in Eqn. 1 is also a 512-dimension vector, and the network H in Eqn. 3 is a 2-layer MLPs with 512 hidden units in each of the two fully-connected layers.For BigGAN (Brock et al., 2019) , we sample the latent code directly from the space of Z.

A.2.1 VIDEO DISCRIMINATOR

The input images for the video discriminator D V are processed at two scales. We downsample the output images from the image generator to the resolution of 128 × 128 and 64 × 64. For indomain video synthesis, the input sequences for D V have the shape of 6 × (n -1) × 128 × 128 and 6 × (n -1) × 64 × 64, where n is the sequence length used for training. For each of the (n -1) subsequent frames, we concatenate the RGB channels of both the first frame and that subsequent frame, resulting in a 6-channel input. For cross-domain video synthesis, the input sequences for D V have the shape of 3 × n × 128 × 128 and 3 × n × 64 × 64, as the concatenation of the first frame will make the discriminator aware the domain gaps. Details for D V are shown in Tab. 6. The image discriminator D I has an architecture based on that of the BigGAN discriminator, except that we remove the self-attention layer. The feature extractor F used for contrastive learning has the same architecture as D I , except that it does not include the last layer of D I but has two additional fully connected (FC) layers as the projection head. The number of hidden units for these two FC layers are both 256.Here we describe in more detail the image augmentation and memory bank techniques used for conducting contrastive learning.Image Augmentation. We perform data augmentation on images to create positive and negative pairs. We normalize the images to [-1, 1] and apply the following augmentation techniques.• Affine. We augment each image with an affine transformation defined with three random parameters: rotation α r ∈ U(-180, 180), translation α t ∈ U(-0.1, 0.1), and scale α s ∈ U(0.95, 1.05).• Brightness. We add a random value α b ∼ U(-0.5, 0.5) to all channels of each image.• Color. We add a random value α c ∼ U(-0.5, 0.5) to one randomly-selected channel of each image.even when trained with 16-frame clips. In Fig. 13 , we demonstrate that our approach can generate diverse motion patterns using the same content code. In Fig. 14 , we apply the same motion codes with different content to get the synthesized videos.Sky Time-lapse. Fig. 15 shows the generated videos for the Sky Time-lapse dataset.(FFHQ, VoxCeleb). Fig. 16 , Fig. 17 state-of-the-art performance on UCF-101 dataset, the quality of the generated videos is not as good as other datasets (Fig. 9 ), and the quality of synthesized videos is still not close to real videos.Example of R2. We test our method on BAIR Robot Pushing Dataset (Ebert et al., 2017) . We train a 64 × 64 StyleGAN2 image generator with using the frames from BAIR videos. The image generator has FID as 6.12. Based on the image generator, we train a video generation model that can synthesize 16 frames. An example of synthesized video is shown in Fig. 24 (more videos are in the supplementary materials). We can see our method can successfully model shadow changing, the robot arm moving, but it struggles to decouple the robot arm from some small objects in the background, which we show analysis follows. Inspired by previous work (Härkönen et al., 2020) , we further investigate the latent space of the image generator by considering the information contained in each PCA component. Fig. 25 shows the percentage of total variance captured by top PCA components. The image generator on BAIR compresses most of the information on a few components. Specially, the top 20 PCA components captures 85% of the variance. In contrast, the latent space of the image generator trained on FFHQ (and FFHQ 1024 for high-resolution image synthesis) uses 100 PCA components to capture 85% information. This implies the BAIR generator models the dataset in a low-dimension space, and such generator increases the difficulty for fully disentangling all the objects in images for manipulation. 

