COGVIDEO: LARGE-SCALE PRETRAINING FOR TEXT-TO-VIDEO GENERATION VIA TRANSFORMERS

Abstract

Large-scale pretrained transformers have reached a milestone in text (GPT-3) and text-to-image (DALL-E and CogView) generation. However, its application to video generation still has several challenges: unaffordable huge computation cost and scarcity and weak relevance of the text-video datasets. In this work, we present CogVideo, a 9B-parameter transformer for text-to-video generation. The CogVideo model has been trained by inheriting a pretrained text-to-image model, CogView2, which significantly reduces the training cost and alleviates the problem of scarcity and weak relevance. We also propose a multi-frame-rate training strategy for better aligning text and video clips. CogVideo achieves state-of-the-art performance in machine evaluation and outperforms publicly available models by a large margin in human evaluation. Its codes and model are also publicly available at https://github.com/THUDM/CogVideo.



A lion man is drinking water. A woman is riding horse on the sea. A man is skiing. A girl is dancing, anime. Nightfall in a metropolis. 

1. INTRODUCTION

Autoregressive transformers, e.g. DALL-E (Ramesh et al., 2021) and CogView (Ding et al., 2021) , have revolutionized text-to-image generation. A few other works have also followed the framework to develop text-to-video transformers (Wu et al., 2021b; Ge et al., 2022) , e.g. VideoGPT (Yan et al., 2021) , and demonstrated its superiority over GAN-based methods (Clark et al., 2019; Tulyakov et al., 2018) . However, the performances are still far from satisfactory. Diffusion probabilistic models, e.g. Imagen (Saharia et al., 2022) and DALLE-2 (Ramesh et al., 2022) , represent another line of research for text-to-image generation and video generation Ho et al. (2022) . However, how to better incorporate the temporal information for text-to-video generation is still a challenge. In this paper, we focus on designing an autoregressive model for text-to-video generation. The critical challenge in previous work is that the generated video frames tend to gradually deviate from the text prompt. This makes vanilla autoregressive models only good at synthesizing videos with regular (e.g. forward moving cars) or random patterns (e.g. speaking by random moving lips), but fail at text prompt such as "a lion is drinking water". The main reason is that in the former case the first frame already provides sufficient information for the subsequent changes, while in the latter the model has to precisely understand the action "drink" in order to correctly generate the desired action -the lion lifts the glass to its lip, drinks and then puts down the glass. Why could the autoregressive transformers well understand the text-image alignment, but struggle for the text-action alignment in videos? One fact is that the duration of videos varies a lot. Previous models split the video into many clips with a fixed number of frames for training (Wu et al., 2021b; Ge et al., 2022) . Such treatment destroys the alignment between the text and its temporal counterparts in the video. If a "drinking" video is split into four individual clips of "holding a glass", "lifting", "drinking" and "putting down" with the same text "drinking", the model will be confused to learn the precise meaning of drinking. The other challenge is that the perfect aligned text-video data is scarce, compared to the easy-tocollect billions of text-image pairs (Ramesh et al. 



Figure 1: Samples generated by CogVideo. The actual text inputs are in Chinese. Each sample is a 4-second clip of 32 frames, and here we sample 8 frames uniformly for display purposes.

, 2021). VATEX is probably the largest annotated text-video dataset(Wang et al., 2019). However, it has only 41,250 videos. The retrieval-based text-video pairs, e.g. Howto100M(Miech et al., 2019), are weakly relevant and most captions only describe the scene without temporal information.Present Work. Here we present a large-scale pretrained text-to-video generative model, CogVideo, which is of 9.4 billion parameters and trained on 5.4 million text-video pairs. To reduce the computational cost, CogVideo has been developed to inherit the knowledge learned from a text-image pretraining model CogView2(Ding et al., 2022). To ensure the alignment between text and its temporal counterparts in the video, we propose the multi-frame-rate training. The flexibility of the textual condition makes it possible to simply prepend a piece of text describing the frame rate to the original text prompt for modeling different frame rates. To keep the text-video alignment, we choose a proper frame rate description to include the complete action in each training sample. The frame rate token also controls the intensity of the changes throughout continuous frames in generation. We train a sequential generation model and a frame interpolation model. The former model generates key frames according to the text, and the latter recursively fills the middle frames by varying the frame rates to make the video coherent. As shown in Figure1, CogVideo can generate high-resolution (480×480) videos. The human evaluation demonstrates that CogVideo outperforms most publicly available models by a large margin. Our main contributions include:• We present CogVideo,which is the largest and open-source pretrained transformer for general text-to-video generation. CogVideo demonstrates state-of-the-art FVD on the UCF-101 benchmark. • We propose the multi-frame-rate training to better align text-clip pairs, which significantly improves the generation accuracy, in particular for movements of complex semantics. This training strategy offers CogVideo the capacity of controlling the intensity of changes during the generation. • We design dual-channel attention to elegantly and efficiently finetune a pretrained text-toimage generative model for text-to-video generation, avoiding the expensive full parameter pretraining from scratch. 2 RELATED WORK 2.1 VIDEO GENERATION Video generation is a long-standing research topic. Most previous works focus on the next-frame prediction task -forecasting the future frames based on the first video frame. Early works, e.g.

