PHENAKI: VARIABLE LENGTH VIDEO GENERATION FROM OPEN DOMAIN TEXTUAL DESCRIPTIONS

Abstract

We present Phenaki, a model capable of realistic video synthesis, given a sequence of textual prompts. Generating videos from text is particularly challenging due to the computational cost, limited quantities of high quality text-video data and variable length of videos. To address these issues, we introduce a new model for learning video representation which compresses the video to a small representation of discrete tokens. This tokenizer uses causal attention in time, which allows it to work with variable-length videos. To generate video tokens from text we are using a bidirectional masked transformer conditioned on pre-computed text tokens. The generated video tokens are subsequently de-tokenized to create the actual video. To address data issues, we demonstrate how joint training on a large corpus of image-text pairs as well as a smaller number of video-text examples can result in generalization beyond what is available in the video datasets. Compared to the previous video generation methods, Phenaki can generate arbitrary long videos conditioned on a sequence of prompts (i.e. time variable text or a story) in open domain. To the best of our knowledge, this is the first time a paper studies generating videos from open domain time variable prompts. In addition, compared to the per-frame baselines, the proposed video encoder-decoder computes fewer tokens per video but results in better spatio-temporal consistency.

1. INTRODUCTION

It is now possible to generate realistic high resolution images given a description [38, 39, 36, 42, 65] , but generating high quality videos from text remains challenging. In essence, videos are just a sequence of images, but this does not mean that generating a long coherent video is easy. In practice, it is a significantly harder task because there is much less high quality data available and the computational requirements are much more severe [11] . For image generation, there are datasets with billions of image-text pairs (such as LAION-5B [45] and JFT4B [67]) while the text-video datasets are substantially smaller e.g. WebVid [4] with ⇠10M videos, which is not enough given the higher complexity of open domain videos. As for computation, training current state-of-theart image generation models is already pushing the state-of-the-art computational capabilities [65], leaving little to no room for generating videos, particularly videos of variable length. To make the matters worse, one can argue that a single short text prompt is not sufficient to provide a complete description of a video (except for short clips), and instead, a generated video must be conditioned on a sequence of prompts, or a story, which narrates what happens over time. Ideally, a video generation model must be able to generate videos of arbitrary length, all the while having the capability of conditioning the generated frames at time t on prompts at time t that can vary over time. Such capability can clearly distinguish the video from a "moving image" and open up the way ‡ Equal contribution. * Intern at Google Brain while working on this project.

