PHENAKI: VARIABLE LENGTH VIDEO GENERATION FROM OPEN DOMAIN TEXTUAL DESCRIPTIONS

Abstract

We present Phenaki, a model capable of realistic video synthesis, given a sequence of textual prompts. Generating videos from text is particularly challenging due to the computational cost, limited quantities of high quality text-video data and variable length of videos. To address these issues, we introduce a new model for learning video representation which compresses the video to a small representation of discrete tokens. This tokenizer uses causal attention in time, which allows it to work with variable-length videos. To generate video tokens from text we are using a bidirectional masked transformer conditioned on pre-computed text tokens. The generated video tokens are subsequently de-tokenized to create the actual video. To address data issues, we demonstrate how joint training on a large corpus of image-text pairs as well as a smaller number of video-text examples can result in generalization beyond what is available in the video datasets. Compared to the previous video generation methods, Phenaki can generate arbitrary long videos conditioned on a sequence of prompts (i.e. time variable text or a story) in open domain. To the best of our knowledge, this is the first time a paper studies generating videos from open domain time variable prompts. In addition, compared to the per-frame baselines, the proposed video encoder-decoder computes fewer tokens per video but results in better spatio-temporal consistency.

1. INTRODUCTION

It is now possible to generate realistic high resolution images given a description [38, 39, 36, 42, 65] , but generating high quality videos from text remains challenging. In essence, videos are just a sequence of images, but this does not mean that generating a long coherent video is easy. In practice, it is a significantly harder task because there is much less high quality data available and the computational requirements are much more severe [11] . For image generation, there are datasets with billions of image-text pairs (such as and JFT4B [67]) while the text-video datasets are substantially smaller e.g. WebVid [4] with ⇠10M videos, which is not enough given the higher complexity of open domain videos. As for computation, training current state-of-theart image generation models is already pushing the state-of-the-art computational capabilities [65] , leaving little to no room for generating videos, particularly videos of variable length. To make the matters worse, one can argue that a single short text prompt is not sufficient to provide a complete description of a video (except for short clips), and instead, a generated video must be conditioned on a sequence of prompts, or a story, which narrates what happens over time. Ideally, a video generation model must be able to generate videos of arbitrary length, all the while having the capability of conditioning the generated frames at time t on prompts at time t that can vary over time. Such capability can clearly distinguish the video from a "moving image" and open up the way ‡ Equal contribution. * Intern at Google Brain while working on this project. The entire figure is one continuous video generated auto-regressively. We start by generating the video conditioned on the first prompt and then after a couple of frames we change the prompt to the next one. Each row contains a selected number of frames (from left to right in order) while the model was conditioned on that particular prompt. The model manages to preserve the temporal coherence of the video while adopting to the new prompt, usually taking the shortest path for the adaption (notice the morphing of the teddy bear to the panda). Note that the generated video has complex visual features such as reflections, occlusions, interactions and scene transitions. Full video is available at phenaki.github.io. to real-world creative applications in art, design and content creation. To the best our knowledge, story based conditional video generation in open domain has never been explored before and this is the first paper to take early steps towards that goal. A traditional deep learning approach of simply learning this task from data is not possible, since there is no story-based dataset to learn from. Instead, to achieve this we rely on a model that is designed specifically with this capability in mind. In this paper, we introduce Phenaki, a text to video model trained on both text to video and text to image data that can: -Generate temporally coherent and diverse videos conditioned on open domain prompts even when the prompt is a new composition of concepts (Fig. 3 ). The videos can be long (minutes) even though the model is trained on 1.4 seconds videos (at 8 fps). -Generate videos conditioned on a story (i.e. a sequence of prompts), e.g. Fig. 1 and Fig. 5 . To enable these capabilities, we could not rely on current video encoders, because they either can only decode fixed size videos or they encode frames independently. Hence, we introduce C-ViViT , a novel encoder-decoder architecture that: -Exploits temporal redundancy in videos to improve reconstruction quality over a per frame model while compressing the number of video tokens by 40% or more. -Allows encoding and decoding of variable length videos given its causal structure.



Figure 1. Time variable text (i.e. story) conditional video generation.The entire figure is one continuous video generated auto-regressively. We start by generating the video conditioned on the first prompt and then after a couple of frames we change the prompt to the next one. Each row contains a selected number of frames (from left to right in order) while the model was conditioned on that particular prompt. The model manages to preserve the temporal coherence of the video while adopting to the new prompt, usually taking the shortest path for the adaption (notice the morphing of the teddy bear to the panda). Note that the generated video has complex visual features such as reflections, occlusions, interactions and scene transitions. Full video is available at phenaki.github.io.

