TEMPORALLY CONSISTENT VIDEO TRANSFORMER FOR LONG-TERM VIDEO PREDICTION

Abstract

Generating long, temporally consistent video remains an open challenge in video generation. Primarily due to computational limitations, most prior methods limit themselves to training on a small subset of frames that are then extended to generate longer videos through a sliding window fashion. Although these techniques may produce sharp videos, they have difficulty retaining long-term temporal consistency due to their limited context length. In this work, we present Temporally Consistent Video Transformer (TECO), a vector-quantized latent dynamics video prediction model that learns compressed representations to efficiently condition on long videos of hundreds of frames during both training and generation. We use a MaskGit prior for dynamics prediction which enables both sharper and faster generations compared to prior work. Our experiments show that TECO outperforms SOTA baselines in a variety of video prediction benchmarks ranging from simple mazes in DMLab, large 3D worlds in Minecraft, and complex real-world videos from Kinetics-600. In addition, to better understand the capabilities of video prediction models in modeling temporal consistency, we introduce several challenging video prediction tasks consisting of agents randomly traversing 3D scenes of varying difficulty. This presents a challenging benchmark for video prediction in partially observable environments where a model must understand what parts of the scenes to re-create versus invent depending on its past observations or generations.

1. INTRODUCTION

Recent work in video prediction has seen tremendous progress (Ho et al., 2022; Clark et al., 2019; Yan et al., 2021; Le Moing et al., 2021; Ge et al., 2022; Tian et al., 2021; Luc et al., 2020) in producing high-fidelity and diverse samples on complex video data. This can largely be attributed to a combination of increased computational resources and more compute efficient high-capacity neural architectures. However, much of this progress has focused on generating short videos, where models can perform well by basing their predictions on only a handful of previous frames. Video prediction models with short context windows can generate long videos in a sliding window fashion. While the resulting videos can look impressive at first sight, they lack temporal consistency. We would like models to predict temporally consistent videos -where the same content is generated if a camera pans back to a previously observed location. On the other hand, the model should imagine a new part of the scene for locations that have not yet been observed, and future predictions should remain consistent to this newly imagined part of the scene. Prior work has investigated techniques for modeling long-term dependencies, such as temporal hierarchies (Saxena et al., 2021) and strided sampling with frame-wise interpolation (Ge et al., 2022; Hong et al., 2022) . Other methods train on sparse sets of frames selected out of long videos (Harvey et al., 2022; Skorokhodov et al., 2021; Clark et al., 2019; Saito & Saito, 2018; Yu et al., 2022) , or model videos via compressed representations (Yan et al., 2021; Rakhimov et al., 2020; Le Moing et al., 2021; Seo et al., 2022; Gupta et al., 2022; Walker et al., 2021) . Refer to Appendix M for more detailed discussion on related work. Despite this progress, many methods still have difficulty scaling to datasets with many longrange dependencies. While Clockwork-VAE (Saxena et al., 2021) trains on long sequences, it is limited by training time (due to a recurrent architecture) and difficult to scale to more complex data. On the other hand, transformer-based methods over latent spaces (Yan et al., 2021) scale poorly to long videos due to quadratic complexity in attention, with long videos containing tens of thousands of tokens. Methods that train on subsets of tokens are limited by truncated backpropagation through time (Hutchins et al., 2022; Rae et al., 2019; Dai et al., 2019) or naive temporal operations (Hawthorne et al., 2022) . In this paper, we introduce Temporally Consistent Video Transformer (TECO), a vector-quantized latent dynamics model that effectively models long-term dependencies in a compact representation space using efficient transformers. The key contributions are summarized as follows: • We introduce TECO, an efficient and scalable video prediction model that learns a set of compressed VQ-latents to allow for efficient training and generation. • We propose several long-length video prediction datasets centered around 3D scenes in DMLab (Beattie et al., 2016 ), Minecraft (Guss et al., 2019 ), and Habitat (Szot et al., 2021; Savva et al., 2019) to help better evaluate temporal consistency in video predictions. • We show that TECO has strong performance on a variety of difficult video prediction tasks, and is able to leverage long-term temporal context to generate high quality videos with consistency. • We provide several ablations providing intuition for why TECO is able to generate more temporally consistency predictions, and how these insights can extend to future work in longterm video prediction.

2. PRELIMINARIES

2.1 VQ-GAN VQ-GAN (Esser et al., 2021; Van Den Oord et al., 2017) is an autoencoder that learns to compress data into a set of discrete latents, consisting of an encoder E, decoder G, codebook C, and discriminator D. Given an image x ∈ R H×W ×3 , the encoder E maps x to its latent representation h ∈ R H ′ ×W ′ ×D , which is quantized by nearest neighbors lookup in a codebook of embeddings C = {e i } K i=1 to produce z ∈ R H ′ ×W ′ ×D . The discretized latent z is fed through decoder G to



Figure 1: TECO generates sharp and consistent video predictions for hundreds of frames on challenging datasets. The figure shows evenly spaced frames of the 264 frame predictions, after being conditioned on 36 context frames. From top to bottom, the datasets are are DMLab, Minecraft, Habitat, and Kinetics-600.

availability

https://sites.google.com/view/iclr23-teco 

