PATCHBLENDER: A MOTION PRIOR FOR VIDEO TRANSFORMERS

Abstract

Transformers have become one of the dominant architectures in the field of computer vision. However, there are yet several challenges when applying such architectures to video data. Most notably, these models struggle to model the temporal patterns of video data effectively. Directly targeting this issue, we introduce Patch-Blender, a learnable blending function that operates over patch embeddings across the temporal dimension of the latent space. We show that our method is successful at enabling vision transformers to encode the temporal component of video data. On Something-Something v2 and MOVi-A, we show that our method improves the baseline performance of video Transformers. PatchBlender has the advantage of being compatible with almost any Transformer architecture and since it is learnable, the model can adaptively turn on or off the prior. It is also extremely lightweight compute-wise, 0.005% the GFLOPs of a ViT-B.

1. INTRODUCTION

The Transformer (Vaswani et al., 2017) has become one of the dominant architectures of many fields in machine learning (Brown et al., 2020; Devlin et al., 2019; Dosovitskiy et al., 2020) . Initially proposed for natural language processing (Vaswani et al., 2017) , it has since been shown to outperform convolutional neural networks in the image domain (Dosovitskiy et al., 2020) . Adapting such vision models to the video domain has been straightforward and resulted in new state-of-the-art results (Arnab et al., 2021) . Since then, multiple Transformer based methods have been proposed (Bertasius et al., 2021; Fan et al., 2021; Liu et al., 2021) , making steady progress on a variety of challenges in the video domain. Despite these advances, there are still many challenges when it comes to applying Transformers to video data. One such challenge is the lack of a strong inductive bias in the Transformer architecture with respect to temporal patterns. This challenge is most evident in the attention mechanism, where it can be difficult for patches to attend to relevant patches across time. Picture for example multiple frames of a blue sky with birds, how is a given patch supposed to attend to its relevant spatial location in each frame? If it can't do so, how would it know if a bird was present in the patch at some point in time? This issue makes it difficult for Transformers to properly model video data. To address the challenge of modeling temporal information in video data, we propose a new temporal prior for video transformer architectures. This novel prior, called PatchBlender , is a learnable smoothing layer introduced in the Transformer. The layer allows the model to blend tokens in the latent space of video frames along the temporal dimension. We show that this simple technique provides a strong inductive bias with respect to video data, as it allows for easier mapping of relevant spatio-temporal patches in the attention mechanism. We evaluate our method on three video benchmarks. Experiments on MOVi-A (Greff et al., 2022) show that Vision Transformers (ViT) (Dosovitskiy et al., 2020) with PatchBlender are more accurate at predicting the position and velocity of falling objects compared to the baseline. For Something-Something v2 (Goyal et al., 2017) , we also find that PatchBlender improves the baseline performance of ViT and MViTv2 (Li et al., 2021) . We also include results on Kinetics400 and show that PatchBlender learns to weakly exploit the temporal aspect of Kinetics400. Specifically, unlike Something-Something v2 and MOVi-A, it has been shown that one can achieve competitive performances on this dataset even without temporal information (Fan et al., 2021; Sevilla-Lara et al., 2019) . Interestingly, we provide further evidence to this, as we show that our PatchBlender

