PATCHBLENDER: A MOTION PRIOR FOR VIDEO TRANSFORMERS

Abstract

Transformers have become one of the dominant architectures in the field of computer vision. However, there are yet several challenges when applying such architectures to video data. Most notably, these models struggle to model the temporal patterns of video data effectively. Directly targeting this issue, we introduce Patch-Blender, a learnable blending function that operates over patch embeddings across the temporal dimension of the latent space. We show that our method is successful at enabling vision transformers to encode the temporal component of video data. On Something-Something v2 and MOVi-A, we show that our method improves the baseline performance of video Transformers. PatchBlender has the advantage of being compatible with almost any Transformer architecture and since it is learnable, the model can adaptively turn on or off the prior. It is also extremely lightweight compute-wise, 0.005% the GFLOPs of a ViT-B.

1. INTRODUCTION

The Transformer (Vaswani et al., 2017) has become one of the dominant architectures of many fields in machine learning (Brown et al., 2020; Devlin et al., 2019; Dosovitskiy et al., 2020) . Initially proposed for natural language processing (Vaswani et al., 2017) , it has since been shown to outperform convolutional neural networks in the image domain (Dosovitskiy et al., 2020) . Adapting such vision models to the video domain has been straightforward and resulted in new state-of-the-art results (Arnab et al., 2021) . Since then, multiple Transformer based methods have been proposed (Bertasius et al., 2021; Fan et al., 2021; Liu et al., 2021) , making steady progress on a variety of challenges in the video domain. Despite these advances, there are still many challenges when it comes to applying Transformers to video data. One such challenge is the lack of a strong inductive bias in the Transformer architecture with respect to temporal patterns. This challenge is most evident in the attention mechanism, where it can be difficult for patches to attend to relevant patches across time. Picture for example multiple frames of a blue sky with birds, how is a given patch supposed to attend to its relevant spatial location in each frame? If it can't do so, how would it know if a bird was present in the patch at some point in time? This issue makes it difficult for Transformers to properly model video data. To address the challenge of modeling temporal information in video data, we propose a new temporal prior for video transformer architectures. This novel prior, called PatchBlender , is a learnable smoothing layer introduced in the Transformer. The layer allows the model to blend tokens in the latent space of video frames along the temporal dimension. We show that this simple technique provides a strong inductive bias with respect to video data, as it allows for easier mapping of relevant spatio-temporal patches in the attention mechanism. We evaluate our method on three video benchmarks. Experiments on MOVi-A (Greff et al., 2022) show that Vision Transformers (ViT) (Dosovitskiy et al., 2020) with PatchBlender are more accurate at predicting the position and velocity of falling objects compared to the baseline. For Something-Something v2 (Goyal et al., 2017) , we also find that PatchBlender improves the baseline performance of ViT and MViTv2 (Li et al., 2021) . We also include results on Kinetics400 and show that PatchBlender learns to weakly exploit the temporal aspect of Kinetics400. Specifically, unlike Something-Something v2 and MOVi-A, it has been shown that one can achieve competitive performances on this dataset even without temporal information (Fan et al., 2021; Sevilla-Lara et al., 2019) . Interestingly, we provide further evidence to this, as we show that our PatchBlender We then apply these ratios to a sequence of four frames which gives us a pattern blending each frame with their past frames. The final result is a sequence of four frames, where each frame is lightly blended with past frames, depicting the motion leading to that point in time. Note that while this example is with RGB frames, our method is actually applied to the latent representation of these frames within the Transformer. learns an identity-like function as the optimal smoothing operation. Finally, our method is also very lightweight compute-wise, adding only 0.005% to the total GFLOPs of a ViT-B. In comparison with all these methods, our work is novel just by definition. We do not change the Transformer's architecture into a hierarchical one or change the attention mechanism itself. Instead, we introduce a new layer which can be inserted almost anywhere in the Transformer. It takes the latent representation of the frames at the step where the layer is inserted and blends them along the temporal dimension. The layer is thus compatible with almost any Transformer variant and attention mechanism. As long as there is a latent representation for each frame or each patch, they can be blended.



Figure 1: Illustrative example of PatchBlender.On the left, we have an example of learned blending ratios, where the diagonal represents the frame being blended and the other values in a given row correspond to the blending ratios of the other frames. We then apply these ratios to a sequence of four frames which gives us a pattern blending each frame with their past frames. The final result is a sequence of four frames, where each frame is lightly blended with past frames, depicting the motion leading to that point in time. Note that while this example is with RGB frames, our method is actually applied to the latent representation of these frames within the Transformer.

Several prior work have explored how to best handle temporal data in machine learning, e.g., using the convolutional neural networksJi et al. (2013).Notably, Feichtenhofer et al. (2018)  propose to have a combination of two networks, one operating at a low frame rate to model long term dependencies, while the other operates at a fast frame rate to model local dependencies.Liu et al. (2020)   also propose adaptive temporal kernels to better model complex temporal dynamics.With respect to Transformers, one popular approach for incorporating a spatial-temporal bias to the model has been to process the data in a hierarchical manner Arnab et al. (2021); Chen et al. (2021); Fan et al. (2021); Li et al. (2022); Liu et al. (2021); Yan et al. (2022); Zha et al. (2021). Other work have explored modifying the attention mechanism in order to enforce a temporal bias Bertasius et al. (2021); Bulat et al. (2021); Guo et al. (2021); He et al. (2020); Patrick et al. (2021); Zhang et al. (2021). Another type of approach has been to incorporate motion information into the model in an explicit form . Chen & Ho (2021) process all the information available from raw video data, which includes motion and audio. Wang & Torresani (2022) propose to use motion information in order to determine where to attend to in the Transformer's attention mechanism. Other non-Transformer work which make use of motion typically consist of a two stream network, with the RGB and motion data handled seperately Diba et al. (2016); Feichtenhofer et al. (2016b); Girdhar et al. (2017); Gkioxari & Malik (2014); Simonyan & Zisserman (2014) or jointly throughout the network Feichtenhofer et al. (2016a); Feichtenhofer et al. (2017); Jiang et al. (2019); Wang et al. (2019); Zhang et al. (2016).

