LINEAR VIDEO TRANSFORMER WITH FEATURE FIXATION

Abstract

Vision Transformers have achieved impressive performance in video classification, while suffering from the quadratic complexity caused by the Softmax attention mechanism. Some studies alleviate the computational costs by reducing the number of tokens in attention calculation, but the complexity is still quadratic. Another promising way is to replace Softmax attention with linear attention, which owns linear complexity but presents a clear performance drop. We find that such a drop in linear attention results from the lack of attention concentration on critical features. Therefore, we propose a feature fixation module to reweight feature importance of the query and key before computing linear attention. Specifically, we regard the query, key, and value as various latent representations of the input token, and learn the feature fixation ratio by aggregating Query-Key-Value information. This is beneficial for measuring the feature importance comprehensively. Furthermore, we enhance the feature fixation by neighborhood association, which leverages additional guidance from spatial and temporal neighbouring tokens. The proposed method significantly improves the linear attention baseline and achieves state-of-the-art performance among linear video Transformers on three popular video classification benchmarks. With fewer parameters and higher efficiency, our performance is even comparable to some Softmax-based quadratic Transformers.

1. INTRODUCTION

Vision Transformers [19, 86, 49, 82, 15] have been successfully applied in video processing. Recent video Transformers [57, 4, 7, 6, 20] have achieved remarkable performance on challenging video classification benchmarks, e.g., Something-Something V2 [25] and . However, they always suffer from quadratic computational complexity, which is caused by the Softmax operation in the attention module [77, 93] . The quadratic complexity severely constrains the application of video Transformers, since the task of video processing always requires to handle a huge amount of input tokens, considering both the spatial and temporal dimensions. Most of the existing efficient designs in video Transformers attempt to reduce the number of tokens attended in attention calculation. For example, [4, 6] factorize spatial and temporal tokens with different encoders or multi-head self-attention modules, to only deal with a subset of input tokens. [7] calculates the attention only from the target frame, i.e., space-only. [57] restricts tokens to a spatial neighbourhood that can reflect dynamic motions implicitly. Despite reducing the computational costs, they still have the inherent quadratic complexity, which prohibits them from scaling up to longer input, e.g., more video frames or larger resolution [6, 57] . Another practical way, widely used in the Natural Language Processing (NLP) community, is to decompose Softmax with certain kernel functions and linearize the attention via the associate property of matrix multiplication [60, 35, 14] . Recent linear video Transformers [57, 93] achieve higher efficiency, but present a clear performance drop compared to Softmax attention. We find that the degraded performance is mainly attributed to the lack of attention concentration to critical features in linear attention. Such an observation is consistent with the concurrent works in NLP [60, 85] . Accordingly, we propose to concentrate linear attention on critical features through feature fixation. To this end, we reweight the feature importance of the query and key prior to computing linear attention. Inspired by the idea of classic Gaussian pyramids [53, 43] and modern contrastive learning [64, 72] , we regard the query, key, and value as latent representations of the input token. Latent space projection will not destroy the image structure, i.e., salient features are expected to be discriminative across all spaces [71] . Therefore, we aggregate Query-Key-Value features when generating the fixation ratio, which is beneficial for measuring the feature importance comprehensively. Meanwhile, the salient feature activations are usually locally accumulative [11, 67] . In the continuous video, critical information contained in the target token could be shared by its spatial and/or temporal neighbour tokens. We hence propose to use neighbourhood association as extra guidance for feature fixation. Specifically, we reconstruct each key and value vectors (they are responsible for information exchange between tokens [29, 7]) by sequentially mixing key/value features of nearby tokens in the spatial and temporal domain. This is efficiently realized by employing the feature shift technique [84, 91] . Experimental results demonstrate the effectiveness of feature fixation and neighbourhood association in improving the classification accuracy. Our primary contributions are summarized as follows: 1) We discover that the performance drop of linear attention results from not concentrating attention distribution on discriminative features in linear video Transformers. 2) We develop a novel linear video Transformer with feature fixation to concentrate salient attention. It is further enhanced by neighbourhood association, which leverages the information from nearby tokens to measure the feature importance better. Our method is simple and effective, and can also be seamlessly applied to other linear attention variants and improve their performance. 3) Our model achieves state-of-the-art results among linear video Transformers on popular video classification benchmarks (see Fig. 1 ), and its performance is even comparable to some Softmax-based quadratic Transformers with fewer parameters and higher efficiency.

2. RELATED WORK

Video classification by CNNs. CNN-based video networks are normally built upon 2D image classification models. To aggregate temporal information, current works either fuse the features of each video frame [80, 45] , or build spatial-temporal relations through 3D convolutions [28, 62, 73] or RNNs [18, 40, 42] . Although the latter ones have achieved state-of-the-art performance, they suffer from being significantly more computationally and memory-intensive. Several techniques have been proposed to enhance efficiency, e.g., temporal shift [45, 51] , adaptive frame sampling [37, 3, 88, 87] , and spatial/temporal factorization [74, 32, 75, 21] . Transformers. Vision Transformers [19, 86, 49, 82, 15] 20, 26, 83] , and so on.



Figure 1: Video classification performance of linear Transformers. We report Top-1 accuracy and GFLOPs of the state-of-the-art linear video Transformers on (a) SSv2 and (b) K400 datasets. Our model achieves the best accuracy among its counterparts.

have achieved a great success in computer vision tasks. It is straightforward to extend the idea of vision Transformers to video processing, considering both the spatial and temporal dependencies. However, the full spatial-temporal attention is computationally expensive, and hence several video Transformers try to reduce the costs via factorization. TimeSformer [6] exploits five types of factorization, and empirically finds that the divided attention, where spatial and temporal features are separately processed, leads to the best accuracy. A similar conclusion is drawn inViViT [4], which separates spatial and temporal features with different encoders or multi-head self attention (MHSA) modules. [7] introduces a local temporal window where features at the same spatial position are mixed. Some other methods have also explored dimension reduction[24, 34], token selection [78, 2], local-global attention stratification [44], deformable attention [77], temporal down-sampling [65, 92], locality strengthening [50, 23], hierarchical learning [90, 89], multi-scale fusion [

