LEARNING SELF-SIMILARITY IN SPACE AND TIME AS A GENERALIZED MOTION FOR ACTION RECOGNITION

Abstract

Spatio-temporal convolution often fails to learn motion dynamics in videos and thus an effective motion representation is required for video understanding in the wild. In this paper, we propose a rich and robust motion representation method based on spatio-temporal self-similarity (STSS). Given a sequence of frames, STSS represents each local region as similarities to its neighbors in space and time. By converting appearance features into relational values, it enables the learner to better recognize structural patterns in space and time. We leverage the whole volume of STSS and let our model learn to extract an effective motion representation from it. The proposed method is implemented as a neural block, dubbed SELFY, that can be easily inserted into neural architectures and learned end-to-end without additional supervision. With a sufficient volume of the neighborhood in space and time, it effectively captures long-term interaction and fast motion in the video, leading to robust action recognition. Our experimental analysis demonstrates its superiority over previous methods for motion modeling as well as its complementarity to spatio-temporal features from direct convolution. On the standard action recognition benchmarks, Something-Something-V1 & V2, Diving-48, and Fine-Gym, the proposed method achieves the state-of-the-art results.

1. INTRODUCTION

Learning spatio-temporal dynamics is the key to video understanding. To this end, extending convolutional neural networks (CNNs) with spatio-temporal convolution has been actively investigated in recent years (Tran et al., 2015; Carreira & Zisserman, 2017; Tran et al., 2018) . The empirical results so far indicate that spatio-temporal convolution alone is not sufficient for grasping the whole picture; it often learns irrelevant context bias rather than motion information (Materzynska et al., 2020) and thus the additional use of optical flow turns out to boost the performance in most cases (Carreira & Zisserman, 2017; Lin et al., 2019) . Motivated by this, recent action recognition methods learn to extract explicit motion, i.e., flow or correspondence, between feature maps of adjacent frames and they improve the performance indeed (Li et al., 2020c; Kwon et al., 2020) . But, is it essential to extract such an explicit form of flows or correspondences? How can we learn a richer and more robust form of motion information for videos in the wild? Figure 1 : Spatio-temporal self-similarity (STSS) representation learning. STSS represents each spatio-temporal position (query) as its similarities (STSS tensor) with its neighbors in space and time (neighborhood) . STSS allows to take a generalized, far-sighted view on motion, i.e., both short-term and long-term, both forward and backward, as well as spatial self-motion. Our method learns to extract a rich and effective motion representation from STSS without additional supervision. In this paper, we propose to learn spatio-temporal self-similarity (STSS) representation for video understanding. Self-similarity is a relational descriptor for an image that effectively captures intrastructures by representing each local region as similarities to its spatial neighbors (Shechtman & Irani, 2007) . Given a sequence of frames, i.e., a video, it extends along the temporal dimension and thus represents each local region as similarities to its neighbors in space and time. By converting appearance features into relational values, STSS enables a learner to better recognize structural patterns in space and time. For neighbors at the same frame it computes a spatial self-similarity map, while for neighbors at a different frame it extracts a motion likelihood map. If we fix our attention to the similarity map to the very next frame within STSS and attempt to extract a single displacement vector to the most likely position at the frame, the problem reduces to optical flow, which is a particular type of motion information. In contrast, we leverage the whole volume of STSS and let our model learn to extract an effective motion representation from it in an end-toend manner. With a sufficient volume of neighborhood in space and time, it effectively captures long-term interaction and fast motion in the video, leading to robust action recognition. We introduce a neural block for STSS representation, dubbed SELFY, that can be easily inserted into neural architectures and learned end-to-end without additional supervision. Our experimental analysis demonstrates its superiority over previous methods for motion modeling as well as its complementarity to spatio-temporal features from direct convolutions. On the standard benchmarks for action recognition, Something-Something V1&V2, Diving-48, and FineGym, the proposed method achieves the state-of-the-art results.

2. RELATED WORK

Video action recognition. Video action recognition is a task to categorize videos into pre-defined action classes. One of the conventional topics in action recognition is to capture temporal dynamics in videos. In deep learning, many approaches attempt to learn temporal dynamics in different ways: Two-stream networks with external optical flows (Simonyan & Zisserman, 2014; Wang et al., 2016) , recurrent networks (Donahue et al., 2015) , and 3D CNNs (Tran et al., 2015; Carreira & Zisserman, 2017) . Recent approaches have introduced the advanced 3D CNNs (Tran et al., 2018; 2019; Feichtenhofer, 2020; Lin et al., 2019; Fan et al., 2020) and show the effectiveness of capturing spatio-temporal features, so that 3D CNNs now become a de facto approach to learn temporal dynamics in the video. However, spatio-temporal convolution is vulnerable unless relevant features are well aligned across frames within the fixed-sized kernel. To address this issue, a few methods adaptively translate the kernel offsets with deformable convolutions (Zhao et al., 2018; Li et al., 2020a) , while several methods (Feichtenhofer et al., 2019; Li et al., 2020b) modulate the other hyper-parameters, e.g., higher frame rate or larger spatial receptive fields. Unlike these methods, we address the problem of the spatio-temporal convolution by a sufficient volume of STSS, capturing far-sighted spatio-temporal relations. Learning motion features. Since using the external optical flow benefits 3D CNNs to improve the action recognition accuracy (Carreira & Zisserman, 2017; Zolfaghari et al., 2018; Tran et al., 2018) , several approaches try to learn frame-by-frame motion features from RGB sequences inside neural architectures. Fan et al. (2018) ; Piergiovanni & Ryoo (2019) internalize TV-L1 (Zach et al., 2007) optical flows into the CNN. Frame-wise feature differences (Sun et al., 2018b; Lee et al., 2018; Jiang et al., 2019; Li et al., 2020c) are also utilized as the motion features. Recent correlation-based methods (Wang et al., 2020; Kwon et al., 2020) adopt a correlation operator (Dosovitskiy et al., 2015; Sun et al., 2018a; Yang & Ramanan, 2019) to learn motion features between adjacent frames. However, these methods compute frame-by-frame motion features between two adjacent frames and then rely on stacked spatio-temporal convolutions for capturing long-range motion dynamics. We propose to learn STSS features, as generalized motion features, that enable to capture both shortterm and long-term interactions in the video. Self-similarity. Self-similarity represents an internal geometric layout of images. It is widely used in many computer vision tasks, such as object detection (Shechtman & Irani, 2007) , image retrieval (Hörster & Lienhart, 2008) , and semantic correspondence matching (Kim et al., 2015; 2017) . In the video domain, Shechtman & Irani (2007) firstly introduce the concept of STSS and transforms the STSS to a hand-crafted local descriptor for action detection. Inspired from this work, early methods adopt self-similarities for capturing view-invariant temporal patterns (Junejo et al., 2008; 2010; Körner & Denzler, 2013) , but they use temporal self-similarities only due to computational costs. Recently, there are several non-local approaches (Wang et al., 2018; Liu et al., 2019) that utilize STSS for capturing long-range dynamics of videos. However, they use STSS for reweighting or aligning visual features, which is an indirect way of using STSS. Different from these methods, our method leverages full STSS directly as generalized motion information and learns an effective representation for action recognition within a video-processing architecture. To the best of our knowledge, our work is the first in learning STSS representation using modern CNNs. The contribution of our paper can be summarized as follows. First, we revisit the notion of selfsimilarity and propose to learn a generalized, far-sighted motion representations from STSS. Second, we implement STSS representation learning as a neural block, dubbed SELFY, that can be integrated into existing neural architectures. Third, we provide comprehensive evaluations on SELFY, achieving the state-of-the-art on benchmarks: Something-Something V1&V2, Diving-48, and FineGym.

3. OUR APPROACH

In this section, we first revisit the notions of self-similarity and discuss its relation to motion. We then introduce our method for learning effective spatio-temporal self-similarity representation, which can be easily integrated into video-processing architectures and learned end-to-end.

3.1. SELF-SIMILARITY REVISITED

Self-similarity is a relational descriptor that suppresses variations in appearance and reveals structural patterns in images or videos (Shechtman & Irani, 2007) . Given an image feature map I ∈ R X×Y ×C , self-similarity transformation of I results in a 4D tensor S ∈ R X×Y ×U ×V , whose elements are defined as S x,y,u,v = sim(I x,y , I x+u,y+v ), where sim(•, •) is a similarity function, e.g., cosine similarity. Here, (x, y) is a query coordinate while (u, v) is a spatial offset from it. To impose a locality, the offset is restricted to its neighborhood: (u, v) ∈ [-d U , d U ] × [-d V , d V ] , so that U = 2d U + 1 and V = 2d V + 1, respectively. By converting C-dimensional appearance feature I x,y into U V -dimensional relational feature S x,y , it suppresses variations in appearance and reveals spatial structures in the image. Note that the selfsimilarity transformation closely relates to conventional cross-similarity (or correlation) across two different feature maps (I, I ∈ R X×Y ×C ), which can be defined as S x,y,u,v = sim(I x,y , I x+u,y+v ). Given two images of a moving object, the cross-similarity transformation effectively captures motion information and thus is commonly used in optical flow and correspondence estimation (Dosovitskiy et al., 2015; Sun et al., 2018a; Yang & Ramanan, 2019) . For a sequence of frames, i.e., a video, one can naturally extend the spatial self-similarity along the temporal axis. Let V ∈ R T ×X×Y ×C denote a feature map of the video with T frames. Spatiotemporal self-similarity (STSS) transformation of V results in a 6D tensor S ∈ R T ×X×Y ×L×U ×V , whose elements are defined as where (t, x, y) is the spatio-temporal coordinate and (l, u, v) is a spatio-temporal offset from it. In addition to the locality of spatial offsets above, the temporal offset l is also restricted to its temporal neighborhood: S t,x,y,l,u,v = sim(V t,x,y , V t+l,x+u,y+v ), l ∈ [-d L , d L ], so that L = 2d L + 1. What types of information does STSS describe? Interestingly, for each time t, the STSS tensor S can be decomposed along temporal offset l into a single spatial self-similarity tensor (when l = 0) and 2d L spatial cross-similarity tensors (when l = 0); the partial tensors with a small offset (e.g., l = -1 or +1) collect motion information from adjacent frames and those with larger offsets capture it from further frames both forward and backward in time. Unlike previous approaches to learn motion (Dosovitskiy et al., 2015; Wang et al., 2020; Kwon et al., 2020) , which rely on crosssimilarity between adjacent frames, STSS allows to take a generalized, far-sighted view on motion, i.e., both short-term and long-term, both forward and backward, as well as spatial self-motion.

3.2. SPATIO-TEMPORAL SELF-SIMILARITY REPRESENTATION LEARNING

By leveraging the rich information in STSS, we propose to learn a generalized motion representation for video understanding. To achieve this goal without additional supervision, we design a neural block, dubbed SELFY, which can be inserted into a video-processing architectures and learned end-to-end. The overall structure is illustrated in Fig. 2 . It consists of three steps: self-similarity transformation, feature extraction, and feature integration. Given the input video feature tensor V, the self-similarity transformation step converts it to the STSS tensor S as in Eq.( 1). In the following, we describe feature extraction and integration steps.

3.2.1. FEATURE EXTRACTION

From the STSS tensor S ∈ R T ×X×Y ×L×U ×V , we extract a C F -dimensional feature for each spatiotemporal position (t, x, y) and temporal offset l so that the resultant tensor is F ∈ R T ×X×Y ×L×C F , which is equivariant to translation in space, time, and temporal offset. The dimension of L is preserved to extract motion information across different temporal offsets in a consistent manner. While there exist many design choices, we introduce three methods for feature extraction in this work. Soft-argmax. The first method is to compute explicit displacement fields using S, which previous motion learning methods adopt using spatial cross-similarity (Dosovitskiy et al., 2015; Sun et al., 2018a; Yang & Ramanan, 2019) . One may extract the displacement field by indexing the positions with the highest similarity value via arg max (u,v) , but it is not differentiable. We instead use softargmax (Chapelle & Wu, 2010) , which aggregates displacement vectors with softmax weighting (Fig. 3a ). The soft-argmax feature extraction can be formulated as F t,x,y,l = u,v exp(S t,x,y,l,u,v /τ ) u ,v exp(S t,x,y,l,u ,v /τ ) [u; v], which results in a feature tensor F ∈ R T ×X×Y ×L×2 . The temperature factor τ adjusts the softmax distribution, and we set τ = 0.01 in our experiments.

Multi-layer perceptron (MLP).

The second method is to learn an MLP that converts self-similarity values into a feature. For this, we flatten the (U, V ) volume into U V -dimensional vectors, and apply an MLP to them (Fig. 3b ). For the reshaped tensor S ∈ R T ×X×Y ×L×U V , a perceptron f (•) can be expressed as f (S) = ReLU(S × 5 W φ ), where × n denotes the n-mode tensor product, W φ ∈ R C ×U V is the perceptron parameters, and the output is f (S) ∈ R T ×X×Y ×L×C . The MLP feature extraction can thus be formulated as F = (f n • f n-1 • • • • • f 1 )(S), which produces a feature tensor F ∈ R T ×X×Y ×L×C F . This method is more flexible and may also be more effective than the soft-argmax because not only can it encode displacement information but also it can directly access the similarity values, which may be helpful for learning motion distribution. Convolution. The third method is to learn convolution kernels over (L, U, V ) volume of S (Fig. 3c ). When we regard S as a 7D tensor S ∈ R T ×X×Y ×L×U ×V ×C with C = 1, the convolution layer g can be expressed using S = g(S), whose elements are computed by S t,x,y,l,u,v,c = ReLU lκ,uκ,vκ,c K lκ,uκ,vκ,c,c S t,x,y,l+ lκ,u+ûκ,v+vκ,c . (5) where  K ∈ R Lκ×Uκ×Vκ×C×C is a multi-channel convolution kernel, (l κ , u κ , v κ ) = l κ -L κ /2, ûκ = u κ -U κ /2, vκ = v κ -V κ /2. Starting from R T ×X×Y ×L×U ×V ×1 , we gradually downsample (U,V) and expand channels via multiple convolutions with strides, finally resulting in R T ×X×Y ×L×1×1×C F ; we preserve the L dimension, since maintaining fine temporal resolution is shown to be effective for capturing detailed motion information (Lin et al., 2019; Feichtenhofer et al., 2019) . The convolutional feature extraction with n layers can thus be formulated as F = (g n • g n-1 • • • • • g 1 )(S), which results in a feature tensor F ∈ R T ×X×Y ×L×C F . This method is effective in learning structural patterns with their convolution kernels, thus outperforming the former methods as will be seen in our experiments.

3.2.2. FEATURE INTEGRATION

In this step, we integrate the extracted STSS features F ∈ R T ×X×Y ×L×C F to feed them back to the original input stream with (T, X, Y, C) volume. We first use 3 × 3 spatial convolution kernels along (x, y) dimension of F. The spatial convolution layer h can be expressed using F = h(F), whose elements are computed by F l,x,y,t,c F = ReLU (xκ,yκ,c F ) K xκ,yκ,c F ,c F F l,x+xκ,y+ŷκ,t,c F , where K ∈ R Xκ×Yκ×C F ×C F is the multi-channel convolution kernel, (x κ , y κ ) is the kernel parameter indices, and (c F , c F ) is the channel indices. (x κ , ŷκ ) is centered as xκ = x κ -X κ /2, ŷκ = y κ -Y κ /2 . This type of spatial convolutions integrate the original features by extending receptive fields along (x, y) dimension. The resultant features F ∈ R T ×X×Y ×L×C F is defined as F = (h n • h n-1 • • • • • h 1 )(F). ( ) We then flatten the (L, C F ) volume into LC F -dimensional vectors to obtain F ∈ R T ×X×Y ×LC F , and apply an 1 × 1 × 1 convolution layer to obtain the final output. This convolution layer integrates features from different temporal offsets and also adjusts its channel dimension to fit that of the original input V. We adopt the identity mapping of the input for residual learning (He et al., 2016) . The final output tensor Z is expressed as Z = ReLU(F × 5 W θ ) + V, where × n is the n-mode tensor product and W θ ∈ R C×LC F is the weights of the convolution layer.

4.1. DATASETS & IMPLEMENTATION DETAILS

For evaluation, we use benchmarks that contain fine-grained spatio-temporal dynamics in videos. Something-Something V1 & V2 (SS-V1 & V2) (Goyal et al., 2017b) , which are both large-scale action recognition datasets, contain ∼108k and ∼220k video clips, respectively. Both datasets share the same 174 action classes that are labeled, e.g., 'pretending to put something next to something.' Diving-48 (Li et al., 2018) , which contains ∼18k videos with 48 different diving action classes, is an action recognition dataset that minimizes contextual biases, i.e., scenes or objects. FineGym (Shao et al., 2020 ) is a fine-grained action dataset built on top of gymnastic videos. We adopt the Gym288 and Gym99 sets for experiments that contains 288 and 99 classes, respectively. Action recognition architecture. We employ TSN ResNets (Wang et al., 2016) as 2D CNN backbones and TSM ResNets (Lin et al., 2019) as 3D CNN backbones. TSM enables to obtain the effect of spatio-temporal convolutions using spatial convolutions by shifting a part of input channels along the temporal axis before the convolution operation. TSM is added into each residual block of the ResNet. We adopt ImageNet pre-trained weights for our backbones. To transform the backbones to the self-similarity network (SELFYNet), we insert a single SELFY block after the third stage in the backbones. For SELFY block, we use the convolution method as a default feature extraction method and use multi-channel 1 × 3 × 3 convolution kernels. For more details, please refer Appendix A, B. Training & testing. For training, we sample a clip of 8 or 16 frames from each video by using segment-based sampling (Wang et al., 2016) . The spatio-temporal matching region (L, U, V ) of SELFY block is set as (5, 9, 9) or (9, 9, 9) when using 8 or 16 frames, respectively. For testing, we sample one or two clips from a video, crop their center, and evaluate the averaged prediction of the sampled clips. For more details, please refer Appendix A. 

4.2. COMPARISON WITH THE STATE-OF-THE-ART METHODS

For a fair comparison, we compare our model with other models that are not pre-trained on additional large-scale video datasets, e.g., Kinetics (Kay et al., 2017) or Sports1M (Karpathy et al., 2014) in the following experiments. Table 1 summarizes the results on SS-V1&V2. The first and second compartment of the table shows the results of other 2D CNN and (pseudo-) 3D CNN models, respectively. The last part of each compartment shows the results of SELFYNet. SELFYNet with TSN-ResNet (SELFYNet-TSN-R50) achieves 50.7% and 62.7% at top-1 accuracy, respectively, which outperforms other 2D models using 8 frames only. When we adopt TSM ResNet (TSM-R50) as our backbone and use 16 frames, our method (SELFYNet-TSM-R50) achieves 54.3% and 65.7% at top-1 accuracy, respectively, which is the best among the single models. Compared to TSM-R50, a single SELFY block obtains a significant gain of 7.0%p and 4.5%p at top-1 accuracy, respectively; our method is more accurate than TSM-R50 Two-stream on both datasets. Finally, our ensemble model (SELFYNet-TSM-R50 EN ) with 2-clip evaluation sets a new state of the art on both datasets by achieving 56.6% and 67.7% at top-1 accuracy, respectively. Table 2 summarizes the results on Diving-48 & FineGym. For Diving-48, TSM-R50 using 16 frames shows 38.8% in top-1 accuracy in our implementation. SELFYNet-TSM-R50 outperforms TSM-R50 by 2.8%p in accuracy so that it sets a new state-of-the-art top-1 accuracy as 41.6% on Diving-48. For FineGym, SELFYNet-TSM-R50 achieves 49.5% and 87.7% at given 288 and 99 classes, respectively, surpassing all the other models reported in Shao et al. (2020) .

4.3. ABLATION STUDIES

We conduct ablation experiments to demonstrate the effectiveness of the proposed method. All experiments are performed on SS-V1 by using 8 frames. Unless otherwise specified, we set Ima-geNet pre-trained TSM ResNet-18 (TSM-R18) with the single SELFY block of which (L, U, V ) = (5, 9, 9), as our default SELFYNet. Types of similarity. In Table 3a , we investigate the effect of different types of similarity by varying the set of temporal offset l on both TSN-ResNet-18 (TSN-R18) and TSM-R18. Interestingly, learning spatial self-similarity ({0}) improves accuracy on both backbones, which implies that selfsimilarity features help capture structural patterns of visual features. Learning cross-similarity with a short temporal range ({1}) shows a noticeable gain in accuracy on both backbones, indicating the significance of motion features. Learning STSS outperforms other types of similarity, and the accuracy of SELFYNet increases as the temporal range becomes longer. When STSS takes a far-sighted Feature extraction and integration methods. In Table 3b , we compare the performance of different combinations of feature extraction and integration methods. From the 2 nd to the 4 th rows, different feature extraction methods are compared fixing the feature integration methods to a single fully-connected (FC) layer. Compared to the baseline, the use of soft-argmax, which extracts spatial displacement features, improves top-1 accuracy by 1.0%p. Replacing soft-argmax with MLP provides an additional gain of 1.9%p at top-1 accuracy, showing the effectiveness of directly using similarity values. When using the convolution method for feature extraction, we achieve 46.7% at top-1 accuracy; the multi-channel convolution kernel is more effective in capturing structural patterns along (u, v) dimensions than MLP. From the 4 th to the 6 th rows, different feature integration methods are compared fixing the feature extraction method to convolution. Replacing the single FC layer with MLP improves the top-1 accuracy by 0.6%p. Replacing MLP with convolutional layers further improves and achieves 48.4% at top-1 accuracy. These results demonstrate that our design choice of using convolutions along (u, v) and (h, w) dimensions is the most effective in learning the geometry-aware STSS representation. For more ablation experiments, please refer to Appendix B.

4.4. COMPLEMENTARITY BETWEEN SPATIO-TEMPORAL FEATURES AND STSS FEATURES

We conduct experiments for analyzing different meanings of spatio-temporal features and STSS features. We organize two basic blocks for representing two different features: spatio-temporal convolution block (STCB) that consists of several spatial-temporal convolutions (Fig. 4a ) and SELFY-s block, light-weighted version of the SELFY block by removing spatial convolution layers (Fig. 4b ). Both blocks have the same receptive fields and a similar number of parameters for a fair comparison. Different combinations of the basic blocks are inserted after the third stage of TSN-ResNet-18. Table 5 summarizes the results on SS-V1. STSS features (Fig. 4b, 4d ) are more effective than spatio-temporal features (Fig. 4a, 4c ) in top-1 and top-5 accuracy when the same number of blocks are inserted. Interestingly, the combination of two different features (Fig. 4e, 4f ) shows better results in top-1 and top-5 accuracy compared to the single feature cases (Fig. 4c, 4d ), which demonstrate that both features complement each other. We conjecture that this complementarity comes from different characteristics of the two features; while spatio-temporal features are obtained by directly encoding appearance features, STSS features are obtained by suppressing variations in appearance and focusing on the relational features in space and time.

4.5. IMPROVING ROBUSTNESS WITH STSS

In this section, we demonstrate that STSS representation helps video-processing models to be more robust to video corruptions. We test two corruptions that are likely to happen in the real world videos: occlusion and motion blur. To induce the corruptions, We either cut out a rectangle patch of a particular frame or generate a motion blur (Hendrycks & Dietterich, 2019) . We corrupt a single center-frame for every clip of SS-V1 at the testing phase and gradually increase the severity of corruptions. We compare the results of TSM-R18 and SELFYNet variants of Table 3a . Fig. 5a , 5b summarizes the results of two corruptions, respectively. The top-1 accuracy of TSM-R18 and SELF-YNets with the short temporal range ({0}, {1}, and {-1, 0, 1}) significantly drops as the severity of corruptions becomes harder. We conjecture that the features of the corrupted frame propagate through the stacked TSMs, confusing the entire network. However, the SELFYNets with the long temporal range ({-2, • • • , 2} and {-3, • • • , 3}) show more robust performance than the other models. As shown in Fig. 5a , 5b, the accuracy gap between SELFYNets with the long temporal range and the others increases as the severity of corruptions becomes higher, indicating that the larger size of STSS features can improve the robustness on action recognition. We also present some qualitative results (Fig. 5c ) where two SELFYNets with different temporal ranges, {1} and {-3, • • • , 3}, both answer correctly without corruption, while the SELFYNet with {1} fails for the corrupted input.

5. CONCLUSION

In this paper, we have proposed to learn a generalized, far-sighted motion representation from STSS for video understanding. The comprehensive analyses on the STSS demonstrate that STSS features effectively capture both short-term and long-term interactions, complement spatio-temporal features, and improve the robustness of video-processing models. Our method outperforms other state-of-the-art methods on the three benchmarks for video action recognition. 

A IMPLEMENTATION DETAILS

Architecture details. We use TSN-ResNet and TSM-ResNet as our backbone (see Table 6 ) and initialize them with ImageNet pre-trained weights. We insert a single SELFY block into right after res 3 and use the convolution method as a default feature extraction method. We set the spatiotemporal matching region of SELFY block, (L, U, V ), as (5, 9, 9) or (9, 9, 9) when using 8 or 16 input frames, respectively. We stack four 1 × 3 × 3 convolution layers along (l, u, v) dimension for the feature extraction method, and use four 3 × 3 convolution layer along (x, y) dimension for the feature enhancement. We reduce a spatial resolution of video feature tensor, V, as 14×14 for computation efficiency before the self-similarity transformation. After the feature enhancement, we upsample the enhanced feature tensor, G , as 28×28 for the residual connection. Training. We sample a clip of 8 or 16 frames from each video by using segment-based sampling (Wang et al., 2016) . We resize the sampled clips into 240 × 320 images and apply random scaling and horizontal flipping for data augmentation. When applying the horizontal flipping on SS-V1&V2 (Goyal et al., 2017b) , we do not flip clips of which class labels include 'left' or 'right' words; the action labels, e.g., 'pushing something from left to right.' We fit the augmented clips into a spatial resolution of 224 × 224. For SS-V1&V2, we set the initial learning rate to 0.01 and the training epochs to 50; the learning rate is decayed by 1/10 after 30 th and 40 th epochs. For Diving-48 (Li et al., 2018) and FineGym (Shao et al., 2020) , we use a cosine learning rate schedule (Loshchilov & Hutter, 2016) with the first 10 epochs for gradual warm-up (Goyal et al., 2017a) . We set the initial learning rate to 0.01 and the training epochs to 30 and 40, respectively. Testing. Given a video, we sample 1 or 2 clips, resize them into 240 × 320 images, and crop their centers as 224 × 224. We evaluate an average prediction of the sampled clips. We report top-1 and top-5 accuracy for SS-V1&V2 and Diving-48, and mean-class accuracy for FineGym. Frame corruption details. We adopt two corruptions, occlusion and motion blur, to test the robustness of SELFYNet. We only corrupt a single center-frame for every validation clip of SS-V1; we corrupt the 4 th frame amongst 8 input frames. For the occlusion, we cut out a rectangle region from the center of the frame. For the motion blur, we adopt ImageNet-C implementation, which is available online 1 . We set 6 levels of severity for each corruption. We set the side length of the occluded region as 40px, 80px, 120px, 160px, 200px and 224px from the level 1 to 6. For the motion blur, we set (radius, sigma) tuple arguments as (15, 5), (10, 8), (15, 12), (20, 15), (25, 20), and (30, 25) , respectively. STSS features is more effective for action recognition than the indirect ways of using STSS, e.g., re-weighting visual-semantic features or learning correspondences. Comparison with correlation-based methods. We also compare our method with correlationbased methods (Kwon et al., 2020; Wang et al., 2020) . While correlation-based methods extract motion features between two adjacent frames only and are thus limited to short-term motion, our method effectively captures bi-directional and long-term motion information via learning with the sufficient volume of STSS. Our method can also exploit richer information from the self-similarity values than other methods. MS module (Kwon et al., 2020) only focuses on the maximal similarity value of the (u, v) dimensions to extract flow information, and Correlation block (Wang et al., 2020) uses an 1 × 1 convolution layer for extracting motion features from the similarity values. In contrast to the two methods, we introduce a generalized motion learning framework using the self-similarity tensor at Sec 3.2 of our main paper. The difference between our method and correlation-based methods are illustrated in Fig. 6 . We have conducted experiments to compare our method with MSNet (Kwon et al., 2020) , of the correlation-based methods. For an apple-to-apple comparison, we apply kernel soft-argmax and max pooling operation (KS + CM in Kwon et al. ( 2020)) to our feature extraction method by following their official codesfoot_3 . Please note that, when we restrict the temporal offset l to {1}, the SELFY block using KS + CM is equivalent to the MS module of which feature transformation layers are the standard 2D convolutional layers. Table 7b summarizes the results. KS+CM method achieves 46.1% at top-1 accuracy. As we enlarge the temporal window L to 5, we obtain an additional gain as 1.3%p. The learnable convolution layers improve top-1 accuracy by 1.0%p in both cases. The results demonstrates the effectiveness of learning geometric patterns within the sufficient volume of STSS tensors for learning abundant motion features. Multi-channel 3 × 3 × 3 kernel for feature extraction. We investigate the effect of the convolution method for STSS feature extraction, when we use multi-channel 3 × 3 × 3 kernels. For the experiment, we stack four 3 × 3 × 3 convolution layers followed by the feature integration step, which are the same as in Section 3.2.2. Table 7c summarizes the results. Note that we do not report models of which temporal window L = 1, e.g., {0} and {1}. As shown in the table, indeed, the long temporal range gives the higher accuracy. However, the effect of the 3 × 3 × 3 kernel is comparable to that of the 1 × 3 × 3 kernel in Table 3a in terms of accuracy. Considering the accuracy-computation trade-off, we choose to fix the kernel size, L κ ×U κ ×V κ , as 1×3×3 for the STSS feature extraction.

Spatial matching region.

In Table 7d , we compare a single SELFY block with different spatial matching regions, (U, V ). As a result, indeed, the larger spatial matching region leads the better accuracy. Considering the accuracy-computation trade-off, we set our spatial matching region, (U, V ), as (9, 9) as a default. Fusing STSS with visual features. We evaluate SELFYNet purely based on STSS features to see how much the ordinary visual feature V contributes for the final prediction. That is, we pass the STSS features, ReLU(F × 5 W θ ), into the downstream layers without the visual features V (Eq. 9 in our main paper). For the simplicity of description, we denote the relational feature ReLU(F × 5 W θ ) by R . Table 7e compares the results of using different cases of the output tensor Z (Z = V, Z = R, and Z = R+V) on SS-V1. Interestingly, SELFYNet using only R achieves 45.5% at top-1 accuracy, which is higher as 2.5%p than the baseline. As we add V to R, we obtain an additional gain of 2.9%p. It indicates that the STSS features and the visual features are complementary to each other. Block position. From the 2 nd to the 6 th row of Table 7f , we identify the effect of different positions of SELFY block in the backbone. We resize the spatial resolution of the video tensor, (X, Y ), into 14×14, and fix the matching region, (L, U, V ), as (5, 9, 9) for all the cases maintaining the similar computational cost. SELFY after the res 3 shows the best trade-off by achieving the highest accuracy among the cases. The last row in Table 7f shows that the multiple SELFY blocks improve accuracy compared to the single block. The local self-attention (Hu et al., 2019; Ramachandran et al., 2019; Zhao et al., 2020) and our method have a common denominator of using the self-similarity tensor but use it in a very different way and purpose. The local self-attention mechanism aims to aggregate the local context features using the self-similarity tensor and it thus uses the self-similarity values as attention weights for feature aggregation. However, our method aims to learn a generalized motion representation from the local STSS, so the final STSS representation is directly fed into the neural network instead of multiplying it to local context features. For an empirical comparison, we conduct an ablation experiment as follows. We extend the local self attention layer (Ramachandran et al., 2019) to temporal dimension, and then add the spatio-temporal local self-attention layer, which is followed by feature integration layers, after res 3 . All experimental details are the same as those in Appendix A, except that we reduce the channel dimension C' of appearance feature V to 32. Table 8 summarizes the results on SS-V1. The spatio-temporal local self-attention layer is accurate as 43.8% at top-1 accuracy, and both of SELFY blocks using the embedded Gaussian and the cosine similarity outperform the local self-attention by achieving top-1 accuracy as 47.6% and 47.8%, respectively. These results are in alignment with the prior work (Liu et al., 2019) , which reveals that the self-attention mechanism hardly captures motion features in video.

D VISUALIZATIONS

In Fig. 7 , we visualize some qualitative results of two different SELFYNet-TSM-R18 ({1} and {-3, • • • , 3}) on SS-V1. We show the different predictions of the two models with 8 input frames. We also overlay Grad-CAMs (Selvaraju et al., 2017) on the input frames to see whether a larger volume of STSS benefits to capture long-term interactions in videos. We take Grad-CAMs of features which is right before a global average pooling layer. As shown in the figure, the STSS with the sufficient volume helps to learn more enriched context of temporal dynamics in the video; in Fig. 7a , for example, SELFYNet with the range of ({-3, • • • , 3}) focuses on not only regions on which an action occurs but also focuses on the white-stain after the action to verify whether the stain is wiped off or not. Figure 6 : Comparison with non-local approaches (Wang et al., 2018; Liu et al., 2019) and correlation-based methods (Kwon et al., 2020; Wang et al., 2020) . From the top, non-local block, CP module, MS module, Correlation block and SELFY block are illustrated.



https://github.com/hendrycks/robustness https://github.com/xingyul/cpnet https://github.com/facebookresearch/video-nonlocal-net https://github.com/arunos728/MotionSqueeze



Figure2: Overview of our self-similarity representation block (SELFY). SELFY block takes as input a video feature tensor V, transforms it to a STSS tensor S, and extracts a feature tensor F from S. It then converts F to the same size as the input V via the feature integration, and combines it with the input V to produce the final STSS representation Z. See text for details.

Figure 3: Feature extraction from STSS. For a spatio-temporal position (t, x, y), each method transforms (L, U, V ) volume of STSS tensor S into (L, C F ). See text for details.

is the kernel parameter indices, and (c, c ) is the channel indices. The indices ( lκ , ûκ , vκ ) are centered as lκ

Basic blocks and their combinations. (a) spatio-temporal convolution block (STCB), (b) SELFY-s block, and (c-f) their combinations. Table 5: Spatio-temporal features v.s. STSS features. Results of different combinations of two blocks ((a) -(f) from Fig. 4) are shown. Corruption: occlusion. (b) Corruption: motion blur. (c) Qualitative examples of the corruptions.

Figure 5: Results of the robustness experiment. (a) and (b) show top-1 accuracy of SELFYNet variants (Table 3a) with different occlusions and motion blurs, respectively. (c) shows qualitative examples that SELFYNet ({-3, • • • , 3}) answers correct, while SELFYNet ({1}) fails.

(a) Non-local block (Wang et al., 2018) (b) CP module (Liu et al., 2019) (c) MS module (Kwon et al., 2020) (d) Correlation block (Wang et al., 2020) (e) SELFY block (ours)

Performance comparison on SS-V1&V2. Top-1, 5 accuracy (%) and FLOPs (G) are shown.

Performance comparison on Diving-48 & FineGym.

Ablations on SS-V1. Top-1 & 5 accuracy (%) are shown. Types of similarity. Performance comparison with different sets of temporal offset in SELFY block. {•} denotes a set of temporal offset l.

TSN-ResNet & TSM-ResNet backbone.

Performance comparison with the local self attention mechanisms. R.P.E. is an abbreviation for relative positional embeddings. Top-1, 5 accuracy (%) are shown.

B ADDITIONAL EXPERIMENTS

We conduct additional ablation experiments to identify the behaviors of the proposed method. All experiments are performed on SS-V1 by using 8 frames. Unless otherwise specified, we set Ima-geNet pre-trained TSM ResNet-18 (TSM-R18) with a single SELFY block of which (L, U, V ) = (5, 9, 9), as our default SELFYNet.Comparison with non-local methods. We compare our method with popular non-local methods (Wang et al., 2018; Liu et al., 2019) , which capture the long-range dynamics of videos. While computing the self-similarity values as ours, both methods use them as attention weights for feature aggregation by multiplying them to the visual features (Wang et al., 2018) or aligning top-K corresponding features (Liu et al., 2019) ; they both do not use STSS itself as a relational representation. In contrast, our method does it indeed and learns a more powerful relational feature from STSS. The difference between our method and non-local methods are illustrated in Fig. 6 .We have conducted experiments for performance comparison, and the results are shown in Table 7a . We re-implement the non-local block and the CP module in Pytorch based on their official codes 23 . For a fair comparison, we insert a single block or module at the same position (after res 3 of ResNet-18). Note that our method downsamples a spatial resolution of V to 14 × 14 before STSS transformation, whereas the others do not. Compared to the non-local block and the CP module, SELFY block improves top-1 accuracy by 4.4%p and 1.5%p, while computing less floating-point operations as 7.5 GFLOPs and 8.3 GFLOPs, respectively. It demonstrates that the direct integration of 

