BRINGING SACCADES AND FIXATIONS INTO SELF-SUPERVISED VIDEO REPRESENTATION LEARNING

Abstract

In this paper, we propose a self-supervised video representation learning (video SSL) method by taking inspiration from cognitive science and neuroscience on human visual perception. Different from previous methods that focus on the inherent properties of videos, we argue that humans learn to perceive the world through the self-awareness of the semantic changes or consistency in the input stimuli in the absence of labels, accompanied by representation reorganization during the post-learning rest periods. To this end, we first exploit the presence of saccades as an indicator of semantic changes in a contrastive learning framework, mimicking the self-awareness in human representation learning. The saccades are generated artificially without eye-tracking data. Second, we model the semantic consistency in eye fixation by minimizing the prediction error between the predicted and the true state of another time point. Finally, we incorporate prototypical contrastive learning to reorganize the learned representations to enhance the associations among perceptually similar ones. Compared to previous video SSL solutions, our method can capture finer-grained semantics from video instances, and the associations among similar ones are further strengthened. Experiments show that the proposed bio-inspired video SSL method significantly improves the Top-1 video retrieval accuracy on UCF101 and achieves superior performance on downstream tasks such as action recognition under comparable settings.

1. INTRODUCTION

Learning without labels is the most common way for humans to get to know the world (DiCarlo et al., 2012) , and it has also been widely studied in machine learning for developing intelligent agents. In particular, many researchers focus on self-supervised learning (SSL) from dynamic visual input data, i.e., videos (Hurri & Hyvärinen, 2003; Mobahi et al., 2009; Srivastava et al., 2015) , which comes closest to the natural data perceived by humans. Recently, deep learning based video SSL methods have also shown superior performance over traditional non-deep learning methods (Wang et al., 2021a; Duan et al., 2022) . However, there is still a large room for improvement considering the gap between the unsupervised learning abilities of deep learning models and humans. One notable difference between deep video SSL methods and human unsupervised learning is that the former typically learn discriminative representations by considering the inherent data properties, such as the clip order (Misra et al., 2016) , the spatiotemporal coherence (Vondrick et al., 2018) , the transformations exerted (Jenni et al., 2020) , etc., and propose various pretext tasks accordingly. While for humans, the self-awareness of the semantic change or consistency in the input stimuli is essential for learning without labels (Melcher & Colby, 2008) . Besides, the encoded representations in the brain are not left unchanged but kept being reorganized to yield a representation structure with strengthened associations among perceptually similar representations (Diekelmann & Born, 2010) . Fig. 1 shows an overall comparison. This discrepancy inspired us to propose a new video SSL method by taking inspiration from cognitive science and neuroscience on human visual perception. Recently, Illing et al. (2021) proposed a bio-inspired unsupervised learning rule that treats the presence of saccades as a global synaptic modulator. However, it is less powerful due to the inherent difficulty in optimizing deep networks with layer-wise optimization. Human visual perception is mainly accomplished by alternating saccade and fixation when the heads are relatively still. The former is the rapid foveal motion from one target of interest to another, We further encourage the semantic consistency within a fixation duration by minimizing the prediction error (PE) when using the current state to predict that of another time point in the same fixation duration. In this way, PE can serve as an extra supervision signal to avoid semantic discrepancy during semantic-change-aware contrastive learning. This is also biorational, as PE is known as an important modulator in perception, attention, and motivation control (Den Ouden et al., 2012) . To enhance the association among the previously learned finer-grained semantics, inspired by the reorganization in human representation learning (Diekelmann & Born, 2010) , we incorporate prototypical contrastive learning (Li et al., 2020) to gradually redistribute the representations. The learned representations are pulled towards their corresponding prototypes and pushed away from other prototypes. Such post-learning reorganization facilitates grouping unseen input stimuli into meaningful categories based on similarity, which leads to improved Top-1 retrieval accuracy compared with previous contrastive-based video SSL methods. In summary, we propose a video SSL framework by taking inspiration from cognitive science and neuroscience on human visual perception. We first exploit the presence of saccades as an indicator of semantic change in a contrastive learning framework for modeling the role of self-awareness in human representation learning. Then, we model the semantic consistency in the input by minimizing PE between a predicted and the true states of different time points during a fixation. Third, we incorporate prototypical contrastive learning to reorganize the learned representations such that the associations among perceptually similar ones would be strengthened after redistribution. Experiments show that the proposed bio-inspired video SSL method significantly improves the Top-1 video retrieval accuracy on UCF101, and achieves superior performance on downstream tasks such as action recognition. The code and the pre-trained models will be released.

2. RELATED WORK

Self-supervised learning has been studied with various data formats, including image (Wu et al., 2018; Grill et al., 2020; He et al., 2020; Li et al., 2020) , video (Xu et al., 2019; Benaim et al., 2020; Qian et al., 2021a; Duan et al., 2022) , and multi-modal data (Alayrac et al., 2020; Patrick et al., 2021) . In this section, we focus on self-supervised representation learning on videos. Non-contrastive video SSL methods. Previous video SSL methods mainly learn discriminative representations by designing various pretext tasks based on the analysis of the inherent spatiotemporal properties of video data. The pretext tasks include figuring out the correct order of the clips (Misra et al., 2016; Fernando et al., 2017; Lee et al., 2017; Wei et al., 2018; Xu et al., 2019) , tracking contents across adjacent frames (Wang & Gupta, 2015; Vondrick et al., 2016; Pathak et al., 2017; Vondrick et al., 2018; Wang et al., 2019b) , studying foreground and background robustness (Luo et al., 2017; Wang et al., 2021b; c; Ding et al., 2022) , predicting future frames (Vondrick et al., 2016; Luo et al., 2017; Villegas et al., 2017; Han et al., 2020a; Behrmann et al., 2021) , solving spatiotemporal puzzles (Kim et al., 2019) or video cloze (Luo et al., 2020) , learning the spatiotemporal statistics of the videos (Wang et al., 2019a; 2021a) , recognizing transformations exerted on the video (Jenni et al., 2020; Duan et al., 2022) , or determining whether a video is played at the intrinsic speeds (Benaim et al., 2020; Wang et al., 2020; Yao et al., 2020; Chen et al., 2021) . Contrastive video SSL methods. Another line of research adapts spatiotemporal properties into the contrastive learning framework (Hadsell et al., 2006) by constructing the positive and negative pairs based on various spatiotemporal cues. More specifically, some methods extend the instance discrimination methods in image SSL (Wu et al., 2018) , and directly use clips randomly sampled from the same video as a positive pair (Han et al., 2020b; Lin et al., 2021; Pan et al., 2021; Qian et al., 2021b; Yao et al., 2021) . Some methods consider the spatiotemporal consistency of video and treat the predicted and the ground-truth features at the same spatiotemporal location as a positive pair (Han et al., 2019; 2020a) . Some methods relate video understanding to pace reasoning ability, and construct positive pairs by sampling clips with different sampling rates from the same video or sampling clips with the same pace from different videos (Huang et al., 2021) . Some methods construct the positive pair from both frame-level and video-level representations (Kong et al., 2020; Kuang et al., 2021) . Some others construct positive pairs through spatiotemporal data augmentations (Qian et al., 2021b; Sun et al., 2021) , or exploit motion information for data augmentation (Dwibedi et al., 2019; Li et al., 2021; Wang et al., 2021b) . Our method also belongs to this line of research. However, we construct positive and negative pairs based on the semantic change indicated by saccades, resulting in a finer-grained distinction of the semantics from the same video instance. Comparison with CLAPP. Both the CLAPP model (Illing et al., 2021) and our work exploit the self-awareness of saccades for self-supervised representation learning, but we differ significantly in several aspects. First, CLAPP aims to propose a local learning rule for building deep representations without back-propagation, where each layer is trained independently to predict whether a saccade happens. While ours utilizes the presence of saccades to construct the positive and negative pairs in a contrastive learning framework for end-to-end learning. Second, for each layer, CLAPP restricts the predictions to be similar to its responses to future inputs and as different as possible from its responses to fake inputs. However, our method minimizes the discrepancies between the predictions and the future responses in the absence of a saccade for semantic consistency modeling. Third, besides constructing a saccade by switching the network inputs from one video to another as in CLAPP, we also consider inter-video saccades, which are realized by intentionally changing the fixation area on the same video to capture finer-grained semantics. As shown in §4, we significantly improve over CLAPP for the video recognition task on UCF101.

3. METHOD

As shown in Fig. 2 , our method consists of three parts. First, we explore saccades as the indicator of semantic change in a contrastive learning process ( §3.1), where the negative pairs consist of features before and after a saccade occurs. Second, we model the semantic consistency during fixation by minimizing the prediction error (PE) between the predicted and the true features of different time points ( §3.2). Third, we perform a post-learning reorganization to strengthen the associations among perceptually similar representations ( §3.3).

3.1. CAPTURING SEMANTIC CHANGE VIA CONTRASTIVE LEARNING

Preparing saccades. The ground-truth gaze data are generally collected using eye-tracking devices, which requires a lot of manual effort, and usually has personal heterogeneity considering the exact fixation locations. To mitigate these problems, we propose to construct artificial saccades, and only consider a representative set of coarse fixation locations. We simulate the receptive field of the fovea by exerting fixation masks {m i } Np i=1 on the input stimuli. The artificial saccades are constructed by intentionally alternating the fixation masks. To balance the performance and efficiency, we set N p = 5. The spatial size of fixation area is 1/4 of the whole input as shown in Fig. 3 (a), and the temporal length is the same as the clip length. More details are presented in §A.1. Semantic-change-aware contrastive learning. To capture the semantic change in the presence of a saccade, we try to make the representations before and after a saccade distinct from each other. To this end, we treat the representations before and after a saccade as negative pairs, and otherwise positive pairs. The latent feature space is thus optimized by minimizing the contrastive loss. Formally, given a training set X = {x 1 , x 2 , . . . , x N } of N videos without category labels, we aim to learn an embedding function f θ to map X to features V = {v 1 , v 2 , . . . , v N }, where v i = f θ (x i ) and v i ∈ R D is expected to capture the semantics of x i . The semantic-change-aware contrastive loss is inspired by InfoNCE (Oord et al., 2018; He et al., 2020) , and is calculated as follows: L cl = 1 N N i=1 -log exp(v ik • v ′ ik /τ ) exp(v ik • v ′ ik /τ ) + j∈I-exp(v ik • v ′ jh /τ )) , |I -| = N neg . Here, v ′ ik = f θ (x ′ i ⊙ m k ) is a positive sample for v ik = f θ (x i ⊙ m k ) (⊙ is Hadamard product), x ′ i is obtained by applying commonly-used data augmentations on x i , v ′ jh ̸ = v ′ ik is a negative sample, I -is the set of indices for N neg selected negative samples, and k, h ∈ {1, 2, . . . , N p } are indices of the mask m. Note that a sample is considered positive if and only if j = i and h = k, i.e., they are from the same fixation region of the same video. By minimizing Eq. ( 1), the embedding function f θ is trained to distinguish between finer-grained semantics within a video. Perceptually similar finer-grained semantics will be further associated together through a reorganization process. Memory bank. As previously revealed in (Oord et al., 2018; Wu et al., 2018) , a large number of negative pairs is essential for training InfoNCE loss, which is typically restricted by the batch size. To alleviate this issue, we follow (Wu et al., 2018) and maintain a memory bank V = {v i } N * Np i=1 for different fixation locations of all the videos in the training dataset. Here, N is the number of training videos, and N p is the number of masks. Similar to (Wu et al., 2018) , we initialize V with random D-dimensional unit vectors and update the slot v i with the latest feature v i as follows: v i ← (1 -m)v i + mv i , where m ∈ [0, 1] is a momentum value. With V, we can rewrite the contrastive learning procedure and Eq. ( 1) by replacing the negative samples v ′ by their memory bank representations v.

3.2. MODELING SEMANTIC CONSISTENCY VIA MINIMIZING PE

To encourage the semantic consistency between the states of two time points within the a fixation, given a video x i , we minimize the PE when using c t i to predict c t+∆t i , where is reshaped from the l-th level feature from the backbone f θ whose size is C l ×T l ×H l ×W l , and C l , T l , H l , W l are the feature channel number, the temporal resolution, the height, and the width, respectively. Note that t is the time point of a clip from x i , and ∆t can be either positive or negative. c * i ∈ R C l ×(T l H l W l ) The predictor module p θ is designed as a variant of the self-attention module (Bahdanau et al., 2015; Xie et al., 2021) to capture the spatiotemporal correlations. It first obtains the spatiotemporal weights based on the element-wise cosine similarity, and then calculates each predicted element as the weighted sum of the transformed version of all other spatiotemporal elements. An illustration is shown in Fig. 3 (b ). The u-th element of the predicted feature (c t+∆t i ) pred is calculated as: (c t+∆t iu ) pred = T l H l W l v=1 a(c t iu , c t iv ) • g(c t iv ), where a(•, •) calculates the attention weights as: a(c t iu , c t iv ) = ReLU(cos(c t iu , c t iv )), and the transform function g(•) is a linear layer that maps a C l -dim input to a C l -dim output. The semantic consistency is optimized by minimizing the prediction loss defined as follows: L pred = 1 N N i=1 1 T l H l W l T l H l W l u=1 |c t+∆t iu -(c t+∆t iu ) pred |. (5)

3.3. REORGANIZING VIA PROTOTYPICAL CONTRASTIVE LEARNING

Inspired by the gradual redistribution and reorganization of memory representations during the postlearning rest periods (Diekelmann & Born, 2010) , after training the whole framework with L cl (Eq. ( 1)) and L pred (Eq. ( 5)) to converge, we further perform reorganization through prototypical contrastive learning (Li et al., 2020) to strengthen the associations among similar representations. Specifically, we first cluster the representations {v ik } for R times to obtain R distinct clustering results {G (r) } R r=1 , where G (r) contains Q (r) clusters. Then, we randomly pick Q = min{Q (r) , N neg } clusters from each G (r) to form G (r) ′ = {G (r) q } Q q=1 , where the centroid of G (r) q is o (r) q . The reorganization loss is calculated as follows: L reorg = 1 N N i=1 1 N p Np k=1 1 R R r=1 -log exp(v ik • o (r) s /ϕ (r) s ) Q j=0 exp(v ik • o (r) j /ϕ (r) j )) . Here, G s is the cluster to which v ik is assigned, o s is the centroid of G (r) s , and ϕ (r) * denotes the concentration estimation of the cluster G (r) * as in (Li et al., 2020) . In this way, the associations among previously learned finer-grained semantics are further strengthened, which can facilitate similarity-based categorization for unknown input stimuli.

3.4. LOSS FUNCTION

The overall loss function is a combination of the three loss terms introduced above: L = L cl + L pred + L reorg , where the third term only takes effect at the late stage of the training process. Datasets. We conduct experiments on two representative video datasets, namely UCF101 (Soomro et al., 2012) and HMDB51 (Kuehne et al., 2011) . UCF101 consists of 13K videos of 101 action classes. HMDB51 contains 7K videos from 51 action classes. Both UCF101 and HMDB51 have three official train/test splits. We pre-trained on the first train split of UCF101 and used the first train/test split of UCF101 and HMDB-51 for evaluation following (Wang et al., 2020; Qian et al., 2021a) . For the ablation study, we use the first train/test split of UCF101. Network architectures. We experiment on two backbones that are commonly used in previous video SSL methods, namely R3D-18 (Hara et al., 2018) and R(2+1)D-18 (Tran et al., 2018) . Given N p randomly sampled 16-frame clips of resolution 112 × 112 from a video, the backbone outputs N p D-dim feature vectors, where D = 512, and N p = 5 is the number of types of fixation masks. We sample N p clips for a video and ensure that all the N p memory slots for the video can be updated. Pre-training. The batch size for R3D is 16, and the batch size for R(2+1)D is 14. The two backbones are trained for 300 epochs on the UCF101 training set using SGD with a momentum of 0.9 and weight decay of 10 -4 . The initial learning rate is 0.1 and is decayed by 5 at epoch 90, 180, and 240. The number of negative samples N neg = 1024. For PE minimization, we set l = 5, i.e., the state c * i is from the 5-th level of the backbone network, which shows the best balance between accuracy and efficiency. For reorganization, we use an unsupervised clustering algorithm Faiss (Johnson et al., 2019) , and set R = 3, Q (r) = 1500 for r = 1, . . . , R, and Q = 1024 based on ablation study. We train for another 60 epochs after incorporating L reorg using SGD with a learning rate of 8 -4 . Action recognition. We initialize the backbone using the model parameters obtained in the pretraining stage except for the last linear layer. We consider two settings: i) linear probe, where only the last linear layer is trained with cross-entropy loss, and ii) finetune, where the entire network is finetuned with cross-entropy loss. For linear probe, when training on UCF101, we use SGD optimizer and trained for 200 epochs with an initial learning rate of 0.1, which is further decayed by 10 at epoch 60, 120 and 180. For HMDB51, we use Adam with a learning rate of 0.001 and trained to converge. We use batch size 32 for both R3D and R(2+1)D backbones with input resolution 112× 112. For finetune, we use an SGD optimizer and trained for 200 epochs with a large initial learning rate of 0.1 following (Duan et al., 2022) , which is further decayed by 10 at epoch 60, 120, and 180. The batch size is 32 for both backbones. During training, we apply the same data augmentation as in (Han et al., 2020b) . For evaluation, we uniformly sample 10 clips from one testing video, perform the center crop, resize them to 112×112, and average the predicted probabilities as the final prediction, following (Wang et al., 2020; Qian et al., 2021a) . 

4.2. COMPARISON WITH THE STATE-OF-THE-ART

To evaluate the representations learned in the self-supervised pre-training stage, we perform video retrieval and action recognition and compare them with other state-of-the-art methods. Video retrieval. We query the k-nearest neighbors of the testing set video clips from the training set. The pre-trained backbone model (R3D or R(2+1)D) is directly used as a feature extractor without further finetuning. For each query video, we obtain one 512-d feature vector by extracting and averaging the features of 10 uniformly sampled clips. We use cosine similarity to measure the distance of the features for determining the k-nearest neighbors. A correct retrieval is counted when the k nearest neighbors contain at least one video of the same class with the query video. We report Top-k retrieval accuracy in Table 1 , where k = 1, 5, 10, 20, and compare it with other video SSL methods pre-trained on the RGB modality of UCF101, including VCOP (Xu et al., 2019) , VCP (Luo et al., 2020) , PRP (Yao et al., 2020) , Pace (Wang et al., 2020) , STS (Wang et al., 2021a), and TransRank (Duan et al., 2022) . As can be observed, our method achieves the best Top-1 retrieval accuracy on UCF101, and comparable Top-1 performance on HMDB51. Though our method is slightly inferior to other Top-k values, we argue that the Top-1 metric matters the most in securitydemanded real-world applications such as autonomous driving. Primates including humans are also more good at giving out one promising guess than several candidates (Freedman et al., 2001) . Compared to previous methods, our method captures finer-grained semantics, which are reorganized to yield a better representation structure. To see the change of the learned finer-grained semantics before and after reorganization, we make statistics on the Top-1 retrieval accuracy before and after reorganization on UCF101 regarding the 101 categories. As shown in Fig. 4 , the post-learning reorganization improves the Top-1 retrieval accuracy for 57.4% of the 101 action categories. For 22.8% categories, the Top-1 retrieval accuracy remains the same. This shows that the redistributed finer-grained semantics can facilitate similarity based categorization. We further visualize video retrieval examples that are corrected by reorganization or conversely in in black and red background. For the PlayingCello query clip, it retrieves a PlayingSitar clip that resembles in both appearance and motion. By representation reorganization, such ambiguousness are removed, and clips with the correct action labels are found. However, we also notice that, for clips with complicated backgrounds, such as PommeiHorse in the third row of Fig. 5 , reorganization tends to make it easier to be mixed up with clips having similar appearance. Action recognition. We report the Top-1 action recognition accuracy of linear probe (Frozen ✓) and finetune (Frozen ✗) in Table 2 . For a fair comparison, we exclude methods based on much deeper backbones, with larger input resolution, using multi-model data, or pre-trained on much larger video datasets such as Kinetics (Carreira & Zisserman, 2017) . For linear probe experiments, our method outperforms all previous video SSL methods pre-trained on UCF101, especially for CLAPP (Illing et al., 2021) , a bio-inspired unsupervised representation learning method without back-propagation. This clearly demonstrates that our method is a competitive practice of exploring cognitive inspirations in deep self-supervised representation learning. For the finetune setting, our method achieves the highest Top-1 accuracy compared with other video SSL methods pre-trained on the RGB modality of UCF101, showing a good generalization ability of the learned representations.

4.3. ABLATION STUDY

In this section, we assess the effectiveness of the framework design regarding three components: the semantic-change-aware contrastive learning that utilizes saccade as an indicator ( §3.1), the semantic consistency learning by minimizing the prediction error (PE)( § 3.2), and the reorganization via prototypical contrastive learning ( §3.3). We report Top-1 video retrieval accuracy on UCF101 to evaluate the learned video representations without further finetuning. Overall framework design. As shown in Table 3 , although increasing the size of the memory bank benefits the retrieval performance, major improvements are brought by incorporating the three components. Specifically, incorporating saccades for constructing negative pairs achieves an absolute improvement of 1.7 point, and further including PE minimization or post-learning reorganization leads to an absolute improvement of 0.8 and 2.4, respectively. The full model is 4.3 point higher than the baseline with a memory bank of the same size. The results clearly demonstrate the effectiveness of each framework design. Alternatives of artificial saccades. Besides constructing artificial saccades for training, we tried to incorporate real gaze data in our scheme by resorting to current video saliency prediction datasets such as DHF1K (Wang et al., 2018) , Hollywood-2, and UCF sports (Mathe & Sminchisescu, 2015) . However, those datasets typically contain no more than 1.5K training videos, which are much  No. R Q (r) Top-1 No. R Q (r) Top-1 1 0 40.8 7 3 500 1000 1500 41.7 2 1 500 41.5 8 3 1000 1000 1000 42.2 3 1 1000 42.1 9 3 1500 1500 1500 42.3 4 1 1500 42.0 10 3 1000 1500 2000 41.8 5 1 2000 41.9 11 3 1500 1500 1500 41.0 * 6 1 2500 41.3 12 3 1500 1500 1500 41.4 * smaller than video SSL datasets such as HMDB51 and UCF101. Besides, the distribution of the videos in the saliency prediction datasets are typically not the same as that of the videos used for SSL training. Thus, it is hard for our SSL method trained on video saliency prediction datasets to achieve competitive performance when evaluated on downstream tasks. To mitigate the above-mentioned problems, we resort to saliency models pre-trained on real gaze data for egocentric (Huang et al., 2018) or third-person videos (Droste et al., 2020) , which are promising to provide an approximation of real gaze data. Considering that UCF101 contains thirdperson videos, we utilize UNISAL (Droste et al., 2020) to predict visual saliency maps for UCF101, and then use these maps to guide the construction of saccades during training. The fixation mask of a video clip is determined by the majority of the corresponding visual saliency maps. Since more than 90% of the resulted fixation masks are center masks, to mitigate such distribution center bias, we randomly perturb the fixation mask labels with a probability of 0.5. To assess the effectiveness of the saccades, we train the baseline with saccades for 300 epochs on UCF101. The model achieves 45.1% Top-1 retrieval accuracy on UCF101, which is slightly better than 44.2%, the Top-1 accuracy of the one trained with artificial saccades as reported in Table 3 . It is promising that a comparable amount of real gaze data would also bring in such benefits in our video SSL framework. Reorganization parameters. To better reorganize the learned finer-grained semantics, we experiment on two key parameters introduced in §3.3, namely the number of cluster results R, the number of clusters in each result Q (r) , as well as the clustering frequency and the number of warmup epochs. All the baselines are trained on UCF101 for 100 epochs using SGD. The learning rate is 0.1 and is decayed by 5 at epoch 30, 60 and 80. The results are shown in Table 4 . For the first ten baselines, the reorganization loss L reorg is incorporated at epoch 61, and the prototypes are updated every 5 epochs. For baseline 11 and 12, L reorg is introduces at epoch 2 and 61, respectively, and the prototypes are updated every epoch. As can be observed, reorganization can consistently improve the Top-1 retrieval accuracy for a wide range of R and Q (r) . However, it is recommended to start prototypical learning later when the finer-grained semantics are relatively better captured, with less frequent prototype updates. In our full experiments where the models are trained for 300 epochs, we set R = 3 and Q (r) = 1500 for r = 1, • • • , R, and update prototypes every 5 epochs since epoch 181.

5. CONCLUSION

In this work, we propose a video SSL method by taking inspiration from cognitive science and neuroscience on human visual perception. Instead of designing pretext tasks based on the inherent properties of videos, we explore the presence of saccades as an indicator of semantic change in a contrastive learning framework to mimic the role of self-awareness in human perception. To achieve semantic consistency in the absence of a saccade, we minimize the prediction error when using the state of a time point to predict that of another time point during a fixation. Finally, we strengthen the associations among similar representations through a post-learning reorganization process. Compared to previous contrastive learning based video SSL methods, our method learns more powerful representations by first making finer-grained distinctions for semantics in a video instance, and then associating similar semantics across different video instances through a reorganization process. Semantic consistency between the states of two time points within the same fixation is encouraged by minimizing the prediction error of using the earlier state to predict the later state. The proposed bio-inspired video SSL method achieves superior Top-1 video retrieval accuracy on UCF101 and outperforms other methods on the action recognition tasks. Table 5 : Ablation study on fixation mask configurations. We report Top-1 video retrieval results on UCF101 split 1. The best values are boldfaced. The value of default settings are underlined. See §A.2 for more discussions. Spatial size of fixation area 5% 11% 15% 20% 25% 30% 35% 40% 45% 50% N p = 5 43.3 44.7 44.3 44.4 44.2 45.0 45.9 44.8 44.9 44.8 N p = 9 43.5 44.3 44.7 45.2 43.9 value is 1 at every point within the fixation area and 0 otherwise. The fixation area, i.e., the receptive field of the fovea, is represented using a rectangle centered at a fixation location that has the same aspect ratio as the input. In our experiments, we set the spatial size of the fixation area as 25% of the entire image size. We also explore the impact of different spatial sizes and numbers of fixations in §A.2. Exemplar fixation masks are shown in Fig. 6 . Finally, we construct an artificial saccade by manually assigning two different fixation masks to two video clips, which guarantees that the two fixation masks before and after an artificial saccade capture different visual receptive fields in the input scene. The two different fixation masks are randomly picked from N p pre-defined fixation masks.

A.2 FURTHER STUDY ON FIXATION MASKS

In this section, we study the effect of the spatial size of the fixation areas and the number of fixations N p used for constructing artificial saccades. We train the baseline with saccades only instead of the full model on UCF101 split 1 for 100 epochs. The Top-1 retrieval accuracy on UCF101 of N p = 5 and N p = 9 with various spatial sizes are show in Table 5 . As can be observed, the Top-1 retrieval performance first increases and then decreases as the spatial sizes of the fixation area becomes larger. The optimal spatial sizes are 35% and 20% for N p = 5 and N p = 9, respectively. This is because that the information fall in the visual receptive field increases with the fixation area, and more information is beneficial for representation learning. However, when the overlap among different fixation areas enlarges, unintentional perturbations would be introduced, which impedes finer-grained semantic learning. Thus it is crucial to determine the optimal overlapping for different N p values.



Figure 1: Overall comparison. (a) Previous video SSL methods design pretext tasks based on the inherent properties of videos, while (b) our method explores the presence of saccades as an indicator of semantic change to mimic the role of self-awareness in human perception. Since it is relatively expensive to collect real gaze data, we construct artificial saccades for training ( §3.1).while the latter is the period where the eye is kept aligned with one target for processing visual details. To capture the semantic change in the video, we propose to exploit the presence of saccades as an indicator of the semantic change and propose a semantic-change-aware contrastive learning framework. This is inspired by the fact that the human would perform a saccade when a semantic change occurs in the fixation area. Specifically, the positive pairs are formed by features of the same fixation location in a video, and the negative pairs are formed by features of different fixation locations in the same video or features from different videos. Compared to previous contrastivebased video SSL methods, our method captures finer-grained semantics within the same video. Note that we manually construct saccades by exerting different fixation masks on the input without using real gaze data, making our method a general one for any video data without extra supervision.

Figure 2: Framework of the proposed bio-inspired video SSL method. Given a video x i , the positive and negative pairs for semantic-change-aware contrastive learning can be constructed by exerting the same or different fixation masks m on different clips of the video ( §3.1), respectively. Further, the semantic consistency is modeled by minimizing the prediction error between the predicted state (c t * +∆t ik

Figure 3: Method details. (a) Exemplar fixation masks for N p = 5, which cover the major region in the visual field. See §3.1) for more details. (b) Prediction module introduced in §3.2), which is designed as a variant of the self-attention module to capture the spatiotemporal correlations.

Figure 5: Qualitative results for video retrieval before and after reorganization. The first two examples are improved by reorganization, and the third example is a failure case, where the retrieval result after the reorganization is more similar in appearance while different in semantics. See §4.

The first two examples are rectified by reorganization. For the JugglingBalls query clip, it retrieves a BoxPunchingBag clip with similar foreground and background color, i.e., a person

Figure 6: Exemplar fixation masks of various spatial sizes for (a) N p = 5 and (b) N p = 9, respectively. See §A.1 for more details.

Video retrieval results on UCF101 and HMDB51. All the models are pre-trained on UCF101. Larger values are better. See §4.2 for details.

Action recognition results on UCF101 and HMDB51. The models are pre-trained on UCF101 with RGB only. * means learning without back-propagation. See §4.2 for details.

Ablation study on framework design. We report Top-1 video retrieval results on UCF101 split 1. Here, N is #training samples and N p is #fixation locations considered. See §4.3 for details.

Ablation study on reorganization parameters assessed by Top-1 retrieval accuracy on UCF101 split 1. Here, R is #cluster results, and Q (r) is #clusters in the r-th result. * denotes updating prototypes every epoch. See §4.3.

annex

The reorganization process can improve the Top-1 retrieval performance by 57.4% out of all the classes, leading to comparative performance of 22.8% out of all. See §4 for more discussions. 

A APPENDIX

In this section, we present more details of saccades preparation in §A.1, and study more configurations on fixation masks in §A.2.

A.1 DETAILS OF PREPARING SACCADES

The presence of a saccade indicates a semantic change in the stimuli that fall in the receptive field of the fovea (denoted as "fixation area" in our paper). In view of this, we proposed to manually determine such receptive fields using fixation masks. Thus, the change of the fixation mask, i.e., a "saccade" in our design, reasonably indicates a semantic change of the stimuli in the fixation area, which inspires the design of our semantic-change-aware contrastive learning framework. The procedure for generating artificial saccades is detailed as follows.We first determine the fixation locations by considering two aspects: i) the fixation locations are better to be evenly distributed since humans may attend to anywhere in the scene, and ii) the central region of the input shall be covered considering the center bias in free-viewing visual saliency (Tseng et al., 2009) . To balance the performance and efficiency, in our experiments, we divide the mage into 2! ×2 grids, and take their centroids and the centroid of the entire image, which gives N p = 5 locations in total. We also experiment on 9 fixation masks which corresponds to 9 fixation locations centered in the 3×3 grids, which is shown in §A.2.Then, since the human eye can be viewed as an optical imaging system, we consider the generalized pupil function of an ideal imaging system and design the fixation mask as a binary mask where the

