BRINGING SACCADES AND FIXATIONS INTO SELF-SUPERVISED VIDEO REPRESENTATION LEARNING

Abstract

In this paper, we propose a self-supervised video representation learning (video SSL) method by taking inspiration from cognitive science and neuroscience on human visual perception. Different from previous methods that focus on the inherent properties of videos, we argue that humans learn to perceive the world through the self-awareness of the semantic changes or consistency in the input stimuli in the absence of labels, accompanied by representation reorganization during the post-learning rest periods. To this end, we first exploit the presence of saccades as an indicator of semantic changes in a contrastive learning framework, mimicking the self-awareness in human representation learning. The saccades are generated artificially without eye-tracking data. Second, we model the semantic consistency in eye fixation by minimizing the prediction error between the predicted and the true state of another time point. Finally, we incorporate prototypical contrastive learning to reorganize the learned representations to enhance the associations among perceptually similar ones. Compared to previous video SSL solutions, our method can capture finer-grained semantics from video instances, and the associations among similar ones are further strengthened. Experiments show that the proposed bio-inspired video SSL method significantly improves the Top-1 video retrieval accuracy on UCF101 and achieves superior performance on downstream tasks such as action recognition under comparable settings.

1. INTRODUCTION

Learning without labels is the most common way for humans to get to know the world (DiCarlo et al., 2012) , and it has also been widely studied in machine learning for developing intelligent agents. In particular, many researchers focus on self-supervised learning (SSL) from dynamic visual input data, i.e., videos (Hurri & Hyvärinen, 2003; Mobahi et al., 2009; Srivastava et al., 2015) , which comes closest to the natural data perceived by humans. Recently, deep learning based video SSL methods have also shown superior performance over traditional non-deep learning methods (Wang et al., 2021a; Duan et al., 2022) . However, there is still a large room for improvement considering the gap between the unsupervised learning abilities of deep learning models and humans. One notable difference between deep video SSL methods and human unsupervised learning is that the former typically learn discriminative representations by considering the inherent data properties, such as the clip order (Misra et al., 2016) , the spatiotemporal coherence (Vondrick et al., 2018) , the transformations exerted (Jenni et al., 2020) , etc., and propose various pretext tasks accordingly. While for humans, the self-awareness of the semantic change or consistency in the input stimuli is essential for learning without labels (Melcher & Colby, 2008) . Besides, the encoded representations in the brain are not left unchanged but kept being reorganized to yield a representation structure with strengthened associations among perceptually similar representations (Diekelmann & Born, 2010) . We further encourage the semantic consistency within a fixation duration by minimizing the prediction error (PE) when using the current state to predict that of another time point in the same fixation duration. In this way, PE can serve as an extra supervision signal to avoid semantic discrepancy during semantic-change-aware contrastive learning. This is also biorational, as PE is known as an important modulator in perception, attention, and motivation control (Den Ouden et al., 2012) . To enhance the association among the previously learned finer-grained semantics, inspired by the reorganization in human representation learning (Diekelmann & Born, 2010), we incorporate prototypical contrastive learning (Li et al., 2020) to gradually redistribute the representations. The learned representations are pulled towards their corresponding prototypes and pushed away from other prototypes. Such post-learning reorganization facilitates grouping unseen input stimuli into meaningful categories based on similarity, which leads to improved Top-1 retrieval accuracy compared with previous contrastive-based video SSL methods. In summary, we propose a video SSL framework by taking inspiration from cognitive science and neuroscience on human visual perception. We first exploit the presence of saccades as an indicator of semantic change in a contrastive learning framework for modeling the role of self-awareness in human representation learning. Then, we model the semantic consistency in the input by minimizing PE between a predicted and the true states of different time points during a fixation. Third, we incorporate prototypical contrastive learning to reorganize the learned representations such that the associations among perceptually similar ones would be strengthened after redistribution. Experiments show that the proposed bio-inspired video SSL method significantly improves the Top-1 video retrieval accuracy on UCF101, and achieves superior performance on downstream tasks such as action recognition. The code and the pre-trained models will be released.

2. RELATED WORK

Self-supervised learning has been studied with various data formats, including image (Wu et al., 2018; Grill et al., 2020; He et al., 2020; Li et al., 2020 ), video (Xu et al., 2019; Benaim et al., 2020; Qian et al., 2021a; Duan et al., 2022), and multi-modal data (Alayrac et al., 2020; Patrick et al., 2021) . In this section, we focus on self-supervised representation learning on videos. Non-contrastive video SSL methods. Previous video SSL methods mainly learn discriminative representations by designing various pretext tasks based on the analysis of the inherent spatiotemporal properties of video data. The pretext tasks include figuring out the correct order of



Figure 1: Overall comparison. (a) Previous video SSL methods design pretext tasks based on the inherent properties of videos, while (b) our method explores the presence of saccades as an indicator of semantic change to mimic the role of self-awareness in human perception. Since it is relatively expensive to collect real gaze data, we construct artificial saccades for training ( §3.1).while the latter is the period where the eye is kept aligned with one target for processing visual details. To capture the semantic change in the video, we propose to exploit the presence of saccades as an indicator of the semantic change and propose a semantic-change-aware contrastive learning framework. This is inspired by the fact that the human would perform a saccade when a semantic change occurs in the fixation area. Specifically, the positive pairs are formed by features of the same fixation location in a video, and the negative pairs are formed by features of different fixation locations in the same video or features from different videos. Compared to previous contrastivebased video SSL methods, our method captures finer-grained semantics within the same video. Note that we manually construct saccades by exerting different fixation masks on the input without using real gaze data, making our method a general one for any video data without extra supervision.

