BRINGING SACCADES AND FIXATIONS INTO SELF-SUPERVISED VIDEO REPRESENTATION LEARNING

Abstract

In this paper, we propose a self-supervised video representation learning (video SSL) method by taking inspiration from cognitive science and neuroscience on human visual perception. Different from previous methods that focus on the inherent properties of videos, we argue that humans learn to perceive the world through the self-awareness of the semantic changes or consistency in the input stimuli in the absence of labels, accompanied by representation reorganization during the post-learning rest periods. To this end, we first exploit the presence of saccades as an indicator of semantic changes in a contrastive learning framework, mimicking the self-awareness in human representation learning. The saccades are generated artificially without eye-tracking data. Second, we model the semantic consistency in eye fixation by minimizing the prediction error between the predicted and the true state of another time point. Finally, we incorporate prototypical contrastive learning to reorganize the learned representations to enhance the associations among perceptually similar ones. Compared to previous video SSL solutions, our method can capture finer-grained semantics from video instances, and the associations among similar ones are further strengthened. Experiments show that the proposed bio-inspired video SSL method significantly improves the Top-1 video retrieval accuracy on UCF101 and achieves superior performance on downstream tasks such as action recognition under comparable settings.

1. INTRODUCTION

Learning without labels is the most common way for humans to get to know the world (DiCarlo et al., 2012) , and it has also been widely studied in machine learning for developing intelligent agents. In particular, many researchers focus on self-supervised learning (SSL) from dynamic visual input data, i.e., videos (Hurri & Hyvärinen, 2003; Mobahi et al., 2009; Srivastava et al., 2015) , which comes closest to the natural data perceived by humans. Recently, deep learning based video SSL methods have also shown superior performance over traditional non-deep learning methods (Wang et al., 2021a; Duan et al., 2022) . However, there is still a large room for improvement considering the gap between the unsupervised learning abilities of deep learning models and humans. One notable difference between deep video SSL methods and human unsupervised learning is that the former typically learn discriminative representations by considering the inherent data properties, such as the clip order (Misra et al., 2016) , the spatiotemporal coherence (Vondrick et al., 2018) , the transformations exerted (Jenni et al., 2020) , etc., and propose various pretext tasks accordingly. While for humans, the self-awareness of the semantic change or consistency in the input stimuli is essential for learning without labels (Melcher & Colby, 2008) . Besides, the encoded representations in the brain are not left unchanged but kept being reorganized to yield a representation structure with strengthened associations among perceptually similar representations (Diekelmann & Born, 2010) . Fig. 1 shows an overall comparison. This discrepancy inspired us to propose a new video SSL method by taking inspiration from cognitive science and neuroscience on human visual perception. Recently, Illing et al. (2021) proposed a bio-inspired unsupervised learning rule that treats the presence of saccades as a global synaptic modulator. However, it is less powerful due to the inherent difficulty in optimizing deep networks with layer-wise optimization. Human visual perception is mainly accomplished by alternating saccade and fixation when the heads are relatively still. The former is the rapid foveal motion from one target of interest to another,

