SELF-SUPERVISED VIDEO PRETRAINING YIELDS GENERAL IMAGE REPRESENTATIONS

Abstract

Videos contain far more information than still images and hold the potential for learning rich representations of the visual world. Yet pretraining on image datasets has remained the dominant paradigm for learning representations that capture spatial information. Prior attempts at video pretraining made progress towards solving video-based tasks, but did so at the cost of their image understanding capabilities. In this work we revisit self-supervised learning of image representations from the dynamic evolution of video frames. To that end, we propose a procedure for data curation that addresses the domain mismatch between video and image datasets, and develop a contrastive learning framework which handles the complex transformations present in natural videos. This simple paradigm for distilling knowledge from videos to image representations, called VITO, far outperforms all prior video pretraining methods on object detection and semantic segmentation tasks, and for the first time, closes the gap with ImageNet pretraining. Furthermore, VITO remains effective when transferring to video understanding tasks such as DAVIS segmentation and UCF-101 action recognition. Together, these results suggest that video-pretraining is now strictly more general than imagepretraining and could become the new default for learning visual representations.



Pretraining on large image datasets has been the dominant paradigm for learning representations that understand the visual world (Krizhevsky et al., 2012; He et al., 2016) . In particular, self-supervised methods which learn representations that are invariant to specific image transformations have proven very powerful, surpassing supervised pretraining on a variety of downstream tasks (He et al., 2020; Hénaff et al., 2019; Chen et al., 2020; Caron et al., 2021) . Although the synthetic augmentations used in these transformations capture important image priors such as scale-, color-, and translationinvariance, they pale in comparison to the complex changes in pose and viewpoint that arise in natural videos. Therefore, one would expect that learning from videos, as opposed to images, should produce strictly more general visual representations. However, while self-supervised video representation learning has seen a variety of recent successful applications when evaluating on video-based tasks (Qian et al., 2021; Feichtenhofer et al., 2021; Toering et al., 2022; Dave et al., 2022; Feichtenhofer et al., 2022; Ni et al., 2022) , it has typically done so by sacrificing performance on image classification (Gordon et al., 2020; Wu & Wang, 2021) relative to image pretraining. Furthermore, the specifics of video-representation ar-chitectures make their comparison with image-based architectures difficult, obfuscating the role of the underlying data and learning paradigm in the quality of the resulting representations. In this work we perform a systematic comparison of image-and video-based learning of image representations. Starting from a strong, self-supervised contrastive baseline, we find the spatial content of standard video datasets to have a detrimental effect on the quality of the resulting representations, as measured by their performance on canonical scene understanding tasks. We therefore introduce a straightforward video curation procedure-VideoNet-which aligns their class distribution with that of ImageNet, and which partially redresses the imbalance between image and video learning. Additionally, we propose three simple modifications to the standard contrastive paradigm to account for the particularities of video data: less aggressive crop augmentation, multi-scale attention pooling, and enriching view generation with natural temporal deformations. Together, these improvements yield large gains over prior video pretraining efforts on semantic segmentation on PASCAL and ADE20K and object detection on COCO and LVIS, closing the gap between imageand video-based representation learning for the first time. Notably, we maintain the expected benefits of transfer to video-based tasks (DAVIS segmentation and UCF-101 action recognition). This gives a new life to the promise of video pretraining serving as a general purpose means of learning visual representations.

2. RELATED WORK

Video-based pretraining. Many prior works have considered the problem of self-supervised representation learning for capturing spatio-temporal invariances. These span a wide range of approaches, beginning with traditional methods that leveraged temporal coherence, optical flow, and object tracking (Wiskott & Sejnowski, 2002; Hurri & Hyvärinen, 2003; Agrawal et al., 2015; Wang & Gupta, 2015; Pathak et al., 2017; Goroshin et al., 2015; Misra et al., 2016; Srivastava et al., 2015; Kulkarni et al., 2019) . More recently, there have been many successful examples of approaches that leverage contrastive learning, masked autoencoding, and other self-supervised pretext tasks to learn strong video representations (Sermanet et al., 2018; Recasens et al., 2021; Qian et al., 2021; Dave et al., 2022; Dorkenwald et al., 2022; Feichtenhofer et al., 2021; 2022) . However, most of these methods employ specialized video architectures and transfer to video-based tasks (action recognition, motion segmentation, object tracking, etc.) to measure the quality of the learned representations. Natural motion-induced deformations are powerful learning signals that should allow for learning better image representations as well, and recent works (Gordon et al., 2020; Alayrac et al., 2020; Wu & Wang, 2021; Tschannen et al., 2020; Xu & Wang, 2021; Jabri et al., 2020; Bian et al., 2022; Xiong et al., 2021) have made attempts in this direction. One family of successful, recent methods, use cycle-consistency-based objectives that encourage learning correspondences between temporally ordered image patches via graph random walks (Jabri et al., 2020; Bian et al., 2022) 2021), which utilize simple variants of contrastive learning to learn global frame-level representations from videos. Our approach differs in its curation of video datasets and its ability to handle temporal deformations in the contrastive learning framework via learned attention. Although these works report gains on video-centric tasks such as object tracking and video segmentation, in the context of canonical scene understanding tasks used to evaluate image representations, we find these methods to underperform state-of-the-art ImageNet-pretrained models. Contrastive learning for fine-grained scene understanding. In this work, we specifically focus on evaluations that assess real-world scene understanding, namely semantic segmentation and object detection (Van Gansbeke et al., 2021; Hénaff et al., 2021; Xie et al., 2021a) . Self-supervised learning has greatly benefited fine-grained scene understanding tasks, and there has been significant progress using dense contrastive losses that chose positive pairs for local features by spatial proximity and/or feature affinity across two views (Xie et al., 2021b; Bai et al., 2022; O Pinheiro et al., 2020; Wang et al., 2021b) . However, as described in Sharma et al. (2022) , dense contrastive losses fail when ground truth correspondences cannot be easily obtained across views, as is the case when



Figure 1: Closing the gap between video-and image-pretraining. Prior work learning image representations from video have lagged behind Ima-geNet pretraining. We propose a method for data curation, VideoNet, and a self-supervised learning objective, VITO, which together close this gap.

. However, this is generally computationally expensive to train, and has had noted issues scaling to larger model architectures Xu & Wang (2021). Alternate approaches have used optical flow to supervise correspondence learning Sharma et al. (2022); Xiong et al. (2021), which can be quite powerful, but also limited to capturing dynamics over short time-scales. The most similar works to ours are Gordon et al. (2020), Xu & Wang (2021), and Wu & Wang (

