SELF-SUPERVISED VIDEO PRETRAINING YIELDS GENERAL IMAGE REPRESENTATIONS

Abstract

Videos contain far more information than still images and hold the potential for learning rich representations of the visual world. Yet pretraining on image datasets has remained the dominant paradigm for learning representations that capture spatial information. Prior attempts at video pretraining made progress towards solving video-based tasks, but did so at the cost of their image understanding capabilities. In this work we revisit self-supervised learning of image representations from the dynamic evolution of video frames. To that end, we propose a procedure for data curation that addresses the domain mismatch between video and image datasets, and develop a contrastive learning framework which handles the complex transformations present in natural videos. This simple paradigm for distilling knowledge from videos to image representations, called VITO, far outperforms all prior video pretraining methods on object detection and semantic segmentation tasks, and for the first time, closes the gap with ImageNet pretraining. Furthermore, VITO remains effective when transferring to video understanding tasks such as DAVIS segmentation and UCF-101 action recognition. Together, these results suggest that video-pretraining is now strictly more general than imagepretraining and could become the new default for learning visual representations.



Pretraining on large image datasets has been the dominant paradigm for learning representations that understand the visual world (Krizhevsky et al., 2012; He et al., 2016) . In particular, self-supervised methods which learn representations that are invariant to specific image transformations have proven very powerful, surpassing supervised pretraining on a variety of downstream tasks (He et al., 2020; Hénaff et al., 2019; Chen et al., 2020; Caron et al., 2021) . Although the synthetic augmentations used in these transformations capture important image priors such as scale-, color-, and translationinvariance, they pale in comparison to the complex changes in pose and viewpoint that arise in natural videos. Therefore, one would expect that learning from videos, as opposed to images, should produce strictly more general visual representations. However, while self-supervised video representation learning has seen a variety of recent successful applications when evaluating on video-based tasks (Qian et al., 2021; Feichtenhofer et al., 2021; Toering et al., 2022; Dave et al., 2022; Feichtenhofer et al., 2022; Ni et al., 2022) , it has typically done so by sacrificing performance on image classification (Gordon et al., 2020; Wu & Wang, 2021) relative to image pretraining. Furthermore, the specifics of video-representation ar-



Figure 1: Closing the gap between video-and image-pretraining. Prior work learning image representations from video have lagged behind Ima-geNet pretraining. We propose a method for data curation, VideoNet, and a self-supervised learning objective, VITO, which together close this gap.

