WATCHING THE WORLD GO BY: REPRESENTATION LEARNING FROM UNLABELED VIDEOS

Abstract

Recent unsupervised representation learning techniques show remarkable success on many single image tasks by using instance discrimination: learning to differentiate between two augmented versions of the same image and a large batch of unrelated images. Prior work uses artificial data augmentation techniques such as cropping, and color jitter which can only affect the image in superficial ways and are not aligned with how objects actually change e.g. occlusion, deformation, viewpoint change. We argue that videos offer this natural augmentation for free. Videos can provide entirely new views of objects, show deformation, and even connect semantically similar but visually distinct concepts. We propose Video Noise Contrastive Estimation, a method for using unlabeled video to learn strong, transferable, single image representations. We demonstrate improvements over recent unsupervised single image techniques, as well as over fully supervised ImageNet pretraining, across temporal and non-temporal tasks. 1

1. INTRODUCTION

The world seen through our eyes is constantly changing. As we move through the world, we see much more than a single static image: objects rotate revealing occluded regions, deform, the surroundings change, and we ourselves move. Our internal visual systems are constantly seeing temporally coherent images. Yet many popular computer vision models learn representations which are limited to inference on single images, lacking temporal context. Representations learned from static images are inherently limited to an understanding of the world as many unrelated static snapshots. This is especially true of recent unsupervised learning techniques (Bachman et al., 2019; Chen et al., 2020a; He et al., 2020; Hénaff et al., 2019; Hjelm et al., 2019; Misra & Maaten, 2020; Tian et al., 2019; Wu et al., 2018) , all of which train on a set of highly-curated, well-balanced data: Ima-geNet (Deng et al., 2009) . Scaling up these techniques to larger, less-curated datasets like Instagram-1B (Mahajan et al., 2018) has not provided large improvements in performance (He et al., 2020) . Only so much can be learned from a single image: no amount of artificial augmentation can show a new view of an object or what might happen next in a scene. This dichotomy can be seen in Figure 1 . In order to move beyond this limitation, we argue that video supplies significantly more semantically meaningful content than a single image. With video, we can see how the world changes, find connections between images, and more directly observe the underlying scene. Prior work using temporal cues has shown success in learning from unlabeled videos (Misra et al., 2016; Wang et al., 2019; Srivastava et al., 2015) , but has not been able to surpass supervised pretraining. On the other hand, single image techniques (Gutmann & Hyvärinen, 2010) have shown improvements over stateof-the-art by using Noise Contrastive Estimation (NCE). In this work, we merge the two concepts with Video Noise Contrastive Estimation (VINCE), a method for using unlabeled videos as a basis for learning visual representations. Instead of predicting whether two feature vectors come from the same underlying image, we task our network with predicting whether two images originate from the same video. Not only does this allow our method to learn how a single object might change, it also enables learning which things might be in a scene together, e.g. cats are more likely to be in videos with dogs than with sharks. Additionally, we generalize the NCE technique to operate on multiple positive pairs from a single source. 

2.1. NOISE CONTRASTIVE ESTIMATION (NCE)

The NCE loss (Gutmann & Hyvärinen, 2010) is at the center of many recent representation learning methods (Bachman et al., 2019; Chen et al., 2020a; He et al., 2020; Hénaff et al., 2019; Hjelm et al., 2019; Misra & Maaten, 2020; Tian et al., 2019; Wu et al., 2018) . Similar to the triplet loss (Chechik et al., 2010) , the basic principle behind NCE is to maximize the similarity between an anchor data point and a positive data point while minimizing similarity to all other (negative) points. A challenge for using NCE in an unsupervised fashion is devising a way to construct positive pairs. Pairs should be different enough that a network learns a non-trivial representation, but structured enough that the learned representation is useful for downstream tasks. A standard approach used by (Bachman et al., 2019; Chen et al., 2020a; He et al., 2020) et al., 2019) uses crops of an image as "context" and predicts features for the unseen portions of the image. We provide a more natural data augmentation by using multiple frames from a single video. As a video progresses, the objects in the scene, the background, and the camera itself may move, providing new views. Whereas augmentations on an image are constrained by a single snapshot in time, using different frames from a single video gives entirely new information about the scene. Additionally, rather than restricting our method to only use two frames from a video, we generalize the NCE technique to use many images from a single video, resulting in more computational reuse and a better final representation (Bachman et al. (2019) similarly makes multiple comparisons per pair, but each anchor has only one positive).



Code and the Random Related Video Views dataset will be made available.



Figure 1: NCE vs VINCE. The standard contrastive setup learns to separate artificial augmentations of the same image. Our method uses novel views and temporal consistency which single images cannot provide. Views (R2V2), a set 960,000 frames from 240,000 uncurated videos. Using our learning technique, we achieve across-the-board improvements over the recent Momentum Contrast method (He et al., 2020) as well as over a network pretrained on supervised ImageNet on diverse tasks such as scene classification, activity recognition, and object tracking.

To facilitate this learning, we construct Random Related Video

