FAST TRAINING OF CONTRASTIVE LEARNING WITH INTERMEDIATE CONTRASTIVE LOSS

Abstract

Recently, representations learned by self-supervised approaches have significantly reduced the gap with their supervised counterparts in many different computer vision tasks. However, these self-supervised methods are computationally challenging. In this work, we focus on accelerating contrastive learning algorithms with little or even no loss of accuracy. Our insight is that, contrastive learning concentrates on optimizing similarity (dissimilarity) between pairs of inputs, and the similarity on the intermediate layers is a good surrogate of the final similarity. We exploit our observation by introducing additional intermediate contrastive losses. In this way, we can truncate the back-propagation and updates only a part of the parameters for each gradient descent update. Additionally, we do selection based on the intermediate losses to filter easy regions for each image, which further reduces the computational cost. We apply our method to recently-proposed MOCO (He et al., 2020), SimCLR (Chen et al., 2020a), SwAV (Caron et al., 2020) and notice that we can reduce the computational cost with little loss on the performance of ImageNet linear classification and other downstream tasks.

1. INTRODUCTION

Recently, self-supervised learning has been shown a promising approach for unsupervised and semisupervised learning in computer vision (Oord et al., 2018; Kolesnikov et al., 2019; Zhai et al., 2019; He et al., 2020; Chen et al., 2020b; a; Grill et al., 2020) . These methods learn unsupervised representation that can perform well on both ImageNet (Deng et al., 2009) linear classification and other down-stream tasks, e.g. pose estimation, detection, and semantic segmentation (He et al., 2020; Caron et al., 2020) . A major type of self-supervised learning is contrastive learning, which constructs similar and dissimilar pairs over the dataset and minimize a constrative loss to learn a mapping that yields similar (resp. dissimilar) pairs with similar (resp. dissimilar) representations. Despite the recent successes, contrastive learning was found to practically incur longer training time and higher computational cost compared with supervised learning (He et al., 2020; Grill et al., 2020) . For example, He et al. ( 2020) and Chen et al. (2020a) requires 5 × time than standard supervised learning on ImageNet. The enormous time and computational cost makes large-scale contrastive learning out of reach for many of the researchers and applications. This work focuses on speed up contrastive learning. Our key observation is, because contrastive learning focuses on optimizing similarity (dissimilarity) between pairs of inputs, the similarity on the intermediate layers provide a good surrogate of the final similarity, and computing the intermediate representation requires less computational cost. This is in contrast with supervised learning, which requires to match the output of the final layer with a label and hence it is essential to calculate the final outputs. We test the proposed method upon several recent self-supervised learning algorithms, MOCO (He et al., 2020 ), SwAV (Caron et al., 2020 ), simCLR(Chen et al., 2020a) and MOCO V2 (Chen et al., 2020b) . We empirically show that our method can save the training time with almost no loss on the final performance of the downstream tasks, e.g. ImageNet linear classification, PASCAL VOC object detection and segmentation. Our method largely reduces the training cost for contrastive learning methods by over 30%, and can serve as an alternative to standard self-supervised learning training pipeline if the computation resources is limited.

2. METHOD

We first give a brief introduction about contrastive learning in Sec. 2.1, and then introduce our method in Sec. 2.2 and Sec. 2.3. Our method is composed of two major components: (1) using randomly early stopping for different mini-batch (2) using random crop and selecting hard regions on hidden states.

2.1. CONTRASTIVE LEARNING

Given an unlabeled set of images D = {x i }, we want to learn a representation map f that extracts useful low-dimensional representations from the high-dimensional images x. In contrastive learning, for each x ∼ D, we use data augmentation or other techniques to construct a positive example x + that is similar x and a set of negative examples {x -k } K k=1 that are less similar to x than x + . Then we train the map f to maximize the similarity between f (x) and f (x + ), while minimizing the similarity between f (x) and f (x -k ). A popular choice of contrastive loss is InfoNCE (Oord et al., 2018) , L info (f (x), f (x + )) = -log exp 1 τ f (x) f (x + ) K k=1 exp 1 τ f (x) f (x -k ) + exp 1 τ f (x) f (x + ) The encoder aims to minimize the InfoNCE for all the images in the dataset, L(f ) = E x∼D [L info (f (x), f (x + ))] , where τ is a temperature hyper-parameter, Intuitively, Eq. 2 is the log loss of a (K+1)-way softmaxbased classifier that tries to classify f (x) into the same class as f (x + ). There are many different methods to construct the positive and negative examples. In practice, people use context (Oord et al., 2018 ), data augmentation (Chen et al., 2020a; He et al., 2020; Hénaff et al., 2019 ), colorization (Oord et al., 2018 ), clustering (Caron et al., 2020; 2018) 

2.2. ACCELERATED TRAINING BY PARTIAL BACK-PROPAGATION

Our approach stems from a fundamental property of contrastive representation learning: contrastive loss measures the similarity between representations. Due to the hierarchical nature of deep neural networks, similar representation pairs from early layers are still similar even after being processed by the later layers. Therefore, we argue that full back-propagation is not necessary for contrastive representation learning. Rather, partial back-propagation is enough for learning useful representations in the final layer. Intermediate loss has been widely used in deep learning, from Inception Network (Szegedy et al., 2015) to DARTs (Liu et al., 2019) . We consider a deep neural network f = f n • f n-1 • • • • f 1 , where f i refers to the i-th building block of the network. For feed-forward neural networks, the building block can be one hidden layer. For more complex network, e.g. ResNet (He et al., 2016) , the building block can be a sequential of convolutional layer, batch normalization layers (Ioffe & Szegedy, 2015) , and activation functions.



We exploit the observation by introducing additional contrastive losses in the middle layers of the neural network. Instead of measuring only the contrastive loss of the representation in the last layer, we compute the contrastive loss in the intermediate blocks. The intermediate losses enable the following two strategies that accelerates contrastive learning. (1) Partial Back-propagation: We start back-propagation randomly from one of all the contrastive losses. Compared with doing full back-propagation in every optimization step, an intermediate starting point only requires computing the gradients for a part of the parameters. (2) Block-wise Hard Pair Selection: The intermediate contrastive losses can serve as indicators of the similarity between the representations in early layers. These indicators can be used to filter out the simple pairs, thus reducing unnecessary computation.

, etc., to construct the positive pair {x, x + }. More generally, each image x can have multiple positive examples (Tian et al., 2019). Previous works have used in-batch data (Chen et al., 2020a), memory bank (He et al., 2020), or different regions of one given image (Oord et al., 2018) to generate {x -k } K k=1 for each x.

