FAST TRAINING OF CONTRASTIVE LEARNING WITH INTERMEDIATE CONTRASTIVE LOSS

Abstract

Recently, representations learned by self-supervised approaches have significantly reduced the gap with their supervised counterparts in many different computer vision tasks. However, these self-supervised methods are computationally challenging. In this work, we focus on accelerating contrastive learning algorithms with little or even no loss of accuracy. Our insight is that, contrastive learning concentrates on optimizing similarity (dissimilarity) between pairs of inputs, and the similarity on the intermediate layers is a good surrogate of the final similarity. We exploit our observation by introducing additional intermediate contrastive losses. In this way, we can truncate the back-propagation and updates only a part of the parameters for each gradient descent update. Additionally, we do selection based on the intermediate losses to filter easy regions for each image, which further reduces the computational cost. We apply our method to recently-proposed MOCO (He et al., 2020), SimCLR (Chen et al., 2020a), SwAV (Caron et al., 2020) and notice that we can reduce the computational cost with little loss on the performance of ImageNet linear classification and other downstream tasks.

1. INTRODUCTION

Recently, self-supervised learning has been shown a promising approach for unsupervised and semisupervised learning in computer vision (Oord et al., 2018; Kolesnikov et al., 2019; Zhai et al., 2019; He et al., 2020; Chen et al., 2020b; a; Grill et al., 2020) . These methods learn unsupervised representation that can perform well on both ImageNet (Deng et al., 2009) linear classification and other down-stream tasks, e.g. pose estimation, detection, and semantic segmentation (He et al., 2020; Caron et al., 2020) . A major type of self-supervised learning is contrastive learning, which constructs similar and dissimilar pairs over the dataset and minimize a constrative loss to learn a mapping that yields similar (resp. dissimilar) pairs with similar (resp. dissimilar) representations. Despite the recent successes, contrastive learning was found to practically incur longer training time and higher computational cost compared with supervised learning (He et al., 2020; Grill et al., 2020) . For example, He et al. ( 2020) and Chen et al. (2020a) requires 5 × time than standard supervised learning on ImageNet. The enormous time and computational cost makes large-scale contrastive learning out of reach for many of the researchers and applications. This work focuses on speed up contrastive learning. Our key observation is, because contrastive learning focuses on optimizing similarity (dissimilarity) between pairs of inputs, the similarity on the intermediate layers provide a good surrogate of the final similarity, and computing the intermediate representation requires less computational cost. This is in contrast with supervised learning, which requires to match the output of the final layer with a label and hence it is essential to calculate the final outputs. 



We exploit the observation by introducing additional contrastive losses in the middle layers of the neural network. Instead of measuring only the contrastive loss of the representation in the last layer, we compute the contrastive loss in the intermediate blocks. The intermediate losses enable the following two strategies that accelerates contrastive learning. (1) Partial Back-propagation: We start back-propagation randomly from one of all the contrastive losses. Compared with doing full back-propagation in every optimization step, an intermediate starting point only requires computing the gradients for a part of the parameters. (2) Block-wise Hard Pair Selection: The intermediate

