WHITENING FOR SELF-SUPERVISED REPRESENTATION LEARNING

Abstract

Most of the self-supervised representation learning methods are based on the contrastive loss and the instance-discrimination task, where augmented versions of the same image instance ("positives") are contrasted with instances extracted from other images ("negatives"). For the learning to be effective, a lot of negatives should be compared with a positive pair, which is computationally demanding. In this paper, we propose a different direction and a new loss function for selfsupervised representation learning which is based on the whitening of the latentspace features. The whitening operation has a "scattering" effect on the batch samples, which compensates the use of negatives, avoiding degenerate solutions where all the sample representations collapse to a single point. Our Whitening MSE (W-MSE) loss does not require special heuristics (e.g. additional networks) and it is conceptually simple. Since negatives are not needed, we can extract multiple positive pairs from the same image instance. We empirically show that W-MSE is competitive with respect to popular, more complex self-supervised methods. The source code of the method and all the experiments is included in the Supplementary Material.

1. INTRODUCTION

One of the current main bottlenecks in deep network training is the dependence on large annotated training datasets, and this motivates the recent surge of interest in unsupervised methods. Specifically, in self-supervised representation learning, a network is (pre-)trained without any form of manual annotation, thus providing a means to extract information from unlabeled-data sources (e.g., text corpora, videos, images from the Internet, etc.). In self-supervision, label information is replaced by a prediction problem using some form of context or using a pretext task. Pioneering work in this direction was done in Natural Language Processing (NLP), in which the co-occurrence of words in a sentence is used to learn a language model (Mikolov et al., 2013a; b; Devlin et al., 2019) . In Computer Vision, typical contexts or pretext tasks are based on: (1) the temporal consistency in videos (Wang & Gupta, 2015; Misra et al., 2016; Dwibedi et al., 2019) , (2) the spatial order of patches in still images (Noroozi & Favaro, 2016; Misra & van der Maaten, 2019; Hénaff et al., 2019) or (3) simple image transformation techniques (Ji et al., 2019; He et al., 2019; Wu et al., 2018) . The intuitive idea behind most of these methods is to collect pairs of positive and negative samples: two positive samples should share the same semantics, while negatives should be perceptually different. A triplet loss (Sohn, 2016; Schroff et al., 2015; Hermans et al., 2017; Wang & Gupta, 2015; Misra et al., 2016) can then be used to learn a metric space which should represent the human perceptual similarity. However, most of the recent studies use a contrastive loss (Hadsell et al., 2006) or one of its variants (Gutmann & Hyvärinen, 2010; van den Oord et al., 2018; Hjelm et al., 2019 ), while Tschannen et al. (2019) show the relation between the triplet loss and the contrastive loss. It is worth noticing that the success of both kinds of losses is strongly affected by the number and the quality of the negative samples. For instance, in the case of the triplet loss, a common practice is to select hard/semi-hard negatives (Schroff et al., 2015; Hermans et al., 2017) . On the other hand, Hjelm et al. ( 2019) have shown that the contrastive loss needs a large number of negatives to be competitive. This implies using batches with a large size, which is computationally demanding, especially with high-resolution images. In order to alleviate this problem, Wu et al. ( 2018 2018) using a memory-efficient queue of the last visited negatives, together with a momentum encoder which preserves the intra-queue representation consistency. Chen et al. ( 2020) have performed large-scale experiments confirming that a large number of negatives (and therefore a large batch size) is required for the contrastive loss to be efficient. Concurrently with our work, Grill et al. ( 2020) have suggested that it is not necessary to rely on the contrastive scheme, introducing a high-performing alternative based on bootstrapping. In this paper we propose a new self-supervised loss function which first scatters all the sample representations in a spherical distributionfoot_0 and then penalizes the positive pairs which are far from each other. In more detail, given a set of samples V = {v i }, corresponding to the current minibatch of images B = {x i }, we first project the elements of V onto a spherical distribution using a whitening transform (Siarohin et al., 2019) . The whitened representations {z i }, corresponding to V , are normalized and then used to compute a Mean Squared Error (MSE) loss which accumulates the error taking into account only positive pairs (z i , z j ). We do not need to contrast positives against negatives as in the contrastive loss or in the triplet loss because the optimization process leads to shrinking the distance between positive pairs and, indirectly, scatters the other samples to satisfy the overall spherical-distribution constraint. In summary, our contributions are the following: • We propose a new loss function, Whitening MSE (W-MSE), for self-supervised training. W-MSE constrains the batch samples to lie in a spherical distribution and it is an alternative to positive-negative instance contrasting methods. • Our loss does not rely on negatives, thus including more positive samples in the batch can be beneficial; we indeed demonstrate that multiple positive pairs extracted from one image improve the performance. • We empirically show that our W-MSE loss outperforms the commonly adopted contrastive loss when measured using different standard classification protocols. We show that W-MSE is competitive with respect to state-of-the-art self-supervised methods.

2. BACKGROUND AND RELATED WORK

A typical self-supervised method is composed of two main components: a pretext task, which exploits some a-priori knowledge about the domain to automatically extract supervision from data, and a loss function. In this section we briefly review both aspects, and we additionally analyse the recent literature concerning feature whitening. Pretext Tasks. The temporal consistency in a video provides an intuitive form of self-supervision: temporally-close frames usually contain a similar semantic content (Wang & Gupta, 2015; van den Oord et al., 2018) . Misra et al. (2016) extended this idea using the relative temporal order of 3 frames, while Dwibedi et al. ( 2019) used a temporal cycle consistency for self-supervision, which is based on comparing two videos sharing the same semantics and computing inter-video frame-toframe nearest neighbour assignments. When dealing with still images, the most common pretext task is instance discrimination (Wu et al. ( 2018)): from a training image x, a composition of data-augmentation techniques are used to extract two different views of x (x i and x j ). Commonly adopted transformations are: image cropping, rotation, color jittering, Sobel filtering, etc.. The learner is then required to discriminate (x i , x j ) from other views extracted from other samples (Wu et al., 2018; Ji et al., 2019; He et al., 2019; Chen et al., 2020) . 



Here and in the following, with "spherical distribution" we mean a distribution with a zero-mean and an identity-matrix covariance.



) use a memory bank of negatives, which is composed of feature-vector representations of all the training samples. He et al. (2019) conjecture that the use of large and fixed-representation vocabularies is one of the keys to the success of self-supervision in NLP. The solution proposed by He et al. (2019) extends Wu et al. (

Denoising auto-encoders(Vincent et al., 2008)  add random noise to the input image and try to recover the original image. More sophisticated pretext tasks consist in predicting the spatial order of image patches(Noroozi & Favaro, 2016; Misra & van der Maaten, 2019)  or in reconstructing large masked regions of the image (Pathak et al., 2016). Hjelm et al. (2019); Bachman et al. (2019) compare the holistic representation of an input image with a patch of the same image. Hénaff et al.

