TEMPERATURE SCHEDULES FOR SELF-SUPERVISED CONTRASTIVE METHODS ON LONG-TAIL DATA

Abstract

Most approaches for self-supervised learning (SSL) are optimised on curated balanced datasets, e.g. ImageNet, despite the fact that natural data usually exhibits long-tail distributions. In this paper, we analyse the behaviour of one of the most popular variants of SSL, i.e. contrastive methods, on long-tail data. In particular, we investigate the role of the temperature parameter τ in the contrastive loss, by analysing the loss through the lens of average distance maximisation, and find that a large τ emphasises group-wise discrimination, whereas a small τ leads to a higher degree of instance discrimination. While τ has thus far been treated exclusively as a constant hyperparameter, in this work, we propose to employ a dynamic τ and show that a simple cosine schedule can yield significant improvements in the learnt representations. Such a schedule results in a constant 'task switching' between an emphasis on instance discrimination and group-wise discrimination and thereby ensures that the model learns both group-wise features, as well as instance-specific details. Since frequent classes benefit from the former, while infrequent classes require the latter, we find this method to consistently improve separation between the classes in long-tail data without any additional computational cost.

1. INTRODUCTION

Deep Neural Networks have shown remarkable capabilities at learning representations of their inputs that are useful for a variety of tasks. Especially since the advent of recent self-supervised learning (SSL) techniques, rapid progress towards learning universally useful representations has been made. Currently, however, SSL on images is mainly carried out on benchmark datasets that have been constructed and curated for supervised learning (e.g. ImageNet (Deng et al., 2009) , CIFAR (Krizhevsky et al., 2009), etc.) . Although the labels of curated datasets are not explicitly used in SSL, the structure of the data still follows the predefined set of classes. In particular, the class-balanced nature of curated datasets could result in a learning signal for unsupervised methods. As such, these methods are often not evaluated in the settings they were designed for, i.e. learning from truly unlabelled data. Moreover, some methods (e.g. (Asano et al., 2019; Caron et al., 2020) ) even explicitly enforce a uniform prior over the embedding or label space, which cannot be expected to hold for uncurated datasets. In particular, uncurated, real-world data tends to follow long-tail distributions (Reed, 2001) , in this paper, we analyse SSL methods on long-tailed data. Specifically, we analyse the behaviour of contrastive learning (CL) methods, which are among the most popular learning paradigms for SSL. In CL, the models are trained such that embeddings of different samples are repelled, while embeddings of different 'views' (i.e. augmentations) of the same sample are attracted. The strength of those attractive and repelling forces between samples is controlled by a temperature parameter τ , which has been shown to play a crucial role in learning good representations (Chen et al., 2020c; a) . To the best of our knowledge, τ has thus far almost exclusively been treated as a constant hyper-parameter. In contrast, we employ a dynamic τ during training and show that this has a strong effect on the learned embedding space for long-tail distributions. In particular, by introducing a simple schedule for τ we consistently improve the representation quality across a wide range of settings. Crucially, these gains are obtained without additional costs and only require oscillating τ with a cosine schedule. This mechanism is grounded in our novel understanding of the effect of temperature on the contrastive loss. In particular, we analyse the contrastive loss from an average distance maximisation perspective, which gives intuitive insights as to why a large temperature emphasises group-wise discrimination, whereas a small temperature leads to a higher degree of instance discrimination and more uniform distributions over the embedding space. Varying τ during training ensures that the model learns both group-wise and instance-specific features, resulting in better separation between head and tail classes. Overall, our contributions are summarised as follows: • we carry out an extensive analysis of the effect of τ on imbalanced data; • we analyse the contrastive loss from an average distance perspective to understand the emergence of semantic structure; • we propose a simple yet effective temperature schedule that improves the performance across different settings; • we show that the proposed τ scheduling is robust and consistently improves the performance for different hyperparameter choices.

2. RELATED WORK

Self-supervised representation learning (SSL) from visual data is a quickly evolving field. Recent methods are based on various forms of comparing embeddings between transformations of input images. We divide current methods into two categories: contrastive learning (He et al., 2020; Chen et al., 2020c; a; Oord et al., 2018) , and non-contrastive learning (Grill et al., 2020; Zbontar et al., 2021; Chen & He, 2021; Bardes et al., 2022; Wei et al., 2022; Gidaris et al., 2021; Asano et al., 2019; Caron et al., 2020; He et al., 2022) . Our analysis concerns the structure and the properties of the embedding space of contrastive methods when training on imbalanced data. Consequently, this section focuses on contrastive learning methods, their analysis and application to imbalanced training datasets. Contrastive Learning employs instance discrimination (Wu et al., 2018) to learn representations by forming positive pairs of images through augmentations and a loss formulation that maximises their similarity while simultaneously minimising the similarity to other samples. Methods such as MoCo (He et al., 2020; Chen et al., 2020c ), SimCLR (Chen et al., 2020a; b), SwAV (Caron et al., 2020) , CPC (Oord et al., 2018) Negatives. The importance of negatives for contrastive learning is remarkable and noticed in many prior works (Wang et al., 2021; Yeh et al., 2021; Zhang et al., 2022; Iscen et al., 2018; Kalantidis et al., 2020; Robinson et al., 2020; Khaertdinov et al., 2022) . Yeh et al. (2021) propose decoupled learning by removing the positive term from the denominator, Robinson et al. ( 2020) develop an unsupervised hard-negative sampling technique, Wang et al. (2021) propose to employ a triplet loss, and Zhang et al. (2022); Khaertdinov et al. (2022) propose to improve negative mining with the help of different temperatures for positive and negative samples that can be defined as input-independent or input-dependent functions, respectively. In contrast to explicitly choosing a specific subset of negatives, we discuss the Info-NCE loss (Oord et al., 2018) through the lens of an average distance perspective with respect to all negatives and show that the temperature parameter can be used to implicitly control the effective number of negatives. Imbalanced Self-Supervised Learning. Learning on imbalanced data instead of curated balanced datasets is an important application since natural data commonly follows long-tailed distributions (Reed, 2001; Liu et al., 2019; Wang et al., 2017) 



, CMC Tian et al. (2020a), and Whitening (Ermolov et al., 2021) have shown impressive representation quality and down-stream performance using this learning paradigm. CL has also found applications beyond SSL pre-training, such as multi-modal learning (Shvetsova et al., 2022), domain generalisation (Yao et al., 2022), semantic segmentation (Van Gansbeke et al., 2021), 3D point cloud understanding (Afham et al., 2022), and 3D face generation (Deng et al., 2020).

. In recent work, Kang et al. (2020), Yang & Xu (2020), Liu et al. (2021), Zhong et al. (2022), Gwilliam & Shrivastava (2022) discover that self-supervised learning generally allows to learn a more robust embedding space than a supervised counterpart. Tian et al. (2021) explore the down-stream performance of contrastive learning on standard benchmarks based on large-scale uncurated pre-training and propose a multi-stage distillation framework to overcome the shift in the distribution of image classes. Jiang et al. (2021); Zhou et al. (2022) propose to address the data imbalance by identifying and then emphasising tail samples during training in an unsupervised manner. For this, Jiang et al. (2021) compare the outputs of the trained model before and after pruning, assuming that tail samples are more easily 'forgotten' by the pruned model and can thus be identified. Zhou et al. (2022), use the loss value for each input to identify tail samples and then use stronger augmentations for those. Instead of modifying the architecture

