OPENCOS: CONTRASTIVE SEMI-SUPERVISED LEARN-ING FOR HANDLING OPEN-SET UNLABELED DATA

Abstract

Modern semi-supervised learning methods conventionally assume both labeled and unlabeled data have the same class distribution. However, unlabeled data may include out-of-class samples in practice; those that cannot have one-hot encoded labels from a closed-set of classes in label data, i.e., unlabeled data is an open-set. In this paper, we introduce OpenCoS, a method for handling this realistic semisupervised learning scenario based on a recent framework of contrastive learning. One of our key findings is that out-of-class samples in the unlabeled dataset can be identified effectively via (unsupervised) contrastive learning. OpenCoS utilizes this information to overcome the failure modes in the existing state-of-the-art semisupervised methods, e.g., ReMixMatch or FixMatch. In particular, we propose to assign soft-labels for out-of-class samples using the representation learned from contrastive learning. Our extensive experimental results show the effectiveness of OpenCoS, fixing the state-of-the-art semi-supervised methods to be suitable for diverse scenarios involving open-set unlabeled data. The code will be released.

1. INTRODUCTION

Despite the recent success of deep neural networks with large-scale labeled data, many real-world scenarios suffer from expensive data acquisition and labeling costs. This has motivated the community to develop semi-supervised learning (SSL; Grandvalet & Bengio 2004; Chapelle et al. 2009) , i.e., by further incorporating unlabeled data for training. Indeed, recent SSL works (Berthelot et al., 2019; 2020; Sohn et al., 2020) demonstrate promising results on several benchmark datasets, as they could even approach the performance of fully supervised learning using only a small number of labels, e.g., 93.73% accuracy on CIFAR-10 with 250 labeled data (Berthelot et al., 2020) . However, SSL methods often fail to generalize when there is a mismatch between the classdistributions of labeled and unlabeled data (Oliver et al., 2018; Chen et al., 2020c; Guo et al., 2020) , i.e., when the unlabeled data contains out-of-class samples, whose ground-truth labels are not contained in the labeled dataset (as illustrated in Figure 1(a) ). In this scenario, various label-guessing techniques used in the existing SSL methods may label those out-of-class samples incorrectly, which in turn significantly harms the overall training through their inner-process of entropy minimization (Grandvalet & Bengio, 2004; Lee, 2013) or consistency regularization (Xie et al., 2019; Sohn et al., 2020) . This problem may largely hinder the existing SSL methods from being used in practice, considering the open-set nature of unlabeled data collected in the wild (Bendale & Boult, 2016) . Contribution. In this paper, we focus on a realistic SSL scenario, where unlabeled data may contain some unknown out-of-class samples, i.e., there is a class distribution mismatch between labeled and unlabeled data (Oliver et al., 2018) . Compared to prior approaches that have bypassed this problem by simply filtering out them with some heuristic detection scores (Nair et al., 2019; Chen et al., 2020c) , the unique characteristic in our approach is to further leverage the information in out-of-class samples by assigning soft-labels to them: they may still contain some useful features for the in-classes. Somewhat surprisingly, we found that a recent technique of contrastive unsupervised learning (Wu et al., 2018; He et al., 2020; Chen et al., 2020a) can play a key role for our goal. More specifically, we show that a pre-trained representation via contrastive learning, namely SimCLR (Chen et al., 2020a), on both labeled and unlabeled data enables us to design (a) an effective score for detecting out-of-class samples in unlabeled data, and (b) a systematic way to assign soft-labels to the detected out-of-class samples, by modeling class-conditional likelihoods from labeled data. Finally, we found (c) auxiliary We verify the effectiveness of the proposed method on a wide range of SSL benchmarks based on CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009), and ImageNet (Deng et al., 2009) datasets, assuming the presence of various out-of-class data, e.g., SVHN (Netzer et al., 2011) and TinyImageNet datasets. Our experimental results demonstrate that OpenCoS greatly improves existing state-of-the-art SSL methods (Berthelot et al., 2019; 2020; Sohn et al., 2020) , not only by discarding out-of-class samples, but also by further leveraging them into training. We also compare our method to other recent works (Nair et al., 2019; Chen et al., 2020c; Guo et al., 2020) addressing the same class distribution mismatch problem in SSL, and again confirms the effectiveness of our framework, e.g., we achieve an accuracy of 68.37% with 40 labels (just 4 labels per class) on CIFAR-10 with TinyImageNet as out-of-class, compared to DS 3 L (Guo et al., 2020) of 56.32%. Overall, our work highlights the benefit of unsupervised representations in (semi-) supervised learning: such a label-free representation turns out to enhance model generalization due to its robustness on the novel, out-of-class samples.

2.1. SEMI-SUPERVISED LEARNING

The goal of semi-supervised learning for classification is to train a classifier f : X → Y from a labeled dataset D l = {x (i) l , y l } N l i=1 where each label y l is from a set of classes Y := {1, • • • , C}, and an unlabeled dataset D u = {x (i) u } Nu i=1 where each y u exists but is assumed to be unknown. In an attempt to leverage the extra information in D u , a number of techniques have been proposed, e.g., entropy minimization (Grandvalet & Bengio, 2004; Lee, 2013) and consistency regularization (Sajjadi et al., 2016) . In general, recent approaches in semi-supervised learning can be distinguished by the prior they adopt for the representation of unlabeled data: for example, the consistency regularization technique (Sajjadi et al., 2016) attempt to minimize the cross-entropy loss between any two predictions of different augmentations t 1 (x u ) and t 2 (x u ) from a given unlabeled sample x u , jointly with the standard training for a labeled sample (x l , y l ): L SSL (x l , x u ) := H(y l , f (x l )) + β • H(f (t 1 (x u )), f (t 2 (x u ))), (1) where H is a standard cross-entropy loss for labeled data, and β is a hyperparameter. Recently, several "holistic" approaches of various techniques (Zhang et al., 2018; Cubuk et al., 2019) have shown remarkable performance in practice, e.g., MixMatch (Berthelot et al., 2019 ), ReMixMatch (Berthelot et al., 2020 ), and FixMatch (Sohn et al., 2020) , which we mainly consider in this paper. We note that our scheme can be integrated with any recent semi-supervised learning methods.



Figure 1: (a) Illustration of an open-set unlabeled data under class-distribution mismatch in semisupervised learning, i.e., unlabeled data may contain unknown out-of-class samples. (b) Comparison of median test accuracy under varying proportions of out-of-class samples on the CIFAR-10 + TinyImageNet benchmark with 25 labels per class. batch normalization layers (Xie et al., 2020) could further help to mitigate the class-distribution mismatch via decoupling batch normalization layers. We propose a generic SSL framework, coined OpenCoS, based on the aforementioned techniques for handling open-set unlabeled data, which can be integrated with any existing SSL methods.

