OPENCOS: CONTRASTIVE SEMI-SUPERVISED LEARN-ING FOR HANDLING OPEN-SET UNLABELED DATA

Abstract

Modern semi-supervised learning methods conventionally assume both labeled and unlabeled data have the same class distribution. However, unlabeled data may include out-of-class samples in practice; those that cannot have one-hot encoded labels from a closed-set of classes in label data, i.e., unlabeled data is an open-set. In this paper, we introduce OpenCoS, a method for handling this realistic semisupervised learning scenario based on a recent framework of contrastive learning. One of our key findings is that out-of-class samples in the unlabeled dataset can be identified effectively via (unsupervised) contrastive learning. OpenCoS utilizes this information to overcome the failure modes in the existing state-of-the-art semisupervised methods, e.g., ReMixMatch or FixMatch. In particular, we propose to assign soft-labels for out-of-class samples using the representation learned from contrastive learning. Our extensive experimental results show the effectiveness of OpenCoS, fixing the state-of-the-art semi-supervised methods to be suitable for diverse scenarios involving open-set unlabeled data. The code will be released.

1. INTRODUCTION

Despite the recent success of deep neural networks with large-scale labeled data, many real-world scenarios suffer from expensive data acquisition and labeling costs. This has motivated the community to develop semi-supervised learning (SSL; Grandvalet & Bengio 2004; Chapelle et al. 2009) , i.e., by further incorporating unlabeled data for training. Indeed, recent SSL works (Berthelot et al., 2019; 2020; Sohn et al., 2020) demonstrate promising results on several benchmark datasets, as they could even approach the performance of fully supervised learning using only a small number of labels, e.g., 93.73% accuracy on CIFAR-10 with 250 labeled data (Berthelot et al., 2020) . However, SSL methods often fail to generalize when there is a mismatch between the classdistributions of labeled and unlabeled data (Oliver et al., 2018; Chen et al., 2020c; Guo et al., 2020) , i.e., when the unlabeled data contains out-of-class samples, whose ground-truth labels are not contained in the labeled dataset (as illustrated in Figure 1(a) ). In this scenario, various label-guessing techniques used in the existing SSL methods may label those out-of-class samples incorrectly, which in turn significantly harms the overall training through their inner-process of entropy minimization (Grandvalet & Bengio, 2004; Lee, 2013) or consistency regularization (Xie et al., 2019; Sohn et al., 2020) . This problem may largely hinder the existing SSL methods from being used in practice, considering the open-set nature of unlabeled data collected in the wild (Bendale & Boult, 2016) . Contribution. In this paper, we focus on a realistic SSL scenario, where unlabeled data may contain some unknown out-of-class samples, i.e., there is a class distribution mismatch between labeled and unlabeled data (Oliver et al., 2018) . Compared to prior approaches that have bypassed this problem by simply filtering out them with some heuristic detection scores (Nair et al., 2019; Chen et al., 2020c) , the unique characteristic in our approach is to further leverage the information in out-of-class samples by assigning soft-labels to them: they may still contain some useful features for the in-classes. Somewhat surprisingly, we found that a recent technique of contrastive unsupervised learning (Wu et al., 2018; He et al., 2020; Chen et al., 2020a) can play a key role for our goal. More specifically, we show that a pre-trained representation via contrastive learning, namely SimCLR (Chen et al., 2020a), on both labeled and unlabeled data enables us to design (a) an effective score for detecting out-of-class samples in unlabeled data, and (b) a systematic way to assign soft-labels to the detected out-of-class samples, by modeling class-conditional likelihoods from labeled data. Finally, we found (c) auxiliary

