OPENCOS: CONTRASTIVE SEMI-SUPERVISED LEARN-ING FOR HANDLING OPEN-SET UNLABELED DATA

Abstract

Modern semi-supervised learning methods conventionally assume both labeled and unlabeled data have the same class distribution. However, unlabeled data may include out-of-class samples in practice; those that cannot have one-hot encoded labels from a closed-set of classes in label data, i.e., unlabeled data is an open-set. In this paper, we introduce OpenCoS, a method for handling this realistic semisupervised learning scenario based on a recent framework of contrastive learning. One of our key findings is that out-of-class samples in the unlabeled dataset can be identified effectively via (unsupervised) contrastive learning. OpenCoS utilizes this information to overcome the failure modes in the existing state-of-the-art semisupervised methods, e.g., ReMixMatch or FixMatch. In particular, we propose to assign soft-labels for out-of-class samples using the representation learned from contrastive learning. Our extensive experimental results show the effectiveness of OpenCoS, fixing the state-of-the-art semi-supervised methods to be suitable for diverse scenarios involving open-set unlabeled data. The code will be released.

1. INTRODUCTION

Despite the recent success of deep neural networks with large-scale labeled data, many real-world scenarios suffer from expensive data acquisition and labeling costs. This has motivated the community to develop semi-supervised learning (SSL; Grandvalet & Bengio 2004; Chapelle et al. 2009) , i.e., by further incorporating unlabeled data for training. Indeed, recent SSL works (Berthelot et al., 2019; 2020; Sohn et al., 2020) demonstrate promising results on several benchmark datasets, as they could even approach the performance of fully supervised learning using only a small number of labels, e.g., 93.73% accuracy on CIFAR-10 with 250 labeled data (Berthelot et al., 2020) . However, SSL methods often fail to generalize when there is a mismatch between the classdistributions of labeled and unlabeled data (Oliver et al., 2018; Chen et al., 2020c; Guo et al., 2020) , i.e., when the unlabeled data contains out-of-class samples, whose ground-truth labels are not contained in the labeled dataset (as illustrated in Figure 1(a) ). In this scenario, various label-guessing techniques used in the existing SSL methods may label those out-of-class samples incorrectly, which in turn significantly harms the overall training through their inner-process of entropy minimization (Grandvalet & Bengio, 2004; Lee, 2013) or consistency regularization (Xie et al., 2019; Sohn et al., 2020) . This problem may largely hinder the existing SSL methods from being used in practice, considering the open-set nature of unlabeled data collected in the wild (Bendale & Boult, 2016) . Contribution. In this paper, we focus on a realistic SSL scenario, where unlabeled data may contain some unknown out-of-class samples, i.e., there is a class distribution mismatch between labeled and unlabeled data (Oliver et al., 2018) . Compared to prior approaches that have bypassed this problem by simply filtering out them with some heuristic detection scores (Nair et al., 2019; Chen et al., 2020c) , the unique characteristic in our approach is to further leverage the information in out-of-class samples by assigning soft-labels to them: they may still contain some useful features for the in-classes. Somewhat surprisingly, we found that a recent technique of contrastive unsupervised learning (Wu et al., 2018; He et al., 2020; Chen et al., 2020a) can play a key role for our goal. More specifically, we show that a pre-trained representation via contrastive learning, namely SimCLR (Chen et al., 2020a) , on both labeled and unlabeled data enables us to design (a) an effective score for detecting out-of-class samples in unlabeled data, and (b) a systematic way to assign soft-labels to the detected out-of-class samples, by modeling class-conditional likelihoods from labeled data. Finally, we found (c) auxiliary We verify the effectiveness of the proposed method on a wide range of SSL benchmarks based on CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009) , and ImageNet (Deng et al., 2009) datasets, assuming the presence of various out-of-class data, e.g., SVHN (Netzer et al., 2011) and TinyImageNet datasets. Our experimental results demonstrate that OpenCoS greatly improves existing state-of-the-art SSL methods (Berthelot et al., 2019; 2020; Sohn et al., 2020) , not only by discarding out-of-class samples, but also by further leveraging them into training. We also compare our method to other recent works (Nair et al., 2019; Chen et al., 2020c; Guo et al., 2020) addressing the same class distribution mismatch problem in SSL, and again confirms the effectiveness of our framework, e.g., we achieve an accuracy of 68.37% with 40 labels (just 4 labels per class) on CIFAR-10 with TinyImageNet as out-of-class, compared to DS 3 L (Guo et al., 2020) of 56.32%. Overall, our work highlights the benefit of unsupervised representations in (semi-) supervised learning: such a label-free representation turns out to enhance model generalization due to its robustness on the novel, out-of-class samples.

2.1. SEMI-SUPERVISED LEARNING

The goal of semi-supervised learning for classification is to train a classifier f : X → Y from a labeled dataset D l = {x (i) l , y (i) l } N l i=1 where each label y l is from a set of classes Y := {1, • • • , C}, and an unlabeled dataset D u = {x (i) u } Nu i=1 where each y u exists but is assumed to be unknown. In an attempt to leverage the extra information in D u , a number of techniques have been proposed, e.g., entropy minimization (Grandvalet & Bengio, 2004; Lee, 2013) and consistency regularization (Sajjadi et al., 2016) . In general, recent approaches in semi-supervised learning can be distinguished by the prior they adopt for the representation of unlabeled data: for example, the consistency regularization technique (Sajjadi et al., 2016) attempt to minimize the cross-entropy loss between any two predictions of different augmentations t 1 (x u ) and t 2 (x u ) from a given unlabeled sample x u , jointly with the standard training for a labeled sample (x l , y l ): L SSL (x l , x u ) := H(y l , f (x l )) + β • H(f (t 1 (x u )), f (t 2 (x u ))), where H is a standard cross-entropy loss for labeled data, and β is a hyperparameter. Recently, several "holistic" approaches of various techniques (Zhang et al., 2018; Cubuk et al., 2019) have shown remarkable performance in practice, e.g., MixMatch (Berthelot et al., 2019) , ReMixMatch (Berthelot et al., 2020) , and FixMatch (Sohn et al., 2020) , which we mainly consider in this paper. We note that our scheme can be integrated with any recent semi-supervised learning methods. 

2.2. CONTRASTIVE REPRESENTATION LEARNING

Contrastive learning (Oord et al., 2018; Hénaff et al., 2019; He et al., 2020; Chen et al., 2020a) defines an unsupervised task for an encoder f e : X → R de from a set of samples {x i }: assume that a "query" sample x q is given and there is a positive "key" x + ∈ {x i } that x q matches. Then the contrastive loss is defined to let f to extract the necessary information to identify x + from x q as follows: L con (f e , x q , x + ; {x i }) := -log exp(h(f e (x q ), f e (x + ))/τ ) i exp(h(f e (x q ), f e (x i ))/τ ) , where h(•, •) is a pre-defined similarity score, and τ is a temperature hyperparameter. In this paper, we primarily focus on SimCLR (Chen et al., 2020a) , a particular form of contrastive learning: for a given {x i } N i=1 , SimCLR first samples two separate data augmentation operations from a pre-defined family T , namely t 1 , t 2 ∼ T , and matches (x i , xi+N ) := (t 1 (x i ), t 2 (x i )) as a query-key pair interchangeably. The actual loss is then defined as follows: L SimCLR (f e ; {x i } N i=1 ) := 1 2N 2N q=1 L con (f e , xq , x(q+N) mod 2N ; {x i } 2N i=1 \ {x q }), h SimCLR (v 1 , v 2 ) := CosineSimilarity(g(v 1 ), g(v 2 )) = g(v 1 ) • g(v 2 ) ||g(v 1 )|| 2 ||g(v 2 )|| 2 , where g : R de → R dp is a 2-layer neural network called projection header. In other words, the SimCLR loss defines a task to identify a "semantically equivalent" sample to x q up to the set of data augmentations T .

3. OPENCOS: A FRAMEWORK FOR OPEN-SET SEMI-SUPERVISED LEARNING

We consider semi-supervised classification problems involving C classes. In addition to the standard assumption of semi-supervised learning (SSL), we assume that the unlabeled dataset D u is open-set, i.e., the hidden labels y u of x u may not be in Y := {1, • • • , C}. In this scenario, existing semisupervised learning techniques may degrade the classification performance, possibly due to incorrect label-guessing procedure for those out-of-class samples. In this respect, we introduce OpenCoS, a generic method for detecting and labeling out-of-class unlabeled samples in semi-supervised learning. Overall, our key intuition is to utilize the unsupervised representation from contrastive learning (Wu et al., 2018; He et al., 2020; Chen et al., 2020a) to leverage such out-of-class samples in an appropriate manner. We present a brief overview of our method in Section 3.1, and describe how our approach, OpenCoS, can handle out-of-class samples in Section 3.2 and 3.3.

3.1. OVERVIEW OF OPENCOS

Recall that our goal is to train a classifier f : X → Y from a labeled dataset D l and an open-set unlabeled dataset D u . Overall, OpenCoS aims to overcome the presence of out-of-class samples in D u through the following procedure: 1. Pre-training via contrastive learning. OpenCoS first learns an unsupervised representation of f via SimCLRfoot_0 (Chen et al., 2020a) , using both D l and D u without labels. More specifically, we learn the penultimate features of f , denoted by f e , by minimizing the contrastive loss defined in (3) . We also introduce a projection header g (4), which is a 2-layer MLP as per (Chen et al., 2020a) . 2. Detecting out-of-class samples. From a learned representation of f e and g, OpenCoS identifies an out-of-class unlabeled data D out u from the given data D u = D in u ∪ D out u . This detection process is based on the similarity score between D l and D u in the representation space of f e and g (see Section 3.2). 3. Semi-supervised learning with auxiliary loss and batch normalization. Now, one can use any semi-supervised learning scheme to train f using D l and D in u , e.g., ReMixMatch (Berthelot et al., 2020) . In addition, OpenCoS minimizes an auxiliary loss that assigns a soft-label to each sample in D out u , which is also based on the representation of f e and g (see Section 3.3) . Furthermore, we found maintaining auxiliary batch normalization layers (Xie et al., 2020) for D out u is beneficial to our loss as they mitigate the distribution mismatch arisen from D out u . Putting it all together, OpenCoS provides an effective and systematic way to detect and utilize outof-class data for semi-supervised learning. Due to its simplicity, our framework can incorporate the most recently proposed semi-supervised learning methods (Berthelot et al., 2019; 2020; Sohn et al., 2020) and improve their performance in the presence of out-of-class samples. Figure 2 illustrates the overall training scheme of OpenCoS.

3.2. DETECTION CRITERION OF OPENCOS

For a given labeled dataset D l and an open-set unlabeled dataset D u , we aim to detect a subset of the unlabeled training data D out u ⊆ D u whose elements are out-of-class, i.e., y u / ∈ Y. A standard way to handle this task is to train a confident-calibrated classifier using D l (Hendrycks & Gimpel, 2017; Liang et al., 2018; Lee et al., 2018a; b; Hendrycks et al., 2019a; b; Bergman & Hoshen, 2020; Tack et al., 2020) . However, such methods typically assume a sufficient number of in-class samples (i.e., large D l ), which does not hold in our case to the label-scarce nature of SSL. This motivates us to consider a more suitable approach which leverages the open-set unlabeled dataset D u for contrastive learning. Then, OpenCoS utilizes the labeled dataset D l to estimate the class-wise distributions of (pre-trained) embeddings, and use them to define a detection score for D u . We assume that an encoder f e : X → R de and a projection header g : R de → R dp pre-trained via SimCLR on D l ∪ D u . Motivated by the similarity metric used in the pre-training objective of SimCLR (4), we propose a simple yet effective detection score s(x u ) for unlabeled input x u based on the cosine similarity between x u and class-wise prototypical representations {v c } C c=1 obtained from D l . Namely, we first define a class-wise similarity score 2 sim c (x u ) for each class c as follows: v c (D l ; f e , g) := 1 N c l i 1 y (i) l =c • g(f e (x (i) l )), and sim c (x u ; D l , f e , g) := CosineSimilarity(g(f e (x u )), v c ), where N c l := |{(x (i) l , y (i) l )|y (i) l = c}| is the sample size of class c in D l . Then, our detection score s(x u ) is defined by the maximal similarity score between x u and the prototypes {v c } C c=1 : s(x u ) := max c=1,••• ,C sim c (x u ) . In practice, we use a pre-defined threshold t for detecting out-of-class samples in D u , i.e., we detect a given sample x u as out-of-class if s(x u ) < t. In our experiments, we found an empirical value of t := µ l -2σ l performs well across all the datasets tested, where µ l and σ l are mean and standard deviation computed over {s(x (i) l )} N l i=1 , respectively, although more tuning of t could further improve the performance. Further analysis of our detection threshold can be found in Appendix B.4.

3.3. AUXILIARY LOSS AND BATCH NORMALIZATION OF OPENCOS

Based on the detection criterion defined in Section 3.2, the open-set unlabeled dataset D u can be split into (a) the in-class unlabeled dataset D in u and (b) the out-of-class unlabeled dataset D out u . The labeled dataset D l and D in u are now used to train the classifier f using any existing semi-supervised learning method (Berthelot et al., 2019; 2020; Sohn et al., 2020) . In addition, we propose to further utilize D out u via an auxiliary loss that assigns a soft-label to each x out u ∈ D out u . More specifically, for any semi-supervised learning objective L SSL (x l , x in u ; f ), we consider the following loss: L OpenCoS = L SSL (x l , x in u ; f ) + λ • H(q(x out u ), f (x out u )), where H denotes the cross-entropy loss, λ is a hyperparameter, and q(x out u ) defines a specific assignment of distribution over Y for x out u . In this paper, we propose to assign q(x out u ) based on the class-wise similarity scores sim c (x out u ) defined in ( 6), again utilizing the contrastive representation f e and g: q c (x u ) := exp (sim c (x u ; f e , g)/τ ) i exp (sim i (x u ; f e , g)/τ ) , ( ) where τ is a (temperature) hyperparameter. At first glance, assigning a label of Y to x out u may seem counter-intuitive, as the true label of x out u is not in Y by definition. However, even when out-of-class samples cannot be represented as one-hot labels, one can still model their class-conditional likelihoods as a linear combination (i.e., soft-label) of Y: for instance, although "cat" images are out-of-class for CIFAR-100, still there are some classes in CIFAR-100 that is semantically similar to "cat", e.g., "leopard", "lion", or "tiger", so that assigning a soft-label, e.g., 0.1 • "leopard" + 0.2 • "lion" + 0.7 • "tiger", might be beneficial. Even if out-of-classes are totally different from in-classes, one can assign the uniform labels to ignore them. We empirically found that such soft-labels based on representations learned via contrastive learning offer an effective way to utilize out-of-class samples, while they are known to significantly harm in the vanilla semi-supervised learning schemes. We present detailed discussion on our soft-label assignments in Section 4.4. Auxiliary batch normalization. Finally, we suggest to handle a data-distribution shift originated from the class-distribution mismatch (Oliver et al., 2018) , i.e., D l and D out u are drawn from the different underlying distribution. This may degrade the in-class classification performance as the auxiliary loss utilizes out-of-class samples. To handle the issue, we use additional batch normalization layers (BN; Sergey Ioffe 2015) for training samples in D out u to disentangle those two distributions. In our experiments, we observe such auxiliary BNs are beneficial when using out-of-class samples via the auxiliary loss (see Section 4.4). Auxiliary BNs also have been studied in adversarial learning literature (Xie et al., 2020) : decoupling BNs improves the performance of adversarial training by handling a distribution mismatch between clean and adversarial samples. In this paper, we found that a similar strategy can improve model performance in realistic semi-supervised learning.

4. EXPERIMENTS

In this section, we verify the effectiveness of our method over a wide range of semi-supervised learning benchmarks in the presence of various out-of-class data. The full details on experimental setups can be found in Appendix A. Datasets. We perform experiments on image classification tasks for several benchmarks in the literature of semi-supervised learning (Berthelot et al., 2020; Sohn et al., 2020) : CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009) , and ImageNet (Deng et al., 2009) datasets. Specifically, we focus on settings where each dataset is extremely label-scarce: only 4 or 25 labels per class are given during training, while the rest of the training data are assumed to be unlabeled. To configure realistic semi-supervised learning scenarios, we additionally assume that unlabeled data contain samples from an external dataset: for example, in the case of CIFAR-10, we use unlabeled samples from SVHN (Netzer et al., 2011) or TinyImageNetfoot_2 datasets. Baselines. We evaluate MixMatch (Berthelot et al., 2019) , ReMixMatch (Berthelot et al., 2020) , and FixMatch (Sohn et al., 2020) as baselines in our experimental setup, which are considered to be state-of-the-art methods in conventional semi-supervised learning. We also compare our method with three prior works applicable to our setting: namely, we consider Uncertainty-Aware Self-Distillation (UASD; Chen et al. 2020c), RealMix (Nair et al., 2019) and DS 3 L (Guo et al., 2020), which propose schemes to detect and filter out out-of-class samples in the unlabeled dataset: e.g., DS 3 L learns to re-weight unlabeled samples to reduce the effect of out-of-class samples. Recall that our method uses SimCLR (Chen et al., 2020a) for pre-training. Unless otherwise noted, we also pre-train the baselines via SimCLR for a fair comparison, denoting those fine-tuned models by "-ft," e.g., MixMatch-ft and UASD-ft. We confirm that fine-tuned models show comparable or better performance compared to those trained from scratch, as presented in Figure 1 (b) and Appendix A.2. Also, we report the performance purely obtainable from (unsupervised) SimCLR: namely, we additionally consider (a) SimCLR-le: a SimCLR model with linear evaluation protocol (Zhang et al., 2016; Chen et al., 2020a) , i.e., it additionally learns a linear layer with the labeled dataset, and (b) SimCLR-ft: the whole SimCLR model is fine-tuned with the labeled dataset. Somewhat interestingly, these models turn out to be the strongest baselines in our setups; they often outperform the state-of-the-art semi-supervised baselines under large proportions of out-of-class samples (see Table 1 ). Finally, we remark that our framework can incorporate any conventional semi-supervised methods for training. We denote our method built upon an existing method by "+ OpenCoS", e.g., ReMixMatch + OpenCoS. Training details. As suggested by Oliver et al. (2018) , we have re-implemented all baseline methods considered, including SimCLR, under the same codebase and performed experiments with the same model architecture of ResNet-50 (He et al., 2016) . 4 Due to the label-scarce nature of semi-supervised learning, we do not use a validation set in our setting. Instead, we checkpoint per 2 16 training samples and report (a) the median test accuracy of the last 5 checkpoints out of 50 checkpoints in total and (b) the best accuracy among all the checkpoints. We fix τ = 1, the temperature hyperparameter in (9), and λ = 0.5 in ( 8), in all our experiments. The full details on model architecture and hyperparameters can be found in Appendix A.2 and B.1, respectively.

4.1. EXPERIMENTS ON VARYING PROPORTIONS OF OUT-OF-CLASS SAMPLES

We first evaluate the effect of out-of-class unlabeled samples in semi-supervised learning, on varying proportions to the total dataset. We consider CIFAR-10 and TinyImageNet datasets, and synthetically control the proportion between the two in 50K training samples. For example, 80% of proportion means the training dataset consists of 40K samples from TinyImageNet, and 10K samples from CIFAR-10. In this experiment, we assume that 25 labels per class are always given in the CIFAR-10 side. We compare three models on varying proportions of out-of-class: (a) a ReMixMatch model trained from scratch (ReMixMatch), (b) a SimCLR model fine-tuned by ReMixMatch (ReMixMatch-ft), and (c) our OpenCoS model applied to ReMixMatch-ft (+ OpenCoS). Figure 1 (b) demonstrates the results. Overall, we observe that the performance of ReMixMatch rapidly degrades as the proportion of out-of-class samples increases in unlabeled data. While ReMixMatch-ft significantly mitigates this problem, however, it still fails at a larger proportion: e.g., at 80% of out-of-class, the performance of ReMixMatch-ft falls into that of ReMixMatch. OpenCoS, in contrast, successfully prevents the performance degradation of ReMixMatch-ft, especially at the regime that out-of-class samples dominate in-class samples.

4.2. EXPERIMENTS ON CIFAR DATASETS

In this section, we evaluate our method on several benchmarks where CIFAR datasets are assumed to be in-class: more specifically, we consider scenarios that either CIFAR-10 or CIFAR-100 is an in-class dataset, with an out-of-class dataset of either SVHN or TinyImageNet. Additionally, we also consider a separate benchmark called CIFAR-Animals + CIFAR-Others following the setup in the related work (Oliver et al., 2018) : the in-class dataset consists of 6 animal classes from CIFAR-10, while the remaining samples are considered as out-of-class. We fix every benchmark to have 50K training samples. We assume an 80% proportion of out-of-class, i.e., 10K for in-class and 40K for out-of-class samples, except for CIFAR-Animals + CIFAR-Others, which consists of 30K and 20K samples for in-and out-of-class, respectively. We report ReMixMatch-ft + OpenCoS as it tends to outperform FixMatch-ft + OpenCoS in such CIFAR-scale experiments, while FixMatch-ft + OpenCoS does in the large-scale ImageNet experiments in Section 4. improves ReMixMatch-ft, outperforming the other baselines simultaneously. For example, OpenCoS improves the test accuracy of ReMixMatch-ft 28.51% → 68.37% on 4 labels per class of CIFAR-10 + TinyImageNet. Also, we observe large discrepancies between the median and best accuracy of semi-supervised learning baselines, MixMatch-ft, ReMixMatch-ft, and FixMatch-ft, especially in the extreme label-scarce scenario of 4 labels per class, i.e., these methods suffer from over-fitting on out-of-class samples. One can also confirm this significant over-fitting in state-of-the-art SSL methods by comparing other baselines with detection schemes, e.g., USAD-ft, RealMix-ft, and DS 3 L-ft, which show less over-fitting but with lower best accuracy.

4.3. EXPERIMENTS ON IMAGENET DATASETS

We also evaluate OpenCoS on ImageNet to verify its scalability to a larger and more complex dataset. We design 9 benchmarks from ImageNet dataset, similarly to Restricted ImageNet (Tsipras et al., 2019) : more specifically, we define 9 super-classes of ImageNet, each of which consists of 11∼118 sub-classes. We perform our experiments on each super-class as an individual dataset. Each of the benchmarks (a super-class) contains 25 labels per sub-class, and we use the full ImageNet as an unlabeled dataset (excluding the labeled ones). In this experiment, we checkpoint per 2 15 training samples and report the median test accuracy of the last 3 out of 10. We present additional experimental details, e.g., configuration of the dataset, in Appendix A.3. Table 2 shows the results: OpenCoS still effectively improves the baselines, largely surpassing SimCLR-le and SimCLR-ft as well. For example, OpenCoS improves the test accuracy on Bird to 81.78% from FixMatch-ft of 78.73%, also improving SimCLR-le of 75.81% significantly. This shows the efficacy of OpenCoS in exploiting open-set unlabeled data from unknown (but related) classes or even unseen distribution of another dataset in the real-world.

4.4. ABLATION STUDY

We perform an ablation study to understand further how OpenCoS works. Specifically, we assess the individual effects of the components in OpenCoS and show that each of them has an orthogonal contribution to the overall improvements. We also provide a detailed evaluation of our proposed detection score (7) compared to other out-of-distribution detection methods. Table 3 : Ablation study on three main components of our method: the detection criterion ("Detect"), auxiliary loss ("Aux. loss"), and auxiliary BNs ("Aux. BNs"). We report the mean and standard deviation over three runs with different random seeds and a fixed split of labeled data. In Component analysis. To further analyze the individual contribution of each component of Open-CoS, we incrementally apply these components one-by-one to ReMixMatch-ft (CIFAR-scale) and FixMatch-ft (ImageNet-scale) baselines. Specifically, we consider CIFAR-Animals + CIFAR-Others, CIFAR-10 + SVHN for CIFAR-scale, and Produce, Bird, Food + ImageNet for ImageNet-scale benchmarks. Table 3 summarizes the results, and they indeed confirm that each of what comprises OpenCoS has an orthogonal contribution to improve the accuracy of the benchmarks tested. We observe that leveraging out-of-class samples via auxiliary loss ("Aux. loss") achieves consistent improvements, and also outperforms the baselines significantly. Finally, we remark auxiliary batch normalization layers ("Aux. BNs") give a consistent improvement, and it is often significant: e.g., it gives 55.78% → 57.77% on CIFAR-10 + SVHN. Other detection scores. In Section 3.2, we propose a detection score s(•) (7) for detecting out-ofclass samples in an unlabeled dataset, based on the contrastive representation of SimCLR. This setup is different to the standard out-of-distribution (OOD) detection task (Emmott et al., 2013; Liu et al., 2018) : OOD detection targets unseen (i.e., "out-of-distribution") samples in test time, while our setup aims to detect seen out-of-class samples during training assuming few in-class labels. Due to this lack of labeled information, therefore, the standard techniques for OOD detection (Hendrycks & Gimpel, 2017; Liang et al., 2018; Lee et al., 2018b) are not guaranteed to perform still well in our setup. We examine this in Appendix B.3 by comparing detection performance of such OOD detection scores with ours (7) upon a shared SimCLR representation: in short, we indeed observe that our approach of directly leveraging the contrastive representation could perform better than simply applying OOD scores relying on few labeled samples, e.g., our score achieves an AUROC of 98.10% on the CIFAR-Animals + CIFAR-Others benchmark compared to the maximum softmax probability based score (Hendrycks & Gimpel, 2017) of 80.79%. We present the detailed experimental setups and more results in Appendix B.3. Effect of soft-labeling. We emphasize that our soft-labeling scheme can be rather viewed as a more reasonable way to label such out-of-class samples compared to existing state-of-the-art SSL methods, e.g., MixMatch simply assigns its sharpened predictions. On the other hand, a prior work (Li & Hoiem, 2016) has a similar observation to our approach: assigning soft-labels of novel data could be beneficial for transfer learning. This motivate us to consider an experiment to further support the claim that our soft-labeling gives informative signals: we train a classifier by minimizing only the cross-entropy loss with soft-labels (i.e., without in-class samples) from scratch. In Table 4 , the trained Table 4 : Comparison of the median test accuracy of ResNet-50 trained on out-of-class samples and their soft-labels of the CIFAR-10 benchmarks with 4 labels per class. We denote the new setting of minimizing with the auxiliary loss as "Aux. loss only". We report the mean and standard deviation over three runs with different random seeds and splits of labeled data. In classifier performs much better than (random) guessing, even close to some baselines although this model does; this supports that generated soft-labels contain informative features of in-classes. The details on experimental setups can be found in Appendix A.2. Examples of actual soft-labels. We also present some concrete examples of our soft-labeling scheme in Figure 3 for a better understanding, which are obtained from random unlabeled samples in the CIFAR-10 + TinyImageNet benchmark: Overall, we qualitatively observe that out-of-class samples that share some semantic features to the in-classes (e.g., Figure 3 (a)) have relatively high confidence capturing such similarity, while returning very close to uniform otherwise (e.g., Figure 3(b) ).

5. CONCLUSION

In this paper, we propose a simple and general framework for handling novel unlabeled data, aiming toward a more realistic assumption for semi-supervised learning. Our key idea is (intentionally) not to use label information, i.e., by relying on unsupervised representation, when handling novel data, which can be naturally incorporated into semi-supervised learning with our framework: OpenCoS. In contrast to previous approaches, OpenCoS opens a way to further utilize those novel data by assigning them soft-labels, which are again obtained from unsupervised learning. We hope our work would motivate researchers to extend this framework with a more realistic assumption, e.g., noisy labels (Wang et al., 2018; Lee et al., 2019) , imbalanced learning (Liu et al., 2020) . A TRAINING DETAILS

A.1 DETAILS ON THE EXPERIMENTAL SETUP

For the experiments reported in Table 1 , we generally follow the training details of FixMatch (Sohn et al., 2020) , including optimizer, learning rate schedule, and an exponential moving average. Specifically, we use Nesterov SGD optimizer with momentum 0.9, a cosine learning rate decay with an initial learning rate of 0.03, and an exponential moving average with a decay of 0.999. The batch size is 64, which is widely adopted in semi-supervised learning (SSL) methods. We do not use weight decay for these models, as they are fine-tuned. We use a simple augmentation strategy, i.e., flip and crop, as a default. We use the augmentation scheme of SimCLR (Chen et al., 2020a) (i.e., random crop with resize, random color distortion, and random Gaussian blur) when a SSL method requires to specify a strong augmentation strategy, e.g., for consistency regularization in the SSL literature (Berthelot et al., 2020; Sohn et al., 2020) . We fix the number of augmentations as 2, following Berthelot et al. ( 2019): e.g., MixMatch-ft generates two augmentations of each unlabeled sample while ReMixMatch-ft generates one weak augmentation and one strong augmentation. In the case of ReMixMatch-ft, we do not use the ramp-up weighting function, the pre-mixup and the rotation loss, which give a marginal difference in fine-tuning, for efficient computation.foot_4 For FixMatch-ft, we set the relative size of labeled and unlabeled batch µ = 1 for a fair comparison with other baselines, and scale the learning rate linearly with µ, as suggested by Sohn et al. (2020) . Following Chen et al. (2020c) , UASD-ft computes the predictions by accumulative ensembling instead of using an exponential moving average. OpenCoS shares all hyperparameters of the baseline SSL methods, e.g., FixMatch + OpenCoS shares hyperparameters of FixMatch-ft. For the results of ReMixMatch (from scratch) in Figure 1 (b), we report the median accuracy of the last 10 checkpoints out of 200 checkpoints, where each checkpoint is saved per 2 16 training samples.

A.2 CIFAR EXPERIMENTS

Training from scratch. We pre-train all the baselines via SimCLR for a fair comparison, as mentioned in Section 4. In Table 5 , we also report the performance of each baseline model when trained from scratch. Here, we report the median accuracy of the last 10 checkpoints out of 500 checkpoints in total. We also present the fine-tuned baselines (see Section 4.2) denoting by "-ft," e.g., MixMatch-ft. Here, we follow the training details those which originally used in each baseline method. For example, ReMixMatch from scratch uses Adam optimizer with a fixed learning rate of 0.002, and weight decay of 0.02. We use the simple strategy (i.e., flip and crop) and RandAugment (Cubuk et al., 2019) as a weak and strong augmentation, respectively. In addition, we use the ramp-up weighting function, the pre-mixup and the rotation loss for ReMixMatch. We consider the CIFAR-100 + TinyImageNet benchmark assuming 80% proportion of out-of-class, i.e., 10K samples for in-class and 40K samples for out-of-class. Training without in-class samples from scratch. For the experiments reported in Table 4 , we train a classifier from scratch only with the unlabeled out-of-class samples and their soft-labels, on the CIFAR-10 benchmarks with 4 labels per class. We use ResNet-50, and SGD with momentum 0.9, weight decay 0.0001, and an initial learning rate of 0.1. The learning rate is divided by 10 after epochs 100 and 150, and total epochs are 200. We set batch size as 128, and use a simple data augmentation strategy, i.e., flip and crop. We minimize the cross-entropy loss between the soft-labels q(•) and model predictions f (•), i.e., L = H(q(x out u ), f (x out u )). We use the temperature scaling on the both sides of soft-labels and model predictions for a stable training, specifically 0.1 and 4, respectively. Analysis of model architectures. For all our experiments, we use ResNet-50 following the standard of SimCLR (Chen et al., 2020a) . This architecture is larger than Wide-ResNet-28-2 (Zagoruyko & Komodakis, 2016) , a more widely adopted architecture in the semi-supervised learning literature (Oliver et al., 2018) . We have found that using a larger network, i.e., ResNet-50, is necessary to leverage the pre-trained features of SimCLR: In Table 5 , we provide an evaluation on another choice of model architecture, i.e., Wide-ResNet-28-2. The hyperparameters are the same as the experiments on ResNet-50. Here, one can observe that OpenCoS trained on Wide-ResNet-28-2 still improves ReMixMatch-ft, outperforming the other baselines. More importantly, however, we observe that pre-training Wide-ResNet-28-2 via SimCLR does not significantly improve the baselines trained from (Hénaff et al., 2019; Chen et al., 2020a; b) . Experiments on more labeled data. We have performed additional experiments on the CIFAR-10 + SVHN benchmark with 400 labels per class, and the results are given in Table 6 . One can still observe that OpenCoS consistently outperforms other methods when more labeled data are available. (Tsipras et al., 2019) . In detail, we group together subsets of semantically similar classes into 9 different super-classes, as shown in Table 7 .

Details of ImageNet experiments.

For the experiments reported in Table 2 , we use a pre-trained ResNet-50 modelfoot_5 of Chen et al. (2020a) and fine-tune the projection header for 5 epochs on ImageNet. We follow the optimization details of the fine-tuning experiments of SimCLR (Chen et al., 2020a) : specifically, we use Nesterov SGD optimizer with momentum 0.9, and a learning rate of 0.00625 (following LearningRate = 0.05 • BatchSize/256). We set the batch size to 32, and report the median accuracy of the last 3 checkpoints out of 10 checkpoints in total. Data augmentation, regularization techniques, and other hyperparameters are the same as CIFAR experiments. In the case of FixMatch-ft + OpenCoS, we empirically observe that it is more beneficial not to discard the detected out-of-class samples in FixMatch training, as it performs better than using in-class samples In Section 4, we perform all the experiments with the fixed temperature τ = 1 and loss weight λ = 0.5. To examine the effect of hyperparameters τ and λ, we additionally test the hyperparameters across an array of τ ∈ {0.1, 0.5, 1, 2, 4} and λ ∈ {0.1, 0.5, 1, 2, 4} on the CIFAR-100 + TinyImageNet benchmark with ResNet-50. The results are presented in Table 8 . Overall, we found our method is fairly robust on τ and λ. To clarify how the improvements of OpenCoS comes from out-of-class samples, we have considered additional CIFAR-scale experiments with 4 labels per class. We newly pre-train and fine-tune SimCLR models using in-class samples only, i.e., 30,000 for CIFAR-Animals, 10,000 for CIFAR-10, CIFAR-100 benchmarks, and compare two baselines: SimCLR-le and ReMixMatch-ft. Interestingly, we found that just merging out-of-class samples to the training dataset improves the performance of SimCLR models in several cases (see Table 9 ), e.g., SimCLR-le of CIFAR-10 enhances from 55.27% to 58.20% with TinyImageNet. Also, OpenCoS significantly outperforms overall baselines, even when out-of-class samples hurt the performance of SimCLR-le or ReMixMatch-ft. We confirm that the proposed method effectively utilizes contrastive representations of out-of-class samples beneficially, compared to other SSL baselines. Robustness to incorrect detection. We observe our method is quite robust on incorrectly detected out-of-class samples, i.e., those samples are still leveraged via auxiliary loss instead of SSL algorithm. We have considered an additional experiment on CIFAR-10 with 250 labels (out of 50,000 samples), that assumes (i) all the unlabeled samples are in-class, and (ii) 80% of those in-class samples are incorrectly detected as out-of-class in OpenCoS. Here, we compare OpenCoS with a baseline which only uses the correctly-detected (in-class) samples without auxiliary loss, i.e., the baseline is trained on 10,000 samples while OpenCoS on 50,000 in total. In this scenario, OpenCoS achieves 89.54% in the median test accuracy, while the baseline does 89.27%: this shows that our auxiliary loss does not harm the training even when it is incorrectly applied to in-class samples.

B.3 EVALUATIONS OF OUR DETECTION SCORE

Baselines. We consider maximum softmax probability (MSP; Hendrycks & Gimpel 2017), ODIN (Liang et al., 2018) , and Mahalanobis distance-based score (Lee et al., 2018b) as baseline detection methods. As MSP and ODIN require a classifier to obtain their scores, we employ SimCLR-le: a SimCLR model which additionally learns a linear layer with the labeled dataset, for both baselines. ODIN performs an input pre-processing by adding small perturbations with a temperature scaling as follows: P (y = c|x; T ) = exp(f c (x)/T ) y exp(f y (x)/T ) , x = x -• sign(-∇ x log P (y = c|x; T )), where f = (f 1 , ..., f C ) is the logit vector of deep neural network, T > 0 is a temperature scaling parameter, and is a magnitude of noise. ODIN calculates the pre-processed data x and feeds it into the classifier to compute the confidence score, i.e., max y P (y|x ; T ), and identifies it as in-class if the confidence score is higher than some threshold δ. We choose the temperature T and the noise magnitude from {1, 10, 100, 1000} and {0, 0.0005, 0.001, 0.0014, 0.002, 0.0024, 0.005, 0.01, 0.05, 0.1, 0.2}, respectively, by using 2,000 validation data. Mahalanobis distance-based score (Mahalanobis) assumes the features of the neural network f follows the class-conditional Gaussian distribution. Then, it computes Mahalanobis distance between input x and the closest class-conditional Gaussian distribution, i.e., M (x) = max c -(f (x) -µ c ) Σ -1 (f (x) -µ c ), where µ c is the class mean and Σ is the covariance of the labeled data. We fix the covariance matrix as the identity because the number of labeled samples is insufficient to compute it: the feature dimension of SimCLR encoder f e and projection header g are 2048. Moreover, Mahalanobis has the noise magnitude parameter for input pre-processing like ODIN, and use a feature ensemble method of Lee et al. (2018b) . We choose from {0, 0.0005, 0.001, 0.0014, 0.002, 0.005, 0.01}, and perform the feature ensemble of intermediate features including f e 's and g's by using 2,000 validation data. Metrics. We follow the threshold-free detection metrics used in Lee et al. (2018b) to measure the effectiveness of detection scores in identifying out-of-class samples. • True negative rate (TNR) at 95% true positive rate (TPR). We denote true positive, true negative, false positive, and false negative as TP, TN, FP, and FN, respectively. We measure TNR = TN / (FP+TN) at TPR = TP / (TP+FN) is 95%. • Detection accuracy. For an unlabeled data x ∈ D u (= D in u ∪ D out u ), this metric corresponds to the maximum classification probability over all possible thresholds δ: 1 -min δ {FNR • P (x ∈ D in u ) + FPR • P (x ∈ D out u )} where false negative rate FNR = FN / (FN+TP), and false positive rate FPR = FP / (FP+TN). • Area under the receiver operating characteristic curve (AUROC). The receiver operating characteristic (ROC) curve is a graph of the true positive rate (TPR) against the false positive rate (FPR) by varying a threshold, and we measure its area. • Area under the precision-recall curve (AUPR). The precision-recall (PR) curve is a graph of the precision = TP / (TP+FP) against recall = TP / (TP+FN) by varying a threshold. AUPR-in (or -out) is AUPR where in-(or out-of-) class samples are specified as positive. In this section, we present evaluations of our detection score s(•) (7) under various detection metrics on the CIFAR-Animals + CIFAR-Others with 4 labels per class. Table 10 shows the results: interestingly, our score outperforms MSP and ODIN and also performs comparable to Mahalanobis, even these baselines require more computational costs, e.g., input pre-processing. We confirm that the design of our score is an effective and efficient way to detect out-of-class samples based on the representation of SimCLR. In Figure 4 , we provide the receiver operating characteristic (ROC) curves that support the above results. We remark that the projection header g (4) is crucial for the detection, e.g., g enhances AUROC of our score from 88.80% to 98.10%. According to the definition of our score (7), it can be viewed as a simpler version of Mahalanobis without its input pre-processing and feature ensembles under an assumption of identity covariance, which may explain their comparable performances. We additionally provide the performance of OpenCoS among various detection methods, including the above baselines and two artificial methods: we consider (a) Random: a random detection with a probability of 0.5, and (b) Oracle: a perfect detection. For MSP, ODIN, and Mahalanobis, we choose their detection thresholds at TPR 95%. Table 11 shows the results: we observe that the classification accuracy is proportional to the detection performance. Remarkably, our detection method achieves comparable accuracy to Oracle, which is the optimal performance of OpenCoS. We additionally provide the detection performance on various proportions of out-of-class samples, i.e., 50% and 67%, on this benchmark. For each setting, the number of out-of-class samples is fixed at 20K, while in-class samples are controlled to 20K and 10K, respectively. We choose the same detection threshold t := µ l -2σ l throughout all experiments: it is a reasonable choice, giving ≈ 95% confidence if the score follows Gaussian distribution. 



Nevertheless, our framework is not restricted to a single method of SimCLR; it is easily generalizable to other contrastive learning methods(Hénaff et al., 2019;He et al., 2020; Chen et al., 2020b).2 In this work, we adopt the well-known cosine similarity to define our score, but there can be other designs as long as it represents class-wise similarity(Chen et al., 2020d;Vinyals et al., 2016;Snell et al., 2017). https://tiny-imagenet.herokuapp.com/ Note that this architecture is larger than Wide-ResNet-28-2(Zagoruyko & Komodakis, 2016) used in the semi-supervised learning literature(Oliver et al., 2018). We use ResNet-50 following the standard of SimCLR. For a fair comparison, ReMixMatch-ft + OpenCoS shares these settings. https://github.com/google-research/simclr



Figure 1: (a) Illustration of an open-set unlabeled data under class-distribution mismatch in semisupervised learning, i.e., unlabeled data may contain unknown out-of-class samples. (b) Comparison of median test accuracy under varying proportions of out-of-class samples on the CIFAR-10 + TinyImageNet benchmark with 25 labels per class. batch normalization layers (Xie et al., 2020) could further help to mitigate the class-distribution mismatch via decoupling batch normalization layers. We propose a generic SSL framework, coined OpenCoS, based on the aforementioned techniques for handling open-set unlabeled data, which can be integrated with any existing SSL methods.



Figure 2: Overview of our proposed framework, OpenCoS. First, our method detects out-of-class samples based on contrastive representation. The out-of-class samples detected by OpenCoS are further utilized via an auxiliary loss with soft-labels generated from the representation, while the remaining in-class samples are used for standard semi-supervised methods. Also, the out-of-class samples pass through additional batch normalization layers to handle a class-distribution mismatch.

Top-5 classes in a soft-label of "Pizza".

Figure 3: Illustration of soft-label assignments in the CIFAR-10 + TinyImageNet benchmark. Unlabeled out-of-class samples from (a) "gazelle" is assigned with soft-labels of ≈78% confidence for "deer", and (b) "pizza" is assigned with almost uniform soft-labels (≈10% of confidence). The soft-labels are scaled with the temperature τ = 0.1.

Figure 4: Receiver operating characteristic (ROC) curves of detection methods on CIFAR-Animals + CIFAR-Others benchmark with 4 labels per class.

3. Table1shows the results: OpenCoS consistently Comparison of median test accuracy on various benchmark datasets. We report the mean and standard deviation over three runs with different random seeds and splits, and also report the mean of the best accuracy in parentheses. The best scores are indicated in bold. We denote methods handling unlabeled out-of-class samples (i.e., open-set) as "Open-SSL".

Comparison of median test accuracy on 9 super-classes of ImageNet, which are obtained by grouping semantically similar classes in ImageNet; Dog, Reptile, Produce, Bird, Insect, Food, Primate, Aquatic animal, and Scenery. We report the mean and standard deviation over three runs with different random seeds and splits. The best scores are indicated in bold. We denote methods handling unlabeled out-of-class samples (i.e., open-set) as "Open-SSL".

Comparison of the median test accuracy of Wide-ResNet-28-2 and ResNet-50 on CIFAR-100 + TinyImageNet benchmark over baseline methods. The best scores are indicated in bold. We denote methods handling unlabeled out-of-class samples (i.e., open-set) as "Open-SSL".

Comparison of the median test accuracy on the CIFAR-10 + SVHN benchmark with 400 labels per class over baseline methods. The best scores are indicated in bold.

Super-classes used in 9 benchmarks from ImageNet dataset. The class ranges are inclusive.

Comparison of median test accuracy on the CIFAR-100 + TinyImageNet benchmark with (a) 4 and (b) 25 labels per class, over various hyperparameters τ and λ.

Comparison of the median test accuracy for the use of out-of-class samples on the CIFARscale benchmarks with 4 labels per class. We denote whether using out-of-class samples for training as "w/ out-of-class". We report the mean and standard deviation over three runs with different random seeds and splits. The best scores are indicated in bold.

Comparison of detection methods on the CIFAR-Animals + CIFAR-Others benchmark with 4 labels per class under various evaluation metrics. We denote our detection method without the projection header as "Ours w/o header". The best scores are indicated in bold.Detection method TNR at TPR 95% ↑ Detection Accuracy ↑ AUROC ↑ AUPR-in ↑ AUPR-out ↑

Comparison of the median test accuracy on the CIFAR-Animals + CIFAR-Others benchmark with 4 labels per class among various detection methods. We denote our detection method without the projection header as "Ours w/o header". We report mean and standard deviation over three runs with different random seeds and a fixed split of labeled data. The best scores are indicated in bold.

Table 12(a) shows the detection performance of our threshold and its applicability over various proportions of out-of-class samples. Although tuning t gives further improvements though (see Table 12(b)), we fix the threshold without any tuning.

The detection performance across different (a) proportions of out-of-class and (b) detection thresholds in CIFAR-Animals + CIFAR-Others benchmark with 4 labels per class.(a) The detection performance of the proposed threshold t := µ l -2σ l on varying proportions of out-of-class. The detection performance and median test accuracy across different thresholds t, i.e., k = 1, 2, 3, 4 of t = µ l -k • σ l .

annex

S l = S l ∪ {s(x l ; f e , g)}The similarity score (7). 6: end forCompute the threshold t. SSL with auxiliary loss (8).

21:

Update parameters of f by computing the gradients of the proposed loss L OpenCoS . 22: end for

