IN DEFENSE OF PSEUDO-LABELING: AN UNCERTAINTY-AWARE PSEUDO-LABEL SELEC-TION FRAMEWORK FOR SEMI-SUPERVISED LEARNING

Abstract

The recent research in semi-supervised learning (SSL) is mostly dominated by consistency regularization based methods which achieve strong performance. However, they heavily rely on domain-specific data augmentations, which are not easy to generate for all data modalities. Pseudo-labeling (PL) is a general SSL approach that does not have this constraint but performs relatively poorly in its original formulation. We argue that PL underperforms due to the erroneous high confidence predictions from poorly calibrated models; these predictions generate many incorrect pseudo-labels, leading to noisy training. We propose an uncertainty-aware pseudo-label selection (UPS) framework which improves pseudo labeling accuracy by drastically reducing the amount of noise encountered in the training process. Furthermore, UPS generalizes the pseudo-labeling process, allowing for the creation of negative pseudo-labels; these negative pseudo-labels can be used for multi-label classification as well as negative learning to improve the single-label classification. We achieve strong performance when compared to recent SSL methods on the CIFAR-10 and CIFAR-100 datasets. Also, we demonstrate the versatility of our method on the video dataset UCF-101 and the multi-label dataset Pascal VOC.

1. INTRODUCTION

The recent extraordinary success of deep learning methods can be mostly attributed to advancements in learning algorithms and the availability of large-scale labeled datasets. However, constructing large labeled datasets for supervised learning tends to be costly and is often infeasible. Several approaches have been proposed to overcome this dependency on huge labeled datasets; these include semi-supervised learning (Berthelot et al., 2019; Tarvainen & Valpola, 2017; Miyato et al., 2018; Lee, 2013) , self-supervised learning (Doersch et al., 2015; Noroozi & Favaro, 2016; Chen et al., 2020a) , and few-shot learning (Finn et al., 2017; Snell et al., 2017; Vinyals et al., 2016) . Semi-supervised learning (SSL) is one of the most dominant approaches for solving this problem, where the goal is to leverage a large unlabeled dataset alongside a small labeled dataset. One common assumption for SSL is that decision boundaries should lie in low density regions (Chapelle & Zien, 2005) . Consistency-regularization based methods achieve this by making the network outputs invariant to small input perturbations (Verma et al., 2019) . However, one issue with these methods is that they often rely on a rich set of augmentations, like affine transformations, cutout (DeVries & Taylor, 2017) , and color jittering in images, which limits their capability for domains where these augmentations are less effective (e.g. videos and medical images). Pseudo-labeling based methods select unlabeled samples with high confidence as training targets (pseudo-labels); this can be viewed as a form of entropy minimization, which reduces the density of data points at the decision boundaries (Grandvalet & Bengio, 2005; Lee, 2013) . One advantage of pseudo-labeling over consistency regularization is that it does not inherently require augmentations and can be generally applied to most domains. However, recent consistency regularization approaches tend to outperform pseudo-labeling on SSL benchmarks. This work is in defense of pseudo-labeling: we demonstrate that pseudo-labeling based methods can perform on par with consistency regularization methods. Although the selection of unlabeled samples with high confidence predictions moves decision boundaries to low density regions in pseudo-labeling based approaches, many of these selected predictions are incorrect due to the poor calibration of neural networks (Guo et al., 2017) . Since, calibration measures the discrepancy between the confidence level of a network's individual predictions and its overall accuracy (Dawid, 1982; Degroot & Fienberg, 1983) ; for poorly calibrated networks, an incorrect prediction might have high confidence. We argue that conventional pseudo-labeling based methods achieve poor results because poor network calibration produces incorrectly pseudo-labeled samples, leading to noisy training and poor generalization. To remedy this, we empirically study the relationship between output prediction uncertainty and calibration. We find that selecting predictions with low uncertainty greatly reduces the effect of poor calibration, improving generalization. Motivated by this, we propose an uncertainty-aware pseudo-label selection (UPS) framework that leverages the prediction uncertainty to guide the pseudo-label selection procedure. We believe pseudolabeling has been impactful due to its simplicity, generality, and ease of implementation; to this end, our proposed framework attempts to maintain these benefits, while addressing the issue of calibration to drastically improve PL performance. UPS does not require modality-specific augmentations and can leverage most uncertainty estimation methods in its selection process. Furthermore, the proposed framework allows for the creation of negative pseudo-labels (i.e. labels which specify the absence of specific classes). If a network predicts the absence of a class with high confidence and high certainty, then a negative label can be assigned to that sample. This generalization is beneficial for both single-label and multi-label learning. In the single-label case, networks can use these labels for negative learning (Kim et al., 2019) foot_0 ; in the multi-label case, class presence is independent so both positive and negative labels are necessary for training. Our key contributions include the following: (1) We introduce UPS, a novel uncertainty-aware pseudo-label selection framework which greatly reduces the effect of poor network calibration on the pseudo-labeling process, (2) While prior SSL methods focus on single-label classification, we generalize pseudo-labeling to create negative labels, allowing for negative learning and multi-label classification, and (3) Our comprehensive experimentation shows that the proposed method achieves strong performance on commonly used benchmark datasets CIFAR-10 and CIFAR-100. In addition, we highlight our method's flexibility by outperforming previous state-of-the-art approaches on the video dataset, UCF-101, and the multi-label Pascal VOC dataset.

2. RELATED WORKS

Semi-supervised learning is a heavily studied problem. In this work, we mostly focus on pseudolabeling and consistency regularization based approaches as currently, these are the dominant approaches for SSL. Following (Berthelot et al., 2019) , we refer to the other SSL approaches for interested readers which includes: "transductive" models (Gammerman et al., 1998; Joachims, 1999; 2003 ), graph-based methods (Zhu et al., 2003;; Bengio et al., 2006; Liu et al., 2019) , generative modeling (Belkin & Niyogi, 2002; Lasserre et al., 2006; Kingma et al., 2014; Pu et al., 2016) . Furthermore, several recent self-supervised approaches (Grill et al., 2020; Chen et al., 2020b; Caron et al., 2020) , have shown strong performance when applied to the SSL task. For a general overview of SSL, we point to (Chapelle et al., 2010; Zhu, 2005) .

Pseudo-labeling

The goal of pseudo-labeling (Lee, 2013; Shi et al., 2018) and self-training (Yarowsky, 1995; McClosky et al., 2006) is to generate pseudo-labels for unlabeled samples with a model trained on labeled data. In (Lee, 2013), pseudo-labels are created from the predictions of a trained neural network. Pseudo-labels can also be assigned to unlabeled samples based on neighborhood graphs (Iscen et al., 2019 ). Shi et al. (2018) extend the idea of pseudo-labeling by incorporating confidence scores for unlabeled samples based on the density of a local neighborhood. Inspired by noise correction work (Yi & Wu, 2019) , Wang & Wu (2020) attempt to update the pseudo-labels through an optimization framework. Recently, (Xie et al., 2019) show self-training can be used to improve the performance of benchmark supervised classification tasks. A concurrent



The motivations for using negative learning (NL) in this work differs greatly fromKim et al. (2019). In this work, NL is used to incorporate more unlabeled samples into training and to generalize pseudo-labeling to the multi-label classification setting, whereasKim et al. (2019) use negative learning primarily to obtain good network initializations to learn with noisy labels. Further discussion about NL can be found in Appendix K.

