ROPAWS: ROBUST SEMI-SUPERVISED REPRESENTA-TION LEARNING FROM UNCURATED DATA

Abstract

Semi-supervised learning aims to train a model using limited labels. State-of-theart semi-supervised methods for image classification such as PAWS rely on selfsupervised representations learned with large-scale unlabeled but curated data. However, PAWS is often less effective when using real-world unlabeled data that is uncurated, e.g., contains out-of-class data. We propose RoPAWS, a robust extension of PAWS that can work with real-world unlabeled data. We first reinterpret PAWS as a generative classifier that models densities using kernel density estimation. From this probabilistic perspective, we calibrate its prediction based on the densities of labeled and unlabeled data, which leads to a simple closed-form solution from the Bayes' rule. We demonstrate that RoPAWS significantly improves PAWS for uncurated Semi-iNat by +5.3% and curated ImageNet by +0.4%.

1. INTRODUCTION

Semi-supervised learning aims to address the fundamental challenge of training models with limited labeled data by leveraging large-scale unlabeled data. Recent works exploit the success of selfsupervised learning (He et al., 2020; Chen et al., 2020a) in learning representations from unlabeled data for training large-scale semi-supervised models (Chen et al., 2020b; Cai et al., 2022) . Instead of self-supervised pre-training followed by semi-supervised fine-tuning, PAWS (Assran et al., 2021) proposed a single-stage approach that combines supervised and self-supervised learning and achieves state-of-the-art accuracy and convergence speed. While PAWS can leverage curated unlabeled data, we empirically show that it is not robust to realworld uncurated data, which often contains out-of-class data. A common approach to tackle uncurated data in semi-supervised learning is to filter unlabeled data using out-of-distribution (OOD) classification (Chen et al., 2020d; Saito et al., 2021; Liu et al., 2022) . However, OOD filtering methods did not fully utilize OOD data, which could be beneficial to learn the representations especially on large-scale realistic datasets. Furthermore, filtering OOD data could be ineffective since in-class and out-of-class data are often hard to discriminate in practical scenarios. To this end, we propose RoPAWS, a robust semi-supervised learning method that can leverage uncurated unlabeled data. PAWS predicts out-of-class data overconfidently in the known classes since it assigns the pseudo-label to nearby labeled data. To handle this, RoPAWS regularizes the pseudolabels by measuring the similarities between labeled and unlabeled data. These pseudo-labels are further calibrated by label propagation between unlabeled data. Figure 1 shows the conceptual illustration of RoPAWS and Figure 4 visualizes the learned representations. More specifically, RoPAWS calibrates the prediction of PAWS from a probabilistic view. We first introduce a new interpretation of PAWS as a generative classifier, modeling densities over representation by kernel density estimation (KDE) (Rosenblatt, 1956) . The calibrated prediction is given by a closed-form solution from Bayes' rule, which implicitly computes the fixed point of an iterative propagation formula of labels and priors of unlabeled data. In addition, RoPAWS explicitly controls out-of-class data by modeling a prior distribution and computing a reweighted loss, making the model robust to uncurated data. Unlike OOD filtering methods, RoPAWS leverages all of the unlabeled (and labeled) data for representation learning.

