HOW DOES SEMI-SUPERVISED LEARNING WITH PSEUDO-LABELERS WORK? A CASE STUDY

Abstract

Semi-supervised learning is a popular machine learning paradigm that utilizes a large amount of unlabeled data as well as a small amount of labeled data to facilitate learning tasks. While semi-supervised learning has achieved great success in training neural networks, its theoretical understanding remains largely open. In this paper, we aim to theoretically understand a semi-supervised learning approach based on pre-training and linear probing. In particular, the semi-supervised learning approach we consider first trains a two-layer neural network based on the unlabeled data with the help of pseudo-labelers. Then it linearly probes the pretrained network on a small amount of labeled data. We prove that, under a certain toy data generation model and two-layer convolutional neural network, the semisupervised learning approach can achieve nearly zero test loss, while a neural network directly trained by supervised learning on the same amount of labeled data can only achieve constant test loss. Through this case study, we demonstrate a separation between semi-supervised learning and supervised learning in terms of test loss provided the same amount of labeled data.

1. INTRODUCTION

With the help of human-annotated labels, supervised learning has achieved remarkable success in several computer vision tasks (Girshick et al., 2014; Long et al., 2015; Krizhevsky et al., 2012; Tran et al., 2015) . However, annotating large-scale datasets (e.g., video datasets with temporal dimensions) is time-consuming and costly. In order to reduce the number of labels used for training while maintaining a good prediction performance, a variety of methods have been proposed. Among these methods, semi-supervised learning (Scudder, 1965; Fralick, 1967; Agrawala, 1970) , which leverages both a small amount of labeled data and a large amount of unlabeled data to improve learning performance, is one of the most widely used approaches. It has been shown to achieve promising performance for a wide variety of tasks, including image classification (Rasmus et al., 2015; Springenberg, 2015; Laine & Aila, 2016) , image generation (Kingma et al., 2014; Odena, 2016; Salimans et al., 2016 ), domain adaptation (Saito et al., 2017; Shu et al., 2018; Lee et al., 2019) , and word embedding (Turian et al., 2010; Peters et al., 2017) . One of the popular semi-supervised learning approaches is pseudo-labeling (Lee et al., 2013) , which generates pseudo-labels of unlabeled data for pre-training. This approach has been remarkably successful in improving performance on many tasks. For example, in image classification, one can first train a teacher network on a small labeled dataset and use it as a pseudo-labeler to generate pseudo-labels for large unlabeled datasets. Then one can train a student network on the combination of labeled and pseudo-labeled images (Xie et al., 2020; Pham et al., 2021b; Rizve et al., 2021) . In order to theoretically understand semi-supervised learning with pseudo-labelers, Oymak & Gulcu (2021) considered learning a linear classifier in the Gaussian mixture model setting. They are able to show that in the high dimensional limit, the predictors found by semi-supervised learning are correlated with the Bayes-optimal predictor. Frei et al. (2022c) further proved that the semi-supervised learning algorithm can provably converge to the Bayes-optimal predictor for mixture models. However, their analyses are limited to linear classifiers, and cannot explain the success of semi-supervised learning with neural networks. In this paper, we attempt to theoretically explain the success of semi-supervised learning with pseudo-labelers in training neural networks. Specifically, we focus on a toy data model that contains both signal patches and noise patches, where the signal patch is correlated to the label while the noise patch is not. We consider semi-supervised learning with pre-training and linear probing. In the pre-training state, we train a two-layer convolutional neural network (CNN) on an unlabeled dataset with pseudo-labels. We then fine-tune the pre-trained model using linear probing on a small amount of labeled data. We provide a comprehensive analysis of the learning process in both pretraining and linear probing stages. The contributions of our work are summarized as follows. • We theoretically show that with the help of pseudo-labelers, CNN can learn the feature representation during the pre-training stage. Moreover, the learned feature is highly correlated with the true labels of the data, even though the true labels are unknown and not used during the pre-training stage. • Based on our analysis of the pre-training process, we further show that when linear-probing the pre-trained model in the downstream task, the final classifier can achieve near-zero test loss and test error. Notably, these guarantees of small test loss and error only require a very small number of labeled training data. • As a comparison, we show that standard supervised learning cannot learn a good classifier under the same setting. Specifically, we show that, even when the training process converges to a global minimum of the training loss, the learned two-layer CNN can only achieve constant level test loss. This, together with the aforementioned results for semi-supervised learning, demonstrates the advantage of semi-supervised learning over standard supervised learning. Notation. We use lower case letters, lower case bold face letters, and upper case bold face letters to denote scalars, vectors, and matrices respectively. For a scalar x, we use  a k = O(b k ) if |a k | ≤ C|b k | for some absolute constant C, denote a k = Ω(b k ) if b k = O(a k ), and denote a k = Θ(b k ) if |a k | ≤ C|b k | and a k = Ω(b k ). We also denote a k = o(b k ) if lim |a k /b k | = 0. Finally, we use Θ(•), O(•) and Ω(•) to omit logarithmic terms in the notations.

2. RELATED WORK

Semi-supervised learning methods in practice. Since the invention of semi-supervised learning in Scudder (1965); Fralick (1967) ; Agrawala (1970), a wide range of semi-supervised learning approaches have been proposed, including generative models (Miller & Uyar, 1996; Nigam et al., 2000) , semi-supervised support vector machines (Bennett & Demiriz, 1998; Xu et al., 2007; 2009 ), graph-based methods (Zhu et al., 2003;; Belkin et al., 2006; Zhou et al., 2003), and co-training (Blum & Mitchell, 1998) , etc. For a comprehensive review of classical semi-supervised learning methods, please refer to Chapelle et al. (2010); Zhu & Goldberg (2009) . In the past years, a number of deep semi-supervised learning approaches have been proposed, such as generative methods (Odena, 2016; Li et al., 2019) , consistency regularization methods (Sajjadi et al., 2016; Laine & Aila, 2016; Rasmus et al., 2015; Tarvainen & Valpola, 2017) and pseudo-labeling methods (Lee et al., 2013; Zhai et al., 2019; Xie et al., 2020; Pham et al., 2021a) . In this work, we will focus on pseudo-labeling methods. Theory of semi-supervised learning. To understand semi-supervised learning, Castelli & Cover (1995; 1996) studied the relative value of labeled data over unlabeled data under a parametric assumption on the marginal distribution of input features. Later, a series of works proved that semisupervised learning can possess better sample complexity or generalization performance than supervised learning under certain assumptions on the marginal distribution (Niyogi, 2013; Globerson et al., 2017) or the ratio of labeled and unlabeled samples (Singh et al., 2008; Darnstädt, 2015) , while Balcan & Blum (2010) provided a unified PAC framework able to analyze both sample-complexity and algorithmic issues. Oymak & Gulcu (2021); Frei et al. (2022c) considered semi-supervised learning with pseudo-labers by learning a linear classifier for mixture models and convergence to Bayes-optimal predictor.



[x] + to denote max{x, 0}. For a vector v = (v 1 , • • • , v d ) ⊤ ,we denote by ∥v∥ 2 ℓ 2 norm, and use supp(v) := {j : v j ̸ = 0} to denote its support. For two sequences {a k } and {b k }, we denote

