HOW DOES SEMI-SUPERVISED LEARNING WITH PSEUDO-LABELERS WORK? A CASE STUDY

Abstract

Semi-supervised learning is a popular machine learning paradigm that utilizes a large amount of unlabeled data as well as a small amount of labeled data to facilitate learning tasks. While semi-supervised learning has achieved great success in training neural networks, its theoretical understanding remains largely open. In this paper, we aim to theoretically understand a semi-supervised learning approach based on pre-training and linear probing. In particular, the semi-supervised learning approach we consider first trains a two-layer neural network based on the unlabeled data with the help of pseudo-labelers. Then it linearly probes the pretrained network on a small amount of labeled data. We prove that, under a certain toy data generation model and two-layer convolutional neural network, the semisupervised learning approach can achieve nearly zero test loss, while a neural network directly trained by supervised learning on the same amount of labeled data can only achieve constant test loss. Through this case study, we demonstrate a separation between semi-supervised learning and supervised learning in terms of test loss provided the same amount of labeled data.

1. INTRODUCTION

With the help of human-annotated labels, supervised learning has achieved remarkable success in several computer vision tasks (Girshick et al., 2014; Long et al., 2015; Krizhevsky et al., 2012; Tran et al., 2015) . However, annotating large-scale datasets (e.g., video datasets with temporal dimensions) is time-consuming and costly. In order to reduce the number of labels used for training while maintaining a good prediction performance, a variety of methods have been proposed. Among these methods, semi-supervised learning (Scudder, 1965; Fralick, 1967; Agrawala, 1970) , which leverages both a small amount of labeled data and a large amount of unlabeled data to improve learning performance, is one of the most widely used approaches. It has been shown to achieve promising performance for a wide variety of tasks, including image classification (Rasmus et al., 2015; Springenberg, 2015; Laine & Aila, 2016) , image generation (Kingma et al., 2014; Odena, 2016; Salimans et al., 2016) , domain adaptation (Saito et al., 2017; Shu et al., 2018; Lee et al., 2019) , and word embedding (Turian et al., 2010; Peters et al., 2017) . One of the popular semi-supervised learning approaches is pseudo-labeling (Lee et al., 2013), which generates pseudo-labels of unlabeled data for pre-training. This approach has been remarkably successful in improving performance on many tasks. For example, in image classification, one can first train a teacher network on a small labeled dataset and use it as a pseudo-labeler to generate pseudo-labels for large unlabeled datasets. Then one can train a student network on the combination of labeled and pseudo-labeled images (Xie et al., 2020; Pham et al., 2021b; Rizve et al., 2021) . In order to theoretically understand semi-supervised learning with pseudo-labelers, Oymak & Gulcu (2021) considered learning a linear classifier in the Gaussian mixture model setting. They are able to show that in the high dimensional limit, the predictors found by semi-supervised learning are correlated with the Bayes-optimal predictor. Frei et al. (2022c) further proved that the semi-supervised learning algorithm can provably converge to the Bayes-optimal predictor for mixture models. However, their analyses are limited to linear classifiers, and cannot explain the success of semi-supervised learning with neural networks.

