SEMI-SUPERVISED LEARNING WITH A PRINCIPLED LIKELIHOOD FROM A GENERATIVE MODEL OF DATA CURATION

Abstract

We currently do not have an understanding of semi-supervised learning (SSL) objectives such as pseudo-labelling and entropy minimization as log-likelihoods, which precludes the development of e.g. Bayesian SSL. Here, we note that benchmark image datasets such as CIFAR-10 are carefully curated, and we formulate SSL objectives as a log-likelihood in a generative model of data curation. We show that SSL objectives, from entropy minimization and pseudo-labelling, to state-ofthe-art techniques similar to FixMatch can be understood as lower-bounds on our principled log-likelihood. We are thus able to introduce a Bayesian extension of SSL, which gives considerable improvements over standard SSL in the setting of 40 labelled points on CIFAR-10, with performance of 92.2±0.3% vs 88.6% in the original FixMatch paper. Finally, our theory suggests that SSL is effective in part due to the statistical patterns induced by data curation. This provides an explanation of past results which show SSL performs better on clean datasets without any "out of distribution" examples. Confirming these results we find that SSL gave much larger performance improvements on curated than on uncurated data, using matched curated and uncurated datasets based on Galaxy Zoo 2. 1

1. INTRODUCTION

To build high-performing deep learning models for industrial and medical applications, it is necessary to train on large human-labelled datasets. For instance, Imagenet (Deng et al., 2009) , a classic benchmark dataset for object recognition, contains over 1 million labelled examples. Unfortunately, human labelling is often prohibitively expensive. In contrast obtaining unlabelled data is usually very straightforward. For instance, unlabelled image data can be obtained in almost unlimited volumes from the internet. Semi-supervised learning (SSL) attempts to leverage this unlabelled data to reduce the required number of human labels (Seeger, 2000; Zhu, 2005; Chapelle et al., 2006; Zhu & Goldberg, 2009; Van Engelen & Hoos, 2020) . One family of SSL methods -those based on low-density separation -assume that decision boundaries lie in regions of low probability density, far from all labelled and unlabelled points. To achieve this, pre deep learning (DL) low-density separation SSL methods such as entropy minimization and pseudo-labelling (Grandvalet & Bengio, 2005; Lee, 2013) use objectives that repel decision boundaries away from unlabelled points by encouraging the network to make more certain predictions on those points. Entropy minimization (as the name suggests) minimizes the predictive entropy, whereas pseudo-labelling treats the currently most-probable label as a pseudo-label, and minimizes the cross entropy to that pseudo-label. More modern work uses the notion of consistency regularisation, which augments the unlabelled data (e.g. using translations and rotations), then encourages the neural network to produce similar outputs for different augmentations of the same underlying image (Sajjadi et al., 2016; Xie et al., 2019; Berthelot et al., 2019b; Sohn et al., 2020) . Further developments of this line of work have resulted in many variants/combinations of these algorithms, from directly encouraging the smoothness of the classifier outputs around unlabelled datapoints (Miyato et al., 2018) to the "FixMatch" family of algorithms (Berthelot et al., 2019b; a; Sohn et al., 2020) , which combine pseudo-labelling and consistency regularisation by augmenting each image twice, and using one of the augmented images to provide a pseudo-label for the other augmentation. However, some of the biggest successes of deep learning, from supervised learning to many generative models, have been built on a principled statistical framework as maximum (marginal) likelihood inference (e.g. the cross-entropy objective in supervised learning can be understood as the log-likelihood for a Categorical-softmax model of the class-label MacKay, 2003) . Low-density separation SSL methods such as pseudo-labelling and entropy minimization are designed primarily to encourage the class-boundary to lie in low-density regions. Therefore they cannot be understood as log-likelihoods and cannot be combined with principled statistical methods such as Bayesian inference. Here, we give a formal account of SSL methods based on low-density separation (Chapelle et al., 2006) as lower bounds on a principled log-likelihood. In particular, we consider pseudo-labelling (Lee, 2013), entropy minimization (Grandvalet & Bengio, 2005) , and modern methods similar to FixMatch (Sohn et al., 2020) . Thus, we introduce a Bayesian extension of SSL which gives 92.2 ± 0.3% accuracy, vs 88.6% in the case of 40 labelled examples in the original FixMatch paper. We confirm the importance of data curation for SSL on real data from Galaxy Zoo 2 (also see Cozman et al., 2003; Oliver et al., 2018; Chen et al., 2020; Guo et al., 2020) .

2. BACKGROUND

The intuition behind low-density separation objectives for semi-supervised learning is that decision boundaries should be in low-density regions away from both labelled and unlabelled data. As such, it is sensible to "repel" decision boundaries away from labelled and unlabelled datapoints and this can be achieved by making the classifier as certain as possible on those points. This happens automatically for labelled points as the standard supervised objective encourages the classifier to be as certain as possible about the true class label. But for unlabelled points we need a new objective that encourages certainty, and we focus on two approaches. First, and perhaps most direct is entropy minimization (Grandvalet & Bengio, 2005 ) L neg entropy (X) = y∈Y p y (X) log p y (X) (1) where X is the input, y is the on particular label and Y is the set of possible labels. Here, we have followed the typical probabilistic approach in writing the negative entropy as an objective to be maximized. Alternatively, we could use pseudo-labelling, which takes the current classification, y * , to be the true label, and maximizes the log-probability of that label (Lee, 2013), L pseudo (X) = log p y * (X) y * = arg max y∈Y log p y (X). (2) Lee ( 2013) regarded pseudo-labelling as closely related to entropy miminization as the optimal value of both objectives is reached when all the probability mass is assigned to one class. However, they are not formulated as a principled log-likelihood, which gives rise to at least three problems. First, these methods cannot be combined with other principled statistical methods such as Bayesian inference. Second, it is unclear how to combine these objectives with standard supervised objectives, except by taking a weighted sum and doing hyperparameter optimization over the weight. Third, these objectives risk reinforcing any initial poor classifications and it is unclear whether this is desirable.

UNINFORMATIVE

It is important to note that under the standard supervised-learning generative model unlabelled points should not give any information about the weights. The typical supervised learning setup assumes that the joint probability factorises as, P (X, θ, Y ) = P (X) P (θ) P (Y |X, θ) ,



Our code: https://anonymous.4open.science/r/GZ_SSL-ED9E; MIT Licensed

