SEMI-SUPERVISED LEARNING WITH A PRINCIPLED LIKELIHOOD FROM A GENERATIVE MODEL OF DATA CURATION

Abstract

We currently do not have an understanding of semi-supervised learning (SSL) objectives such as pseudo-labelling and entropy minimization as log-likelihoods, which precludes the development of e.g. Bayesian SSL. Here, we note that benchmark image datasets such as CIFAR-10 are carefully curated, and we formulate SSL objectives as a log-likelihood in a generative model of data curation. We show that SSL objectives, from entropy minimization and pseudo-labelling, to state-ofthe-art techniques similar to FixMatch can be understood as lower-bounds on our principled log-likelihood. We are thus able to introduce a Bayesian extension of SSL, which gives considerable improvements over standard SSL in the setting of 40 labelled points on CIFAR-10, with performance of 92.2±0.3% vs 88.6% in the original FixMatch paper. Finally, our theory suggests that SSL is effective in part due to the statistical patterns induced by data curation. This provides an explanation of past results which show SSL performs better on clean datasets without any "out of distribution" examples. Confirming these results we find that SSL gave much larger performance improvements on curated than on uncurated data, using matched curated and uncurated datasets based on Galaxy Zoo 2. 1

1. INTRODUCTION

To build high-performing deep learning models for industrial and medical applications, it is necessary to train on large human-labelled datasets. For instance, Imagenet (Deng et al., 2009) , a classic benchmark dataset for object recognition, contains over 1 million labelled examples. Unfortunately, human labelling is often prohibitively expensive. In contrast obtaining unlabelled data is usually very straightforward. For instance, unlabelled image data can be obtained in almost unlimited volumes from the internet. Semi-supervised learning (SSL) attempts to leverage this unlabelled data to reduce the required number of human labels (Seeger, 2000; Zhu, 2005; Chapelle et al., 2006; Zhu & Goldberg, 2009; Van Engelen & Hoos, 2020) . One family of SSL methods -those based on low-density separation -assume that decision boundaries lie in regions of low probability density, far from all labelled and unlabelled points. To achieve this, pre deep learning (DL) low-density separation SSL methods such as entropy minimization and pseudo-labelling (Grandvalet & Bengio, 2005; Lee, 2013) use objectives that repel decision boundaries away from unlabelled points by encouraging the network to make more certain predictions on those points. Entropy minimization (as the name suggests) minimizes the predictive entropy, whereas pseudo-labelling treats the currently most-probable label as a pseudo-label, and minimizes the cross entropy to that pseudo-label. More modern work uses the notion of consistency regularisation, which augments the unlabelled data (e.g. using translations and rotations), then encourages the neural network to produce similar outputs for different augmentations of the same underlying image (Sajjadi et al., 2016; Xie et al., 2019; Berthelot et al., 2019b; Sohn et al., 2020) . Further developments of this line of work have resulted in many variants/combinations of these algorithms, from directly encouraging the smoothness of the classifier outputs around unlabelled datapoints (Miyato et al., 2018) to the "FixMatch" family of



Our code: https://anonymous.4open.science/r/GZ_SSL-ED9E; MIT Licensed 1

