DON'T FEAR THE UNLABELLED: SAFE SEMI-SUPERVISED LEARNING VIA DEBIASING

Abstract

Semi-supervised learning (SSL) provides an effective means of leveraging unlabelled data to improve a model's performance. Even though the domain has received a considerable amount of attention in the past years, most methods present the common drawback of lacking theoretical guarantees. Our starting point is to notice that the estimate of the risk that most discriminative SSL methods minimise is biased, even asymptotically. This bias impedes the use of standard statistical learning theory and can hurt empirical performance. We propose a simple way of removing the bias. Our debiasing approach is straightforward to implement and applicable to most deep SSL methods. We provide simple theoretical guarantees on the trustworthiness of these modified methods, without having to rely on the strong assumptions on the data distribution that SSL theory usually requires. In particular, we provide generalisation error bounds for the proposed methods. We evaluate debiased versions of different existing SSL methods, such as the Pseudolabel method and Fixmatch, and show that debiasing can compete with classic deep SSL techniques in various settings by providing better calibrated models. Additionally, we provide a theoretical explanation of the intuition of the popular SSL methods. An implementation of a debiased version of Fixmatch is available at https://github.com/HugoSchmutz/DeFixmatch 

1. INTRODUCTION

The promise of semi-supervised learning (SSL) is to be able to learn powerful predictive models using partially labelled data. In turn, this would allow machine learning to be less dependent on the often costly and sometimes dangerously biased task of labelling data. Early SSL approachese.g. Scudder's (1965) untaught pattern recognition machine-simply replaced unknown labels with predictions made by some estimate of the predictive model and used the obtained pseudo-labels to refine their initial estimate. Other more complex branches of SSL have been explored since, notably using generative models (from McLachlan, 1977 , to Kingma et al., 2014) or graphs (notably following Zhu et al., 2003) . Deep neural networks, which are state-of-the-art supervised predictors, have been trained successfully using SSL. Somewhat surprisingly, the main ingredient of their success is still the notion of pseudo-labels (or one of its variants), combined with systematic use of data augmentation (e.g. Xie et al., 2019; Sohn et al., 2020; Rizve et al., 2021 ). An obvious SSL baseline is simply throwing away the unlabelled data. We will call such a baseline the complete case, following the missing data literature (e.g. Tsiatis, 2006) . As reported in van Engelen & Hoos (2020), the main risk of SSL is the potential degradation caused by the introduction of unlabelled data. Indeed, semi-supervised learning outperforms the complete case baseline only in specific cases (Singh et al., 2008; Schölkopf et al., 2012; Li & Zhou, 2014) . This degradation risk for generative models has been analysed in Chapelle et al. (2006, Chapter 4) . To overcome this issue, previous works introduced the notion of safe semi-supervised learning for techniques which never reduce predictive performance by introducing unlabelled data (Li & Zhou, 2014; Guo et al., 2020) . Our loose definition of safeness is as follows: an SSL algorithm is safe if it has theoretical guarantees that are similar or stronger to the complete case baseline. The "theoretical" part of the definition is motivated by the fact that any empirical assessment of generalisation performances of an SSL algorithm is jeopardised by the scarcity of labels. "Similar or stronger guarantees" can be understood in a broad sense since there are many kinds of theoretical guarantees (e.g. the two methods may be both consistent, have similar generalisation bounds, be both asymptotically normal with related asymptotic variances). Unfortunately, popular deep SSL techniques generally do not benefit from theoretical guarantees without strong and essentially untestable assumptions on the data distribution (Mey & Loog, 2022) such as the smoothness assumption (small perturbations on the features x do not cause large modification in the labels, p(y|pert(x)) ≈ p(y|x)) or the cluster assumption (data points are distributed on discrete clusters and points in the same cluster are likely to share the same label). Most semi-supervised methods rely on these distributional assumptions to ensure performance in entropy minimisation, pseudo-labelling and consistency-based methods. However, no proof is given that guarantees the effectiveness of state-of-the-art methods (Tarvainen & Valpola, 2017; Miyato et al., 2018; Sohn et al., 2020; Pham et al., 2021) . To illustrate that SSL requires specific assumptions, we show in a toy example that pseudo-labelling can fail. To do so, we draw samples from two uniform distributions with a small overlap. Both supervised and semi-supervised neural networks are trained using the same labelled dataset. While the supervised algorithm learns perfectly the true distribution of p(1|x), the semi-supervised learning methods (both entropy minimisation and pseudo-label) underestimate p(1|x) for x ∈ [1, 3] (see Figure 1 ). We also test our proposed method (DeSSL) on this dataset and show that the unbiased version of each SSL technique learns the true distribution accurately. See Appendix A for the results with Entropy Minimisation. Beyond this toy example, a recent benchmark (Wang et al., 2022) of the last SSL methods demonstrates that no one method is empirically better than the others. Therefore, the scarcity of labels brings out the need for competitive methods that benefit from theoretical guarantees. The main motivation of this work is to show that these competitive methods can be easily modified to benefit from theoretical guarantees without performance degradation.

1.1. CONTRIBUTIONS

Rather than relying on the strong geometric assumptions usually used in SSL theory, we use the missing completely at random (MCAR) assumption, a standard assumption from the missing data literature (see e.g. Little & Rubin, 2019) and often implicitly made in most SSL works. With this only assumption on the data distribution, we propose a new safe SSL method derived from simply debiasing common SSL risk estimates. Our main contributions are: • We introduce debiased SSL (DeSSL), a safe method that can be applied to most deep SSL algorithms without assumptions on the data distribution; • We propose a theoretical explanation of the intuition of popular SSL methods. We provide theoretical guarantees on the safeness of using DeSSL both on consistency, calibration and asymptotic normality. We also provide a generalisation error bound; • We show how simple it is to apply DeSSL to the most popular methods such as Pseudo-label and Fixmatch, and show empirically that DeSSL leads to models that are never worse than their classical counterparts, generally better calibrated and sometimes much more accurate.



Figure 1: (Left) Data histogram. (Right) Posterior probabilities p(1|x) of the same model trained following either complete case (only labelled data), Pseudo-label or our DePseudo-label.

