DON'T FEAR THE UNLABELLED: SAFE SEMI-SUPERVISED LEARNING VIA DEBIASING

Abstract

Semi-supervised learning (SSL) provides an effective means of leveraging unlabelled data to improve a model's performance. Even though the domain has received a considerable amount of attention in the past years, most methods present the common drawback of lacking theoretical guarantees. Our starting point is to notice that the estimate of the risk that most discriminative SSL methods minimise is biased, even asymptotically. This bias impedes the use of standard statistical learning theory and can hurt empirical performance. We propose a simple way of removing the bias. Our debiasing approach is straightforward to implement and applicable to most deep SSL methods. We provide simple theoretical guarantees on the trustworthiness of these modified methods, without having to rely on the strong assumptions on the data distribution that SSL theory usually requires. In particular, we provide generalisation error bounds for the proposed methods. We evaluate debiased versions of different existing SSL methods, such as the Pseudolabel method and Fixmatch, and show that debiasing can compete with classic deep SSL techniques in various settings by providing better calibrated models. Additionally, we provide a theoretical explanation of the intuition of the popular SSL methods. An implementation of a debiased version of Fixmatch is available at https://github.com/HugoSchmutz/DeFixmatch 

1. INTRODUCTION

The promise of semi-supervised learning (SSL) is to be able to learn powerful predictive models using partially labelled data. In turn, this would allow machine learning to be less dependent on the often costly and sometimes dangerously biased task of labelling data. Early SSL approachese.g. Scudder's (1965) untaught pattern recognition machine-simply replaced unknown labels with predictions made by some estimate of the predictive model and used the obtained pseudo-labels to refine their initial estimate. Other more complex branches of SSL have been explored since, notably using generative models (from McLachlan, 1977 , to Kingma et al., 2014) or graphs (notably following Zhu et al., 2003) . Deep neural networks, which are state-of-the-art supervised predictors, have been trained successfully using SSL. Somewhat surprisingly, the main ingredient of their success is still the notion of pseudo-labels (or one of its variants), combined with systematic use of data augmentation (e.g. Xie et al., 2019; Sohn et al., 2020; Rizve et al., 2021 ). An obvious SSL baseline is simply throwing away the unlabelled data. We will call such a baseline the complete case, following the missing data literature (e.g. Tsiatis, 2006) . As reported in van Engelen & Hoos (2020), the main risk of SSL is the potential degradation caused by the introduction of unlabelled data. Indeed, semi-supervised learning outperforms the complete case baseline only in specific cases (Singh et al., 2008; Schölkopf et al., 2012; Li & Zhou, 2014) . This degradation risk for generative models has been analysed in Chapelle et al. (2006, Chapter 4) . To overcome this issue, previous works introduced the notion of safe semi-supervised learning for techniques which never reduce predictive performance by introducing unlabelled data (Li & Zhou, 2014; Guo et al., 2020) . Our loose definition of safeness is as follows: an SSL algorithm is safe if it has theoretical guarantees that are similar or stronger to the complete case baseline. The "theoretical" part of the definition

