DON'T FEAR THE UNLABELLED: SAFE SEMI-SUPERVISED LEARNING VIA DEBIASING

Abstract

Semi-supervised learning (SSL) provides an effective means of leveraging unlabelled data to improve a model's performance. Even though the domain has received a considerable amount of attention in the past years, most methods present the common drawback of lacking theoretical guarantees. Our starting point is to notice that the estimate of the risk that most discriminative SSL methods minimise is biased, even asymptotically. This bias impedes the use of standard statistical learning theory and can hurt empirical performance. We propose a simple way of removing the bias. Our debiasing approach is straightforward to implement and applicable to most deep SSL methods. We provide simple theoretical guarantees on the trustworthiness of these modified methods, without having to rely on the strong assumptions on the data distribution that SSL theory usually requires. In particular, we provide generalisation error bounds for the proposed methods. We evaluate debiased versions of different existing SSL methods, such as the Pseudolabel method and Fixmatch, and show that debiasing can compete with classic deep SSL techniques in various settings by providing better calibrated models. Additionally, we provide a theoretical explanation of the intuition of the popular SSL methods. An implementation of a debiased version of Fixmatch is available at https://github.com/HugoSchmutz/DeFixmatch 

1. INTRODUCTION

The promise of semi-supervised learning (SSL) is to be able to learn powerful predictive models using partially labelled data. In turn, this would allow machine learning to be less dependent on the often costly and sometimes dangerously biased task of labelling data. Early SSL approachese.g. Scudder's (1965) untaught pattern recognition machine-simply replaced unknown labels with predictions made by some estimate of the predictive model and used the obtained pseudo-labels to refine their initial estimate. Other more complex branches of SSL have been explored since, notably using generative models (from McLachlan, 1977 , to Kingma et al., 2014) or graphs (notably following Zhu et al., 2003) . Deep neural networks, which are state-of-the-art supervised predictors, have been trained successfully using SSL. Somewhat surprisingly, the main ingredient of their success is still the notion of pseudo-labels (or one of its variants), combined with systematic use of data augmentation (e.g. Xie et al., 2019; Sohn et al., 2020; Rizve et al., 2021 ). An obvious SSL baseline is simply throwing away the unlabelled data. We will call such a baseline the complete case, following the missing data literature (e.g. Tsiatis, 2006) . As reported in van Engelen & Hoos (2020) , the main risk of SSL is the potential degradation caused by the introduction of unlabelled data. Indeed, semi-supervised learning outperforms the complete case baseline only in specific cases (Singh et al., 2008; Schölkopf et al., 2012; Li & Zhou, 2014) . This degradation risk for generative models has been analysed in Chapelle et al. (2006, Chapter 4) . To overcome this issue, previous works introduced the notion of safe semi-supervised learning for techniques which never reduce predictive performance by introducing unlabelled data (Li & Zhou, 2014; Guo et al., 2020) . Our loose definition of safeness is as follows: an SSL algorithm is safe if it has theoretical guarantees that are similar or stronger to the complete case baseline. The "theoretical" part of the definition is motivated by the fact that any empirical assessment of generalisation performances of an SSL algorithm is jeopardised by the scarcity of labels. "Similar or stronger guarantees" can be understood in a broad sense since there are many kinds of theoretical guarantees (e.g. the two methods may be both consistent, have similar generalisation bounds, be both asymptotically normal with related asymptotic variances). Unfortunately, popular deep SSL techniques generally do not benefit from theoretical guarantees without strong and essentially untestable assumptions on the data distribution (Mey & Loog, 2022 ) such as the smoothness assumption (small perturbations on the features x do not cause large modification in the labels, p(y|pert(x)) ≈ p(y|x)) or the cluster assumption (data points are distributed on discrete clusters and points in the same cluster are likely to share the same label). Most semi-supervised methods rely on these distributional assumptions to ensure performance in entropy minimisation, pseudo-labelling and consistency-based methods. However, no proof is given that guarantees the effectiveness of state-of-the-art methods (Tarvainen & Valpola, 2017; Miyato et al., 2018; Sohn et al., 2020; Pham et al., 2021) . To illustrate that SSL requires specific assumptions, we show in a toy example that pseudo-labelling can fail. To do so, we draw samples from two uniform distributions with a small overlap. Both supervised and semi-supervised neural networks are trained using the same labelled dataset. While the supervised algorithm learns perfectly the true distribution of p(1|x), the semi-supervised learning methods (both entropy minimisation and pseudo-label) underestimate p(1|x) for x ∈ [1, 3] (see Figure 1 ). We also test our proposed method (DeSSL) on this dataset and show that the unbiased version of each SSL technique learns the true distribution accurately. See Appendix A for the results with Entropy Minimisation. Beyond this toy example, a recent benchmark (Wang et al., 2022) of the last SSL methods demonstrates that no one method is empirically better than the others. Therefore, the scarcity of labels brings out the need for competitive methods that benefit from theoretical guarantees. The main motivation of this work is to show that these competitive methods can be easily modified to benefit from theoretical guarantees without performance degradation.

1.1. CONTRIBUTIONS

Rather than relying on the strong geometric assumptions usually used in SSL theory, we use the missing completely at random (MCAR) assumption, a standard assumption from the missing data literature (see e.g. Little & Rubin, 2019) and often implicitly made in most SSL works. With this only assumption on the data distribution, we propose a new safe SSL method derived from simply debiasing common SSL risk estimates. Our main contributions are: • We introduce debiased SSL (DeSSL), a safe method that can be applied to most deep SSL algorithms without assumptions on the data distribution; • We propose a theoretical explanation of the intuition of popular SSL methods. We provide theoretical guarantees on the safeness of using DeSSL both on consistency, calibration and asymptotic normality. We also provide a generalisation error bound; • We show how simple it is to apply DeSSL to the most popular methods such as Pseudo-label and Fixmatch, and show empirically that DeSSL leads to models that are never worse than their classical counterparts, generally better calibrated and sometimes much more accurate.

2. SEMI-SUPERVISED LEARNING

The ultimate objective of most of the learning frameworks is to minimise a risk R, defined as the expectation of a particular loss function L over a data distribution p(x, y), on a set of models f θ (x), parametrised by θ ∈ Θ. The distribution p(x, y) being unknown, we generally minimise a Monte Carlo approximation of the risk, the empirical risk R(θ) computed on a sample of n i.i.d points drawn from p(x, y). R(θ) is an unbiased and consistent estimate of R(θ) under mild assumptions. Its unbiased nature is one of the basic properties that is used for the development of traditional learning theory and asymptotic statistics (van der Vaart, 1998; Shalev-Shwartz & Ben-David, 2014) . Semi-supervised learning leverages both labelled and unlabelled data to improve the model's performance and generalisation. Further information on the distribution p(x) provides a better understanding of the distributions p(x, y) and also p(y|x). Indeed, p(x) may contain information on p(y|x) (Schölkopf et al., 2012 , Goodfellow et al., 2016 , Chapter 7.6, van Engelen & Hoos, 2020) . In the following, we have access to n samples drawn from the distribution p(x, y) where some of the labels are missing. We introduce a new binary random variable r ∼ B(π) that governs whether or not a data point is labelled (r = 0 missing, r = 1 observed, and π ∈ (0, 1) is the probability of being labelled). The labelled (respectively the unlabelled) datapoints are indexed by the set L (respectively U), L ∪ U = {1, ..., n}. We note n l the number of labelled and n u the number of unlabelled datapoints. The MCAR assumption states that the missingness of a label y is independent of its features and the value of the label: p(x, y, r) = p(x, y)p(r). This is the case when neither features nor labels carry information about the potential missingness of the labels. This description of semi-supervised learning as a missing data problem has already been done in multiple works (see e.g. Seeger, 2000; Ahfock & McLachlan, 2019) . Moreover, the MCAR assumption is implicitly made in most of the SSL works to design the experiments, indeed, missing labels are drawn completely as random in datasets such as MNIST, CIFAR or SVHN (Tarvainen & Valpola, 2017; Miyato et al., 2018; Xie et al., 2019; Sohn et al., 2020) . Moreover, the definition of safety in the introduction is not straightforward without the MCAR assumption as the complete case is not an unbiased estimator of the risk in non-MCAR settings (see e.g. Liu & Goldberg, 2020, Section 3.1).

2.1. COMPLETE CASE: THROWING THE UNLABELLED DATA AWAY

In missing data theory, the complete case is the learning scheme that only uses fully observed instances, namely labelled data. The natural estimator of the risk is then simply the empirical risk computed on the labelled data. Fortunately, in the MCAR setting, the complete case risk estimate keeps the same good properties of the traditional supervised one: it is unbiased and converges point wisely to R(θ). Therefore, traditional learning theory holds for the complete case under MCAR. While these observations are hardly new (see e.g. Liu & Goldberg, 2020) , they can be seen as particular cases of the theory that we develop below. The risk to minimise is RCC (θ) = 1 n l i∈L L(θ; x i , y i ). (1)

2.2. INCORPORATING UNLABELLED DATA

A major drawback of the complete case framework is that a lot of data ends up not being exploited. A class of SSL approaches, mainly inductive methods with respect to the taxonomy of van Engelen & Hoos (2020), generally aim to minimise a modified estimator of the risk by including unlabelled data. Therefore, the optimisation problem generally becomes finding θ that minimises the SSL risk, RSSL (θ) = 1 n l i∈L L(θ; x i , y i )+ λ n u i∈U H(θ; x i ). ( ) where H is a term that does not depend on the labels and λ is a scalar weight which balances the labelled and unlabelled terms. In the literature, H can generally be seen as a surrogate of L. Indeed, it looks like the intuitive choices of H are equal or equivalent to a form of expectation of L on a distribution given by the model.

2.3. SOME EXAMPLES OF SURROGATES

A recent overview of the recent SSL techniques has been proposed by van Engelen & Hoos (2020) . In this work, we focus on methods suited for a discriminative probabilistic model p θ (y|x) that approximates the conditional p(y|x). We categorised methods into two distinct sections, entropy and consistency-based. Entropy-based methods Entropy-based methods aim to minimise a term of entropy of the predictions computed on unlabelled data. Thus, they encourage the model to be confident on unlabelled data, implicitly using the cluster assumption. Entropy-based methods can all be described as an expectation of L under a distribution π x computed at the datapoint x: H(θ; x) = E πx(x,ỹ) [L(θ; x, ỹ)]. For instance, Grandvalet & Bengio (2004) simply use the Shannon entropy as H(θ; x) which can be rewritten as equation equation 3 with π x (x, ỹ) = δ x (x)p θ (ỹ|x), where δ x is the dirac distribution in x. Also, pseudo-label methods, which consist in picking the class with the maximum predicted probability as a pseudo-label for the unlabelled data (Scudder, 1965) , can also be described as Equation 3. See Appendix B for complete description of the entropy-based literature (Berthelot et al., 2019; 2020; Xie et al., 2019; Sohn et al., 2020; Rizve et al., 2021) and further details. Consistency-based methods Another range of SSL methods minimises a consistency objective that encourages invariant prediction for perturbations either on the data or on the model in order to enforce stability on model predictions. These methods rely on the smoothness assumption. In this category, we cite Π-model (Sajjadi et al., 2016) , temporal ensembling (Laine & Aila, 2017) , Mean-teacher (Tarvainen & Valpola, 2017) , virtual adversarial training (VAT, Miyato et al., 2018) and interpolation consistent training (ICT, Verma et al., 2019) . These objectives H are equivalent to an expectation of L (see Appendix B). The general form of the unsupervised objective can be written as C 1 E πx(x,ỹ) [L(θ; x, ỹ)] ≤ H(θ; x) = Div(f θ (x, .), pert(f θ (x, .)) ≤ C 2 E πx(x,ỹ) [L(θ; x, ỹ)], where f θ is the predictions of the model, the Div is a non-negative function that measures the divergence between two distributions, θ is a fixed copy of the current parameter θ (the gradient is not propagated through θ), pert is a perturbation applied to the model or the data and 0 ≤ C 1 ≤ C 2 . Previous works also remarked that H is an expectation of L for entropy-minimisation and pseudo-label (Zhu et al., 2022; Aminian et al., 2022) . We describe a more general framework covering further methods and provide with our theory an intuition on the choice of H.

2.4. SAFE SEMI-SUPERVISED LEARNING

The main risk of SSL is the potential degradation caused by the introduction of unlabelled data when distributional assumptions are not satisfied (Singh et al., 2008; Schölkopf et al., 2012; Li & Zhou, 2014) , specifically in settings where the MCAR assumption does not hold anymore (Oliver et al., 2018; Guo et al., 2020) . Additionally, in Zhu et al. (2022) , the authors show disparate impacts of pseudo-labelling on the different sub-classes of the population. As remarked by Oliver et al. (2018) , SSL performances are enabled by leveraging large validation sets which is not suited for real-world applications. To mitigate these problems, previous works introduced the notion safe semi-supervised learning for techniques which never reduce learning performance by introducing unlabelled data (Li & Zhou, 2014; Kawakita & Takeuchi, 2014; Li et al., 2016; Gan et al., 2017; Trapp et al., 2017; Guo et al., 2020) . While the spirit of the definition of safe given by these works is the same, there are different ways of formalising it. To be more precise, we listed in Appendix C.1 theoretical guarantees given by these different works. In our work, we propose to call safe an SSL algorithm that has theoretical guarantees that are similar to or stronger than those of the complete case baseline. Even though the methods presented in Section 2.3 produce good performances in a variety of SSL benchmarks, they generally do not benefit from theoretical guarantees, even elementary. Moreover, Schölkopf et al. (2012) identify settings on the causal relationship between the features x and the target y where SSL may systematically fail, even if classic SSL assumptions hold. Our example of Figure 1 also shows that classic SSL may fail to generalise in a very benign setting with a large number of labelled data. Presented methods minimise a biased version of the risk under the MCAR assumption and therefore classical learning theory cannot be applied anymore, as we argue more precisely in Appendix C.2. Learning over a biased estimate of the risk is not necessarily unsafe but it is difficult to provide theoretical guarantees on such methods even if some works try to do so with strong assumptions on the data distribution (Mey & Loog 2022, Sections 4 and 5) . Additionally, the choice of H can be confusing as seen in the literature. For instance, Grandvalet & Bengio (2004) and Corduneanu & Jaakkola (2003) perform respectively entropy and mutual information minimisation whereas Pereyra et al. (2017) and Krause et al. (2010) perform maximisation of the same quantities. Some other SSL methods have theoretical guarantees, unfortunately, so far these methods come with either strong assumptions or important computational burdens. Li & Zhou (2014) introduced a safe semi-supervised SVM and showed that the accuracy of their method is never worse than SVMs trained with only labelled data with the assumption that the true model is accessible. However, if the distributional assumptions are not satisfied, no improvement or degeneration is expected. Sakai et al. (2017) proposed an unbiased estimate of the risk for binary classification by including unlabelled data. The key idea is to use unlabelled data to better evaluate on the one hand the risk of positive class samples and on the other the risk of negative samples. They provided theoretical guarantees on its variance and a generalisation error bound. The method is designed only for binary classification and has not been tested in a deep-learning setting. It has been extended to ordinal regression in follow-up work (Tsuchiya et al., 2021) . In the context of kernel machines, Liu & Goldberg (2020) used an unbiased estimate of risk, like ours, for a specific choice of H. Guo et al. (2020) proposed DS 3 L, a safe method that needs to approximately solve a bi-level optimisation problem. In particular, the method is designed for a different setting, not under the MCAR assumption, where there is a class mismatch between labelled and unlabelled data. The resolution of the optimisation problem provides a solution not worse than the complete case but comes with approximations. They provide a generalisation error bound. Sokolovska et al. (2008) proposed a method with asymptotic guarantees using strong assumptions such that the feature space is finite and the marginal probability distribution of x is fully known. Fox-Roberts & Rosten (2014) proposed an unbiased estimator in the generative setting applicable to a large range of models and they prove that this estimator has a lower variance than the one of the complete case.

3. DESSL: UNBIASED SEMI-SUPERVISED LEARNING

To overcome the issues introduced by the second term in the approximation of the risk for the semisupervised learning approach, we propose DeSSL, an unbiased version of the SSL estimator using labelled data to annul the bias. The idea here is to retrieve the properties of classical learning theory. Fortunately, we will see that the proposed method can eventually have better properties than the complete case, in particular with regard to the variance of the estimate. The proposed DeSSL objective is RDeSSL (θ) = 1 n l i∈L L(θ; x i , y i )+ λ n u i∈U H(θ; x i )- λ n l i∈L H(θ; x i ). Under the MCAR assumption, this estimator is unbiased for any value of the parameter λ. For proof of this result see Appendix D. We prove the optimality of debiasing with the labelled dataset in Appendix F. Intuitively, for entropy-based methods, H should be applied only on unlabelled data to enforce the confidence of the model only on unlabelled datapoints. Whereas, for consistency-based methods, H can be applied to any subset of data points. Our theory and proposed method remain the same whether H is applied to all the available data or not (see Appendix K).

3.1. DOES THE DESSL RISK ESTIMATOR MAKE SENSE?

The most intuitive interpretation is that by debiasing the risk estimator, we get back to the basics of learning theory. This way of debiasing is closely related to the method of control variates (Owen, 2013, Chapter 8) which is a common variance reduction technique. The idea is to add an additional term to a Monte-Carlo estimator with a null expectation in order to reduce the variance of the estimator without modifying the expectation. Here, DeSSL can also be interpreted as a control variate on the risk's gradient itself and should improve the optimisation scheme. This idea is close to the optimisation schemes introduced by Johnson & Zhang (2013) and Defazio et al. (2014) which reduce the variance of the gradients' estimate to improve optimisation performance. As a matter of fact, we study the gradient's variance reduction of DeSSL in Appendix E.1. Another interesting way to interpret DeSSL is as a constrained optimisation problem. Indeed, minimising RDeSSL is equivalent to minimising the Lagrangian of the following optimisation problem: min θ RCC (θ) s.t. 1 n u i∈U H(θ; x i ) = 1 n l i∈L H(θ; x i ). The idea of this optimisation problem is to minimise the complete case risk estimator by assessing that some properties represented by H are on average equal for the labelled data and the unlabelled data. For example, if we consider entropy-minimisation, this program encourages the model to have the same confidence on the unlabelled examples as on the labelled ones. The debiasing term of our objective will penalise the confidence of the model on the labelled data. Pereyra et al. (2017) show that penalising the entropy in a supervised context for supervised models improves on the state-of-the-art on common benchmarks. This comforts us in the idea of debiasing using labelled data in the case of entropy-minimisation. Moreover, the debiasing term in pseudolabel is similar to plausibility inference (Barndorff-Nielsen, 1976 ). Intuitively we understand the benefits of debiasing the estimator with labelled data to penalise the confidence of the model on these datapoints. But the debiasing can be performed on any subset of the training data (labelled or unlabelled). However, in regards to the variance of the estimator, we can prove that debiasing with only the labelled data or the whole dataset are both optimal and equivalent (see Appendix F). Our objective also resembles doubly-robust risk estimates used for SSL in the context of kernel machines by Liu & Goldberg (2020) and for deep learning by Hu et al. (2022) . In both cases, their focus is quite different, as they consider weaker conditions than MCAR, but very specific choices of H.

3.2. IS RDeSSL (θ) AN ACCURATE RISK ESTIMATE?

Because of the connections between our debiased estimate and variance reduction techniques, we have a natural interest in the variance of the estimate. Having a lower-variance estimate of the risk would mean estimating it more accurately, leading to better models. Similarly to traditional control variates (Owen, 2013) , the variance can be computed, and optimised in λ: Theorem 3.1. The function λ → V( RDeSSL (θ)|r) reaches its minimum for: λ opt = n u n Cov(L(θ; x, y), H(θ; x)) V(H(θ; x)) , and at λ opt : V( RDeSSL (θ)|r) λopt = 1 - n u n ρ 2 L,H V( RCC (θ)|r) ≤ V( RCC (θ)|r), where ρ L,H = Corr(L(θ; x, y), H(θ; x)). Additionally, we have a variance reduction regime V( RDeSSL (θ)|r) ≤ V( RCC (θ)|r) for all λ between 0 and 2λ opt . A proof of this theorem is available as Appendix E. This theorem provides a formal justification to the heuristic idea that H should be a surrogate of L. Indeed, DeSSL is a more accurate risk estimate when H is strongly positively correlated with L, which is likely to be the case when H is equal or equivalent to an expectation of L. Then, choosing λ positive is a coherent choice. We demonstrate in Appendix E that L and H are positively correlated when L is the negative likelihood and H is the entropy. Finally, we experimentally validate this theorem and the unbiasedness of our estimator in Appendix E.1. Other SSL methods have variance reduction guarantees, see Fox-Roberts & Rosten (2014) and Sakai et al. (2017) . In a purely supervised context, Chen et al. (2020) show that the effectiveness of data augmentation techniques lies partially in the variance reduction of the risk estimate. A natural application of this theorem would be to tune λ automatically by estimating λ opt . In our case however, the estimation of Cov(L(θ; x, y), H(θ; x)) with few labels led to unsatisfactory results. However, we estimate it more accurately using the test set (which is of course impossible in practice) on different datasets and methods to provide intuition on the order of λ opt and the range of the variance reduction regime in Appendix M.2.

3.3. CALIBRATION, CONSISTENCY AND ASYMPTOTIC NORMALITY

The calibration of a model is its capacity of predicting probability estimates that are representative of the true distribution. This property is determinant in real-world applications when we need reliable predictions. In the sense of scoring rules (Gneiting & Raftery, 2007 ) theory, we prove DeSSL is as well-calibrated as the complete case. See Theorem G.1 and proof in Appendix G. We say that θ is consistent if d( θ, θ * ) p -→ 0 when n - → ∞, where d is a distance on Θ. The properties of θ depend on the behaviours of the functions L and H. We will thus use the following standard assumptions. Assumption 3.2. The minimum θ * of R is well-separated: inf θ:d(θ * ,θ)≥ϵ R(θ) > R(θ * ). Assumption 3.3. The uniform weak law of large numbers holds for both L and H. Theorem 3.4. Under the MCAR assumption, Assumption 3.2 and Assumption 3.3, θ = arg min RDeSSL is consistent. For proof of this theorem see Appendix G. This theorem is a simple application of van der Vaart's (1998) Theorem 5.7 proving the consistency of an M-estimator. The result holds for the complete case, with λ = 0 which proves that the complete case is a solid baseline under the MCAR assumption. Coupling of n l and n u under the MCAR assumption Under the MCAR assumption, n l and n u are random variables. We have r ∼ B(π) (i.e. any x has the probability π of being labelled). Then, with n growing to infinity, we have n l n = n l n l +nu -→ π. Therefore, both n l and n u grow to infinity and n l nu -→ π-1 π . This implies n u = O(n l ) when n goes to infinity. Going further, we prove the asymptotic normality of θDeSSL by simplifying a bit the learning objective. The idea is to replace n l with πn to remove the dependence between samples. However, this modification is equivalent to our objective by modifying λ. We also show that the asymptotic variance can be optimised with respect to λ. We define the cross-covariance matrix between random vectors ∇L(θ; x, y) and ∇H(θ; x) as K θ (i, j) = Cov(∇L(θ; x, y) i , ∇H(θ; x) j ). Theorem 3.5. Suppose L and H are smooth functions in C 2 (Θ, R). Assume R(θ) admits a secondorder Taylor expansion at θ * with a non-singular second order derivative V θ * . Under the MCAR assumption, θDeSSL is asymptotically normal and the trace of the covariance can be minimised. Indeed, Tr(Σ DeSSL ) reaches its minimum at λ opt = (1 -π) Tr(V -1 θ * K θ * V -1 θ * ) Tr(V -1 θ * E [∇H(θ * ; x)∇H(θ * ; x) T ] V -1 θ * ) , and at λ opt , we have Tr(Σ DeSSL ) -Tr(Σ CC ) ≤ 0. See Appendix I for proof of this result. The Theorem shows there exists a range of λ for which DeSSL is a better model's parameters than the complete case. In this context, we have a variance reduction regime on the risk estimate but even more importantly on the parameters estimate.

3.4. RADEMACHER COMPLEXITY AND GENERALISATION BOUNDS

In this section, we prove an upper bound for the generalisation error of DeSSL. To simplify, we use the same modification as for the asymptotic variance. The unbiasedness of RDeSSL can directly be used to derive generalisation bounds based on the Rademacher complexity (Bartlett & Mendelson, 2002) , defined in our case as Rn = E (ε i ) i≤n sup θ∈Θ 1 nπ i∈L εiL(θ; xi, yi) - λ nπ i∈L εiH(θ; xi) + λ n(1 -π) i∈U εiH(θ; xi) , where ε i are i.i.d. Rademacher variables independent of the data. In the particular case of λ = 0, we recover the standard Rademacher complexity of the complete case. Theorem 3.6. We assume that labels are MCAR and that both L and H are bounded. Then, there exists a constant κ > 0, that depends on λ, L, H, and the ratio of observed labels, such that, with probability at least 1 -δ, for all θ ∈ Θ, R(θ) ≤ RDeSSL (θ) + 2R n + κ log(4/δ) n . Published as a conference paper at ICLR 2023 The proof follows Shalev-Shwartz & Ben-David (2014, Chapter 26) , and is available in Appendix J.

4. EXPERIMENTS

We evaluate the performance of DeSSL against different classic methods. In particular, we perform experiments with varying λ on different datasets MNIST, DermaMNIST, CIFAR-10, CIFAR-100 (Krizhevsky, 2009) and five small datasets of MedMNIST (Yang et al., 2021; 2023) with a fixed λ. The results of these experiments are reported below. In our figures, the error bars represent the size of the 95% confidence interval (CI). Finally, we modified the implementation of Fixmatch (Sohn et al., 2020) and compare it with its debiased version on CIFAR-10, CIFAR-100 and on STL10 (Coates et al., 2011) . Finally, we show how simple it is to debias an existing implementation, by demonstrating it on the consistency-based models benchmarked by (Oliver et al., 2018) , namely VAT, Π-model and MeanTeacher on CIFAR-10 and SVHN (Netzer et al., 2011) . We observe similar performances between the debiased and biased versions for the different methods, both in terms of cross-entropy and accuracy. See Appendix O.

4.1. PSEUDO LABEL

We compare PseudoLabel and DePseudoLabel on CIFAR-10 and MNIST. We test the influence of the hyperparameters λ and report the accuracy, the cross-entropy and the expected calibration error (ECE, Guo et al., 2017) at the epoch of best validation accuracy using 10% of n l as the validation set. MNIST is an advantageous dataset for SSL since classes are well-separated. We train a LeNet-like architecture using n l = 1000 labelled data on 10 different splits of the training dataset into a labelled and unlabelled set. Models are then evaluated using the standard 10, 000 test samples. Results are reported in Figure 2 and Appendix L. In this example SSL and DeSSL have almost the same accuracy for all λ, however, DeSSL seems to be always better calibrated. CIFAR-10 We train a CNN-13 from Tarvainen & Valpola (2017) on 5 different splits. We use n l = 4000 and use the rest of the dataset as unlabelled. Models are then evaluated using the standard test samples. Results are reported in Figure 3 . We report the ECE in Appendix M. The performance of both methods on CIFAR-100 with nl = 10000 are reported in Appendix M. We observe DeSSL provides both a better cross-entropy and ECE with the same accuracy for small λ. For larger λ, DeSSL performs better in all the reported metrics. We performed a paired Student's t-test to ensure that our results are significant and reported the p-values in Appendix M. The p-values indicate that for λ close to 10, DeSSL is often significantly better in all the metrics. Moreover, DeSSL for large λ provides a better cross-entropy and ECE than the complete case whereas SSL never does. MedMNIST is a large-scale MNIST-like collection of biomedical images. We selected the five smallest 2D datasets of the collection, for these datasets it is likely that the cluster assumption no longer holds. Results are reported in Appendix L. We show DePseudoLabel competes with PseudoLabel in terms of accuracy and even success when PseudoLabel's accuracy is less than the complete case. Moreover, DePseudoLabel is always better in terms of cross-entropy, so calibration, whereas PseudoLabel is always worse than the complete case. We focus on the unbalanced dataset DermaMNIST with a majority class and we compare the accuracy per class of the complete case, PseudoLabel and DePseudoLabel, see 

4.2. FIXMATCH (SOHN ET AL., 2020)

The efficiency of the debiasing method lies in the correlation of the labelled objective and the unlabelled objective as shown in Theorem 3.1. So, we debiased a version of Fixmatch that includes strong augmentations in the labelled objective. See Appendix N for further details. We also compare our results to the original Fixmatch (Fixmatch * in Table 1 ). While the modified version is slightly worse than the original Fixmatch, debiasing it works. For this experiment, we use n l = 4000 on 5 different folds for CIFAR-10 and n l = 10000 on one fold for CIFAR-100. First, we report that a strong complete case baseline using data augmentation reaches 87.27% accuracy on CIFAR-10 (resp. 62.62% on CIFAR-100). A paired Student's t-test ensures that our results are significantly better on CIFAR-10 (p-value in Appendix N). Then, we observe that the debiasing method improves both the accuracy and cross-entropy of this modified version of Fixmatch. Inspired by Zhu et al. (2022) , we analyse individual classes on CIFAR-10 and show that our method improved performance on "poor" classes more equally than the biased version. Pseudo-label-based methods with a fixed selection threshold draw more pseudo-labels in the "easy" subpopulations. For instance, in the toy dataset (Figure 6 ), PseudoLabel will always draw samples of the class blue in the overlapping area. DeSSL prevents the method to overfit and be overconfident on the "easy" classes with the debiasing term. Indeed, De-Fixmatch improves Fixmatch by 1.57% (resp. 1.94%) overall but by 4.91% on the worst class (resp. 8.00%). We report in Appendix N.1 the accuracy per class and in Appendix N.2 the results on STL10. 

5. CONCLUSION

Motivated by the remarks of van Engelen & Hoos (2020) and Oliver et al. (2018) on the missingness of theoretical guarantees in SSL, we proposed a simple modification of SSL frameworks. We consider frameworks based on the inclusion of unlabelled data in the computation of the risk estimator and debias them using labelled data. We show theoretically that this debiasing comes with several theoretical guarantees. We demonstrate these theoretical results experimentally on several common SSL datasets. DeSSL shows competitive performance in terms of accuracy compared to its biased version but improves significantly the calibration. There are several future directions open to us. Our experiments suggest that DeSSL will perform as well as SSL when classic SSL assumptions are true (such as the cluster assumption), for instance with MNIST in Figure 2 . Still, for MNIST, DeSSL does not outperform in terms of accuracy but does in terms of cross-entropy and expected calibration error. However, when these assumptions do not hold anymore DeSSL will outperform its SSL counterparts. We showed that λ opt exists (Theorem 3.1) and therefore our formula provides guidelines for the optimisation of λ. Finally, an interesting improvement would be to go beyond the MCAR assumption by considering settings with a distribution mismatch between labelled and unlabelled data (Guo et al., 2020; Cao et al., 2021; Hu et al., 2022) .

A TOY EXAMPLE

We trained a 4-layer neural network (1/20/100/20/1) with ReLU activation function using 25, 000 labelled and 25, 000 unlabelled points drawn from two 1D uniform laws with an overlap. We used λ = 1 and a confidence threshold for Pseudo-label τ = 0.70. We optimised the model's weights using a stochastic gradient descent (SGD) optimiser with a learning rate of 0.1. 

B DETAILS ON SURROGATES AND MORE EXAMPLES

We provide in this appendix further details on our classification of SSL methods between entropybased and consistency-based (see Section 2.3). We detail a general framework for both of these methods' classes. We also show how popular SSL methods are related to our framework.

B.1 ENTROPY-BASED

We class as entropy-based, methods that aim to minimise a term of entropy such as Grandvalet & Bengio (2004) which minimises Shannon's entropy or pseudo-label which is a form of entropy, see Remark E.5. These methods encourage the model to be confident on unlabelled data, implicitly using the cluster assumption. We recall those entropy-based methods can all be described as an expectation of L under a distribution π x computed at the datapoint x: H(θ; x) = E πx(x,ỹ) [L(θ; x, ỹ)]. Pseudo-label: As presented in the core article, the unsupervised objective of pseudo-label can be written as an expectation of L on the distribution π x (x, ỹ) = δ x (x)p θ (ỹ|x). The Pseudo-label unsupervised term can be written as an entropy in term of Rényi entropy, H ∞ (x) = -max y log p(y|x). For this reason, we class Pseudo-label methods as entropy-based methods. Recently, Lee (2013) encouraged the pseudo-labels method for deep semi-supervised learning. Then, Rizve et al. ( 2021) recently improved the pseudo-label selection by introducing an uncertainty-aware mechanism on the confidence of the model concerning the predicted probabilities. Pham et al. (2021) reaches state-of-the-art on the Imagenet challenge using pseudo-labels on a large dataset of additional images.

B.2 PSEUDO-LABEL AND DATA AUGMENTATION

Recently, several methods based on data augmentation have been proposed and proven to perform well on a large spectrum of SSL tasks. The idea is to have a model resilient to strong data-augmentation of the input (Berthelot et al., 2019; 2020; Sohn et al., 2020; Xie et al., 2019; Zhang et al., 2021a) . These methods rely both on the cluster assumption and the smoothness assumption and are at the border between entropy-based and consistency-based methods. The idea is to have the same prediction for an input and an augmented version of it. For instance, in Sohn et al. (2020) , we first compute pseudo-labels predicted using a weakly-augmented version of x (flip-and-shift data augmentation) and then minimise the likelihood with the predictions of the model on a strongly augmented version of x. In Xie et al. (2019) , the method is a little bit different as we minimise the cross entropy between the prediction of the model on x and the predictions of an augmented version. In both cases, the unsupervised part of the risk estimator can be reformulated as Equation 12. Fixmatch: In Fixmatch, Sohn et al. (2020) , the unsupervised objective can be written as: H(θ; x) = 1[max y p θ (y|x 1 ) > τ ]L(θ; x 2 , arg max y p θ (y|x 1 )) ( ) where θ is a fixed copy of the current parameters θ indicating that the gradient is not propagated through it, x 1 is a weakly-augmented version of x and x 2 a strongly-augmented one. Therefore, we write H as an expectation of L on the distribution π x (x, ỹ) = δ x2 (x)δ arg max y p θ (y|x1) (ỹ)1[max y p θ (y|x 1 ) > τ ]. UDA: In UDA, Xie et al. (2019) , the unsupervised objective can be written as: H(θ; x) = y p θ (y|x)L(θ; x 1 , y) where θ is a fixed copy of the current parameters θ indicating that the gradient is not propagated through it and x 1 is an augmented version of x. Therefore, we write H as an expectation of L on the distribution π x (x, ỹ) = δ x1 (x)p θ (ỹ|x). The former is an improved version of Fixmatch with a variable threshold τ with respect to the class and the training stage. The latter introduces a measurement of uncertainty in the pseudo-labelling step to improve the selection. They also introduce negative pseudo-labels to improve the single-label classification.

B.3 CONSISTENCY-BASED

Consistency-based methods aim to smooth the decision function of the models or have more stable predictions. These objectives H are not directly a form of expectation of L but are equivalent to an expectation of L. For all the following methods we can write the unsupervised objective H such that: C 1 E πx(x,ỹ) [L(θ; x, ỹ)] ≤ H(θ; x) ≤ C 2 E πx(x,ỹ) [L(θ; x, ỹ)], with 0 ≤ C 1 ≤ C 2 . Indeed, consistency-based methods minimise an unsupervised objective that is a divergence between the model predictions and a modified version of the input (data augmentation) or a perturbation of the model. Using the fact that all norms are equivalent in a finite-dimensional space such as the space of the labels, we have the equivalence between a consistency-based H and an expectation of L. VAT The virtual adversarial training method proposed by (Miyato et al., 2018) generates the most impactful perturbation r adv to add to x. The objective is to train a model robust to input perturbations. This method is closely related to adversarial training introduced by Goodfellow et al. (2014) . H(θ; x) = Div(f θ (x, .), f θ (x + r adv , .)) where the Div is a non-negative function that measures the divergence between two distributions, the cross-entropy or the KL divergence for instance. If the divergence function is the cross-entropy, it is straightforward to write the unlabelled objective as Equation 3. If the objective function is the KL divergence, we can write the objective as H(θ; x) = E πx(x+r,ỹ) [L(θ; x, ỹ)] -E πx(x,ỹ) [L( θ; x, ỹ)] with π x (x, ỹ) = δ x (x)p θ (y|x). Therefore, variation of H with respect to θ are the same as E πx(x+r,ỹ) [L(θ; x, ỹ)]. VAT is also a method between consistency-based and entropy-based methods as long as we use the KL-divergence or the cross-entropy as the measure of divergence. Mean-Teacher A different form of pseudo-labelling is the Mean-Teacher approach proposed by (Tarvainen & Valpola, 2017) where pseudo-labels are generated by a teacher model for a student model. The parameters of the student model are updated, while the teacher's are a moving average of the student's parameters from the previous training steps. The idea is to have a more stable pseudo-labelling using the teacher than in the classic Pseudo-label. Final predictions are made by the student model. A generic form of the unsupervised part of the risk estimator is then H(θ; x) = y (p θ (y|x) -p θ (y|x)) 2 , where θ are the fixed parameters of the teacher.

Π-Model

The Π-Models are intrinsically stochastic models (for example a model with dropout) encouraged to make consistent predictions through several passes of the same x in the model. The SSL loss is using the stochastic behaviour of the model where the model f θ and penalises different predictions for the same x (Sajjadi et al., 2016 ). Let's note f θ (x, .) 1 and f θ (x, .) 2 two passes of x through the model f θ . A generic form of the unsupervised part of the risk estimator is then H(θ; x) = Div(f θ (x, .) 1 , f θ (x, .) 2 ), ( ) where Div is a measure of divergence between two distributions (often the Kullback-Leibler divergence). Temporal ensembling Temporal ensembling (Laine & Aila, 2017 ) is a form of Π-Model where we compare the current prediction of the model on the input x with an accumulation of the previous passes through the model. Then, the training is faster as the network is evaluated only once per input on each epoch and the perturbation is expected to be less noisy than for Π-models. ICT Interpolation consistency training (Verma et al., 2019) is an SSL method based on the mixup operation (Zhang et al., 2017) . The model trained is then consistent to predictions at interpolations. The unsupervised term of the objective is then computed on two terms: H(θ; x 1 , x 2 ) = Div f θ (αx 1 + (1 -α)x 2 , .), αf θ (x 1 , .) + (1 -α)f θ (x 2 , .) , with α drawn with from a distribution B(a, a). With the exact same transformation, we will be able to show that this objective is equivalent to a form of expectation of L.

C SAFE SEMI-SUPERVISED LEARNING C.1 ON THE DEFINITION OF SAFE SEMI-SUPERVISED LEARNING

Previous works introduced the notion safe semi-supervised learning for techniques which never reduce learning performance by introducing unlabelled data (Li & Zhou, 2014; Kawakita & Takeuchi, 2014; Li et al., 2016; Trapp et al., 2017; Guo et al., 2020) . While the spirit of the definition of safe given by these works is the same, there are different ways of formalising it. In the following, we list the theoretical guarantees given on the performance of SSL methods by each of these works: • Li & Zhou (2014) prove that the accuracy of their method, namely S4VM, is never worse than the complete case in a transductive setting under the low-density assumption. • Kawakita & Takeuchi (2014) prove that the model learned is asymptotically normal and has a lower asymptotic variance without assumption on the data distribution but only when n l ≤ n u . • Li et al. (2016) prove that their method builds a model which is better for a chosen measure of performance in the range of top-k precision, F β score or AUC than the complete case under either the cluster assumption, the low-density assumption or the manifold assumption. • Trapp et al. (2017) prove that the Sum-Product network learned has a lower train loss than the one with the complete case. • Guo et al. (2020) prove that the empirical risk of the SSL model on the labelled train set is lower or equal than the one of the complete case with no assumption on the data distribution nor the missingness mechanism. Additionally, they prove generalisation error bounds. Some other works proposed theoretical guarantees in the use of their SSL methods without calling the method safe. In regard to the definition and the safe SSL methods listed above, we can categorise the following method as safe. Among them, we cite the following: • Sokolovska et al. (2008) prove that the model learned is asymptotically normal and has a lower asymptotic variance with the assumptions that the feature space is finite and n u -→ ∞. • Fox-Roberts & Rosten (2014) propose an unbiased estimator of the likelihood in a generative setting and prove that the estimator is unbiased and has a lower variance than its complete case baseline with no assumption on the data distribution. • Loog (2015) proposes a method for which the log-likelihood of the SSL model on the labelled train set is lower or equal to the one of the complete case with the gaussian distribution assumption of Linear Discriminant analysis. • Sakai et al. (2017) prove that their risk estimator is unbiased and has a lower variance than the complete case baseline. Additionally, they prove generalisation error bounds and the bounds decrease with respect to the number of unlabelled data without assumption on the data distribution but with strong assumption on the chosen loss function.

C.2 ON THE SEMI-SUPERVISED BIAS

We provide in this appendix a further explanation of the risk induced by the SSL bias as introduced in Section 2.4. Presented methods minimise a biased version of the risk under the MCAR assumption and therefore classical learning theory does not apply anymore, E[ RSSL (θ)] = E[L(θ; x, y)]+λE[H(θ; x, y)] ̸ = R(θ). ( ) Learning over a biased estimate of the risk is not necessarily unsafe but it is difficult to provide theoretical guarantees on such methods even if some works try to do so with strong assumptions on the data distribution (Mey & Loog 2022, Section 4 and 5, Zhang et al. 2021b) . Previous works proposed generalisation error bounds of SSL methods under strong assumptions on the data distribution or the true model. We refer to the survey by Mey & Loog (2022) . More recently, Wei et al. (2021) proves an upper bound for training deep models with the pseudo-label method under strong assumptions. Under soft assumptions, Aminian et al. ( 2022) provides an error bound showing that the choice of H is crucial to provide good performances. Indeed, the unbiased nature of the risk estimate is crucial in the development of learning theory. This bias on the risk estimate may look like one of a regularisation, such as a ridge regularisation. However, SSL and regularisation are intrinsically different for several reasons: • Regularisers have a vanishing impact in the limit of infinite data whereas SSL usually do not in the proposed methods, see Equation 19. A solution would be to choose λ with respect of the number of data points and make it vanish when n goes to infinity. However, in most works, the choice of λ is independent of the number of n or n l (Oliver et al., 2018; Sohn et al., 2020) . • One of the main advantages of regularisation is to turn the learning problem into a "more convex" problem, see Shalev-Shwartz & Ben-David (2014, Chapter 13) . Indeed, ridge regularisation will often turn a convex problem into a strongly-convex problem. However, SSL faces the danger to turn the learning problem into non-convex as previously noted by Sokolovska et al. (2008) . • The objective of a regulariser is to bias the risk towards optimum with smooth decision functions whereas entropy-based SSL will lead to sharp decision functions. • Regularisation usually does not depend on the data whereas H does in the SSL framework. An entropy bias has been actually used by Pereyra et al. (2017) as a regulariser but as entropy maximisation which should have the exact opposite effect of the SSL method introduced by Grandvalet & Bengio (2004) , the entropy minimisation. D PROOF THAT RDeSSL (θ) IS UNBIASED UNDER MCAR Theorem D.1. Under the MCAR assumption, RDeSSL (θ) is an unbiased estimator of R(θ). As a consequence of the theorem, under the MCAR assumption, RCC (θ) is also unbiased as a special case of RDeSSL (θ) for λ = 0 Proof: We first recall that the DeSSL risk estimator RDeSSL (θ) is defined for any λ by RDeSSL (θ) = 1 n l i∈L L(θ; x i , y i ) + λ n u i∈U H(θ; x i ) - λ n l i∈L H(θ; x i ) = n i=1 r i n l L(θ; x i , y i ) + λ 1 -r i n u - r i n l H(θ; x i ) . By the law of total expectation: E[ RDeSSL (θ)] = E r E x,y [ RDeSSL (θ)|r] . As far as we are under the MCAR assumption, the data (x, y) and the missingness variable r are independent thus, E r E x,y [ RDeSSL (θ)|r] = E r E x,y [ RDeSSL (θ)] . We focus on E x,y [ RDeSSL (θ)]. First, we replace RDeSSL (θ) by its definition and then use the linearity of the expectation. Then, E x,y [ RDeSSL (θ)] = E 1 n l i∈L L(θ; x i , y i ) + λ n u i∈U H(θ; x i ) - λ n l i∈L H(θ; x i ) by definition = 1 n l i∈L E [L(θ; x i , y i )] + λ n u i∈U E [H(θ; x i )] - λ n l i∈L E [H(θ; x i )] by linearity The couples (x i , y i ) are i.i.d. samples following the same distribution. Then, we have E x,y [ RDeSSL (θ)|r] = 1 n l i∈L E [L(θ; x, y)] + λ n u i∈U E [H(θ; x)] - λ n l i∈L E [H(θ; x)] i.i.d samples = E [L(θ; x, y)] = R(θ). Finally, we have the results that , RDeSSL (θ) is unbiased as R(θ) is a constant, E[ RDeSSL (θ)] = E E x,y [ RDeSSL (θ)|r] = E r [R(θ)] = R(θ). And, at λ opt , the variance of RDeSSL (θ)|r) becomes V( RDeSSL (θ)|r) = 1 n l V(L(θ, x, y)) 1 - n u n Cov(L(θ, x, y), H(θ, x)) 2 V(H(θ, x))V(L(θ; x, y)) = 1 n l V(L(θ, x, y)) 1 - n u n Corr(L(θ, x, y), H(θ, x)) 2 = 1 - n u n ρ 2 L,H 1 n l V(L(θ, x, y)) Remark E.1. If H is perfectly correlated with L (ρ L,H = 1), then the variance of the DeSSL estimator is equal to the variance of the estimator with no missing labels. Remark E.2. Is it possible to estimate λ opt in practice ? The data distribution p(x, y) being unknown, the computation of λ opt is not possible directly. Therefore, we need to use an estimator of the covariance Cov(L(θ; x, y), H(θ; x)) and the variance V(H(θ; x)) (See Equation 24). Also, we have to be careful not to introduce a new bias with the computation of λ opt , indeed, if we compute it using the training set, λ opt becomes dependent on x and y and therefore RDeSSL (θ)|r) becomes biased. A solution would be to use a validation dataset for its computation. Another approach is to compute it using the splitting method (Avramidis & Wilson, 1993) . Moreover, the computation of λ opt is tiresome and time-consuming in practice as it has to be updated for every different value of θ, so at each gradient step. λopt = 1 n l i∈L (L(θ; x i , y i ) -L(θ))(H(θ; x i ) -H(θ)) 1 n n i=1 (H(θ; x i ) -H(θ)) 2 where H(θ) = 1 n n i=1 H(θ; x i ) and L(θ) = 1 n l i∈L L(θ; x i , y i ) Remark E.3. About the sign of λ The theorem still has a quantitative merit when it comes to choosing λ, by telling that the sign of λ is positive when H and L are positively correlated which will generally be the case with the examples mentioned in the article. For instance, concerning the entropy minimisation technique, the following proposition proves that the log-likelihood is negatively correlated with its entropy and therefore it justifies the choice of λ > 0 in the entropy minimisation. Proposition E.4. The log-likelihood of the true distribution log p(y|x) is negatively correlated with its entropy H ỹ (p(ỹ|x)) = -E ỹ∼p(.|x) [log p(ỹ|x)] . Cov(log p(y|x), H ỹ (p(ỹ|x))) < 0 Proof. Cov(log p(y|x), H ỹ (p(ỹ|x))) = E x,y [log p(y|x)H ỹ (p(ỹ|x))] -E x,y [log p(y|x)]E x [H ỹ (p(ỹ|x))] = -E x,y [log p(y|x)E ỹ|x [log p(ỹ|x)]]+ E x,y [log p(y|x)]E x [E ỹ|x [log p(ỹ|x)]] By the law of total expectation, we have that E x [E ỹ|x [log p(ỹ|x)]] = E x,ỹ [log p(ỹ|x)], then Cov(log p(y|x), H ỹ (p(ỹ|x)) = -E x,y [log p(y|x)E ỹ|x [log p(ỹ|x)]] + E x,y [log p(y|x)] 2 = E x,y [log p(y|x)] 2 -E x,y [log p(y|x)E ỹ|x [log p(ỹ|x)]] On the other hand, also with the law of total expectation, E x,y [log p(y|x)E ỹ|x [log p(ỹ|x)]] = E x [E y|x [log p(y|x)]E ỹ|x [log p(ỹ|x)]], so E x,y [log p(y|x)E ỹ|x [log p(ỹ|x)]] = E x [E y|x [log p(y|x)] 2 ] ≥ E x [E y|x [log p(y|x)]] 2 Jensen's inequality ≥ E x,y [log p(y|x)] 2 total expectation law Finally, we have the results, Cov(log p(y|x), H ỹ (p(ỹ|x))) ≤ E x,y [log p(y|x)] 2 -E x,y [log p(y|x)] 2 ≤ 0 Remark E.5. We can also see the Pseudo-label as a form of entropy. Indeed, modulo the confidence selection on the predicted probability, the Pseudo-label objective is the inverse of the Rényi minentropy: H ∞ (x) = -max y log p(y|x) E.1 ON RISK ESTIMATION QUALITY We train CNN-13 on CIFAR-10 using only 4,000 labelled data and then split the test dataset into labelled and unlabelled data to estimate the PseudoLabel (PL) and DePseudoLabel (DePL) risks that we compared to the oracle risk estimate using all the test set. For this experiment, we split 100 times the test set into 40 labelled data and 9960 unlabelled data in order to estimate the variance of the risk estimator. We compute λ opt using the entire test set. This experiment illustrates that DePL is unbiased for any value of λ (Figure 7 Left) and its variance can be optimised in λ and at λ opt The gradient of DeSSL is also an unbiased estimator of the risk's gradient, as the Complete Case. We compare the quality of these two estimators under the same setting. We compare the variance of the gradient of DePL risk to the complete case and show that DePL reduces the variance considerably (see Figure 8 ). Variance reduction techniques applied on the gradients have shown promising results in the amelioration of optimisation algorithms (Defazio et al., 2014; Johnson & Zhang, 2013) . G PROOF OF THEOREM G.1 The calibration of a model is its capacity of predicting probability estimates that are representative of the true distribution. This property is determinant in real-world applications when we need reliable predictions. A scoring rule S is a function assigning a score to the predictive distribution p θ (y|x) relative to the event y|x ∼ p(y|x), S(p θ , (x, y)), where p(x, y) is the true distribution (see e.g. Gneiting & Raftery, 2007) . A scoring rule measures both the accuracy and the quality of predictive uncertainty, meaning that better calibration is rewarded. The expected scoring rule is defined as S(p θ , p) = E p [S(p θ , (x, y))]. A proper scoring rule is defined as a scoring rule such that S(p θ , p) ≤ S(p, p) (Gneiting & Raftery, 2007) . The motivation behind having proper scoring rules comes from the following: suppose that the true data distribution p is accessible by our set of models. Then, the scoring rule encourages to predict p θ = p. The opposite of a proper scoring rule can then be used to train a model to encourage the calibration of predictive uncertainty: L(θ; x, y) = -S(p θ , (x, y)). The most common losses used to train models are proper scorings rules such as log-likelihood. Theorem G.1. If S(p θ , (x, y)) = -L(θ; x, y) is a proper scoring rule, then S ′ (p θ , (x, y, r)) = -( rn n l L(θ; x, y) + λn( 1-r nu -r n l )H(θ; x)) is also a proper scoring rule. The proof follows directly from unbiasedness and the MCAR assumption. The main interpretation of this theorem is that we can expect DeSSL to be as well-calibrated as the complete case. Proof. The scoring rule considered in our SSL framework is: S ′ (p θ , (x, y, r)) = - rn n l L(θ; x, y) + λn( 1 -r n u - r n l )H(θ; x) . The proper scoring rule of the fully supervised problem is S(p θ , (x, y, r)) = -L(θ; x, y). Let p be the true distribution of the data (x, y, r). Under MCAR, r is independent of x and y, then p(x, y, r) = p(r)p(x, y). 

I ASYMPTOTIC NORMALITY OF DESSL

In the following, we study a modified version of the objective to simplify the proof. Let us consider the following DeSSL objective L ′ (θ; x, y, r) = r π L(θ; x, y) + λ 1-r 1-π -r π H(θ; x) which has the same properties than the original one (unbiasedness, variance reduction property, consistency and benefit from generalisation error bounds). The idea is to replace n l with πn to simplify the expression. The value n l converges to πn then the following Theorem should hold with the true DeSSL objective. We define the cross-covariance matrice between random vectors ∇L(θ; x, y) and ∇H(θ; x) as K θ (i, j) = Cov(∇L(θ; x, y) i , ∇H(θ; x) j ). Theorem I.1. Suppose L and H are smooth functions in C 2 (Θ, R). Assume R(θ) admit a secondorder Taylor expansion at θ * with a non-singular second order derivative V θ * . Under the MCAR assumption, we have that θDeSSL is asymptotically normal with covariance: Σ DeSSL = 1 π V -1 θ * E ∇L(θ * ; x, y)∇L(θ * ; x, y) T V -1 θ * + λ 2 π(1 -π) V -1 θ * E ∇H(θ * ; x, y)∇H(θ * ; x, y) T V -1 θ * - λ π V -1 θ * K θ * V -1 θ * . As a consequence, we can minimise the trace of the covariance. Indeed, Tr(Σ DeSSL ) reaches its minimum at λ opt = (1 -π) Tr(V -1 θ * K θ * V -1 θ * ) Tr(V -1 θ * E [∇H(θ * ; x)∇H(θ * ; x) T ] V -1 θ * ) , and at λ opt : Tr(Σ DeSSL ) -Tr(Σ CC ) = - 1 -π π Tr(V -1 θ * K θ * V -1 θ * ) 2 Tr(V -1 θ * E [∇H(θ * ; x)∇H(θ * ; x) T ] V -1 θ * ) ≤ 0. ( ) The complete case is the special case of DeSSL with λ = 0. Then, the Theorem holds for the complete case. Proof. We define L ′ (θ; x, y, r) = r π L(θ; x, y) + λ 1-r 1-π -r π H(θ; x) The assumptions of the theorem are sufficient assumptions to apply Theorem 5.23 of Van der Vaart 1998 to the couple ( θDeSSL , L ′ ). Hence, we obtain the following representation for representation θDeSSL : √ n( θDeSSL -θ * ) = 1 √ n V -1 θ * n i=1 r i π ∇L(θ * ; x i , y i ) + λ 1 -r i 1 -π - r i π ∇H(θ * ; x i ) + o p (1). ( ) √ n( θDeSSL -θ * ) L -→ N (0, Σ DeSSL ), The asymptotic normality follows with variance: Σ DeSSL = V -1 θ * E ∇L ′ (θ * ; x, y)∇L ′ (θ * ; x, y) T V -1 θ * . Using the MCAR assumption, we simplify the expression of Σ DeSSL : The asymptotic relative efficiency of consequence, the asymptotic relative efficiency θDeSSL compared to θCC is defined as the quotient Tr(Σ DeSSL ) Tr(Σ CC ) . This quotient can be minimised with respect to λ: λ opt = (1 -π) Tr(V -1 θ * K θ * V -1 θ * ) Tr(V -1 θ * E [∇H(θ * ; x)∇H(θ * ; x) T ] V -1 θ * ) , and at λ opt : Tr(Σ DeSSL ) Tr(Σ CC ) = 1 - 1 -π π Tr(V -1 θ * K θ * V -1 θ * ) 2 Tr(V -1 θ * E [∇H(θ * ; x)∇H(θ * ; x) T ] V -1 θ * )Tr(Σ CC ) ≤ 1. ( ) Remark I.2. On the sign of λ. It is easy to show that a sufficient condition to have λ opt > 0 is to have K θ * positive semi-definite. Indeed, using that V θ * is positive definite and Proposition 6.1 of Serre (2010), we show that Tr(V -1 θ * K θ * V -1 θ * ) > 0 and then λ opt > 0. Remark I.3. Why minimising the trace of Σ DeSSL ? Minimising the trace of Σ DeSSL leads to an estimator with a smaller asymptotic MSE, see Chen et al. (2020) . Remark I.4. Fully supervised setting. We also remark that our theorem matches the theorem for the supervised setting. Indeed, observing all the labelled corresponds to the case π = 1 and we obtain: Σ DeSSL = Σ CC = Σ Fully supervised . J PROOF OF THEOREM 3.6 Our proof will be based on the following result from Shalev-Shwartz & Ben-David (2014, Theorem 26.5 ). Theorem J.1. Let H be a set of parameters, z ∼ D a random variable living in a space Z, c > 0, and ℓ : H × Z -→ [-c, c]. We denote L D (h) = E z [ℓ(h, z)], and L S (h) = 1 m m i=1 ℓ(h, z i ), where z 1 , ..., z m are i.i.d. samples from D. For any δ > 0, with probability at least 1 -δ, we have L D (h) ≤ L S (h) + 2E (εi) i≤m sup h∈H 1 m m i=1 ε i ℓ(h, z i ) + 4c 2 log(4/δ) m , where ε 1 , ..., ε m are i.i.d. Rademacher variables independent from z 1 , ..., z m . We can now restate and prove our generalisation bound. Theorem 3.6. We assume that both L and H are bounded and that the labels are MCAR. Then, there exists a constant κ > 0, that depends on λ, L, H, and the ratio of observed labels, such that, with probability at least 1 -δ, for all θ ∈ Θ, R(θ) ≤ RDeSSL (θ) + 2R n + κ log(4/δ) n , where R n is the Rademacher complexity Rn = E (ε i ) i≤n sup θ∈Θ 1 nπ i∈L εiL(θ; xi, yi) - λ nπ i∈L εiH(θ; xi) + λ n(1 -π) i∈U εiH(θ; xi) , with ε 1 , ..., ε m i.i.d. Rademacher variables independent from the data. Proof. We use Theorem J.1 with z = (x, y, r), H = Θ, m = n, and ℓ(h, z) = 1 π L(θ; x i , y i ) + λ 1 -r i 1 -π - r i π H(θ; x i ). The unbiasedness of our estimate under the MCAR assumption, proven in Appendix D, ensures that the condition of Equation equation 35 is satisfied with L D (h) = R(θ) and L S (h) = RDeSSL (θ). Now, since L and H are bounded, there exists M > 0 such that |L| < M and |H| < M . We can then bound ℓ:  |ℓ(h, z)| ≤ M π + λ max 1 1 -π , 1 π M = c. (

L.3 MEDMNIST

We trained a 5-layer CNN with a fixed λ = 1 and n l at 10% of the training data. We report in Table 2 the mean accuracy and cross-entropy on 5 different splits of the labelled and unlabelled data and the number of labelled data used. We report the AUC in Appendix L. DePseudoLabel competes with PseudoLabel in terms of accuracy and even success when PseudoLabel's accuracy is less than the complete case. Moreover, DePseudoLabel is always better in terms of cross-entropy, so calibration, whereas PseudoLabel is always worse than the complete case. As explained in the main text, the estimation of Cov(L(θ; x, y), H(θ; x)) with few labels led to extremely unstable unsatisfactory results. However, we test the formula on CIFAR-10 and different methods to provide intuition on the order of λ opt and the range of the variance reduction regime (between 0 and 2λ opt ). To do so, we estimate λ opt on the test set for CIFAR-10 by training a CNN13 using only 4, 000 labelled data on 200 epochs. The value of λ opt is 1.67, 31.16 and 0.66 for entropy minimisation, pseudo label and Fixmatch. Therefore, the reduced variance regime covers the intuitive choices of λ in the SSL literature. Unfortunately, computing λ opt on the test set is not applicable in practice. M.3 CIFAR-100 N FIXMATCH (SOHN ET AL., 2020) We performed a paired Student t-test to ensure that our performances are significantly better on CIFAR-10 on both accuracy, cross-entropy and Brier score. The results show that DeFixmatch is significantly better with a p-value of 6.5x10 -5 in accuracy, 3.3x10 -5 in cross-entropy and 7.6x10 -5 in Brier score.

N.1 PER CLASS ACCURACY

In recent work, Zhu et al. (2022) exposed the disparate effect of SSL on different classes. Indeed, classes with a high complete case accuracy benefit more from SSL than classes with a low baseline accuracy. They introduced a metric called the benefit ratio (BR) that quantifies the impact of SSL on a class C: BR(C) = acc SSL (C) -acc CC (C) acc S (C) -acc CC (C) , where acc SSL (C), acc CC (C) and acc S (C) are respectively the accuracy of the class with an SSL trained model, a complete-case model and a fully supervised model (a model that has access to all labels). Inspired by this work, we report the per-class accuracy and the benefit ratio in Table N .1. We see that the "poor" classes such as bird, cat and dog tend to benefit from DeFixmatch much more than from Fixmatch. We compute acc S (C) using a pre-trained model with the same architecture 1 . Zhu et al. (2022) also promote the idea that a fair SSL algorithm should benefit different sub-classes equally, then having BR(C) = BR(C ′ ) for all C, C ′ . While perfect equality seems unachievable in practice, we propose to look at the standard deviation of the BR through the different classes. While the standard deviation of Fixmatch is 0.12, the one of DeFixmatch is 0.06. Therefore, DeFixmatch improves the sub-populations' accuracies more equally. DeFixmatch also performs better than Fixmatch on STL-10 using 4200 labelled data for training and 800 for validation.

N.3 FIXMATCH DETAILS

As first detailed in Appendix B, Fixmatch is a pseudo-label-based method with data augmentation. Indeed, Fixmatch uses weak augmentations of x (flip-and-shift) for the pseudo-labels selection and then minimises the likelihood with the prediction of the model on a strongly augmented version of x. where x 1 is a weak augmentation of x and x 2 is a strong augmentation. We tried to debias an implementation of Fixmatchfoot_0 however training was very unstable and led to model that were much worst than the complete case. We believed that this behaviour is because the supervised part of the loss does not include strong augmentation. Indeed, our theoretical results encourage to have a strong correlation between L and H, therefore including strong augmentations in the supervised term. Moreover, a solid baseline for CIFAR-10 using only labelled data integrated strong augmentations (Cubuk et al., 2020) . We modify the implementation, see Codefoot_1 . Therefore, the supervised loss term can be written as: L(θ; x, y) = 1 2 E x1∼weak(x) [-log(p θ (y|x 1 ))] + E x2∼strong(x) [-log(p θ (y|x 2 ))] , where x 1 is a weak augmentation of x and x 2 is a strong augmentation. This modification encourages us to choose λ = 1 2 as the original Fixmatch implementation used λ = 1. We also remark that this modification degrades the performance of Fixmatch (less than 2%) reported in the work of Sohn et al. (2020) . However, including strong augmentations in the supervised part greatly improves the performance of the Complete Case.



https://https://github.com/LeeDoYup/FixMatch-pytorch https://github.com/HugoSchmutz/DeFixmatch.git https://github.com/brain-research/realistic-ssl-evaluation



Figure 1: (Left) Data histogram. (Right) Posterior probabilities p(1|x) of the same model trained following either complete case (only labelled data), Pseudo-label or our DePseudo-label.

Figure 2: The influence of λ on Pseudo-label and DePseudo-label for a Lenet trained on MNIST with n l = 1000: (Left) Mean test accuracy; (Right) Mean test cross-entropy, with 95% CI.

Figure 4: Class accuracies (without the majority class) on DermaMNIST trained with n l = 1000 labelled data on five folds. (Left) CompleteCase (B-Acc: 26.88 ± 2.26%); (Middle) PseudoLabel (B-Acc: 22.03 ± 1.45%); (Right) DePseudoLabel (B-Acc: 28.84 ± 1.02%), with 95% CI.

Figure 5: Data histogram

Recently, have been proposed in the literature Zhang et al. (2021a) and Rizve et al. (2021).

Figure 7: (Left) Risk estimate value for PseudoLabel (PL) and DePseudoLabel (DePL) compared to the true value of the risk. (Right) The influence of λ on the ratio V( RDeP L (θ)|r)/V( RCC (θ)|r)

Figure 8: The influence of λ on the ratio V(∇ RDeP L (θ)|r)/V(∇ RCC (θ)|r)

′ (p θ , p) = p(x, y, r)S ′ (p θ , (x, y, r)) dx dy dr = p(x, y)p(r)S ′ (p θ , (x, y, r)) dx dy dr = -p(x, y)p(r) rn n l L(θ; x, y) + λn( y)L(θ; x, y) dx dy = S(p θ , p)Therefore, if S(p θ , (x, y)) = -L(θ; x, y) is a proper scoring rule, then S ′ (p θ , (x, y, r)) = -( rn n l L(θ; x, y) + λn( 1-r nu -r n l )H(θ; x)) is also a proper scoring rule.

Figure 9: The influence of λ on Pseudo-label and DePseudo-label for a Lenet trained on MNIST with n l = 1000: (Left) Test accuracy; (Middle) Mean test cross-entropy; (Right) Mean test ECE, with 95% CI

Figure 13: The influence of λ on Pseudo-label and DePseudo-label on CIFAR-100 with nl= 4000: (Left) Mean test accuracy; (Middle) Mean test cross-entropy; (Right) Test ECE, with 95% CI.

Test accuracy, worst class accuracy and cross-entropy of Complete Case, Fixmatch and DeFixmatch on 5 folds of CIFAR-10 and one fold of CIFAR-100. Fixmatch * is the original version of Fixmatch, that uses strong augmentations only on the unlabelled.

Test accuracy and cross-entropy of Complete Case (CC), PseudoLabel (PL) and DePseu-doLabel (DePL) on five datasets of MedMNIST.

Test AUC of Complete Case , PseudoLabel and DePseudoLabel on five datasets of MedM-NIST.Figure 12: p-values of a paired student test between PseudoLabel and DePseudoLabel (Right) DePseudoLabel is better than PseudoLabel; (Left) DePseudoLabel is worst than PseudoLabel. M.2 COMPUTATION OF λ opt ON THE TEST SET.

Mean accuracy per class and mean benefit ratio (BR) on 5 folds for Fixmatch, DeFixmatch and the Complete Case. Bold: "poor" complete case accuracy classes.

Test accuracy, worst class accuracy and cross-entropy of Complete Case, Fixmatch and DeFixmatch on 5 folds of CIFAR-10, one fold of CIFAR-100 and one fold of STL10.Weak augmentations are also used for the supervised part of the loss. In this context,L(θ; x, y) = E x1∼weak(x) [-log(p θ (y|x 1 ))] (y|x 1 ) > τ ]E x2∼strong(x) [-log(p θ (arg max y p θ (y|x 1 )|x 2 ))]

ACKNOWLEDGEMENTS

This work has been supported by the French government, through the 3IA Côte d'Azur, Investment in the Future, project managed by the National Research Agency (ANR) with the reference number ANR-19-P3IA-0002.We thank Jes Frellsen for suggesting the interpretation of DeSSL as a constrained optimisation problem. We are also grateful to the OPAL infrastructure from Université Côte d'Azur for providing resources and support.

annex

E PROOF AND COMMENTS ABOUT THEOREM 3.1 Theorem 3.1 The function λ → V( RDeSSL (θ)|r) reaches its minimum for:Cov(L(θ; x, y), H(θ; x)) V(H(θ; x))andwhere ρ L,H = Corr(L(θ; x, y), H(θ; x)).Proof: For any λ ∈ R,we want to compute the variance:V( RDeSSL (θ)|r).Under the MCAR assumption, x and y are both jointly independent of r. Also, the couples (x i , y i , r i ) are independent. Therefore, we haveUsing the fact that the couples (x i , y i ) are i.i.d. samples following the same distribution, we haveV (x,y)∼p(x,y) r i n l L(θ, x, y)V(H(θ, x)) using covarianceCov(L(θ, x, y), H(θ, x))Now, we remark that the variable r is binary and therefore r 2 = r, (1 -r) 2 = 1 -r and r(1 -r) = 0.Using that and simplifying, we haveFinally, by summing and simplifying the expression (note that n l + n u = n), we compute the expression variance,So V( RDeSSL (θ)|r) is a quadratic function in λ and reaches its minimum for λ opt such that:F WHY DEBIASING WITH THE LABELLED DATASET?We remark that the debiasing can be performed with any subset of the training data, labelled and unlabelled. The choice of debiasing only with the labelled data can be explained both intuitively and computationally in regard to the Theorem 3.1. Intuitively, the debiasing term penalises the confidence on the labelled datapoints and then prevents the overfitting on the train dataset. As remarked in section 3.1, Pereyra et al. (2017) showed that penalising low entropy models acts as a strong regulariser in supervised settings. This comforts the idea of penalising low entropy on the labelled dataset, i.e. debiaising the entropy minimisation with the labelled dataset. Considering Pseudo-Label-based methods, the objective for the labelled data is to predict the correct labels with moderate confidence. This is also similar to the concept of plausibility inference described by Barndorff-Nielsen (1976) .In regard to Theorem 3.1, we show that the optimum choice of subset for debiaising is either only the labelled data or the whole dataset and both are equivalent.We consider a subset A of the training set. We defined a as follow:The unbiased estimator is then:We compute the variance of this quantity as in the proof of Theorem 3.1 and show that:Suppose that no labelled datapoints are in A. Then, the last term of the variance is null. Hence, having no labelled datapoints in A leads to a variance increase. We also remark that debiasing with the entire dataset is equivalent that debiasing with the labelled datapoints. Indeedwhich is equivalent to debiasing with only the labelled dataset by replacing λ by λ n l n . At this point, we can still sample a random subset composed of l labelled datapoints and u unlabelled datapoints. Therefore a i = 1/(l + u)1{x i ∈ A}, we show in the following that the optimum choice of the couples (l, u) are (n l , 0) and (n l , n u ), so only the labelled or the whole dataset.We sample l labelled and u unlabelled datapoints to debiased the estimator, by simplifying the term in the sum of Equation 27 as follow:.Then, by summing the term and simplifying, we get:We want to minimise V( RDeSSL (θ)|r) with respect to (λ, l, u). V( RDeSSL (θ)|r) reaches is minimum in λ atV(H(θ, x)) .Then,V(H(θ, x)) .We now want to minimise with respect to 0 ≤ u ≤ n u and 1 ≤ l ≤ n l . We can easily show that the (n u -u + l)(l + u) reaches its minimum for u = 0 or u = n u and for both value:Then l/(n u + l) is an increasing function, then reaches its maximum a l = n l . So finally, the optimal choices for the couple (n l , 0) and (n l , n u ). We showed that these couples are equivalent.H PROOF OF THEOREM 3.4Assumption 3.2: the minimum θ * of R is well-separated. This result is a direct application of Theorem 5.7 from van der Vaart (1998, Chapter 5) that states that under assumption A and B for L, θ = arg min R is asymptotically consistent with respect to n.Assumption A remains unchanged as we have M-estimators of the same R. We now aim to prove that under assumption B for both L and H, we have the B on θ -→ rn n l L(θ; x, y) + λ(1rn n l )H(θ; x). Lemma H.1. If the uniform law of large number holds for both L and H, then it holds for θ -→Proof. Suppose assumption B for L, then the same result holds if we replace n with n l as n and n l are coupled by the law of r. Indeed, when n grows to infinity, n l too and inversely. Therefore,Now, suppose we have assumption B for H, then we can make the same remark than for L. Now, we have to show that:We first split the absolute value and the sup operator asSo we now have to prove that the second term is also converging to 0 in probability. Again by splitting the absolute value and the sup, we havePublished as a conference paper at ICLR 2023 Then we have that,Thus,And we now just have to apply the results of van der Vaart (1998, Theorem 5.7) to have the asymptotic consistent of θ = arg min RDeSSL .Remark H.2. A sufficient condition on the function H to verify assumption B, the uniform weak law of large numbers, is to be bounded (Newey & McFadden, 1994, Lemma 2.4 ). For instance, the entropy H =y p θ (y|x) log(p θ (y|x)) is bounded and therefore, the entropy minimisation is asymptotically consistent.

K DESSL WITH H APPLIED ON ALL AVAILABLE DATA

For consistency-based SSL methods it is common to use all the available data for the consistency term:With the same idea, we debias the risk estimate with the labelled data:Under MCAR, this risk estimate is unbiased and the main theorem of the article hold with minor modifications. In Theorem 3.1, λ opt is slightly different and the expression of the variance at λ opt remains the same. The scoring rule in Theorem G.1 is different but the theorem remains the same. Both Theorem 3.4, 3.5 and 3.6 remain the same with very similar proofs.Theorem K.1. The function λ → V( RDeSSL (θ)) reaches its minimum for:andwhere ρ L,H = Corr(L(θ; x, y), H(θ; x)).When H is applied on all labelled and unlabelled data, the scoring rule used in the learning process is then S ′ (p θ , (x, y, r)) = -( rn n l L(θ; x, y) + λ(1 -rn n l )H(θ; x)) and we have S ′ is a proper scoring rule.

O CIFAR AND SVHN: OLIVER ET AL. (2018) IMPLEMENTATION OF CONSISTENCY-BASED MODEL.

In this section, we present the results on CIFAR and SVHN by debiasing the implementation of (Oliver et al., 2018) of Π-Model, Mean-Teacher and VAT 3 . We mimic the experiments of Oliver et al. (2018, figure-4) with the same configuration and the exact same hyperparameters (Oliver et al., 2018, Appendix B and C) . We perform an early stopping independently on both cross-entropy and accuracy. As reported below, we reach almost the same results as the biased methods. 

