ON THE IMPORTANCE OF CALIBRATION IN SEMI-SUPERVISED LEARNING Anonymous

Abstract

State-of-the-art (SOTA) semi-supervised learning (SSL) methods have been highly successful in leveraging a mix of labeled and unlabeled data by combining techniques of consistency regularization and pseudo-labeling. During pseudo-labeling, the model's predictions on unlabeled data are used for training and thus, model calibration is important in mitigating confirmation bias. Yet, many SOTA methods are optimized for model performance, with little focus directed to improve model calibration. In this work, we empirically demonstrate that model calibration is strongly correlated with model performance and propose to improve calibration via approximate Bayesian techniques. We introduce a family of new SSL models that optimizes for calibration and demonstrate their effectiveness across standard vision benchmarks of CIFAR-10, CIFAR-100 and ImageNet, giving up to 16.2% improvement in test accuracy on the CIFAR-100-400-labels benchmark. Furthermore, we also demonstrate their effectiveness in additional realistic and challenging problems, such as class-imbalanced datasets and in photonics science.

1. INTRODUCTION

While deep learning has achieved unprecedented success in recent years, its reliance on vast amounts of labeled data remains a long standing challenge. Semi-supervised learning (SSL) aims to mitigate this by leveraging unlabeled samples in combination with a limited set of annotated data. In computer vision, two powerful techniques that have emerged are pseudo-labeling (also known as self-training) (Rosenberg et al., 2005; Xie et al., 2019b) and consistency regularization (Bachman et al., 2014; Sajjadi et al., 2016) . Broadly, pseudo-labeling is the technique where artificial labels are assigned to unlabeled samples, which are then used to train the model. Consistency regularization enforces that random perturbations of the unlabeled inputs produce similar predictions. These two techniques are typically combined by minimizing the cross-entropy between pseudo-labels and predictions that are derived from differently augmented inputs, and have led to strong performances on vision benchmarks (Sohn et al., 2020; Assran et al., 2021) . Intuitively, given that pseudo-labels (i.e. the model's predictions for unlabeled data) are used to drive training objectives, the calibration of the model should be of paramount importance. Model calibration (Guo et al., 2017) is a measure of how a model's output truthfully quantifies its predictive uncertainty, i.e. it can be understood as the alignment between its prediction confidence and its groundtruth accuracy. In some SSL methods, the model's confidence is used as a selection metric (Lee, 2013; Sohn et al., 2020) to determine pseudo-label acceptance, further highlighting the need for proper confidence estimates. Even outside this family of methods, the use of cross-entropy minimization objectives common in SSL implies that models will naturally be driven to output high-confidence predictions (Grandvalet & Bengio, 2004) . Having high-confidence predictions is highly desirable in SSL since we want the decision boundary to lie in low-density regions of the data manifold, i.e. away from labeled data points (Murphy, 2022). However, without proper calibration, a model would easily become over-confident. This is highly detrimental as the model would be encouraged to reinforce its mistakes, resulting in the phenomenon commonly known as confirmation bias (Arazo et al., 2019) . Despite the fundamental importance of calibration in SSL, many state-of-the-art (SOTA) methods have thus far been empirically driven and optimized for performance, with little focus on techniques that specifically target improving calibration to mitigate confirmation bias. In this work, we explore the generality of the importance of calibration in SSL by focusing on two broad families of SOTA Blundell et al., 2015) and weight-ensembling approaches (Izmailov et al., 2018) . Such Our modification forms a new family of SSL methods that improve upon the SOTA on both standard benchmarks and real-world applications. Our contributions are summarized as follows: 1. Using SOTA SSL methods as case studies, we empirically show that maintaining good calibration is strongly correlated to better model performance in SSL. 2. We propose to use approximate Bayesian techniques to directly improve calibration and provide theoretical results on generalization bounds for SSL to motivate our approach. 3. We introduce a new family of methods, BAM-, that improves calibration via BAyesian model averaging (see Fig. 1 ) and demonstrate their improvements upon a variety of SOTA SSL methods on standard benchmarks, notably giving up to 16.2% gains in test accuracy. 4. We further explored weight averaging techniques, one of which (i.e. EMA) being wellestablished in SSL and show that their effectiveness can be understood from improving pseudo-label calibration. 5. We further demonstrate the efficacy of BAM-in more challenging and realistic scenarios, such as class-imbalanced datasets and a real-world application in photonic science. proposes to re-initialize the model before every iteration to overcome confirmation bias. Another popular technique is to impose a selection metric (Yarowsky, 1995) to retain only the highest quality pseudo-labels, commonly realized via a fixed threshold on the maximum class probability (Xie et al., 2019a; Sohn et al., 2020) . Recent works have further extended such selection metrics to be based on dynamic thresholds, either in time (Xu et al., 2021) or class-wise (Zou et al., 2018; Zhang et al., 2021) . Different from the above approaches, our work proposes to overcome confirmation bias in SSL by directly improving the calibration of the model through approximate Bayesian techniques.



Figure 1: Illustration of our BAyesian Model averaging (BAM) approach

Semi-supervised learning (SSL) and confirmation bias. A fundamental problem in SSL methods based on pseudo-labeling(Rosenberg et al., 2005)  is that of confirmation bias(Tarvainen & Valpola,  2017; Murphy, 2022), i.e. the phenomenon where a model overfits to incorrect pseudo-labels. Several strategies have emerged to tackle this problem; Guo et al. (2020) and Ren et al. (2020) looked into weighting unlabeled samples, Thulasidasan et al. (2019) and Arazo et al. (2019) proposes to use augmentation strategies like MixUp (Zhang et al., 2017), while Cascante-Bonilla et al. (2020)

To motivate our work, we first empirically show that strong baselines like FixMatch(Sohn et al.,  2020)  and PAWS(Assran et al., 2021), from each of the two families, employ a set of indirect techniques to implicitly maintain calibration and that achieving good calibration is strongly correlated to improved performance. Furthermore, we demonstrate that it is not straightforward to control calibration via such indirect techniques. To remedy this issue, we propose techniques that are directed to explicitly improve calibration by using approximate Bayesian techniques that are designed to capture model uncertainty. In our work, we explored approximate Bayesian neural networks (

