CONFIDENCE ESTIMATION USING UNLABELED DATA

Abstract

Overconfidence is a common issue for deep neural networks, limiting their deployment in real-world applications. To better estimate confidence, existing methods mostly focus on fully-supervised scenarios and rely on training labels. In this paper, we propose the first confidence estimation method for a semi-supervised setting, when most training labels are unavailable. We stipulate that even with limited training labels, we can still reasonably approximate the confidence of model on unlabeled samples by inspecting the prediction consistency through the training process. We use training consistency as a surrogate function and propose a consistency ranking loss for confidence estimation. On both image classification and segmentation tasks, our method achieves state-of-the-art performances in confidence estimation. Furthermore, we show the benefit of the proposed method through a downstream active learning task.

1. INTRODUCTION

Besides accuracy, the confidence, measuring how certain a model is of its prediction, is also critical in real world applications such as autonomous driving (Ding et al., 2021) and computer-aided diagnosis (Laves et al., 2019) . Despite the strong prediction power of deep networks, their overconfidence is a very common issue (Guo et al., 2017; Nguyen et al., 2015; Szegedy et al., 2014) . The output of a standard model, e.g., the softmax output, does not correctly reflect the confidence. The reason is that the training is only optimized regarding to the training set (Naeini et al., 2015) , not the underlying distribution. Accurate confidence estimation is important in practice. In autonomous driving and computer-aided diagnosis, analyzing low confidence samples can help identify subpopulations of events or patients that deserve extra consideration. Meanwhile, reweighting hard samples, i.e., samples on which the model has low confidence, can help improve the model performance. Highly uncertain samples can also be used to promote model performance in active learning (Siddiqui et al., 2020; Moon et al., 2020) . Different ideas have been proposed for confidence estimation. Bayesian approaches (MacKay, 1992; Neal, 1996; Graves, 2011) rely on probabilistic interpretations of a model's output, while the high computation demand restricts their applications. Monte Carlo dropout (Gal & Ghahramani, 2016) is introduced to mitigate the computation inefficiency. But dropout requires sampling multiple model predictions at the inference stage, which is time-consuming. Another idea is to use an ensemble of neural networks (Lakshminarayanan et al., 2017) , which can still be expensive in both inference time and storage. To overcome the inefficiency issue, some recent works focus on the whole training process rather than the final model. However, most existing methods purely rely on labeled data, and thus are not well suited for a semisupervised setting. Indeed, confidence estimation is critically needed in the semi-supervised setting, where we have limited labels and a large amount of unlabeled data. A model trained with limited labels is sub-optimal. Confidence will help efficiently improve the quality of the model, and help annotate the vast majority of unlabeled data in a scalable manner (Wang et al., 2022; Sohn et al., 2020; Xu et al., 2021) . dence learning. For data without labels, our idea is to use the consistency of the predictions through the training process. An initial investigation suggests that consistency of predictions tends to be correlated with sample confidence on both labeled and unlabeled data. Having established training consistency as an approximation of confidence, the next challenge is that the consistency can only be evaluated on data available during training. To this end, we propose to re-calibrate model's prediction by aligning it with the consistency. In particular, we propose a novel Consistency Ranking Loss to regulate the model's output after the softmax layer so it has a similar ranking of model confidence output as the ranking of the consistency. After the re-calibration, we expect the model's output correctly accounts for its confidence on test samples. We both theoretically and empirically validate the effectiveness of the proposed Consistency Ranking Loss. Specifically, we show the superiority of our method on real applications, such as image classification and medical image segmentation, under semi-supervised settings. We also demonstrate the benefit of our method through active learning tasks. Related work. There are two mainstream approaches for confidence (uncertainty) estimation: confidence calibration and ordinal ranking. Confidence calibration treats confidence as the true probability of making correct prediction and tries to directly estimate it (Platt, 2000; Guo et al., 2017; Jungo & Reyes, 2019; Zadrozny & Elkan, 2002; Naeini et al., 2015) . For any sample, the confidence estimate generated by a well-calibrated classifier should be the likelihood of predicting correctly. Directly estimating the confidence may be challenging. Instead, many works focus on the ordinal ranking aspect (Geifman et al., 2018; Geifman & El-Yaniv, 2017; Mandelbaum & Weinshall, 2017; Moon et al., 2020; Lakshminarayanan et al., 2017) . In spite of the actual estimated confidence values, the ranking of samples with regard to the confidence level should be consistent with the chance of correct prediction. A model with well-ranked confidence estimate can be widely used in the field of selective classification, active learning and semi-supervised learning (Siddiqui et al., 2020; Yoo & Kweon, 2019; Sener & Savarese, 2018; Tarvainen & Valpola, 2017; Zhai et al., 2019; Miyato et al., 2018; Xie et al., 2020; Chen et al., 2021; Li & Yin, 2020) . Semi-supervised active learning methods (Gao et al., 2020; Huang et al., 2021) only focus on finding high-uncertainty samples through training actions, thus are unable to be applied to estimate uncertainty for unseen samples.

2. CONSISTENCY -A NEW SURROGATE OF CONFIDENCE

Our main idea is to use the training consistency, i.e., the frequency of a training datum getting the same prediction in sequential training epochs, as a surrogate function of the model's confidence. In this section, we first formalize the definition of training consistency. Next, we use qualitative and quantitative evidences to show that consistency can be used as a surrogate function of confidence. This justifies the usage of training consistency as a supervision for model confidence estimation, which we will introduce in the next section. Definition: training consistency. Assume a given dataset with n labeled and p unlabeled data, D = (X, Y, U ). Here X = {x 1 , . . . , x n }, x i ∈ X is the set of labeled data with corresponding labels Y = {y 1 , . . . , y n }, y i ∈ Y = {1, . . . , K}. The set of unlabeled data U = {x n+1 , . . . , x n+p }, x j ∈ X cannot be directly used to train the model, but will be used to help estimate confidence. We assume a simple training setting where we use the labeled set (X, Y ) to train a model f (x, y; W ) : X × Y → [0, 1]. The method can naturally generalize to more sophisticated semisupervised learning methods, where unlabeled data can also be used. For any data, either labeled or unlabeled, x i ∈ X ∪ U , we have the classification ŷi = arg max y∈Y f (x i , y; W ). Note that traditionally the model output regarding the classification label, f (x i , ŷi ; W ) = max y∈Y f (x i , y; W ), is used as the confidence. Our definition involves the training process. Denote by W t the model weights at the t-th training epoch, t = 1, . . . , T . We use ŷt i = arg max y∈Y f (x i , y|W t ) to denote the model classification for sample x i at the t-th epoch. For a sample x i , we define its training consistency as the frequency of getting consistent predictions in sequential training epochs during the whole training process (T epochs in total): c i = 1 T -1 T -1 t=1 1{ŷ t i = ŷt+1 i }. (1) Qualitative analysis shows consistency is a good surrogate of confidence. We provide a qualitative example in Fig. 1 with the feature representations. We observe that the data further from the



Moon et al. (2020) use the frequency of correct predictions through the training process to approximate the confidence of a model on each training sample. Geifman et al. (2018) collect model snapshots over the training process to compensate for overfitting and estimate confidence.

