UNCERTAINTY CALIBRATION ERROR: A NEW MET-RIC FOR MULTI-CLASS CLASSIFICATION Anonymous

Abstract

Various metrics have recently been proposed to measure uncertainty calibration of deep models for classification. However, these metrics either fail to capture miscalibration correctly or lack interpretability. We propose to use the normalized entropy as a measure of uncertainty and derive the Uncertainty Calibration Error (UCE), a comprehensible calibration metric for multi-class classification. In our experiments, we focus on uncertainty from variational Bayesian inference methods and compare UCE to established calibration errors on the task of multi-class image classification. UCE avoids several pathologies of other metrics, but does not sacrifice interpretability. It can be used for regularization to improve calibration during training without penalizing predictions with justified high confidence.

1. INTRODUCTION

Advances in deep learning have led to superior accuracy in classification tasks, making deep learning classifiers an attractive choice for safety-critical applications like autonomous driving (Chen et al., 2015) or computer-aided diagnosis (Esteva et al., 2017) . However, the high accuracy of recent deep learning models alone is not sufficient for such applications. In cases where serious decisions are made upon model's predictions, it is essential to also consider the uncertainty of these predictions. We need to know if a prediction is likely to be incorrect or if invalid input data is presented to a deep model, e.g. data that is far away from the training domain or obtained from a defective sensor. The consequences of a false decision based on an uncertain prediction can be fatal. A natural expectation is that the certainty of a prediction should be directly correlated with the quality of the prediction. In other words, predictions with high certainty are more likely to be accurate than uncertain predictions, which are more likely to be incorrect. A common misconception is the assumption that the estimated softmax likelihood can be directly used as a confidence measure for the predicted class. This expectation is dangerous in the context of critical decision-making. The estimated likelihood of models trained by minimizing the negative log-likelihood (i.e. cross entropy) is highly overconfident; that is, the estimated likelihood is considerably higher than the observed frequency of accurate predictions with that likelihood (Guo et al., 2017) .

2. UNCERTAINTY ESTIMATION

In this work, we focus on uncertainty from approximately Bayesian methods. We assume a general multi-class classification task with C classes. Let input x ∈ X be a random variable with corresponding label y ∈ Y = {1, . . . , C}. Let f w (x) be the output (logits) of a neural network with weight matrices w, and with model likelihood p(y = c | f w (x)) for class c, which is sampled from a probability vector p = σ SM (f w (x)), obtained by passing the model output through the softmax function σ SM (•). From a frequentist perspective, the softmax likelihood is often interpreted as confidence of prediction. Throughout this paper, we follow this definition. The frequentist approach assumes a single best point estimate of the parameters (or weights) of a neural network. In frequentist inference, the weights of a deep model are obtained by maximum likelihood estimation (Bishop, 2006) , and the normalized output likelihood for an unseen test input does not consider uncertainty in the weights (Kendall & Gal, 2017) . Weight uncertainty (also referred to as model or epistemic uncertainty) is a considerable source of predictive uncertainty for models trained on data sets of limited size (Bishop, 2006; Kendall & Gal, 2017) . Bayesian neural networks and recent advances in their approximation provide valuable mathematical tools for quantification of model uncertainty (Gal & Ghahramani, 2016; Kingma & Welling, 2014) . Instead of assuming the existence of a single best parameter set, we place distributions over the parameters and want to consider all possible parameter configurations, weighted by their posterior. More specifically, given a training data set D and an unseen test sample x with class label y, we are interested in evaluating the predictive distribution p(y|x, D) = p(y|x, w)p(w|D) dw . This integral requires to evaluate the posterior p(w|D), which involves the intractable marginal likelihood. A possible solution to this is to approximate the posterior with a more simple, tractable distribution q(w) by optimization. In this work, we incorporate the following approximately Bayesian methods which we use in our experiments to obtain weight uncertainty: Monte Carlo (MC) dropout (Gal & Ghahramani, 2016) , Gaussian dropout (Wang & Manning, 2013; Kingma et al., 2015) , Bayes by Backprop (Blundell et al., 2015) , SWA-Gaussian (Maddox et al., 2019) , and (although not Bayesian) deep ensembles (Lakshminarayanan et al., 2017) . A short review of each of the methods can be found in Appendix A.2.

3. RELATED CALIBRATION METRICS

Expected Calibration Error The expected calibration error (ECE) is one of the most popular calibration error metrics and estimates model calibration by binning the predicted confidences p = max c p(y = c | x) into M bins from equidistant intervals and comparing them to average accuracies per bin (Naeini et al., 2015; Guo et al., 2017) : ECE = M m=1 |B m | n acc(B m ) -conf(B m ) , with number of test samples n and acc(B) and conf(B) denoting the accuracy and confidence of bin B, respectively. Several recent works have described severe pathologies of the ECE metric (Ashukha et al., 2020; Nixon et al., 2019; Kumar et al., 2019) . Most notably, the ECE metric is minimized by a model constantly predicting the marginal distribution of the majority class which makes it impossible to directly optimize it (Kumar et al., 2018) . Additionally, the ECE only considers the maximum class probability and ignores the remaining entries of the probability vector p(x). Adaptive Calibration Error Nixon et al. (2019) proposed the adaptive calibration error (ACE) to address the issue of fixed bin widths of ECE-like metrics. For models with high accuracy or overconfidence, most of the predictions fall into the rightmost bins, whereas only very few predictions fall into the rest of the bins. ACE spaces the bins such that an equal number of predictions contribute to each bin. The final ACE is computed by averaging over per-class ACE values to address the issue raised by Kull et al. (2019) . However, this makes the metric more sensitive to the manually selected number of bins M as the number of bins effectively becomes C • M , with number of classes C. Using fixed bin widths, the numbers of samples in the sparsely populated bins is further reduced, which increases the variance of each measurement per bin. Using adaptive bins, this results in the lower confidence bins spanning a wide range of values, which increases the bias of the bin's measurement. Negative Log-Likelihood Deep models for classification are usually trained by minimizing the average negative log-likelihood (NLL): NLL = 1 N N i=1 -log p(y = y i | x i ) . ( ) The NLL is also commonly used as a metric for measuring the calibration of uncertainty. However, the NLL is minimized by increasing the confidence max c p(y = c | x), which favors over-confident models and models with higher accuracy (Ashukha et al., 2020) . This metric is therefore unable to compare the calibration of models with different accuracies and training a model by minimizing NLL does not necessarily lead to good calibration.

Brier Score

The average Brier score is another popular metric for assessing the quality of predictive uncertainty and is defined as (Brier, 1950; Lakshminarayanan et al., 2017 ) BS = 1 N N i=1 C c=1 (1(y i = c) -p(y = c | x i )) 2 . (3) Similarly to the NLL, the Brier score favors high probabilities for correct predictions and low probabilities for incorrect predictions. Thus, models with higher accuracy tend to show a better Brier score, which makes the metric unsuitable for comparing the quality of uncertainty for models with different accuracies. Maximum Mean Calibration Error Common recalibration methods are applied post-hoc, e.g. temperature scaling on a separate calibration set. Kumar et al. (2018) proposed the maximum mean calibration error (MMCE), a trainable calibration surrogate for the calibration error. It is defined as (5) Kumar et al. (2018) claim to have addressed the issue that the ECE is unsuitable for direct optimization due to its high discontinuity in w. However, MMCE is also minimized by a model constantly predicting the marginal distribution of the classes. This leads to subpar logit temperature when training with MMCE and temperature scaling can further reduce miscalibration (Kumar et al., 2018) . MMCE 2 (D) = i,j∈D (1(ŷ i = y i ) -pi ) (1(ŷ j = y j ) -pj ) k(p i , pj ) m 2

4. UNCERTAINTY CALIBRATION ERROR

To give an insight into our general approach to measuring the calibration of uncertainty, we will first revisit the definition of perfect calibration of confidence (Guo et al., 2017) and show how this concept can be extended to calibration of our definition uncertainty. Let ŷ = arg max p be the most likely class prediction of input x with confidence p = max p and true label y. Then, following Guo et al. (2017) , perfect calibration of confidence is defined as P [ŷ = y | p = α] = α, ∀α ∈ [0, 1] . That is, the probability of a correct prediction ŷ = y given the prediction confidence p should exactly correspond to the prediction confidence. Instead of using only the probability of the predicted class, we use the entropy of p to express prediction uncertainty: H(p) = - C c=1 p (c) log p (c) . Let q(k) := (P[y = 1| arg max p(x) = k], . . . , P[y = C| arg max p(x) = k]) (8) be a probability vector of true marginal class probabilities for all inputs x predicted with class k. Consider the following example: Three i.i.d. inputs x 1:3 in a binary classification task with ground truth labels {1, 1, 2} have all been predicted with arg max p(x 1:3 ) = 1. Then, q(1) = 2 3 , 1 3 . With this, we define a model to be perfectly calibrated if H(q(k)) = H(p | arg max p = k) ∀ k ∈ {1, . . . , C} . From this, we derive an error metric for calibration of uncertainty: E p |H(q) -H(p)| . However, this metric and the use of the entropy as measure of uncertainty lacks interpretability, as the entropy scales with the number of classes C. This does not allow to compare the uncertainty or the calibration of models trained on different data sets. Therefore, we propose to use the normalized entropy to scale the values to a range between 0 and 1: H(p) := - 1 log C C c=1 p (c) log p (c) , H ∈ [0, 1] . We further increase interpretability and argue, that the normalized entropy should correlate with the model error. From Eq. ( 6) and Eq. ( 11), we define perfect calibration of uncertainty as P ŷ = y | H(p) = α = α, ∀α ∈ [0, 1] . ( ) That is, in a batch of inputs that are all predicted with uncertainty of e. g. 0.2, a top-1 error of 20 % is expected. The confidence is interpreted as the probability of belonging to a particular class, which should naturally correlate with the model error of that class. This characteristic does not generally apply to entropy, and thus the question arises why entropy should correspond with the model error. Proposition 1. The normalized entropy (uncertainty) H(p) approaches the top-1 error in the limit of number of classes C if the model p is well-calibrated. Proof. lim C→∞ H(p) = (1 -p) The top-1 error equals (1 -p) if the model is perfectly calibrated in the sense of Eq. ( 6). For a detailed proof, see Appendix A.1. Thus, the normalized entropy gives us an intuitive and interpretable measure of uncertainty. If a model is perfectly calibrated, H corresponds to the top-1 error. We propose the following notion to quantify miscalibration of uncertainty: E H P ŷ = y | H(p) = α -α , ∀α ∈ [0, 1] . We refer to this as Expected Uncertainty Calibration Error (UCE) and approximate with UCE := M m=1 |B m | n err(B m ) -uncert(B m ) , using the same binning scheme as in ECE estimation. The error per bin is defined as err(B m ) := 1 |B m | i∈Bm 1(ŷ i = y) , where 1(ŷ i = y) = 1 and 1(ŷ i = y) = 0. Uncertainty per bin is defined as uncert(B m ) := 1 |B m | i∈Bm H(p i ) . ( ) Properties of UCE The proposed UCE metric solves several problems of other metrics. First, the UCE is not zero for a model constantly predicting the marginal class distribution. Estimators of metrics with this pathology (e.g. ECE, MMCE) suffer from varying bias and therefore do not allow comparing miscalibration of different models (Ashukha et al., 2020; Vaicenavicius et al., 2019) . In contrast to ACE, UCE is not highly sensitive to the numbers of bins and provides a consistent ranking of different models for the same classification task (see Fig. 1 ). Additionally, UCE can be used as a trainable regularizer in similar manner to MMCE. During training, we compute the UCE over mini-batches D ⊂ D and add it to the NLL training objective arg min w D NLL(D, w) + λ UCE(D, w) , weighted by a factor λ. UCE is zero for an optimal model and thus does not penalize high confident predictions for models with high accuracy, which is a major disadvantage of plain entropy regularization (Pereyra et al., 2017) . Predictions with low uncertainty, but high top-1 error are penalized whereas predictions with high accuracy are encouraged to have low uncertainty. 

5. EXPERIMENTS

We evaluate the proposed uncertainty calibration error on mutli-class image classification on CIFAR-10 with ResNet-34 and on CIFAR-100 with ResNet-50 (He et al., 2016; Krizhevsky & Hinton, 2009) . The feature extractor of ResNet is used as implemented in PyTorch 1.6 (Paszke et al., 2019) and the last linear layer is implemented using the different Bayesian approximations from § 2. All models were trained from random initialization. We employed early stopping at highest validation set accuracy. More details on the training procedure and a link to our source code can be found in Appendix A.3. First, we compute the accuracies and all calibration error metrics from § 3 and the UCE on the test sets of CIFAR-10/100 for all models. We investigate the effect of the number of bins in the estimators of the metrics involving binning and analyze the ranking of different models under varying softmax temperature τ , where p = σ SM (τ -1 f w (x)). Finally, we train a ResNet on CIFAR-10/100, SVHN, and Fashion-MNIST with added calibration error regularization as in Eq. ( 5) and ( 18). We compare UCE regularization (λ = 10) to regularization with MMCE (λ = 10) and confidence penalty arg min D NLL(D, w) + λ H(D, w) with λ = 0.1, which penalizes the entropy of the probability vector p of each prediction (Pereyra et al., 2017) . We combine the regularization experiments with post-hoc calibration using temperature scaling (Guo et al., 2017) . Additionally, we analyze the utility of the normalized entropy as a measure of uncertainty and perform rejection and out-of-distribution (OoD) detection experiments using H. We define an uncertainty threshold H max and reject all predictions from the test set where H(p) > H max . A decrease in false predictions of the remaining test set is expected. To demonstrate the OoD detection ability, we provide images from CIFAR-100 to a deep model trained on CIFAR-10 (note that both CIFAR data sets have no mutual classes). In this experiment, we compose a batch of 100 random samples from the test set of the training domain and stepwise replace images with out-of-distribution data. In practice, it is expected that models are applied to a mix of known and unknown classes. After each step, we evaluate the mean batch uncertainty and expect, that the mean uncertainty monotonically increases as a function of the fraction of OoD data.

5.1. RESULTS

In this section, the results of the above mentioned experimental setups are presented and discussed. best in terms of test set accuracy and calibration of uncertainty. Brier score and NLL are both highly sensitive to the model accuracy, which is especially apparent on CIFAR-10. For the first three models with similar accuracy, the Brier scores differ only marginally. Thus, both the Brier score and the NLL are unsuitable for comparing the calibration of different models. Ashukha et al. (2020) propose to use the calibrated NLL at optimal temperature for model comparison. However, Fig. 5 -6 plot the metrics over varying softmax temperature and show, that the models with highest accuracy have lowest Brier and NLL, regardless of the temperature. From this we deduce that both Brier and NLL should not be used for comparison of multi-class calibration, even at optimal temperature. The remaining metrics show consistent ranking before and after the point of optimal temperature. The metrics ECE, UCE and MMCE have a narrow region in which the optimal temperature for all models can be found. This allows comparison of calibration of models if they are all over-or underconfident. However, all metrics fail at comparing underconfident models to overconfident models (see model ranking left and right of optimal temperature in Fig. 6 ).

Comparison of Calibration Error Metrics

Fig. 1 shows the effect of the number of bins M in the estimators of ECE, ACE and UCE. Both ECE and ACE are more sensitive to the number of bins and do not provide a consistent ranking of models under varying bin count. This is due to the fact that fewer bins are populated using H as uncertainty (cf. Fig. 10 in the appendix). This can be interpreted as possible downside of the UCE metric as the adaptive binning scheme of ACE explicitly addresses that. However, we argue that consistent ranking due to robustness against bin count results in a metric that is more useful in practice.

Uncertainty Regularization

Tab. 1 shows results of ResNet-50 with SWAG trained on CIFAR-100. All regularization methods considerably reduce miscalibration compared to unregularized models. Plain entropy regularization is surprisingly effective on CIFAR-100; however, on CIFAR-10 (see Tab. 1), it increased miscalibration and is generally outperformed by MMCE and UCE regularization at optimal temperature. Therefore, when performing post-hoc temperature scaling, MMCE and UCE regularization is preferable to entropy regularization. UCE regularization can be interpreted as entropy penalization for predictions with low accuracy. As UCE is zero for an optimal model, it encourages a model to reach high accuracy. Rejection & OoD Detection Fig. 2 (left) shows the top-1 error as a function of decreasing uncertainty threshold H max and (right) shows the mean batch uncertainty at increasing OoD data. Robust rejection of uncertain predictions and detection of OoD data based on the normalized entropy H(p) is possible and is generally more sensitive to OoD data than the confidence max p.

6. CONCLUSION

We have proposed to measure uncertainty based on the normalized entropy. From this, we derived the uncertainty calibration error; a new metric that avoids several pathologies of existing calibration errors. In our experimental evaluation, we focused on uncertainty from approximate Bayesian methods and deep ensembles. The UCE does not only consider the class with the highest probability and is not minimized by a constant model predicting the marginal class distribution. In contrast to the Brier score and NLL, it allows comparison of models with different accuracy. It is not sensitive to a varying number of bins and provides a consistent ranking of models. However, we follow the suggestion of Ashukha et al. (2020) and state that comparison of calibration for different models should only be done at optimal softmax temperature. Regularization with UCE during training reduces miscalibration and does not penalize high accuracy and predictions with justified high confidence. UCE regularization with temperature scaling often performed best in our experiments in terms of calibration. The normalized entropy itself is a useful measure of uncertainty and allows for robust rejection of uncertain predictions and detection of OoD data. We hope to have provided a new useful metric for reliable evaluation of uncertainty estimation. UCE is easy to implement and interpretable as it expresses the discrepancy of the uncertainty from the model error, which increases the chance of being accepted by deep learning practitioners.

7. REBUTTAL

In this section, new or updated results during the rebuttal phase are presented. We identified the main question of the reviewers as to why/where our calibration metric (UCE) is beneficial compared to previous metrics (ECE, MMCE, ACE) and how normalized entropy is beneficial over vanilla entropy. We project the following main changes for the final manuscript: • illustrate the advantage of normalized entropy over vanilla entropy • add toy experiments to show that UCE is able to capture miscalibration where ECE, ACE and MMCE fail (see results below) • add results from experiments on additional data sets (SVHN, Fashion-MNIST) To make room for the following content, we propose the removal/reduction (approx. -1 page) of • the review of Bayesian methods and refer to previous work instead, • experiments regarding temperature (remove Table 1 and Figure 1) Additionally, we will include all minor comments of the reviewers in the final manuscript.

7.1. UNCERTAINTY FROM NORMALIZED ENTROPY VS. ENTROPY

The advantages of normalized entropy over vanilla entropy in our definition of UCE are twofold: First, the domain of the metric does not scale with the number of classes C, which helps comparing the calibration of models trained on different data sets. Second, the value of the metric is more interpretable: Rejecting test samples where H < 0.2 will result in a classification error < 0.2 for the remaining samples if the model is well-calibrated (see Fig. 2 ). We argue that using normalized entropy as uncertainty measure is as interpretable as max p, but avoids the pathologies of max p when used in a calibration metric.

7.2. UCE VS. OTHER CALIBRATION METRICS

UCE is more reliable than ECE and MMCE because it is based on normalized entropy and incorporates the predictions of all classes. It is more robust than ACE, as it is significantly less sensitive to binning. Reliable Uncertainty Detection ECE and MMCE can be minimized by models with an uninformative, constant output (Ovadia et al., 2019) . Given a data set of two classes, with 60 % class 1 and 40 % class 2, and a degenerated model that consistently predicts the marginal probabilities p = (0.6, 0.4). This leads to 60 % correctly classified samples for class 1 leading to perfect calibration scores for ECE and MMCE. Class 2, however, is misclassified in 100% of the cases-and not 60 % as expected. The miscalibration of class 2 is not reflected in ECE or MMCE. ACE is computed class wise and is able to capture this miscalibration. UCE is based on the normalized entropy to determine uncertainty and therefore incorporates the predictions of all classes. Fig. 3 visualizes these results.

Robustness to Varying Number of Bins

From a reliable calibration metric, we would expect a constant ranking of two models in comparison, independent of varying calibration parameters. However, ACE is highly sensitive to the number of bins and produces arbitrary rankings of models. Fig. 4 highlights this by comparing two differently calibrated random models.

7.3. UCE REGULARIZATION

At optimal temperature (as suggested by Ashukha et al. ( 2020)), UCE and MMCE regularization considerably reduce miscalibration for all employed calibration metric outperforming entropy regularization, with UCE achieving highest accuracy on CIFAR-100, SVHN and Fashion-MNIST. (see Tab. 1). We want to stress out that UCE, in contrast to MMCE, was not specifically designed for the use as a calibration regularizer (Kumar et al., 2018) . Why UCE Regularization Works UCE regularization works best when computed classwise (in similar manner to ACE): UCE = 1 C C c=1 UCE(c) , where UCE(c) is computed for predictions of class c. Consider the following binary classification example: A batch with mainly samples from class 1 and few samples from class 2 are all predicted as class 1 with high confidence. NLL further pushes the confidence of the predictions to 1.0, favoring overconfidence, whereas UCE is only reduced if the confidence of the overconfidently false predictions is reduced. 1 : Results from SWAG trained with entropy, MMCE and UCE regularization at optimal temperature (+T). We used the weighted MMCE (Kumar et al., 2018) . Proof. With Lemma 1 and p = max p we rewrite the normalized entropy as H(p) = - p log p log C - (1 -p) log 1- p C-1 log C . Now, in the limit of number of classes C lim C→∞ H(p) = lim C→∞ - (1 -p) log 1- p C-1 log C (20) = lim C→∞ -(1 -p) log(1 -p) log C - log(C -1) log C (21) = (1 -p) The top-1 error equals (1 -p) if the model is perfectly calibrated in the sense of Eq. ( 6). Lemma 1. Given a softmax output p with C entries and the most likely prediction ŷ = arg max p with likelihood p = max p. Then, the remaining entries p i,i =ŷ are approximately uniformly distributed with probability 1- p C-1 . Proof. This assumption is approximately correct (1 ) if p → 1 or (2) if C → ∞. Let pj = p i ∀i = ŷ and qj = (1-p) C-1 . Note that p and q are not proper probability distributions as pj = qj = (1 -p). (1) Consider KL[p q] as p approaches 1: lim p→1 KL [p q] = lim p→1 C-1 j=1 pj log pj qj (23) = lim p→1 C-1 j=1 pj log pj - C-1 j=1 pj log qj (24) = lim p→1 C-1 j=1 pj log pj -(1 -p) log (1 -p) C -1 (25) = 0 (2) Let z i be the logits of a model trained with L2 regularization. The magnitude of the logits |z i | cannot become arbitrary large and due to the normalizing nature of softmax lim C→∞ exp z i C j=1 exp z j = lim C→∞ 1 C . Alternatively, let z ∈ A C and z ∈ B K be two logit vectors with C < K. If both models have been trained with L2 regularization, the magnitude of the logits |z i |, |z i | cannot become arbitrary large. More specifically, A = B ⊂ R. Due to the normalizing nature of softmax, z corresponds to a lower softmax temperature and as the temperature decreases with increasing number of classes, softmax approaches a uniform distribution (Jang et al., 2017) . SWA-Gaussian Stochastic weight averaging (SWA) uses stochastic gradient descent steps around a local loss optimum of a trained network and averages the weights w SWA = 1 T T i=1 w i of the model from each step i (Izmailov et al., 2018) . This explores the loss landscape and averaging helps to find a better weight estimate than converging to a single local optimum. SWA-Gaussian (SWAG) is closely related to Bayes by Backprop (Maddox et al., 2019) . It assumes a Gaussian distribution with diagonal covariance matrix as approximate variational posterior. Instead of using backpropagation to directly optimize µ and σ, it fits a Gaussian by using µ = w SWA and Σ diag = diag(w 2 -w 2 SWA ), w 2 = 1 T T i=1 w 2 i . This doubles the number of parameters at test time. The approximate Gaussian posterior results to N (w SWA , Σ diag ) and MC integration with samples w i ∼ N (w SWA , Σ diag ) is used to compute the final probability vector p(x). Deep Ensembles Training multiple randomly initialized copies of a deep network by performing maximum posterior estimation and ensembling them to get multiple predictions for a single input is not a variational inference method. However, they have been reported to produce surprisingly useful uncertainty estimates in practice that are better calibrated (Lakshminarayanan et al., 2017) . Deep ensembles considerably increase the number of parameters at train and test time. We use deep ensembles as non Bayesian baseline for uncertainty estimation.

A.3 TRAINING SETTING

The base model implementations from PyTorch 1.6 (Paszke et al., 2019) are used and trained with following settings: • Adam optimizer with initial learn rate of 3e-4 and β 1 = 0.9, β 2 = 0.999 and mini-batch size of 256 (Kingma & Ba, 2015) • weight decay of 1e-6 • negative-log likelihood (cross entropy) loss • reduce-on-plateau learn rate scheduler (patience of 20 epochs) with factor of 0.1 • additional validation set is randomly extracted from the training set (5, 000 samples) We used M = 15 bins where necessary. Here, all metrics provide the same ranking of models. A.5 ADDITIONAL FIGURES optimal temperature for all models can be found. They show more a consistent ranking before and after the point of optimal temperature. This allows comparison of calibration of models if they are all over-or under-confident. However, all metrics fail at comparing underconfident models to overconfident models. Even at optimal temperature, Brier score and NLL fail at comparing calibration of models with different accuracy, as the metrics are always lower for models with better accuracy. Both MMCE and UCE regularization lead to less overconfident models and reduce miscalibration (optimal temperature is closer to τ = 1). Entropy regularization leads to underconfident models and is not as effective as MMCE and UCE regularization on CIFAR-10. MMCE and UCE regularization at optimal temperature outperform entropy regularization at optimal temperature for all metrics except Brier score. On CIFAR-100, UCE favors fewer bins, which makes UCE more insensitive to the total number of bins. Due to the adaptive binning, ACE is highly sensitive to the bin count.



4) over batch D ⊂ D with batch size m, matrix-valued universal kernel k and ŷ = arg max c p(y = c | x). Trainable calibration metrics are used in joint optimization with the negative log-likelihood arg min w D NLL(D, w) + λ MMCE(D, w) .

Figure 2: (Left) Rejection results on CIFAR-10 for decreasing uncertainty threshold comparing H(p) and max p as uncertainty metric. In both cases, the top-1 error decreases strictly monotonically with decreasing threshold. (Right) Out-of-distribution detection for CIFAR-10 → CIFAR-100. The normalized entropy H(p) as measure of uncertainty can be used to robustly detect OoD data.

Figure 3: Calibration diagrams for a toy experiment with a degenerated model constantly predicting the marginal probabilities p = (0.6, 0.4) in a binary classification task. ECE and MMCE only consider max p and fail at capturing the miscalibration of class 2 with p(c = 2) = 0.4, but acc(c = 2) = 0. The red bars show the measured miscalibration. Uncertainty is given as normalized entropy. The left diagram is computed using ECE as MMCE does not involve binning.

Figure5: Calibration error vs. softmax temperature on CIFAR-10. All metrics provide inconsistent ranking of models over τ . The metrics ECE, UCE and MMCE have a narrow region in which the optimal temperature for all models can be found. They show more a consistent ranking before and after the point of optimal temperature. This allows comparison of calibration of models if they are all over-or under-confident. However, all metrics fail at comparing underconfident models to overconfident models. Even at optimal temperature, Brier score and NLL fail at comparing calibration of models with different accuracy, as the metrics are always lower for models with better accuracy.

Figure 6: Calibration error vs. softmax temperature on CIFAR-100.

Figure8: Calibration error vs. softmax temperature from SWAG trained with different regularization on CIFAR-100. In this experiment, entropy regularization without temperature scaling (τ = 1) was surprisingly effective and outperforms MMCE and UCE regularization. However, at optimal temperature both MMCE and UCE regularization outperform entropy regularization for all metrics.

Figure Rejection results on CIFAR-100 for decreasing uncertainty threshold comparing H(p) and max p as uncertainty metric. In both cases, the top-1 error decreases monotonically with decreasing threshold.

Calibration error values for two ResNet models A and B on CIFAR-10 test set. (Right) ACE and UCE values for different Bayesian methods on CIFAR-100 test set. ECE and ACE are very sensitive to the number of bins used in the estimator, not yielding a consistent ranking of the models. UCE is less sensitive to the bin number and ranks models consistently, allowing comparison of different models.

Revision of Table

The normalized entropy (uncertainty) H(p) approaches the top-1 error in the limit of number of classes C if the model p is well-calibrated.

Classification accuracy and calibration error results for different models on CIFAR-10/100.

ACKNOWLEDGMENTS

Acknowledgments withheld.

A.2 BAYESIAN DEEP LEARNING METHODS

In the following, we briefly describe common approximately Bayesian methods which we use in our experiments to obtain weight uncertainty.Monte Carlo Dropout One practical approximation of the posterior is variational inference with Monte Carlo (MC) dropout (Gal & Ghahramani, 2016) . To determine model uncertainty, dropout variational inference is performed by training the model f w with dropout (Srivastava et al., 2014) and using dropout at test time to sample from the approximate posterior distribution by performing N stochastic forward passes per test sample (Gal & Ghahramani, 2016; Kendall & Gal, 2017) . This is also referred to as MC dropout. In MC dropout, the final probability vector of the predictive distribution is computed by MC integration:Gaussian Dropout Gaussian dropout was first proposed by Wang & Manning (2013) and linked to variational inference by Kingma et al. (2015) . Dropout introduces Bernoulli noise during optimization and reduces overfitting of the training data. The resulting output y of a layer with dropout is a weighted sum of Bernoulli random variables. Then, the central limit theorem states, that y is approximately normally distributed. Instead of sampling from the weights and computing the resulting output, we can directly sample from the implicit Gaussian distribution of dropoutwithusing the reparameterization trick (Kingma et al., 2015)Gaussian dropout is a continuous approximation to Bernoulli dropout, and in comparison it will better approximate the true posterior distribution and is expected to provide improved uncertainty estimates (Louizos & Welling, 2017) . To obtain the final probability vector p(x), we again use MC integration with N stochastic forward passes.The dropout rate p is now a learnable parameter and does not need to be chosen carefully by hand. In fact, p could be optimized w.r.t. uncertainty calibration, scaling the variance of the implicit Gaussian of dropout. A similar approach was presented by Gal et al. (2017) using the Concrete distribution (Maddison et al., 2016; Jang et al., 2017) . However, we focus on metrics for measuring calibration and therefore fix p in our subsequent experiments.Bayes by Backprop Blundell et al. (2015) assume a Gaussian distribution with diagonal covariance matrix as variational posterior q(w|θ), parameterized by mean µ and standard deviation σ, where θ = {µ, σ}. A sample of the weights can be obtained by sampling a multivariate unit Gaussian and shift it by µ and scale it by σ. Then, the network is trained by minimizingIn case of a zero mean Gaussian prior, the first term can be implemented by weight decay. In contrast to Gaussian dropout, which operates on the implicit distribution of the activations, Bayes by Backprop (BBB) directly operates on the weights. This doubles the number of trainable parameters in practice. MC integration is used to obtain the final probability vector p(x).

