UNCERTAINTY CALIBRATION ERROR: A NEW MET-RIC FOR MULTI-CLASS CLASSIFICATION Anonymous

Abstract

Various metrics have recently been proposed to measure uncertainty calibration of deep models for classification. However, these metrics either fail to capture miscalibration correctly or lack interpretability. We propose to use the normalized entropy as a measure of uncertainty and derive the Uncertainty Calibration Error (UCE), a comprehensible calibration metric for multi-class classification. In our experiments, we focus on uncertainty from variational Bayesian inference methods and compare UCE to established calibration errors on the task of multi-class image classification. UCE avoids several pathologies of other metrics, but does not sacrifice interpretability. It can be used for regularization to improve calibration during training without penalizing predictions with justified high confidence.

1. INTRODUCTION

Advances in deep learning have led to superior accuracy in classification tasks, making deep learning classifiers an attractive choice for safety-critical applications like autonomous driving (Chen et al., 2015) or computer-aided diagnosis (Esteva et al., 2017) . However, the high accuracy of recent deep learning models alone is not sufficient for such applications. In cases where serious decisions are made upon model's predictions, it is essential to also consider the uncertainty of these predictions. We need to know if a prediction is likely to be incorrect or if invalid input data is presented to a deep model, e.g. data that is far away from the training domain or obtained from a defective sensor. The consequences of a false decision based on an uncertain prediction can be fatal. A natural expectation is that the certainty of a prediction should be directly correlated with the quality of the prediction. In other words, predictions with high certainty are more likely to be accurate than uncertain predictions, which are more likely to be incorrect. A common misconception is the assumption that the estimated softmax likelihood can be directly used as a confidence measure for the predicted class. This expectation is dangerous in the context of critical decision-making. The estimated likelihood of models trained by minimizing the negative log-likelihood (i.e. cross entropy) is highly overconfident; that is, the estimated likelihood is considerably higher than the observed frequency of accurate predictions with that likelihood (Guo et al., 2017) .

2. UNCERTAINTY ESTIMATION

In this work, we focus on uncertainty from approximately Bayesian methods. We assume a general multi-class classification task with C classes. Let input x ∈ X be a random variable with corresponding label y ∈ Y = {1, . . . , C}. Let f w (x) be the output (logits) of a neural network with weight matrices w, and with model likelihood p(y = c | f w (x)) for class c, which is sampled from a probability vector p = σ SM (f w (x)), obtained by passing the model output through the softmax function σ SM (•). From a frequentist perspective, the softmax likelihood is often interpreted as confidence of prediction. Throughout this paper, we follow this definition. The frequentist approach assumes a single best point estimate of the parameters (or weights) of a neural network. In frequentist inference, the weights of a deep model are obtained by maximum likelihood estimation (Bishop, 2006) , and the normalized output likelihood for an unseen test input does not consider uncertainty in the weights (Kendall & Gal, 2017) . Weight uncertainty (also referred to as model or epistemic uncertainty) is a considerable source of predictive uncertainty for models trained on data sets of limited size (Bishop, 2006; Kendall & Gal, 2017) . Bayesian neural networks and recent advances in their approximation provide valuable mathematical tools for quantification of model uncertainty (Gal & Ghahramani, 2016; Kingma & Welling, 2014) . Instead of assuming the existence of a single best parameter set, we place distributions over the parameters and want to consider all possible parameter configurations, weighted by their posterior. More specifically, given a training data set D and an unseen test sample x with class label y, we are interested in evaluating the predictive distribution p(y|x, D) = p(y|x, w)p(w|D) dw . This integral requires to evaluate the posterior p(w|D), which involves the intractable marginal likelihood. A possible solution to this is to approximate the posterior with a more simple, tractable distribution q(w) by optimization. In this work, we incorporate the following approximately Bayesian methods which we use in our experiments to obtain weight uncertainty: Monte Carlo (MC) dropout (Gal & Ghahramani, 2016), Gaussian dropout (Wang & Manning, 2013; Kingma et al., 2015) , Bayes by Backprop (Blundell et al., 2015) , SWA-Gaussian (Maddox et al., 2019) , and (although not Bayesian) deep ensembles (Lakshminarayanan et al., 2017) . A short review of each of the methods can be found in Appendix A.2.

3. RELATED CALIBRATION METRICS

Expected Calibration Error The expected calibration error (ECE) is one of the most popular calibration error metrics and estimates model calibration by binning the predicted confidences p = max c p(y = c | x) into M bins from equidistant intervals and comparing them to average accuracies per bin (Naeini et al., 2015; Guo et al., 2017) : ECE = M m=1 |B m | n acc(B m ) -conf(B m ) , with number of test samples n and acc(B) and conf(B) denoting the accuracy and confidence of bin B, respectively. Several recent works have described severe pathologies of the ECE metric (Ashukha et al., 2020; Nixon et al., 2019; Kumar et al., 2019) . Most notably, the ECE metric is minimized by a model constantly predicting the marginal distribution of the majority class which makes it impossible to directly optimize it (Kumar et al., 2018) . Additionally, the ECE only considers the maximum class probability and ignores the remaining entries of the probability vector p(x). Adaptive Calibration Error Nixon et al. (2019) proposed the adaptive calibration error (ACE) to address the issue of fixed bin widths of ECE-like metrics. For models with high accuracy or overconfidence, most of the predictions fall into the rightmost bins, whereas only very few predictions fall into the rest of the bins. ACE spaces the bins such that an equal number of predictions contribute to each bin. The final ACE is computed by averaging over per-class ACE values to address the issue raised by Kull et al. (2019) . However, this makes the metric more sensitive to the manually selected number of bins M as the number of bins effectively becomes C • M , with number of classes C. Using fixed bin widths, the numbers of samples in the sparsely populated bins is further reduced, which increases the variance of each measurement per bin. Using adaptive bins, this results in the lower confidence bins spanning a wide range of values, which increases the bias of the bin's measurement. Negative Log-Likelihood Deep models for classification are usually trained by minimizing the average negative log-likelihood (NLL): NLL = 1 N N i=1 -log p(y = y i | x i ) . ( ) The NLL is also commonly used as a metric for measuring the calibration of uncertainty. However, the NLL is minimized by increasing the confidence max c p(y = c | x), which favors over-confident models and models with higher accuracy (Ashukha et al., 2020) . This metric is therefore unable to compare the calibration of models with different accuracies and training a model by minimizing NLL does not necessarily lead to good calibration.

