CALIBRATION OF NEURAL NETWORKS USING SPLINES

Abstract

Calibrating neural networks is of utmost importance when employing them in safety-critical applications where the downstream decision making depends on the predicted probabilities. Measuring calibration error amounts to comparing two empirical distributions. In this work, we introduce a binning-free calibration measure inspired by the classical Kolmogorov-Smirnov (KS) statistical test in which the main idea is to compare the respective cumulative probability distributions. From this, by approximating the empirical cumulative distribution using a differentiable function via splines, we obtain a recalibration function, which maps the network outputs to actual (calibrated) class assignment probabilities. The spline-fitting is performed using a held-out calibration set and the obtained recalibration function is evaluated on an unseen test set. We tested our method against existing calibration approaches on various image classification datasets and our spline-based recalibration approach consistently outperforms existing methods on KS error as well as other commonly used calibration measures.

1. INTRODUCTION

Despite the success of modern neural networks they are shown to be poorly calibrated (Guo et al. (2017) ), which has led to a growing interest in the calibration of neural networks over the past few years (Kull et al. (2019) ; Kumar et al. (2019; 2018) ; Müller et al. (2019) ). Considering classification problems, a classifier is said to be calibrated if the probability values it associates with the class labels match the true probabilities of correct class assignments. For instance, if an image classifier outputs 0.2 probability for the "horse" label for 100 test images, then out of those 100 images approximately 20 images should be classified as horse. It is important to ensure calibration when using classifiers for safety-critical applications such as medical image analysis and autonomous driving where the downstream decision making depends on the predicted probabilities. One of the important aspects of machine learning research is the measure used to evaluate the performance of a model and in the context of calibration, this amounts to measuring the difference between two empirical probability distributions. To this end, the popular metric, Expected Calibration Error (ECE) (Naeini et al. (2015) ), approximates the classwise probability distributions using histograms and takes an expected difference. This histogram approximation has a weakness that the resulting calibration error depends on the binning scheme (number of bins and bin divisions). Even though the drawbacks of ECE have been pointed out and some improvements have been proposed (Kumar et al. (2019) ; Nixon et al. ( 2019)), the histogram approximation has not been eliminated. 1In this paper, we first introduce a simple, binning-free calibration measure inspired by the classical Kolmogorov-Smirnov (KS) statistical test (Kolmogorov (1933); Smirnov (1939) ), which also provides an effective visualization of the degree of miscalibration similar to the reliability diagram (Niculescu-Mizil & Caruana (2005) ). To this end, the main idea of the KS-test is to compare the respective classwise cumulative (empirical) distributions. Furthermore, by approximating the empirical cumulative distribution using a differentiable function via splines (McKinley & Levine (1998)), we obtain an analytical recalibration functionfoot_2 which maps the given network outputs to the actual class assignment probabilities. Such a direct mapping was previously unavailable and the problem has been approached indirectly via learning, for example, by optimizing the (modified) cross-entropy loss (Guo et al. ( 2017 2019)) the spline-fitting is performed using a held-out calibration set and the obtained recalibration function is evaluated on an unseen test set. We evaluated our method against existing calibration approaches on various image classification datasets and our spline-based recalibration approach consistently outperforms existing methods on KS error, ECE as well as other commonly used calibration measures. Our approach to calibration does not update the model parameters, which allows it to be applied on any trained network and it retains the original classification accuracy in all the tested cases.

2. NOTATION AND PRELIMINARIES

We abstract the network as a function f θ : D → [0, 1] K , where D ⊂ IR d , and write f θ (x) = z. Here, x may be an image, or other input datum, and z is a vector, sometimes known as the vector of logits. In this paper, the parameters θ will not be considered, and we write simply f to represent the network function. We often refer to this function as a classifier, and in theory this could be of some other type than a neural network. In a classification problem, K is the number of classes to be distinguished, and we call the value z k (the k-th component of vector z) the score for the class k. If the final layer of a network is a softmax layer, then the values z k satisfy K k=1 z k = 1, and z k ≥ 0. Hence, the z k are pseudoprobabilities, though they do not necessarily have anything to do with real probabilities of correct class assignments. Typically, the value y * = arg max k z k is taken as the (top-1) prediction of the network, and the corresponding score, max k z k is called the confidence of the prediction. However, the term confidence does not have any mathematical meaning in this context and we deprecate its use. We assume we are given a set of training data (x i , y i ) n i=1 , where x i ∈ D is an input data element, which for simplicity we call an image, and y i ∈ K = {1, . . . , K} is the so-called ground-truth label. Our method also uses two other sets of data, called calibration data and test data. It would be desirable if the numbers z k output by a network represented true probabilities. For this to make sense, we posit the existence of joint random variables (X, Y ), where X takes values in a domain D ⊂ IR d , and Y takes values in K. Further, let Z = f (X), another random variable, and Z k = f k (X) be its k-th component. Note that in this formulation X and Y are joint random variables, and the probability P (Y | X) is not assumed to be 1 for single class, and 0 for the others. A network is said to be calibrated if for every class k, P (Y = k | Z = z) = z k . This can be written briefly as P (k | f (x)) = f k (x) = z k . Thus, if the network takes input x and outputs z = f (x), then z k represents the probability (given f (x)) that image x belongs to class k.  (Y = k | Z k = z k ) = z k . This paper uses this definition (2) of calibration in the proposed KS metric. Calibration and accuracy of a network are different concepts. For instance, one may consider a classifier that simply outputs the class probabilities for the data, ignoring the input x. Thus, if f k (x) = z k = P (Y = k), this classifier f is calibrated but the accuracy is no better than the random predictor. Therefore, in calibration of a classifier, it is important that this is not done while sacrificing classification (for instance top-1) accuracy.



We consider metrics that measure classwise (top-r) calibration error(Kull et al. (2019)). Refer to section for details. Open-source implementation available at https://github.com/kartikgupta-at-anu/ spline-calibration



); Mukhoti et al. (2020); Müller et al. (2019)). Similar to the existing methods (Guo et al. (2017); Kull et al. (

