CALIBRATION OF NEURAL NETWORKS USING SPLINES

Abstract

Calibrating neural networks is of utmost importance when employing them in safety-critical applications where the downstream decision making depends on the predicted probabilities. Measuring calibration error amounts to comparing two empirical distributions. In this work, we introduce a binning-free calibration measure inspired by the classical Kolmogorov-Smirnov (KS) statistical test in which the main idea is to compare the respective cumulative probability distributions. From this, by approximating the empirical cumulative distribution using a differentiable function via splines, we obtain a recalibration function, which maps the network outputs to actual (calibrated) class assignment probabilities. The spline-fitting is performed using a held-out calibration set and the obtained recalibration function is evaluated on an unseen test set. We tested our method against existing calibration approaches on various image classification datasets and our spline-based recalibration approach consistently outperforms existing methods on KS error as well as other commonly used calibration measures.

1. INTRODUCTION

Despite the success of modern neural networks they are shown to be poorly calibrated (Guo et al. (2017) ), which has led to a growing interest in the calibration of neural networks over the past few years (Kull et al. (2019); Kumar et al. (2019; 2018); Müller et al. (2019) ). Considering classification problems, a classifier is said to be calibrated if the probability values it associates with the class labels match the true probabilities of correct class assignments. For instance, if an image classifier outputs 0.2 probability for the "horse" label for 100 test images, then out of those 100 images approximately 20 images should be classified as horse. It is important to ensure calibration when using classifiers for safety-critical applications such as medical image analysis and autonomous driving where the downstream decision making depends on the predicted probabilities. One of the important aspects of machine learning research is the measure used to evaluate the performance of a model and in the context of calibration, this amounts to measuring the difference between two empirical probability distributions. To this end, the popular metric, Expected Calibration Error (ECE) (Naeini et al. ( 2015)), approximates the classwise probability distributions using histograms and takes an expected difference. This histogram approximation has a weakness that the resulting calibration error depends on the binning scheme (number of bins and bin divisions). Even though the drawbacks of ECE have been pointed out and some improvements have been proposed (Kumar et al. (2019) ; Nixon et al. ( 2019)), the histogram approximation has not been eliminated. 1In this paper, we first introduce a simple, binning-free calibration measure inspired by the classical Kolmogorov-Smirnov (KS) statistical test (Kolmogorov (1933); Smirnov (1939)) , which also provides an effective visualization of the degree of miscalibration similar to the reliability diagram (Niculescu-Mizil & Caruana (2005) ). To this end, the main idea of the KS-test is to compare the respective classwise cumulative (empirical) distributions. Furthermore, by approximating the empirical cumulative distribution using a differentiable function via splines (McKinley & Levine (1998)), we



We consider metrics that measure classwise (top-r) calibration error(Kull et al. (2019)). Refer to section for details.

