MITIGATING BIAS IN CALIBRATION ERROR ESTIMATION

Abstract

Building reliable machine learning systems requires that we correctly understand their level of confidence. Calibration focuses on measuring the degree of accuracy in a model's confidence and most research in calibration focuses on techniques to improve an empirical estimate of calibration error, ECE BIN . Using simulation, we show that ECE BIN can systematically underestimate or overestimate the true calibration error depending on the nature of model miscalibration, the size of the evaluation data set, and the number of bins. Critically, ECE BIN is more strongly biased for perfectly calibrated models. We propose a simple alternative calibration error metric, ECE SWEEP , in which the number of bins is chosen to be as large as possible while preserving monotonicity in the calibration function. Evaluating our measure on distributions fit to neural network confidence scores on CIFAR-10, CIFAR-100, and ImageNet, we show that ECE SWEEP produces a less biased estimator of calibration error and therefore should be used by any researcher wishing to evaluate the calibration of models trained on similar datasets.



Figure 1 : Bias in ECE BIN for perfectly calibrated models. Simulated data from a perfectly calibrated model with confidence scores modeled to ResNet-110 CIFAR-10 output (He et al., 2016; Kängsepp, 2019) . We show a reliability diagram for a sample of size n = 200 and the distribution of ECE BIN scores computed across 10 6 independent simulations. Even though the model is perfectly calibrated, ECE BIN systematically predicts large calibration errors. Machine learning models are increasingly deployed in high-stakes settings like self-driving cars (Caesar et al., 2020; Geiger et al., 2013; Sun et al., 2020) and medical diagnoses (Esteva et al., 2017; 2019; Gulshan et al., 2016) where a model's ability to recognize when it is likely to be incorrect is critical. Unfortunately, such models often fail in unexpected and poorly understood ways, hindering our ability to interpret and trust such systems (Azulay & Weiss, 2018; Biggio & Roli, 2018; Hendrycks & Dietterich, 2019; Recht et al., 2019; Szegedy et al., 2013) . To address these issues, calibration is used to ensure that a machine learning model produces confidence scores that reflect the model's ground truth likelihood of being correct (Platt et al., 1999; Zadrozny & Elkan, 2001; 2002) . To obtain an estimate of the calibration error, or ECEfoot_0 , the standard procedure (Guo et al., 2017; Naeini et al., 2015) partitions the model confidence scores into bins and compares the model's predicted accuracy to its empirical accuracy within each bin. We refer to this specific metric as ECE BIN . Although recent work has pointed out that ECE BIN is sensitive to implementation hyperparameters (Kumar et al., 2019; Nixon et al., 2019) , measuring the statistical bias in ECE BIN , or the difference between the expected ECE BIN and the true calibration error (TCE), has remained largely unaddressed. In this paper, we address this problem by developing techniques to measure bias in existing calibration metrics. We use simulation to create a setting where the TCE can be computed analytically and thus the bias can be estimated directly. As Figure 1 highlights, we find empirically that ECE BIN has non-negligible statistical bias and systematically predicts large errors for perfectly calibrated models. Naeini et al. (2015) introduce ECE as an acronym for Expected Calibration Error. However, ECE is not a proper expectation whereas the true calibration error is computed under an expectation. To resolve this confusion, we prefer to read ECE as Estimated Calibration Error.

