MITIGATING BIAS IN CALIBRATION ERROR ESTIMATION

Abstract

Building reliable machine learning systems requires that we correctly understand their level of confidence. Calibration focuses on measuring the degree of accuracy in a model's confidence and most research in calibration focuses on techniques to improve an empirical estimate of calibration error, ECE BIN . Using simulation, we show that ECE BIN can systematically underestimate or overestimate the true calibration error depending on the nature of model miscalibration, the size of the evaluation data set, and the number of bins. Critically, ECE BIN is more strongly biased for perfectly calibrated models. We propose a simple alternative calibration error metric, ECE SWEEP , in which the number of bins is chosen to be as large as possible while preserving monotonicity in the calibration function. Evaluating our measure on distributions fit to neural network confidence scores on CIFAR-10, CIFAR-100, and ImageNet, we show that ECE SWEEP produces a less biased estimator of calibration error and therefore should be used by any researcher wishing to evaluate the calibration of models trained on similar datasets.



Figure 1 : Bias in ECE BIN for perfectly calibrated models. Simulated data from a perfectly calibrated model with confidence scores modeled to ResNet-110 CIFAR-10 output (He et al., 2016; Kängsepp, 2019) . We show a reliability diagram for a sample of size n = 200 and the distribution of ECE BIN scores computed across 10 6 independent simulations. Even though the model is perfectly calibrated, ECE BIN systematically predicts large calibration errors. Machine learning models are increasingly deployed in high-stakes settings like self-driving cars (Caesar et al., 2020; Geiger et al., 2013; Sun et al., 2020) and medical diagnoses (Esteva et al., 2017; 2019; Gulshan et al., 2016) where a model's ability to recognize when it is likely to be incorrect is critical. Unfortunately, such models often fail in unexpected and poorly understood ways, hindering our ability to interpret and trust such systems (Azulay & Weiss, 2018; Biggio & Roli, 2018; Hendrycks & Dietterich, 2019; Recht et al., 2019; Szegedy et al., 2013) . To address these issues, calibration is used to ensure that a machine learning model produces confidence scores that reflect the model's ground truth likelihood of being correct (Platt et al., 1999; Zadrozny & Elkan, 2001; 2002) . To obtain an estimate of the calibration error, or ECEfoot_0 , the standard procedure (Guo et al., 2017; Naeini et al., 2015) partitions the model confidence scores into bins and compares the model's predicted accuracy to its empirical accuracy within each bin. We refer to this specific metric as ECE BIN . Although recent work has pointed out that ECE BIN is sensitive to implementation hyperparameters (Kumar et al., 2019; Nixon et al., 2019) , measuring the statistical bias in ECE BIN , or the difference between the expected ECE BIN and the true calibration error (TCE), has remained largely unaddressed. In this paper, we address this problem by developing techniques to measure bias in existing calibration metrics. We use simulation to create a setting where the TCE can be computed analytically and thus the bias can be estimated directly. As Figure 1 highlights, we find empirically that ECE BIN has non-negligible statistical bias and systematically predicts large errors for perfectly calibrated models. Motivated by monotonicity in true calibration curves arising from trained models, we develop a simple alternate for measuring calibration error, the monotonic sweep calibration error (ECE SWEEP ), which chooses the largest number possible while maintaining monotonicity in the approximation to the calibration curve. Our results suggest that ECE SWEEP is less biased than the standard ECE BIN and can thus more reliably estimate calibration error. Does the use of an improved ECE measure affect which recalibration method is preferred? In Figure 2 , we examine this question using 10 pre-trained models, and compare the standard ECE measure, ECE BIN with 15 equal-width-spaced bins, to our ECE SWEEP . With large dataset sizes for recalibration and evaluationfoot_1 , we find that ECE BIN produces a different selection of the preferred recalibration method on 30% of the models. (We use histogram binning (Zadrozny & Elkan, 2001) , temperature scaling (Guo et al., 2017) , and isotonic regression (Zadrozny & Elkan, 2002) as the recalibration techniques.) When we reduce the size of the validation and evaluation by 10% and recalibrate with these smaller sets, ECE BIN produces a different selection on 22% of the cases ( with 10 bins, we see disagreement on 27% of the cases). Thus, the use of our improved ECE measure has significant implications not only for estimation of calibration error but for improving calibration with methods like temperature scaling. True calibration error (TCE). We define true calibration error as the difference between a model's predicted confidence and the true likelihood of being correct under the p norm:

2. BACKGROUND

0 1 f(X)=c 0 1 2 3 4 5 PDF f(x) Beta(2.8, 0.05) 0 1 f(X)=c 0 1 E[Y f(X) = c] E[Y f(X) = c] = c 2 Perfect calibration TCE(f ) = (E X [|f (X) -E Y [Y |f (X)]| p ]) 1 p . ( ) The TCE is dictated by two independent features of a model: (1) the distribution of confidence scores f (x) ∼ F over which the outer expectation is computed, and (2) the true calibration curve EY [Y | f (X)], which governs the relationship between the confidence score f (x) and the empirical accuracy (see Figure 3 for illustration). In our experiments, we measure calibration error using the 2 norm because it increases the sensitivity of the error metric to extremely poorly calibrated predictions, which tend to be more harmful in applications. In addition, the mean squared prediction error of the classifier, or Brier score (Brier, 1950) , can be decomposed into terms corresponding to the squared 2 calibration error and the variance of the model's correctness likelihood (Kuleshov & Liang, 2015; Kumar et al., 2019) .



Naeini et al. (2015) introduce ECE as an acronym for Expected Calibration Error. However, ECE is not a proper expectation whereas the true calibration error is computed under an expectation. To resolve this confusion, we prefer to read ECE as Estimated Calibration Error. We use standard validation sets of size 5, 000 examples for CIFAR-10/100 and 25, 000 examples for ImageNet and evaluation sets of size 10, 000 for CIFAR-10/100 and 25, 000 for ImageNet.



Figure2: Bias affects which recalibration algorithm is preferred. For ten models, we report which recalibration method is determined to be superior based either on ECE BIN or ECE SWEEP . The wide bar indicates the superior method using entire validation set (mean of X instances); narrow bars each use a random sample of 10% of the original validation set. Recalibration methods tested are histogram binning, temperature scaling, and isotonic regression.

Figure 3: Curves controlling true calibration error. Our ability to measure calibration is contingent on both the confidence score distribution (e.g., f (X) ∼ Beta(2.8, 0.05)) and the true calibration curve (e.g., EY [Y | f (X) = c] = c 2 .

