MULTI-CLASS UNCERTAINTY CALIBRATION VIA MU-TUAL INFORMATION MAXIMIZATION-BASED BINNING

Abstract

Post-hoc multi-class calibration is a common approach for providing high-quality confidence estimates of deep neural network predictions. Recent work has shown that widely used scaling methods underestimate their calibration error, while alternative Histogram Binning (HB) methods often fail to preserve classification accuracy. When classes have small prior probabilities, HB also faces the issue of severe sample-inefficiency after the conversion into K one-vs-rest class-wise calibration problems. The goal of this paper is to resolve the identified issues of HB in order to provide calibrated confidence estimates using only a small holdout calibration dataset for bin optimization while preserving multi-class ranking accuracy. From an information-theoretic perspective, we derive the I-Max concept for binning, which maximizes the mutual information between labels and quantized logits. This concept mitigates potential loss in ranking performance due to lossy quantization, and by disentangling the optimization of bin edges and representatives allows simultaneous improvement of ranking and calibration performance. To improve the sample efficiency and estimates from a small calibration set, we propose a shared class-wise (sCW) calibration strategy, sharing one calibrator among similar classes (e.g., with similar class priors) so that the training sets of their class-wise calibration problems can be merged to train the single calibrator. The combination of sCW and I-Max binning outperforms the state of the art calibration methods on various evaluation metrics across different benchmark datasets and models, using a small calibration set (e.g., 1k samples for ImageNet).

1. INTRODUCTION

Despite great ability in learning discriminative features, deep neural network (DNN) classifiers often make over-confident predictions. This can lead to potentially catastrophic consequences in safety critical applications, e.g., medical diagnosis and autonomous driving perception tasks. A multi-class classifier is perfectly calibrated if among the cases receiving the prediction distribution q, the ground truth class distribution is also q. The mismatch between the prediction and ground truth distribution can be measured using the Expected Calibration Error (ECE) (Guo et al., 2017; Kull et al., 2019) . Since the pioneering work of (Guo et al., 2017) , scaling methods have been widely acknowledged as an efficient post-hoc multi-class calibration solution for modern DNNs. The common practice of evaluating their ECE resorts to histogram density estimation (HDE) for modeling the distribution of the predictions. However, Vaicenavicius et al. (2019) proved that with a fixed number of evaluation bins the ECE of scaling methods is underestimated even with an infinite number of samples. Widmann et al. (2019); Kumar et al. (2019); Wenger et al. (2020) also empirically showed this underestimation phenomena. This deems scaling methods as unreliable calibration solutions, as their true ECEs can be larger than evaluated, putting many applications at risk. Additionally, setting HDE also faces the bias/variance trade-off. Increasing its number of evaluation bins reduces the bias, as the evaluation quantization error is smaller, however, the estimation of the ground truth correctness begins to suffer from high variance. Fig. 1-a ) shows that the empirical ECE estimates of both the raw network outputs and the temperature scaling method (TS) (Guo et al., 2017) are sensitive to the number of evaluation An alternative technique for post-hoc calibration is Histogram Binning (HB) (Zadrozny & Elkan, 2001; Guo et al., 2017; Kumar et al., 2019) . Note, here HB is a calibration method and is different to the HDE used for evaluating ECEs of scaling methods. HB produces discrete predictions, whose probability mass functions can be empirically estimated without using HDE/KDE. Therefore, its ECE estimate is constant and unaffected by the number of evaluation bins in Fig. 1 -a) and it can converge to the true value with increasing evaluation samples (Vaicenavicius et al., 2019) , see Fig. 1-b ). The most common variants of HB are Equal (Eq.) size (uniformly partitioning the probability interval [0, 1]), and Eq. mass (uniformly distributing samples over bins) binning. These simple methods for multi-class calibration are known to degrade accuracy, since quantization through binning may remove a considerable amount of label information contained by the classifier's outputs. In this work we show that the key for HB to retain the accuracy of trained classifiers is choosing bin edges that minimize the amount of label information loss. Both Eq. size and mass binning are suboptimal. We present I-Max, a novel iterative method for optimizing bin edges with proved convergence. As the location of its bin edges inherently ensures sufficient calibration samples per bin, the bin representatives of I-Max can then be effectively optimized for calibration. Two design objectives, calibration and accuracy, are thus nicely disentangled under I-Max. For multi-class calibration, I-Max adopts the one-vs-rest (OvR) strategy to individually calibrate the prediction probability of each class. To cope with a limited number of calibration samples, we propose to share one binning scheme for calibrating the prediction probabilities of similar classes, e.g., with similar class priors or belonging to the same class category. At small data regime, we can even choose to fit one binning scheme on the merged training sets of all per-class calibrations. Such a shared class-wise (sCW) calibration strategy greatly improves the sample efficiency of I-Max binning. I-Max is evaluated according to multiple performance metrics, including accuracy, ECE, Brier and NLL, and compared against benchmark calibration methods across multiple datasets and trained classifiers. For ImageNet, I-Max obtains up to 66.11% reduction in ECE compared to the baseline and up to 38.14% reduction compared to the state-of-the-art GP-scaling method (Wenger et al., 2020).

2. RELATED WORK

For confidence calibration, Bayesian DNNs and their approximations, e.g. et al., 2019; Thulasidasan et al., 2019; Yun et al., 2019; Hendrycks et al., 2020) and ensemble distribution distillation (Malinin et al., 2020) . In comparison, a simple approach that requires no retraining of the models is post-hoc calibration (Guo et al., 2017) .



Figure 1: (a) Temperature scaling (TS), equally sized-histogram binning (HB), and our proposal, i.e., sCW I-Max binning are compared for post-hoc calibrating a CIFAR100 (WRN) classifier. (b) Binning offers a reliable ECE measure as the number of evaluation samples increases.

(Blundell et al., 2015)  (Gal & Ghahramani, 2016)  are resource-demanding methods to consider predictive model uncertainty. However, applications with limited complexity overhead and latency require sampling-free and singlemodel based calibration methods. Examples include modifying the training loss(Kumar et al., 2018), scalable Gaussian processes(Milios et al., 2018), sampling-free uncertainty estimation(Postels et al.,  2019), data augmentation (Patel

