MULTI-CLASS UNCERTAINTY CALIBRATION VIA MU-TUAL INFORMATION MAXIMIZATION-BASED BINNING

Abstract

Post-hoc multi-class calibration is a common approach for providing high-quality confidence estimates of deep neural network predictions. Recent work has shown that widely used scaling methods underestimate their calibration error, while alternative Histogram Binning (HB) methods often fail to preserve classification accuracy. When classes have small prior probabilities, HB also faces the issue of severe sample-inefficiency after the conversion into K one-vs-rest class-wise calibration problems. The goal of this paper is to resolve the identified issues of HB in order to provide calibrated confidence estimates using only a small holdout calibration dataset for bin optimization while preserving multi-class ranking accuracy. From an information-theoretic perspective, we derive the I-Max concept for binning, which maximizes the mutual information between labels and quantized logits. This concept mitigates potential loss in ranking performance due to lossy quantization, and by disentangling the optimization of bin edges and representatives allows simultaneous improvement of ranking and calibration performance. To improve the sample efficiency and estimates from a small calibration set, we propose a shared class-wise (sCW) calibration strategy, sharing one calibrator among similar classes (e.g., with similar class priors) so that the training sets of their class-wise calibration problems can be merged to train the single calibrator. The combination of sCW and I-Max binning outperforms the state of the art calibration methods on various evaluation metrics across different benchmark datasets and models, using a small calibration set (e.g., 1k samples for ImageNet).

1. INTRODUCTION

Despite great ability in learning discriminative features, deep neural network (DNN) classifiers often make over-confident predictions. This can lead to potentially catastrophic consequences in safety critical applications, e.g., medical diagnosis and autonomous driving perception tasks. A multi-class classifier is perfectly calibrated if among the cases receiving the prediction distribution q, the ground truth class distribution is also q. The mismatch between the prediction and ground truth distribution can be measured using the Expected Calibration Error (ECE) (Guo et al., 2017; Kull et al., 2019) . 2020) also empirically showed this underestimation phenomena. This deems scaling methods as unreliable calibration solutions, as their true ECEs can be larger than evaluated, putting many applications at risk. Additionally, setting HDE also faces the bias/variance trade-off. Increasing its number of evaluation bins reduces the bias, as the evaluation quantization error is smaller, however, the estimation of the ground truth correctness begins to suffer from high variance. Fig. 1-a ) shows that the empirical ECE estimates of both the raw network outputs and the temperature scaling method (TS) (Guo et al., 2017) are sensitive to the number of evaluation



Since the pioneering work of(Guo et al., 2017), scaling methods have been widely acknowledged as an efficient post-hoc multi-class calibration solution for modern DNNs. The common practice of evaluating their ECE resorts to histogram density estimation (HDE) for modeling the distribution of the predictions.However, Vaicenavicius et al. (2019)  proved that with a fixed number of evaluation bins the ECE of scaling methods is underestimated even with an infinite number of samples. Widmann et al. (2019); Kumar et al. (2019); Wenger et al. (

