MULTI-CLASS UNCERTAINTY CALIBRATION VIA MU-TUAL INFORMATION MAXIMIZATION-BASED BINNING

Abstract

Post-hoc multi-class calibration is a common approach for providing high-quality confidence estimates of deep neural network predictions. Recent work has shown that widely used scaling methods underestimate their calibration error, while alternative Histogram Binning (HB) methods often fail to preserve classification accuracy. When classes have small prior probabilities, HB also faces the issue of severe sample-inefficiency after the conversion into K one-vs-rest class-wise calibration problems. The goal of this paper is to resolve the identified issues of HB in order to provide calibrated confidence estimates using only a small holdout calibration dataset for bin optimization while preserving multi-class ranking accuracy. From an information-theoretic perspective, we derive the I-Max concept for binning, which maximizes the mutual information between labels and quantized logits. This concept mitigates potential loss in ranking performance due to lossy quantization, and by disentangling the optimization of bin edges and representatives allows simultaneous improvement of ranking and calibration performance. To improve the sample efficiency and estimates from a small calibration set, we propose a shared class-wise (sCW) calibration strategy, sharing one calibrator among similar classes (e.g., with similar class priors) so that the training sets of their class-wise calibration problems can be merged to train the single calibrator. The combination of sCW and I-Max binning outperforms the state of the art calibration methods on various evaluation metrics across different benchmark datasets and models, using a small calibration set (e.g., 1k samples for ImageNet).

1. INTRODUCTION

Despite great ability in learning discriminative features, deep neural network (DNN) classifiers often make over-confident predictions. This can lead to potentially catastrophic consequences in safety critical applications, e.g., medical diagnosis and autonomous driving perception tasks. A multi-class classifier is perfectly calibrated if among the cases receiving the prediction distribution q, the ground truth class distribution is also q. The mismatch between the prediction and ground truth distribution can be measured using the Expected Calibration Error (ECE) (Guo et al., 2017; Kull et al., 2019) . Since the pioneering work of (Guo et al., 2017) , scaling methods have been widely acknowledged as an efficient post-hoc multi-class calibration solution for modern DNNs. The common practice of evaluating their ECE resorts to histogram density estimation (HDE) for modeling the distribution of the predictions. However, Vaicenavicius et al. (2019) proved that with a fixed number of evaluation bins the ECE of scaling methods is underestimated even with an infinite number of samples. Widmann et al. (2019) ; Kumar et al. (2019) ; Wenger et al. (2020) also empirically showed this underestimation phenomena. This deems scaling methods as unreliable calibration solutions, as their true ECEs can be larger than evaluated, putting many applications at risk. Additionally, setting HDE also faces the bias/variance trade-off. Increasing its number of evaluation bins reduces the bias, as the evaluation quantization error is smaller, however, the estimation of the ground truth correctness begins to suffer from high variance. Fig. 1-a ) shows that the empirical ECE estimates of both the raw network outputs and the temperature scaling method (TS) (Guo et al., 2017) are sensitive to the number of evaluation Binning offers a reliable ECE measure as the number of evaluation samples increases. bins. It remains unclear how to optimally choose the number of evaluation bins so as to minimize the estimation error. Recent work (Zhang et al., 2020; Widmann et al., 2019) suggested kernel density estimation (KDE) instead of HDE. However, the choice of the kernel and bandwidth also remains unclear, and the smoothness of the ground truth distribution is hard to verify in practice. An alternative technique for post-hoc calibration is Histogram Binning (HB) (Zadrozny & Elkan, 2001; Guo et al., 2017; Kumar et al., 2019) . Note, here HB is a calibration method and is different to the HDE used for evaluating ECEs of scaling methods. HB produces discrete predictions, whose probability mass functions can be empirically estimated without using HDE/KDE. Therefore, its ECE estimate is constant and unaffected by the number of evaluation bins in Fig. 1 -a) and it can converge to the true value with increasing evaluation samples (Vaicenavicius et al., 2019) , see Fig. 1-b ). The most common variants of HB are Equal (Eq.) size (uniformly partitioning the probability interval [0, 1]), and Eq. mass (uniformly distributing samples over bins) binning. These simple methods for multi-class calibration are known to degrade accuracy, since quantization through binning may remove a considerable amount of label information contained by the classifier's outputs. In this work we show that the key for HB to retain the accuracy of trained classifiers is choosing bin edges that minimize the amount of label information loss. Both Eq. size and mass binning are suboptimal. We present I-Max, a novel iterative method for optimizing bin edges with proved convergence. As the location of its bin edges inherently ensures sufficient calibration samples per bin, the bin representatives of I-Max can then be effectively optimized for calibration. Two design objectives, calibration and accuracy, are thus nicely disentangled under I-Max. For multi-class calibration, I-Max adopts the one-vs-rest (OvR) strategy to individually calibrate the prediction probability of each class. To cope with a limited number of calibration samples, we propose to share one binning scheme for calibrating the prediction probabilities of similar classes, e.g., with similar class priors or belonging to the same class category. At small data regime, we can even choose to fit one binning scheme on the merged training sets of all per-class calibrations. Such a shared class-wise (sCW) calibration strategy greatly improves the sample efficiency of I-Max binning. I-Max is evaluated according to multiple performance metrics, including accuracy, ECE, Brier and NLL, and compared against benchmark calibration methods across multiple datasets and trained classifiers. For ImageNet, I-Max obtains up to 66.11% reduction in ECE compared to the baseline and up to 38.14% reduction compared to the state-of-the-art GP-scaling method (Wenger et al., 2020) .

2. RELATED WORK

For confidence calibration, Bayesian DNNs and their approximations, e.g. (Blundell et al., 2015 ) (Gal & Ghahramani, 2016) are resource-demanding methods to consider predictive model uncertainty. However, applications with limited complexity overhead and latency require sampling-free and singlemodel based calibration methods. Examples include modifying the training loss (Kumar et al., 2018) , scalable Gaussian processes (Milios et al., 2018) , sampling-free uncertainty estimation (Postels et al., 2019) , data augmentation (Patel et al., 2019; Thulasidasan et al., 2019; Yun et al., 2019; Hendrycks et al., 2020) and ensemble distribution distillation (Malinin et al., 2020) . In comparison, a simple approach that requires no retraining of the models is post-hoc calibration (Guo et al., 2017) . Prediction probabilities (logits) scaling and binning are the two main solutions for post-hoc calibration. Scaling methods use parametric or non-parametric models to adjust the raw logits. Guo et al. (2017) investigated linear models, ranging from the single-parameter based TS to more complicated vector/matrix scaling. To avoid overfitting, Kull et al. (2019) suggested to regularize matrix scaling with a L 2 loss on the model weights. Recently, Wenger et al. (2020) adopted a latent Gaussian process for multi-class calibration. Ji et al. (2019) extended TS to a bin-wise setting, by learning separate temperatures for various confidence subsets. To improve the expressive capacity of TS, an ensemble of temperatures were adopted by Zhang et al. (2020) . Owing to continuous outputs of scaling methods, one critical issue discovered in the recent work is: Their empirical ECE estimate is not only non-verifiable (Kumar et al., 2019) , but also asymptotically smaller than the ground truth (Vaicenavicius et al., 2019) . Recent work (Zhang et al., 2020; Widmann et al., 2019) exploited KDEs for an improved ECE evaluation, however, the parameter setting requires further investigation. Nixon et al. (2019) and (Ashukha et al., 2020) discussed potential issues of the ECE metric, and the former suggested to 1) use equal mass binning for ECE evaluation; 2) measure both top-1 and class-wise ECE to evaluate multi-class calibrators, 3) only include predictions with a confidence above some epsilon in the class-wise ECE score. As an alternative to scaling, HB quantizes the raw confidences with either Eq. size or Eq. mass bins (Zadrozny & Elkan, 2001) . It offers asymptotically convergent ECE estimation (Vaicenavicius et al., 2019) , but is less sample efficient than scaling methods and also suffers from accuracy loss (Guo et al., 2017) . Kumar et al. (2019) proposed to perform scaling before binning for an improved sample efficiency. Isotonic regression (Zadrozny & Elkan, 2002) and Bayesian binning into quantiles (BBQ) (Naeini et al., 2015) are often viewed as binning methods. However, their ECE estimates face the same issue as scaling methods: though isotonic regression fits a piecewise linear function, its predictions are continuous as they are interpolated for unseen data. BBQ considers multiple binning schemes with different numbers of bins, and combines them using a continuous Bayesian score, resulting in continuous predictions. In this work, we improve the current HB design by casting bin optimization into a MI maximization problem. Furthermore, our findings can also be used to improve scaling methods.

3. METHOD

Here we introduce the I-Max binning scheme, which addresses the issues of HB in terms of preserving label-information in multi-class calibration. After the problem setup in Sec. 3.1, Sec. 3.2) presents a sample-efficient technique for one-vs-rest calibration. In Sec. 3.3 we formulate the training objective of binning as MI maximization and derive a simple algorithm for I-Max binning.

3.1. PROBLEM SETUP

We address supervised multi-class classification tasks, where each input x ∈ X belongs to one of K classes, and the ground truth labels are one-hot encoded, i.e., y = [y 1 , y 2 , . . . , y K ] ∈ {0, 1} K . Let f : X → [0, 1] K be a DNN trained using cross-entropy loss. It maps each x onto a probability vector q = [q 1 , . . . , q K ] ∈ [0, 1] K , which is used to rank the K possible classes of the current instance, e.g., arg max k q k being the top-1 ranked class. As the trained classifier tends to overfit to the cross-entropy loss rather than the accuracy (i.e., 0/1 loss), q as the prediction distribution is typically poorly calibrated. A post-hoc calibrator h to revise q can deliver an improved performance. To evaluate the calibration performance of h • f , class-wise ECE averaged over the K classes is a common metric, measuring the expected deviation of the predicted per-class confidence after calibration, i.e., h k (q), from the ground truth probability p(y k = 1|h(q)): cw ECE(h • f ) = 1 K K k=1 E q=f (x) p(y k = 1|h(q)) -h k (q) . ( ) When h is a binning scheme, h k (q) is discrete and thus repetitive. We can then empirically set p(y k = 1|h(q)) as the frequency of label-1 samples among those receiving the same h k (q). On the contrary, scaling methods are continuous. It is unlikely that two samples attain the same h k (q), thus requiring additional quantization, i.e., applying HDE for modeling the distribution of h k (q), or alternatively using KDE. It is noted that ideally we should compare the whole distribution h(q) with the ground truth p(y|h(q)). However, neither HDE nor KDE scales well with the number of classes. Therefore, the multi-class ECE evaluation often boils down to the one-dimensional class-wise ECE as in (1) or the top-1 ECE, i.e., E |p(y k=arg max k h k (q) = 1|h(q)) -max k h k (q)| . 3.2 ONE-VS-REST (OVR) STRATEGY FOR MULTI-CLASS CALIBRATION HB was initially developed for two-class calibration. When dealing with multi-class calibration, it separately calibrates the prediction probability q k of each class in a one-vs-rest (OvR) fashion: For any class-k, HB takes y k as the binary label for a two-class calibration task in which the class-1 means y k = 1 and class-0 collects all other K -1 classes. It then revises the prediction probability q k of y k = 1 by mapping its logit λ k ∆ = log q k -log(1 -q k ) onto a given number of bins, and reproducing it with the calibrated prediction probability. Here, we choose to bin the logit λ k instead of q k , as the former is unbounded, i.e., λ k ∈ R, which eases the bin edge optimization process. Nevertheless, as q k and λ k have a monotonic bijective relation, binning q k and λ k are equivalent. We note that after K class-wise calibrations we avoid the extra normalization step as in (Guo et al., 2017) . After OvR marginalizes the multi-class predictive distribution, each class is treated independently (see Sec. A1). The calibration performance of HB depends on the setting of its bin edges and representatives. From a calibration set C = {(y, z)}, we can construct K training sets, i.e., S k = {(y k , λ k )} ∀k, under the one-vs-rest strategy, and then optimize the class-wise (CW) HB over each training set. As two common solutions in the literature, Eq. size and Eq. mass binning focus on bin representative optimization. Their bin edge locations, on the other hand, are either fixed (independent of the calibration set) or only ensures a balanced training sample distribution over the bins. After binning the logits in the calibration set S k = {(y k , λ k )}, the bin representatives are set as the empirical frequencies of samples with y k = 1 in each bin. To improve the sample efficiency of bin representative optimization, Kumar et al. (2019) proposed to perform scaling-based calibration before HB. Namely, after properly scaling the logits {λ k }, the bin representative per bin is then set as the averaged sigmoid-response of the scaled logits in S k belonging to each bin. However, pre-scaling does not resolve the sample inefficiency issue arising from a small class prior p k .The two-class ratio in S k is p k : 1 -p k . When p k is small we will need a large calibration set C = {(y, x)} to collect enough class-1 samples in S k for setting the bin representatives. To address this, we propose to merge {S k } across similar classes and then use the merged set S for HB training, yielding one binning scheme shareable to multiple per-class calibration tasks, i.e., shared class-wise (sCW) binning instead of CW binning respectively trained on S k . In Sec. 4, we respectively experiment using a single binning schemes for all classes in the balanced multi-class setting, and sharing one binning among the classes with similar class priors in the imbalanced multi-class setting. Note, both S k and S serve as empirical approximations to the inaccessible ground truth distribution p(y k , λ k ) for bin optimization. The former suffers from high variances, arising from insufficient samples (Fig. A1-a ), while the latter is biased due to having samples drawn from the other classes (Fig. A1-b ). As the calibration set size is usually small, the variance is expected to outweigh the approximation error over the bias (see an empirical analysis in Sec. A2).

3.3. BIN OPTIMIZATION VIA MUTUAL INFORMATION (MI) MAXIMIZATION

Binning can be viewed as a quantizer Q that maps the real-valued logit λ ∈ R to the bin interval m ∈ {1, . . . , M } if λ ∈ I m = [g m-1 , g m ), where M is the total number of bin intervals, and the bin edges g m are sorted (g m-1 < g m , and g 0 = -∞, g M = ∞). Any logit binned to I m will be reproduced to the same bin representative r m . In the context of calibration, the bin representative r m assigned to the logit λ k is used as the calibrated prediction probability of the class-k. As multiple classes can be assigned with the same bin representative, we will then encounter ties when making top-k predictions based on calibrated probabilities. Therefore, binning as lossy quantization generally does not preserve the raw logit-based ranking performance, being subject to potential accuracy loss. Unfortunately, increasing M to reduce the quantization error is not a good solution here. For a given calibration set, the number of samples per bin generally reduces as M increases, and a reliable frequency estimation for setting the bin representatives {r m } demands sufficient samples per bin. Considering that the top-k accuracy reflects how well the ground truth label can be recovered from the logits, we propose bin optimization via maximizing the MI between the quantized logits Q(λ)  I(y; m = Q(λ)) (a) = arg max Q: {gm} H(m) -H(m|y) (2) where the index m is viewed as a discrete random variable with P (m|y) = gm gm-1 p(λ|y)dλ and P (m) = gm gm-1 p(λ)dλ, and the equality (a) is based the relation of MI to the entropy H(m) and conditional entropy H(m|y) of m. Such a formulation offers a quantizer Q * optimal at preserving the label information for a given budget on the number of bins. Unlike designing distortion-based quantizers, the reproducer values of raw logits, i.e., the bin representatives {r m }, are not a part of the optimization space, as it is sufficient to know the mapped bin index m of each logit. Once the bin edges {g * m } are obtained, the bin representative r m to achieve zero calibration error shall equal P (y = 1|m), which can be empirically estimated from the samples within the bin interval I m . It is interesting to analyze the objective function after the equality (a) in (2). The first term H(m) is maximized if P (m) is uniform, which is attained by Eq. mass binning. A uniform sample distribution over the bins is a sample-efficient strategy to optimize the bin representatives for the sake of calibration. However, it does not consider any label information, and thus can suffer from severe accuracy loss. Through MI maximization, we can view I-Max as revising Eq. mass by incorporating the label information into the optimization objective, i.e., having the second term H(m|y). As a result, I-Max not only enjoys a well balanced sample distribution for calibration, but also maximally preserved label information for accuracy. In the example of Fig. 2 , the bin edges of I-Max binning are densely located in an area where the uncertainty of y given the logit is high. This uncertainty results from small gaps between the top class predictions. With small bin widths, such nearby prediction logits are more likely located to different bins, and thus distinguishable after binning. On the other hand, Eq. mass binning has a single bin stretching across this high-uncertainty area due to an imbalanced ratio between the p(λ|y = 1) and p(λ|y = 0) samples. Eq. size binning follows a pattern closer to I-Max binning. However, its very narrow bin widths around zero may introduce large empirical frequency estimation errors when setting the bin representatives. For solving the problem (2), we formulate an equivalent problem. Theorem 1. The MI maximization problem given in (2) is equivalent to max Q: {gm} I(y; m = Q(λ)) ≡ min {gm,φm} L({g m , φ m }) (3) where the loss L({g m , φ m }) is defined as L({g m , φ m }) ∆ = M -1 m=0 gm+1 gm p(λ) y ∈{0,1} P (y = y |λ) log P (y = y ) σ [(2y -1)φ m ] dλ and {φ m } as a set of real-valued auxiliary variables are introduced here to ease the optimization. Tishby et al., 1999) on the label-information I(y; Q(λ)) vs. the compression rate I(λ; Q(λ)). The information-rate pairs achieved by I-Max binning are very close to the limit. The information loss of Eq. mass binning is considerably larger, whereas Eq. size binning gets stuck in the low rate regime, failing to reach the upper bound even with more bins. Proof . See A3 for the proof. Next, we compute the derivatives of the loss L with respect to {g m , φ m }. When the conditional distribution P (y|λ) takes the sigmoid model, i.e., P (y|λ) ≈ σ[(2y -1)λ], the stationary points of L, zeroing the gradients over {g m , φ m }, have a closed-form expression g m =log    log 1+e φm 1+e φ m-1 log 1+e -φ m-1 1+e -φm    , φ m =log gm+1 gm σ(λ)p(λ)dλ gm+1 gm σ(-λ)p(λ)dλ ≈log λn∈Sm σ(λ n ) λn∈Sm σ(-λ n ) , where the approximation for φ m arises from using the logits in the training set S as an empirical approximation to p(λ) and S m ∆ = S ∩ [g m , g m+1 ). So, we can solve the problem by iteratively and alternately updating {g m } and {φ m } based on (5) (see Algo. 1 in the appendix for pseudocode). The convergence and initialization of such an iterative method as well as the sigmoid-model assumption are discussed along with the proof of Theorem 1 in Sec. A3. As the iterative method operates under an approximation of the inaccessible ground truth distribution p(y, λ), we synthesize an example, see Fig. 3 , to assess its effectiveness. As quantization can only reduce the MI, we evaluate I(y; λ), serving as the upper bound in Fig. 3-a ) for I(y; Q(λ)). Among the three realizations of Q, I-Max achieves higher MI than Eq. size and Eq. mass, and more importantly, it approaches the upper bound over the iterations. Next, we assess the performance within the framework of information bottleneck (IB) (Tishby et al., 1999) , see Fig. 3-b ). In the context of our problem, IB tackles min 1/β × I(λ; Q(λ)) -I(y; Q(λ)) with the weight factor β > 0 to balance between 1) maximizing the information rate I(y; Q(λ)), and 2) minimizing the compression rate I(λ; Q(λ)). By varying β, IB gives the maximal achievable information rate for the given compression rate. Fig. 3-b ) shows that I-Max approaches the theoretical limits and provides an information-theoretic perspective on the sub-optimal performance of the alternative binning schemes. Sec. A3.2 has a more detailed discussion on the connection of IB and our problem formulation.

4. EXPERIMENTS

Datasets and Models We evaluate post-hoc calibration methods on four benchmark datasets, i.e., ImageNet (Deng et al., 2009) , CIFAR 10/100 (Krizhevsky, 2009) and SVHN (Netzer et al., 2011) , and across various modern DNNs architectures. More details are reported in Sec. A8.1.

Training and Evaluation Details

We perform class-balanced random splits of the data test set, unless stated otherwise: the calibration and evaluation set sizes are both 25k for ImageNet, and 5k for CIFAR10/100. Different to ImageNet and CIFAR10/100, the test set of SVHN is class imbalanced. We evenly split it into the calibration and evaluation set of size 13k. All reported numbers are the means across 5 random splits; stds can be found in the appendix. Note that some calibration methods only use a subset of the available calibration samples for training, showing their sample efficiency. Further calibrator training details are provided in Sec. A8.1. We empirically evaluate MI, Accuracy (top-1 and 5 ACCs), ECE (class-wise and top-1), Brier and NLL; the latter are shown in the appendix. Analogous to (Nixon et al., 2019) , we use thresholding when evaluating the class-wise ECE ( CW ECE thr ). Without thresholding, the empirical class-wise ECE score may be misleading. When a class-k has a small class prior (e.g. 0.01 or 0.001), the empirical class-wise ECE score will be dominated by prediction samples where the class-k is not the ground truth. For these cases, a properly trained classifier will often not rank this class-k among the top classes and instead yield only small calibration errors. While it is good to have many cases with small calibration errors, they should not wash out the calibration errors of the rest of the cases (prone to poor calibration) through performance averaging. These include (1) class-k is the ground truth class and not correctly ranked and (2) the classifier mis-classifies some class-j as class-k. The thresholding remedies the washing out by focusing more on crucial cases (i.e. only averaging across cases where the prediction of the class-k is above a threshold). In all experiments, our primary choice of threshold is to set it according to the class prior for the reason that the class-k is unlikely to be the ground truth if its a-posteriori probability becomes lower than its prior after observing the sample. While empirical ECE estimation of binning schemes is simple, we resort to HDE with 100 equal size evaluation bins (Wenger et al., 2020) for scaling methods. Sec. A6 also reports the results attained by HDE with additional binning schemes and KDE. For HDE-based ones, we notice that with 100 evaluation bins, the ECE estimate is insensitive to the choice of binning scheme. 4.1 EQ. SIZE, EQ. MASS VS. I-MAX BINNING In Tab. 1, we compare three binning schemes: Eq. size, Eq. mass and I-Max binning. The accuracy performances of the binning schemes are proportional to their MI; Eq. mass binning is highly suboptimal at label information preservation, and thus shows a severe accuracy drop. Eq. size binning accuracy is more similar to that of I-Max binning, but still lower, in particular at Acc top5 . Also note that I-Max approaches the MI theoretical limit of I(y; λ)=0.0068. Advantages of I-Max become even more prominent when comparing the NLLs of the binning schemes. For all ECE evalution metrics, I-Max binning improves on the baseline calibration performance, and outperforms Eq. size binning. Eq. mass binning is out of this comparison scope due to its poor accuracy deeming the method impractical. Overall, I-Max successfully mitigates the negative impact of quantization on ACCs while still providing an improved and verifiable ECE performance. Additionally, one-for-all sCW I-Max achieves an even better calibration with only 1k calibration samples, instead of the standard CW binning with 25k calibration samples, highlighting the effectiveness of the sCW strategy. Furthermore, it is interesting to note that CW ECE of the Baseline classifier is very small, i.e., 0.000442, thus it may appear as the Baseline classifier is well calibrated. However, top1 ECE is much larger, i.e., 0.0357. Such inconsistent observations disappear after thresholding the class-wise ECE with the class prior. This example confirms the necessity of thresholding the class-wise ECE. In Sec. A5 we perform additional ablations on the number of bins and calibration samples. Accordingly, a post-hoc analysis investigates how the quantization error of the binning schemes change the ranking order. Observations are consistent with the intuition behind the problem formulation (see Sec. 3.3) and empirical results from Tab. 1 that MI maximization is a proper criterion for multi-class calibration and it maximally mitigates the potential accuracy loss.

4.2. SCALING VS. I-MAX BINNING

In Tab. 2, we compare I-Max binning to benchmark scaling methods. Namely, matrix scaling with L 2 regularization (Kull et al., 2019 ) has a large model capacity compared to other parametric scaling methods, while TS (Guo et al., 2017) only uses a single parameter and MnM (Zhang et al., 2020) uses three temperatures as an ensemble of TS (ETS). As a non-parametric method, GP (Wenger et al., 2020) yields state of the art calibration performance. Additional 8 scaling methods can be found in Sec. A10. Benefiting from its model capacity, matrix scaling achieves the best accuracy. I-Max Table 1 : ACCs and ECEs of Eq. mass, Eq. size and I-Max binning for the case of ImageNet (InceptionResNetV2). Due to the poor accuracy of Eq. mass binning, its ECEs are not considered for comparison. The MI is empirically evaluated based on KDE analogous to Fig. 3 , where the MI upper bound is I(y; λ)=0.0068. For the other datasets and models, we refer to A9. To showcase the complementary nature of scaling and binning, we investigate combining binning with GP (a top performing non-parametric scaling method, though with the drawback of high complexity) and TS (a commonly used scaling method). Here, we propose to bin the raw logits and use the GP/TS scaled logits of the samples per bin for setting the bin representatives, replacing the empirical frequency estimates. As GP is then only needed at the calibration learning phase, complexity is no longer an issue. Being mutually beneficial, GP helps improving ACCs and ECEs of binning, i.e., marginal ACC drop 0.16% (0.01%) on Acc top1 for ImageNet (CIFAR100) and 0.24% on Acc top5 for ImageNet; and large ECE reduction 38.27% (49.78%) in CW ECE cls-prior and 66.11% (76.07%) in top1 ECE of the baseline for ImageNet (CIFAR100). Binn. sCW(?) size MI ↑ Acctop1 ↑ Acctop5 ↑ CWECE ↓ CWECEcls-prior ↓ top1ECE ↓ NLL ↓ Baseline - -

4.3. SHARED CLASS WISE HELPS SCALING METHODS

Though without quantization loss, some scaling methods, i.e., Beta (Kull et al., 2017) , Isotonic regression (Zadrozny & Elkan, 2002) , and Platt scaling (Platt, 1999) , even suffer from more severe accuracy degradation than I-Max binning. As they also use the one-vs-rest strategy for multi-class calibration, we find that the proposed shared CW binning strategy is beneficial for reducing their accuracy loss and improving their ECE performance, with only 1k calibration samples, see Tab. 3.

4.4. IMBALANCED MULTI-CLASS SETTING

Lastly, we turn our experiments to an imbalanced multi-class setting. The adopted SVHN dataset has non-uniform class priors, ranging from 6% (e.g. digit 8) to 19% (e.g. digit 0). We reproduce Tab. 2 for SVHN, yielding Tab. 4. In order to better control the bias caused by the calibration set merging in the imbalanced multi-class setting, the former one-for-all sCW strategy in the balanced multi-class setting changes to sharing I-Max among classes with similar class priors. Despite the class imbalance, I-Max and its variants perform best compared to the other calibrators, being similar to Tab. 2. This shows that I-Max and the sCW strategy both can generalize to imbalanced multi-class setting. In Tab. 4, we additionally evaluate the class-wise ECE at multiple threholds. We ablate various thresholds settings, namely, 1) 0 (no thresholding); 2) the class prior; 3) 1/K (any class with prediction probability below 1/K will not be the top-1); and 4) a relatively large number 0.5 (the case when the confidence on class-k outweighs NOT class-k). We observe that I-Max and its variants are consistently top performing across the different thresholds.

5. CONCLUSION

We proposed I-Max binning for multi-class calibration, which maximally preserves the labelinformation under quantization, reducing potential accuracy losses. Using the shared class-wise (sCW) strategy, we also addressed the sample-inefficiency issue of binning and scaling methods that rely on one-vs-rest (OvR) for multi-class calibration. 

A1 NO EXTRA NORMALIZATION AFTER K CLASS-WISE CALIBRATIONS

There is a group of calibration schemes that rely on one-vs-rest conversion to turn multi-class calibration into K class-wise calibrations, e.g., histogram binning (HB), Platt scaling and Isotonic regression. After per-class calibration, the calibrated prediction probabilities of all classes no longer fulfill the constraint, i.e., K k=1 q k = 1. An extra normalization step was taken in Guo et al. (2017) to regain the normalization constraint. Here, we note that this extra normalization is unnecessary and partially undoes the per-class calibration effect. For HB, normalization will make its outputs continuous like any other scaling methods, thereby suffering from the same issue at ECE evaluation. One-vs-rest strategy essentially marginalizes the multi-class predictive distribution over each class. After such marginalization, each class and its prediction probability shall be treated independently, thus no longer being constrained by the multi-class normalization constraint. This is analogous to train a CIFAR or ImageNet classifier with sigmoid rather than softmax cross entropy loss, e.g., Ryou et al. (2019) . At training and test time, each class prediction probability is individually taken from the respective sigmoid-response without normalization. The class with the largest response is then top-ranked, and normalization itself has no influence on the ranking performance. A2 S VS. S k AS EMPIRICAL APPROXIMATIONS TO p(λ k , y k ) FOR BIN OPTIMIZATION In Sec. 3.2 of the main paper, we discussed the sample inefficiency issue when there are classes with small class priors. Fig. A1-a ) shows an example for ImageNet with 1k classes. The class prior for the class-394 is about 0.001. Among the 10k calibration samples, we can only collect 10 samples with ground truth is the class-394. Estimating the bin representatives from these 10 samples is highly unreliable, resulting into poor calibration performance. To tackle this, we proposed to merge the training sets {S k } across a selected set of classes (e.g., with similar class priors, belonging to the same class category or all classes) and use the merged S to train a single binning scheme for calibrating these classes, i.e., shared class-wise (sCW) instead of CW binning. Fig. A1-b ) shows that after merging over the 1k ImageNet classes, the set S has sufficient numbers from both the positive y = 1 and negative y = 0 class under the one-vs-rest conversion. Tab. 1 showed the benefits of sCW over CW binnings. Tab. 3 showed that our proposal sCW is also beneficial to scaling methods which use one-vs-rest for multi-class calibration. As pointed out in Sec. 3.2, both S and S k are empirical approximations to the inaccessible ground truth p(λ k , y k ) for bin optimization. In Fig. A2 , we empirically analyze their approximation errors. From the CIFAR10 test set, we take 5k samples to approximate per-class logit distribution p(λ k |y k = 1) by means of histogram density estimation, and then use it as the baseline for comparison, i.e., BS k in Here, we focus on p(λ k |y k = 1) as its empirical estimation suffers from small class priors, being much more challenging than p(λ k |y k = 0) as illustrated in Fig. A1 . For each class, we respectively evaluate the square root of the Jensen-Shannon divergence (JSD) from the baseline BS k to the empirical distribution of S or S k attained at different numbers of samples. In general, Fig. A2 confirms that variance (due to not enough samples) outweighs bias (due to training set merging). Nevertheless, sCW does not always have smaller JSDs than CW, for instance, the class 7 with the samples larger than 2k (the blue bar "sCW" is larger than the orange bar "CW"). So, for the class-7, the bias of merging logits starts outweighing the variance when the number of samples is more than 2k. Unfortunately, we don't have more samples to further evaluate JSDs, i.e., making the variance sufficiently small to reveal the bias impact. Another reason that we don't observe large JSDs of sCW for CIFAR10 is that the logit distributions of the 10 classes are similar. Therefore, the bias of sCW is small, making CIFAR10 a good use case of sCW. From CIFAR10 to CIFAR100 and ImageNet, there are more classes with even smaller class priors. Therefore, we expect that the sample inefficiency issue of S k becomes more critical. It will be beneficial to exploit sCW for bin optimization as well as for other methods based on the one-vs-rest conversion for multi-class calibration. Note, for JSD evaluation, the histogram estimator sets the bin number as the maximum of 'sturges' and 'fd' estimators, both of them optimize their bin setting towards the number of samples.

A3 PROOF OF THEOREM 1 AND ALGORITHM DETAILS OF I-MAX

In this section, we proves Theorem 1 in Sec. 3.3, discuss the connection to the information bottleneck (IB) (Tishby et al., 1999) , analyze the convergence behavior of the iterative method derived in Sec. 3.3 and modify the k-means++ algorithm (Arthur & Vassilvitskii, 2007) for initialization. To assist the implementation of the iterative method, we further provide the pseudo code and perform complexity/memory cost analysis.

A3.1 PROOF OF THEOREM 1

Theorem 1. The mutual information (MI) maximization problem given as follows: {g * m } = arg max Q: {gm} I(y; m = Q(λ)) (A1) is equivalent to max Q: {gm} I(y; m = Q(λ)) ≡ min {gm,φm} L({g m , φ m }) (A2) where the loss L({g m , φ m }) is defined as L({g m , φ m }) ∆ = M -1 m=0 gm+1 gm p(λ) y ∈{0,1} P (y = y |λ) log P (y = y ) P σ (y = y ; φ m ) dλ (A3) with P σ (y; φ m ) ∆ = σ [(2y -1)φ m ] . (A4) As a set of real-valued auxiliary variables, {φ m } are introduced here to ease the optimization. Proof . Before staring our proof, we note that the upper-case P indicates probability mass functions of discrete random variables, e.g., the label y ∈ {0, 1} and the bin interval index m ∈ {1, . . . , M }; whereas the lower-case p is reserved for probability density functions of continuous random variables, e.g., the raw logit λ ∈ R. The key to prove the equivalence is to show the inequality I(y; m = Q(λ)) ≥ -L({g m , φ m }), and the equality is attainable by minimizing L over {φ m }. By the definition of MI, we firstly expand I(y; m = Q(λ)) as From the above expression, we note that MI maximization effectively only accounts to the bin edges {g m }. The bin representatives can be arbitrary as long as they can indicate the condition λ ∈ [g m , g m+1 ). So, the bin interval index m is sufficient to serve the role in conditioning the probability mass function of y, i.e., P (y|m). After optimizing the bin edges, we have the freedom to set the bin representatives for the sake of post-hoc calibration. I(y; m = Q(λ)) = M -1 Next, based on the MI expression, we compute its sum with L (A9) I(y; Q(λ)) + L({g m , φ m }) = M -1 From the equality (a) to (b), it is simply because of identifying the term in [•] of the equality (a) as the Kullback-Leibler divergence (KLD) between two probability mass functions of y. As the probability mass function P (m) and the KLD both are non-negative, we reach to the inequality at (c), where the equality holds if P σ (y; φ m ) = P (y|m). By further noting that L is convex over {φ m } and P σ (y; φ m ) = P (y|m) nulls out its gradient over {φ m }, we then reach to I(y; Q(λ)) + min {φm} L({g m , φ m }) = 0. (A10) The obtained equality then concludes our proof Lastly, we note that L({g m , φ m }) can reduce to a NLL loss (as P (y) in the log probability ratio is omittable), which is a common loss for calibrators. However, only through this equivalence proof and the MI maximization formulation, can we clearly identify the great importance of bin edges in preserving label information. So even though {g m , φ m } are jointly optimized in the equivalent problem, only {g m } play the determinant role in maximizing the MI.

A3.2 CONNECTION TO INFORMATION BOTTLENECK (IB)

IB (Tishby et al., 1999 ) is a generic information-theoretic framework for stochastic quantization design. Viewing binning as quantization, IB aims to find a balance between two conflicting goals: 1) maximizing the information rate, i.e., the mutual information between the label and the quantized logits I(y; Q(λ)); and 2) minimizing the compression rate, i.e., mutual information between the logits and the quantized logits I(λ; Q(λ)). It unifies them by minimizing min p(m|λ) 1 β I(λ; m = Q(λ)) -I(y; m = Q(λ)), ( ) where m is the bin index assigned to λ and β is the weighting factor (with larger value focusing more on the information rate and smaller value on the compression rate). The compression rate is the bottleneck for maximizing the information rate. Note that IB optimizes the distribution p(m|λ), which describes the probability of λ being assigned to the bin with the index m. Since it is not a deterministic assignment, IB offers a stochastic rather than deterministic quantizer. Our information maximization formulation is a special case of IB, i.e., β being infinitely large, as we care predominantly about how well the label can be predicted from a compressed representation (quantized logits), in other words, making the compression rate as small as possible is not a request from the problem. For us, the only bottleneck is the number of bins usable for quantization. Furthermore, with β → ∞, stochastic quantization degenerating to a deterministic one. If using stochastic binning for calibration, it outputs a weighted sum of all bin representatives, thereby being continuous and not ECE verifiable. Given that, we do not use it for calibration. As the IB defines the best trade-off between the information rate and compression rate, we use it as the upper limit for assessing the optimality of I-Max in Fig. 3-b ). By varying β, IB depicts the maximal achievable information rate for the given compression rate. For binning schemes (Eq. size, Eq. mass and I-Max), we vary the number of bins, and evaluate their achieved information and compression rates. As we can clearly observe from Fig. 3-b ), I-Max can approach the upper limit defined by IB. Note that, the compression rate, though being measured in bits, is different to the number of bins used for the quantizer. As quantization is lossy, the compression rate defines the common information between the logits and quantized logits. The number of bins used for quantization imposes an upper limit on the information that can be preserved after quantization.

A3.3 CONVERGENCE OF THE ITERATIVE METHOD

For convenience, we recall the update equations for {g m , φ m } in Sec. 3.3 of the main paper here g m = log    log 1+e φm 1+e φ m-1 log 1+e -φ m-1 1+e -φm    φ m = log g m+1 gm σ(λ)p(λ)dλ g m+1 gm σ(-λ)p(λ)dλ ≈ log λn∈Sm σ(λn) λn ∈Sm σ(-λn) ∀m. (A13) In the following, we show that the updates on {g m } and {φ m } according to (A13) continuously decrease the loss L, i.e., L({g l m , φ l m }) ≥ L({g l+1 m , φ l m }) ≥ L({g l+1 m , φ l+1 m }). (A14) The second inequality is based on the explained property of L. Namely, it is convex over {φ m } and the minimum for any given {g m } is attained by P σ (y; φ m ) = P (y|m). As φ m is the log-probability ratio of P σ (y; φ m ), we shall have  ({g m , φ l m }) ∂g m = p(λ = g m ) y ∈{0,1} P (y = y |λ = g m ) log P σ (y = y ; φ l m ) P σ (y = y ; φ l m-1 ) ! = 0 ∀m (A16) Being a stationary point is the necessary condition of local extremum when the function's first-order derivative exists at that point, i.e., first-derivative test. To further show that the local extremum is actually a local minimum, we resort to the second-derivative test, i.e., if the Hessian matrix of L({g m , φ l m }) is positive definite at the stationary point {g l+1 m }. Due to φ m > φ m-1 with the monotonically increasing function sigmoid in its update equation, we have ∂ 2 L({g m , φ l m }) ∂g m ∂g m gm=g l+1 m ∀m = 0 and ∂ 2 L({g m , φ l m }) ∂ 2 g m gm=g l+1 m ∀m > 0, (A17) implying that all eigenvalues of the Hessian matrix are positive (equivalently, is positive definite). Therefore, {g l+1 m } as the stationary point of L({g m , φ l m }) is a local minimum. It is important to note that from the stationary point equation (A16), {g l+1 m } as a local minimum is unique among {g m } with p(λ = g m ) > 0 for any m. In other words, the first inequality holds under the condition p(λ = g l m ) > 0 for any m. Binning is a lossy data processing. In order to maximally preserve the label information, it is natural to exploit all bins in the optimization, not wasting any single bin in the area without mass, i.e., p(λ = g m ) = 0. Having said that, it is reasonable to constrain {g m } with p(λ = g m ) > 0 ∀m over iterations, thereby concluding that the iterative method will converge to a local minimum based on the two inequalities (A14).

A3.4 INITIALIZATION OF THE ITERATIVE METHOD

We propose to initialize the iterative method by modifying the k-means++ algorithm (Arthur & Vassilvitskii, 2007) As I(y; λ) is a constant with respect to ({g m , φ m }), minimizing L is equivalent to minimizing the term on the RHS of (A18). The last approximation is reached by turning the binning problem into a clustering problem, i.e., grouping the logit samples in the training set S according to the KLD measure, where {φ m } are effectively the centers of each cluster. k-means++ algorithm Arthur & Vassilvitskii (2007) initializes the cluster centers based on the Euclidean distance. In our case, we alternatively use the JSD as the distance measure to initialize {φ m }. Comparing with KLD, JSD is symmetric and bounded.

A3.5 A REMARK ON THE ITERATIVE METHOD DERIVATION

The closed-form update on {g m } in (A13) is based on the sigmoid-model approximation, which has been validated through our empirical experiments. It is expected to work with properly trained classifiers that are not overly overfitting to the cross-entropy loss, e.g., using data augmentation and other regularization techniques at training. Nevertheless, even in corner cases that classifiers are poorly trained, the iterative method can still be operated without the sigmoid-model approximation. Namely, as shown in Fig. 2 of the main paper, we can resort to KDE for an empirical estimation of the ground truth distribution p(λ|y). Using the KDEs, we can compute the gradient of L over {g m } and perform iterative gradient based update on {g m }, replacing the closed-form based update. Essentially, the sigmoid-model approximation is only necessary to find the stationary points of the gradient equations, speeding up the convergence of the method. If attempting to keep the closed-form update on {g m }, an alternative solution could be to use the KDEs for adjusting the sigmoid-model, e.g., p(y|λ) ≈ σ [(2y -1)(aλ + ab)], where a and b are chosen to match the KDE based approximation to p(y|λ). After setting a and b, they will be used as a scaling and bias term in the original closed-form update equations ∀m. g m = 1 a log    log 1+e φm 1+e φ m-1 log 1+e -φ m-1 1+e -φm    -b φ m = log (A20)

A3.6 COMPLEXITY AND MEMORY ANALYSIS

To ease the reproducibility of I-Max, we provide the pseudocode in Algorithm. 1. Based on it, we further analyze the complexity and memory cost of I-Max at training and test time. We simplify this complexity analysis as our algorithm runs completely offline and is purely numpybased. We note that despite the underlying (numpy) operations performed at each step of the  g m ←log    log 1+e φm 1+e φ m-1 log 1+e -φ m-1 1+e -φm    ; end for m = 0, 2, . . . , M -1 do S m ∆ = {λ n } ∩ [g m , g m+1 ) ; φ m ←log λn∈Sm σ(λn) λn ∈Sm σ(-λn) ; end end algorithm differs, we treat multiplication, division, logarithm and exponential functions each counting as the same unit cost and ignore the costs of the logic operations and add/subtract operators. The initialization has complexity of O(N M ), for the one-dimensional logits. We exploit the sklearn implementation of Kmeans++ initialization initially used for Kmeans clustering, but replace the MSE with JSD in the distance measure. Following Algorithm 1, we arrive at the following complexity of O(N * M + I * (10 * M + 2 * M )). Our python codes runs Algorithm. 1 within seconds for classifiers as large as ImageNet and performed purely in Numpy. The largest storage and memory consumption is for keeping the N logits used during the I-Max learning phase. At test time, there is negligible memory and storage constraints, as only (2M -1) floats need to be saved for the M bin representatives {φ m } M -1 0 and M -1 bin edges {g m } M -1 1 . The complexity at test time is merely logic operations to compute the bin assignments of each logit and can be done using numpy's efficient 'quantize' function. I-Max offers a real-time post-hoc calibrator which adds an almost-zero complexity and memory cost relative to the computations of the original classifier. We will release our code soon.

A4 POST-HOC ANALYSIS ON THE EXPERIMENT RESULTS IN SEC. 4.1

In Tab. 1 of Sec. 4.1, we compared three different binning schemes by measuring their ACCs and ECEs. The observation on their accuracy performance is aligned with our mutual information maximization viewpoint introduced in Sec. 3.3 and Fig. 2 . Here, we re-present Fig. 2 and provide an alternative explanation to strengthen our understanding on how the location of bin edges affects the accuracy, e.g., why Eq. Size binning performed acceptable at the top-1 ACC, but failed at the top-5 ACC. Specifically, Fig. A3 shows the histograms of raw logits that are grouped based on their ranks instead of their labels as in Fig. 2 . As expected, the logits with low ranks (i.e., rest below top-5 in Fig. A3 ) are small and thus take the left hand side of the plot, whereas the top-1 logits are mostly located on the right hand side. Besides sorting logits according to their ranks, we additionally estimate the density of the ground truth (GT) classes associated logits, i.e., GT in Fig. A3 ). With a properly trained classifier, the histogram of top-1 logits shall largerly overlap with the density curve GT, i.e., top-1 prediction being correct in most cases. From the bin edge location of Eq. Mass binning, it attempts to attain small quantization errors for logits of low ranks rather than top-5. This will certainly degrade the accuracy performance after binning. On contrary, Eq. Size binning aims at small quantization error for the top-1 logits, but ignores top-5 ones. As a result, we observed its poor top-5 ACCs. I-Max binning nicely distributes its bin edges in the area where the GT logits are likely to locate, and the bin width becomes smaller in the area where the top-5 logits are close by (i.e., the overlap region between the red and blue histograms). Note that, any logit larger than zero must be top-1 ranked, as there can exist at most one 

GT Top 1 Top 5 (except Top 1) Rest

Figure A3 : Histogram of CIFAR100 (WRN) logits in S constructed from 1k calibration samples, using the same setting as Fig. 2 in the main paper. Instead of categorizing the logits according to their two-class label y k ∈ {0, 1} as in Fig. 2 , here we sort them according to their ranks given by the CIFAR100 WRN classifier. As a baseline, we also plot the KDE of logits associated to the ground truth classes, i.e., GT. class with prediction probability larger than 0.5. Given that, the bins located above zero are no longer to maintain the ranking order, rather to reduce the precision loss of top-1 prediction probability after binning. The second part of our post-hoc analysis is on the sCW binning strategy. When using the same binning scheme for all per-class calibration, the chance of creating ties in top-k predictions is much higher than CW binning, e.g., more than one class are top-1 ranked according to the calibrated prediction probabilities. Our reported ACCs in the main paper are attained by simply returning the first found class, i.e., using the class index as the secondary sorting criterion. This is certainly a suboptimal solution. Here, we investigate on how the ties affect ACCs of sCW binning. To this end, we use raw logits (before binning) as the secondary sorting criterion. The resulting ACC * top1 and ACC * top5 are shown in Tab.A1. Interestingly, such a simple change reduces the accuracy loss of Eq. Mass and I-Max binning to zero, indicating that they can preserve the top-5 ranking order of the raw logits but not in a strict monotonic sense, i.e., some > are replaced by =. As opposed to I-Max binning, Eq. Mass binning has a poor performance at calibration, i.e., the really high NLL and ECE. This is because it trivially ranks many classes as top-1, but each of them has a very and same small confidence score. Given that, even though the accuracy loss is no longer an issue, it is still not a good solution for multi-class calibration. For Eq. Size binning, resolving ties only helps restore the baseline top-5 but not top-1 ACC. Its poor bin representative setting due to unreliable empirical frequency estimation over too narrow bins can result in a permutation among the top-5 predictions. Concluding from the above, our post-hoc analysis confirms that I-Max binning outperforms the other two binning schemes at mitigating the accuracy loss and multi-class calibration. In particular, there exists a simple solution to close the accuracy gap to the baseline, at the same time still retaining the desirable calibration gains. Table A2 : Ablation on the number of bins and calibration samples for sCW I-Max binning, where the basic setting is identical to the Tab. 1 in Sec. 4.1 of the main paper. In Tab. 1 of Sec. 4.1, sCW I-Max binning is the top performing one at the ACCs, ECEs and NLL measures. In this part, we further investigate on how the number of bins and calibration set size influences its performance. Tab. A2 shows that in order to benefit from more bins we shall accordingly increase the number of calibration samples. More bins help reduce the quantization loss, but increase the empirical frequency estimation error for setting the bin representatives. Given that, we observe a reduced ACCs and increased ECEs for having 50 bins with only 1k calibration samples. By increasing the calibration set size to 5k, then we start seeing the benefits of having more bins to reduce quantization error for better ACCs. Next, we further exploit scaling method, i.e., GP Wenger et al. (2020) , for improving the sample efficiency of binning at setting the bin representatives. As a result, the combination is particularly beneficial to improve the ACCs and top-1 ECE. Overall, more bins are beneficial to ACCs, while ECEs favor less number of bins. Binn. Bins Acctop1 ↑ Acctop5 ↑ CWECE 1 K ↓ top1ECE ↓ Acctop1 ↑ Acctop5 ↑ CWECE 1 K ↓ top1ECE ↓ Baseline -

A6 EMPIRICAL ECE ESTIMATION OF SCALING METHODS UNDER MULTIPLE EVALUATION SCHEMES

As mentioned in the main paper, scaling methods suffer from not being able to provide verifiable ECEs, see Fig. 1 . Here, we discuss alternatives to estimate their ECEs. The current literature can be split into two types of ECE evaluation: histogram density estimation (HDE) and kernel density estimation (KDE).

A6.1 HDE-BASED ECE EVALUATION

HDE bins the prediction probabilities (logits) for density modeling. The binning scheme has different variants, where changing the bin edges can give varying measures of the ECE. Two bin edges schemes have been discussed in the literature (Eq. size and Eq. mass) as well as a new scheme was introduced (I-Max). Alternatively, we also evaluate a binning scheme which is based on KMeans clustering to determine the bin edges.

A6.2 KDE-BASED ECE EVALUATION

Recent work (Zhang et al., 2020) presented an alternative ECE evaluation scheme which exploits KDEs to estimate the distribution of prediction probabilities {q k } from the test set samples. Using the code provided by Zhang et al. (2020) , we observe that the KDE with the setting in their paper can Figure A4 : Distribution of the top-1 predictions and its log-space counterparts, i.e., λ = log qlog(1 -q). have a sub-optimal fitting in the probability space. This can be observed from Fig. A4a and Fig. A4c , where the fitting is good for ImageNet/Inceptionresnetv2 though when the distribution is significantly skewed to the right (as in the case of CIFAR100/WRN) the mismatch becomes large. We expect that the case of CIFAR100/WRN is much more common in modern DNNs, due to their high capacities and prone to overfitting. Equivalently, we can learn the distribution in its log space by the bijective transformation, i.e., λ = log q -log(1 -q) and q = σ(λ). As we can observe from Fig. A4b and Fig. A4d , the KDE fitting for both models is consistently good. Zhang et al. (2020) empirically validated their KDE in a toy example, where the ground truth ECE can be analytically computed. By analogy, we reproduce the experiment and further compare it with the log-space KDE evaluation. Using the same settings as in (Zhang et al., 2020) , we assess the ECE evaluation error by KDE, i.e., |ECE gt -ECE kde |, in both the log and probability space, achieving prob 0.0020 vs. log 0.0017 for the toy example setting β 0 = 0.5; β 1 = -1.5. For an even less calibrated setting, β 0 = 0.2; β 1 = -1.9, we obtain prob 0.0029 vs. log 0.0020. So the log-space KDE-based ECE evaluation (kdeECE log ) has lower estimation error than in the probability space.

A6.3 ALTERNATIVE ECE EVALUATION SCHEMES

Concluding from the above, Tab. A3 shows the ECE estimates attained by HDEs (from four different bin setting schemes) and KDE (from (Zhang et al., 2020) , but in the log space). As we can see, the obtained results are evaluation scheme dependent. On contrary, I-Max binning with and without GP are not affected, and more importantly, their ECEs are better than that of scaling methods, regardless of the evaluation scheme. 4), use 10 2 bins. Note that, the ECEs of I-Max binning (as a calibrator rather than evaluation scheme) are agnostic to the evaluation scheme. Furthermore, BBQ suffers from severe accuracy degradation. (Pereyra et al., 2017) ; and 3) using Mixup data augmentation (Zhang et al., 2018; Thulasidasan et al., 2019) . Taking each of the trained models as one baseline, we further perform post-hoc calibration. Note the best numbers per training mode is marked in bold and the underlined scores are the best across the three models. Calibrator ACCtop 1 CWdECE 1 K ↓ CWmECE 1 K ↓ CWkECE 1 K ↓ CWiECE 1 K ↓ CWkdeECE 1 K ↓ Mean No In general, post-hoc and during-training calibration can be viewed as two orthogonal ways to improve the calibration, as they can be easily combined. Exemplarily, we compare/combine posthoc calibration methods against/with during-training regularization which directly modifies the training objective to encourage less confident predictions through an entropy regularization term (Entr. Reg.) (Pereyra et al., 2017) . Additionally, we adopt Mixup (Zhang et al., 2018) which is a data augmentation shown to improve calibration (Thulasidasan et al., 2019) . We re-train the CIFAR100 WRN classifier respectively using Entr. 

A9 EXTEND TAB. 1 FOR MORE DATASETS AND MODELS.

Tab. 1 in Sec. 4.1 of the main paper is replicated across datasets and models, where the basic setting remains the same. Specifically, three different ImageNet models can be found in Tab. A5, Tab. A6 and Tab. A7. Three models for CIFAR100 can be found in Tab. A8, Tab. A9 and Tab. A10. Similarly, CIFAR10 models can be found in Tab. A11, Tab. A12 and Tab. A13. The accuracy degradation of Eq. Mass reduces as the dataset has less number of classes, e.g., CIFAR10. This is a result of a higher class prior, where the one-vs-rest conversion becomes less critical for CIFAR10 than ImageNet. Nevertheless, its accuracy losses are still much larger than the other binning schemes, i.e., Eq. Size and I-Max binning. Therefore, its calibration performance is not considered for comparison. Overall, the observations of Tab. A5-A13 are similar to Tab. 1, showing the stable performance gains of I-Max binning across datasets and models.

A10 EXTEND TAB. 2 FOR MORE SCALING METHODS, DATASETS AND MODELS

Tab. 2 in Sec. 4.2 of the main paper is replicated across datasets and models, and include more scaling methods for comparison. The three binning methods all use the shared CW strategy, therefore 1k calibration samples are sufficient. The basic setting remains the same as Tab. 2. Three different ImageNet models can be found in Tab. A14, Tab. A15 and Tab. A16. Three models for CIFAR100 



Figure 1: (a) Temperature scaling (TS), equally sized-histogram binning (HB), and our proposal, i.e., sCW I-Max binning are compared for post-hoc calibrating a CIFAR100 (WRN) classifier. (b) Binning offers a reliable ECE measure as the number of evaluation samples increases.

train. set S for shared-CW binning.

Figure A1: Histogram of ImageNet (InceptionResNetv2) logits for (a) CW and (b) sCW training. By means of the set merging strategy to handle the two-class imbalance 1 : 999, S has K=1000 times more class-1 samples than S k with the same 10k calibration samples from C.

distribution P (y|m) is given asP (y|m) = P (y|λ ∈ [g m , g m+1 )

(y = y |λ)dλ log P (y = y |m) P σ (y = y ; φ m ) (y = y |m) log P (y = y |m) P σ (y = y ; φ m ) (m)KLD [P (y = y |m) P σ (y = y ; φ m )] )P (y = y |λ)dλ = P (y = y , λ ∈ [g m , g m+1 )) = P (λ ∈ [g m , g m+1 )) =P (m) P (y = y |m).

where P (y = 1|m) in this case is induced by {g l+1 m } and P (y|λ) = σ[(2y -1)λ]. Plugging {g l+1 m } and P (y|λ) = σ[(2y -1)λ] into (A7), the resulting P (y = y |m) at the iteration l + 1 yields the update equation of φ m as given in (A13).To prove the first inequality, we start from showing that {g l+1 m } is a local minimum of L({g m , φ l m }). The update equation on {g m } is an outcome of solving the stationary point equation of L({g m , φ l m }) over {g m } under the condition p(λ = g m ) > 0 for any m ∂L

that was developed to initialize the cluster centers for k-means clustering algorithms. It is based on the following identification L({g m , φ m }) + I(y; λ) = KDL [P (y = y |λ) P σ (y = y ; φ m )] dλ ((y = y |λ) P σ (y = y ; φ m )] dλ ≈ 1 |S| λn∈S min m KLD [P (y = y |λ n ) P σ (y = y ; φ m )] . (A19)

I-Max Binning Calibration Input: Number of bins M , logits {λ n } N 1 and binary labels {y n } N 1 Result: bin edges {g m } M 0 (g 0 = -∞ and g M = ∞) and bin representations {φ m } M -1 Initialization: {φ m } ← Kmeans++({λ n } N 1 , M ) (see A3.4) ; for iteration = 1, 2, . . . , 200 do for m = 1, 2, . . . , M -1 do

ACCs and ECEs of I-Max binning (15 bins) and scaling methods. All methods use 1k calibration samples, except for Mtx. Scal. and ETS-MnM, which requires the complete calibration set, i.e., 25k/5k for ImageNet/CIFAR100. Additional 6 scaling methods can be found in A10. CWECEcls-prior ↓ top1ECE ↓ Acctop1 ↑ Acctop5 ↑ CWECEcls-prior ↓ top1ECE ↓

ACCs and ECEs of scaling methods using the one-vs-rest conversion for multi-class calibration. Here we compare using 1k samples for both CW and one-for-all sCW scaling. CWECEcls-prior ↓ top1ECE ↓ Acctop1 Acctop5 ↑ CWECEcls-prior ↓ top1ECE ↓

ACCs and ECEs of I-Max binning (15 bins) and scaling methods. All methods use 1k calibration samples, except for Mtx. Scal. and ETS-MnM, which requires the complete calibration set, i.e., 13k for SVHN. Here, we also report the class-wise ECEs using four different thresholds.

Our experiments showed that I-Max yields consistent class-wise and top-1 calibration improvements over multiple datasets and model architectures, outperforming HB and state-of-the-art scaling methods. Combining I-Max with scaling methods offers further calibration performance gains, and more importantly, ECE estimates that can converge to the ground truth in the large sample limit. Future work will investigate extensions of I-Max that jointly calibrate multiple classes, and thereby directly model class correlations. Interestingly, even on datasets such as ImageNet which contain several closely related classes, there is no clear evidence that methods that do model class correlations, Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. In SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 694-699, January 2002. Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Proceedings of the British Machine Vision Conference (BMVC), 2016. Hongyi Zhang, Moustapha Cissé, Yann Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. International Conference on Learning Representations (ICLR), 2018. Jize Zhang, Bhavya Kailkhura, and T Han. Mix-n-Match: Ensemble and compositional methods for uncertainty calibration in deep learning. In International Conference on Machine Learning (ICML), Vienna, Austria, July 2020. This document supplements the presentation of Multi-Class Uncertainty Calibration via Mutual Information Maximization-based Binning in the main paper with the following: A1: No extra normalization after K class-wise calibrations; A2: S vs. S k as empirical approximations to p(λ k , y k ) for bin optimization in Sec. 3.Extend Tab. 2 in Sec. 4.2 for more scaling methods, datasets and models.

Comparison of sCW binning methods in the case of ImageNet -InceptionResNetV2. As sCW binning creates ties at top predictions, the ACCs initially reported in Tab. 1 of Sec. 4.1 use the class index as the secondary sorting criterion. Here, we add Acc* top1 and Acc* top5 which are attained by using the raw logits as the secondary sorting criterion. As the CW ECEs are not affected by this change, here we only report the new top1 ECE*.

ECEs of scaling methods under various evaluation schemes for ImageNet InceptionRes-NetV2. Overall, we consider five evaluation schemes, namely (1) dECE: equal size binning; (2) mECE: equal mass binning, (3) kECE: MSE-based KMeans clustering; (4) iECE: I-Max binning; 5) kdeECE: KDE. The HDEs based schemes, i.e., (1)-(

.0597 ± 0.0007 0.0593 ± 0.0008 0.0613 ± 0.0008 0.0634 ± 0.0008 0.1372 ± 0.0028 0.0762 Vec Scal. w. L2 reg. 80.53 0.0494 ± 0.0002 0.0472 ± 0.0004 0.0498 ± 0.0003 0.0531 ± 0.0003 0.0805 ± 0.0010 0.0560 Mtx Scal. w. L2 reg. 80.78 0.0508 ± 0.0003 0.0488 ± 0.0004 0.0512 ± 0.0005 0.0544 ± 0.0004 0.0898 ± 0.0011 0.0590 ECEs of post-hoc and during-training calibration. A WRN CIFAR100 classifier is trained in three modes: 1) no during-training calibration; 2) using entropy regularization

calibrate a DNN-based classifier, there exists two groups of methods. One is to improve the calibration during training, whereas the other is post-hoc calibration. In this paper, we focus on post-hoc calibration because it is simple and does not require re-training of deployed models. In the following, we briefly discuss the advantages and disadvantages of post-hoc and during-training calibration.

Reg. and Mixup.  It can be seen in Tab. A4 that compared to the Baseline model (without training calibration Entr. Reg. or Mixup), EntrReg improves the top1 ECE from 0.06880 to 0.04806. Further applying post-hoc calibration, I-Max and I-Max w. GP can reduce the 0.04806 to 0.02202 and 0.01562, respectively. This indicates that their combination is beneficial. In this particular case, we also observed that without Entr. Reg., directly post-hoc calibrating the Baseline model appears to be more effective, e.g., top 1 ECE of 0.01428 and class-wise ECE 0.04574. Switching to Mixup, the best top 1 ECE 0.01364 is attained by combining Mixup with post-hoc I-Max w. GP, while I-Max alone without during-training calibration is still the best at class-wise ECE.While post-hoc calibrator is simple and effective at calibration, during-training techniques may deliver more than improving calibration, e.g., improving the generalization performance and providing robustness against adversarials. Therefore, instead of choosing either post-hoc or during training technique, we recommend the combination. While during-training techniques improve the generalization and robustness of the Baseline classifier, post-hoc calibration can further boost its calibration at a low computational cost.Huang et al. (2017) and ResNext8x64Xie et al. (2017) for the two CIFAR datasets.The ImageNet and CIFAR models are publicly available pre-trained networks and details are reported at the respective websites, i.e., ImageNet classifiers: https://github.com/ Cadene/pretrained-models.pytorch and CIFAR classifiers: https://github.com/ bearpaw/pytorch-classification.A8.2 TRAINING SCALING METHODSThe hyper-parameters were decided based on the original respective scaling methods publications with some exceptions. We found that the following parameters were the best for all the scaling methods. All scaling methods use the Adam optimizer with batch size 256 for CIFAR and 4096 for ImageNet. The learning rate was set to 10 -3 for temperature scalingGuo et al. (2017) and Platt scaling Platt (1999), 0.0001 for vector scalingGuo et al. (2017) and 10 -5 for matrix scalingGuo et al. (2017). Matrix scaling was further regularized as suggested byKull et al. (2019) with a L 2 loss on the bias vector and the off-diagonal elements of the weighting matrix. BBQ Naeini et al. (2015), isotonic regressionZadrozny & Elkan (2002)  and BetaKull et al. (2017) hyper-parameters were taken directly fromWenger et al. (2020). Max bin optimization started from k-means++ initialization, which uses JSD instead of Euclidean metric as the distance measure, see Sec. A3.4. Then, we iteratively and alternatively updated {g m } and {φ m } according to (5) until 200 iterations. With the attained bin edges {g m }, we set the bin representatives {r m } based on the empirical frequency of class-1. If a scaling method is combined with binning, an alternative setting for {r m } is to take the averaged prediction probabilities based on the scaled logits of the samples per bin, e.g., in Tab. 2 in Sec. 4.2. Note that, for CW binning in 1, the number of samples from the minority class is too few, i.e., 25k/1k = 25. We only have about 25/15 ≈ 2 samples per bin, which are too few to use empirical frequency estimates. Alternatively, we set {r m } based on the raw prediction probabilities. For ImageNet and CIFAR 10/100, which have test sets with uniform class priors, the used sCW setting is to share one binning scheme among all classes. Alternatively, for the imbalanced multi-class SVHN setting, we share binning among classes with similar class priors, and thus use the following class (i.

Tab. 1 Extension: ImageNet -InceptionResNetV2 80.33 ± 0.15 95.10 ± 0.15 0.0486 ± 0.0003 0.0357 ± 0.0009 0.8406 ± 0.0095 Eq. Mass 25k 7.78 ± 0.15 27.92 ± 0.71 0.0016 ± 0.0001 0.0606 ± 0.0013 3.5960 ± 0.0137 Eq. Mass 1k 5.02 ± 0.13 26.75 ± 0.37 0.0022 ± 0.0001 0.0353 ± 0.0012 3.5272 ± 0.0142 Eq. Size 25k 78.52 ± 0.15 89.06 ± 0.13 0.1344 ± 0.0005 0.0547 ± 0.0017 1.5159 ± 0.0136 Eq. Size 1k 80.14 ± 0.23 88.99 ± 0.12 0.1525 ± 0.0023 0.0279 ± 0.0043 1.2671 ± 0.0130 I-Max 25k 80.27 ± 0.17 95.01 ± 0.19 0.0342 ± 0.0006 0.0329 ± 0.0010 0.8499 ± 0.0105 I-Max 1k 80.20 ± 0.18 94.86 ± 0.17 0.0302 ± 0.0041 0.0200 ± 0.0033 0.7860 ± 0.0208

Tab. 1 Extension: ImageNet -DenseNet 77.21 ± 0.12 93.51 ± 0.14 0.0502 ± 0.0006 0.0571 ± 0.0014 0.9418 ± 0.0120 Eq. Mass 25k 18.48 ± 0.19 45.12 ± 0.26 0.0017 ± 0.0000 0.1657 ± 0.0020 2.9437 ± 0.0162 Eq. Mass 1k 17.21 ± 0.47 45.69 ± 1.22 0.0054 ± 0.0004 0.1572 ± 0.0047 2.9683 ± 0.0561 Eq. Size 25k 74.34 ± 0.28 88.27 ± 0.11 0.1272 ± 0.0011 0.0660 ± 0.0018 1.6699 ± 0.0165 Eq. Size 1k 77.06 ± 0.28 88.22 ± 0.10 0.1519 ± 0.0016 0.0230 ± 0.0050 1.3948 ± 0.0105 I-Max 25k 77.07 ± 0.13 93.40 ± 0.17 0.0334 ± 0.0004 0.0577 ± 0.0008 0.9492 ± 0.0130 I-Max 1k 77.13 ± 0.14 93.34 ± 0.17 0.0263 ± 0.0119 0.0201 ± 0.0088 0.9229 ± 0.0103

Tab. 1 Extension: ImageNet -ResNet152 78.33 ± 0.17 94.00 ± 0.14 0.0500 ± 0.0004 0.0512 ± 0.0018 0.8760 ± 0.0133 Eq. Mass 25k 17.45 ± 0.10 44.87 ± 0.37 0.0017 ± 0.0000 0.1555 ± 0.0010 2.9526 ± 0.0168 Eq. Mass 1k 16.25 ± 0.54 45.53 ± 0.81 0.0064 ± 0.0004 0.1476 ± 0.0054 2.9471 ± 0.0556 Eq. Size 25k 75.50 ± 0.28 88.85 ± 0.19 0.1223 ± 0.0008 0.0604 ± 0.0017 1.6012 ± 0.0252 Eq. Size 1k 78.24 ± 0.16 88.81 ± 0.19 0.1480 ± 0.0015 0.0286 ± 0.0053 1.3308 ± 0.0178 I-Max 25k 78.24 ± 0.16 93.91 ± 0.17 0.0334 ± 0.0005 0.0521 ± 0.0015 0.8842 ± 0.0135 I-Max 1k 78.19 ± 0.21 93.82 ± 0.17 0.0295 ± 0.0030 0.0196 ± 0.0049 0.8638 ± 0.0135

Tab. 1 Extension: CIFAR100 -WRN 81.35 ± 0.13 0.1113 ± 0.0010 0.0748 ± 0.0018 0.7816 ± 0.0076 Eq. Mass 5k 60.78 ± 0.62 0.0129 ± 0.0010 0.4538 ± 0.0074 1.1084 ± 0.0117 Eq. Mass 1k 62.04 ± 0.53 0.0252 ± 0.0032 0.4744 ± 0.0049 1.1789 ± 0.0308 Eq. Size 5k 80.39 ± 0.36 0.1143 ± 0.0013 0.0783 ± 0.0032 1.0772 ± 0.0184 Eq. Size 1k 81.12 ± 0.15 0.1229 ± 0.0030 0.0273 ± 0.0055 1.0165 ± 0.0105 I-Max 5k 81.22 ± 0.12 0.0692 ± 0.0020 0.0751 ± 0.0024 0.7878 ± 0.0090 I-Max 1k 81.30 ± 0.22 0.0518 ± 0.0036 0.0231 ± 0.0067 0.7593 ± 0.0085 ± 0.08 0.0979 ± 0.0015 0.0590 ± 0.0028 0.7271 ± 0.0026 Eq. Mass 5k 63.02 ± 0.54 0.0131 ± 0.0012 0.4764 ± 0.0057 1.0535 ± 0.0191 Eq. Mass 1k 64.48 ± 0.64 0.0265 ± 0.0011 0.4980 ± 0.0070 1.1232 ± 0.0277 Eq. Size 5k 80.81 ± 0.26 0.1070 ± 0.0008 0.0700 ± 0.0030 1.0178 ± 0.0066 Eq. Size 1k 81.99 ± 0.21 0.1195 ± 0.0013 0.0230 ± 0.0033 0.9556 ± 0.0071 I-Max 5k 81.99 ± 0.08 0.0601 ± 0.0027 0.0627 ± 0.0034 0.7318 ± 0.0026 I-Max 1k 81.96 ± 0.14 0.0549 ± 0.0081 0.0205 ± 0.0074 0.7127 ± 0.0040

Tab. 1 Extension: CIFAR100 -DenseNet 82.36 ± 0.26 0.1223 ± 0.0008 0.0762 ± 0.0015 0.7542 ± 0.0143 Eq. Mass 5k 57.23 ± 0.50 0.0117 ± 0.0011 0.4173 ± 0.0051 1.1819 ± 0.0228 Eq. Mass 1k 58.11 ± 0.21 0.0233 ± 0.0005 0.4339 ± 0.0024 1.2049 ± 0.0405 Eq. Size 5k 81.35 ± 0.23 0.1108 ± 0.0017 0.0763 ± 0.0029 1.0207 ± 0.0183 Eq. Size 1k 82.22 ± 0.30 0.1192 ± 0.0024 0.0219 ± 0.0021 0.9482 ± 0.0137 I-Max 5k 82.35 ± 0.26 0.0740 ± 0.0007 0.0772 ± 0.0010 0.7618 ± 0.0145 I-Max 1k 82.32 ± 0.22 0.0546 ± 0.0122 0.0189 ± 0.0071 0.7022 ± 0.0124

Tab. 1 Extension: CIFAR10 -WRN 96.12 ± 0.14 0.0457 ± 0.0011 0.0288 ± 0.0007 0.1682 ± 0.0062 Eq. Mass 5k 91.06 ± 0.54 0.0180 ± 0.0045 0.0794 ± 0.0066 0.2066 ± 0.0091 Eq. Mass 1k 91.24 ± 0.27 0.0212 ± 0.0009 0.0836 ± 0.0091 0.2252 ± 0.0220 Eq. Size 5k ± 0.14 0.0344 ± 0.0008 0.0290 ± 0.0013 0.2231 ± 0.0074 Eq. Size 1k 96.04 ± 0.15 0.0278 ± 0.0021 0.0105 ± 0.0028 0.2744 ± 0.0812 I-Max 5k 96.10 ± 0.14 0.0329 ± 0.0011 0.0276 ± 0.0007 0.1704 ± 0.0067 I-Max 1k 96.06 ± 0.13 0.0304 ± 0.0012 0.0113 ± 0.0039 0.1595 ± 0.0604 96.30 ± 0.18 0.0485 ± 0.0014 0.0201 ± 0.0021 0.1247 ± 0.0058 Eq. Mass 5k 89.40 ± 0.55 0.0168 ± 0.0037 0.0589 ± 0.0052 0.2011 ± 0.0085 Eq. Mass 1k 89.85 ± 0.61 0.0269 ± 0.0051 0.0676 ± 0.0127 0.2208 ± 0.0172 Eq. Size 5k 96.30 ± 0.20 0.0274 ± 0.0013 0.0174 ± 0.0013 0.1613 ± 0.0101 Eq. Size 1k 96.17 ± 0.24 0.0288 ± 0.0039 0.0114 ± 0.0025 0.2495 ± 0.0571 I-Max 5k 96.26 ± 0.20 0.0240 ± 0.0020 0.0167 ± 0.0014 0.1264 ± 0.0066 I-Max 1k 96.22 ± 0.21 0.0254 ± 0.0030 0.0104 ± 0.0025 0.1397 ± 0.0276

Tab. 1 Extension Dataset: CIFAR10 -DenseNet ± 0.36 0.0137 ± 0.0039 0.0657 ± 0.0041 0.2283 ± 0.0101 Eq. Size 1k 96.64 ± 0.22 0.0262 ± 0.0035 0.0101 ± 0.0035 0.2465 ± 0.0543 Eq. Size 5k 96.74 ± 0.07 0.0301 ± 0.0012 0.0242 ± 0.0013 0.1912 ± 0.0075 I-Max 1k 96.59 ± 0.32 0.0261 ± 0.0025 0.0098 ± 0.0027 0.1208 ± 0.0044 I-Max 5k 96.71 ± 0.09 0.0284 ± 0.0013 0.0233 ± 0.0009 0.1608 ± 0.0086

Tab. 2 Extension: ImageNet -InceptionResnetV2



Tab. 2 Extension: CIFAR100 -WRN

Tab. 2 Extension Dataset: CIFAR100 -DenseNet

Tab. 2 Extension: CIFAR10 -ResNeXt

