IMPROVED UNCERTAINTY POST-CALIBRATION VIA RANK PRESERVING TRANSFORMS Anonymous

Abstract

Modern machine learning models with high accuracy often exhibit poor uncertainty calibration: the output probabilities of the model do not reflect its accuracy, and tend to be over-confident. Existing post-calibration methods such as temperature scaling recalibrate a trained model using rather simple calibrators with one or few parameters, which can have a rather limited capacity. In this paper, we propose Neural Rank Preserving Transforms (NRPT), a new post-calibration method that adjusts the output probabilities of a trained classifier using a calibrator of higher capacity, while maintaining its prediction accuracy. NRPT learns a calibrator that preserves the rank of the probabilities through general monotonic transforms, individualizes to the original input, and allows learning with any loss function that encourages calibration. We show experimentally that NRPT improves the expected calibration error (ECE) significantly over existing postcalibration methods such as (local) temperature scaling on large-scale image and text classification tasks. The performance of NRPT can further match ensemble methods such as deep ensembles, while being much more parameter-efficient. We further demonstrate the improved calibration ability of NRPT beyond the ECE metric, such as accuracy among top-confidence predictions, as well as optimizing the tradeoff between calibration and sharpness.

1. INTRODUCTION

Modern machine learning models such as deep neural networks have achieved high performance on many challenging tasks, and have been put into production that impacts billions of people (LeCun et al., 2015) . It is increasingly critical that the outputs of these models are comprehensible and safe to use in downstream applications. However, high-accuracy classification models often exhibit the failure mode of miscalibration: the output probabilities of these models do not reflect the true accuracies, and tend to be over-confident (Guo et al., 2017; Lakshminarayanan et al., 2017) . As the output probabilities are typically comprehended as (an estimate of) true accuracies and used in downstream applications, miscalibration can negatively impact the decision making, and is especially dangerous in risk-sensitive domains such as medical AI (Begoli et al., 2019; Jiang et al., 2012) or self-driving cars (Michelmore et al., 2018) . It is an important question how to properly calibrate these models so as to make the output probabilities more trustworthy and safer to use. Existing methods for uncertainty calibration can roughly be divided into two types. Diversity-based methods such as ensembles (Lakshminarayanan et al., 2017; Wen et al., 2020) and Bayesian networks (Gal & Ghahramani, 2016; Maddox et al., 2019; Dusenberry et al., 2020) work by aggregating predicted probability over multiple models or multiple times on a randomized model. These methods are able to improve both the accuracy and the uncertainty calibration over a single deterministic model (Ovadia et al., 2019) . However, deploying these models requires either storing all the ensemble members and/or running multiple random variants of the same model, which makes them memory-expensive and runtime-inefficient. On the other hand, post-calibration methods work by learning a calibrator on top of the output probabilities (or logits) of an existing well-trained model (Platt et al., 1999; Zadrozny & Elkan, 2001; 2002; Guo et al., 2017; Ding et al., 2020) . For a K-class classification model that outputs logits z = z(x) ∈ R K , post-calibration methods learn a calibrator f : R K → R K using additional holdout data, so that f (z) is better calibrated than the original z. The architectures of such calibrators are typically simple: A prevalent example is the temperature scaling method which learns f T (z) = z/T with a single trainable parameter T > 0 by minimizes the training and validation NLL reasonably well (and improves the ECE), but still underfits the NLL. Matrix scaling learns a higher-capacity matrix calibrator and minimizes the training NLL better, but does not improve the ECE since the calibrator does not maintain the accuracy and is encouraged to improve the accuracy instead of calibration. Our Neural Rank-Preserving Transforms (NRPT) learns a higher-capacity calibrator that preserves the accuracy, and improves both the training/validation NLL as well as the ECE. minimizing the negative log-likelihood (NLL) loss on holdout data. Such simple calibrators add no overhead to the existing model, and is empirically shown to improve the calibration significantly on a variety of tasks and models (Guo et al., 2017) . Despite its empirical success, the design of post-calibration methods is not yet fully satisfactory: In practice, simple calibrators such as temperature scaling often underfit the calibration loss on its training data, whereas more complex calibrators can often overfit-see Figure 1 for a quantitative illustration of this effect. While the underfitting of simple calibrators are perhaps due to their limited expressive power, the overfitting of complex calibrators is also believed to be natural since the holdout dataset used for training the calibrators are typically small (e.g. a few thousands of examples). One concrete example is the matrix scaling method which learns a matrix calibrator f W,b (z) = W z + b involving O(K 2 ) trainable parameters. When K is large, matrix scaling often tend to overfit and hurt calibration, despite being a strict generalization of temperature scaling (Guo et al., 2017) . It is further observed that the overfitting cannot be easily fixed by applying common regularizations such as L 2 on the calibrator (Kull et al., 2019) . These empirical evidence seems to suggest that complex calibrators with a large amount of parameters are perhaps not recommended in designing post calibration methods. In this paper, we show that in contrast to the prior belief, large calibrators do not necessarily overfit; it is rather a lack of accuracy constraint on the calibrator that may have caused the overfitting. Observe that matrix scaling, unlike temperature scaling, is not guaranteed to maintain the accuracy of the model: it applies a general affine transform z → W z + b on the logits and can modify their rank (and thus the predicted top label), whereas temperature scaling is guaranteed to preserve the rank. When trained with the NLL loss, a calibrator that does not maintain the accuracy may attempt to improve the accuracy at the cost of hurting (or not improving) the calibration. Motivated by this observation, this paper proposes Neural Rank-Preserving Transforms (NRPT), a method for learning calibrators that maintain the accuracy of the model, yet are complex enough for yielding better calibration performance than simple calibrators such as temperature scaling. Our key idea is that a sufficient condition for the calibrator to maintain the accuracy is for it to preserve the rank of the logits: any mapping that preserves the rank of the logits will not change the predicted top label. We instantiate this idea by designing a family of calibrators that perform entrywise monotone transforms on each individual logit (or log-probability): for a K-class classification problem, NRPT scales each logit as z i → f (z i , x), where z i ∈ R is the i-th logit (1 ≤ i ≤ K), x ∈ R d is the original input features, and f : R × R d → R is monotonically increasing in its first argument but otherwise arbitrary. As f is monotone, we have f (z 1 , x) ≤ f (z 2 , x) if z 1 ≤ z 2 , and thus f preserves the rank of the logits. This method strictly generalizes temperature scaling (which uses f (z i , x) = z i /T ) and local temperature scaling (which uses f (z i , x) = z i /T (x)) (Ding et al., 2020) . The fact that f can depend on x further helps improve the expressivity of f and allows great flexibility in the architecture design. We compare our instantiation of NRPT against temperature scaling and matrix scaling in Figure 1 , in which we see that NRPT is indeed able to fit the training loss better than temperature scaling and does not suffer from overfitting. Our contributions We propose Neural Rank-Preserving Transforms (NRPT), an improved method for performing uncertainty post-calibration on a trained classifier while maintaining its accuracy (Section 3). NRPT learns calibrators that scale the logits using general monotone transforms, are individualized to the original input features, and allow learning with any calibration loss function (not restricted to those that correlates with the accuracy). We show experimentally that NRPT improves the expected calibration error (ECE) significantly over existing post-calibration methods on large-scale image and text classification tasks such as CIFAR-100, ImageNet, and MNLI (Section 4.1). NRPT can further match diversity-based methods such as deep ensembles, while using a much less number of additional parameters. We further demonstrate the strong calibration ability of NRPT beyond the ECE, by showing that it improves on the accuracy among top-confidence predictions, as well as the tradeoff between ECE and sharpness of prediction (Section 4.2). Due to the space constraint, we defer the discussions of additional related work to Appendix A and additional experimental details and results to the later Appendices.

2. BACKGROUND ON UNCERTAINTY CALIBRATION

We consider K-class classification problems where X ∈ R d is the input (features), Y ∈ [K] := {1, . . . , K} is the true label, and (X, Y ) follows some underlying joint distribution. Let p : R d → ∆ K be a prediction model (for example, a neural network learned from data) that maps inputs to probabilities, where ∆ K := {(p 1 , . . . , p K ) : p i ≥ 0, i p i = 1} is the set of all probability distributions on [K]. We say p is perfectly calibrated if P(Y = k | p(X) = p) = p k for all p ∈ ∆ K , k ∈ [K]. In other words, a model is perfectly calibrated if when the model predicts p(X) = p, the conditional distribution of Y is exactly p. It is difficult to evaluate perfect calibration from finite data, as for almost all p we do not receive samples that satisfy the exact conditioning p(x) = p. This motivates considering alternative scalar metrics for calibration that can be estimated from data. ECE The Expected Calibration Error (ECE) is a commonly used metric that measures calibration through grouping examples according to the confidence (i.e. the top predicted probability) (Naeini et al., 2015; Guo et al., 2017) . Let {(x i , y i )} n i=1 be the evaluation dataset on which we wish to evaluate the calibration of a model p. Define the intervals I m = ( m-1 M , m M ], where M > 0 is a (fixed) number of bins, and partitions the examples into M bins according to the confidence: B m = {i : max k p(x i ) k ∈ I m }. Define the accuracy and confidence within B m as acc(B m ) := 1 B m i∈Bm 1 arg max k p(x i ) k = y i and conf(B m ) := 1 B m i∈Bm max k p(x i ) k The ECE is then defined as the (weighted) average difference between accuracy and confidence: ECE( p) := M m=1 |B m | n |acc(B m ) -conf(B m )|. ( ) The ECE is a sensible calibration metric since it is a binned approximation of the top-label calibration error (TCE) that measures the difference between accuracy and confidence under exact conditioning: TCE( p) := E P arg max k p(X) k = Y | max k p(X) k -max k p(X) k . Debiased ECE Recent work shows that the ECE has an inherent positive bias and proposes the Debiased ECE that approximately removes this bias using Gaussian bootstrapping (Kumar et al., 2019) : DECE( p) := ECE( p) -E R 1:M M m=1 |B m | n |conf(B m ) -R m | -ECE( p) , where R m ∼ N(acc(B m ), acc(Bm)(1-acc(Bm))

|Bm|

). Kumar et al. (2019) showed that the debiased ECE is a typically more accurate estimator of the TCE than ECE, especially when the TCE is relatively small. In our experiments, we use both the ECE and the debiased ECE for evaluating calibration. NLL The Negative Log-Likelihood (NLL), typically used as loss function for training classifiers, is also a measure of calibration: NLL( p) := 1 n n i=1 -log p(x i ) yi . NLL is a proper scoring rule (Lakshminarayanan et al., 2017) in the sense that the population minimizer over all possible p is achieved at the ground truth conditional distribution p (Hastie et al., 2009) . In general, the NLL measures the distance between p and p , and is thus a joint metric of accuracy and calibration. Predictive entropy (sharpness) While we are mostly concered about the accuracy and calibration of a model, these two metrics alone do not fully guarantee a proper uncertainty quantification. For example, any high-accuracy model can be calibrated in a "trivial" way such that the ECE becomes exactly 0, by mapping the confidence on all examples to be equal to the (overall) accuracy of the model, and rescaling the non-top probabilities accordingly. In order to prevent such trivial calibration, we additionally measure the sharpness of the predictions using the predictive entropy (Lakshminarayanan et al., 2017) : PEnt(p) = 1 n n i=1 K k=1 -p(x i ) k log p(x i ) k . Lower predictive entropies indicate sharper predictions (i.e. predictions closer to delta distributions than the uniform distribution). In general, the predictive entropy is not necessarily related to the calibration; however, for models that have the same accuracies, we observe that the predictive entropy is typically negatively correlated with calibration-sharper predictions are usually less calibrated.

3. RANK PRESERVING TRANSFORMS

We now introduce our main algorithm Neural Rank-Preserving Transforms (NRPT) for performing post-calibration on trained classifiers. Throughout this section, we consider K-class classification problems (K ≥ 2), and let z : R d → R K denote the input-to-logit mapping of a trained classifier. The predicted probabilities of the model is the softmax of the logits: p(x) = σ SM ( z(x)) = [exp( z(x) k )/ j∈[K] exp( z(x) j )] k . Temperature Scaling We begin by reviewing temperature scaling, a simple yet strong baseline method for post-calibration. Temperature scaling recalibrates a model by scaling down the logits using a single temperature parameter T > 0: f T ( z) = f T ( z 1 , . . . , z K ) = [ z 1 /T, . . . , z K /T ] = z/T, and using σ SM (f T ( z)) as calibrated probabilities. The parameter T is typically learned by minimizing the NLL loss on a hold-out calibration dataset. Temperature scaling clearly preserves the rank of the logits, and is observed to improve both the NLL and the ECE on test data by learning a temperature parameter that is typically above one (Guo et al., 2017) on large, over-confident models. However, as we have seen in Figure 1 , temperature scaling often does not minimize the (training) NLL on the calibration dataset well due to its limited model capacity. Individualization The first building block of our algorithm is to individualize (or localize) temperature scaling, an idea recently proposed in the Local Temperature Scaling (LTS) method Ding et al. (2020) : Each input x ∈ R d can have its own temperature T (x) > 0. This still preserves the rank of the logits, but can substantially increase the capacity of the calibrator as now the temperature can adapt to the raw input. Formally, we calibrate the model by scaling down the logits using an individualized temperature model T θ (x) > 0: f T θ ( z; x) = [ z 1 /T θ (x), . . . , z K /T θ (x)]. We require the temperature model T θ (x) to always output a positive scalar, but can otherwise have arbitrary architectures. In our experiments, we find LTS with the right choice of architecture can consistently outperform temperature scaling. Rank-preserving transforms via general monotone calibrators We now introduce our key idea of performing general monotone calibration, which preserves the rank of the logits and can be even more flexible than local temperature scaling. Our key observation is that the fundamental property that allows (local) temperature scaling (3) and (4) to preserve the rank is their monotone property: z i ≤ z j guarantees f ( z) i ≤ f ( z) j . Further, this is satisfied for (3) and ( 4) since the calibrator applies an entrywise, monotonically increasing function to each logit. Motivated by this, we consider general monotone calibrators of the form f θ ( z; x) = [g θ ( z 1 ; x), . . . , g θ ( z K ; x)], g(z, x ) is monotonically increasing in z for all x. (5) Observe that both temperature scaling (g(z i ; x) = z i /T ) and local temperature scaling (g(z i ; x) = z i /T (x)) as special cases of ( 5), and under this perspective are still limited in capacity as the g used in both cases are linear in z i given any x. To let the calibrator be more expressive, we would rather like to learn an arbitrary g under the monotonicity constraint. Instantiation via monotone two-layer networks We now explain how we design a function class g θ (z; x) that is monotone in z for any x and not too restricted in its expressivity. Existing techniques for building such monotone function classes include either classical non-parametric methods such as isotonic regression (Barlow & Brunk, 1972) , or sophisticated tricks such as parametrizing the derivative d dz g θ (z; x) using a non-negative neural network (Wehenkel & Louppe, 2019) . However, for the purpose designing calibrators, we prefer a simpler parametric class that enables efficient gradient-based learning. We acheive this by using a class of two-layer neural networks in z i with coefficients depending on x: g θ (z i ; x) := M j=1 a j φ 1 T θj (x) z i -b θj (x) , where a j ≥ 0, T θj (x) > 0, and φ is a monotonically increasing nonlinearity. (6) It is straightforward to see that ( 6) is guaranteed to be monotonically increasing in z i for any x as desired. Further, by choosing a proper φ and using a large number of neurons M , (6) can express a fairly large class of monotonic functions in z i for any fixed x. We also note that (6) recovers local temperature scaling (4) if we take φ(t) = t to be the identity mapping (as g becomes linear in z i in that case), and therefore has a strictly higher expressivity. Architectural choices For implementing the calibrator ( 5) and ( 6), in theory one is free to use any architecture as long as T θj (x) > 0 is guaranteed. However, we observe experimentally that reusing the representation of the trained classifier and weight sharing can help improve the calibration performance. In all our experiments, we choose [T θj (x), b θj (x)] to be a two-layer neural network of the last hidden layer (the pre-logit layer) of the trained classifier, with shared weights: [T θ1 (x), . . . , T θ M (x)] = σ temp A temp σ(W H(x) + b) , [b θ1 (x), . . . , b θ M (x)] = A bias σ(W H(x) + b), where H : R d → R d hid is the last hidden layer of the trained classifier, and W ∈ R N ×d hid , b ∈ R N , and A temp , A bias ∈ R M ×N are the trainable parameters. We further use a strictly positive nonlinearity σ temp to guarantee the temperatures are positive and not too small. Flexible loss functions Our final observation is that our calibrator ( 5) and ( 6) can be trained with not only the NLL loss, but any other loss function that encourages calibration. This is only possible for rank-preserving calibrators-calibrators that do not preserve the accuracy have to be trained using a loss that jointly encourages high accuracy and calibration (such as the NLL), otherwise the calibrator may hurt the accuracy for the sake of getting good calibration. We specifically propose to use the ECE (1) directly as a loss function for training the calibrator. Notice that even though the ECE is non-smooth in the model outputs (due to the binning and the non-smoothness of the argmax prediction rule), it still has a non-trivial gradient since the average confidence conf(B m ) is differentiable with respect to the model outputs. We find that training with the ECE loss can often result in better calibration, at the cost of reducing the sharpness (see Section 4) for details). We remark that Kumar et al. (2018) has considered training the MMCE (maximum mean calibration error), a kernelized version of the ECE loss; however, we are not aware of prior work that has considered training the ECE directly to the best of our knowledge.

Summary of algorithm

We summarize our NRPT algorithm as follows: Build a calibrator f θ (z; x) using the rank-preserving transform g θ (z i ; x) defined in ( 5), ( 6), ( 7); Train the calibrator f θ via minimizing any desired loss (e.g. NLL or ECE) on a holdout calibration dataset; Output the calibrated model p(x) = σ SM (f θ ( z(x), x)).

4. EXPERIMENTS 4.1 CALIBRATION ON IMAGE AND TEXT CLASSIFICATION

Tasks and models We perform calibration experiments on the three benchmark tasks: • CIFAR-100, splitted into 45K/5K/10K as train/calibration/val data. We train a WideResNet-28-10 (Zagoruyko & Komodakis, 2016) on the train split, achieving 80.36% accuracy. • ImageNet ILSVRC2012 (Deng et al., 2009) , splitted into 1.1M/100K/50K as train/calibration/val data. We train a WideResNet-50-2 on the train split, achieving 76.27% top-1 acccuracy. • MNLI, one of the largest text classification tasks in the GLUE benchmark (Wang et al., 2018) , splitted into 350K/43K/20K as train/calibration/val data. We finetune a pretrained BERT-Base (Devlin et al., 2018) model, achieving 83.36% accuracy on the matched (MNLI-m) data and 83.77% accuracy on the mismatched (MNLI-mm) data. Calibration performances are also evaluated on MNLI-m and MNLI-mm separately.

Methods and evaluation metrics

We implement our NRPT algorithm with a two-layer neural network calibrator on top of the last hidden representation H of the trained classifier (see ( 6) and (7).) In the case of BERT, H is the last encoder layer at the CLS token. For all three tasks, we choose hidden dimension N = 512, the number or neurons M ∈ {5, 10}, and φ to be the leaky-relu activation. We choose σ temp (t) = 0.2+ relu 6 (t), so that T θi (x) is guaranteed to be within [0.2, 6.2]. We compare our NRPT against the original uncalibrated model, as well as two existing postcalibration methods: Temperature Scaling (TS, see (3)), a strong baseline method for postcalibration (Guo et al., 2017) , and Local Temperature Scaling (LTS, see (4)), a generalization of temperature scaling that is observed to achieve state-of-the-art performance on a variety of computer vision tasks (Ding et al., 2020) foot_0 . On CIFAR-100 and ImageNet we further compare with deep ensembles (Lakshminarayanan et al., 2017), a representative diversity-based method with strong calibration performance (Ovadia et al., 2019) . We train the calibrators using either the NLL loss (2) or the ECE loss (1) with unregularized minibatch SGD on the calibration data. 2 We use "+E" in {TS+E, LTS+E, NRPT+E} to indicate that a method is trained with the ECE loss. Additional architectural and training details can be found in Appendix B. We evaluate the calibration methods on three metrics: the NLL, the ECE and debiased ECE (DECE), as well as the predictive entropy (PEnt) for evaluating the sharpness of the calibrated predictions. Results Table 1 summarizes our main results. We observe that NRPT and NRPT+E consistently performs the best among each group in terms of both the NLL and the ECE metric. In particular, the best test likelihood on CIFAR100 and ImageNet are achieved by NRPT and the best test ECE or debiased ECE are achieved by either NRPT or NRPT+E. This justifies our intuition that maximizing the expressivity of the calibrator while making sure it preserves the rank of the logits can indeed improve the calibration performance and do not yield significant overfitting. As a side note, we find that the choice of the loss function can substantially impact the behavior of the final calibrator: training with the NLL loss typically minimizes the test NLL well and improves the ECE by a reasonable amount, whereas training on the ECE loss typically minimizes the test ECE better (at least on the image tasks) at the cost of performing a little worse on the NLL and ECE. We remark that the behaviors on MNLI are slightly different in that the best ECE is achieved by NRPT instead of NRPT+E, potentially due to the language task being different from the image tasks. However, we do observe that NRPT+E still performs best among those trained with ECE loss, again demonstrating the benefit of the improved expressivity in the NRPT calibrator. Comparison with deep ensembles; parameter efficiency We further compare NRPT against deep ensembles (Lakshminarayanan et al., 2017) in Table 2 . While deep ensembles in general can improve the accuracy and achieve much better NLL and sharpness (PEnt) due to the higher model capacity, we find that NRPT can consistently achieve a better ECE than an ensemble of 4 models, and nearly match an ensemble of 8 models on CIFAR100. NRPT further has much less memory overhead compared with the ensembles: the calibrators we used only has size 0.5% -2% of the original model, whereas even an ensemble of 2 models requires doubling the model size. We also compare against Monte Carlo Dropout (Gal & Ghahramani, 2016) in Appendix C where we find Dropout cannot simultaneously maintain the accuracy and achieve calibration as well as NRPT.

4.2. CALIBRATION PERFORMANCE BEYOND THE ECE

Accuracy among top-confidence predictions We investigate the calibration ability of NRPT beyond the ECE metric, by looking at how well it improves the rank of the confidence among the individual examples. We measure the quality of the rank by visualizing the accuracy among topconfidence predictions: we take the subset of the test set for which the calibrated confidence ranks among the top x% (e.g. 20%, 10%), and evaluate the accuracy (or error rate) among these examples. Roughly speaking, the error should become lower as we decrease the percentage x, but we can further compare this curve across different methods. In Figure 2a , we confirm that the NRPT has a lower classification error than TS and LTS for the majority of the percentages, except at the very tail. This can be further quantitatively measured by the PRR (prediction rejection ratio) metric, where a higher PRR implies a better accuracy among top-confident examples (see Appendix D for the details of PRR). In Table 3 , we find that NRPT achieves better PRR than both TS and LTS. Tradeoff between and ECE For post-calibration methods that maintain the accuracy of the model, the calibration (e.g. ECE) is typically negatively correlated to the sharpness of the prediction (e.g. predictive entropy). We investigate the ability of NRPT in optimizing the tradeoff when both metrics are desired. To test this, we train both {TS, LTS, NRPT} on a weighted combination of the NLL and ECE loss αL NLL + C(1 -α)L ECE , where use multiple weight values α ∈ {0, 0.1, . . . , 0.7, 0.75, 0.8 . . . , 1}, and plot the resulting predictive entropy and ECE as a tradeoff curve. In Figure 2b , we see that NRPT achieves a nearly universally better tradeoff curve than TS and LTS. This suggests that the improved expressivity in NRPT can be beneficial in practice when more than one metrics are desired and it is necessary to manage the tradeoff. 

5. CONCLUSION

We proposed Neural Rank-Preserving Transforms (NRPT), an improved technique for uncertainty post-calibration, and showed that it outperforms existing post-calibration methods on benchmark tasks. A number of interesting research questions remain open: for example, can we have a better understanding on the choice of the architecture for the monotonic transforms used in NRPT? Can we build calibrators that combine the advantages in standard post-calibration methods and ensemblelike methods? We would like to leave these as future work. parametric approachs in the scaling-bining calibrator, which first fits a parametric function to the calibration dataset and then performs the binning. While most binning type methods are defined for binary problems, they can be extended to multiclass classification (K ≥ 3 classes) by performing the calibration on all the 1-vs-K -1 binary tasks, and re-normalizing the calibrated probabilities (Zadrozny & Elkan, 2002) . The parametric can also be extended to the multi-class case with various degrees of freedom, including tempreature scaling, vector scaling, and matrix scaling. Guo et al. (2017) tested the multi-class calibration methods on a variefy of tasks and found that temperature scaling performs the best across the board. Local Tempreature Scaling (Ding et al., 2020) proposes to use an individualized temperature for each example; in this paper we implement . Dirichlet calibration (Kull et al., 2019) improves the per-class calibration by using a different Dirichlet distribution for each class as the calibrator. MMCE (Kumar et al., 2018) proposes to optimize a kernalized version of ECE for improving the calibration; however they do not consider optimizing the original ECE directly. Diversity-based uncertainty quantification Diversity-based uncertainty quantification can be roughly divided into two types. Ensemble methods such as deep ensembles (Lakshminarayanan et al., 2017) train an ensemble of models from different initializations (and with different SGD noise), and find that the aggregated (average) predicted probability exhibit better uncertainty calibration than a single deterministic model. As ensembles are memroy and runtime heavy, a recent line of work proposes to make ensembles more efficient by either reducing the parameter count through smart reparametrizations (Wen et al., 2020) or a single deterministic model that simulates the ensembles (Liu et al., 2020) . A related line of work proposes to distill an ensemble of models (Malinin & Gales, 2018; Malinin et al., 2020; Tran et al., 2020) . We remark that either the efficient ensembling approach or the distillation approach improves the uncertainty calibration through simulating an ensemble, and can be used jointly with post-calibration methods. Bayesian neural networks (MacKay, 1995) are capable of producing uncertainty estimates by nature since it learns a distribution of networks (that can be used aggregatedly) rather than a single network. Monte Carlo Dropout (Gal & Ghahramani, 2016) NRPT We choose σ in (7) to be the ReLU activation and φ to be the leaky relu activation with negative slope tuned in {0.5, 0.8, 1.5, 2.0}. The number of neurons M was tuned in {5, 10}. We further initialize the T θj (x) such that it has initial values (approximately) {0.5, 1.0, 1.5, . . . , 0.5M } by properly initializing the bias within these networks. LTS For LTS (local temperature scaling) we use an architecture that is similar to NRPT: T θ (x) = σ temp (a σ(W H(x) + b)), where H : R d → R d hid is the last hidden representation layer of the trained classifer, M ∈ R d hid ×N and a ∈ R N . Similar as in NRPT, we chose N = 512 and σ temp (t) = 0.2+relu 6 (t). We remark that our implementation is likely different from the implementation in (Ding et al., 2020) (and operates on different base models). Nevertheless we find that our implementation is also a strong calibrator that consistenly performs temperature scaling.

B.2 TRAINING AND EVALUATION

CIFAR-100 The base WideResNet-28-10 on CIFAR-100 was trained with batchsize 128 for 200 epochs with a cosine learning rate with initial learning rate 0.1. ImageNet The base WideResNet-50-2 on ImageNet was trained with batchsize 256 (parallelized onto 8 GPUs) for 100 epochs. The initial learning rate was 0.1, with a fixed decay of factor 0.1x at the {30, 60, 80}-th epochs. MNLI The BERT-base on MNLI was finetuned from the pretrained model with batchsize 32 for 3 epochs, with the AdamW optimizer and learning rate 2 × 10 -5 . All post-calibrators are trained with a one-cycle learning rate (Smith, 2017) and we tune the initial learning rate within {1e-3, 3.1e-3, 1e-2, 3.1e-2}. All post-calibrators are trained with the same batchsize as used in training the base model. The number of epochs for training the calibrators was 50 on CIFAR-100, 5 on ImageNet, and 6 on MNLI.

ECE as loss and evaluation metric

We choose the number of bins to be 15 for evaluating the ECE following the standard practice in (Guo et al., 2017) (and the body of recent work). However, at train time, we tune the number of bins within {5, 10}, as we evaluate the train-time ECE loss on small minibatches, which could benefit from a smaller number of bins. Hyperparameter tuning All the hyperparameter tuning were conducted by further splitting the calibration dataset into a training and development set, where we train with a grid of hyperparameters on the train set and select the best on the development set.

C COMPARISON BETWEEN NRPT AND MONTE CARLO DROPOUT

We compare the calibration performance of NRPT and Monte Carlo Dropout (Gal & Ghahramani, 2016) on CIFAR-100, where we train a WideResNet-28-10 model for each drop probability, and evaluate the calibration by aggregating over 8 random predictions at test time (each with a different mask). We observe a consistent trend on Dropout: increasing the drop probability improves the ECE at the cost of hurting the accuracy, which is as expected since randomized predictions can naturally get more calibrated but less accurate as we increase the level of randomization. Comparing NRPT+E with Dropout, we see that NRPT+E achieves better ECE than Dropout up to drop probability 0.7; for drop probability 0.8, the ECE is better than NRPT+E, however it comes at the cost of significantly lower accuracy. For the NLL and predictive entropy metric, NRPT performs slightly better than Drop 0.8 and slightly worse than Drop 0.7. However, Drop 0.7 also has a worse accuracy than NRPT (which did not change the accuracy of the model). This suggests that NRPT may be preferred over Dropout if maintaining the accuracy is crucial; Dropout can only achieve a better calibration by huring the accuracy. 

D DETAILS ON THE PRR METRIC

The PRR (prediction rejection ratio) metric is commonly used for quantitatively summarizing the accuracy against top-confidence predictions, or equivalenty for evaluating the if the model can reliably reject to predict (Malinin et al., 2020) . The PRR metric is based on the AUC (area under curve) on the accuracy among top-confidence curve as in Figure 3 . We compare the curve of a method against that of an "oracle" rank of confidence (which perfectly ranks all wrong predictions of lower confidence than all correct predictions), as well as a "random" rank of confidence illustrated by a straight line. The PRR metric is then defined as PRR(Method) := AUC(Random) -AUC(Method) AUC(Random) -AUC(Oracle) = shaded area in green shaded area in orange . A higher PRR indicates that the method achieves a better (closer to the oracle) rank of confidence. 



We remark that our method is partly motivated by the attempt to improve over matrix scaling. However, we find matrix scaling consistently performs worse than temperature scaling, and thus we omit the results here. For temperature scaling, since there is only one trainable parameter, we can in theory obtain the exact optimal solution on the entire dataset; however we observed that the SGD solution with proper learning rate decay almost always coincides with the exact solution.



Figure 1: Post-calibration training curves on a WideResNet-28-10 on CIFAR-100. Temperature scaling

Figure 2: Comparison between TS, LTS, and NRPT in the calibration abilities. Each dot in (b) is obtained by optimizing a weighted combination of the NLL and ECE loss. Shaded area in (a) and crosses in (b) indicate the standard deviation over 4 random seeds.

uses the randomized prediction capability of Dropout to perform uncertainty calibration. SWAG (Maddox et al., 2019) performs uncertainty calibration via an approximate Bayesian model averaging using the SGD iterates. Bayesian rankone factors (Dusenberry et al., 2020) is a Bayesian version of BatchEnsembles that learns a posterior over the rank-one parametrization of ensembles. B ADDITIONAL EXPERIMENTAL DETAILS B.1 MODELS

Figure 3: Illustration of the PRR metric.

Figure 4: Accuracy among most-confident examples on ImageNet.

Comparison between NRPT and existing post-calibration methods. Metrics are reported in terms of the mean and standard deviation over 4 random seeds.

Comparison between NRPT and deep ensembles.

Prediction rejection ratio (PRR) metric (def in Appendix D). Higher the better.

Comparison between NRPT and Monte Carlo Dropout. "Drop 0.4" indicates the dropout method with drop probability 0.4 and keep probability 0.6. All dropout methods are evaluated by aggregating the randomized predictions over 8 masks.

