IMPROVED UNCERTAINTY POST-CALIBRATION VIA RANK PRESERVING TRANSFORMS Anonymous

Abstract

Modern machine learning models with high accuracy often exhibit poor uncertainty calibration: the output probabilities of the model do not reflect its accuracy, and tend to be over-confident. Existing post-calibration methods such as temperature scaling recalibrate a trained model using rather simple calibrators with one or few parameters, which can have a rather limited capacity. In this paper, we propose Neural Rank Preserving Transforms (NRPT), a new post-calibration method that adjusts the output probabilities of a trained classifier using a calibrator of higher capacity, while maintaining its prediction accuracy. NRPT learns a calibrator that preserves the rank of the probabilities through general monotonic transforms, individualizes to the original input, and allows learning with any loss function that encourages calibration. We show experimentally that NRPT improves the expected calibration error (ECE) significantly over existing postcalibration methods such as (local) temperature scaling on large-scale image and text classification tasks. The performance of NRPT can further match ensemble methods such as deep ensembles, while being much more parameter-efficient. We further demonstrate the improved calibration ability of NRPT beyond the ECE metric, such as accuracy among top-confidence predictions, as well as optimizing the tradeoff between calibration and sharpness.

1. INTRODUCTION

Modern machine learning models such as deep neural networks have achieved high performance on many challenging tasks, and have been put into production that impacts billions of people (LeCun et al., 2015) . It is increasingly critical that the outputs of these models are comprehensible and safe to use in downstream applications. However, high-accuracy classification models often exhibit the failure mode of miscalibration: the output probabilities of these models do not reflect the true accuracies, and tend to be over-confident (Guo et al., 2017; Lakshminarayanan et al., 2017) . As the output probabilities are typically comprehended as (an estimate of) true accuracies and used in downstream applications, miscalibration can negatively impact the decision making, and is especially dangerous in risk-sensitive domains such as medical AI (Begoli et al., 2019; Jiang et al., 2012) or self-driving cars (Michelmore et al., 2018) . It is an important question how to properly calibrate these models so as to make the output probabilities more trustworthy and safer to use. Existing methods for uncertainty calibration can roughly be divided into two types. Diversity-based methods such as ensembles (Lakshminarayanan et al., 2017; Wen et al., 2020) and Bayesian networks (Gal & Ghahramani, 2016; Maddox et al., 2019; Dusenberry et al., 2020) work by aggregating predicted probability over multiple models or multiple times on a randomized model. These methods are able to improve both the accuracy and the uncertainty calibration over a single deterministic model (Ovadia et al., 2019) . However, deploying these models requires either storing all the ensemble members and/or running multiple random variants of the same model, which makes them memory-expensive and runtime-inefficient. On the other hand, post-calibration methods work by learning a calibrator on top of the output probabilities (or logits) of an existing well-trained model (Platt et al., 1999; Zadrozny & Elkan, 2001; 2002; Guo et al., 2017; Ding et al., 2020) . For a K-class classification model that outputs logits z = z(x) ∈ R K , post-calibration methods learn a calibrator f : R K → R K using additional holdout data, so that f (z) is better calibrated than the original z. The architectures of such calibrators are typically simple: A prevalent example is the temperature scaling method which learns f T (z) = z/T with a single trainable parameter T > 0 by minimizing the negative log-likelihood (NLL) loss on holdout data. Such simple calibrators add no overhead to the existing model, and is empirically shown to improve the calibration significantly on a variety of tasks and models (Guo et al., 2017) . Despite its empirical success, the design of post-calibration methods is not yet fully satisfactory: In practice, simple calibrators such as temperature scaling often underfit the calibration loss on its training data, whereas more complex calibrators can often overfit-see Figure 1 for a quantitative illustration of this effect. While the underfitting of simple calibrators are perhaps due to their limited expressive power, the overfitting of complex calibrators is also believed to be natural since the holdout dataset used for training the calibrators are typically small (e.g. a few thousands of examples). One concrete example is the matrix scaling method which learns a matrix calibrator f W,b (z) = W z + b involving O(K 2 ) trainable parameters. When K is large, matrix scaling often tend to overfit and hurt calibration, despite being a strict generalization of temperature scaling (Guo et al., 2017) . It is further observed that the overfitting cannot be easily fixed by applying common regularizations such as L 2 on the calibrator (Kull et al., 2019) . These empirical evidence seems to suggest that complex calibrators with a large amount of parameters are perhaps not recommended in designing post calibration methods. In this paper, we show that in contrast to the prior belief, large calibrators do not necessarily overfit; it is rather a lack of accuracy constraint on the calibrator that may have caused the overfitting. Observe that matrix scaling, unlike temperature scaling, is not guaranteed to maintain the accuracy of the model: it applies a general affine transform z → W z + b on the logits and can modify their rank (and thus the predicted top label), whereas temperature scaling is guaranteed to preserve the rank. When trained with the NLL loss, a calibrator that does not maintain the accuracy may attempt to improve the accuracy at the cost of hurting (or not improving) the calibration. Motivated by this observation, this paper proposes Neural Rank-Preserving Transforms (NRPT), a method for learning calibrators that maintain the accuracy of the model, yet are complex enough for yielding better calibration performance than simple calibrators such as temperature scaling. Our key idea is that a sufficient condition for the calibrator to maintain the accuracy is for it to preserve the rank of the logits: any mapping that preserves the rank of the logits will not change the predicted top label. We instantiate this idea by designing a family of calibrators that perform entrywise monotone transforms on each individual logit (or log-probability): for a K-class classification problem, NRPT scales each logit as z i → f (z i , x), where z i ∈ R is the i-th logit (1 ≤ i ≤ K), x ∈ R d is the original input features, and f : R × R d → R is monotonically increasing in its first argument but otherwise arbitrary. As f is monotone, we have f (z 1 , x) ≤ f (z 2 , x) if z 1 ≤ z 2 , and thus f preserves the rank of the logits. This method strictly generalizes temperature scaling (which uses f (z i , x) = z i /T ) and local temperature scaling (which uses f (z i , x) = z i /T (x)) (Ding et al., 2020) . The fact that f can depend on x further helps improve the expressivity of f and allows great flexibility in the architecture design. We compare our instantiation of NRPT against temperature scaling and matrix scaling in Figure 1 , in which we see that NRPT is indeed able to fit the training loss better than temperature scaling and does not suffer from overfitting.



Figure 1: Post-calibration training curves on a WideResNet-28-10 on CIFAR-100. Temperature scalingminimizes the training and validation NLL reasonably well (and improves the ECE), but still underfits the NLL. Matrix scaling learns a higher-capacity matrix calibrator and minimizes the training NLL better, but does not improve the ECE since the calibrator does not maintain the accuracy and is encouraged to improve the accuracy instead of calibration. Our Neural Rank-Preserving Transforms (NRPT) learns a higher-capacity calibrator that preserves the accuracy, and improves both the training/validation NLL as well as the ECE.

