IMPROVED UNCERTAINTY POST-CALIBRATION VIA RANK PRESERVING TRANSFORMS Anonymous

Abstract

Modern machine learning models with high accuracy often exhibit poor uncertainty calibration: the output probabilities of the model do not reflect its accuracy, and tend to be over-confident. Existing post-calibration methods such as temperature scaling recalibrate a trained model using rather simple calibrators with one or few parameters, which can have a rather limited capacity. In this paper, we propose Neural Rank Preserving Transforms (NRPT), a new post-calibration method that adjusts the output probabilities of a trained classifier using a calibrator of higher capacity, while maintaining its prediction accuracy. NRPT learns a calibrator that preserves the rank of the probabilities through general monotonic transforms, individualizes to the original input, and allows learning with any loss function that encourages calibration. We show experimentally that NRPT improves the expected calibration error (ECE) significantly over existing postcalibration methods such as (local) temperature scaling on large-scale image and text classification tasks. The performance of NRPT can further match ensemble methods such as deep ensembles, while being much more parameter-efficient. We further demonstrate the improved calibration ability of NRPT beyond the ECE metric, such as accuracy among top-confidence predictions, as well as optimizing the tradeoff between calibration and sharpness.

1. INTRODUCTION

Modern machine learning models such as deep neural networks have achieved high performance on many challenging tasks, and have been put into production that impacts billions of people (LeCun et al., 2015) . It is increasingly critical that the outputs of these models are comprehensible and safe to use in downstream applications. However, high-accuracy classification models often exhibit the failure mode of miscalibration: the output probabilities of these models do not reflect the true accuracies, and tend to be over-confident (Guo et al., 2017; Lakshminarayanan et al., 2017) . As the output probabilities are typically comprehended as (an estimate of) true accuracies and used in downstream applications, miscalibration can negatively impact the decision making, and is especially dangerous in risk-sensitive domains such as medical AI (Begoli et al., 2019; Jiang et al., 2012) or self-driving cars (Michelmore et al., 2018) . It is an important question how to properly calibrate these models so as to make the output probabilities more trustworthy and safer to use. Existing methods for uncertainty calibration can roughly be divided into two types. Diversity-based methods such as ensembles (Lakshminarayanan et al., 2017; Wen et al., 2020) and Bayesian networks (Gal & Ghahramani, 2016; Maddox et al., 2019; Dusenberry et al., 2020) work by aggregating predicted probability over multiple models or multiple times on a randomized model. These methods are able to improve both the accuracy and the uncertainty calibration over a single deterministic model (Ovadia et al., 2019) . However, deploying these models requires either storing all the ensemble members and/or running multiple random variants of the same model, which makes them memory-expensive and runtime-inefficient. On the other hand, post-calibration methods work by learning a calibrator on top of the output probabilities (or logits) of an existing well-trained model (Platt et al., 1999; Zadrozny & Elkan, 2001; 2002; Guo et al., 2017; Ding et al., 2020) . For a K-class classification model that outputs logits z = z(x) ∈ R K , post-calibration methods learn a calibrator f : R K → R K using additional holdout data, so that f (z) is better calibrated than the original z. The architectures of such calibrators are typically simple: A prevalent example is the temperature scaling method which learns f T (z) = z/T with a single trainable parameter T > 0 by

