PRIVACY PRESERVING RECALIBRATION UNDER DOMAIN SHIFT

Abstract

Classifiers deployed in high-stakes applications must output calibrated confidence scores, i.e. their predicted probabilities should reflect empirical frequencies. Typically this is achieved with recalibration algorithms that adjust probability estimates based on real-world data; however, existing algorithms are not applicable in realworld situations where the test data follows a different distribution from the training data, and privacy preservation is paramount (e.g. protecting patient records). We introduce a framework that provides abstractions for performing recalibration under differential privacy constraints. This framework allows us to adapt existing recalibration algorithms to satisfy differential privacy while remaining effective for domain-shift situations. Guided by our framework, we also design a novel recalibration algorithm, accuracy temperature scaling, that is tailored to the requirements of differential privacy. In an extensive empirical study, we find that our algorithm improves calibration on domain-shift benchmarks under the constraints of differential privacy. On the 15 highest severity perturbations of the ImageNet-C dataset, our method achieves a median ECE of 0.029, over 2x better than the next best recalibration method and almost 5x better than without recalibration.

1. INTRODUCTION

Machine learning classifiers are currently deployed in high stakes applications where (1) the cost of failure is high, so prediction uncertainty must be accurately calibrated (2) the test distribution does not match the training distribution, and (3) data is subject to privacy constraints. All three of these challenges must be addressed in applications such as medical diagnosis (Khan et al., 2001; Chen et al., 2018; Kortum et al., 2018) , financial decision making (Berestycki et al., 2002; Rasekhschaffe & Jones, 2019; He & Antón, 2003) , security and surveillance systems (Sun et al., 2015; Patel et al., 2015; Agre, 1994) , criminal justice (Berk, 2012; 2019; Rudin & Ustun, 2018) , and mass market autonomous driving (Kendall & Gal, 2017; Yang et al., 2018; Glancy, 2012) . While much prior work has addressed these challenges individually, they have not been considered simultaneously. The goal of this paper is to propose a framework that formalizes challenges (1)-(3) jointly, introduce benchmark problems, and design and compare new algorithms under the framework. A standard approach for addressing challenge (1) is uncertainty quantification, where the classifier outputs its confidence in every prediction to indicate how likely it is that the prediction is correct. These confidence scores must be meaningful and trustworthy. A widely used criterion for good confidence scores is calibration (Brier, 1950; Cesa-Bianchi & Lugosi, 2006; Guo et al., 2017) -i.e. among the data samples for which the classifier outputs confidence p ∈ (0, 1), exactly p fraction of the samples should be classified correctly. Several methods (Guo et al., 2017) learn calibrated classifiers when the training distribution matches the test distribution. However, this classical assumption is always violated in real world applications, and calibration performance can significantly degrade under even small domain shifts (Snoek et al., 2019) . To address this challenge, several methods have been proposed to re-calibrate a classifier on data from the test distribution (Platt et al., 1999; Guo et al., 2017; Kuleshov et al., 2018; Snoek et al., 2019) . These methods make small adjustments to the classifier to minimize calibration error on a validation dataset drawn from the test distribution, but they are typically only applicable when they have (unrestricted) access to data from this validation set. Additionally, high stakes applications often require privacy. For example, it is difficult for hospitals to share patient data with machine learning providers due to legal privacy protections (Centers for Medicare & Medicaid Services, 1996) . When the data is particularly sensitive, provable differential privacy becomes necessary. Differential privacy (Dwork et al., 2014) provides a mathematically rigorous definition of privacy along with algorithms that meet the requirements of this definition. For instance, the hospital may share only certain statistics of their data, where the shared statistics must have bounded mutual information with respect to individual patients. The machine learning provider can then use these shared statistics -possibly combining statistics from many different hospitalsto recalibrate the classifier and provide better confidence estimates. In this paper, we present a framework to address all three challenges -calibration, domain shift, and differential privacy -and introduce a benchmark to standardize performance and compare algorithms. We show how to modify modern recalibration techniques (e.g. (Zadrozny & Elkan, 2001; Guo et al., 2017) ) to satisfy differential privacy using this framework, and compare their empirical performance. This framework can be viewed as performing federated learning for recalibration, with the constraint that each party's data must be kept differentially private. We also present a novel recalibration technique, accuracy temperature scaling, that is particularly effective in this framework. This new technique requires private data sources to share only two statistics: the overall accuracy and the average confidence score for a classifier. We adjust the classifier until the average confidence equals the overall accuracy. Because only two numbers are revealed by each private data source, it is much easier to satisfy differential privacy. In our experiments, we find that without privacy requirements the new recalibration algorithm performs on par with algorithms that use the entire validation dataset, such as (Guo et al., 2017) ; with privacy requirements the new algorithm performs 2x better than the second best baseline. In summary, the contributions of our paper are as follows. (1) We introduce the problem of "privacy preserving calibration under domain shift" and design a framework for adapting existing recalibration techniques to this setting. (2) We introduce accuracy temperature scaling, a novel recalibration method designed with privacy concerns in mind, that requires only the overall accuracy and average confidence of the model on the validation set. (3) Using our framework, we empirically evaluate our method on a large set of benchmarks against state-of-the-art techniques and show that it performs well across a wide range of situations under differential privacy.

2.1. CALIBRATION

Description of Calibration Consider a classification task from input domain (e.g. images) X to a finite set of labels Y = {1, • • • , m}. We assume that there is some joint distribution P * on X × Y. This could be the training distribution, or the distribution from which we draw test data. A classifier is a pair (φ, p) where φ : X → Y maps each input x ∈ X to a label y ∈ Y and p : X → [0, 1] maps each input x to a confidence value c. We say that the classifier (φ, p) is perfectly calibrated (Brier, 1950; Gneiting et al., 2007) with respect to the distribution P * if ∀c ∈ [0, 1] Pr P * (x,y) [φ(x) = y | p(x) = c] = c. Note that calibration is a property not only of the classifier (φ, p), but also of the distribution P * . A classifier (φ, p) can be calibrated with respect to one distribution (e.g. the training distribution) but not another (e.g. the test distribution). To simplify notation we drop the dependency on P * . To numerically measure how well a classifier is calibrated, the commonly used metric is Expected Calibration Error (ECE) (Naeini et al., 2015) defined by ECE(φ, p) := c∈[0,1] Pr[p(x) = c] • |Pr[φ(x) = y | p(x) = c] -c| . In other words, ECE measures average deviation from Eq. 1. In practice, the ECE is approximated by binning -partitioning the predicted confidences into bins, and then taking a weighted average of the difference between the accuracy and average confidence for each bin (see Appendix A.1 for details.) Recalibration Methods Several methods apply a post-training adjustment to a classifier (φ, p) to achieve calibration (Platt et al., 1999; Niculescu-Mizil & Caruana, 2005) . The one most relevant to our paper is temperature scaling (Guo et al., 2017) . On each input x ∈ X , a neural network

