PRIVACY PRESERVING RECALIBRATION UNDER DOMAIN SHIFT

Abstract

Classifiers deployed in high-stakes applications must output calibrated confidence scores, i.e. their predicted probabilities should reflect empirical frequencies. Typically this is achieved with recalibration algorithms that adjust probability estimates based on real-world data; however, existing algorithms are not applicable in realworld situations where the test data follows a different distribution from the training data, and privacy preservation is paramount (e.g. protecting patient records). We introduce a framework that provides abstractions for performing recalibration under differential privacy constraints. This framework allows us to adapt existing recalibration algorithms to satisfy differential privacy while remaining effective for domain-shift situations. Guided by our framework, we also design a novel recalibration algorithm, accuracy temperature scaling, that is tailored to the requirements of differential privacy. In an extensive empirical study, we find that our algorithm improves calibration on domain-shift benchmarks under the constraints of differential privacy. On the 15 highest severity perturbations of the ImageNet-C dataset, our method achieves a median ECE of 0.029, over 2x better than the next best recalibration method and almost 5x better than without recalibration. Under review as a conference paper at ICLR 2021 Medicare & Medicaid Services, 1996). When the data is particularly sensitive, provable differential privacy becomes necessary. Differential privacy (Dwork et al., 2014) provides a mathematically rigorous definition of privacy along with algorithms that meet the requirements of this definition. For instance, the hospital may share only certain statistics of their data, where the shared statistics must have bounded mutual information with respect to individual patients. The machine learning provider can then use these shared statistics -possibly combining statistics from many different hospitalsto recalibrate the classifier and provide better confidence estimates. In this paper, we present a framework to address all three challenges -calibration, domain shift, and differential privacy -and introduce a benchmark to standardize performance and compare algorithms. We show how to modify modern recalibration techniques (e.g. (Zadrozny & Elkan, 2001; Guo et al., 2017) ) to satisfy differential privacy using this framework, and compare their empirical performance. This framework can be viewed as performing federated learning for recalibration, with the constraint that each party's data must be kept differentially private. We also present a novel recalibration technique, accuracy temperature scaling, that is particularly effective in this framework. This new technique requires private data sources to share only two statistics: the overall accuracy and the average confidence score for a classifier. We adjust the classifier until the average confidence equals the overall accuracy. Because only two numbers are revealed by each private data source, it is much easier to satisfy differential privacy. In our experiments, we find that without privacy requirements the new recalibration algorithm performs on par with algorithms that use the entire validation dataset, such as (Guo et al., 2017) ; with privacy requirements the new algorithm performs 2x better than the second best baseline. In summary, the contributions of our paper are as follows. (1) We introduce the problem of "privacy preserving calibration under domain shift" and design a framework for adapting existing recalibration techniques to this setting. (2) We introduce accuracy temperature scaling, a novel recalibration method designed with privacy concerns in mind, that requires only the overall accuracy and average confidence of the model on the validation set. (3) Using our framework, we empirically evaluate our method on a large set of benchmarks against state-of-the-art techniques and show that it performs well across a wide range of situations under differential privacy.

1. INTRODUCTION

Machine learning classifiers are currently deployed in high stakes applications where (1) the cost of failure is high, so prediction uncertainty must be accurately calibrated (2) the test distribution does not match the training distribution, and (3) data is subject to privacy constraints. All three of these challenges must be addressed in applications such as medical diagnosis (Khan et al., 2001; Chen et al., 2018; Kortum et al., 2018) , financial decision making (Berestycki et al., 2002; Rasekhschaffe & Jones, 2019; He & Antón, 2003) , security and surveillance systems (Sun et al., 2015; Patel et al., 2015; Agre, 1994) , criminal justice (Berk, 2012; 2019; Rudin & Ustun, 2018) , and mass market autonomous driving (Kendall & Gal, 2017; Yang et al., 2018; Glancy, 2012) . While much prior work has addressed these challenges individually, they have not been considered simultaneously. The goal of this paper is to propose a framework that formalizes challenges (1)-(3) jointly, introduce benchmark problems, and design and compare new algorithms under the framework. A standard approach for addressing challenge (1) is uncertainty quantification, where the classifier outputs its confidence in every prediction to indicate how likely it is that the prediction is correct. These confidence scores must be meaningful and trustworthy. A widely used criterion for good confidence scores is calibration (Brier, 1950; Cesa-Bianchi & Lugosi, 2006; Guo et al., 2017) -i.e. among the data samples for which the classifier outputs confidence p ∈ (0, 1), exactly p fraction of the samples should be classified correctly. Several methods (Guo et al., 2017) learn calibrated classifiers when the training distribution matches the test distribution. However, this classical assumption is always violated in real world applications, and calibration performance can significantly degrade under even small domain shifts (Snoek et al., 2019) . To address this challenge, several methods have been proposed to re-calibrate a classifier on data from the test distribution (Platt et al., 1999; Guo et al., 2017; Kuleshov et al., 2018; Snoek et al., 2019) . These methods make small adjustments to the classifier to minimize calibration error on a validation dataset drawn from the test distribution, but they are typically only applicable when they have (unrestricted) access to data from this validation set. Additionally, high stakes applications often require privacy. For example, it is difficult for hospitals to share patient data with machine learning providers due to legal privacy protections (Centers for Description of Calibration Consider a classification task from input domain (e.g. images) X to a finite set of labels Y = {1, • • • , m}. We assume that there is some joint distribution P * on X × Y. This could be the training distribution, or the distribution from which we draw test data. A classifier is a pair (φ, p) where φ : X → Y maps each input x ∈ X to a label y ∈ Y and p : X → [0, 1] maps each input x to a confidence value c. We say that the classifier (φ, p) is perfectly calibrated (Brier, 1950; Gneiting et al., 2007) with respect to the distribution P * if ∀c ∈ [0, 1] Pr P * (x,y) [φ(x) = y | p(x) = c] = c. (1) Note that calibration is a property not only of the classifier (φ, p), but also of the distribution P * . A classifier (φ, p) can be calibrated with respect to one distribution (e.g. the training distribution) but not another (e.g. the test distribution). To simplify notation we drop the dependency on P * . To numerically measure how well a classifier is calibrated, the commonly used metric is Expected Calibration Error (ECE) (Naeini et al., 2015) defined by ECE(φ, p) := c∈[0,1] Pr[p(x) = c] • |Pr[φ(x) = y | p(x) = c] -c| . In other words, ECE measures average deviation from Eq. 1. In practice, the ECE is approximated by binning -partitioning the predicted confidences into bins, and then taking a weighted average of the difference between the accuracy and average confidence for each bin (see Appendix A.1 for details.) Recalibration Methods Several methods apply a post-training adjustment to a classifier (φ, p) to achieve calibration (Platt et al., 1999; Niculescu-Mizil & Caruana, 2005) . The one most relevant to our paper is temperature scaling (Guo et al., 2017) . On each input x ∈ X , a neural network typically first computes a logit score l 1 (x), l 2 (x), • • • , l n (x) for each of the n labels, then computes a confidence score or probability estimate p(x) with a softmax function. Temperature scaling adds a temperature parameter T ∈ R + to the softmax function p(x; T ) = max i e li(x)/T j e lj (x)/T . (3) A higher temperature reduces the confidence, and vice versa. T is trained to minimize the standard cross entropy objective on the validation dataset, which is equivalent to maximizing log likelihood. Despite its simplicity, temperature scaling performs well empirically in classification calibration for deep neural networks. Alternative methods for classification calibration have also been proposed. Histogram binning (Zadrozny & Elkan, 2001) partitions confidence scores ∈ [0, 1] into bins {[0, ), [ , 2 ), • • • , [1 - , 1]} and sorts each validation sample into a bin based on its confidence p(x). The algorithm then resets the confidence level of each bin to match the average classification accuracy of data points in that bin. Isotonic regression methods (Kuleshov et al., 2018 ) learn an additional layer on top of the softmax output layer. This additional layer is trained on a validation dataset to fit the output confidence scores to the empirical probabilities in each bin. Other methods include Platt scaling (Platt et al., 1999) and Gaussian process calibration (Wenger et al., 2019) .

2.2. ROBUSTNESS TO DOMAIN SHIFT

Preventing massive performance degradation of machine learning models under domain shift has been a long-standing problem. There are several approaches developed in the literature. Unsupervised domain adaptation (Ganin & Lempitsky, 2014; Shu et al., 2018) learns a joint representation between the source domain (original data) and target domain (domain shifted data). Invariance based methods (Cissé et al., 2017; Miyato et al., 2018; Madry et al., 2017; Lakshminarayanan et al., 2017; Cohen et al., 2019) prevent the classifier output from changing significantly given small perturbations to the input. Transfer learning methods (Pan & Yang, 2009; Bengio, 2012; Dai et al., 2007) fine-tune the classifier on labeled data in the target domain. We classify our method in this category because we also fine-tune on the target domain, but with minimal data requirements (we only need the overall classifier accuracy).

2.3. DIFFERENTIAL PRIVACY

Differential privacy (Dwork et al., 2014) is a procedure for sharing information about a dataset to the public while withholding critical information about individuals in the dataset. Informally, it guarantees that an attacker can only learn a limited amount of new information about an individual. Differentially private approaches are critical in privacy sensitive applications. For example, a hospital may wish to gain medical insight or calibrate its prediction models by releasing diagnostic information to outside experts, but it cannot release information about any particular patient. One common notion of differential privacy is -differential privacy (Dwork et al., 2014) . Let us define a database D as a collection of data points in a universe X , and represent it by its histogram:  D ∈ N |X | , Pr[M(D) ∈ S] Pr[M(D ) ∈ S] ≤ e Intuitively, the output of M should not change much if a single data point is added or removed. An attacker that learns the output of M gains only limited information about any particular data point. Given a deterministic real valued function f : N |X | → R z , we would like to design a function M that remains as close as possible to f but satisfies Definition 1. This can be achieved by the Laplace mechanism (McSherry & Talwar, 2007; Dwork, 2008) . Let us define the L 1 sensitivity of f : ∆f = max D,D ∈N |X | D-D 1=1 f (D) -f (D ) 1 Then the Laplace mechanism adds Laplacian random noise as in (4): M L (D; f, ) = f (D) + (Y 1 , . . . , Y z ) where Y i are i.i.d. random variables drawn from the Laplace(loc = 0, scale = ∆f / ) distribution. The function M L satisfies -differential privacy, and we reproduce the proof in Appendix A.2.

3. RECALIBRATION UNDER DIFFERENTIAL PRIVACY

In this section we propose a framework for performing recalibration that allows independent parties to pool their data for improved calibration, while maintaining differential privacy. This setup can be framed as differentially private federated learning for recalibration. Multiple parties experience the same domain shift (e.g. because they live in the same changing world). Each party would benefit from access to additional data, but each party also wants to keep their own data private. Our framework allows all parties to react to domain shifts more quickly by pooling their data (so each individual party needs less labeled data from the new distribution), while maintaining the privacy of each party.

3.1. EXAMPLE APPLICATIONS

We begin with example scenarios that illustrate the main desiderata and challenges of this problem. Example 1: Suppose you have a classifier for diagnosing a medical condition and deploy your classifier across many hospitals. The hospitals need calibrated confidences for a similar but more unusual condition (e.g. the original model may have been trained on an already existing virus strain but need to be recalibrated for a novel strain of the virus). There are two options: 1. Each hospital uses only their own private data to calibrate the classifier; 2. Each hospital sends some (differentially private) information to you, and you aggregate the information and calibrate the classifier. Option 2 is preferable if each hospital has only a handful of patients for the particular condition. In this case, the hospitals are the parties that wish to keep their data (patient info) private. The novel strain of the virus represents a domain shift. If the hospitals each have only a few data points, they want to aggregate their data in order to improve their classifier's calibration while still respecting patient privacy. Example 2: Suppose that there is a third-party advertising company that runs ads for websites. This advertising company has worked with news websites before, but recently acquired new clients from a different category of websites. The individual websites have user information, but they cannot provide the third-party advertising company with this user information due to privacy constraints. The third-party advertising company wants calibrated models for whether a user will click on an ad. Example 3: Another category of scenarios in which our framework can be used is for individual privacy. An individual may have labeled data that he wishes to keep private; however, he would still like calibrated confidences from prediction models (e.g. financial software for individuals). With differential privacy, individuals can provide summary statistics with added noise to an aggregator. In this setup, differential privacy is guaranteed on the individual level. Aggregators can then improve their confidence estimation using noisy summary statistics from many individuals.

3.2. GENERAL FRAMEWORK

We propose a standard framework to handle the general situation represented by the examples above. This two-party framework involves (1) a calibrator and (2) private data sources, and it allows us to adapt recalibration algorithms for differential privacy. A private data source may be e.g. a hospital (as in Example 1 above), a website (as in Example 2), or an individual (as in Example 3).

1.. [Calibrator:

] Input an uncalibrated classifier (φ, p). 

2.. [Private Data Sources

:] Each data source i = 1, • • • , d inputs private dataset D i . 3. At iteration k = 1, • • • , K k (D i ), k = 1, • • • , K, i = 1, • • • , d. Under this framework, differential privacy is automatically satisfied: Dwork et al., 2014) . The differential privacy guarantees for each private data source i are independent of the policy of the calibrator or other private data sources; i.e. even if the calibrator and all other private data sources collude to steal information from the i-th data source -as long as the i-th private data source follows the protocol, its data will be protected by differential privacy. if for each k = 1, • • • , K, M k is /K-differentially private, then the combined function (M 1 , • • • , M k ) is -differentially private ( This framework simplifies the problem into two design choices: select the query function f k for k = 1, • • • , K, and select the mapping from observations M k (D 1 ), • • • , M k (D d ) at k = 1, • • • , K to the calibrated confidence function p . We will discuss the most reasonable choices for several existing recalibration algorithms. Note that in general, the calibration quality degrades as the privacy level increases (i.e. decreases).

3.3. ADAPTING EXISTING ALGORITHMS

In this section, we explain how we adapt algorithms introduced in Section 2 to our framework. Note that many existing recalibration algorithms involve parametric optimization, and in these cases multiple iterations K are needed to search the parameter space. However, using additional iterations hurts the calibration since a larger K increases the added Laplace noise for /K-differential privacy; i.e. for any fixed Laplace noise, more queries means less privacy. Thus, we propose the use of the golden section search algorithm as a better alternative to grid search for parametric optimization, since it is more efficient at finding the extremum of a unimodal function within a specified interval, and requires fewer queries. See Appendix C.1 for additional details about the golden section search. Temperature Scaling Temperature scaling finds the temperature T in Eq. 3 that maximizes log likelihood. At each iteration k = 1, • • • , K, the function f k queries D i for the log likelihood at some temperature, and we average the log likelihood over all the private datasets. We observe that log likelihood is a unimodal function of the temperature in Proposition 1 (see Appendix B for proof). Therefore, the golden section search algorithm can find the maximum of the unimodal function with the fewest queries. We may refer to temperature scaling as NLL-T for brevity. Proposition 1. For any distribution p * on X ×Y where Y = {1, • • • , m}, and for any set of functions l 1 , • • • , l m : X → R, E x,y∼p * log e ly (x)/T j e l j (x)/T is a unimodal function of T . ECE Minimization (ECE-T) Instead of finding a temperature that maximizes log likelihood, we find that empirically it is often better to directly minimize the discretized ECE in Eq. 2. Adapting ECE minimization to our framework is similar to log likelihood maximization, except that we query for the necessary quantities to compute the ECE score instead of the log likelihood. In Appendix C.2.2, we show how to compute the ECE score with as few queried quantities as possible. Histogram Binning Histogram binning can be adapted to the above protocol with only one iteration (K = 1). The function f 1 queries D i for the number of correct predictions in each bin and the total number of samples in each bin. We average the query results from different datasets. To compute the new confidence for a bin, we divide the average number of correct predictions by the average total number of samples in each bin. lot of information about the private datasets D 1 , • • • , D d . The relative amount of noise also increases as the amount of data available decreases, as is the case when binning is used. Larger noise will degrade calibration performance. To improve performance, we propose a new recalibration algorithm called accuracy temperature scaling that acquires much less information than previous algorithms. Our method is a form of temperature scaling that is based on a weaker notion than calibration. Let classification accuracy and average confidence be denoted as Acc(φ) = Pr[φ(x) = y], Conf(p) = E[p(x)] Acc and Conf are expectations of [0, 1]-bounded random variables, so they can be accurately estimated even from a relatively small quantity of data. We say that a classifier is consistent if Acc(φ) = Conf(p). We tune the temperature parameter in Eq. 3 until the average confidence Conf is identical to the average accuracy Acc, i.e. until consistency is achieved. We will refer to our method as Acc-T for brevity. Consistency is a strictly weaker condition than calibration. Surprisingly, even when there is a lot of data and no privacy requirements, optimizing for consistency achieves similar performance as directly optimizing for ECE in our experiments, as shown in Appendix E.2.

4.1. ACCURACY TEMPERATURE SCALING UNDER DIFFERENTIAL PRIVACY

Adapting Acc-T to our differential privacy framework is similar to doing so for temperature scaling in Section 3.3. As we show in Proposition 2 (see Appendix B for proof), the Acc-T objective is also a unimodal function of T , so we can use golden section search to find the T that minimizes the objective function. Algorithm 1 provides the complete algorithm for Acc-T under differential privacy. On Line 2, we select initial temperature values. Line 3 specifies a query function that the hospitals use to pool their data while respecting differential privacy. Lines 4-12 implement differentially private golden section search over the recalibration temperature parameter. The algorithm outputs a temperature value that improves the classifier's calibration on the new domain. Proposition 2. For any distribution p * on X × Y where Y = {1, • • • , m}, and for any set of functions l 1 , • • • , l m : X → R, let pT : x → max i e l i (x)/T j e l j (x)/T and φ : x → arg max i l i (x). |Pr x,y∼p * [φ(x) = y] -E p * [p T (x)]| is a unimodal function of T . Algorithm 1 Acc-T with differential privacy 1: Input Private datasets D 1 , • • • , D d . Logit functions l 1 , • • • , l m : X → R. Initial temperature range [T 0 -, T 0 + ]. Number of iterations K. Define φ and pT as in Proposition 2. 2: Set T 0 0 = T 0 + -(T 0 + -T 0 -) * 0.618, T 0 1 = T 0 -+ (T 0 + -T 0 -) * 0.618 3: For T 0 0 set M 0 : D i → xi,yi∈Di I(φ(x i ) = y i ) -pT 0 0 (x i ) + Lap K+1 and sample v 0 0 = 1 d d i=1 M 0 (D i ). Similarly set M 1 for T 0 1 and sample v 0 1 . 4: for k = 0, • • • , K -1 do 5: if |v k 0 | ≥ |v k 1 | then 6: Set T k+1 + = T k + , T k+1 - = T k 0 , T k+1 0 = T k 1 , T k+1 1 = T -+ (T + -T -) * 0.618 7: Set v k+1 0 = v k 1 . Sample v k+1 1 for T k+1 1 as in line 3. 8: else 9: Set T k+1 - = T k -, T k+1 + = T k 1 , T k+1 1 = T k 0 , T k+1 0 = T + -(T + -T -) * 0.618 10: Set v k+1 1 = v k 0 . Sample v k+1 0 for T k+1 0 as in line 3. 11: end if 12: end for 13: Return (T K -+ T K + )/2 as the optimal temperature.

4.2. COMPARISON

We will briefly discuss how our method, Acc-T, compares to others such as histogram binning, temperature scaling, or ECE-T in terms of its theoretical bias (calibration error given infinite data), worst case variance (calibration error degradation when less data is available), and adaptability to differential privacy (based on the relative amount of noise that must be added to satisfy differential privacy). Acc-T has a higher theoretical bias than the other methods, since its objective function does not directly minimize the calibration error. However, in our experiments on deep neural networks, the bias of Acc-T is only slightly worse or comparable to that of ECE-T or temperature scaling in practice. Acc-T also has a lower worst case variance than other methods because it does not use binning (so there are more data points per bin) and its objective function has a smaller range than that of temperature scaling. Overall, Acc-T has the highest adaptability to differential privacy; it has smaller L 1 sensitivity than the other methods, so less noise is necessary for differential privacy. Additional factors that affect the calibration quality and the level of privacy are discussed in Appendix D.1.

5. EXPERIMENTS

In this section, we run an extensive series of large, controllable experiments on three datasets to compare our proposed method Acc-T against five different baseline methods, three of which are designed with privacy concerns in mind, using the general procedure in Section 3. These benchmarks include various domain shifts and privacy settings, and our proposed Acc-T method consistently outperforms the other baseline methods. We also extensively validate the relationship between calibration error and several relevant factors for domain shift and privacy. Additional experimental details are included in Apppendix E.

5.1. EXPERIMENTAL SETUP

Methods We evaluate the differentially private versions of temperature scaling, ECE-T, histogram binning, and Acc-T over an extensive range of settings that considers calibration under various domain shifts and privacy concerns. We also include two baseline methods, (1) no calibration and ( 2) recalibration with only one private dataset from the target domain (so data from other sources is not used; in this case privacy constraints need not be taken into account but less data is available). Datasets To simulate various domain shifts, we use the ImageNet-C, CIFAR-100-C, and CIFAR-10-C datasets (Hendrycks & Dietterich, 2019) , which are perturbed versions of the ImageNet (Deng et al., 2009) , CIFAR-100 (Krizhevsky & Hinton, 2009) , and CIFAR-10 (Krizhevsky & Hinton, 2009) test sets. Each -C dataset includes 15 perturbed versions of the original test set, with perturbations such as Gaussian noise, motion blur, jpeg compression, and fog. We divide each perturbed test set into a validation split containing different "private data sources" with the same number of samples, and a test split containing all of the remaining images. We then apply the recalibration algorithms over the validation split and evaluate the ECE on the test split. Note that only the unperturbed training sets were used to train the models.

Relevant factors

We evaluate the ECE for all of the methods while controlling the following three factors: (1) the number of private data sources, (2) the number of samples per data source, and (3) the privacy level . When we vary one factor, we keep the other two factors constant. Additional details We use K = 5 iterations for all experiments, and report the average ECE achieved over 500 trials with randomly divided splits for each experiment. We report other experimental setup details including the type of network used in Appendix E.1.

5.2. RESULTS AND ANALYSIS

In Fig. 1 , we plot the ECE vs. (1a) the number of private data sources, (1b) the number of samples per data source, and (1c) the value, for the ImageNet "fog" perturbation. Fig. 2 shows a similar plot for the CIFAR-100 "jpeg compression" perturbation, and Fig. 3 shows a similar plot for the CIFAR-10 "motion blur" perturbation. Our proposed method, Acc-T, is shown in red, and clearly outperforms other methods under the constraints of differential privacy for these ranges of values. Full plots for all perturbations and datasets are included in Appendix E.3. Table 1 shows the overall median and mean ECE achieved by each recalibration method on ImageNet, CIFAR-100, and CIFAR-10. These averages are computed over all perturbations, numbers of private data sources, numbers of samples per source, and settings from the suite of experiments in E.3. Our method, Acc-T, far outperforms other methods in the domain-shifted differential privacy setting. The performance of all recalibration algorithms degrades when subjected to the constraints of differential privacy, but some are affected more than others for a given situation. Selecting a Under review as a conference paper at ICLR 2021 differentially private recalibration algorithm for a particular situation thus requires some consideration. To this end, we provide some analysis over these methods under the three relevant factors.

Number of Private Data Sources

As the number of sources increases, Acc-T tends to do well, even when the number of samples per source is small. Because Acc-T does not involve binning and the sensitivity of its objective function is small, there is relatively less noise for this method than for others. Therefore, it can effectively combine data from multiple sources even under the constraints of differential privacy, and is the best method in general.

Number of Samples Per Source

As the number of samples per source increases, Acc-T tends to do well given enough data sources. As the number of samples per source grows towards infinity, recalibration with only one source works very well since we do not need to query other sources or apply privacy constraints. Histogram binning and ECE-T may also perform quite well with many bins when the number of samples is very large. Privacy concern When is very low (i.e. the privacy requirements are very high), recalibrating with only one data source works well; this method remains unaffected by the strong privacy constraints, while all other methods worsen drastically due to the increased noise. For mid-range values, Acc-T works well. When is very high, ECE-T can work well, since privacy is not much of a concern. 

A ADDITIONAL BACKGROUND INFORMATION A.1 COMPUTATION OF ECE

To compute the ECE, discretization is necessary. We first divide [0, 1] into bins c = (c 1 , • • • , c k ) such that 0 < c 1 < • • • < c k = 1 , and then we compute the average accuracy Acc and average confidence Conf in each bin (for convenience, denote c 0 = 0) Acc(f, c, i) = Pr [f (x) = y|p(x) ∈ [c i-1 , c i )] Conf(f, c, i) = E[p(x)|p(x) ∈ [c i-1 , c i )] Then the ECE defined in Eq. 2 can be approximated by a discretized version ECE(f, p) ≈ ECE(f, p; c) := k i=1 Pr [p(x) ∈ [c i-1 , c i )] • |Acc(f, c, i) -Conf(f, c, i)| Given empirical data D = {x 1:n , y 1:n } we can estimate ECE(f, p; c) as ECE(f, p; c) ≈ ÊCE(f, p; c, D) := k i=1 1 n xi∈[ci-1,ci) I(f (x i ) = y i ) -p(x i ) Note that there are two approximations: we first discretize the ECE, and then use finite data to approximate the discretized expression ECE(f, p) ≈ ECE(f, p; c) ≈ ÊCE(f, p; c, D) In practice, if the first approximation is better (more bins are used), then the second approximation must be worse (there will be less data in each bin) (Kumar et al., 2019) . In other words, with finite data, there is a tradeoff between calibration error and estimation error. Note that newer estimators, e.g. (Kumar et al., 2019) , can measure the ECE even more accurately, particularly when there are more bins.

A.2 LAPLACE MECHANISM PROOF

Theorem 1. The Laplace mechanism (Dwork et al., 2014)  p D (x) p D (x) = z i=1 exp( -|f (D)i-xi| ∆f ) exp( -|f (D )i-xi| ∆f ) = z i=1 exp (|f (D ) i -x i | -|f (D) i -x i |) ∆f ≤ z i=1 exp |f (D) i -f (D ) i | ∆f = exp • f (D) -f (D ) 1 ∆f ≤ exp( ) where the first inequality follows from the triangle inequality, and the second inequality follows from the definition of sensitivity (Dwork et al., 2014) .

B PROOFS

Proof of Proposition 1. ∂ ∂T E x,y∼p * log e ly(x)/T j e lj (x)/T = E x,y∼p *   ∂ ∂T l y (x)/T - ∂ ∂T log j e lj (x)/T   = E x,y∼p * -l y (x)/T 2 - -j e lj (x)/T l j (x)/T 2 j e lj (x)/T = 1 T 2 E x,y∼p * -l y (x) + j l j (x)e lj (x)/T j e lj (x)/T Let us set the derivative equal to 0. Suppose there are multiple solutions T 1 > T 2 ; this implies that E x,y∼p * j l j (x)e lj (x)/T1 j e lj (x)/T1 = E x,y∼p * j l j (x)e lj (x)/T2 j e lj (x)/T2 . (5) E x,y∼p * j lj (x)e l j (x)/T j e l j (x)/T is monotonically non-increasing. Therefore, if there are 0 or 1 solutions to Eq. 5, the original function must be unimodal. If there are at least 2 solutions T 1 < T 2 , then E x,y∼p * j lj (x)e l j (x)/T j e l j (x)/T must be a constant function ∀T ∈ [T 1 , T 2 ], which im- plies that E x,y∼p * e ly (x)/T j e l j (x)/T is a constant function of T ∈ [T 1 , T 2 ] . This further implies that E x,y∼p * e ly (x)/T j e l j (x)/T is a constant function for all T ∈ R, which is also unimodal. Proof of Proposition 2. Because pT (x) is a monotonically decreasing function of T , E p * [p T (x)] is also a monotonically decreasing function of T . This means that Pr x,y∼p * [f (x) = y] -E p * [p T (x)] is a monotonically decreasing function of T . The absolute value of a monotonic function must be monotonic or unimodal. C ADDITIONAL DETAILS FOR SECTION 3 C.1 GOLDEN SECTION SEARCH The golden section search is an algorithm for finding the extremum of a unimodal function within a specified interval. It is an iterative method that reduces the search interval with each iteration. The algorithm is described below. Note that we describe the algorithm for a minimization problem, but it also works for maximization problems. 1. Specify the function to be minimized, g(•), and specify an interval over which to minimize g, [T min , T max ]. 2. Select two interior points T 1 and T 2 , with T 1 < T 2 , such that T 1 = T max - √ 5-1 2 (T max - T min ) and T 2 = T min + √ 5-1 2 (T max -T min ). Evaluate g(T 1 ) and g(T 2 ). 3. If g(T 1 ) > g(T 2 ), then determine a new T min , T 1 , T 2 , T max as follows: T min = T 1 T max = T max T 1 = T 2 T 2 = T min + √ 5 -1 2 (T max -T min ) If g(T 1 ) < g(T 2 ), determine a new T min , T 1 , T 2 , T max as follows: T min = T min T max = T 2 T 2 = T 1 T 1 = T max - √ 5 -1 2 (T max -T min ) Note that in either case, only one new calculation is performed. 4. If the interval is sufficiently small, i.e. T max -T min < δ, then the maximum occurs at (T min + T max )/2. Otherwise, repeat Step 3.

C.2 ADAPTING EXISTING RECALIBRATION METHODS FOR DIFFERENTIAL PRIVACY

In this section, we go into more detail about how to adapt several existing recalibration algorithms for the differential privacy setting with our framework.

C.2.1 TEMPERATURE SCALING

Temperature scaling optimizes over the temperature parameter T using the negative log likelihood loss, and thus requires multiple iterations to query the databases D i at different temperature values using golden section search. In this case, the objective function g(•) is the negative log-likelihood (NLL) loss over all samples. In the standard NLL formulation, the overall loss is the average of the samples' NLL losses, but summing these losses for each database rather than taking the average is equivalent except for a constant scale factor (the total number of samples in the database). Thus, the function f k queries each D i for its summed NLL loss. The sensitivity ∆f is technically infinite, since the range of the NLL function is infinite, but in practice we can choose some sufficiently large value (we chose ∆f = 10, since that was approximately the largest NLL value we saw among the images that we checked). We chose T min = 0.5 and T max = 3.0, since empirically the optimal temperature always seems to fall within this range, and used K = 5 iterations. To aggregate information from different D i , we simply average the M k (D 1 ), • • • , M k (D d ). The new classifier (φ, p ) outputs probabilities that are recalibrated with the (noisy) optimal temperature.

C.2.2 TEMPERATURE SCALING BY ECE MINIMIZATION

The standard recalibration objective when applying temperature scaling is to maximize the log likelihood of a validation dataset. This objective is given in both recent papers (Guo et al., 2017) and established textbooks (Smola et al., 2000) . An alternative, but surprisingly overlooked, objective is to minimize the discretized ECE directly. To adapt this method to differential privacy, we must again use multiple iterations to query the databases D i at different temperature values using golden section search. Here we want to find the temperature that minimizes the discretized ECE: min T bins |Acc -Conf| • pr = min T bins n correct n bin -i c i n bin • n bin n total ( ) where pr is the proportion of samples in the bin, n correct is the number of correct predictions in the bin, n bin is the total number of samples in the bin, i c i is the sum of the confidence scores for all samples in the bin, and n total is the total number of samples across all bins. Simplifying Eq. 6 and ignoring n total as a constant, our objective function g(•) becomes g(φ, T ) = bins |n correct - i c i | The function f k queries each D i for the quantity (n correcti c i ) in each bin. The sensitivity ∆f = 1, since this quantity could change by up to 1 with the addition or removal of one sample to a database. We use T min = 0.5, T max = 3.0, and K = 5 iterations. We use 15 bins (since we also evaluate the discretized ECE with 15 bins), so the M k are vectors ∈ R 15 . To aggregate information from different D i , we average the M k (D 1 ), • • • , M k (D d ) , take the absolute value of this average, and then sum this absolute value vector over all bins. In the absence of noise, this aggregation process will yield the correct overall g(•) exactly, using all samples from all sources. The new classifier (φ, p ) outputs probabilities that are recalibrated with the (noisy) optimal temperature. Unsurprisingly, ECE-T performs very well without the constraints of differential privacy, so this method may be a good choice when is high.

C.2.3 HISTOGRAM BINNING

Histogram binning is a relatively simple, non-parametric recalibration method that can be adapted to differential privacy with a single iteration (i.e. K = 1). The function f 1 queries D i for the number of correct predictions in each bin and the total number of samples in each bin. ∆f = 2 because if exactly one entry is added or removed from a database, the number of correct predictions can change by at most 1 for exactly one of the bins, and the total number of samples can change by at most 1 for exactly one of the bins. We use 15 bins in our experiments, so the M k are vectors ∈ R 30 . To aggregate information from different D i , we average the M k (D 1 ), • • • , M k (D d ). The new confidence for each bin is the average number of correct predictions divided by the average total number of samples for that bin. D ADDITIONAL DETAILS FOR SECTION 4 D.1 FACTORS THAT AFFECT CALIBRATION QUALITY AND PRIVACY ↑ Data ↑ Iterations ↑ Bins ↑ Sensitivity ↑

Calibration quality ----Privacy preservation

Table 2 shows several factors and hyperparameter choices that affect the calibration quality and the level of privacy for all recalibration methods. More data improves both calibration and privacy. More iterations improves calibration when privacy is not required (e.g. running more iterations of gradient descent), but hurts privacy (making multiple queries in a parametric optimization setting with the same amount of added noise increases ). Using more bins for methods that involve binning improves calibration when enough data is available, but may hurt privacy. Higher sensitivity of the f k function hurts privacy, and higher represents less privacy. We discuss each of these in more detail below. Data Differentially private recalibration algorithms require sufficient data in order to work well. We cannot trivially combine data from different private datasets because each dataset holder must honor its agreement with the individuals whose information is in that dataset. Our framework describes a method for pooling data from different private datasets while allowing each one to respect differential privacy for its users, which is necessary for improved calibration while preserving privacy.

Number of iterations

For parametric optimization recalibration methods, multiple iterations are generally needed to search the parameter space. Using additional iterations improves the calibration without differential privacy (e.g. running more iterations of gradient descent), but hurts the calibration when differential privacy is required. With multiple iterations, a worst-case bound on the overall L 1 sensitivity of f k is K times the sensitivity of a single query ∆f single , since a single database entry may change the response to each query by up to ∆f single . Thus, the amount of noise added to the true query responses must follow a L(0, K • ∆f single / ) distribution to maintain -differential privacy. Because using more iterations increases the amount of noise added, it is best to search through the parameter space while minimizing the number of iterations needed for the desired granularity. We use golden section search to do this. Each iteration of the golden section search narrows the range of possible values of the extremum, but increases the amount of noise added to the data; in general, we select K such that the granularity and the noise are balanced. Binning Several of the recalibration methods discussed use binning, where all of the confidence estimates are divided into mutually-exclusive bins. Without differential privacy, using more bins generally improves calibration when a lot of data is available (i.e. above a "data threshold"), but hurts calibration below this data threshold. When not enough data is available, using more bins increases the estimation error since there are too few samples in each bin. In the differential privacy setting, using more bins may degrade the calibration. In this setting, one query may request a summary statistic from each bin. Because a single database entry can be in exactly one bin, the remaining bins are unaffected and the sensitivity does not increase with more bins. However, although the number of bins does not affect the absolute amount of noise, it can affect the relative amount of noise. When more bins are used, there are fewer elements in each bin on average. Thus, the summary statistics involved tend to be lower, and the noise is relatively higher. Note that when multiple equal-width bins are involved, as in temperature scaling by ECE minimization (see Section C.2.2), the optimization problem may not be strictly unimodal since samples can change bins as the temperature changes. Using bins with equal numbers of samples, rather than equal widths, ensures unimodality in temperature scaling but makes it difficult to combine information from different private data sources (since different sources will have different bin endpoints). Thus, we elected to use equal-width bins in our experiments. Although the function to be minimized is not necessarily unimodal, it is generally a close enough approximation that golden section search returns reasonably good results with few queries, and empirically performs better than grid search. Sensitivity of f k An f k function with a large range has a detrimental effect on the amount of noise added. For instance, the range of the negative log-likelihood is technically infinite (although in practice we used some sufficiently large value). Thus, the sensitivity of a method with the negative log-likelihood in the objective function is quite high, and the amount of noise needed to preserve differential privacy is large. value Calibration is worse when is smaller, i.e. when there is a higher privacy level with stronger differential privacy constraints.

E.1 EXPERIMENTAL SETUP

We simulated the problem of recalibration with multiple private datasets on domain-shifted data using the ImageNet-C, CIFAR-100-C, and CIFAR-10-C datasets (Hendrycks & Dietterich, 2019) , which are perturbed versions of the ImageNet (Deng et al., 2009) , CIFAR-100 (Krizhevsky & Hinton, 2009) , and CIFAR-10 ( Krizhevsky & Hinton, 2009) test sets respectively. We randomly divided each perturbed test set into n sources validation sets of size n samples and a test set comprising the remaining images, where n sources represents the number of private data sources and n samples represents the number of samples per source. We computed each ECE value by binning with 15 equal-width bins. For ImageNet, we varied the number of private data sources from 100 to 2000 in step sizes of 100, with 10 samples per data source and = 1. We varied the number of samples per data source from 5 to 100 in step sizes of 5, with 100 private data sources and = 1. We varied from 0.2 to 2.0 in step sizes of 0.2, with 50 samples per data source and 100 private data sources. For CIFAR-100 and CIFAR-10, we varied the number of private data sources from 10 to 250 in step sizes of 10, with 10 samples per data source and = 1. We varied the number of samples per data source from 5 to 50 in step sizes of 5, with 50 private data sources and = 1. We varied from 0.2 to 2.0 in step sizes of 0.2, with 30 samples per data source and 50 private data sources. We used K = 5 iterations for all experiments. We reported the average ECE achieved over 500 randomly divided trials for each experiment. All models were trained on only the unperturbed training sets. For ImageNet, we trained a ResNet50 network (He et al., 2015) for 90 epochs with an SGD optimizer (Sutskever et al., 2013) with an initial learning rate of 0.1, and decayed the learning rate according to a cosine annealing schedule (Loshchilov & Hutter, 2016) . For CIFAR-100 and CIFAR-10, we trained Wide ResNet-28-10 networks (Zagoruyko & Komodakis, 2016) for 200 epochs with an SGD optimizer with an initial learning rate of 0.1, and again decayed the learning rate with a cosine annealing schedule. For each dataset, we tested both the unperturbed accuracy and the perturbed accuracy on each of 15 perturbation types in (Hendrycks & Dietterich, 2019) Table 3 shows the classification accuracy achieved by our models on each of the 15 perturbations of the CIFAR-10-C, CIFAR-100-C, and ImageNet-C test sets, as well as on the unperturbed test set. Note that the models are trained only on unperturbed training data. The accuracies achieved are in line with reported state-of-the-art numbers. Tables 4, 5 , and 6 summarize our calibration results without differential privacy constraints for CIFAR-10, CIFAR-100, and ImageNet, respectively. Our Acc-T algorithm generally improves the model's calibration compared to the standard temperature scaling method NLL-T. Despite its simplicity, Acc-T also performs on par with ECE-T, generally achieving similar ECEs, even when privacy is not required.

E.3 EXPERIMENTS WITH DIFFERENTIAL PRIVACY CONSTRAINTS

The figures in this section show recalibration results for ImageNet, CIFAR-100, and CIFAR-10 under the highest severity perturbations. In the left panel of each figure, we vary the number of private data sources. In the middle panel, we vary the number of samples per data source. In the right panel, we vary the privacy level . Our method, Acc-T, generally does best in these settings.

IMAGENET RESULTS

ImageNet, unperturbed We note that using different clipping thresholds for NLL-T (where the clipped NLL loss is min(clipping threshold, NLL)) can affect its performance slightly. In practice, selecting the optimal clipping threshold would violate differential privacy, because doing so would require access to the labeled test data. However, even under the most favorable threshold, Acc-T significantly outperforms NLL-T. In Fig. 52 , we show an example of NLL-T performance at different clipping thresholds for CIFAR-10 under the "snow" perturbation with a perturbation severity of 1. In this case, using the optimal clipping threshold would improve performance by 0.7% over using a clipping threshold of 10, and this improvement comes at a cost of privacy violations. Finally, Table 7 shows the overall median and mean ECE achieved by each recalibration method on CIFAR-100 with a perturbation severity of 1 (the lowest perturbation level). These averages are computed over all perturbations, numbers of private data sources, numbers of samples per source, and settings from the suite of experiments. Comparing these results to those shown in Table 1 , which used a perturbation severity of 5 (the highest level), we see that the overall calibration improves for all methods when the degree of domain shift is lower, but our proposed algorithm, Acc-T, still outperforms other methods.



ACCURACY TEMPERATURE SCALINGWhen we add Laplace noise according to Eq. 4, the added noise increases with the number of iterations K and the L 1 sensitivity of the query functions f k . In other words, when we adapt a calibration algorithm to our framework, we need to add more noise if the original algorithm gains a CONCLUSIONSimultaneously addressing the challenges of calibration, domain shift, and privacy is extremely important in many environments. In this paper, we introduced a framework for recalibration on domain-shifted data under the constraints of differential privacy. Within this framework, we designed a novel algorithm to handle all three challenges. Our method demonstrated impressive performance across a wide range of settings on a large suite of benchmarks. In future work, we are interested in investigating recalibration under different types of privacy mechanisms.



where each entry D x represents the number of elements in the database that takes the value x ∈ X . A randomized algorithm M is one that takes in input D ∈ N |X | and (stochastically) outputs some value M(D) = b for b ∈ Range(M). Definition 1. Let M be a randomized function M : N |X | → Range(M). We say that M isdifferentially private if for all S ⊆ Range(M) and for any two databases D, D ∈ N |X | that differ by only one element, i.e. D -D 1 ≤ 1, we have

a) [Calibrator:] The calibrator designs a function f k : N |X | → R s , where s ∈ N. For each i = 1, • • • , d, the calibrator sends function f k to private data source i. (b) [Private Data Sources:] For each i = 1, • • • , d, the i-th private data source uses the Laplace mechanism in Eq. 4 to convert f k to M k that satisfies /K-differential privacy, and sends M k (D i ) back to the calibrator. 4. [Calibrator:] Output a new classifier (φ, p ) based on M

Figure1: Recalibration results for ImageNet under the "fog" perturbation, with varying (1a) number of private data sources, (1b) number of samples per data source, and (1c) privacy level . Acc-T does best in these settings.

preserves -differential privacy. Proof. Let D ∈ N |X | and D ∈ N |X | be two databases that differ by up to one element, i.e. D -D 1 ≤ 1. Let function f : N |X | → R z , and let p D and p D denote the probability density functions of M L (D; f, ) and M L (D ; f, ), respectively. Then we can take the ratio of p D to p D at an arbitrary point x ∈ R z :

Sources = 100, = 1.0 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2Sources = 100, = 1.0 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2Sources = 100, = 1.0 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2Sources = 100, = 1.0 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2Sources = 50, = 1.0 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2Sources = 50, = 1.0 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2Sources = 50, = 1.0 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2Sources = 50, = 1.0 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2Sources = 50, = 1.0 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2Sources = 50, = 1.0 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2Sources = 50, = 1.0 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2Sources = 50, = 1.0 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2

Figure 52: Recalibration results for CIFAR-10 under the "snow" perturbation with a perturbation severity of 1, with different clipping thresholds. Number of sources = 50, number of samples per source = 30, and = 1.0. Acc-T significantly outperfoms NLL-T regardless of the clipping threshold.

Median and mean expected calibration error (ECE) achieved for domain-shifted data under differential privacy. Columns from left to right show the median/mean ECE achieved over all perturbations, number of private data sources, number of samples per source, and for ImageNet, CIFAR-100, and CIFAR-10. Best calibration is shown in bold.

The impact of various factors on recalibration quality and privacy preservation.

at multiple severity levels to ensure sharpness. These accuracy tables can be found in E.2. Expected calibration error (ECE) on CIFAR-10 without privacy constraints is shown. Columns from

Expected calibration error (ECE) on CIFAR-100 without privacy constraints is shown. Columns from left to right show ECE for the baseline without calibration, recalibration with temperature scaling by minimizing the negative log likelihood, recalibration with temperature scaling by matching predictive confidence to accuracy (our method), and recalibration by minimizing the ECE directly. Rows indicate the type of perturbation applied. These results correspond to a perturbation severity of 5. Best calibration for each perturbation is shown in bold.

Expected calibration error (ECE) on ImageNet without privacy constraints is shown. Columns from left to right show ECE for the baseline without calibration, recalibration with temperature scaling by minimizing the negative log likelihood, recalibration with temperature scaling by matching predictive confidence to accuracy (our method), and recalibration by minimizing the ECE directly. Rows indicate the type of perturbation applied.

