A CLOSER LOOK AT THE CALIBRATION OF DIFFERENTIALLY PRIVATE LEARNERS

Abstract

We systematically study the calibration of classifiers trained with differentially private stochastic gradient descent (DP-SGD) and observe miscalibration across a wide range of vision and language tasks. Our analysis identifies per-example gradient clipping in DP-SGD as a major cause of miscalibration, and we show that existing approaches for improving calibration with differential privacy only provide marginal improvements in calibration error while occasionally causing large degradations in accuracy. As a solution, we show that differentially private variants of post-processing calibration methods such as temperature scaling and Platt scaling are surprisingly effective and have negligible utility cost to the overall model. Across 7 tasks, temperature scaling and Platt scaling with DP-SGD result in an average 3.1-fold reduction in the in-domain expected calibration error and only incur at most a minor percent drop in accuracy.

1. INTRODUCTION

Modern deep learning models tend to memorize their training data in order to generalize better (Zhang et al., 2021; Feldman, 2020) , posing great privacy challenges in the form of training data leakage or membership inference attacks (Shokri et al., 2017; Hayes et al., 2017; Carlini et al., 2021) . To address these concerns, differential privacy (DP) has become a popular paradigm for providing rigorous privacy guarantees when performing data analysis and statistical modeling based on private data. In practice, a commonly used DP algorithm to train machine learning (ML) models is DP-SGD (Abadi et al., 2016) . The algorithm involves clipping per-example gradients and injecting noises into parameter updates during the optimization process. Despite that DP-SGD can give strong privacy guarantees, prior works have identified that this privacy comes at a cost of other aspects of trustworthy ML, such as degrading accuracy and causing disparate impact (Bagdasaryan et al., 2019; Feldman, 2020; Sanyal et al., 2022) . These tradeoffs pose a challenge for privacy-preserving ML, as it forces practitioners to make difficult decisions on how to weigh privacy against other key aspects of trustworthiness. In this work, we expand the study of privacy-related tradeoffs by characterizing and proposing mitigations for the privacy-calibration tradeoff. The tradeoff is significant as accessing model uncertainty is important for deploying models in safety-critical scenarios like healthcare and law where explainability (Cosmides & Tooby, 1996) and risk control (Van Calster et al., 2019) are needed in addition to privacy (Knolle et al., 2021) . The existence of such a tradeoff may be surprising, as we might expect differentially private training to improve calibration by preventing models from memorizing training examples and promoting generalization (Dwork et al., 2015; Bassily et al., 2016; Kulynych et al., 2022) . Moreover, training with modern pre-trained architectures show a strong positive correlation between calibration and classification error (Minderer et al., 2021) and using differentially private training based on pre-trained models are increasingly performant (Tramer & Boneh, 2021; Li et al., 2022b; De et al., 2022) . However, we find that DP training has the surprising effect of consistently producing over-confident prediction scores in practice (Bu et al., 2021 ). We show an example of this phenomenon in a simple 2D logistic regression problem (Fig. 1 ). We find a polarization phenomenon, where the DP-trained model achieves similar accuracy to its non-private counterpart, but its confidences are clustered around either 0 or 1. As we will see later, the polarization insight conveyed by this motivating example transfers to more realistic settings. Our first contribution quantifies existing privacy-calibration tradeoffs for state-of-the-art models that leverage DP training and pre-trained backbones such as RoBERTa (Liu et al., 2019b) and vision transformers (ViT) (Dosovitskiy et al., 2020) . Although there have been some studies of miscalibration for differentially private learning (Bu et al., 2021; Knolle et al., 2021) , they focus on simple tasks (e.g., MNIST, SNLI) with relatively small neural networks trained from scratch. Our work shows that miscalibration problems persist even for state-of-the-art private models with accuracies approaching or matching their non-private counterparts. Through controlled experiments, we show that these calibration errors are unlikely solely due to the regularization effects of DP-SGD, and are more likely caused by the per-example gradient clipping operation in DP-SGD. Our second contribution shows that the privacy-calibration tradeoff can be easily addressed through differentially private variants of temperature scaling (DP-TS) and Platt scaling (DP-PS). To enable these modifications, we provide a simple privacy accounting analysis, proving that DP-SGD based recalibration on a held-out split does not incur additional privacy costs. Through extensive experiments, we show that DP-TS and DP-PS effectively prevent DP-trained models from being overconfident and give a 3.1-fold reduction in in-domain calibration error on average, substantially outperforming more complex interventions that have been claimed to improve calibration (Bu et by the disparity in accuracies across groups). Our miscalibration findings are closely related to the above privacy-fairness tradeoff that has already received substantial attention. For example, per-example gradient clipping is shown to exacerbate accuracy disparity (Tran et al., 2021; Esipova et al., 2022) . Some fairness notions also require calibrated predictions such as calibration over demographic groups (Pleiss et al., 2017; Liu et al., 2019a) or a rich class of structured "identifiable" subpopulations (Hébert-Johnson et al., 2018; Kim et al., 2019) . Our work expands the understanding of tradeoffs between privacy and other aspects of trustworthiness by characterizing privacy-calibration tradeoffs.

2. RELATED WORK

Calibration. Calibrated probability estimates match the true empirical frequencies of an outcome, and calibration is often used to evaluate the quality of uncertainty estimates provided by ML models. Recent works have observed that highly-accurate models that leverage pre-training are often wellcalibrated (Hendrycks et al., 2019; Desai & Durrett, 2020; Minderer et al., 2021; Kadavath et al., 2022) . However, we find that even pre-trained models are poorly calibrated when they are fine-tuned using DP-SGD. Our work is not the first to study calibration under learning with DP, but we provide a more comprehensive characterization of privacy-calibration tradeoffs and solutions that improve this tradeoff which are both simpler and more effective. Luo et al. (2020) studied private calibration for out-of-domain settings, but did not study whether DP-SGD causes miscalibration in-domain. Angelopoulos et al. (2021) modified split conformal prediction to be privacy-preserving, but they only studied vision models and their private models have substantial performance decrease compared to non-private ones. They also did not study the miscalibration of private models and the causes of the privacy-calibration tradeoff. Knolle et al. (2021) studied miscalibration, but only on MNIST and a small pneumonia dataset. Our work provides a more comprehensive characterization across more realistic datasets, and our comparisons show that our recalibration approach is consistently more effective. Closer to our work, the work by Bu et al. (2021) identified that DP-SGD produces miscalibrated models on CIFAR-10, SNLI, and MNIST. As a solution, they suggested an alternative clipping scheme that empirically reduces the expected calibration error (ECE). Our work differs in three ways: our experimental results cover harder tasks and control for confounders such as model accuracy and regularization; we study transfer learning settings that are closer to the state-of-the-art setup in differentially private learning and find substantially worse ECE gaps (e.g. they identify a 43% relative increase in ECE on CIFAR-10, while we find nearly 400% on Food101); we compare our simple recalibration procedure to their method and find that DP-TS is substantially more effective at reducing ECE.

3. PROBLEM STATEMENT AND METHODS

Our main goal is to build classifiers that are both accurate and calibrated under differential privacy. We begin by defining core preliminary concepts.

3.1. DIFFERENTIAL PRIVACY

Differential privacy is a formal privacy guarantee for a randomized algorithm which intuitively ensures that no adversary has a high probability of identifying whether a record was included in a dataset based on the output of the algorithm. Throughout our work, we will study models trained with approximate-DP / (ϵ, δ)-DP algorithms. Definition 3.1. (Approximate-DP (Dwork et al., 2006) ). The randomized algorithm M : X → Y is (ϵ, δ)-DP if for all neighboring datasets X, X ′ ∈ X that differ on a single element and all measurable Y ⊂ Y, P(M(X) ∈ Y ) ≤ exp(ϵ)P (M (X ′ ) ∈ Y ) + δ.

3.2. DIFFERENTIALLY PRIVATE STOCHASTIC GRADIENT DESCENT

The standard approach to train neural networks with DP is using the differentially private stochastic gradient descent (DP-SGD) (Abadi et al., 2016) algorithm. The algorithm operates by privatizing each gradient update via combining per-example gradient clipping and Gaussian noise injection. Formally, one step of DP-SGD to update θ with a batch of samples B t is defined as θ (t+1) = θ (t) -η t 1 B i∈Bt clip C ∇L i θ (t) + ξ , where η t is the learning rate at step t, L θ (t) is the learning objective, clip  C ∇L i θ (t) clips the gradient using clip C ∇L i θ (t) = ∇L i θ (t) • min 1, C/∥∇L i θ (t) ∥ 2 and ξ is Gaussian noise defined as ξ ∼ N 0, C 2 σ 2 B 2 I p

3.3. CALIBRATION

A probabilistic forecast is said to be calibrated if the forecast has accuracy p on the set of all examples with confidence p. Specifically, given a multi-class classification problem where we want to predict a categorical variable Y based on the observation X, we say that a probabilistic classifier h θ parameterized by θ over C classes satisfies canonical calibration if for each p in the simplex ∆ C-foot_0 and every label y, P (Y = y | h θ (X) = p) = p y holds. 1 Intuitively, a calibrated model should give predictions that can truthfully reflect the predictive uncertainty, e.g., among the samples to which a calibrated classifier gives a confidence 0.1 for class k, 10% of the samples actually belong to class k. The canonical calibration property can be difficult to verify in practice when the number of classes is large (Guo et al., 2017) . Because of this, we will consider a simpler top-label calibration criterion in this work. In this relaxation, we consider calibration over only the highest probability class. More formally, we say that a classifier h θ is calibrated if ∀p * ∈ [0, 1], P (Y ∈ arg max p | max h θ (X) = p * ) = p * , where p * is the true predictive uncertainty. With the same definition of p * , we will quantify the degree to which a classifier is calibrated through the expected calibration error (ECE), defined by E [| p * -E [Y ∈ arg max h θ (X) |max h θ (X) = p * ]|] . In practice, we estimate ECE by first partitioning the confidence scores into M bins B 1 , . . . , B M before calculating the empirical estimate of ECE as ECE = M m=1 |B m | n |acc (B m ) -conf (B m )| , where acc (B m ) = 1 |Bm| i∈Bm 1 (y i = arg max h θ (x i )), conf (B m ) = 1 |Bm| i∈Bm h θ (x i ) and {(x i , y i )} n i=1 are a set of n i.i.d. samples that follow a distribution P (X, Y ). When appropriate, we will also study fine-grained miscalibration errors through the histogram of conf(B m ) (the confidence histogram) and plot acc(B m ) against conf(B m ) (the reliability diagram).

3.4. RECALIBRATION

When models are miscalibrated, post-hoc recalibration methods are often used to reduce calibration errors. These methods alleviate miscalibration by adjusting the log-probability scores generated by the probabilistic prediction model. More formally, consider a score-based classifier that produces probabilistic forecasts via softmax(h θ (x)). We can adjust the calibration of this classifier by learning a g ϕ that adjusts the log probabilities and produces a better calibrated forecast softmax(g ϕ • h θ (x)). Typically, the re-calibration function g ϕ is learned by minimizing a proper scoring rule on a separate validation/recalibration set X recal with the following optimization problem These methods differ in their choice of g ϕ . In this work, we will consider the temperature scaling (g ϕ (x) = x/T ) and the Platt scaling (g ϕ (x) = Wx + b) with the choice of log loss for ℓ. min ϕ E[ℓ(softmax(g ϕ • h θ (x)), y)].

3.5. DIFFERENTIALLY PRIVATE RECALIBRATION

As observed in our motivating example (Fig. 1 ) and later experimental results (Fig. 2 ), DP training (with DP-SGD and DP-Adam) tends to produce miscalibrated models. Motivated by the success of recalibration methods such as Platt scaling and temperature scaling in the non-private setting (Guo et al., 2017) , we study how these methods can be adapted to build well-calibrated classifiers with DP guarantees. Our proposed approach is very simple and consists of three steps, shown in Algorithm 1. We first split the training set into a model training part (X train ) and a validation/recalibration part (X recal ). We then train the model using DP-SGD on X train , followed by a recalibration step using DP-SGD on X recal . Depending on the choice of g ϕ , we will refer to this algorithm as either DP temperature scaling (DP-TS) or DP Platt scaling (DP-PS). Algorithm 1: Differentially Private Recalibration Input: X = {(x 1 , y 1 ) , ..., (x n , y n )}, validation ratio α. Initial: Parameters of models h θ , recalibrator g ϕ . 1. X train , X recal = RandomSplit(X, α) 2. Train h θ (x) using DP-SGD to optimize min θ E [ℓ (softmax (h θ (x)) , y)] with X train 3. Train g ϕ using DP-SGD to optimize min ϕ E [ℓ (softmax (g ϕ • h θ (x)) , y)] with X recal Output: g ϕ • h θ (•) The use of sample splitting for recalibration makes privacy accounting simple. To achieve a target (ϵ, δ)-DP guarantee after recalibration, we can simply run both stages with DP-SGD parameters that achieve (ϵ, δ)-DP (Prop. A.1). While sample splitting does reduce the number of samples available for model training, using 90% of the dataset for X train results in a minor utility cost for the model training step in practice.

4. EXPERIMENTAL RESULTS

We study three different experimental settings. We first consider in-domain evaluations, where we evaluate calibration errors on the same domain that they are trained on. Results show that using pre-trained models does not address miscalibration issues in-domain. We then evaluate the same models above in out-of-domain settings, showing that both miscalibration and effectiveness of our recalibration methods carry over to the out-of-domain setting. Finally, we perform careful ablations to isolate and understand the causes of in-domain miscalibration. In each case, we will show that DP-SGD leads to high miscalibration, and DP recalibration substantially reduces calibration errors. Models. Our goal is to evaluate calibration errors for state-of-the-art private models. Because of this, our models are based on transfer learning from a pre-trained model. For the text datasets, we fine-tune RoBERTa-base using the procedure in Li However, we find that these same models have substantially higher calibration errors. For example, the linear probe for Food101 in Fig. 2 but the ECE is more than 4× that of the non-private counterpart. In the language case, we see similar results on QNLI with a ~6% decrease in accuracy but a ~4.3× increase in ECE. The overall trend of miscalibration is clear across datasets and modalities (Fig. 2 ). DP recalibration. We now turn our attention to recalibration algorithms and see whether DP-TS and DP-PS can address in-domain miscalibration. We find that DP-TS and DP-PS perform well consistently over all datasets and on both modalities with marginal accuracy drops (Tab.1 and Tab.2). In many cases, the differentially private variants of recalibration work nearly as well as their non-private counterparts. The ECE values for the private DP-TS and non-private baseline of DP+Non-private-TS are generally close across all the datasets. We note that both DP-TS and DP-PS perform consistently well, with an average relative (in-domain) ECE reduction of 0.58. Despite being simple, the two methods never underperform Global Clipping and DP-SGLD in terms of ECE, and can have very close or even higher accuracies despite the added cost of sample splitting. Qualitative analysis. Examining the reliability diagram before and after DP-TS, we see two clear phenomena. First, the model confidence distribution under DP-SGD is highly polarized (Fig. 3 , first two panels) with nearly all examples receiving confidences of 1.0. Next, we see that after DP-TS, this confidence distribution is adjusted to cover a much broader range of confidence values. In the case of SUN397, after recalibration, we see almost perfect agreement between the model confidences and actual accuracies.

4.2. OUT-OF-DOMAIN CALIBRATION

We complement our in-domain experiments with out-of-domain evaluations. To do this, we evaluate the zero-shot transfer performance of models trained over MNLI, QNLI (Tab. Our findings are consistent with the in-domain evaluations. Differentially private training generally results in high ECE, while DP-TS and DP-PS generally improve calibration. The gaps out-of-domain are substantially smaller than the in-domain case, as all methods are of low accuracy and miscalibrated out of domain. However, the general ranking of miscalibration methods, and the observation that DP-TS and DP-PS lead to private models with calibration errors on-par to non-private models is unchanged.

4.3. ANALYSES AND ABLATION STUDIES

Finally, we carefully study two questions to better understand the miscalibration of private learners: What component of DP-SGD leads to miscalibration? What are other confounders such as accuracy or regularization effects that lead to miscalibration? On 2D synthetic data (example given in Fig. 1 ), Fig. 4 (a) shows that fixing the overall privacy guarantee (ϵ) and increasing the clipping threshold from DP (0.1) to DP (1) and further to DP (10) Controlling for accuracy and regularization. Accuracy and calibration are generally positively correlated (Minderer et al., 2021; Carrell et al., 2022) . This poses a question: Does the miscalibration of DP models arise due to their suboptimal accuracy? We find evidence against this in two different experiments. In the first experiment, we vary ϵ for fine-tuning RoBERTa with DP on MNLI. This results in several models situated on a linear ECE-accuracy tradeoff curve (Fig. 5(a) ). Intuitively, extrapolating this curve helps us identify the anticipated ECE for a DP trained model with a given accuracy. Fig. 5(a) shows that when compared to these private models, the non-private model has substantially lower ECE than would be expected by extrapolating this tradeoff alone. This suggests that private learning experiences a qualitatively different ECE-accuracy tradeoff than standard learning. In the second experiment, we controlled the in-domain accuracy of non-private models to match their private counterparts by early-stopping the non-private models to be within 1% of the DP model accuracy. Fig. 5 (b) shows that the ECE gap between the private and non-private models persists even when controlling for accuracy. More generally, we find that regularization methods such as early stopping impact the ECE-accuracy tradeoff qualitatively differently than DP-SGD. Our results in Tab. 5 show that most other regularizers such as early-stopping lead to an accuracy-ECE tradeoff, in which highly regularized models are less accurate but better calibrated. This is not the case for DP training, where the resulting models are both of lower accuracy and less calibrated relative to their non-private counterparts. These findings suggest that calibration errors in private and non-private settings may be caused by different reasonsthe miscalibration of private models may not be due to the regularization effects of DP-SGD. DP training leads to similarly high train and test ECE. Learning algorithms which satisfy tight DP guarantees are known to generalize well, meaning that the train (empirical) and test (population) losses of a DP trained model should be similar (Dwork et al., 2015; Bassily et al., 2016) . In a controlled experiment, we fine-tune RoBERTa on QNLI with DP-SGD (ϵ = 8) and observe that the train-test gaps for both ECE and loss are smaller for DP models than the non-private ones (Fig. 6 ). Yet, for DP trained models, both the train and test ECEs are high compared to the non-private model. Interestingly, these observations with DP trained models are very different from what's seen in miscalibration analyses of non-private models. For instance, Carrell et al. (2022) showed that non-private models tend to be calibrated on the training set but can be miscalibrated on the test set due to overfitting (large calibration generalization gap). Our results show that DP trained models have a small calibration generalization gap, but are miscalibrated on both the training and test sets.

5. CONCLUDING REMARKS

In this work, we study the calibration of ML models trained with DP-SGD. We quantify the miscalibration of DP-SGD trained models and verify that they exist even using state-of-the-art pre-trained backbones. While the calibration errors are substantial and consistent, we show that adapting existing post-hoc calibration methods is highly effective for DP-SGD models. We believe it is an open question whether it is possible to leverage the generalization guarantees of DP-SGD to naturally obtain similarly well-calibrated models without the use of sample-splitting and recalibration.

A PRIVACY ANALYSIS FOR INDEPENDENT RELEASES WITH A PARTITION OF DATA

Our post-processing calibration setup requires splitting the original (private) training data into two disjoint splits where one of which is used solely for training and the other solely for post hoc recalibration. Given that both the training and post hoc recalibration algorithms are DP, it is natural to ask what is the overall privacy spending of the joint release. While one can essentially resort to any "off-the-shelf" privacy composition theorem, we note that in our setup the splits of data used in the two algorithms are disjoint, and thus a tighter characterization of privacy leakage is possible. The following is common knowledge, and we only include the proof for completeness. Proposition A.1. Let M 1 : X 1 → Y and M 2 : X 2 × Y → Z be (ϵ, δ)-DP algorithms consuming independent random bits operating on disjoint splits of the dataset. Then, the algorithm M : X → Y × Z defined by M (X) = (y, z), y = M 1 (X 1 ), z = M 2 (X 2 , y), where (X 1 , X 2 ) is a partition of X determined through some procedure independent on X, is also (ϵ, δ)-DP. Proof. Let X and X ′ be neighboring datasets. Suppose that the first component in both partitions is the same, i.e., X = (X 1 , X 2 ), and X ′ = (X 1 , X ′ 2 ) , where X 2 and X ′ 2 are neighboring. Then, M is (ϵ, δ)-DP directly follows from that M 2 is (ϵ, δ)-DP. The more subtle case is when the second component in both partitions is the same. Specifically, suppose that X = (X 1 , X 2 ), and X ′ = (X ′ 1 , X 2 ), where X 1 and X ′ 1 are neighboring. Let R denote the random variable that controls only the randomness of M 2 , i.e., conditioned a draw of R = r, M 2 is a deterministic function. With slight abuse of notation, we denote this deterministic function by M 2 (r). Let O = ∪ o1∈O1 {o 1 } × O 2 (o 1 ) ⊂ Y × Z be a subset of the codomain. Define the following shorthand for the preimage of M 2 conditioned on R = r M 2 (r) -1 (X 2 , S) = {y ∈ Y | M 2 (r)(X 2 , y) ∈ S}. Then, we have Pr (M (X) ∈ O | R = r) = o1∈O1 Pr (M 1 (X 1 ) = o 1 ) Pr (M 2 (X 2 , o 1 ) ∈ O 2 (o 1 ) | R = r) = o1∈O1 Pr (M 1 (X 1 ) = o 1 ) 1 [M 2 (r)(X 2 , o 1 ) ∈ O 2 (o 1 )] = o1∈O1 Pr M 1 (X 1 ) = o 1 , M 1 (X 1 ) ∈ M 2 (r) -1 (X 2 , O 2 (o 1 )) = Pr M 1 (X 1 ) ∈ ∪ o1∈O1 {o 1 } ∩ M 2 (r) -1 (X 2 , O 2 (o 1 )) ≤ e ϵ Pr M 1 (X ′ 1 ) ∈ ∪ o1∈O1 {o 1 } ∩ M 2 (r) -1 (X 2 , O 2 (o 1 )) + δ = e ϵ Pr (M (X ′ ) ∈ O | R = r) + δ. Since the above holds for all draws of R, we conclude that Pr (M (X) ∈ O) ≤ e ϵ Pr (M (X ′ ) ∈ O) + δ for all neighboring X and X ′ which differ only in their first components. This concludes the proof.

B.1 SETTINGS FOR SYNTHETIC EXPERIMENTS

For synthetic experiments, we generate two-dimensional mixture Gaussian data of size 10k. The distance between the centers of two class data shifts by a constant, which is set to be 2 * 1.5. We use logistic regression to do the binary classification. We set the amount of data points from each class as 5k and batch size as 4k. We include the results with different maximum gradient norm C ∈ {0. The default hyper-parameters for ℓ 2 , dropout, early stopping are 1e -1, 0.1, 8 respectively so some of the results in Tab.5 are reused. For recalibration training, we use a fixed amount of epochs without hyper-parameter tuning to avoid privacy leakage of validation sets. We initialize the temperature parameter in DP-TS as 1.0 and train 100 epochs for all the tasks except Food1001 (which uses 30 epochs) using DP-SGD with a 0.1 learning rate, 10 maximum gradient clipping norm, and a linearly decayed learning rate scheduler. We adapt multiclass extensions for Platt scaling by considering higher-dimensional parameters (Guo et al., 2017) . For baselines, we grid search the maximum norm bound Z ∈ {100, 500, 1000} and epochs over {6, 8, 18} for global clipping (Bu et 



We slightly abuse the notation of X and Y . To match the label space between MNLI and the OOD tasks, we merge "contradiction" and "neutral" labels into a single "not-contradiction" label.



Figure 1: DP-SGD gives rise to miscalibration for logistic regression. (a) Logistic Regression model (blue line) with ϵ = 8 on Gaussian data {(x i , y i )} n i=1 where (x, y) ∈ R p ×{1, -1}, (x-b)|y ∼ N (0, I 2×2 ), b = (1.5, 0) if y = 1 else b = (0, 1.5), and y is Rademacher distributed. (b) Reliability diagram and confidence histogram. DP-SGD trained classifier, which shows poor calibration with a large concentration of extreme confidence values (Left); the baseline is a standard, non-private logistic regression model trained by SGD, which is much better calibrated (Right).

with the standard deviation σ as the noise multiplier returned by accounting and the expected batch size B. Each step of DP-SGD is approximate-DP, and the final model satisfies approximate-DP with privacy leakage parameters that can be computed with privacy loss composition theorems(Abadi et al., 2016;Mironov, 2017; Wang et al., 2019b;Dong et al., 2019;Gopi et al., 2021).

Specific examples of this type of post hoc recalibration technique include temperature scaling (Guo et al., 2017), Platt scaling (Platt et al., 1999), and isotonic regression (Zadrozny & Elkan, 2002).

et al. (2022b), and for vision datasets, we perform linear probe of ViT and ResNet-50 features, following Tramer & Boneh (2021). Datasets. Following prior work (Li et al., 2022b), we train on MNLI, QNLI, QQP, SST-2 (Wang et al., 2019a) for the text classification tasks, and perform OOD evaluations on common transfer targets such as Scitail (Khot et al., 2018), HANS (McCoy et al., 2019), RTE, WNLI, and MRPC (Wang et al., 2019a). 2 For the vision tasks, we focus on the in-domain setting and evaluate on a subset of the transfer tasks in Kornblith et al. (2019) with at least 50k examples. Methods. As baselines, we train the above models using non-private SGD (NON-PRIVATE), standard DP-SGD (DP), global clipping (Bu et al., 2021) (GLOBAL CLIPPING), and differentially private stochastic gradient Langevin dynamics (Knolle et al., 2021) (DP-SGLD). The last two methods are included to evaluate our simple recalibration approaches against existing methods which are reported to improve calibration.For our recalibration methods, we run the private recalibration method over the in-domain recalibration set X recal in Sec. 3.5 using private temperature scaling (DP-TS)(Guo et al., 2017) and Platt scaling (DP-PS)(Platt et al., 1999;Guo et al., 2017). We also include a non-private baseline that combines differentially private model training with non-private temperature scaling (DP+NON-PRIVATE-TS) as a way to quantify privacy costs in the post-hoc recalibration step. Further implementation details and default hyper-parameters for DP training are in Tab. 6 in Appendix B.4.1 IN-DOMAIN CALIBRATIONWe now conduct in-depth experiments across multiple datasets and domains to study miscalibration (Tab. 1, 2). We train differentially private models using pre-trained backbones, and find that their accuracies match previously reported high performance(Tramer & Boneh, 2021; Li et al., 2022b;De et al., 2022).

Figure2: DP trained models display consistently higher ECE than their non-private counterparts.

Figure 3: Reliability diagram and confidence histogram before (Left) and after (Right) recalibration using DP-TS. Recalibration parameters are learned on the validation set X recal of MNLI and SUN397.

Figure 4: Per-example gradient clipping (ϵ = 8) causes large ECE errors in (a) logistic regression on non-separable 2D synthetic data, and (b) fine-tuning RoBERTa on MNLI. (c) Performing only gradient noising leads to high accuracy and low ECE. Ablation on per-example gradient clipping and noise injection. DP-SGD involves per-example gradient clipping and noise injection. To better understand which component contributes more to miscalibration, we perform experiments to isolate the effect of each individual component.

Figure 6: DP-SGD training (ϵ = 8) makes train and eval ECE close but both of them are large. The training dynamics of (a) ECE and (b) Loss on both QNLI training and evaluation sets.

al., 2021; Knolle et al., 2021).

Differentially Private Deep Learning. DP-SGD (Song et al., 2013; Abadi et al., 2016) is a popular algorithm for training deep learning models with DP. Recent works have shown that fine-tuning high-quality pre-trained models with DP-SGD results in good downstream performance (Tramer & Boneh, 2021; Li et al., 2022b; De et al., 2022; Li et al., 2022a). Existing works have studied how ensuring differential privacy through mechanisms such as DP-SGD leads to tradeoffs with other properties, such as accuracy (Feldman, 2020) and fairness (Bagdasaryan et al., 2019; Tran et al., 2021; Sanyal et al., 2022; Esipova et al., 2022) (measured

has private accuracy within 7% of the non-private counterpart, The image classification performance (ϵ = 8) of different models before and after recalibration. Results for ϵ = 3 are in Appendix B.3

The text classification performance (ϵ = 8) before and after recalibration.



The zero-shot transfer paraphrase performance (ϵ = 8) from QQP to MRPC. affect the accuracy only marginally but substantially improve calibration. Repeating this ablation with RoBERTa fine-tuning on MNLI (Fig.4(b)) confirms that increasing the clipping threshold (slightly) decreases ECE but does not substantially impact model accuracy. Finally, Fig.4(c)shows that completely removing clipping and training with only noisy gradient descent dramatically reduces ECE (and increases accuracy). These results suggest that intensive clipping exacerbates miscalibration (even under a fixed privacy guarantee).

Comparison with non-private models trained using common regularizers, i.e. ℓ 2 (weight decay factor), dropout (probability) and early stopping (total training epochs). Models are trained on MNLI and evaluated over MNLI, Scitail and QNLI.

Default hyperparameter of DP finetuning over different datasets for reproducibility. Batch size is based on a unit batch size 20 with different amount of gradient accumulation steps. We use the validation ratio, the proportion of validation set, to split the training set for tuning recalibration methods.B.2 IMPLEMENTATION DETAILSWe use pre-trained checkpoints and trainers from Huggingface library(Wolf et al., 2020) for NLP experiments. We do linear probe for CV experiments using ResNet50 for CIFAR-10, ViT for SUN397 and Food101. We use the modified Opacus privacy engine (Yousefpour et al., 2021) from(Li et al.,  2022b), which computes per-example gradients for transformers. We compare DP training with popular regularizers used for finetuning like ℓ 2 , dropout and early stopping over NLP datasets. ℓ 2 is the weight decay rate {1e -1, 1e -2, 1e -3, 1e -4} during optimization. We apply dropout to both hidden and attention layers of transformers, which takes the value in {0.1, 0.2, 0.3, 0.4}. We do early stopping by setting the maximum amount of training epochs to be smaller, i.e. values in {2, 4, 6, 8}.

al., 2021); we use pre-noise scale 0.046, temperature τ = 6.08, exponential learning rate decay with learning rate 0.005 and decay factor 0.028 as suggested by Knolle et al. (2021).B.3 ADDITIONAL IMAGE CLASSIFICATION RESULTSIn Tab. 7, we give additional results when we have a smaller privacy budget ϵ = 3. We see consistent results that DP fine-tuning gives poor calibration performance while DP-TS and/or DP-PS can recalibrate the classifiers effectively.

The image classification performance (ϵ = 3) of different models before and after recalibration across datasets.

B.4 ADDITIONAL ABLATION STUDIES

Label noise injection. All of the datasets we consider have labels that are designed to be unambiguous, and the Bayes optimal predictor would produce a confidence histogram that is concentrated at 1.0. In this case, we might wonder whether the polarized confidence histograms observed in Fig. 3 are an artifact for datasets with unambiguous labels.To understand this, we intentionally inject label noise into MNLI and study how this changes the behavior of DP-SGD and non-private learning algorithms. Specifically, we uniformly corrupt training labels -by selecting a uniform random class with probability p ∈ {0.6, 0.8}. We compare DP-SGD trained models and non-private models with 0.2 dropout regularization. The confidence histograms in Fig. 7 clearly demonstrate that differentially private models result in 100% confidence, even when the Bayes optimal classifier can be at most 60% confident. This shows that DP-SGD trained model's miscalibration behavior that results in near 100% confidence is not driven by a dataset's label distribution and this behavior is likely to be even worse on tasks with inherent label uncertainty. 

C REMARKS ON CORRELATIONS BETWEEN ACCURACY AND CALIBRATION

In general, the correlations between accuracy and calibration are not clearly understood even for non-private learners as many factors can impact calibration such as architecture, regularization, optimization, data distribution, overparameterization, etc. Below we include some notable empirical findings. Convolutional networks like ResNets and DenseNets can be miscalibrated (Guo et al., 2017) . However, Minderer et al. (2021) show that modern models like ViT (Dosovitskiy et al., 2020) are better calibrated compared to past models; modern neural networks tend to have a strong positive correlation between calibration and classification error; model architectures matter greatly in calibration properties. Using pre-training can improve model uncertainty and calibration (Hendrycks et al., 2019; Desai & Durrett, 2020; Minderer et al., 2021; Kadavath et al., 2022) . Regularizations like gradient noise injection can promote stability and distributional generalization so good calibration over the training set can transfer to the test set (Kulynych et al., 2022) . Carrell et al. (2022) empirically shows that popular models with small generalization gaps will have small test calibration errors.Realizing the above observations, it is possible that the per-example gradient clipping and gradient noise injection in DP-SGD can contribute to both accuracy and calibration in different ways. Therefore, we carefully control the accuracy and regularization when conducting analyses and drawing conclusions (Tab. 5, Fig. 5 (a) and 5(b), Fig. 7 ). However, even with the confounding controls above, DP-SGD trained models are still miscalibrated. In other words, the reason for the finding that private learners are much more miscalibrated than non-private counterparts is less likely to be the unambiguous labels in datasets, accuracy discrepancy or regularization effects of DP-SGD but more likely to be the per-example gradient clipping operation.

