RECYCLING SCRAPS: IMPROVING PRIVATE LEARNING BY LEVERAGING INTERMEDIATE CHECKPOINTS

Abstract

All state-of-the-art (SOTA) differentially private machine learning (DP ML) methods are iterative in nature, and their privacy analyses allow publicly releasing the intermediate training checkpoints. However, DP ML benchmarks, and even practical deployments, typically use only the final training checkpoint to make predictions. In this work, for the first time, we comprehensively explore various methods that aggregate intermediate checkpoints to improve the utility of DP training. Empirically, we demonstrate that checkpoint aggregations provide significant gains in the prediction accuracy over the existing SOTA for CIFAR10 and StackOverflow datasets, and that these gains get magnified in settings with periodically varying training data distributions. For instance, we improve SOTA StackOverflow accuracies to 22.7% (+0.43% absolute) for ε = 8.2, and 23.84% (+0.43%) for ε = 18.9. Theoretically, we show that uniform tail averaging of checkpoints improves the empirical risk minimization bound compared to the last checkpoint of DP-SGD. Lastly, we initiate an exploration into estimating the uncertainty that DP noise adds in the predictions of DP ML models. We prove that, under standard assumptions on the loss function, the sample variance from last few checkpoints provides a good approximation of the variance of the final model of a DP run. Empirically, we show that the last few checkpoints can provide a reasonable lower bound for the variance of a converged DP model.

1. INTRODUCTION

Machine learning models can unintentionally memorize sensitive information about the data they were trained on, which has led to numerous attacks that extract private information about the training data (Ateniese et al., 2013; Fredrikson et al., 2014; 2015; Carlini et al., 2019; Shejwalkar et al., 2021; Carlini et al., 2021; 2022) . For instance, membership inference attacks (Shokri et al., 2017) can infer whether a target sample was used to train a given ML model, while property inference attacks (Melis et al., 2019; Mahloujifar et al., 2022) can infer certain sensitive properties of the training data. To address such privacy risks, literature has introduced various approaches to privacy-preserving ML (Nasr et al., 2018; Shejwalkar & Houmansadr, 2021; Tang et al., 2022) . In particular, iterative techniques like differentially private stochastic gradient decent (DP-SGD) (Song et al., 2013; Bassily et al., 2014; Abadi et al., 2016b; McMahan et al., 2017) and DP Follow The Regularized Leader (DP-FTRL) (Kairouz et al., 2021) have become the state-of-the-art for training DP neural networks. For establishing benchmarks, prior works in DP ML (Abadi et al., 2016b; McMahan et al., 2017; 2018; Thakkar et al., 2019; Erlingsson et al., 2019; Wang et al., 2019b; Zhu & Wang, 2019; Balle et al., 2020; Erlingsson et al., 2020; Papernot et al., 2020; Tramer & Boneh, 2020; Andrew et al., 2021; Kairouz et al., 2021; Amid et al., 2022; De et al., 2022; Feldman et al., 2022) use only the final model output by the DP algorithm. This is also how DP models are deployed in practice (Ramaswamy et al., 2020; McMahan et al., 2022) . However, the privacy analyses for the techniques used allow releasing/using all of the intermediate training checkpoints. In this work, we comprehensively study various methods that leverage intermediate checkpoints to 1) improve the utility of DP training, and 2) quantify the uncertainty in DP ML models that is due to the DP noise.

Accuracy improvement using checkpoints:

We propose two classes of aggregation methods based on aggregating the parameters of checkpoints, or their outputs. We provide both theoretical and em-pirical analyses for our aggregation methods. Theoretically, we show that excess empirical risk of the final checkpoint of DP-SGD is log(n) times more than that of the weighted average of the past k checkpoints. Here, n is the size of dataset. Empirically, we demonstrate significant top-1 accuracy gains due to our aggregations for image classification (CIFAR10) and a next word prediction (StackOverflow) tasks. Specifically, we show that our checkpoints aggregations achieve absolute (relative) prediction accuracy improvements of 3.79% (7.2%) at ε = 1 for CIFAR10 (DP-SGD), and 0.43% (1.9%) at ε = 8.2 for the StackOverflow (DP-FTRLM) SOTA baselines, respectively. We also show that our aggregations significantly reduce the variance in the performance of DP models over training. Finally, we show that these benefits further magnify in more practical settings with periodically varying training data distributions. For instance, we note absolute (relative) accuracy gains of 17.4% (28.6%) at ε = 8 for CIFAR10 over DP-SGD baseline in such a setting.

Uncertainty quantification using checkpoints:

There are various sources of randomness in a ML training pipeline (Abdar et al., 2021) , e.g., choice of initial parameters, dataset, batching, etc. This randomness induces uncertainty in the predictions made using such ML models. In critical domains, e.g., medical diagnosis, self-driving cars and financial market analysis, failing to capture the uncertainty in such predictions can have undesirable repercussions. DP learning adds an additional source of randomness by injecting noise at every training round. Hence, it is paramount to quantify reliability of the DP models, e.g., by quantifying the uncertainty in their predictions. To this end, we take the first steps towards quantifying the uncertainty that DP noise adds to DP ML training. As prior work, Karwa & Vadhan (2017) develop finite sample confidence intervals but for the simpler Gaussian mean estimation problem. Various methods exist for uncertainty quantification in ML-based systems (Mitchell, 1980; Roy et al., 2018; Begoli et al., 2019; Hubschneider et al., 2019; McDermott & Wikle, 2019; Tagasovska & Lopez-Paz, 2019; Wang et al., 2019a; Nair et al., 2020; Ferrando et al., 2022) . However, these methods either use specialized (or simpler) model architectures to facilitate uncertainty quantification, or are not directly applicable to quantify the uncertainty in DP ML due to DP noise. For e.g., a common way of uncertainty quantification (Barrientos et al., 2019; Nissim et al., 2007; Brawner & Honaker, 2018; Evans et al., 2020) that we call the independent runs method, needs k independent (bootstrap) runs of the ML algorithm. However, repeating a DP ML algorithm multiple times can incur significant privacy and computation costs. To address the above issue, we propose to use the last k checkpoints of a single run of a DP ML algorithm as a proxy for the k final checkpoints from independent runs. This does not incur any additional privacy cost to the DP ML algorithm. Furthermore, it is useful in practice as it does not incur additional training compute, and can work with any algorithm having intermediate checkpoints. Theoretically, we consider using the sample variance of a statistic f (θ) at checkpoints θ t1 , . . . , θ t k as an estimator of the variance of f (θ t k ), i.e., the statistic at the final checkpoint, and give a bound on the bias of this estimator. As expected, our bound on the bias decreases as the "burn-in" time t 1 as well as the time between checkpoints both increase. Intuitively, our proof shows that (i) as the burnin time increases, the marginal distribution of each θ ti approaches the distribution of θ t k , and (ii) as the time between checkpoints increases, any pair θ ti , θ tj approaches pairwise independence. Both (i) and (ii) are proven via a mixing time bound, which shows that starting from any point distribution θ 0 , the Markov chain given by DP-SGD approaches its stationary distribution at a certain rate. Empirically, we show our method provides reasonable lower bounds on the uncertainty quantified using the more accurate (but privacy and computation intensive) method that uses independent runs. Related work on Checkpoint aggregations: (Chen et al., 2017; Izmailov et al., 2018) explore checkpoint aggregation methods to improve performance in (non-DP) ML settings, but observe negligible performance gains. To our knowledge, De et al. (2022) is the only work in the DP ML literature that uses intermediate checkpoints post training. They apply an exponential moving average (EMA) over the checkpoints of DP-SGD, and note non-trivial gains in performance. However, we propose various aggregation methods that outperform EMA on standard benchmarks.

2. IMPROVING ACCURACY BY AGGREGATING DP TRAINING CHECKPOINTS

In this section, we describe our checkpoint aggregation methods, followed by the experimental setup we use for evaluation. Next, we detail our experimental results that demonstrate the significant gains in accuracy of DP ML models due to checkpoints aggregations. DP Preliminaries: Differential Privacy (DP) (Dwork et al., 2006) is a notion to quantify the privacy leakage from the outputs of a data analysis procedure. A randomized algorithm M : D * → Y is (ε, δ)-DP if, for all neighbouring datasets D, D ∈ D * (i.e., datasets that differ in one data sample) and all measurable sets of outputs S ⊆ Y, we have P [M (D) ∈ S] ≤ e ε •P [M (D ) ∈ S]+δ. We add Gaussian noise to functions of bounded sensitivity to ensure DP. We also use DP's post-processing property, i.e., any analysis on the outputs of a DP algorithm does not worsen its DP guarantees. We consider two state-of-the-art DP ML algorithms: 1) DP-SGD for the central learning setting, i.e., when all data is pooled at a central location, e.g., a server, and 2) DP-FTRL for the federated learning (FL) setting, i.e., when all private data is distributed across remote collaborative devices. The privacy analyses for both of these techniques involves composition across training rounds, allowing the release of all intermediate checkpoints computed during training.

2.1. CHECKPOINTS AGGREGATION METHODS

We propose two types of aggregation methods: parameter aggregation, and output aggregation. Parameter aggregations compute a function of the parameters of intermediate checkpoints from a DP run, and then use the resulting aggregate parameters for inference. On the other hand, output aggregations compute a function of the outputs of the intermediate checkpoints, and use it for making predictions. These classes can encompass a vast majority of possible aggregation algorithms, but for brevity, we experiment with two algorithms in each of the classes: exponential moving average and uniform past k average that aggregate parameters, and predictions average and labels majority vote that aggregate outputs. Note that all of our aggregation algorithms post-process the checkpoints, and hence, do not incur any additional privacy cost. Additionally, our aggregation methods are general, i.e., they are applicable to any training algorithm that computes intermediate checkpoints.

2.1.1. PARAMETER AGGREGATION METHODS

Exponential moving average (EMA): EMA has been previously used (Tan & Le, 2019; Brock et al., 2021) to improve the performance of ML models at inference time. Starting from the last checkpoint of the run, EMA assigns exponentially decaying weights to each of previous checkpoints; the weights are a function of the EMA coefficient β t at step t. During training, at each step t, EMA maintains a moving average θ t ema that is a weighted average of θ t-1 ema and the t th checkpoint, θ t . This is formalized as follows: θ t ema = (1 -β t ) • θ t-1 ema + β t • θ t . Following (Tan & Le, 2019; De et al., 2022) , we use a warm-up schedule for the EMA coefficient as: β t = min β, 1+t 10+t . Uniform past k average (UPA): For step t of training, UPA computes the mean of the past k checkpoints, i.e., checkpoints from steps [t -(k -1), t]. We formalize this as: θ t,k upa = 1 k t i=t-(k-1) θ i .

2.1.2. OUTPUT AGGREGATION METHODS

Output predictions averaging (OPA): For a given test sample x, OPA first computes prediction vectors f θ i (x) of the past k checkpoints, i.e., checkpoints from steps ∈ [t -(k -1), t], averages the prediction vectors, and computes argmax of the average vector as the final output label. We formalize OPA as ŷopa ( x) = argmax 1 k t i=t-(k-1) f θ i (x) . Output labels majority vote (OMV): For a given test sample x, OMV computes output prediction vectors for x and the corresponding labels, i.e., argmax f θ i (x). Finally, it uses the majority label among the k labels (breaking ties arbitrarily) as its final output label. We formalize OPA as ŷomv (x) = Majority argmax(f θ i (x)) Model architectures, and training details: Following the setup of the state-of-the-art (SOTA) in (De et al., 2022) , we train a WideResNet-16-4 with depth 16 and width 4 using DP-SGD (Abadi et al., 2016b) in JAXline (Babuschkin et al., 2020) for ε ∈ {1, 8}. For CIFAR100, we follow (De et al., 2022) and fine-tune the last, classifier layer of a WideResNet-28-10 pre-trained on ImageNet data. For StackOverflow, we follow SOTA in (Kairouz et al., 2021; Denisov et al., 2022 ) and train a one-layer LSTM using DP-FTRL in TFF (Abadi et al., 2016a) for ε ∈ {8.2, 18.9}. For UPA, OPA and OMV, we treat k, the number of checkpoints to aggregate, as a hyperparameter and tune it using validation data. All our results are computed over 5 runs of each setting. We provide additional details of our experimental setup in Section B.1.

2.2.2. EXPERIMENTAL RESULTS

First, we discuss the results for original datasets and then for more practical periodic distribution shifting datasets. Below, the tables present results for the final training round/step, i.e., for the aggregates computed until the last step/round, while plots show results over the last k rounds for some k much smaller than total number of training steps/rounds. For StackOverflow, due to large size of its test data, we provide plots for accuracy on validation data and tables with test accuracy. CIFAR10 results: Table 1 and the left-most two plots in Figure 1 present the accuracy gains in CIFAR10 for ε ∈ {1, 8}. For ε = 1, OPA provides the maximum accuracy gain of 3.79%, while for ε = 8, EMA provides maximum gain of 2.23%.foot_0 We note from Figure 1 that all checkpoints aggregations improve accuracy for all the training steps of DP-SGD for both ε's. Next, note from Figure 1 that the accuracy of baseline DP-SGD has a high variance across training steps, i.e., based on the hyperparameters, DP-SGD can produce bad/good models which can be undesirable in practice. However, checkpoints aggregations significantly reduce the variance in accuracy of models, and therefore, increase the confidence in the final DP-SGD model. StackOverflow, User level = 18.9 StackOverflow results: Table 1 and the rightmost two plots in Figure 1 present the accuracy gains in StackOverflow for ε ∈ {18.9, 8.2}; Figure 5 present the results for ε = ∞. The maximum accuracy gains are 0.57%, 0.43%, and 0.42% for ε of ∞, 18.9, and 8.2, respectively. We observe that UPA aggregation always provides the best accuracy. Note that these improvements are very significant, because there are 10,004 classes in StackOverflow data. Similarly for CIFAR10, Figures 1 and 5 show that checkpoints aggregations improve accuracy for all training rounds by large margins and also significantly reduce the variance of model accuracy across training rounds. Note that these improvements are on top of the momentum optimizer used during training. CIFAR100 results: Due to space constraints, we defer the results to Table 7 in Appendix B, and discuss the major observations here. First, we improve the SOTA baseline of De et al. (2022) 70.6% to 75.51% at ε = 1, and from 77.6% to 80.8% at ε = 8. To achieve these gains, we perform fine-tuning over the EMA checkpoint of ImageNet-trained WRN-28-10 instead of the final checkpoint as in (De et al., 2022) . Subsequently, we observe that for fine-tuning using CIFAR100, checkpoint aggregations provide small accuracy gains: for both ε of 1 and 8, we observe maximum gains are 0.11% due to UPA and OPA, respectively.

2.2.3. ACCURACY IMPROVEMENTS IN PERIODIC DISTRIBUTION SHIFTING SETTINGS

In many real-world settings, for instance, in federated learning (FL) settings, training data distribution may vary over time. Zhu et al. (2021) demonstrate the adverse impacts of distribution shifts in training data on the performances of resulting FL models. Considering the practical significance of such distribution shifts, we consider settings where the training data distribution has diurnal variations, i.e., it is a function of two oscillating distributions, D 1 and D 2 ; Figure 2 shows these distributions. Such a scenario commonly occurs in FL training, e.g., when a model is trained using FL with client devices participating from two significantly different time zones. Linear periodic distribution shifts To simulate such diurnal distribution, for CIFAR datasets, we design D {1,2} such that D 1 and D 2 respectively contain the data from even and odd classes of the original data. For Stack-Overflow, we design D {1,2} such that D 1 only has the questions from each user, while D 2 only has answers from them. Then, we draw clients from D {1,2} . Apart from data distribution, the rest of experimental setup is the same as before. We use test and validation data same as for the original StackOverflow setting. CIFAR10 results: Table 2 and Figure 3 (left two plots) show accuracy gains for diurnal CIFAR10. We note very large accuracy gains for both ε ∈ {1, 8}; the absolute accuracy gains are 7.45% and 17.37%, respectively, and are due to OPA. Observe that, such diurnal settings have large variances in accuracy across training steps, but our aggregation methods almost completely eliminate the variances. We note that the improvements in diurnal settings are significantly more than that in original CIFAR10. This is because in diurnal settings, the variances in model accuracy over training steps is very large, and hence, the benefits of checkpoints aggregations magnify in these settings. StackOverflow results: Table 2 and Figure 3 (rightmost two plots) present accuracy gains for diurnal StackOverflow. The maximum accuracy gains are 0.09%, 0.44%, and 0.51% for ε of ∞, 18.9, and 8.2, respectively. In contrast to the diurnal CIFAR10 case, improvements in diurnal Stack-Overflow and StackOverflow are similar. This is because the two distributions in diurnal CIFAR10 (completely different images from even/odd classes) are significantly farther apart compared to the two distributions in diurnal StackOverflow (text from questions/answers). rameter WRN pre-trained on downsampled ImageNet1k from De et al. (2022) . Moreover, similar to De et al. (2022) , we observe that in small ε regimes, fine-tuning only the classifier layer provides better accuracy. StackOverflow, User level = 18.9 8 in Appendix B. For CIFAR100 also, we observe significant gains due to checkpoints aggregations in PDS setting: for ε of 1 and 8, we note maximum gains of 4.96% (due to UPA) and 3.37% (due to EMA), respectively. Similar to CIFAR10, the gains due to checkpoints aggregation are significantly higher for PDS-CIFAR100 than for non-PDS CIFAR100. So far, our aggregated methods operated over a fixed sequence of checkpoints, e.g., past k. However, when some public data similar to the private training data is available, we propose a data-dependent checkpoint selection method for aggregation: aggregate the checkpoints that perform the best on the (held-out) public data. We expect this to improve over data-independent aggregation schemes. We validate this hypothesis for StackOverflow dataset. We use 10,000 random samples (less than 0.01% of the training data) from the StackOverflow validation data as the held-out public data. Note that this is disjoint from the validation data that we use for hyperparameter tuning. We omit the CIFAR datasets from this evaluation due to the lack of availability of additional similar data. We compute accuracy of all training checkpoints on the held-out data, find the best k checkpoints, and aggregate them as detailed above. We tune the hyperparameter k using disjoint validation data. Table 3 presents the gains due to data-dependent aggregations over the baseline and dataindependent aggregations; Tables 5 and 6 in Appendix B present full results. 3 We see that data-dependent aggregations outperform data-independent aggregations for all ε's. For instance, at ε of 8.2, accuracies of baseline, data-independent and data-dependent aggregations are 22.28%, 22.7% and 22.8%, respectively. For ε's ∞ and 18.9, aggregating best past k checkpoints outperform aggregating past k checkpoints (baseline) by 0.16% (0.73%) and 0.08% (0.51%), respectively. We make the same observation for diurnal PDS StackOverflow dataset-the maximum accuracy gains due to data-dependent aggregations are 0.29%, 0.58%, and 0.66% for ε of ∞, 18.9, and 8.2, respectively.

2.4. IMPROVED EXCESS RISK VIA TAIL AVERAGING

In this section, we formally show that checkpoint aggregations like uniform tail averaging provably improves the privacy/utility trade-offs, compared to that of the last checkpoint of DP-SGD. 1. θ 0 ← 0 p . 2. For t ∈ [T ], θ t+1 ← Π C (θ t -η t (∇L(θ t ; D) + b t )) , where b t ∼ N 0, L 2 T 2nρ I p×p and Π C (•) being the 2 -projection onto the set C. We will provide the utility guarantee for this algorithm by directly appealing to the result of Shamir & Zhang (2013) . For a given α ∈ (0, 1), UPA (Section 2.1.1) corresponds to the average of the last αT models, i.e., θ priv upa = 1 αT T t=(1-α)T +1 θ t . One can also consider polynomial-decay averaging (PDA) with parameter γ ≥ 0, defined as follows: θ priv pda [t] = 1 -γ+1 t+γ θ priv pda [t -1] + γ+1 t+γ • θ t . For γ = 0, PDA matches UPA over all iterates. As γ increases, PDA places more weight on later iterates; in particular, if γ = cT , the averaging scheme is very similar to EMA (Section 2.1.1), since as t → T the decay parameter γ+1 t+γ approaches a constant c c+1 . In that sense, PDA can be viewed as a method interpolating between UPA and EMA. Theorem 2.1 (Adapted from Theorems 2 and 4 of Shamir & Zhang (2013) ). Consider the DP-SGD algorithm above, and the associated parameters. Then there exists choice of learning rate η t and the number of time steps T s.t. the following are true for α = Θ(1): E L θ priv upa ; D -min θ∈C L(θ; D) = O L C 2 √ p nρ , and E [L(θT ; D)] -min θ∈C L(θ; D) = O L C 2 √ p log(n) nρ . Furthermore, for γ = Θ(1), we have, E L θ priv pda [T ]; D -min θ∈C L(θ; D) = O L C 2 √ p nρ . Proof. If we choose T = nρ and set η t appropriately, the proof of Theorem 2 (Shamir & Zhang, 2013) implies the following for θ priv upa : E L θ priv upa ; D -min θ∈C L(θ; D) = O L C 2 √ p nρ log 1 α . Setting α = Θ(1) gives the theorem's first part, and αT = 1, i.e., 1/α = T = nρ gives the second. The third follows from modifying Theorem 4 of Shamir & Zhang (2013) for the convex case (see the end of Section 4 of Shamir & Zhang (2013) for details). The excess empirical risk for θ T is higher by factor of log(n) in comparison to θ priv upa and θ priv pda [T ] . For the step size selections typically used in practice (e.g., fixed or inverse polynomial step sizes), the last iterate will suffer from the extra log(n) factor, and we do not know how to avoid it. Furthermore, Harvey et al. (2019) showed that this is unavoidable in the non-private, high probability regime.foot_5 

3. QUANTIFYING UNCERTAINTY IN PREDICTIONS DUE TO DP NOISE

In this section, we discuss our proposal of how to quantify the uncertainty that the differential privacy (DP) noise adds to the outputs of ML algorithms, without additional privacy cost or computation. First, we discuss the issues with the current established approach of uncertainty quantification in ML when used in DP setting, and then discuss our proposal and theoretical results. Finally, we provide experimental results that demonstrate the practical utility of our approach. (Naive) uncertainty quantification, and its hurdles: As discussed in Section 1, although uncertainty quantification has a long history of research, prior methods, including the most common independent runs method, are not applicable in DP settings. The two major issues with the independent runs method in DP settings are: First, the additional runs of a DP-ML algorithm can incur a significant increase in the privacy cost. Second, the additional computation required in independent runs method can be prohibitive, e.g., in production FL. Hence, it is not feasible to use the naive method for uncertainty quantification in DP settings.

3.1. TWO BIRDS, ONE STONE: OUR UNCERTAINTY QUANTIFICATION PROPOSAL

To address the two hurdles discussed above, we propose a simple yet efficient method that leverages intermediate checkpoints computed during a DP run. Specifically, we substitute the k output models from the independent runs method with k checkpoints, e.g., the last k checkpoints, from a single DP run. The rest of the confidence interval computation is the same for both the methods. Specifically, we compute the most likely classification labels of the k models/checkpoints on a test input. Use the set of k output labels as a sample from the student's t-distribution and compute a confidence interval of the mean of the distribution. Finally, we use the average of multiple such intervals over multiple test inputs as a proxy for uncertainty of outputs of the DP ML algorithm.

3.1.1. THEORY

In this section, we give a theoretical motivation for using checkpoints to estimate the variance of a statistic. We show that under certain assumptions, if we take the sample variance of a statistic f (θ) computed at the checkpoints θ t1 , . . . , θ t k , in expectation it is a good approximation of the sample variance of the limiting distribution of DP-SGD. As expected, the quality of the approximation improves by increasing the length of the "burn-in" time as well as the time between checkpoints. To simplify the proof, we actually prove the theorem for DP-LD, a continuous-time version of DP-SGD. DP-LD can be defined as follows. We first reformulate (unconstrained) DP-SGD with step size η as: θ (t+1)η ← θ tη -η∇L( θ tη ; D) + b t , b t ∼ N (0, 2ησ 2 I p×p ). Notice that we have reparameterized θ so that its subscript refers to the sum of all step-sizes so far, i.e. after t iterations we have θ tη and not θ t . Also notice that the variance of the noise we added is proportional to the step size η. In turn, for any η that divides t, after t/η iterations with step size η, the sum of variances of noises added is 2tσ 2 . Now, taking the limit as η goes to 0 of the sequence of random variables { θ tη } t∈Z ≥0 defined by DP-SGD, we get a continuous sequence {θ t } t∈R ≥0 . In particular, if we fix some t, then θ t is the limit as η goes to 0 of θ t defined by DP-SGD with step size η. This sequence is exactly the sequence defined by DP-LD, which is more formally given by the following stochastic differential equation: dθ t = -∇L(θ t ; D)dt + σ √ 2dW t . (1) Here, W t is a standard Brownian motion and σ 2 is analogous to the variance of noise added in DP-SGD. In Appendix A we show the following: Theorem 3.1 (Simplified version of Theorem A.1). Suppose L is 1-strongly convex and M -smooth, and σ = 1 in (1). Let 0 < t 1 < t 2 < . . . < t k be such that t i+1 ≥ t i + γ, ∀i > 0 and some γ. Let {θ ti : i ∈ [k]} be the checkpoints, and f : Θ → [-1, 1] be a statistic whose variance we wish to estimate. Let S be the sample variance of f over the checkpoints, and V be the variance of f (θ t k ), i.e. the statistic at the final checkpoint. Then, for some "burn-in" time B that is a function of θ 0 , M, p, we have: |E[S] -V | = exp(-Ω(min{t 1 , γ} -B)). Intuitively, Theorem A.1 and its proof say the following: (i) As we increase t 1 , the time before the first checkpoint, each of the checkpoints' marginal distributions approaches the distribution of θ t k , and (ii) As we increase γ, the time between checkpoints, the checkpoints' distributions approach pairwise independence. So increasing both t 1 and γ causes our checkpoints to approach k pairwise independent samples from the distribution of θ t k , i.e., our variance estimator approaches the true variance in expectation. To show both (i) and (ii), we build upon past results from the sampling literature to show a mixing bound of the following form: running DP-LD from any point initializationfoot_6 θ 0 , the Rényi divergence between θ t and the limit as t → ∞ of DP-LD, θ ∞ , decays exponentially in t. This mixing bound shows (i) since if t 1 is sufficiently large, then the distributions of all of θ t1 , θ t2 , . . . , θ t k are close to θ ∞ , and thus close to each other. This also shows (ii) since DP-LD is a Markov chain, i.e. the distribution of θ tj conditioned on θ ti is equivalent to the distribution of θ tj -ti if we run DP-LD starting from θ ti instead of θ 0 . So our mixing bound shows that even after conditioning on θ ti , θ tj has distribution close to θ ∞ . Since θ tj is close to θ ∞ conditioned on any value of θ ti , then θ tj is almost independent of θ ti . : Uncertainty due to DP noise measured using confidence interval widths. We compute the intervals using N bootstrap (independent) runs, and using the last N checkpoints of a single run.

Empirical analysis:

We compare the uncertainty quantified using the independent runs method and using our method; experimental setup is the same as in Section 2.2.1. First, for both StackOverflow and CI-FAR10 datasets, we do 100 independent training runs. Then to compute uncertainty using the independent runs method, we take the final model from k of these runs (chosen randomly), compute prediction labels for a given input, compute confidence interval widths for the input, and finally, use the average of confidence interval widths of a large number of inputs as the final uncertainty estimate. To compute uncertainty using our checkpoints based method, we instead select our k models to be the last k checkpoints of a random training run, and obtain average confidence intervals as before. Figure 4 shows the results for StackOverflow and CIFAR10. Plots show intervals averaged over 5 runs, i.e., by sampling k final models 5 times or by using the last k models of 5 random runs. We observe that uncertainty computed using the intermediate checkpoints consistently give a reasonable lower bound on the uncertainty computed using the independent runs.

4. CONCLUSIONS

In this work, we explore methods for aggregating intermediate checkpoints to improve the utility of DP training. Using checkpoints aggregation methods, we show significant improvements in prediction accuracy over the SOTA for CIFAR and StackOverflow datasets. We also show that uniform tail averaging of checkpoints improves the ERM bound compared to the last checkpoint of DP-SGD. Lastly, we prove that for some standard loss functions, the sample variance from last few checkpoints provides a good approximation of the variance of the final model of a DP run. We also conduct experiments demonstrating that the last few checkpoints can provide a reasonable lower bound for the variance of a converged DP model. For future work, using checkpoint aggregates during DP training could be an interesting direction to further improve its utility. Leveraging intermediate checkpoints to provide variance estimates for checkpoint aggregates could also be a promising direction.

A PROOF OF THEOREM A.1

For convenience/formality, we review the setup for the theorem we wish to prove. Recall that given a loss function L, by taking the limit as η goes to 0 in DP-SGD, we recover the stochastic differential equation ( 1), restated here for convenience: dθ t = -∇L(θ t ; D)dt + σ √ 2dW t . Note that the solutions θ t to this equation are random variables. A key property of ( 1) is that the stationary distribution (equivalently, the limiting distribution as t → ∞) has pdf proportional to exp(-L(θ; D)/σ) under mild assumptions on L (which are satisfied by strongly convex and smooth functions). To simplify proofs and presentation in the section, we will assume that (a) θ 0 is a point distribution, (b) we are looking at unconstrained optimization over R p , i.e., there is no need for a projection operator in DP-SGD and ( 1), (c) the loss L is 1-strongly convex and M -smooth, and (d) σ = 1. We note that (a) can be replaced with θ 0 being sampled from a random initialization without too much work, and (c) can be enforced for Lipschitz, smooth functions by adding a quadratic regularizer. We let θ * refer to the (unique) minimizer of L throughout the section. Now, we consider the following setup: We obtain a single sample of the trajectory {θ t : t ∈ [0, T ]}. We have some statistic f : Θ → [-1, 1], and we wish to estimate the variance of the statistic for the final value θ T , i.e. the variance V := Var (f (θ T )). To do so, we use the sample variance of the checkpoints at times 0 < t 1 < t 2 < t 3 < . . . < t k = T . That is, our estimator is defined as S = 1 k-1 k i=1 (f (θ ti ) -µ) 2 where µ = 1 k k i=1 f (θ ti ). Theorem A.1. Under the preceding assumptions/setup, for some sufficiently large constant c, let γ = 1 2M + ln(cM (p + ln(1/∆) + θ 0 -θ * 2 2 )) + c ln(1/∆) (recall that p is the dimensionality of the space). Then, if t 1 > γ and t i+1 > t i + γ for all i > 0, for S, V as defined above: |E[S] -V | = O(∆). Before proving this theorem, we need a few helper lemmas about Rényi divergences: Definition A.2. The Rényi divergence of order α > 1 between two distributions P and Q (with support R d ), D α (P, Q), is defined as follows: D α (P, Q) := θ∈R d P (θ) α Q(θ) α-1 dθ We refer the reader to e.g. van Erven & Harremos (2014) ; Mironov (2017) for properties of the Rényi divergence. The following property shows that for any two random variables close in Rényi divergence, functions of them are close in expectation: Bun & Steinke (2016) ] Let P and Q be two distributions on Ω and g : Lemma A.3. [Adapted from Lemma C.2 of Ω → [-1, 1]. Then, |E x∼P [g(x)] -E x∼Q [g(x)]| ≤ e D2(P||Q) -1. Here, D 2 (P||Q) corresponds to Rényi divergence of order two between the distributions P and Q. The next lemma shows that the solution to (1) approaches θ ∞ exponentially quickly in Rényi divergence. Lemma A.4. Fix some point θ 0 . Assume L is 1-strongly convex, and M -smooth. Let P be the distribution of θ t according to (1) for σ = 1 and: t := 1/2M + ln(c(M θ 0 -θ * 2 2 + p ln(M ))) + c ln(1/∆). Where c is a sufficiently large constant. Let Q be the stationary distribution of (1). Then: D 2 (P, Q) = O(∆ 2 ). The proof of this lemma builds upon techniques in Ganesh & Talwar (2020) , and we defer it to the end of the section. Our final helper lemma shows that θ ∞ is close to θ * with high probability: Lemma A.5. Let θ ∞ be the random variable given by the stationary distribution of (1) for σ = 1. If L is 1-strongly convex, then: Pr[ θ ∞ -θ * 2 > √ p + x] ≤ exp(-x 2 /2). Proof. We know the stationary distribution has pdf proportional to exp(-L(θ t ; D)). In particular, since L is 1-strongly convex, this means θ ∞ is a sub-Gaussian random vector (i.e., its dot product with any unit vector is a sub-Gaussian random variable), and thus the above tail bound applies to it. We now will show that under the assumptions in Theorem A.1, every checkpoint is close to the stationary distribution, and that every pair of checkpoints is nearly pairwise independent. Lemma A.6. Under the assumptions/setup of Theorem A.1, we have: (E1) ∀i : |E[(f (θ ti ))] -E[(f (θ t k ))]| = O(∆), (E2) ∀i : |E[(f (θ ti ) 2 )] -E[f (θ t k ) 2 ]| = O(∆), (E3) ∀i < j : |Cov f (θ ti ), f (θ tj ) | = O(∆). Proof. We assume without loss of generality ∆ is at most a sufficiently small constant; otherwise, since f has range [-1, 1], all of the above quantities can easily be bounded by 2, so a bound of O(∆) holds for any distributions on {θ ti }. For (E1), by triangle inequality, it suffices to prove a bound of O(∆) on |E[f (θ ti )] -E[f (θ ∞ )]|. We abuse notation by letting θ t denote both the random variable and its distribution. Then: |E[f (θ ti )] -E[f (θ ∞ )]| Lemma A.3 ≤ e D2(f (θt i ),f (θ∞)) -1 ( * 1) ≤ e D2(θt i ,θ∞) -1 Lemma A.4 = e O(∆ 2 ) -1 ( * 2) = O(∆). In ( * 1 ) we use the data-processing inequality (Theorem 9 of van Erven & Harremos (2014)), and in ( * 2 ) we use the fact e x -1 ≤ 2x, x ∈ [0, 1] and our assumption on ∆. (E2) follows from (E1) by just using f 2 (which is still bounded in [-1, 1]) instead of f . For (E3), note that since (1) is a (continuous) Markov chain, the distribution of θ tj conditioned on θ ti is the same as the distribution of θ tj -ti according to (1) if we start from θ ti instead of θ 0 . Let P be the joint distribution of θ ti , θ tj . Let Q be the joint distribution of θ ti , θ ∞ (since (1) has the same stationary distribution regardless of its initialization, this is a pair of independent variables). Let P , Q be defined identically to P, Q, except when sampling θ ti , if θ ti -θ * 2 > √ p+ 2 ln(1/∆) we instead set θ ti = θ * (and in the case of P , we instead sample θ tj from θ tj |θ ti = θ * when this happens). Let R denote this distribution over θ ti . Then similarly to the proof of (E1) we have: |E P [f (θ ti )f (θ tj )] -E Q [f (θ ti )]E[f (θ ∞ )]| Lemma A.3 ≤ e D2(P ,Q ) -1 ( * 3) ≤ e max θ t i ∈supp(R) {D2(θt j |θt i ,θ∞)} -1. Lemma A.4 = e O(∆ 2 ) -1 = O(∆). Here ( * 3 ) follows from the convexity of Rényi divergence, and in our application of A.4, we are using the fact that for all θ ti ∈ supp(R), θ ti -θ * 2 ≤ √ p + 2 ln(1/∆). Furthermore, by Lemma A.5, we know P and P (resp. Q and Q ) differ by at most ∆ in total variation distance. So, since f is bounded in [-1, 1], we have: |E P [f (θ ti )f (θ tj )] -E P [f (θ ti )f (θ tj )]| ≤ ∆, |E Q [f (θ ti )]E[f (θ ∞ )] -E Q [f (θ ti )]E[f (θ ∞ )]| ≤ ∆. Then by applying triangle inequality twice: |E P [f (θ ti )f (θ tj )] -E Q [f (θ ti )]E[f (θ ∞ )]| = O(∆) Now we can prove (E3) as follows: |Cov f (θ ti ), f (θ tj ) | = |E[(f (θ ti ) -E[f (θ ti )])(f (θ tj ) -E[f (θ tj )])]| = |E[f (θ ti )f (θ tj )] -E[f (θ ti )]E[f (θ tj )]| ≤ |E[f (θ ti )f (θ tj )] -E[f (θ ti )]E[f (θ ∞ )]| + |E[f (θ ti )]E[f (θ ∞ )] -E[f (θ ti )]E[f (θ tj )]| ≤ O(∆) + |E[f (θ ∞ )] -E[f (θ tj )]| = O(∆). Proof of Theorem A.1. We again assume without loss of generality ∆ is at most a sufficiently small constant. The proof strategy will be to express E[S] in terms of individual variances Var (f (θ ti )), which can be bounded using Lemma A.6. We have the following: E[S] = 1 k -1 k i=1 E (f (θ ti ) -µ) 2 = 1 k -1 k i=1 E       k -1 k 2       f (θ ti ) xi - 1 k -1 j∈[k],j =i f (θ tj ) yi       2       . (2) From (2), we have the following: E (x i -y i ) 2 = E[x 2 i ] -2E[x i y i ] + E[y 2 i ] = E[x 2 i ] -(E[x i ]) 2 + E[y 2 i ] -(E[y i ]) 2 + (E[x i ]) 2 + (E[y i ]) 2 -2E [x i y i ] = Var (x i ) A + Var (y i ) B + (E[x i ]) 2 + (E[y i ]) 2 -2E [x i y i ] C . In the following, we bound each of the terms A, B, and C individually. First, let us consider the term B. We have the following: B = Var (y i ) = 1 (k -1) 2     j∈[k],j =i Var f (θ tj ) + 2 1≤j< ≤k j =i, =i Cov f (θ tj ), f (θ t )     . (4) Plugging Lemma A.6, (E3) into (4) we bound the variance of y i as follows: B = Var (y i ) = 1 (k -1) 2   j∈[k],j =i Var f (θ tj )   ± O(∆). (5) We now focus on bounding the term C in (3). Lemma A.6, (E1) and (E3) implies the following: (E[x i ]) 2 = (E[f (θ t k )]) 2 ± O(∆), (E[y i ]) 2 = (E[f (θ t k )]) 2 ± O(∆), E[x i y i ] = (E[f (θ t k )]) 2 + O(∆). Plugging ( 6),(7), and ( 8) into (3), we have E (x i -y i ) 2 = Var (f (θ ti )) + 1 (k -1) 2   j∈[k],j =i Var f (θ tj )   ± O(∆). Now, Lemma A.6, (E1) and (E2) implies ∀i : |Var (f (θ ti )) -Var (f (θ t k )) | = O(∆). So from (9) we have the following: E (x i -y i ) 2 = Var (f (θ t k )) • k k -1 ± O(∆). ( ) Plugging this bound back in (2), we have the following: E[S] = 1 k -1 • k -1 k 2 • k • Var (f (θ t k )) • k k -1 ± O(∆) = Var (f (θ t k )) ± O(∆). ( ) Which completes the proof.

A.1 OPTIMIZING THE NUMBER OF CHECKPOINTS

In Theorem A.1, we fixed the number of checkpoints and gave lower bounds on the burn-in time and separation between checkpoints needed for the sample variance bound to have bias at most ∆. We could instead consider the problem where T , the time of the final checkpoint, is fixed, and we want to choose k which minimizes the (upper bound on) mean squared error of the sample variance of {f (θ iT /k )} i∈ [k] . Here, we sketch a solution to this problem using the bound from this section. The mean squared error of the sample variance is the sum of the bias and variance of this estimator. We will use the following reparameterization of Theorem A.1: Theorem A.7 (Reparameterized version of Theorem A.1). Let c 1 := 1 2M + ln(c 2 M (p + θ 0 -θ * 2 2 )) , where c 2 is a sufficiently large constant. Then if S is the sample variance of {f (X iT /k )} i∈[k] , V is the true variance of X T , and T /k > c 1 : |E[S] -V | 2 ≤ exp - T /k -c 1 c 2 . One can also show the variance of S is close to the true sample variance: Lemma A.8. If S is the sample variance of k > 1 i.i.d. samples of θ T , then if c 2 is a sufficiently large constant, for c 1 as defined in theorem A.7: Var S ≤ 1 k , |Var (S) -Var S | ≤ 2 exp - T /k -c 1 c 2 . Proof. Let x 1 , . . . , x k be k i.i.d. samples of f (θ T ), then since each x i is in the interval [-1, 1]: Var S = E[x 4 1 ] k - Var (x 1 ) (k -3) k(k -1) ≤ 1 k . Giving the first part of the lemma. For the second part, let x i be the sampled value of f (θ iT /k ). Then: E[S 2 ] = E       1 k -1 i∈[k]   x i - 1 k j∈[k] x j   2    2    . For some coefficients c i,j, ,m , this can be written as i≤j≤ ≤m c i,j, ,m E[x i x j x x m ] where i≤j≤ ≤m |c i,j, ,m | ≤ 2. By a similar argument to Theorem A.1, the change in this expectation if we instead use x i that are i.i.d. is then at most exp -T /k-c1 c2 as long as c 2 is a sufficiently large constant. In other words, |E[S 2 ] -E[ S2 ]| ≤ exp -T /k-c1 c2 . A similar argument applies to E[S] 2 , giving the second part of the lemma. Putting it all together, we have an upper bound on the mean squared error of the sample variance of: 1 k + 3 exp - T /k -c 1 c 2 , Assuming k > 1, T /k > c 1 . Minimizing this expression with respect to k gives k = T c 1 + c 2 ln(3T /c 2 ) , which we can then round to the nearest integer larger than 1 to determine the number of checkpoints to use that minimizes our upper bound on the mean squared error. Of course, if T < 2c 1 then Theorem A.1 cannot be applied to give a meaningful bias bound for any number of checkpoints, so this choice of k is not meaningful in that case. A.2 PROOF OF LEMMA A.4 We will bound the divergences D α (P 1 , P 2 ), D α (P 2 , P 3 ), D α (P 3 , P 4 ) where P 1 is the distribution θ η that is the solution to (1), P 2 is a Gaussian centered at the point θ 0 -η∇L(θ 0 ; D), P 3 is a Gaussian centered at θ * , and P 4 is the stationary distribution of (1). Then, we can use the approximate triangle inequality for Rényi divergences to convert these pairwise bounds into the desired bound. Lemma A.9. Fix some θ 0 . Let P 1 be the distribution of θ η that is the solution to (1), and let P 2 be the distribution N (θ 0 -η∇L(θ 0 ; D), 2η). Then: D α (P 1 , P 2 ) = O M 2 ln(α) • max{pη 2 , θ 0 -θ * 2 2 η 3 } Proof. Let θ t be the solution trajectory of (1) starting from θ 0 , and let θ t be the solution trajectory if we replace ∇L(θ t ; D) with ∇L(θ 0 ; D). Then θ η is distributed according to P 1 and θ η is distributed according to P 2 . By a tail bound on Brownian motion (see e.  θ t -θ 0 2 ≤ cM ( √ p + ln(1/δ)) √ η + M θ 0 -θ * 2 η, for some sufficiently large constant c, and the same is true w.p. 1 -δ over θ t . Now, following the proof of Theorem 15 in Ganesh & Talwar (2020) , for some constant c , we have the divergence bound D α (P 1 , P 2 ) ≤ ε as long as: M 4 ln 2 α ε 2 (pη 2 + θ 0 -θ * 2 2 η 3 ) < c . In other words, for any fixed η, we get a divergence bound of: D α (P 1 , P 2 ) = O M 2 ln(α) • max{pη 2 , θ 0 -θ * 2 2 η 3 } , as desired. Lemma A.10. Let P 2 be the distribution N (θ 0 -η∇L(θ 0 ; D), 2η) and P 3 be the distribution N (θ * , 2η). Then for η ≤ 2/M : CIFAR10 training: We use Jaxline (Bradbury et al., 2018) to train on CIFAR10 using DP-SGD (Berrada et al., 2022) . For CIFAR10, we use a WideResNet with depth 16 and width 4. We fix clip norm to 1, batch size to 4096 and augmentation multiplicity to 16 as in (De et al., 2022) . D α (P 2 , P 3 ) ≤ α θ 0 -θ * 2 2 4η . Then, we set learning rate and noise multiplier, respectively, to 2 and 10 for ε = 1 and to 4 and 3 for ε = 8. For periodic distribution shifting (PDS) CIFAR10, we set learning rate and noise multiplier, respectively, to 2 and 12 for ε = 1 and to 4 and 4 for ε = 8. We stop the training when the intended privacy budget exhausts. CIFAR100 training: For CIFAR100 also, we use Jaxline (Bradbury et al., 2018) and use DP-SGD to fine-tune the last, classifier layer of a WideResNet with depth 28 and width 10 that is pre-trained on entire ImageNet data. We fix clip norm to 1, batch size to 16,384 and augmentation multiplicity to 16 as in (De et al., 2022) . Then, we set learning rate and noise multiplier, respectively, to 3.5 and 21.1 for ε = 1 and to 4 and 9.4 for ε = 8. For periodic distribution shifting (PDS) CIFAR100, we set learning rate and noise multiplier, respectively, to 4 and 21.1 for ε = 1 and to 5 and 9.4 for ε = 8. We stop the training when privacy budget exhausts. We would like to highlight that we obtain a significant improvement over the SOTA baseline of De et al. (2022) : In particular, unlike in (De et al., 2022) , we fine-tune the final EMA checkpoint, i.e., the one computed using EMA during pre-training over ImageNet. This modification (without any additional checkpoints aggregations) gives a major accuracy boost of 5% (70.3% → 75.51%) for ε = 1 and of 3.2% (77.6% → 80.81%) for ε = 1 for the normal CIFAR100 baseline. We obtain similarly high improvements by fine-tuning the EMA of pre-trained checkpoints (instead of just the final checkpoint) for the PDS-CIFAR100 case. We leave the further investigation of this phenomena to the future work. StackOverflow training: We follow the SOTA in (Kairouz et al., 2021; Denisov et al., 2022) and use TFF to train a one-layer LSTM (detailed architecture in Table 4 (Reddi et al., 2020) ) on Stack-Overflow using DP-FTRLM full version from (Denisov et al., 2022) . StackOverflow is naturally user-partitioned data, and we process 100 clients in each FL round. We train for 2048 FL rounds and set clip norm, noise multiplier, server learning rate, client learning rate, and server momentum, respectively, to 1, 0.341, 0.5, 1.0, 0.95 for ε = 18.9 and to 1, 0.682, 0.25, 1.0, 0.95 for ε = 8.2. For PDS-StackOverflow, the same set of hyperparameter performs the best based on our tuning of the aforementioned hyperparameters. Hyperparameters tuning of checkpoints aggregations: Here we provide the methodology we follow to obtain the best hyperparameters for our checkpoints aggregation methods (Section 2.1). For EMA, De et al. (2022) simply use the EMA coefficient that works the best in non-private baseline. However, for each of the settings we consider, we tune EMA coefficient in {0.85, 0.9, 0.95, 0.99, 0.999, 0.9999} and observe that the best EMA coefficients for private and non-private settings need not be the same (Table 9 ). For instance, for CIFAR10, for ε of 1 and 8, EMA coefficient of 0.95 and 0.99 perform the best and outperform 0.9999 by 0.6% and 0.3%, respectively. Hence, we advise future works to perform tuning of EMA coefficient. Full results are given in Table 9 . For the past k checkpoints aggregation based methods, our tuning methodology is general and as follows: we use some validation data (which is disjoint from training data) and evaluate the efficacy of aggregating k checkpoints where we vary k over a large enough range, and select the k that performs the best on average over 5 runs. Finally, we present results on test data using the best k value. Our hyperparameters tuning method is easy to replicate hence for conciseness, here we only provide the ranges of k that we use for tuning. For CIFAR10 and CIFAR100, we tune k ∈ {3, 5, 10, 20, ..., 200} for both parameters and outputs aggregation. For StackOverflow, we tune k ∈ {3, 5, 10, 20, ..., 200} for parameters aggregations (i.e., UPA) and for outputs aggregation (i.e., OPA and OMV). However, as for outputs aggregations, one should store k checkpoints on device in the FL setting of StackOverflow, we reduce tuning range and tune k ∈ {3, 5, 10, 20, ..., 100}.

Details of data dependent checkpoints selection for StackOverflow:

Here we provide precise method for selecting the best k of all checkpoints and aggregate them. We perform these experiments for StackOverflow due to availability of additional data, and omit CIFAR datasets. Specifically, we use 10,000 samples from the validation partition of the original StackOverflow data and use it as held-out data. Note that held-out data is disjoint from both training and validation data we use for other parts of experiments. We compute performance of all checkpoints on held-out data and order them from highest to lowest performing checkpoints. Then we tune k as detailed in the previous section. The aggregation function remains the same for all, but EMA, our aggregation methods from Section 2.1. Recall that traditional EMA is computed over all the checkpoints computed during training. We compute EMA over best k checkpoints as θ i ema = (1 -β) • θ i-1 ema + β • θ i , where i ∈ {1, 2, ..., k}. We keep EMA coefficient, β, constant through out. For EMA, we further tune β using validation data. 



Note that EMA (with default EMA coefficients from non-private settings) has also been used inDe et al. (2022) to improve DP-SGD's performance. We observe that even a coarse-grained tuning of the EMA coefficient provides significant accuracy gains. For instance, for CIFAR10, our tuned EMA coefficients outperform those ofDe et al. (2022) by ∼ 0.6% and 0.35% for ε of 1 and 8, respectively. For CIFAR100,Bu et al. (2022) achieve SOTA accuracy of 83% at ε = 1 using 303M parameters vision transformers pre-trained on ImageNet21k. However, due to computational constraints, we use the 36M pa- Note that only the data-dependent aggregations use public data, so this comparison is just for illustration. Using Bun & Steinke (2016), it is easy to convert the privacy guarantee to an (ε, δ)-DP guarantee. Jain et al. (2021) show that for carefully chosen step sizes, the logarithmic factor can be removed, andFeldman et al. (2020) extend this analysis to a DP-SGD variant with varying batch sizes. Unlike those methods, averaging can be done as post-processing of DP-SGD outputs, rather than a modification of the algorithm. To our knowledge, all prior mixing results for DP-LD require that the α-Rényi divergence between θ0 and θ∞ is finite for some α > 1 or some stronger assumption, which cannot be satisfied by a point distribution for θ0. Thus, our mixing result for DP-LD starting from a point distribution may be of independent interest.



Accuracy improvements due to checkpoints aggregation methods for DP-SGD trained CIFAR10 and DP-FTRL trained StackOverflow.

Figure 2: Probability of sampling data from distributions D 1 and D 2 .

Accuracy gains due to checkpoints aggregations for DP-SGD trained periodic distribution shifting (PDS) CIFAR10 (test data) and DP-FTRL trained PDS StackOverflow (validation data).

To formalize the problem, we define the following notation: Consider a data set D = {d 1 , . . . , d n } and a loss function L(θ; D) d i ), where each of the loss function is convex and L-Lipschitz in the first parameter, and θ ∈ C with C ⊆ R p being a convex constraint set. We analyze the following variant of DP-SGD, which is guaranteed to be ρ-zCDP 4 .

Figure4: Uncertainty due to DP noise measured using confidence interval widths. We compute the intervals using N bootstrap (independent) runs, and using the last N checkpoints of a single run.

g. Fact 32 in Ganesh & Talwar (2020)), we have that max t∈[0,η] t 0 dW s ds 2 ≤ η(p + 2 ln(2/δ)) w.p. 1 -δ. Then following the proof of Lemma 13 in Ganesh & Talwar (2020), w.p. 1 -δ, max t∈[0,η]

Figure 5: Accuracy gains due to checkpoints aggregations for StackOverflow (left) and periodic distribution shifting StackOverflow (right) trained using DP-FTRLM without any DP, i.e., ε = ∞.

For experiments with DP, we fix the privacy parameter δ to 10 -5 on CIFAR-10/CIFAR-100, and 10 -6 on StackOverflow, ensuring that δ < n -1 , where n is the number of examples in CIFAR10/CIFAR-100 and the number of users in StackOverflow.

2 from Test accuracy gains for original CIFAR10 and StackOverflow. ± 0.16 25.72 ± 0.02 25.81 ± 0.02 25.79 ± 0.01 25.78 ± 0.01 ε = 18.9 23.41 ± 0.08 23.56 ± 0.02 23.84 ± 0.01 23.6 ± 0.02 23.57 ± 0.02 ε = 8.2 22.28 ± 0.08 22.43 ± 0.04 22.7 ± 0.03 22.57 ± 0.04 22.52 ± 0.04

Test accuracy gains for periodic distribution shifting (PDS) CIFAR10 and StackOverflow.



StackOverflow LSTM architecture details.

Accuracy gains on test data due to data-dependent checkpoints aggregation algorithms for StackOverflow trained using DPFTRLM to achieve user-level DP. ∞ 25.24 ± 0.16 25.97 ± 0.01 25.97 ± 0.01 25.94 ± 0.03 25.97 ± 0.02 ε = 18.9 23.41 ± 0.08 23.88 ± 0.04 23.92 ± 0.03 23.81 ± 0.06 23.76 ± 0.07 ε = 8.2 22.28 ± 0.08 22.66 ± 0.05 22.80 ± 0.05 22.7 ± 0.01 22.69 ± 0.01 Table 6: Accuracy gains on test data due to data-dependent checkpoints aggregation algorithms for periodic distribution shifting StackOverflow trained using DPFTRLM to achieve user-level DP. ∞ 23.89 ± 0.04 24.18 ± 0.02 24.15 ± 0.02 24.15 ± 0.01 24.19 ± 0.01 ε = 18.9 21.6 ± 0.13 22.18 ± 0.07 22.17 ± 0.06 22.21 ± 0.08 22.19 ± 0.09 ε = 8.2 20.24 ± 0.29 20.84 ± 0.13 20.85 ± 0.12 20.90 ± 0.06 20.82 ± 0.08

Accuracy gains on test data for CIFAR100 fine-tuned using DP-SGD and sample-level DP. 80.81 ± 0.11 80.88 ± 0.10 80.83 ± 0.09 80.92 ± 0.10 80.82 ± 0.10 ε = 1 75.51 ± 0.15 75.42 ± 0.13 75.62 ± 0.12 75.51 ± 0.16 75.57 ± 0.18 Table 8: Accuracy gains on test data for periodic distribution shifting CIFAR100 fine-tuned using DP-SGD and sample-level DP. 77.16 ± 0.11 80.53 ± 0.07 80.53 ± 0.08 80.49 ± 0.06 80.41 ± 0.09 ε = 1 70.84 ± 0.16 74.83 ± 0.15 75.81 ± 0.16 75.02 ± 0.17 74.97 ± 0.18

We observe that tuning the EMA coefficient can provide significant gains in accuracy over the default value of 0.9999 thatDe et al. (2022) use; with warm-up schedule and number of training steps thatDe et al. (2022) use, 0.9999 and 0.999 provide the same results. This implies that tuning EMA coefficients for each different privacy budget is required for the best performances.

annex

Proof. By contractivity of gradient descent we have:Now the lemma follows from Rényi divergence bounds between Gaussians (see e.g., Example 3 of van Erven & Harremos (2014) ).Lemma A.11. Let P 3 be the distribution N (θ * , 2η) and let P 4 be the stationary distribution of (1).Then for η ≤ 1/2M we have:Proof. We have. By Msmoothness of the negative log density of P 4 , we also have). In addition, since P 4 is 1-strongly log concave, P 4 (θ * ) ≥ 1 2π p/2 (as the 1-strongly log concave density with mode θ * that minimizes P 4 (θ * ) is the multivariate normal with mean θ * and identity covariance). Finally, for α ≥ 1 and η ≤ 1/2M , we have α/4η > (α -1)M/2. Putting it all together:In ( * ), we use the fact that α/4η > (α -1)M/2 to ensure the integral converges.Lemma A.12. Fix some point θ 0 . Let P be the distribution θ η that is the solution to (1) from θ 0 for time η ≤ 1/2M . Let Q be the stationary distribution of (1). Then:Proof. By monotonicity of Rényi divergences (see e.g., Proposition 9 of Mironov ( 2017)), we can assume α ≥ 2. Then by applying twice the approximate triangle inequality for Rényi divergences (see e.g. Proposition 11 of Mironov ( 2017)), we get: D α (P 1 , P 4 ) ≤ 5 3 D 3α (P 1 , P 2 ) + 4 3 D 3α-1 (P 2 , P 3 ) + D 3α-2 (P 3 , P 4 ).The lemma now follows by Lemmas A.9, A.10, A.11.Lemma A.4 now follows by plugging α = 2, η = 1/2M into Lemma A.12 and then using Theorem 2 of Vempala & Wibisono (2019) .

