DIFFERENTIALLY PRIVATE ADAPTIVE OPTIMIZATION WITH DELAYED PRECONDITIONERS

Abstract

Privacy noise may negate the benefits of using adaptive optimizers in differentially private model training. Prior works typically address this issue by using auxiliary information (e.g., public data) to boost the effectiveness of adaptive optimization. In this work, we explore techniques to estimate and efficiently adapt to gradient geometry in private adaptive optimization without auxiliary data. Motivated by the observation that adaptive methods can tolerate stale preconditioners, we propose differentially private adaptive training with delayed preconditioners (DP 2 ), a simple method that constructs delayed but less noisy preconditioners to better realize the benefits of adaptivity. Theoretically, we provide convergence guarantees for our method for both convex and non-convex problems, and analyze trade-offs between delay and privacy noise reduction. Empirically, we explore DP 2 across several realworld datasets, demonstrating that it can improve convergence speed by as much as 4× relative to non-adaptive baselines and match the performance of state-of-the-art optimization methods that require auxiliary data.

1. INTRODUCTION

Adaptive optimizers such as AdaGrad (Duchi et al., 2011; McMahan & Streeter, 2010) and RM-SProp (Hinton et al., 2012) are commonly used to improve convergence speed in machine learning training. However, in privacy-sensitive applications, the benefits of adaptivity may degrade as a result of noise added to the preconditioners to guarantee differential privacy (Li et al., 2022) . Prior works typically address this issue by using non-sensitive auxiliary data to approximate the underlying structures of private gradients (Asi et al., 2021; Kairouz et al., 2021a; Li et al., 2022) . While this can boost performance, assuming access to informative public data may be unrealistic in many privacy-sensitive applications. In this work, we instead ask: Can we improve privacy/utility trade-offs in private adaptive optimization without accessing auxiliary data? A key insight we have in addressing this question is that for many machine learning problems, the gradient geometry may not change drastically during successive steps of optimization (e.g., see Figure 1 , which plots successive distributions of preconditioner values). This presents an opportunity to estimate the preconditioners used by adaptive optimizers with smaller noise, by averaging across previous iterates. To this end, we propose DP 2 , a differentially private adaptive method that uses historical gradients to construct delayed preconditioners with reduced noise. Despite the simplicity of this approach, we find that it can significantly improve performance in practice-improving convergence speed by as much as 4× relative to non-adaptive baselines, all without the need to access auxiliary data. To better understand these performance gains, we theoretically and empirically analyze the method to study the effect of using delayed preconditioners, including trade-offs that emerge between the noise reduction and staleness. Contributions. We propose DP 2 as a method for differentially private adaptive optimization with delayed preconditioners. Unlike prior work, DP 2 does not rely on auxiliary data to improve privacy/utility trade-offs in private training. We provide convergence guarantees for DP 2 in both convex and non-convex settings, and analyze the trade-offs between delay and privacy noise. We conduct extensive experiments to showcase the effectiveness of DP 2 , which can significantly improve model utility for a given privacy budget across text and recommendation benchmarks.

2. BACKGROUND AND RELATED WORK

In this section we discuss closely related works and set up some preliminaries. We start by discussing prior work in differentially private optimization, considering the classic framework of (ε, δ)-differential privacy (DP) (Dwork et al., 2006) , defined as follows. Definition 1 (Differential privacy (Dwork et al., 2006) ). A randomized algorithm M is (ε, δ)differentially private if for all neighboring datasets D, D ′ differing by one element, and every possible subset of outputs O, Pr (M(D) ∈ O) ≤ e ε Pr (M(D ′ ) ∈ O) + δ. Differentially Private SGD. Informally, DP in machine learning offers protection by masking the influence of individual examples (example-level DP, e.g. (Abadi et al., 2016; Bassily et al., 2014; Song et al., 2013) ) or all of the examples from one user (user-level DP, e.g. (Kairouz et al., 2021b; McMahan et al., 2018) ) on the trained model. In this work, we consider example-level DP using the popular subsampled Gaussian mechanism (Dwork et al., 2014; Mironov et al., 2019) to perturb gradients to ensure DP. Unless much larger batch sizes and possibly larger datasets are used, DP mechanisms often lead to a significant utility drop. Extensive research has thus been devoted to investigating improved privacy/utility/computation trade-offs for DP-SGD, including various training techniques (e.g., data augmentation and large-batch training) (De et al., 2022) , leveraging public data (Amid et al., 2022; Zhou et al., 2021) , and releasing gradient statistics via tree aggregation to reduce the amount of noise (Chan et al., 2011; Denisov et al., 2022; Kairouz et al., 2021b) . These prior works are orthogonal to and could be applied in conjunction with our proposed method, which focuses specifically on privacy in the context of adaptive optimization. Differentially Private Adaptive Optimization. To reduce privacy cost in iterative DP algorithms, it is natural to consider applying adaptive optimizers (e.g., AdaGrad (Duchi et al., 2011; McMahan & Streeter, 2010) , RMSProp (Hinton et al., 2012) , AMSGrad (Reddi et al., 2018) , and Yogi (Zaheer et al., 2018) ) to speed up convergence. A straightforward approach is to first privatize mini-batch gradients and then plug in noisy gradients to any adaptive updating rules (Zhou et al., 2020) . However, estimating gradient moments in this way may yield preconditioners with too much noise, resulting in adaptive methods that may not have meaningful improvements over DP-SGD (Li et al., 2022 ). As we discuss in Section 1, more recent works suggest the use of non-sensitive public information to estimate the preconditioners (or other gradient structures) (Asi et al., 2021; Kairouz et al., 2021a; Li et al., 2022) , which may not always be available in practice. In Section 5.2, we empirically benchmark two baselines along this line of work and demonstrate that DP 2 can perform comparably to these state-of-the-art methods, even though it does not require access to auxiliary data. Finally, we note that previous works have explored the high-level direction of delayed preconditioners, but mainly as a compromise for computational considerations in non-private training (Gupta et al., 2018) . In this work, we instead show that staleness can be leveraged to improve privacy/utility trade-offs in private adaptive optimization, and propose and analyze a novel method for delaying preconditioner computation in the context of private training. Notation. In this work, we consider using adaptive optimization methods to solve the classic empirical risk minimization objective, i.e., min w F (w) = 1 n n i=1 f (x i ; w), where w ∈ R d and {f (x i ; w)} i∈[n] are individual loss functions on training sample i ∈ [n]. For vectors u, v ∈ R d , we use u + v for coordinate-wise addition, and u v for coordinate-wise division. For any vector v, v j denotes the j-th coordinate of v. For example, g i,t j refers to the j-th coordinate of gradient g i,t . Finally, |v| ∈ R d denotes taking coordinate-wise absolute values, and ∥ • ∥ M denotes the matrix norm defined as ∥ • ∥ M := ⟨•, M •⟩ for a symmetric and positive definite matrix M ∈ R d×d , or a diagonal matrix with non-negative diagonal entries populated by a vector M ∈ R d .

3. DP 2 : DELAYED PRECONDITIONERS FOR DIFFERENTIALLY PRIVATE ADAPTIVE OPTIMIZATION

We now introduce our DP 2 framework. While we discuss DP 2 in the context of a particular adaptive method (RMSProp), we note that the approach is method-agnostic in that it can generally be applied to any private adaptive optimization method where preconditioners are calculated at each iteration. As an initial step towards understanding the algorithm, we first investigate the effects of delayed preconditioners in non-private training in Section 3.1. We then explain how to apply this idea to construct less noisy preconditioners from prior gradients in private training in Section 3.2.

3.1. DELAYED PRECONDITIONERS IN NON-PRIVATE SETTINGS

Adaptive methods use preconditioners to adapt to gradient geometry, effectively resulting in coordinate-wise learning rates. This can be advantageous for many applications, especially those with sparse gradients or non-uniform stochastic noise (e.g., Hinton et al., 2012; McMahan & Streeter, 2010; Reddi et al., 2021; Zhang et al., 2020) . One of the key design choices of DP 2 is to update preconditioners less frequently and use the average of past gradients to reduce noise. Our observation is that a wide range of learning problems are tolerant to the staleness of preconditioners. In this subsection, we validate this empirically on the benchmark datasets considered throughout this paper. There are potentially many ways that one could instantiate the idea of delayed preconditioner computation in adaptive optimization. Here we consider a specific algorithm, which is the exact non-private version of our proposed DP 2 framework (Algorithm 1) introduced in later sections. The basic idea is to alternate between s steps of SGD and s steps of an adaptive method (for simplicity we assume RMSProp as the adaptive algorithm), where s is a constant larger than 1. Each time we switch from SGD to RMSProp, we average s past SGD gradients and use the average to update the preconditioner. The preconditioner will be used in subsequent RMSProp updates (thus being stale). As motivation for DP 2 , we empirically show that RMSProp with delayed preconditioners achieves almost the same optimization performance as RMSProp (Figure 2 ). As discussed in Section 2, we note that the idea of delayed preconditioning has been briefly discussed in prior work (Gupta et al., 2018) for the purpose of speeding up the computation of adaptive optimization in non-private training. Unlike this prior work, we focus on the goal of reducing noise in private training, propose an alternative method for using stale preconditioners that is more amenable to differential privacy, and analyze our method in both convex and non-convex settings.

3.2. CONSTRUCTING DELAYED PRECONDITIONERS WITH REDUCED NOISE

Without access to public data or other side information, prior works typically update preconditioners based on noisy gradients at each iteration (Zhou et al., 2020) . For instance, a natural way to privatize RMSProp is to update the preconditioner v ∈ R d as v ← βv + (1β)(g) 2 where β ∈ (0, 1) is a moving average constant, and g ∈ R d is the noisy gradient output by some standard privacy mechanism (e.g., the Gaussian mechanism).foot_0 However, a drawback to this is that the noise gets accumulated at each iteration, making adaptive methods significantly less effective (Li et al., 2022) . Inspired by the observation that problems can be tolerant to the staleness of preconditioners (Figure 2 ), we propose to update the preconditioners less frequently to reduce noise. For instance, we update v every s steps using some aggregate function of s recent private gradients from DP-SGD. During iterations where v is not updated, we simply apply the most recent (stale) v to precondition the gradients. In order to mitigate the noise, we average over these s gradients to form a pseudo-gradient g, which can be plugged into arbitrary adaptive optimization algorithms. Note that the privacy noise variance will be reduced s times if we average s Gaussian random variables (i.e., the DP noise).  for t = 0, • • • , T -1 do if t mod (s 1 + s 2 ) = 0 then Reset accumulator G t ← 0 if t mod (s 1 + s 2 ) = s 1 then Update moment estimates as v ← βv + (1 -β) (G t /s 1 ) 2 Reset accumulator G t ← 0 Uniformly randomly sample a mini-batch B with size b from private training data Get individual gradients for sample i ∈ B: g i,t ← ∇f (x i ; w t ) Privatize the (preconditioned) gradients using the Gaussian mechanism: gt ← 1 b i∈B clip g i,t D t , C + N 0, σ 2 C 2 where D t ← 1 if t mod (s 1 + s 2 ) < s 1 √ v + ϵ otherwise. Accumulate the private gradients gt : G t+1 ← G t + gt Update model parameters w: w t+1 ← w t -α t gt return w T DP 2 is summarized in Algorithm 1. For simplicity of presentation, we assume RMSProp as the adaptive method (denoted as DP 2 -RMSProp) throughout this section. However, our framework can be generally applied to other common adaptive methods (see Appendices C.3 and D) . The high-level idea is to alternate between s 1 steps of private SGD and s 2 private RMSProp steps, and use averages of s 1 SGD gradients (i.e., average of the accumulator G ∈ R d ) to update the preconditioner v. Next, we discuss some key components of our algorithm. Order of privatization and preconditioning. Given a private preconditioner v, there are generally two choices to perform adaptive optimization over the raw gradients {g i,t } i∈B generated from mini-batch B at the t-th iteration. 1. First privatize gradients with clipping threshold C 1 , then precondition noisy gradients with √ v + ϵ where ϵ is a small constant: gt ← 1 b i∈B clip g i,t , C 1 + N 0, σ 2 C 2 1 / √ v + ϵ 2. First precondition gradients with √ v + ϵ, then privatize the output with clipping threshold C 2 : gt ← 1 b i∈B clip g i,t / √ v + ϵ , C 2 + N 0, σ 2 C 2 2 The difference is that the privacy noise in the first choice may be scaled in an undesired direction, as N (0,σ 2 C 2 ) √ v+ϵ with a less noisy estimated √ v (perfect estimation removing all privacy noise in the extreme case) would amplify the noise N (0, σ 2 C 2 ) on informative coordinates (i.e., coordinates with smaller preconditioner values), which is consistent with the argument made in Li et al. (2022) . We empirically compare the two options and show that the latter gives better performance (Section 5.3). It is critical to average noisy gradients to construct a cleaner estimate of the preconditioner (Line 5 and 10 in Algorithm 1) and apply it for adaptive optimization (Line 9). As these two steps access raw gradients twice, we need to privatize them separately. Unfortunately, the privacy budget would accumulate with each query to the raw training data. Hence, we use the private SGD gradients for both the model update and the preconditioner estimation. This results in a hybrid method that alternates between private SGD and private adaptive optimization steps. Note that to get an unbiased estimate of the true delayed preconditioners, we can correct the bias in (G t /s 1 ) 2 (Line 5) by subtracting the privacy noise variance term σ 2 C 2 s1b 2 out of (G t /s 1 ) 2 . But this value is usually very small and negligible in practice. While in principle, non-adaptive and adaptive updates can take different numbers of consecutive iterations, in our empirical evaluation, we simply set s 1 = s 2 , and find that this works reasonably well across all datasets (Section 5). Privacy guarantees. From Algorithm 1, we see that at each iteration, we access raw data and pass them through the privacy barrier once (Line 9) to generate private gradients gt with the same noise multiplier σ and batch size b, and the preconditioner only accumulates already differentially private gradients. Since the final model is a composition of these private releases (noisy gradients), Algorithm 1 (or DP 2 in general) achieves the same privacy guarantees as standard DP-SGD training under the same training settings. For completeness, we formally state the privacy guarantee below. Theorem 1 (Privacy guarantee of Algorithm 1 (Abadi et al., 2016) ). There exist constants c 1 and c 2 such that for any ε < c 1 b 2 T /n 2 , Algorithm 1 is (ε, δ)-differentially private for any δ > 0 if σ ≥ c 2 b √ T log(1/δ) nε . In practice, we use Rényi differential privacy (RDP) for the subsampled Gaussian mechanism accountant (Mironov et al., 2019) to compute the actual ε's reported in the experiments (Section 5).

4. CONVERGENCE ANALYSIS

In this section, we analyze Algorithm 1 for both convex and non-convex problems. We aim to study the convergence properties of DP 2 and investigate the trade-offs between delay and privacy noise. In doing so, key challenges are introduced by alternating between adaptive and non-adaptive updating and through the staleness of preconditioners.

4.1. CONVEX CASES

For convex functions, we define the optimal model w * as w * ∈ arg min w F (w). First we state some assumptions (apart from convexity) that are used in the analysis. Assumption 1. There exists a constant R such that ∥w tw * ∥ 2 ≤ R for any iteration t. Assumption 2 (Bounded stochastic gradient norm). There exists a constant C such that g i,t 2 ≤ C for any i ∈ [n] and iteration t. Assumption 1 (bounded domain across all iterations) is commonly used in adaptive optimization literature (Asi et al., 2021; Levy et al., 2018; Li et al., 2022; Reddi et al., 2018) . Assumption 2 aims to bound the L 2 norm of the stochastic gradient, thus helping bound the L 2 sensitivity of the operation of calculating and averaging individual gradients from a mini-batch. Assuming bounded stochastic gradient norm is standard in prior works on convex and non-convex private optimization (e.g., Kairouz et al., 2021a; Li et al., 2022; Zhou et al., 2020) . Under this assumption, suppose the clipping does not happen, we have gt ← g t + N (0, σ 2 C 2 /b 2 ), where g t := 1 b i∈B g i,t . Without loss of generality, let s 1 =s 2 in Algorithm 1. Our main convergence result is as follows (assuming t starts from 1). Theorem 2 (Convergence of Algorithm 1 for convex problems). Let Assumptions 1 and 2 hold. Assume F is a convex function. Let the learning rate α t be set as α t ← α ⌊ t 2s ⌋ + ⌊ t+s 2s ⌋ +1 √ t . After running Algorithm 1 for T iterations with s = υT for a small constant υ ∈ (0, 1], we obtain min t∈[T ] E F (w t ) -F (w * ) ≤ R 2 + κ α ⌊ 1 2υ ⌋+⌊ 1+υ 2υ ⌋ 1 √ T t∈Tυ E D t 1 + 1 T T t=1 α ⌊ t 2υT ⌋+⌊ t+υT 2υT ⌋ √ t E[∥N t ∥ 2 D t ], where T υ denotes the iteration indices where we switch from private RMSProp steps to private SGD steps plus the last iteration, with cardinality |T υ | = ⌈ 1 2υ ⌉, N t ∼ N (0, σ 2 C 2 /b 2 ), κ ≥ max α 2 C 2 , Ch(s) ϵ √ 1 -β , α = min ϵ, 1 √ M + ϵ , 1 where M := C 2 + σ 2 C 2 sb 2 . We defer all proofs to Appendix A and state simplified convergence results in Corollary 1. As we can see, the above upper bound relies on a critical metric h(s) which is related to temporal gradient similarity and the amount of staleness s, formally defined as: where the expectation is taken with respect to all randomness in the algorithm, and G ⌊ t s ⌋s ∈ R d refers to the latest accumulator that is used to update v (Line 5 in Algorithm 1). A smaller h(s) indicates better convergence. We see that the denominator of h(s) can be decomposed into the average of past raw gradients and the average of random Gaussian noise. Intuitively, h(s) tends to be smaller as gradients across the s iterations in G ⌊ t s ⌋s are more similar with the current gradient g t in terms of the gradient norms. In Appendix A.2, we show that an upper bound of h(s) can be expressed as c 1 + c 2 s where c 1 , c 2 are two constants. We also visualize the value of h(s) on the IMDB dataset in Figure 3 , and show that (1) the values of h(s) are consistently small across all delays, and (2) h(s) increases as the s gets larger, which is consistent with the expression of s. h(s) ≥ max t∈[T ] E [∥g t ∥ 1 ] E 1 s G ⌊ t s ⌋s 1 + dϵ = max t∈[T ] E [∥g t ∥ 1 ] E 1 s ⌊ t s ⌋s-1 i=⌊ t s ⌋s-s gi 1 + dϵ , 0 Trade-offs between delay and noise. Here we discuss how s affects convergence based on our analysis. Intuitively, larger s (larger delay) results in staler preconditioners, but introduces less noise due to private gradient averaging. In our convergence bound, there are several terms that depend on s (or υ). Although this makes it difficult to derive a closed-form characterization of an optimal s, we can analyze the effects of s in simplified settings. In particular, examine the first term of the RHS of the convergence bound, let α = 1 √ M +ϵ = 1 √ c3+ c 4 υ +ϵ (where c 3 , c 4 are two constants), and assume 1 2υ + 1+υ 2υ = 1 2υ + 1+υ 2υ = 2+υ 2υ . Combined with h(s), the dependence on υ in R 2 +κ α ⌊ 1 2υ ⌋ + ⌊ 1+υ 2υ ⌋ can be expressed as (c 1 + c 2 υ) c 3 + c4 υ + ϵ 2+υ 2υ . This suggests that there exists an optimal υ that achieves the minimal value. In Section 5.1, we empirically study the effects of s across real-world datasets, and demonstrate that there exist specific ranges of s that provide favorable trade-offs between delay and noise (Figure 6 ). Corollary 1. Let Assumptions 1 and 2 hold. Assume F is a convex function. Ignoring the constants, the convergence rate under learning rate α t = O 1 √ t simplifies to min t∈[T ] E[F (w t )] -F (w * ) ≤ O 1 √ T max t∈Ts E ∥D t ∥ 1 + O 1 T T t=1 1 √ t E ∥N t ∥ 2 D t , where T s denotes the iteration indices where we switch from private RMSProp steps to private SGD steps plus the last iteration (thus having a constant cardinality) and N t ∼ N (0, σ 2 C 2 /b 2 ). At a high level, the first term is due to adaptive optimization using RMSProp, and the second term corresponds to the added privacy noise. Our O 1 √ T rate is the same as previous results for SGD (or DP-SGD) in convex cases with delaying learning rates (Bassily et al., 2014; Nemirovski et al., 2009) . Compared with DP-SGD, the added privacy noise would be reduced from 1 T T t=1 1 √ t E[∥N t ∥ 2 ] to 1 T T t=1 1 √ t E[∥N t ∥ 2 D t ] when the gradients are sparse (so that ∥D t ∥ 1 < d in adaptive iterations). Hence, this theorem suggests some constant improvements relative to DP-SGD when we switch for a constant number of times.

4.2. NON-CONVEX CASES

We make the following additional common assumptions in non-convex convergence analyses. Assumption 3 (Smoothness). Each f (x i ; w) (i ∈ [n]) is L-smooth with respect to w ∈ R d . Assumption 4. Stochastic gradient variance is bounded, i.e., E[∥g i,t -E[g i,t ]∥ 2 2 ] ≤ τ 2 for all i, t. Theorem 3 (Convergence of Algorithm 1 for non-convex problems.). Let Assumptions 1-4 hold. Define constant M as M := C 2 + σ 2 C 2 sb 2 . Under any delay parameter s, after running Algorithm 1 with constant learning rates α t = α such that Lα ϵ ≤ 1, we have 1 T T t=1 E[∥∇F (w t )∥ 2 ] ≤ 2( √ M + 1)F (w 1 ) αT + 2αL( √ M + 1) τ 2 2ϵ 2 b + dσ 2 C 2 2b 2 . The proof is deferred to Appendix B. Compared with Theorem 2, here we do not have constraints on s. Note that to guarantee (ε, δ)-DP by running T iterations, we can set σ 2 = O b 2 T log(1/δ) n 2 ε 2 , α = O 1 √ d , and T = O nε log(1/δ) , to arrive at a convergence bound O √ d nε + τ 2 √ db . Under any s, our rate (with and without noise) is the same as previous results on DP-SGD and (DP) adaptive methods for non-convex problems (Li et al., 2022; Zaheer et al., 2018) . We note that our non-convex analysis does not directly highlight the benefits of adaptivity or trade-offs around s; hence the optimal choice of s according to this result is s = T , to maximize the goal of reducing privacy noise. However, the practical performance can be better than the upper bound derived here, as shown in our experiments (Section 5). Most of the previous works studying stochastic non-convex adaptive optimization does not prove improvements relative to SGD (e.g., Alacaoglu et al., 2020; De et al., 2018; Ward et al., 2020; Zaheer et al., 2018) . It is still an open problem to rigorously characterize the benefits of adaptivity for non-convex problems, which we leave for future work.

5. EMPIRICAL EVALUATION

In this section we report empirical results on a range of learning tasks. In Section 5.1, we compare DP 2 with the baselines of DP-SGD and vanilla DP adaptive methods across various privacy budgets, and investigate the effects of delay on all datasets. We additionally compare DP 2 with recent more advanced private adaptive methods in Section 5.2, and conduct ablation studies to validate the effectiveness of different DP 2 components in Section 5.3. In all experiments, we use Rényi differential privacy (RDP) accountant for the subsampled Gaussian mechanism (Mironov et al., 2019) for privacy accounting. We focus on the RMSProp optimizer (Hinton et al., 2012) and provide results relating to other adaptive methods such as AdaGrad (Duchi et al., 2011; Streeter & McMahan, 2010) in Appendix C. Our experiments are implemented in JAX (Bradbury et al., 2018) with Haiku (Hennigan et al., 2020) to auto-vectorize over the perexample operations (e.g. per-example clipping) for substantial speedups (Subramani et al., 2021) . Unless explicitly stated, we report results with the best grid-searched hyperparameters. Note that for DP 2 we tune the learning rates and clipping thresholds separately for private SGD iterations and private adaptive (RMSProp) iterations. See Appendix C.2 for hyperparameter details. Our code is publicly available at github.com/kenziyuliu/DP2. Tuning s. In all experiments, we tune the delay parameter (s) via grid search. For convex tasks, we choose s from {0.025, 0.5, 0.1, 0.5, 1, 2} epochs. For the non-convex model, we choose s from {0.5, 3, 10, 25} epochs. We explore the sensitivity of DP 2 to s in Section 5.2, and show that there exist a wide range of s parameters that result in superior performance compared with baseline methods. Datasets and Tasks. We pick datasets and tasks where adaptivity is crucial (e.g., those involving sparse gradients). For such tasks, adaptive methods have major benefits relative to SGD in non-private training, and we expect DP 2 to retain the benefits in private training. See Appendix C.1 for a detailed description. For all datasets, we explore the effects of several noise multiplier (σ) values, and set δ = 10 -k where k is the smallest integer that satisfies 10 -k ≤ 1/n for the training dataset size n.

5.1. DP 2 COMPARED WITH DP-SGD AND VANILLA DP ADAPTIVE METHODS

We consider two popular baselines: DP-SGD (Abadi et al., 2016) and vanilla DP-RMSProp (Zhou et al., 2020) . In vanilla DP adaptive methods, private gradients are plugged into adaptive updating rules to approximate the preconditioners at each iteration. Figure 4 compares DP 2 -RMSProp with DP-SGD and DP-RMSProp. We observe that across all datasets, DP 2 consistently and substantially outperforms the baselines in terms of both convergence and absolute performance. Privacy/utility trade-offs. Figure 5 : Privacy/utility trade-offs of DP 2 -RMSProp (Algorithm 1) compared with DP-SGD and DP-RMSProp for a range of privacy budgets. We see that DP 2 -RMSProp consistently achieves more favorable privacy/utility trade-offs than the baseline methods. across a range of privacy parameters, where ε ranges are consistent with prior works (e.g., Kairouz et al., 2021b) . Results are shown in Figure 5 . We observe that similar to the results in Figure 4 , DP 2 significantly outperforms DP-SGD and DP-RMSProp under each privacy budget. For reference, the non-private RMSProp method achieves 87% accuracy, 62% accuracy, and 0.88 mean square error (MSE) on IMDB, StackOverflow, and MovieLens, respectively. Indeed, with weaker privacy (larger ε), we expect smaller utility gaps between private and non-private optimization. In Appendix C.4, we additionally explore how increasing the computational budget may affect the privacy-utility trade-off. Effects of s. Finally, we empirically study the effect of the delay parameter s. Intuitively, there exists a trade-off between the amount of delay and the privacy noise in the preconditioner: averaging over more historical gradients (larger s) could yield less noisy preconditioners, while introducing more staleness. In Figure 6 , we report test performance versus the delay s across all datasets on the first three subplots. In the last subplot, we additionally show the convergence behavior under different values of s. These results suggest that there is a "sweet spot" for s to yield good performancesmall delays are gradually improving over DP-RMSProp; moderate delays perform best in terms of convergence and absolute performance; and large delays may slow down convergence (although it is possible to reach similar performance with sufficient training). These empirical results are consistent with the implications of our convergence analysis discussed in Section 4.1.

5.2. DP 2 COMPARED WITH RECENT METHODS FOR PRIVATE OPTIMIZATION

As discussed in Section 2, beyond DP-SGD and vanilla DP adaptive methods, another line of work uses auxiliary, public data to improve private (adaptive) optimization. While not directly comparable to DPfoot_1 since DP 2 does not require any side/public information, we compare DP 2 to two state-ofthe-art methods along this direction 2 : (1) AdadPS (Li et al., 2022) which uses public data or their statistics to estimate gradient geometry, and (2) PDA-DPMD (Amid et al., 2022) , which uses the loss on public data as a mirror map to learn the underlying gradient geometry. Results are reported in Li et al., 2022) . Even though DP 2 does not require auxiliary information, we find that it achieves comparable performance with these state-of-the-art approaches that require additional public data. Corresponding convergence plots are presented in Figure 11 in Appendix C.6. Finally, we also study the effectiveness of different components of DP 2 . Recall that in Algorithm 1, we use noisy gradients from DP-SGD iterations to update both the model parameters and the preconditioner such that the total privacy cost is identical to that of DP-SGD. The first variant considers accumulating DP-SGD gradients in the same way, but it runs private adaptive methods using delayed preconditioner in almost all iterations. This requires us to add independent noise twice at most iterations (when accumulating the preconditioner and when noising the preconditioned update), thus increasing the total privacy budget. The second variant is identical to DP 2 except that it applies the delayed preconditioner after noising the clean gradient; this is to study the order of preconditioning as discussed in Section 3. As illustrated in Figure 7 , both variants indeed significantly underperform our proposed method on the IMDB dataset, thus validating the design choices of DP 2 . We defer complete results to Figure 10 and Table 4 in Appendix C.5. See also Appendix D for the exact algorithms of both variants.

6. CONCLUSION AND FUTURE WORK

In this work, we proposed DP 2 , a private adaptive optimization framework that uses historical gradients to construct delayed but less noisy preconditioners, yielding improved privacy/utility tradeoffs without the need to access auxiliary data. We demonstrated the effectiveness of DP 2 both theoretically and empirically. In the future, it would be interesting to extend the techniques developed herein to other privacy-sensitive applications such as federated learning (McMahan et al., 2017; Reddi et al., 2021) . It is also worth exploring interplays between DP 2 and private online optimization with tree aggregation, which similarly releases cumulative statistics with reduced noise (Chan et al., 2011) . A PROOFS Lemma 1. Under Assumption 2, let s 1 = s 2 = s in Algorithm 1, we have for any j ∈ [d], E[v j ] ≤ C 2 + σ 2 C 2 sb 2 . Proof. Recall that C is the gradient norm bound (Assumption 2). Let the clipping threshold be C as well. We have for j ∈ [d], E 1 s G j 2 = E 1 s g i1 j + • • • + g is j + 1 s N i1 j + • • • + N is j 2 (1) = E 1 s 2 g i1 j + • • • + g is j 2 + E 1 s 2 N i1 j + • • • + N is j 2 (2) ≤ C 2 + σ 2 C 2 sb 2 , ( ) where {i 1 , . . . , i s } denotes the indices of s noisy gradients used to obtain G j , and {N i1 j , . . . , N is j } are random zero-mean Gaussian variables with variance σ 2 C 2 b 2 under noise multiplier σ, clipping threshold C, and mini-batch size b. Hence for any j ∈ [d] and t ∈ [T ], E 1 s G j 2 ≤ C 2 + σ 2 C 2 sb 2 := M, E[v j ] ≤ M, E √ v j ≤ E[v j ] ≤ √ M (6) E D t j ≤ max √ M + ϵ, 1 . A.1 PROOF OF THEOREM 2 Based on the updating rule, we have w t+1 -w * 2 D t (8) = w t -α t g t D t -α t N t -w * 2 D t (9) = w t -w * 2 D t + α t g t D t + α t N t 2 D t -2 w t -w * , α t g t + α t D t N t (10) = w t -w * 2 D t -2α t g t , w t -w * + (α t ) 2 g t , g t D t -2α t ⟨w t -w * , D t N t ⟩ + (α t ) 2 ∥N t ∥ 2 D t + 2(α t ) 2 ⟨g t , N t ⟩. ( ) Rearranging terms gives g t , w t -w * = ∥w t -w * ∥ 2 D t -∥w t+1 -w * ∥ 2 D t 2α t + α t 2 g t , g t D t -⟨w t -w * , D t N t ⟩ + α t 2 ∥N t ∥ 2 D t + α t ⟨g t , N t ⟩. ( ) Taking the expectation on both sides conditioned on w t , ∇F (w t ), w t -w * = E t ∥w t -w * ∥ 2 D t -E t [∥w t+1 -w * ∥ 2 D t ] 2α t + α t 2 E t g t , g t D t + α t 2 E t ∥N t ∥ 2 D t , where we have used the fact that N is a zero-mean Gaussian variable independent of g t , w t . Taking the expectation on both sides and using the convexity of F (•): E[F (w t )] -F (w * ) ≤ E[∥w t -w * ∥ 2 D t ] -E[∥w t+1 -w * ∥ 2 D t ] 2α t + α t 2 E g t , g t D t + α t 2 E ∥N t ∥ 2 D t . ( ) Applying telescope sum, we have T t=1 E[F (w t )] -F (w * ) ≤ ∥w 1 -w * ∥ 2 A 1 2α 1 + T t=2 E ∥w t -w * ∥ 2 D t 2α t - E ∥w t -w * ∥ 2 D t-1 2α t-1 + T t=1 α t 2 E g t , g t D t + T t=1 α t 2 E ∥N t ∥ 2 D t . ( ) Hence, we need to bound the RHS: w 1 -w * 2 D 1 2α 2 + T t=2 E ∥w t -w * ∥ 2 D t 2α t - E ∥w t -w * ∥ 2 D t-1 2α t-1 T1 + T t=1 α t 2 E g t , g t D t T2 + T t=1 α t 2 E ∥N t ∥ 2 D t , where the vector D t ∈ R d satisfies that D t = 1 when running private SGD steps, and D t = √ v + ϵ when running private RMSProp steps. Let the delay parameter to be scheduled as s = υT (0 < υ < 1) and the learning rate α t be α t ← α ⌊ t 2s ⌋+⌊ t+s 2s ⌋+1 √ t , where α = min ϵ, 1 √ M +ϵ , 1 , and M is the upper bound of E [v j ] for j ∈ [d], as defined and proved in Lemma 1. We next consider the T 1 term. There are four cases. 1. DP-SGD at the t -1-th iteration, and DP-SGD at the t-th iteration: As D t = D t-1 there is not much requirement other that the learning rates need to satisfy α t ≤ α t-1 , which holds for our choice. 2. Private RMSProp at the t -1-th iteration, and private RMSProp at the t-th iteration: Similar to previous case, the learning rates need to satisfy α t ≤ α t-1 , which holds for our choice. 3. DP-SGD at the t -1-th iteration, and private RMSProp at the t-th iteration: We require α t ϵ ≤ α t-1 =⇒ √ v t + ϵ α t ≥ 1 α t-1 (19) But in this case we must have t % s = 0. So this is satisfied by our choice as long as α ≤ ϵ. 4. Private RMSProp at the t -1-th iteration, and DP-SGD at the t-th iteration The first three cases form an updating pattern of DP-SGD → • • • → DP-SGD → DP-RMSProp→ • • • → DP-RMSProp , where every pattern takes 2s iterations, except for the first pattern, because the telescope sum starts from t = 2. For the first pattern, we have w 1 -w * 2 D 1 2α 2 + 2s t=2 E ∥w t -w * ∥ 2 D t 2α t - E ∥w t -w * ∥ 2 D t-1 2α t-1 (20) = w 1 -w * 2 D 1 2α 2 + 2s t=2 E w t -w * 2 D t α t -D t-1 α t-1 (21) ≤ w 1 -w * 2 D 1 2α 2 + R 2 2s t=2 E [∥D t ∥ 1 ] 2α t - E ∥D t-1 ∥ 1 2α t-1 ≤ R 2 2α 2s E ∥D 2s ∥ 1 , where D 2s = √ v + ϵ. For k ≥ 1, we have 2sk+2s t=2sk+1 E ∥w t -w * ∥ 2 D t 2α t - E ∥w t -w * ∥ 2 D t-1 2α t-1 = E ∥w 2sk+1 -w * ∥ 2 D 2sk+1 2α 2sk+1 - E ∥w 2sk+1 -w * ∥ 2 D 2sk 2α 2sk + 2sk+2s t=2sk+2 E w t -w * 2 D t 2α t -D t-1 2α t-1 ≤ E ∥w 2sk+1 -w * ∥ 2 D 2sk+1 2α 2sk+1 - E ∥w 2sk+1 -w * ∥ 2 D 2sk 2α 2sk + R 2 E[∥D 2sk+2s ∥ 1 ] 2α 2sk+2s - E[∥D 2sk+1 ∥ 1 ] 2α 2sk+1 ≤ E ∥w 2sk+1 -w * ∥ 2 D 2sk+1 2α 2sk+1 + R 2 E[∥D 2sk+2s ∥ 1 ] 2α 2sk+2s - E[∥D 2sk+1 ∥ 1 ] 2α 2sk+1 ≤ R 2 2α 2sk+2s E ∥D 2sk+2s ∥ 1 , ( ) where D 2sk+2s = √ v + ϵ belong to DP-RMSProp updates. We look at the second T 2 term, and prove by induction that there exists a constant κ such that T t=1 α t 2 E g t , g t D t ≤ κ α T E ∥D T ∥ 1 . When T = 1 (α 1 = α and D 1 = 1), α 2 E[∥g 1 ∥ 2 ] ≤ κd α holds if κ ≥ α 2 C 2 . At each step t, the goal is to get κ α t-1 E ∥D t-1 ∥ 1 + α t 2 E g t , g t D t ≤ κ α t E ∥D t ∥ 1 1. DP-SGD at the t -1-th iteration, and DP-SGD at the t-th iteration: We require κd α t-1 + α t 2 E g t 2 ≤ κd α t which would hold for choice of α t as gradients are bounded and κ ≥ α 2 C 2 . 2. Private RMSProp at the t -1-th iteration, and private RMSProp at the t-th iteration: We need κE ∥ √ v t-1 + ϵ∥ 1 α t-1 + α t 2 E g t , g t √ v t-1 + ϵ ≤ κ α t E ∥ √ v t + ϵ∥ 1 , α t 2 E g t , g t √ v t-1 + ϵ ≤ κ α t - κ α t-1 E √ v t-1 + ϵ 1 . (28) Let h(s) ≥ max t∈[T ]    E [∥g t ∥ 1 ] E 1 s G ⌊ t s ⌋s + ϵ 1    . Based on our updating rule, E √ v t + ϵ 1 ≥ 1 -β E 1 s G ⌊ t s ⌋s + ϵ 1 . Note that α t 2 E g t , g t √ v t-1 + ϵ ≤ α t 2 E ∥g t ∥ 2 ϵ ≤ α t C 2ϵ E[∥g t ∥] ≤ α t C 2ϵ E[∥g t ∥ 1 ], where we have used the assumption that ∥g t ∥ ≤ C. Combining the above two, α t C 2ϵ E[∥g t ∥] ≤ α t C 2ϵ h(s)E 1 s G ⌊ t s ⌋s + ϵ 1 (32) ≤ α t C 2ϵ h(s) √ 1 -β E √ v t-1 + ϵ 1 (33) ≤ κ 1 α t - 1 α t-1 E √ v t-1 + ϵ 1 . This implies the condition holds as long as κ satisfies κ ≥ Ch(s) ϵ √ 1 -β . 3. DP-SGD at the t -1-th iteration, and private RMSProp at the t-th iteration. We want to prove κd α t-1 + α t 2 E g t , g t D t ≤ κ α t E D t 1 . As ∥g t ∥ ≤ C, it holds that α t 2 E g t , g t √ v t + ϵ ≤ α t 2ϵ E[∥g t ∥ 2 ] ≤ α t C 2ϵ E[∥g t ∥] ≤ α t C 2ϵ E[∥g t ∥ 1 ]. Therefore, α t 2 E g t , g t √ v t + ϵ ≤ Ch(s) 2ϵ √ 1 -β α t E √ v t + ϵ 1 . Based on our learning rate set in Eq. ( 18), √ tα t = √ t -1α t-1 ϵ (39) =⇒ α t 2 ≤ 1 α t - 1 α t-1 ϵ ≤ 1 α t - d α t-1 E [∥D t ∥ 1 ] . Hence, Ch(s) 2ϵ √ 1 -β α t E √ v t + ϵ 1 ≤ Ch(s) ϵ √ 1 -β E D t 1 1 α t - d α t-1 E [∥D t ∥ 1 ] (41) ≤ κ E[∥D t ∥ 1 ] α t - d α t-1 , where we require κ ≥ Ch(s) ϵ √ 1 -β . 4. Private RMSProp at the t -1-th iteration, and DP-SGD at the t-th iteration. We need κ α t-1 E √ v t-1 + ϵ t-1 1 + α t 2 E ∥g t ∥ 2 ≤ κd α t . (44) Plug in E ∥ √ v t-1 ∥ 1 ≤ d √ M (Lemma 1) and ∥g t ∥ 2 ≤ C 2 , we have κ α t-1 E ∥ √ v t-1 + ϵ∥ 1 + α t 2 E ∥g t ∥ 2 ≤ κ α t-1 d √ M + d + α t 2 C 2 . ( ) Based on our learning rate set in Eq. ( 18), for some constant γ, α t-1 = γ √ t -1 , α t ≤ γ √ t( √ M + 1) (46) =⇒ α t 2 ≤ 1 α t - √ M + 1 α t-1 ≤ d α t - d √ M + d α t-1 . Therefore α t 2 C 2 ≤ κ d α t - d √ M + d α t-1 holds as long as κ ≥ α 2 C 2 . To sum up, the requirement on κ is κ ≥ max α 2 C 2 , Ch(s) ϵ √ 1 -β . ( ) Final convergence results: min t∈[T ] E F (w t ) -F (w * ) (50) ≤ R 2 + κ α ⌊ 1 2υ ⌋+⌊ 1+υ 2υ ⌋ 1 √ T t∈Tυ E D t 1 + 1 T T t=1 α ⌊ t 2υT ⌋+⌊ t+υT 2υT ⌋ √ t E[∥N t ∥ 2 D t ], where T v denotes the iteration indices where we switch from private RMSProp steps to private SGD steps plus the last iteration, and its cardinality is |T υ | = ⌈ 1 2υ ⌉, and κ ≥ max α 2 C 2 , Ch(s) ϵ √ 1-β , , α = min ϵ, 1 √ M +ϵ , 1 . A.2 A CLOSER LOOK AT h(s) We closely examine h(s), defined as h(s) ≥ max t∈[T ]    E [∥g t ∥ 1 ] E 1 s G ⌊ t s ⌋s + ϵ 1    . Let us assume mini-batch gradients on consecutive time steps are not very different, i.e. ∥g tg t-1 ∥ 1 ≤ M . This means each gradient norm cannot be too far away from each other, which can be used to show the dependence of h(s) on the delay parameter s. Denote the gap between the current iteration t and the iteration where v gets updated as k, i.e., k := t -⌊ t s ⌋s. Hence, ∥g t ∥ 1 1 s (g t-k-1 + • • • + g t-k-s ) + 1 s (N t-k-1 + • • • + N t-k-s ) 1 + dϵ (53) = g t -1 s g t-k-1 + • • • + g t-k-s j + 1 s g t-k-1 + • • • + g t-k-s j 1 1 s g t-k-1 + • • • + g t-k-s j + 1 s (N t-k-1 + • • • + N t-k-s ) 1 + dϵ (54) = 1 s (g t -g t-k-1 ) + • • • + (g t -g t-k-s ) + 1 s g t-k-1 + • • • + g t-k-s 1 1 s (g t-k-1 + • • • + g t-k-s ) + 1 s (N t-k-1 + • • • + N t-k-s ) 1 + dϵ (55) ≤ 1 s g t-k-1 + • • • + g t-k-s 1 1 s (g t-k-1 + • • • + g t-k-s ) + 1 s (N t-k-1 + • • • + N t-k-s ) 1 + dϵ + 1 s (sM + • • • + (2s)M ) dϵ (56) Denote a := 1 s N t-k-1 + • • • + N t-k-s , and b := 1 s g t-k-1 + • • • + g t-k-s . Then h(s) ≤ E[∥b∥ 1 ] E[∥a + b∥ 1 ] + dϵ + sM dϵ ≤ 1 E[∥a∥1] E[∥b∥1] -1 + dϵ E[∥b∥1] + sM dϵ In the special case where gradients are sparse, i.e., E[∥b∥ 1 ] < E[∥a∥ 1 ], we have h(s) ≤ 1 E[∥a∥1] E[∥b∥1] + dϵ E[∥b∥1] -1 + sM dϵ It is easy to see that the RHS is O (s), and it increases as s. We can informally express it as c 1 s + c 2 , where c 1 and c 2 are two constants.

B PROOF OF THEOREM 3

First we introduce a result that will be used in this section. Under the bounded stochastic gradient variance assumption (Assumption 4), we have that conditioned on w t , E t ∥g t ∥ 2 ≤ τ 2 b + ∥∇F (w t )∥ 2 , ( ) where b refers to the mini-batch size to obtain gradient g t , i.e., g t ← 1 b i∈B g i,t . This lemma is proved in Zaheer et al. (2018) . The per-coordinate version of this result is that for j ∈ [d], E t (g t j ) 2 ≤ τ 2 j b + ∇ j F (w t ) 2 , and j∈[d] τ 2 j = τ 2 . As we assume F (w) is L-smooth, at each iteration t, F (w t+1 ) ≤ F (w t ) + ⟨∇F (w t ), w t+1 -w t ⟩ + L 2 w t+1 -w t 2 . ( ) Based on the updating rule of Algorithm 1, we have F (w t+1 ) ≤ F (w t ) + ⟨∇F (w t ), w t+1 -w t ⟩ + L 2 w t+1 -w t 2 (63) = F (w t ) -α t ∇F (w t ), g t D t + N t + (α t ) 2 L 2 g t D t + N t 2 , where N ∈ R d and N j ∼ N 0, σ 2 C 2 b 2 with noise multiplier σ and clipping threshold C, and D t satisfies that D t ← 1 if t mod 2s ≤ s, √ v + ϵ otherwise. ( ) Take expectation with respect to samples at the t-th iteration and N t , E t [F (w t+1 )] ≤ F (w t ) -α t ∇F (w t ), ∇F (w t ) D t + (α t ) 2 L 2 E t g t D t 2 + d(α t ) 2 L 2b 2 σ 2 C 2 = F (w t ) -α t j∈[d] (∇ j F (w t )) 2 D t j + (α t ) 2 L 2 j∈[d] E t (g t j ) 2 (D t j ) 2 + d(α t ) 2 L 2b 2 σ 2 C 2 , ( ) where we have used the fact that N t is a zero-mean random variable independent of w t , and D t is independent of samples at time t. We need to consider two cases. 1. DP-SGD at the t-th iteration In this case, D t = 1. Hence plugging in E t (g t j ) 2 ≤ τ 2 j b + ∇ j F (w t ) 2 , we have E t F (w t+1 ) ≤ F (w t ) -α t - (α t ) 2 L 2 ∇F (w t ) 2 + (α t ) 2 L τ 2 2b + σ 2 C 2 d 2b 2 . ( ) Under constant learning rate, let α t = α ≤ 1 L , E t F (w t+1 ) ≤ F (w t ) - α 2 ∥∇F (w t )∥ 2 + (α t ) 2 L τ 2 2b + σ 2 C 2 d 2b 2 . ( ) Taking expectation on both sides gives α 2 E ∥∇F (w t )∥ 2 2 ≤ E[F (w t )] -E[F (w t+1 )] + (α t ) 2 L τ 2 2b + σ 2 C 2 d 2b 2 . ( ) 2. Private RMSProp at the t-th iteration We have E t [F (w t+1 )] ≤ F (w t ) -α t j∈[d] [∇F (w t )] 2 j v t j + ϵ + (α t ) 2 L 2ϵ j∈[d] E t [(g t j ) 2 ] v t j + ϵ t + d(α t ) 2 Lσ 2 C 2 2b 2 . (71) Plugging in E t (g t j ) 2 ≤ τ 2 j b + (∇ j F (w t )) 2 results in E t [F (w t+1 )] ≤ F (w t ) -α t j∈[d] [∇F (w t )] 2 j v t j + ϵ + (α t ) 2 L 2ϵ j∈[d] σ 2 j v t j + ϵ b + (α t ) 2 L 2ϵ j∈[d] [∇F (w t )] 2 j v t j + ϵ + d(α t ) 2 Lσ 2 C 2 2b 2 (73) = F (w t ) -α t - (α t ) 2 L 2ϵ j∈[d] [∇F (w t )] 2 j v t j + ϵ + (α t ) 2 L 2ϵ j∈[d] τ 2 j v t j + ϵ b + d(α t ) 2 Lσ 2 C 2 2b 2 (74) ≤ F (w t ) -α t - (α t ) 2 L 2ϵ j∈[d] [∇F (w t )] 2 j v t j + ϵ + (α t ) 2 L τ 2 2ϵ 2 b + dσ 2 C 2 2b 2 . ( ) Taking expectation on both sides yields E[F (w t+1 )] ≤ E[F (w t )] -α t - (α t ) 2 L 2ϵ j∈[d] E   [∇F (w t )] 2 j v t j + ϵ   + (α t ) 2 L τ 2 2ϵ 2 b + dσ 2 C 2 2b 2 . ( ) We need to lower bound j∈ [d] E [∇F (w t )] 2 j √ v t j +ϵ . We know from Holder's inequality that E[⟨u, v⟩] ≤ E[∥u∥ 1 ]E[∥v∥ ∞ ]. Now note that E ∥∇F (w t )∥ 2 = E |∇F (w t )| 2 D t , D t ≤ E (∇F (w t )) 2 D t 1 E ∥D t ∥ ∞ (77) ≤ E (∇F (w t )) 2 D t 1 ( √ M + ϵ). Hence j∈[d] E (∇ j F (w t )) 2 D t j ≥ E[∥∇F (w t )∥ 2 ] √ M + ϵ (79) and E[F (w t+1 )] ≤ E[F (w t )] -α t - (α t ) 2 L 2ϵ E ∥∇F (w t )∥ 2 √ M + ϵ + (α t ) 2 L τ 2 2ϵ 2 b + dσ 2 C 2 2b 2 . ( ) Let α t = α ≤ ϵ L , we obtain E[F (w t+1 )] ≤ E[F (w t )] - α 2( √ M + ϵ) E ∥∇F (w t )∥ 2 + (α t ) 2 L τ 2 2ϵ 2 b + dσ 2 C 2 2b 2 . ( ) Combining the two cases, for any t, we have E[∥∇F (w t )∥ 2 ] (82) ≤ 2( √ M + 1) α E[F (w t )] -E[F (w t+1 )] + 2αL( √ M + 1) τ 2 2ϵ 2 b + dσ 2 C 2 2b 2 . ( ) Taking a telescope sum results in 1 T T t=1 E[∥∇F (w t )∥ 2 ] ≤ 2( √ M + 1)F (w 1 ) αT + 2αL( √ M + 1) τ 2 2ϵ 2 b + dσ 2 C 2 2b 2 , ( ) where M := C 2 + σ 2 C 2 sb 2 . Dataset DP-SGD DP-RMSProp PDA-DPMD AdaDPS DP 2 -RMSProp (w/ RMSProp)

IMDB

(5, 0.5) (0.3, 0.1, 10 -3 ) (5, 0.5) (1, 5, 10 -3 ) (0.1, 3, 0.5, 5, 10 -7 , 195) StackOverflow (3, 0.25) (0.03, 0.1, 10 -3 ) (3, 0.25) (0.4, 5, 10 -3 ) (0.3, 0.3, 0.25, 5, 10 -5 , 1000) MovieLens (0.1, 1) (0.001, 0.5, 10 -3 ) (0.1, 1) (0.01, 10, 10 -2 ) (0.1, 0.03, 1, 5, 10 -3 , 31250) 

Dataset Ablation Variant1

Ablation Variant 2 IMDB (3.0, 0.1, 0.5, 2.0, 10 -7 , 780) (0.3, 0.3, 0.25, 10 -3 , 780) StackOverflow (1.0, 1.0, 1.0, 1.0, 10 -5 , 1000) (0.3, 0.001, 0.25, 10 -5 , 1000) The DP 2 framework can be applied to a range of adaptive methods beyond RMSProp mostly discussed in the main text. We extend DP 2 to the AdaGrad update rule (with only one line of code change, see Section D), and benchmark its convergence and privacy-utility trade-offs. In Figure 8 and Figure 9 , the results indicate that DP 2 -AdaGrad, like DP 2 -RMSProp, can consistently and substantially improve over the baselines in terms of both convergence and absolution performance, demonstrating the generality of DP 2 to other adaptive optimizers. 

C.4 EFFECTS OF INCREASING COMPUTATIONAL BUDGETS

When differential privacy introduces a large utility gap between private and non-private training, one approach to improving the privacy-utility trade-off is to increase computational costs by using larger batch sizes under fixed numbers of steps. The noise multiplier needs to increase to achieve the same privacy target, while the overall privacy noise may still be reduced due to the larger batch size. This technique may be adopted in practice when we want to prioritize the utility of private optimization under fixed privacy budgets. In Figure 9 (right), we explore the effect of such increased computation on StackOverflow. With a 4× factor increase in computational cost (4× larger batch sizes with the same number of training iterations), we observe that the privacy/utility trade-off of all methods can be substantially improved, narrowing the utility gap to non-private training. In particular, observe that the absolute performance improvement of DP 2 over the vanilla DP baselines remains similar. On IMDB, we observe that despite not using any auxiliary information, the convergence of DP 2 -RMSProp is comparable with that of AdaDPS-RMSProp (Li et al., 2022) Figure 11 : Test accuracies of DP 2 compared against recent private (adaptive) methods that leverage public data (Amid et al., 2022; Li et al., 2022) . Dotted lines correspond to training metrics. In Figure 12 , we additionally implement a private AdaGrad method proposed in Kairouz et al. (2021a) that also leverages public data. Specifically, in each iteration, the algorithm clips and adds independent noise to both the clean gradients and the preconditioner estimated using clean gradients; it then uses public data to estimate a gradient subspace onto which to project the clipped/noised preconditioner in order to reduce the effect of noise; finally, it preconditions the noisy gradient with the noisy preconditioner and takes an update step. Our implementation differs from Kairouz et al. (2021a) in that we use the diagonal form of the preconditioner instead of the full matrix form. To estimate the gradient subspace, we follow the approach described in Zhou et al. (2021) where the projection matrix V ∈ R d×k where d is the number of parameters and k is the dimension of the subspace is obtained by taking the top-k eigenspace of M t with M t = 1 |X pub | x i ∈Xpub ∇ w t f x i ; w t ∇ w t f x i ; w t ⊤ where X pub is the set of public examples. Unfortunately, we have not obtained a satisfactory result for this noisy AdaGrad algorithm. We remark that since the method is extremely computationally expensive (involves computing the eigendecomposition of a d × d matrix with d = 10001 at every iteration), further hyperparameter tuning may help improve the performance. However, our ablation studies (Section 5.3 and Appendix C.5) may shed light on the current observations since this method privatizes gradients before preconditioning.

D ALGORITHMS

For completeness, we present all algorithms mentioned in the main text in detail. • Non-private version of DP 2 : only changing Line 9 in Algorithm 1 to gt ← 1 b i∈B g i,t D t • DP 2 with the AdaGrad update rule (DP 2 -AdaGrad): only changing Line 5 in Algorithm 1 to v ← v + G t /s 1 2 • DP 2 with Yogi's additive update rule (DP 2 -Yogi): only changing Line 5 in Algorithm 1 to v ← v + (1 -β)sign(G t /s 1 -v 2 ) G t /s 1 2 • Ablation variant 1 (extra query) with delayed preconditioners: see Algorithm 2. Observe that the clean batch gradients {g i,t } i∈B get privatized twice in most iterations (when (t-1) mod s ̸ = 0), increasing the total privacy cost. • Ablation variant 2 (noise before preconditioning) with delayed preconditioners: in Line 9 of Figure 1 , privatize the batch gradients with the following replacement: gt ← 1 b i∈B clip g i,t , C + N 0, σ 2 C 2 /D t



We consider the practical diagonal (as opposed to matrix) form of adaptive methods throughout the paper. We do not directly compare with the prior work ofAsi et al. (2021) as the code is not publicly available and implementation details are missing in the paper; however, the more recent PDA-DPMD work ofAmid et al. (2022) we compare with suggests superior performance toAsi et al. (2021). We also implement the diagonal variant of the method proposed in the theoretically-focused work ofKairouz et al. (2021a), but observe that accuracy improves only marginally beyond random guessing (see Figure12in Appendix C.6).



Figure 1: Preconditioner values do not change drastically during optimization (IMDB dataset).

Figure 2: In non-private training, RMSProp with delayed preconditioners achieves similar training loss as standard RMSProp across all datasets. Final test accuracies are presented in Section 5.1. This observation provides motivation for our proposed DP 2 framework for private training (Section 3.2).

Figure 3: Visualization of h(s) versus s on IMDB.

Figure 4: Test performance of DP 2 compared to DP-SGD and DP-RMSProp on IMDB (left), StackOverflow (middle), and MovieLens-100k (right) for a fixed privacy budget. For all datasets, we calculate the privacy loss (ε) under fixed δ's, noise multipliers {1.0, 1.0, 0.5}, and batch size 64. All runs are repeated over 5 random seeds. Dotted lines correspond to training metrics.

Figure6: Effect of the delay parameter s. We show trade-offs between delay and noise in the first three subplots. The rightmost subfigure showcases convergence curves under different delays (s=10000 corresponds to delaying for ≈ 3 epochs) where DP 2 achieves 4× convergence speedup than DP-SGD. Privacy settings follow those of Figure4. Although a specific value of s achieves the greatest improvements, we observe that nearly all instantiations of DP 2 improve upon the baselines.

Figure 7: Different ablation variants of DP 2 on IMDB. The dotted lines correspond to training accuracy.

Figure 8: (Extension of Figure 4 to the AdaGrad update rule) Test accuracy of DP 2 compared to DP-SGD, DP-RMSProp, and DP-AdaGrad on IMDB and StackOverflow. Dotted lines denote training performance.

Figure 10: Test accuracies for ablation studies on DP 2 . Dotted lines correspond to training metrics.

Figure 12: Comparing DP 2 against a noisy AdaGrad variant based on Kairouz et al. (2021a) where the gradients and the preconditioner are privatized separately.

Algorithm 1: DP 2 -RMSprop: Delayed Preconditioners for Differentially Private RMSpropInput: T , batch size b, noise multiplier σ, clipping thresholds C, initial model w 0 ∈ R d , v = 0, constant ϵ ∈ R + ,learning rate schedule α t , moving average parameter β, SGD cumulative aggregation step s 1 , RMSProp cumulative step s 2

which show that DP 2 has comparable performance to state-of-the-art baselines, but without the need to access auxiliary data. See Appendix C.6 for full details and convergence curves.

DP 2 compared with other private (adaptive) methods that use public data(Amid et al., 2022;

Tuned hyperparameters for different methods across three datasets. For DP-SGD and PDA-DPMD, the values refer to (LR, clip); for DP-RMSProp and AdaDPS, the values refer to (LR, clip, adaptivity ϵ); and for DP 2 , the values refer to (LR for SGD iters, LR for RMSProp iters, clip for SGD iters, clip for RMSProp iters, adaptivity ϵ, delay s). Bold values were experimented on the edges of the hyperparameter grids.

Tuned hyperparameters for ablation studies (Section 5.3) on IMDB and StackOverflow. Both variants use the RMSProp update rule for the adaptive steps. Bold values were experimented on the edges of the hyperparameter grids. For Variant 1 and 2 respectively, the values refer to (LR for SGD iters, LR for RMSProp iters, clip for SGD iters, clip for RMSProp iters, adaptivity ϵ, delay s) and (LR for SGD iters, LR for RMSProp iters, clip for both SGD/RMSProp iters, adaptivity ϵ, delay s). Note that for Variant 2 the clipping threshold do not need to be tuned separately for SGD/RMSProp iters as it applies to preconditioned gradients in both cases.C.3 RESULTS FOR DP 2 -ADAGRAD

Extension of Figure5to the AdaGrad update rule and increased computational cost) Privacy/utility trade-offs of DP 2 compared to DP-SGD, DP-RMSProp, and DP-AdaGrad on IMDB and StackOverflow. "(4×)" denotes increasing the batch size and the number of epochs simultaneously by a factor of 4 and picking the appropriate noise multiplier to arrive at similar privacy costs (ε).C.5 ADDITIONAL RESULTS FOR ABLATION STUDIESTable4summarizes the results for ablation studies on IMDB, StackOverflow, and MovieLens, and Figure10reports test accuracies on IMDB and StackOverflow during optimization. The variants are discussed in Section 5.3 and complete algorithms are presented in Appendix D. We observe that DP 2 indeed consistently outperforms the two (weaker) variants on all datasets, thus verifying our design choices for DP 2 . In particular, note that the utility drop of variant 2 (adding noise before preconditioning) on StackOverflow is more significant compared to that on IMDB; we argue that this is due to StackOverflow being a high-dimensional learning task (roughly 5 million model parameters) and thus the detrimental effect of preconditioning per-coordinate noise is larger. Summary of ablation studies on all three datasets.

which uses 1% of training data as the public data (250 examples) to approximate the preconditioner. On StackOverflow where the same public split of 1% corresponds to 2460 examples, we observe that AdaDPS-RMSProp can outperform DP 2 . On the other hand, the extra public data do not help PDA-DPMD outperform DP 2 .

ACKNOWLEDGMENTS

The work of TL, ZL, and VS was supported in part by the National Science Foundation Grant IIS1838017, a Google Faculty Award, a Meta Faculty Award, the Private AI Collaborative Research Institute, and the CONIX Research Center. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the National Science Foundation or any other funding agency.

C EXPERIMENTAL DETAILS AND ADDITIONAL RESULTS

C.1 DATASETS IMDB (Maas et al., 2011) is a binary classification dataset on sentiment analysis for movie reviews that includes 25,000/25,000 training/test samples. Each sample is a review under a vocabulary size of 10,000. We train a logistic regression model with 10,001 parameters.StackOverflow (Kaggle, 2022; TensorFlow Federated, 2022 ) is a large-scale text dataset containing questions and answers from Stack Overflow. We focus on the task of classifying the tag(s) of a given sentence described in TensorFlow Federated ( 2022), though we focus on the usual centralized training setting instead of a federated setting. We randomly sample 246,092 sentences for training and 61,719 for testing, where each sentence is described by 10,000 features. We format the task as a 500-class classification problem, and the resulting model has roughly 5 million parameters.MovieLens-100k (Harper & Konstan, 2015) is a movie review dataset commonly used for recommendation systems. It contains 100,000 movie ratings from 943 users on 1,682 items (≈ 6% non-zero entries). We study a (non-convex) matrix factorization task with embedding size 100, thus totaling 262,500 parameters. We treat each non-zero entry as a 'record' for differential privacy, and randomly partition them for training and evaluation.

C.2 HYPERPARAMETERS

Unless otherwise stated, we fix the following hyperparameters in our experiments: for IMDB, StackOverflow, and MovieLens respectively, we train for 100/50/50 epochs with batch size 64 and privacy δ = 10 -5 /10 -6 /10 -6 . We then perform a grid search on other hyperparameters:• Learning rates: We grid search over {0.03, 0.1, 0.3, 1, 3, 5} for SGD / AdaGrad update rules and from {0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3} for the RMSProp update rule.• Per-example clipping thresholds: We grid search over {0.1, 0.25, 0.5, 1} when performing perexample clipping on clean gradients without preconditioning (e.g. for DP-SGD updates), and over {0.1, 0.25, 0.5, 1, 2, 3, 5} when clipping preconditioned clean gradients (e.g. for DP 2 updates in adaptive iterations). The rationale is that, in general, the preconditioned gradient norms are usually larger than those without preconditioning (recall from Section 3.2 that we apply preconditioning before privatization in DP 2 ). For AdaDPS and DP 2 -RMSProp, we also tried a few values of even larger clip thresholds (≥ 10) though we did not perform a full sweep for other hyperparameters at those values due to computational constraints.• Delay parameter s: For all datasets, s (i.e., the number of optimization steps) is chosen heuristically as a function of the number of steps in an epoch. When reporting the best results (e.g. Figure 4 , Figure 5 ), we search over s ∈ {195, 390, 780} (roughly 0.5, 1, 2 epochs respectively) for IMDB (390 steps/epoch); s ∈ {100, 300, 1000, 3000} for StackOverflow (3845 steps/epoch); and s ∈ {1250, 15625, 31250, 50000} for MovieLens (1250 steps/epoch).• Adaptivity ϵ: In our settings, the adaptivity parameter ϵ for RMSProp/AdaGrad (in the denominatorwould affect the amount of adaptivity as well as the norms of preconditioned gradients, which may in turn influence the privacy-utility trade-off under per-example clipping. We tune ϵ over a small grid of {10 -2 , 10 -3 , 10 -5 , 10 -7 }.All reported results use the best hyperparameter configurations, which are selected using training set metrics (as overfitting generally does not occur under DP noise). To facilitate reproducibility, we summarize the tuned hyperparameters for the main experiments and the ablation studies in Table 2 and Table 3 below respectively.Algorithm 2: Ablation variant 1 (extra query) using delayed preconditioners 

