ON THE UNIVERSALITY OF LANGEVIN DIFFUSION FOR PRIVATE EUCLIDEAN (CONVEX) OPTIMIZATION Anonymous

Abstract

In this paper, we revisit the problem of differentially private empirical risk minimization (DP-ERM) and differentially private stochastic convex optimization (DP-SCO). We show that a well-studied continuous time algorithm from statistical physics, called Langevin diffusion (LD), simultaneously provides optimal privacy/utility trade-offs for both DP-ERM and DP-SCO, under ε-DP, and (ε, δ)-DP both for convex and strongly convex loss functions. We provide new time and dimension independent uniform stability properties of LD, with which we provide the corresponding optimal excess population risk guarantees for ε-DP. An important attribute of our DP-SCO guarantees for ε-DP is that they match the non-private optimal bounds as ε → ∞.

1. INTRODUCTION

Over the last decade, there has been significant progress in providing tight upper and lower bounds for differentially private empirical risk minimization (DP-ERM) (Chaudhuri et al., 2011; Kifer et al., 2012; Bassily et al., 2014; Song et al., 2013; McMahan et al., 2017; Smith et al., 2017; Wu et al., 2017; Iyengar et al., 2019; Song et al., 2020; Chourasia et al., 2021) and differentially private stochastic optimization (DP-SCO) (Bassily et al., 2019; Feldman et al., 2020; Bassily et al., 2020; Kulkarni et al., 2021; Gopi et al., 2022; Asi et al., 2021b) , both in the ε-DP setting and in the (ε, δ)-DP settingfoot_0 . While we know tight bounds for both DP-ERM and DP-SCO in the (ε, δ)-DP setting (Bassily et al., 2014; 2019) , the space is much less understood in the ε-DP setting (i.e., where δ = 0). First, to the best of our knowledge, tight DP-SCO bounds are not known for ε-DP. In this paper when we say a bound is tight for any problem, we implicitly always expect the bound to reach the optimal non-private bound (including polylogarithmic factors) for the same task as ε → ∞. Second, the algorithms for both DP-ERM and DP-SCO in the ε-DP setting are inherently different from the (ε, δ)-DP setting. While all the algorithms in the (ε, δ)-DP setting are based on DP variants of gradient descent (Bassily et al., 2014; 2019; Feldman et al., 2020; Bassily et al., 2020) , the best algorithms for ε-DP are based on a combination of exponential mechanism (McSherry & Talwar, 2007) and output perturbation (Chaudhuri et al., 2011) . Third, we know that as we move from ε to (ε, δ)-DP, for convex problems, we gain a polynomial improvement in the error bounds in terms of the model dimensionality, p. It is unknown if such an improvement is even possible when the loss functions are non-convex. In this work, we close these gaps in our understanding of DP-ERM and DP-SCO via the following contributions. 1. We provide a unified framework for DP-ERM/DP-SCO via an information theoretic tool called Langevin diffusion (Langevin, 1908; Lemons & Gythiel, 1997) (LD) (defined in eq. ( 1) and eq. ( 2)), which under appropriate choice of parameters interpolates between optimal/tight utility bounds for both DP-ERM and DP-SCO, both under ε and (ε, δ)-DP. 2. We provide tight DP-SCO bounds for both convex and strongly convex losses under ε-DP. To achieve these bounds, we show uniform stability of the exponential mechanism on strongly convex losses (which was to the best of our knowledge unknown prior to our work). Notably, our proof of uniform stability is almost immediate given the unified framework from item 1. 3. We provide a lower bound showing that if the loss functions are non-convex, it is not possible to obtain any polynomial improvement in the error in terms of the model dimensionality when we shift from ε-DP to (ε, δ)-DP. This is in sharp contrast to the convex setting. 4. Along the way we provide a set of results, which may be of independent interest: (a) A simple Rényi divergence bound between two Langevin diffusions running on loss functions with a bounded gradient difference. (b) For strongly convex and smooth losses, a Rényi divergence bound between two Langevin diffusions that approaches the Rényi divergence between their stationary distributions. (c) Improved analyses for ε-DP-ERM: For strongly convex losses we improve the bound by log factors via a better algorithm, and for non-convex losses we remove the assumption of Bassily et al. (2014) that the constraint set contains a ball of radius r. (d) A last-iterate analysis of Langevin diffusion as an optimization algorithm, using continuous analogs of techniques in Shamir & Zhang (2013) . To the best of our knowledge, this is the first analysis tailored to a continuous-time DP algorithm. Our work initiates a systematic study of DP continuous time optimization. We believe this may have ramifications in the design of discrete time DP optimization algorithms analogous to that in the non-private setting. In the non-private setting, continuous time dynamical viewpoints have helped in designing new algorithms, including the celebrated mirror-descent (see the discussion in Chapter 3 of Nemirovskij & Yudin (1983) ) and Polyak's momentum method (Polyak, 1964) , and understanding the implicit bias of gradient descent (Vardi et al., 2022) . Extending the dynamical viewpoint to private optimization would help us understand the price of privacy for the convergence rate of private optimization. Further, it is known that privacy helps in generalization and implicit bias underlies the generalization ability of (non-private) machine learning models trained by stochastic gradient descent. Taking the dynamical viewpoint would help us understand if there is a more fundamental reason for the generalization ability of differential privacy. In the rest of this section, we elaborate on each of these conceptual contributions, and alongside highlight the technical advances that were required. At the end, we provide comparison to most relevant works. We defer a broader comparison to other works in the area to Appendix A. Problem description: Consider a data set D = {d 1 , . . . , d n } drawn from some domain τ and an associated loss function : C × τ → R, where C ⊂ R p is called the constraint set. The objective in DP-ERM is to output a model θ priv while ensuring differential privacy that approximately minimizes the excess empirical risk, Risk ERM (θ priv ) = 1 n n ∑ i=1 (θ priv ; d i ) -min θ∈C 1 n n ∑ i=1 (θ; d i ). For brevity, we will refer to 1 n n ∑ i=1 (θ; d i ) as L(θ; D), or the empirical loss. For DP-SCO, we assume each data point d i in the data set D is drawn i.i.d. from some distribution D over the domain τ, and the objective is to minimize the excess population risk given by Risk SCO (θ priv ) = E d∼D (θ priv ; d)min θ∈C E d∼D [ (θ; d) ]. In other words, the goal of DP-ERM is to output θ priv that minimizes the average loss on D while the goal of DP-SCO is to output a θ priv that performs well on the 'unseen' data sampled from D. It is easy to see that DP-SCO is a stronger requirement. Throughout the paper, we make two standard assumptions in differentially private optimization: (i) the loss function (θ; d) is L-Lipschitz w.r.t. 2 -norm, i.e., ∀θ ∈ C, d ∈ τ, ∇ θ (θ; d) 2 ≤ L and (ii) constraint set has bounded diameter. W.l.o.g., we assume the loss functions are twice continuously differentiable within the constraint set C -if not, we can ensure this by convolving the loss function with finite variance Gaussian kernel (Feldman et al., 2018) . Depending on the problem context, we make additional assumptions like m-strong convexity, i.e., ∇ 2 θ (θ; D) mI, and M-smoothness, i.e., ∇ 2 θ (θ; D) MI, where A B denotes that B -A is a positive semidefinite matrix. We drop the subscript θ when it is clear from the context. We provide notational details in Appendix A. Langevin Diffusion (LD): We will start with the Langevin Diffusion (LD) algorithm in eq. ( 1) which forms the building block for all the algorithms considered in this paper. Intuitively, one should think of (1) as the limit of noisy gradient descent and (2) as the limit of projected noisy gradient descent, both as η → 0. Langevin diffusion. Let W t be the standard Brownian motion in p-dimensions, and β t > 0 be the so called inverse temperature. Then Langevin diffusion is the following stochastic differential equation: dθ t = β t -∇L(θ t ; D) • dt + √ 2 β t • dW t . "Projected" Langevin diffusion. Sometimes, we will only have the Lipschitz guarantee within a constrained set. We can also consider the following "projected" version of LD: dθ t = β t -∇L(θ t ; D) • dt + √ 2 β t • dW t + ν t L(dt) , ∀t ≥ 0 : θ t ∈ C. ( ) where L is a measure supported on {t : θ t ∈ ∂C} and ν t is an outer unit normal vector at θ t for all such θ t . See e.g. (Bubeck et al., 2018 , Section 2.1, 3.1) for a discussion of (2)/verification that a solution exists for convex C under M-smoothness for some finite M (which can be enforced with arbitrarily small perturbations to the loss via convolution). 1.1 OUR RESULTS AND TECHNIQUES: CONCEPTUAL CONTRIBUTIONS Optimal Excess Population Risk for DP-SCO under ε-DP: DP-SCO, at this point, is a very wellstudied problem in the literature (see references of previous works above). One approach towards obtaining the optimal excess population risk is to first prove an optimal DP-ERM bound, and then use uniform stability property (Bousquet & Elisseeff, 2002 ) (Definition B.14) of the underlying DP algorithm to obtain a population risk guarantee. These two steps indeed provides the optimal bounds of Θ 1 √ n + √ p log(1/δ) εn for convex losses and Θ 1 mn + p log(1/δ) mε 2 n 2 for m-strongly convex losses under (ε, δ)-DP via variants of DP-SGD (Bassily et al., 2020) . One crucial aspect of these bounds is that they reach the non-private optimal SCO bounds as ε → ∞. In our work, we obtain Θ 1 √ n + p εn for convex losses and Θ 1 mn + p 2 log n mε 2 n 2 for m-strongly convex losses, under ε-DP. Analogous to the (ε, δ)-DP setting, these bounds reach the non-private optimal as ε → ∞. As mentioned earlier, such bounds for the ε-DP setting was unknown prior to this work. Our optimal  Assumption ε-DP (ε, δ)-DP Excess Risk Time Excess Risk Time ERM Convex Lp εn ∞ L √ p log(1/δ) εn 1 p m-SC L 2 (p 2 +p log n) mε 2 n 2 ∞ L 2 p log(1/δ) mε 2 n 2 log 2 (pεn) L 2 log(1/δ) log 4 (pεn) m 2 ε 2 n 2 Non-convex Lp εn log εn p ∞ Lp εn log εn p ∞ SCO Convex L √ n + Lp εn ∞ L √ n + L √ p log(1/δ) εn min 1 p , log(1/δ) ε 2 n m-SC L 2 mn + L 2 p 2 log n mε 2 n 2 ∞ L 2 mn + L 2 p log(1/δ) log 2 (pεn) mε 2 n 2 L 2 log(1/δ) log 4 (pεn) m 2 ε 2 n 2 ∇ (θ; •) 2 ≤ L. Here, time (T) refers to the length till which we run the Langevin Diffusion algorithm; for the ε-DP results, T = ∞ means we use the stationary distribution of the Langevin diffusion. We set the diameter of the constraint set C 2 = 1. DP-SCO bound is obtained by proving a dimension independent uniform stability guarantee for the standard exponential mechanism on the loss function L(θ; D) + m 2 θ 2 2 . The translation to DP-SCO guarantee is immediate (Bousquet & Elisseeff, 2002) by appealing to the DP-ERM guarantee for exponential mechanism for such score functions, and combining with the uniform stability guarantee. To show dimension-independent uniform stability of the exponential mechanism on the regularized loss L(θ; D) + m 2 θ 2 2 , we view the exponential mechanism as the limiting distribution of LD, and provide a time independent O L 2 mn -uniform stability of LD. Here, L is the 2 -Lipschitz parameter of the loss function (θ; •). Equipped with this bound, we can easily combine with the DP-ERM bound to obtain the excess population risk bound we intended to achieve. We believe the above proof technique can be of independent interest, as it allows one to reduce showing the exponential mechanism is uniformly stable to showing gradient descent has time-independent uniform stability, which is a more well-understood problem.

Unification of DP-ERM and DP-SCO via LD:

In this work we show that LD (Langevin, 1908; Lemons & Gythiel, 1997) , defined in (1) and (2), interpolates between optimal known utility bounds for both DP-ERM and DP-SCO, and both under ε-and (ε, δ)-DP. Based on the inverse temperature, β t , and the length for which the diffusion process is run, T, Langevin diffusion not only recovers the best known bounds in all the settings mentioned above, but also provides new previously not known results. For example, it recovers the optimal excess population risk bound for convex and strongly convex DP-SCO under ε-DP improving on the previous best known bounds of Asi et al. (2021b) . A summary of the results we achieve in this paper are given in Table 1 . While our algorithm is purely information theoretic, it is worth highlighting that it was not clear whether such a universal object that achieves optimal excess risk for both ε-and (ε, δ)-DP even exists. As we discuss later, the inverse temperature settings of LD that gives rise to the best algorithms for ε-DP and (ε, δ)-DP are off by roughly a factor of √ p (p being the number of model parameters or the dimension of the parameter space). Hence, under constant ε regime for (ε, δ)-DP, we analyze LD far before it has converged to a stationary distribution.

Two-phase analysis of DP-Langevin diffusion:

In the process of demonstrating the universality of LD, we discovered two clear phases in the diffusion process, which enable us to obtain either (ε, δ)-DP or ε-DP results. For the purposes of brevity, it is easiest to explain this in the context of Lipschitz convex losses. We provide details for other losses in the remainder of the paper. As evident from Table 1 to obtain the DP-ERM and DP-SCO bounds in the (ε, δ)-DP setting, the LD runs for T ≈ 1 p , whereas, in the ε-DP setting, it needs to run until it convergences to a stationary distribution. It is easy to show that, when T ≤ 1 p , the diffusion process cannot converge to a stationary distribution under reasonable choice of inverse temperature required to ensure DP. This follows from the fact that, within the mentioned time period, for certain loss functions and constraint sets, with high probability, LD does not escape a ball of diameter c C 2 for c < 1 which contains c Ω(p) of the probability mass of the stationary distribution. However, LD is still able to obtain the desired ERM bounds in this time because the desired ERM bounds are satisfied by points at distance Ω( √ p/εn) from the empirical minimizer (see Appendix H for a more detailed discussion). In the ε-DP case, however, we run LD till it has converged to the stationary distribution, which is the exponential mechanism (see Appendix D for a more detailed discussion). As a result, the utility analysis for these phases are very different. In the (ε, δ)-DP case, we analyze the algorithm as a noisy gradient flow and use tools from optimization theory (Wilson et al., 2021) , whereas, in the ε-DP setting, we analyze the utility in terms of the stationary distribution that the diffusion process converge to, i.e., the Gibbs distribution. Following this viewpoint, if one studies DP-SGD (Bassily et al., 2014) (which is a typical optimization algorithm in the (ε, δ)-DP case) as a discretization of the LD process, one can observe that under optimal parameter settings it does not converge to anywhere near the stationary distribution (see Appendix H).

1.2. OUR RESULTS AND TECHNIQUES: TECHNICAL CONTRIBUTIONS

In this section we give an overview of our technical results. Due to space constraints, we only give formal statements for our most novel results and high-level descriptions for remaining results.

Rényi divergence bounds for LD (Appendix C and Appendix I):

We cannot use standard composition theorems of DP (Dwork & Roth, 2014) because the underlying algorithm is a continuous time process. One main technical contribution of this work is to quantify the Rényi divergence between two LD processes when run on neighboring data sets: Lemma 1.1 (Simplified version of Lemma C.1). Let Θ [0,T] be the distribution of the trajectory {θ t } t∈[0,T] in (2). Suppose we have that ∇L(θ; D) -∇L(θ; D ) 2 ≤ ∆ for all θ. Then for all α ≥ 1: R α (Θ [0,T] , Θ [0,T] ) ≤ α∆ 2 4 T 0 β 2 t dt. The idea behind the above lemma is to define an infinite sequence of pairs of DP-SGD runs on D, D with decreasing step sizes, such that (i) there is a fixed Rényi divergence bound that holds for all pairs in the sequence and (ii) the trajectory of ( 2) is the limit of the sequence. We then conclude using Fatou's lemma. This result forms the foundation of the privacy analysis in the rest of the paper. A similar result was also provided in Chourasia et al. (2021) . Our result is stronger in that it proves a divergence bound between the entire histories {θ t } 0≤t≤T rather than just the last iterate θ T , which enables us to output weighted averages of θ t privately. Furthermore, it is proven using only tools from the differential privacy literature and Fatou's lemma, providing an arguably much simpler proof. Additionally, for the special case of strongly convex and smooth loss functions, we leverage techniques from Vempala & Wibisono (2019) and Ganesh & Talwar (2020) to show a Rényi-divergence bound (Lemma I.1) that approaches the divergence between the stationary distributions of the LD processes, i.e., the privacy guarantee of the exponential mechanism. Since this bound is not needed for our DP-ERM or DP-SCO results, we defer it to Appendix I.

LD as exponential mechanism, DP-ERM under ε-DP (Appendix D):

The observation made in privacy analysis of Langevin diffusion (LD) allows us to derive the results on private optimization using exponential mechanism (McSherry & Talwar, 2007) . One technical challenge that we need to overcome while mapping LD to exponential mechanism is the issue with optimizing within the constraint set C. We do this via a result of Tanaka (1979) , stated in Lemma D.2. This, in turn, readily gives us optimal empirical risk for 2 -Lipschitz convex functions (described in Appendix D) using the utility analysis in Bassily et al. (2014) (see Theorem D.1). To get optimal empirical risk for strongly convex Lipschitz functions requires an algorithmic improvement. Recall the algorithm in Bassily et al. (2014) for strongly convex losses uses a two step process: output perturbation (Chaudhuri et al., 2011) , and then a one shot exponential mechanism. In Algorithm 3, we propose an iterated exponential mechanism which can be of independent interest. This algorithm allows us to define an algorithm purely based on an exponential mechanism over the loss function L(θ; D). The idea is to iteratively run the exponential mechanism on a sequence of constraint sets  (C = C 0 ⊇ C 1 ⊇ . . . ⊇ C k ), where C i+1 = B(θ i , r i ) ∩ C i , θ i being E [Risk ERM (θ k )] = O L 2 (p 2 + p log n) mε 2 n 2 . This is a mild improvement on the excess empirical risk over Bassily et al. (2014) , who achieved O(L 2 p 2 log n/mε 2 n 2 ). For non-convex functions, the results of Bassily et al. (2014) either assume a small ball of radius r is contained in the constraint set C, with utility depending on log( C 2 /r), or use the discrete exponential mechanism on a ball-covering of C. One of our technical contributions is to show that the continuous exponential mechanism achieves the optimal excess loss without the small ball assumption (Theorem D.5). Our algorithm is (arguably) more flexible than the ball-covering algorithm in Bassily et al. (2014) , since for certain classes of non-convex losses, one can still approximately sample from the continuous exponential mechanism efficiently (e.g., if the stationary distribution satisfies a Poincaré inequality or isoperimetry (Chewi et al., 2021) ). Table 1 summarizes these results. LD as noisy gradient flow, and DP-ERM under (ε, δ)-DP (Appendix F.1): The view of LD we took for the ε-DP case was when the diffusion has converged to a stationary distribution. We also study LD in the setting when it is far from convergence. In fact, we argue in Appendix H that, under the settings we operate with, the algorithm's convergence to the stationary distribution is not much better than a point distributionfoot_2 . We present two results under (ε, δ)-DP: i) LD achieves optimal excess empirical risk bounds for convex losses (Theorem F.1), and ii) LD achieves optimal excess empirical risk bounds for strongly convex losses (Theorem F.2). The optimality of these algorithms follow from the standard lower bounds in Bassily et al. (2014) and privacy guarantee follows from the privacy accounting machinery described above (see Table 1 for a summary of the bounds). Our utility bound holds for the last model θ T output by LD. √ p-factor. We improve this and get an optimal bound as follows: Theorem 1.3 (Simplified version of Theorem E.4). Let θ priv be the output of the exponential mechanism when run on the regularized loss L m (θ; D) := L(θ; D) + m 2 θ 2 2 . For an appropriate choice of m we have: E θ priv [Risk SCO (θ priv )] = O Lp C 2 εn + L C 2 √ n . The above bound is obtained by showing a dimension-independent uniform stability result for the exponential mechanism on strongly convex losses by viewing the exponential mechanism as the limit as as η → 0, T → ∞ of gradient descent, which has a dimension and time-independent uniform stability bound for strongly convex losses (see Corollary E.3). While Raginsky et al. (2017) as well as extensions of results in Bassily et al. (2014) give dimension-dependent uniform stability bounds for the exponential mechanism, a dimension-independent bound was not known prior to this work. A dimension-independent uniform stability bound was also proven independently of us in the contemporary work of Gopi et al. (2022) (Theorem 6.10), albeit using a different proof. We discuss it in more detail in Appendix A. Given uniform stability of the exponential mechanism on strongly convex losses, we apply it in the convex case by adding a regularizer to the loss function. In the strongly convex case, where we use the iterated exponential mechanism, we show that the population and empirical minimizers of a strongly convex loss function are close to each other with high probability. Given this claim and the uniform stability bound, only a slight modification of our DP-ERM analysis is needed to get the desired DP-SCO bound informally stated as below: Theorem 1.4 (Simplified version of Theorem E.6). Let θ k be the output of Algorithm 3 with slight modification to the algorithm's parameters. Then it is ε-DP and for m-strongly convex losses satisfies E [Risk SCO (θ k )] = O L 2 p 2 log n mε 2 n 2 + L 2 mn . We also prove uniform stability of LD in the (ε, δ)-DP setting via analyzing noisy gradient descent when learning rate tends to zero. The optimality of the results above follows from standard arguments (see Appendix C in Bassily et al. (2019) 4 ). Lower-bound for non-convex functions (Appendix G): In the case of convex loss functions, it is known that we can improve the excess error by √ p factor by going from ε-DP to (ε, δ)-DP. However, the excess empirical risk of our algorithm for non-convex loss function under (ε, δ)-DP is the same as that under ε-DP. We finally show that it is not an artifact of our algorithm or analysis, rather, in general, it is not possible to get an improvement by going from ε-DP to (ε, δ)-DP for nonconvex loss functions: Theorem 1.5 (Simplified version of Theorem G.1). There exists a dataset D = {d 1 , The lower bound uses a reduction to the top-k-selection problem defined over the universe of size s. In particular, we define a packing over the p-dimensional Euclidean ball such that there is an bijective mapping between the centers of the packing and [s]. Then we define a non-convex function such that the function attains the minimum at the center that corresponds to the coordinate j ∈ [s] with maximum frequency. Since the size of the α-net is ≈ 1/α p and there is a bijective mapping, this gives the desired lower bound using Steinke & Ullman (2017) .

Comparison to Gopi et al. (2022):

In a concurrent, independent, and complementary workfoot_5 on convex losses, Gopi et al. (2022) showed that the stationary distribution of a Metropolis-Hastings style process provides the optimal algorithm both for DP-SCO and DP-ERM under (ε, δ)-DP for 2 -Lipschitz and convex losses. Their results immediately imply a single algorithm that spans across ε and (ε, δ)-DP for DP-ERM. In comparison, our work captures a much larger spectrum of unification, i.e., DP-ERM and DP-SCO, under ε and (ε, δ)-DP, for 2 -Lipschitz losses with/without strong convexity. Furthermore, unlike Gopi et al. (2022) our privacy analysis does not rely on convexity. In terms of gradient oracle complexity, Gopi et al. (2022) give oracle efficient algorithms for their construction. While we acknowledge that the oracle complexity of our LD based algorithm is an important research question, we leave it for future work (see Section 1.4), and focus on statistical efficiency only. The DP-SCO result in Gopi et al. (2022) relies on uniform stability property of exponential mechanism, analogous to our work. While their result relies on bounding the Wasserstein distance between two exponential mechanisms run on neighboring data sets, our uniform stability guarantee follows immediately from the uniform stability guarantee of the diffusion process (which in limit matches the exponential mechanism). (2021) to the mini-batch setting. They show that under smoothness and strong convexity on the loss function L(θ; D), the privacy cost of DP-SGLD converges to stationary finite value, even when the number of time steps goes to ∞. This in-turn improves the gradient oracle complexity over differentially private stochastic gradient descent, dubbed as DP-SGD, (Algorithm 1). Our analysis of DP-LD improves on the result of Chourasia et al. (2021) by allowing divergence bounds between the entire histories {θ t } 0≤t≤T rather than just the last iterate θ T , which enables us to output weighted averages of θ t privately, which is necessary for some of our (ε, δ)-DP guarantees. Altschuler & Talwar (2022) in a follow-up work to ours, removes the requirement of strong convexity in Chourasia et al. (2021) and provides an analogous bound only for the last iterate θ T using a different technique based on optimal transport. In general, these results are orthogonal to us since they do not seek to either unify the existing algorithms or provide tighter utility/privacy trade-offs via the Langevin dynamics/diffusion view point. for Lipschitz and m-strongly convex losses under ε-DP. Notice that the bounds do not reach the non-private optimal bounds of O 1 √ n and O 1 mn , respectively as ε → ∞. Our bounds on the other hand have the property of matching the non-private optimal bounds as ε → ∞. To the best of our knowledge, the polylog(n) dependence in the non-private error bounds of Asi et al. (2021b) is not a slack in the analysis, but is unavoidable for their algorithm. We note that Asi et al. (2021b) has the advantage of requiring weaker primitives (solving an ERM problem instead of the exponential mechanism) and thus being easier to implement. However, our improvements are not solely due to using stronger primitives; e.g., our uniform stability bound is a generalization of a uniform stability bound implicitly used in their paper.

Comparison to

We provide a thorough comparison to other related works in Appendix A.

1.4. FUTURE DIRECTIONS

While we know that most of the DP-ERM and DP-SCO bounds are optimal, our understanding of the optimal rates of convergence (in terms of gradient oracle complexity (Bubeck, 2015) ) to these error bounds is far from being complete. For example, Kuru et al. (2020) shows DP algorithms with accelerated oracle complexity for strongly convex and smooth losses; can we obtain optimal DP-SCO/DP-ERM rates with accelerated oracle complexity without strong convexity? Understanding the trajectory of private optimization has further ramifications, such as an understanding of the natural scope of higher order descent methods under privacy constraints, and phenomena that gradients are heavy tailed and lie in a low dimensional subspace. For example, in the non-private setting, higher order methods can be naturally explained using variational methods that study the trajectory of optimization (Wibisono et al., 2016) . For DP, one can study the corresponding stochastic variational methods. Here, we can use differentiation as a linear operator and then use the machinery of operator algebra to understand the necessary conditions for the calculus that allows us to derive stochastic variational methods. From a practical perspective, these methods can be helpful in understanding whether DP-SGD converges to robust network or not when training deep neural network. Without privacy, we know that there is an implicit bias of gradient descent towards non-robust local minima of non-convex problems even though robust networks exist (Vardi et al., 2022) . However, because of stochasticity, DP-SGD behaves as the so called Ornstein-Uhlenbeck process. An immediate consequence of this phenomenon is that it activates the second-order Taylor's expansion adding a regularizer-like behavior. This phenomenon does not exist in gradient flow and we conjecture that it might be the critical aspect of DP-SGD that allows convergence to a robust network. 

A OTHER RELATED WORK

Work on optimization: Although optimization methods in computer science have been mostly discrete, there is a vast literature that studies optimization from a continuous time dynamical point of view, with the earliest example being the mirror-descent (see the discussion in Nemrivosky and Yudin (Nemirovskij & Yudin, 1983 , Chapter 3)) and Polyak's momentum method (Polyak, 1964) . In fact, Polyak in his 1964 paper motivated his approach using heavy ball moving in a potential and used this physical intuition to give rigorous proof for quadratic loss. More recent works have also shown a closer connection between gradient flow and gradient descent (Wilson, 2018) and between accelerated methods and second-order ordinary differential equations by taking a variational perspective (Hu & Lessard, 2017; Li et al., 2017; Su et al., 2014; Wibisono et al., 2016; Wilson et al., 2021) . This, in the case of the unconstrained case, can be interpreted as a damped nonlinear oscillator (Cabot et al., 2009) . This observation has led to fruitful works to get an averaging interpretation of accelerated dynamics (Krichene et al., 2015; 2016) and also a cornerstone of the "restarting" heuristic (O'donoghue & Candes, 2015) . The idea of approximating discrete-time stochastic algorithms by continuous time equations can be traced back to the vast literature of stochastic approximation theory. We refer the readers to the excellent monograph by Harold, Kushner and Yin on the same topic (Harold et al., 1997) . In optimization theory with machine learning as the motivation, the earliest work to the best of our knowledge that studied the dynamical properties of stochastic gradient descent algorithms are the independent and concurrent works of (Li et al., 2017; Mandt et al., 2017; Vollmer et al., 2016) . The idea of discretizing stochastic differential equations (and by extension stochastic gradient descents and its variants) can be dated back to the seminal work of Mil'shtejn (1975) who performed an extensive numerical analysis of stochastic differential equations. Since then, several works have studied continuous time gradient descent (Hu et al., 2019; Feng et al., 2017) 2021)) show that using the Langevin dynamics, under certain assumptions (often strong convexity and smoothness or variants thereof) one can efficiently obtain an approximate sample from the stationary distribution. In particular, the gradient oracle complexity of these results is often linear in the dimension and inverse polynomial in the approximation error. The error metric varies from paper to paper; originally, total variation distance, Wasserstein-distances, and KL-divergences were more commonly studied but starting with Vempala & Wibisono (2019) , recent works have focused on Rényi divergence bounds.

Work on DP and optimization:

The connection between dynamical systems and differential privacy is also not new. for non-convex loss functions. In a concurrent, and complementary work on convex losses, Gopi et al. (2022) study private optimization and show the universality of exponential mechanisms for both stochastic convex optimization and empirical risk minimization. Their analysis takes the sampling perspective when the diffusion process has completed. It is probably important to mention that objective perturbation (Chaudhuri et al., 2011; Kifer et al., 2012) can be potentially thought of as a (near) universal algorithm for the problem classes considered in this paper, albeit the following two caveats: i) The instantiation of the algorithm for ε-DP and (ε, δ)-DP require two different noise models to be drawn from, namely, Gamma distribution, and Normal distribution, and ii) It requires the loss functions (θ; •) to be twice-continuously differentiable, and ∇ 2 θ (θ; •) to have a near constant rank. As mentioned in the remainder of our paper, Langevin diffusion does not require any such assumptions.foot_6 

B NOTATION AND PRELIMINARIES

In this section, we give a brief exposition of the concepts and results required used in the rest of the paper. In Table 2 we provide a summary of all the notation used in the paper. Background on Langevin dynamics. One of the important tools in stochastic calculus is Ito's lemma (It ô, 1944) . It can be seen as the stochastic calculus counterpart of the chain rule and be derived from Taylor's expansion and noting that the second order does not go to zero under quadratic variation: Lemma B.1 (Ito's lemma (It ô, 1944) ). Let x t ∈ R p be governed by the Langevin diffusion process dx t = µ t • dt + σ t • dW t , where W t is the standard Brownian motion in p-dimensions, µ t ∈ R p is the drift, and σ 2 t ∈ R is the standard deviation. We have the following for any fixed function f : R p → R: d f (x t ) =   ∇ x=x t f (x), µ t + σ 2 t 2   p ∑ i=1 ∂ 2 f (x) ∂x 2 i x=x t     • dt + σ t • ∇ x=x t f (x), dW t . Here  R α (•, •) Renyi divergence of order α T continuous time A 0 A is positive semidefinite A B A -B is positive semidefinite I p p × p identity matrix Table 2: Notation Table R α (P, Q) = 1 α -1 ln supp(Q) P(x) α Q(x) α-1 dx = 1 α -1 ln E x∼Q P(x) α Q(x) α . The α-Rényi divergence for α = 1 (resp. ∞) is defined by taking the limit of R α (P, Q) as α approaches 1 (resp. ∞) and equals the KL divergence (resp. max divergence). We next define differential privacy, our choice of the notion of data privacy. Central to the notion of differential privacy is the definition of adjacent or neighboring datasets. Two datasets D and D are called adjacent if they differ in exactly one data point. Definition B.3 (Approximate Differential privacy (Dwork et al., 2006b; a) ). A randomized mechanism M : D n → R is said to have (ε, δ)-differential privacy , or (ε, δ)-DP for short, if for any adjacent D, D ∈ D n and measurable subset S ⊂ R, it holds that Pr[M(D) ∈ S] ≤ e ε Pr[M(D) ∈ S] + δ. When δ = 0, it is known as pure differential privacy, and we denote it by ε-DP. Definition B.4 (Renyi Differential privacy (Mironov, 2017)). A randomized mechanism M : D n → R is said to have (α, ε)-Rényi differential privacy, or (α, ε)-RDP for short, if for any adjacent D, D ∈ D n it holds that R α (M(D), M(D )) ≤ ε. It is easy to see that ε-DP is merely (∞, ε)-RDP. Similarly, the following fact relates (ε, δ)-DP to (α, ε)-RDP: Fact B.5 ((Mironov, 2017, Proposition 3)). If M satisfies (α, ε)-RDP, then M is (ε + log 1/δ α-1 , δ)- differentially private for any 0 < δ < 1. Rényi divergences satisfy a number of other useful properties, which we list here.  ≤ α 1 ≤ α 2 we have R α 1 (P, Q) ≤ R α 2 (P, Q). X → Y we have R α ( f (P), f (Q)) ≤ R α (P, Q). Lemma B.8 (Gaussian dichotomy (van Erven & Harremos, 2014, Example 3)). Let P = P 1 × P 2 × • • • and Q = Q 1 × Q 2 × • • • , where P i and Q i are unit variance Gaussian distributions with mean µ i and ν i , respectively. Then R α (P i , Q i ) = α 2 (µ i -ν i ) 2 , and by additivity for α > 0, R α (P, Q) = α 2 ∞ ∑ i=1 (µ i -ν i ) 2 . As a corollary, we have: R α (N(0, σ 2 I p ), N(x, σ 2 I p )) ≤ α x 2 2 2σ 2 . Fact B.9 (Adaptive Composition Theorem (Mironov, 2017, Proposition 1)). Let X 0 , X 1 , . . . , X k be arbitrary sample spaces. For each i ∈ 10 (Weak Triangle Inequality (Mironov, 2017, Proposition 11) ). For any α > 1, q > 1 and distributions P 1 , P 2 , P 3 with the same support: R α (P 1 , P 3 ) ≤ α -1/q α -1 R qα (P 1 , P 2 ) + R qα-1 q-1 (P 2 , P 3 ). [k], let f i , f i : ∆(X i-1 ) → ∆(X i ) be maps from distributions over X i-1 to distributions over X i such that for any distribution X i-1 over X i-1 , R α ( f i (X i-1 ), f i (X i-1 )) ≤ ε i . Then, for F, F : ∆(X 0 ) → ∆(X k ) defined as F(•) = f k ( f k-1 (. . . f 1 (•) . . .) and F (•) = f k ( f k-1 (. . . f 1 (•) . . .) we have R α (F(X 0 ), F (X 0 )) ≤ ∑ k i=1 ε i for any X 0 ∈ ∆(X 0 ). Fact B. We discuss two differentially private mechanisms for optimization in this paper. The first one is the exponential mechanism. (McSherry & Talwar, 2007) . Given some arbitrary domain D and range R, the exponential mechanism is defined with respect to some loss function, : D × R → R. Definition B.11 (Exponential mechanism (McSherry & Talwar, 2007) ). Given a privacy parameter ε, the range R and a loss function : D × R → R, the exponential mechanism samples a single element from R based on the probability distribution π D (r) = e -ε (D,r)/2∆ ∑ r∈R e -ε (D,r)/2∆ where ∆ is the sensitivity of u, defined as ∆ := max D∼D ,  ∇θ priv ← ∇ (θ t ; d t ) + N 0, σ 2 I , where σ 2 = 8TL 2 log(1/δ) ε 2 . 5: θ t+1 ← Π C θ t -η • ∇θ priv , where Π C (v) = arg min θ∈C v -θ 2 . 6: end for 7: return θ t . The second algorithm that we discuss is the stochastic gradient descent used in Bassily et al. (2014) and presented in Algorithm 1. The algorithm can be seen as noisy stochastic variant of the classic gradient descent algorithm, where stochasticity comes from two sources in every iteration: sampling of d t and explicit noise addition to the gradient before the descent stage. We use the result by Steinke & Ullman (2017) for our lower bound proof. We use their equivalent result for empirical mean (see equation ( 2) in Steinke & Ullman ( 2017)) and for privacy parameters (ε, δ) using a standard reduction (Bun et al., 2018; Steinke & Ullman, 2015) 7 : Theorem B.12. Fix n, s, k ∈ N. Set β = 1 + 1 2 log s 8 max{2k,28} . Let P 1 , • • • , P s ∼ Beta(β, β) and let X := {x 1 , • • • , x n } be such that x i ∈ {0, 1} s for all i ∈ [n], x i,j is independent (conditioned on P) and E[x i,j ] = P j for all i ∈ [n] and j ∈ [s]. Let M : ({0, 1} s ) n → {0, 1} d be (1, 1 ns )-differentially private. Suppose M(x) 1 = M(x) 2 2 = k for all X with probability 1 and E M     1 n n ∑ i=1 ∑ j∈[s] M(x) j =1 x i,j     ≥ 1 n max S⊂[d] |S|=k n ∑ i=1 ∑ u∈S x i,u - k 20 . (3) Then n ∈ Ω √ k log s k . Results from statistics and machine learning: We will sometimes use Fatou's lemma in our proofs. The form we will use is stated here for convenience: Lemma B.13 (Fatou's Lemma). Let {X i } be a sequence of random variables such that there is some constant c such that for all i, Pr[X i ≥ c] = 1. Then: E lim inf i→∞ X i ≤ lim inf i→∞ E [X i ] . For our SCO bounds, we will use uniform stability. Uniform stability of a learning algorithm is a notion of algorithmic stability introduced to derive high-probability bounds on the generalization error. Formally, it is defined as follows: 7 Steinke & Ullman (2017) present their result in the terms of population mean and privacy parameters (1, 1 ns ). Definition B.14 (Uniform stability (Bousquet & Elisseeff, 2002) ). A mechanism M is µ(n)-uniformly stable with respect to if for any pair of databases D, D of size n differing in at most one individual: sup d∈τ E M [ (M(D), d)] -E M (M(D ), d) ≤ µ(n). In this paper, we will need the following result. Lemma B.15 (Bousquet & Elisseeff (2002) ). Suppose M is µ(n)-uniformly stable. Then: E D∼D n ,M [Risk SCO (M(D))] ≤ E D∼D n ,M [Risk ERM (M(D))] + µ(n).

C R ÉNYI DIVERGENCE BOUND FOR LANGEVIN DIFFUSION (LD)

This section is devoted to proving the divergence bounds between two LD processses when run on neighboring datasets. It forms the basis of the privacy analysis in the rest of the paper. Lemma C.1. Let θ 0 , θ 0 have the same distribution Θ 0 , θ T be the solution to (2) given θ 0 and database D, and let θ T be the solution to (2) given θ 0 and database D , such that D ∼ D . Let Θ [0,T] be the distribution of the trajectory {θ t } t∈[0,T] . Suppose we have that ∇L(θ; D) -∇L(θ; D ) 2 ≤ ∆ for all θ. Then for all α ≥ 1: R α (Θ [0,T] , Θ [0,T] ) ≤ α∆ 2 4 T 0 β 2 t dt. The idea behind the proof is to use a bound on the divergence between Gaussians and RDP composition to provide a bound on the divergence between the projected noisy gradient descents on datasets D and D . Then, taking the limit as the step size in gradient descent goes to 0 and applying Fatou's lemma (Lemma B.13), we get the bound above. Proof. For ease of presentation, we will show a divergence bound between Θ T , Θ T which are the distributions of θ t , θ t , and then describe how to modify the proof to show the same bound between Θ [0,T] , Θ [0,T] . Let Ψ D,m,i be a map from (distributions over) R p to (distributions over) R p that takes the point θ to the distribution Π C N θ - iT/m (i-1)T/m β t dt ∇L(θ; D), 2 T m I , where Π C is the 2 -projection into C. It is well known (see e.g. Lemma B.8) that: R α (N(0, σ 2 I), N(x, σ 2 I)) ≤ α x 2 R α   N   θ -    iT/m (i-1)T/m β t dt    ∇L(θ; D), 2T m I    , N   θ -    iT/m (i-1)T/m β t dt    ∇L(θ; D ), 2T m I       = R α   N 0, 2T m I , N       iT/m (i-1)T/m β t dt    (∇L(θ; D) -∇L(θ; D )), 2 T m I       ≤ α∆ 2 4 • iT/m (i-1)T/m β t dt 2 T/m . Let Ψ D,m denote the composition Ψ D,m,m • Ψ D,m,m-1 • . . . • Ψ D,m,1 . By Fact B.9, we have R α (Ψ D,m (Θ 0 ), Ψ D ,m (Θ 0 )) ≤ m ∑ i=1 max θ R α (Ψ D,m,i (θ), Ψ D ,m,i (θ)) . Plugging in the bound on R α (Ψ D,m,i (θ), Ψ D ,m,i (θ)), we get R α (Ψ D,m (Θ 0 ), Ψ D ,m (Θ 0 )) ≤ α∆ 2 4 • m T m ∑ i=1    iT/m (i-1)T/m β t dt    2 Note that Θ T = lim m→∞ Ψ D,m (Θ 0 ), and Θ T = lim m→∞ Ψ D ,m (Θ 0 ). Since exp((α -1)R α (P, Q)) is a monotone function of R α (P, Q) and is the expectation of a positive random variable, by Fatou's lemma we have: R α (Θ T , Θ T ) ≤ lim m→∞ R α (Ψ D,m (Θ 0 ), Ψ D ,m (Θ 0 )) ≤ α∆ 2 4 • lim m→∞ m T m ∑ i=1    iT/m (i-1)T/m β t dt    2 = α∆ 2 T 0 β 2 t dt 4 . This gives the bound on R α (Θ T , Θ T ). To obtain the same bound for R α (Θ [0,T] , Θ [0,T] ), we modify Ψ D,m,i so that instead of receiving Θ (i-1)T/m and outputting Θ iT/m , it receives the joint distribution {Θ jT/m } 0≤j≤i-1 and outputs {Θ jT/m } 0≤j≤i by appending the (also jointly distributed) variable Θ iT/m = Π C   N   θ -    iT/m (i-1)T/m β t dt    ∇L(Θ (i-1)T/m ; D), 2 T m I    That is, we update Ψ D,m,i so it outputs the distributions of all iterates seen so far instead of just the distribution of the last iterate; the limiting value of the joint distribution {Θ jT/m } 0≤j≤i is then Θ [0,T] according to eq. (2), and the same divergence bound holds.

D LD AS EXPONENTIAL MECHANISM AND DP-ERM UNDER ε-DP

In this section, we study the privacy-utility trade-offs of LD when viewed as variants of exponential mechanism (McSherry & Talwar, 2007) . Using this view point we show that LD can achieve the optimal excess empirical risk bounds for 2 -Lipschitz losses, including non-convex, convex, and strongly convex losses.

D.1 BOUND FOR CONVEX LOSSES

We revisit the result from  E θ priv [Risk ERM (θ priv )] = O Lp • C 2 εn .

Equivalence of Algorithm 2 and Langevin diffusion:

The following lemma, which is implied by, e.g. (Tanaka, 1979, Theorem 4.1) , shows that one can implement Algorithm 1 using only solutions to eq. ( 2); note that this does not necessarily mean solutions to eq. ( 2) are efficiently sampleable. Lemma D.2. Let L be a M-smooth function for some finite M. Then if β t = β for all t, then the stationary distribution of (2) has pdf proportional to exp(-βL(θ; D)) • 1(θ ∈ C), where 1(•) is the indicator function. We recall that one can ensure smoothness by convolving L (appropriately extended to all of R p ) with the Gaussian kernel of finite variance (Feldman et al., 2018, Appendix C). In particular, since we only need M to be finite, we can take the convolution with the Gaussian kernel N (0, λ 2 I p ) for arbitrarily small λ > 0, and in turn the result of the convolution is L/λ-smooth (which is perhaps arbitrarily large but still finite) and differs from L by an arbitrarily small amount everywhere in C.

D.2 BOUND FOR STRONGLY CONVEX LOSSES

Our algorithm for strongly convex losses, given as Algorithm 3, is an iterated version of the exponential mechanism. Again, we note that Algorithm 3 can be implemented using only the Langevin Diffusion as a primitive. Theorem D.3. For any ε, suppose we run Algorithm 3 with k = 1 + log log( εn (p+log n) ) , and ε i = ε/2 k-i+1 . Assume each of the individual loss function in L(θ; D) is L-Lipschitz within the constraint set C 0 . Then Algorithm 3 is ε-differentially private. Additionally, if the loss function L(θ; D) is m-strongly convex, and the constraint set C 0 is convex, then over the randomness of the algorithm, the output θ k of Algorithm 3 satisfies: E [Risk ERM (θ k )] = O L 2 (p 2 + p log n) mε 2 n 2 . Algorithm 3 Iterated Exponential Mechanism (ITERATED-EXP-MECHANISM) Input: Loss function L, constraint set C 0 ⊂ R p with bounded diameter, Lipschitz constant L, strong convexity parameter m, number of iterations k, privacy parameter sequence {ε i } k i=1 , and data set D of n samples. 1: for i = 1 to k do 2: Sample θ i from C i-1 with probability proportional to exp(- ε i n 2L C i-1 2 L(θ; D)). 3: C i ← θ ∈ C i-1 : θ -θ i 2 ≤ cL(p+3 log n) C i-1 2 mε i n . 4: end for 5: return θ k The theorem follows by solving a recurrence for C i 2 to bound the diameter of the final set C k-1 . Then, we show that the minimizer over C 0 is also in C k-1 with high probability, so the analysis of the exponential mechanism gives the theorem. We also note that a similar result is achieved by Bassily et al. (2014) . However, we improve the p 2 log n in their result to p 2 + p log n, and only need the exponential mechanism as a primitive (whereas their algorithm requires computing the minimum of L). To bound the prove Theorem D.3, we first need the following lemma, which shows that with high probability we choose a series of C i that all contain the optimal θ for C 0 . Lemma D.4. Suppose we sample θ from the convex constraint set C ⊂ R p with bounded diameter with probability proportional to exp -εn 2L C 2 L(θ; D) , where L(•; D) is an m-strongly convex function. Let θ * = arg min θ∈C L(θ; D). Then for any t ≥ 0 and for some sufficiently large constant c we have Pr θ -θ * 2 ≤ cL(p + t) C 2 mεn ≥ 1 -2 -t . The lemma follows from a tail bound on the excess loss of the exponential mechanism, and using m-strong convexity to translate this into a distance bound. The proof is given below. Proof of Lemma D.4. By e.g. the proof of (Bassily et al., 2014, Theorem III. 2), we know that for some sufficiently large constant c: Pr L(θ; D) -L(θ * ; D) ≤ cL C 2 2εn (p + t) ≥ 1 -2 -t . ( ) We now show that the claim holds conditioned on this event. By optimality of θ * and convexity of C, we know ∇L(θ * ; D), θ -θ * ≥ 0. ( ) So, by m-strong convexity, we have cL C 2 2εn (p + t) ≥ (4) L(θ; D) -L(θ * ; D) ≥ ∇L(θ * ; D), θ -θ * + m 2 θ -θ * 2 2 ≥ (5) m 2 θ -θ * 2 2 . Rearranging gives the claim. Given Lemma D.4, we can now prove Theorem D.3. Proof of Theorem D.3. The privacy guarantee is immediate from the privacy guarantee of the exponential mechanism, composition, and the fact that for this choice of ε i , k, we have ∑ k i=1 ε i < ε. Setting t = 3 log n in Lemma D.4, in iteration i, letting θ * i = arg min θ∈C i-1 L(θ; D), we have that with probability 1 - 2 -t = 1 -1 n 3 , θ * i ∈ C i , and thus θ * i = θ * i+1 . Then by a union bound, we have that with probability 1 -k n 3 ≥ 1 - log log(εn) n 3 , θ * 1 ∈ C k-1 (equivalently, θ * 1 = θ * 2 = . . . = θ * k ) . When this event fails to happen, our excess loss is at most L C 0 2 , and in turn the contribution of this event failing to hold to the expected excess loss is O( L C 0 2 log log(εn) n 3 ), which is asymptotically less than our desired excess loss bound. So it suffices to provide the desired expected excess loss bound conditioned on this event. By the analysis of the exponential mechanism, conditioned on this event, we have that E θ k [L(θ k ; D)] -L(θ * 1 ; D) = O Lp C k-1 2 ε k n = O Lp C k-1 2 εn . ( ) Note that L(θ * 1 ; D) = min θ∈C 0 L(θ; D) by definition, so it now suffices to bound C k-1 2 by O L(p+log n) mεn . To do this, we have the recurrence relation: C i 2 ≤ 2 cL(p + 3 log n) C i-1 2 mε i n . Solving the recurrence relation for C k-1 , we get: C k-1 2 ≤ 4cL(p + 3 log n) mn 1-2 -(k-1) • ( C 0 2 ) 2 -(k-1) • k-1 ∏ i=1 ε -2 -(k-i) i = 4cL(p + 3 log n) mεn 1-2 -(k-1) • ( C 0 2 ) 2 -(k-1) • k-1 ∏ i=1 (2 (k-i+1) ) 2 -(k-i) . We claim the following: C 0 2 ≤ 2L m . ( ) Let θ global be the minimizer of L(θ; D) over all of R p . By triangle inequality, there exists a point θ in C 0 which is at distance at least C 0 2 /2 far from θ global . By m-strong convexity, this implies that the gradient at θ has 2 -norm at least m C 0 2 /2. Now, by Lipschitzness over C 0 , we know that the gradient at θ has 2 -norm at most L. This gives us eq. ( 8). Using eq. ( 8), we can simplify eq. ( 7) to C k-1 2 ≤ 2L m • 2c(p + 3 log n) εn 1-2 -(k-1) • k-1 ∏ i=1 (2 (k-i+1) ) 2 -(k-i) . We have: log 2 k-1 ∏ i=1 (2 (k-i+1) ) 2 -(k-i) = k-1 ∑ i=1 (k -i + 1)2 -(k-i) ≤ ∞ ∑ j=1 (j + 1)2 -j = 3. In other words, ∏ k-1 i=1 (2 (k-i+1) ) 2 -(k-i) is at most 8, regardless of the value of k. Now, using the fact that m 1/ log m = O(1) is a constant, our final upper bound on C k-1 2 is: C k-1 2 = O L m • (p + log n) εn 1-2 -(k-1) = O L(p + log n) mεn . Plugging in eq. ( 6) gives us Theorem D.3.

D.3 BOUND FOR NON-CONVEX LOSSES

If the loss function L is non-convex but still L-Lipschitz, we can still obtain a comparable error bound for Algorithm 2 as long as the constraint set C is convex. Theorem D.5. Assume the constraint set C ⊂ R p with bounded diameter is convex, and each of the individual loss function in L(θ; D) is L-Lipschitz within C. Then for p ≤ εn/2, over the randomness of Algorithm 2, the output θ priv satisfies E Risk ERM (θ priv ) = O Lp • C 2 εn log εn p . Note that the assumption on p can easily be removed: if p > εn/2, any θ ∈ C achieves Risk ERM (θ) ≤ L C 2 < 2Lp• C 2 εn by L-Lipschitzness. To prove Theorem D.5, we show there is a "good" subset of C, C Good ⊂ C, with large volume and only containing points with small excess loss. We note that Bassily et al. (2014) also gave an analysis for the continuous exponential mechanism on non-convex losses, although their analysis assumes C contains an 2 -ball of radius r > 0, which they use to choose C Good in their analysis. In turn, their error bound is roughly proportional to log(1/r). In contrast, by choosing C Good more carefully, we remove this dependence on r. Proof. By the analysis of the exponential mechanism, for any G ⊆ C we have: E θ priv L(θ priv ; D) -max θ∈G L(θ; D) = O L C 2 εn • log Vol(C) Vol(G) . Let us define G to be G := {θ * + r(θ -θ * ) : θ ∈ ∂C, 0 ≤ r ≤ R} , for some R ≤ 1 we will choose later, where θ * is a minimizer of L(θ; Vol(G) = (1/R) p . So the analysis of the exponential mechanism gives E θ priv L(θ priv ; D) -max θ∈G L(θ; D) = O Lp C 2 εn • log 1 R . ( ) By L-Lipschitzness of L(θ; D) and since G is C rescaled by R, and thus max θ∈G θθ * 2 ≤ R C 2 , we have: max θ∈G L(θ; D) -L(θ * ; D) ≤ RL C 2 . ( ) Combining ( 9) and ( 10), we get: E θ priv Risk ERM (θ priv ) = O L C 2 p εn • log 1 R + R . ( ) The above bound is minimized by choosing R = p εn , which is at most 1 by assumption, giving the theorem.

E UNIFORM STABILITY OF LD AND OPTIMAL DP-SCO UNDER ε-DP

In this section we provide the uniform stability bounds for LD in the setting of ε-DP. These bounds combined with excess empirical risk bounds in Section D provide us optimal excess population risk bounds for convex losses and strongly convex losses.

E.1 BOUND FOR CONVEX LOSSES

All our algorithms in the ε-DP case are primarily based on the exponential mechanism. Therefore, to get SCO bound, we establish uniform stability of the exponential mechanism on regularized losses. To the best of our knowledge, such uniform stability bounds for exponential mechanism were unknown prior to this work. We first recall the following fact about projected noisy gradient descent on strongly convex (and smooth) losses. Lemma E.1. Let f , f be m-strongly convex, M-smooth functions such that ∇ f (θ) -∇ f (θ) 2 ≤ ∆ for all θ ∈ C. Recall that given convex set C, projected noisy gradient descent on f performs the following random update: sample ξ t ∼ N(0, σ 2 I p ) and compute θ t+1 = Π C (θ t -η t ∇ f (θ t ) + ξ t ) . Here Π C is the Euclidean projection onto C. Let {θ t } t , {θ t } t be the trajectories given by running projected noisy gradient descent on f , f respectively starting from the same point θ 0 , and suppose M ≤ 1/ max t η t . Then for any (shared) fixed realization of the noise {ξ t } t , θ tθ t 2 ≤ ∆ m . Lemma E.1 is implied by e.g. the proof of (Hardt et al., 2016, Theorem 3.9) . We have the following corollaries: Corollary E.2. If is m-strongly convex and L-Lipschitz, then for any β t such that b a β t dt is finite if 0 ≤ a ≤ b, and any t > 0, outputting θ t that is the solution to (2) (given some fixed θ 0 independent of L) is 2L 2 mn -uniformly stable. Corollary E.3. If is convex and L-Lipschitz, then running Algorithm 2 on the regularized loss function L m (θ; D) := L(θ; D) + m 2 θ 2 2 is 2L 2 mn -uniformly stable (with respect to the unregularized loss L). The idea behind the proofs is that the uniform stability bound implied by Lipschitzness and Lemma E.1 is independent of t, η, and so it applies to the limiting distribution as η → 0 (giving Corollary E.2) and as t → ∞, which by Lemma D.2 is the exponential mechanism (giving Corollary E.3). We believe that this proof demonstrates the power of the unified framework that is the focus of this paper; once we have this framework in mind, we can almost immediately obtain optimal SCO bounds under ε-DP and also improve on Asi et al. (2021b) by polylog (n) factors. Proof of Corollary E.2. Since the regularizer does not depend on the dataset, D, by L-Lipschitzness we have ∇L(θ; D) -∇L(θ; D ) 2 ≤ 2βL n . Fix some η > 0 and set f = L, η t = tη (t-1)η β t dt, σ 2 = 2η in Lemma E.1, and note that the bound in Lemma E.1 does not depend on σ 2 . We will assume L is M-smooth for some finite M, since per the discussion in Section D, we can easily replace L with a smoothed version of L that differs from L by an arbitrarily small amount. Lemma E.1 then gives that, if θ t , θ t are the result of running projected noisy gradient descent on L using D, D respectively (from the same initial θ 0 ), then we have θ tθ t 2 ≤ 2L mn for any fixed realization of the noise {ξ t } t as long as the smoothness assumption is satisfied. Taking the limit as η goes to 0, the trajectory of this projected noisy gradient descent approaches the solution to (2). Furthermore, taking this limit the smoothness assumption in Lemma E.1 is trivially satisfied as all η t go to 0. Therefore, by Fatou's lemma (Lemma B.13), we have the following: for all t, if θ t , θ t are the solutions to eq. ( 2) using D, D respectively and a fixed realization of the Brownian motion, then θ t -θ t 2 ≤ 2L mn . Taking the expectation over the Brownian motion and using L-Lipschitzness of , we can conclude that for any t and all d ∈ τ, E θ t ,θ t [ (θ t ; d) -(θ t ; d)] ≤ 2L 2 mn . Proof of Corollary E.3. Similarly to the proof of Corollary E.2, if θ t , θ t are the solutions to eq. ( 2) (run on the regularized loss function L m ) using D, D respectively with β t = β and fixed realization of the Brownian motion, then E θ t ,θ t [ θ t -θ t 2 ] ≤ 2L mn , and thus E θ t ,θ t [ (θ t ; d) -(θ t ; d)] ≤ 2L 2 mn . Note that the above inequality holds for the unregularized loss. Taking the limit as t → ∞ and use Lemma D.2 to conclude that, for θ priv and θ priv output by Algorithm 2 using D and D , respectively, E θ priv ,θ priv [ (θ priv ; d) -(θ priv ; d)] ≤ 2L 2 mn for all d ∈ τ. Then, from Corollary E.3 and Theorem D.1, we have the following theorem: Theorem E.4. Let θ priv be the output of Algorithm 2 when run on the regularized loss L m (θ; D) := L(θ; D) + m 2 θ 2 2 . If each is convex and L-Lipschitz, then for m = L C 2 √ n we have: E θ priv [Risk SCO (θ priv )] = O Lp C 2 εn + L C 2 √ n . Furthermore, outputting θ priv is (ε, δ)-differentially private. Proof. Assume wlog that 0 ∈ C. Let θ * be the empirical minimizer of L. If m ≤ L C 2 , the Lipschitz constant does not change by more than a constant factor, so we have by Theorem D.1,: E θ priv [L m (θ priv ; D)] -L m (θ * ; D) ≤ E θ priv [L m (θ priv ; D)] -min θ∈C L m (θ; D) = O Lp C 2 εn . The functions L m and L differ by at most m 2 C 2 2 everywhere in C, so in turn: E θ priv [Risk ERM (θ priv )] = O Lp C 2 εn + m C 2 2 . Finally, we apply the uniform stability bound from Corollary E.3 to get: E θ priv [Risk SCO (θ priv )] = O Lp C 2 εn + m C 2 2 + L 2 mn . The theorem follows by setting m = L C 2 √ n .

E.2 BOUND FOR STRONGLY CONVEX LOSSES

Under strong convexity, using a similar proof to the proof of Theorem E.4, we can obtain nearoptimal DP-SCO bounds for ε-DP. We first show that the empirical minimizer is close to the population minimizer with high probability: Lemma E.5. Let be a m-strongly convex function and C ⊂ R p be a convex set with bounded diameter such that for any d, θ, ∇ (θ; d) -E d∼D [∇ (θ; d)] 2 ≤ ∆, and let θ * := arg min θ∈C E d∼D [ (θ; d) ] and θ emp := arg min θ∈C (θ; D). Then for D ∼ D n , with probability 1β, we have: θ emp -θ * 2 = O ∆ log(1/β) m √ n Proof. Consider a function over R p which has gradient ∇ (θ) = ∇ (Π C (θ)) + m(θ -Π C (θ) ). We will show that the assumptions on in the lemma statement also hold for , and that the distance between the empirical and population minimizers of over R p dominates the distance between the empirical and population minimizers of over C. So proving the lemma holds for and R p implies it holds for and C, i.e. it suffices to prove the lemma assuming C = R p . By m-strong convexity of in C, is also m-strongly convexfoot_8 . Also, for any d, θ we have ∇ (θ; d) -E d∼D [∇ (θ; d)] 2 = ∇ (θ; d) -E d∼D ∇ (θ; d) 2 ≤ ∆. So, we have shown the assumptions in the lemma hold on as desired. From strong convexity of , the empirical/population minimizers of over R p are the unique points with gradient 0. This gives us that for any D, the empirical minimizer of over R p is equal to θ emp := θ emp - 1 m • ∇ (θ emp ; D), where θ emp is the minimizer of (θ; D), and the population minimizer of E d∼D (θ; d) is θ * := θ * - 1 m • E d∼D [∇ (θ * ; d)] . Note that by convexity either ∇ (θ emp ; D) = 0 or -∇ (θ emp ; D) is a tangent vector of C at θ emp (and the same is true for θ * ). In turn, θ emp = Π C ( θ emp ) and θ * = Π C ( θ * ), and since projection is a non- expansive operator, θ emp -θ * 2 ≥ θ emp -θ * 2 . This shows the distance between minimizers of dominates the distance between minimizers of as desired. We now turn to proving the lemma assuming C = R p . If C = R p then E d∼D [∇ (θ; d)] = 0. Now, by the assumptions in the lemma and a vector Azuma inequality (see e.g., Hayes (2003) ), we have ∇ (θ * ; D) 2 = O( ∆ √ log(1/β) √ n ) with probability 1β over D. Furthermore, we know ∇ (θ emp ; D) = 0 by strong convexity and since C = R p . Then by strong convexity, we have θ * -θ emp 2 ≤ ∇ (θ * ; D) -∇ (θ emp ; D) 2 m = ∇ (θ * ; D) 2 m = O( ∆ log(1/β) m √ n ) with probability 1β as desired. Given Lemma E.5, if we want to ensure the population minimizer rather than empirical minimizer remains in the sets we choose in Algorithm 3, we just need to choose a slightly larger ball. From this modification and uniform stability, we get the following DP-SCO bound: Theorem E.6. Let m, n, p, be as in lemma E.5. Let θ k be the output of Algorithm 3 except we let the radius of C i be cL(p+3 log n) C i-1 2 mε i n + cL √ log n m √ n instead of cL(p+3 log n) C i-1 2 mε i n (chosen in Theorem D.3 to obtain optimal ERM bound). Then E [Risk SCO (θ k )] = O L 2 p 2 log n mε 2 n 2 + L 2 mn . Proof. Note that by L-Lipschitzness of in C, we have ∇ (θ; d) -E d∼D [∇ (θ; d)] 2 ≤ 2L. By Lemma D.4, Lemma E.5, and a triangle inequality, we have that if c is sufficiently large then the population minimizer of in C i is in C i+1 for each i with probability 1 -2/n 3 . Then by a union bound, we have that the population minimizer is in C k-1 . When this event fails to hold, our excess population loss is O(L C 0 2 ) and so the contribution of this event to the expected excess loss is O L C 0 2 log log(εn) n 3 = O L 2 log log(εn) mn 3 , which is asymptotically less than our desired bound. So it suffices to provide the desired expected excess loss bound conditioned on this event. We can bound the diameter of C k-1 similarly to the proof of Theorem D.3, by noting that: C i 2 ≤ 2 • max    2 cL(p + 3 log n) C i-1 2 mε i n , cL log n m √ n    Then, rolling out the recursion, we have similarly to the proof of Theorem D.3: C k-1 2 = O L(p + log n) mεn + L log n m √ n . Now, combining Theorem D.1 and the uniform stability bound of Corollary E.3, we get that the expected excess population loss of θ k compared to the population minimizer over C k-1 is: O Lp C k-1 2 εn + L 2 mn = O L 2 mn • (p 2 + p log n) ε 2 n + p log n ε √ n + 1 = O L 2 p 2 log n mε 2 n 2 + L 2 mn . In the final equality, we use the fact that p √ log n ε √ n is the geometric mean of p 2 log n ε 2 n and 1 and thus p √ log n ε √ n ≤ max{ p 2 log n ε 2 n , 1} . We conclude by noting that conditioned on the event the population minimizer over C 0 is contained in C k-1 , θ k has this same excess population loss bound compared to the population minimizer over C 0 .

F LD AS NOISY GRADIENT FLOW AND DP-ERM AND DP-SCO UNDER (ε, δ)-DP

We now investigate LD (defined in (1) and ( 2)) as a noisy gradient flow algorithm (i.e., for finite time T or when it is far from convergence). In fact, in Section H, we argue that the setting of parameters we operate with, the algorithm could not have converged to a limiting distribution in any reasonable sense. In the following, we provide optimal DP-ERM and DP-SCO bounds via LD for both convex and strongly convex losses. All the proofs are deferred to Appendix J.

F.1 DP-ERM BOUNDS FOR CONVEX AND STRONGLY CONVEX LOSSES

We first provide the excess ERM bounds for the LD in (2) for convex losses. Theorem F.1. Let the inverse temperature β t = β for all t > 0. Then for an appropriate choice of β, T, the solution to (2), θ priv = θ T is (ε, δ)-differentially private and E Risk ERM (θ priv ) = O L C 2 p log(1/δ) εn . The proof of this theorem can be viewed as a continuous analogue of Shamir & Zhang (2013) . We sketch the proof here. We define a potential φ t ∝ θ tθ * 2 2 , and use Ito's lemma to analyze the rate of the change of φ t . Rearranging the resulting inequality and using convexity, we then bound the integral of the excess loss of θ t over any interval of time [a, b] . Using techniques similar to those in Shamir & Zhang (2013), we translate this into a bound on the excess loss of θ T in terms of T and β. We use Lemma C.1 to choose the value of β that preserves privacy, and then optimize the resulting bound over T. We now provide the excess ERM bounds of LD in (2) for strongly convex losses. To simplify the presentation, we assume only in this theorem alone L satisfies 2m-strong convexity instead of m-strong convexity -this does not affect our final bound by more than constant factors. Theorem F.2. Let θ t be the solution to (2) if we set β t = t a . Suppose L is 2m-strongly convex. Then for any ε, δ and an appropriate choice of a, T, for B t = t a+1 a+1 , and θ priv = 1 e mB T -1 T 0 θ t de mB t , over the randomness of the algorithm, E Risk ERM (θ priv ) = O pL 2 log(1/δ) mε 2 n 2 log εn L log(1/δ) log pε 2 n 2 log(1/δ) . Furthermore, outputting θ priv is (ε, δ)-differentially private. The proof is fairly similar to that of Theorem F.1; the main differences are that we now define the potential φ t to be proportional to e mB t θ tθ * 2 2 instead of θ tθ * 2 2 , and there is no need to translate to a bound on the final iterate like in Theorem F.1.

F.2 DP-SCO BOUNDS FOR CONVEX AND STRONGLY CONVEX LOSSES

The uniform stability bound of LD in the case of (ε, δ)-DP follows from taking the limit as the step sizes go to 0 in the bound presented in (Bassily et al., 2020, Lemma 3 .1): Lemma F.3. Let θ T be the solution to (2) at time T, and suppose each is L-Lipschitz. Then outputting θ T satisfies µ-uniform stability for: µ = 4L 2 n T 0 β t dt. Combining Lemma F.3 with Theorem F.1, we obtain tight DP-SCO guarantees for convex losses. Theorem F.4. If each is convex and L-Lipschitz, then for an appropriate choice of β t , T, for θ priv = θ T that is the solution to (2), we have: E θ priv [Risk SCO (θ priv )] = O L C 2 p log(1/δ) εn + L C 2 √ n . We can also combine Corollary E.2 and Theorem F.2 to obtain tight DP-SCO guarantees for strongly convex losses. Theorem F.5. Suppose is m-strongly convex and L-Lipschitz, then for an appropriate choice of β t , T, for θ priv as defined in Theorem F.2, we have: E θ priv [Risk SCO (θ priv )] = O pL 2 log(1/δ) mε 2 n 2 log εn L log(1/δ) log pε 2 n 2 log(1/δ) + L 2 mn . In both cases, outputting θ priv is (ε, δ)-differentially private.

G LOWER BOUND ON DP-ERM FOR NON-CONVEX LOSSES

In this section, we show the following lower bound on the excess empirical risk for 1-Lipschitz non-convex loss functions. The lower bound implies that that there is no advantage, in terms of the dependence on dimensions (p), to move from ε-DP to (ε, δ)-DP. 1) , and B(0, 1) be a unit Euclidean ball centered at origin. Then there exists 1-Lipschitz non-convex function L : B(0, 1) × X → R and a datasetfoot_10 D = {d 1 , • • • , d n } such that for every p ∈ N, there is no (ε, δ)-differentially private algorithm A that outputs θ priv such that Theorem G.1. Let ε ≤ 1, 2 -Ω(n) ≤ δ ≤ 1/n 1+Ω( E L(θ priv ; D) -min θ∈B p (1) L(θ; D) = o p log (1/δ) nε , Proof. We first perform two translations of Theorem B.12: first from (1, 1 ns ) to (ε, δ) from Steinke & Ullman (2015) and then from sample complexity to a result stated in the terms of accuracy bound. A direct corollary of Theorem B.12 with k = 1 is as follows: for every s ∈ N, no (ε, δ)-differentially private algorithm on input X satisfying the premise of Theorem B.12 outputs an index j ∈ [s] such that E M 1 n n ∑ i=1 x i,j -max u∈[s] 1 n n ∑ i=1 x i,u = o 1 nε log(s) log (1/δ) , where ε ≤ 1 and 2 -Ω(n) ≤ δ ≤ 1/n 1+Ω(1) . Using this lower bound on top-selection, we give our lower bound by defining an appropriate non-convex loss function. In particular, we define a packing over the p-dimensional Euclidean ball such that there is an bijective mapping between the centers of the packing and [s] . Then the function attains the minimum at the center of packing which corresponds to the coordinate j ∈ [s] with maximum frequency. Since the size of the α-net is ≈ 1/α p and there is a bijective mapping, this gives a lower bound using eq. ( 13). Let B(0, 1) be the p-dimensional Euclidean ball centered at origin and let α ∈ (0, 1/2) be a constant. Consider an α-packing with centers C = {c 1 , c 2 , • • • ,}. It is known that the size of such packing, N(α) is 1 α p ≤ N(α) ≤ 3 α p . Let s = N(α). Further, let f : B(0, 1) → {1, • • • , s} be an injective function defined as follows: f (θ) = j : c j = arg min c∈C θ -c 2 . In particular, f is the function that maps a point on the unit ball to its closest point in C. We now define our loss function as follows: L(θ; D) := 1 n ∑ d i ∈D (θ; d i ) where (θ; d i ) = min c j ∈C θ -c j α -1 d i,j . For Lipschitz property, note that each loss function is 1/α-Lipschitz because the gradient when it is defined is just θ-c j α θ-c j 2 . We prove it formally. Consider any θ, θ in B(0, 1) and a data point d i ∈ D. We wish to show | (θ; d i ) -(θ ; d i )| ≤ 1 α θ -θ 2 . We can split the line segment from θ to θ into a sequence of line segments (θ 0 , θ 1 ), (θ 1 , θ 2 ), . . . , (θ k-1 , θ k ) where θ 0 = θ, θ k = θ , such that for any line segment (θ m , θ m+1 ), θ m and θ m+1 share a minimizer in C of θ-c j 2 α d i,j . 10 It now suffices to show | (θ m ; d i ) -(θ m+1 ; d i )| ≤ 1 α θ m -θ m+1 2 for each m, since we then have: | (θ; d i ) -(θ ; d i )| ≤ k-1 ∑ m=0 | (θ m ; d i ) -(θ m+1 ; d i )| ≤ 1 α k-1 ∑ m=0 θ m -θ m+1 2 = 1 α θ -θ 2 . Let c j be a shared minimizer of θ-c j 2 α d i,j for θ m and θ m+1 . If d i,j = 0, then trivially | (θ m ; d i ) - (θ m+1 ; d i )| ≤ 1 α θ m -θ m+1 2 . Otherwise d i,j = 1 and by triangle inequality, we have: | (θ m ; d i ) -(θ m+1 ; d i )| = θ m -c j 2 α - θ m+1 -c j 2 α ≤ 1 α θ m -θ m+1 2 . Now let us suppose there is an (ε, δ)-differentially private algorithm A that on input a non-convex function L and n data points {d 1 , • • • , d n }, outputs a θ priv such that E A L(θ priv ; D) -min θ∈B(1) L(θ; D) = o p log (1/δ) nε , ( ) where D = {d 1 , • • • , d n }. 10 In particular, for each c j let B j be the set of points in B(0, 1) such that c j is a minimizer of θ-c j 2 α d i,j . We can split the line segment from θ to θ at each point where it enters or leaves some B j to get this sequence of line segments, and by this construction each line segment's endpoints are both in B j for some j. We will construct an algorithm that uses A as subroutine and solve top-selection problem with an error o(log(s)), contracting the lower bound of Theorem B.12. Algorithm B: • On input X = {x 1 , • • • , x n }, invokes A on the function defined by eq. ( 14) and data points X to get θ priv as output. • Output f (θ priv ). Since the last step is post-processing, B is (ε, δ)-differentially private. We now show that if A outputs a θ priv satisfying eq. ( 15), then j := f (θ priv ) satisfies eq. ( 13) leading to a contradiction. First note that, for any c ∈ C and all θ ∈ B p (c, α) such that θ -c 2 ≤ α 2 , L(c; D) = - 1 n n ∑ i=1 x i, f (c) ≤ L(θ; D). Therefore, L(θ * ; X) := min c∈C L(c, X) = min c∈C - 1 n n ∑ i=1 x i, f (c) This implies that f (θ * ) = arg max 1≤j≤s 1 n n ∑ i=1 x i,j , which is exactly the top-selection problem. Therefore, eq. ( 15) implies eq. ( 13) because p log 1 α ≤ log(s) ≤ p log 3 α and α ∈ (0, 1/2) is a constant.

H ON THE CONVERGENCE TIME OF LD UNDER (ε, δ)-DP

In this section we provide a discussion into the choice of the time for which LD was run for our approximate DP results. In particular, we will discuss why for convex losses, the optimal runtime in Theorem F.1 is ∝ 1/p, and why for the setting of parameters in Theorem F.1, the diffusion process does not come close to converging to the stationary distribution. We show that such properties are present even in DP-SGD (Bassily et al., 2014) , when analyzed carefully while taking the learning rate and the number of time steps into account. On Optimal Time for DP-ERM with Approximate DP: For DP-ERM with a convex loss and approximate DP, our eventual choice of T is C 2 2 /p, i.e. decreasing in the dimension if C 2 is fixed. We note that this phenomenon is not unique to our analysis; e.g., for learning rate η > 0, the eventual value of Tη in the same setting in Bassily et al. (2019) also is roughly proportional to 1/p. This is perhaps counterintuitive, as in the non-noisy setting, the amount of time one runs gradient descent for is generally independent of the dimension. We can provide some intuition for this phenomenon from the perspective of the Langevin diffusion. As an example, consider the loss function L(θ) = θ 2 , and suppose C = B(0, 1), i.e. an 2 ball of radius 1 centered at the origin. For privacy, per Lemma C.1 we will set β ∝ 1/ √ T = √ p. At In the left picture, we can see that when we are close to the minimizer (the center of the circle), Brownian motion has a stronger effect than gradient drift. In the right picture, we see that when we are far from the minimizer, the opposite is true. In the middle picture, we see the equilibrium point where the two counteract each other. As the dimension increases, the ratio of the red arrow's length to blue arrow's length increases, and so the equilibrium point will get further from the minimizer. distance proportional to p/β ∝ √ p from the minimizer, Langevin diffusion stops making progress towards the minimizer in expectation, as at this distance the progress towards the minimizer due to the drift -∇L(θ) is cancelled out by the movement in perpendicular directions due to the Brownian motion (see Figure 1 for a visualization of this phenomenon). This suggests that we only need to run the diffusion until θ t reaches this distance from the minimizer, as past that point, we do not expect the diffusion to make progress. Now, the distance √ p is increasing with the dimension, and in turn the distance from the ball of radius √ p centered at the origin to the boundary of C is decreasing since C 2 is a constant. So even if we start at the boundary of C (the worst case for this loss function) the distance the Langevin diffusion needs to travel to be within √ p distance of the origin is decreasing with the dimension. The total distance we travel due to the gradient drift is roughly Tβ ∝ √ T, and thus the time we need to run the diffusion for to reach this ball also decreases with the dimension. In particular, once the dimension is large enough, the diffusion will stay very close to the boundary of C with high probability, i.e. taking an arbitrary initial point and outputting it is about as good as in this example as running the diffusion for any amount of time. On Non-Convergence of DP-SGD/Finite Time Diffusion: We show here that while both the Langevin diffusion and DP-SGD achieve optimal error rates in finite time, for the same parameter settings they do not converge to the stationary distribution. This in part explains why e.g. the finite-time Langevin diffusion for convex losses achieves a much better privacy parameter than its stationary distribution. We use the same example as in the preceding discussion, i.e. L(θ) = θ 2 , C = B(0, 1). In the proof of Theorem F.1, we show optimal error rates are achieved when Langevin diffusion runs for time T = 1 2p . Asymptotically, we get the same bound if we use e.g. T = 1 100p , so let us instead consider this choice of T. Since β ∝ 1/ √ T, and the movement due to the gradient drift is proportional to Tβ, the movement due to gradient drift for our parameter choices decreases with p. The integral from 0 to T of the Brownian motion has distribution N(0, TI p ), i.e. with probability 1e -Ω(p) , the total movement due to Brownian motion is at most, say, 2 T p = 1 5 . In particular, we get that for all sufficiently large p, with probability 1e -Ω(p) , Langevin diffusion does not move more than a total distance 1 3 . If θ 0 is on the boundary, then the probability a random point from the stationary distribution is in a ball of radius 1 3 centered at θ 0 is e -Ω(p) . So even though θ T in expectation obtains the (asymptotic) optimal excess empirical loss, the total variation distance between θ T and the stationary distribution is at least 1e -Ω(p) , and similarly e.g. any p-Wasserstein distance between θ T and the stationary distribution is Ω(1) = Ω( C 2 ).

I SPLIT REGIMES FOR R ÉNYI DIVERGENCE BOUNDS ON LANGEVIN DIFFUSION

In this section, we show that when the loss functions are strongly convex and smooth, we can show a bound on the Rényi divergence between two diffusions using different loss functions that converges to roughly the divergence between the stationary distributions of the diffusions. In doing so, we show that Rényi divergence bounds between two diffusions exist in two different regimes, where for small T it is advantageous to analyze privacy of LD as a noisy optimization algorithm, and for large T it is advantageous to analyze privacy of LD to an algorithm which samples from approximately a Gibbs distribution.

I.1 "LONG TERM" R ÉNYI DIVERGENCE BOUND

Under m-strong convexity and M-smoothness, using the results in Vempala & Wibisono (2019) we can also give a bound on the Rényi divergence depending on the closeness to the stationary distribution. Since we rely on the bounds in Vempala & Wibisono (2019) which apply to the unconstrained setting, we will also focus only on the unconstrained setting here. In order to ensure the initial Rényi divergence to the stationary distribution is finite, we assume that both L(θ; D), L(θ; D ) have minimizers θ opt , θ opt respectively such that θ opt 2 , θ opt 2 ≤ R. Lemma I.1. Suppose we sample θ 0 = θ 0 from Θ 0 = N(0, 1 βm I p ). Let Θ t , Θ t be the resulting distributions of θ t , θ t according to (1) using D, D respectively. Let P, P be the stationary distributions of (1) using D, D respectively. Then for any T ≥ t 0 := 2 log((α -1) max{2, M/m}) and α ≥ 2: R α (Θ T , Θ T ) ≤ O βmR 2 ((M/m) 2 + α) + pM m log M m • exp - (T -t 0 )βm 3α + 4 3 R 3α (P, P ). Proof. Since adding a constant to L does not affect the sampling problem, assume without loss of generality that there is a density function P such that P (θ) = exp(-βL(θ; D)). Let Q = N(θ opt , 1 βm I p ). By m-strong convexity of L, L(θ; D) -L(θ opt ; D) ≥ m 2 θ -θ opt Under review as a conference paper at ICLR 2023 exp ((α -1)R α (P, Q)) = R p P (θ) α Q(θ) α-1 dθ = 2π βm p(α-1)/2 R p exp(α log P (θ) + (α -1)βm 2 θ -θ opt 2 )dθ ≤ P (θ opt ) α 2π βm p(α-1)/2 R p exp(- βm 2 θ -θ opt 2 2 )dθ = P (θ opt ) α 2π βm pα/2 ≤ βM 2π pα/2 2π βm pα/2 = M m pα/2 . where the first inequality uses strong convexity, and in the second inequality, we use the fact that the βm-log strongly convex, βM-log smooth distribution with mode θ opt that has the largest density function at θ opt is N(θ opt , 1 βM I p ). The above bound thus implies that R α (P, Q) ≤ pα 2(α -1) log M m . By a similar argument, but instead using M-smoothness and the fact that the βm-log strongly convex, βM-log smooth distribution with mode θ opt that has the smallest density function at θ opt is N(θ opt , 1 βm I p ), for α < M M-m : exp ((α -1)R α (Q, P )) = R p Q(θ) α P (θ) α-1 dθ = ( βm 2π ) αp/2 R p exp -αβm θ -θ opt 2 2 2 + (α -1) log P (θ) dθ ≤ ( βm 2π ) αp/2 P (θ opt ) -(α-1) R p exp (αβ(M -m) -βM) θ -θ opt 2 2 2 dθ ≤ ( βm 2π ) p/2 R p exp (αβ(M -m) -βM) θ -θ opt 2 2 2 dθ ( * ) = ( m M -α(M -m) ) p/2 =⇒ R α (Q, P) ≤ p 2(α -1) log m M -α(M -m) . In ( * ), we use the assumption on α to ensure the integral does not diverge. In particular, choosing α = 1 + m M satisfies the assumption and gives: R 1+ m M (Q, P) ≤ pM 2m log M m . The same bounds hold for P , Q defined using D and θ opt instead of D, θ opt respectively. Using the weak triangle inequality for Renyi divergences (Fact B.10), letting Θ 0 = N(0, 1 βm I p ), we have for any α ≥ 1, q ≥ 1: R α (P , Θ 0 ) ≤ α -1/q α -1 R qα (P , Q ) + R qα-1 q-1 (Q , Θ 0 ). Setting q = 2, plugging in our above bound and Lemma B.8 we get: R α (P , Θ 0 ) ≤ pα 2(α -1) log M m + βm(2α -1)R 2 2 . If α ≥ 2, this can be simplified to: R α (P , Θ 0 ) ≤ p log M m + βm(2α -1)R 2 2 . Then following the proof in (Ganesh & Talwar, 2020, Lemma 20) , we get that if α ≥ 2: R α (P , Θ T ) ≤ R α (P , Θ 0 ) • exp(- Tβm α ) ≤ p log M m + βm(2α -1)R 2 2 • exp - Tβm α Similarly we have: R α (Θ 0 , P ) ≤ α -1/q α -1 R qα (Θ 0 , Q) + R qα-1 q-1 (Q, P ). Letting q = max{2, M/m}, α = 1 + q-1 q 2 and again plugging in our above bound and Lemma B.8: R 1+ q-1 q 2 (Θ 0 , P ) ≤ (1 + q)R q+1-1/q (Θ 0 , Q) + R 1+1/q (Q, P ) ≤ (1 + q)R q+1-1/q (Θ 0 , Q) + R 1+m/M (Q, P ) ≤ βm(q 2 + 2q -1/q)R 2 2 + pM 2m log M m . Now, using (Vempala & Wibisono, 2019 , Theorem 2 and Lemma 14), we have that for any α ≥ 2 and T ≥ t 0 := 2 log((α -1)q): R α (Θ T , P ) ≤ R α (Θ t 0 , P ) • exp - 2tβm α ≤ (q -1) q 2 (1 -1/α) • R 1+ q-1 q 2 (Θ 0 , P ) • exp - 2(T -t 0 )βm α . To simplify, we will use the fact that q ≥ 2, α ≥ 2 in conjunction with our bound on R 1+ q-1 q 2 (Θ 0 , P ) to upper bound this by: ≤ βm(q 2 + 2q -1/q)R 2 2 + pM 2m log M m • exp - (T -t 0 )βm α . We can now use the weak triangle inequality (Fact B.10) twice and monotonicity of Renyi divergences (Fact B.6) to directly prove a divergence bound between Θ T and Θ T for α ≥ 2: R α (Θ T , Θ T ) ≤ α -1/3 α -1 R 3α (Θ T , P ) + 3α/2 -1 3α/2 -3/2 R 3α-1 (P, P ) + R 3α-2 (P , Θ T ) ≤ 5 3 R 3α (Θ T , P ) + 4 3 R 3α (P, P ) + R 3α (P , Θ T ) Substituting the bounds on R 3α (Θ T , P ), R 3α (P, P ), and R 3α (P , Θ T ), we have a bound (16) ≤ βmR 2 • max{30, 5(M/m) 2 + 10M/m} + 18α -3 6 + 5pM/m + 3p 6 log M m • exp - (T -t 0 )βm 3α + 4 3 R 3α (P, P ). This completes the proof of Lemma I.1. One can modify the proof such that the bound converges to R α (P, P ) rather than 4 3 R 3α (P, P ). To do so, note that both the leading 4/3 and the leading constant in 3α arise from applying Fact B.10 in (16) with fixed parameter q. By instead applying Fact B.10 with parameter q depending on T, we can replace 4 3 R 3α (P, P ) with the expression c(T)R α(T) (P, P ) for some functions c(T), α(T) approaching 1 and α respectively. The cost of doing this is that the coefficient 5/3 in front of the first term in ( 16), and the orders 3α in the first and last term in (16) will become larger over time; however, for a fixed order α these terms decay as exp(-T/α), so if we choose q in our application of Fact B.10 such that these values grow sub-linearly, the first and last term in (16) will still decay as exp(-T c ) for some 0 < c < 1. This modification heavily complicates the proof and the form of the final bound, so we omit a proof including this modification and instead opt for a simpler presentation and weaker bound here. I.2 SWITCHING BETWEEN "SHORT" AND "LONG" TERM If both loss functions are m-strongly convex and we use a fixed β t = β, then Chourasia et al. (Chourasia et al., 2021 , Corollary 1) implies the following: Lemma I.2. Suppose ∇L(θ; D) -∇L(θ; D ) 2 ≤ ∆ for all θ, and L(θ; D), L(θ; D ) are both mstrongly convex with respect to θ. Then if θ 0 and θ 0 are sampled from Θ 0 = N(0, 1 βm I p ), for all α ≥ 1 and T ≥ 0: R α (Θ T , Θ T ) ≤ αβ∆ 2 m • (1 -e -βmT/2 ). When we have m-strongly convexity, M-smoothness, and a bound of ∆ on the 2 -norm of the difference between the gradients of L(θ; D) and L(θ; D ), both Lemma I.1 and Lemma I.2 give a bound on the Renyi divergence. Intuitively, Lemma I.2 bounds the divergence between the noisy gradient flows. Therefore, it is initially 0 when there is no history of the noisy gradient flows and the distributions are the same. However, it worsens as T increases, since the distribution at time T becomes more dependent on the difference between the history of the noisy gradient flows and less dependent on the shared initial distribution. On the other hand, Lemma I.1 effectively is a bound on the "sampling error" between the finite-time distribution and stationary distribution of each diffusion, plus the divergence between the stationary distributions of the two diffusions. The former improves as T increases, and the latter is fixed. In turn, as long as 4 3 R 3α (P, P ) ≤ αβ∆ 2 m . That is, the asymptotic bound of Lemma I.2 is worse than the asymptotic bound of Lemma I.1. Since the sampling error goes to 0 as T goes to infinity, there is a shift in regimes where roughly speaking the divergence due to sampling error becomes smaller than the divergence between the noisy gradient flows. At this point, it directs us to use the "long" term bound instead of "short" term bound. In particular, this regime shift occurs roughly when: O(βmR 2 ((M/m) 2 + α) + pM m log M m ) • exp - (T -t 0 )βm 3α ≤ αβ∆ 2 m - 4 3 R 3α (P, P ), Rearranging, we get that we shift regimes at roughly a time T * such that T * ≈ 2 log((α -1) max{2, M}) + Θ   α βm • log   βmR(M/m + α) + pM m log M m αβ∆ 2 m -4 3 R 3α (P, P )     . Again, we remark that one can modify Lemma I.1 to approach R α (P, P ) instead of 4 3 R α (P, P ). For this asymptotic bound, the inequality R α (P, P ) ≤ αβ∆ 2 m , is always satisfied, i.e., either the limiting bound of Lemma I.2 is tight for the stationary distribution, or such a regime shift always exists.

J DEFERRED PROOFS FROM SECTION F

Proof of Theorem F.1. Let φ x (θ t ) = 1 2β θ tx 2 2 be the potential w.r.t. a fixed x ∈ C. We will write dφ x (θ t ) as the sum of two terms, dφ A x (θ t ), which is equal to dφ x (θ t ) if we used eq. ( 1) instead of eq. ( 2), and dφ B x (θ t ), which is simply the difference dφ x (θ t )dφ A x (θ t ). For example, dφ B x (θ t ) is 0 when θ t is not on the boundary of C. Since C is convex, if θ is a point on the boundary of C and t is a normal vector of C at θ, then θx, t ≥ 0. Then by definition of Π C,θ t , this implies dφ B x (θ t ) ≥ 0 always, so dφ x (θ t ) ≤ dφ A x (θ t ). By Ito's lemma (Lemma B.1), we have the following: Combining eq. ( 17) with the fact that dφ x (θ t ) ≤ dφ A x (θ t ), we get: dφ x (θ t ) ≤ p β -θ t -x, ∇L(θ t ; D) • dt + √ 2 β • θ t -x, dW t . ( ) Furthermore, by linearity of expectation we have the following: E [dφ x (θ t )] ≤ p β • dt -E [ θ t -x, ∇L(θ t ; D) ] • dt + √ 2 β • E [ θ t -x, dW t ] ⇔ E [ θ t -x, ∇L(θ t ; D) ] • dt ≤ p β • dt + √ 2 β • E [ θ t -x, dW t ] -E [dφ x (θ t )] = p β • dt -E [dφ x (θ t )] The last equality in eq. ( 19) follows from the following observation: (22) where in eq. ( 21), Z τ is the standard Brownian motion, and in eq. ( 22) we used the law of iterated expectations and the fact that in the Ito integral, dW t is independent of {W τ } 0≤τ≤t . E [ θ t - By convexity of the loss function L we have: L(θ t ; D) -L(x; D) ≤ θ t -x, ∇L(θ t ; D) . ( ) Combining eq. ( 19) and eq. ( 23) we have the following:  β + E 1 2β θ a -x 2 2 - 1 2β θ b -x 2 2 (25) by the definition of the potential φ x (θ t ) = 1 2β θ tx 2 2 . Consider two non-negative real numbers k and γ such that γ < k. Define -E [L(x; D)] ≤ p β + 1 2β(b -a) E θ a -x 2 2 - 1 b -a E   b a L(θ t ; D)dt   = p β + 1 2β • k E θ T-k -x 2 2 - k + γ k E S k,γ This implies that -E    T-(k-γ) t=T-k L(θ t ; D)dt    ≤ pγ β + 1 2β • k E    T-(k-γ) t=T-k θ T-k -θ t 2 2 dt    -1 + γ k • γ • E S k,γ ≤ pγ β + γ • C 2 2 2β • k -1 + γ k • γ • E S k,γ Plugging in eq. ( 27) into eq. ( 26) we have the following: E k • S k-γ,γ ≤ E k - γ 2 k S k,γ + pγ β + γ • C 2 2 2β • k Dividing by k, we get E S k-γ,γ ≤ E 1 - γ 2 k 2 S k,γ + pγ βk + γ • C 2 2 2β • k 2 Rearranging the terms, we get E S k-γ,γ -1 - γ 2 k 2 S k,γ ≤ pγ βk + γ • C 2 2 2β • k 2 This gives the following set of implications E lim γ→0 S k-γ,γ -S k,γ γ ≤ ( * ) lim γ→0 E S k-γ,γ -S k,γ γ ≤ p βk + C 2 2 2β • k 2 ⇒ d dc E [ Sk-c ] ≤ p βk + C 2 2 2β • k 2 ⇒ E [ S0 ] ≤ E [ Sk ] + p β + C 2 2 2β • k . ( ) Where the inequality ( * ):  Since θ t ∈ C for all t, and L is L-Lipschitz, we get that L(θ t , D) is lower and upper bounded by some constants for all t. Using eq. ( 30), this implies that S k-γ,γ -S k,γ γ is lower bounded by some constant for all γ < k. In other words, we can apply Fatou's lemma to upper bound the expectation of the limit by the limit of the expectation, hence the inequality ( * ). Now setting k = T in eq. ( 28), we get: E [ S0 ] ≤ E [ ST ] + p β + C 2 2 2βT . ( ) If we set a = 0, b = T in eq. ( 25) and then divide by T we have: E [ ST ] -L(x; D) ≤ p β + E 1 2βT θ 0 -x 2 2 - 1 2βT θ T -x 2 2 ≤ p β + C 2 2 2βT . ( ) Combining eq. ( 31) and eq. ( 32) and setting x = θ * where θ * is any minimizer over C, we get: Now, integrating eq. ( 38) from t = 0 to T and noting that φ t ≥ 0 for all t we get:  We now determine what choice of a as a function of T ensures privacy when β t = t a . Our final choice of a will satisfy a ≥ 0 for all sufficiently large n. Recall from Lemma C.1 that for (ε, δ)differential privacy, since we have L-Lipschitzness it suffices if: 4L 2 log(1/δ) T 0 β 2 t dt εn 2 ≤ ε ⇔ T 0 t 2a dt ≤ ε 2 n 2 4L 2 log(1/δ) ⇔T 2a+1 ≤ (2a + 1)ε 2 n 2 4L 2 log(1/δ) Note that we use the assumption a ≥ 0 to show the integral is equal to T 2a+1 rather than diverging. For eq. ( 43) to hold, assuming a ≥ 0, we have 2a + 1 ≥ 1, so letting a = log max{2,T} εn 2L √ log(1/δ) -1 2 suffices. We will eventually choose T that is at most 2 for sufficiently large n, which implies that a ≥ 0 for all sufficiently large n as promised. We now wish to evaluate from eq. ( 42) the integral: . So the last integral in eq. ( 44) can be upper bounded as: T 0 exp(c √ t)dt = 2 c 2 c √ T 0 xe x dx = 2 • e c √ T (c √ T -1) + 1 c 2 ≤ 2 • √ T(e c √ T -1) c ( ) Where the last inequality follows from the inequality e x ≥ 1 + x. We also have that the denominator of eq. ( 42) is equal to e mB t -1 = e c √ T -1. So plugging eq. ( 44), eq. ( 46) into eq. ( 42 (48) Note that φ 0 = θ 0θ * 2 2 ≤ C 2 2 ≤ L 2 m 2 . So to complete the proof, we can choose R = m 2 ε 2 n 2 pφ 0 2L 2 log(1/δ) ≤ pε 2 n 2 2 log(1/δ) , Which causes the second term in eq. ( 48) to be larger than the first term and gives the theorem statement. Proof of Theorem F. 



We only focus on -Lipschitz losses and the constraint set is bounded in the 2 -norm; the non-Eucledian setting(Talwar et al., 2015; Asi et al., 2021a;Bassily et al., 2021) are beyond the scope of this work. There is a setting of parameters obtaining the optimal (asymptotic) excess empirical loss, loss function L, and constraint set C such that the total variation (1-Wassertein, resp.) distance to the stationary distribution can be as large as 1o(1) (Ω( C 2 ), resp.). A Lyapunov function maps scalar or vector variables to real numbers (R p → R) and decreases along the solution trajectory of an SDE. They are primarily used in ordinary differential equation to prove stability and in continuous optimization to prove convergence. The lower bound inBassily et al. (2019) is technically for (ε, δ)-DP, but can be interpreted as that for ε-DP up to slack factors of polylog(n). The authors ofGopi et al. (2022) also formally acknowledged the independence claim. In particular, we can always ensure twice differentiability by convolving the loss function with the bump kernel(Kifer et al., 2012), and then make the smoothness parameter finite but arbitrarily large which does not affect the Lipschitzness. 2σ 2 . So by post-processing (Fact B.7) and the Lipschitzness assumption, R α (Ψ D,m,i (θ), Ψ D ,m,i (θ)) is bounded by This follows since (θ) is m-strongly convex iff (θ) -m 2 θ 2 2 is. By m-strong convexity of in C, (θ)m is convex in C, and then its convexity over R p then follows from e.g. Theorem 4.1 in Yan (2012). The dataset, D = {d 1 , • • • , d n } is such that d i ∈ {0, 1} s for all i ∈ [n], d i,j is independent (conditioned on P) and E[d i,j ] = P j for all i ∈ [n] and j ∈ [s].Here P is the distribution that is defined in Theorem B.12. 2 , so we have:



Chourasia et al. (2021); Altschuler & Talwar (2022); Ryffel et al. (2022): Recently, Chourasia et al. (2021) studied discretization of the LD algorithm as DP-(Stochastic) Gradient Langevin Dynamics (DP-SGLD), and Ryffel et al. (2022) extended these results of Chourasia et al.

Asi et al. (2021b): Asi et al. (2021b) showed DP-SCO bounds under a general condition called κ-growth via a custom localization based algorithm. A corollary of their results is that under ε-DP one can obtain a excess population risk of O

Li et al. (2017)  andMandt et al. (2017)  independently gave the first asymptotic convergence of stochastic gradient descent and momentum method as an approximation to stochastic differential equation while Vollmer et al. (2016) gave a non-asymptotic bound on convergence of the stochastic gradient LD algorithm by using the Poisson equations.

(Monotonicity (van Erven & Harremos, 2014, Theorem 3)). For any distributions P, Q and 0

Fact B.7(Post-Processing (van Erven & Harremos, 2014, Theorem 9)). For any sample spaces X , Y, distributions P, Q over X , and any function f :

r∈R |u(D, r)u(D , r)|. If R is continuous, we instead sample from the distribution with pdf: p D (r) = e -ε (D,r)/2∆ r∈R e -ε (D,r)/2∆ dr . Differentially private stochastic gradient descent (DP-SGD) (Bassily et al., 2014) Require: Data set D = {d 1 , • • • , d n }, loss function: : C × D → R, gradient 2 -norm bound: L, constraint set: C ⊆ R p , number of iterations: T, noise variance: σ 2 , learning rate: η. 1: Choose any point θ 0 ∈ C. 2: for t = 0, . . . , T -1 do 3: Sample d t uniformly at random from D with replacement.4:

Figure 1: An abstract visualization of how the gradient drift and Brownian motion counteract each other in Langevin diffusion for L(θ; D) = θ 2 , C = B(0, 1) is the unit Euclidean ball centered at 0. The blue arrow shows the integral of gradient drift over time η. The red arrow shows the integral of Brownian motion over time η perpendicular to the gradient drift.In the left picture, we can see that when we are close to the minimizer (the center of the circle), Brownian motion has a stronger effect than gradient drift. In the right picture, we see that when we are far from the minimizer, the opposite is true. In the middle picture, we see the equilibrium point where the two counteract each other. As the dimension increases, the ratio of the red arrow's length to blue arrow's length increases, and so the equilibrium point will get further from the minimizer.

[(L(θ t ; D) -L(x; D)) dt] ≤ p • dt β -E [dφ x (θ t )](24)Let a ≤ b. Integrating eq. (24) from [a, b], we have the following:φ x (θ a )φ x (θ b )] = p • (ba)

t ; D)dt and Sk = lim γ→0 S k,γ .That is, Sk is the average value of L(θ t ; D) over the interval [Tk, T], and in particular S0 = L(θ T ; D). We have the following:E k • S k-γ,γ = E t ; D)dt above, we use eq. (25) as follows, while setting a = Tk and b = T.

[L(θ T ; D) -L(θ * ; D)] to choose β, T to minimize the above bound while satisfying privacy. By setting ∆ = 2L/n in Lemma C.1 and setting α = 1 + 2 log(1/δ) ε in Fact B.5, we get that as long as 2 log(1/δ) ε≥ 1 to satisfy (ε, δ)-differential privacy it suffices if: 4 log(1/δ)L 2 β 2 T eq. (33) we get:E [L(θ T ; D) -L(θ * ; D)] ≤ 4pL 2T log(1/δ)

L(θ t ; D)] -L(θ * ; D)]de mB t ≤ mφ 0 + mp T -1)L(θ priv ; D) ≤ T 0 L(θ t ; D)de mB t . (41)Plugging eq. (41) into eq. (40) and using linearity of expectation we have:E L(θ priv ; D) -L(θ * ;

) we have: E L(θ priv ; D) -L(θ * ; D) the expression for c, eq. (47) becomes: E L(θ priv ; D) -L(θ * ; D) + 1)L 2 log(1/δ) log(R + 1) mε 2 n 2 .

4. We again choose β = εn L √ 8T log(1/δ)as in Theorem F.1. Then combining Lemma F.3 with eq. (35), we get the following boundE θ priv [Risk SCO (θ priv )]

Summary of our upper bounds for DP-ERM and DP-SCO. The bounds marked in blue were not known even via different algorithms. Every bound is tight up to polylog (p, ε, n) factors. Convex: Class of convex bounded Lipschitz losses, m-SC: Convex with ∇ 2 (θ; •) mI. Nonconvex: Class of losses with

the output of the ε i -DP exponential mechanism on C i . For appropriate choices of ε i , r i we get the following ERM bound for θ k that is the output of the exponential mechanism on the final constraint set:

show the empirical risk to population risk transformation by showing uniform stability bounds for both ε-DP and (ε, δ)-DP. For the ε-DP setting, standard transformation from empirical risk to population risk inBassily et al. (2014) (either via 2 -regularization, or the stability property of DP itself) leads to a bound sub-optimal by a

• • • , d n } and 1-Lipschitz non-convex function L such that for every p ∈ N, there is no (ε, δ)-differentially private algorithm A that outputs θ priv such that Risk ERM (θ priv ) ∈ o (p log (1/δ)/ (nε)) .

Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning, pp. 2101-2110. PMLR, 2017. Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017. H Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differentially private recurrent language models. arXiv preprint arXiv:1710.06963, 2017. Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07), pp. 94-103. IEEE, 2007. Panayotis Mertikopoulos and Mathias Staudigl. On the convergence of gradient-like flows with noisy gradient input. SIAM Journal on Optimization, 28(1):163-197, 2018.

Chourasia et al. (2021) andRyffel et al. (2022)  study discretization of the LD algorithm as DP-(Stochastic) Gradient Langevin Dynamics (DP-SGLD). They show that under smoothness and strong convexity on the loss function L(θ; D), the privacy cost of DP-SGLD converges to a stationary finite value, even when the number of time steps goes to ∞.Wang et al. (2019) used the result byRaginsky et al. (2017)  to prove a sub-optimal excess empirical risk of

Bassily et al. (2014) for convex losses, and quote the result for completeness purposes. Notice that the privacy guarantee in Theorem D.1 does not rely on convexity. Loss function L, constraint set C ⊂ R p with bounded diameter, Lipschitz constant L, number of iterations k, privacy parameter ε, data set D of n-samples.

D) in C, and ∂C is the boundary of C. By convexity of C, G is contained in C. Furthermore, G is simply C rescaled by R in all directions around the point θ * (i.e., if θ * were the origin, G would simply be RC), so

x, dW t ] = E [ θ t , dW t ] -E [ x, dW t ] = E [ θ t , dW t ]

Proof of Theorem F.2. Note that β t = d dt B t by definition, and let φ t = e mB t θ tθ * 2 2 . Similarly to eq. (19), by Ito's Lemma we have:dφ t ≤mβ t e mB t θ tθ * 2 2 dte mB t θ *θ t , -B t ∇L(θ t ; D)dt + E [dφ t ] ≤β t e mB t E m θ tθ * 2 2 + θ *θ t , ∇L(θ t ; D) dt + pe mB t dt. θ t + m θ tθ * 2 2 . (37)Note that de mB t = mβ t e mB t dt. Then plugging eq. (37) into eq. (36) we get:E [dφ t ] ≤ β t e mB t [L(θ * ; D) -E [L(θ t ; D)]] dt + pe mB t dt(38) L(θ t ; D)] -L(θ * ; D)] de mB t ≤ -E [dφ t ] + pe mB t dt.

annex

To minimize the right hand side of the above inequality, we choose T = Θ(min{. The result follows.

