ON THE UNIVERSALITY OF LANGEVIN DIFFUSION FOR PRIVATE EUCLIDEAN (CONVEX) OPTIMIZATION Anonymous

Abstract

In this paper, we revisit the problem of differentially private empirical risk minimization (DP-ERM) and differentially private stochastic convex optimization (DP-SCO). We show that a well-studied continuous time algorithm from statistical physics, called Langevin diffusion (LD), simultaneously provides optimal privacy/utility trade-offs for both DP-ERM and DP-SCO, under ε-DP, and (ε, δ)-DP both for convex and strongly convex loss functions. We provide new time and dimension independent uniform stability properties of LD, with which we provide the corresponding optimal excess population risk guarantees for ε-DP. An important attribute of our DP-SCO guarantees for ε-DP is that they match the non-private optimal bounds as ε → ∞.

1. INTRODUCTION

Over the last decade, there has been significant progress in providing tight upper and lower bounds for differentially private empirical risk minimization (DP-ERM) (Chaudhuri et al., 2011; Kifer et al., 2012; Bassily et al., 2014; Song et al., 2013; McMahan et al., 2017; Smith et al., 2017; Wu et al., 2017; Iyengar et al., 2019; Song et al., 2020; Chourasia et al., 2021) and differentially private stochastic optimization (DP-SCO) (Bassily et al., 2019; Feldman et al., 2020; Bassily et al., 2020; Kulkarni et al., 2021; Gopi et al., 2022; Asi et al., 2021b) , both in the ε-DP setting and in the (ε, δ)-DP settingfoot_0 . While we know tight bounds for both DP-ERM and DP-SCO in the (ε, δ)-DP setting (Bassily et al., 2014; 2019) , the space is much less understood in the ε-DP setting (i.e., where δ = 0). First, to the best of our knowledge, tight DP-SCO bounds are not known for ε-DP. In this paper when we say a bound is tight for any problem, we implicitly always expect the bound to reach the optimal non-private bound (including polylogarithmic factors) for the same task as ε → ∞. Second, the algorithms for both DP-ERM and DP-SCO in the ε-DP setting are inherently different from the (ε, δ)-DP setting. While all the algorithms in the (ε, δ)-DP setting are based on DP variants of gradient descent (Bassily et al., 2014; 2019; Feldman et al., 2020; Bassily et al., 2020) , the best algorithms for ε-DP are based on a combination of exponential mechanism (McSherry & Talwar, 2007) and output perturbation (Chaudhuri et al., 2011) . Third, we know that as we move from ε to (ε, δ)-DP, for convex problems, we gain a polynomial improvement in the error bounds in terms of the model dimensionality, p. It is unknown if such an improvement is even possible when the loss functions are non-convex. In this work, we close these gaps in our understanding of DP-ERM and DP-SCO via the following contributions. 1. We provide a unified framework for DP-ERM/DP-SCO via an information theoretic tool called Langevin diffusion (Langevin, 1908; Lemons & Gythiel, 1997) (LD) (defined in eq. ( 1) and eq. ( 2)), which under appropriate choice of parameters interpolates between optimal/tight utility bounds for both DP-ERM and DP-SCO, both under ε and (ε, δ)-DP. 2. We provide tight DP-SCO bounds for both convex and strongly convex losses under ε-DP. To achieve these bounds, we show uniform stability of the exponential mechanism on strongly convex losses (which was to the best of our knowledge unknown prior to our work). Notably, our proof of uniform stability is almost immediate given the unified framework from item 1. 3. We provide a lower bound showing that if the loss functions are non-convex, it is not possible to obtain any polynomial improvement in the error in terms of the model dimensionality when we shift from ε-DP to (ε, δ)-DP. This is in sharp contrast to the convex setting. 4. Along the way we provide a set of results, which may be of independent interest: (a) A simple Rényi divergence bound between two Langevin diffusions running on loss functions with a bounded gradient difference. (b) For strongly convex and smooth losses, a Rényi divergence bound between two Langevin diffusions that approaches the Rényi divergence between their stationary distributions. (c) Improved analyses for ε-DP-ERM: For strongly convex losses we improve the bound by log factors via a better algorithm, and for non-convex losses we remove the assumption of Bassily et al. ( 2014) that the constraint set contains a ball of radius r. (d) A last-iterate analysis of Langevin diffusion as an optimization algorithm, using continuous analogs of techniques in Shamir & Zhang (2013) . To the best of our knowledge, this is the first analysis tailored to a continuous-time DP algorithm. Our work initiates a systematic study of DP continuous time optimization. We believe this may have ramifications in the design of discrete time DP optimization algorithms analogous to that in the non-private setting. In the non-private setting, continuous time dynamical viewpoints have helped in designing new algorithms, including the celebrated mirror-descent (see the discussion in Chapter 3 of Nemirovskij & Yudin (1983) ) and Polyak's momentum method (Polyak, 1964) , and understanding the implicit bias of gradient descent (Vardi et al., 2022) . Extending the dynamical viewpoint to private optimization would help us understand the price of privacy for the convergence rate of private optimization. Further, it is known that privacy helps in generalization and implicit bias underlies the generalization ability of (non-private) machine learning models trained by stochastic gradient descent. Taking the dynamical viewpoint would help us understand if there is a more fundamental reason for the generalization ability of differential privacy. In the rest of this section, we elaborate on each of these conceptual contributions, and alongside highlight the technical advances that were required. At the end, we provide comparison to most relevant works. We defer a broader comparison to other works in the area to Appendix A. For brevity, we will refer to 1 n n ∑ i=1 (θ; d i ) as L(θ; D), or the empirical loss. For DP-SCO, we assume each data point d i in the data set D is drawn i.i.d. from some distribution D over the domain τ, and the objective is to minimize the excess population risk given by Risk SCO (θ priv ) =



We only focus on -Lipschitz losses and the constraint set is bounded in the 2 -norm; the non-Eucledian setting(Talwar et al., 2015; Asi et al., 2021a; Bassily et al., 2021) are beyond the scope of this work.



Consider a data set D = {d 1 , . . . , d n } drawn from some domain τ and an associated loss function : C × τ → R, where C ⊂ R p is called the constraint set. The objective in DP-ERM is to output a model θ priv while ensuring differential privacy that approximately minimizes the excess empirical risk, Risk ERM (θ priv ) = 1

