DIFFERENTIALLY PRIVATE OPTIMIZATION FOR SMOOTH NON-CONVEX ERM

Abstract

We develop simple differentially private optimization algorithms that move along directions of (expected) descent to find an approximate second-order necessary solution for non-convex ERM. We use line search, mini-batching, and a two-phase strategy to improve the speed and practicality of the algorithm. Numerical experiments demonstrate the effectiveness of these approaches.

1. INTRODUCTION

Privacy protection has become a central issue in machine learning algorithms, and differential privacy (Dwork & Roth, 2014 ) is a rigorous and popular framework for quantifying privacy. In our paper, we propose a differentially private optimization algorithm that finds an approximate second-order necessary solution for ERM problems. We proposed several techniques to improve the practical performance of the method, including backtracking line search, mini-batching, and a heuristic to avoid the effects of conservative assumptions made in the analysis. For given f : R d → R, consider the following minimization problem, min w∈R d f (w). We want to find an approximate second-order necessary solution, defined formally as follows. Definition 1 ((ϵ g , ϵ H )-2NS). For given positive values of ϵ g and ϵ H , We say that w is an (ϵ g , ϵ H )-approximate second-order necessary solution (abbreviated as (ϵ g , ϵ H )-2NS) if ∥∇f (w)∥ ≤ ϵ g , λ min ∇ 2 f (w) ≥ -ϵ H . (1) We are mostly interested in the case of ϵ g = α and ϵ H = √ M α, that is, we seek an (α, √ M α)-2NS, where M is the Lipschitz constant for ∇ 2 f . We will focus on the empirical risk minimization (ERM) problem, defined as follows, Definition 2 (ERM). Given a dataset D = {x 1 , . . . , x n } and a loss function l(w, x), we seek the parameter w ∈ R d that minimizes the empirical risk f (w) = L(w, D) := 1 n n i=1 l (w, x i ) . ERM is a classical problem in machine learning that has been studied extensively; see, for example Shalev-Shwartz & Ben-David (2014) . In this paper, we describe differentially private (DP) techniques for solving ERM. Previous research on DP algorithms for ERM and optimization has focused mainly on convex loss functions. Recent research on differentially private algorithms for non-convex ERM (Wang et al., 2018a; Wang & Xu, 2019; Zhang et al., 2017) targets an approximate stationary point, which satisfies only the first condition in (1). Wang & Xu (2021) proposes a trust-region type (DP-TR) algorithm that gives an approximate second-order necessary solution for ERM, satisfying both conditions in (1), for particular choices of ϵ g and ϵ H . This work requires the trust-region subproblem to be solved exactly at each iteration, and fixes the radius of the trust region at a small value, akin to a "short step" in a line-search method. An earlier work (Wang et al., 2019) proposed the DP-GD algorithm, which takes short steps in a noisy gradient direction, then sorts through all the iterates so generated to find one that satisfies second-order necessary conditions. Our work matches the sample complexity bound in DP-GD, which is O d α 2 √ ρ for ρ-zCDP or O d √ ln(1/δ) α 2 ε for (ε, δ)-DP for finding an (α, √ M α)-2NS, and has an iteration complexity of O(α -2 ). Our contributions can be summarized as follows. • Our algorithm is elementary and is based on a simple (non-private) line-search algorithm for finding an approximate second-order necessary solution. It evaluates second-order information (a noisy Hessian matrix) only when insufficient progress can be made using first-order (gradient) information alone. By contrast, DP-GD uses the (noisy) Hessian only for checking the second-order approximate condition, while DP-TR requires the noisy Hessian to be calculated at every iteration. • Our algorithm is practical and fast. DP-TR has a slightly better sample complexity bound than our method, depending on α -7/4 rather than α -2 . However, since our analysis is based on the worst case, we can be more aggressive with step sizes (see below). DP-TR requires solving the trust-region subproblem exactly, which is relatively expensive and unnecessary when the gradient is large enough to take a productive step. Experiments demonstrate that our algorithm requires fewer iterations than DP-TR, does less computation on average at each iteration, and thus runs significantly faster than DP-TR. Moreover, we note that the mini-batch version of DP-TR has a sample complexity O(α -2 ), matching the sample complexity of the mini-batch version of our algorithm. • We use line search and mini-batching to accelerate the algorithm. Differentially private line search algorithms have been proposed by (Chen & Lee, 2020) . We use the same sparse vector technique as used by their work, but provide a tighter analysis of the sensitivity of the query for checking sufficient decrease condition. In addition, we provide a rigorous analysis of the guaranteed function decrease with high probability. • To complement our worst-case analysis, we propose a heuristic that can obtain much more rapid convergence while retaining the guarantees provided by the analysis. The remainder of the paper is structured as follows. In Section 2, we review basic definitions and properties from differential privacy, and make some assumptions about the function f to be optimized. In Section 3, we describe our algorithm and its analysis. We will discuss the basic short step version of the algorithm in Section 3.1, and an extension to a practical line search method in Section 3.2. A mini-batch adaptation of the algorithm is described in Section 3.3. In Section 4, we present experimental results and demonstrate the effectiveness of our algorithms.

2. PRELIMINARIES

We use several variants of DP for the need of our analysis, including (ε, δ)-DP (Dwork & Roth, 2014) , (α, ϵ)-RDP (Mironov, 2017) , and zCDP (Bun & Steinke, 2016) . We review their definitions and properties in Appendix A. We make the following assumptions about the smoothness of the objective function f . Assumption 1. We assume f is lower bounded by f . Assume further that f is G-smooth and has M -Lipschitz Hessian, that is, for all w 1 , w 2 ∈ dom(f ), ∥∇f (w 1 ) -∇f (w 2 )∥ ≤ G∥w 1 -w 2 ∥, ∥∇ 2 f (w 1 ) -∇ 2 f (w 2 )∥ ≤ M ∥w 1 -w 2 ∥, where ∥ • ∥ denotes the vector 2-norm and the matrix 2-norm respectively. We use this notation throughout the paper. For the ERM version of f (see Definition 2), we make additional assumptions. Assumption 2. For the ERM setting, we assume the loss function is l(w, x) is G-smooth and has M -Lipschitz Hessian with respect to w. Thus L(w, D) (the average loss across n samples) is also G-smooth and has M -Lipschitz Hessian with respect to w. In addition, we assume l(w, x) has bounded function values, gradients, and Hessians. That is, there are constants B, B g , and B H such that for any w, x we have, 0 ≤ l(w, x) ≤ B, ∥∇ w l(w, x)∥ ≤ B g , ∥∇ 2 w l(w, x)∥ ≤ B H . As a consequence, the l 2 sensitivity of L(w, D) and ∇L(w, D) is bounded by B/n and 2B g /n respectively. We have ∥∇ 2 L(w, D) -∇ 2 L(w, D ′ )∥ F ≤ √ d ∥∇ 2 L(w, D) -∇ 2 L(w, D ′ )∥ ≤ 2B H √ d n . To simplify notation, we define g(w) := ∇f (w) and H(w) := ∇ 2 f (w). From the definition (20) of ℓ 2 -sensitivity, we have that the sensitivities of f , g, and H are ∆ f = B n , ∆ g = 2 B g n , ∆ H = 2B H √ d n . (4)

3. MAIN RESULTS

Our algorithmic starting point is the elementary algorithm described in Wright & Recht (2022, Chapter 3.6 ) that has convergence guarantees to points that satisfy approximate second-order necessary conditions. For simplicity, we use the following notation to describe and analyze the method: f k := f (w k ), g k := g(w k ) = ∇f (w k ), H k := H(w k ) = ∇ 2 f (w k ). We employ the Gaussian mechanism to perturb gradients and Hessians, and denote gk = g k + ε k , Hk = H k + E k , where ε k ∼ N (0, ∆ 2 g σ 2 g I d ) for some chosen parameter σ g and E k is a symmetric matrix in which each entry on and above its diagonal is i.i.d. as N 0, ∆ 2 H σ 2 H , for some chosen value of σ H . Let λk denote the minimum eigenvalue of Hk with pk the corresponding eigenvector, with sign and norm chosen to satisfy ∥p k ∥ = 1 and (p k ) T gk ≤ 0. (5) Algorithm 1 specifies the general form of our optimization algorithm. We will discuss two strategies -a "short step" strategy and one based on backtracking line search -to choose the step sizes γ k,g and γ k,H to be taken along the directions gk and pk , respectively. For each variant, we define a quantity MIN_DEC to be the minimum decrease, and use it together with a specific lower bound on f to define an upper bound T of the required number of iterations. In each iteration, we take a step in the negative of the perturbed gradient direction gk if ∥g k ∥ > ϵ g . Otherwise, we check the minimum eigenvalue λk of the perturbed Hessian Hk . If λk < -ϵ H , we take a step along the direction pk . In the remaining case, we have ∥g k ∥ ≤ ϵ g and λk ≥ -ϵ H , so the approximate second-order necessary conditions are satisfied and we output the current iterate w k as a 2NS solution. The quantities σ f , σ g , σ H determine the amount of noise added to function, gradient, and Hessian evaluations, respectively, with the goal of preserving privacy via Gaussian Mechanism. We can target a certain privacy level for the overall algorithm (ρ in ρ-zCDP, for example), find an upper bound on the number of iterations required by whatever variant of Algorithm 1 we are using, and then choose σ f , σ g , and σ H to ensure this level of privacy. Conversely, we can choose positive values for σ f , σ g , and σ H and then determine what level of privacy can be ensured by this choice. We can keep track of the privacy leakage as the algorithm progresses, leading to the possibility of adaptive schemes for choosing the σ's.)

3.1. SHORT STEP

In the short step version of the algorithm, we make choices for the step sizes that are independent of k: γ k,g ≡ 1 G , γ k,H ≡ 2| λk | M . ( ) The choices of MIN_DEC and the noise parameters σ f , σ g , and σ H are discussed in the following results. First, we discuss the privacy guarantee and its relationship to the noise variances and the number of iterations. Theorem 1. Let the noise variances σ f , σ g , σ H be given. Suppose a run of Algorithm 1 takes k g gradient steps and k H negative curvature steps. Then the run is ρ-zCDP where ρ = 1 2 1 σ 2 f + kg+k H σ 2 g + k H σ 2 H . Recall that T is the maximum number of iterations defined in (6). Let ρ = 1 2 1 σ 2 f + T σ 2 g + T σ 2

H

. We always have ρ ≥ ρ, so the algorithm is ρ-zCDP. Conversely, for given ρ > 0 and ρ f ∈ (0, ρ), we can choose σ 2 f = 1 2ρ f , σ 2 g = σ 2 H = T ρ -ρ f . ( ) to ensure that the algorithm is ρ-zCDP. Algorithm 1 DP Optimization with Second-Order Guarantees Given: minimum decrease per iteration MIN_DEC, tolerances ϵ g and ϵ H , noise parameters σ f , σ g and σ H Initialize w 0 and sample z ∼ N (0, ∆ 2 f σ 2 f ) Compute an upper bound of the required number of iterations as follows T = f (w 0 ) + |z| -f MIN_DEC (6) Set σ g and σ H using T (See theorems for details) Proof. The proof follows directly from the zCDP guarantee for the Gaussian mechanism combined with postprocessing and composition of zCDP. for k = 1, 2, . . . , T do Sample ε k ∼ N 0, ∆ 2 g σ 2 g I d Compute the perturbed gradient gk = g k + ε k if ∥g k ∥ > ϵ g then Choose γ k,g and set w k+1 ← w k -γ k,g gk ▷ Gradient step else Sample E k such that E k is a d × d symmetric Remark. In our algorithm, the actual noise is scaled by the corresponding sensitivity ∆ defined in (4). We do the same for later algorithms. In practice, we expect most steps to be gradient steps, so ρ is an overestimate of the actual privacy level ρ. We can be more aggressive in choosing the noise variances. We discuss a two-phase approach in Section 3.4. We now discuss guarantees of the output of Algorithm 1. First, we analyze MIN_DEC in each short step. Lemma 2. With the short step size choices (7), if the noise satisfies the following conditions for some positive constants c, c 1 , and c 2 such that c 1 < 1 2 and c 2 + c < 1 3 , ∥ε k ∥ ≤ min c 1 ϵ g , c 2 M ϵ 2 H , ∥E k ∥ ≤ c ϵ H , then the amount of decrease in each step is at least MIN_DEC = min 1 -2c 1 2G ϵ 2 g , 2 1 3 -c 2 -c ϵ 3 H M 2 . ( ) The true gradient and true minimum eigenvalue of the Hessian satisfy the following, ∥g k ∥ ≤ (1 + c 1 ) ∥g k ∥ , λ k > -(1 + c)| λk |. ( ) Remark. The constants c, c 1 , and c 2 in the conditions (9) above control how accurate our noisy gradient and Hessian estimates are. MIN_DEC is smaller when we choose smaller constants. We will also have a tighter solution as demonstrated in the corollary below. However, we need smaller noise to satisfy the conditions (9), which in turn translates to a larger required sample size n for our ERM problem, as we will see in Theorem 4. Corollary 3. Assuming the noise satisfies (9) at each iteration, the short step version (using (7), (10)) of the algorithm will output a ((1 + c 1 )ϵ g , (1 + c)ϵ H )-2NS. With the results above, we now analyze the guarantees of the fixed step-size algorithm under the ERM setting. Theorem 4 (Sample complexity of the short step algorithm). Consider the ERM setting. Suppose that the number of samples n satisfies n ≥ n min , where cCd ))} T where c and C are universal constants in Lemma D.7, the output of the short step version (using (7),( 10)) of the algorithm is a n min := max √ 2dB g σ g log T ζ min c 1 ϵ g , c2 M ϵ 2 H , C √ dB H σ H log T ζ c ϵ H . With probability at least {(1 -ζ T )(1 -C exp (- ((1 + c 1 )ϵ g , (1 + c)ϵ H )-2NS. With the choice of σ's in (8) using ρ f = c f ρ for c f ∈ (0, 1), hiding logarithmic terms and constants, the asymptotic dependence of n min on (ϵ g , ϵ H ), ρ and d, is n min = d √ ρ Õ max ϵ -2 g , ϵ -1 g ϵ -2 H , ϵ -7/2 H . ( ) When (ϵ g , ϵ H ) = (α, √ M α), the dependence simplifies to d √ ρ Õ(α -2 ). Remark. When the conditions (9) do not hold, the algorithm could fail to converge to a ((1 + c 1 )ϵ g , (1 + c)ϵ H )-2NS. First, the noise in the perturbed gradient and Hessian can be so large that the step is not a descent direction. Second, due to the noise, we may terminate early or fail to terminate timely when checking the approximate second-order conditions. If the noise is not excessive, the solution is still acceptable since the noisy evaluations satisfy the termination conditions.

3.2. LINE SEARCH ALGORITHM

Instead of using a conservative fixed step size, we can do a line search using backtracking. The backtracking line search requires an initial value γ 0 , a decrease parameter β ∈ (0, 1) for the step size, and constants c g ∈ (0, 1 -c 1 ), c H ∈ (0, 1 -c -8 3 c 2 ) that determine the amount of decrease we need. Each line search tries in succession the values γ 0 , βγ 0 , β 2 γ 0 , . . ., until a value is found that satisfies a sufficient decrease condition. For gradient steps, the condition is f (w -γg) ≤ f (w) -c g γ∥g∥ 2 , ( ) while for negative curvature steps it is f (w + γ p) ≤ f (w) - 1 2 c H γ 2 | λ|. (SD2) To make line search differentially private, we use the sparse vector technique from (Dwork & Roth, 2014) . We define queries according to (SD1) and (SD2): q g (γ, w) = f (w) -f (w -γg) -c g γ∥g∥ 2 , (13a) q H (γ, w) = f (w) -f (w + γ p) - 1 2 c H γ 2 | λ|, whose nonnegativity is equivalent to each of the sufficient decrease conditions. Algorithm 2 specifies the differentially private line search algorithm using SVT, which is adapted from AboveThreshold algorithm (Dwork & Roth, 2014) . By satisfying the sufficient decrease conditions, we try to get a more substantial improvement in the function value than for the short step algorithm. As a fallback strategy, we use step sizes similar to the short step values (differing only by a constant factor) if the line search fails, yielding a similar decrease to the short-step case. We state the complete algorithm enhanced with line search in Algorithm 3. In the algorithm, we compute the fall back step size γ and use a multiplier b (b > 1) of them as the initial step size bγ for the line search. We compute the query sensitivity ∆ q accordingly and call the private line search subroutine to find a step size γ that satisfies the sufficient decrease conditions. We have the following privacy guarantees. Theorem 5. Suppose that σ f , σ g , σ H , and λ are given. Suppose an actual run of the line search algorithm takes k g gradient steps and k H negative curvature steps. The run is ρ-zCDP where ρ = 1 2 1 σ 2 f + kg+k H σ 2 g + k H σ 2 H + kg+k H λ 2 . Algorithm 2 Private backtracking line search using SVT Given: query q and its sensitivity ∆ q , initial step size multiplier b, fall back step size γ, decrease parameter β, privacy parameter λ function DP-LINESEARCH(q, ∆ q , γ init , γ, β, λ) Initialize γ ← γ init . Sample ξ ∼ Lap (2λ∆ q ) for i = 1, 2, . . . , i max = ⌊log β γ γ init ⌋ + 1 do Sample ν i ∼ Lap (4λ∆ q ) Evaluate q i = q(γ) and qi = q i + ν i if qi ≥ ξ then HALT and output γ else Set γ ← βγ end if end for HALT and output γ. end function Algorithm 3 DP Optimization algorithm with Second-Order Guarantees and Backtracking Line Search Given: noise bound parameters c 1 , c 2 , c, sufficient decrease parameters c g , c H and initial step size multipliers b g , b H , line search decreasing parameters β g , β H , tolerances ϵ g and ϵ H , noise parameters σ f , σ g , σ H and λ SV T Initialize w 0 , sample z ∼ N (0, ∆ 2 f σ 2 f ) and compute MIN_DEC according to (17) Compute an upper bound of the required number of iterations T = f (w0)+|z|-f MIN_DEC Set γg ← 2 (1 -c 1 -c g ) /G for k ← 1, 2, . . . , T do Sample ε k ∼ N 0, ∆ 2 g σ 2 g I d Compute the perturbed gradient gk = g k + ε k if ∥g k ∥ > ϵ g then Define q k,g (γ) = f (w k ) -f (w k -γg k ) -c g γ∥g k ∥ 2 Set γ init k,g ← b g γg , ∆ q k,g ← 2 n γ init k,g B g ∥g k ∥ ▷ Line search query sensitivity γ k,g ← DP-LINESEARCH(q k,g , ∆ q k,g , γ init k,g , γg , β g , λ SV T ) ▷ Backtracking line search w k+1 ← w k -γ k,g gk ▷ Gradient step else Sample E k such that E k is a d × d symmetric matrix in which each entry on and above its diagonal is i.i.d. as N 0, ∆ 2 H σ 2 H . Compute perturbed Hessian Hk = H k + E k Compute the minimum eigenvalue of Hk and the corresponding eigenvector ( λk , pk ) satisfying (5). if λk < -ϵ H then Define q k,H (γ) = f (w k ) -f (w k + γ pk ) -1 2 c H γ 2 | λk | Set γk,H ← t 2 | λk |/M , γ init k,H ← b H γk,H , ∆ q k,H ← 2 n γ init k,H B g γ k,H ← DP-LINESEARCH(q H , ∆ q k H , γ init k,H , γk,H , β H , λ SV T ) ▷ Backtracking line search w k+1 ← w k + γ k,H pk ▷ Negative curvature step else return w k end if end if end for Recall that T is the maximum number of iterations defined in (6). Let ρ = 1 2 1 σ 2 f + T σ 2 g + T σ 2 H + T λ 2 . We always have ρ ≥ ρ, so the algorithm is ρ-zCDP. Conversely, for given ρ > 0 and ρ f ∈ (0, ρ), we can choose σ 2 f = 1 2ρ f , σ 2 g = σ 2 H = λ 2 = 3T 2(ρ -ρ f ) , to ensure that algorithm is ρ-zCDP. Proof. We know that SVT is (1/λ)-DP. Thus, it satisfies (1/(2λ 2 ))-zCDP. The result follows directly from the zCDP guarantee for the Gaussian mechanism combined with postprocessing and composition of zCDP. We now discuss the guarantee of the output of the algorithm. We first derive necessary conditions for sufficient decrease. Lemma 6. Assume the same bounded noise conditions (9) as before. With the choice of sufficient decrease coefficients c g ∈ (0, 1 -c 1 ), c H ∈ (0, 1 -c -8 3 c 2 ), let γg = 2 (1 -c 1 -c g ) /G and γH = t 2 | λ|/M as defined in Algorithm 3, the sufficient decrease conditions (SD1) and (SD2) are satisfied when γ ≤ γg and γ ∈ [(t 1 /t 2 )γ H , γH ], respectively, where 0 < t 1 < t 2 are solutions to the following quadratic equation (given our choice of c, c 2 , c H , real solutions exist), r(t) := - 1 6 t 2 + 1 2 (1 -c -c H ) t -c 2 = 0, Explicitly, we have t 1 , t 2 = 3 2 (1 -c -c H ) ± 3 1 4 (1 -c -c H ) 2 - 2 3 c 2 . ( ) In particular, we have q g (γ g ) ≥ 0 and q H (γ H ) ≥ 0. We now derive the minimum amount of decrease for each iteration. Lemma 7. Using DP line search Algorithm 3, assume the same bounded noise conditions (9) as before. With the choice of sufficient decrease coefficients c g ∈ (0, 1 -c 1 ), c H ∈ (0, 1 -c -8 3 c 2 ) , define γg and γH as before. Choose initial step size multipliers b g , b H > 1 and decrease parameters β g ∈ (0, 1), β H ∈ (t 1 /t 2 , 1). Let i max = ⌊log β max(b g , b H )⌋ + 1. If n satisfies the following: n ≥ 16λ log i max + log T ξ ) max 2b g B g c g ϵ g , 4b H B g M t 2 c H ϵ 2 H , with probability at least 1 -ξ/T , the amount of decrease in a single step is at least MIN_DEC = min 1 G (1 -c 1 -c g )c g ϵ 2 g , 1 4 c H t 2 2 ϵ 3 H M 2 . ( ) With the results above, we can now analyze the guarantees of the line search algorithm under ERM settings. Theorem 8 (Sample complexity of the line search algorithm). Assuming the same conditions as in the previous lemma, with probability at least {(1 -ζ T )(1 -C exp (-cCd))(1 -ξ/T )} T , suppose the number of samples n satisfies n ≥ n min , where n min := max √ 2dB g σ g log T ζ min c 1 ϵ g , c2 M ϵ 2 H , C √ dB H σ H log T ζ c ϵ H , 16λ log i max + log( T ξ ) max 2b g B g c g ϵ g , 4b H B g M t 2 c H ϵ 2 H . (18) The output of the algorithm is a ((1 + c 1 )ϵ g , (1 + c)ϵ H )-2NS. With the choice of σ's and λ in (14), hiding logarithmic terms and constants, the asymptotic dependence of n min on (ϵ g , ϵ H ) and ρ, is n min = d √ ρ Õ max ϵ -2 g , ϵ -1 g ϵ -2 H , ϵ -7/2 H . ( ) When (ϵ g , ϵ H ) = (α, √ M α), the dependence simplifies to d √ ρ Õ(α -2 ). Proof. The proof is similar to Theorem 4 (see Appendix D.4) using Lemma 17. We have an additional term in our success probability due to the SVT line search step. For the asymptotic bound of n min , we note that MIN_DEC (17), T , σ g and σ H are the same as those of the short step algorithm, up to a constant. The additional requirement (16) for n is O λ log T max(ϵ -1 g ,ϵ -2 H ) . Since we choose λ = σ g , it is in the same order as the first term inside the max expression of n min in (18). Thus, the asymptotic bound of n min is the same as that of the short step algorithm.

3.3. MINI-BATCHING

We can use mini-batching to speed up the algorithm. Specifically, in each iteration k, we sample m data points from D without replacement, forming the mini-batch S k . The objective now only computes the average risk over set S k , that is, f k,S k := 1 m i∈S k ℓ (w, x i ) . Likewise, we modify all the prior algorithms by evaluating the gradients and Hessians over the mini-batch S k . We show that the sample complexity of the mini-batch version of the algorithm remains Õ( d √ ln(1/δ) εα 2 ) when (ϵ g , ϵ H ) = (α, √ M α) for (ε, δ)-DP. The details are in the Appendix B.

3.4. DISCUSSION: PRACTICAL IMPROVEMENT AND EIGENVALUE COMPUTATION

The estimate of T in ( 6) is pessimistic, based on MIN_DEC obtained from our worst-case analysis. we now propose a two-phase strategy to speed up the algorithm. In the first phase, we use a fraction (say 3/4) of the privacy budget to try out a smaller T , which corresponds to less noise and quicker convergence. If we are unable to find a desired solution, we fall back to the original algorithm with the remaining privacy budget and use the last iterate as a warm start. In the actual implementation, we can use Lanczos method to compute the minimum eigenvalue and eigenvector.This will slightly change our analysis. See Appendix C for a discussion. We comment that Lanczos method only requires Hessianvector product and thus the full Hessian is not required. We can use automatic differentiation or finite differencing to obtain Hessian-vector product efficiently, reducing the time complexity of this step to O(d) from O(d 2 ).foot_0 

4. EXPERIMENTS

We carry out numerical experiments to demonstrate the performance of our DP optimization algorithms, following similar experimental protocols to (Wang & Xu, 2021) . We use Covertype dataset and perform necessary data preprocessing. Details of the dataset and additional experiments can be found in the Appendix E. Let x i be the feature vector and y i ∈ {-1, +1} be the binary label. We investigate the non-convex ERM lossfoot_1 : min w∈R p 1 n n i=1 log (1 + exp (-y i ⟨x i , w⟩)) + r(w), where r(w) = p i=1 λw 2 i 1+w 2 i is the non-convex regularizer. In our experiments, we choose λ = 10 -foot_2 . We implement our algorithms and DP-TR in Python using PyTorch. 3 To make the results comparable, we modify DP-TR so that it explicitly checks approximate second-order necessary conditions and stop if these conditions are satisfied, in the same way as in our algorithms. We run the experiment under two settings: 1. Finding a loose solution: ϵ g = 0.060 and ϵ H ≈ 0.245. In this setting, our requirement for the 2NS is loose. This translates to a large sample size n compared to the required sample complexity. 2. Finding a tight solution: ϵ g = 0.030 and ϵ H ≈ 0.173. In this setting, our requirement for the 2NS is tight. We have a small sample size n compared to the required sample complexity. For each setting, we pick different levels of privacy budget ε and run each configuration with five different random seeds. We convert differential privacy schemes to (ε, δ)-DP when necessary for the comparison. We present the aggregated results in the tables below. In each entry, we report mean ± standard deviation of the values across five runs. If any of the five runs failed to find a solution, or it found a solution but failed to terminate due to the noise, we report the runtime with ×. Here, the runtime is expressed in a unit determined by the Python function time.perf_counter(). In the table, we use acronyms for methods: TR for DP-TR, OPT for our proposed algorithms and 2OPT for their two-phase variants, OPT-LS for our proposed algorithms with line search, and the ones with "-B" use mini-batching. 0.712 ± 0.018 3.1 ± 2.9 0.712 ± 0.018 3.2 ± 3.0 0.712 ± 0.018 2.9 ± 2.9 OPT-LS 0.577 ± 0.032 × 0.687 ± 0.028 0.4 ± 0.1 0.699 ± 0.018 0.4 ± 0.1 2OPT 0.626 ± 0.078 × 0.712 ± 0.017 0.6 ± 0.2 0.712 ± 0.018 0.6 ± 0.2 2OPT-B 0.712 ± 0.018 1.4 ± 0.3 0.712 ± 0.018 1.4 ± 0.4 0.712 ± 0.018 2.0 ± 1.7 2OPT-LS 0.699 ± 0.018 0.5 ± 0.2 0.699 ± 0.018 0.5 ± 0.2 0.699 ± 0.018 0.5 ± 0.2 Experimental results show that for finding a loose solution under high privacy budgets ε = 0.6, 1.0, our short step algorithm OPT outperforms TR, with much less runtime and lower final loss. Under the low privacy budget ε = 0.2, although OPT can fail to terminate with success, we see that the final loss is even lower than TR. The reason is as follows, due to the conservative estimate of the decrease, the per iteration privacy budget is low, so we cannot check 2NS conditions accurately enough due to the noise. In practice, we can stop early and the solution is still acceptable despite the failure of the termination. Heuristics may be employed to spend extra privacy budget to check 2NS conditions. Line search and mini-batching improve upon the short step algorithm, especially when combined with our two-phase strategy. We remark that similar to OPT, OPT-LS has an even more conservative theoretical minimum decrease. The two-phase strategy, using an aggressive estimate of the decrease, complements line search. We observe that 2OPT-LS performs consistently well across all privacy budget levels and under two settings. Finally, we remark that the number of Hessian evaluations is minimal. See Appendix E.3 for details.

5. CONCLUSION

We develop simple differentially private optimization algorithms based on an elementary algorithm for finding an approximate second-order optimal point of a smooth nonconvex function. The proposed algorithms take noisy gradient steps or negative curvature steps based on a noisy Hessian on nonconvex ERM problems. To obtain a method that is more practical than conservative short-step methods, we employ line searches, mini-batching, and a two-phase strategy. We track privacy leakage using zCDP (RDP for mini-batching). Our work matches the sample complexity of DP-GD, but with a much simplified analysis. Although DP-TR has a better sample complexity, its mini-batched version has the same complexity as ours. Our algorithms have a significant advantage over DP-TR in terms of runtime. 2OPT-LS, which combines the line search and the two-phase strategy, consistently outperform DP-TR in numerical experiments. A A QUICK REVIEW OF DIFFERENTIAL PRIVACY Definition A.1 ((ε, δ)-DP (Dwork & Roth, 2014) ). A randomized algorithm A is (ε, δ)-DP if for all neighboring datasets D, D ′ and for all events S in the output space of A, the following holds: Pr (A(D) ∈ S) ≤ e ε Pr (A(D ′ ) ∈ S) + δ. When δ = 0, we say A is ε-DP. In DP-ERM, we say D ′ is a neighboring dataset of D if they differ on just one data point, that is, by changing one data point x k in D to x ′ k , we obtain dataset D ′ . Rényi-DP (RDP) was introduced by Mironov as a relaxation of the original DP. Definition A.2 (Rényi divergence). For two probability distributions P and Q defined over R, the Rényi divergence of order α > 1 is D α (P ∥Q) = 1 α -1 log E w∼Q P (w) Q(w) α . Definition A.3 ((α, ϵ)-RDP (Mironov, 2017) ). A randomized algorithm M : D → R is (α, ϵ)-RDP if for all neighboring dataset pairs D, D ′ , the following holds D α (M(D)∥M(D ′ )) ≤ ϵ. Another notion of differential privacy is Zero-Concentrated Differential Privacy (zCDP), which requires a linear bound for the divergence of all orders. Definition A.4 (zCDP (Bun & Steinke, 2016) ). A randomized algorithm M : D → R satisfies (ξ, ρ)-zCDP if for all neighboring dataset pairs D, D ′ and all α ∈ (1, ∞), the following holds: D α (M(D)∥M(D ′ )) ≤ ξ + ρα. Equivalently, a randomized algorithm M satisfies ρ-zCDP if for all α ∈ (1, ∞), M satisfies (α, ξ + ρα)-RDP. If this definition holds for ξ = 0, we use the term ρ-zCDP instead. RDP and zCDP have some properties in common (Mironov, 2017; Bun & Steinke, 2016) . Proposition A.1 (Composition of RDP). Suppose that M 1 : D → R 1 is (α, ϵ 1 )-RDP and M 2 : R 1 × D → R 2 is (α, ϵ 2 )-RDP. Then the mechanism defined as (X, Y ), where X ∼ M 1 (D) and Y ∼ M 2 (X, D) is (α, ϵ 1 + ϵ 2 )-RDP. Proposition A.2 (Composition of zCDP). Suppose that M 1 : D → R 1 is ρ 1 -zCDP and M 2 : R 1 × D → R 2 is ρ 2 -zCDP. Then the mechanism defined as (X, Y ), where X ∼ M 1 (D) and Y ∼ M 2 (X, D) is (ρ 1 + ρ 2 )-zCDP. Proposition A.3 (Preservation under Postprocessing). Consider the mappings M : D → R and g : R → R ′ . We observe that D α (P ∥Q) ≥ D α (g(P )∥g(Q)) by the analog of the data processing inequality. This shows that if M(•) is (α, ϵ) -RDP, so is g(M(•)). Similarly, if M(•) is (ρ)-zCDP, so is g(M(•)). We can convert easily from one notion of differential privacy to another. Proposition A.4 (RDP to (ε, δ)-DP). If M is an (α, ϵ)-RDP mechanism, then it is ϵ + log 1/δ α-1 , δ -DP for any 0 < δ < 1. Proposition A.5 (ϵ-DP to zCDP). If M is an ε-DP mechanism, then it is also ( 1 2 ε 2 )-zCDP. Proposition A.6 (zCDP to (ε, δ)-DP). Suppose that M : D → R is (ξ, ρ)-zCDP. Then M is also (ε, δ)-DP for all δ > 0 and ε = ξ + ρ + 4ρ log(1/δ). Thus to achieve a (ε, δ)-DP guarantee for given ε and δ, it suffices to satisfy (ξ, ρ)-zCDP with ρ = ( ε -ξ + log(1/δ) -log(1/δ)) 2 ≈ (ε -ξ) 2 4 log(1/δ) . A common way to achieve differential privacy is to add Gaussian noise to the output. Proposition A.7 (Gaussian Mechanism). Given any function h : X n → R d , the Gaussian Mechanism is defined as: G σ h(D) = h(D) + N (0, ∆ 2 h σ 2 I d ) , where ∆ h denotes the ℓ 2 -sensitivity of the function h, defined as ∆ h = sup D∼D ′ ∥h(D) -h(D ′ )∥. The Gaussian Mechanism G σ h satisfies (α, α/(2σ 2 ))-RDP for all α ∈ [1, ∞) and thus also satisfies (1/2σ 2 )-zCDP. We defer results for (ε, δ)-DP to Appendix B.2.

B ANALYSIS OF THE MINI-BATCH ALGORITHMS

For the line search version of the algorithm, we need additional assumptions if we want to check the sufficient decrease conditions using the mini-batch loss. For simplicity, we only consider the short version of the algorithm in this section. B.1 RDP ANALYSIS (In this section, we use s = m/n to denote the sample fraction.) We evaluate the gradient and the Hessian similarly on the mini-batch S k , which we will write as g k,S k , H k,S k and let gk,S k , Hk,S k be their perturbed versions respectively. The other parts of the algorithm remain unchanged. The sensitivity ∆ f , ∆ g , ∆ H as stated in (4) will be scaled accordingly by replacing n in their denominator by the mini-batch size |m. Let g k and H k be the gradient and the Hessian evaluated on the full dataset D. We can decompose the deviation of their noisy approximation as follows, ∥g k,S k -g k ∥ ≤ ∥g k,S k -g k,S k ∥ + ∥g k,S k -g k ∥ , Hk,S k -H k ≤ Hk,S k -H k,S k + ∥H k,S k -H k ∥ , where the first term in the bound is due to the added Gaussian noise, and the second term is due to subsampling. We will bound two terms separately with high probability. We have the following subsampling concentration results from (Kohler & Lucchi, 2017) Lemma B.1 (Gradient deviation bound). We have with probability at least 1 -η that ∥g k,S k -g k ∥ ≤ 4 √ 2B g log(2d/η) + 1/4 |S k | . Lemma B.2 (Hessian deviation bound). We have with probability at least 1 -η that ∥H k,S k -H k ∥ ≤ 4B H log(2d/η) |S k | . For the subsampling error (21), we use Gaussian concentration results described before in 41 to bound the first term, and subsampling results stated above to bound the second term. For iteration k, with probability at least (1 -ζ T )(1 -C exp (-cCd))(1 -η/T ) 2 , we have that ∥g k,S k -g k ∥ ≤ √ 2dη g σ g log T ζ + 4 √ 2B g log(2dT /η) + 1/4 |S k | , Hk,S k -H k ≤ C √ dη H σ H + 4B H log(2dT /η) |S k | . ( ) It suffices to require that each term in the right-hand sides of the bound above is bounded by 1/2 of the corresponding term in the right-hand sides of ( 9), so that we can use a similar analysis. For deviation due to subsampling we need, 4 √ 2B g log(2dT /η) + 1/4 |S k | ≤ 1 2 min c 1 ϵ g , c 2 M ϵ 2 H (23a) 4B H log(2dT /η) |S k | ≤ 1 2 c ϵ H (23b) Rearranging the terms, we have the condition for the size of the mini-batch, |S k | ≥ max 64B 2 g (log(2dT /η) + 1/4) max c -2 1 ϵ -2 g , M 2 c 2 2 ϵ -4 H , 32B 2 H log(2dT /η)c -2 ϵ -2 H . ( ) The following convergence result is immediate, based on the same analysis as in the full-batch case (cf. Theorem 8). Theorem B.3. With probability at least {(1 -ζ/T )(1 -C exp (-cCd))(1 -η/T ) 2 } T , suppose the number of samples n satisfies n ≥ n min , where n min := max 2 √ 2dB g σ g log T ζ min c 1 ϵ g , c2 M ϵ 2 H , 2C √ dB H σ H log T ζ c ϵ H , s -1 64B 2 g (log(2dT /η) + 1/4) max c -2 1 ϵ -2 g , M 2 c 2 2 ϵ -4 H , s -1 32B 2 H log(2dT /η)c -2 ϵ -2 H . ( ) The output of the mini-batch short step algorithm is a ((1 + c 1 )ϵ g , (1 + c)ϵ H )-2NS. With the choice of σ's in (8), hiding logarithmic terms and constants, the asymptotic dependence of n min on (ϵ g , ϵ H ) and ρ, is n min = d √ ρ Õ max ϵ -2 g , ϵ -1 g ϵ -2 H , ϵ -7/2 H . ( ) We now discuss privacy guarantees. It is impossible to deal with subsampling using z-CDP, but under RDP, Wang et al. (2018b) provides a generalized analysis for subsampling. Theorem B.4 (RDP for Subsampled Mechanisms). Given a dataset of n points drawn from a domain X and a (randomized) mechanism M that takes an input from X m for m ≤ n, let the randomized algorithm M • subsample be defined as (1) subsample: subsample without replacement m datapoints of the dataset (with sampling fraction s = m/n), and (2) apply M: a randomized algorithm taking the subsampled dataset as the input. For all integers α ≥ 2, if M is (α, ϵ(α))-RDP, then this new randomized algorithm M • subsample obeys (α, ϵ ′ (α))-RDP where, ϵ ′ (α) ≤ 1 α -1 log 1 + s 2 α 2 min 4 e ϵ(2) -1 , e ϵ(2) min 2, e ϵ(∞) -1 2 + α j=3 s j α j e (j-1)ϵ(j) min 2, e ϵ(∞) -1 j . For the Gaussian mechanism, we have ϵ(α) = α 2σ 2 , so ϵ M(∞) = ∞ and the bound simplifies to ϵ ′ (α) ≤ 1 α -1 log 1 + s 2 α 2 min 4 e 1/σ 2 -1 , 2e 1/σ 2 + α j=3 2s j α j e (j-1)j/(2σ 2 ) =: ϵ ′ N (α; σ, s), which we denote as ϵ ′ N (α; σ, s). When s is small and σ is large, we can discard higher-order terms and write the right-hand side as ϵ ′ N (α; σ, s) ≈ 1 α -1 s 2 α(α -1) 2 • 4 1 σ 2 = 2s 2 α 2 σ 2 , ( ) where we use the approximation e t ≈ 1 + t for small t. Theorem B.5. Consider the short step version of the algorithm using subsampling. Given the choice of σ f , σ g , σ H , λ and sampling fraction s. Suppose an actual run of the subsampled algorithm takes k g gradient steps and k H negative curvature steps. The run is data-dependent (α, ϵ(α))-RDP where ϵ(α) = α 2σ 2 f + (k g + k H )ϵ ′ N (α; σ g , s) + k H ϵ ′ N (α; σ H , s). Let ε(α) = α 2σ 2 f + T ϵ ′ N (α; σ g , s) + T ϵ ′ N (α; σ H , s). ( ) We always have ε(α) ≥ ϵ(α), so the algorithm is (α, ε(α))-RDP. Given the complexity in the subsampled privacy guarantee, ϵ ′ N (α; σ, s), we do not have an explicit formula to set parameters σ f , σ g , σ H . However, given (ε, δ)-DP privacy budget, we can optimize the parameters to meet the privacy guarantee. Recall the conversion from (α, ϵ(α))-RDP to (ε DP , δ DP )-DP, given δ, we solve ε DP (ϵ(•)) = min α ϵ(α) + log 1/δ DP α -1 . So we can optimize the parameters σ f , σ g , σ H , such that the following objective is minimized max (ε DP -ε DP (ε(•)), 0) , ( ) where εDP is the target privacy budget and we replace ε(•) with their corresponding versions (29).

B.2 SAMPLE COMPLEXITY USING (ε, δ)-DP

Under the (ε, δ)-DP scheme, subsampling is easier to deal with and we will derive a sample complexity bound. We first introduce some useful results in (ε, δ)-DP. Proposition B.1 (Composition of (ε, δ)-DP). Suppose that M 1 : D → R 1 is (ε 1 , δ 1 )-DP and M 2 : R 1 × D → R 2 is (ε 2 , δ 2 )-DP. Then the mechanism defined as (X, Y ), where X ∼ M 1 (D) and Y ∼ M 2 (X, D) is (ε 1 + ε 2 , δ 1 + δ 2 )-DP. Definition B.1 (Gaussian Mechanism for (ε, δ)-DP). Given any function h : X n → R d , the Gaussian Mechanism is defined as: G σ h(D) = h(D) + N (0, ∆ 2 f σ 2 I d ), where ∆ h be the ℓ 2 -sensitivity of the function h and σ ≥ √ 2 ln(1.25/δ)∆2(f ) ϵ . Then, the Gaussian Mechanism G σ h satisfies (ϵ, δ)-differential privacy. Theorem B.6 (Privacy amplification via subsampling (Balle et al., 2018) ). . Given a dataset of n points drawn from a domain X and a (randomized) mechanism M that takes an input from X m for m ≤ n, let the randomized algorithm M • subsample be defined as: (1) subsample: subsample without replacement m datapoints of the dataset (sampling parameter s = m/n), and (2) apply M: a randomized algorithm taking the subsampled dataset as the input. If M is (ε, δ)-DP, then M • subsample is (ε ′ , δ ′ )-DP, where ε ′ = log (1 + s(e ε -1) ≤ s(e ε -1)) and δ ′ = sδ. Theorem B.7 (Advanced Composition). For all ε 0 , δ 0 , δ ′ 0 ≥ 0, the class of (ε 0 , δ 0 )-differentially private mechanisms satisfies (ε, kδ + δ ′ )-differential privacy under k-fold adaptive composition for: ε = 2k ln (1/δ ′ )ε + kε (e ε -1) . As a corollary, for 0 < ε < 1, it suffices to choose ϵ 0 = ε 2 √ 2k log(1/δ ′ ) to ensure the composition is (ε, kδ + δ ′ )-DP. In particular, we can in addition choose δ ′ = δ/2 and δ 0 = δ/(2k) to satisfy (ε, δ)-DP. We have the following privacy guarantee for the algorithm using sampling without replacement, Theorem B.8. Consider the short step version of the algorithm using subsampling. Given privacy parameters ε, δ, ε f , δ f ∈ (0, 1) such that ε f < ε and δ f < δ, subsampling parameter s. Let ε 0 = (εε f )/(8s 2T ln(2/(δ -δ f ))) and δ 0 = (δ -δ f )/(4sT ), where T is estimated as before in (6). Set σ f = √ 2 ln(1.25/δ f ) ε f , σ g = σ H = √ 2 ln(1.25/δ0) ϵ0 . The algorithm is (ε, δ)-DP. Proof. By Gaussian mechanism, the step for estimating T is (ε f , δ f )-DP. It suffices to show the remaining steps are (ε -ε f , δ -δ f ). Using advanced composition, we only need to show that each iteration is (4sε 0 , 2sδ 0 )-DP. Consider a single iteration without subsampling. From the usage of Gaussian mechanism and sparse vector technique, we know that computing the perturbed gradient step and the perturbed Hessian step are both (ε 0 , δ 0 )-DP, whereas the backtracking line search step is ε 0 -DP. By composition, we know that the whole iteration is (2ε 0 , 2δ 0 )-DP. Applying the Privacy Amplification Theorem B.6, we know that each iteration using subsampling is (4sε 0 , 2sδ 0 )-DP. Since (as earlier) we expect most steps to be gradient steps, rather than negative curvature steps, we overestimate the privacy leakage. Theorem B.9. For c f ∈ (0, 1), setting ε f = c f ε and δ f = c f δ, under the choice of parameters σ g , σ H in Theorem B.3, the asymptotic dependence of n min in Theorem B.3 on (ϵ g , ϵ H ), (ε, δ), , is n min = Õ d ln(1/δ) ε max ϵ -2 g , ϵ -1 g ϵ -2 H , ϵ -4 H . ( ) When (ϵ g , ϵ H ) = (α, √ M α), the dependence simplifies to Õ( d √ ln(1/δ) εα 2 ), matching the result in full-batch version of the algorithm by converting ρ-zCDP to (ε, δ)-DP via √ ρ = O( ε d √ ln(1/δ) ) using Proposition A.6. Proof. As before, we have √ T = O(max(ϵ g , ϵ -3/2 H )). After simplification, the order of σ g and σ H is s ε √ T ln(2sT /δ) ln(1/δ) = Õ s ε max(ϵ g , ϵ -3/2 H ) . The asymptotic dependence of n min follows by substituting σ g and σ H into (25).

C COMPUTATION OF THE SMALLEST EIGENVALUE USING LANCZOS METHOD

In our algorithms, we need to compute the smallest eigenvalue of the perturbed Hessian. This can be done effectively using the randomized Lanczos algorithm. We have the following result from (Carmon et al., 2018) , Lemma C.1. Suppose that the Lanczos method is used to estimate the smallest eigenvalue of H starting with a random vector uniformly generated on the unit sphere, where ∥H∥ ≤ M . For any δ ∈ [0, 1), this approach finds the smallest eigenvalue of H to an absolute precision of ϵ/2, together with a corresponding direction v, in at most min n, 1 + 1 2 ln 2.75n/δ 2 M ϵ iterations with probability at least 1 -δ. To use Lanczos method in our algorithm, we output an estimate λ of λ min ( H) along with the corresponding eigenvector, provided that λ ≤ -ϵ H /2. If λ > -ϵ H /2, we declare that λ min ( H) ≥ -ϵ H , with an error probability at most δ L . Our analysis and convergence results still hold with ϵ H /2 replacing ϵ H and adding the success probability of the Lanczos algorithm 1 -δ L to the product of the success probability in each iteration.

D MISSING PROOFS

D.1 PROOF OF LEMMA 2 Lemma D.1. With the short step size choices (7), if the noise satisfies the following conditions for some positive constants c, c 1 , and c 2 such that c 1 < 1 2 and c 2 + c < 1 3 , ∥ε k ∥ ≤ min c 1 ϵ g , c 2 M ϵ 2 H , ∥E k ∥ ≤ c ϵ H , then the amount of decrease in each step is at least MIN_DEC = min 1 -2c 1 2G ϵ 2 g , 2 1 3 -c 2 -c ϵ 3 H M 2 . ( ) The true gradient and true minimum eigenvalue of the Hessian satisfy the following, ∥g k ∥ ≤ (1 + c 1 ) ∥g k ∥ , λ k > -(1 + c)| λk |. Proof. We will use the following two standard bounds, which follow from the smoothness assumptions on f : f (w + p) ≤ f (w) + ∇f (w) ⊤ p + G 2 ∥p∥ 2 , f (w + p) ≤ f (w) + ∇f (w) T p + 1 2 p T ∇ 2 f (w)p + 1 6 M ∥p∥ 3 . ( ) For simplicity, we drop the iteration number k in the analysis below. For gradient steps we have ∥g∥ > ϵ g . We write g = g -ε. Using ∥ε∥ ≤ c 1 ϵ g < c 1 ∥g∥, it follows from (35) that f (w -γ g g) ≤ f -γ g (g -ε) T g + G 2 γ 2 g ∥g∥ 2 ≤ f - 1 G (g -ε) T g + 1 2G ∥g∥ 2 ≤ f - 1 2G ∥g∥ 2 + 1 G ∥ε∥ ∥g∥ ≤ f - 2G ∥g∥ 2 + 1 G c 1 ∥g∥ 2 = f - 1 2G (1 -2c 1 ) ∥g∥ 2 ≤ f - 1 2G (1 -2c 1 ) ε 2 g , while the true gradient satisfies ∥g∥ ≤ ∥g∥ + ∥ε∥ ≤ (1 + c 1 ) ∥g∥. When negative curvature steps are taken, we have λ < -ϵ H . By assumption, we have ∥ε∥ ≤ c2 M ϵ 2 H < c2 M | λ| 2 and ∥E∥ ≤ c ϵ H < c| λ|. Recall the definition (5) of p and we write g = g -ε, H = H -E. From (36), we have f (w + γ H p) ≤ f + γ H g T p + 1 2 γ 2 H pT H p + 1 6 M γ 3 H ∥p∥ 3 = f + γ H gT p + 1 2 γ 2 H pT H p + 1 6 M γ 3 H ∥p∥ 3 -γ H ε T p - 1 2 γ 2 H pT E p ≤ f + 1 2 2| λ| M 2 (-| λ|) + 1 6 M 2| λ| M 3 - 2| λ| M ε T p - 1 2 2| λ| M 2 pT E p ≤ f - 2 3 | λ| 3 M 2 + 2| λ| M ∥ε∥ + 2| λ| 2 M 2 ∥E∥ ≤ f - 2 3 -2c 2 -2c | λ| 3 M 2 ≤ f - 2 3 -2c 2 -2c ϵ 3 H M 2 , provided that c 2 + c < 1/3. Let λ denote the minimum eigenvalue of H. It follows from Weyl's Inequality that | λ -λ| ≤ ∥E∥ ≤ c| λ|, and thus, λ > λ -c| λ| ≥ -(1 + c) | λ|.

PROOF OF LEMMA 6

Lemma D.2. Assume the same bounded noise conditions (9) as before. With the choice of sufficient decrease coefficients c g ∈ (0, 1 -c 1 ), c H ∈ (0, 1 -c -8 3 c 2 ), let γg = 2 (1 -c 1 -c g ) /G and γH = t 2 | λ|/M as defined in Algorithm 3, the sufficient decrease conditions (SD1) and (SD2) are satisfied when γ ≤ γg and γ ∈ [(t 1 /t 2 )γ H , γH ], respectively, where 0 < t 1 < t 2 are solutions to the following quadratic equation (given our choice of c, c 2 , c H , real solutions exist), r(t) := - 1 6 t 2 + 1 2 (1 -c -c H ) t -c 2 = 0, Explicitly, we have t 1 , t 2 = 3 2 (1 -c -c H ) ± 3 1 4 (1 -c -c H ) 2 - 2 3 c 2 . ( ) In particular, we have q g (γ g ) ≥ 0 and q H (γ H ) ≥ 0. Proof. The analysis is similar to that of Lemma 2. Again for simplicity we drop iteration indices k. For gradient steps we have ∥g∥ > ϵ g . We write g = -ε. Using ∥ε∥ ≤ c 1 ϵ g < c 1 ∥g∥, it follows from (35) that f (w -γg) ≤ f (w) -γ(g -ε) ⊤ g + G 2 γ 2 ∥g∥ 2 ≤ f (w) -γ - G 2 γ 2 ∥g∥ 2 + γ∥g∥∥ε∥ ≤ f (w) -γ 1 - G 2 γ -c 1 ∥g∥ 2 , It follows by definition of γg that (SD1) holds when γ ≤ γg and When negative curvature steps are taken, we have λ < -ϵ H . By assumption, we have ∥ε∥ ≤ c2 M ϵ 2 H < c2 M | λ| 2 and ∥E∥ ≤ c ϵ H < c| λ|. Recall the definition (5) of p and we write g = g -ε, H = H -E. From (36), we have for γ > 0 that f (w + γ p) ≤ f (w) + γg T p + 1 2 γ 2 pT H p + 1 6 M γ 3 ∥p∥ 3 -γε T p - 1 2 γ 2 pT E p ≤ f (w) - 1 2 γ 2 | λ| + 1 6 M γ 3 + γ∥ε∥ + 1 2 γ 2 ∥E∥ ≤ f (w) - 1 2 γ 2 (1 -c)| λ| -γ c 2 M | λ| 2 - 1 6 M γ 3 g(γ) . By reparameterizing γ = t| λ| M , we obtain g(γ) - 1 2 c H γ 2 | λ| = - 1 6 t 3 + 1 2 (1 -c -c H ) t 2 -c 2 t | λ| 3 M 2 = t • r(t) | λ| 3 M 2 . Note that (SD2) holds when g(γ) ≥ 1 2 c H γ 2 | λ|. The result follows from the fact that r(t) ≥ 0 for t ∈ [t 1 , t 2 ].

D.2 PROOF OF LEMMA 7

Lemma D.3. Using DP line search Algorithm 3, assume the same bounded noise conditions (9) as before. With the choice of sufficient decrease coefficients c g ∈ (0, 1 -c 1 ), c H ∈ (0, 1 -c -8 3 c 2 ), define γg and γH as before. Choose initial step size multipliers b g , b H > 1 and decrease parameters β g ∈ (0, 1), β H ∈ (t 1 /t 2 , 1). Let i max = ⌊log β max(b g , b H )⌋ + 1. If n satisfies the following: n ≥ 16λ (log i max + log(T /ξ)) max 2b g B g c g ϵ g , 4b H B g M t 2 c H ϵ 2 H , with probability at least 1 -ξ/T , the amount of decrease in a single step is at least MIN_DEC = min 1 G (1 -c 1 -c g )c g ϵ 2 g , 1 4 c H t 2 2 ϵ 3 H M 2 . ( ) Proof. The analysis is similar to that of Lemma 2. Again for simplicity we drop iteration indices k. For gradient steps we have ∥g∥ > ϵ g . We write g = g -ε. Using ∥ε∥ ≤ c 1 ϵ g < c 1 ∥g∥, it follows from (35) that f (w -γg) ≤ f (w) -γ(g -ε) ⊤ g + G 2 γ 2 ∥g∥ 2 ≤ f (w) -γ - G 2 γ 2 ∥g∥ 2 + γ∥g∥∥ε∥ ≤ f (w) -γ 1 - G 2 γ -c 1 ∥g∥ 2 , It follows by definition of γg that (SD1) holds when γ ≤ γg and When negative curvature steps are taken, we have λ < -ϵ H . By assumption, we have ∥ε∥ ≤ c2 M ϵ 2 H < c2 M | λ| 2 and ∥E∥ ≤ c ϵ H < c| λ|. Recall the definition (5) of p and we write g = g -ε, H = H -E. From (36), we have for γ > 0 that f (w + γ p) ≤ f (w) + γg T p + 1 2 γ 2 pT H p + 1 6 M γ 3 ∥p∥ 3 -γε T p - 1 2 γ 2 pT E p ≤ f (w) - 1 2 γ 2 | λ| + 1 6 M γ 3 + γ∥ε∥ + 1 2 γ 2 ∥E∥ ≤ f (w) - 1 2 γ 2 (1 -c)| λ| -γ c 2 M | λ| 2 - 1 6 M γ 3 g(γ) . By reparameterizing γ = t| λ| M , we obtain g(γ) - 1 2 c H γ 2 | λ| = - 1 6 t 3 + 1 2 (1 -c -c H ) t 2 -c 2 t | λ| 3 M 2 = t • r(t) | λ| 3 M 2 . Note that (SD2) holds when g(γ) ≥ 1 2 c H γ 2 | λ|. The result follows from the fact that r(t) ≥ 0 for t ∈ [t 1 , t 2 ].

D.3 PROOF OF COROLLARY 3

Corollary D.4. 3 Assuming the noise satisfies (9) at each iteration, the short step version (using (7), ( 10)) of the algorithm will output a ((1 + c 1 )ϵ g , (1 + c)ϵ H )-2NS. Proof. From the minimum decrease (10) we just derived, it follows that the algorithm will terminate in T * iterations, where T * = f (w 0 ) -f * MIN_DEC . Our choice of T in ( 6) is an upper bound of T * and thus the algorithm will halt within T iterations. In the iteration k when the algorithm halts, we have ∥g k ∥ ≤ ϵ g and λk ≥ -ϵ H . It follows from ( 11) that the output is a ((1 + c 1 )ϵ g , (1 + c)ϵ H )-2NS.

D.4 PROOF OF THEOREM 4

Theorem D.5 (Sample complexity of the short step algorithm). Consider the ERM setting. Suppose that the number of samples n satisfies n ≥ n min , where n min := max √ 2dB g σ g log T ζ min c 1 ϵ g , c2 M ϵ 2 H , C √ dB H σ H log T ζ c ϵ H . With probability at least {(1 -ζ T )(1 -C exp (-cCd))} T where c and C are universal constants in Lemma D.7, the output of the short step version (using (7),( 10)) of the algorithm is a ((1 + c 1 )ϵ g , (1 + c)ϵ H )-2NS. With the choice of σ's in (8) using ρ f = c f ρ for c f ∈ (0, 1), hiding logarithmic terms and constants, the asymptotic dependence of n min on (ϵ g , ϵ H ), ρ and d, is n min = d √ ρ Õ max ϵ -2 g , ϵ -1 g ϵ -2 H , ϵ -7/2 H . When (ϵ g , ϵ H ) = (α, √ M α), the dependence simplifies to d √ ρ Õ(α -2 ). Before proving Theorem 4, we introduce two concentration results. Lemma D.6 (Gaussian concentration, (Vershynin, 2018) ). For x ∼ N 0, σ 2 I d , with probability at least 1 -η for any 1 > η > 0, we have ∥x∥ ≤

√

2dσ log 1 η . Lemma D.7 (Upper tail estimate for Wigner ensembles (Tao, 2012, p. 110) ). Let M = (m ij ) 1≤i,j≤d be an d × d random symmetric matrix. Suppose that the coefficients m ij of M are independent for j ≥ i, mean zero, and have uniform sub-Gaussian tails. There exist universal constants C, c > 0 such that for all A ≥ C, we have P ∥M ∥ > A √ d ≤ C exp(-cAd). Proof. It follows from concentration results that, in iteration k, with probability at least (1 -ζ T )(1 -C exp (-cCd)), we have ∥ε k ∥ ≤ √ 2d∆ g σ g log T ζ , ∥E k ∥ ≤ C √ d∆ H σ H . We need to find a condition on n that ensures that the right-hand sides are less than the right-hand sides of ( 9). We substitute for ∆ g and ∆ H from (4) and solve for n min by rearranging the terms. The result then follows from Corollary 3 if the concentration results hold for all iterations. Now let us calculate the success probability. For each iteration, we have a probability of at least (1 -ζ T )(1 -C exp (-cCd)) that the concentration results hold (if we do not compute the perturbed Hessian, the probability is higher with at least 1 -ζ T . Using conditional probability, the overall success probability is {(1 -ζ T )(1 -C exp (-cCd))} τ conditioned on the number of iterations τ . Since τ ≤ T , the overall success probability is at least { (1 -ζ T )(1 - C exp (-cCd))} T . For the second part, recall from (8) that σ g = σ H = √ T / (1 -c f )ρ. With (10) and our choice of T in (6), we have √ T = O(max(ϵ g , ϵ -3/2 )). We obtain the asymptotic bound of n min by plugging in σ g and σ H .

E EXPERIMENTAL SETTINGS AND ADDITIONAL EXPERIMENTS

We first remark that all our experiments were run on a cluster with a 36-core Intel Xeon Gold 6254 3.1GHz CPU, utilizing 8 CPU cores for each run.

E.1 DATASETS

The Covertype datasetfoot_3 contains n = 581012 data points. Each data point has the form (x, y), where x is a 54dimensional feature vector (first 10 are dimensions numerical, column 11 -14 is the WildernessArea one-hot vector, and last 40 columns are the SoilType one-hot vector), and y being the label, is one of {1, 2, . . . , 7}. For preprocessing, we normalize the first 10 numerical columns, and keep only those samples for which y = 1, 2. The number of samples remaining in this restricted set is n = 495141. We recode y = 2 to y = -1 so that y ∈ {-1, 1}. The IJCNN datasetfoot_4 contatins n = 4999 data points. Each point consists of (x, y), where x is a 22-dimensional feature vector (first 10 are one hot and column 11 -22 are numerical, and y being the label is binary. For preprocessing, we normalize the data. For privacy accounting of RDP, which is used in the mini-batched algorithm, we use the autodp packagefoot_5 . Below we repeat the same experiment using the IJCNN dataset. We remark that for 2OPT-LS under ε = 0.2, the result is unavailable (reported as <NA>) because due to numerical issues, the package autodp we use cannot handle subsampling with a very low privacy budget.

E.3 ADDITIONAL EXPERIMENTS

Additionally, we consider the logistic loss ∇l(w) = 1 n n i=1 1 1 + exp (-y i ⟨x i , w⟩) + λ 2 ∥w∥ 2 , and repeat our experiments on the aforementioned datasets with λ = 10 -3 . We can verify that the two chosen losses have Lipschitz gradients and Hessians as long as the feature vector x i 's are bounded. In this set of experiments, we find solutions (ϵ g , ϵ H ) = (0.040, 0.200) and (ϵ g , ϵ H ) = (0.020, 0.141). We also show the aggregated results for the number of noisy Hessian evaluations. We note that the number of noisy Hessian evaluations required in our algorithm is very low, whereas DP-TR needs to evaluate the noisy Hessian every iteration. 



We still evaluate noisy Hessians in our implementation, since there is no optimized support of this in PyTorch. Upon checking, the loss has Lipschitz gradients and Hessians as long as the feature vector xi's are bounded. We implement DP-GD but cannot produce practical results using the algorithmic parameters described in the DP-GD paper. Data source: UCI Machine Learning Repository https://archive.ics.uci.edu/ml/datasets/covertype Data source: LIBSVM data repository https://www.openml.org/search?type=data&sort=runs&id=1575& status=active Open source repo: https://github.com/yuxiangw/autodp



022 1.0 ± 0.0 0.539 ± 0.022 1.0 ± 0.1 0.539 ± 0.022 1.0 ± 0.1 2OPT-LS 0.539 ± 0.022 0.4 ± 0.1 0.539 ± 0.022 0.3 ± 0.1 0.539 ± 0.022 0.4 ± 0.1 Table6: Covertype Hess evals (logistic loss): finding a loose solution: (ϵ g , ϵ H ) = (0.040, 0.200)

± 0.008 0.0 7 ± 0.0 0.501 ± 0.008 0.0 ± 0.0 2OPT-B 0.501 ± 0.008 1.0 ± 0.0 0.501 ± 0.008 1.4 ± 0.9 0.501 ± 0.008 1.1 ± 0.3 2OPT-LS 1.002 ± 0.321 × 0.501 ± 0.008 0.0 ± 0.0 0.501 ± 0.008 0.0 ± 0.0 Table 10: IJCNN Hess evals (logistic loss): finding a loose solution: (ϵ g , ϵ H ) = (0.040, 0.200)

Covertype: finding a loose solution: (ϵ g , ϵ H ) = (0.060, 0.245)

Covertype: finding a tight solution: (ϵ g , ϵ H ) = (0.030, 0.173)

IJCNN: finding a tight solution: (ϵ g , ϵ H ) = (0.020, 0.141)

Covertype (logistic loss): finding a loose solution: (ϵ g , ϵ H ) = (0.040, 0.200) 2OPT 0.539 ± 0.022 0.3 ± 0.1 0.539 ± 0.022 0.3 ± 0.1 0.539 ± 0.022 0.4 ± 0.1 2OPT-B 0.539 ± 0.

Covertype (logistic loss): finding a tight solution: (ϵ g , ϵ H ) = (0.020, 0.141) 0.454 ± 0.004 2.0 ± 0.2 0.454 ± 0.004 2.0 ± 0.3 0.454 ± 0.004 1.8 ± 0.2 2OPT-LS 0.441 ± 0.007 × 0.447 ± 0.009 0.6 ± 0.2 0.447 ± 0.008 0.7 ± 0.2

Covertype Hess evals (logistic loss): finding a tight solution: (ϵ g

IJCNN (logistic loss): finding a loose solution: (ϵ g , ϵ H ) = (0.040, 0.200) ± 0.008 8.6 ± 0.2 0.501 ± 0.008 11.0 ± 4.9 0.501 ± 0.008 9.0 ± 1.0 OPT-LS 0.607 ± 0.076 × 0.473 ± 0.012 × 0.454 ± 0.004 ×

IJCNN (logistic loss): finding a tight solution: (ϵ g , ϵ H ) = (0.020, 0.141) 2OPT-B 0.502 ± 0.007 1.5 ± 0.4 0.502 ± 0.007 1.3 ± 0.1 0.502 ± 0.007 1.6 ± 0.3 2OPT-LS 3.504 ± 0.98 × 0.798 ± 0.081 × 0.459 ± 0.01 0.4 ± 0.4

IJCNN Hess evals (logistic loss): finding a tight solution: (ϵ g

