ON DYNAMIC NOISE INFLUENCE IN DIFFERENTIAL PRIVATE LEARNING

Abstract

Protecting privacy in learning while maintaining the model performance has become increasingly critical in many applications that involve sensitive data. Private Gradient Descent (PGD) is a commonly used private learning framework, which noises gradients based on the Differential Privacy protocol. Recent studies show that dynamic privacy schedules of decreasing noise magnitudes can improve loss at the final iteration, and yet theoretical understandings of the effectiveness of such schedules and their connections to optimization algorithms remain limited. In this paper, we provide comprehensive analysis of noise influence in dynamic privacy schedules to answer these critical questions. We first present a dynamic noise schedule minimizing the utility upper bound of PGD, and show how the noise influence from each optimization step collectively impacts utility of the final model. Our study also reveals how impacts from dynamic noise influence change when momentum is used. We empirically show the connection exists for general non-convex losses, and the influence is greatly impacted by the loss curvature. are summarized as follows. 1) For the class of loss functions satisfying the Polyak-Lojasiewicz condition (Polyak, 1963) , we show that a dynamic schedule improving the utility upper bound is shaped by the influence of per-iteration noise on the final loss. As the influence is tightly connected to the loss curvature, the advantage of using dynamic schedule depends on the loss function consequently. 2) Beyond gradient descent, our results show the gradient methods with momentum implicitly introduce a dynamic schedule and result in an improved utility bound. 3) We empirically validate our results on convex and non-convex (no need to satisfy the PL condition) loss functions. Our results suggest that the preferred dynamic schedule admits the exponentially decaying form, and works better when learning with high-curvature loss functions. Moreover, dynamic schedules give more utility under stricter privacy conditions (e.g., smaller sample size and less privacy budget). Differentially Private Learning. Differential privacy (DP) characterizes the chance of an algorithm output (e.g., a learned model) to leak private information in its training data when the output distribution is known. Since outputs of many learning algorithms have undetermined distributions, the probability of their privacy leakages is hard to measure. A common approach to tackle this issue is to inject randomness with known probability distribution to privatize the learning procedures. Classical methods include output perturbation (

1. INTRODUCTION

In the era of big data, privacy protection in machine learning systems is becoming a crucial topic as increasing personal data involved in training models (Dwork et al., 2020) and the presence of malicious attackers (Shokri et al., 2017; Fredrikson et al., 2015) . In response to the growing demand, differential-private (DP) machine learning (Dwork et al., 2006) provides a computational framework for privacy protection and has been widely studied in various settings, including both convex and non-convex optimization (Wang et al., 2017; 2019; Jain et al., 2019) . One widely used procedure for privacy-preserving learning is the (Differentially) Private Gradient Descent (PGD) (Bassily et al., 2014; Abadi et al., 2016) . A typical gradient descent procedure updates its model by the gradients of losses evaluated on the training data. When the data is sensitive, the gradients should be privatized to prevent excess privacy leakage. The PGD privatizes a gradient by adding controlled noise. As such, the models from PGD is expected to have a lower utility as compared to those from unprotected algorithms. In the cases where strict privacy control is exercised, or equivalently, a tight privacy budget, accumulating effects from highly-noised gradients may lead to unacceptable model performance. It is thus critical to design effective privatization procedures for PGD to maintain a great balance between utility and privacy. Recent years witnessed a promising privatization direction that studies how to dynamically adjust the privacy-protecting noise during the learning process, i.e., dynamic privacy schedules, to boost utility under a specific privacy budget. One example is (Lee & Kifer, 2018) , which reduced the noise magnitude when the loss does not decrease, due to the observation that the gradients become very small when approaching convergence, and a static noise scale will overwhelm these gradients. Another example is (Yu et al., 2019) , which periodically decreased the magnitude following a predefined strategy, e.g., exponential decaying or step decaying. Both approaches confirmed the empirically advantages of decreasing noise magnitudes. Intuitively, the dynamic mechanism may coordinate with certain properties of the learning task, e.g., training data and loss surface. Yet there is no theoretical analysis available and two important questions remain unanswered: 1) What is the form of utility-preferred noise schedules? 2) When and to what extent such schedules improve utility? To answer these questions, in this paper we develop a principled approach to construct dynamic schedules and quantify their utility bounds in different learning algorithms. Our contributions Table 1 : Comparison of utility upper bound using different privacy schedules. The algorithms are T -iteration 1 2 R-zCDP under the PL condition (unless marked with *). The O notation in this table drops other ln terms. Unless otherwise specified, all algorithms terminate at step T = O(ln N 2 R D ). Assume loss functions are 1-smooth and 1-Lipschitz continuous, and all parameters satisfy their numeric assumptions. Key notations: Opbound occurs in probability p; D -feature dimension; N -sample size; R -privacy budget; ci -constant; other notations can be found in Section 4. An extended table and explanation are available in Appendix A. Algorithm Schedule (σ 2 t ) Utility Upper Bound on sample size accordingly. Recently, Zhou et al. proved the utility bound by using the momentum of gradients (Polyak, 1964; Kingma & Ba, 2014) . Table 1 summarizes the upper bounds of methods studied in this paper (in the last block of rows) and results from state-of-the-art algorithms based on private gradients. Our work shows that considering the dynamic influence can lead to a tighter bound.

3. PRIVATE GRADIENT DESCENT

Notations. We consider a learning task by empirical risk minimization (ERM) f (θ) = 1 N N n=1 f (θ; x n ) on a private dataset {x n } N n=1 and θ ∈ R D . The gradient methods are defined as θ t+1 = θ t -η t ∇ t , where ∇ t = ∇f (θ t ) = 1 N n ∇f (θ t ; x n ) denotes the non-private gradient at iteration t, η t is the step learning rate. ∇ (n) t = ∇f (θ t ; x n ) denotes the gradient on a sample x n . I c denotes the indicator function that returns 1 if the condition c holds, otherwise 0. Assumptions. (1) In this paper, we assume f (θ) is continuous and differentiable. Many commonly used loss functions satisfy this assumption, e.g., the logistic function. (2) For a learning task, only finite amount of privacy cost is allowed where the maximum cost is called privacy budget and denoted as R. (3) Generally, we assume that loss functions f (θ; x) (sample-wise loss) are G-Lipschitz continuous and f (θ) (the empirical loss) is M -smooth. Definition 3.1 (G-Lipschitz continuity). A function f (•) is G-Lipschitz continuous if, for G > 0 and all x, y in the domain of f (•), f (•) satisfies f (y) -f (x) ≤ G y -x 2 . . Definition 3.2 (m-strongly convexity). A function f (•) is m-strongly convex if f (y) ≥ f (x) + ∇f (x) T (y -x) + m 2 y -x 2 , for some m > 0 and all x, y in the domain of f (•). Definition 3.3 (M -smoothness). A function is M -smooth w.r.t. l 2 norm if f (y) ≤ f (x) + ∇f (x) T (y -x) + M 2 y -x 2 , for some constant M > 0 and all x, y in the domain of f (•). For a private algorithm M(d) which maps a dataset d to some output, the privacy cost is measured by the bound of the output difference on the adjacent datasets. Adjacent datasets are defined to be datasets that only differ in one sample. In this paper, we use the zero-Concentrated Differential Privacy (zCDP, see Definition 3.4) as the privacy measurement, because it provides the simplicity and possibility of adaptively composing privacy costs at each iteration. Various privacy metrics are discussed or reviewed in (Desfontaines & Pejó, 2019) . A notable example is Moment Accoutant (MA) (Abadi et al., 2016) , which adopts similar principle for composing privacy costs while is less tight for a smaller privacy budget. We note that alternative metrics can be adapted to our study without major impacts to the analysis. Definition 3.4 (ρ-zCDP (Bun & Steinke, 2016) ). Let ρ > 0. A randomized algorithm M : (Rényi, 1961) of order α. zCDP provides a linear composition of privacy costs of sub-route algorithms. When the input vector is privatized by injecting Gaussian noise of N (0, σ 2 t I) for the t-th iteration, the composed privacy cost is proportional to t ρ t where the step cost is ρ t = 1 σ 2 t . For simplicity, we absorb the constant coefficient into the (residual) privacy budget R. The formal theorems for the privacy cost computation of composition and Gaussian noising is included in Lemmas B.1 and B.2. Generally, we define the Private Gradient Descent (PGD) method as iterations for t = 1 . . . T : D n → R satisfies ρ-zCDP if, for all adjacent datasets d, d ∈ D n , D α (M(d) M(d )) ≤ ρα, ∀α ∈ (1, ∞) where D α (• •) denotes the Rényi divergence θ t+1 = θ t -η t φ t = θ t -η t (∇ t + σ t Gν t /N ), ) where φ t = g t is the gradient privatized from ∇ t as shown in Algorithm 1, G/N is the bound of sensitivity of the gathered gradient excluding one sample gradient, and ν t ∼ N (0, I) is a vector element-wisely subject to Gaussian distribution. We use σ t to denote the noise scale at step t and use σ to collectively represents the schedule (σ 1 , . . . , σ T ) if not confusing. When the Lipschitz constant is unknown, we can control the upper bound by scaling the gradient if it is over some constant. The scaling operation is often called clipping in literatures since it clips the gradient norm at a threshold. After the gradient is noised, we apply a modification, φ(•), to enhance its utility. In this paper, we consider two types of φ(•): φ(m t , g t ) = g t (GD), φ(m t , g t ) = [β(1 -β t-1 )m t + (1 -β)g t ]/(1 -β t ) (Momentum) We now show that the PGD using Algorithm 1 guarantees a privacy cost less than R: Algorithm 1 Privatizing Gradients Input: Raw gradients [∇ (1) t , . . . , ∇ (n) t ] (n = N by default), v t , residual privacy budget R t assuming the full budget is R and R 1 = R. 1: ρ t ← 1/σ 2 t , ∇ t ← 1 n n i=1 ∇ (i) t Budget request 2: if ρ t < R t then 3: R t+1 ← R t -ρ t 4: g t ← ∇ t + Gσ t ν t /N , ν t ∼ N (0, I) Privacy noise 5: m t+1 ← φ(m t , g t ) or g 1 if t = 1 6: return η t m t+1 , R t+1 Utility projection 7: else 8:

Terminate

Theorem 3.1. Suppose f (θ; x) is G-Lipschitz continuous and the PGD algorithm with privatized gradients defined by Algorithm 1, stops at step T . The PGD algorithm outputs θ T and satisfies ρ-zCDP where ρ ≤ 1 2 R. Note that Theorem 3.1 allows σ t to be different throughout iterations. Next we present a principled approach for deriving dynamic schedules optimized for the final loss f (θ T ).

4. DYNAMIC POLICIES BY MINIMIZING UTILITY UPPER BOUNDS

To characterize the utility of the PGD, we adopt the Expected Excess Risk (EER), which notion is widely used for analyzing the convergence of random algorithms, e.g., (Bassily et al., 2014; Wang et al., 2017) . Due to the presence of the noise and the limitation of learning iterations, optimization using private gradients is expected to reach a point with a higher loss (i.e., excess risk) as compared to the optimal solution without private protection. Define θ * = arg min θ f (θ), after Algorithm 1 is iterated for T times in total, the EER gives the expected utility degradation: EER = E ν [f (θ T +1 )] -f (θ * ). Due to the variety of loss function and complexity of recursive iterations, an exact EER with noise is intractable for most functions. Instead, we study the worst case scenario, i.e., the upper bound of the EER, and our goal is to minimize the upper bound. For consistency, we call the upper bound of EER divided by the initial error as ERUB. Since the analytical form of EER is either intractable or complicated due to the recursive iterations of noise, studying the ERUB is a convenient and tractable alternative. The upper bound often has convenient functional forms which are (1) sufficiently simple, such that we can directly minimize it, and (2) closely related to the landscape of the objective depending on both the training dataset and the loss function. As a consequence, it is also used in previous PGD literature (Pichapati et al., 2019; Wang et al., 2017) for choosing proper parameters. Moreover, we let ERUB min be the achievable optimal upper bound by a specific choice of parameters, e.g., the σ and T . In this paper, we consider the class of loss functions satisfying the Polyak-Lojasiewicz (PL) condition which bounds losses by corresponding gradient norms. It is more general than the m-strongly convexity. If f is differentiable and M -smooth, then m-strongly convexity implies the PL condition. Definition 4.1 (Polyak-Lojasiewicz condition (Polyak, 1963) ). For f (θ), there exists µ > 0 and for every θ, ∇f (θ) 2 ≥ 2µ(f (θ) -f (θ * )). The PL condition helps us to reveal how the influence of step noise propagates to the final excess error, i.e., EER. Though the assumption was also used previously in Wang et al. (2017) ; Zhou et al. (2020) , neither did they discuss the propagated influence of noise. In the following sections, we will show how the influence can tighten the upper bound in gradient descent and its momentum variant.

4.1. GRADIENT DESCENT METHODS

For the brevity of our discussion, we first define the following constants: 1 α 2RM N 2 DG 2 (f (θ 1 ) -f (θ * )), κ M µ , and γ 1 - 1 κ , which satisfy κ ≥ 1 and γ ∈ [0, 1). Note that κ is the condition number of f (•) if f (•) is strongly convex. κ tends to be large if the function is sensitive to small differences in inputs, and 1/α tends to be large if more samples are provided and with a less strict privacy budget. The convergence of PGD under the PL condition has been studied for private (Wang et al., 2017) and non-private (Karimi et al., 2016; Nesterov & Polyak, 2006; Reddi et al., 2016) ERM. Below we extend the bound in (Wang et al., 2017) by considering dynamic influence of noise and relax σ t to be dynamic: Theorem 4.1. Let α, κ and γ be defined in Eq. (2), and η t = 1 M . Suppose f (θ; x i ) is G-Lipschitz and f (θ) is M -smooth satisfying the Polyak-Lojasiewicz condition. For PGD, the following holds: ERUB = γ T + R T t=1 q t σ 2 t , where q t γ T -t α. In Eq. ( 3), the step noise magnitude σ 2 t has an exponential influence, q t , on the EER. The dynamic characteristic of the influence is the key to prove a tighter bound. Plus, on the presence of the dynamic influence, it is natural to choose a dynamic σ 2 t . When relaxing q t to a static 1, a static σ 2 t was studied by Wang et al. They proved a bound which is nearly optimal except a ln 2 N factor. To get the optimal bound, in the following sections, we look for the σ and T that minimize the upper bound.

4.1.1. UNIFORM SCHEDULE

The uniform setting of σ t has been previously studied in Wang et al. (2017) . Here, we show that the bound can be further tightened by considering the dynamic influence of iterations and a proper T . Theorem 4.2. Suppose conditions in Theorem 4.1 are satisfied. When σ 2 t = T /R, let α, γ and κ be defined in Eq. (2) and let T be: T = O κ ln 1 + 1 κα . Meanwhile, if κ ≥ 1 1-c > 1, 1/α > 1/α 0 for some constant c ∈ (0, 1) and α 0 > 0, the corresponding bound is: ERUB uniform min = Θ κ 2 κ + 1/α ln 1 + 1 κα . Sketch of proof. The key of proof is to find a proper T to minimize ERUB = E = γ T + T t=1 γ T -t αRσ 2 = γ T + αT 1 -γ T 1 -γ = γ T + ακ(1 -γ T )T where we use σ t = T /R. Vanishing its gradient is to solve γ T ln γ+ακ(1-γ T )-ακT γ T ln γ = 0, which however is intractable. In (Wang et al., 2017) , T is chosen to be O(ln(1/α)) and ERUB is relaxed as γ T + ακT 2 . The approximation results in a less tight bound as O(α(1 + κ ln 2 (1/α))) which explodes as κ → ∞. We observe that for a super sharp loss function, i.e., a large κ, any minor perturbation may result in tremendously fluctuating loss values. In this case, not-stepping-forward will be a good choice. Thus, we choose T = 1 ln(1/γ) ln 1 + ln(1/γ) α ≤ O κ ln 1 + 1 κα which converges to 0 as κ → +∞. The full proof is deferred to the appendix.

4.1.2. DYNAMIC SCHEDULE

A dynamic schedule can improve the upper bound delivered by the uniform schedule. First, we observe that the excess risk in Eq. ( 3) is upper bounded by two terms: the first term characterizes the error due to the finite iterations of gradient descents; the second term, a weighted sum, comes from error propagated from noise at each iteration. Now we show for any {q t |q t > 0, t = 1, . . . , T } (not limited to the q t defined in Eq. ( 3)), there is a unique σ t minimizing the weighted sum: Lemma 4.1 (Dynamic schedule). Suppose σ t satisfy T t=1 σ -2 = R. Given a positive sequence {q t }, the following equation holds: min σ R T t=1 q t σ 2 t = T t=1 √ q t 2 , when σ 2 t = 1 R T i=1 q i q t . Remarkably, the difference between the minimum and T T t=1 q t (uniform σ t ) monotonically increases by the variance of √ q t w.r.t. t. We see that the dynamics in σ t come from the non-uniform nature of the weight q t . Since q t presents the impact of the σ t on the final error, we denote it as influence. Given the dynamic schedule in Eq. ( 6), it is of our interest to which extent the ERUB can be improved. First, we present Theorem 4.3 to show the optimal T and ERUB. Theorem 4.3. Suppose conditions in Theorem 4.1 are satisfied. Let α, κ and γ be defined in Eq. (2). When η t = 1 M , σ t (based on Eqs. (3) and ( 6)) and the T minimizing ERUB are, i.e., σ 2 t = 1 R (1/γ) T -1 1 - √ γ γ t , T = 2κ ln 1 + 1 κα . Meanwhile, when κ ≥ 1 and 1/α ≥ 1/α 0 for some positive constant α 0 , the minimal bound is: ERUB dynamic min = Θ κ 2 κ 2 + 1/α . (8) 4.1.3 DISCUSSION In Theorems 4.2 and 4.3, we present the tightest bounds for functions satisfying the PL condition, to our best knowledge. We further analyze the advantages of our bounds from two aspects: sample efficiency and robustness to sharp losses. Sample efficiency. Since dataset cannot be infinitely large, it is critical to know how accurate the model can be trained privately with a limited number of samples. Formally, it is of interest to study when κ is fixed and N is large enough such that α 1. Then we have the upper bound in Eq. ( 5) as ERUB uniform min ≤ O κ 2 α ln 1 κα ≤ Õ DG 2 ln(N ) M N 2 R , where we ignore κ and other logarithmic constants with Õ as done in Wang et al. (2017) . As a result, we get a bound very similar to (Wang et al., 2017) , except that R is replaced by R M A = 2 / ln(1/δ) using Moment Accountant. In comparison, based on Lemma B.3, R = 2ρ = 2 + 4 ln(1/δ) + 4 ln(1/δ)( + ln(1/δ) if θ T satisfies ρ-zCDP. Because ln(1/δ) > 1, it is easy to see R = R zCDP > R M A when ≤ 2 ln(1/δ). As compared to the one reported in (Wang et al., 2017) , our bound saved a factor of ln N and thus is require less sample to achieve the same accuracy. Remarkably, the saving is due to the maintaining of the influence terms as shown in the proof of Theorem 4.2. Using the dynamic schedule, we have ERUB dynamic min ≤ O(α) = O DG 2 M N 2 R , which saved another ln N factor in comparison to the one using the uniform schedule Eq. ( 9). As shown in Table 1 , such advantage maintains when comparing with other baselines. Robustness. Besides sample efficiency, we are also interested in robustness of the convergence under the presence of privacy noise. Because of the privacy noise, the convergence of private gradient descent will be unable to reach an ideal spot. Specifically, when the samples are noisy or have noisy labels, the loss curvature may be sharp. The sharpness also implies lower smoothness, i.e., a small M or has a very small PL parameter. Thus, gradients may change tremendously at some steps especially in the presence of privacy noise. As illustrated in the left figure, the highly-curved loss function (the green curve) results in mean higher final loss (the red dashed line) than the flatten curve (purple and blue lines). Such changes have more critical impact when only a less number of iterations can be executed due to the privacy constraint. Assume α is some constant while κ 1/α, we immediately get: ERUB uniform min = Θ κ ln 1 + 1 κα = Θ 1 α ≤ O M N 2 R DG 2 , ERUB dynamic min = Θ(1). Both are robust, but the dynamic schedule has a smaller factor since 1/α could be a large number. In addition, the factor implies that when more samples are used, the dynamic schedule is robuster. Solid lines are optimization trajectories and dashed horizontal lines are the averaged final losses. Section 4.1 shows that the step noise has an exponentially increasing influence on the final loss, and therefore a decreasing noise magnitude improves the utility upper bound by a ln N factor. However, the proper schedule can be hard to find when the curvature information, e.g., κ, is absent. A parameterized method that less depends on the curvature information is preferred. On the other hand, long-term iterations will result in forgetting of the initial iterations, since accumulated noise overwhelmed the propagated information from the beginning. This effect will reduce the efficiency of the recursive learning frameworks.

4.2. GRADIENT DESCENT METHODS

Alternative to GD, the momentum method can mitigate the two issues. It was originally proposed to stabilize the gradient estimation (Polyak, 1964) . In this section, we show that momentum (agnostic about the curvature) can flatten the dynamic influence and improve the utility upper bound. Previously, Pichapati et al. used the momentum as an estimation of gradient mean, without discussions of convergence improvements. Zhou et al. gave a bound for the Adam with DP. However, the derivation is based on gradient norm, which results in a looser bound (see Table 1 ). The momentum method stabilizes gradients by moving average history coordinate values and thus greatly reduces the variance. The φ(m t , g t ) can be rewritten as: m t+1 = φ(m t , g t ) = v t+1 1 -β t , v t+1 = βv t + (1 -β)g t = (1 -β) t i=1 β t-i g t , v 1 = 0, where β ∈ [0, 1]. Note v t+1 is a biased estimation of the gradient expectation while m t+1 is unbiased. Theorem 4.4 (Convergence under PL condition). Suppose f (θ; x i ) is G-Lipschitz, and f (θ) is Msmooth and satisfies the Polyak-Lojasiewicz condition. Assume β = γ and β ∈ (0, 1). Let η t = η0 2M and η 0 ≤ 8 1 + 64βγ(γ -β) -2 (1 -β) -3 + 1 -1 . Then the following holds: EER ≤ γ T + 2Rη 0 α U 3 (σ, T ) noise varinace (f (θ 1 ) -f (θ * )) -ζ η 0 2M T t=1 γ T -t E v t+1 2 momentum effect where γ = 1 - η 0 κ , ζ = 1 - 1 β(1 -β) 3 η 2 0 - 1 4 η 0 ≥ 0, U 3 = T t=1 γ T -t (1 -β) 2 (1 -β t ) 2 t i=1 β 2(t-i) σ 2 i . The upper bound includes three parts that influence the bound differently: (1) Convergence. The convergence term is mainly determined by η 0 and κ. η 0 should be in (0, κ) such that the upper bound can converge. A large η 0 will be preferred to speed up convergence if it does not make the rest two terms worse. (2) Noise Variance. The second term compressed in U 3 is the effect of the averaged noise, t i=1 β 2(t-i) σ 2 i . One difference introduced by the momentum is the factor (1 -β)/(1 -β t ) which is less than γ t at the beginning and converges to a non-zero constant 1 -β. Therefore, in U 3 , γ T -t (1 -β)/(1 -β t ) will be constantly less than γ T meanwhile. Furthermore, when t > T , the moving average t i=1 β 2(t-i) σ 2 i smooths the influence of each σ t . In Appendix D, we will see that the influence dynamics is less steep than that of GD. (3) Momentum Effect. The momentum effect term can improve the upper bound when η 0 is small. For example, when β = 0.9 and γ = 0.99, then η 0 ≤ 0.98/M which is a rational value. Following the analysis, when M is large which means the gradient norms will significantly fluctuate, the momentum term may take the lead. Adjusting the noise scale in this case may be less useful for improving utility. To give an insight on the effect of dynamic schedule, we provide the following utility bounds. Theorem 4.5 (Uniform schedule). Suppose the assumptions in Theorem 4.4 are satisfied. Let σ 2 t = T /R, and let: T = max t s.t. γ t-1 ≥ 1 -β 1 -β t , T = O κ η 0 ln 1 + η 0 κα . Given some positive constant c and α 0 > 0 with 1/α > 1/α 0 , the following inequality holds: ERUB min ≤ O κ 2 κ + η 0 /α I T ≤ T + γ T -1 ln 1 + η 0 κα I T > T . Theorem 4.6 (Dynamic schedule). Suppose the assumptions in Theorem 4.4 are satisfied. Let α = 2η0α γ(1-γβ 2 ) , β < γ and T = max t s.t. γ t-1 ≥ 1-β 1-β t . Use the following schedule: σ 2 t = 1 R T i=1 q i q t , T dyn = O 2κ η 0 ln 1 + η 0 κα , where q t = c 1 γ T +t I T ≤ T + γ T -1 c 2 γ T -t I T > T for some positive constants c 1 and c 2 . The following inequality holds: ERUB ≤ γ T + 2η 0 α T t=1 Rq t σ 2 t , ERUB min ≤ O κα κα + η 0 κα κα + η 0 I T ≤ T + I T > T . Discussion. Theoretically, the dynamic schedule is more influential in vanilla gradient descent methods than the momentum variant. The result is mainly attributed to the averaging operation. The moving averaging, (1 -β) t i=1 β t-i g i /(1 -β t ) , increase the influence of the under-presented initial steps and decrease the one of the over-sensitive last steps. Counterintuitively, the preferred dynamic schedule should be increasing since q t decreases when t ≤ T .

4.3. PRIVATE STOCHASTIC GRADIENT DESCENT (NEW SECTION ON REBUTTAL)

Though PGD provides a guarantee both for utility and privacy, computing gradients of the whole dataset is impractical for large-scale problems. For this sake, studying the convergence of Private Stochastic Gradient Descent (PSGD) is meaningful. The Algorithm 1 can be easily extended to PSGD by subsampling n gradients where the batch size n N . According to (Yu et al., 2019) , when privacy is measured by zCDP, there are two ways to account for the privacy cost of PSGD depending on the batch-sampling method: sub-sampling with or without replacement. In this paper, we focus on the random subsampling with replacement since it is widely used in deep learning in literature, e.g., (Abadi et al., 2016; Feldman et al., 2020) . Accordingly, we replace N in the definition of α by n because the term is from the sensitivity of batch data (see Eq. ( 1)). For clarity, we assume that T is the number of iterations rather than epochs and that ∇t is mean stochastic gradient. When a batch of data are randomly sampled, the privacy cost of one iteration is cp 2 /σ t where c is some constant, p = n/N is the sample rate, and 1/σ 2 t is the full-batch privacy cost. Details of the sub-sampling theorems are referred to the Theorem 3 of (Yu et al., 2019) and their empirical setting. Threfore, we can replace the privacy constraint t p 2 /σ 2 t = R by t 1/σ 2 t = R where R = R/p 2 = N 2 n 2 R. Remarkably, we omit the constant c because it will not affect the results regarding uniform or dynamic schedules. Notice N 2 R in the α is replaced by n 2 R = N 2 R. Thus, the form of α is not changed which provides convenience for the following derivations. Now we study the utility bound of PSGD. To quantify the randomness of batch sampling, we define a random vector ξ t with E[ξ t ] = 0 and E ξ t 2 ≤ D such that ∇t ≤ ∇ t + σ g ξ t /n for some positive constant σ g . Because ξ t has similar property to the privacy noise ν t , we can easily extend the PGD bounds to PSGD bounds by following theories. Theorem 4.7 (Utility bounds of PSGD). Let α, κ and γ be defined in Eq. (2), and η t = 1 M . Suppose f (θ; x i ) is G-Lipschitz and f (θ) is M -smooth satisfying the Polyak-Lojasiewicz condition. For PSGD, when batch size satisfies n = max{N √ R, 1}, the following holds: ERUB = γ T + α g σ 2 g + R T t=1 q t σ 2 t , where q t γ T -t α, t 1/σ 2 t = R . ( ) where α g = D 2µN 2 R(f (θ1)-f (θ * )) . Theorem 4.8 (PSGD with momentum). Let α g = D 2µN 2 R(f (θ1)-f (θ * )) . Suppose assumptions in Theorem 4.4 holds. When batch size satisfies n = max{N √ R, 1}, the U 3 (σ, T ) has to be replaced by Ũ3 = U g 3 + U 3 , with αR U g 3 ≤ α g σ 2 g (15) when PSGD is used. As shown above, the utility bound of PSGD differs from the PGD merely by α g σ 2 g . Note α g = O( D N 2 R ) which fits the order of dynamic-schedule bounds. In addition, α and other variables are not changed. Hence, the conclusions w.r.t. the dynamic/uniform schedules maintain the same.

5. EXPERIMENTS

We empirically validate the properties of privacy schedules and their connections to learning algorithms. In this section, we briefly review the schedule behavior on quadratic losses under varying data sensitivity. Details of experimental setups and empirical results are available in Appendix D. We first show the estimated influence of step noise q t (by retraining the private learning algorithms, ref. Appendix D) in Fig. 2 Left. We see the trends of influence are approximately in an exponential form of t. This obvervation motivates the use of exponential decay schedule in practice. We then show the trends on the variance of influence (dashed lines with the right axis) and relative final losses (solid lines with the left axis) in the Middle Pane, where uni denotes the uniform schedule baseline, exp is an exponential schedule, dyn denotes the dynamic schedule minimizing the ERUB. The influences increases steeply when the data scale is large and therefore have a large variance. Meanwhile, dynamic schedules show improvements of the final loss when the variance is large. It reveals the connection between the influence and the dynamic advantage (refer to Lemma 4.1). We lastly evaluate the impacts from momentum in the Right Pane, using a Deep Neural Network (DNN) with 2 layers and 100 hidden units. Because of time costs of training deep networks, we do not estimate the influence by retraining and then compute schedules. Instead, we grid-search for the schedule hyper-parameters to find the best one. We see that influence modeled by an exponential function (expinfl) has comparable performance of the influence modeled by linear combination of two reverse exponential functions (momexpinfl). The latter only shows advantage in the setting that data scale is 25 and the number of iteration is only 100, which is expected by our analysis Theorems 4.5 and 4.6. The inherent reason is that the dynamic schedule is more effective when T is larger. 

6. CONCLUSION

When a privacy budget is provided for a certain learning task, one has to carefully schedule the privacy usage through the learning process. Uniformly scheduling the budget has been widely used in literature whereas increasing evidence suggests that dynamically schedules could empirically outperform the uniform one. This paper provided a principled analysis on the problem of optimal budget allocation and connected the advantages of dynamic schedules to both the loss structure and the learning behavior. We further validated our results through empirical studies. A COMPARISON OF ALGORITHMS  O D ln 3 N N R ,δ GD+MA (Wang et al., 2017) O( T R ,δ ) O D ln 2 N N 2 R ,δ *GD+Adv+BBImp (Cummings et al., 2018 ) O n 2 ln(n/δ) R ,δ Op D 2 ln 2 (1/p) R ,δ N 1-c Adam+MA (Zhou et al., 2020 ) O( T R ,δ ) Op √ D ln(N D /(1-p)) N R ,δ GD, Non-Private 0 O D N 2 R GD+zCDP, Static Schedule T R O D ln N N 2 R GD+zCDP, Dynamic Schedule O γ (t-T )/2 R O D N 2 R Momentum+zCDP, Static Schedule T R O D N 2 R (c + ln N I T > T ) Momentum+zCDP, Dynamic Schedule O c 1 γ T +t +c 2 γ (T -t)/2 R O D N 2 R (1 + cD N 2 R I T > T ) We present Table 2 as an sumpplementary to the Table 1 . Asymptotic upper bounds are achieved when sample size N approaches infinity. Both R and R ,δ with R ,δ < R are the privacy budgets of corresponding algorithms. Specifically, R ,δ = 2 / ln(1/δ) < R when the private algorithm is ( , δ)-DP with ≤ 2 ln(1/δ). PGD+Adv. Adv denotes the Advanced Composition method (Bassily et al., 2014) . The method assumes that loss function is 1-strongly convex which implies the PL condition and optimized variable is in a convex set of diameter 1 w.r.t. l 2 norm. PGD+MA. MA denotes the Moment Accoutant (Abadi et al., 2016) which improve the composed privacy bound versus the Advanced Composition. The improvement on privacy bound lead to a enhanced utility bound, as a result. PGD+Adv+BBImp. The dynamic method assumes that the loss is 1-strongly convex and data comes in stream with n ≤ N samples at each round. Their utility upper bound is achieved at some probability p with any positive c. Adam+MA. The authors prove a convergence bound for the gradient norms which is extended to loss bound by using PL condition. They also presents the results for AdaGrad and GD which are basically of the same upper bound. Out theorems improve their bound by using the recursive derivation based on the PL condition, while their bound is a simple application of the condition on the gradient norm bound. GD, Non-Private. This method does not inject noise into gradients but limit the number of iterations. With the bound, we can see that our utility bound are optimal with dynamic schedule. GD+zCDP. We discussed the static and dynamic schedule for the gradient descent method where the dynamic noise influence is the key to tighten the bound. Momemtum+zCDP. Different from the GD+zCDP, momentum methods will have two phase of utility upper bound. When T is small than some positive constant T , the bound is as tight as the non-private one. Afterwards, the momentum has a bound degraded as the GD bound. O(ln N 2 R ,δ D ) SSGD+zCDP (Feldman et al., 2020) O 1 √ N + 2 √ D √ RN ln N N 2 16D/R 2 +4N * SGD+MA (Bassily et al., 2019) O max √ D N √ R ,δ , 1 √ N min{ N 8 , N 2 R ,δ 32D } GD+zCDP, Static Schedule O1-p G 2 µN D ln(N ) ln(1/p) N R + 4 p O(ln N 2 R D ) GD+zCDP, Dynamic Schedule O1-p G 2 µN D ln(1/p) N R + 4 p O(ln N 2 R D ) Momentum+zCDP, Static Sch. O1-p G 2 µN D ln(1/p) N R (c + ln N I T > T ) + 4 p O(ln N 2 R D ) Momentum+zCDP, Dynamic Sch. O1-p G 2 µN D ln(1/p) N R (1 + cD N 2 R I T > T ) + 4 p O(ln N 2 R D ) GD, Non-Private O D N 2 R O(ln N 2 R D ) GD+zCDP, Static Schedule O D ln N N 2 R O(ln N 2 R D ) GD+zCDP, Dynamic Schedule O D N 2 R O(ln N 2 R D ) Momentum+zCDP, Static Sch. O D N 2 R (c + ln N I T > T ) O(ln N 2 R D ) Momentum+zCDP, Dynamic Sch. O D N 2 R (1 + cD N 2 R I T > T ) O(ln N 2 R D ) A.1 COMAPRISON OF GENERALIZATION BOUNDS In addition to the empirical risk bounds in Table 2 , in this section we study the true risk bounds, or generalization error bounds. True risk bounds characterize how well the learnt model can generalize to unseen samples subject to the inherent data distribution. By leveraging the generic learning-theory tools, we extend our results to the True Excess Risk (TER) for strongly convex functions as follows. For a model θ, its TER is defined as follows: TER E x∼X [E[f (θ; x)]] -min θ E x∼X [f ( θ; x)], where the second expectation is over the randomness of generating θ (e.g., the noise and stochastic batches). Assume a dataset d consist of N samples drawn i.i.d. from the distribution X . Two approaches could be used to extend the empirical bounds to the true excess risk: One is proposed by Shalev-Shwartz et al. (2009) where the true excess risk of PGD can be bounded in high probability. For example, Bassily et al. (2014) achieved a ln 2 N N bound with N 2 iterations. Alternatively, instead of relying on the probabilistic bound, Bassily et al. (2019) used the uniform stability to give a tighter bound. Later, Feldman et al. ( 2020) improve the efficiency of gradient computation to achieve a similar bound. Both approaches introduce an additive term to the empirical bounds. In this section, we adopt both approaches to investigate the two types of resulting true risk bounds. (1) True Risk in High Probability. First, we consider the high-probability true risk bound. Based on Section 5.4 from (Shalev-Shwartz et al., 2009) (restated in Theorem A.1), we can relate the EER to the TER. Theorem A.1. Let f (θ; x) be G-Lipschitz, and f (θ) be µ-strong convex loss function given any x ∈ X . With probability at least 1 -p over the randomness of sampling the data set d, the following inequality holds: TER(θ) ≤ 2G 2 µN f (θ) -f (θ * ) + 4G 2 pµN , where θ * = arg min θ f (θ). To apply the Eq. ( 16), we need to extend EER, the expectation bound, to a high-probability bound. Following (Bassily et al., 2014 ) (Section D), we repeate the PGD with privacy budget R/k for k times. Note, the output of all repetitions is still of R budget. When k = 1, let the EER of the algorithm be denoted as F (R). Then the EER of one execution of the k repetitions is F (R/k) where privacy is accounted by zCDP. When k = log 2 (1/p) for p ∈ [0, 1], by Markov's inequality, there exists one repetition whose EER is F (R/ log 2 (1/p)) with probability at least 1 -1/2 k = 1 -p. Combined with Eq. ( 16), we use the bounds of uniform schedule and dynamic schedules in Section 4.1.3 to obtain: TER uniform ≤ Õ G 2 µN D ln(N ) ln(1/p) N R + 4 p , TER dynamic ≤ Õ G 2 µN D ln(1/p) N R + 4 p , where we again ignore the κ and other constants. Similarly, we can extend the momentum methods. (  sup x∈X E[f (M(d); x) -f (M(d ); x)] ≤ s, where the expectation is over the internal randomness of M. Theorem A.2 (See, e.g., (Shalev-Shwartz & Ben-David, 2014) ). Suppose M : D N → Θ is a s-uniformly stable algorithm w.r.t. the loss function f . Let D be any distribution from over data space and let d ∼ D N . The following holds true. E d∼D N [E[f (M(d); D) -f (M(d); d)]] ≤ s, where the second expectation is over the internal randomness of M. f (M(d); D) and f (M(d); d) represent the true loss and the empirical loss, respecitvely. Theorem A.3 (Uniform stability of PGD from (Bassily et al., 2019) ). Suppose η < 2/M for M smooth, G-Lipschitz f (θ; x). Then PGD is s-uniformly stable with s = G 2 T η/N . Combining Theorems A.2 and A.3, we obtain the following: TER ≤ EER +G 2 ηT N . Because EER in this paper compresses a γ T or similar exponential terms, unlike (Bassily et al., 2019) , we cannot directly minimize the TER upper bound w.r.t. T and η in the presence of a polynomial form of γ T and T . Therefore, we still use T = O(ln N 2 R D ) and η for minimizing EER. Note that G 2 ηT N ≤ O( G 2 M N ln N 2 R D ) ≤ O G 2 M where we assume N D and use ln N ≤ N . Because the term O G 2 /M is constant and independent from dimension, we follow (Bassily et al., 2019) to drop the term when comparing the bounds. After dropping the additive term, it is obvious to see that the advantage of dynamic schedules still maintains since TER ≤ EER. A similar extension can be derived for (Wang et al., 2017) . We summarize the results and compare them to prior works in Table 3 where we include an additional method: Snowball Stochastic Gradient Descent (SSGD). SSGD dynamically schedule the batch size to achieve an optimal convergence rate in linear time. Discussion. By using uniform stability, we successfully transfer the advantage of our dynamic schedules from empirical bounds to true risk bounds. The inherent reason is that our bounds only need ln N iterations to reach the preferred final loss. With uniform stability, the logarithmic T reduce the gap caused by transferring. Compared to the (Feldman et al., 2020; Bassily et al., 2019) , our method has remarkably improved efficiency in T from N or N 2 to ln(N ). That implies fewer iterations are required for converging to the same generalization error.  i } N i=1 is ∆ 2 (∇ t ) = max n 1 N N j=1,j =n ∇ (j) t - 1 N N j=1 ∇ (j) t 2 = 1 N max n ∇ (n) t 2 where ∇ (n) t denotes the gradient of the n-th sample. Lemma B.2 (Gaussian mechanism (Bun & Steinke, 2016) ). Let f : D n → Z have sensitivity ∆. Define a randomized algorithm M : D n → Z by M (x) ← f (x) + N (0, ∆ 2 σ 2 I). Then M satisfies 1 2σ 2 -zCDP. Lemma B.3 ((Bun & Steinke, 2016)). If M is a mechanism satisfying ρ-zCDP, then M is (ρ + 2 ρ ln(1/δ), δ)-DP for any δ > 0. By solving ρ + 2 ρ ln(1/δ) = , we can get ρ = + 2 ln(1/δ) + 2 ln(1/δ)( + ln(1/δ).

B.2 AUXILIARY LEMMAS

Lemma B.4. If max n x n 2 = 1 and 1 N n x n = 0, then the gradient sensitivity of the squared loss will be ∆ 2 (∇) = max i 1 N 2f (θ; x i ) x i 2 ≤ 1 2 (DM θ 2 + 1), where Θ M is the set of all possible parameters θ t generated by the learning algorithm M. Proof. According to the definition of sensitivity in Eq. ( 19), we have ∆ 2 (∇) = max i ∇ (i) 2 = max n 1 n A (i) θ -x i 2 where we use i denotes the index of sample in the dataset. Here, we assume it is constant 1. We may get A (i) θ -x i 2 2 = x i (x i θ -1) 2 2 = (x i θ -1) 2 x i 2 2 = 2f (θ; x i ) x i 2 2 where f (θ; x i ) = 1 2 (x i θ -1) 2 . Thus, ∆ 2 (∇) = max i 1 N 2f (θ; x i ) x i 2 Since x n 2 ≤ 1 and 1 N N n=1 x n = 0, f (θ) = 1 2N N n=1 [(x n θ) 2 -2x n θ + 1] ≤ 1 2N N n=1 [( x n θ ) 2 + 1] ≤ 1 2 (DM θ 2 + 1) Lemma B.5. Assume assumptions in Theorem 4.4 are satisfied. Given variables defined in Theorem 4.4, the following inequality holds true: T t=1 γ T -t 2(1 -β)η t b t t i=1 β t-i ∇ t -∇ i 2 ≤ η 3 0 βγ 2M (1 -β) 3 (γ -β) 2 T -1 i=1 γ T -i v i+1 2 . Proof. We handle the inner summation. By smoothness, the inequality ∇f (x) -∇f (y) ≤ M x -y holds true. Thus, t i=1 β t-i ∇ t -∇ i 2 ≤ M 2 t i=1 β t-i θ t -θ i 2 = M 2 t-1 k=0 β k θ t -θ t-k 2 = M 2 t-1 k=0 β k t-1 i=t-k η i v i+1 /b i 2 ≤ M 2 t-1 k=0 β k t-1 j=t-k η 2 j /b 2 j t-1 i=t-k v i+1 2 where the last inequality is by Cauchy-Schwartz inequality. Because 1 bt = 1 1-β t ≤ 1 1-β and η t = η0 2M , t i=1 β t-i ∇ t -∇ i 2 ≤ η 2 0 4(1 -β) 2 t-1 k=0 β k k t-1 i=t-k v i+1 2 = η 2 0 4(1 -β) 2 t-1 k=0 β k k t-1 i=1 v i+1 2 I(i ≥ t -k) = η 2 0 4(1 -β) 2 t-1 i=1 v i+1 2 t-1 k=0 β k kI(k ≥ t -i) = η 2 0 4(1 -β) 2 t-1 i=1 v i+1 2 t-1 k=t-i β k k ( ) where I(•) is the indicating function which output 1 if the condition holds true, otherwise 0. Denote the left-hand-side of the conclusion as LHS. We plug Eq. ( 20) into LHS to get LHS ≤ T t=1 γ T -t 1 b t η 3 0 4M (1 -β) t-1 i=1 v i+1 2 t-1 k=t-i β k k ≤ η 3 0 4M (1 -β) 2 T t=1 γ T -t t-1 i=1 v i+1 2 t-1 k=t-i β k k where we relax the upper bound by 1 bt = 1 1-β t ≤ 1 1-β . Using Lemma B.6 can directly lead to the conclusion: LHS ≤ η 3 0 βγ 2M (1 -β) 3 (γ -β) 2 T -1 i=1 γ T -i v i+1 2 . Lemma B.6. Given variables defined in Theorem 4.4, the following inequality holds true: T t=1 γ T -t t-1 i=1 v i+1 2 t-1 k=t-i kβ k ≤ 2βγ (γ -β) 2 (1 -β) T -1 i=1 γ T -i v i+1 2 . Proof. We first derive the summation: U 1 (t, i) t-1 k=t-i β k k = t-1 k=t-i k j=1 β k = t-1 k=t-i t-1 j=1 β k I(j ≤ k) = t-1 j=1 t-1 k=max(t-i,j) β k = t-1 j=1 β max(t-i,j) -β t 1 -β = 1 1 -β (t -i)β t-i + β t-i+1 -β t 1 -β - β -β t 1 -β = 1 1 -β (t -i)β t-i + β t-i+1 -β 1 -β Now, we substitute U 1 (t, i) into LHS and replace t -i by j, i.e., t = j + i, to get LHS = T t=1 γ T -t t-1 i=1 v i+1 2 1 1 -β (t -i)β t-i + β t-i+1 -β 1 -β = T -1 i=1 v i+1 2 T t=i+1 γ T -t 1 1 -β (t -i)β + β t-i+1 -β 1 -β = T -1 i=1 v i+1 2 T -i j=1 γ T -(j+i) 1 1 -β jβ j + β j+1 -β 1 -β = T -1 i=1 γ T -i v i+1 2 T -i j=1 γ -j 1 1 -β jβ j + β j+1 -β 1 -β ≤ 1 1 -β T -1 i=1 γ T -i v i+1 2 T -i j=1 j β γ j + β 1 -β β γ j Let a = β/γ, we show T -i j=1 ja j = T -i j=1 j o=1 a j = T -i o=1 T -i j=o a j = T -i o=1 ( a o -a T -i+1 1 -a ) = a -a T -i+1 (1 -a) 2 -(T -i) a T -i+1 1 -a ≤ a (1 -a) 2 . Thus, LHS ≤ 1 1 -β T -1 i=1 γ T -i v i+1 2   a (1 -a) 2 + β 1 -β T -i j=1 a j   ≤ 1 1 -β T -1 i=1 γ T -i v i+1 2 a (1 -a) 2 + β 1 -β a 1 -a ≤ a (1 -a) 2 (1 -β) T -1 i=1 γ T -i v i+1 2 Because γ < 1, β < a = β/γ and a -a) 2 + β 1 -β a 1 -a ≤ 2a (1 -a) 2 . Therefore, LHS ≤ 2a (1 -a) 2 (1 -β) T -1 i=1 γ T -i v i+1 2 = 2βγ (γ -β) 2 (1 -β) T -1 i=1 γ T -i v i+1 2 Lemma B.7. Suppose γ ∈ (0, 1) and β ∈ (0, 1). Define T = max t s.t. γ t-1 ≥ 1 -β 1 -β t . If t ≤ T , 1-β 1-β t ≤ γ t-1 for t = 1, . . . , T . If t > T , 1-β 1-β t < γ T -1 . Proof. Define h(t) = γ t-foot_0 (1 -β t ) whose derivatives are h (t) = γ t-1 (1 -β t ) ln γ + γ t-1 (-β t ) ln β = γ t-1 ln γ -β t (ln γ + ln β) = γ t-1 1 -β t (1 + log γ β) ln γ. Simple calculation shows 1 - β t (1 + log γ β) t=0 = -log γ β < 0 and lim t→+∞ 1 -β t (1 + log γ β) = 1. When t = -log β (1 + log γ β) denoted as t 0 , 1 -β t (1 + log γ β) = 0. Be- cause 1 -β t (1 + log γ β) is monotonically increasing by t and γ t-1 ln γ is negative, h (t) ≥ 0 if t ≤ t 0 . Otherwise, h (t) < 0. Therefore, h(t) is a concave function. Because h(1) = 1 -β and h( T ) = γ T -1 (1 -β T ) ≥ 1 -β > 0, h(t) ≥ 1 -β for t = 1, . . . , T . Thus, for all t ∈ [1, T ], we have 1-β 1-β t ≤ γ t-1 . For t > T , because 1-β 1-β t monotonically increases by t, we have 1-β 1-β t < 1-β 1-β T ≤ γ T -1 .

C PROOFS

Proof of Theorem 3.1. Because all sample gradient are G-Lipschitz continuous, the sensitivity of the averaged gradient is upper bounded by G/N . Based on Lemma B.2, the privacy cost of g t is 1 2σ 2 t 1 . Here, we make the output of each iteration a tuple of (θ t+1 , v t=1 ). For the 1st iteration, because θ 1 does not embrace private information by random initialization, the mapping, v 2 θ 2 = g 1 θ 1 -η 1 g 1 , Under review as a conference paper at ICLR 2021 is ρ1 -zCDP where ρ1 = 1 2σ 2 t . Suppose the output of the t-th iteration, (θ t , v t ), is ρt -zCDP. At each iteration, we have the following mapping (θ t , v t ) → (θ t+1 , v t+1 ) defined as v t+1 θ t+1 = φ(v t , g t ) θ t -η t φ(v t , g t ) . Thus, the output tuple (θ t+1 , v t+1 ) is (ρ t + 1 2σ 2 t )-zCDP by Lemma B.1. Thus, the recursion implies that (θ T +1 , v T +1 ) has privacy cost as ρT +1 = ρT + 1 2σ 2 T = • • • = T t=1 1 2σ 2 t = 1 2 T t=1 ρ t ≤ 1 2 (R -R T ) ≤ 1 2 R. Let ρ = ρT +1 . Then we can get the conclusion.

C.1 GRADIENT DESCENTS

Proof of Theorem 4.1. With the definition of smoothness in Definition 3.3 and Eq. ( 1), we have f (θ t+1 ) -f (θ t ) ≤ -η t ∇ t (∇ t + Gσ t ν t /N ) + 1 2 M η 2 t ∇ t + Gσ t ν t /N 2 = -η t (1 - 1 2 M η t ) ∇ t 2 -(1 -M η t )η t ∇ t Gσ t ν t /N + 1 2 M η 2 t Gσ t ν t /N 2 ≤ -2µη t (1 - 1 2 M η t )(f (θ t ) -f (θ * )) -(1 -M η t )η t ∇ t Gσ t ν t /N + 1 2 M η 2 t Gσ t ν t /N 2 . where the last inequality is due to the Polyak-Lojasiewicz condition. Taking expectation on both sides, we can obtain E[f (θ t+1 )] -E[f (θ t )] ≤ -2µη t (1 - M 2 η t )(E[f (θ t )] -f (θ * )) + M 2 (η t Gσ t /N ) 2 E ν t 2 which can be reformulated by substacting f (θ * ) on both sides and re-arranged as E[f (θ t+1 )] -f (θ * ) ≤ 1 -2µη t (1 - M 2 η t ) (E[f (θ t )] -f (θ * )) + M 2 (η t Gσ t /N ) 2 D Recursively using the inequality, we can get E[f (θ T +1 )] -f (θ * ) ≤ T t=1 1 -2µη t (1 - M 2 η t ) (E[f (θ 1 )] -f (θ * )) + M D 2 T t=1 T i=t+1 1 -2µη i (1 - M 2 η i ) (η t Gσ t /N ) 2 . Let η t ≡ 1/M . Then the above inequality can be simplified as E[f (θ T +1 )] -f (θ * ) ≤ γ T (E[f (θ 1 )] -f (θ * )) + R T t=1 γ T -t M D 2R η t G N 2 σ 2 t = γ T (E[f (θ 1 )] -f (θ * )) + R T t=1 γ T -t ασ 2 t (E[f (θ 1 )] -f (θ * )) = γ T + R T t=1 q t σ 2 t (f (θ 1 ) -f (θ * )) Proof of Theorem 4.2. The minimizer of the upper bound of Eq. ( 3) can be written as T * = arg min T γ T + ακ(1 -γ T )T where we substitute σ 2 = T /R in the second line. To find the convex minimization problem, we need to vanishing its gradient which involves an equation like T γ T = c for some real constant c. However, the solution is W k (c) for some integer k where W is Lambert W function which does not have a simple analytical form. Instead, because γ T > 0, we can minimize a surrogate upper bound as following T * = arg min T γ T + ακT = 1 ln(1/γ) ln ln(1/γ) κα , if κα + ln γ < 0 where we use the surrogate upper bound in the second line and utilize γ = 1 -1 κ . However, the minimizer of the surrogate objective is not optimal for the original objective. When κ is large, the term, -ακγ T T , cannot be neglected as we expect. On the other hands, T suffers from explosion if κ → ∞ and meanwhile 1/γ → + 1. The tendency is counterintuitive since a small T should be taken for sharp losses. To fix the issue, we change the form of T * as T * = 1 ln(1/γ) ln 1 + ln(1/γ) α , which gradually converges to 0 as κ → ∞. Now we substitute Eq. ( 23) into the original objective function, Eq. ( 21), to get ERUB uniform = 1 1 + ln(1/γ)/α 1 + κ ln 1 + ln(1/γ) α . Notice that ln(1/γ) = ln(κ/(κ -1)) = ln(1 + 1/(κ -1)) ≤ 1 κ -1 ≤ 1 cκ because κ ≥ 1 1-c > 1 for some constant c ∈ (0, 1). In addition, ln(1/γ) = -ln(1 -1/κ) ≥ 1/κ. Now, we can get the upper bound of Eq. ( 24) as ERUB uniform ≤ κ κ + 1/α 1 + κ ln 1 + 1 cκα ≤ c 1 κ κ + 1/α κ ln 1 + 1 κα + ln( 1 c )) ≤ c 1 c 2 κ 2 κ + 1/α ln 1 + 1 κα for some constants c 1 , c 2 and large enough 1 α . Also, we can get the lower bound ERUB uniform ≥ cκ cκ + 1/α 1 + κ ln 1 + 1 κα ≥ c κ 2 κ + 1/α ln 1 + 1 κα . where we use the condition c ∈ (0, 1). Thus, ERUB uniform = Θ κ 2 κ+1/α ln 1 + 1 κα . Proof of Lemma 4.1. By T t=1 σ -2 = R and Cauchy-Schwarz inequality, we can derive the achievable lower bound as R t q t σ 2 t = t 1 σ 2 t t q t σ 2 t ≥ T t=1 √ q t 2 where the inequality becomes equality if and only if s/σ 2 t = q t σ 2 t , i.e., σ t = (s/q t ) 1/4 , for some positive constant s. The equality T t=1 σ -2 t = R immediately suggests √ s = 1 R T t=1 √ q t . Thus, we get the σ t . Notice T T t=1 q t - T t=1 √ q t 2 = T 2 1 T T t=1 √ q t - 1 T T i=1 √ q i 2 = T 2 Var[q t ] where the variance is w.r.t. t. Proof of Theorem 4.3. The upper bound of Eq. ( 3) can be written as ERUB dyn = γ T + T t=1 γ T -t αRσ 2 = γ T + α T t=1 γ T -t 2 = γ T + α 1 -γ T /2 1 - √ γ 2 where we make use of Lemma 4.1. Then, the minimizer of the ERUB is T * = arg min T γ T + α 1 -γ T /2 1 - √ γ 2 = 2 log γ α α + (1 - √ γ) 2 . ( ) We can substitute Eq. ( 26) into ERUB dyn to get ERUB dyn min = α α + (1 - √ γ) 2 2 + α 1 1 - √ γ 2 1 - α α + (1 - √ γ) 2 2 = α(1 - √ γ) -2 α(1 - √ γ) -2 + 1 2 + α(1 - √ γ) -2 α(1 - √ γ) -2 + 1 2 = α(1 - √ γ) -2 α(1 - √ γ) -2 + 1 Notice that 1 - √ γ -2 = κ 2 + κ 2 -κ + 2κ κ(κ -1) = κ(2κ -1 + 2 κ(κ -1)) and it is bounded by κ(2κ -1 + 2 κ(κ -1)) ≤ 4κ 2 , κ(2κ -1 + 2 κ(κ -1)) ≥ κ(2κ -(3κ -2) + 2 (κ -1)(κ -1)) = κ(-κ + 2 + 2κ -2) = κ 2 . Therefore, κ ≤ 1 - √ γ -1 ≤ 2κ , with which we can derive ERUB dyn min ≤ 4 κ 2 α κ 2 α + 1 , ERUB dyn min ≥ κ 2 α 4κ 2 α + 1 ≥ 1 4 κ 2 α κ 2 α + 1 . Thus, ERUB dyn min = Θ κ 2 α κ 2 α+1 .

C.2 GRADIENT DESCENTS WITH MOMENTUM

Proof of Theorem 4.4. Without loss of generality, we absorb the Cσ t /N into the variance of ν t such that ν t ∼ N (0, Cσ 2 t N I) and g t ← ∇ t + ν t . Define b t = 1 -β t . By smoothness and Eq. ( 1), we have f (θ t+1 ) -f (θ t ) ≤ ∇ t (θ t+1 -θ t ) + 1 2 M θ t+1 -θ t 2 = - η t b 2 t b t ∇ t v t+1 + 1 2 M η 2 t b 2 t v t+1 2 = η t b 2 t b t ∇ t -v t+1 2 -b t ∇ t 2 -v t+1 2 + 1 2 M η 2 t b 2 t v t+1 2 = η t b 2 t b t ∇ t -v t+1 U1(t) -η t ∇ t 2 - η t b 2 t (1 - 1 2 M η t ) v t+1 2 , where only the U 1 (t) is non-negative. Specifically, U 1 (t) describes the difference between current gradient and the average. We can expand v t+1 to get an upper bound: U 1 (t) = b t ∇ t -v t+1 2 = (1 -β) t i=1 β t-i ∇ t -(1 -β) t i=1 β t-i g i 2 = (1 -β) 2 t i=1 β t-i (∇ t -g i ) 2 = (1 -β) 2 t i=1 β t-i (∇ t -∇ i ) + t i=1 β t-i (∇ i -g i ) 2 ≤ 2(1 -β) 2 t i=1 β t-i (∇ t -∇ i ) 2 + t i=1 β t-i (∇ i -g i ) 2 ≤ 2(1 -β)     b t t i=1 β t-i ∇ t -∇ i 2 U2(t) (gradient variance) +(1 -β) t i=1 β t-i ν i 2 noise variance     where we use x + y 2 ≤ ( x + y ) 2 ≤ 2( x 2 + y 2 ). The last inequality can be proved by Cauchy-Schwartz inequality for each coordinate. We plug the U 1 (t) into Eq. ( 27) and use the PL condition to get f (θ t+1 ) -f (θ t ) ≤ η t b 2 t U 1 (t) -η t ∇ t 2 - η t b 2 t (1 - 1 2 M η t ) v t+1 2 ≤ -η t ∇ t 2 + η t b 2 t 2(1 -β) b t U 2 (t) + (1 -β) t i=1 β t-i ν i 2 - η t b 2 t (1 - 1 2 M η t ) v t+1 2 ≤ -2µη t (f (θ t ) -f (θ * )) + 2(1 -β)η t b t U 2 (t) + 2(1 -β) 2 η t b 2 t t i=1 β t-i ν i 2 - η t b 2 t (1 - 1 2 M η t ) v t+1 2 . Rearranging terms and taking expectation to show E[f (θ t+1 )] -f (θ * ) ≤ γ(E[f (θ t )] -f (θ * )) + 2(1 -β) 2 η t b 2 t t i=1 β t-i E ν i 2 + 2(1 -β)η t b t E[U 2 (t)] - η t b 2 t (1 - 1 2 M η t )E v t+1 2 = γ(E[f (θ t )] -f (θ * )) + 2(1 -β) 2 η t b 2 t t i=1 β 2(t-i) C 2 Dσ 2 t N 2 + 2(1 -β)η t b t E[U 2 (t)] - η t b 2 t (1 - 1 2 M η t )E v t+1 2 where γ = 1 -η 0 /κ = 1 -2µη t . The recursive inequality implies E[f (θ T +1 )] -f (θ * ) ≤ γ T (f (θ 1 ) -f (θ * )) + T t=1 γ T -t 2(1 -β) 2 η t b 2 t t i=1 β 2(t-i) C 2 Dσ 2 t N 2 + T t=1 γ T -t 2(1 -β)η t b t E[U 2 (t)] - T t=1 γ T -t η t b 2 t (1 - 1 2 M η t )E v t+1 2 = γ T + 2η 0 αR T t=1 γ T -t (1 -β) 2 b 2 t t i=1 β 2(t-i) σ 2 t U3 (f (θ 1 ) -f (θ * )) + T t=1 γ T -t 2(1 -β)η t b t E[U 2 (t)] - T t=1 γ T -t η t b 2 t (1 - 1 2 M η t )E v t+1 2 U4(t) . where we utilize α = DC 2 2M N 2 R 1 f (θ1)-f (θ * ) and η t = η0 2M . By Lemma B.5, we have T t=1 γ T -t 2(1 -β)η t b t U 2 (t) ≤ η 3 0 βγ 2M (1 -β) 3 (γ -β) 2 T -1 i=1 γ T -i v i+1 2 . Thus, by 1 bt ≥ 1, U 4 (t) ≤ η 3 0 βγ 2M (1 -β) 3 (γ -β) 2 T -1 i=1 γ T -i E v i+1 2 - η 0 2M (1 - η 0 4 ) T t=1 γ T -t E v t+1 2 = - η 0 2M ζ T t=1 γ T -t E v t+1 2 where ζ = 1 - 1 4 η 0 - βγ (γ -β) 2 (1 -β) 3 η 2 0 = 1 - 1 4 η 0 - β/γ (1 -β/γ) 2 (1 -β) 3 η 2 0 When a small enough η 0 , e.g., Specifically, η 0 ≤ (γ -β) 2 (1 -β) 3 8βγ 1 + 64βγ (γ -β) 2 (1 -β) 3 -1 = 8 1 + 64βγ(γ -β) -2 (1 -β) -3 + 1 We can have ζ ≥ 0. By the definition of U 3 (T, σ), we can get E[f (θ T +1 )] -f (θ * ) ≤ γ T + 2η 0 αRU 3 (T, σ) (f (θ 1 ) -f (θ * )) - η 0 2M ζ T t=1 γ T -t E v t+1 2 . Proof of Theorem 4.5. Since σ t is static, by definition of U 3 in Theorem 4.4, U 3 = T t=1 γ T -t (1 -β) 2 (1 -β t ) 2 t i=1 β 2(t-i) σ 2 = σ 2 T t=1 γ T -t (1 -β) 2 (1 -β t ) 2 t i=1 β 2(t-i) = σ 2 T t=1 γ T -t (1 -β) 2 (1 -β t ) 2 1 -β 2t 1 -β 2 = σ 2 T t=1 γ T -t 1 -β 1 -β t 1 + β t 1 + β . Because 1-β 1-β t 1+β t 1+β ≤ 1, the U 3 will be smaller than the corresponding summation in GD with uniform schedule. By Lemma B.7, when T > T , we can rewrite U 3 as U 3 ≤ σ 2 T t=1 γ T -t 1 -β 1 -β t = σ 2 T t=1 γ T -t 1 -β 1 -β t + σ 2 T t= T +1 γ T -t 1 -β 1 -β t ≤ σ 2 T t=1 γ T -t γ t-1 + σ 2 T t= T +1 γ T -t γ T -1 = σ 2 γ T -1 T + σ 2 γ T -1 T - T t=1 γ T -T -t = σ 2 γ T -1 T + σ 2 γ T -1 -γ T -1 1 -γ = T γR γ T T + γ T -T -1 1 -γ where we use σ 2 = T /R in the last line. Without assuming T > T , we can generally write the upper bound as U 3 ≤ T γR γ T min{ T , T } + max{ γ T -T -1 1 -γ , 0} . By Theorem 4.4, because ζ ≥ 0, we have ERUB ≤ γ T + 2Rη 0 αU 3 = γ T (1 + α γ T min{ T , T } + max{ γ T -T -1 1 -γ , 0} ) where α = 2η 0 α. First, we consider T ≤ T . Use T = 1 ln(1/γ) ln 1 + η0 κα = O κ η0 ln 1 + η0 κα to get ERUB ≤ α α + η 0 /κ 1 + α γ -1 ( 2 ln(1/γ) ln 1 + η 0 κα ) 2 ≤ α α + η 0 /κ 1 + 8κ 2 α η 0 γ ln 2 1 + η 0 κα ≤ O κ κ + η 0 /α 1 + 8κ 2 α η 0 γ ln 2 1 + η 0 κα = O κ κ + η 0 /α 1 + 4κ γ = O κ 2 κ + η 0 /α where we used ln(1/γ) ≥ η 0 /κ and ln(1 + x) ≤ √ x for any x > 0. Second, when T > T , ERUB ≤ γ T (1 + α γ T T + γ T -T -1 1 -γ ) ≤ O γ T + 2α γ T κ(γ T -γ T ) . Make use of T = 1 ln(1/γ) ln 1 + η0 κα to show ERUB ≤ O κ κ + η 0 /α + 4κ 2 α η 0 γ (γ T - κ κ + η 0 /α ) ln 1 + η 0 κα ≤ O κ 2 κ + η 0 /α γ T -1 ln 1 + η 0 κα . Proof of Theorem 4.6. By Lemma B.7, we can rewrite U 3 as U 3 = T t=1 γ T -t (1 -β) 2 (1 -β t ) 2 t i=1 β 2(t-i) σ 2 i ≤ T t=1 γ T -t γ 2(t-1) t i=1 β 2(t-i) σ 2 i + T t= T +1 γ T -t γ 2( T -1) t i=1 β 2(t-i) σ 2 i ≤ γ T -T T t=1 γ T -t γ 2(t-1) t i=1 β 2(t-i) σ 2 i V1 +γ 2( T -1) T t= T +1 γ T -t t i=1 β 2(t-i) σ 2 i V2 We derive V 1 and V 2 separately. For V 1 , we can obtain the upper bound by V 1 = T t=1 γ T -t γ 2(t-1) t i=1 β 2(t-i) σ 2 i = γ T -2 T t=1 γ t t i=1 β 2(t-i) σ 2 i = γ T -2 T i=1 β -2i σ 2 i T t=i γβ 2 t = γ T -2 T i=1 β -2i σ 2 i γβ 2 i -γβ 2 T +1 1 -γβ 2 = γ 2 T -3 T i=1 γ i-T -1 -β 2( T +1-i) 1 -γβ 2 σ 2 i = γ 2 T -3 T i=1 1 -(γβ 2 ) T +1-i 1 -γβ 2 γ i-T -1 σ 2 i ≤ γ T γ 2 (1 -γβ 2 ) T i=1 γ i σ 2 i ≤ γ T γ(γ -β 2 ) T i=1 γ i σ 2 i For V 2 , we can derive V 2 = T t= T +1 γ T -t t i=1 β 2(t-i) σ 2 i = T t=1 γ T -t t i=1 β 2(t-i) σ 2 i - T t=1 γ T -t t i=1 β 2(t-i) σ 2 i = T t=1 γ T -t t i=1 β 2(t-i) σ 2 i -γ T -T T t=1 γ T -t t i=1 β 2(t-i) σ 2 i . We first consider the first term T t=1 γ T -t t i=1 β 2(t-i) σ 2 i = T i=1 σ 2 i T t=i γ T -t β 2(t-i) = T i=1 γ T β -2i σ 2 i T t=i γ -t β 2t = T i=1 γ T β -2i σ 2 i (β 2 /γ) i -(β 2 /γ) T +1 1 -(β 2 /γ) = T i=1 γ T +1-i -β 2(T +1-i) γ -β 2 σ 2 i . Similarly, we have γ T -T T t=1 γ T -t t i=1 β 2(t-i) σ 2 i = γ T -T T i=1 γ T +1-i -β 2( T +1-i) γ -β 2 σ 2 i = T i=1 γ T +1-i -γ T -T β 2( T +1-i) γ -β 2 σ 2 i . Thus, V 2 = T i=1 γ T +1-i -β 2(T +1-i) γ -β 2 σ 2 i - T i=1 γ T +1-i -γ T -T β 2( T +1-i) γ -β 2 σ 2 i = T i= T +1 γ T +1-i -β 2(T +1-i) γ -β 2 σ 2 i + T i=1 γ T -T -β 2(T -T ) γ -β 2 β 2( T +1-i) σ 2 i ≤ T i= T +1 γ T +1-i -β 2(T +1-i) γ -β 2 σ 2 i + T i=1 γ T - T γ -β 2 β 2( T +1-i) σ 2 i . Substitute V 1 and V 2 into U 3 to get U 3 ≤ γ T 1 γ(γ -β 2 ) T i=1 γ i σ 2 i + γ 2 T -2 T i= T +1 γ T +1-i -β 2(T +1-i) γ -β 2 σ 2 i + T i=1 γ T + T -2 γ -β 2 β 2( T +1-i) σ 2 i ≤ γ T γ(γ -β 2 ) T i=1 (γ i + γ T -1 β 2( T +1-i) )σ 2 i + γ 2 T -2 T i= T +1 γ T +1-i -β 2(T +1-i) γ -β 2 σ 2 i ≤ 2γ T γ(γ -β 2 ) T i=1 γ i σ 2 i + γ 2 T -2 T i= T +1 γ T +1-i -β 2(T +1-i) γ -β 2 σ 2 i = T t=1 q t σ 2 t where q t = 2 γ(γ -β 2 ) γ T +t I T ≤ T + γ 2( T -1) γ T +1-i -β 2(T +1-i) γ -β 2 γ T -t I T > T ≤ c 1 γ T +t I T ≤ T + γ T -1 c 2 γ T -t I T > T where c 1 = 2 γ(γ-β 2 ) and c 2 = γ 2 T γ-β 2 . When T > T , by Lemma 4.1, the lower bound of R T t=1 q t σ 2 t is T t=1 √ q t 2 = γ T T t=1 c 1 γ t + T t= T +1 γ T -1 c 2 γ -t 2 = γ T √ c 1 γ 1 -γ T /2 1 - √ γ + √ c 2 1 -γ ( T -T -1)/2 √ γ -1 2 = γ T √ c 1 γ 1 -γ T /2 1 - √ γ + √ c 2 γ ( T -T -1)/2 -1 1 - √ γ 2 ≤ O   c 2 γ ( T -1)/2 -γ T /2 1 - √ γ 2   which is achieved when σ 2 t = 1 R T i=1 q i q t . By Theorem 4.4, because ζ ≥ 0, we have ERUB ≤ γ T + 2Rη 0 αU 3 = γ T + 2η 0 α T t=1 Rq t σ 2 t . And the minimum of the upper bound is ERUB min = γ T + α O   γ ( T -1)/2 -γ T /2 1 - √ γ 2   where α = 2η 0 c 2 α. Let T = 2 ln(1/γ) ln 1 + η0 κα . Then, ERUB min = O   κα κα + η 0 2 + α (1 - √ γ) 2 γ ( T -1)/2 -(1 -γ ( T -1)/2 )κα κα + η 0 2   ≤ O   κα κα + η 0 2 + 2η 0 c 2 α (1 - √ γ) 2 γ ( T -1)/2 κα + η 0 2   = O κα (κα + η 0 ) 2 κα + 2η 0 c 2 /κ (1 - √ γ) 2 γ ( T -1) = O κα (κα + η 0 ) 2 (κα + c 3 η 0 ) ≤ O κα κα + η 0 where c 3 is some constant. When T ≤ T , U 3 ≤ γ T -T T t=1 γ T -t γ 2(t-1) t i=1 β 2(t-i) σ 2 i V1 ≤ γ T -2 1 -γβ 2 T i=1 γ i σ 2 i with which we obtain ERUB ≤ γ T + 2Rη 0 αU 3 ≤ γ T + 2η 0 α γ -2 1 -γβ 2 T t=1 Rq t σ 2 t . where we let q t = γ T +t . By Lemma 4.1, T i=1 Rq t σ 2 i ≥ T t=1 √ q t 2 = γ T T t=1 γ t/2 2 = γ T +1 1 -γ T /2 1 - √ γ 2 . Thus, ERUB min ≤ γ T + 2η 0 α γ T -1 1 -γβ 2 1 -γ T /2 1 - √ γ 2 = γ T 1 + 2η 0 γc 1 α 1 -γ T /2 1 - √ γ 2 Let T = 2 ln(1/γ) ln 1 + η0 κα . Then, ERUB min ≤ κα κα + η 0 2 1 + 2η 0 γc 1 α (1 - √ γ) 2 ( 1 κα + 1 ) 2 ≤ κα κα + η 0 2 1 + O( 1 κα + 1 ) ≤ O κα κα + η 0 2 . In summary, ERUB min ≤ O κα κα + η 0 I T ≤ T κα κα + η 0 + I T > T C.3 STOCHASTIC GRADIENT DESCENTS Proof of Theorem 4.7. Let ∇t be the stochastic gradient of the step t. By the smoothness, we have  f (θ t+1 ) -f (θ t ) ≤ -η t ∇ t ( ∇t + Gσ t ν t /n) + 1 2 M η 2 t ∇t + Gσ t ν t /n 2 = -η t ∇ t (∇ t + σ g ξ t /n + Gσ t ν t /n) + 1 2 M η 2 t ∇ t + σ g ξ t /n + Gσ t ν t /n 2 . Note that E(σ g ξ t /n + Gσ t ν t /n) = 0 and E(σ g ξ t /n + Gσ t ν t /n) 2 = σ 2 g + (Gσ t /n) 2 . f (θ t+1 ) -f (θ t ) ≤ -η t ∇ t (∇ t + σt ζ t /n) + 1 2 M η 2 t ∇ t + σt ζ t /n 2 = -η t (1 - 1 2 M η t ) ∇ t 2 -(1 -M η t )η t ∇ t σt ζ t /n + 1 2 M η 2 t σt ζ t /n 2 ≤ -2µη t (1 - 1 2 M η t )(f (θ t ) -f (θ * )) -(1 -M η t )η t ∇ t σt ζ t /n + 1 2 M η 2 t σt ζ t /n 2 . Then following the same proof of Theorem 4.1, we can get E[f (θ T +1 )] -f (θ * ) ≤ γ T (E[f (θ 1 )] -f (θ * )) + R T t=1 γ T -t α 1 G 2 σ2 t (E[f (θ 1 )] -f (θ * )) = γ T + R T t=1 γ T -t α( 1 G 2 σ 2 g + σ 2 t ) (E[f (θ 1 )] -f (θ * )) = γ T + R α 1 G 2 σ 2 g 1 -γ T 1 -γ + R T t=1 γ T -t ασ 2 t (E[f (θ 1 )] -f (θ * )) ≤ γ T + R κα G 2 σ 2 g + R T t=1 γ T -t ασ 2 t (E[f (θ 1 )] -f (θ * )). where  R κα G 2 = D 2µ(f (θ1)-f (θ * )) 1 n 2 = D 2µ(f (θ1)-f (θ * )) min{ 1 N 2 R , 1} ≤ D 2µ(f (θ1)-f (θ * )) 1 N 2 R . Proof of Ũ3 = 1 G 2 T t=1 γ T -t (1 -β) 2 (1 -β t ) 2 t i=1 β 2(t-i) σ2 i = T t=1 γ T -t (1 -β) 2 (1 -β t ) 2 t i=1 β 2(t-i) ( 1 G 2 σ 2 g + σ 2 t ) = U g 3 + U 3 where we define U g 3 1 G 2 σ 2 g T t=1 γ T -t (1 -β) 2 (1 -β t ) 2 t i=1 β 2(t-i) . We can upper bound U g 3 by U g 3 = 1 G 2 σ 2 g T t=1 γ T -t (1 -β) 2 (1 -β t ) 2 1 -β 2t 1 -β 2 = 1 G 2 σ 2 g T t=1 γ T -t 1 -β 1 -β t 1 + β t 1 + β ≤ 1 G 2 σ 2 g T t=1 γ T -t ≤ 1 G 2 σ 2 g 1 1 -γ = 1 G 2 κσ 2 g . Combine with the factors of U 3 in the PGD bounds: αR U g 3 ≤ αR G 2 κσ 2 g = αR G 2 κσ 2 g = Dσ 2 g 2µn 2 (f (θ 1 ) -f (θ * )) ≤ Dσ 2 g 2µN 2 R(f (θ 1 ) -f (θ * )) .

D EXPERIMENTS

Dataset. (1) Synthetic data. We generate a 100-dimensional dataset including linearly separable data points using sckit-learn package. The data points are distributed in two points in the hyper-cubic with Euclidean distance of 10. In total, 1000 samples are generated for training with the logistic loss. (2) Real data. We create a subset of the MNIST dataset (Lecun et al., 1998) including 1000 handwritten images of digit 3 and 5 (MNIST35). Compared to the original dataset (70, 000 samples), the small set will be more vulnerable to attack and the private learning will require larger noise (see the 1/N factor in Eq. ( 1)). Following the preprocessing in (Abadi et al., 2016) , we project the vectorized images into a 60-dimensional subspace extracted by PCA. Setup. The samples are first normalized so that N n=1 x n = 0 and the standard deviation is 1. Then the sample norms are scaled such that max n x n = 10 (i.e., data scales). We fix the learning rate to 0.1 based on the corresponding experiments of non-private training (same setting without noise). The total privacy budget is 0.1963-zCDP (equal to (4, 10 -8 )-DP) which implies R = 0.3927. To control the sensitivity of the gradients, we clip gradients by a clipping norm fixed at 4. Formally, we scale down the sample gradients to length 4 if its norm is larger than 4. Because the schedule highly depends on the iteration number T , we grid search the best T for compared methods. Therefore, we ignore the privacy cost of such tuning in our experiments which protocol is also used in previous work (Abadi et al., 2016; Wu et al., 2017) . All the experiments are repeated 100 times and metrics are averaged afterwards. (middle) The influence values are estimated using the Hessian eigenvalues (squared loss) and by retraining (logistic loss). The larger is the data scale, the larger the influence variance. (right) The relative final training losses versus the data scale where uniform schedule (uni), dynamic schedule (dyn) and exponential-decaying schedule (exp) are compared. The relative loss is computed w.r.t. the losses of the uniform schedule. For example, if the dynamic loss is e 1 and the uniform loss is e 0 , then the relative loss is (e 1 -e 0 )/e 0 . The dashed lines are with the right axis to present influence-related term. Estimate of influences by retraining. In the left pane of Fig. 3 , we estimate the influence of σ t by retraining multiple times. When keeping σ i for i = t fixed (as the fine-tuend uniform schedule σ), the value of σ t is varied from 20 to 200. Then we fit a quadratic function of σ t where the coefficient is treated as estimation of q t . Repeating the estimation for all t in range 1 to 100 could provide us the trend of influence in the middle pane. Results. In the left pane of Fig. 3 , the squared final loss (i.e., f (θ T +1 )) is approximately a quadratic function of the σ t with small relative variance (i.e., the variance divided by the mean value, 0.14, 0.032, 0.016 for t = 10, 60, 90, respectively). When σ t increases, the frequency of clipping increases as well which leads to more variance in the final losses. We use the least square method to fit the relation shown as the solid lines. The final logistic loss is more sensitive to the noise because of the additional uncertainty from the changing Hessian. We still fit a quadratic function on σ t . It turns out the relative variances of the quadratic coefficient (approximately the influence) is small which are 0.025, 0.024, 0.027 for t = 10, 60, 90, respectively. The middle pane shows the relationship between the estimated q t and learning steps. The q t of squared loss is computed by analytical solution using the Hessian eigen values. The q t of logistic loss is computed by retraining based on uniform schedule. Both loss functions show an increased influence as learning continues, which indicates that the dynamic schedule should be decreasing accordingly. The squared loss has a rather steep trend while the logistic has a relative flatten one. The reason is that the logistic loss has a larger variance in the gradient norm and therefore clipping happens more frequently (approximately more than 80% gradients are clipped). As a result, the variance of influence will be relative small for logistic losses. Moreover, we vary the data scale (scale all samples uniformly such that all sample norms are less than a specific value), changing the variance of influence, as seen in the figures. The last pane compares uniform, dynamic and exponential decay schedules Yu et al. (2019) using final losses relative to the uniform schedule. We set the exponential decay schedule to approximate the dynamic schedule by fitting σt = σ 0 exp(-kt) using the least squares method. The final losses are picked by grid searching the best T ∈ [1, 100]. We see that the advantage of the dynamic schedule over the uniform one increases when data scales less than 15. But we also notice some loss gaps decrease, suggesting that the data scale is not the inherent reason for the dynamic advantage. According to Lemma 4.1, the advantage should be proportional to T 2 Var(q t ) which is shown to be decreasing when the data scale is larger than 15. When the data scale continues to increase, gradient clipping will change the curvature of the loss function. Therefore, a increasing Var(T q t ) is witnessed. Meanwhile, the loss gap decreases. In the last row of Fig. 3 , the DNN is experimented with the same setting. Though the influence increases by t, the variance is small and less dependent on the data scale in comparison to shallow models. Though the dynamic schedule estimated by retraining influences does not performs stably, the exp method still decreasingly depends on the variance of influence as expectation.



For brevity, when we say the privacy cost of some value, e.g., gradient, we actually refer to the cost of mechanism that output the value.



Figure 1: Private gradient descent repeated 100 times on two differently-curved loss functions.Solid lines are optimization trajectories and dashed horizontal lines are the averaged final losses.

Figure 2: Comparison of dynamic schedule and uniform schedule on different data scale. Left pane is the influence by iteration estimated by retraining. The rest two are the relative loss by varying data scale (left axis with solid lines) and the variance of influence (right axis with dashed lines). The middle is on the MNIST35 dataset consisting of 1000 digit 3 and 5 images using quadratic regression. The right is the final loss on subsampled MNIST dataset of 1000 training samples and 50, 000 test samples when using DNN and momentum methods.

1 (Composition & Post-processing). Let two mechanisms be M : D n → Y and M : D n × Y → Z. Suppose M satisfies (ρ 1 , a)-zCDP and M (•, y) satisfies (ρ 2 , a)-zCDP for ∀y ∈ Y. Then, mechanism M : D n → Z (defined by M (x) = M (x, M (x))) satisfies (ρ 1 + ρ 2 )-zCDP. Definition B.1 (Sensitivity). The sensitivity of a gradient query ∇ t to the dataset {x

Figure 3: Experiments with Logistic loss on synthetic data and squared, Logistic loss and DNN on MNIST35 by rows. (left) The final loss is fitted by a quadratic function formulated as c 2 σ 2t + c 0 . (middle) The influence values are estimated using the Hessian eigenvalues (squared loss) and by retraining (logistic loss). The larger is the data scale, the larger the influence variance. (right) The relative final training losses versus the data scale where uniform schedule (uni), dynamic schedule (dyn) and exponential-decaying schedule (exp) are compared. The relative loss is computed w.r.t. the losses of the uniform schedule. For example, if the dynamic loss is e 1 and the uniform loss is e 0 , then the relative loss is (e 1 -e 0 )/e 0 . The dashed lines are with the right axis to present influence-related term.

Comparison of empirical excess risk bounds. The algorithms are T -iteration 1 2 R-zCDP or equivalently ( , δ)-DP under the PL condition (unless marked with * for convexity). The O notation in this table drops other ln terms. All algorithms in the second part terminate at step T = O(ln N 2 R D ). Assume loss functions are 1-smooth and 1-Lipschitz continuous, and all parameters satisfy their numeric assumptions. Key notations: O p -bound occurs in probability p; D -feature dimension; N -sample size; R -privacy budget; c i -constant.

Comparison of true excess risk bounds. The algorithms are T -iteration 1 2 R-zCDP or equivalently ( , δ)-DP under the µ-strongly-convex condition. The O notation in this table drops other ln terms. Assume loss functions are 1-smooth and 1-Lipschitz continuous, and all parameters satisfy their numeric assumptions. * marks the method with convex assumption.

Without loss of generality, we can write σ g ξ t + Gσ t ν t as σt ζ t where σt σ 2 g + (Gσ t ) 2 and ζ t is a random vector with Eζ t = 0 and E ζ t 2 ≤ D. Therefore,

Theorem 4.8. Without loss of generality, we can write σ g ξ t + Gσ t ν t as σt ζ t where σt σ 2 g + (Gσ t ) 2 and ζ t is a random vector with Eζ t = 0 and E ζ t 2 ≤ D. Therefore, we replace ν t by ζ t and σ 2 t by σ2 t /G 2 = σ 2 g /G 2 + σ 2 t . Now, we only need to update U 3 (σ, T ) as

