IMPROVED CONVERGENCE OF DIFFERENTIAL PRIVATE SGD WITH GRADIENT CLIPPING

Abstract

Differential private stochastic gradient descent (DP-SGD) with gradient clipping (DP-SGD-GC) is an effective optimization algorithm that can train machine learning models with a privacy guarantee. Despite the popularity of DP-SGD-GC, its convergence in the unbounded domain without the Lipschitz continuous assumption is less-understood; existing analysis of DP-SGD-GC either impose additional assumptions or end up with a utility bound that involves a non-vanishing bias term. In this work, for smooth and unconstrained problems, we improve the current analysis and show that DP-SGD-GC can achieve a vanishing utility bound without any bias term. Furthermore, when the noise generated from subsampled gradients is light-tailed, we prove that DP-SGD-GC can achieve nearly the same utility bound as DP-SGD applies to the Lipschitz continuous objectives. As a by-product, we propose a new clipping technique, called value clipping, to mitigate the computational overhead caused by the classic gradient clipping. Experiments on standard benchmark datasets are conducted to support our analysis.

1. INTRODUCTION

Training machine learning models that can achieve decent prediction accuracy while preserving data privacy is fundamental in many modern machine learning applications. The concept of differential privacy (DP) from Dwork (2006) ; Dwork & Roth (2014) offers an elegant mathematical framework to characterize the privacy-preserving ability of randomized algorithms, which has been widely applied to tasks including clustering, regression, principle component analysis, empirical-risk minimization, matrix completion, graph distance estimation, optimization and deep learning (Chaudhuri & Monteleoni, 2008; Chaudhuri et al., 2011; Agarwal et al., 2018; Ge et al., 2018; Jain et al., 2018; Fan & Li, 2022; Fan et al., 2022) . For the empirical-risk minimization (ERM) problem, among many proposed methods, differential private stochastic gradient descent (DP-SGD) is an effective algorithm that can solve the ERM problem with a privacy guarantee and achieve a reasonable utility bound. DP-SGD has received substantial interest in recent years due to its simplicity and effectiveness (Song et al., 2013; Bassily et al., 2014; Abadi et al., 2016; Wang et al., 2017; Bassily et al., 2019; Feldman et al., 2020; Asi et al., 2021) . In the classic analysis of DP-SGD, the variance of the Gaussian noise used in each iteration of DP-SGD relies crucially on the ℓ 2 -sensitivity of the loss function. Therefore most early works on DP-SGD assume each individual loss function to be Lipschitz continuous in its domain (Song et al., 2013; Bassily et al., 2014) . However, many real-world problems are only smooth but not globally Lipschitz continuous; for example, the unconstrained linear regression problem. There are two techniques to circumvent the Lipschitz continuous assumption: (i) imposing an additional bounded domain constraint to the original problem; (ii) clipping gradients in their 2-norm and using the clipped gradients to update the model (Abadi et al., 2016) . In practice, the gradient clipping technique is usually more preferred than imposing a bounded domain constraint because the latter requires prior knowledge of the distance between initialization and solution, which is typically unavailable for unconstrained problems. In summary, the state-of-the-art implementations of DP-SGD all advocate the gradient clipping technique. Table 1 : The utility bound and assumptions needed by different algorithms for convex problems, where d is the problem size, n is the number of data points and ϵ measures the privacy-preserving ability; see Section 3 for more details. " †" is based on a trivial extension of Bassily et al. (2014) . • This work is theoretical in essence but also includes a practical contribution (Section 5). We develop a novel value clipping technique for problems that satisfy the weak growth condition (Definition 3.1). The proposed value clipping technique can be implemented within one forwardbackward propagation on existing learning platforms and can alleviate the computation overhead caused by gradient clipping. The efficiency of value clipping is demonstrated on real datasets.

2. RELATED WORK

DP-SGD with gradient clipping was initially proposed by Abadi et al. (2016) . Gradient clipping and its variants have been widely adopted by many privacy-aware training algorithms (Andrew et al., 2021) . Despite the popularity of gradient clipping, the convergence rate of DP-SGD-GC without the Lipschitz continuous and bounded domain assumptions remains a challenging task; see (Wang et al., 2022, Remark 5 ) for a short discussion on the hardness of removing the bounded domain assumption. This challenging research question was not carefully studied until the recent works from Chen et al. (2020) and Song et al. (2021) , who provided counter-examples showing that DP-SGD-GC can suffer from a constant utility in the worst case. Chen et al. (2020) studied the convergence of DP-SGD-GC to a stationary point in the nonconvex setting and showed that an additional assumption on gradient distribution is sufficient to derive a meaningful utility bound. Song et al. (2021) showed that DP-SGD-GC converges to a perturbed objective function for the generalized linear model and can suffer from a constant utility for the original objective in the worst case. Note that there are some other recent works that study the convergence of DP-SGD-GC for smooth objective (Du et al., 2021; Wu et al., 2021; Yang et al., 2022) , the rates in these works usually involve a bias term due to clipping. A concurrent work from Bu et al. (2022) suggests that a small clipping threshold can yield promising performance for DP-SGD-GC in certain scenarios, such as training language models. Their empirical discovery contrasts with the theoretical analysis in this work as our proof technique relies on a large clipping threshold. Bu et al. (2022) 's experiments indicate that the analysis in this work may be further improvable; rigorous theoretical justification for the phenomenon described by Bu et al. (2022) is worth future investigation. Another concurrent work from Yang et al. (2022) studied the convergence of DP-SGD-GC under the generalized smooth condition, their analysis relies on a different set of assumptions and do not overlap with this work. On the practical side, the original implementation of gradient clipping was inefficient as one needs to calculate the norm of each individual sample in every iteration. Many works (Goodfellow, 2015; Abadi et al., 2016; Rochette et al., 2019; Bu et al., 2021; Subramani et al., 2021) have been carried on to improve the efficiency of DP-SGD-GC from either engineering or algorithmic perspective. Our proposed value clipping technique can be viewed as an alternative to the classic gradient clipping.

3. PRELIMINARIES

Notation Throughout this paper, for any positive integer n, we denote [n] := {1, 2, . . . , n}. We denote ∥ • ∥ to be the vector 2-norm or matrix operator norm if not otherwise specified. We use the notation O(•) to hide poly-logarithmic terms. We consider the empirical-risk minimization (ERM) problem minimize w∈R d f (w) := 1 n n i=1 f i (w), (P) where n is the number of training samples, f i 's are differentiable functions and w is the model we wish to train. Throughout the paper, we assume that f is bounded below and its minimum is attainable. We let W * to be the set of solutions for problem (P), and denote the optimal function value as f * . We assume that a lower bound of f * is known as a prior. Note that this assumption holds in many realistic settings, for example the ERM problems are usually lower bounded by zero. Next we introduce the weak growth condition (WGC), which is the cornerstone for our analysis. Definition 3.1. A function h : R d → R is (β 1 , β 2 )-WGC for some β 1 > 0, β 2 ≥ 0 if ∥∇h(w)∥ 2 ≤ β 1 h(w) -inf u∈R d h(u) + β 2 ∀w ∈ R d . The weak growth condition bounds the norm of the gradient by a linear function of the objective value. WGC is gaining increasing interest in recent years as multiple works have demonstrated that WGC and its variants can improve the classic analysis of SGD-type algorithms (Schmidt & Roux, 2013; Needell et al., 2014; Vaswani et al., 2019; Qian et al., 2019; Stich, 2019; Khaled & Richtárik, 2020; Fang et al., 2021; Gower et al., 2021) . It is easy to show that smooth functions that are bounded below necessarily satisfy WGC; the following Lemma makes this precise. Lemma 3.2. If a function h : R d → R is L-smooth for some L > 0 and bounded below, e.g., inf w∈R d h(w) > -∞. Then h is (2L, 0)-WGC. We stress that WGC is not a strong assumption. In fact, a wide range of nonconvex and nonsmooth problems arising from the ERM problem also satisfy WGC, e.g., the Lipschitz continuous model with smooth and convex loss function; see Fang et al. (2021, § 4.1) and Section D for more details. We recall the standard definition of differential privacy (DP). Definition 3.3 (Dwork, 2006) . A randomized algorithm A is (ϵ, δ)-differentially private if for all neighboring datasets D, D ′ and for all events S in the output space of A, we have Pr[A(D) ∈ S] ≤ e ϵ Pr[A(D ′ ) ∈ S] + δ. If δ = 0, then A is said to be ϵ-differentially private. The detailed algorithm of DP-SGD-GC is shown in Algorithm 1. It has been shown that DP-SGD-GC is (ϵ, δ)-DP as long as the noise level σ is larger than certain threshold (Theorem 3.4). Theorem 3.4 (Abadi et al., 2016, Theorem 1) . Let q = B/n, where B is the batch size and n is the number of data points. There exist constants c 1 and c 2 , such that for any ϵ < c 1 q 2 T , Algorithm 1 is (ϵ, δ)-DP for any δ > 0 if σ ≥ c 2 q √ T log(1/δ) ϵ . Algorithm 1 Differential-private SGD with gradient clipping (DP-SGD-GC) 1: Input: number of iteration T ∈ N, clipping threshold C > 0, noise level σ > 0, batch size B ∈ [1, n], learning rate η > 0, initial iterate w (0) . 2: for t ← 0, . . . , T -1 do 3: Sample a mini-batch B t , where each data has probability B/n to be sampled; 4: g (t) i = ∇f i (w (t) ) ∀i ∈ B t ; 5: g(t) i = g (t) i / max{1, ∥g i ∥/C}, ∀i ∈ B t ; 6: t) , where ξ (t) ∼ N (0, Cfoot_1 σ 2 I d×d ); 7: end for 8: Return: w (i) where i is uniform randomly sampled from {0, 1 . . . , T }. w (t+1) = w (t) -η 1 B i∈Bt g(t) i + ξ (

4. MAIN THEORETICAL RESULTS

We present our main theoretical contributions in this section. Denote w priv as the output of Algorithm 1. We are interested in the upper bound of the excess empirical risk f (w priv ) -f * and the gradient norm square ∥∇f (w priv )∥ 2 without assuming f i 's to be Lipschitz continuous in R d . Part of our analysis relies on assuming the noise generated from subsampled gradients is "light-tailed" (sub-Gaussian). Formally, we introduce the following assumption 2 . Assumption 4.1. There exist ρ > 0 such that E i exp ∥∇f i (w) -∇f (w)∥ 2 /ρ 2 ≤ e, ∀w ∈ R d . We note that the above light-tail assumption is a widely used assumption for the analysis of high probability utility bound of SGD (Nemirovskii et al., 2009; Juditsky & Nesterov, 2014; Ghadimi & Lan, 2013; Harvey et al., 2019; Feldman et al., 2020) . Assumption 4.1 does not imply f i 's to be globally Lipschitz continuous on R d , and therefore will not trivialize our analysis. Note that there is a recent trend on analyzing SGD and its variants with heavy-tail gradient noise (Gürbüzbalaban et al., 2021) . Our analysis does not apply to the heady-tail setting because our Proposition 4.2 relies crucially on the light-tail-noise assumption. The idea of our proof is concise. We first summarize the proof sketch as follows, and then explain the technical details step by step. Proof sketch: • Assuming that the initial objective gap f (w (0) ) -f * is bounded. We are able to prove that, with high probability, the iterates generated from SGD have bounded objective values that only logarithmically depend on T . • The weak growth condition allows us to convert the objective value upper bound to the gradient norm upper bound. Thus, with high probability, gradient clipping will never happen during the process of DP-SGD if the clipping threshold C is chosen appropriately. • Finally, we can apply the classic convergence analysis of SGD (without gradient clipping) and obtain a non-trivial excess empirical risk or gradient norm upper bound. To begin with, we develop a uniform upper bound on the objective values, e.g., f (w (t) ) where w (t) 's are the iterates generated from the vanilla SGD algorithm (Algorithm 2) with sub-Gaussian noise. Proposition 4.2 (Uniform upper bound on objective values of SGD with sub-Gaussian noise). Assume f is L-smooth for some L > 0 and there exist σ > 0 such that E[exp(∥ζ (t) ∥ 2 /σ 2 )] ≤ e for any t ∈ N. Denote {w (t) } T t=0 as the iterates generated from Algorithm 2 with η ≤ min 1 2L , 1 σ√ T . Then for any δ ∈ (0, 1), max t∈{0,1,...,T } f (w (t) ) -f * ≤ 2 f (w (0) ) -f * + O (log(T /δ)) with probability at least 1 -δ. 1: Input: total iterations T ∈ N, learning rate η > 0, initial iterate w (0) . 2: for t ← 0, . . . , T -1 do 3: t) ; 4: end for 5: Return: w (i) where i uniform randomly sampled from {0, 1 . . . , T }. w (t+1) = w (t) -η ∇f (w (t) ) + ζ ( The conclusion stated in Proposition 4.2 may seem obvious at first glance as the training algorithm is expected to produce almost decreasing objective values during the optimization process. However, we note that deriving a nearly constant upper bound of f (w (t) ) -f * that holds uniformly over t ∈ {0, 1 . . . , T } without assuming Lipschitz continuity or bounded domain is nontrivial. Our proof relies on a recently proposed technical tool called the generalized Freedman inequality (Harvey et al., 2019) ; see Section B.1 for the detailed proof of Proposition 4.2. We also remark that Proposition 4.2 holds for the standard SGD algorithm without considering differential privacy, and thus may be of independent interest. Based on Proposition 4.2, we can further obtain an upper bound on each individual loss f i (w (t) ) -f * i under Assumption 4.1, and therefore also upper bound ∥∇f i (w (t) )∥ via the weak growth condition. Proposition 4.3. Assume f i 's are L-smooth for some L > 0 and there exist σ > 0 such that E[exp(∥ζ (t) ∥ 2 /σ 2 )] ≤ e for any t ∈ N. Denote {w (t) } T t=0 as the iterates generated from Algorithm 2 with η ≤ min 1 2L , 1 σ√ T . Then for any δ ∈ (0, 1), max i∈[n],t∈[T ] ∥∇f i (w (t) )∥ ≤ 2β 1 (f (w (0) ) -f * ) + O log(1/δ) + log T + log n (1) holds with probability at least 1 -δ for any δ ∈ (0, 1). When Assumption 4.1 holds, Proposition 4.3 suggests that the upper bound of the maximum gradient norm logarithmically depends on δ, T and n (eq. ( 1)). Now we are ready to present our excess empirical risk bound and gradient norm bound for DP-SGD-GC in terms of conditional expectation. Proposition 4.4 (Convergence on conditional expectation). Assume that Assumption 4.1 holds. Denote w priv as the output of Algorithm 1 and define E := {∥f i (w (t) )∥ ≤ C ∀i ∈ [n], t ∈ {0, 1 . . . , T }} as the event of no clipping happens during the training of Algorithm 1. Let D f := f (w (0) ) -f * . Given any ϵ > 0 and δ, δ ′ ∈ (0.5, 1). • Assume f i 's are convex and L-smooth for some L > 0. set T > ϵ/c 1 , σ = c 2 B T log(1/δ) nϵ , C = c 3 + c 4 log(nT /δ ′ ), η = min 1 2L , c 5 B (Bρ 2 + C 2 dσ 2 )T , where c 1 , c 2 , c 5 are some absolute constants and c 3 , c 4 are constants that depend on L and D f . We have that Algorithm 1 is (ϵ, δ)-DP, Pr[E] ≥ 1 -δ ′ , and E [f (w priv ) -f * | E] ≤ O   1 T + 1 √ BT + log(1/δ ′ ) + log(T n) d log(1/δ) nϵ   . • Assume f i 's are L-smooth for some L > 0. Setting ϵ, σ, C, η as in eq. (2). It holds that Algorithm 1 is (ϵ, δ)-DP, Pr[E] ≥ 1 -δ ′ , and E ∥∇f (w priv )∥ 2 | E ≤ O   1 T + 1 √ BT + log(1/δ ′ ) + log(T n) d log(1/δ) nϵ   . While the bounds stated in Proposition 4.4 are close to our objective, we note that these bounds are expressed in terms of conditional expectation where the conditioning event happens with high probability, which is different from the classic notion of convergence in expectation. Fortunately, we show that it is easy to convert the bounds in Proposition 4.4 to expected utility bound when the random variable f (w priv ) -f * (or ∥∇f (w priv )∥ 2 ) is sub-exponential, which is true under Assumption 4.1. Lemma A.9 serves as the main technical tool for this conversion. Theorem 4.5 (Convergence on expectation). Suppose that Assumption 4.1 holds. Denote w priv as the output of Algorithm 1. Let D f := f (w (0) ) -f * . Given any ϵ > 0 and δ ∈ (0.5, 1). Setting T, σ, η in the same way as eq. ( 2) and let C = c 3 + c 4 log(nT ), where c 1 , c 2 , c 5 are some absolute constants and c 3 , c 4 are constants that depend on L, D f . • Assume f i 's are convex and L-smooth for some L > 0. Then Algorithm 1 is (ϵ, δ)-DP and E [f (w priv ) -f * ] ≤ O 1 T + 1 √ BT + √ d nϵ . Consequently, we have E [f (w priv ) -f * ] = O d 1/2 (nϵ) -1 by setting T = Θ n 2 ϵ 2 d -1 . • Assume f i 's are L-smooth for some L > 0. Then Algorithm 1 is (ϵ, δ)-DP and E ∥∇f (w priv )∥ 2 ≤ O 1 T + 1 √ BT + √ d nϵ . Consequently, we have E ∥∇f (w priv )∥ 2 = O d 1/2 (nϵ) -1 by setting T = Θ n 2 ϵ 2 d -1 . Remark 4.1. Our results suggest that, when the problem is smooth, the Lipschitz continuous and bounded domain assumptions can be removed almost for free when analyzing DP-SGD-GC with light-tailed gradient noise; the only cost is some logarithmic terms. Remark 4.2. Our analysis also holds for DP-GD-GC. When analyzing DP-GD-GC, Assumption 4.1 is no longer required. However, when Assumption 4.1 does not hold, there will be an additional multiplicative term √ n appear in eq. (1) and the final utility bound is O( √ d/( √ nϵ)). Theorem 4.5 is the main theoretical contributions of this paper. The bounds stated in Theorem 4.5 nearly match the rate O(log(1/δ)d 1/2 (nϵ) -1 ), which is the best-known bound of DP-SGD with the Lipschitz continuous or bounded domain assumption (Bassily et al., 2014) . To our knowledge, for unconstrained smooth problems, this is the first utility bound of DP-SGD-GC without a nonvanishing bias term. Note that existing lower bound analyses for DP-SGD either assume the domain is bounded (Bassily et al., 2014) or the loss is Lipschitz continuous (Song et al., 2021) . Therefore those lower bounds are not comparable with the upper bounds stated in Theorem 4.5. We leave the lower bound of DP-SGD-GC for unconstrained smooth problems as a future direction to explore.

5. VALUE CLIPPING

This section focuses on the practical side of DP-SGD-GC. A well-known implementation issue of Algorithm 1 is that the gradient clipping step (line 4 of Algorithm 1) requires to access the norm of each individual gradient from the sampled batch, and naive implementation of DP-SGD-GC on current deep learning platforms cannot fully exploit the parallelism of GPU; see some attempts that try to mitigate this issue (Goodfellow, 2015; Rochette et al., 2019; Bu et al., 2021) . The state-of-the-art implementation of DP-SGD-GC is from Subramani et al. (2021) , who developed a highly engineered approach to exploit language primitives, compilation, and vectorization on certain deep learning platforms. In this section, we propose a value clipping technique for functions that satisfy the weak growth condition (Definition 3.1). The proposed value clipping can be viewed as an alternative to the classic gradient clipping technique that is easy to implement on all existing deep learning platforms such as PaddlePaddle. The intuition behind value clipping is simple -when f i 's satisfy the weak growth condition, the norm of their gradients can be bounded by a function of their objective values, e.g., ∥∇f i (w)∥ ≤ β 1 (f i (w) -f * i ) + β 2 , therefore scaling the gradient by β 1 (f i (w) -f * i ) + β 2 ensures the scaled gradient has a bounded norm. Formally, Algorithm 3 Differential-private SGD with value clipping (DP-SGD-VC) 1: Input: number of iteration T ∈ N, clipping threshold C > 0, noise level σ > 0, batch size B ∈ [1, n], learning rate η > 0, initial iterate w (0) , WGC parameters β 1 > 0, β 2 ≥ 0, f * lb ∈ R that lower bound f * i ∀i ∈ [n]. 2: for t ← 0, . . . , T -1 do 3: Sample a mini-batch B t , where each data have probability B/n to be sampled; 4: t) , where ξ (t) ∼ N (0, C 2 σ 2 I d×d ); 6: end for 7: Return: w (i) where i uniform randomly sampled from {0, 1 . . . , T }. g(t) i = ∇f i (w (t) )/ max 1, β 1 (f (w (t) ) -f * lb ) + β 2 /C ∀i ∈ B t ; 5: w (t+1) = w (t) -η 1 B i∈Bt g(t) i + ξ ( ∀w ∈ R d , g := f i (w)/ max 1, β 1 (f i (w) -f * i ) + β 2 C =⇒ ∥g∥ ≤ C. The detailed algorithm of DP-SGD with value clipping, termed DP-SGD-VC, is shown in Algorithm 3. Note that DP-SGD-VC requires knowing the WGC parameters β 1 , β 2 and a lower bound of f * i 's as its input. The WGC parameters for simple models, including linear and logistic regression, are easy to calculate. For feed-forward neural networks, the calculation of WGC parameters is achievable but more involved; see Appendix D for details. It is easy to verify that DP-SGD-VC is (ϵ, δ)-DP because the norm of the clipped gradient is guaranteed to be bounded by C; the following corollary is a direct consequence of Theorem 3.4 and describes the DP property of DP-SGD-VC. Corollary 5.1. Assume f i 's are (β 1 , β 2 )-WGC for some β 1 > 0, β 2 ≥ 0. Let q = B/n, where B is the batch size and n is the number of data points. There exist constants c 1 and c 2 , such that for any ϵ < c 1 q 2 T , Algorithm 3 is (ϵ, δ)-DP for any δ > 0 if σ ≥ c 2 q T log(1/δ)ϵ -1 . Remark 5.1. DP-SGD-VC is easy to implement on existing auto-differentiation based deep learning platforms. The value clipping step (line 4 of Algorithm 3) can be realized within one forwardbackward propagation if the WGC parameters are given in advance. Therefore DP-SGD-VC can be as fast as the vanilla SGD algorithm.

6. NUMERICAL STUDY

We conduct experiments on two standard image classification benchmark datasets: MNIST (LeCun, 1998) and CIFAR10 (Krizhevsky & Hinton, 2009) . In Appendix, we also present some experimental results on synthetic data with light-tailed noise. For MNIST, we train a linear classifier and a twolayer MLP with 128 hidden nodes respectively. For CIFAR10, to achieve decent accuracy, we use a pre-trained VGG16 network (Simonyan & Zisserman, 2015) to extract informative high-level features. Based on the 512-dimensional extracted features, we train a linear classifier and a two-layer MLP with 128 hidden nodes respectively. Implementation details For all experiments, we set the batch size B = 128, the noise level σ = 1.0 and the confidence level δ = 10 -5 . For MNIST, we try learning rate in {2 × 10 -3 , 5 × 10 -3 , 10 -2 } for each experiment and report the best result. For CIFAR10 we fix the learning rate to be 0.1. All experiments are conducted on a server with 4 CPUs and one NVIDIA Tesla P100 GPU.

6.1. THE EVOLUTION OF CLIPPING FREQUENCY DURING TRAINING

We run DP-SGD-GC with different clipping thresholds. In particular, we try C ∈ {1, 5, 20, 40} and C ∈ {0.1, 0.2, 0.4, 1.0} for MNIST and CIFAR10 respectively. The evolutions of training accuracy and clipping frequency per epoch with different clipping thresholds are shown in Figure 1 . We can observe that, in most cases, the clipping frequency decreases as the training accuracy goes up. This observation aligns with the WGC as lower training loss implies a smaller average gradient norm and further results in lower clipping frequency. We also see that, in most cases, the clipping frequency can become close to 0 when the clipping threshold is chosen appropriately; this observation is consistent with our theoretical analysis. Another interesting observation is the result of training the two-layer neural network with MNIST and C ∈ {20, 40}. The clipping frequency is small initially and becomes stable at 15% instead of decreasing to zero. We conjecture that this phenomenon is because the WGC parameters of the neural network grow as the training goes and thus prevent some gradient norms from being smaller than the clipping threshold.

6.2. THE EVALUATION OF DP-SGD-VC

We set f * lb = 0 for all experiments with DP-SGD-VC. The calculation of the WGC parameters for a feed-forward neural network with cross-entropy loss is given in Section D, where β 2 = 0 and β 1 depends on the spectral norm of each layer of neural networks. Compared with vanilla SGD, DP-SGD-VC has an additional cost to calculate the spectral norm of each layer in each iteration. As shown in the following content, the overhead of calculating the spectral norm is not significant.

Training and testing accuracy

The training and testing accuracy of DP-SGD-GC and DP-SGD-VC with different clipping thresholds are shown in Figure 2 and Figure 3 . We can observe that DP-SGD-VC converges slightly slower than DP-SGD-GC in terms of the epoch. This observation should not be surprising as DP-SGD-VC uses an upper bound of the gradient norm for clipping and will result in a smaller effective learning rate than DP-SGD-GC. For CIFAR10, the training and testing are easy as the model is trained on pre-trained features; both DP-SGD-VC and DP-SGD-GC can achieve similar accuracy at the end of training. For MNIST, there is an unfortunate loss of training and testing accuracy. For MNIST with the linear model, there is a ∼ 2% loss in training and testing accuracy. For MNIST with two-layer NN, there is a 2 ∼ 3% loss in training and testing accuracy for C ∈ {20, 40} and a ∼ 4% loss for C = 10. The gap between DP-SGD-GC and GP-SGD-VC is more obvious when the clipping threshold is small. We conjecture that the loss of training and testing accuracy is due to our estimation of f * i . The estimation f * lb = 0 is accurate for CIFAR10 as the model can almost perfectly fit all pre-trained data. However, the estimation is inaccurate for MNIST and thus results in a loose upper bound of the gradients' norm. Comparing the computational time per epoch We report the per epoch runtime of different algorithms in Table 2 . All experiments are conducted on a server with one NVIDIA Tesla P100 GPU. The vanilla SGD without privacy consideration is the baseline method and is the fastest among all algorithms. Micro-batching is the naive implementation of DP-SGD-GC and is significantly slower than SGD. GC-Opacus is the implementation of DP-SGD-GC from the (highly optimized) Opacus package; we can see that there is still a gap between the performance of GC-Opacus and the standard non-private SGD algorithm. DP-SGD-VC is our implementation of DP-SGD with the proposed value clipping technique. We can observe that DP-SGD-VC is slightly slower than the standard SGD algorithm and faster than other private training methods. Limitations Despite the efficiency of DP-SGD-VC, it also has certain limitations: (i) as shown in the experiments, DP-SGD-VC may cause some loss in training/testing accuracy if our estimation for the WGC parameters is loose; (ii) DP-SGD-VC requires calculating the WGC parameters, which we show is available for feed-forward neural works with cross-entropy loss. However, it would be hard to apply VC to more complicated network architectures with arbitrary loss in a black-box manner, e.g., transformers with ranking loss. Overall, we consider DP-SGD-VC as an alternative for DP-SGD-GC that is computationally cheap; DP-SGD-VC can perform similarly to DP-SGD-GC in certain scenarios but can also be not applicable in other situations.

7. CONCLUSION AND FUTURE WORK

This paper studied the convergence behavior of a widely used privacy-preserving learning algorithm called DP-SGD-GC. Our analysis extended the convergence of DP-SGD-GC to smooth and unconstrained problems without assuming the objective to be globally Lipschitz continuous. We believe that our theoretical results improved the current understanding of DP-SGD-GC and provided new insights for practitioners and researchers to use DP-SGD-GC and design new algorithms. Our analysis can potentially be used for other privacy-preserving learning algorithms such as adaptive DP-SGD (Asi et al., 2021) and DP-SGD with subspace identification (Zhou et al., 2021; Song et al., 2021) . Our work implies some future directions. Firstly, the light-tail noise condition may not hold in some machine learning applications (Gürbüzbalaban et al., 2021) . In these scenarios, the utility bound of DP-GD-GC (Remark 4.2) is √ n worse than the best-known bound of DP-SGD-GC for Lipschitz functions. Whether it is possible to further improve the utility bound with heavy-tail-noise is an interesting question. Another direction is to explore the lower bound of DP-SGD-GC with a carefully tuned clipping threshold for smooth and unconstrained problems. Lastly, we may also combine our analysis with more methods and schemes in optimization, e.g., adaptive gradient methods, gradient compression, and distributed optimization (Kingma & Ba, 2015; Agarwal et al., 2018; Zhou et al., 2020; Li et al., 2022) . Lemma A.8 (Tail bound for the maximum of sub-Gaussian variables). Let X 1 , X 2 , . . . , X n be random variables. Assume that there exist K > 0 such that E[exp(λX i )] ≤ exp(λ 2 K 2 ) for all λ ∈ R and i ∈ [n]. Then there exist some absolute positive constant c such that Pr max i∈[n] |X i | ≥ cK log(2n) + ct ≤ exp(-t 2 /K 2 ) ∀t ≥ 0. Proof. By Lemma A.4, there exist some absolute constant c > 0 such that Pr[|X i | ≥ t] ≤ 2 exp -t 2 /(c 2 K 2 ) ∀t ≥ 0, i ∈ [n]. Then we simply apply union bound to obtain the conclusion. For any t ≥ 0, Pr max i∈[n] |X i | ≥ cK log(2n) + ct ≤ n i=1 Pr |X i | ≥ cK log(2n) + ct ≤ 2n exp -(cK log(2n) + ct) 2 /(c 2 K 2 ) ≤ exp(log(2n)) exp - K 2 log(2n) K 2 - 2tK log(2n) K 2 - t 2 K 2 = exp - 2tK log(2n) K 2 - t 2 K 2 ≤ exp(-t 2 /K 2 ), which yields the desired result. Lemma A.9. Let X be a random variable and E be a random event such that E[|X| | E] ≤ α and Pr[E] ≥ 1 -δ for some α > 0, δ ∈ (0, 1). If X satisfies Pr[|X| -β ≥ t] ≤ 2 exp(-t/K) ∀t ≥ 0 for some β, K > 0. Then E [|X|] ≤ α + δβ + δ log(8/δ)K. Proof. Let Q : [0, 1] → R ∪ {∞} be the quantile function for the random variable |X|, e.g., Q(p) = inf x ∈ R p ≤ Pr[|X| ≤ x] ∀p ∈ [0, 1]. By the assumption that |X| -β is sub-exponential, we have that Q(1 -δ ′ ) ≤ β + K log(2/δ ′ ) ∀δ ′ ∈ (0, 1). Now we are ready to prove the conclusion. First notice that E[|X|] = Pr[E]E[|X| | E] + Pr[E c ]E[|X| | E c ] ≤ α + E[|X| • 1 E c ], where 1 E c is the indicator function with the event E c . The remaining is to bound E[|X| • 1 E c ]. Define a new set of events A i := |X| ∈ Q 1 -δ 2 i-1 , Q 1 -δ 2 i , i = 1, 2, . . ., and denote the probability measure space as (Ω, F, µ). Then E[|X| • 1 E c ] = Ω |X| • 1 E c dµ(ω) ≤ Ω |X| • 1 |X|≥Q(1-δ) dµ(ω) ≤ ∞ i=1 Ω |X| • 1 Ai dµ(ω) ≤ ∞ i=1 δ 2 i Q 1 - δ 2 i (i) ≤ δβ + ∞ i=1 δ 2 i K log 2 i+1 δ ≤ δβ + δK ∞ i=1 1 2 i ((i + 1) log(2) + log(1/δ)) (ii) ≤ δβ + δK(3 log(2) + log(1/δ)), where (i) is by eq. ( 3) and (ii) is by the standard scalar inequality (Lemma A.2). The above together with eq. ( 4) yields the desired result.

A.3 THE GENERALIZED FREEDMAN INEQUALITY

We restate the generalized Freedman inequality and its corollary from Harvey et al. ( 2019). Lemma A.10 (Generalized Freedman Inequality, Harvey et al., 2019, Theorem 3.2) . Let {d i , F i } T i=1 be a martingale difference sequence. Suppose v i-1 ≥ 0, ∀i ∈ [T ] are F i-1 -measurable random variables such that E[exp(λd i ) | F i-1 ] ≤ exp( λ 2 2 v i-1 ) for all i ∈ ], λ > 0. Let S t = t i=1 d i and V t = t i=1 v i-1 . Let α i ≥ 0 and α = max i∈[T ] α i . Then Pr T t=1 S t ≥ x and V t ≤ t i=1 α i d i + β ≤ exp - x 4α + 8β/x ∀x, β > 0. The following lemma is an immediate consequence from the generalized Freedman inequality. i , i = 1, 2, . . . , T, t = 1, 2, . . . , T } such that Lemma A.11. Let {d i , F i } T i=1 be a martingale difference sequence. Suppose v i-1 ≥ 0, ∀i ∈ [T ] are F i-1 -measurable random variables such that E[exp(λd i ) | F i-1 ] ≤ exp( λ 2 2 v i-1 ) for all i ∈ [T ], λ > 0. Let S t = t i=1 d i and V t = t i=1 v i-1 . Let δ ∈ (0, Pr T t=1 V t ≤ t i=1 α (t) i d i + R(δ) ≥ 1 -δ. Let α = max i∈[T ],t∈[T ] α (t) i . Then Pr T t=1 {S t ≥ x} ≤ δ + T exp - x 4α + 8R(δ)/x ∀x, β > 0. Proof. Given δ ∈ (0, 1), x ∈ R. Define the events A t := {S t ≥ x} and B t := V t ≤ t i=1 α (t) i d i + R(δ) . Then Pr T t=1 {S t ≥ x} = Pr T t=1 A t = Pr T t=1 A t T t=1 B t + Pr T t=1 A t T t=1 B t c (i) ≤ Pr T t=1 A t T t=1 B t + δ ≤ Pr T t=1 A t T i=1 B i + δ ≤ Pr T t=1 A t B t + δ 1: Input: total iterations T ∈ N, learning rate η > 0, initial iterate w (0) . 2: for t ← 0, . . . , T -1 do 3: w (t+1) = w (t) -η ∇f (w (t) ) + ξ (t) ; 4: end for 5: Return: w (i) where i uniform randomly sampled from {0, 1 . . . , T }. (ii) ≤ T t=1 Pr A t B t + δ (iii) ≤ δ + T exp - x 4α + 8R(δ)/x ∀x, β > 0, where (i) is by the assumption Pr ∩ T t=1 B t ≥ 1 -δ, (ii) is by union bound (note that we can not directly use Lemma A.10 since V t relies on α (t) 1 , . . . , α (t) t instead of α 1 , . . . , α t ), (iii) is by applying Lemma A.10 for T times. A.4 OTHER LEMMAS Recall Lemma 3.2. Proof of Lemma 3.2. By the smoothness of h, we have h(v) ≤ h(u) + ⟨∇h(u), v -u⟩ + L 2 ∥v -u∥ 2 u, v ∈ R d . Making the identification that v = u -∇h(u)/L. We obtain h(u -∇h(u)/L) ≤ h(u) - 1 2L ∥∇h(u)∥ 2 =⇒ inf v∈R d h(v) ≤ h(u -∇h(u)/L) ≤ h(u) - 1 2L ∥∇h(u)∥ 2 =⇒ ∥∇h(u)∥ 2 ≤ 2L h(u) -inf v∈R d h(v) ∀u ∈ R d . Next, we review the standard convergence rate of SGD algorithm in the following lemma. Lemma A.12 (Ghadimi & Lan, 2013, Theorem 2.1). Denote ŵ as the output of Algorithm 4. Assume that f i 's are L-smooth for some L > 0, η ≤ 1/L and there exist σ ≥ 0 such that E[∥ξ (t) ∥ 2 ] ≤ σ 2 for all t ∈ N. • If f i 's are further convex, then E[f ( ŵ) -f * ] ≤ inf w∈W * ∥w (0) -w∥ 2 (T + 1)η + σ 2 η, where W * is the set of solutions. • If f i 's are not necessarily convex, then E ∥∇f ( ŵ)∥ 2 ≤ 2 f (w (0) ) -f * (T + 1)η + Lσ 2 η. The following lemma connects the WGC parameters of f i 's to the WGC parameters of f . Lemma A.13. Assume that f i 's are (β 1 , β 2 )-WGC for all i ∈ [n]. Then f := 1 n n i=1 f i is (β 1 , β 2 + Γβ 1 )-WGC, where Γ = 1 n n i=1 (f * -f * i ). Proof. ∥∇f (w)∥ 2 (i) ≤ 1 n n i=1 ∥∇f i (w)∥ 2 (ii) ≤ 1 n n i=1 β 1 (f i (w) -f * i ) + β 2 ≤ β 1 (f (w) -f * ) + Γβ 1 + β 2 , where (i) is by the convexity of ∥ • ∥ 2 and (ii) is by the assumption that f i 's are (β 1 , β 2 )-WGC.

APPENDIX B PROOFS FOR SECTION 4

B.1 PROOF OF PROPOSITION 4.2 Proof. We begin with the smoothness of f , f (w (t+1) ) ≤ f (w (t) ) + ⟨∇f (w (t) ), w (t+1) -w (t) ⟩ + L 2 ∥w (t+1) -w (t) ∥ 2 = f (w (t) ) -η∥∇f (w (t) )∥ 2 -η⟨∇f (w (t) ), ζ (t) ⟩ + Lη 2 2 ∥∇f (w (t) ) + ζ (t) ∥ 2 = f (w (t) ) -η - Lη 2 2 ≥ 0 ∥∇f (w (t) )∥ 2 + Lη 2 2 ∥ζ (t) ∥ 2 zt + (Lη 2 -η)⟨∇f (w (t) ), ζ (t) ⟩ ut . Let Z t := t i=0 z i , U t := t i=0 u i . Recursively apply eq. ( 5) gives f (w (t) ) ≤ f (w (0) ) + Z T -1 + U t-1 , ∀t = 0, 1, . . . , T. All we need is to show that Z T -1 and U t-1 are bounded above with high probability. First, we bound Z T -1 . Notice that ζ (t) 's are sub-Gaussian variables, therefore Z T -1 is sub-exponential. We apply Lemma A.7, there exist some absolute constant c 1 > 0 such that for any δ ′ ∈ (0, 0.5), Z T -1 ≤ Lη 2 2 c 1 T σ2 log(1/δ ′ ) ≤ c 1 L log(1/δ ′ ) 2 (By the definition of η) with probability at least 1 -δ ′ . Next we bound the term U t-1 . Noticing that {U t } ∞ t=0 is a martingale sequence, we bound it by the generalized Freedman inequality (Lemma A.10). Denote F t-1 to be the σ-Algebra generated from {w (1) , . . . , w (t) }. Then by the definition of ζ (t) , we have E exp u 2 t /((Lη 2 -η) 2 σ2 ∥∇f (w (t) )∥ 2 ) F t-1 ≤ e, where we use the inequality ∥u t ∥ 2 ≤ (Lη 2 -η) 2 ∥∇f (w (t) )∥ 2 ∥ζ (t) ∥ 2 . By the properties of sub-Gaussian variable (Lemma A.4), there exist some absolute constant c 2 > 0 such that E [exp (λu t ) | F t-1 ] ≤ exp λ 2 2 c 2 (Lη 2 -η) 2 σ2 ∥∇f (w (t) )∥ 2 ∀λ ∈ R. Let v t-1 := c 2 (Lη 2 -η) 2 σ2 ∥∇f (w (t) )∥ 2 . Then v t-1 ≤ 2c 2 Lη 2 σ2 (f (w (t) ) -f * ) (By the weak growth condition) ≤ 2c 2 L T (f (w (t) ) -f * ) (By η ≤ 1/(σ √ T )) . Furthermore, for any t = 0, 1, . . . , T -1, V t := t i=0 v i-1 ≤ c 2 t i=0 2L T (f (w (i) ) -f * ) ≤ c 2 t i=0 2L T f (w (0) ) -f * + Z T -1 + U i-1 (By eq. ( 6)) ≤ 2c 2 L(f (w (0) ) -f * + Z T -1 ) + 2c 2 L T t i=0 i-1 j=0 u j ≤ 2c 2 L(f (w (0) ) -f * + Z T -1 ) + 2c 2 L T t i=0 (t -i)u i (Rearranging) ≤ 2c 2 L(f (w (0) ) -f * + Z T -1 ) + 2c 2 L t i=0 t -i T u i . Combining the above with eq. ( 7), with probability 1 -δ ′ V t ≤ 2c 2 L f (w (0) ) -f * + c 1 L log(1/δ ′ )/2 + 2c 2 L t i=0 t -i T u i ∀t = 0, 1, . . . , T -1. Now we are ready to apply Lemma A.11. Making the identification d i = u i , α (t) i = 2c 2 L t -i T , α = 2c 2 L, R(δ ′ ) = 2c 2 L f (w (0) ) -f * + c 1 L log(1/δ ′ )/2 and apply Lemma A.11 gives Pr T -1 t=0 {U t ≥ x} ≤ δ ′ + T exp - x 4α + 8R(δ ′ )/x ∀x > 0. It is easy to verify that with the choice x = max 4 R(δ ′ ) log(T /δ ′ ), 8α log(T /δ ′ ) , we have Pr T -1 t=0 {U t ≥ x} ≤ 2δ ′ . Therefore, with probability at least 1 -2δ ′ , max t=0,1,...,T f (w (t) ) -f * (i) ≤ f (w (0) ) -f * + c 1 L log(1/δ ′ )/2 + max 4 2c 2 L(f (w (0) ) -f * + c 1 L log(1/δ ′ )/2) log(T /δ ′ ), 16c 2 L log(T /δ ′ ) (ii) ≤ f (w (0) ) -f * + c 1 L log(1/δ ′ )/2 + 4 c 2 L log(T /δ ′ ) 2 (iii) ≤ 2 f (w (0) ) -f * + c 1 L log(1/δ ′ )/2 + 32c 2 L log(T /δ ′ ), where (i) is by eq. ( 6) and eq. ( 8), (ii) is by the fact that max{a, b} ≤ a + b and √ a + b ≤ √ a + √ b for any a, b ≥ 0, (iii) is by Lemma A.1. Substitute δ ′ with δ/2, we obtain the desired result.

B.2 PROOF OF PROPOSITION 4.3

Proof. By the assumption that f i 's are L-smooth, we know that f is also L-smooth. Therefore f is (2L, 0)-WGC (By Lemma 3.2), e.g., ∥∇f (w (t) )∥ ≤ 2L f (w (t) ) -f * ∀t = 0, 1, . . . , T. Apply Proposition 4.2. We obtain that max t=0,1,...,T ∥∇f (w (t) )∥ ≤ 4L f (w (0) ) -f * + c 1 L log(2/δ)/2 + 64c 2 L 2 log(2T /δ) with probability at least 1 -δ for some absolute positive constants c 1 , c 2 . The remaining is to bound the individual gradient norm ∥∇f i (w (t) )∥. When Assumption 4.1 holds, we have that max i∈[n],t=0,1,...,T ∥∇f i (w (t) )∥ ≤ max t∈{0,1...,T } ∥∇f (w (t) )∥ + max i∈[n],t=0,1,...,T ∥∇f i (w (t) ) -f (w (t) )∥. By Assumption 4.1, we know that ∥∇f i (w) -∇f (w)∥ is sub-Gaussian for all w ∈ R d . Therefore we can apply Lemma A.8 to bound the maximum of sub-Gaussian variables, which gives max i∈[n],t=0,1,...,T ∥∇f i (w (t) ) -f (w (t) )∥ < c 3 ρ log(n(T + 1)) + c 3 ρ log(1/δ) for some absolute positive constant c 3 > 0 and any δ ∈ (0, 1). Combining eq. ( 9) and eq. ( 10). We obtain that max i∈[n],t=0,1,...,T ∥∇f i (w (t) )∥ ≤ 4L f (w (0) ) -f * + c 1 L log(2/δ)/2 + 64c 2 L 2 log(2T /δ) + c 3 ρ log(n(T + 1)) + c 3 ρ log(1/δ) with probability at least 1 -2δ, which yields the desired result eq. ( 1) under Assumption 4.1.

B.3 PROOF OF PROPOSITION 4.4

Proof. We first consider the case that f i 's are convex and smooth. By Lemma 3.2, we know that f i 's are (2L, 0)-WGC in this scenario. Setting C = 4L f (w (0) ) -f * + c 1 L log(2/δ ′ )/2 + 16c 2 L log(2T /δ ′ ) + c 3 ρ log(n(T + 1)) + c 3 ρ log(1/δ ′ ), σ = c 4 q T log(1/δ) ϵ , where c 1 , c 2 , c 3 correspond to the absolute positive constants that appeared in the proof of Proposition 4.2, c 4 is some positive constant that appeared in Theorem 3.4. Next we need to apply Proposition 4.3. Denote B t as the batch sampled at the t-th iteration and let ζ (t) := 1 |B t | i∈Bt ∇f i (w (t) ) + ξ (t) -∇f (w (t) ). By the definition of ξ (t) and Assumption 4.1, ζ (t) 's satisfy E ∥ζ (t) ∥ 2 / c 5 ρ 2 B + c 5 C 2 dσ 2 B 2 ≤ e (by the property of Poisson sampling) , where c 5 is some positive absolute constant. Let σ2 = c 5 ρ 2 B + C 2 dσ 2 B 2 , η = min 1 2L , 1 σ√ T . Then Proposition 4.3 tells us that Algorithm 2 with the above setup of σ and η will produce iterates such that ∥∇f i (w (t) )∥ ≤ C, ∀i ∈ [n], t = 0, 1, . . . , T, with probability 1 -δ ′ . The above analysis is based on Algorithm 2. Next, we draw the connection between Algorithm 2 with the above setup of parameters and Algorithm 1 with the parameter setup in eq. ( 11) and η defined as in eq. ( 12). We introduce some new notation to help our analysis. To distinguish between Algorithm 2 (SGD) and Algorithm 1 (DP-SGD-GC), we denote { w(t) } T t=0 as the iterates from Algorithm 2 and {w (t) } T t=0 as the iterates from Algorithm 1. We further let E t and E t as the event {∥∇f i ( w(t) )∥ ≤ C, ∀i ∈ [n]} and {∥∇f i (w (t) )∥ ≤ C, ∀i ∈ [n]} respectively. For two random variables A and B, we denote A ∼ B if A and B are independent and identically distributed. Consider two independent runs of Algorithm 2 and Algorithm 1 with w(0) = w (0) , we are going to show that • [ w(0) , . . . , w(T ) ] conditioned on the event ∩ T t=0 E t has the same distribution as [w (0) , . . . , w (T ) ] conditional on the event ∩ T t=0 E t ; • Pr[∩ T t=0 E t ] = Pr[∩ T t=0 E t ]. The first conclusion should be obvious. Given w(0) = w (0) , we can conclude that w(1) ∼ w (1) conditioned on E 0 and E 0 (both gradients are bounded), which further implies w(2) ∼ w (2) conditioned on E 1 and E 1 and so on. For the second conclusion, we prove by induction. Given w(0) = w (0) , the base case Pr[ E 0 ] = Pr[E 0 ] is obviously true. Next, given Pr[∩ m t=0 E t ] = Pr[∩ m t=0 E t ], we are going to prove Pr[∩ m+1 t=0 E t ] = Pr[∩ m+1 t=0 E t ]. Knowing that Pr[∩ m+1 t=0 E t ] = 1 -Pr[∪ m+1 t=0 E c t ] = 1 -Pr[ E c 0 ] + Pr[ E 0 ∩ E c 1 ] + . . . + Pr[∩ m t=0 E t ∩ E c m+1 ] (by Lemma A.3) . We only need to prove that Pr [∩ p t=0 E t ∩ E c p+1 ] = Pr[∩ p t=0 E t ∩ E c p+1 ] ∀p = 0, 1, . . . , m, which is equivalent to Pr[ E c p+1 | ∩ p t=0 E t ] Pr[∩ p t=0 E t ] = Pr[E c p+1 | ∩ p t=0 E t ] Pr[∩ p t=0 E t ]. By induction, suppose Pr[∩ p t=0 E t ] = Pr[∩ p t=0 E t ] holds. Conditioning on the two events ∩ p t=0 E t and ∩ p t=0 E t , it is obvious that w(p+1) ∼ w (p+1) and therefore Pr [ E c p+1 | ∩ p t=0 E t ] = Pr[E c p+1 | ∩ p t=0 E t ]. Combining the above together, we prove that Pr[∩ m+1 t=0 E t ] and finish the induction. With the above two conclusions, we can transfer the convergence of Algorithm 2 conditioned on the event ∩ T t=0 E t to Algorithm 1 conditioned on the event ∩ T t=0 E t . Next, the conditional convergence of SGD. The proof is simply based on the law of total expectation. Denote wout as the output of Algorithm 2 and E = ∩ T t=0 E t . E[f ( wout ) -f * | E] Pr[ E] + E[f ( wout ) -f * | E c ] Pr[ E c ] = E[f ( wout ) -f * ] =⇒ E[f ( wout ) -f * | E] Pr[ E] ≤ E[f ( wout ) -f * ] (i) =⇒ E[f ( wout ) -f * | E] ≤ 2E[f ( wout ) -f * ], where (i) is by the fact that Pr[ E] ≥ 1 -δ ′ ≥ 0.5. Now we can transfer the convergence of conditional SGD to conditional DP-SGD-GC. Notice that the term E[∥ξ (t) ∥ 2 ] in Lemma A.12 is bounded by σ2 , then the classic convergence rate of SGD (Lemma A.12) gives that E [f (w priv ) -f * | E] = E f ( wout ) -f * | E ≤ 2E [f ( wout ) -f * ] (by eq. ( 13)) ≤ 2D 2 w (T + 1)η + 2σ 2 η (by Lemma A.12) (i) ≤ 4LD 2 w T + 2D 2 w σ √ T + 2σ √ T (ii) ≤ O 2LD 2 w T + D 2 w ρ √ BT + D 2 w Cσ √ d B √ T + ρ √ T B + Cσ √ d B √ T (iii) ≤ O 2LD 2 w T + D 2 w ρ √ BT + D 2 w Cq d log(1/δ) Bϵ + ρ √ T B + Cq d log(1/δ) Bϵ (iv) = O 2LD 2 w T + D 2 w ρ √ BT + D 2 w C d log(1/δ) nϵ + ρ √ T B + C d log(1/δ) nϵ (v) = O D 2 w T + D 2 w ρ √ BT + D 2 w (log(1/δ ′ ) + log(n) + log(T )) d log(1/δ) nϵ , where (i) is by the definition of η, (ii) is by the definition of σ, (iii) is by the definition of σ (eq. ( 11)), (iv) is by the definition of q = B/n, (v) is by noticing C = O(log(1/δ ′ ) + log(n) + log(T )) from eq. ( 11). When f i 's are L-smooth but not necessarily convex. We set the parameters the same as for the convex case. Apply the convergence rate of SGD for smooth but not necessarily convex functions (Lemma A.12). Following the same proof template as for the convex case, can we obtain the desired result. B.4 PROOF OF THEOREM 4.5 Proof. The outline of the proof is summarized as follows. • First we show that the excess empirical risk e.g., f (w priv ) -f * or gradient norm square e.g.,∥∇f (w priv )∥ 2 are sub-exponential random variables. We also derive a pessimistic upper bound of their sub-exponential parameters (Lemma B.1)foot_2 . • Next, we develop a technical tool to convert conditional expected error bound (the conditioning event happens with high probability) to expected error bound for sub-exponential random variables (Lemma A.9). • Finally, by tuning δ ′ (the probability that gradient clipping happens during training; see Proposition 4.4), Lemma A.9 together with Lemma B.1 and Proposition 4.4 yield the desired result. Proof. We begin with the smoothness of f , for any t ∈ {0, 1, . . . , T } f (w (t) ) ≤ f (w (0) ) + ⟨∇f (w (0) ), w (t) -w (0) ⟩ + L 2 ∥w (t) -w (0) ∥ 2 (i) ≤ f (w (0) ) + 1 2L ∥∇f (w (0) )∥ 2 + L∥w (t) -w (0) ∥ 2 ≤ f (w (0) ) + 1 2L ∥∇f (w (0) )∥ 2 + L η t-1 i=0   1 B j∈Bt g(i) j + 1 B ξ (i)   2 (ii) ≤ f (w (0) ) + 1 2L ∥∇f (w (0) )∥ 2 + Lη 2 T T -1 i=0 1 B j∈Bt g(i) j + 1 B ξ (i) 2 (iii) ≤ f (w (0) ) + 1 2L ∥∇f (w (0) )∥ 2 + Lη 2 T 2 C 2 + Lη 2 T B 2 T -1 i=0 ∥ξ (i) ∥ 2 (iv) ≤ f (w (0) ) + (f (w (0) ) -f * ) + Lη 2 T 2 C 2 + Lη 2 T B 2 T -1 i=0 ∥ξ (i) ∥ 2 (v) ≤ f (w (0) ) + (f (w (0) ) -f * ) + T 2 C 2 L + T LB 2 T -1 i=0 ∥ξ (i) ∥ 2 ZT , where (i) is by the fact that ⟨a, b⟩ ≤ 1 2λ ∥a∥ 2 + λ 2 ∥b∥ 2 ∀λ > 0, (ii) is by the convexity of ∥ • ∥ 2 , (iii) is because of ∥g (i) j ∥ ≤ C due to gradient clipping, (iv) is by the weak growth condition, (v) is by the assumption on learning rate (η ≤ 1/L). By the definition of ξ (i) , Z T follows from the sub-exponential distribution. By Lemma A.4, there exist some absolute constant c 1 > 0 such that Pr [|Z T | ≥ t] ≤ 2 exp - tc 1 LB 2 T 2 C 2 dσ 2 . Combining with eq. ( 16), we have that Pr f (w priv ) -f * ≥ 2(f (w (0) ) -f * ) + T 2 C 2 L + t ≤ 2 exp - tc 1 LB 2 T 2 C 2 dσ 2 ∀t ≥ 0, which finishes the proof for eq. ( 14). By the weak growth condition, we further have that ∥∇f (w priv )∥ 2 ≤ 2L(f (w priv ) -f * )

Therefore

Pr ∥∇f (w priv )∥ 2 ≥ 4L(f (w (0) ) -f * ) + 2T 2 C 2 + 2Lt ≤ 2 exp -tc 1 LB 2 T 2 C 2 dσ 2 ∀t ≥ 0, which finishes the proof for eq. ( 15).

APPENDIX C MISSING EXPERIMENTS C.1 EXPERIMENTS ON SYNTHETIC DATA

The light-tail-noise assumption may not hold when training neural networks on real-world data. To keep consitent with our theory, we conduct experiment with synthetic data and artificial Gaussian noise. Our synthetic data has 10, 000 samples and each sample is generated from a 256-dimensional standard Gaussian distribution and a ground truth linear model. To simulate stochastic gradient with light-tailed noise, we perform linear regression and add standard Gaussian noise (light-tailed) for the true gradient. The experimental results are shown in Figure 4 . The conclusion is the same as in We can observe that the performance gap between WGC and VC grows as the privacy level ϵ becomes larger. For linear model, the testing accuracy gap is ∼ 2% when ϵ ∼ 10 and the gap is ∼ 5% when ϵ ∼ 0.01. For two-layer neural networks, the performance gap is larger, the testing accuracy gap is ∼ 3% when ϵ ∼ 10 and the gap is about ∼ 9% when ϵ ∼ 0.01. The experimental result should not be too surprising; imposing smaller ϵ implies adding more noise to the gradient, which will further increase the gradient norm and clipping frequency, and will eventually amplify the error created by VC.

APPENDIX D MORE ON THE WEAK GROWTH CONDITION

We review some existing results and describe the weak growth condition for feed-forward neural networks with cross-entropy loss. First, we review the generalized growth condition (Fang et Combining Lemma D.1, eq. ( 18) and eq. ( 19), we obtain the following weak growth condition ∥∂f (W)∥ 2 F = H i=1 ∥∂ Wi f (W)∥ 2 F ≤ 8∥x∥ 2 2 H i=1 H j=1,j̸ =i ∥W j ∥ 2 2 f (W). Furthermore, by ∥∇ ŷ g(ŷ)∥ 2 ≤ 2 and Allen-Zhu et al. (2019, Fact 2.6 ). We also have ∥∂f (W)∥ 2 F = H i=1 ∥∂ Wi f (W)∥ 2 F ≤ 4∥x∥ 2 2 H i=1 H j=1,j̸ =i ∥W j ∥ 2 2 . To sum up, we obtain ∥∂f (W)∥ 2 F = H i=1 ∥∂ Wi f (W)∥ 2 F ≤ 4∥x∥ 2 2 H i=1 H j=1,j̸ =i ∥W j ∥ 2 2 min{1, 2f (W)}. To access the WGC-parameters, we need to compute ∥W j ∥ 2 , j ∈ [H] in each iteration, the overhead can be amortized by increasing batch size.



The light-tail assumption is standard for deriving high probability error bound of SGD in the literature. Note that there are several equivalent (up to constant) definitions of sub-Gaussian variable; see Vershynin, 2018, Proposition 2.5.2. These definitions are often used interchangeably in the literature. We do not try to optimize the bound in Lemma B.1 as we show that a loose bound is enough for our purpose. The bound in Lemma B.1 should be improvable.



Figure 1: The evolution of training accuracy and clipping frequency during DP-SGD-GC, where the solid line represent training accuracy and the dashed line denote the clipping frequency per epoch. The left two figures present the results on the MNIST dataset, the algorithm is about (1.0379, 10 -5 )-DP. The right two figures show the results on the CIFAR10 dataset (feature extracted by a pretrained VGG16 network), the algorithm is about (0.9580, 10 -5 )-DP.

Figure 2: Value clipping (VC) versus gradient clipping (GC) in terms of training accuracy.

Figure 3: Value clipping (VC) versus gradient clipping (GC) in terms of testing accuracy.

1) and suppose there are positive values R(δ) > 0 and non-negative values {α (t)

Figure 4: The evolution of clipping frequency and the comparison of gradient clipping (GC) and value clipping (VC).

Figure 5: Testing accuracy of gradient clipping (GC) a value clipping (VC) with varying ϵ and δ.

Per epoch runtime of different methods. SGD without gradient clipping is the baseline method; Micro-batching is the naive implementation of DP-SGD-GC; GC and VC are the classic gradient clipping and the proposed value clipping, respectively. We bold the shortest per epoch runtime among methods besides SGD.

al., Then we need to bound ∂ ŷ ∂W . Directly applying Allen-Zhu et al. (2019, Fact 2.6) gives that

ETHIC STATEMENT

This paper studied a class of private-preserving machine learning algorithms called differential private SGD. We provide improved convergence analysis and some new theoretical insights into DP-SGD. We are not aware of any potential negative social impact of this work. The datasets used for the experiments do not contain personally identifiable information or offensive content.

8. APPENDIX APPENDIX A LEMMAS

A.1 STANDARD FACTS Lemma A.1. For any a, b ∈ R, it is true that (a + b) 2 ≤ 2a 2 + 2b 2 . Lemma A.2. It holds that ∞ i=1 (i/2 i ) = 2. Lemma A.3. Given n random events A 1 , A 2 , . . . , A n . It holds that

A.2 SUB-GAUSSIAN AND SUB-EXPONENTIAL PROPERTIES

Lemma A.4 (Sub-Gaussian properties, Vershynin, 2018, Proposition 2.5.2). Let X be a random variable. Then the following properties are equivalent; the parameters K i , i = 1, . . . , 4 differ from each other by at most an absolute constant factor.• The tails of X satisfy• The MGF of X 2 is bounded at some point, e.g.,• E[X] = 0 and the MGF of X satisfies .5 (Harvey et al., 2019, Lemma A.4 ). Let X 1 , X 2 , . . . , X n be random variables. Assume that there exist .6 (Harvey et al., 2019, Claim A.7) . Suppose X is a random variable such that there exists constants c andNote that the original lemmas from Harvey et al. ( 2019) assume E[exp(λX)] ≤ exp(λK) for all λ ≤ 1/K. It is easy to check that their results also hold forThen Lemma A.6 givesThe above finishes the proof. Now we proceed to the detailed proof. Consider the case that f i 's are convex and L-smooth. Let T, σ, C, η satisfy eq. ( 2). We apply Lemma A.9 by making the identificationwhere the choice of α is by Proposition 4.4 and the choice of β and K is by Lemma B.1. Then Lemma A.9 gives thatwhere (i) is true by settingNote that the above proof template works as long as we can construct β, K that polynomially depends on T, n, d, ϵ -we can tune δ ′ to convert all polynomial terms in β and K to polylogarithm terms by Lemma A.9. We find Lemma A.9 a quite convenient technical tool for our analysis and we are not aware if this result exist in the literature.When f i 's are smooth but not necessarily convex. The proof follows exactly the same as the convex case (just need to let X := ∥∇f (w priv )∥ 2 , β := 4L(f (wand apply eq. ( 15) instead of eq. ( 14)). We omit the details to avoid tedious repetition.

Lemma B.1 (A pessimistic bound for DP-SGD-GC).

Assume that f i 's are L-smooth. Let w priv be the output from Algorithm 1 with η ≤ 1/L. ThenFurthermore,2021, Proposition 4.1). Consider the objective minwhere ℓ : R → R ≥0 is a nonnegative 1-dimensional loss function that is convex, 1-smooth, and satisfies inf ℓ = 0. The functions h i 's are assumed to be β-Lipschitz continuous for some β > 0.Then f i 's and f are shown to satisfy the following weak growth condition.Lemma D.1 (Fang et al., 2021, Proposition 4.1) . For any w ∈ R d ,where ∂f (w) is the Clarke's generalized gradient (Clarke, 1981) .We refer readers to Fang et al. (2021) for more details on the derivation of the above lemma. Note that f i 's and f in eq. ( 17) can be nonconvex, and the above lemma suggests that the weak growth condition holds for a certain class of nonconvex ERM.

CROSS-ENTROPY LOSS

We give a simple extension of Lemma D.1 to feed-forward neural networks with cross-entropy loss.We denote the number of classes as K. For simplicity, we consider a feed-forward neural network with fixed width m. We denote the parameter of a H-layer feed-forward neural network asWith a little abuse of notation, we denote σ as the ReLU activation, e.g., σ(x) = (max{x 1 , 0}, . . . , max{x m , 0}). Note that neural networks with the ReLU activation is not differentiable everywhere on its domain. We define the "gradient" of a ReLU neural networks in the same way as in Allen-Zhu et al. (2019, Fact 2.6 ).We consider a single training sample x (which is enough for our purpose) and define the architecture of the H-layer feed-forward neural network aswhere h j is the hidden variables of the j-th layer, ŷ is the prediction produced by the network.Without loss of generality, we assume that the label for our training sample is the first class. Then by the definition of the cross-entropy loss, we have thatexp(ŷ i -ŷ1 ) .Making the identification ℓ(α) = log(1 + exp(α)) and g(ŷ) = log K i=2 exp(ŷ i -ŷ1 ) .Then f (W) = ℓ(g(ŷ)).It is obvious that ℓ satisfy the assumptions made by Lemma D.1. In order to apply Lemma D.1, all the remaining is to bound ∂g(ŷ) ∂W . First, we notice that, exp(ŷ 3 -ŷ1 ) K i=2 exp(ŷ i -ŷ1 ) , . . . , exp(ŷ K -ŷ1 ) K i=2 exp(ŷ i -ŷ1 ) =⇒ ∥∇ ŷ g(ŷ)∥ 1 = 2 =⇒ ∥∇ ŷ g(ŷ)∥ 2 ≤ 2.(18)

