IMPROVED CONVERGENCE OF DIFFERENTIAL PRIVATE SGD WITH GRADIENT CLIPPING

Abstract

Differential private stochastic gradient descent (DP-SGD) with gradient clipping (DP-SGD-GC) is an effective optimization algorithm that can train machine learning models with a privacy guarantee. Despite the popularity of DP-SGD-GC, its convergence in the unbounded domain without the Lipschitz continuous assumption is less-understood; existing analysis of DP-SGD-GC either impose additional assumptions or end up with a utility bound that involves a non-vanishing bias term. In this work, for smooth and unconstrained problems, we improve the current analysis and show that DP-SGD-GC can achieve a vanishing utility bound without any bias term. Furthermore, when the noise generated from subsampled gradients is light-tailed, we prove that DP-SGD-GC can achieve nearly the same utility bound as DP-SGD applies to the Lipschitz continuous objectives. As a by-product, we propose a new clipping technique, called value clipping, to mitigate the computational overhead caused by the classic gradient clipping. Experiments on standard benchmark datasets are conducted to support our analysis.

1. INTRODUCTION

Training machine learning models that can achieve decent prediction accuracy while preserving data privacy is fundamental in many modern machine learning applications. The concept of differential privacy (DP) from Dwork (2006) ; Dwork & Roth (2014) offers an elegant mathematical framework to characterize the privacy-preserving ability of randomized algorithms, which has been widely applied to tasks including clustering, regression, principle component analysis, empirical-risk minimization, matrix completion, graph distance estimation, optimization and deep learning (Chaudhuri & Monteleoni, 2008; Chaudhuri et al., 2011; Agarwal et al., 2018; Ge et al., 2018; Jain et al., 2018; Fan & Li, 2022; Fan et al., 2022) . For the empirical-risk minimization (ERM) problem, among many proposed methods, differential private stochastic gradient descent (DP-SGD) is an effective algorithm that can solve the ERM problem with a privacy guarantee and achieve a reasonable utility bound. DP-SGD has received substantial interest in recent years due to its simplicity and effectiveness (Song et al., 2013; Bassily et al., 2014; Abadi et al., 2016; Wang et al., 2017; Bassily et al., 2019; Feldman et al., 2020; Asi et al., 2021) . In the classic analysis of DP-SGD, the variance of the Gaussian noise used in each iteration of DP-SGD relies crucially on the ℓ 2 -sensitivity of the loss function. Therefore most early works on DP-SGD assume each individual loss function to be Lipschitz continuous in its domain (Song et al., 2013; Bassily et al., 2014) . However, many real-world problems are only smooth but not globally Lipschitz continuous; for example, the unconstrained linear regression problem. There are two techniques to circumvent the Lipschitz continuous assumption: (i) imposing an additional bounded domain constraint to the original problem; (ii) clipping gradients in their 2-norm and using the clipped gradients to update the model (Abadi et al., 2016) . In practice, the gradient clipping technique is usually more preferred than imposing a bounded domain constraint because the latter requires prior knowledge of the distance between initialization and solution, which is typically unavailable for unconstrained problems. In summary, the state-of-the-art implementations of DP-SGD all advocate the gradient clipping technique. Despite the popularity of DP-SGD with gradient clipping (DP-SGD-GC), the convergence of DP-SGD-GC for unconstrained problems that are not globally Lipschitz continuous has not been well-studied. In fact, recent works (Chen et al., 2020; Song et al., 2021) have reported that DP-SGD-GC can suffer from a constant utility in the worst case. With these negative results on the convergence of DP-SGD-GC, one may consider DP-SGD-GC as an algorithm with a fundamental non-convergence issue. In this work, we show that this is not the case. With a careful choice of the clipping threshold, we prove that DP-SGD-GC can achieve the same utility bound as its non-clipped counterpart DP-SGD. Formally, we summarize our contributions as follows. • For unconstrained problems that are convex and smooth but not necessarily globally Lipschitz continuous, we show that DP-SGD-GC can achieve a O( √ d/(nϵ)) utility bound when the noise generated from the subsampled gradients is light-tailed (Assumption 4.1)foot_0 , which is the same as the utility bound of DP-SGD applies to the Lipschitz continuous problems. See Table 1 for a comparison to existing results. Our convergence analysis of DP-SGD-GC for convex, smooth and unconstrained problems, to our knowledge, provide the first utility bound without a non-vanishing bias term. • We show that our analysis also applies to unconstrained smooth problems that can potentially be nonconvex. Consequently, DP-SGD-GC can achieve a O( √ d/(nϵ)) gradient norm bound for smooth problems under the light-tail-noise assumption. • This work is theoretical in essence but also includes a practical contribution (Section 5). We develop a novel value clipping technique for problems that satisfy the weak growth condition (Definition 3.1). The proposed value clipping technique can be implemented within one forwardbackward propagation on existing learning platforms and can alleviate the computation overhead caused by gradient clipping. The efficiency of value clipping is demonstrated on real datasets. 



The light-tail assumption is standard for deriving high probability error bound of SGD in the literature.



WORKDP-SGD with gradient clipping was initially proposed byAbadi et al. (2016). Gradient clipping and its variants have been widely adopted by many privacy-aware training algorithms(Andrew et al.,  2021). Despite the popularity of gradient clipping, the convergence rate of DP-SGD-GC without the Lipschitz continuous and bounded domain assumptions remains a challenging task; see(Wang et al.,  2022, Remark 5) for a short discussion on the hardness of removing the bounded domain assumption. This challenging research question was not carefully studied until the recent works from Chen et al. (2020) and Song et al. (2021), who provided counter-examples showing that DP-SGD-GC can suffer from a constant utility in the worst case. Chen et al. (2020) studied the convergence of DP-SGD-GC to a stationary point in the nonconvex setting and showed that an additional assumption on gradient distribution is sufficient to derive a meaningful utility bound.Song et al. (2021)  showed that DP-SGD-GC converges to a perturbed objective function for the generalized linear model and can suffer from a constant utility for the original objective in the worst case. Note that there are some other recent works that study the convergence of DP-SGD-GC for smooth objective(Du et al.,  2021; Wu et al., 2021; Yang et al., 2022), the rates in these works usually involve a bias term due to clipping. A concurrent work fromBu et al. (2022)  suggests that a small clipping threshold can yield promising performance for DP-SGD-GC in certain scenarios, such as training language models. Their empirical discovery contrasts with the theoretical analysis in this work as our proof technique relies on a large clipping threshold.Bu et al. (2022)'s experiments indicate that the analysis in this

The utility bound and assumptions needed by different algorithms for convex problems, where d is the problem size, n is the number of data points and ϵ measures the privacy-preserving ability; see Section 3 for more details. " †" is based on a trivial extension ofBassily et al. (2014).

