IMPROVED CONVERGENCE OF DIFFERENTIAL PRIVATE SGD WITH GRADIENT CLIPPING

Abstract

Differential private stochastic gradient descent (DP-SGD) with gradient clipping (DP-SGD-GC) is an effective optimization algorithm that can train machine learning models with a privacy guarantee. Despite the popularity of DP-SGD-GC, its convergence in the unbounded domain without the Lipschitz continuous assumption is less-understood; existing analysis of DP-SGD-GC either impose additional assumptions or end up with a utility bound that involves a non-vanishing bias term. In this work, for smooth and unconstrained problems, we improve the current analysis and show that DP-SGD-GC can achieve a vanishing utility bound without any bias term. Furthermore, when the noise generated from subsampled gradients is light-tailed, we prove that DP-SGD-GC can achieve nearly the same utility bound as DP-SGD applies to the Lipschitz continuous objectives. As a by-product, we propose a new clipping technique, called value clipping, to mitigate the computational overhead caused by the classic gradient clipping. Experiments on standard benchmark datasets are conducted to support our analysis.

1. INTRODUCTION

Training machine learning models that can achieve decent prediction accuracy while preserving data privacy is fundamental in many modern machine learning applications. The concept of differential privacy (DP) from Dwork (2006); Dwork & Roth (2014) offers an elegant mathematical framework to characterize the privacy-preserving ability of randomized algorithms, which has been widely applied to tasks including clustering, regression, principle component analysis, empirical-risk minimization, matrix completion, graph distance estimation, optimization and deep learning (Chaudhuri & Monteleoni, 2008; Chaudhuri et al., 2011; Agarwal et al., 2018; Ge et al., 2018; Jain et al., 2018; Fan & Li, 2022; Fan et al., 2022) . For the empirical-risk minimization (ERM) problem, among many proposed methods, differential private stochastic gradient descent (DP-SGD) is an effective algorithm that can solve the ERM problem with a privacy guarantee and achieve a reasonable utility bound. DP-SGD has received substantial interest in recent years due to its simplicity and effectiveness (Song et al., 2013; Bassily et al., 2014; Abadi et al., 2016; Wang et al., 2017; Bassily et al., 2019; Feldman et al., 2020; Asi et al., 2021) . In the classic analysis of DP-SGD, the variance of the Gaussian noise used in each iteration of DP-SGD relies crucially on the ℓ 2 -sensitivity of the loss function. Therefore most early works on DP-SGD assume each individual loss function to be Lipschitz continuous in its domain (Song et al., 2013; Bassily et al., 2014) . However, many real-world problems are only smooth but not globally Lipschitz continuous; for example, the unconstrained linear regression problem. There are two techniques to circumvent the Lipschitz continuous assumption: (i) imposing an additional bounded domain constraint to the original problem; (ii) clipping gradients in their 2-norm and using the clipped gradients to update the model (Abadi et al., 2016) . In practice, the gradient clipping technique is usually more preferred than imposing a bounded domain constraint because the latter requires prior knowledge of the distance between initialization and solution, which is typically unavailable for unconstrained problems. In summary, the state-of-the-art implementations of DP-SGD all advocate the gradient clipping technique.

