ON DYNAMIC NOISE INFLUENCE IN DIFFERENTIAL PRIVATE LEARNING

Abstract

Protecting privacy in learning while maintaining the model performance has become increasingly critical in many applications that involve sensitive data. Private Gradient Descent (PGD) is a commonly used private learning framework, which noises gradients based on the Differential Privacy protocol. Recent studies show that dynamic privacy schedules of decreasing noise magnitudes can improve loss at the final iteration, and yet theoretical understandings of the effectiveness of such schedules and their connections to optimization algorithms remain limited. In this paper, we provide comprehensive analysis of noise influence in dynamic privacy schedules to answer these critical questions. We first present a dynamic noise schedule minimizing the utility upper bound of PGD, and show how the noise influence from each optimization step collectively impacts utility of the final model. Our study also reveals how impacts from dynamic noise influence change when momentum is used. We empirically show the connection exists for general non-convex losses, and the influence is greatly impacted by the loss curvature.

1. INTRODUCTION

In the era of big data, privacy protection in machine learning systems is becoming a crucial topic as increasing personal data involved in training models (Dwork et al., 2020) and the presence of malicious attackers (Shokri et al., 2017; Fredrikson et al., 2015) . In response to the growing demand, differential-private (DP) machine learning (Dwork et al., 2006) provides a computational framework for privacy protection and has been widely studied in various settings, including both convex and non-convex optimization (Wang et al., 2017; 2019; Jain et al., 2019) . One widely used procedure for privacy-preserving learning is the (Differentially) Private Gradient Descent (PGD) (Bassily et al., 2014; Abadi et al., 2016) . A typical gradient descent procedure updates its model by the gradients of losses evaluated on the training data. When the data is sensitive, the gradients should be privatized to prevent excess privacy leakage. The PGD privatizes a gradient by adding controlled noise. As such, the models from PGD is expected to have a lower utility as compared to those from unprotected algorithms. In the cases where strict privacy control is exercised, or equivalently, a tight privacy budget, accumulating effects from highly-noised gradients may lead to unacceptable model performance. It is thus critical to design effective privatization procedures for PGD to maintain a great balance between utility and privacy. Recent years witnessed a promising privatization direction that studies how to dynamically adjust the privacy-protecting noise during the learning process, i.e., dynamic privacy schedules, to boost utility under a specific privacy budget. One example is (Lee & Kifer, 2018) , which reduced the noise magnitude when the loss does not decrease, due to the observation that the gradients become very small when approaching convergence, and a static noise scale will overwhelm these gradients. Another example is (Yu et al., 2019) , which periodically decreased the magnitude following a predefined strategy, e.g., exponential decaying or step decaying. Both approaches confirmed the empirically advantages of decreasing noise magnitudes. Intuitively, the dynamic mechanism may coordinate with certain properties of the learning task, e.g., training data and loss surface. Yet there is no theoretical analysis available and two important questions remain unanswered: 1) What is the form of utility-preferred noise schedules? 2) When and to what extent such schedules improve utility? To answer these questions, in this paper we develop a principled approach to construct dynamic schedules and quantify their utility bounds in different learning algorithms. Our contributions  O( T R ,δ ) O D ln 2 N N 2 R ,δ Adam+MA (Zhou et al., 2020) O( T R ,δ ) Op √ D ln(N D /(1-p)) N R ,δ GD, Non-Private 0 O D N 2 R GD+zCDP, Static Schedule T R O D ln N N 2 R GD+zCDP, Dynamic Schedule O γ (t-T )/2 R O D N 2 R Momentum+zCDP, Static Schedule T R O D N 2 R (c + ln N I T > T ) Momentum+zCDP, Dynamic Schedule O c 1 γ T +t +c 2 γ (T -t)/2 R O D N 2 R (1 + cD N 2 R I T > T ) are summarized as follows. 1) For the class of loss functions satisfying the Polyak-Lojasiewicz condition (Polyak, 1963) , we show that a dynamic schedule improving the utility upper bound is shaped by the influence of per-iteration noise on the final loss. As the influence is tightly connected to the loss curvature, the advantage of using dynamic schedule depends on the loss function consequently. 2) Beyond gradient descent, our results show the gradient methods with momentum implicitly introduce a dynamic schedule and result in an improved utility bound. 3) We empirically validate our results on convex and non-convex (no need to satisfy the PL condition) loss functions. Our results suggest that the preferred dynamic schedule admits the exponentially decaying form, and works better when learning with high-curvature loss functions. Moreover, dynamic schedules give more utility under stricter privacy conditions (e.g., smaller sample size and less privacy budget).

2. RELATED WORK

Differentially Private Learning. Differential privacy (DP) characterizes the chance of an algorithm output (e.g., a learned model) to leak private information in its training data when the output distribution is known. Since outputs of many learning algorithms have undetermined distributions, the probability of their privacy leakages is hard to measure. A common approach to tackle this issue is to inject randomness with known probability distribution to privatize the learning procedures. Classical methods include output perturbation (Chaudhuri et al., 2011 ), objective perturbation (Chaudhuri et al., 2011) and gradient perturbation (Abadi et al., 2016; Bassily et al., 2014; Wu et al., 2017) . Among these approaches, the Private Gradient Descent (PGD) has attracted extensive attention in recent years because it can be flexibly integrated with variants of gradient-based iteration methods, e.g., stochastic gradient descent, momentum methods (Qian, 1999) et al., 2019; Thakkar et al., 2019) and dynamic batch sizes (Feldman et al., 2020) are also demonstrated to improves the convergence. Utility Upper Bounds. A utility upper bound is a critical metric for privacy schedules that characterizes the maximum utility that a schedule can deliver in theory. Wang et al. (2017) is the first to prove the utility bound under the PL condition. In this paper, we improve the upper bound by a more accurate estimation of the dynamic influence of step noise. Remarkably, by introducing a dynamic schedule, we further boost the sample-efficiency of the upper bound. With a similar intuition, Feldman et al. (2020) proposed to gradually increase the batch size, which reduces the dependence



Comparison of utility upper bound using different privacy schedules. The algorithms are T -iteration 1 2 R-zCDP under the PL condition (unless marked with *). The O notation in this table drops other ln terms. Unless otherwise specified, all algorithms terminate at step T = O(ln N 2 R D ). Assume loss functions are 1-smooth and 1-Lipschitz continuous, and all parameters satisfy their numeric assumptions. Key notations: Opbound occurs in probability p; D -feature dimension; N -sample size; R -privacy budget; ci -constant; other notations can be found in Section 4. An extended table and explanation are available in Appendix A.

, and Adam (Kingma & Ba, 2014), for both convex and non-convex problems.

