DIFFERENTIALLY PRIVATE OPTIMIZATION FOR SMOOTH NON-CONVEX ERM

Abstract

We develop simple differentially private optimization algorithms that move along directions of (expected) descent to find an approximate second-order necessary solution for non-convex ERM. We use line search, mini-batching, and a two-phase strategy to improve the speed and practicality of the algorithm. Numerical experiments demonstrate the effectiveness of these approaches.

1. INTRODUCTION

Privacy protection has become a central issue in machine learning algorithms, and differential privacy (Dwork & Roth, 2014 ) is a rigorous and popular framework for quantifying privacy. In our paper, we propose a differentially private optimization algorithm that finds an approximate second-order necessary solution for ERM problems. We proposed several techniques to improve the practical performance of the method, including backtracking line search, mini-batching, and a heuristic to avoid the effects of conservative assumptions made in the analysis. For given f : R d → R, consider the following minimization problem, min w∈R d f (w). We want to find an approximate second-order necessary solution, defined formally as follows. Definition 1 ((ϵ g , ϵ H )-2NS). For given positive values of ϵ g and ϵ H , We say that w is an (ϵ g , ϵ H )-approximate second-order necessary solution (abbreviated as (ϵ g , ϵ H )-2NS) if ∥∇f (w)∥ ≤ ϵ g , λ min ∇ 2 f (w) ≥ -ϵ H . (1) We are mostly interested in the case of ϵ g = α and ϵ H = √ M α, that is, we seek an (α, √ M α)-2NS, where M is the Lipschitz constant for ∇ 2 f . We will focus on the empirical risk minimization (ERM) problem, defined as follows, Definition 2 (ERM). Given a dataset D = {x 1 , . . . , x n } and a loss function l(w, x), we seek the parameter w ∈ R d that minimizes the empirical risk f (w) = L(w, D) := 1 n n i=1 l (w, x i ) . ERM is a classical problem in machine learning that has been studied extensively; proposes a trust-region type (DP-TR) algorithm that gives an approximate second-order necessary solution for ERM, satisfying both conditions in (1), for particular choices of ϵ g and ϵ H . This work requires the trust-region subproblem to be solved exactly at each iteration, and fixes the radius of the trust region at a small value, akin to a "short step" in a line-search method. An earlier work (Wang et al., 2019) proposed the DP-GD algorithm, which takes short steps in a noisy gradient direction, then sorts through all the iterates so generated to find one that satisfies second-order necessary conditions. Our work matches the sample complexity bound in DP-GD, which is O d α 2 √ ρ for ρ-zCDP or O d √ ln(1/δ) α 2 ε for (ε, δ)-DP for finding an (α, √ M α)-2NS, and has an iteration complexity of O(α -2 ). Our contributions can be summarized as follows. • Our algorithm is elementary and is based on a simple (non-private) line-search algorithm for finding an approximate second-order necessary solution. It evaluates second-order information (a noisy Hessian matrix) only when insufficient progress can be made using first-order (gradient) information alone. By contrast, DP-GD uses the (noisy) Hessian only for checking the second-order approximate condition, while DP-TR requires the noisy Hessian to be calculated at every iteration. • Our algorithm is practical and fast. DP-TR has a slightly better sample complexity bound than our method, depending on α -7/4 rather than α -2 . However, since our analysis is based on the worst case, we can be more aggressive with step sizes (see below). DP-TR requires solving the trust-region subproblem exactly, which is relatively expensive and unnecessary when the gradient is large enough to take a productive step. Experiments demonstrate that our algorithm requires fewer iterations than DP-TR, does less computation on average at each iteration, and thus runs significantly faster than DP-TR. Moreover, we note that the mini-batch version of DP-TR has a sample complexity O(α -2 ), matching the sample complexity of the mini-batch version of our algorithm. • We use line search and mini-batching to accelerate the algorithm. Differentially private line search algorithms have been proposed by (Chen & Lee, 2020). We use the same sparse vector technique as used by their work, but provide a tighter analysis of the sensitivity of the query for checking sufficient decrease condition. In addition, we provide a rigorous analysis of the guaranteed function decrease with high probability. • To complement our worst-case analysis, we propose a heuristic that can obtain much more rapid convergence while retaining the guarantees provided by the analysis. The remainder of the paper is structured as follows. In Section 2, we review basic definitions and properties from differential privacy, and make some assumptions about the function f to be optimized. In Section 3, we describe our algorithm and its analysis. We will discuss the basic short step version of the algorithm in Section 3.1, and an extension to a practical line search method in Section 3.2. A mini-batch adaptation of the algorithm is described in Section 3.3. In Section 4, we present experimental results and demonstrate the effectiveness of our algorithms.

2. PRELIMINARIES

We use several variants of DP for the need of our analysis, including (ε, δ)-DP (Dwork & Roth, 2014), (α, ϵ)-RDP (Mironov, 2017), and zCDP (Bun & Steinke, 2016) . We review their definitions and properties in Appendix A. We make the following assumptions about the smoothness of the objective function f . Assumption 1. We assume f is lower bounded by f . Assume further that f is G-smooth and has M -Lipschitz Hessian, that is, for all w 1 , w 2 ∈ dom(f ), ∥∇f (w 1 ) -∇f (w 2 )∥ ≤ G∥w 1 -w 2 ∥, ∥∇ 2 f (w 1 ) -∇ 2 f (w 2 )∥ ≤ M ∥w 1 -w 2 ∥, where ∥ • ∥ denotes the vector 2-norm and the matrix 2-norm respectively. We use this notation throughout the paper. For the ERM version of f (see Definition 2), we make additional assumptions. Assumption 2. For the ERM setting, we assume the loss function is l(w, x) is G-smooth and has M -Lipschitz Hessian with respect to w. Thus L(w, D) (the average loss across n samples) is also G-smooth and has M -Lipschitz Hessian with respect to w. In addition, we assume l(w, x) has bounded function values, gradients, and Hessians. That is, there are constants B, B g , and B H such that for any w, x we have, 0 ≤ l(w, x) ≤ B, ∥∇ w l(w, x)∥ ≤ B g , ∥∇ 2 w l(w, x)∥ ≤ B H . As a consequence, the l 2 sensitivity of L(w, D) and ∇L(w, D) is bounded by B/n and 2B g /n respectively. We have ∥∇ 2 L(w, D) -∇ 2 L(w, D ′ )∥ F ≤ √ d ∥∇ 2 L(w, D) -∇ 2 L(w, D ′ )∥ ≤ 2B H √ d n .

