DIFFERENTIALLY PRIVATE ADAPTIVE OPTIMIZATION WITH DELAYED PRECONDITIONERS

Abstract

Privacy noise may negate the benefits of using adaptive optimizers in differentially private model training. Prior works typically address this issue by using auxiliary information (e.g., public data) to boost the effectiveness of adaptive optimization. In this work, we explore techniques to estimate and efficiently adapt to gradient geometry in private adaptive optimization without auxiliary data. Motivated by the observation that adaptive methods can tolerate stale preconditioners, we propose differentially private adaptive training with delayed preconditioners (DP 2 ), a simple method that constructs delayed but less noisy preconditioners to better realize the benefits of adaptivity. Theoretically, we provide convergence guarantees for our method for both convex and non-convex problems, and analyze trade-offs between delay and privacy noise reduction. Empirically, we explore DP 2 across several realworld datasets, demonstrating that it can improve convergence speed by as much as 4× relative to non-adaptive baselines and match the performance of state-of-the-art optimization methods that require auxiliary data.

1. INTRODUCTION

Adaptive optimizers such as AdaGrad (Duchi et al., 2011; McMahan & Streeter, 2010) and RM-SProp (Hinton et al., 2012) are commonly used to improve convergence speed in machine learning training. However, in privacy-sensitive applications, the benefits of adaptivity may degrade as a result of noise added to the preconditioners to guarantee differential privacy (Li et al., 2022) . Prior works typically address this issue by using non-sensitive auxiliary data to approximate the underlying structures of private gradients (Asi et al., 2021; Kairouz et al., 2021a; Li et al., 2022) . While this can boost performance, assuming access to informative public data may be unrealistic in many privacy-sensitive applications. In this work, we instead ask: Can we improve privacy/utility trade-offs in private adaptive optimization without accessing auxiliary data? A key insight we have in addressing this question is that for many machine learning problems, the gradient geometry may not change drastically during successive steps of optimization (e.g., see Figure 1 , which plots successive distributions of preconditioner values). This presents an opportunity to estimate the preconditioners used by adaptive optimizers with smaller noise, by averaging across previous iterates. To this end, we propose DP 2 , a differentially private adaptive method that uses historical gradients to construct delayed preconditioners with reduced noise. Despite the simplicity of this approach, we find that it can significantly improve performance in practice-improving convergence speed by as much as 4× relative to non-adaptive baselines, all without the need to access auxiliary data. To better understand these performance gains, we theoretically and empirically analyze the method to study the effect of using delayed preconditioners, including trade-offs that emerge between the noise reduction and staleness. Contributions. We propose DP 2 as a method for differentially private adaptive optimization with delayed preconditioners. Unlike prior work, DP 2 does not rely on auxiliary data to improve privacy/utility trade-offs in private training. We provide convergence guarantees for DP 2 in both convex and non-convex settings, and analyze the trade-offs between delay and privacy noise. We conduct extensive experiments to showcase the effectiveness of DP 2 , which can significantly improve model utility for a given privacy budget across text and recommendation benchmarks.

2. BACKGROUND AND RELATED WORK

In this section we discuss closely related works and set up some preliminaries. We start by discussing prior work in differentially private optimization, considering the classic framework of (ε, δ)-differential privacy (DP) (Dwork et al., 2006) , defined as follows. et al., 2022; Zhou et al., 2021) , and releasing gradient statistics via tree aggregation to reduce the amount of noise (Chan et al., 2011; Denisov et al., 2022; Kairouz et al., 2021b) . These prior works are orthogonal to and could be applied in conjunction with our proposed method, which focuses specifically on privacy in the context of adaptive optimization. Differentially Private Adaptive Optimization. To reduce privacy cost in iterative DP algorithms, it is natural to consider applying adaptive optimizers (e.g., AdaGrad (Duchi et al., 2011; McMahan & Streeter, 2010) , RMSProp (Hinton et al., 2012 ), AMSGrad (Reddi et al., 2018 ), and Yogi (Zaheer et al., 2018) ) to speed up convergence. A straightforward approach is to first privatize mini-batch gradients and then plug in noisy gradients to any adaptive updating rules (Zhou et al., 2020) . However, estimating gradient moments in this way may yield preconditioners with too much noise, resulting in adaptive methods that may not have meaningful improvements over DP-SGD (Li et al., 2022 ). As we discuss in Section 1, more recent works suggest the use of non-sensitive public information to estimate the preconditioners (or other gradient structures) (Asi et al., 2021; Kairouz et al., 2021a; Li et al., 2022) , which may not always be available in practice. In Section 5.2, we empirically benchmark two baselines along this line of work and demonstrate that DP 2 can perform comparably to these state-of-the-art methods, even though it does not require access to auxiliary data. Finally, we note that previous works have explored the high-level direction of delayed preconditioners, but mainly as a compromise for computational considerations in non-private training (Gupta et al., 2018) . In this work, we instead show that staleness can be leveraged to improve privacy/utility trade-offs in private adaptive optimization, and propose and analyze a novel method for delaying preconditioner computation in the context of private training. Notation. In this work, we consider using adaptive optimization methods to solve the classic empirical risk minimization objective, i.e., min w F (w) = 1 v for coordinate-wise division. For any vector v, v j denotes the j-th coordinate of v. For example, g i,t j refers to the j-th coordinate of gradient g i,t . Finally, |v| ∈ R d denotes taking coordinate-wise absolute values, and ∥ • ∥ M denotes the matrix norm defined as ∥ • ∥ M := ⟨•, M •⟩ for a symmetric and positive definite matrix M ∈ R d×d , or a diagonal matrix with non-negative diagonal entries populated by a vector M ∈ R d .

3. DP 2 : DELAYED PRECONDITIONERS FOR DIFFERENTIALLY PRIVATE ADAPTIVE OPTIMIZATION

We now introduce our DP 2 framework. While we discuss DP 2 in the context of a particular adaptive method (RMSProp), we note that the approach is method-agnostic in that it can generally be applied



Figure 1: Preconditioner values do not change drastically during optimization (IMDB dataset).

Definition 1 (Differential privacy(Dwork et al., 2006)). A randomized algorithm M is (ε, δ)differentially private if for all neighboring datasets D, D ′ differing by one element, and every possible subset of outputs O,Pr (M(D) ∈ O) ≤ e ε Pr (M(D ′ ) ∈ O) + δ.Differentially Private SGD. Informally, DP in machine learning offers protection by masking the influence of individual examples (example-level DP, e.g.(Abadi et al., 2016; Bassily et al., 2014;  Song et al., 2013)) or all of the examples from one user (user-level DP, e.g.(Kairouz et al., 2021b;  McMahan et al., 2018)) on the trained model. In this work, we consider example-level DP using the popular subsampled Gaussian mechanism(Dwork et al., 2014; Mironov et al., 2019)  to perturb gradients to ensure DP. Unless much larger batch sizes and possibly larger datasets are used, DP mechanisms often lead to a significant utility drop. Extensive research has thus been devoted to investigating improved privacy/utility/computation trade-offs for DP-SGD, including various training techniques (e.g., data augmentation and large-batch training)(De et al., 2022), leveraging public data (Amid

x i ; w), where w ∈ R d and {f (x i ; w)} i∈[n] are individual loss functions on training sample i ∈ [n]. For vectors u, v ∈ R d , we use u + v for coordinate-wise addition, and u

