DIFFERENTIALLY PRIVATE ADAPTIVE OPTIMIZATION WITH DELAYED PRECONDITIONERS

Abstract

Privacy noise may negate the benefits of using adaptive optimizers in differentially private model training. Prior works typically address this issue by using auxiliary information (e.g., public data) to boost the effectiveness of adaptive optimization. In this work, we explore techniques to estimate and efficiently adapt to gradient geometry in private adaptive optimization without auxiliary data. Motivated by the observation that adaptive methods can tolerate stale preconditioners, we propose differentially private adaptive training with delayed preconditioners (DP 2 ), a simple method that constructs delayed but less noisy preconditioners to better realize the benefits of adaptivity. Theoretically, we provide convergence guarantees for our method for both convex and non-convex problems, and analyze trade-offs between delay and privacy noise reduction. Empirically, we explore DP 2 across several realworld datasets, demonstrating that it can improve convergence speed by as much as 4× relative to non-adaptive baselines and match the performance of state-of-the-art optimization methods that require auxiliary data.

1. INTRODUCTION

Adaptive optimizers such as AdaGrad (Duchi et al., 2011; McMahan & Streeter, 2010) and RM-SProp (Hinton et al., 2012) are commonly used to improve convergence speed in machine learning training. However, in privacy-sensitive applications, the benefits of adaptivity may degrade as a result of noise added to the preconditioners to guarantee differential privacy (Li et al., 2022) . Prior works typically address this issue by using non-sensitive auxiliary data to approximate the underlying structures of private gradients (Asi et al., 2021; Kairouz et al., 2021a; Li et al., 2022) . While this can boost performance, assuming access to informative public data may be unrealistic in many privacy-sensitive applications. In this work, we instead ask: Can we improve privacy/utility trade-offs in private adaptive optimization without accessing auxiliary data? A key insight we have in addressing this question is that for many machine learning problems, the gradient geometry may not change drastically during successive steps of optimization (e.g., see Figure 1 , which plots successive distributions of preconditioner values). This presents an opportunity to estimate the preconditioners used by adaptive optimizers with smaller noise, by averaging across previous iterates. To this end, we propose DP 2 , a differentially private adaptive method that uses historical gradients to construct delayed preconditioners with reduced noise. Despite the simplicity of this approach, we find that it can significantly improve performance in practice-improving convergence speed by as much as 4× relative to non-adaptive baselines, all without the need to access auxiliary data. To better understand these performance gains, we theoretically and empirically analyze the method to study the effect of using delayed preconditioners, including trade-offs that emerge between the noise reduction and staleness. Contributions. We propose DP 2 as a method for differentially private adaptive optimization with delayed preconditioners. Unlike prior work, DP 2 does not rely on auxiliary data to improve privacy/utility trade-offs in private training. We provide convergence guarantees for DP 2 in both convex and non-convex settings, and analyze the trade-offs between delay and privacy noise. We conduct



Figure 1: Preconditioner values do not change drastically during optimization (IMDB dataset).

