MULTI-EPOCH MATRIX FACTORIZATION MECHANISMS FOR PRIVATE MACHINE LEARNING

Abstract

We introduce new differentially private (DP) mechanisms for gradient-based machine learning (ML) training involving multiple passes (epochs) of a dataset, substantially improving the achievable privacy-utility-computation tradeoffs. Our key contribution is an extension of the online matrix factorization DP mechanism to multiple participations, substantially generalizing the approach of Denisov et al. (2022). We first give conditions under which it is possible to reduce the problem with per-iteration vector contributions to the simpler one of scalar contributions. Using this, we formulate the construction of optimal (in total squared error at each iterate) matrix mechanisms for SGD variants as a convex program. We propose an efficient optimization algorithm via a closed form solution to the dual function. While tractable, both solving the convex problem offline and computing the necessary noise masks during training can become prohibitively expensive when many training steps are necessary. To address this, we design a Fourier-transform-based mechanism with significantly less computation and only a minor utility decrease. Extensive empirical evaluation on two tasks: example-level DP for image classification and user-level DP for language modeling, demonstrate substantial improvements over the previous state-of-the-art. Though our primary application is to ML, we note our main DP results are applicable to arbitrary linear queries and hence may have much broader applicability.

1. INTRODUCTION

Differentially private stochastic gradient descent (DP-SGD) is the de facto standard algorithm for DP machine learning (ML) (Song et al., 2013; Abadi et al., 2016a) . However, obtaining stateof-the-art privacy-utility tradeoffs critically requires use of privacy amplification techniques like shuffling (Erlingsson et al., 2019; Feldman et al., 2022) or (Poisson) subsampling (Bassily et al., 2014; Zhu & Wang, 2019; Wang et al., 2019) . These in turn require strong assumptions on the manner in which data is processed that are rarely valid in applications of DP-SGD, as implementing these procedures is often impractical (Kairouz et al., 2021) . Kairouz et al. (2021) recently proposed the DP-FTRL framework that avoids reliance on amplification by sampling, through using DP streaming of prefix sums (Dwork et al., 2010; Chan et al., 2011; Honaker, 2015) . DP-FTRL can often match (or outperform) DP-SGD in privacy-utility tradeoffs. Indeed, this algorithm enabled McMahan & Thakurta (2022) to train the first known provably DP ML model on user data in a production setting. Several works have since focused on this primitive as an instantiation of the streaming matrix mechanism; in particular, Denisov et al. (2022) showed that leveraging optimal matrix mechanisms led to significant empirical improvements, though their work was restricted to the single-epoch setting. Shown in Figs. 1 and 3 , we achieve substantially improved privacy-utility tradeoffs, with comparable computation. Our methods outperform all prior work, including DP-SGD with amplification, to as low as ε ≈ 2. To accomplish this, we propose a formalism for measuring multi-participation sensitivity, given in Section 2, a significant extension to the single-participation sensitivity used in in Denisov et al. (2022) . We show in Section 3 how one may compute matrix mechanisms optimized for this multi-participation setting. This generalization enables application of optimized streaming matrix mechanisms to settings where each example (or user) may contribute to multiple elements of the data matrix (the matrix formed by stacking unnoised batch gradients in ML). We also explore the computational tradeoffs of our approaches. In particular, computing optimal matrix factorizations may become intractable when large numbers of training steps n are required , as we discuss in Section 4. While this is uncommon in the federated algorithms for user-level DP, it can be a limitation when training with SGD for example-level privacy. To reduce this cost, we propose and investigate an approach based on the Fast Fourier Transform (FFT) (Nussbaumer, 1981) , which is near-optimal for the single-epoch setting and efficiently computable for most, if not all, ML settings. Indeed, we find this approach still outperforms the mechanisms from the extant literature, even under multiple participations. Contributions 1) We provide a framework for computing the sensitivity of matrix mechanisms under general participation schemas. To do this, we prove a new theorem bounding sensitivity for multi-dimensional data contributions. This allows us to reduce the problem to that of measuring sensitivity for scalar contributions alone (Section 2). 2) We extend the results of Denisov et al. (2022) to the optimization problems corresponding to these generalized notions of sensitivity, showing that the algorithms proposed there can be applied in our setting (Section 3). 3) We propose and analyze a computationally-efficient factorization based on the Fourier transform which is near optimal for the single-epoch setting and can be efficiently extended to handle multiple epochs (Section 4). 4) We perform detailed empirical comparisons of our mechanisms with both the prior matrix mechanism approaches and DP-SGD. We show that the methods proposed here outperform all others (in particular, DP-SGD with amplification), to privacy budgets as low as ε ≈ 2, and without any need for privacy amplification (Section 5). 5) We will upload all code used in the final manuscript.

Related work

The core privacy primitive here is the matrix mechanism (Li et al., 2015) . Its long history of study and application was mostly in the offline setting (McKenna et al., 2018; Edmonds et al., 2020; Yuan et al., 2016; Hardt & Talwar, 2010) . Example-and user-level DP, and the connection to federated learning (FL) In addition to example-level DP, we consider user-level DP. As observed by McMahan et al. (2018) , private FL algorithms are well suited to providing user-level DP or other multi-example units of privacy, e.g. document-level, as bounding the sensitivity of a single user's contribution to an aggregate update is made straightforward by the per-user data processing pattern inherent in FL. However, our primary application is to datacenter training, where user data can be processed in a fixed shuffled order, unlike cross-device FL. We use the term 'participation' to denote the event that an example (user or client in FL) contributes to the gradient sum (or a model update in FL) x i for a given step/iteration (round in FL) i. Individual contributions to x i are scaled so their maximum 2 norm is ζ. Our



Figure 1: Our optimal multi-epoch matrix and FFT-based mechanisms outperform all others, including DP-SGD with amplification, as low as ε ≈ 4. Using our sensitivity calculation of Theorem 2.1 and stamping (Section 5), we optimize a single pass (k = 1) matrix of Denisov et al. (2022) but apply it here with > 1 pass. We use an online Honaker-based decoder equivalent to that of Kairouz et al. (2021) except for a significant improvement to tree-completion in Appendix D.3. Models trained for 20 epochs on CIFAR10 with a batch size of 500. We repeat each setting 12 times and show 95% bootstrapped confidence intervals. Empirical setup is in Section 5.1.

Fichtenberger et al. (2022); Denisov et al. (2022) independently applied it to the adaptive streaming setting, where outputs are released oneby-one and privacy analysis must account for an adversary adaptively defining the inputs. Denisov et al. (2022) connected the matrix mechanism to DP ML, via the DP-FTRL algorithm of Kairouz et al. (2021), and showed that computing optimal factorizations significantly improves the privacyutility-computation tradeoffs when needing only a single pass (epoch) over the training data.

