MULTI-EPOCH MATRIX FACTORIZATION MECHANISMS FOR PRIVATE MACHINE LEARNING

Abstract

We introduce new differentially private (DP) mechanisms for gradient-based machine learning (ML) training involving multiple passes (epochs) of a dataset, substantially improving the achievable privacy-utility-computation tradeoffs. Our key contribution is an extension of the online matrix factorization DP mechanism to multiple participations, substantially generalizing the approach of Denisov et al. (2022) . We first give conditions under which it is possible to reduce the problem with per-iteration vector contributions to the simpler one of scalar contributions. Using this, we formulate the construction of optimal (in total squared error at each iterate) matrix mechanisms for SGD variants as a convex program. We propose an efficient optimization algorithm via a closed form solution to the dual function. While tractable, both solving the convex problem offline and computing the necessary noise masks during training can become prohibitively expensive when many training steps are necessary. To address this, we design a Fourier-transform-based mechanism with significantly less computation and only a minor utility decrease. Extensive empirical evaluation on two tasks: example-level DP for image classification and user-level DP for language modeling, demonstrate substantial improvements over the previous state-of-the-art. Though our primary application is to ML, we note our main DP results are applicable to arbitrary linear queries and hence may have much broader applicability.

1. INTRODUCTION

Differentially private stochastic gradient descent (DP-SGD) is the de facto standard algorithm for DP machine learning (ML) (Song et al., 2013; Abadi et al., 2016a) . However, obtaining stateof-the-art privacy-utility tradeoffs critically requires use of privacy amplification techniques like shuffling (Erlingsson et al., 2019; Feldman et al., 2022) or (Poisson) subsampling (Bassily et al., 2014; Zhu & Wang, 2019; Wang et al., 2019) . These in turn require strong assumptions on the manner in which data is processed that are rarely valid in applications of DP-SGD, as implementing these procedures is often impractical (Kairouz et al., 2021) . Kairouz et al. (2021) recently proposed the DP-FTRL framework that avoids reliance on amplification by sampling, through using DP streaming of prefix sums (Dwork et al., 2010; Chan et al., 2011; Honaker, 2015) . DP-FTRL can often match (or outperform) DP-SGD in privacy-utility tradeoffs. Indeed, this algorithm enabled McMahan & Thakurta (2022) to train the first known provably DP ML model on user data in a production setting. Several works have since focused on this primitive as an instantiation of the streaming matrix mechanism; in particular, Denisov et al. (2022) showed that leveraging optimal matrix mechanisms led to significant empirical improvements, though their work was restricted to the single-epoch setting. Shown in Figs. 1 and 3 , we achieve substantially improved privacy-utility tradeoffs, with comparable computation. Our methods outperform all prior work, including DP-SGD with amplification, to as low as ε ≈ 2. To accomplish this, we propose a formalism for measuring multi-participation sensitivity, given in Section 2, a significant extension to the single-participation sensitivity used in in Denisov et al. (2022) . We show in Section 3 how one may compute matrix mechanisms optimized for this multi-participation setting. This generalization enables application of optimized streaming matrix mechanisms to settings where each example (or user) may contribute to multiple elements of the data matrix (the matrix formed by stacking unnoised batch gradients in ML). Figure 1 : Our optimal multi-epoch matrix and FFT-based mechanisms outperform all others, including DP-SGD with amplification, as low as ε ≈ 4. Using our sensitivity calculation of Theorem 2.1 and stamping (Section 5), we optimize a single pass (k = 1) matrix of Denisov et al. (2022) but apply it here with > 1 pass. We use an online Honaker-based decoder equivalent to that of Kairouz et al. (2021) except for a significant improvement to tree-completion in Appendix D.3. Models trained for 20 epochs on CIFAR10 with a batch size of 500. We repeat each setting 12 times and show 95% bootstrapped confidence intervals. Empirical setup is in Section 5.1. We also explore the computational tradeoffs of our approaches. In particular, computing optimal matrix factorizations may become intractable when large numbers of training steps n are required , as we discuss in Section 4. While this is uncommon in the federated algorithms for user-level DP, it can be a limitation when training with SGD for example-level privacy. To reduce this cost, we propose and investigate an approach based on the Fast Fourier Transform (FFT) (Nussbaumer, 1981) , which is near-optimal for the single-epoch setting and efficiently computable for most, if not all, ML settings. Indeed, we find this approach still outperforms the mechanisms from the extant literature, even under multiple participations. Contributions 1) We provide a framework for computing the sensitivity of matrix mechanisms under general participation schemas. To do this, we prove a new theorem bounding sensitivity for multi-dimensional data contributions. This allows us to reduce the problem to that of measuring sensitivity for scalar contributions alone (Section 2). 2) We extend the results of Denisov et al. (2022) to the optimization problems corresponding to these generalized notions of sensitivity, showing that the algorithms proposed there can be applied in our setting (Section 3). 3) We propose and analyze a computationally-efficient factorization based on the Fourier transform which is near optimal for the single-epoch setting and can be efficiently extended to handle multiple epochs (Section 4). 4) We perform detailed empirical comparisons of our mechanisms with both the prior matrix mechanism approaches and DP-SGD. We show that the methods proposed here outperform all others (in particular, DP-SGD with amplification), to privacy budgets as low as ε ≈ 2, and without any need for privacy amplification (Section 5). 5) We will upload all code used in the final manuscript.

Related work

The core privacy primitive here is the matrix mechanism (Li et al., 2015) . Its long history of study and application was mostly in the offline setting (McKenna et al., 2018; Edmonds et al., 2020; Yuan et al., 2016; Hardt & Talwar, 2010) . Fichtenberger et al. (2022) ; Denisov et al. (2022) independently applied it to the adaptive streaming setting, where outputs are released oneby-one and privacy analysis must account for an adversary adaptively defining the inputs. Denisov et al. (2022) connected the matrix mechanism to DP ML, via the DP-FTRL algorithm of Kairouz et al. (2021) , and showed that computing optimal factorizations significantly improves the privacyutility-computation tradeoffs when needing only a single pass (epoch) over the training data. Example-and user-level DP, and the connection to federated learning (FL) In addition to example-level DP, we consider user-level DP. As observed by McMahan et al. (2018) , private FL algorithms are well suited to providing user-level DP or other multi-example units of privacy, e.g. document-level, as bounding the sensitivity of a single user's contribution to an aggregate update is made straightforward by the per-user data processing pattern inherent in FL. However, our primary application is to datacenter training, where user data can be processed in a fixed shuffled order, unlike cross-device FL. We use the term 'participation' to denote the event that an example (user or client in FL) contributes to the gradient sum (or a model update in FL) x i for a given step/iteration (round in FL) i. Individual contributions to x i are scaled so their maximum 2 norm is ζ. Our mechanisms compute sums over individual clipped contributions, and then post-process by dividing by the batch size (or clients/round) to compute an average gradient (or model update). We assume ζ = 1, applying appropriate scaling as needed. Appendix A summarizes terminology and notation.

2. PRIVACY FOR ADAPTIVE STREAMS WITH MULTIPLE PARTICIPATIONS

We define and efficiently bound the sensitivity of the multi-participation adaptive streaming (continual release) setting, by generalizing Denisov et al. (2022, Sec. 2) . We assume a database of m examples (or users in FL, or records in a general DP application) that is processed as a stream over n steps. A set of B examples is selected on each step i, and processed via an adaptively chosen function (e.g., computing a gradient at the current model), producing a vector of 2 norm at most ζ. These vectors are summed and provided to the DP mechanism as x i ∈ R d , which then releases a privatized function of [x 1 , . . . , x i ], the stream so far. When a particular example contributes to the sum x i , we say it participates on step i. We are primarily interested in the case where m/B < n, and hence each example is used on more than one step. This is the multiple epoch setting of ML. Two data streams x and x are said to be neighboring if they differ in the contributions derived from a single example, either by zeroing out all of its contributions, or by replacing them arbitrarily subject to the norm bound ζ. Thus, the participation pattern does not change (all records contribute to the same steps in x and x, with only the vectors associated with one record changing). We define a participation schema Π as the set of possible participation patterns π ∈ Π, with each π ⊆ [n] indicating a set of steps in which a single example might participate. Assuming each record contributes at most once (single-participation, Π = {1}, {2}, . . . {n} ), recovers the standard streaming setting. This captures, for example, training with minibatch SGD using a single pass (epoch) over a training dataset. At the other extreme, we have every-step participation with Π = {[n]} where each record contributes to every step. This captures learning with full gradient descent, where we compute the gradient on the full training dataset on every iteration.

Fixed-epoch-order participation

We focus on generalization of the above two, (k, b)participation, where each example participates at most k times, with any adjacent participations exactly b steps apart: formally, Π is the set of all π such that |π| ≤ k, and if π = {i 1 , . . . , i k }, we have ∀j ∈ {2, . . . , k}, i j -i j-1 = b. Note (k=1, b=n)-participation recovers the single-epoch setting, and (k=n, b=1)-participation recovers every-step participation, and for example (k=3, b=2)participation has Π = {{1, 3, 5}, {2, 4, 6}}. We focus on this participation schema because: 1) It encompasses multi-epoch SGD training using a data processing pattern well-supported by modern ML infrastracture. 1 The only requirement is that rather than shuffling the dataset for each epoch, the dataset is shuffled once and the same order of minibatches is used for each epoch. With this setup, k epochs of training on a dataset of size m with a batch size B gives n = mk/B total training steps, and satisfies (k, m/B)-participation. 2) We show (e.g., Eq. ( 3)) in importance cases this participation schema allows for the efficient computation of sensitivity. 3) We will see in Section 3 that the more possible participation patterns π, the more constrained the problem of finding optimal mechanisms becomes. Hence, a relatively restrictive (but practical) schema like (k, b)-participation yields more favorable privacy-utility tradeoffs. Sensitivity of linear queries on multi-participation adaptive streams Consider a full-rank square query (or workload) matrix A ∈ R n×n ; we wish to compute the function x → Ax in a differentially private manner, where we consider inputs x and outputs Ax to be elements of R n×d , under geometry inherited from the Frobenius inner product. We utilize the matrix mechanism (Li et al., 2015) , which, provided a factorization A = BC, computes the estimate Ax = B (Cx + z) , ( ) where z is a sample from appropriately scaled isotropic Gaussian noise. The scale is determined by the sensitivity of the mapping x → Cx; roughly, how much outputs of this mapping can vary (in 2 norm) when we swap the input stream x for a neighboring one x. We refer to this matrix C as the "encoder", as it encodes x as Cx before adding Gaussian noise. Similarly, we call B the "decoder". Let N be the set of all pairs of neighboring streams x and D := {x -x | (x, x) ∈ N} represent the set of all possible deltas between neighboring x, x. The definition of D implies it is symmetric (u ∈ D ⇒ -u ∈ D). We will say a D satisfies the participation schema Π if the indices of all nonzero elements in each vector u ∈ D corresponds some π ∈ Π. Critically, for linear queries D fully captures the sensitivity of the query: Definition 1. The sensitivity of the matrix factorization mechanism Eq. ( 1) is defined as sens D (C) = sup (x,x)∈N Cx -Cx F = sup u∈D Cu F . Convexity of Cu F in u implies that sup u∈D Cu F = sup u∈conv(D) Cu F , and hence without loss of generality (wlog), we take D to be convex as needed. It is illustrative to consider some specific Ds for scalar per-step contributions with ζ = d = 1. Single-participation corresponds to D = conv{αe i |α ∈ [-1, 1], i ∈ [n] } where e i for i ∈ [n] are the standard basis vectors. Noting Cu = -Cu and convexity of Cu , we see the maximum will be achieved at some e i , recovering the 'max-2 -norm-over-columns' measurement of sensitivity of Li et al. (2015, Proposition 3) . Every-step participation corresponds to the ∞ ball, D = {x | x ∞ ≤ 1}. Conditions allowing the reduction to per-iterate scalar contributions In ML, examples are used to calculate gradients of d > 1 dimensions, and so we wish to consider x ∈ R n×d , with rows x i ∈ R d corresponding to the sum of gradients for examples participating in step i. In order to compute sensitivity, one may hope that the sensitivity for each x i ∈ R d can be bounded by only considering some appropriately worst-case x i ∈ R. More formally, consider a fixed participation schema Π, and further assume (wlog) ζ = 1. Then, for vector-valued contributions we have D d Π = conv G ∈ R n×d | ∃π ∈ Π s.t. G [i,:] 2 ≤ 1 for i ∈ π and G [i,:] = 0 for i ∈ π . In the d = 1 case, we have a much simpler polytope, D 1 Π = conv(D 1 Π ) where D 1 Π = π∈Π u ∈ R n | u i ∈ {-1, 1} if i ∈ π, 0 otherwise . One might hope to show sens D d Π (C) ≤ sens D 1 Π (C) , and the authors in fact initially conjectured this to be true. To our surprise, while this inequality holds under a variety of assumptions, it does not hold in general (Appendix H.2 gives a counterexample).foot_2 Empirically we have observed that for various query matrices A and (k, b)-participation with d = 1, the optimal C satisfy (or almost satisfy) the condition C C ≥ 0 (element-wise non-negativity). In this case, we can show: Corollary 2.1. When per-step contributions bounded by ζ = 1, for any participation schema Π and dimensionality d ≥ 1, when C C ≥ 0 elementwise, we have sens D d Π (C) = sens D 1 Π (C). In particular, this implies that if C is optimal in the d = 1 case and satisfies C C ≥ 0, it is also optimal in the d > 1 case. This result is a corollary of Theorem H.1, which establishes additional conditions under which sens D d Π (C) ≤ sens D 1 Π (C) holds. All proofs are in Appendix H onwards. Difficulty of computing sens(C) In general, computing sens(C) is a convex quadratic maximization problem over a convex set, which can be NP-hard. Even the simple case of computing the sensitivity for an arbitrary matrix C under every-step participation with scalar (d = 1) contributions is NP-hard-it is exactly the problem of computing the ∞ -2 operator norm (Tropp, 2004) . (In fact, it is useful to observe sens D (•) can always be viewed as an operator norm, see Appendix B). This hardness is in stark contrast to the single-participation setting, where calculating sensitivity is trivial. However, we can in some cases compute sensitivity exactly by brute force. Take d = 1. Observe D 1 Π is a finite set and so a direct calculation by using Eq. ( 2) is often possible. But, |D 1 Π | = |Π|2 k , and observing the symmetry Cu = C(-u) can reduce the computational cost only by half. In general |Π| may be exponential in k, but in the special case of (k, b)-participation, we have |Π| = b (the number of steps in one epoch). Hence, for modest numbers of epochs k, directly computing sensitivity is possible, e.g., in our StackOverflow experiments in Section 5.2, we can reduce u ∈ D 1 Π to only 342 • 2 5 = 10, 944 vectors. Theorem H.1 can be used to translate bounds from scalar to higher dimensions, though without such a translation it is not clear how to generalize a brute-force method. Computing sensitivity when C C ≥ 0 Let X = C C. When X has only nonnegative elements, one may reduce the problem of computing sens D 1 Π (C) to sens D d Π (C) = max u∈D 1 Π Cu F = max u∈D 1 Π √ u Xu = max π∈Π 1 X [π,π] 1, where X [π,π] ∈ R k×k is the submatrix of X formed from the rows and columns selected by π, |π| = k and 1 ∈ R k . The first equality follows from Corollary 2.1, and then the max must be achieved by the maximum-magnitude nonnegative vector u, specifically 1 k . As noted above, the matrices we consider satisfy this property, and hence we can compute the exact sensitivity efficiently for (k, b)-participation. Upper-bounding sensitivity As an alternative to structural conditions on C or X allowing efficient exact computation of sensitivity for d > 1, we can look to (reasonably tight) upper bounds on the sensitivity of C. In the case of (k, b)-participation, one efficient method of computing upper bounds for the multiple-participation sensitivity of C has shown itself to be particularly useful:  Theorem 2.1. Let C ∈ R n×n , C [:,π] 2 . Then sens D 1 Π (C) ≤ λ √ k. In the (k, b)-participation case, |Π| = b. The complexity of computing the largest eigenvalue of the subselected C matrix is cubic in k. Thus, computing this upper bound is of order bk 3 , easily computable for the range of k, b considered here (k ≤ 100, b ≤ 500). Differential Privacy Guarantee Using our generalization of adaptive streams to multiple participations we obtain the following result (a straightforward generalization of Denisov et al. (2022, Theorem 2 .1)). The proof is identical that in Denisov et al. (2022) , except we replace the sensitivity bound with that for multiple participations obtained via Corollary 2.1. Theorem 2.2. Let A ∈ R n×n be a lower-triangular full-rank query matrix, and let A = BC be any factorization with the following property: for any two neighboring streams x, x ∈ R n×d , we have C(xx) F ≤ κ. Let Z ∼ N (0, κ 2 σ 2 ) n×d with σ large enough so that M(x) = Ax + BZ = B(Cx + Z) satisfies (ε, δ)-DP (or ρ-zCDP or µ-Gaussian DP) in the nonadaptive continual release model. Then, M satisfies the same DP guarantee (with the same parameters) even when the rows of the input are chosen adaptively.

3. OPTIMIZING MATRIX MECHANISMS FOR MULTIPLE EPOCHS

We now present methods for computing optimal matrix mechanisms that are specialized to a specific participation schema Π and query matrix A. For example, Fig. 2 shows the optimal factorization for the query matrix A representing SGD with momentum and coldown under (k=6, b=342)participation, used in Section 5.2. Specializing the mechanism to both the participation pattern and the specific query workload enables as to obtain state-of-the-art results in ML (Section 5). We follow the approach of Denisov et al. (2022) which showed empirical improvements over prior methods in single-epoch settings. We begin by defining the loss of interest, i.e., the total variance of noise added, for the mechanism defined in Eq. ( 1). Note that this loss characterizes other downstream tasks like DP Mean Estimation. Given D, assume that we may represent D = conv(D) for some finite set D-as we have seen, this is the case, e.g., in (k, b)-participation. Then the loss which corresponds to total squared error of a factorization, at a fixed privacy level, may be expressed as: L(B, C) = sens 2 D (C) B 2 F where sens D (C) = sup u∈D Cu 2 2 = sup u∈D Cu 2 2 . (4) Figure 2 : The optimal factorization A = BC under (k=6, b=342)-participation, constructed by solving the optimization problem Eq. ( 5). Matrix A encodes SGD with momentum 0.95 and a learning-rate cooldown schedule for the last 25% of rounds, as used in our StackOverlow experiments (Section 5.2). The constraints on sensitivity imposed by this participation schema are evident in the resulting matrices. For example, the white diagonals with a period of b = 342 in X = C C show that the columns of C that could correspond to a pair of rounds (i, j) where the same user might participate are in fact orthogonal. See Fig. 13 in Appendix F.4 for a larger view. Observing that ∀α the mechanism A = αB 1 α C has identical loss, we conclude that we may consider the constrained version of the problem of minimizing this loss where sens D (C) ≤ 1. Since for any C, B = AC † produces the minimum-Frobeneous norm B-matrix, it is sufficient to solve: min C L AC † , C = min C:sens 2 D (C)=1 AC † 2 F . With the change of variables X = C C, equivalently: min Xis PD, sens 2 D (X)=1 tr(A AX -1 ) where sens 2 D (X) ≤ sup u∈D u Xu. Theorem 3.1. Let a finite D = {u i } k i=1 be given, and assume that the vectors {u i } k i=1 span R n . Assume that A is full-rank, and for v ∈ R k define H v = [u 1 , . . . , u k ] diag(v) 1/2 , U = H v H v . Define the Lagrangian L as L (X, v) := tr(A AX -1 ) + u∈D v u u Xu -1 . Then, for Lagrange multipliers v such that the U is full-rank, the minimizer X (v) of L for this fixed v may be represented X (v) = U -1 2 U 1 2 A AU 1 2 1 2 U -1 2 . and the Lagrange dual function g for the problem Eq. (6) can be expressed in closed form in terms of the dual variables v: g(v) := inf X is PD L(X, v) = 2 tr U 1 2 A AU 1 2 1 2 - u∈D v u (7) Remark. The restriction that v yields a full-rank U serves to restrict to cases where the Lagrangian has a finite, positive-definite minimizer in the primal variable; if the vectors {u} span R n , the problem Eq. ( 6) has a finite minimizer by Lemma I.1. Any setting of the dual variables v corresponding to this minimizer is contained in a neighborhood uniformly satisfying this full-rank property, and so it is valid to differentiate our expression for g with respect to such v (as we will do in Appendix I.1). Corollary 3.1. In the same setup as Theorem 3.1, the gradient of the dual function g is: ∂g ∂vi = u i U -1 2 U 1 2 A AU 1 2 1 2 U -1 2 u i -1. Moreover, a maximizer of the dual v must satisfy: v = diagpart H v A AH v 1 2 . ( ) The optimal value of the problem defined in Eq. (6) is tr (v ). Remark. In the single-participation case of Denisov et al. (2022) , H v = diag(v) 1 2 , and Eq. (8) recovers the fixed point expression of that paper's Theorem 3.2. Our Corollary 3.1 implies that the optimization methods presented in Denisov et al. (2022) may be applied, with suitable translation, to our setting; we use these methods to generate the optimal matrices studied empirically in Section 5.

4. FAST-FOURIER-TRANSFORM-BASED DP-PREFIX SUM ESTIMATION

Our work has two types of computation costs: optimization costs are those associated with optimizing and generating (or, computing) a mechanism whereas noise generation costs are those associated with using the mechanism to sample noise for DP prefix sum release. Note that once optimized, a single mechanism can be reused indefinitely to generate noise for other runs by simply resampling new noise and applying the same decoder. The best known methods for computing the optimal factorizations scale as at least O(n 3 ) (Yuan et al., 2016; Denisov et al., 2022) . This optimization cost can become intractable when n grows too large. Thus, in this section we focus on reducing optimization computation at a small decrease in the achievable privacy-utility tradeoff. A prime candidate for this goal is the Discrete Fourier Transform (DFT) because there are known algorithms both for nearly-optimal private convolutions (Fawaz et al., 2013) which are intimately related to the DFT, and for efficient calculation of the DFT using the Fast Fourier Transform (FFT) (Nussbaumer, 1981) . We present an FFT-based mechanism that reduces the noise generation costs. We then prove rigorous DP guarantees for it and show that these lead to near-optimal privacy-utility tradeoffs in the single-epoch setting. We provide two improvements over prior work: 1) extending the result to the multi-epoch and multi-dimensional setting and 2) providing explicit non-asymptotic analysis of the algorithm's utility. Let A represent the (Toeplitz) matrix of all 1s on or below the main diagonal and 0s elsewhere; i.e., the prefix-sum matrix. In this section, we perform our analysis in the Fourier domain. To release Ax, we define a circulant matrix A circ ∈ R 2n×2n with a corresponding input vector x ext = concat(x, 0 n ), 0 n ∈ {0} n so that the first n entries of A circ x ext are equal to Ax (see Appendix J.1). Thus, we study the DP release of A circ x ext . Note, to be consistent with the literature on FFT, in this section, and in Appendix J, we will index all the vectors and matrices with starting index of zero. Theorem J.1 of Gray ( 2006) (restated in Appendix J.1) shows there exists a diagonal Σ such that A circ = F * ΣF for diagonal Σ, where F is the DFT matrix. Then, A circ can then be factorized as A circ = B circ C circ where B circ = F * Σ 1/2 and C circ = Σ 1/2 F. The (complex-valued) matrix mechanism specified by the factorization above (and presented as Algorithm 1 of Appendix C) is nearly optimal in the class of matrix-factorization-based mechanisms, as we show in Appendix J. Though we prove a simple zCDP guarantee for any participation schema having at most k = max π∈Π |π| participations in Theorem 4.1, we can instead use our Corollary 2.1 when the participation schema is known in advance (as in our experiments with (k, b)-participation). Theorem 4.1. Under k-participation, Algorithm 1 satisfies (k 2 ρ)-zCDP. The optimal FFT decoder Observe from Theorems 2.1 and 2.2 that the privacy guarantee of our multi-participation adaptive setting is independent of the choice of decoder, B. Thus, instead of taking B circ above, we take the optimal decoder using the Moore-Penrose pseudoinverse of the encoder, i.e., B circ = SC † circ . We do so using an equivalent encoder C ∈ R n×2n as shown in Appendix K. We find that this leads to significant improvements in the privacy-utility tradeoff (see Fig. 9 in Appendix E) at no additional computational overhead. Computation Costs Though we define the optimal FFT decoder as a pseudoinverse, observe that we do not need to optimize (or even compute) the decoder; by Appendix K, the problem is reduced to that of solving a highly structured linear system. However, we find that even suboptimal implementations using the pseudoinverse can still factorize a mechanism for n = 10, 000 in 146 minutes on a V100 GPU, remaining well within practical requirements since we need only generate a mechanism once, before it can be reused indefinitely for training. In contrast, computing optimal matrices becomes practically difficult near n ≈ 10, 000, taking 24 hours to compute an effective factorization for n = 8, 192 using batch-priority cloud CPU resources. We remark that this regime of n is highly practical, e.g., standard federated benchmarks use n ≈ 2000 (Reddi et al., 2020) and our central image classification use n = 2000. In terms of noise generation, the FFT mechanism shows preferable asymptotic properties, scaling as O(nd log 2 n). However, even the optimal matrix mechanism with runtime scaling as O(dn 2 ), noise generation on a GPU (even with significantly suboptimal implementation) takes negligible time. Further, noise from our mechanisms can be pre-generated if needed. We discuss these tradeoffs in Appendix C.

5. EMPIRICAL EVALUATION

We compare four main mechanism classes: tree-based mechanisms (Honaker, 2015) , including 'tree-completion' of Kairouz et al. (2021) ; our FFT mechanism; our optimal factorizations; and DP-SGD with (though incorrect) amplification via Poisson Subsampling (Abadi et al., 2016b) . When computing optimal factorizations by solving Eq. ( 6), we may vary the sensitivity constraint set D to encode different participation schemas. We compute optimal factorizations for the (k, b)participation setting. An encoder C factorized for one value of k can be applied for another, by simply computing its new sensitivity (due to our Section 2), though this will alter its privacy-utility tradeoffs. For example, our MF1, 6e in Fig. 3 uses this to extend Denisov et al. (2022) to k > 1. When applying a mechanism that is determined independently of the number of participations (e.g., FFT or tree aggregation) or extending an optimal mechanism for k participations to a larger number of participations, the sensitivity may scale poorly in k. In such cases, it may actually have lower sensitivity to reuse a single encoder multiple times over the course of n steps, and hence more favorable privacy-utility tradeoffs. We term this approach encoder stamping. This also provides a straightforward method for extending any factorization to handle more iterations without, e.g., re-optimizing Eq. ( 6). Combined with our Section 2, stamping lets us apply mechanisms from Denisov et al. (2022) , e.g., MF(k=1, n=1000)×2 in Fig. 1 ; mechanisms with stamping have "×s" appended in this way. Discussion of stamping and its relation to existing literature are in Appendix D.4. The manner in which baselines from the extant literature map to this setting can be found in Appendix D.1. Since the matrix mechanism reduces privacy cost of training to that of the release of a single Gaussian mechanism, accounting in our case becomes quite simple; see Appendix D.2.

5.1. EXAMPLE-LEVEL DP FOR AN IMAGE CLASSIFICATION TASK.

We train image classification models on CIFAR10 (Krizhevsky, 2009) which has become a de facto standard for comparing DP ML algorithms-sufficiently easy for existing DP algorithms to achieve nontrivial accuracy, but not so simple so as to be unable to differentiate approaches. Details on our full setup are in Appendix E; generally, they match those of Kairouz et al. (2021) . Notably, we make improvements on their Online Honaker-based approach by not just completing the tree with virtual steps, but also zeroing out noise from virtual steps as detailed in Appendix D.3. We find this led to significant improvements around a few percentage points. For all matrix mechanisms except the Denisov et al. (2022) baseline and our Optimal MF(k = 20, dim = 2000) × 1, we optimize over the stamps s by the losses in Appendix D.5, which we find well match the ordering in ML accuracy. In contrast to Section 5.2, in this section we only compare factorizations of the prefix-sum matrix; we do not incorporate momentum or cooldown directly into the mechanisms, though we use both momentum and cooldown as postprocessing for the matrix-factorization-based mechanisms and report results for the best settings we find (with both). For DP-SGD, we report results for the best setting (no momentum, with cooldown). Details are in Appendix E. Main results (Figure 1 ) First, we see that the optimal factorization for the target (k, b) = (20, 100) setting outperforms all other mechanisms across (nearly) all privacy levels, only slightly underperforming DP-SGD with amplification at ε = 2. To the best of our knowledge, this represents the first empirical demonstration of an ML algorithm which is competitive with DP-SGD into this high-privacy regime, without any amplification by sampling. The FFT (optimal decoder) mechanism outperforms all baselines, again well toward the high-privacy regime at ε ≈ 4. Though at a worse privacy-utility tradeoff compared with our multi-epoch optimal matrices, this mechanism shows promise for outperforming prior work when n grows too large for generating optimal factorizations.

5.2. USER-LEVEL DP FOR A NEXT WORD PREDICTION TASK

User-level DP for language models is an important real-word task (McMahan & Thakurta, 2022) . There is much history in the DP language modelling literature, which we briefly describe in Appendix F.1. Here, we use the standard benchmark: StackOverflow next-word prediction (Reddi et al., 2020) . Our experimental setup fixes the same model and hyperparameters as Kairouz et al. (2021) and Denisov et al. (2022) except notable changes below. Details are in Appendix F.2. Figure 3 : Our MF6, 6e achieves within 2% relative difference in performance from the nonprivate baseline at ε = 8.8, δ = 10 -6 . Our MF1, 6e, though worse, outperforms all baselines from the literature. Select runs were replicated multiple times, with bootstrap 95% intervals shown. DP-GDM, 2052e is the extreme case of every-round participation (342× more computationally expensive than 6 epochs of training), and hence generally infeasible. The server learning rate η s was optimized over 0.5 and 1.0, and additionally 0.25 for ε = 2. The blue bands give the non-private baseline accuracy for 6 epochs of training with η s = 1.0 and 0.5.

Notable changes from prior work

We use a much higher 1000 clients per round (≈ 100 in prior work)-enabled mainly by our multi-epoch factorizations. We also zero out large updates with ∞ norm greater than 100 (rather than scaling down to our clipping norm of ζ = 1) as we found this improved the stability of noisy training (see Table 2 in Appendix F.3). This may have enabled more consistent success of a higher server learning rate η s = 1.0. We conducted initial simulations which verified that two observations from Denisov et al. (2022) for the single-epoch setting extended to our multi-epoch and large-batch (1000clients/round instead of 167) setting. First, linear server learning rate cooldown from 1× to 0.05× over the final 512 rounds offered a small improvement over constant rates (more so in the higher-privacy regime). Second, optimizing a factorizing with momentum and this cooldown schedule, rather than applying both as post-processing, consistently offers a small benefit. See Appendix F.2 for details. Thus, for our primary investigation we fix these preferable design choices and compare the following algorithms. Algorithms All algorithms train for 6 epochs 2052 rounds, and a large-batch 1000 clients/round (better for DP training) unless otherwise noted. Honaker, 6e is the DP-FTRL algorithm of Kairouz et al. (2021) , trained for 2048 rounds (a power of 2). MF1, 1e (Denisov et al., 2022) is the stateof-the-art for single-epoch training, using 167 clients/round and k=1. MF1, 6e uses our Eq. (3) to take the (non-negative) (k=1)-optimized matrix of the previous approach and compute the sensitivity under (k=6, b=342)-participation, allowing us to train for 6 epochs (2048 rounds) with large batches. MF6, 6e is our approach directly optimizing the matrix factorization for (k=6, b=342)participation via Eq. ( 6). DP-SGDM, 6e is the DP-FedAvg algorithm of McMahan et al. (2018) , to 2052 rounds, and accounted with Poisson sampling-an incorrect, though standard, accounting computation, as noted in Section 1. DP-GDM, 2052e is full-batch gradient descent for 2052 rounds and 2052 epochs, or 342× the computation cost of our 6 epoch runs. We compute the exact privacy cost for this approach, and estimate the accuracy from experiments with 1000 clients per round for computational efficiency, following the methodology of Kairouz et al. (2021) . This is essentially an upper bound on the best privacy-accuracy tradeoffs with unlimited computational resources. Main results (Figure 3 ) We find that our MF6, 6e, is the best feasible private result at 25.25% accuracy and (17.7, 10 -6 )-DP. This exceeds the non-private baseline of 25.2% accuracy reported by Kairouz et al. (2021) , is within the margins of small hyperparameter tuning differences of our improved non-private baselines (25.43%) and private full-batch gradient descent (25.31%). At (8.84, 10 -6 )-DP, MF6, 6e achieves 24.94% accuracy, substantially improving over the previous state-of-the-art at this privacy level given by MF1, 1e at 24.11% accuracy, and considerably improves on both the accuracy and privacy of Honaker, 6e (24.86% accuracy at (21.0, 10 -6 )-DP). In fact, we achieve better accuracy at ε = 8.84 than prior methods achieve at ε = 17.7. DP-SGDM, 6e is outperformed by MF6, 6e and MF1, 6e across all ε values evaluated. Fig. 12 in Appendix F shows results for each learning rate separately, with numeric results in Tables 4 and 5 . Discussion and conclusions can be found in Appendix G.

A SUMMARY OF NOTATION AND TERMINOLOGY

The following table summarizes the notation used throughout the paper: n Number of steps of the streaming linear query (SGD steps or FL rounds) d Dimension of per-step user contributions. x i ∈ R or R d Sum of per-example gradients (or per-user model updates) on step i. x ∈ R n×d Stream of inputs x i , equiv. matrix with rows x i (so x i = x [i,:] ). ζ Clipping norm that limits the size of per-example contributions to x i . π Participation pattern, the set of steps that an example could participation in. A ∈ R n×n Lower-triangular linear query matrix to be factorized as A = BC.

Π

T ∈ R n×n T := A A for convenience. λ min (A), λ max (A). Smallest and largest eigenvalues of real matrix A. A * Conjugate (Hermitian) transpose of A. X A matrix X that is "optimal" in a context-dependent sense. A † Moore-Penrose pseudoinverse of matrix A. A [i,j] The (i, j) th entry of matrix A. A [i,:] and A [:,j] The i th row and j th column. s Number of encoder C replications (stamps) into a block-diagonal matrix.

conv (S)

Convex hull of the set S. [n] = {1, . . . , n} X F The Frobenius norm of a matrix X. We utilize terminology from federated learning as well as standard centralized training, which generally map as follows: Centralized Federated (2) shows that our generalized notion of sensitivity can be viewed directly as a particular operator norm. To see this, view C : V 1 → V 2 as a linear operator from vector space V 1 to V 2 . Then with • (1) the vector norm on V 1 and similarly for V 2 , an operator norm is defined as C (1),(2) = max u∈V1: u (1) ≤1 Cu (2) . Because we use the Gaussian mechanism and thus are interested in the 2 sensitivity, • (2) = • 2 , and we define the norm u (1) = u D := inf r > 0 : u r ∈ D , the vector norm induced by D (the fact that D is a closed, convex, symmetric set ensures this is a norm). Note u ∈ D ⇔ u D ≤ 1. Thus, we have sens D (C) = C D,2 . C THE FFT MECHANISMS AND REDUCING COMPUTATION  F [k,:] = 1 √ 2n exp - j2πka 2n : a ∈ {0, . . . , 2n -1} . (Σ, w) ← (diag(v DFT ), standard complex Normal in 2n-dimensions). (s, z) ← x 0 , x 0 + x 1 , . . . , n-1 a=0 x a , κ 2 v DFT 1 4nρ F * Σ 1/2 • w . Output s + z.real[0, . . . , n -1]. We propose two FFT mechanisms. First, we propose the FFT mechanism which is described in Algorithm 1. This mechanism has the same computation complexity-no optimization costs and O(nd log n) noise generation-as the Honaker method used in Kairouz et al. (2021) but at a better privacy-utility tradeoff, as shown in Fig. 9 . The privacy and utility analysis can be found in Appendix J. The FFT Optimal Decoder (FFT Opt Dec) mechanism presented in Section 4 represents taking (a real-valued translation of) the encoder C from Algorithm 1 and using the optimal decoder, defined in terms of the Moore-Penrose pseudoinverse of C. Similarly to Algorithm 1, there is no need to construct a literal matrix to multiply by in the case of noise defined by the optimal decoder; noise generation time of the mechanism, however, increases by a logarithmic factor to O(nd log 2 n) (as discussed in Appendix K). This complexity is still feasible for many steps (which we will discuss below) and comes with significant utility benefits (see Fig. 1 ). All the mechanisms we study scale as either O(n 2 ) or O(n • polylog(n)). For our n = 2000 step environments, and even far beyond to n ≈ 10, 000, our algorithms can be efficiently realized on GPUs with runtime on the order of seconds per step (including computing and applying gradients and noise). The main challenge in these cases are storing the n 2 + nd coordinates in GPU memory (the former for the decoder matrix, the latter for the noise samples). Given that each of the d coordinates of noise can be sampled independently, this algorithm is straightforwardly parallelizable and so work may be partitioned across many processors when needed. Noises could instead be pregenerated in an entirely separate process, and stored on disk, to be loaded into memory row-by-row concurrently with training. Outside of computation, both our FFT mechanisms take O(nd) space as all noises for the x ∈ R n×d must be pre-generated. This is in contrast to all prior work, and even our optimal factorizations, which require only O(d) space to generate the noise at the current step. Again, we note that space is ofter much cheaper so this is typically not the limiting factor. D MECHANISMS UNDER CONSIDERATION: BASELINES, SUBTLETIES, AND LOSSES. D.1 BASELINES Kairouz et al. (2021) and Denisov et al. (2022) both present approaches for ML model training which can be understood as instances of the matrix mechanism-the former grounded in the binarytree mechanism as refined by Honaker (2015) , and the latter explicitly optimizing a factorization under single participation. These two works yield two natural baselines: • Kairouz et al. ( 2021) explores 'tree restarts' (generalized as our notion of 'stamps', s, in Section 5) and the so-called 'tree-completion trick' for the Honaker estimator-from-below variant of the binary tree method for computing differentially private prefix sums; the matrix-factorization perspective on this estimator allows us to implement slightly optimized versions of these methods; see Appendices D.3 and D.4. • Denisov et al. (2022) computes optimal factorizations of various optimization-related matrices, though only for a single epoch. For these matrices, we leverage the results in Section 2 to directly compute the sensitivity of the encoder matrices for multiple participations. These two papers can be combined in other ways as well; e.g., the results of Denisov et al. (2022) show that the 'fully efficient estimator' of Honaker (2015) may be used as a drop-in replacement for the estimator from below in Kairouz et al. (2021) . We focus on the two mechanisms specified above as the natural baselines for the present work.

D.2 PRIVACY ACCOUNTING

The matrix mechanism Eq. ( 1) conceptually adds isotropic Gaussian noise in some encoded space. In our case, we encode a matrix of gradients (clipped to 2 norm ζ) computed over the course of training, denoted by G, with the matrix C, and add Gaussian noise to each entry in the matrix CG. Under the assumption that the matrix factorization has been appropriately scaled so C has sensitivity 1, this Gaussian noise will have standard deviation σ = ζz in each coordinate, where z is the 'noise multiplier' parameter determining the privacy level of the mechanism (see Table 3 for example). Privacy costs are computed as a single application of the Gaussian mechanism to GC using the PLDAccountant provided by the Google DP Libraryfoot_3 . We also use this accountant to analyze DP-(S)GD baselines (which require more complex accounting), yielding a small improvements in ε over the Renyi-DP accounting used in prior works.

D.3 IMPROVEMENTS TO'TREE COMPLETION' BY REMOVING NOISE FROM VIRTUAL STEPS

The "tree completion" trick of Kairouz et al. (2021) is used on the last step of any restart (in ours, stamp) to reduce the noise added on this step. This is achieved by adding virtual steps (with 0 inputs) until the final step of that level in the tree, because this noise will be the lowest in that level. In this section, we show how to further improve on this trick and that our matrix mechanisms make analyzing such tricks easier. Our implementations of the binary-tree baslines Honaker (2015); Kairouz et al. (2021) utilize these improvements. For the online Honaker estimate, Honaker (2015) obtain a DP estimate for the release node i ∈ [n] (representing the prefix sum until i) but summing the corresponding subtrees prior to this node. These are exactly the subtrees corresponding to the binary representation of this node (Honaker, 2015) . Then, the variance required to release node i, with subtrees of height 0, 1, ... , l i -1, is   li-1 j=0 c 2 j • 2 j   • σ 2 = 1 2 • (1 -2 -µ ) • σ 2 where c j = 1/2 j l1-1 j=0 (1/2 j ) . Notice that reaching a new height in the tree decreases the variance needed. In Kairouz et al. (2021) and just before terminating on some non-power-of-two step n < n, they run n -n virtual steps on zero gradients. This enables their mechanism to use the minimal noise for the power-of-two-step for the final real step. However, notice that in both these cases, the methods assume that these virtual steps must be privatized. Indeed, they do not need to be because we know apriori that these steps are virtual, i.e., not corresponding to real gradients. Thus, in our methods we account for this in our mechanism and reduce the noise of the final step accordingly. Importantly, this can be computed without altering the asymptotic runtime and storage complexity of the algorithm: on the last step, the contributions of the virtual steps to the power-of-two noise can be calculated using, e.g., a second binary tree, and removed. This leads to a significant benefit in the loss as we observed in Table 1 in Appendix D.5. We believe this oversight of prior works showcases the power of our matrix mechanism approach. Indeed, let C tree be a matrix representing the (complete) binary tree used in the mechanisms of Honaker (2015) ; Kairouz et al. (2021) with 2 log 2 (n) leaves; for the sake of concreteness, assume this is the matrix constructed in Appendix C of Denisov et al. (2022) . Let B Hon represent the Honaker estimator-from-below; in our language, the decoder used by (Kairouz et al., 2021) . The tree completion trick of (Kairouz et al., 2016 ) can be understood as follows. The matrix B Hon C tree is of size 2 log 2 (n) × 2 log 2 (n) . In the case that n is not a power of two, the penultimate rows and columns of this matrix will go unused. However, for this factorization, the variance added by B Hon on the final row will be quite small, due to the binary tree's redundancy in encoding estimates of this sum. The matrix we wish to factorize is a prefix-sum matrix S of size n × n; this matrix can be expressed as any one of a family of transformations of the (potentially larger) product B Hon C tree : S = P j B Hon C tree E, where E embeds a n-dimensional vector into 2 log 2 (n) dimensions by padding with zeros, and P j projects back down to n dimensions in a similarly axis-aligned way, taking the first n -1 rows of its right-hand matrix argument, and only one, but any of the j th rows for n ≤ j ≤ 2 log 2 (n) . To minimize the Frobenius norm of the constructed decoder in the factorization of S, we may simply pick the row with the lowest 2 norm; in the case of B Hon , this is the final row. One more optimization becomes clear when tree completion is formulated in this manner. Similar to our optimal decoder of Section 4, any decoder can be used without changing the sensitivity of the encoder. Noting that nonzero entries in the decoder increase our loss of Eq. ( 4), we can simply zero out the columns of the decoder corresponding to these virtual steps-this decreases the loss, preserves the same error in the DP estimate of the prefix sum (the inputs are 0), and maintains the same DP guarantee. In other words, we need not account for the noise, or the error it introduces, of virtual steps. We now provide a more rigorous explanation. The image of C tree E can be contained in an axis-aligned subspace; effectively, the subspace corresponding to elements that may be nonzero in the binary tree when run for n steps. In other words, the columns of the decoder corresponding to rows of the encoder that are removed via the projection need not be included: because the input is not processed. Therefore, denoting the projection onto this subspace by Π, we may write: S = P j B Hon ΠC tree E = (P j B Hon Π) (ΠC tree E) , further reducing the variance of the decoder (without increasing the sensitivity of the encoder) in this incomplete binary tree case. In our implementations of the online Honaker mechanism, we freely use these tricks, in addition to exact calculations of the sensitivity of the mechanism enabled by noting that the encoder is allnonnegative and the observations of Section 2, leading to some improvements in these mechanisms over those in the existing literature. No changes to accounting are required, as privacy is inherited from the matrix mechanism perspective.

D.4 STAMPING: REPEATED MECHANISMS IN THE MATRIX-FACTORIZATION SETTING

In stamping, we define a new encoder matrix as the Kronecker product of some given encoder C with an s × s identity matrix I, creating a new sn-dimensional linear DP query mechanism. Assuming C is of shape n × n, the resulting encoder is an sn × sn block-diagonal encoder matrix, formed by 'repeating' the matrix C along the diagonal. Kairouz et al. (2021) explored 'restarting' their binary-tree based prefix sum estimation mechanism, treating the number of restarts used as a hyperparameter, and treating the result of a 'completed' application of the binary tree as fixed. For linear operators A with constant columns below the diagonal, this approach may be used to construct a new factorization from an existing one. This constant-columns property, or a block-based variant thereof, is required to simply treat the output from a 'completed' application of the existing mechanism as fixed; for a general matrix A, there is no clear prefix property which can be leveraged for this purpose. Taking the matrix view, one may construct a similar encoder/decoder pair for the prefix-sum matrix by reusing the initial decoder B on the block-diagonal and fixing the columns to simply repeat the final row of this decoder below the block-diagonal; notice that the constant-column property of the prefix sum matrix guarantees that this construction appropriately factorizes A. The noise that the matrix mechanism thus constructed adds can be implemented as a 'restarted' tree mechanism; however, since we compute sensitivities exactly for decoders of this structure as in Section 2, the privacy properties of these mechanisms we construct to replicate the 'restarts' of (Kairouz et al., 2021) may not be identical to those presented there, where accounting is performed by composition. The matrix-mechanism perspective additionally allows one to apply the 'stamping' construction to any linear operator A (e.g. the momentum matrix), where a reuse of fixed previous outputs is not possible. A 'stamped' factorization of any A may be obtained, for example, by matrix pseudoinversion: letting B = A (C ⊗ I) † . This defines a legitimate factorization of any A requiring only suitable non-degeneracy assumptions on C, and indeed represents the optimal decoder for the stamped encoder. The pseudoinverse-based construction can, even in the prefix-sum case, be quite different from a construction designed to replicate 'restarted' mechanisms. For example, if C is a matrix representation of the binary-tree encoder and I = 1, then the resulting decoder matrix represents the full, rather than online, Honaker decoder (Honaker, 2015) ; the validity of this mechanism in the adaptive streaming setting was only shown quite recently by Denisov et al. (2022) . For comparability with existing literature (and to preserve potential for an efficient implementation), however, all of the tree-based mechanisms we explore in the main body have decoders which replicate the setting of compositionfoot_4 . All other stamped mechanisms used the optimal decoder. Optimizing over s Considering instantiation enables us to directly analyze and minimize (over s) the stamped mechanism's multi-epoch loss Eq. ( 4) without running compute intensive ML experiments. Indeed, we observe in Table 1 of Appendix D.5 that for mechanisms which were not explicitly optimized for the (k, b)participation setting, there exist stamped mechanisms (s > 1) with much lower loss that correspondingly led to much more performant ML models (e.g., Figures 7 and 8 of Appendix D). Interestingly, with our capacity to measure mechanisms at a single shot in the multi-epoch setting, we see a similar trend as was observed for restarts in Kairouz et al. (2021) : that 'stamped' mechanisms have lower total loss than their non-stamped full-tree counterparts in the 20-epoch, 100 steps / epoch setting (see Table 1 ); training performance of these 'stamped' mechanisms on CIFAR10 can be found in Fig. 7 . However, there is a significant improvement in our approach in that we can now directly tune this hyperparameter without the need to actually run ML training. This lets us reduce computation by only analyzing the loss of the generated matrices and then running the mechanism with the lowest loss. 

D.5 FACTORIZATION LOSSES AND PER-ITERATE VARIANCE

As a first measure of the privacy-utility tradeoff, we compare the losses of each mechanism from Eq. ( 4) for factorizations of the matrices under consideration. We compare measured losses of several factorizations of the prefix-sum matrix for the (k, b) = (20, 100) setting of the CIFAR experiments in Table 1 . We plot the per-iterate variance of the mechanisms in Fig. 1 , along with several variations, in Fig. 4 . In Fig. 5 , we plot the per-iterate variance distribution of factorizations of the momentum and cooldown matrix (described in Section 5) at a fixed variance level, and with various privacies, computed for the (k, b) = (6, 342) participation setting. 1.4e6 Optimal Decoder Honaker(n = 100)×20 1.8e6 MF(k = 20, n = 2000) 6.5e5 O(n 3 ) Optimization + O(n 2 ) Noise Generation MF(k = 10, n = 1000)×2 8.8e5 MF(k = 5, n = 500)×4 1.2e6 MF(k = 1, n = 100)×20 2.5e6 MF(k = 1, n = 2000) 1.6e6 MF(k = 1, n = 1000)×2 1.37e6 MF(k = 1, n = 500)×4 1.4e6 MF(k = 1, n = 400)×5 1.5e6 MF(k = 1, n = 200)×10 1.8e6 MF(k = 1, n = 100)×20 2.5e6 FFT(n = 2000) 2.3e6 O(n log n) Noise Generation FFT(n = 1000)×2 1.8e6 FFT(n = 400)×5 1.6e6 FFT(n = 200)×10 1.9e6 FFT(n = 100)×20 2.5e6 FFT Optimal Decoder(n = 2000) 2.2e6 O(n log 2 n) Noise Generation FFT Optimal Decoder(n = 1000)×2 1.5e6 FFT Optimal Decoder(n = 400)×5 1.1e6 FFT Optimal Decoder(n = 200)×10 1.2e6 FFT Optimal Decoder(n = 100)×20 1.7e6 Table 1 : Loss for various prefix-sum factorizations, computed via Eq. ( 4), in multiple-participation setting for 20 epochs with 100 steps per epoch. Lowest-loss mechanism in each class bolded. Note that '(Online) Honaker' corresponds to the restarted decoder. By evaluating the dual problem (Section 3), 6.53e5 represents a lower bound on the optimal loss; the optimal matrix factorization is within 0.2% of this optimal value. Though up to On 2 noise generation can be tolerated practically for large ML training runs, we find that the stamped FFT optimal decoder obtains the best privacyutility tradeoffs while requiring only O(n log 2 (n)) time. Sensitivity is calculated exhaustively with contributions constrained to +1 for all matrices except FFT ones, where sensitivity is calculated using Theorem 2.1. Figure 6 : Comparing the optimal decoder, i.e., OptDecoderHonaker, with the standard stamping decoder (including fixing the output of each block), i.e., Online Honaker, with the optimal factorization.

E DETAILS AND ADDITIONAL EXPERIMENTS FOR CIFAR10.

We train image-classification models using the CIFAR10 dataset as hosted in tensorflow-datasets, containing 50,000 training and 10,000 test examples. We evaluate and compute test accuracies on the entire test set, following the open-sourced code of Kairouz et al. (2021) . We reuse the network architecture, dataset processing and initialization strategies presented in Kairouz et al. (2021) ; in particular, the architecture we use can be found in their Table 2 (b).

Optimization setup and hyperparameters

We train all mechanisms for 20 epochs with batch size of 500, yielding 100 steps per epoch and 2000 total. After performing some small initial grid searches, we settled on using linear learning rate cooldown to 0.05× the initial learning rate over the last 500 steps of training. We found this consistently improved utility for all mechanisms and privacy levels. As mentioned in Section 5, for this 20-epoch training setup, we only compare factorizations of the prefix-sum matrix, and do not include any factorizations of matrices which incorporate momentum of learning rate cooldown directly in the mechanism itself (Denisov et al., 2022) . We sweep over learning rates of values (1 × 10 i , 2 × 10 i , 5 × 10 i ) for i in {-2, -1}; for all mechanisms and noise levels, optimal values were in the interior of this sweep. We sweep over momentum values of 0, 0.85, 0.9, 0.95 though find nonzero momentum works best for all matrix mechanisms, and no momentum works best for DP-SGD at our scale as found previously by Kairouz et al. (2021) . For Honaker and FFT-based factorizations, there is no known a-priori way to choose the optimal number of s for a given (k, b) setting. Therefore we treat the value s as a hyperparameter, and sweep across it, for s ∈ {1, 2, 5, 10, 20}. As can be seen in Table 1 , the optimal s for both of these factorizations was in the interior of this sweep. As shown, e.g., in Fig. 7 , the training-time performance of these mechanisms matched the expected order for computed loss. This value s represents an extra hyperarameter which must be set for the Honaker and FFT mechanisms; to the best of our knowledge, computing the loss for various instantiations of these mechanisms via Eq. ( 4) represents the only known method for setting this parameter other than simply training models. We also apply our sensitivity analysis of Section 2 to the matrices of Denisov et al. (2022) which are optimized for k = 1. In doing so, we can also optimize the number of stamps which we do. We report the best results as identified by the losses in Table 1 . 

F ADDITIONAL STACKOVERFLOW DETAILS F.1 PRIVACY AND LANGUAGE MODELLING

Language models trained on user data are an important real-world application of DP training, as these models can memorize their training data if appropriate mitigations are not applied (Carlini et al., 2019; Song & Shmatikov, 2019; Carlini et al., 2021) . Since one user might contribute 1000s of tokens (training examples) to a dataset, it is particularly important to consider user-level guarantees (McMahan et al., 2018) . Building on the approach of Kairouz et al. (2021) , Google recently announced the first-ever launch of a language model trained on user data with a formal user-level DP guarantee (ρ = 0.81 zCDP), further demonstrating the importance of this application (McMahan & Thakurta, 2022) . The 

F.2 HYPERPARAMETER TUNING AND INITIAL EXPERIMENTS

All runs use server momentum 0.95 and a learning-rate cooldown schedule for the last 25% of rounds. Zeroing outlier updates and using 1000 clients/round (6 epoch runs) allows the use of the higher server learning rates. MF1, 1e replicates the result of single-epoch training from Denisov et al. (2022) ; note that with 167 clients/round and this mechanism, the higher learning rate does not appear to help. Fig. 10 gives preliminary experimental results which informed the main experiments used in the paper. Note that the y-axis range (Test set accuracy) is highly compressed, and so the primary point of comparison is on epsilons. For example, Denisov et al. (2022) shows that cross-run variation of 0.002 or more is typical. The 6 horizontal lines give test-set accuracy for various non-private training mechanisms. The "Unnoised MF" runs correspond to the same code path used for privacy, but without any noise addition. In particular, these use momentum with learning rate cooldown; the other unnoised runs use a standard FL implementation with momentum but a fixed learning rate schedule; "cpr=167" corresponds to one epoch of training (167 clients/round), and "cpr=50" is 50 clients/rounds (only about 1/3 of an epoch). This last non-private baseline uses the best hyperparameters for FedAvgM from Reddi et al. The two "Unnoised MF" runs with accuracies between 0.246 and 0.248 are functionally identical, and the line near 0.248 accuracy is the same except it does not use learning-rate cooldown. Thus, for the given learning rates, we see the higher-epsilon private runs are adding sufficiently small noise that the accuracy is essentially equivalent to unnoised baselines with the same hyperparameters. However, using larger learning rates can achieve accuracy over 25%, even with privacy as in the case of the MF-6-6 run, hence motivating the inclusion of larger learning rates in the main experiments. The MF (Matrix Factorization) runs with "prefix" in the name correspond to computing an optimal factorization of the prefix-sum matrix (lower triangular matrix of ones) and then applying momentum (and possibly learning rate cooldown) as post-processing. The other MF runs directly factor the momentum or momentum+cooldown matrix.

F.3 IMPACT OF ZEROING-OUT LARGE-NORM UPDATES

We observed that zeroing out updates with an ∞ norm greater than 100ζ = 100 greatly stabilized training, allowing larger learning rates, particularly for MF6, 6e. We conducted ablation experiments where we turned off this zeroing, which produced a large fraction of unconverged runs as detailed in Table 2 . The number of updates zeroed increases significantly with larger amounts of noise and larger learning rates, as shown in Fig. 11 . 3 showing results for server learning rates η s = 0.5 and 1.0, as well as 0.25 when ε=2. All algorithms use momentum 0.95 and a client learning rate of 1.0. For the 1 and 6 epoch runs, we observe MF generally tolerates larger learning rates, though lower learning rates perform better for all algorithms at ε=2.

F.4 COMPLETE RESULTS

In this section we give additional details on our main grid of experiments. Fig. 12 uses the same data as Fig. 3 , but shows results for each learning rate individually. Tables 4 and 5 give the mean, minimum, maximum, and standard deviation of test-set accuracy corresponding to Fig. 12 , as well as the number of replicated experiments ('count'). Table 3 gives the noise multipliers to achieve our various privacy targets ε. Due to a change in the accountant used, we have slightly different ε targets around 8.8 for the different methods. Note the noise multipliers here are incomparable between the MF and (S)GD mechanisms in terms of the total noise introduced. For matrix factorization, we sample Z ∼ N (0, ζ 2 z 2 ) for noise multiplier z; this noise is applied after mapping the raw gradients/updates x through the linear map C which is normalized so the total sensitivity is ζ. For the (S)GD mechanisms, we add noise N (0, ζ 2 z 2 ) independently to each model update. In all cases, noise is added to the sum of per-user updates, so the effective noise in the average update scales down with the number of clients per round. were conducted with shuffling between epochs, and some were conducted using a fixed order for all epochs (as required by our DP analysis). We saw no impact of reshuffling on the final test set accuracy, and so include all runs in these results regardless of the shuffling setting. Our work significantly improves the privacy-utility tradeoffs in DP ML. Indeed, our work outperforms the state-of-the-art (and, DPSGD with amplification) by ≈ 5 percentage points across many privacy levels-and as low as ε ≈ 2-with practically implementable assumptions. We remark that we compare our mechanisms with DPSGD on a level-ground using well-performing but not state-of-the-art models and training protocols-e.g., very large models, augmentations prior to clipping (De et al., 2022) , public data usage, and large batch sizes can all aid training. Many, if not all, of these techniques are applicable to our setting and can thus be used with our mechanisms to realize additional absolute performance gains, again, likely beyond the performances achieved by DPSGD. The major limitation of our approach is the computation required to generate the optimal matrices. Though our optimal FFT decoder bridges the gap between the mechanisms without optimizer costs and our optimal mechanism, it still leaves some room for improvement in the privacy-utility tradeoff. We believe this is an important area for future work. H ANALYSIS FOR SECTION 2 1. We have k = 1 or k = 2 participations. 2. All the entries of C C are non-negative. 3. The rows of G are all co-linear, G [i,:] = u i • G [1,:] for u i ∈ {-1, 1}, i > 1. 4. The rows of G are all orthogonal, G [i,:] , G [j,:] = 0, ∀i = j, i, j ∈ [k]. Then, CG F ≤ 1. (10) Furthermore, the following statements are also true without assuming conditions (1)-( 3) above. • CG F = O (log k). • If we replace the condition on G to ∀i ∈ [k], G [i,:] 1 ≤ 1, then CG F ≤ 1. Note Theorem H.1 is generally applied to C [:,π] , the sub-matrix of some C ∈ R n×n formed by keeping only columns selected by a particular participation pattern π. Proof Theorem H.1. Let C = [c 1 c 2 • • • c k ] with each c i ∈ R n being a column vector. Also let we write g [i,j] for the (i, j)-th entry of G. It will also be useful to note CG 2 F = tr CG(CG) = tr C CGG . In the following we prove each of the individual cases of Theorem H.1. When k = 1 or k = 2: For k = 1, we have CG 2 F = c 1 2 2   d j=1 g [1,j]   2 ≤ c 1 2 2 = max u∈{±1} u • c 1 2 2 . An equivalent argument is used in Denisov et al. (2022, Thm. 3.1) . For k = 2, we have the following: CG 2 F = 2 i=1 c i 2 2 •   d j=1 g 2 [i,j]   + 2g [1,1] g [2,1] c 1 , c 2 + • • • + 2g [1,d] g [2,d] c 1 , c 2 ≤ 2 i=1 c i 2 2 + 2|g [1,1] | • |g [2,1] || c 1 , c 2 | + • • • + 2|g [1,d] | • |g [2,d] || c 1 , c 2 | ≤ 2 i=1 c i 2 2 + g 2 [1,1] + g 2 [2,1] | c 1 , c 2 | + • • • + g 2 [1,d] + g 2 [2,d] | c 1 , c 2 | (12) = c 1 2 2 + c 2 2 2 + 2| c 1 , c 2 | = max (c 1 + c 2 ) 2 , (c 1 -c 2 ) 2 ≤ max u∈{±1} 2 Cu 2 2 ≤ 1, where Eq. ( 12) follows from the standard A.M. ≥ G.M. inequality. All the entries of C C are non-negative: Let X = C C and Ĝ = GG . Observe Ĝ[i,j] ∈ [-1, 1], and using Eq. ( 11), when X is elementwise non-negative, tr X Ĝ is maximized when Ĝ = 1 k×k = u u by choosing u = 1 k . Hence, CG F ≤ tr C C u u = C u 2 ≤ max u∈{±1} k Cu 2 2 . ( ) Eq. ( 13) completes the proof. The rows of G are co-linear: By the convexity of CG 2 F with respect to the matrix G, we may assume the rows of G are of 2 norm 1. Under the colinearity assumption, this translates to G [i,:] = u i G [1,:] , with each u i ∈ {±1}. Let u = [u 1 , . . . , u k ] ∈ {-1, 1} k . Then for the matrix GG we have the following: [GG ] [i,j] = G [i,:] , G [j,:] = u i u j G [1,:] , G [1,:] = u i u j . which implies GG = uu . (14) Using Eq. ( 14) with Eq. ( 11), we have CG 2 F = tr CG(CG) = tr C Cuu ) ≤ max u∈{±1} k Cu 2 2 ≤ 1. The rows of G are all orthogonal This condition implies Ĝ = GG is a diagonal matrix with diagonal entries in [0, 1], and so Eq. ( 11) implies CG 2 F ≤ tr(X). It is thus sufficient to show tr(X) ≤ max u∈{-1,1} k tr(Xuu ). We give a construction for a u that shows this. Observe tr(Xuu ) = tr(X) + 2 k i=1 u i i-1 j=1 u j X [i,j] bi . Observe we can choose u 1 = 1 and then u i = sign(b i ) since b i depends only on C and the previously fixed u j for j < i, ensuring the double sum on the right is non-negative, completing the proof. CG F = O(log k) without assuming conditions (1)-( 4): We will prove this claim via probabilistic argument. First notice that due to convexity, we have the following: max u∈{±1} k Cu 2 2 ≤ 1 ⇒ ∀x ∈ R k , Cx 2 2 ≤ x 2 ∞ . (15) We now observe the following for Normal distributions: CG 2 F = E z∼N (0,1) d CGz 2 2 ≤ E z∼N (0,1) d Gz 2 ∞ = O max i∈[k] G [i,:] 2 2 • log k = O (log k) . The first inequality in Eq. ( 16) follows from Eq. ( 15), and the first equality in Eq. ( 16) follows from expectation of the maximum of Gaussian random variables. Replacing the condition on Notice that the constraint on any matrix G ∈ G is oblivious to the sign, meaning, we can flip the sign of any set of entries in G and the new matrix will still be in G. This along with Claim H.1 immediately implies that the set Cu 2 ≤ 1, then CG F ≤ 1 for all G with row 2 -norm at most one. Unfortunately, we show that the conjecture is not true when k > 2, as shown by the following counterexample with n = k = 3 and d = 2:  G to ∀i ∈ [k], G [i,:] 1 ≤ 1, then CG F ≤ 1: First notice that since CG 2 F is a convex H = {H ∈ {-1, 0, 1} k×d : ∀i ∈ [k], H [i,:] 0 = H [i,:] ∞ = C = 1 √ 24 2 1 1 1 2 -1 1 -1 2 and G = 1 √ 5 2 1 2 -1 1 2 Direct calculation shows max u∈{±1} k Cu 2 = 1, but CG F = √ 1.1 ≈ 1. = (u i , 0, . . . , 0) ∈ R d for i ∈ [n]. H v = [u 1 , . . . , u k ] diag(v) 1/2 , U = H v H v . Then, for Lagrange multipliers v such that the U is full-rank, the Lagrange dual function g can be expressed in closed form in terms of the Lagrange multipliers: g(v) := inf X is PD L(X, v) = 2 tr U 1 2 A AU 1 2 1 2 - u∈D v u Proof. As noted in the remark after Theorem 3.1, any optimal setting of the dual variables must be in the interior of a neighborhood in which the representation Eq. ( 7) is valid. It is therefore permissible to differentiate this representation. Take some set of vectors {u i } n i=1 ∈ D which span R n . Fix some representation e i = n j=1 α ij u j , where e i is the i th standard basis vector. Take y of 2 norm 1, and express: y = n i=1 γ i e i . Then for X satisfying our assumptions, y Xy = n i,j=1 γ i γ j e i Xe j =≤ n 2 max 1≤i,j≤n |γ i γ j | e i Xe j . Similarly, e i Xe j = n k,l=1 α ik α jl u k Xu l ≤ n 2 max 1≤k,l≤n |α ik α jl | u k Xu l ≤ n 2 max 1≤k,l≤n |α ik α jl |, where the final inequality follows by the assumptions on X. Now, since the 2 norm of y is 1, the orthogonality of the e i imply that each |γ i γ j | is at most 1. Therefore: y Xy ≤ n 4 max 1≤i,j,k,l≤n |α ik α jl and we have sufficiency of D spanning R n . For necessity, suppose D does not span R n . Then there is some vector y ∈ span (D) ⊥ of norm 1. Take any X such that sup u∈D u Xu ≤ 1. Then, since y u = 0 for all u ∈ D, Y := X + αy satisfies the same set of inequalities for any α. Lemma I.2. Let U, V ∈ S n ++ . Let U = U L U R be a factorization of U such that U R VU L is PSD, and the following equation defines a positive-definite matrix X: X = U † R U R VU L 1 2 U † L . Then, this X solves the equation XUX = V or equivalently U = X -1 VX -1 . Moreover, this positive-definite solution X is unique. Proof. We will begin by showing that X as defined by Eq. ( 31) solves the equation XUX = V; then we will show that any two positive-definite representations of the form Eq. ( 31) are in fact identical. Notice that the representation Then, A circ = F * ΣF, where Σ ∈ C 2n×2n is a diagonal matrix with the diagonal being the DFT (defined in Equation 35) of the first column of A circ . Here, * is the Hermimitian operation. U = U L U R implies that rank(U L ) ≥ n and rank(U R ) ≥ n. There- fore U L U † L = I = U † R U R , Privacy and utility guarantees In the following we provide the privacy guarantee and the main utility guarantee for the FFT mechanism defined in Algorithm 1. Theorem J.2 (DP-Prefix Sum via FFT Privacy Guarantee). Algorithm 1 is ρ-zCDP in the adaptive continuous release model. Next, we analyze the utility of Algorithm 1 and show that it is nearly optimal in terms of the mean squared error (MSE) in the single-pass setting. First, we express the MSE in Theorem J.3 below. Theorem J.3 (DP-Prefix Sum via DFT Utility). The MSE achieved by Algorithm 1 using the real and imaginary components of z is E [MSE] = κ 2 v DFT 2 1 2ρn 2 . In the following, we will have an explict expression for v DFT 1 in terms of the problem parameters. Finally, we will argue that Theorem J.3 is nearly optimal. Corollary J.1. The expected mean squared error (MSE) is given by the following: E [MSE] = κ 2 2ρn 2   n + 2n-1 2 a=0 1 sin π(2a+1) 2n   2 . Near-optimal utility Here, we show that Theorem J.3 is near-optimal in utility for the single-participation setting. To do this, we compare with a lower bound on the expected MSE of any factorization-based mechanism from Henzinger et al. (2022, Theorem 2): 1 2ρπ 2 2 + ln 2n+1 3 + ln(2n+1) 2n 2 . We find that the though our analytical upper bound in Corollary J.1 is ≈ 6x worse than the lower bound, the empirical noise added in Algorithm 1 closely tracks the lower bound to within a factor of 1.2x-because it only adds the real part of the noise. Results are in Figure 14 of Appendix J.1. Proof. First, consider the non-adaptive setting and the following mechanism, with parameters as defined in Algorithm 1,

Showing near-optimal utility via MSE experiments

Σ Fx ext + Σ -1 z , where z = κ 2 v DFT 1 4nρ √ Σ • w We claim that this satisfies κ 2 v DFT 1 4nσ -zCDP. To see this, we proceed by bounding ρ i for each coordinate i ∈ [2n] defined in Equation 36. For brevity, let b = Fx ext . Consider two neighboring data sets g and g , correspondingly, (b, x ext ) and (b , x ext ). Then, b -b ∞ = F(x ext -x ext ) ∞ = κ √ 2n . ( ) We will now prove zCDP guarantee independently for each of the 2n coordinates and then use standard zCDP composition (Bun & Steinke, 2016) . For any coordinate a ∈ {0, . . . , 2n -1}, adding noise σ √ |v DFT [i]| • N complex (0, 1) to b[i] satisfies ρ i -zCDP with ρ i = κ 2 |v DFT [i]| 4nσ 2 . Then by composition, we have that ρ = 2n-1 a=0 (ρ i ) = κ 2 4nσ 2 2n-1 a=0 v DFT [i] = κ 2 v DFT 1 4nσ 2 . ( ) Therefore, setting σ 2 = κ 2 v DFT 1 4nρ -satisfies a non-adaptive ρ-zCDP. Using the same σ, we prove the adaptive part using the same σ. We have the following from Equation 36. [F * (ΣFxext + z)] = F * √ Σ √ ΣFxext + 1 √ Σ z =   F * √ Σ   √ ΣFxext + κ 2 v DFT 1 4nρ • w     Since, w in Equation 39 is spherical Gaussian, and the original query matrix A is lower triangular, by Theorem 2.1 in Denisov et al. (2022) , the adaptive privacy guarantee follows. J.3 PROOF OF THEOREM J.3 Theorem J.3 Restated. The MSE achieved by Algorithm 1 using the real and imaginary components of z is E [MSE] = κ 2 v DFT 2 1 2ρn 2 . Proof. The MSE is given by the following: which is real-valued by Lemma K.1. Note that the sensitivity of C F is identical to that of C for any notion of sensitivity expressible as Definition 1 due to the unitary of the Fourier transform. Since this matrix is of shape [2n, n], there is choice in computing the decoder B such that BC F represents the prefix-sum matrix. The two decoders we present below correspond to two subtly distinct mechanisms. E[MSE] = 1 n E z[0, . . . , n -1] 2 2 = κ 2 v DFT Mechanism 1: A real-valued version of the mechanism presented in Section 4. One natural translation of the analysis in Section 4 (indeed, a real-valued version of the precise operation described in Algorithm 1) may be computed by inserting a Fourier transform to match the inverse transform in C F : B F = BF, Clearly B F C F = BC, and B F real-valued by Lemma K.1. Proposition K.1. For any D, the mechanism described in Section 4 is distributionally equivalent to an application of the real-valued matrix mechanism with the factorization (B F , C F ), and satisfies the same privacy guarantees. Proof. To show this result, by noting that C F and C have the same sensitivity, it suffices to show that it suffices to show: • [F * √ Σz] (for z a sample from an isotropic complex Gaussian) is distributionally equivalent to PF * √ ΣFb for b a sample from a real (isotropic) Gaussian with the same variance. This is a consequence of the distributional invariance of the Gaussian under unitary transformations: [F * √ Σz] ∼ [F * √ ΣFz] = F * √ ΣF [z] (as F * √ ΣF is real) ∼ F * √ ΣFb, where the variances are as desired. Note that the efficiency of the mechanism described in Section 4 carries over immediately to this factorization (B F , C F ); indeed, the capacity to compute the noise B F b with complexity n log(n) may be reasoned to directly, in a similar manner. This mechanism is not, however, the optimal one for the encoder C F , and this subtlety has difficult downstream effects in integrating with real-valued factorization codepaths (e.g., see the discussion in Appendix D.4). We proceed to show that the optimal decoder can be used directly, at only a moderate loss of efficiency with sufficiently careful implementation. Mechanism 2: A real-valued optimal decoder with complexity n log 2 (n). As noted in the literature (e.g. Section 3 of Denisov et al. ( 2022)), for a fixed encoder, the optimal decoder may always be computed in terms of an appropriate pseudoinversion of the encoder. Therefore, we may compute the optimal decoder for the encoder C F , defining: B Fopt = SC † F , where S is the prefix-sum matrix. Since C F is real, its pseudoinverse is as well, and B Fopt is also real-valued. Since B Fopt can have no more variance than B F , all the utility analysis of the DFT mechanism in Section 4 carries through as an upper bound for this factorization. Privacy of this mechanism is ensured by the fact that this mechanism reuses teh encoder C F . The major way in which these mechanisms operationally differ comes down to the cost of computing the noise vector B Fopt b, where b represents a sample from an isotropic Gaussian distribution. Though we do not know of a complexity result which matches the decoder B F , we will show that the complexity cost which must be paid is only logarithmically higher. Proposition K.2. The mapping b → B Fopt b, where b ∈ R n , may be evaluated in O(n log 2 (n)) time. Proof. First, notice that the matrix C F is one-to-one; indeed, this is immediately implied by the factorization S = B F C F . By Theorem 1.2.1 (P6) of (Campbell & Meyer, 1979) , any one-to-one matrix T admits the following representation for its pseudoinverse: T † = (T * T) -1 T * . We compute: Therefore the computational cost of computing the mapping b → SC † F b can be upper bounded by the maximum of n log n and the cost of computing the mapping v → (PF * |Σ|FE) C † F = (C * F C F ) -1 C * F = F * √ ΣFE * F * √ ΣFE -1 F * √ ΣFE * = PF * √ Σ * FF * √ ΣFE -1 v. The cost of computing this mapping is, in turn, bounded by the cost of inverting a general (fullrank) Toeplitz system, since (PF * |Σ|FE) -1 v may be alternatively characterized as the solution x to the equation PF * |Σ|FEx = v. The computational cost of solving such a system is known to be n log 2 (n); see, e.g., (de Hoog, 1987) . Lemma K.1. For a real-valued vector v, let v represent its discrete Fourier transform. If v as no purely real, negative entries, then letting √ • denote the (pointwise) principal branch of the square root and F the matrix representation of the Fourier transform, the matrix F * √ vF is real-valued. Proof. Conjugate symmetry of the DFT states that for a j-dimensional real-valued vector x, x[m] = x[j -m], and that the converse also holds-that if x has this symmetry, x is real-valued. This can be seen by examining the action of conjugation of x on the Fourier transform x. Now, by the assumptions on v and the choice of the principal branch of the square rootfoot_5 , if v has this conjugate symmetry, so does √ v. Therefore there is some real-valued vector y such that ŷ = √ v. The matrix F * √ vF represents convolution with y in the standard basis, and hence is real-valued.



This is in contrast to Poisson, or independent fixed-sized batch sampling, with replacement across steps, as is assumed by many works(Abadi et al., 2016a;Bassily et al., 2014;Zhu & Wang, 2019;Wang et al., 2019). Many works in fact process batches in a shuffled order without replacement and then incorrectly apply DP analysis for, e.g., Poisson sampling. Indeed, we perform this same-incorrect-analysis for our DP-SGD baseline because it reproduces the previous state-of-the-art results for DP-SGD. We conjecture it is "almost" true; tightly bounding the necessary error term is an interesting open question. https://github.com/google/differential-privacy We show how the optimal binary-tree decoder differs from the online, composition-based decoder in Fig.6 These assumptions can be avoided, though at the cost of taking care in choosing the square root of the negative elements of v to preserve the appropriate symmetry.



Participation schema, set of sets of steps (set of all π) an example could participate in. D = {x -x | (x, x) ∈ N}, the set of deltas between neighboring input streams x, x. D Corners of D when assumed to be a polytope, D = conv(D). (k, b)-participation participation schema Π with at most k participations, separated by exactly b.

Figure 4: Per-iterate variance for prefix-sum factorizations. All mechanisms above yield the same privacy ((ε, δ) = (4.38, 10 -5 ) in the (k, b) = (20, 100) setting), but have different total variances (the integral of the curves above).

Figs. 4 and 5 both demonstrate the effect of (k, b)-participations on the optimization problem. Particularly interesting to consider are the optimally-factorized matrices; in both cases, the epoch structure is clearly visible in the manner in which the mechanisms distribute variance. We see also the effect of 'stamps' s in the variance distribution, effectively a proxy for the epoch structure directly accounted for by the optimal mechanisms.

Figure 5: Per-iterate variance for momentum + cooldown matrix factorizations. Privacy measured in the (k, b) = (6, 342) setting.

Figure 7: DP-FTRL-Honaker baseline ablation with respect to number of 'stamps' s.

Figure 8: Ablation of prefix-sum factorizations, optimized for different number of epochs, and 'stamped' as appropriate. Performance improves as the geometry used for computing the factorization approaches that used for training.

Figure 9: Our optimal multi-epoch matrix and FFT-based mechanisms outperform all others, including DP-SGD with amplification, as low as ε ≈ 4. Using our sensitivity calculation of Theorem 2.1 and stamping (Section 5), we optimize a single pass (k = 1) matrix of Denisov et al. (2022) but apply it here with > 1 pass. We use an online Honaker-based decoder equivalent to that ofKairouz et al. (2021) except for a significant improvement to tree-completion in Appendix D.3. Models trained for 20 epochs on CIFAR10 with a batch size of 500. We repeat each setting 12 times and show 95% bootstrapped confidence intervals. Empirical setup is in Section 5.1.

StackOverflow next-word prediction task, introduced in (Reddi et al., 2020), has become a benchmark problem for DP training, and our experimental setup here fixes the same model and adapts hyperparmaeters from previous work including Kairouz et al. (2021); Denisov et al. (2022).

Figure 10: Preliminary experimental results and non-private (unnoised) baselines. The notation sX, cY indicates a server learning rate X and client learning rate Y .

Figure 11: Number of large-magnitude updates zeroed per training round.

Figure12: Complete data used in Fig.3showing results for server learning rates η s = 0.5 and 1.0, as well as 0.25 when ε=2. All algorithms use momentum 0.95 and a client learning rate of 1.0. For the 1 and 6 epoch runs, we observe MF generally tolerates larger learning rates, though lower learning rates perform better for all algorithms at ε=2.

Figure 13: A more detailed view of the matrices shown in Fig. 2. F.5 OPTIMAL MATRIX MECHANISMS G DISCUSSIONS AND CONCLUSIONS

function, the maximum happens at the extreme points of the constraint set G = {G | ∀i ∈ [k], G [i,:] 1 = 1}. We use Claim H.1 to identify the extreme points of G. Claim H.1 (Theorem 1 in Cao et al. (2022)). The set of extreme points of the set of k × d rowstochastic matrices are precisely the set of row permutation matrices, i.e., set of the matrices with entries in {0, 1} k×d and each row has exactly one non-zero entry.

the other direction, for each π ∈ Π, we apply Theorem H.1 to the matrix C π = C [:,π] , and observe C C ≥ 0 is sufficient to imply C π C π ≥ 0. Note The condition C C ≥ 0 is sufficient but not in fact necessary for Corollary 2.1 to hold. In particular, for (k, b)-participation Π, the sub-matrices C π C π for π ∈ Π "touch" only k 2 b entries of the n 2 = k 2 b 2 entries of C C; the other entries of C C could in fact be negative. However, we did not need to use this observation for any of the matrices in our experiments. H.4 PROOF OF THEOREM 2.1 Theorem 2.1 Restated. Let C ∈ R n×n , and take some participation pattern Π, with k = max π∈Π |π| the maximum number of participations. With C [:,π] representing to selecting the columns of the matrix C indexed by π and • 2 the spectral matrix norm, let λ = max π∈Π . Let a finite D = {u i } k i=1 be given, and assume that the vectors {u i } k i=1 span R n . Assume that A is full-rank, and for v ∈ R k define

as implied by the Moore definition of the Moore-Penrose pseudoinverse. So: Circulant matrices expressed using Fourier Transforms Theorem J.1 (Adapted from Gray (2006)). Consider any circulant matrix A circ ∈ R 2n×2n . Let F ∈ C 2n×2n , where the k-th row of F is given by F[k, :] = 1 √ 2n exp -j2πka 2n : a ∈ {0, . . . , 2n -1} .

Figure 14: Algorithm 1 achieves near-optimal utility as measured by the analytic lower bound from Henzinger et al. (2022, Theorem 2). J.2 PROOF OF THEOREM J.2 Theorem J.2 Restated. Algorithm 1 is ρ-zCDP in the adaptive continuous release model.

, Σ[: n, : n] refers to the top-left n × n submatrix of Σ.J.4 PROOF OF COROLLARY J.1Corollary J.1 Restated. Under the same setting as Theorem J.3, the MSE for Algorithm 1 is the followingE [MSE] =

* |Σ|FE) -1 F * √ ΣFE * Now, the matrix PF * |Σ|FE is Toeplitz, since F * |Σ|F is circulant, and P, E combine to select out the top-left n × n square of F * |Σ|F. Notice that PF * |Σ|FE is not circulant, and cannot therefore be diagonalized by the n-dimensional Fourier transform. The development of Section 4 yield the representation: S = PF * ΣFE, which implies that matrix-vector products with the matrix S may be computed in n log n time by the use of the FFT. Similarly, matrix-vector products with F * √ ΣFE * may be computed in n log n time.

and take some participation schema Π, with k = max π∈Π |π| the maximum number of participations. With C [:,π] representing selecting the columns of the matrix C indexed by π and • 2 the spectral matrix norm, let λ = max

Data vector x ∈ R n (with each |x i | ≤ ζ) and zCDP parameter ρ. v DFT ∈ C 2n ← the DFT of v (defined in Eq. (34)). Let v DFT [:n] be the first n coordinates. F ← DFT matrix in 2n-dimensions, where the k-th row of F is given by

Number of divergent training runs with and without zeroing of user updates with ∞ norm greater than 100; η s gives the server learning rate.

Noise multiplier parameters for the StackOverflow experiments to achieve various εs at δ = 10 -6 . Privacy was computed using the PLD accountant, see Appendix D.2.

Test set accuracy statistics for DP-SGDM, 6e (1000 clients/round) and DP-GDM, 2052e (342,477 clients/round). Accuracy for DP-GDM, 2052e was estimated with 1000 clients/round with an appropriately scaled noise multiplier. The count columns gives the number of repeated trials of the given configuration, with η s indicating the server learning rate.

Test set accuracy for matrix-factorization based mechanisms. Note: Some 6 epoch runs

1} k , and let G ∈ R k×d such that each row G [i,:] for i ∈ [k] satisfies G [i,:] 2 ≤ 1.Suppose at least one of the following conditions hold:

1} is the set of extreme points of the set G. (If the set of extreme points of G is larger than H, then the signs of any such extreme point can be flipped to create a new extreme point of row-stochastic matrices, which would violate Claim H.1.) It is not hard to observe that for any H ∈ H, there exists an u H {±1} k s.t. CH F = Cu H 2 . Since, Cu H 2 ≤ max Cu 2 for any choice of u H , and the fact that max CG F is reached at one of the matrices in H, the claim in Theorem H.1 follows. H.2 A COUNTEREXAMPLE FOR GENERAL C Theorem H.1 indicates the possibility of the following conjecture being true, because of it being true in so many special cases: If max

049. Corollary 2.1 Restated. When per-step contributions are bounded by ζ = 1, for any participation pattern Π and dimensionality d ≥ 1, when C C ≥ 0 elementwise, we havesens D d Π (C) = sens D 1 Π (C).Cu 2 , and observe we can construct a G such that CG F = Cu 2 by taking rows G[i,:]

annex

Proof. Since there is some finite set of vectors u ∈ R n specifying D, the supremum in Eq. ( 5) may be reduced to a maximum over these elements.Our problem then takes the form: inf X is PD tr(A AX -1 ) s.t. u Xu ≤ 1, ∀u ∈ D.(18)Recall that we have defined1 2 , and U = H v H v . Now, note:Introducing Lagrange multipliers v u ≥ 0, for the problem Eq. ( 18) we form the Lagrangian for positive-definite X:= tr(A AX -1 ) + tr= tr(A AX -1 ) + tr (UX) -For fixed v, any finite minimizer of L for positive-definite X must correspond to a zero of this Lagrangian's gradient. We then compute the gradientU and A are full-rank by assumption; therefore Lemma I.2 is applicable, and Eq. ( 23) has a unique positive-definite zero (and indeed, the infimum in Eq. ( 18) becomes a minimum):Note that Eq. ( 23) also immediately implies that if U is not full-rank, then there is no finite positivedefinite minimizer of L in X. Letting g(v) = min X L(X, v) be the Lagrange dual function and plugging back into Eq. ( 22), we have

= min

Xis PD tr(XU) + tr (UX)u v u using Eq. ( 23) Xis PD 2 tr (UX) -I.1.1 PROOF OF COROLLARY 3.1Corollary 3.1 Restated. In the same setup as Theorem 3.1, a maximizer of the dual v must satisfy:Moreover, the optimal value of the problem defined in 6 is tr (v ).Differentiating, we find:(recalling the usage of the symbol X in Eq. ( 24)).Thereforeand we have the stated expression for the gradient of the dual function.Now, at a maximizer of the dual function, this derivative must vanish. An equivalent condition is diagpart(H XH) = 1, and henceso at the optimum v in fact g(v ) = u v u , establishing the second claim of our result.Again using the observation that diagpart(H v XH v ) = 1 and soFurther, using the second claim of Corollary I.1, we can takev , and multiplying this by H v and H v on the left and right respectively yieldsand so we conclude for the optimal Lagrange multiplier v ,

I.2 LEMMAS AND COROLLARIES

Lemma I.1. The set of positive-definite X such that sup u∈D u Xu ≤ 1 is bounded as a subset of R n×n if and only if D = {u} spans R n .Proof. Suppose that D spans R n . For a PSD matrix, a bound on the trace implies a bound on the elements; therefore it is sufficient to show that sup u∈D u Xu ≤ 1 implies that the maximum eigenvalue of X is uniformly bounded for X PSD.this can be seen by on the left or right as appropriate by U R VU L 1 2 , and notingSince all the terms here are symmetric, the appropriate equality follows by uniqueness of the symmetric matrix square root. Therefore:The uniqueness of a positive-definite X solving XUX = V follows from the uniqueness of the usual matrix square root. Indeed, assume Y positive-definite satisfies YUY = V. Then:Since the positive-definite square root is uniquely determined, U 1/2 YU 1/2 is uniquely determined. Since U is invertible, Y is uniquely determined as well, and we have Y = X.Corollary I.1. Two particular instantiations of Lemma I.2 are of interest. X as the matrix geometric mean of U -1 and V (takingand assuming the representationProof. By positive-definiteness of U and V, Eq. ( 32) is clearly positive definite; Eq. ( 33) may be seen to be positive definite via the SVD of the pseudoinverses involved. Symmetry is again clear. Therefore both representations satisfy the assumptions of Lemma I.2.

J ANALYSIS FOR SECTION 4 J.1 ADDITIONAL DETAILS

Defining the circulant matrix We consider the special case where A is the prefix sum linear query matrix (lower-triangle matrix of ones). Then, we define the corresponding circulant matrixIt is straightforward to verify A circ[:n,:n] = A.Proof. Recall the definition of DFT from Equation 35and of v in Equation 34. It is immediate that v DFT [0] = n. For any k = 0, we have,From Equation 41, we have that when k > 0 is even, v DFT [k] = 0. For k odd, we haveCombining these, the term v DFT 1 isJ.5 PROOF OF THEOREM 4.1Theorem 4.1 Restated. Under k participation, Algorithm 1 satisfies (k 2 ρ)-zCDP.Proof. The proof goes exactly as Theorem J.2, except equation 37 gets replaced by the following:K TWO RELATED FFT MECHANISMS.The FFT mechanism presented in Section 4 can be understood as an application of a complex-valued matrix mechanism factorizing the prefix-sum matrix as, where E and P are appropriate embedding and projection matrices, respectively embedding an ndimensional vector in the first n components of R 2n , and projecting those same first n components back to R n , and following this application by 'chopping off' the imaginary part of the noise. The entries of Σ may be computed exactly; they contain no purely negative entries, so specifying the principal branch of the square root resolves the implicit ambiguity in the formulation above. This branch corresponds as well to the implementation of the complex square root in major software frameworks (e.g., NumPy).All these operations are linear; and since everything begins and ends in the real domain, this mechanism can be expressed as a real-valued mechanism. Therefore identical codepaths can be used for implementing experiments with the FFT, though notably without some special implementation of the mechanism, realizing the potential computational savings will not be immediate. In this small section, we translate this complex-valued mechanism into two real-valued mechanism which can be integrated with the code backing the rest of the paper. These mechanisms differ in their decoding matrix B, and thus achieve different levels of loss. Both have efficient implementations, though with asymptotics differing by a logarithmic factor. We implement and experiment with both of these mechanisms in the main body.These two mechanisms share an encoding matrix:Remark. Lemma K.1 can be understood as a statement about the solvability of a certain repeatedconvolution equation over real-valued functions (the equation g * g = f ). We suspect that this fact has been observed in the harmonic analysis literature as a general property of all Fourier transforms; we could find no reference. The symmetries discussed above take a slightly different form in the continuous and noncompact case (IE, Fourier transform on real-valued function on R d ) and the finite-dimensional Fourier transform here, so we choose to prove this statement in this limited setting.

