PRIVATE FEDERATED LEARNING WITHOUT A TRUSTED SERVER: OPTIMAL ALGORITHMS FOR CONVEX LOSSES

Abstract

This paper studies federated learning (FL)-especially cross-silo FL-with data from people who do not trust the server or other silos. In this setting, each silo (e.g. hospital) has data from different people (e.g. patients) and must maintain the privacy of each person's data (e.g. medical record), even if the server or other silos act as adversarial eavesdroppers. This requirement motivates the study of Inter-Silo Record-Level Differential Privacy (ISRL-DP), which requires silo i's communications to satisfy record/item-level differential privacy (DP). ISRL-DP ensures that the data of each person (e.g. patient) in silo i (e.g. hospital i) cannot be leaked. ISRL-DP is different from well-studied privacy notions. Central and user-level DP assume that people trust the server/other silos. On the other end of the spectrum, local DP assumes that people do not trust anyone at all (even their own silo). Sitting between central and local DP, ISRL-DP makes the realistic assumption (in cross-silo FL) that people trust their own silo, but not the server or other silos. In this work, we provide tight (up to logarithms) upper and lower bounds for ISRL-DP FL with convex/strongly convex loss functions and homogeneous (i.i.d.) silo data. Remarkably, we show that similar bounds are attainable for smooth losses with arbitrary heterogeneous silo data distributions, via an accelerated ISRL-DP algorithm. We also provide tight upper and lower bounds for ISRL-DP federated empirical risk minimization, and use acceleration to attain the optimal bounds in fewer rounds of communication than the state-of-the-art. Finally, with a secure "shuffler" to anonymize silo messages (but without a trusted server), our algorithm attains the optimal central DP rates under more practical trust assumptions. Numerical experiments show favorable privacy-accuracy tradeoffs for our algorithm in classification and regression tasks.

1. INTRODUCTION

Machine learning tasks often involve data from different "silos" (e.g. cell-phone users or organizations such as hospitals) containing sensitive information (e.g. location or health records). In federated learning (FL), each silo (a.k.a. "client") stores its data locally and a central server coordinates updates among different silos to achieve a global learning objective (Kairouz et al., 2019) . One of the primary reasons for the introduction of FL was to offer greater privacy (McMahan et al., 2017) . However, storing data locally is not sufficient to prevent data leakage. Model parameters or updates can still reveal sensitive information (e.g. via model inversion attacks or membership inference attacks) (Fredrikson et al., 2015; He et al., 2019; Song et al., 2020; Zhu & Han, 2020) . Differential privacy (DP) (Dwork et al., 2006) protects against privacy attacks. Different notions of DP have been proposed for FL. The works of Jayaraman & Wang (2018) ; Truex et al. (2019) ; Noble et al. (2022) considered central DP (CDP) FL, which protects the privacy of silos' aggregated data against an external adversary who observes the final trained model. 1 There are two major issues with CDP FL: 1) it does not guarantee privacy for each specific silo; and 2) it does not guarantee data privacy when an adversarial eavesdropper has access to other silos or the server. To address the first issue, McMahan et al. (2018) ; Geyer et al. (2017) ; Jayaraman & Wang (2018) ; Gade & Vaidya (2018) ; Wei et al. (2020a) ; Zhou & Tang (2020) ; Levy et al. (2021) ; Ghazi et al. (2021) considered user-level DP (a.k.a. client-level DP). User-level DP guarantees privacy of each silo's full local data set. This is a practical notion for cross-device FL, where each silo/client corresponds to a single person (e.g. cell-phone user) with many records (e.g. text messages). However, user-level DP still suffers from the second critical shortcoming of CDP: it allows silo data to be leaked to an untrusted server or to other silos. Furthermore, user-level DP is less suitable for cross-silo FL, where silos are typically organizations (e.g. hospitals, banks, or schools) that contain data from many different people (e.g. patients, customers, or students). In cross-silo FL, each person has a record (a.k.a. "item") that may contain sensitive data. Thus, an appropriate notion of DP for cross-silo FL should protect the privacy of each individual record ("item-level DP") in silo i, rather than silo i's full aggregate data. Another notion of DP is local DP (LDP) (Kasiviswanathan et al., 2011; Duchi et al., 2013) . While central and userlevel DP assume that people trust all of the silos and the server, LDP assumes that individuals (e.g. patients) do not trust anyone else with their sensitive data, not even their own silo (e.g. hospital). Thus, LDP would require each person (e.g. patient) to randomize her report (e.g. medical test results) before releasing it (e.g. to their own doctor/hospital). Since patients/customers/students usually trust their own hospital/bank/school, LDP may be unnecessarily stringent, hindering performance/accuracy. In this work, we consider a privacy notion called intersilo record-level differential privacy (ISRL-DP), which requires that all of the communications of each silo satisfy (item-level) DP; see Fig. Why ISRL-DP? ISRL-DP is the natural notion of DP for cross-silo FL, where each silo contains data from many individuals who trust their own silo but may not trust the server or other silos (e.g., hospitals in Fig. 1 ). The item-level privacy guarantee that ISRL-DP provides for each silo (e.g. hospital) ensures that no person's record can be leaked. In contrast to central DP and user-level DP, the protection of ISRL-DP is guaranteed even against an adversary with access to the server and/or other silos (e.g. hospitals). This is because each silo's communications are DP with respect to their own data records and cannot leak information to any adversarial eavesdropper. On the other hand, since individuals (e.g. patients) trust their own silo (e.g. hospital), ISRL-DP does not require individuals to randomize their own data reports (e.g. health records). Thus, ISRL-DP leads to better performance/accuracy than local DP by relaxing the strict local DP requirement. Another benefit of ISRL-DP is that each silo i can set its own pϵ i , δ i q item-level DP budget depending on its privacy needs; see Appendix H and also Liu et al. (2022) ; Aldaghri et al. (2021) . In addition, ISRL-DP can be useful in cross-device FL without a trusted server: If the ISRL privacy parameters are chosen sufficiently small, then ISRL-DP implies user-level DP (see Appendix C). Unlike user-level DP, ISRL-DP does not allow data to be leaked to the untrusted server/other users. Another intermediate trust model between the low-trust local model and the high-trust central/userlevel models is the shuffle model of DP (Bittau et al., 2017; Cheu et al., 2019) . In this model, a secure shuffler receives noisy reports from the silos and randomly permutes them before the reports are sent to the untrusted server. 2 An algorithm is Shuffle Differentially Private (SDP) if silos' shuffled messages are CDP; see Definition 3. Fig. 2 summarizes which parties are assumed to be trustworthy (from the perspective of a person contributing data to a silo) in each of the described privacy notions. Problem setup: Consider a FL setting with N silos, each containing a local data set with n samples:foot_2 X i " px i,1 , ¨¨¨, x i,n q for i P rN s :" t1, . . . , N u. In each round of communication r, silos download the global model w r and use their local data to improve the model. Then, silos send local updates to the server (or other silos, in peer-to-peer FL), who updates the global model to w r`1 . For each silo i, let D i be an unknown probability distribution on a data universe X i (i.e. X i P X n i ). Let X :" Ť N i"1 X i . Given a loss function f : W ˆX Ñ R, define silo i's local objective as Fipwq :" Ex i "D i rf pw, xiqs, where W Ă R d is a parameter domain. Our goal is to find a model parameter that performs well for all silos, by solving the FL problem We put "trust" in quotes because the shuffler is assumed to be secure and silo messages must already satisfy (at least a weak level of) ISRL-DP in order to realize SDP: anonymization alone cannot "create" DP (Dwork & Roth, 2014) . min wPW # F pwq :" 1 N N ÿ i"1 Fipwq + , while maintaining the privacy of each silo's local data. At times, we will focus on empirical risk minimization (ERM), where p F i pwq :" 1 n ř n j"1 f pw, x i,j q is silo i's local objective. Thus, in the ERM case, our goal is to solve min wPW p F X pwq :" 1 N ř N i"1 p F i pwq, while maintaining privacy. When F i takes the form (1) (not necessarily ERM), we may refer to the problem as stochastic convex optimization (SCO) for emphasis. For ERM, we make no assumptions on the data; for SCO, we assume the samples tx i,j u iPrN s,jPrns are drawn independently. For SCO, we say problem (2) is "i.i.d." or "homogeneous" if X i " X and D i " D, @i. The excess risk of an algorithm A for solving (2) is EF pApXqq ´F ˚, where F ˚" inf wPW F pwq and the expectation is taken over both the random draw of X " pX 1 , . . . , X N q and the randomness of A. For ERM, the excess empirical risk of A is E p F X pApXqq ´p F X, where the expectation is taken solely over the randomness of A. Thus, a fundamental question in FL is about the minimum achievable excess risk while maintaining privacy. In this work, we specifically study the following questions for FL with convex and strongly convex loss functions: 4Question 1. What is the minimum achievable excess risk for solving (2) with inter-silo record-level DP? Question 2. With a trusted shuffler (but no trusted server), can the optimal central DP rates be attained? Contributions: Our first contribution is a complete answer to Question 1 when silo data is i.i.d.: we give tight upper and lower bounds in Section 2. The ISRL-DP rates sit between the local DP and central DP rates: higher trust allows for higher accuracy. Further, we show that the ISRL-DP rates nearly match the optimal non-private rates if d À nϵ 2 0 , where ϵ 0 is the ISRL-DP parameter ("privacy for free"). As a corollary of our analysis, we also derive tight upper and lower bounds for FL algorithms that satisfy both ISRL-DP and user-level DP simultaneously, which could be useful in cross-device settings where (e.g. cell phone) users don't trust the server or other users with their sensitive data (e.g. text messages): see Appendix D.3.5. Second, we give a complete answer to Question 1 when F " p F X is an empirical loss in Section 4.foot_4 While (Girgis et al., 2021) provided a tight upper bound for the (non-strongly) convex case, we use a novel accelerated algorithm to achieve this upper bound in fewer communication rounds. Further, we obtain matching lower bounds. We also cover the strongly convex case. Third, we give a partial answer to Question 1 when silo data is heterogeneous (non-i.i.d.), providing algorithms for smooth f p¨, xq that nearly achieve the optimal i.i.d. rates in Section 3. For example, if f p¨, xq is µ-strongly convex and β-smooth, then the excess risk of our algorithm nearly matches the i.i.d. lower bound up to a multiplicative factor of r Opβ{µq. Our algorithm is significantly more effective (in terms of excess risk) than existing ISRL-DP FL algorithms (e.g. Arachchige et al. (2019) ; Dobbe et al. (2020) ; Zhao et al. (2020) ): see Appendix A for a thorough discussion of related work. Fourth, we address Question 2 in Section 5: We give a positive answer to Question 2 when silo data is i.i.d. Further, with heterogeneous silo data, the optimal central DP rates are nearly achieved without a trusted server, if the loss function is smooth. We summarize our results in Fig. 3 . 

Smooth Convex

Smooth strongly (Girgis et al., 2021) . *For our non-i.i.d. algorithm, Theorem 2.2 only applies if ϵ0 " Op1{nq or N " Op1q: see Appendix D.3. Differential Privacy: Let X " X n 1 ˆ¨¨¨X n and ρ : X 2 Ñ r0, 8q be a distance between databases. Two databases X, X 1 P X are ρadjacent if ρpX, X 1 q ď 1. DP ensures that (with high probability) an adversary cannot distinguish between the outputs of algorithm A when it is run on adjacent databases: Definition 1 (Differential Privacy). Let ϵ ě 0, δ P r0, 1q. A randomized algorithm A : X Ñ W is pϵ, δq-differentially private (DP) (with respect to ρ) if for all ρ-adjacent data sets X, X 1 P X and all measurable subsets S Ď W, we have PpApXq P Sq ď e ϵ PpApX 1 q P Sq `δ. (3) Definition 2 (Inter-Silo Record-Level Differential Privacy). Let ρ i : X 2 i Ñ r0, 8q, ρ i pX i , X 1 i q :" ř n j"1 1 txi,j ‰x 1 i,j u , i P rN s. A randomized algorithm A is pϵ 0 , δ 0 q-ISRL-DP if for all i P rN s and all ρ i -adjacent silo data sets X i , X 1 i , the full transcript of silo i's sent messages satisfies (3) for any fixed settings of other silos' messages and data. Definition 3 (Shuffle Differential Privacy (Bittau et al., 2017; Cheu et al., 2019) ). A randomized algorithm A is pϵ, δq-shuffle DP (SDP) if for all ρ-adjacent databases X, X 1 P X and all measurable subsets S, the collection of all uniformly randomly permuted messages that are sent by the shuffler satisfies (3), with ρpX, X 1 q :" 1 n N + d ϵ 0 n N 1 n N + d ϵ 0 n N 1 n N + d ϵ 2 n 2 N 2 1 n N + d ϵ 2 0 n 2 N 1 n N + ( d ϵ 0 n N ) 4/5 1 n N + d ϵ 0 n NC 2 1 n N + d ϵ 2 0 n 2 Nκ 1 n N + κd ϵ 2 0 n 2 N d ϵ 0 n N d ϵ 2 0 n 2 N d ϵ 0 n N d ϵ 2 0 n 2 N 1 n N + d ϵn N 1 n N + d ϵ 2 0 n 2 N 1 n N + d ϵ 2 n 2 N 2 1 n N + d ϵn N (Thm. ř N i"1 ř n j"1 1 txi,j ‰x 1 i,j u . Notation and Assumptions: Let } ¨} be the ℓ 2 norm and Π W pzq :" argmin wPW }w ´z} 2 denote the projection operator. Function Write a " r Opbq if a À log 2 pθqb for some parameters θ. Assume the following throughout: Assumption 1. h : W Ñ R m is L-Lipschitz if }hpwq ´hpw 1 q} ď L}w ´w1 }, @w, w 1 P W. A differentiable function hp¨q is β-smooth if its derivative ∇h is β-Lipschitz. 1. W Ă R d is closed, convex, and }w ´w1 } ď D, @w, w 1 P W. 2. f p¨, xq is L-Lipschitz and convex for all x P X . In some parts of the paper, we assume f p¨, xq is µ-strongly convex. 3. sup wPW E xi"Di }∇f pw, x i q ´∇F i pwq} 2 ď ϕ 2 for all i P rN s.

4.

In each round r, a uniformly random subset S r of M r P rN s silos is available to communicate with the server, where tM r u rě0 are independent random variables with 1 M :" Ep 1 Mr q. In Assumption 1 part 4, the network determines M r : it is not a design parameter. This assumption is more general (and realistic for cross-device FL (Kairouz et al., 2019) ) than most (DP) FL works, which usually assume M " N or M r " M is deterministic. On the other hand, in cross-silo FL, typically all silos can reliably communicate in each round, i.e. M " N (Kairouz et al., 2019) .

2. INTER-SILO RECORD-LEVEL DP FL WITH HOMOGENEOUS SILO DATA

In this section, we provide tight (up to logarithms) upper and lower bounds on the excess risk of ISRL-DP algorithms for FL with i.i.d. silo data. For consistency of presentation, we assume that there is an untrusted server. However, our algorithms readily extend to peer-to-peer FL (no server), by having silos send private messages directly to each other and perform model updates themselves.

2.1. UPPER BOUNDS VIA NOISY DISTRIBUTED MINIBATCH SGD

We begin with our upper bounds, obtained via Noisy Distributed Minibatch SGD (MB-SGD): In each round r, all M r available silos receive w r from the server and send noisy stochastic gradients to the server: r g i r :" 1 K ř K j"1 ∇f pw r , x r i,j q `ui , where u i " N p0, σ 2 I d q and x r i,j are drawn uniformly from X i (and then replaced). The server averages these M r reports and updates w r`1 :" Π W rw r ήr Mr ř iPSr r g i r s. After R rounds, a weighted average of the iterates is returned: p w R " 1 Γ R ř R´1 r"0 γ r w r with Γ R :" ř R´1 r"0 γ r . With proper choices of tη r , γ r u R´1 r"0 , σ 2 , K, and R, we have: Theorem 2.1 (Informal). Let ϵ 0 ď 2 lnp2{δ 0 q, δ 0 P p0, 1q. Then Noisy MB-SGD is pϵ 0 , δ 0 q-ISRL-DP. Moreover: 1. If f p¨, xq is convex, then EF p p wRq ´F ˚" r O ˜LD ? M ˜1 ? n `ad lnp1{δ0q ϵ0n ¸¸. (4) 2. If f p¨, xq is µ-strongly convex, then EF p p wRq ´F ˚" r O ˆL2 µM ˆ1 n `d lnp1{δ0q ϵ 2 0 n 2 ˙˙. The first terms in each of (4) and (5) (LD{ ? M n for convex and L 2 {µM n for strongly convex) are bounds on the uniform stability (Bousquet & Elisseeff, 2002) of Noisy MB-SGD. We use these uniform stability bounds to bound the generalization error of our algorithm. Our stability bound for µ ą 0 in Lemma D.2 is novel even for N " 1. The second terms in (4) and ( 5) are bounds on the empirical risk of the algorithm. We use Nesterov smoothing (Nesterov, 2005) to extend our bounds to non-smooth losses. This requires us to choose a different stepsize and R from Bassily et al. (2019) (for N " 1, µ " 0), which eliminates the restriction on the smoothness parameter that appears in (Bassily et al., 2019, Theorem 3.2 ).foot_5 Privacy of Noisy MB-SGD follows from the Gaussian mechanism (Dwork & Roth, 2014, Theorem A.2) , privacy amplification by subsampling (Ullman, 2017) , and advanced composition (Dwork & Roth, 2014, Theorem 3.20) or moments accountant (Abadi et al., 2016, Theorem 1) . See Appendix D.2 for the detailed proof.

2.2. LOWER BOUNDS

We provide excess risk lower bounds for the case M " N , establishing the optimality of Noisy MB-SGD for two function classes: F L,D :" tf | f p¨, xq is convex, L-Lipschitz, @x P X and }w ´w1 } ď D @w, w 1 P Wu; and G µ,L,D :" tf P F L,D | f p¨, xq is µ-strongly convex, @x P X u. We restrict attention to distributions D satisfying Assumption 1, part 3. The pϵ 0 , δ 0 q-ISRL-DP algorithm class A pϵ0,δ0q,C contains all sequentially interactive algorithms and all fully interactive, C-compositional (defined in Appendix D.3) algorithms. 7 If A is sequentially interactive or Op1q-compositional, write A P A pϵ0,δ0q . The vast majority of DP algorithms in the literature are C-compositional. For example, any algorithm that uses the strong composition theorems of (Dwork & Roth, 2014; Kairouz et al., 2015; Abadi et al., 2016) for its privacy analysis is 1-compositional: see Appendix D.3.2. In particular, Noisy MB-SGD is 1-compositional, hence it is in A pϵ0,δ0q . Theorem 2.2 (Informal). Let ϵ 0 P p0, ? N s, δ 0 " op1{nN q, M " N , and A P A pϵ0,δ0q,C . Then: 1. There exists a f P F L,D and a distribution D such that for X " D nN , we have: EF pApXqq ´F ˚" r Ω ˆϕD ? N n `LD min " 1, ? d ϵ0n ? N C 2 *˙. 2. There exists a µ-smooth f P G µ,L,D and distribution D such that for X " D nN , we have EF pApXqq ´F ˚" r Ω ˆϕ2 µnN `LD min " 1, d ϵ 2 0 n 2 N C 4 *˙. Further, if A P A, then the above lower bounds hold with C " 1. The lower bounds for A pϵ0,δ0q are nearly tightfoot_7 by Theorem 2.1. The first term in each of the lower bounds is the optimal non-private rate; the second terms are proved in Appendix D.3.4. In particular, if d À ϵ 2 0 n, then the non-private term in each bound is dominant, so the ISRL-DP rates match the respective non-private rates, resulting in "privacy for free" (Nemirovskii & Yudin, 1983) . The ISRL-DP rates sit between the rates for LDP and CDP: higher trust allows for higher accuracy. For example, for F L,D , the LDP rate is ΘpLD mint1, ? d{ϵ 0 ? N uq (Duchi et al., 2013) , and the CDP rate is Θpϕ 2 { ? N n `LD mint1, ? d{ϵ 0 nN uq (Bassily et al., 2019) . Theorem 2.2 is more generally applicable than existing LDP and CDP lower bounds. When n " 1, ISRL-DP is equivalent to LDP and Theorem 2.2 recovers the LDP lower bounds (Duchi et al., 2013; Smith et al., 2017) . However, Theorem 2.2 holds for a wider class of algorithms than the lower bounds of (Duchi et al., 2013; Smith et al., 2017) , which were limited to sequentially interactive LDP algorithms. When N " 1, ISRL-DP is equivalent to CDP and Theorem 2.2 recovers the CDP lower bounds (Bassily et al., 2019) . Obtaining our more general lower bounds under the more complex notion of ISRL-DP is challenging. The lower bound approaches of (Duchi et al., 2013; Smith et al., 2017; Duchi & Rogers, 2019) are heavily tailored to LDP and sequentially interactive algorithms. Further, the applicability of the standard CDP lower bound framework (e.g. (Bassily et al., 2014; 2019) ) to ISRL-DP FL is unclear. In light of these challenges, we take a different approach to proving Theorem 2.2: We first analyze the central DP guarantees of A when silo data sets X 1 , ¨¨¨, X N are shuffled in each round, showing that CDP amplifies to ϵ " r Op ϵ0 ? N q. We could not have concluded this from existing amplification results (Erlingsson et al., 2020b; Feldman et al., 2020b; Balle et al., 2019; Cheu et al., 2019; Balle et al., 2020) since these results are all limited to sequentially interactive LDP algorithms and n " 1. Thus, we prove the first privacy amplification by shuffling bound for fully interactive ISRL-DP FL algorithms. Then, we apply the CDP lower bounds of Bassily et al. (2019) to A s , the "shuffled" version of A. This implies that the shuffled algorithm A s has excess population loss that is lower bounded as in Theorem 2.2. Finally, we observe that the i.i.d. assumption implies that A s and A have the same expected population loss. Note that our proof techniques can also be used to obtain ISRL-DP lower bounds for other problems in which a CDP lower bound is known.

3. INTER-SILO RECORD-LEVEL DP FL WITH HETEROGENEOUS SILO DATA

Consider the non-i.i.d. FL problem, where F i pwq takes the form (1) for some unknown distributions D i on X i , @i. The uniform stability approach that we used to obtain our i.i.d. upper bounds does not work in this setting.foot_8 Instead, we directly minimize F by modifying Noisy MB-SGD as follows: 1. We draw disjoint batches of K local samples without replacement from each silo and set R " t n K u. Thus, stochastic gradients are independent across iterations, so our bounds apply to F . 2. We use acceleration (Ghadimi & Lan, 2012) to increase the convergence rate. 3. To provide ISRL-DP, we re-calibrate σ 2 and apply parallel composition (McSherry, 2009) . After these modifications, we call the resulting algorithm One-Pass Accelerated Noisy MB-SGD. It is described in Algorithm 1. In the strongly convex case, we use a multi-stage implementation of One-Pass Accelerated Noisy MB-SGD (inspired by (Ghadimi & Lan, 2013) ) to further expedite convergence: see Appendix E.1 for details. Carefully tuning step sizes, σ 2 , and K yields: Theorem 3.1 (M " N case). Let f p¨, xq be β-smooth for all x P X . Assume ϵ 0 ď 8 lnp1{δ 0 q, δ 0 P p0, 1q. Then One-Pass Accelerated Noisy MB-SGD is pϵ 0 , δ 0 q-ISRL-DP. Moreover, if M " N , then: 1. For convex f p¨, xq, we have EF pw ag R q ´F ˚À ϕD ? N n `˜β 1{4 LD 3{2 a d lnp1{δ0q ϵ0n ? N ¸4{5 . (6) 2. For µ-strongly convex f p¨, xq with κ " β µ , we have EF pw ag R q ´F ˚" r O ˆϕ2 µnN `?κ L 2 µ d lnp1{δ0q ϵ 2 0 n 2 N ˙. Remarkably, the bound (7) nearly matches the optimal i.i.d. bound (5) up to the factor r Op ? κq. In particular, for well-conditioned loss functions, our algorithm achieves the optimal i.i.d. rates even when silo data is arbitrarily heterogeneous. The gap between the bound (6) and the i.i.d. bound (4) is Opp ? d{ϵ 0 n ? N q 1{5 q. Closing the gaps between the non-i.i.d. upper bounds in Theorem 3.1 and the i.i.d. lower bounds in Theorem 2.2 is left as an open problem. Compared to previous upper bounds, Theorem 3.1 is a major improvement: see Appendix A for details. Algorithm 1 Accelerated Noisy MB-SGD 1: Input: Data X i P X n i , i P rN s, strong convexity modulus µ ě 0, noise parameter σ 2 , iteration number R P N, batch size K P rns, step size parameters tη r u rPrRs , tα r u rPrRs . 2: Initialize w ag 0 " w 0 P W and r " 1. 3: for r P rRs do for i P S r in parallel do 6: Silo i draws tx r i,j u k j"1 from X i (replace samples for ERM) and noise u i " N p0, σ 2 I d q. 7: Silo i computes r g i r :" 1 K ř K j"1 ∇f pw md r , x r i,j q `ui . Server updates and broadcasts w ag r " α r w r `p1 ´αr qw ag r´1 . 11: end for 12: Output: w ag R . In Appendix E.2, we provide a general version (and proof) of Theorem 3.1 for M ď N . If M ă N but M is sufficiently large or silo heterogeneity sufficiently small, then the same bounds in Theorem 3.1 hold with N replaced by M . Intuitively, the M ă N case is harder when data is highly heterogeneous, since stochastic estimates of ∇F will have larger variance. In Lemma E.3 (Appendix E.2), we use a combinatorial argument to bound the variance of stochastic gradients.

4. INTER-SILO RECORD-LEVEL DP FEDERATED ERM

In this section, we provide an ISRL-DP FL algorithm, Accelerated Noisy MB-SGD, with optimal excess empirical risk. The difference between our proposed algorithm and One-Pass Accelerated Noisy MB-SGD is that silo i now samples from X i with replacement in each round: see line 6 in Algorithm 1. This allows us to a) amplify privacy via local subsampling and advanced composition/moments accountant, allowing for smaller σ 2 ; and b) run more iterations of our algorithm to better optimize p F X . These modifications are necessary for obtaining the optimal rates for federated ERM. We again employ Nesterov smoothing (Nesterov, 2005) to extend our results to non-smooth f . Theorem 4.1 (Informal). Let ϵ 0 ď 2 lnp2{δ 0 q, δ 0 P p0, 1q. Then, there exist algorithmic parameters such that Algorithm 1 is pϵ 0 , δ 0 q-ISRL-DP and: 1. If f p¨, xq is convex, then E p F X pw ag R q ´p F X " r O ˜LD a d lnp1{δ0q ϵ0n ? M ¸. (8) 2. If f p¨, xq is µ-strongly convex then E p F X pw ag R q ´p F X " r O ˆL2 µ d lnp1{δ0q ϵ 2 0 n 2 M ˙. See Appendix F.1 for the formal statement and proof. With non-random M r " M , Girgis et al. (2021) provides an upper bound for (non-strongly) convex ISRL-DP ERM that nearly matches the one we provide in (8). However, Algorithm 1 achieves the upper bounds for convex and strongly convex loss in fewer rounds of communication than (Girgis et al., 2021) : see Appendix F.1 for details. In Appendix F.2, we get matching lower bounds, establishing the optimality of Algorithm 1 for ERM.

5. SHUFFLE DP FEDERATED LEARNING

Assume access to a secure shuffler and fix M r " M . In each round r, the shuffler receives reports pZ p1q r , ¨¨¨Z pM q r q from active silos (we assume S r " rM s here for concreteness), draws a uniformly random permutation of rM s, π, and then sends pZ pπp1qq r , ¨¨¨, Z pπpM qq r q to the server for aggregation. When this shuffling procedure is combined with ISRL-DP Noisy Distributed MB-SGD, we obtain: Theorem 5. 1 (i.i.d.) . Let ϵ ď lnp2{δq, δ P p0, 1q. Then there are choices of algorithmic parameters such that Shuffled Noisy MB-SGD is pϵ, δq-SDP. Moreover: 1. If f p¨, xq is convex, then EF p p wRq ´F ˚" r O ˜LD ˜1 ? nM `ad lnp1{δq ϵnN ¸¸. 2. If f p¨, xq is µ-strongly convex, then EF p p wRq ´F ˚" r O ˆL2 µ ˆ1 nM `d lnp1{δq ϵ 2 n 2 N 2 ˙˙. See Appendix G.1 for details and proof. When M " N , the rates in Theorem 5.1 match the optimal central DP rates (Bassily et al., 2019) , and are attained without a trusted server. Thus, with shuffling, Noisy MB-SGD is simultaneously optimal for i.i.d. FL in the inter-silo and central DP models. If silo data is heterogeneous, then we use a shuffle DP variation of One-Pass Accelerated Noisy MB-SGD, described in Appendix G.2, to get: Theorem 5.2 (Non-i.i.d.). Assume f p¨, xq is β-smooth @x P X . Let ϵ ď 15, δ P p0, 1 2 q. Then, there is an pϵ, δq-SDP variation of One-Pass Accelerated Noisy MB-SGD such that for M " N , we have: 1. If f p¨, xq is convex, then EF pw ag R q ´F ˚À ϕD ? N n `ˆβ 1{4 LD 3{2 ? d lnpd{δq ϵnN ˙4{5 . ( ) 2. If f p¨, xq is µ-strongly convex with κ " β µ , then EF pw ag R q ´F ˚" r O ˆϕ2 µnN `?κ L 2 µ d lnp1{δq ϵ 2 n 2 N 2 ˙. The bound (11) matches the optimal i.i.d., central DP bound (Bassily et al., 2019) up to r Op ? κq. Hence, if f is not ill-conditioned, then (11) shows that it is not necessary to have either i.i.d. data or a trusted server to attain the optimal CDP rates. See Appendix G.2 for proof and the M ă N case.

6. NUMERICAL EXPERIMENTS

We validate our theoretical findings with three sets of experiments. Our results indicate that ISRL-DP MB-SGD yields accurate, private models in practice. Our method performs well even relative to non-private Local SGD, a.k.a. FedAvg (McMahan et al., 2017) , and outperforms ISRL-DP Local SGD for most privacy levels. Appendix I contains details of experiments, and additional results. 10Binary Logistic Regression with MNIST: Following (Woodworth et al., 2020b) , we consider binary logistic regression on MNIST (LeCun & Cortes, 2010) . The task is to classify digits as odd or even. Each of 25 odd/even digit pairings is assigned to a silo (N " 25). Fig. 4 shows that ISRL-DP MB-SGD outperforms (non-private) Local SGD for ϵ ě 12 and outperforms ISRL-DP Local SGD. Linear Regression with Health Insurance Data: We divide the data set (Choi, 2018) 

7. CONCLUDING REMARKS AND OPEN QUESTIONS

This paper considered FL without a trusted server and advocated for inter-silo record-level DP (ISRL-DP) as a practical privacy notion in this setting, particularly in cross-silo applications. We provided optimal ISRL-DP algorithms for convex/strongly convex FL in both the i.i.d. and ERM settings when all M " N clients are able to communicate. The i.i.d. rates sit between the rates for the stringent "no trust" local DP and relaxed "high trust" central DP notions, and allow for "privacy for free" when d À ϵ 2 0 n. As a side result, in Appendix D.3.5, we established the optimal rates for cross-device FL with user-level DP in the absence of a trusted server. Additionally, we devised an accelerated ISRL-DP algorithm to obtain state-of-the-art upper bounds for heterogeneous FL. We also gave a shuffle DP algorithm that (nearly) attains the optimal central DP rates for (non-i.i.d.) i.i.d. FL. An open problem is to close the gap between our i.i.d. lower bounds and non-i.i.d. upper bounds: e.g. can ? κ in (7) be removed? Also, when M ă N , are our upper bounds tight? Finally, what performance is possible for non-convex ISRL-DP FL? A THOROUGH DISCUSSION OF RELATED WORK Federated Learning: In the absence of differential privacy constraints, federated learning (FL) has received a lot of attention from researchers in recent years. Among these, the most relevant works to us are (Koloskova et al., 2020; Li et al., 2020; Karimireddy et al., 2020; Woodworth et al., 2020a; b; Yuan & Ma, 2020) , which have proved bounds on the convergence rate of FL algorithms. From an algorithmic standpoint, all of these works propose and analyze either Minibatch SGD (MB-SGD), FedAvg/Local SGD (McMahan et al., 2017) , or an extension or accelerated/variance-reduced variation of one of these (e.g. SCAFFOLD (Karimireddy et al., 2020) ). Notably, Woodworth et al. (2020b) proves tight upper and lower bounds that establish the near optimality of accelerated MB-SGD for the heterogeneous SCO problem with non-random M r " M " N in a fairly wide parameter regime. Lobel & Ozdaglar (2010) ; Touri & Gharesifard (2015) ; Nedic et al. (2017) provide convergence results with random connectivity graphs. Our upper bounds describe the effect of the mean of 1{M r on DP FL. DP Optimization: In the centralized setting, DP ERM and SCO is well-understood for convex and strongly convex loss functions (Bassily et al., 2014; Wang et al., 2017; Bassily et al., 2019; Feldman et al., 2020a; Lowy & Razaviyayn, 2021; Asi et al., 2021) . Tight excess risk bounds for local DP SCO were provided in Duchi et al. (2013) ; Smith et al. (2017) . A few works have also considered Shuffle DP ERM and SCO. Girgis et al. (2021) ; Erlingsson et al. (2020a) showed that the optimal CDP convex ERM rate can be attained in the lower trust (relative to the central model) shuffle model of DP. The main difference between our treatment of shuffle DP and Cheu et al. ( 2021) is that our results are much more general than Cheu et al. (2021) . For example, Cheu et al. (2021) does not consider FL: instead they consider the simpler problem of stochastic convex optimization (SCO). SCO is a simple special case of FL in which each silo has only n " 1 sample. Additionally, Cheu et al. (2021) only considers the i.i.d. case, but not the more challenging non-i.i.d. case. Further, Cheu et al. (2021) assumes perfect communication (M " N ), while we also analyze the case when M ă N and some silos are unavailable in certain rounds (e.g. due to internet issues). Note that our bounds in Theorem 5.1 recover the results in Cheu et al. (2021) in the special case considered in their work. DP Federated Learning: More recently, there have been many proposed attempts to ensure the privacy of individuals' data during and after the FL process. Some of these have used secure multi-party computation (MPC) (Chen et al., 2018; Ma et al., 2018) , but this approach leaves users vulnerable to inference attacks on the trained model and does not provide the rigorous guarantee of DP. Others (McMahan et al., 2018; Geyer et al., 2017; Jayaraman & Wang, 2018; Gade & Vaidya, 2018; Wei et al., 2020a; Zhou & Tang, 2020; Levy et al., 2021; Ghazi et al., 2021; Noble et al., 2022) have used user-level DP or central DP (CDP), which rely on a trusted third party, or hybrid DP/MPC approaches (Jayaraman & Wang, 2018; Truex et al., 2019) . The work of (Jayaraman & Wang, 2018) proves CDP empirical risk bounds and high probability guarantees on the population loss when the data is i.i.d. across silos. However, ISRL-DP and SDP are not considered, nor is heterogeneous (non-i.i.d.) FL. It is also worth mentioning that (Geyer et al., 2017) considers random M r but does not prove any bounds. Despite this progress, prior to our present work, very little was known about the excess risk potential of ISRL-DP FL algorithms, except in the two extreme corner cases of N " 1 and n " 1. When N " 1, ISRL-DP and CDP are essentially equivalent; tight ERM (Bassily et al., 2014) and i.i.d. SCO (Bassily et al., 2019; Feldman et al., 2020a) bounds are known for this case. In addition, for LDP i.i.d. SCO when n " 1 and M r " N , (Duchi et al., 2013) establishes the minimax optimal rate for the class of sequentially interactive algorithms and non-strongly convex loss functions. To the best of our knowledge, all previous works examining the general ISRL-DP FL problem with arbitrary n, M, N ě 1 either focus on ERM and/or do not provide excess risk bounds that scale with both M and n, making the upper bounds provided in the present work significantly tighter. Furthermore, none of the existing works on ISRL-DP FL provide lower bounds or upper bounds for the case of random M r . We discuss each of these works in turn below: (Truex et al., 2020) gives a ISRL-DP FL algorithm, but no risk bounds are provided in their work. The works of (Huang et al., 2020) and (Huang & Gong, 2020) use ISRL-DP ADMM algorithms for smooth convex Federated ERM. However, the utility bounds in their works are stated in terms of an average of the silo functions evaluated at different points, so it is not clear how to relate their result to the standard performance measure for learning (which we consider in this paper): expected excess risk at the point p w output by the algorithm. Also, no lower bounds are provided for their performance measure. Therefore, the sub-optimality gap of their result is not clear. (Wu et al., 2019, Theorem 2)  provides an pϵ, 0q-ISRL-DP ERM bound for fixed M r " M " N of O ´κ L 2 µ d nN ϵ 2 `ϵ¯f or µ-strongly convex, β-smooth f with condition number κ " β{µ, and 1{ϵ 2 is an average of 1{ϵ 2 i . The additive ϵ term is clearly problematic: e.g. if ϵ " Θp1q, then the bound becomes trivial. Ignoring this term, the first term in their bound is still looser than the bound that we provide in Theorem 4.1. Namely, for ϵ " ϵ 0 , our bound in part 2 of Theorem 4.1 is tighter by a factor of O ´lnp1{δ0q κn ¯and does not require β-smoothness of the loss. Additionally, the bounds in (Wu et al., 2019) require R "large enough" and do not come with communication complexity guarantees. In the convex case, the ISRL-DP ERM bound reported in (Wu et al., 2019, Theorem 3 ) is difficult to interpret because the unspecified "constants" in the upper bound on E p F X p p w R q ´p F X are said to be allowed to depend on R. (Wei et al., 2020b , Theorems 2-3) provide convergence rates for smooth convex Polyak-Łojasiewicz (a generalization of strong convexity) ISRL-DP ERM, which are complicated non-monotonic functions of R. Since they do not prescribe a choice of R, it is unclear what excess loss and communication complexity bounds are attainable with their algorithm. (Dobbe et al., 2020) proposes a ISRL-DP Inexact Alternating Minimization Algorithm (IAMA) with Laplace noise; their result (Dobbe et al., 2020, Theorem 3.11) gives a convergence rate for smooth, strongly convex ISRL-DP FL of order O ˆΘ ř iPrM s σ 2 i R ˙ignoring smoothness and strong convexity factors, where Θ is a parameter that is only upper bounded in special cases (e.g. quadratic objective). Thus, the bounds given in (Dobbe et al., 2020) are not complete for general strongly convex loss functions. Even in the special cases where a bound for Θ is provided, our bounds are tighter. Assuming that Θ " 1 and (for simplicity of exposition) that parameters are the same across silos, (Dobbe et al., 2020, Theorem 3.11) implies taking σ 2 " 1{ϵ 2 to ensure pϵ, 0q-ISRL-DP. The resulting convergence rate is then OpM {ϵ 2 q, which does not scale with n and is increasing with M. Also, the dependence of their rate on the dimension d is unclear, as it does not appear explicitly in their theorem. 11 Ignoring this issue, the dependence on M and n in the bound of (Dobbe et al., 2020) is still looser than all of the excess loss bounds that we provide in the present work. (Zhao et al., 2020) and (Arachchige et al., 2019) apply the ISRL-DP FL framework to Internet of Things, and (Seif et al., 2020) uses noisy (full-batch) GD for ISRL-DP wireless channels in the FL (smooth strongly convex) ERM setting. The bounds in these works do not scale with the number of data points n, however (only with the number of silos N ). Therefore, the bounds that we provide in the present work are tighter, and apply to general convex FL problems besides wireless channels and Internet of Things.

B RIGOROUS DEFINITION OF INTER-SILO RECORD-LEVEL DP

Recall: Definition 4. (Differential Privacy) Let ϵ ě 0, δ P r0, 1q. A randomized algorithm A : X Ñ W is pϵ, δq-DP if for all ρ-adjacent data sets X, X 1 P X and all measurable subsets S Ă W, we have PpApXq P Sq ď e ϵ PpApX 1 q P Sq `δ. If ( 12) holds for all measurable subsets S, then we denote this property by ApXq » pϵ,δq ApX 1 q. An R-round fully interactive randomized algorithm A for FL is characterized in every round r P rRs by N local silo functions called randomizers R piq r : Z pr´1qˆN ˆX n i Ñ Z (i P rN s) and an aggregation mechanism. (See below for an example that further clarifies the terminology used in this paragraph.) The randomizers send messages Z piq r :" R piq r pZ 1:r´1 , X i q, to the server or other silos. The messages Published as a conference paper at ICLR 2023 Z piq r may depend on silo data X i and the outputs Z 1:r´1 :" tZ pjq t u jPrN s,tPrr´1s of silos' randomizers in prior rounds. 12 Then, the server (or silos, for peer-to-peer FL) updates the global model. We consider the output of A : X Ñ Z RˆN to be the transcript of all silos' communications: i.e. the collection of all N ˆR messages tZ piq r u. Algorithm A is pϵ 0 , δ 0 q-ISRL-DP if for all silos i, the full transcript tZ piq r u rPrRs is pϵ 0 , δ 0 q-DP, for any fixed settings of the other silos' messages and data. More precisely: Definition 5. (Inter-Silo Record-Level Differential Privacy) Let ρ i pX i , X 1 i q :" ř n j"1 1 txi,j ‰x 1 i,j u , i P rN s. A randomized algorithm A is pϵ 0 , δ 0 q-ISRL-DP if for all silos i and all ρ i -adjacent X i , X 1 i , pR piq 1 pXiq, R piq 2 pZ1, Xiq, ¨¨¨, R piq R pZ1:R´1, Xiqq » pϵ 0 ,δ 0 q pR piq 1 pX 1 i q, R piq 2 pZ 1 1 , X 1 i q, ¨¨¨, R piq R pZ 1 1:R´1 , X 1 i qq, where Z r :" tR piq r pZ 1:r´1 , X i qu N i"1 and Z 1 r :" tR piq r pZ 1 1:r´1 , X 1 i qu N i"1 . Example clarifying the terminology used in the definition of ISRL-DP given above: Assume all M r " N silos are available in every round and consider A to be the minibatch SGD algorithm, w r`1 :" w r ´ηg r , where g r " 1 N K ř N i"1 ř K j"1 ∇f pw r , x r i,j q for tx r i,j u K j"1 drawn randomly from X i . Then the randomizers R piq r : Z pr´1qˆN ˆX n i Ñ Z of silo i are its stochastic gradients: Z piq r " R piq r pZ 1:r´1 , X i q " 1 K ř K j"1 ∇f pw r , x r i,j q for tx r i,j u K j"1 drawn randomly from X i . Note that the output of these randomizers depends on w r , which is a function of previous stochastic gradients of all silos Z 1:r´1 " tZ piq t u iPrN s,tPrr´1s . The aggregation mechanism outputs g r by simply averaging the outputs of silos' randomizers: g r " 1 N ř N i"1 Z piq r . We view the output of A : X Ñ Z RˆN to be the transcript of all silos' communications, which in this case is the collection of all N ˆR stochastic minibatch gradients tZ piq r u iPrN s,rPrRs . Note that in practice, the algorithm A does not truly output a list of gradients, but rather outputs p w P W that is some convex combination of the iterates tw r u rPrRs , which themselves are functions of tZ piq r u iPrN s,rPrRs . However, by the post-processing property of DP (Dwork & Roth, 2014 , Proposition 2.1), the privacy of p w will be guaranteed if the silo transcripts are DP. Thus, here we simply consider the output of A to be the silo transcripts. Clearly, minibatch SGD is not ISRL-DP. To make it ISRL-DP, it would be necessary to introduce additional randomness to make sure that each silo's collection of stochastic gradients is DP, conditional on the messages and data of all other silos. For example, Noisy MB-SGD is a ISRL-DP variation of (projected) minibatch SGD.

C RELATIONSHIPS BETWEEN NOTIONS OF DP C.1 ISRL-DP IS STRONGER THAN CDP

Assume A is pϵ 0 , δ 0 q-ISRL-DP. Let X, X 1 be adjacent databases in the CDP sense; i.e. there exists a unique i P rN s, j P rns such that x i,j ‰ x 1 i,j . Then for all r P rRs, l ‰ i, X l " X 1 l , so the conditional distributions of R plq r pZ 1:r´1 , X l q and R plq r pZ 1 1:r´1 , X 1 l q given Z pl 1 ‰lq 1:r´1 " z pl 1 ‰lq 1:r´1 are identical for all z pl 1 ‰lq 1:r´1 P Z pr´1qˆpN ´1q . Integrating both sides of this equality with respect to the joint density of Z pl 1 ‰lq 1:r´1 shows that R plq r pZ 1:r´1 , X l q " R plq r pZ 1 1:r´1 , X 1 l q (unconditional equality of distributions). Hence the full transcript of silo l is (unconditionally) p0, 0q-CDP for all l ‰ i. A similar argument (using the inequality (3) instead of equality) shows that silo i's full transcript is unconditionally pϵ 0 , δ 0 q-CDP. Therefore, by the basic composition theorem for DP (Dwork & Roth, 2014) , the full combined transcript of all N silos is pϵ 0 , δ 0 q-CDP, which implies that A is pϵ 0 , δ 0 q-CDP. Conversely, pϵ, δq-CDP does not imply pϵ 1 , δ 1 q-ISRL-DP for any ϵ 1 ą 0, δ 1 P p0, 1q. This is because a CDP algorithm may send non-private updates to the server and rely on the server to randomize, completely violating the requirement of LDP. 12 We assume that R piq r pZ1:r´1, Xiq is not dependent on Xj (j ‰ i) given Z1:r´1 and Xi; that is, the distribution of R piq r is completely characterized by Z1:r´1 and Xi. Therefore, the randomizers of i cannot "eavesdrop" on another silo's data, which aligns with the local data principle of FL. We allow for Z piq t to be empty/zero if silo i does not output anything to the server in round t. C.2 ISRL-DP IMPLIES USER-LEVEL DP FOR SMALL pϵ 0 , δ 0 q Precisely, we claim that if A is pϵ 0 , δ 0 q-ISRL-DP then A is pnϵ, ne pn´1qϵ δq user-level DP; but conversely pϵ, δq-user-level DP does not imply pϵ 1 , δ 1 q-ISRL-DP for any ϵ 1 ą 0, δ 1 P p0, 1q. The first part of the claim is due to group privacy (see (Kamath, 2020, Theorem 10) ) and the argument used above in Appendix C.1 to get rid of the "conditional". The second part of the claim is true because a user-level DP algorithm may send non-private updates to the server and rely on the server to randomize, completely violating the requirement of ISRL-DP. Therefore, if ϵ 0 " Op1{nq and δ 0 ! 1{n, then any pϵ 0 , δ 0 q-ISRL-DP algorithm also provides a strong user-level privacy guarantee.

D PROOFS AND SUPPLEMENTARY MATERIAL FOR SECTION 2 D.1 PSEUDOCODE OF NOISY MB-SGD

We present pseudocode for Noisy MB-SGD in Algorithm 2: Algorithm 2 Noisy ISRL-DP MB-SGD 1: Input: N, d, R P N, σ 2 ě 0, X i P X n i for i P rN s, loss function f pw, xq, K P rns , tη r u rPrRs and tγ r u rPrRs . 2: Initialize w 0 P W. 3: for r P t0, 1, ¨¨¨, R ´1u do 4: for i P S r in parallel do 5: Server sends global model w r to silo i.

6:

Silo i draws K samples x r i,j uniformly from X i (for j P rKs) and noise u i " N p0, σ 2 I d q. 7: Silo i computes r g i r :" 1 K ř K j"1 ∇f pw r , x r i,j q `ui and sends to server.

8:

end for 9: Server aggregates r g r :" 1 Mr ř iPSr r g i r . 10: Server updates w r`1 :" Π W rw r ´ηr r g r s. 11: end for 12: Output: p w R " 1 Γ R ř R´1 r"0 γ r w r , where Γ R :" ř R´1 r"0 γ r .

D.2 PROOF OF THEOREM 2.1

We begin by proving Theorem 2.1 for β-smooth f p¨, xq and then extend our result to the non-smooth case via Nesterov smoothing (Nesterov, 2005) . We will require some preliminaries. We begin with the following definition from (Bousquet & Elisseeff, 2002) : Definition 6. (Uniform Stability) A randomized algorithm A : W ˆX Ă N is said to be α-uniformly stable (w.r.t. loss function f : W ˆX ) if for any pair of adjacent data sets X, X 1 P X Ă N , |X∆X 1 | ď 2, we have sup xPX E A rf pApXq, xq ´f pApX 1 q, xqs ď α. In our context, r N " N ˆn. The following lemma, which is well known, allows us to easily pass from empirical risk to population loss when the algorithm in Question 1s uniformly stable: Lemma D.1. Let A : X Ă N Ñ W be α-uniformly stable w.r.t. convex loss function f : W ˆX Ñ R. Let D be any distribution over X and let X " D Ă N . Then the excess population loss is upper bounded by the excess expected empirical loss plus α: ErF pApXq, Dq ´F ˚s ď α `Er p F X pApXqq ´min wPW p F X pwqs, where the expectations are over both the randomness in A and the sampling of X " D Ă N . Here we denote the empirical loss by p F X pwq and the population loss by F pw, Dq for additional clarity, and F ˚:" min wPW F pw, Dq " min wPW F pwq. Proof. By Theorem 2.2 in (Hardt et al., 2016) , ErF pApXq, Dq ´p F X pApXqqs ď α. Hence ErF pApXq, Dq ´F ˚s " ErF pApXq, Dq ´p F X pApXqq `p F X pApXqq ´min wPW p F X pwq `min wPW p F X pwq ´F ˚s ď α `Er p F X pApXqq ´min wPW p F X pwqs, since E min wPW p F X pwq ď min wPW E " p F X pw, Xq ı " min wPW F pw, Dq " F ˚. The next step is to bound the uniform stability of Algorithm 2. Lemma D.2. Let f p¨, xq be convex, L-Lipschitz, and β-smooth loss for all x P X . Let n :" mintnu N i"1 . Then under Assumption 3, Noisy MB-SGD with constant stepsize η ď 1 β and averaging weights γ r " 1 R is α-uniformly stable with respect to f for α " 2L 2 Rη nM . If, in addition f p¨, xq is µ-strongly convex, for all x P X , then Noisy MB-SGD with constant step size η r " η ď 1 β and any averaging weights tγ r u R r"1 is α-uniformly stable with respect to f for α " 4L 2 µpM n´1q (assuming mintM, nu ą 1). Proof of Lemma D.2. The proof of the convex case extends techniques and arguments used in the proofs of (Hardt et al., 2016, Theorem 3.8) , (Feldman & Vondrak, 2019, Lemma 4.3) , and (Bassily et al., 2019, Lemma 3.4 ) to the ISRL-DP FL setting; the strongly convex bound requires additional work to get a tight bound. For now, fix the randomness of tM r u rě0 . Let X, X 1 P X Ă N be two data sets, denoted X " pX 1 , ¨¨¨, X N q for X i P X n for all i P rN s and similarly for X 1 , and assume |X∆X 1 | " 2. Then there is a unique a P rN s and b P rns such that x a,b ‰ x 1 a,b . For t P t0, 1, ¨¨¨, Ru, denote the t-th iterates of Algorithm 2 on these two data sets by w t " w t pXq and w 1 t " w t pX 1 q respectively. We claim that E " }wt ´w1 t } |t Mru0ďrďt ‰ ď 2Lη n t ÿ r"0 1 Mr (13) for all t. We prove the claim by induction. It is trivially true when t " 0. Suppose (13) holds for all t ď τ. Denote the samples in each local mini-batch at iteration τ by tx i,j u iPrN s,jPrKs (dropping the τ for brevity). Assume WLOG that S τ " rM τ s. First condition on the randomness due to minibatch sampling and due to the Gaussian noise, which we denote by s u " 1 Mτ ř iPSτ u i . Also, denote (for t ě 0) r g t :" 1 MtK ř iPSt,jPrKs ∇f pw t , x i,j q `s u and r g t :" 1 MtK ř iPSt,jPrKs ∇f pw 1 t , x 1 i,j q `s u. Then by the same argument used in Lemma 3.4 of (Bassily et al., 2019) , we can effectively ignore the noise in our analysis of step τ `1 of the algorithm since the same (conditionally non-random) update is performed on X and X 1 , implying that the noises cancel out. More precisely, by non-expansiveness of projection and gradient descent step for η ď 2 β (see Lemma 3.7 in (Hardt et al., 2016 )), we have }w τ `1 ´w1 τ `1} " › › Π W pw τ ´ητ r g τ q ´ΠW `w1 τ ´ητ r g 1 τ ˘› › ď › › pw τ ´ητ r g τ q ´pw 1 τ ´ητ r g 1 τ q › › ď ˇˇˇˇˇˇˇˇw τ ´ητ ˜˜1 M τ K ÿ pi,jq‰pa,bq ∇f pw τ , x i,j q ¸`s u w1 τ ´ητ ˜˜1 M τ K ÿ pi,jq‰pa,bq ∇f pw 1 τ , x i,j q ¸`s u ¸¸ˇˇˇˇˇˇˇˇˇq τ η τ M τ K › › ∇f pw τ , x a,b q ´∇f pw 1 τ , x 1 a,b q › › ď}w τ ´w1 τ } `qτ η τ M τ K › › ∇f pw τ , x a,b q ´∇f pw 1 τ , x 1 a,b q › › , where q τ P t0, 1, ¨¨¨, Ku is a realization of the random variable Q τ that counts the number of times index b occurs in worker a's local minibatch at iteration τ. (Recall that we sample uniformly with replacement.) Now Q τ is a sum of K independent Bernoulli( 1 na q random variables, hence EQ τ " K na . Then using the inductive hypothesis and taking expected value over the randomness of the Gaussian noise and the minibatch sampling proves the claim. Next, taking expectation with respect to the randomness of tM r u rPrts implies E}w t ´w1 t } ď 2Lt nM , since the M r are i.i.d. with Ep 1 M1 q " 1 M . Then Jensen's inequality and Lipschitz continuity of f p¨, xq imply that for any x P X , Erf p s w R , xq ´f p s w 1 R , x 1 qs ď LE} s w R ´s w 1 R } ď L R R´1 ÿ t"0 E}w t ´w1 t } ď 2L 2 η RM n RpR `1q 2 " L 2 ηpR `1q M n , completing the proof of the convex case. Next suppose f is µ-strongly convex. The proof begins identically to the convex case. We condition on M r , u i , and S r as before and (keeping the same notation used there) get for any r ě 0 }w r`1 ´w1 r`1 } ď ˇˇˇˇˇˇˇˇw r ´ηr ˜˜1 M r K ÿ pi,jq‰pa,bq ∇f pw r , x i,j q ¸`s u w1 r ´ηr ˜˜1 M r K ÿ pi,jq‰pa,bq ∇f pw 1 r , x i,j q ¸`s u ¸¸ˇˇˇˇˇˇˇˇˇq r η r M r K › › ∇f pw r , x a,b q ´∇f pw 1 r , x 1 a,b q › › . We will need the following tighter estimate of the non-expansiveness of the gradient updates to bound the first term on the right-hand side of the inequality above: Lemma D.3. (Hardt et al., 2016) Let G : W Ñ R d be µ-strongly convex and β-smooth. Assume η ď 2 β`µ Then for any w, v P W, we have }pw ´η∇Gpwqq ´pv ´η∇Gpvqq} ď ˆ1 ´ηβµ β `µ ˙}v ´w} ď ´1 ´ηµ 2 ¯}v ´w}. Note that G r pw r q :" 1 MrK ř pi,jq‰pa,bq,pi,jqPrMrsˆrKs f pw r , x r i,j q is p1 ´qr MrK qβ-smooth and p1 qr MrK qµ-strongly convex and hence so is G r pw r q `s u. Therefore, invoking Lemma D.3 and the assumption η r " η ď 1 β , as well as Lipschitzness of f p¨, xq@x P X , yields }w r`1 ´w1 r`1 } ď ˜1 ´ηµp1 ´qr MrK q 2 ¸}w r ´w1 r } `2q r ηL M r K . Next, taking expectations over the M r (with mean Ep 1 Mr q " 1 M ), the minibatch sampling (recall Eq r " K na ), and the Gaussian noise implies E}w r`1 ´w1 r`1 } ď ˜1 ´ηµp1 ´1 naM q 2 ¸E}w r ´w1 r } `2ηL n a M . One can then prove the following claim by an inductive argument very similar to the one used in the proof of the convex part of Lemma D.2: for all t ě 0, E}w t ´w1 t } ď 2ηL n a M t ÿ r"0 p1 ´bq r , where b :" µη 2 ´naM´1 naM ¯ă 1. The above claim implies that E}w t ´w1 t } ď 2ηL n a M ˆ1 ´p1 ´bq t`1 b ď 4L µpn a M ´1q ď 4L µpnM ´1q . Finally, using the above bound together with Lipschitz continuity of f and Jensen's inequality, we obtain that for any x P X , Erf p p w R , xq ´f p p w 1 R , xqs ď LE} p w R ´p w 1 R } " LE › › › › › 1 Γ R R´1 ÿ r"0 γ r pw r ´w1 r q › › › › › ď LE « 1 Γ R R´1 ÿ r"0 γ r }w r ´w1 r } ff ď L « 1 Γ R R´1 ÿ r"0 γ r ˆ4L µpnM ´1q ˙ff " 4L 2 µpnM ´1q . This completes the proof of Lemma D.2. Next, we will bound the empirical loss of Noisy MB-SGD (Algorithm 2). We will need the following two lemmas for the proof of Lemma D.6 (and hence Theorem 2.1): Lemma D.4. (Projection lemma) Let W Ă R d be a closed convex set. Then }Π W paq´b} 2 ď }a´b} 2 for any a P R d , b P W. Lemma D.5. (Stich, 2019) Let b ą 0, let a, c ě 0, and tη t u tě0 be non-negative step-sizes such that η t ď 1 g for all t ě 0 for some parameter g ě a. Let tr t u tě0 and ts t u tě0 be two non-negative sequences of real numbers which satisfy r t`1 ď p1 ´aη t qr t ´bη t s t `cη 2 t for all t ě 0. Then there exist particular choices of step-sizes η t ď 1 g and averaging weights γ t ě 0 such that b Γ T T ÿ t"0 s t γ t `ar T `1 " r O ˆgr 0 exp ˆ´aT g ˙`c aT ˙, where Γ T :" ř T t"0 γ t . In fact, we can choose η t and γ t as follows: η t " η " min # 1 g , ln `max ␣ 2, a 2 r 0 T 2 {c (ȃ T + , γ t " p1 ´aηq ´pt`1q . We give the excess empirical risk guarantee of ISRL-DP MB-SGD below: Lemma D.6. Let f : W ˆX Ñ R be µ-strongly convex (with µ " 0 for convex case), L-Lipschitz, and β-smooth in w for all x P X , where W is a closed convex set in R d s.t. }w ´w1 } ď D for all w, w 1 P W. Let X P X. Then Noisy MB-SGD (Algorithm 2) with σ 2 " 256L 2 R lnp 2.5R δ 0 q lnp2{δ0q n 2 ϵ 2 0 attains the following empirical loss bounds as a function of step size and the number of rounds: 1. (Convex) For any η ď 1{β and R P N, γ r :" 1{R, we have E p F X p p w R q ´p F X ď D 2 2ηR `ηL 2 2 ˜256dR lnp 2.5R δ0 q lnp2{δ 0 q M n 2 ϵ 2 0 `1¸.

2.. (Strongly Convex

) There exists a constant stepsize η r " η ď 1{β such that if R ě 2κ ln ´µMϵ 2 0 n 2 βD L 2 d ¯, then E p F X p p w R q ´p F X " r O ˆL2 µ ˆd lnp1{δ 0 q M ϵ 2 0 n 2 ˙`1 R ˙. ( ) Proof. First, condition on the random M r and consider M r as fixed. Let w ˚P argmin wPW p F X pwq be any minimizer of p F X , and denote the average of the i.i.d. Gaussian noises across all silos in one round by s u r :" Mr . Then for any r ě 0, conditional on M r , we have that E " }w r`1 ´w˚} 2 ˇˇˇM r ȷ "E » - › › › › › Π W « w r ´ηr ˜1 M r ÿ iPSr 1 K K ÿ j"1 ∇f pw r , x r i,j q ´ui ¸ff ´w˚› › › › › 2 ˇˇˇM r fi fl ďE » - › › › › › w r ´ηr ˜1 M r ÿ iPSr 1 K K ÿ j"1 ∇f pw r , x r i,j q ´ui ¸´w ˚› › › › › 2 ˇˇˇM r fi fl "E " }w r ´w˚}2 ˇˇˇM r ȷ ´2η r E " x∇ p F X pw r q `s u r , w r ´w˚y ˇˇˇM r ȷ `η2 r E » - › › › › › s u r `1 M r ÿ iPSr 1 K K ÿ j"1 ∇f pw r , x r i,j q › › › › › 2 ˇˇˇM r fi fl ďp1 ´µη r qE " }w r ´w˚} 2 ˇˇˇM r ȷ ´2η r Er p F X pw r q ´p F X|M r s `η2 r E » - › › › › › s u r `1 M r ÿ iPSr 1 K K ÿ j"1 ∇f pw r , x r i,j q › › › › › 2 ˇˇˇM r fi fl ď p1 ´µη r qE " }w r ´w˚} 2 ˇˇˇM r ȷ ´2η r Er p F X pw r q ´p F X|M r s `η2 r ˆdσ 2 M r `L2 ˙, where we used Lemma D.4 in the first inequality, µ-strong convexity of p F (for µ ě 0) and the fact that s u r is independent of the gradient estimate and mean zero in the next inequality, and the fact that f p¨, xq is L-Lipschitz for all x in the last inequality (together with independence of the noise and the data again). Now we consider the convex (µ " 0) and strongly convex (µ ą 0) cases separately. Convex (µ " 0) case: Re-arranging (15), we get Er p F X pw r q ´p F X|M r s ď 1 2η r `Er}w r ´w˚}2 ´}w r`1 ´w˚}2 s ˘`η r 2 ˆdσ 2 M r `L2 ȧnd hence Er p F X pw r q ´p F Xs ď 1 2η r `Er}w r ´w˚}2 ´}w r`1 ´w˚}2 s ˘`η r 2 ˆdσ 2 M `L2 by taking total expectation and using Er1{M r s " 1{M . Then for η r " η, the average iterate s w R satisfies: Er p F X p s w R q ´p F Xs ď 1 R R´1 ÿ r"0 Er p F X pw r q ´p F Xs ď 1 R R´1 ÿ r"0 1 2η pEr}w r ´w˚}2 ´}w r`1 ´w˚} sq `η 2 ˆdσ 2 M `L2 ď }w 0 ´w˚}2 ηR `ηr 2 ˆdσ 2 M `L2 ˙. Plugging in σ 2 finishes the proof of the convex case. Strongly convex (µ ą 0) case: Recall from (15) that Er}w t`1 ´w˚}2 s ď p1 ´µη t qEr}w t ´w˚}2 s ´2η t Er p F X pw t q ´p F Xs `η2 t ˆdσ 2 M `L2 ˙(16) for all t ě 0 (upon taking expectation over M t ). Now, ( 16) satisfies the conditions for Lemma D.5, with sequences r t " E}w t ´w˚}2 , s t " Er p F X pw t q ´p F Xs and parameters a " µ, b " 2, c " dσ 2 M `L2 , g " 2β, T " R. Then Lemma D.5 and Jensen's inequality imply E p F X p p w R q ´p F X " r O ˆβD 2 exp ˆ´R 2κ ˙`L 2 µ ˆ1 R `d M ϵ 2 0 n 2 ˙˙. Finally, plugging in R and σ 2 completes the proof. We are now prepared to prove Theorem 2.1 in the β-smooth case: Theorem D.1. Let f pw, xq be β-smooth in w for all x P X . Assume ϵ 0 ď 2 lnp2{δ 0 q, choose σ 2 " 256L 2 R lnp 2.5R δ 0 q lnp2{δ0q n 2 ϵ 2 0 and K ě ϵ0n 4 ? 2R lnp2{δ0q . Then Algorithm 2 is pϵ 0 , δ 0 q-ISRL-DP. Moreover, there are choices of tη r u R r"1 such that Algorithm 2 achieves the following excess loss bounds: 1. If f p¨, xq is convex, then setting R " βD ? M L min ! ? n, ϵ0n ? d ) `min ! nM, ϵ 2 0 n 2 M d ) yields EF p p w R q ´F ˚" r O ˜LD ? M ˜1 ? n `ad lnp1{δ 0 q ϵ 0 n ¸¸. ( ) 2. If f p¨, xq is µ-strongly convex, then setting R " max ´2β µ ln ´βD 2 µM ϵ 2 0 n 2 dL 2 ¯, min ! M n, M ϵ 2 0 n 2 d )ȳ ields EF p p w R q ´F ˚" r O ˆL2 µM ˆ1 n `d lnp1{δ 0 q ϵ 2 0 n 2 ˙˙. Proof. Privacy: By independence of the Gaussian noise across silos, it suffices to show that transcript of silo i's interactions with the server is DP for all i P rN s (conditional on the transcripts of all other silos). WLOG consider i " 1. By the advanced composition theorem (Theorem 3.20 in (Dwork & Roth, 2014) ), it suffices to show that each of the R rounds of the algorithm is pr ϵ, r δq-ISRL-DP, where r ϵ " ϵ0 2 ? 2R lnp2{δ0q (we used the assumption ϵ 0 ď 2 lnp2{δ 0 q here) and r δ " δ0 2R . First, condition on the randomness due to local sampling of the local data point x r 1,1 (line 4 of Algorithm 2). Now, the L 2 sensitivity of each local step of SGD is bounded by ∆ :" sup |X1∆X 1 1 |ď2,wPW } 1 K ř K j"1 ∇f pw, x 1,j q ´∇f pw, x 1 1,j q} ď 2L{K, by L-Lipschitzness of f. Thus, the standard privacy guarantee of the Gaussian mechanism (see (Dwork & Roth, 2014 , Theorem A.1)) implies that (conditional on the randomness due to sampling) taking σ 2 1 ě 8L 2 lnp1.25{ r δq r ϵ 2 K 2 suffices to ensure that round r (in isolation) is pr ϵ, r δq-ISRL-DP. Now we invoke the randomness due to sampling: (Ullman, 2017) implies that round r (in isolation) is p 2r ϵK n , r δq-ISRL-DP. The assumption on K ensures that ϵ 1 :" n 2K ϵ0 2 ? 2R lnp2{δ0q ď 1, so that the privacy guarantees of the Gaussian mechanism and amplification by subsampling stated above indeed hold. Therefore, with sampling, it suffices to take σ 2 ě 32L 2 lnp1.25{ r δq n 2 r ϵ 2 " 256L 2 R lnp2.5R{δ0q lnp2{δ0q n 2 ϵ 2 0 to ensure that round r (in isolation) is pr ϵ, r δq-ISRL-DP for all r and hence that the full algorithm (R rounds) is pϵ 0 , δ 0 q-ISRL-DP. Excess loss: 1. First suppose f is merely convex (µ " 0). By Lemma D.2, Lemma D.1, and Lemma D.6, we have: EF p p w R q ´F ˚ď α `E p F X p p w R q ´p F X ď 2L 2 Rη nM `D2 2ηR `ηL 2 2M ˜256dR lnp 2.5R δ0 q lnp2{δ 0 q n 2 ϵ 2 0 `1f or any η ď 1{β. Choosing η " min ´1 β , D L ? R min ! 1, ϵ0n ? M ? Rd , b nM R )¯i mplies EF p p w R q ´F ˚À βD 2 R `LD ? R ˜1 `adR lnpR{δ 0 q ϵ 0 n ? M `c R nM ¸. Finally, one verifies that the prescribed choice of R yields (17). 2. Now suppose f is µ-strongly convex. Then for the η ď 1{β used in the proof of Lemma D.6 and R ě 2β µ ln ´βD 2 µM ϵ 2 0 n 2 dL 2 ¯, we have E p F X p p w R q ´p F X " r O ˆL2 µ ˆd lnp1{δ 0 q M ϵ 2 0 n 2 `1 R `1 M n ˙˙, by Lemma D.6, Lemma D.2, and Lemma D.1. Hence (18) follows by our choice of R ě min ´M n, M ϵ 2 0 n 2 d ¯. We use Theorem D.1 to prove Theorem 2.1 via Nesterov smoothing (Nesterov, 2005) , similar to how (Bassily et al., 2019) proceeded for CDP SCO with non-strongly convex loss and N " 1. That is, for non-smooth f , we run ISRL-DP Noisy MB-SGD on the smoothed objective (a.k.a. β-Moreau envelope) f β pwq :" min vPW ´f pvq `β 2 }w ´v} 2 ¯, where β ą 0 is a design parameter that we will optimize for. The following key lemma allows us to easily extend Theorem D.1 to non-smooth f : Lemma D.7 (Nesterov (2005) ). Let f : W Ñ R d be convex and L-Lipschitz and let β ą 0. Then the β-Moreau envelope f β pwq :" min vPW ´f pvq `β 2 }w ´v} 2 ¯satisfies: 1. f β is convex, 2L-Lipschitz, and β-smooth. 2. @w, f β pwq ď f pwq ď f β pwq `L2 2β . Now let us re-state the precise version of Theorem 2.1 before providing its proof: Theorem D.2 (Precise version of Theorem 2.1). Let ϵ 0 ď 2 lnp2{δ 0 q and choose σ 2 " 256L 2 R lnp 2.5R δ 0 q lnp2{δ0q n 2 ϵ 2 0 , K ě ϵ0n 4 ? 2R lnp2{δ0q . Then Algorithm 2 is pϵ 0 , δ 0 q-ISRL-DP. Further, there exist choices of β ą 0 such that running Algorithm 2 on f β pw, xq :" min vPW ´f pv, xq `β 2 }w ´v} 2 ¯yields: 1. If f p¨, xq is convex, then setting R " βD ? M L min ! ? n, ϵ0n ? d ) `min ! nM, ϵ 2 0 n 2 M d ) yields EF p p w R q ´F ˚" r O ˜LD ? M ˜1 ? n `ad lnp1{δ 0 q ϵ 0 n ¸¸. (19) 2. If f p¨, xq is µ-strongly convex, then setting R " max ´2β µ ln ´βD 2 µM ϵ 2 0 n 2 dL 2 ¯, min ! M n, M ϵ 2 0 n 2 d )ȳ ields EF p p w R q ´F ˚" r O ˆL2 µM ˆ1 n `d lnp1{δ 0 q ϵ 2 0 n 2 ˙˙. Proof. Privacy: ISRL-DP is immediate from post-processing (Dwork & Roth, 2014 , Proposition 2.1), since we already showed that Algorithm 2 (applied to f β ) is ISRL-DP. Excess risk: We have EF p p w R q ´F ˚ď EF β p p w R q ´F β `L2 2β , by part 2 of Lemma D.7. Moreover, by part 1 of Lemma D.7 and Theorem D.1, we have: 1. EF β p p w R q ´F β " r O ˜LD ? M ˜1 ? n `ad lnp1{δ 0 q ϵ 0 n ¸for convex f , 2. EF β p p w R q ´F β " r O ˆL2 µM ˆ1 n `d lnp1{δ 0 q ϵ 2 0 n 2 ˙˙, for µ-strongly convex f . Thus, choosing β 1 such that L 2 {β 1 ď LD ? M ˆ1 ? n `?d lnp1{δ0q ϵ0n ˙and β 2 such that L 2 {β 2 ď L 2 µM ´1 n `d lnp1{δ0q ϵ 2 0 n 2 ¯completes the proof.

D.3 LOWER BOUNDS FOR ISRL-DP FL: SUPPLEMENTAL MATERIAL AND PROOFS

This section requires familiarity with the notation introduced in the rigorous definition of ISRL-DP in Appendix B.

D.3.1 C-COMPOSITIONALITY

The pϵ 0 , δ 0 q-ISRL-DP algorithm class A pϵ0,δ0q,C contains all sequentially interactive and fully interactive, C-compositional algorithms. Definition 7 (Compositionality). Let A be an R-round pϵ 0 , δ 0 q-ISRL-DP FL algorithm with data domain X . Let tpϵ r 0 , δ r 0 qu R r"1 denote the minimal (non-negative) parameters of the local randomizers R piq r selected at round r such that R piq r pZ p1:r´1q , ¨q is pϵ r 0 , δ r 0 q-DP for all i P rN s and all Z p1:r´1q . For C ą 0, we say that A is C-compositional if b ř rPrRs pϵ r 0 q 2 ď Cϵ 0 . If such C is an absolute constant, we simply say A is compositional. Definition 7 is an extension of the definition in Joseph et al. (2019) to δ 0 ą 0.

D.3.2 ALGORITHMS WHOSE PRIVACY FOLLOWS FROM ADVANCED COMPOSITION THEOREM ARE 1-COMPOSITIONAL

Suppose A is pϵ 0 , δ 0 q-ISRL-DP by the advanced composition theorem (Dwork & Roth, 2014, Theorem 3.20) . Then ϵ 0 " g f f e 2 R ÿ r"1 pϵ r 0 q 2 lnp1{δ 1 q `R ÿ r"1 ϵ r 0 pe ϵ r 0 ´1q, and δ 0 " ř R r"1 δ r 0 `δ1 for any δ 1 P p0, 1q. Assume δ 1 P p0, 1{3q without loss of generality: otherwise the privacy guarantee of A is essentially meaningless (see Remark D.1). Then ϵ 0 ě b 2 ř R r"1 pϵ r 0 q 2 lnp1{δ 1 q ě ? 2 b ř R r"1 pϵ r 0 q 2 ě b ř R r"1 pϵ r 0 q 2 , so that A is 1-compositional. Note that even if δ 1 ą 1{3, A would still be compositional, but the constant C may be larger than 1.

D.3.3 EXAMPLE OF LDP ALGORITHM THAT IS NOT COMPOSITIONAL

This example is a simple modification of Example 2.2 in (Joseph et al., 2019) (adapted to our definition of compositionality for δ 0 ą 0). Given any C ą 0, set d :" 2C 2 and let X " te 1 , ¨¨¨e d u Ă t0, 1u d be the standard basis for R d . Let n " 1 and X " px 1 , ¨¨¨, x N q P X N . For all i P rN s let Q piq : X Ñ X be the randomized response mechanism that outputs Q piq px i q " x i with probability e ϵ 0 e ϵ 0 `d´1 and otherwise outputs a uniformly random element of X ztx i u. Note that Q piq is ϵ 0 -DP, hence pϵ 0 , δ 0 q-DP for any δ 0 ą 0. Consider the d-round algorithm A : X N Ñ Z dˆN in Algorithm 3, where Z " R d Algorithm 3 LDP Algorithm that is not C-compositional 1: for r P rds do 2: for i P N do 3: if x i " e r then 4: R piq r px i q :" Q piq px i q. end for 9: end for 10: Output: tR piq r px i qu iPrN s,rPrds . Since each silo's data is only referenced once and Q piq is ϵ 0 -DP, we have ϵ r 0 " ϵ 0 and A is pϵ 0 , δ 0 q-DP. However, b ř d r"1 pϵ r 0 q 2 " a dϵ 2 0 " ? 2Cϵ 0 ą Cϵ 0 , so that A is not C-compositional. Also, note that our One-Pass Accelerated Noisy MB-SGD is only C-compositional for C ě ? R, since ϵ 0 " ϵ r 0 for this algorithm. Thus, substituting the R that is used to prove the upper bounds in Theorem 3.1 (see Appendix E.2) and plugging C " ? R into Theorem 2.2 explains where the non-i.i.d. lower bounds in Fig. 3 come from.

D.3.4 PROOF OF THEOREM 2.2

First, let us state the complete, formal version of Theorem 2.2, which uses notation from Appendix B: Theorem D.3 (Complete Version of Theorem 2.2). Let ϵ 0 P p0, ? N s, δ 0 " op1{nN q, and A P A pϵ0,δ0q,C . Suppose that in each round r, the local randomizers are all pϵ r 0 , δ r 0 q-DP, for ϵ r 0 À 1 n , δ r 0 " op1{nN Rq, N ě 16 lnp2{δ r 0 nq. Then: 1. There exists a f P F L,D and a distribution D such that for X " D nN , we have: EF pApXqq ´F ˚" r Ω ˆϕD ? N n `LD min " 1, ? d ϵ0n ? N C 2 *˙. 2. There exists a µ-smooth f P G µ,L,D and distribution D such that for X " D nN , we have EF pApXqq ´F ˚" r Ω ˆϕ2 µnN `LD min " 1, d ϵ 2 0 n 2 N C 4 *˙. Further, if A P A, then the above lower bounds hold with C " 1. Before we proceed to the proof of Theorem 2.2, we recall the simpler characterization of ISRL-DP for sequentially interactive algorithms. A sequentially interactive algorithm A with randomizers tR piq u N i"1 is pϵ 0 , δ 0 q-ISRL-DP if and only if for all i P rN s, R piq p¨, Z p1:i´1q q : X n i ˆZ is pϵ 0 , δ 0 q-DP for all Z p1:i´1q P Z i´1 . In what follows, we will fix X i " X for all i. We now turn to the proof. Step 1: Privacy amplification by shuffling. We begin by stating and proving the amplification by shuffling result that we will leverage to obtain Theorem 2.2: Theorem D.4. Let A P A pϵ0,δ0q,C such that ϵ 0 P p0, ? N s and δ 0 P p0, 1q. Assume that in each round, the local randomizers R piq r pZ p1:r´1q , ¨q : X n Ñ Z are pϵ r 0 , δ r 0 q-DP for all i P rN s, r P rRs, Z p1:r´1q P Z r´1ˆN with ϵ r 0 ď 1 n . Assume N ě 16 lnp2{δ r 0 nq. If A is C-compositional, then assume δ r 0 ď 1 14nN R and denote δ :" 14N n ř R r"1 δ r 0 ; if instead A is sequentially interactive, then assume δ 0 " δ r 0 ď 1 7N n and denote δ :" 7N nδ 0 . Let A s : X Ñ W be the same algorithm as A except that in each round r, A s draws a random permutation π r of rN s and applies R piq r to X πrpiq instead of X i . Then, A s is pϵ, δq-CDP, where ϵ " O ˆϵ0 lnp1{nN δ min 0 qC 2 ? N ˙, and δ min 0 :" min rPrRs δ r 0 . In particular, if A P A, then ϵ " O ˆϵ0 lnp1{nN δ min 0 q ? N ˙. Note that for sequentially interactive A, δ min 0 " δ 0 . To the best of our knowledge, the restriction on ϵ r 0 is needed to obtain ϵ " r Opϵ 0 { ? N q in all works that have analyzed privacy amplification by shuffling (Erlingsson et al., 2020b; Feldman et al., 2020b; Balle et al., 2019; Cheu et al., 2019; Balle et al., 2020) , but these works focus on the sequentially interactive case with n " 1, so the restriction amounts to ϵ 0 À 1 (or ϵ 0 " r Op1qq. The non-sequential C-compositional part of Theorem D.4 will follow as a corollary (Corollary D.1 ) of the following result which analyzes the privacy amplification in each round: Theorem D.5 (Single round privacy amplification by shuffling). Let ϵ r 0 ď ln ´N 16 lnp2{δ r q ¯{n, r P N and let R piq r pZ, ¨q : X n Ñ Z be an pϵ r 0 , δ r 0 q-DP local randomizer for all Z " Z p1:N q p1:r´1q P Z pr´1qˆN and i P rN s, where X is an arbitrary set. Given a distributed data set X " pX 1 , ¨¨¨, X N q P X N ˆn and Z " Z p1:N q p1:r´1q , consider the shuffled algorithm A r s : X nˆN ˆZpr´1qˆN Ñ Z N that first samples a random permutation π of rN s and then computes Z r " pZ p1q r , ¨¨¨, Z pN q r q, where Z piq r :" R piq r pZ, X πpiq q. Then, A r s is pϵ r , r δ r q-CDP, where ϵ r :" ln « 1 `ˆe ϵ r 0 ´1 e ϵ r 0 `1 ˙˜8 a e nϵ r 0 lnp4{δ r q ? N `8e nϵ r 0 N ¸ff , and r δ r :" δ r `2N ne pn´1qϵ r 0 δ r 0 . In particular, if ϵ r 0 " O `1 n ˘, then ϵ r " O ˜ϵr 0 a lnp1{δ r q ? N ¸. Further, if ϵ r 0 ď 1{n, then setting δ r :" N nδ r 0 implies that ϵ r " O ˜ϵr 0 a lnp1{nN δ r 0 q ? N ¸(23) and r δ r ď 7N nδ r 0 , which is in p0, 1q if we assume δ r 0 P p0, 1 7N n s. We sometimes refer to the algorithm A r s as the "shuffled algorithm derived from the randomizers tR piq r u." From Theorem D.5, we obtain: Corollary D.1 (R-round privacy amplification for C-compositional algorithms). Let A : X nˆN Ñ Z RˆN be an R-round pϵ 0 , δ 0 q-ISRL-DP and C-compositional algorithm such that ϵ 0 P p0, ? N s and δ 0 P p0, 1q, where X is an arbitrary set. Assume that in each round, the local randomizers R piq r pZ p1:r´1q , ¨q : X n Ñ Z are pϵ r 0 , δ r 0 q-DP for i P rN s, r P rRs, where N ě 16 lnp2{δ r 0 nq, ϵ r 0 ď 1 n , and δ r 0 ď 1 14nN R . Then, the shuffled algorithm A s : X nˆN Ñ Z RˆN derived from tR :" min rPrRs δ r 0 . In particular, if A P A is compositional, then ϵ " O ˆϵ0 lnp1{nN δ min 0 q ? N ˙. Proof. Let δ 1 :" ř r N nδ r 0 and δ r :" N nδ r 0 . Then the (central) privacy loss of the full R-round shuffled algorithm is bounded as ϵ ď 2 ÿ r pϵ r q 2 `c2 ÿ r pϵ r q 2 lnp1{δ 1 q " O ˜ÿ r ˆpϵ r 0 q 2 lnp1{δ r q N ˙`d ÿ r pϵ r 0 q 2 lnp1{δ r q lnp1{δ 1 q N " O ˆC2 ϵ 0 lnp1{nN δ min 0 q ? N ˙, where the three (in)equalities follow in order from the Advanced Composition Theorem (Dwork & Roth, 2014) , ( 23) in Theorem D.5, and C-compositionality of A combined with the assumption ϵ 0 À ? N . Also, δ " δ 1 `řr r δ r by the Advanced Composition Theorem, where r δ r ď 7N nδ r 0 by Theorem D.5. Hence δ ď 14N n ř r δ r 0 . In particular, if A is compositional, then C " Op1q, so ϵ " O ˆϵ0 lnp1{nN δ min 0 q ? N ˙. Remark D.1. The upper bounds assumed on δ r 0 and δ r in Theorem 2.2 ensure that δ P p0, 1q and that the lower bounds of (Bassily et al., 2014) apply (see Theorem D.6). These assumptions are not very restrictive in practice, since δ r 0 , δ 0 ! 1{n is needed for meaningful privacy guarantees (see e.g. Chapter 2 in (Dwork & Roth, 2014) ) and R must be polynomial for the algorithm to run. To quote (Dwork & Roth, 2014) , "typically we are interested in of δ that are less than the inverse of any polynomial in the size of the database" (page 18). For larger δ (e.g. δ " Ωp1{nq), there are examples of algorithms that satisfy the definition of DP but clearly violate any reasonable notion of privacy. For instance, an algorithm that outputs δ 0 n random samples from each silo's data set is p0, δ 0 q-ISRL-DP, but completely violates the privacy of at least one person in each silo if δ 0 ě 1{n. Also, since N " 1 is the regime of interest (otherwise if N " r Op1q, the CDP lower bounds of (Bassily et al., 2019) already match our upper bounds up to logarithms), the requirement that N be larger than 16 lnp2{δ min 0 nq is unimportant. 13The sequentially interactive part of Theorem D.4 will be clear directly from the proof of Theorem D.5. We now turn to the proof of Theorem D.5, which uses the techniques from (Feldman et al., 2020b) . First, we'll need some more notation. The privacy relation in (3) between random variables P and Q can be characterized by the hockey-stick divergence: D e ϵ pP }Qq :" ş maxt0, ppxq ´eϵ qpxqudx, where p and q denote the probability density or mass functions of P and Q respectively. Then P » pϵ,δq Q iff maxtD e ϵ pP }Qq, D e ϵ pQ}P qu ď δ. Second, recall the total variation distance between P and Q is given by T V pP, Qq " 1 2 ş R |ppxq ´qpxq|dx. Third, we recall the notion of group privacy: Definition 8 (Group DP). A randomized algorithm A : X N Ñ Z is pϵ, δq group DP for groups of size N if ApXq » pϵ,δq ApX 1 q for all X, X 1 P X N . We'll also need the following stronger version of a decomposition from (Kairouz et al., 2015) and Lemma 3.2 of (Murtagh & Vadhan, 2016) . Lemma D.8 ( (Kairouz et al., 2015) ). Let R 0 , R 1 : X n Ñ Z be local randomizers such that R 0 pX 0 q and R 1 pX 1 q are pϵ, 0q indistinguishable. Then, there exists a randomized algorithm U : tX 0 , X 1 u Ñ Z such that R 0 pX 0 q " e ϵ e ϵ `1 U pX 0 q `1 e ϵ `1 U pX 1 q and R 1 pX 1 q " 1 e ϵ `1 U pX 0 q `eϵ e ϵ `1 U pX 1 q. Lemma D.8 follows from the proof of Lemma 3.2 in (Murtagh & Vadhan, 2016) , noting that the weaker hypothesis assumed in Lemma D.8 sufficient for all steps to go through. Definition 9 (Deletion Group DP). Algorithm R : X n Ñ Z is pϵ, δq deletion group DP for groups of size n if there exists a reference distribution ρ such that RpXq » pϵ,δq ρ for all X P X n . It's easy to show that if R is pϵ, δq-deletion group DP for groups of size n, then R is p2ϵ, p1 `eϵ qδq group DP for groups of size n. In addition, we have the following result: Now, define L piq U pZ, Xq :" $ & % U piq pZ, X 0 1 q if X " X 0 1 U piq pZ, X 1 1 q if X " X 1 1 L piq pZ, Xq otherwise. . For any inputs Z, X, let A U pZ, Xq be defined exactly the same as A s pZ, Xq (same π) but with the randomizers R piq replaced by L piq U . Then by ( 24) and ( 25), A L pX 0 q " e ϵ r 0 e ϵ r 0 `1 A U pX 0 q`1 e ϵ r 0 `1 A U pX 1 q and A L pX 1 q " 1 e ϵ r 0 `1 A U pX 0 q`e ϵ r 0 e ϵ r 0 `1 A U pX 1 q. ( ) Then by ( 26), for any X P X n ztX 0 1 , X 1 1 u and any Z " Z 1:r´1 P Z pr´1qˆN , we have L piq U pZ, Xq " 1 2e Ă ϵ 0 L piq U pZ, X 0 1 q `1 2e Ă ϵ 0 L piq U pZ, X 1 1 q `p1 ´e´Ă ϵ0 qLO piq pZ, Xq. Hence, Lemma D.10 (with p :" e ´Ă ϵ0 " e ´nϵ r 0 ) implies that A U pX 0 q and A U pX 1 q) are ˜log ˜1 `8a e Ă ϵ0 lnp4{δ r q ? N `8e Ă ϵ0 N ¸, δ r indistinguishable for any δ r ě 2e ´N e ´nϵ r 0 {16 . Applying Lemma D.14 with P :" A L pX 0 q, Q " A L pX 1 q, q " e ϵ r 0 ´1 e ϵ r 0 `1 , P 1 " A U pX 0 q, Q 1 " A U pX 1 q, and P 0 " 1 2 pP 1 `Q1 q and convexity of the hockey-stick divergence yields that A L pX 0 q and A L pX 1 q are pϵ r , δ r q indistinguishable, as desired. This proves the claim and hence (by Lemma D.13, as described earlier) the theorem. Remark D.2. Notice that if A is sequentially interactive, then the proof of Theorem D.5 above almost immediately implies the sequentially interactive part of Theorem D.4. Essentially, just change notation: replace Z 1:r´1 by Z p1:i´1q , the collection of (single) reports sent by the first i ´1 silos; note that ϵ r 0 " ϵ 0 , δ r 0 " δ 0 ; and view the N reports as being sent in order instead of simultaneously. Alternatively, plug our techniques for n ą 1 into the proof of Theorem 3.8 in (Feldman et al., 2020b) , which is for sequentially interactive algorithms. Step 2: Combine Theorem D.4 with the following CDP SCO lower bounds which follow from (Bassily et al., 2014; 2019) and the non-private SCO lower bounds (Nemirovskii & Yudin, 1983; Agarwal et al., 2012) : Theorem D.6. (Bassily et al., 2019; 2014) Let µ, D, ϵ ą 0, L ě µD, and δ " op1{nN q. Consider X :" t ´D ? d , D ? d u d Ă R d and W :" B 2 p0, Dq Ă R d . Let A : X nN Ñ W be any pϵ, δq-CDP algorithm. Then: 1. There exists a (µ " 0) convex, linear (β-smooth for any β), L-Lipschitz loss f : W ˆX Ñ R and a distribution D on X such that if X " D nN , then the expected excess loss of A is lower bounded as EF pApXqq ´F ˚" r Ω ˜ϕD ? N n `LD min # 1, ? d ϵnN +¸. 2. For L « µD, there exists a µ-strongly convex and smooth, L-Lipschitz loss f : W ˆX Ñ R and a distribution D on X such that if X " D nN , then the expected excess loss of A is lower bounded as EF pApXqq ´F ˚" r Ω ˆϕ2 µN n `L2 µ min " 1, d ϵ 2 n 2 N 2 *˙. For general L, µ, D, the above strongly convex lower bound holds with the factor L 2 µ replaced by LD. Namely, if A is pϵ 0 , δ 0 q-ISRL-DP, then (under the hypotheses of Theorem 2.2) A s is pϵ, δq-CDP for ϵ " r Opϵ 0 { ? N q, so Theorem D.6 implies that the excess loss of A s is lower bounded as in Theorem D.6 with ϵ replaced by ϵ 0 { ? N . Step 3: We simply observe that when the expectation is taken over the randomness in sampling X " D nˆN , the expected excess population loss of A s is identical to that of A since X i and X πpiq have the same distribution for all i, π by the i.i.d. assumption. This completes the proof of Theorem 2.2.

D.3.5 TIGHT EXCESS RISK BOUNDS FOR CROSS-DEVICE FL WITHOUT A TRUSTED SERVER

We explain how our ISRL-DP excess risk bounds also imply tight excess risk bounds for algorithms that satisfy both ISRL-DP and user-level DP simultaneously, which may be desirable for cross-device FL without a trusted server. Assume M " N for simplicity and consider i.i.d. FL for concreteness. 15Given pϵ, δq with ϵ ď 1, let ϵ 0 " ϵ{n, δ 0 " δ{4n ď δ{pne pn´1qϵ0 q " δ{pne pn´1qϵ{n q and run Noisy (ISRL-DP) MB-SGD with noise calibrated to pϵ 0 , δ 0 q. Then the algorithm also satisfies pϵ, δq-user level DP by Appendix C. Thus, by Theorem 2.1, we obtain hybrid ISRL-DP/user-level DP excess risk upper bounds. For example, in the convex case, we get EF p p w R q ´F ˚" r O ˜LD ˜ad lnp1{δ 0 q ϵ 0 n ? N `1 ? N n ¸" r O ˜LD ˜ad lnpn{δq ϵ ? N `1 ? N n ¸" r O ˜LD a d lnpn{δq ϵ ? N ¸. ( ) Regarding lower bounds: note that the semantics of the hybrid ISRL-DP/user-level DP notion are essentially identical to LDP, except that individual "records/items" are now thought of as datasets of size n. Thus, letting n " 1 in our ISRL-DP lower bounds (we think of each silo as having just one "record" even though that record is really a dataset) yields lower bounds matching (up to logarithms) the upper bounds attained above. For example, putting n " 1 in the convex lower bound in Theorem 2.2 yields a bound that matches (28), establishing the optimal FL rates (up to logarithmic factors) for this algorithm class. Note that the minimax risk bounds for ISRL-DP/user-level DP hybrid algorithms resemble the bounds for LDP algorithms (Duchi et al., 2013) , scaling with N , but not with n.

E PROOFS AND SUPPLEMENTAL MATERIAL FOR SECTION 3 E.1 MUTLI-STAGE IMPLEMENTATION OF ACCELERATED NOISY MB-SGD

Here we describe the multi-stage implementation of Accelerated Noisy MB-SGD that we will use to further expedite convergence for strongly convex loss, which builds on (Ghadimi & Lan, 2013) . As before, for SCO, silos sample locally without replacement in each round and set R " tn{Ku. Whereas for ERM, silos sample locally with replacement. Multi-stage implementation of Algorithm 1: Inputs: U P rRs such that ř U k"1 R k ď R for R k defined below; w 0 P W, ∆ ě F pw 0 q ´F ˚, V ą 0, and q 0 " 0. For k P rU s, do the following: 1. Run R k rounds of Algorithm 1 using w 0 " q k´1 , tα r u rě1 and tη r u rě1 , where R k " S max # 4 d 2β µ , 128V 2 3µ∆2 ´pk`1q +W , α r " 2 r `1 , η r " 4υ k rpr `1q , υ k " max # 2β, " µV 2 3∆2 ´pk´1q R k pR k `1qpR k `2q ȷ 1{2 + 2. Set q k " w ag R k , where w ag R k is the output of Step 1 above. Then update k Ð k `1 and return to Step 1.

E.2 COMPLETE VERSION AND PROOF OF THEOREM 3.1

We will state and prove the theorem for general M P rN s under Assumption 3. We first require some additional notation: Define the heterogeneity parameter υ 2 :" sup wPW 1 N N ÿ i"1 }∇F i pwq ´∇F pwq} 2 , which has appeared in (Khaled et al., 2019; Koloskova et al., 2020; Karimireddy et al., 2020; Woodworth et al., 2020b) . υ 2 " 0 iff F i " F `ai for constants a i P R, i P rN s ("homogeneous up to transaltion"). Theorem E.1 (Complete version of Theorem 3.1). Let f p¨, xq be β-smooth for all x P X . Assume ϵ 0 ď 8 lnp1{δ 0 q, δ 0 P p0, 1q Then, with σ 2 " 32L 2 lnp1.25{δ0q ϵ 2 0 K 2 , One-Pass Accelerated Noisy MB-SGD is pϵ 0 , δ 0 q-ISRL-DP. Moreover, there are choices of stepsize, batch size, and λ ą 0 such that : 1. Running One-Pass Accelerated Noisy MB-SGD on r f pw, xq :" f pw, xq `λ 2 }w ´w0 } 2 (where w 0 P W) yields EF pw ag R q´F ˚" r O ¨ϕD ? nM `˜β 1{4 LD 3{2 a d lnp1{δ0q ϵ0n ? M ¸4{5 `c N ´M N ´1 1 tN ą1u υL 1{5 D 4{5 β 1{5 ˜ad lnp1{δ0q ϵ0nM 3 ¸1{5 ‚. 2. If f p¨, xq is µ-strongly convex @x P X and κ " β µ , then running the Multi-Stage Implementation of One-Pass Accelerated Noisy MB-SGD directly on f yields with batch size K yields EF pw ag R q ´F ˚" r O ˆϕ2 nM `L2 µ ? κd lnp1{δ 0 q ϵ 2 0 n 2 M `υ2 µ ? κM ˆ1 ´M ´1 N ´1 ˙1tNą1u ˙. (31) Remark E.1. 1. For convex f , if M " N or υ À c N ´1 N ´M « pL 3 D 2 β 2 q 1{5 M 1{5 pd lnp1{δ 0 qq 3{10 n 3{5 ϵ 3{5 0 `ϕ ˆβD L ˙1{5 ˆϵ2 0 M n 3 d lnp1{δ 0 q ˙1{10 ff , then ) recovers the bound (6) in Theorem 3.1 (M " N version), with N replaced by M . 2. For µ-strongly convex f , if M " N or υ 2 À ˆN ´1 N ´M ˙?κ ˆϕ2 n `?κL 2 d lnp1{δ 0 q ϵ 2 0 n 2 ˙, then (31) recovers the bound (7) in Theorem 3.1, with N replaced by M . To prove Theorem E.1, we will need some preliminaries: Lemma E.1. (Woodworth et al., 2020b , Lemma 4) Let F : W Ñ R d be convex and β-smooth, and suppose that the unbiased stochastic gradients r gpw t q at each iteration have bounded variance E}r gpwq´∇F pwq} 2 ď V 2 . If p w ag R is computed by R steps of Accelerated MB-SGD on the regularized objective r F pwq " F pwq `V 2}w0´w ˚}? R }w ´w0 } 2 , then EF p p w ag R q ´F ˚À β}w 0 ´w˚}2 R 2 `V }w 0 ´w˚} ? R . We then have the following bound for the multi-stage protocol: Lemma E.2. (Ghadimi & Lan, 2013, Proposition 7) Let f : W Ñ R d be µ-strongly convex and β-smooth, and suppose that the unbiased stochastic gradients r gpw r q at each iteration r have bounded variance E}r gpw r q ´∇F pw t q} 2 ď V 2 . If p w ag R is computed by R steps of the Multi-Stage Accelerated MB-SGD, then EF p p w ag R q ´F ˚À ∆ exp ˆ´c µ β R ˙`V 2 µR , where ∆ " F pw 0 q ´F ˚. Of course, (Woodworth et al., 2020b, Lemma 4) and (Ghadimi & Lan, 2013, Proposition 7) are stated for the non-private Accelerated MB-SGD (AC-SA). However, we observe that the bounds in (Woodworth et al., 2020b, Lemma 4) and (Ghadimi & Lan, 2013, Proposition 7) depend only on the stochastic gradient oracle via its bias and variance. Hence these results also apply to our (multi-stage implementation of) Accelerated Noisy MB-SGD. Next, we bound the variance of our noisy stochastic gradient estimators: Lemma E.3. Let X i " D n i , r g r :" 1 Mr ř iPSr 1 K ř jPrKs p∇f pw r , x r i,j q `ui q, where px r i,j q jPrKs are sampled from X i and u i " N p0, σ 2 I d q is independent of ∇f pw r , x r i,j q for all i P rN s, j P rKs. Then E} r g r ´∇F pw r q} 2 ď ϕ 2 M K `ˆ1 ´M ´1 N ´1 ˙υ2 M 1 tN ą1u `dσ 2 M . The three terms on the right-hand side correspond (from left to right) to the variances of: local minibatch sampling within each silo, the draw of the silo set S r of size M r under Assumption 3, and the Gaussian noise. We now turn to the proof of Lemma E.3. Proof of Lemma E.3. First, fix the randomness due to the size of the silo set M r . Now r g r " g r `s u r , where s u r " 1 Mr ř Mr i"1 u i " N p0, σ 2 Mr I d q and s u r is independent of g r :" 1 Mr ř iPSr 1 Ki ř jPrKs ∇f pw r , x r i,j q. Hence, Er} r g r ´∇F pw r q} 2 |M r s " Er}g r ´∇F pw r q} 2 |M r s `Er}s u} 2 |M r s " Er}g r ´∇F pw r q} 2 |M r s `d σ 2 M r . Let us drop the r subscripts for brevity (denoting g " g r , w " w r , S " S r , and M r " M 1 since they have the same distribution) and denote h i :" 1 Ki ř Ki j"1 ∇f pw, x i,j q. Now, we have (conditionally on M 1 ) Er}g ´∇F pwq} 2 |M 1 s " E » - › › › › › 1 M 1 ÿ iPS 1 K i Ki ÿ j"1 ∇f pw, x i,j q ´∇F pwq › › › › › 2 ˇˇˇM 1 fi fl " E » - › › › › › 1 M 1 ÿ iPS p∇h i ´∇F i pwqq `1 M 1 ÿ iPS ∇F i pwq ´∇F pwq › › › › › 2 ˇˇˇM 1 fi fl " 1 M 2 1 E « } ÿ iPS h i pwq ´∇F i pwq} 2 ˇˇˇM 1 ff l jh n a ⃝ `1 M 2 1 E « } ÿ iPS ∇F i pwq ´∇F pwq} 2 ˇˇˇM 1 ff l jh n b ⃝ , since, conditional on S, the cross-terms vanish by (conditional) independence of h i and the nonrandom ř i 1 PS ∇F i 1 pwq ´∇F pwq for all i P S. Now we bound a ⃝: a ⃝ " E S « E hi } ÿ iPS h i pwq ´∇F i pwq} 2 ˇˇˇS , M 1 ff " E S « ÿ iPS E hi }h i pwq ´∇F i pwq} 2 ˇˇˇS , M 1 ff ď E S « ÿ iPS ϕ 2 K ff ď E S " M 1 ϕ 2 K ȷ , Excess loss: 1. For the convex case, we choose λ " V 2D ? R , where V 2 " ϕ 2 M K ὺ2 M 1 tN ą1u ´N´M N ´1 ¯`dσ 2 M is the variance of the noisy stochastic minibatch gradients, by Lemma E.3 for our noise with variance σ 2 " 32L 2 lnp1.25{δ0q K 2 ϵ 2 0 . Now plugging V 2 into Lemma E.1, setting R " n{K, and λ :" V 2D ? R yields EF pw ag R q ´F ˚À βD 2 K 2 n 2 `ϕD ? nM `LD a d lnp1{δ 0 q ϵ 0 ? KnM `?KυD ? nM c N ´M N ´1 1 tN ą1u . Choosing K " ´L βD ¯2{5 n 3{5 pd lnp1{δ0q 1{5 ϵ 2{5 0 M 1{5 implies (30). 2. For strongly convex loss, we plug the same estimate for V 2 used above into Lemma E.2 with R " n{K to obtain EF pw ag R q ´F ˚À ∆ exp ˆ´n K ? κ ˙`ϕ 2 µnM `υ2 K µnM ˆ1 ´M ´1 N ´1 ˙1tNą1u `L2 µ d lnp1{δ 0 q Knϵ 2 0 M . Choosing K " n ? κ ln ˆµ∆ min " ϵ 2 0 n 2 M L 2 d lnp1{δ 0 q , nM ϕ 2 *˙y ields (31). F PROOFS AND SUPPLEMENTARY MATERIAL FOR SECTION 4 F.1 PROOF OF THEOREM 4.1 We begin by considering β-smooth f . Theorem F.1 (Smooth ERM Upper Bound). Assume f p¨, xq is β-smooth for all x. Let ϵ 0 ď 2 lnp2{δ 0 q, δ 0 P p0, 1q, choose K ě ϵ0n 4 ? 2R lnp2{δ0q , and σ 2 " 256L 2 R lnp 2.5R δ 0 q lnp2{δ0q n 2 ϵ 2 0 . Then Algorithm 1 is pϵ 0 , δ 0 q-ISRL-DP. Further: 1. If f λ p¨, xq is convex, then running Algorithm 1 on the regularized objective r f pw, xq " f pw, xq `λ 2 }w ´w0 } 2 with R " max ˜ˆb βD L ? M ϵ0n ? d lnp1{δ0q

˙1{2

, 1 tM KăN nu ϵ 2 0 n 2 Kd lnp1{δ0q ¸(and λ specified in the proof) yields E p F X pw ag R q ´p F X " r O ˜LD a d lnp1{δ 0 q ϵ 0 n ? M ¸. 2. If f p¨, xq is µ-strongly convex, then running the multi-stage implementation (Appendix E.1) of Algo- rithm 1 with R " max !b β µ ln ´∆X µM ϵ 2 0 n 2 L 2 d ¯, 1 tM KăN nu ϵ 2 0 n 2 Kd lnp1{δ0q ) and ∆ X ě p F X pw 0 q ´p F X yields E p F X pw ag R q ´p F X " r O ˆL2 µ d lnp1{δ 0 q ϵ 2 0 n 2 M ˙. To prove Theorem F.1, we will require the following lemma: Lemma F.1 ( (Lei et al., 2017) ). Let ta l u lPr Ă N s be an arbitrary collection of vectors such that řĂ N l"1 a l " 0. Further, let S be a uniformly random subset of r r N s of size Ă M . Then, E › › › › › 1 Ă M ÿ lPS a l › › › › › 2 " r N ´Ă M p r N ´1q Ă M 1 r N Ă N ÿ l"1 }a l } 2 ď 1 t Ă M ă Ă N u Ă M r N Ă N ÿ l"1 }a l } 2 . Proof of Theorem F.1. Privacy: ISRL-DP of Algorithm 2 follows from ISRL-DP of the stochastic gradients tr g i r u R r"1 in line 7 of the algorithm (which was established in Theorem D.1) and the postprocessing property of DP (Dwork & Roth, 2014, Proposition 2.1) . Namely, since the choices of σ 2 , K given in Theorem F.1 ensure that silos' local stochastic minibatch gradients are ISRL-DP and the iterates in Algorithm 1 are functions of these private noisy gradients (which do not involve additional data queries), it follows that the iterates themselves are ISRL-DP. Excess risk: 1. For λ " V X }w0´w ˚}? R , Lemma E.1 implies E p F X p p w R q ´p F X À βD 2 R 2 `V D ? R , where V 2 X :" sup rPrRs E} 1 Mr ř iPSr r g i r ´∇ p F X pw md r q} 2 for r g i r " 1 K ř K j"1 ∇f pw md r , x r i,j q `ui defined in line 7 of Algorithm 1. Now Lemma F.1 and L-Lipschitzness of f p¨, xq implies that V 2 X ď 1 tMrKănN u M r KnN N ÿ i"1 n ÿ j"1 sup wPW }∇f pw, x i,j q ´∇ p F X pwq} 2 ď 1 tMrKănN u 4L 2 M r K , conditional on M r . Hence, taking total expectation with respect to M r and plugging this bound into (34) yields: E p F X p p w R q ´p F X À βD 2 R 2 `LD ¨1tMKănNu ? M KR `bd ln 2 pR{δ 0 q ? M ϵ 0 n ‚, by our choice of σ 2 and independence of noises tu i u iPSr across silos. Then one verifies that the prescribed choice of R yields (32). 2. Invoking Lemma E.2 with the same estimate for V 2 X obtained above gives E p F X p p w R q ´p F X À ∆ X exp ˆ´c µ β R ˙`L 2 µ ˆ1tMKănNu M KR `d ln 2 pRδ 0 q M ϵ 2 0 n 2 ˙, and plugging in the prescribed R completes the proof. We now re-state a precise form of Theorem 4.1 before providing its proof: Theorem F.2 (Precise Re-statement of Theorem 4.1). Let ϵ 0 ď 2 lnp2{δ 0 q, δ 0 P p0, 1q, choose σ 2 " 256L 2 R lnp 2.5R δ 0 q lnp2{δ0q n 2 ϵ 2 0 and K ě ϵ0n 4 ? 2R lnp2{δ0q . Then, Algorithm 1 is pϵ 0 , δ 0 q-ISRL-DP. Further, there exist choices of β ą 0 such that for Algorithm 2 run on f β pw, xq :" min vPW ´f pv, xq `β 2 }w ´v} 2 ¯, we have: 1. If f p¨, xq is convex, then setting R " max ˆ?Mϵ0n ? d lnp1{δ0q , 1 tM KăN nu ϵ 2 0 n 2 Kd lnp1{δ0q ˙yields E p F X pw ag R q ´p F X " r O ˜LD a d lnp1{δ 0 q ϵ 0 n ? M ¸. ( ) 2. If f p¨, xq is µ-strongly convex and R " r O ˆmax " ? M ϵ0n ? d lnp1{δ0q , 1 tM KăN nu ϵ 2 0 n 2 Kd lnp1{δ0q *˙( in the multi-stage implementation in Appendix E.1), then E p F X pw ag R q ´p F X " r O ˆL2 µ d lnp1{δ 0 q ϵ 2 0 n 2 M ˙. Proof. We established ISRL-DP in Theorem F.1. Excess risk: 1. By Lemma D.7 and Theorem F.1, we have  E p F X p p w R q ´p F X À LD b d ln 2 pR{δ 0 q ? M ϵ 0 n `L2 β , if R " max E p F X p p w R q ´p F X À L 2 µ d ln 2 pR{δ 0 q M ϵ 2 0 n 2 `L2 β , if R " max !b β µ ln ´∆X µM ϵ 2 0 n 2 L 2 d ¯, 1 tM KăN nu ϵ 2 0 n 2 Kd lnp1{δ0q ) . Now choosing β :" µM ϵ 2 0 n 2 d yields both the desired excess risk and communication complexity bound. Remark F.1. The algorithm of (Girgis et al., 2021) requires R " r Ωpϵ 2 0 n 2 M {dq communications, making Algorithm 1 faster by a factor of minp ? M ϵ 0 n{ ? d, M Kq. If M " N and full batches are used, the advantage of our algorithm over that of (Girgis et al., 2021) is even more significant. F.2 LOWER BOUNDS FOR ISRL-DP FEDERATED ERM Formally, define the algorithm class B pϵ0,δ0q,C to consist of those (sequentially interactive or Ccompositional, ISRL-DP) algorithms A P A pϵ0,δ0q,C such that for any X P X, f P F L,D , the expected empirical loss of the shuffled algorithm A s derived from A is upper bounded by the expected loss of A: E A,tπrur p F pA s pXqq À E A p F pApXqq. Here A s denotes the algorithm that applies the randomizer R piq r to X πrpiq for all i, r, but otherwise behaves exactly like A. This is not a very constructive definition but we will describe examples of algorithms in B pϵ0,δ0q,C . B pϵ0,δ0q,C includes all C-compositional or sequentially interactive ISRL-DP algorithms that are symmetric with respect to each of the N silos, meaning that the aggregation functions g r are symmetric (i.e. g r pZ 1 , ¨¨¨, Z N q " g r pZ πp1q , ¨¨¨Z πpN q q for all permutations π) and in each round r the randomizers R piq r " R r are the same for all silos i P rN s. (R piq r can still change with r though.) For example, Algorithm 2 and Algorithm 1 are both in B pϵ0,δ0q :" B pϵ0,δ0q,1 . This is because the aggregation functions used in each round are simple averages of the M " N noisy gradients received from all silos (and they are compositional) and the randomizers used by every silo in round r are identical: each adds the same Gaussian noise to the stochastic gradients. B also includes sequentially interactive algorithms that choose the order in which silos are processed uniformly at random. This is because the distributions of the updates of A and A s are both averages over all permutations of rN s of the conditional (on π) distributions of the randomizers applied to the π-permuted database. If A P B pϵ0,δ0q,C is sequentially interactive or compositional, we write A P B. We now state and prove our ERM lower bounds: Theorem F.3. Let ϵ 0 P p0, ? N s, δ 0 " op1{nN q and A P B pϵ0,δ0q,C such that in every round r P rRs, the local randomizers R piq r pZ p1:r´1q , ¨q : X n Ñ Z are pϵ r 0 , δ r 0 q-DP for all i P rN s, Z p1:r´1q P Z r´1ˆN , with ϵ r 0 ď 1 n , δ r 0 " op1{nN Rq, and N ě 16 lnp2{δ r 0 nq. Then: 1. there exists a (linear, hence β-smooth @β ě 0) loss function f P F L,D and a database X P X nN (for some X ) such that: E p F X pApXqq ´p F X " r Ω ˜LD min # 1, ? d ϵ 0 n ? N C 2 +¸. 2. There exists a (µ-smooth) f P G µ,L,D and database X P X nN such that E p F X pApXqq ´p F X " r Ω ˆLD min " 1, d ϵ 2 0 n 2 N C 4 *˙. Further, if A P B, then the above lower bounds hold with C " 1. Proof. Step 1 is identical to Step 1 of the proof of Theorem 2.2. Step 2 is very similar to Step 2 in the proof of Theorem 2.2, but now we use Theorem F.4 (below) instead of Theorem D.6 to lower bound the excess empirical loss of A s . Step 3: Finally, the definition of B implies that the excess risk of A is the same as that of A s , hence the lower bound also applies to A. Theorem F.4. (Bassily et al., 2014) Let µ, D, ϵ ą 0, L ě µD, and δ " op1{nN q. Consider X :" t ´D ? d , D ? d u d Ă R d and W :" B 2 p0, Dq Ă R d . Let A : X nN Ñ W be any pϵ, δq-CDP algorithm. Then: 1. There exists a (µ " 0) convex, linear (β-smooth for any β), L-Lipschitz loss f : W ˆX Ñ R and a database X P X nN such that the expected empirical loss of A is lower bounded as E p F X pApXqq ´p F X " r Ω ˜LD min # 1, ? d ϵnN +¸. 2. There exists a µ-strongly convex, µ-smooth, L-Lipschitz loss f : W ˆX Ñ R and a database X P X nN such that the expected empirical loss of A is lower bounded as E p F X pApXqq ´p F X " r Ω ˆLD min " 1, d ϵ 2 n 2 N 2 *˙. G PROOFS AND SUPPLEMENTARY MATERIALS FOR SECTION 5 G.1 PROOF OF THEOREM 5.1 Theorem G.1 (Precise version of Theorem 5.1). Let ϵ ď lnp2{δq, δ P p0, 1q, and M ě 16 lnp18RM 2 {N δq for (polynomial) R specified in the proof. Then, there is a constant C ą 0 such that setting σ 2 :" CL 2 RM lnpRM 2 {N δq lnpR{δq lnp1{δq n 2 N 2 ϵ 2 ensures that the shuffled version of Algorithm 2 is pϵ, δq-CDP. Moreover, there exist η r " η and tγ r u R´1 r"0 such that the shuffled version of Algorithm 2 achieves the following upper bounds: 1. If f p¨, xq is convex, then EF p p w R q ´F ˚" r O ˜LD ˜1 ? nM `ad lnp1{δq ϵnN ¸¸. ( ) 2. If f p¨, xq is µ-strongly convex, then EF p p w R q ´F ˚" r O ˆL2 µ ˆ1 nM `d lnp1{δq ϵ 2 n 2 N 2 ˙˙. Proof. We fix K " 1 for simplicity, but note that K ą 1 can also be used (see Lemma 3 in (Girgis et al., 2021) ). We shall also assume WLOG that f p¨, xq is β-smooth: the reduction to non-smooth f follows by Nesterov smoothing, as in the proofs of Theorems 2.1 and 4.1. Our choices of R shall be: R :" max ´n2 N 2 ϵ 2 M , N M , min ! n, ϵ 2 n 2 N 2 dM ) , βD L min ! ? nM , ϵnN ? d )¯f or convex f ; and R :" max ´n2 N 2 ϵ 2 M , N M , 8β µ ln ´βD 2 µϵ 2 n 2 N 2 dL 2 ¯, min ! n, ϵ 2 n 2 N 2 dM )¯f or strongly convex. Privacy: Observe that in each round r, the model updates of the shuffled algorithm A r s can be viewed as post-processing of the composition M r pXq " S M ˝samp M,N pZ p1q r , ¨¨¨, Z pN q r q, where S M uniformly randomly shuffles the M received reports, samp M,N is the mechanism that chooses M reports uniformly at random from N , and Z piq r " samp 1,n p p R r px i,1 q, ¨¨¨p R r px i,n qq, where p R r pxq :" ∇f pw r , xq `u and u " N p0, σ 2 I d q. Recall (Theorem A.1 in (Dwork & Roth, 2014) ) that σ 2 " 8L 2 lnp2{ x δ0q x ϵ0 2 suffices to ensure that p R r is p p ϵ 0 , p δ 0 q-DP if p ϵ 0 ď 1. Now note that M r pXq " r R M pS M samp M,N pX 1 , ¨¨¨X N q, where r R : X n Ñ Z is given by X Þ Ñ samp 1,n p p Rpx 1 q, ¨¨¨, p Rpx n qq and r R M : X nM Ñ Z M is given by X Þ Ñ p r RpX 1 q, ¨¨¨r RpX M qq for any X " pX 1 , ¨¨¨, X M q P X nM . This is because we are applying the same randomizer (same additive Gaussian noise) across silos and the operators S M and r R M commute. (Also, applying a randomizer to all N silos and then randomly choosing M reports is equivalent to randomly choosing M silos and then applying the same randomizer to all M of these silos.) Therefore, conditional on the random subsampling of M out of N silos (denoted pX 1 , ¨¨¨, X M q for convenience), Theorem 3.8 in (Feldman et al., 2020b) implies that p p Rpx πp1q,1 q, ¨¨¨, p Rpx πp1q,n q, ¨¨¨, ¨¨¨, p Rpx πpM q,1 q, ¨¨¨, p Rpx πpM q,n qq is pp ϵ, p δq-CDP, where p ϵ " O ˆx ϵ0 ? lnp1{M x δ0q ? M ˙and p δ " 9M p δ 0 , provided p ϵ 0 ď 1 and M ě 16 lnp2{ p δ 0 q (which we will see is satisfied by our assumption on M ). Next, privacy amplification by subsampling (see (Ullman, 2017) and Lemma 3 in (Girgis et al., 2021) ) silos and local samples implies that M r is pϵ r , δ r q-CDP, where ϵ r " 2p ϵM nN " O ˆp ϵ 0 ? M lnp1{M x δ0q nN ˙and δ r " M nN p δ " 9M 2 nN p δ 0 . Finally, by the advanced composition theorem (Theorem 3.20 in (Dwork & Roth, 2014 )), to ensure A s is pϵ, δq-CDP, it suffices to make each round pϵ r :" for some C ą 0 and p δ 0 " nN δ 18RM 2 , we see that σ 2 " O ´L2 lnpRM 2 {N δq lnpR{δq lnp1{δqqRM n 2 N 2 ϵ 2 ¯ensures that A s is pϵ, δq-CDP, i.e. that A is pϵ, δq-SDP. Note that our choices of R in the theorem (specifically R ě N {M and R ě n 2 N 2 ϵ 2

M

) ensure that p δ 0 , δ ď 1 and p ϵ 0 À 1, so that Theorem 3.8 in (Feldman et al., 2020b) indeed gives us the amplification by shuffling result used above. Excess risk: Note that shuffling does not affect the uniform stability of the algorithm. So we proceed similarly to the proof of Theorem 2.1, except σ 2 is now smaller. 1. Convex case: Set γ r " γ " 1{R for all r. Now Lemma D.2, Lemma D.1, and Lemma D.6 (with σ 2 in the lemma replaced by the σ 2 prescribed here) together imply for any η ď 1{β that  EF p p w R q ´F ˚À L 2 Rη nM `D2 ηR `η ˆL2 {M `dσ 2 M ˙. + `βD 2 R . Then one can verify that plugging in the prescribed R yields the stated excess population loss bound. 2. µ-strongly convex case: By Lemma D.6 (and its proof), we know there exists η ď 1{β and tγ r u R r"1 such that E p F X p p w R q ´p F X " r O ˆβD 2 exp ˆ´R 2κ ˙`L 2 µR `dσ 2 µM R ˙. Hence Lemma D.1 and Lemma D.2 imply EF p p w R q ´F ˚" r O ˆβD 2 exp ˆ´R 2κ ˙`L 2 µR `L2 d lnp1{δq µϵ 2 n 2 N 2 `L2 µM n ˙. Then one verifies that the prescribed R is large enough to achieve the stated excess population loss bound. G.2 SDP ONE-PASS ACCELERATED NOISY MB-SGD AND THE PROOF OF THEOREM 5.2 To develop an SDP variation of One-Pass Accelerated Noisy MB-SGD, we will use the binomial noise-based protocol of (Cheu et al., 2021) (described in Algorithm 4) instead of using the Gaussian mechanism and amplification by shuffling. This is because for our one-pass algorithm, amplification by shuffling would result in an impractical restriction on ϵ. Algorithm 4 invokes the SDP scalar summation subroutine Algorithm 5. We recall the privacy and accuracy guarantees of Algorithm 4 below: Cheu et al., 2021) ). For any 0 ă ϵ ď 15, 0 ă δ ă 1{2, d, N P N, and L ą 0, there are choices of parameters b, g P N and p P p0, 1{2q for P 1D (Algorithm 5) such that, for X " px 1 , ¨¨¨x N q containing vectors of maximum norm max iPrN s }x i } ď L, the following holds: 1) P vec is pϵ, δq-SDP; and 2) P vec pXq is an unbiased estimate of Lemma G.1 (( ř N i"1 x i with bounded variance E » - › › › › › P vec pX; ϵ, δ; Lq ´N ÿ i"1 x i › › › › › 2 fi fl " O ˜dL 2 log 2 `d δ ε2 ¸. With these building blocks in hand, we provide our SDP One-Pass Accelerated Noisy MB-SGD algorithm in Algorithm 6. We now provide the general version of Theorem 5.2 for M ď N : Algorithm 4 P vec , a shuffle protocol for vector summation (Cheu et al., 2021) 1: Input: database of d-dimensional vectors X " px 1 , ¨¨¨, x N ); privacy parameters ϵ, δ; L. 2: procedure: Local Randomizer R vec px i q 3: for j P rds do 4: Shift component to enforce non-negativity: w i,j Ð x i,j `L 5: m j Ð R 1D pw i,j q 6: end for 7: Output labeled messages tpj, m j qu jPrds 8: end procedure 9: procedure: Analyzer A vec pyq 10: for j P rds do 11: Run analyzer on coordinate j's messages z j Ð A 1D py j q 12: Re-center: o j Ð z j ´L 13: end for 14: Output the vector of estimates o " po 1 , ¨¨¨o d q 15: end procedure Algorithm 5 P 1D , a shuffle protocol for summing scalars (Cheu et al., 2021) 1: Input: Scalar database X " px 1 , ¨¨¨x N q P r0, Ls N ; g, b P N; p P p0, 1 2 q. 2: procedure: Local Randomizer R 1D px i q 3: s x i Ð tx i g{Lu.

4:

Sample rounding value η 1 " Berpx i g{L ´s x i q. 5: Set xi Ð s x i `η1 .

6:

Sample privacy noise value η 2 " Binpb, pq.

7:

Report y i P t0, 1u g`b containing xi `η2 copies of 1 and g `b ´px i `η2 q copies of 0. 8: end procedure 9: procedure: Analyzer A 1D pSpyqq 10: Output estimator L g pp ř N i"1 ř b`g j"1 py i q j q ´pbnq. 11: end procedure Algorithm 6 SDP Accelerated Noisy MB-SGD 1: Input: Data X i P X n i , i P rN s, strong convexity modulus µ ě 0, privacy parameters pϵ, δq, iteration number R P N, batch size K P rns, step size parameters tη r u rPrRs , tα r u rPrRs . 2: Initialize w ag 0 " w 0 P W and r " 1. 3: for r P rRs do for i P S r in parallel do 6: Silo i draws tx r i,j u K j"1 from X i (without replacement) and computes Z r i :" t∇f pw md r , x r i,j qu K j"1 . 7: end for 8: Server receives r g r :" Server updates and broadcasts w ag r " α r w r `p1 ´αr qw ag r´1 . 11: end for 12: Output: w ag R . Theorem G.2 (Complete version of Theorem 5.2). Let f p¨, xq be β-smooth @x P X . Assume ϵ ď 15, δ P p0, 1 2 q. Then, Algorithm 6 is pϵ, δq-SDP. Moreover, there are choices of stepsize, batch size, and λ ą 0 such that (for υ defined in (29)): 1. Running Algorithm 6 on r f pw, xq :" f pw, xq `λ 2 }w ´w0 } 2 (where w 0 P W) yields EF pw ag R q´F ˚" O ˜ϕD ? nM `ˆβ 1{4 LD 3{2 ? d lnpd{δq ϵnM ˙4{5 `c N ´M N ´1 1 tN ą1u υL 1{5 D 4{5 β 1{5 ˆ?d lnpd{δq ϵnM 3.5 ˙1{5 ¸. 2. If f p¨, xq is µ-strongly convex @x P X and κ " β µ , then running the Multi-Stage Implementation of Algorithm 6 (recall Appendix E.1) directly on f yields  EF pw ag R q ´F ˚" r O ˆϕ2 nM `L2 µ ? κd lnp1{δq ϵ 2 n 2 M 2 `υ2 µ ? κM ˆ1 ´M ´1 N ´1 ˙1tNą1u ˙. ( ) Remark G.1. 1. For convex f , if M " N or υ À c N ´1 N ´M « pL 3 D 2 β 2 q 1{5 pd lnpd{δqq 3{10 M 1{10 n 3{5 ϵ 3{5 `ϕ ˆβD L ˙1{5 ˆϵ2 M 2 n 3 d ln 2 pd{δq ˙1{10 ff , then V 2D ? R , where V 2 " ϕ 2 M K ὺ2 M 1 tN ą1u ´N´M N ´1 ¯`Varpr g r q is the conditional variance of the noisy stochastic minibatch gradients given M r , by Lemma E.3. Also, conditional on M r , we have Varpr g r q À dL 2 ln 2 pd{δq ϵ 2 M 2 r K 2 by Lemma G.1 and independence of the data. Hence, taking total expectation over M r , we get V 2 " ϕ 2 M K `υ2 M 1 tN ą1u ´N´M N ´1 ¯`dL 2 ln 2 pd{δq ϵ 2 M 2 K 2 . Now plugging V 2 into Lemma E.1, setting R " n{K, and λ :" V 2D ? R yields EF pw ag R q ´F ˚À βD 2 K 2 n 2 `ϕD ? nM `LD ? d lnpd{δq ϵM ? Kn `?KυD ? nM c N ´M N ´1 1 tN ą1u . Choosing K " ´L βD ¯2{5 n 3{5 pd lnp1{δ0q 1{5 ϵ 2{5 0 M 2{5 implies (39). 2. For strongly convex loss, we plug the same estimate for V 2 used above into Lemma E.2 with R " n{K to obtain EF pw ag R q ´F ˚À ∆ exp ˆ´n K ? κ ˙`ϕ 2 µnM `υ2 K µnM ˆ1 ´M ´1 N ´1 ˙1tNą1u `L2 µ d ln 2 pd{δq Knϵ 2 M 2 , where ∆ ě F pw 0 q ´F ˚. Choosing K " n ? κ ln ´µ∆ min ! ϵ 2 n 2 M 2 L 2 d lnp1{δ 0 q , nM )¯y ields ( 40).

H ISRL-DP UPPER BOUNDS WITH UNBALANCED DATA SET SIZES AND DIFFERING PRIVACY NEEDS ACROSS SILOS

In order to state the generalized versions of our upper bounds (for arbitrary n i , ϵ i , δ i , i P rN s), we will require some additional notation and assumptions.

H.1 ADDITIONAL NOTATION AND ASSUMPTIONS

First, we define a generalization of pϵ 0 , δ 0 q-ISRL-DP (as it was formally defined in Appendix B) that allows for differing privacy parameters across silos: Definition 10. (Generalized Inter-Silo Record-Level Differential Privacy) Let ρ i pX i , X 1 i q :" ř ni j"1 1 txi,j ‰x 1 i,j u , i P rN s. A randomized algorithm A is tpϵ i , δ i qu N i"1 -ISRL-DP if for all silos i and all ρ i -adjacent X i , X 1 i , pR piq 1 pXiq, R piq 2 pZ1, Xiq, ¨¨¨, R piq R pZ1:R´1, Xiqq » pϵ i ,δ i q pR piq 1 pX 1 i q, R piq 2 pZ 1 1 , X 1 i q, ¨¨¨, R piq R pZ 1 1:R´1 , X 1 i qq, where Z r :" tR piq r pZ 1:r´1 , X i qu N i"1 and Z 1 r :" tR piq r pZ 1 1:r´1 , X 1 i qu N i"1 . We also allow for the weights put on each silo in the FL objective to differ and consider: min wPW # F pwq :" N ÿ i"1 p i F i pwq + , where p i P r0, 1s and ř N i"1 p i " 1. However, we will present our results for the case where p i " 1 N for all i P rN s. This is without loss of generality: given any r F pwq " ř N i"1 p i F i pwq, we have r F pwq " ř N i"1 p i F i pwq " 1 N ř N i"1 r F i pwq, where r F i pwq " N p i F i pwq " E xi"Di rN p i f pw, x i qs :" E xi"Di r r f i pw, x i qs. Thus, our results for the case of p i " 1{N apply for general p i , but L gets replaced with r L " max iPrN s p i N L, µ gets replaced with r µ " min iPrN s p i N µ, and β gets replaced with r β " max iPrN s p i N β. We will choose batch sizes K i such that K i {n i " K l {n l for all i, l P rN s in each round, and denote K " min iPrN s K i . In addition to the assumptions we made in the main body, we also refine the assumption on tM r u to include a description of the second moment of 1{M r : Assumption 2. In each round r, a uniformly random subset S r of M r P rN s distinct silos is available to communicate with the server, where tM r u rě0 are independent random variables with 1 M :" Ep 1 Mr q and 1 M 1 :" b Ep 1 M 2 r q. For M P rN s, denote Ě σ 2 M :" 1 M M ÿ i"1 σ 2 piq , where σ 2 p1q :" σ 2 max :" max iPrN s σ 2 i ě σ 2 p2q ě ¨¨¨ě σ 2 pN q :" σ 2 min :" min iPrN s σ 2 i ,. (More generally, whenever a bar and M subscript are appended to a parameter, it denotes the average of the M largest values.) Also, define Σ 2 :" b Ep Ě σ 2 M1 q 2 for any tσ 2 i u N i"1 Ď r0, 8q. Next, recall the heterogeneity parameter from (29): υ 2 :" sup wPW 1 N N ÿ i"1 }∇F i pwq ´∇F pwq} 2 . Lastly, for given parameters, denote ξ i :" ˆ1 n i ϵ i ˙2 lnp2.5R{δ i q lnp2{δ i q for i P rN s, ξ max " maxpξ 1 , ¨¨¨ξ N q, and Ξ :" g f f e E M1 ˜1 M 1 M1 ÿ i"1 ξ piq ¸2. In the case of balanced data and same parameters across silos, we have ξ i " ξ " Ξ for all i. In the general case, we have ξ min ď Ξ ď ξ max .

H.2 PSEUDOCODE FOR NOISY ISRL-DP MB-SGD IN THE UNBALANCED CASE

The generalized version of Noisy ISRL-DP MB-SGD is described in Algorithm 7. Algorithm 7 Noisy ISRL-DP MB-SGD 1: Input: N, d, R P N, tσ i u iPrN s Ă r0, 8q, X i P X ni i for i P rN s, loss function f pw, xq, tK i u N i"1 Ă N , tη r u rPrRs and tγ r u rPrRs . 2: Initialize w 0 P W. 3: for r P t0, 1, ¨¨¨, R ´1u do 4: for i P S r in parallel do

5:

Server sends global model w r to silo i.

6:

Silo i draws K i samples x r i,j uniformly from X i (for j P rK i s) and noise u i " N p0, σ 2 i I d q. 7: Silo i computes r g i r :" 1 Ki ř Ki j"1 ∇f pw r , x r i,j q `ui and sends to server.

8:

end for 9: Server aggregates r g r :" 1 Mr ř iPSr r g i r . 10: Server updates w r`1 :" Π W rw r ´ηr r g r s. 11: end for 12: Output: p w R " 1 Γ R ř R´1 r"0 γ r w r , where Γ R :" ř R´1 r"0 γ r .

H.3 GENERAL UNBALANCED VERSION OF THEOREM 2.1

We first state the formal version of Theorem 2.1 for arbitrary n i , ϵ i , δ i , using the notation defined in Appendix H.1. Theorem H.1 (Generalized Version of Theorem 2.1). Let ϵ i ď 2 lnp2{δ i q, δ i P p0, 1q. Then, Algo- rithm 2 is tpϵ i , δ i qu N i"1 -ISRL-DP if σ 2 i " 256L 2 R lnp 2.5R δ i q lnp2{δiq n 2 i ϵ 2 i and K i ě ϵin 4 ? 2R lnp2{δiq . Moreover, with notation as in Appendix H.1, there are choices of algorithmic parameters such that: 1. If f p¨, xq is convex, then EF p p w R q ´F ˚À LD ˜1 ? n min M `dd min " Ξ M 1 , ξ max M * ¸. 2. If f p¨, xq is µ-strongly convex, then EF p p w R q ´F ˚" r O ˆL2 µ ˆ1 M n min `d min " Ξ M 1 , ξ max M *˙˙. Remark H.1. Note that 1{M 1 ě 1{M by the Cauchy-Schwartz inequality. Both of the upper bounds in Theorem H.1 involve minima of the terms Ξ{M 1 and ξ max {M, which trade off the unbalancedness of silo data and privacy needs with the variance of 1{M r . In particular, if the variance of 1{M r is small enough that Ξ M 1 ď ξmax M , then the excess risk bounds in Theorem H.1 depend on averages of the parameters across silos, rather than maximums. In FL problems with unbalanced data and disparate privacy needs across a large number of silos, the difference between "average" and "max" can be substantial. On the other hand, if data is balanced and privacy needs are the same across silos, then ξ i " ξ max " Ξ " lnp2.5R{δ 0 q lnp2{δ 0 q{n 2 ϵ 2 0 for all i and Ξ M 1 ě ξmax M , so we recover Theorem 2.1, with dependence only on the mean 1{M of 1{M r and not the square root of the second moment 1{M 1 . To prove Theorem H.1, we need the following empirical loss bound for Algorithm 2, which generalizes Lemma D.6 to the unbalanced setting: Lemma H.1. Let f : W ˆX Ñ R be µ-strongly convex (with µ " 0 for convex case), L-Lipschitz, and β-smooth in w for all x P X , where W is a closed convex set in R d s.t. }w ´w1 } ď D for all w, w 1 P W. Let X P X. Then Algorithm 2 with σ 2 i " 256L 2 R lnp 2.5R δ i q lnp2{δiq n 2 i ϵ 2 i attains the following empirical loss bounds as a function of step size and the number of rounds: 1. (Convex) For any η ď 1{β and R P N, γ r :" 1{R, we have E p F X p p w R q ´p F X ď D 2 ηR `η 2 ˆd min " Σ 2 M 1 , σ 2 max M * `L2 2. (Strongly Convex) There exists a constant stepsize η r " η ď 1{β such that if R ě 2κ ln ˆβµD 2 L 2 d minpΞ{M 1 , ξmax M q ˙, then E p F X p p w R q ´p F X " r O ˆL2 µ ˆ1 R `d min " Ξ M 1 , ξ max M *˙˙. ( ) Proof. By the proof of Lemma D.6, we have: ˙, using linearity of expectation for the first term in the minimum and Cauchy-Schwartz inequality for the second term in the minimum. Now we consider the convex (µ " 0) and strongly convex (µ ą 0) cases separately. E " }w Convex (µ " 0) case: Re-arranging (43), we get Er p F X pw r q ´p F Xs ď 1 2η r `Er}w r ´w˚}2 ´}w r`1 ´w˚}2 s ˘`η r 2 ˆd min " Σ 2 M 1 , σ 2 max M * `L2 ˙. Then for η r " η, the average iterate s w R satisfies: Er p F X p s w R q ´p F Xs ď 1 R R´1 ÿ r"0 Er p F X pw r q ´p F Xs ď 1 R R´1 ÿ r"0 1 2η pEr}w r ´w˚}2 ´}w r`1 ´w˚} sq `η 2 ˆd min " Σ 2 M 1 , σ 2 max M * `L2 ď }w 0 ´w˚}2 ηR `η 2 ˆd min " Σ 2 M 1 , σ 2 max M * `L2 ˙, which proves part 1 of the lemma. Strongly convex (µ ą 0) case: Note that (43) satisfies the conditions for Lemma D.5, with sequences r t " E}w t ´w˚}2 , s t " Er p F X pw t q ´p F Xs Then Lemma D.5 and Jensen's inequality imply E p F X p p w R q ´p F X " r O ˆβD 2 exp ˆ´R 2κ ˙`L 2 µ ˆ1 R `d min " Ξ M 1 , ξ max M *˙˙, where κ " β{µ. Finally, plugging in R completes the proof. We are prepared to prove Theorem H.1. Proof of Theorem H.1. Privacy: The proof follows exactly as in the balanced case, since σ 2 i is now calibrated to pϵ i , δ i q for all i P rN s. Excess loss: We shall prove the results for the case when f p¨, xq is β-smooth. The non-smooth case follows by Nesterov smoothing, as in the proof of Theorem 2.1. 1. Convex case: By Lemma D.2 (and its proof), Lemma D.1, and Lemma H.1, we have:  EF p p w R q ´F ˚ď α `E p F X p p w R q ´p F X ď ! Ξ M 1 , ξmax M ) ˙`min ´nmin M, 1 d max ! M 1 Ξ , M ξmax )c ompletes the proof of the convex case. 2. µ-strongly convex case: By Lemma D.2 (and its proof), Lemma D.1, and Lemma H.1, we have: EF p p w R q ´F ˚ď α `E p F X p p w R q ´p F X ď 4L 2 µpM n min ´1q `r O ˆL2 µ ˆ1 R `d min " Ξ M 1 , ξ max M *˙ḟ or the η ď 1{β prescribed in the proof of Lemma H.1 and any R ě 2κ ln ˆβµD 2 L 2 d minpΞ{M 1 , ξmax M q ˙. Thus, choosing R " 2κ ln ˆβµD 2 L 2 d minpΞ{M 1 , ξmax M q ˙`M n min completes the proof. Remark H.2. Generalized versions of the other upper bounds in this paper can also be easily derived with the techniques used above. The key takeaways are: a) the excess empirical risk (and the private term in the SCO bounds) involve a minimum of two terms that trade off the degree of unbalancedness with the variance of 1{M r . In particular, if the variance of 1{M r is sufficiently small (e.g. if M r " M , which is what most existing works on FL assume), then the refined excess risk bounds depend on averages of the parameters across silos, rather than worst-case maximums. b) the generalization error scales with min iPrN s n i .

I NUMERICAL EXPERIMENTS: DETAILS AND ADDITIONAL RESULTS

In some plots in this section, we include a parameter describing the heterogeneity of the FL problem: υ 2 ˚:" 1 N N ÿ i"1 }∇F i pw ˚q} 2 , which has appeared in (Khaled et al., 2019; Koloskova et al., 2020; Karimireddy et al., 2020; Woodworth et al., 2020b) . If the data is homogeneous, then all F i share the same minimizers, so υ 2 ˚" 0, but the converse is false. ISRL-DP Local SGD runs as follows: in round r, each silo i P S r receives the global model w r and takes K steps of noisy SGD (with one sample per step) with their local data: w i,0 r " w r , w i,t r " w i,t´1 r ´ηp∇f pw i,t´1 r , x r i,t q `ut i q for t P rKs, where x r i,t is drawn uniformly at random from X i and u t i " N p0, σ 2 I d q for σ 2 " 8L 2 RK logp1{δq ϵ 2 0 n 2 . Then silo i sends its K-th iterate, w i,K r to the server; the server averages the iterates across all silos and updates the global model to w r`1 " 1 Mr ř iPSr w i,K r .

I.1 LOGISTIC REGRESSION WITH MNIST

The data set can be downloaded from http://yann.lecun.com/exdb/mnist. Our code does this for you automatically. Experimental setup: To divide the data into N " 25 silos and for preprocessing, we borrow code from (Woodworth et al., 2020b) Local SGD, we started with the same stepsize grid, but sometimes required additional tuning with smaller stepsizes (especially for small ϵ) to find the optimal one. Choice of σ 2 and K: We used smaller noise (compared to the theoretical portion of the paper) to get better utility (at the cost of larger K/larger computational cost, which is needed for privacy): σ 2 " 8L 2 lnp1{δqR n 2 ϵ 2 for ISRL-DP MB-SGD, which provides ISRL-DP by Theorem 1 of (Abadi et al., 2016) if K " n ? ϵ 2 ? R (c.f. Theorem 3.1 in (Bassily et al., 2019) ). For ISRL-DP Local SGD, one needs σ 2 " 8L 2 lnp1{δqRK n 2 ϵ 2 since the sensitivity and number of gradient evaluations are both larger by a factor of K. Here L " 2 max xPX }x} " 20 is an upper bound on the Lipschitz parameter of the softmax loss and was estimated directly from the pre-processed training data. J LIMITATIONS AND SOCIETAL IMPACTS J.1 LIMITATIONS Our results rely on certain assumptions (e.g convex, Lipschitz loss), which may be violated in certain practical applications. Moreover, our theoretical results require an a priori bound on the Lipschitz parameter (for noise calibration). While such a bound is fairly easy to obtain for models such as logistic regression with data that is (known to be) bounded (e.g. our MNIST experiments), it is unrealistic for models such as unconstrained linear regression with potentially unbounded data (e.g. our medical cost data experiments). In practice, in such situations, gradient clipping can be incorporated into our algorithms; we have shown empirically (medical cost data experiments) that MB-SGD still performs well with clipping. However, we did not obtain our theoretical results with gradient clipping. An interesting direction for future work would be to determine optimal rates for ISRL-DP FL without Lipschitzness and/or without convexity. Further, as we explained in Appendix I, pre-processing and hyperparameter tuning (and estimation of L) were not done in a ISRL-DP manner in our numerical experiments, since we did not want to detract focus from evaluation of ISRL-DP FL algorithms. As a consequence, the overall privacy loss for the entire experimental process is higher than the ϵ indicated in the plots, which solely reflects the privacy loss from running the FL algorithms with fixed hyperparameters and (pre-processed) data.

J.2 SOCIETAL IMPACTS

Our work provides algorithms for protecting the privacy of users during federated learning. Privacy is usually thought of favorably: for example, the right to privacy is an element of various legal systems. However, it is conceivable that our algorithms could be misused by corporations and governments to justify malicious practices, such as collecting personal data without users' permission. Also, the (necessarily) lower accuracy from privately trained models could have negative consequences: e.g. if a ISRL-DP model is used to predict the effects of climate change and the model gives less accurate and more optimistic results, then a government might use this as justification to improperly eliminate environmental protections. Nevertheless, we believe the dissemination of privacy-preserving machine learning algorithms and greater knowledge about these algorithms provides a net benefit to society.



We abbreviate central differential privacy by CDP. This is different than the concentrated differential privacy notion inBun & Steinke (2016), for which the same abbreviation is sometimes used in other works. Assume that the reports can be decrypted by the server, but not by the shufflerErlingsson et al. (2020a);Feldman et al. (2020b). In Appendix H, we consider the more general setting where data set sizes ni and ISRL-DP parameters pϵi, δiq may vary across silos, and the weights pi on each silo's loss Fi in (2) may differ (i.e. pi ‰ 1{N ). Function g : W Ñ R is µ-strongly convex (µ ě 0) if gpwq ě gpw 1 q `xBgpw 1 q, w ´w1 y `µ 2 }w ẃ1 } 2 @ w, w 1 P W and all sub-gradients Bgpw 1 q. If µ " 0, g is convex. ERM is a special case of the FL problem (2): if Di is the empirical distribution on Xi, then F " p F X . In(Bassily et al., 2019), the number of iterations is denoted by T , rather than R. Sequentially interactive algorithms can query silos adaptively in sequence, but cannot query any one silo more than once. Fully interactive algorithms can query each silo any number of times. See(Joseph et al., 2019) for further discussion. Up to logarithms, a factor of ϕ{L, and for strongly convex case-a factor of µD{L. If d ą ϵ 2 0 n 2 N , then the ISRL-DP algorithm that outputs any w0 P W attains the matching upper bound OpLDq. Specifically, Lemma D.1 in Appendix D does not apply without the i.i.d. assumption. We also describe the ISRL-DP Local SGD baseline in Appendix I. Note that in order for their result to be correct, by(Bassily et al., 2014) when N " M " 1, their bound must scale at least as d 2 {ϵ 2 n 2 , unless their bound is trivial (ě LD). Technically, this assumption on N is needed to ensure that the condition on ϵ r 0 in Theorem D.5 is satisfied. A similar restriction appears in Theorem 3.8 of(Feldman et al., 2020b). This follows from the assumption given in the lead up to Definition 5 that R piq pZ1:r´1, Xq is conditionally independent of X 1 given Z1:r´1 for all Z1:r´1 and X ‰ X 1 . It's easy to see that the same arguments we use in this subsection for i.i.d. SCO can also be used to establish tight federated ERM bounds. See(Abadi et al., 2016;Liu & Talwar, 2019;Papernot & Steinke, 2021) and the references therein for discussion of DP PCA and DP hyperparameter tuning.



Figure 1: ISRL-DP protects the privacy of each patient's record regardless of whether the server/other silos are trustworthy, as long as the patient's own hospital is trusted. By contrast, user-level DP protects aggregate data of patients in hospital i and does not protect against adversarial server/other silos.

1. By the post-processing property of DP, this also ensures that the the broadcasts by the server and the global model are DP. Privacy notions similar or identical to ISRL-DP have been considered in Truex et al. (2020); Huang et al. (2020); Huang & Gong (2020); Wu et al. (2019); Wei et al. (2020b); Dobbe et al. (2020); Zhao et al. (2020); Arachchige et al. (2019); Seif et al. (2020); Liu et al. (2022). We provide a rigorous definition of ISRL-DP in Definition 2 and Appendix B.

Figure 2: Trust assumptions of DP FL notions:We put "trust" in quotes because the shuffler is assumed to be secure and silo messages must already satisfy (at least a weak level of) ISRL-DP in order to realize SDP: anonymization alone cannot "create" DP(Dwork & Roth, 2014).

(d " 7, n N " 1338) into N silos based on the level of the target: patient medical costs. Fig. 5 shows that ISRL-DP MB-SGD outperforms ISRL-DP Local SGD, especially in the high privacy regime ϵ ď 2. For ϵ ě 2, ISRL-DP MB-SGD is in line with (non-private) Local SGD. Multiclass Logistic Regression with Obesity Data: We train a softmax regression model for a 7way classification task with an obesity data set (Palechor & de la Hoz Manotas, 2019). We divide the data into N " 7 heterogeneous silos based on the value of the target variable, obesity level, which takes 7 values: Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III. As shown in Fig. 6, our algorithm significantly outperforms ISRL-DP Local SGD by 10 ´30% across all privacy levels.

Figure 4: Binary logistic regression on MNIST. δ " 1{n 2 . 90% error bars are shown.

Figure 6: Softmax regression on obesity data. δ " 1{n 2 .

piq r pZ p1:r´1q,¨q u iPrN s,rPrRs (i.e. A s is the composition of the R shuffled algorithms A r s defined in Theorem D.5) is pϵ, δq-CDP, where δ ď 14N n ř R r"1 δ r 0 and ϵ " O ˆϵ0 lnp1{nN δ min 0

Kd lnp1{δ0q ¸. Now choosing β :" ϵ0n ? M L ? dD yields both the desired excess risk and communication complexity bound. 2. By Lemma D.7 and Theorem F.1, we have

r :" δ{2Rq-CDP. Using the two equations to solve for p ϵ 0 "CnN ϵ lnp1{δq lnpRM {nN δqM

g " 2β, T " R.

Figure 9: Test error vs. ϵ for linear regression on heterogeneous health insurance data. " 1{n 2 . 90% error bars are shown.

Mr řiPSr u i . Note that s u r " N ´0, σ 2 Mr I d ¯by independence of the tu i u iPrN s and hence E}s u r } 2 " dσ 2

1 MrK P vec ptZ r i u iPSr ; ϵ, δ; Lq.

recovers the bound (11) in Theorem 3.1, with N replaced by M . Also, note that appealing to privacy amplification by subsampling would result in tighter excess risk bounds than those stated in Theorem G.2 when M ă N , but would require a restriction ϵ À M {N . To avoid this restriction, we do not invoke privacy amplification by subsampling in our analysis.

Er}s u r } 2 |M r s " 1

, which can be downloaded from: https://papers.nips.cc/paper/2020/hash/45713f6ff2041d3fdfae927b82488db8-Abstract. html. It is available under a Creative Commons Attribution-Share Alike 3.0 license. There are n i " n " 8673 training and 2168 test examples per silo; to expedite training, we use only 1{7

ACKNOWLEDGMENTS

This work was partly supported by a gift from the USC-Meta Center for Research and Education in AI and Learning.

APPENDIX

To ease navigation, we provide a high-level table of contents for this Appendix: Appendix J: Limitations and Societal Impacts Lemma D.9. Let X 0 P X n . If R : X n Ñ Z is an pϵ, δq-DP local randomizer, then R is pnϵ, ne pn´1qϵ δq deletion group DP for groups of size n with reference distribution RpX 0 q (i.e. RpXq » pr ϵ, r δq RpX 0 q for all X P X n , where r ϵ " nϵ and r δ " ne pn´1qϵ δ).

Contents

Proof. By group privacy (see e.g. Theorem 10 in (Kamath, 2020) ), and the assumption that R is pϵ, δq-DP, it follows that RpXq and RpX 1 q are pnϵ, ne pn´1qϵ δq indistinguishable for all X, X 1 P X n .In particular, taking X 1 :" X 0 completes the proof.Lemma D.10. Let R piq : X n Ñ Z be randomized algorithms (i P rN s) and let A s : X nˆN Ñ Z N be the shuffled algorithm A s pXq :" pR p1q pX πp1q q, ¨¨¨R pN q pX πpN q qq derived from tR piq u iPrN s for X " pX 1 , ¨¨¨, X N q, where π is a uniformly random permutation of rN s. Let X 0 " pX 0 1 , X 2 , ¨¨¨, X N q and X 1 " pX 1 1 , X 2 , ¨¨¨, X N q, δ P p0, 1q and p P r 16 lnp2{δq N , 1s. Suppose that for all i P rN s, X P X n ztX 1 1 , X 0 1 u, there exists a distribution LO piq pXq such thatThen A s pX 0 q » pϵ,δqProof. The proof mirrors the proof of Lemma 3.3 in (Feldman et al., 2020b) closely, replacing their notation with ours. Observe that the DP assumption in Lemma 3.3 of (Feldman et al., 2020b) is not actually needed in the proof.Lemma D.11. Let R : X n Ñ Z be pϵ, δq deletion group DP for groups of size n with reference distribution ρ. Then there exists a randomizer R 1 : X n Ñ Z such that: (i) R 1 is pϵ, 0q deletion group DP for groups of size n with reference distribution ρ; and (ii) T V pRpXq, R 1 pXqq ď δ.In particular, R 1 is p2ϵ, 0q group DP for groups of size n (by (i)).Proof. The proof is nearly identical to the proof of Lemma 3.7 in (Feldman et al., 2020b) .We also need the following stronger version of Lemma 3.7 from (Feldman et al., 2020b) :RpX 1 1 q, then there exists a randomizer R 1 :RpX 0 1 q and T V pR 1 pX 1 1 q, RpX 1 1 qq ď δ 0 .Proof. The proof follows the same techniques as Lemma 3.7 in (Feldman et al., 2020b) , noting that the weaker hypothesis in Lemma D.12 is sufficient for all the steps to go through and that the assumption of n " 1 in (Feldman et al., 2020b) is not needed in the proof. D.13 ((Dwork & Roth, 2014) , Lemma 3.17). Given random variables P, Q, P 1 and Q 1 , if D e ϵ pP 1 , Q 1 q ď δ, T V pP, P 1 q ď δ 1 , and T V pQ, Q 1 q ď δ 1 , then D e ϵ pP, Qq ď δ `pe ϵ `1qδ 1 .

Lemma

Lemma D.14 ((Feldman et al., 2020b) , Lemma 2.3). Let P and Q be distributions satisfying P " p1 ´qqP 0 `qP 1 and Q " p1 ´qqP 0 `qQ 1 for some q P r0, 1s. Then for any ϵ ą 0, if ϵ 1 " logp1 `qpe ϵ ´1qq, then D e ϵ 1 pP ||Qq ď q maxtD e ϵ pP 1 ||P 0 q, D e ϵ pP 1 ||Q 1 qu.We are now ready to prove Theorem D.5:Proof of Theorem D.5. Let X 0 , X 1 P X nˆN be adjacent (in the CDP sense) distributed data sets (i.e. |X 0 ∆X 1 | ď 1). Assume WLOG that X 0 " pX 0 1 , X 2 , ¨¨¨, X N q and X 1 " pX 1 1 , X 2 , ¨¨¨, X N q, where X 0 1 " px 1,0 , x 1,2 , ¨¨¨, x 1,n q ‰ px 1,1 , x 1,2 , ¨¨¨, x 1,n q. We can also assume WLOG that X j R tX 0 1 , X 1 1 u for all j P t2, ¨¨¨, N u by re-defining X and R piq r if necessary.Fix i P rN s, r P rRs, Z " Z 1:r´1 " Z p1:N q p1:r´1q P Z pr´1qˆN , denote RpXq :" R piq r pZ, Xq for X P X n , and A s pXq :" A r s pZ 1:r´1 , Xq. Draw π uniformly from the set of permutations of rN s. Now, since R is pϵ 0 , δ 0 q-DP, RpX 1 1 q » pϵ r 0 ,δ r 0 qRpX 0 1 q, so by Lemma D.12, there exists a local randomizer R 1 such that R 1 pX 1 1 q » pϵ r 0 ,0qRpX 0 1 q and T V pR 1 pX 1 1 q, RpX 1 1 qq ď δ r 0 .Hence, by Lemma D.8, there exist distributions U pX 0 1 q and U pX 1 1 q such that RpX 0 1 q "andDenote r ϵ 0 :" nϵ r 0 and r δ 0 :" ne pn´1qϵ r 0 δ r 0 . By convexity of hockey-stick divergence and the hypothesis that R is pϵ r 0 , δ r 0 q-DP (hence RpXq » pĂ ϵ0, Ă δ0qRpX 0 1 q, RpX 1 1 q for all X by Lemma D.9), we havegroups of size n with reference distribution ρ. Thus, Lemma D.11 implies that there exists a local randomizer R 2 such that R 2 pXq and ρ are p r ϵ 0 , 0q indistinguishable and T V pR 2 pXq, RpXqq ď r δ 0 for all X. Then by the definition of p r ϵ 0 , 0q indistinguishability, for all X there exists a "left-over" distribution LOpXq such that R 2 pXq " 1Now, define a randomizer L by LpX 0 1 q :" RpX 0 1 q, LpX 1 1 q :" R 1 pX 1 1 q, andfor all X P X n ztX 0 1 , X 1 1 u. (The equality follows from ( 24) and ( 25).) Note that T V pRpX 0 1 q, LpX 0 1 qq " 0, T V pRpX 1 1 q, LpX 1 1 qq ď δ r 0 , and for all X PKeeping r fixed (omitting r scripts everywhere), for any i P rN s and Z :" Z 1:r´1 P Z pr´1qˆN , let L piq pZ, ¨q, U piq pZ, ¨q, and LO piq pZ, ¨q denote the randomizers resulting from the process described above. Let A L : X nˆN Ñ Z N be defined exactly the same way as A r s :" A s (same π) but with the randomizers R piq replaced by L piq . Since A s applies each randomizer R piq exactly once and R p1q pZ, X πp1q , ¨¨¨R pN q pZ, X πpN q q are independent (conditional on Z " Z 1:r´1 ) 14 , we have T V pA s pX 0 q, A L pX 0 q ď N p2ne pn´1qϵ r 0 qδ r 0 and T V pA s pX 1 q, A L pX 1 q ď N p2ne pn´1qϵ r 0 qδ r 0 (see (Den Hollander, 2012) ). Now we claim that A L pX 0 q and A L pX 1 q are pϵ r , δ r q indistinguishable for any δ r ě 2e ´N e ´nϵ r 0 {16 . Observe that this claim implies that A s pX 0 q and A s pX 1 q are pϵ r , r δ r q indistinguishable by Lemma D.13 (with P 1 :" A L pX 0 q, Q 1 :" A L pX 1 q, P :" A s pX 0 q, Q :" A s pX 1 q.) Therefore, it remains to prove the claim, i.e. to show that D e ϵ r pA L pX 0 q, A L pX 1 q ď δ r for any δ r ě 2e ´N e ´nϵ r 0 {16 . by conditional independence of h i ´∇F i and h i 1 ´∇F i 1 given S. HenceNext we bound b ⃝. Denote y i :" ∇F i pwq and s y :". Assume WLOG that s y " 0 (otherwise, consider y 1 i " y i ´s y, which has mean 0). In what follows, we shall omit the "conditional on M 1 " notation (but continue to condition on M 1 ) and denote by Ω the collection of all `N M1 ˘subsets of rN s of sizeM1 . Finally, we take expectation over the randomness in M 1 and use Er1{M 1 s " 1{M to arrive at the lemma. Also, the result clearly holds when N " 1 since the b ⃝ term is zero when there is no variance in silo sampling (which is the case when N " 1).Proof of Theorem E.1. Privacy: By post-processing (Dwork & Roth, 2014 , Proposition 2.1), it suffices to show that the R " n{K noisy stochastic gradients computed in line 7 of Algorithm 1 are pϵ 0 , δ 0 q-ISRL-DP. Further, since the batches sampled locally are disjoint (because we sample locally without replacement), parallel composition (McSherry, 2009) implies that if each update in line 7 is pϵ 0 , δ 0 q-ISRL-DP, then the full algorithm is pϵ 0 , δ 0 q-ISRL-DP. Now recall that the Gaussian mechanism provides pϵ 0 , δ 0 q-DP ifis the L 2 sensitivity of the non-private gradient update in line 7 of Algorithm 1: this follows from (Bun & Steinke, 2016 , Propositions 1.3 and 1.6) and our assumption ϵ 0 ď 8 lnp1{δ 0 q. Therefore, conditional on the private transcript of all other silos, our choice of σ 2 implies that silo i's transcript is pϵ 0 , δ 0 q-DP for all i P rN s, which means that One-Pass Noisy Accelerated Distributed MB-SGD is pϵ 0 , δ 0 q-ISRL-DP.Published as a conference paper at ICLR 2023 of the MNIST samples (n " 1238 training examples per silo). We fix δ i " δ " 1{n 2 and test ϵ P t0.75, 1.5, 3, 6, 12, 18u. The maximum υ 2 ˚is about 0.17 for this problem (corresponding to each silo having disjoint local data sets/pairs of digits).Preprocessing: We used PCA to reduce the dimensionality to d " 50. We used an 80{20 train/test split for all silos. To improve numerical stability, we clipped the input xw, xy (i.e. projected it onto r´15, 15s) before feeding into logistic loss.Hyperparameter tuning: For each algorithm and each setting of ϵ, R, K, υ 2 ˚, we swept through a range of constant stepsizes and ran 3 trials to find the (approximately) optimal stepsize for that particular algorithm and experiment. We then used the corresponding w R (averaged over the 3 runs) to compute test error. For (ISRL-DP) MB-SGD, the stepsize grid consisted of 10 evenly spaced points between e ´6 and 1. For (ISRL-DP) Local SGD, the stepsizes were between e ´8 and e ´1. We repeated this entire process 20 times for fresh train/test splits of the data and reported the average test error in our plots.Choice of σ 2 and K: We used smaller noise (compared to the theoretical portion of the paper) to get better utility (at the cost of larger K/larger computational cost, which is needed for privacy):, which provides ISRL-DP by Theorem 1 of (Abadi et al., 2016) R (c.f. Theorem 3.1 in (Bassily et al., 2019) ). Here L " 2 max xPX }x} is an upper bound on the Lipschitz parameter of the logistic loss and was computed directly from the training data.To estimate υ 2 ˚, we followed the procedure used in (Woodworth et al., 2020b) , using Newton's method to compute w ˚and then averaging }∇ p F i pw ˚q} 2 over all i P rN s.Additional experimental result : In Fig. 7 , we show an additional experiment with M " 18 available to communicate in each round. The results are qualitatively similar to the results presented in the main body for MNIST with M " 25, 12. Limitations of Experiments: It is important to note that pre-processing and hyperparameter tuning (and estimation of L) were not done in a ISRL-DP manner, since we did not want to detract focus from evaluation of ISRL-DP FL algorithms. 16 As a consequence, the overall privacy loss for the entire experimental process is higher than the ϵ indicated in the plots, which solely reflects the privacy loss from running the FL algorithms with fixed hyperparameters and (pre-processed) data. Similar remarks apply for the linear regression experiments (see Appendix I.2).Published as a conference paper at ICLR 2023

I.2 LINEAR REGRESSION WITH HEALTH INSURANCE DATA

Data set: The data (https://www.kaggle.com/mirichoi0218/insurance), which is available under an Open Database license, consists of r N " 1338 observations. The target variable y is medical charges. There are d ´1 " 6 features: age, sex, BMI, number of children, smoker, and geographic region.Experimental setup: For a given N, we grouped data into N (almost balanced) silos by sorting y in ascending order and then dividing into N groups, the first N ´1 of size r1338{N s and the remaining points in the last silo. For each N, we ran experiments with R " 35. We ran 20 trials, each with a fresh random train/test (80{20) split. We tested ϵ P t.125, .25, .5, 1, 2, 3u and fixed δ i " 1{n 2 i for all experiments.To estimate υ 2 ˚, we followed the procedure used in (Woodworth et al., 2020b) , using Newton's method to compute w ˚and then averaging }∇ p F i pw ˚q} 2 over all i P rN s.Preprocessing: We first numerically encoded the categorical variables and then standardized the numerical features age and BMI to have zero mean and unit variance.

Gradient clipping:

In the absence of a reasonable a priori bound on the Lipschitz parameter of the squared loss (as is typical for unconstrained linear regression problems with potentially unbounded data), we incorporated gradient clipping (Abadi et al., 2016) into the algorithms. We then calibrated the noise to the clip threshold L to ensure LDP. For fairness of comparison, we also allowed for clipping for the non-private algorithms (if it helped their performance).Hyperparameter tuning: For each trial and each algorithm, we swept through a log-scale grid of 10 stepsizes and 5 clip thresholds 3 times, selected the parameter w that minimized average (over 3 repetitions) training error (among all 10 ˆ5 " 50), and computed the corresponding average test error. The stepsize grids we used ranged from e ´8 and e 1 for (ISRL-DP) MB-SGD and from e ´10 to 1 for (ISRL-DP) Local SGD. The excess risk (train and test) we computed was for the normalized objective function F pw, X, Y q " }Y ´wX} 2 {2N 0 where N 0 P t1070, 268u (1070 for train, 268 for test) and X is N 0 ˆd with d " 7 (including a column of all 1s) and Y P R N0 . The clip threshold grids were t100, 10 4 , 10 6 , 10 8 , 10 32 u, with the last element corresponding to effectively no clipping.Choice of σ 2 and K: We used the same σ 2 and K " n ? ϵ 2 ? R as in the logistic regression experiments described in Appendix I.1. However, here L is the clip threshold instead of the Lipschitz parameter.Relative Root Mean Square Error (RMSE): We scale our reported errors (in the plots) to make them more interpretable. We define the Relative (test) RMSE of an algorithm to be a M SE{N M SE " c ř N test i"1 pyi´ŷiq 2 ř N test i"1 pyi´s ytrainq 2 , where N M SE ("Naiive Mean Square Error") is the (test) MSE incurred by the non-private naiive predictor that predicts y i to be s y train , the average of the training labels, for all test data i P rN test s. Here ŷi is the predicted label of the algorithm. Relative RMSE can be thought of as the Coefficient of Variation or Relative Standard Deviation of the predictions of the algorithm. Note that even though the naiive predictor is completely non-private and allowed to access the entire training data even when M ă N (violating the decentralized data principle of FL), the ISRL-DP FL algorithms still outperform this predictor for most values of ϵ (except for some experiments when ϵ « 0), as evidenced by values of Relative RMSE being below 1. For ϵ « 1, ISRL-DP MB-SGD tends to outperform the naiive predictor by 30 ´40%.Additional experimental results: In Fig. 8 and Fig. 9 , we present results for experiments with additional settings of N and M . We observe qualitatively similar behavior as in the plots presented in the main body of the paper. In particular, ISRL-DP MB-SGD continues to outperform ISRL-DP Local SGD in most tested setups/privacy levels (especially the high-privacy regime). On the other hand, for some settings of M, N, K, we observe that ISRL-DP Local SGD outperforms ISRL-DP MB-SGD as ϵ Ñ 3 (e.g. Fig. 9 , and N " M " 3). In general, we see that the utility impact of ISRL-DP is relatively insignificant for this problem when ϵ « 3. 

I.3 SOFTMAX REGRESSION WITH OBESITY DATASET

The data set can be freely downloaded from https://archive.ics.uci.edu/ml/ datasets/Estimation+of+obesity+levels+based+on+eating+habits+and+ physical+condition+. The data contains 17 attributes (such as age, gender, physical activity, eating habits, etc.) and 2111 records.Experimental setup: We divide the data into N " 7 heterogeneous silos based on the value of the target variable, obesity level, which is categorical and takes 7 values: Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III. We fix δ i " δ " 1{n 2 and test ϵ P t0.5, 1, 3, 6, 9u. We ran three trials with a new train/test split in each trial and reported the average test error.Preprocessing: We numerically encode the categorical variables and standardize the continuous numerical features to have mean zero and unit variance. We used an 80{20 train/test split. We discarded a small number of (randomly selected) training samples from some silos in order to obtain balanced silo sets, to ease implementation of the noisy methods.Hyperparameter tuning: For each algorithm and each setting of ϵ, R, K, we swept through a range of constant stepsizes to find the (approximately) optimal stepsize for that particular algorithm and experiment. We then used the corresponding w R to compute test error for that trial. For (ISRL-DP) MB-SGD, the stepsize grid consisted of 8 evenly spaced points between e ´7 and e ´1 For (ISRL-DP)

