D 2 P-FED: DIFFERENTIALLY PRIVATE FEDERATED LEARNING WITH EFFICIENT COMMUNICATION

Abstract

In this paper, we propose the discrete Gaussian based differentially private federated learning (D 2 P-FED), a unified scheme to achieve both differential privacy (DP) and communication efficiency in federated learning (FL). In particular, compared with the only prior work taking care of both aspects, D 2 P-FED provides stronger privacy guarantee, better composability and smaller communication cost. The key idea is to apply the discrete Gaussian noise to the private data transmission. We provide complete analysis of the privacy guarantee, communication cost and convergence rate of D 2 P-FED. We evaluated D 2 P-FED on INFIMNIST and CIFAR10. The results show that D 2 P-FED outperforms the-state-of-the-art by 4.7% to 13.0% in terms of model accuracy while saving one third of the communication cost. The results might be surprising at its first glance but is reasonable because the quantization level k in D 2 P-FED is independent of q. As long as q is large enough, the probability that the noise exceeds q is small and thus has negligible impact on the model accuracy.

1. INTRODUCTION

Federated learning (FL) is a popular machine learning paradigm that allows a central server to train models over decentralized data sources. In federated learning, each client performs training locally on their data source and only updates the model change to the server, which then updates the global model based on the aggregated local updates. Since the data stays locally, FL can provide better privacy protection than traditional centralized learning. However, FL is facing two main challenges: (1) FL lacks a rigorous privacy guarantee (e.g., differential privacy (DP)) and indeed, it has been shown to be vulnerable to various inference attacks (Nasr et al., 2019; Pustozerova & Mayer; Xie et al., 2019) ; (2) FL incurs considerable communication costs. In many potential applications of FL such as mobile devices, these two challenges are present simultaneously. However, privacy and communication-efficiency have mostly been studied independently in the past. As regards privacy, existing work has applied a gold-standard privacy notion -differential privacy (DP) -to FL, which ensures that the server could hardly determine the participation of each client by observing their updates (Geyer et al., 2017) . To achieve DP, each client needs to inject noise to their local updates and as a side effect, the performance of the trained model would inevitably degrade. To improve model utility, secure multiparty computation (SMC) has been used in tandem with DP to reduce noise (Jayaraman et al., 2018; Truex et al., 2019) . The key idea is to prevent the server from observing the individual updates, make only the aggregate accessible, and thus transform from local DP to central DP. However, SMC introduces extra communication overhead to each client. There has been extensive research on improving communication efficiency of FL while ignoring the privacy aspect (Tsitsiklis & Luo, 1987; Balcan et al., 2012; Zhang et al., 2013; Arjevani & Shamir, 2015; Chen et al., 2016) . However, these communication reduction methods either have incompatible implementations with the existing DP mechanisms or would break the DP guarantees when combined with SMC. The only existing work that tries to reconcile DP and communication efficiency in FL is cpSGD (Agarwal et al., 2018) . The authors leveraged the Binomial mechanism, which adds Binomial noise into local updates to ensure differential privacy. The discrete nature of Binomial noise allows it to be transmitted efficiently. However, cpSGD faces several limitations when applied to real-world applications. Firstly, with Binomial noise, the output of a learning algorithm would have different supports on different input datasets; as a result, Binomial noise can only guarantee approx-imate DP where the participation of the client can be completely exposed with nonzero probability. Also, there lacks a tight composition for DP with Binomial noise and the resulting privacy budget skyrockets in a multi-round FL protocol. Hence, the Binomial mechanism cannot produce a useful model with a reasonable privacy budget on complex tasks. Last but not least, the Binomial mechanism involves several mutually constrained hyper-parameters and the privacy formula is extremely complicated, which makes hyper-parameter tuning a difficult task. In this paper, we propose the discrete Gaussian based differential private federated learning (D 2 P-FED), an alternative technique to reduce communication costs while maintaining differential privacy in FL. Our key idea is to leverage the discrete Gaussian mechanism in FL, which adds discrete Gaussian noise into client updates. We show that the discrete Gaussian mechanism satisfies Rényi DP which provides better composability. We employ secure aggregation along with the discrete Gaussian mechanism to lower the noise and exhibit the privacy guarantee for this hybrid privacy protection approach. To save the communication cost, we integrate the stochastic quantization and random rotation into the protocol. We then cast FL as a general distributed mean estimation problem and provide the analysis of the utility for the overall protocol. Our theoretical analysis sheds light on the superiority of D 2 P-FED to cpSGD. Our experiments show that D 2 P-FED can lead to state-ofthe-art performance in terms of managing the trade-off among privacy, utility, and communication.

2. RELATED WORK

It is well studied how to improve the communication cost in traditional distributed learning settings (Tsitsiklis & Luo (1987); Balcan et al. (2012); Zhang et al. (2013) ; Arjevani & Shamir (2015) ; Chen et al. (2016) ). However, most of the approaches either require communication between the workers or are designed for specific learning tasks so they cannot be applied directly to generalpurpose FL. The most relevant work is Suresh et al. ( 2017) which proposed to use stochastic quantization to save the communication cost and random rotation to lower mean squared error of the estimated mean. We follow their approach to improve the communication efficiency and model utility of D 2 P-FED. Nevertheless, our work differs from theirs in that we also study how to ensure DP for rotated and quantized data transmission and prove a convergence result for the learning algorithm with both communication cost reduction and privacy protection steps in place. On the other hand, differentially private FL is undergoing rapid development during the past few years (Geyer et al. (2017) ; McMahan et al. (2017) ; Jayaraman et al. (2018) ). However, these methods mainly focus on improving utility under a small privacy budget and ignore the issue of communication cost. In particular, we adopt a similar hybrid approach to Truex et al. (2019) , which combines SMC with DP for reducing the noise. SMC ensures that the centralized server can only see the aggregated update but not individual ones from clients and as a result, the noise added by each client can be reduced by a factor of the number of clients participating in one round. The difference of our work from theirs is that we inject discrete Gaussian noise to local updates instead of the continuous Gaussian noise. This allows us to use secure aggregation (Bonawitz et al., 2017) which is much cheaper than threshold homomorphic encryption used by Truex et al. (2019) . We further study the interaction between discrete Gaussian noise and the secure aggregation as well as their effects on the learning convergence. We identify cpSGD (Agarwal et al. (2018) ) as the most comparable work to D 2 P-FED. Just like D 2 P-FED, cpSGD aims to improve both the communication cost and the utility under rigorous privacy guarantee. However, cpSGD suffers from three main defects discussed in Section 1. This paper proposes to use the discrete Gaussian mechanism to mitigate these issues in cpSGD.

3. BACKGROUND AND NOTATION

In this section, we provide an overview of FL and DP and establish the notation system. We use bold lower-case letters (e.g. a,b,c) to denote vectors, and bold upper-case letters (e.g. A, B, C) for matrices. We denote 1 • • • n by [n]. FL Overview. In a FL system, there are one server and n clients C i , i ∈ [n]. The server holds a global model of dimension d. Each client holds (IID or non-IID) samples drawn from some unknown distribution D. The goal is to learn the global model w ∈ R d that minimizes some loss function L(w, D). To achieve this, the system runs a T -round FL protocol. The server initializes the global model with w 0 . In round t ∈ [T ], the server randomly sub-samples γn clients from [n] with sub-sampling rate γ and broadcasts the global model w t-1 to the chosen clients. Each chosen client C i then runs the local optimizers (e.g. SGD, Adam, and RMSprop), computes the difference between the locally optimized model w (i) t and the global model w t-1 : g (i) t = w (i) t -w t-1 , and uploads g (i) t to the server. The server takes the average of the differences and update the global model w t = w t-1 + 1 k g (i) t . Communication in FL. The clients in FL are often edge devices, where the upload bandwidth is fairly limited; therefore, communication efficiency is of uttermost importance to FL. Let π denote a communication protocol. We denote the per-round communication cost as C(π, g [n] ). To lower the communication cost, the difference vectors are typically compressed before sent to the server. The compression would degrade model performance and we measure the performance loss via the mean squared error. Specifically, letting ḡ denote the actual mean of difference vectors 1 n n i=1 g (i) and g denote the server's estimated mean of difference vectors using some protocol such as D 2 P-FED, we could measure the performance loss by E(π, g [n] ) = E[ gḡ 2 ], i.e., the mean squared error between the estimated and the actual mean. This mean squared error is directly related to the convergence rate of FL (Agarwal et al., 2018) . Threat Model & Differential Privacy. We assume that the server is honest-but-curious. Namely, the server will follow the protocol honestly under the law enforcement or reputation pressure, but is curious to learn the client-side data from the legitimate client-side messages. In the FL context, the server wants to get information about the client-side data by studying the local updates received without deviating from the protocol. The above attack, widely known as the inference attack (Shokri et al., 2017; Yeom et al., 2018; Nasr et al., 2019) , can be effectively mitigated using a canonical privacy notation namely differential privacy (DP). Intuitively, DP, in the context of ML, ensures that the trained model is nearly the same regardless of the participation of any arbitrary client. Definition 1 (( , δ)-DP). A randomized algorithm f : D → R is ( , δ)-differentially private if for every pair of neighboring datasets D and D that differs only by one datapoint, and every possible (measurable) output set E the following inequality holds: P [f (D) ⊆ E] ≤ e P [f (D ) ⊆ E] + δ. ( , δ)-DP has been used as a privacy notion in most of the existing works of privacy-preserving FL. However, in this paper, we consider a generalization of DP, Rényi differential privacy (RDP), which is strictly stronger than ( , δ)-DP for δ > 0 and allows tighter analysis for compositing multiple mechanisms. This second point is particularly appealing, as FL mostly comprises multiple rounds yet the existing works suffer from skyrocketing privacy budgets for multi-round learning. Definition 2 ((α, )-RDP). For two probability distributions P and Q with the same support, the Rényi divergence of order α > 1 is defined by D α (P Q) ∆ = 1 α-1 log E x∼Q ( P (x) Q(x) ) α . A randomized mechanism f : D → R is (α, )-RDP, if for any neighboring datasets D, D ∈ D it holds that D α (f (D) f (D )) ≤ . The intuition behind RDP is the same as other variants of differential privacy: "Similar inputs should yield similar output distributions," and the similarity is measured by the Rényi divergence under RDP. RDP can also be converted to ( , δ)-DP using the following transformation. Lemma 1 (RDP-DP conversion (Mironov (2017) )). If M obeys (α, )-RDP, then M obeys ( + log(1/δ)/(α -1), δ)-DP for all 0 < δ < 1. RDP enjoys an operationally convenient and quantitatively accurate way of tracking cumulative privacy loss when compositing multiple mechanisms (Lemma 2) or being combined with subsampling (Wang et al., 2018) . As a result, RDP is particularly suitable for the context of ML. Lemma 2 (Adaptive composition of RDP (Mironov (2017) )). If (randomized) mechanism M 1 obeys (α, 1 )-RDP, and M 2 obeys (α, 2 )-RDP, then their composition obeys (α, 1 + 2 )-RDP.

4. DISCRETE GAUSSIAN MECHANISM

In this section, we present the discrete Gaussian mechanism and establish its privacy guarantee. We first introduce discrete Gaussian distribution. Definition 3 (Discrete Gaussian Distribution). Discrete Gaussian is a probability distribution on a discrete additive subgroup L (for instance, a multiple of Z) parameterized by σ. For a discrete Gaussian distribution N L (σ) and x ∈ L, the probability mass on x is proportional to e -x 2 /(2σ 2 ) . Discrete Gaussian mechanism works by adding noise drawn from discrete Gaussian distribution. Canonne et al. (2020) proved concentrated DP for the discrete Gaussian mechanism. However, there lacks tight privacy amplification and composition theorem for concentrated DP. To address, we turn to RDP and provide the first RDP analysis for the discrete Gaussian mechanism. The proof is delayed to Appendix A due to space limitation. Theorem 1 (RDP for discrete Gaussian mechanism). If f has sensitivity 1 and range(f ) ⊆ L, then the discrete Gaussian mechanism: f (•) + N L (σ) satisfies (α, α/(2σ 2 ))-RDP. Under RDP, discrete Gaussian exhibits tight privacy amplification bound under sub-sampling (Wang et al., 2018) . This suits FL well since a subset of clients is sub-sampled to upload the updates in each round. Corollary 1 (Privacy amplification for discrete Gaussian mechanism (Wang et al., 2018) ). If a discrete Gaussian mechanism is (α, α 2σ 2 )-RDP, then augmented with subsampling (without replacement), the privacy guarantee is amplified to (1) (α, O( αγ 2 σ 2 )) in the high privacy regime; or (2) (α, O(αγ 2 e 1 σ 2 )) in the low privacy regime. Besides, RDP enables discrete Gaussian mechanism to be composed tightly with analytical moments accountant (Wang et al., 2018) , which saves a huge amount of privacy budget in a multi-round FL. Analytical moments accountant is a data structure that tracks the cumulant generating function of the composed mechanisms symbolically. Since it has no closed-form solution, we instead introduce the canonical composition of RDP (Mironov, 2017) below for the ease of discussion in Section 6.3. Corollary 2 (Composition for discrete Gaussian mechanism (Wang et al., 2018) ). If a discrete Gaussian mechanism is (α, α 2σ 2 )-RDP, then the sequential composition of T such mechanisms yield (α, T α 2σ 2 )-RDP guarantee. If we convert all the RDP guarantees back to ( , δ)-DP, the growth of under the same δ is asymptotically O( √ T ). Note that both privacy amplification and composition are given in an asymptotic form for the clarity of presentation. For tight bound, we refer the readers to Theorem 27 and Section 3.3 in Wang et al. (2018) .

5. D 2 P-FED: ALGORITHM AND PRIVACY ANALYSIS

In this section, we formally present D 2 P-FED and provide rigorous privacy analysis.

5.1. ALGORITHM

Algorithm 1 provides the pseudocode for D 2 P-FED. It follows the general FL pipeline which iteratively performs the following steps: (1) the server broadcasts the global model to a subset of clients; (2) the selected clients train the global model on their local data and upload the resulting model difference; and (3) the server aggregates the model differences uploaded by the clients and updates the global model. Grounded on the general FL pipeline, D 2 P-FED introduces some additional steps at the client side as follows to improve communication efficiency and privacy. 2016) for detailed explanation.). On the other hand, quantization lowers the fidelity of the update vector and thus leads to some error in estimating the mean of gradients. To lower the estimation error, the clients apply randomized rotation to the updates before quantization as proposed by McMahan et al. (2016) . The details are discussed in Section 6.1. Discrete Gaussian Mechanism (line 15). We apply the discrete Gaussian mechanism to ensure DP. To determine the noise magnitude to be added, we need to bound the 2 -sensitivity (defined in Section 5.2) of the gradient aggregate. Without quantization and random rotation, one could clip the individual gradient update and consequently the 2 -sensitivity is just the clipping threshold. However, the inclusion of compression steps makes the analysis of 2 -sensitivity more sophisticated. We provide the analysis of sensitivity and RDP guarantees for the entire algorithm in Section 5.2. Each client samples the noise from the discrete Gaussian distribution. In contrast to prior work where each client adds independent noise, we require the clients to share the same random seed, generate the same noise and add an average share of the noise in each round. This is because the sum of multiple independent discrete Gaussians is no longer a discrete Gaussian. Note that the same random seed is also required for communication reduction in secure aggregation so we can conveniently reuse it here without introducing more further overhead (see Bonawitz et al. (2017) for details). The noise magnitude is set to ensure that the aggregate noise from all clients provides the global DP. Algorithm 1: D 2 P-FED Protocol. Input: Support Lattice:  L = 2g max k-1 • Z, Noise Scale: σ, Rotation Matrix: R, Random Seed: s Quantization Level: k, φ q (x) = 2g max k-1 (( k-1 2g max x + q-1 2 ) mod q -q-1 2 ), q is odd. 1 for t ← [T ] do 5 foreach client i ∈ S do Send u ij to j, u ij ∼ U nif (L d q ), L q := {x ∈ L, |x| ≤ q-1 2 } 6 Client: 7 foreach client i ∈ S do 8 Train the model w (i) t with w t as initialization 9 g (i) t = w (i) t -w t-1 / * Compute the difference * / g (i) t,clipped = g (i) t / max(1, g (i) t 2 D ) / * Clip the difference * / g (i) t,rotated = R × g (i) t,clipped / * Random Rotation * / let b[r] := -g max + 2rg max k-1 for every r ∈ [0, k) / * Quantize * / for j ∈ d, b[r] ≤ g(i) t,quantized [j] ≤ b[r + 1] do g(i) t,quantized [j] =      b[r + 1] w.p. g (i) t,rotated [j] -b[r] b[r + 1] -b[r] b[r] o.w. g(i) t,dp = g(i) t,quantized + νi γn , ν i s ∼ N d L (σ) / * Discrete Gaussian * / g(i) t = φ q (φ q (g (i) t,dp ) + j =i,j∈S u ij -j =i,j∈S u ji ) / * Mask * / Send g(i) t to the server Server: gt = 1 γn i∈S g(i) t / * Aggregate * / w t = w t-1 + gt Secure aggregation. (line 5,16,19) To reduce the noise magnitude for ensuring DP, we hide clients' individual updates from the central server and only allow it to see the aggregated updates via the technique of secure aggregation. If the individual updates were available to the central server, they should also be protected with the same privacy guarantee as the averaged update; in this case, the required noise scales up with O(γn). On the other hand, if the central server can only access the aggregated updates, then the required noise is O(1). Hence, secure aggregation of local updates can lead to significant noise reduction. However, there exists a challenge for integrating secure aggregation with discrete Gaussian mechanism: discrete Gaussian variables have infinite support, thereby incompatible with secure aggregation which operates on finite field. In Section 5.2, we propose to address this challenge by mapping the noised vector to a cyclic additive group and then applying secure aggregation and show that the RDP guarantees are preserved under the mapping.

5.2. PRIVACY ANALYSIS

Before applying discrete Gaussian mechanism to D 2 P-FED, we need to figure out how to calibrate the added noise. In differential privacy, the calibration is guided by sensitivity of the function as defined below. Definition 4 ( 2 -sensitivity). Given a function f : D → R and two neighboring datasets D and D , the 2 -sensitivity of f : ∆ f is defined as ∆ f = max D,D f (D) -f (D ) 2 . In DP for deep learning, the traditional way to bound 2 -sensitivity is to clip the update vector. However, quantization will further influence the sensitivity after clipping. We provide the sensitivity after quantization as below. The proof is delayed to Appendix B due to space limitation. Theorem 2. If we clip the l 2 norm of g to D, and quantize it to k = √ d + 1 levels, then the 2 sensitivity of the difference is 4D. Given Theorem 1 and Theorem 2, we provide the RDP bound for D 2 P-FED. Corollary 3 (RDP for D 2 P-FED). Given the clipping bound D, the noise scale σ, D 2 P-FED follows (α, 8αD 2 σ 2 )-RDP. Remark 1: Comparison with cpSGD It seems unclear how to interpret the above bound when compared with Theorem 1 in cpSGD (Agarwal et al., 2018) . Indeed, the claim that D 2 P-FED has a better privacy guarantee than cpSGD can be mainly justified by the following three aspects: (1) D 2 P-FED follows RDP which is a strictly stronger privacy notion than cpSGD which is intrisically limited to ( , δ)-DP; (2) D 2 P-FED enjoys a tighter composition compared to cpSGD. This is of critical significance in a FL protocol with potentially thousands of rounds; (3) Our experimental results in Figure 1a also empirically show that D 2 P-FED enjoys a tighter composition than cpSGD. The total privacy budget for D 2 P-FED grows much more slowly than cpSGD as training proceeds. Remark 2: Privacy Effect of Secure Aggregation. Corollary 3 is built on the assumption that the centralized server only has access to the summed updates but not the individual ones. If the centralized server has access to individual updates, the noise has to scale up γn times to maintain the same privacy guarantee which severely hinders the model accuracy. To consolidate the assumption, we leverage a cryptographic technique, secure aggregation (Bonawitz et al., 2017) which guarantees that the centralized server can only see the aggregated result. The basic intuition is to mask the inputs with random values canceling out in pairs. However, since discrete Gaussian has infinite support, we cannot directly apply random masks to it. To reconcile secure aggregation with discrete Gaussian, we propose to project the involved values into a quotient group after shifting and then apply the random masks as shown in line 16 in Algorithm 1. According to the post-processing theorem of RDP (Mironov (2017) ), the result still follows rigorous Rényi differential privacy as proved in Appendix C. Note that we consider a simplified version of the full secure aggregation protocol (Bonawitz et al. (2017) ) in Algorithm 1 and omit many interesting details such as the generation of the random masks and how to deal with dropout. We deem this to be enough to clarify the idea behind the reconciliation. The complete version of secure aggregation can be reconciled using exactly the same trick. Theorem 3 (Informal). Distributed discrete Gaussian mechanism with secure aggregation obeys the same RDP guarantee as vanilla global discrete Gaussian mechanism with the same parameters.

6. COMMUNICATION PROTOCOL & UTILITY ANALYSIS

In this section, we present our communication protocol in detail and discuss the communication cost and the estimation error of D 2 P-FED with direct comparison to cpSGD. The drastic improvement of D 2 P-FED mainly comes from the tight composition of discrete Gaussian mechanism compared to binomial mechanism in cpSGD.

6.1. COMMUNICATION PROTOCOL

As the first step, we leverage stochastic k-level quantization proposed by McMahan et al. (2016) to lower the communication cost as described in line 12-14 in Algorithm 1. If we denote vanilla stochastic k-level quantization with π k , then we successfully reduce the per-round communication cost C(π k , g [n] ) = n • (d log 2 k + Õ(1)). However, stochastic quantization sacrifices some accuracy for communication efficiency. Concretely, E(π k , g [n] ) = O( d n • 1 n n i=1 g (i) 2 2 ). Since the dimension of parameters d is tens of thousands to hundreds of thousands in federated learning, the estimation error of the mean is too large. Thus to reduce the estimation error, we randomly rotate the difference vector (McMahan et al., 2016) as the second step. The key intuition is that the MSE of stochastic uniform quanti- zation is O( d n (g max ) 2 ). With random rotation, we can limit g max to log d d w.h.p. so the MSE will be improved to O( log d n ). Agarwal et al. (2018) also leverages random rotation to reduce MSE. However, in their setting, random rotation intrinsically harms their privacy guarantee because the ∞ -sensitivity might increase with rotation. A natural advantage of discrete Gaussian is that the privacy guarantee only depends on 2 -sensitivity which is an invariant under rotation. Thus, random rotation does not harm our privacy guarantee at all. We omit the details here and refer the interested readers to McMahan et al. (2016) . We denote the protocol using k-level quantization and random rotation with π (rot) k . We know that C(π (rot) k , g [n] ) remains the same while the MSE error is reduced to E(π (rot) k , g [n] ) = O( log d n • 1 n n i=1 g (i) 2 2 ).

6.2. CONVERGENCE RATE OF D 2 P-FED

In this section, we relate the convergence rate with mean squared error using Corollary 4 and analyze the mean squared error of mean estimation in D 2 P-FED. Note that we assume each client executes one iteration in each round so g equals to the gradient or belongs to the sub-gradients. Corollary 4 (Ghadimi & Lan (2013) ). Let F (w) = L(w, D) for some given distribution D. Let F (w) be L-smooth and ∀w, ∇F (w) ≤ ρ. Let w 0 satisfy F (w 0 ) -F (w * ) ≤ ρ F . Then after T rounds E t∼(U nif (T )) [ ∇F (w t ) 2 2 ] ≤ 2ρ F L T + 2 √ 2λ √ Lρ F √ T + ρB , where λ 2 = max 1≤t≤T 2E[ g(w t ) -∇F (w t ) 2 2 ] + 2 max 1≤t≤T E q [ g(w t ) -g(w t ) 2 2 ], and B = max 1≤t≤T g(w t ) -g(w t ) . As corollary 4 indicates, for a given gradient bound, the convergence rate approximately grows with the reciprocal of MSE. Thus, we analyze D 2 P-FED's MSE and obtain the following theorem. The proof is delayed to Appendix D due to space limitation. Theorem 4. If we choose σ ≥ 1/ √ 2π, the mean squared error is E(π (rot) k,q,N L (σ 2 ) , g [n] ) ≤ (1- 1 1 + 3e -2π 2 σ 2 (1-Φ(nq)))• 4d(g max ) 2 n(k -1) 2 ( 1 4 + σ 2 γ 2 n 2 )+(1-Φ(n(q-k-1)))•q 2 where Φ is the cumulative distribution function (CDF) of the standard normal distribution. Remark 1: Choice of g max . As indicated by Theorem 4, the dominant term in MSE is proportional to the square of g max . A natural choice of g max is the clipping bound D. If we want to match up with the MSE guarantee in cpSGD: O( σ 2 log(d) n(k-1) 2 ), we need to inherit their choice of g max = O(D log (d) d ), and this can be achieved by clipping l ∞ norm of the gradient after random rotation. For instance, according to Lemma 8 in Agarwal et al. (2018) , we can choose g max = 2 √ log( 2nd δ )D √ d . In that case, the possibility that the maximum of g exceeds g max is at most δ. It follows that the possibility that the ∞ -clipping really changes the update is bounded by δ. Hence, the RHS of the MSE bound in Theorem 4 evolves to (1 - 1 1+3e -2π 2 σ 2 (1 -Φ(nq)) -δ) • 4d(g max ) 2 n(k-1) 2 ( 1 4 + σ 2 γ 2 n 2 ) + (1 -Φ(n(q -k -1)) + δ) • q 2 , which is on the same order with the original bound given δ is small. Remark 2: Comparison with cpSGD. As the MSE bound is on the same order with cpSGD (even the constants are close!), a natural question is "what is the advantage of D 2 P-FED over cpSGD in terms of MSE?" Indeed, the advantage stems from a smaller standard deviation of the noise. Given a fixed privacy budget, due to the tighter composition of sub-sampled RDP (Wang et al., 2018) , each round of D 2 P-FED can have more privacy budget and thus smaller noise scale. Plugging a smaller σ into Theorem 4 will give a better MSE and thus a better convergence rate as cpSGD follows a similar convergence rate bound. Moreover, according to Figure 1 in Agarwal et al. (2018) , even with the same noise scale, Gaussian noise provides stronger privacy guarantee than Binomial noise. As discrete Gaussian noise follows the same RDP bound as Gaussian noise, we believe discrete Gaussian can map the same-scale noise to lower privacy cost. 6.3 COMMUNICATION COST OF D 2 P-FED First, we provide the trivial per-round communication cost which is exactly the same as cpSGD. Theorem 5. The per-round communication cost of D 2 P-FED is C(π (rot) k,q,N L (σ 2 ) , g [n] ) = n • (d log(nq + 1) + Õ(1)) . Now let's compare the number of rounds in cpSGD and D 2 P-FED qualitatively. Note that during the following discussion, we usually omit δ for the ease of clarification and assume that δ is fixed. For cpSGD, the known tightest bound is by the combination of standard privacy amplification (Balle et al., 2018) and advanced composition (Dwork et al., 2010) . Concretely, if a mechanism costs privacy budget, then after composed with sub-sampling the privacy budget is reduced to O(γ ) where γ is the sub-sampling rate. If the mechanism is composed sequentially T times, the privacy budget grows to O( T log(1/δ) ). Thus, the total privacy budget of cpSGD is O(γ T log(1/δ) ). On the other hand, D 2 P-FED provides a total privacy budget of O(γ √ T ), saving a factor of log(1/δ). Since δ is typically very small, the saving is quite significant. If the privacy budgets for the two protocols are the same, then D 2 P-FED can use noise with O( log(1/δ)) smaller scale than cpSGD in each round. This will lead to a O( log(1/δ))-time faster convergence. Then for a given gradient bound, D 2 P-FED can reach it with O(log(1/δ))-time fewer rounds and thus save O(log(1/δ))-time communication cost. Both D 2 P-FED and cpSGD intrinsically require secure aggregation to establish their privacy guarantee. Agarwal et al. (2018) did not discuss the issue explicitly. As pointed out in Bonawitz et al. (2017) , once combined with secure aggregation, each field has to expand at least γn times (γn is the number of chosen clients) to prevent overflow of the sum.

7. EVALUATION

We would like to answer the following three questions using empirical evaluation: (1) How D 2 P-FED performs in a multi-round federated learning compared to cpSGD under either the same privacy guarantee or the same communication cost? (2) How different choices of hyper-parameters affect the performance of D 2 P-FED? (3) Does D 2 P-FED work under heterogeneous data distribution? Due to space limitation, we present our main results for (1) in this section and defer the results for (2) and (3) to Appendix E.

7.1. EXPERIMENT SETUP

To answer the above questions, we evaluated D 2 P-FED and cpSGD on INFIMNIST (Bottou (2007) ) and CIFAR10 (Krizhevsky et al. (2009) ). We sampled 10M hand-written digits from INFIMNIST and randomly split the data among 100K clients. In each round, 100 clients are randomly chosen to upload their difference vectors to train a three-layer MLP. For CIFAR10, we select 10 out of 2000 clients in each round to train a 2-layer convolutional network. All RDP bounds are converted to ( , δ)-DP for the ease of comparison and the total δ is set to 1e-5 for all experiments.

7.2. MODEL ACCURACY VS. PRIVACY BUDGET

To answer the first question, we studied model accuracy under the same privacy budget as shown in Figure 1a . Compared with cpSGD, D 2 P-FED achieves 4.7% higher model accuracy on INFIMNIST and 13.0% higher model accuracy on CIFAR10 after convergence. As expected, D 2 P-FED composes far tighter under sub-sampling as the lines are much sharper than those of cpSGD. Consequently, D 2 P-FED converges at a smaller privacy budget than cpSGD as well. Although in Figure 1a cpSGD has better accuracy in the high privacy region, it is not necessarily the case but depends on the scale of the discrete Gaussian noise, as studied in Section E.1. Note that the results for cpSGD in Figure 1 is different from the results in Figure 2 in the original paper (Agarwal et al., 2018) . The reason is that in the original paper, they do not sub-sample clients in each round but assign each client to exactly one round beforehand to avoid composition which cpSGD cannot handle well. However, the scheme is far from practical in the real world due to the dynamic nature of the clients.

7.3. MODEL ACCURACY VS. COMMUNICATION COST

To answer the second question, we also studied the model accuracy under the same communication cost. As shown in Figure 1b A PROOF FOR THEOREM 1 We consider the Rényi divergence between two discrete Gaussian distributions differing in the mean value. Proof. D α (N L (0, σ 2 ) N L (µ, σ 2 )) (1) = 1 α -1 log L 1 L exp(-(x -µ) 2 /(2σ 2 )) exp(-αx 2 /(2σ 2 )) • exp(-(1 -α)(x -µ) 2 /(2σ 2 )) = 1 α -1 log 1 L exp(-(x -µ) 2 /(2σ 2 )) L exp((-x 2 + 2(1 -α)µx -(1 -α)µ 2 )/(2σ 2 )) (2) ≤ 1 α -1 log{exp((α 2 -α)µ 2 /(2σ 2 ))} = αµ 2 /(2σ 2 ) (1) Because range(f ) ⊆ L, we only consider µ ∈ L. Thus the denominator of N L (0, σ 2 ) and N L (µ, σ 2 ) cancels out as L exp(-(x -µ) 2 /(2σ 2 )) is periodic. (2) L exp(-(x-µ) 2 /(2σ 2 )) L exp(-(x-µ) 2 /(2σ 2 )) = √ πϑ((1-α)πµ,e -π 2 ) ϑ(0, 1 e ) ≤ 1 where ϑ is the Jacobi theta function (Wikipedia).

B PROOF FOR THEOREM 2

Proof. The l 2 sensitivity of δ is naturally bounded by 2D. The rotation does not change the sensitivity. The k-level quantization might expand the space as shown in Figure 2 . An upper bound on the radius of the red circle is D + √ d D k-1 . When we take k = √ d -1, it reduces to 2D. Thus, the upper bound on the sensitivity is 4D.

C PROOF FOR THEOREM 3

We first prove the following lemma. φ q (x) = φ q ( x) Proof. φ q (x) = 2g max k -1 (( k -1 2g max x + q -1 2 ) mod q - q -1 2 ) = 2g max k -1 (( k -1 2g max x + q -1 2 ) mod q - q -1 2 ) = φ q ( x) Then we prove Theorem 3 as follows. Proof. Given Lemma 3, gt = 1 k i∈S g(i) t = 1 k i∈S φ q (φ q (g (i) t,dp ) + j =i,j∈S u ij - j =i,j∈S u ji ) = 1 k φ q ( i∈S (φ q (g (i) t,dp ) + j =i,j∈S u ij - j =i,j∈S u ji )) = 1 k φ q ( i∈S g(i) t,dp ) .

The expression i∈S g(i)

t,dp forms a centralized discrete Gaussian mechanism and according to the post-processing theorem, the same RDP guarantee still holds.

D PROOF FOR THEOREM 4

Proof Sketch. Before starting the proof, we introduce two lemmas for the proof. Lemma 4 (Proposition 19 from Canonne et al. (2020) ). For all σ ∈ R with σ > 0, V[N Z (0, σ 2 )] ≤ σ 2 (1 - 4π 2 σ 2 e 4π 2 σ 2 -1 ) < σ 2 Moreover, if σ 2 ≤ 1 3 , then V[N Z (0, σ 2 )] ≤ 3 • e -1 2σ 2 Lemma 5 (Proposition 23 from Canonne et al. (2020) ). For all m ∈ Z with m ≥ 1 and all σ ∈ R with σ > 0, P X∼N Z (0,σ 2 ) [X ≥ m] ≤ P X∼N (0,σ 2 ) [X ≥ m -1]. Moreover, if σ ≥ 1/ √ 2π, we have P X∼N Z (0,σ 2 ) [X ≥ m] ≥ 1 1 + 3e -2π 2 σ 2 P X∼N (0,σ 2 ) [X ≥ m] Now we start our proof. MSE can be rewritten in the following format. E[ X -X 2 2 |] = 1 n 2 d j=1 n i=1 E[( Xi (j) -X i (j)) 2 ] For each = E[( Xi (j) -X i (j)) 2 ] , we need to consider two cases. The hyper-parameter of the most vital interest in D 2 P-FED is the scale of the noise. To understand the effect of the noise scale, we evaluated D 2 P-FED on INFIMNIST, with 3 different choices of noise scale as shown in Figure 3 . It is no surprise that the higher the noise scale, the smaller the privacy budget and the lower the model accuracy. This also illustrates the claim we have in Section 7.2 that D 2 P-FED can also have relatively good performance in the high privacy region at the cost of the model accuracy.

E.2 INFLUENCE OF HETEROGENEOUS DATA DISTRIBUTION

It is well known that data is sometimes heterogeneously distributed among clients in a federated learning system. Thus, to better understand D 2 P-FED's behavior under heterogeneous data distribution, we simulated heterogenenous data distribution by distributing the INFIMNIST data according to the classes and evaluated D 2 P-FED on these clients. The results are shown in Figure 4 and we can see that under heterogeneous data distribution the model accuracy drops by more than 10%. This complies with the previous empirical results and there have been a line of researches focusing on addressing the issue (Yurochkin et al., 2019a; b; Wang et al., 2020) . Although orthogonal to this paper, we deem it as an interesting open problem how to integrate these works with D 2 P-FED. 

E.3 INFLUENCE OF GROUP SIZE q

We also ran D 2 P-FED with multiple choices of discrete group size q. We observe that once the noise scale σ is fixed, the performance is relatively robust to q. 



CONCLUSIONIn this work, we developed D 2 P-FED to achieve both differential privacy and communication efficiency in the context of federated learning. By applying the discrete Gaussian mechanism to the private data transmission, D 2 P-FED provides stronger privacy guarantee, better composability and smaller communication cost than the only prior work, cpSGD, both theoretically and empirically.



a subset of clients S ⊂ [n], |S| = γn and broadcast w t-1 and g max to S 4 Client:

Cost vs. Model Accuracy.

Figure 1: D 2 P-FED vs. cpSGD.

Figure 2: Sensitivity after Quantization.

Figure 3: D 2 P-FED under different σ.

Figure 4: D 2 P-FED under homogeneous/heterogeneous distribution.

Figure 5: D 2 P-FED under different group size q.

Stochastic. To lower the communication cost, the clients stochastically quantize the values in the update vectors to some discrete domain. Com-

, D 2 P-FED consistently achieves better model accuracy under the same communication cost on both INFIMNIST and CIFAR10. The main reason is that the tight composition property allows D 2 P-FED to use smaller per-feature communication cost while still achieving better accuracy. As a concrete instance, D 2 P-FED with 50% compression rate can achieve better accuracy than cpSGD with 25% compression rate. cpSGD with 50% compression rate either leads to an unacceptable privacy budget or does not converge.Chulin Xie, Keli Huang, Pin-Yu Chen, and Bo Li. Dba: Distributed backdoor attacks against federated learning. In International Conference on Learning Representations, 2019. Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. Privacy risk in machine learning: Analyzing the connection to overfitting. In 2018 IEEE 31st Computer Security Foundations Symposium (CSF), pp. 268-282. IEEE, 2018.

If no overflow happens, due to Lemma 4 If overflow happens, trivially we have o ≤ k 2 . Thus, we have= P[¬o] • ¬o + P[o] • o ≤ P X∼N L (σ 2 ) [X ≤ q] • ¬o + P X∼N L (σ 2 ) [X ≥ q -k] • o (1)With Lemma 5, we can bound the probabilities and get the final MSE result in the theorem.

