PRACTICAL LOCALLY PRIVATE FEDERATED LEARN-ING WITH COMMUNICATION EFFICIENCY Anonymous

Abstract

Federated learning (FL) is a technique that trains machine learning models from decentralized data sources. We study FL under local differential privacy constraints, which provides strong protection against sensitive data disclosures via obfuscating the data before leaving the client. We identify two major concerns in designing practical privacy-preserving FL algorithms: communication efficiency and highdimensional compatibility. We then develop a gradient-based learning algorithm called sqSGD (selective quantized stochastic gradient descent) that addresses both concerns. The proposed algorithm is based on a novel privacy-preserving quantization scheme that uses a constant number of bits per dimension per client. Then we improve the base algorithm in two ways: first, we apply a gradient subsampling strategy that offers simultaneously better training performance and smaller communication costs under a fixed privacy budget. Secondly, we utilize randomized rotation as a preprocessing step to reduce quantization error. We also initialize a discussion about the role of quantization and perturbation in FL algorithm design with privacy and communication constraints. Finally, the practicality of the proposed framework is demonstrated on benchmark datasets. Experiment results show that sqSGD successfully learns large models like LeNet and ResNet with local privacy constraints. In addition, with fixed privacy and communication level, the performance of sqSGD significantly dominates that of baseline algorithms.

1. INTRODUCTION

1.1 BACKGROUND Federated learning (FL) Kairouz et al. (2019) ; Konečnỳ et al. (2016) is a rapidly evolving application of distributed optimization to large-scale learning or estimation scenarios where multiple entities. called clients, collaborate in solving a machine learning problem, under the coordination of a central server. Each client's raw data is stored locally and not exchanged or transferred. To achieve the learning objective, the server collects minimal information from the clients for immediate aggregation. FL is particularly suitable for mobile and edge device applications since the (sensitive) individual data never directly leave the device and has seen deployments in industries (? Hard et al., 2019; Leroy et al., 2019) . While FL offers significant practical privacy improvements over centralizing all the training data, it lacks a formal privacy guarantee. As discussed in Melis et al. (2018) , even if only model updates (i.e. gradient updates) are transmitted, it is easy to compromise the privacy of individual clients. Differential privacy (DP) (Dwork et al., 2014) is the state-of-the-art approach to address information disclosure. Differentially private algorithms fuse participation of any individual via injecting algorithm-specific random noise. In FL setting, DP is suitable for protecting against external adversaries, i.e. a malicious analyst that tries to infer individual data via observing final or intermediate model results. However, DP paradigms typically assume a trusted curator, which corresponds to the server in the FL setting. This assumption is often not satisfied in practical cases, under which users that act as clients may not trust the service provider that acts as the server. Local differential privacy (LDP) (Kasiviswanathan et al., 2011; Dwork et al., 2014) provides privacy protection on the individual level via applying randomized mechanisms that obfuscate the data before leaving the client, and is a more natural privacy model for distributed learning scenarios like FL, allowing easier compliance with regulatory strictures (Bhowmick et al., 2018) .

1.2. METHODOLOGY

We will be focusing on distributed learning procedures that aim to solve an empirical risk minimization problem in a decentralized fashion: min θ∈Θ L(θ), L(θ) = 1 N M m=1 L m (θ) where M denotes the number of clients. Let Y i denotes training data that belongs to the m-th client consisting N m data points, and N = M m=1 N m . The term L m (θ) = y∈Ym (θ, y) stands for locally aggregated loss evaluated at parameter θ, where : Θ × Y → R + is the loss function depending on the problem context and we use Θ and Y to describe parameter and data domain respectively. We will seek to optimize (1) via gradient-based methods. In what follows we will use [n] , n ∈ N + to denote the set {1, 2, . . . , n}. At round t, the procedure iterates as: 1. Server distribute the current value θ t among a subset S of clients. 2. For each client s ∈ S, a local update is computed ∆ s = g(θ t , Y s ) and transmitted to the server. 3. The server aggregates all {∆ s } s∈S s to obtain a global update ∆ and updates the parameter as θ t+1 = θ t + ∆. We distinguish 2 practical scenarios as cross-silo FL and cross-device FL (Kairouz et al., 2019) . In cross-silo FL, M is relatively small (i.e. M ≤ 100 ) and usually each client has a moderate or large amount of data (i.e. min m N m 1). As a consequence, all the clients participate in the learning process (i.e. S = [M ]). In each iteration a client computes locally a negative stochastic gradient g(θ t , Y s ) = -1 R y∈Rs ∇ (θ t , y) of L s (θ t )/N s , based on a uniform random subsample R s ⊂ Y s of size R. In cross-device FL S is a uniformly random subset of [M ], and g(θ t , Y s ) = -1 Ns y∈Ys ∇ (θ t , y) is the average negative gradient over Y s evaluated at θ t . In this paper we will be focusing on the FedSgd aggregation rule corresponding to a stochastic gradient step with learning rate η, i.e., ∆ = η s∈S Ns N g(θ t , Y s ). Generally speaking, there are two sources of data that need privacy protection: the parameter θ t and the updates {∆ s } s∈S . In this paper we will develop algorithms for protecting {∆ s } s∈S against inference attack or reconstruction attack, which will be defined later. Note that the protection of θ t s could be done via applying standard central DP techniques as in Bhowmick et al. (2018) . To begin our discussion on suitably defined privacy models, we first review the local model of privacy. Kasiviswanathan et al. (2011) we say a randomized algorithm A that maps the private data X ∈ X to some value Z = A(X) ∈ Z is -locally differentially private, if the induced conditional probability measure P (•|X = x) satisfies that for any x, x ∈ X and any Z ∈ Z:

Local differential privacy

e -≤ P (Z|X = x) P (Z|X = x ) ≤ e (2) In the FL setting, the private data X is typically not the raw data Y but the gradient of the loss function evaluated at Y . Hereafter we will assume the private data to be a d dimensional euclidean vector, while d could be very large (i.e. to the order of millions). LDP is a very strong privacy protection model that protects individual data against inference attacks that aims at inferring the membership of arbitrary data from the data universe. According to the discussion in Bhowmick et al. (2018) , allowing adversaries with such strength may be overly pessimistic: to conduct an effective inference attack the adversary shall come up with "the true data" and only use the privatized output to verify his/her belief about the membership. If the prior information is reasonably constrained for the adversary, we may adopt a weaker, but still elegant type of privacy protection paradigm that protects against reconstruction attacks. Heuristically, reconstruction attack aims at recovering individual data with respect to a well-defined criterion, given a prior over the data domain that is not too concentrated. Formally, we adopt the definition in Bhowmick et al. (2018) : Reconstruction attack Bhowmick et al. (2018) Let π be a prior distribution over the data domain Y that encodes the adversary's prior belief. We describe the generation process of privatized user data as Y → X → Z = A(X) for some (privacy-protecting) mechanism A. Additionally let f : Y → R k be the target of reconstruction (i.e., the adversary wants to evaluate the value f (Y )) and L rec : R k × R k → R + be the reconstruction loss serving as the criterion. Then an estimator ζ : X → R k provides an (α, p, f )-reconstruction breach for the loss L rec if there exists some z ∈ Z such that: P (L rec (f (X), ζ(z))|A(X) = z) > p (3) The existence of a reconstruction breach provides an attack that effectively breaks the privatized algorithm for some output z. Hence to provide protection against reconstruction attacks, we need the following to hold: sup ζ sup z∈Z P (L rec (f (X), ζ(z))|A(X) = z) ≤ p When (4) holds, the mechanism A is said to be (α, p, f )-protected against reconstruction for the loss L rec . In (Bhowmick et al., 2018, Lemma 2.2) , the authors relates LDP mechanisms with protection against reconstruction attacks, and the result for several scenarios suggest that using LDP mechanisms with large may still provide decent protection against reconstruction. As a consequence, we identify two sorts of threats used in this paper: Inference attacks by powerful adversaries this is the standard setup in LDP scenarios under which the adversary can observe the final learned model, as well as the information about potential participants except for the precise membership list. For such cases, only low LDP mechanisms provide sufficient protection against inference attacks. Reconstruction attacks by curious onlookers in this relaxed scenario we assume the adversary is able to observe all model updates and individual communications to conduct reconstruction attacks that target the evaluation of some function f over private individual data. But knowledge about individual data is restricted some prior distribution π f . For such cases, we will discover mechanisms that are based upon LDP algorithms with large but provides decent protection against reconstruction. Moreover, we assume the adversaries to be semi-honest (Lyu et al., 2020) , i.e. they are curious about individual data but follow the FL protocol correctly. In the seminal work (Duchi et al., 2018) , the authors developed minimax optimal LDP mean estimators under i.i.d generative constraints. Bhowmick et al. (2018) used separate mechanisms to privatize both the direction and the scale of the individual gradients, and was shown to achieve comparable performance with respect to non-private counterparts using high s. Despite the effectiveness of the LDP approach, the communication costfoot_0 is prohibitive. For models involving deep neural architectures, the number of parameters explodes to the order of millions. For each round, each client transmit O(dδ) bits to the server, where δ is the minimum number of bits to represent a real number with desired precision. The resulting communication cost is usually not affordable for mobile scenarios. In fact, the primary bottleneck for FL is usually communication, especially for deep neural architectures (Kairouz et al., 2019) . It is thus of significant importance to reduce the communication cost to a reasonable level. We identify two main challenges in building practical privacy-preserving FL algorithms: Communication Efficiency The algorithm shall be communication efficient, measured in terms of bits transferred per client in a single round. High-dimensional compatibility It was shown in Duchi et al. (2018) that upon estimating a ddimensional vector, the locally private constraint incurs a multiplicative penalty of O(d/ 2 ) in terms of minimax 2 risk. This linear dependence on the dimension of gradients may result in overly noisy gradient estimates that destroy the learning process. Hence the algorithm shall be able to handle models with high-dimensional inputs with tolerable performance degradation.

Summary of contributions

In this paper, we propose a practical solution to locally private FL setting that addresses the above concerns which we term selective quantized stochastic gradient descent (sqSGD). Our contributions are summarized as follows: 1. We propose a novel algorithm for locally private multivariate mean estimation. The proposed algorithm stochastically quantizes each dimension of each sample to K equally spaced levels and serves as the base algorithm for our gradient-based learning framework. 2. We make several improvements to the basic algorithm under the gradient-based learning scenario. Specifically, we attribute the estimation error of the base algorithm to perturbation error caused by privacy constraints, and quantization error. Then we utilize a gradient subsampling strategy to simultaneously reduce communication cost and improve private estimation utility. We also apply randomized rotation to reduce quantization error. 3. We verify the performance of sqSGD on benchmark datasets using standard neural architectures like LeNet (Lecun et al., 1998) and ResNet (He et al., 2016) , and make systematic studies on the impact of both communication and privacy constraints. Specifically, under the same privacy and communication constraints, our model outperforms baseline algorithms that do not involve quantization by a significant margin.

2. RELATED WORK

Communication-efficient FL Since its introduction in (Konečný et al., 2016) , communication efficiency has been one of the central topics of FL (Konečnỳ et al., 2016; Kairouz et al., 2019) . Konečnỳ et al. (2016) described two kinds of general approaches for improving communication efficiency: structured updates like randomized masking, and sketched updates like quantization methods. The design of sqSGD could be viewed as a combination of the aforementioned approaches. 2018) justified using large privacy budgets in LDP settings, while still provide reasonable privacy guarantees against reconstruction attacks. However, most of the previous works on privacy-preserving FL did not address communication efficiency problems.

Privacy-preserving FL

Gradient sparsification and quantization Gradient dropping methods threshold gradients based on either fixed single threshold (Aji & Heafield, 2017) or adaptively chosen thresholds (Chen et al., 2017) are reported to greatly reduce communication overhead. DGC (Lin et al., 2018) adopted more elegant refinements like accumulation correction and factor masking to achieve even higher compression ratios. FedSel (Liu et al., 2020) performed privatized gradient sparsification via adopting randomized response techniques to select the most significant gradient dimension. Gradient quantization methods like QSGD (Alistarh et al., 2017) or TernGrad (Wen et al., 2017) are able to achieve little performance degradation on large neural architectures using less than 4 quantization bits. Privatized versions of quantized SGD are recently proposed, cpSGD (Agarwal et al., 2018) adopts the central DP model and vqSGD (Gandikota et al., 2019) allows local DP model.

3.1. BASE MECHANISM

As a starting point, we base our communication efficient mean estimation algorithm on the stochastic k-level quantization(Suresh et al., 2017) scheme: We assume a uniform upper bound U on the ∞ norm of any individual gradients over the whole course of the FL process. The quantization procedure is described as follows: Let K ≥ 2 be the desired quantization level. We quantize the gradient to a sequence of points -U = B 1 < B 2 < . . . , < B K-1 < B K = U , where B k = -U + 2(k -1)U K -1 (5) Under review as a conference paper at ICLR 2021 The case for K = 2 corresponds to using only the endpoints {-U, U } for quantization. We will denote the quantization range as B = {B k } K k=1 . For each dimension j ∈ [d] of the individual gradient X j , we first locate X j to one of the K bins via finding a k * such that X j ∈ [B k * , B k * +1 ) and round X j to the boundary of the bin as: X j = B k * +1 with probability k * (X j -B k * ) 2U B k * otherwise (6) Note that X j is an unbiased estimator of X j . The quantization scheme reduces the communication cost to d log 2 K bits per round per client. Next we privatize X j by constructing a locally private unbiased estimator of X j (where we fix the randomness of the quantization step), thereby obtaining a locally private estimator of X with its value resides in B d . To describe the privatization scheme we introduce some new notations: for ∀v 1 , v 2 ∈ B d , define M(v 1 , v 2 ) = #{j : v j 1 = v j 2 } -#{j : v j 1 = v j 2 } where #C stands for the number of elements in the set C. For a given positive integer κ ∈ {0, . . . , d -1}, let S(v; B d ) := {u : M(u, v) > κ} and S(v; B d ) := {u : M(u, v) ≤ κ}. The privatization procedure, which we termed PrivQuant K,∞ , is summarized in Algorithm1. To state the privacy guarantee of PrivQuant K,∞ , we identify two setups which corresponds to protection against inference and reconstruction attacks. The later case requires additional specifications for the priors and evaluation function held by the adversary, which we state as follows: we assume the evaluation function f to be a map from Y to a compact set C ⊂ R r . For any absolutely continuous prior π on Y, we use π f to denote the induced prior on f . Let π 0 be the uniform prior over C, we will be focusing on priors that belong to the following set: P f (ρ 0 ) = π : sup y∈C log dπ f (y) dπ 0 (y) ≤ ρ 0 where ρ 0 is a non-negative number. The set P f (ρ 0 ) characterize prior beliefs that are "not much more certain than uniformly random guessing" over all possible outcomes that belong to the image of the evaluation function. We state the theorem under the orthogonal reconstruction attack that uses the evaluation function f A (x) = Ax/ x 2 with A ∈ R r×d an orthonormal matrix, i.e., AA T = I r , and the reconstruction criterion is chosen as the 2 error L rec (x, x ) = x/ x 2 -x / x 2 2 2 . Theorem 1. For any K ≥ 2, Consider PrivQuant K,∞ as a mechanism parameterized by (κ, p). Let τ := d+κ+1 2 , and let (κ, p) be chosen such that the following relation holds: p 1 -p × τ -1 =0 d (K -1) (d-) d =τ d (K -1) (d-) ≤ e (8) then Z is an unbiased estimator of X, i.e. E(Z) = X, with randomness jointly over the quantization step and sampling step. Moreover, we have the following privacy guarantees: Inference attack protection PrivQuant K,∞ (•, κ, p) is -locally differentially private. Reconstruction attack protection with k ≥ 4 and a ∈ [0, 1], PrivQuant K,∞ (•, κ, p) is ( √ 2 -2a, ω(a), f A )-protected against reconstruction for ω(a) = √ 8 exp - (r -1)a 2 2 exp( + ρ 0 ) The proof will be deferred to appendix A.1. Theorem 1 implies that, if the adversary aims at reconstructing a non-vanishing fraction of the private data with r = O(d), for small a that allows coarse reconstruction, we only need and ρ 0 to be of smaller order than d to ensure that any reconstruction attack will not succeed with reasonable probability. Note for K = 2 levels and U = 1, algorithm 1 reduces to algorithm PrivUnit ∞ in (Bhowmick et al., 2018) . We may view the estimation error of Z (measured in terms of 2 risk) as coming from two parties: one from the quantization step, and the other from the privatization step. In the following sections we will take closer looks at the two sources of error, and develop corresponding improvements. Indeed as verified by our empirical study (see section 4), certain improvements are necessary to achieve high-dimensional compatibility. Algorithm 1 Private K-quantized ∞ Ball PrivQuant K,∞ Require: X ∈ [-U, U ] d , κ ∈ {0, • • • , d -1}, p ≥ 1 2 , τ = d+κ+1 2 . 1: Quantize each coordinate of X to B using (6) to obtain X ∈ B d . 2: Sample a random vector V as follows: with probability p, V is sampled uniformly at random from S( X; B d ); otherwise with probability 1 -p, V is sampled uniformly at random from S( X; B d ). 3: Calculate normalizing factor m = p d-1 τ -1 (K -1) d-τ d =τ d (K -1) d-l -(1 -p) d-1 τ -1 (K -1) d-τ τ -1 =0 d (K -1) d-l return Z = 1 m • V 3.2 IMPROVING PRIVATIZATION UTILITY VIA SELECTIVE GRADIENT UPDATE PrivQuant K,∞ provides an unbiased estimator of individual gradients, but the considerable amount of noise injected would hurt the training process. This defect is amplified in deep learning scenarios: empirically, Lin et al. ( 2018) observed that for many representative deep architectures, most of the gradient dimensions are nearly sparse. Hence privatizing the whole gradient vector would result in a high privatization error that is even orders of magnitude higher than the norm of the gradient itself. The (almost) sparse structure of the gradients suggests a modification to the estimation scheme via privatizing and transmitting only a fraction of the gradient dimensions, which could be regarded as significantly reducing variance at the cost of a small amount of bias. Such techniques are closely related to gradient compression that sends a few most significant gradient dimensions (Lin et al., 2018) . Typically, gradient compression techniques require selecting top-k gradients measured in absolute magnitude (Lin et al., 2018; Aji & Heafield, 2017) . A straightforward extension to the local private setting would be to perform private selection methods like exponential mechanism (Dwork et al., 2014) or noisy top-k (Ding et al., 2019) , for which we need to allocate a certain fraction of privacy budget. However, these solutions are either computationally expensive or "too noisy" for picking high dimensional gradients that are almost sparse. We defer a more detailed discussion to appendix A.2. Motivated by algorithm designs in distributed learning like random masking (Konečnỳ et al., 2016) and Hogwild! (Niu et al., 2011) , we utilize a simple strategy, gradient subsampling, by randomly sample d = rd dimensions, where r is the sampling proportion of gradient dimensions. To further improve training efficiency, we applied local accumulation techniques like momentum correction and factor masking similar to (Lin et al., 2018) . It is worth noting that low sampling ratio do not necessarily improves model performance as the overall bias of the sparse approximation would become significant when only a few dimensions are selected. A more detailed study of this aspect is provided in appendix A.3.

3.3. IMPROVING QUANTIZATION PERFORMANCE VIA RANDOMIZED ROTATION

As shown in (Agarwal et al., 2018; Suresh et al., 2017) , for a d-dimensional vector X, the quantization error scales with max j∈[d] X j -min j∈[d] X j 2 . To reduce quantization error, an effective way is to preprocess the data via randomized rotation, so as to make max j∈[d] X j -min j∈[d] X j small in expectation. Note that we only need to rotate the fraction of gradient that are selected to get transmitted. To perform randomized rotation we need to assume public randomness between the server and the clients, under which we generate an orthogonal random matrix R ∈ R d× d and apply the linear transform to the individual gradients X = RX on the client side, and apply inverse transform on the server side Z = R -1 Z. We adopt the method in (Agarwal et al., 2018; Suresh et al., 2017)  P(A ii = 1) = P(A ii = -1) = 0.5, ∀i ∈ [ d]) , and H is a Walsh-Hadamard matrix (Horadam, 2012) , defined recursively by the formula: H(2 1 ) = 1 1 1 -1 , H(2 m ) = H(2 m-1 ) H(2 m-1 ) H(2 m-1 ) -H(2 m-1 ) , The randomized rotation operation is performed before quantization and privatization, therefore we project the vector X to an 2 ball of radius U before rotation. This would ensure the vector obtained after rotation has ∞ norm bounded by U as well. The final selective SGD updating process is summarized in algorithm 2. 

4. EXPERIMENTS

In this section, we apply sqSGD to simulated FL environments, constructed via standard datasets MNIST (Lecun et al., 1998) , EMNSIT (Cohen et al., 2017) , and Fashion MNIST (Xiao et al., 2017) which we abbreviate as FMNIST hereafter. All three datasets are of equal (training) sample size, therefore we randomly partition each dataset into 10 equally sized blocks to simulate a cross-silo FL environment consisting of a central server and 10 clients. We train on MNIST and EMNIST datasets using the LeNet-5 architecture (Lecun et al., 1998) (119, 850 parameters) . We also train a larger network under the ResNet-110 architecture (1, 722, 224 parameters) on the FMNIST dataset. Since the main goal of the empirical study is not to achieve state-of-the-art performance, we adopt the following common strategies without further tuning across all experiments: in each round, we randomly sample a single batch with batch size 32 for each client and compute a single gradient update step. Without gradient subsampling, this would be equivalent to a stochastic gradient descent update over the entire data with a batch size of 320. The training duration is measured via number of epochs (1 epoch = 188 rounds). We use the same configuration We set the 2 norm bound of the gradients U = 10 as it works reasonably well. Note that we could use a more elegant distributed private estimation of the 2 norm bound via sending privatized magnitude evaluated at different points of the parameter space, i.e. via mechanisms like ScalarDP in Bhowmick et al. (2018) . Finally, for a given privacy budget , we use the following method to determine the value of (κ , p ): We set κ to be the largest integer such that the value of the left hand side of ( 8) is smaller than or equal to e 0.9 , and p = e 0.1 1+e 0.1 . Choice of privacy budget across all the experiments we align the choice = O( √ d) with respect to the additional risk factor O(d/ 2 ) in private mean estimation (in comparison to non-private case). As the models used in our experiments are all of high dimension, the choice of high-epsilon will basically provide little protection against inference attacks. However, with respect to reconstruction attack specified in section 1.2 and the protection level (4), such choice still provide reasonable protection against reconstruction attacks if the adversary's goal is to evaluate more than O( √ d) of the entries in the private data.

Impact of communication constraints

We evaluate the performance of trained models using the test set accuracy. The privacy budget is chosen as = 400 for LeNet models and = 2000 for ResNet models. Sampling ratio is fixed at r = 0.5%. We vary the quantization level in the range {2, 8, 32} which corresponds to using {1, 3, 5} bits per client per dimension. The results are plotted in figure 1 . The results indicate that a while 1-bit sqSGD is efficient in terms of communication cost, the quantization error results in significant degradation in model performance, both in the final accuracy after model convergence, and in the speed of convergence. A reasonable quantization level (i.e. larger than 3 bits) yields competitive performance. Impact of privacy constraints Next we investigate the performance loss due to privacy constraints. We fix the sampling rate at r = 0.5% and quantization level at K = 16. For LeNet-5 architecture, we select from the set {200, 300, 400}, for ResNet-110 architecture, we select from the set {1000, 2000, 3000}. The results are plotted in figure 2 . The results imply that with limited communication, we need high privacy budgets to ensure successful training on large models.

Baseline comparisons

We compare sqSGD against two baselines: Piecewise mechanism (PM) (Wang et al., 2019) PM is designed for mean estimation under local privacy model that improves accuracy upon the PrivUnit ∞ algorithm in Bhowmick et al. (2018) vqSGD Gandikota et al. (2019) vqSGD uses an communication-optimal distributed mean estimation algorithm that adopts vector quantization strategies. The per client communication cost of vqSGD is O(log d) for mean estimation task. The algorithm has trivial extensions to locally private settings using randomized response strategies (Gandikota et al., 2019) . As discussed by the authors, for distributed optimization settings with high dimensional models. Variance reduction techniques are necessary, specifically one data point is communicated b times and server uses the average of b quantized gradients for aggregation. This makes the communication cost O(b log d). Note that in private settings this results in a caveat that privacy protection level is degraded under composition. In our experiments we will ignore this caveat and use variance reduction so that the communication of vqSGD matches that of sqSGD and PM. We compare the three approaches across all tasks mentioned above under the same privacy budget. For sqSGD we use K = 16 for architectures and K = 128 for ResNet architectures, and we adjust the sampling ratio in sqSGD such that communication costs remain equal among all methods. The comparisons are presented in figure 3 . Figure 3 shows that sqSGD consistently outperforms PM by a significant margin across all tasks under various privacy levels. vqSGD obtains comparable performance with sqSGD in relatively smaller models and high-privacy regime (Note that vqSGD approaches offer much worse privacy protections in this setting), but breaks down when models get larger. Overall, in high-dimensional settings, sqSGD provides much better optimization performance.

Additional experiments

We present an experimental study on the effect of gradient subsampling in appendix A.3, and another study on the trade-off between quantization and subsampling in A.4. The results of additional experiments further justify the choice of sampling ratios and quantization levels in the above comparisons.

5. CONCLUSION

We studied privacy-preserving federated learning under the local differential privacy model. To achieve both communication efficiency and high-dimensional compatibility, we proposed a gradientbased learning framework sqSGD that is based on a novel private multivariate mean estimation scheme with controllable quantization levels. We applied gradient subsampling and randomized rotation to reduce estimation error of the base mechanism that exploits the specific structure of federated learning with large modern neural architectures. Finally, our designed algorithm was shown on standard datasets to be capable of training large models with random initializations, surpassing baseline algorithms by a significant margin.

A APPENDIX

A.1 PROOFS Proof of theorem 1. Let u ∈ B d and U ∼ Unif(B d ). The vector V ∈ B d sampled as in algorithm 1, has p.m.f. p(v | u) ∝ 1/P(M(U, u) > κ) if M(v, u) > κ 1/P(M(U, u) < κ) ifM(v, u) ≤ κ. The event that M(U, u) = κ when d+κ+1 2 ∈ Z implies that U and u match in exactly d+κ+1 2 coordinates; the number of such matches is d (d+κ+1)/2 . Computing the binomial sum, we have P(M(U, u) > κ) = 1 K d d = d+κ+1 2 d (K -1) (d-) P(M(U, u) ≤ κ) = 1 K d d+κ+1 2 -1 =0 d (K -1) (d-) Now we show unbiasedness via showing E[V | u = u] = m • u. We have E[V | u = u] = pE[U | M(U, u) > κ] + (1 -p)E[U | M(U, u) ≤ κ] By rotational symmetry, it suffices to show: E[V 1 | u = u] = p E[U 1 | M(U, u) > κ] Υ1 +(1 -p) E[U 1 | M(U, u) ≤ κ] Υ2 For Υ 1 , we have: Υ 1 = 1 K d P(M(U, u) > κ) × d =τ   u 1 d (K -1) d-- w∈B\u 1 w d (K -1) d--1   = u 1 K d P(M(U, u) > κ) × d -1 τ -1 (K -1) (d-τ ) Where in the second equality we used the fact that w∈B w = 0. Similar calculations yield: Υ 2 = - u 1 K d P(M(U, u) ≤ κ) × d -1 τ -1 (K -1) (d-τ ) Combining the preceding display with (11), we have: E[V 1 | u = u] = p d-1 τ -1 (K -1) d-τ d =τ d (K -1) d-l -(1 -p) d-1 dτ -1 (K -1) d-τ τ -1 =0 d (K -1) d-l × u 1 = mu 1 Next we show privacy guarantee. LDP guarantee As P(M(U, u) > κ) is decreasing in κ for any u, u ∈ B d and v ∈ B d we have p(v | u) p(v | u ) ≤ p 1 -p • P(M(U, u ) ≤ κ) P(M(U, u) > κ) = p 1 -p × τ -1 =0 d (K -1) (d-) d =τ d (K -1) (d-) The result follows by relation ( 8) Reconstruction protection guarantee the result follows by Bhowmick et al. (2018, Proposition 1) and the LDP guarantee. Performing top-k selection with local privacy constraints could be done via two approaches. The first take is iteratively running the exponential mechanism (Dwork et al., 2014) for k times, each time selecting a single index with gumble noise. The bottleneck of this take is that it requires sampling from a high dimensional distribution for k times, which is computationally heavy. Note that the privatization step is carried out on the client side device, which is usually assumed to be of limited computational power in FL settings (Kairouz et al., 2019) , thus using iterative exponential mechanism is not practical for FL scenarios. The second take is the noisy top-k algorithm (Ding et al., 2019) , which generalizes the report noisy max mechanism in Dwork et al. (2014) . The algorithm requires adding Laplacian noise of scale 2U k/ to each dimension of the gradient vector. In practice, k is typically chosen at the order of hundreds. Since most of the gradients are very small in magnitude,to ensure reasonable noise requires a high budget to allocate for the selection step. This would significantly affect the overall privacy level.

A.3 EXPERIMENTS ON THE EFFECT OF GRADIENT SUBSAMPLING

In this experiment, we investigate the effect of subsampling under the setup of training a ResNet110 model on the FMNIST dataset. We fix the privacy level at = 2000 and quantization level at K = 16. We vary the subsampling ratio from the set {1, 5, 10, 50} × 10 -3 . The results are plotted in figure 4 . It could be seen from the plot that using a high sampling ratio severely hurts the training performance, while also incurs more communication. It is thus necessary to perform subsampling. However, using a very low subsampling ratio also causes training failure. This phenomenon will be further explored in the next experiment.

A.4 EXPERIMENTS ON THE TRADE-OFF BETWEEN QUANTIZATION AND SAMPLING

In this experiment, we study the trade-off between quantization and sampling under the setup of training a LeNet-5 model on the MNIST dataset. We fix the privacy level at = 400, and fix the total communication cost per client, measured using the product r log 2 K. We vary the quantization level in the range {2, 8, 32, 128} which corresponds to using {1, 3, 5, 7} bits per client per dimension, and the sampling rate are adjusted accordingly. The results are shown in figure 5 . The results suggest that increasing the quantization level may not monotonically increase training performance. This is mainly due to the random subsample scheme of sqSGD, under which the structure of gradients is not fully explored.



In general, there are two sources of communication cost in a distributed learning scenario that requires sequential interactions between the server and the clients: downlink cost that happens during clients downloading the updated model from the central server, and uplink cost that happens when clients sending their updates to the server. In was previously noted inKonečnỳ et al. (2016) that uplink cost usually dominates downlink costs in FL settings. Hence in the scope of this work, we will refer to communication cost as uplink communication cost.



As optimization via stochastic gradient methods remains the most representative task in federated learning, most privacy-preserving FL approaches are based on private SGD type algorithms like Abadi et al. (2016). McMahan et al. (2018) used a slightly tweaked central DP model to train large language models. Wei et al. (2020) derived federated aggregation schemes under the central DP model and discussed convergence properties. Bhowmick et al. (

Figure 1: Study on impact of communication constraints

Figure 3: Comparison of sqSGD against piecewise mechanism

Figure 4: Study on the effect of gradient subsampling

The randomized rotation and sparsification strategy have no effect on privacy level, hence the local update within a single round is -LDP. At the same time, communication cost was further reduced from O(d log 2 K) bits to O( d log 2 K) bits per client. Algorithm 2 sqSGD: Selective Quantized SGD with local privacy guarantee Require: Training data {Y i } M i=1 , gradient norm bound U , privacy budge , sampling ratio r ∈ (0, 1), local correction factor α, β, learning rate η, batch size B. 1: Find a (κ , p ) pair that satisfies relation (8) 2: Calculate the corresponding normalizing constant m as in algorithm 1 3: Set d = 2 log 2 (rd) , and using public randomness to generate a random matrix R ∈ R d× d, according to the construction in (10) 4: Initialize model parameter θ 0 and local residuals res i,0 = 0, ∀i = 1, . . . , N 5: for t = 0, . . . , T -1 do Select a uniformly random subset of [d] with cardinality d, denoted as D s ,

