VOTING-BASED APPROACHES FOR DIFFERENTIALLY PRIVATE FEDERATED LEARNING

Abstract

While federated learning (FL) enables distributed agents to collaboratively train a centralized model without sharing data with each other, it fails to protect users against inference attacks that mine private information from the centralized model. Thus, facilitating federated learning methods with differential privacy (DPFL) becomes attractive. Existing algorithms based on privately aggregating clipped gradients require many rounds of communication, which may not converge, and cannot scale up to large-capacity models due to explicit dimension-dependence in its added noise. In this paper, we adopt the knowledge transfer model of private learning pioneered by Papernot et al. (2017; 2018) and extend their algorithm PATE, as well as the recent alternative PrivateKNN (Zhu et al., 2020) to the federated learning setting. The key difference is that our method privately aggregates the labels from the agents in a voting scheme, instead of aggregating the gradients, hence avoiding the dimension dependence and achieving significant savings in communication cost. Theoretically, we show that when the margins of the voting scores are large, the agents enjoy exponentially higher accuracy and stronger (data-dependent) differential privacy guarantees on both agent-level and instancelevel. Extensive experiments show that our approach significantly improves the privacy-utility trade-off over the current state-of-the-art in DPFL.

1. INTRODUCTION

With increasing ethical and legal concerns on leveraging private data, federated learning (McMahan et al., 2017) (FL) has emerged as a paradigm that allows agents to collaboratively train a centralized model without sharing local data. In this work, we consider two typical settings of federated learning: (1) Local agents are in large number, i.e., learning user behavior over many mobile devices (Hard et al., 2018) . (2) Local agents are in small number with sufficient instances, i.e., learning a health related model across multiple hospitals without sharing patients' data (Huang et al., 2019) . When implemented using secure multi-party computation (SMC) (Bonawitz et al., 2017) , federated learning eliminates the need for any agent to share its local data. However, it does not protect the agents or their users from inference attacks that combine the learned model with side information. Extensive studies have established that these attacks could lead to blatant reconstruction of the proprietary datasets (Dinur & Nissim, 2003) and identification of individuals (a legal liability for the participating agents) (Shokri et al., 2017) . Motivated by this challenge, there had been a number of recent efforts (Truex et al., 2019b; Geyer et al., 2017; McMahan et al., 2018) in developing federated learning methods with differential privacy (DP), which is a well-established definition of privacy that provably prevents such attacks. Among the efforts, DP-FedAvg (Geyer et al., 2017; McMahan et al., 2018) extends the NoisySGD method (Song et al., 2013; Abadi et al., 2016) to the federated learning setting by adding Gaussian noise to the clipped accumulated gradient. The recent state-of-the-art DP-FedSGD (Truex et al., 2019b ) is under the same framework but with per-sample gradient clipping. A notable limitation for these gradient-based methods is that they require clipping the magnitude of gradients to τ and adding noise proportional to τ to every coordinate of the shared global model with d parameters. The clipping and perturbation steps introduce either large bias (when τ is small) or large variance (when τ is large), which interferes the SGD convergence and makes it hard to scale up to largecapacity models. In Sec. 3, we concretely demonstrate these limitations with examples and theory. Particularly, we show that the FedAvg may fail to decrease the loss function together with gradient clipping, and DP-FedAvg requires many outer-loop iterations (i.e., many rounds of communication to synchronize model parameters) to converge under differential privacy. To avoid the gradient clipping, we propose to conduct the aggregation over the label space, as shown to be an effective approach in standard (non-federated) learning settings, i.e., voting-based modelagnostic approaches (Papernot et al., 2017; 2018; Zhu et al., 2020) . To achieve it, we relax the traditional federated learning setting to allow unlabeled public data at the server side. We also consider a more complete scenario for federated learning, where there are a large number of local agents or a limited number of local agents. The agent-level privacy as introduced in DP-FedAvg, works seamlessly with our setting having many agents. However, when there are few agents, hiding each data belonging to one specific agent becomes burdensome or unnecessary. To this end, we provide a more complete privacy notion, i.e., agent-level and instance-level. Under each of the setting, we theoretically and empirically show that the proposed label aggregation method effectively removes the sensitivity issue caused by gradient clipping or noise addition, and achieves favorable privacy-utility trade-off compared to other DPFL algorithms. Our contributions are summarized as the following: 1. We propose two voting-based DPFL algorithms via label aggregation (PATE-FL and Private-KNN-FL) and demonstrate their clear advantages over gradient aggregation based DPFL methods (e.g., DP-FedAvg) in terms of communication cost and scalability to high-capacity models. 2. We provide provable differential privacy guarantees under two levels of granularity: agentlevel DP and instance-level DP. Each is natural in a particular regime of FL depending on the number of agents and the size of their data. 3. Extensive evaluation demonstrates that our method improves the privacy-utility trade-off over randomized gradient-based approaches in both agent-level and instance-level cases. A remark of our novelty. Though PATE-FL and Private-kNN-FL are algorithmically similar to the original PATE (Papernot et al., 2018) and Private-KNN (Zhu et al., 2020), they are not the same and we are adapting them to a new problem -federated learning. The adaptation itself is nontrivial and requires substantial technical innovations. We highlight three challenges below. • Several key DP techniques that contributed to the success of PATE and Private-KNN in the standard setting are no longer applicable (e.g., Privacy amplification by Sampling and Noisy Screening). This is partly due to that in standard private learning, the attacker only sees the final models; but in FL, the attacker can eavesdrop in all network traffic. • Moreover, PATE and Private-kNN only provide instance-level DP. We show PATE-FL and Private-kNN-FL also satisfy the stronger agent-level DP. PATE-FL's agent-level DP parameter is, surprisingly, a factor of 2 better than its instance-level DP parameter. And Private-kNN-FL in addition enjoys a factor of k amplification for the instance-level DP. • A key challenge of FL is the data heterogeneity of individual agents, while PATE randomly splits the dataset so each teacher is identically distributed. The heterogeneity does not affect our privacy analysis but does make it unclear whether PATE would work. We are the first to report strong empirical evidence that the PATE-style DP algorithms remain highly effective in the non-iid case.

2. PRELIMINARY

In this section, we start with introducing the typical notations of federated learning and differential privacy. Then, two randomized gradient-based baselines, DP-FedAvg and DP-FedSGD, are introduced as the DPFL background. ) is a vanilla federated learning algorithm that we consider as a non-DP baseline. In this algorithm, a fraction of agents is sampled at each communication round with a probability q. Each selected agent downloads the shared global model and improves it by learning from local data using E iterations of stochastic gradient descent (SGD). We denote this local update process as an inner loop. Only the gradient is sent to the server, where it is averaged with other selected agents' gradient to improve the global model. The global model is learned after T communication rounds, where each communication round is denoted as one outer loop. The definition applies to a variety of different granularity, depending on how the adjacent datasets are defined, i.e., if we are to protect whether one agent participates into training, the neighboring datasets are defined by adding or removing the entire local data within that agent. It is known as agent-level (user-level) differential privacy, which has been investigated in DP-FedAvg (Geyer et al., 2017; McMahan et al., 2018) . Compared to FedAvg, DP-FedAvg (Figure 1 ) enforces clipping of peragent model gradient to a threshold S and adds noise to the scaled gradient before it is averaged at the server. Note that this DP notion is favored when data samples within one agent reveal the same sensitive information, e.g., cell phone agents send the same message. However, when there are only a few agents, hiding the entire dataset from one agent becomes difficult and inappropriate. We then consider the instance-level DP, where the adjacent dataset is defined by differing one single training example. This definition is consistent with the standard non-federated learning differential privacy ( that eavesdrop what sent out by each agent. In our experiment, we assume that the aggregation is conducted by SMC for all privacy-preserving algorithms that we consider.

3. CHALLENGES FOR GRADIENT-BASED FEDERATED LEARNING

In this section, we highlight the main challenges of the conventional DPFL frameworks in terms of accuracy, convergence and communication cost. For other challenges, we refer the readers to a survey (Kairouz et al., 2019) . The details of DP-FedAvg are summarized in appendix algorithm section. 3 Here, we draw connections to DP-FedAvg's convergence rate and demonstrate that using many outer-loop iterations (T ) could have a similar convergence issue under differential privacy. When E = 1 in the local update (inner loop), the FedAvg algorithm is equivalent to SGD with distributed data, which requires many rounds of communication. The appeal of FedAvg is to set E to be larger so that each agent performs E iterations to update its own parameters before synchronizing the parameters to the global model, hence reducing the number of rounds in communication. However, setting E > 1 may not improve convergence at all. Now, we take a closer look at the effect of increasing E in the case of piecewise linear functions. Let η be the learning rate for individual agents. In appendix convergence section, we establish that the effect of increasing E is essentially increasing the learning rate for a large family of optimization problems with piecewise linear objective functions. It is known that for the family of G-Lipschitz functions supported on a B-bounded domain, any Krylov-space methodfoot_0 has a rate of convergence that is lower bounded by Ω(BG/ √ T ) (Nesterov, 2003, Section 3.2.1). This indicates that the variant of FedAvg that aggregates only the loss function part of the gradient or projects only when synchronizing requires Ω(1/α 2 ) rounds of outer loop (i.e., communication), in order to converge to an α stationary point, i.e., increasing E does not help, even if no noise is added. This also says that DP-FedAvg is essentially the same as stochastic subgradient method in almost all locations of a piecewise linear objective function with gradient noise being N (0, σ 2 /N I d ). The additional noise in DP-FedAvg imposes more challenges to the convergence. If we plan to run T rounds and achieve ( , δ)-DP, we need to choose σ =  GB( 1 + 2T d log(1.25/δ) N 2 2 ) √ T = O GB √ T + d log(1.25/δ) N , for an optimal choice of the learning rate Eη. The above bound is tight for stochastic subgradient methods, and in fact also informationtheoretically optimal. The GB/ √ T part of upper bound matches the information-theoretical lower bound for all methods that have access to T -calls of stochastic subgradient oracle (Agarwal et al., 2009 , Theorem 1), while the second matches the information-theoretical lower bound for all ( , δ)differentially private methods on the agent level (Bassily et al., 2014, Theorem 5.3) . That is, the first term indicates that there must be many rounds of communications, while the second term says that the dependence in ambient dimension d is unavoidable for DP-FedAvg. Clearly, our method also has such a dependence in the worst case, but it is easier for our approach to adapt to the structure that exists in the data (i.e., high consensus among voting), as we will illustrate later. In contrast, it has larger impact on DP-FedAvg, since it needs to explicitly add noises with variance Ω(d). Another observation is that when N is small, no DP method with reasonable , δ parameters is able to achieve high accuracy. This partially motivates us to consider the other regime that deals with instance-level DP.

3.3. OTHER CHALLENGES

Expensive Communication Cost: Up-stream communication cost (Konečnỳ et al., 2016) , i.e., total transmitted updates from local agent to server, is another key concern in FL. For FedAvg, our convergence analysis suggests that increasing E does not speed up the convergence. A high communication cost is expected till the model converges. CpSGD (Agarwal et al., 2018 ) is another DPFL method, aiming at reducing the communication cost by gradient quantization with binomial noise. However, sampling from binomial distribution can be difficult on devices, which prevents it from being practical in real-world scenarios. Network Complexity: DP-FedAvg requires to clip gradient magnitude to τ at each coordinate in parameters, which is hard to scale up to large models, as the noise level increases proportional to the network capacity. 

4. ALGORITHM

We assume there are unlabeled data drawn from D G at the server, which is public and accessible from any agent. The goal is to design an ( , δ)-DP algorithm (either on the agent-level or instancelevel) that outputs pseudo-labels for a subset of server's unlabeled data. Then a global model is trained in a semi-supervised way, using pseudo-labeled and unlabeled data.

PATE-FL

In PATE-FL (Algorithm 1), each agent i trains a local "teacher" model f i using its own private local data. For each "student" query x t , every agent adds Gaussian Noise to her prediction (i.e., C-dim histogram), aggregates their noisy predictions via SMC and the label with the most votes is returned to the server as the "pseudo-label" of x t . Similar to the original PATE, the idea behind the privacy guarantee is that by adding or removing any instance, it can change at most one agent's prediction. The same argument also naturally applies to adding or removing one agent. In fact we gain a factor of 2 in the stronger agent-level DP due to a smaller sensitivity (see the proof for details)! Another important difference is that in the original PATE, the teachers are trained on random splits of the data, while in our case, the agents are naturally present with different distributions. We propose to optionally use domain adaptation techniques to mitigate these differences when training the "teachers". Private-kNN-FL Next we present how the teachers f i is constructed in Private-kNN-FL method (see Algorithm 2) . Each agent has a data-independent feature extractor φ. For every unlabeled query x t , agent i finds the k i nearest neighbor to x t from its local data by measuring their Euclidean distance in the feature space R d φ and f i (x t ) outputs the frequency vector of the votes for these nearest neighbors. Subsequently, f i (x t ) from all agents are privately aggregated with the argmax of the noisy voting scores returned to the server. Different from the original Private-kNN (Zhu et al., 2020) , we apply kNN on each agent's local data instead of the entire private dataset. This distinction allows us to receive up to kN neighbors while Algorithm 1 PATE-FL Input: Noise σ, global data D G , Q query 1: for i in N clients do 2: Train local model f i using D i 3: for t = 0, 1, ..., Q, pick x t ∈ D G do 4: for each agent i in 1, ..., N do bounding the contribution of individual agents by k. Comparing to PATE-FL, this approach enjoys a stronger instance-level DP guarantee since the sensitivity from adding or removing one instance is a factor of k/2 times smaller than that of the agent-level. 5: fi (x t ) = f i (x t ) + N (0, σ 2 N I C ).

4.1. PRIVACY ANALYSIS

We provide our privacy analysis based on Renyi differential privacy (RDP) (Mironov, 2017) . RDP inherits and generalizes the information-theoretical properties of DP, and has been used for privacy analysis in DP-FedAvg. We defer the background about RDP, its connection to DP and all proofs of our technical results to the appendix RDP section. Theorem 3 (Privacy guarantee). Let PATE-FL and Private-kNN-FL answer Q queries with noise scale σ. For agent-level protection, both algorithms guarantee (α, Qα/(2σ 2 ))-RDP for all α ≥ 1. For instance-level protection, PATE-FL and Private-kNN-FL obey (α, Qα/σ 2 ) and (α, Qα/(kσ 2 ))-RDP respectively. This theorem says that both algorithms achieve agent-level and instance-level differential privacy. With the same noise injection to the agent's output, Private-kNN-FL enjoys a stronger instance-level DP (by a factor of k/2) compared to its agent-level guarantee, while PATE-FL's instance-level DP is weaker by a factor of 2. Improved accuracy and privacy with large margin: Let f 1 , ..., f N : X → C-1 where C-1 denotes the probability simplex -the soft-label space. Note that both algorithms we propose can be viewed as voting of these local agents, which output a probability distribution in C-1 . First, let us define the margin parameter γ(x) that measures the difference between the largest and second largest coordinate of 1 N N i=1 f i (x). Lemma 4. Conditioning on the teachers, for each public data point x, the noise added to each coordinate of 1 N N i=1 f i (x) is drawn from N (0, σ 2 /N 2 ), then with probability ≥ 1 - C exp{-N 2 γ(x) 2 /8σ 2 } , the privately released label matches the majority vote without noise. The proof (in Appendix) is a straightforward application of Gaussian tail bounds and a union bound over C coordinates. This lemma implies that for all public data point x such that γ(x) ≥ 2 √ 2 log(C/δ) N , the output label matches noiseless majority votes with probability at least 1 -δ. Next we show that for those data point x such that γ(x) is large, the privacy loss for releasing arg max j [ 1 N N i=1 f i (x) ] j is exponentially smaller. Theorem 5. For each public data point x, the mechanism that releases arg max j [ 1 N N i=1 f i (x) + N (0, (σ 2 /N 2 )I C )] j obeys (α, )-data-dependent-RDP, where ≤ Ce -N 2 γ(x) 2 8σ 2 + 1 α -1 log 1 + e (2α-1)σ 2 2s 2 - N 2 γ(x) 2 16σ 2 +log C , where s = 1 for PATE-FL, and s = 1/k for Private-KNN-FL. This bound implies that when the margin of the voting scores is large, the agents enjoy exponentially stronger (data-dependent) differential privacy guarantees in both agent-level and instance-level. In ) for instance-level DP. Five independent rounds of experiments are conducted to report mean accuracy and its standard deviation. We use both labeled and unlabeled data on Digit datasets but only labeled data for all other datasets. We defer the experimental details to appendix. Digit Datasets Evaluation: MNIST, SVHN and USPS together as Digit datasets, is a controlled setting to mimic the real case, where distribution of agent-to-server or agent-to-agent can be different. We simulate 140 agents using SVHN with 3000 records each and 60 agents using MNIST with 1000 records each. USPS serves as unlabeled public data, where 3000 records can be accessed by the local agents and the remaining records are used for testing.

5.1. EVALUATION ON AGENT-LEVEL DP

In Table 1 , our methods PATE-FL and PATE-FL+DA are compared to private and nonprivate baselines. PATE-FL+DA is based on PATE-FL, where each agent model is trained with domain adaptation (DA) technique (Ganin et al., 2016) . FedAvg+DA is the variant of Fe-dAvg with the same DA technique. We observe: (1) When the privacy cost of DP-FedAvg and PATE-FL is close, our method significantly improves the accuracy from 76.3% to 83.8%. (2) The further improved accuracy 92.5% of PATE-FL+DA demonstrates that our framework can orthogonally benefit from DA techniques, where it is highly uncertain yet for the gradient-based methods. (3) Both FedAvg and DP-FedAvg perform better than their DA variants. The possible reason might be that FL with domain adaptation is more closely 1 , our method achieves consistent better performance than DP-FedAvg. Moreover, we plot the privacy-accuracy tradeoff in Figure 2 . For every fixed privacy budget at the x-axis, we do a grid search on all hyperparameters (e.g., #queries and noise scale for PATE-FL and #communication round, noise scale for DP-FedAvg). In the figure, the accuracy of PATE-FL is consistently higher than DP-FedAvg.

5.2. EVALUATION ON INSTANCE-LEVEL DP

When agents are few, preserving privacy across agents becomes hard and meaningless. We then focus on preserving each instance's privacy, a.k.a instance-level DP. FedAvg is non-private baseline. Office-Caltech Evaluation: Office-Caltech consists of data from four domains: Caltech (C), Amazon (A), Webcam(W) and DSLR (D). We pick one domain as server each time and the rest ones are for local agents (e.g., in A, C, D → W , Webcam is treated as the server). We split 70% data from the server domain as public available unlabeled data while the remaining 30% data is used for testing. For Private-kNN-FL, we instantiate the public feature extractor using the network backbone without the classifier layer. Both AlexNet and Resnet50 are Imagenet pre-trained. We set σ = 15 for Private-kNN-FL with AlexNet and σ = 25 for ResNet50. To address the domain adaptation issue, each agent can choose k to be smaller if they observe the domain gap is large, as a smaller k implies a more selective set of neighbors. In our experiment, we set k to be the 5% of the local data size (i.e., each agent returns the noisy top-5% neighbors' predictions). We observe in Table 2 , DP-FedSGD degrades when backbone changes from light load AlexNet to heavy load ResNet50, while ours is improved by 10%. It is because larger model capacity leads to more sensitive response to gradient clipping or noise injection. In contrast, our Private-kNN-FL avoids the gradient operation by label aggregation and can still benefit from the larger model



One that outputs a solution in the subspace spanned by a sequence of subgradients.



Figure 1: DP-FedAvg and PATE-FL are used for agent-level DP. DP-FedSGD and Private-kNN-FL are used for instance-level DP. and privately from a party-specific domain distribution D i . C is the number of classes. The objective is to output a global model that performs well on the target (server) distribution. Most prior works consider the target distribution as a uniform distribution over the union of all local data, which is restrictive in practice. Here we consider an agnostic federated learning scenario (Mohri et al., 2019; Peng et al., 2019c), where the server distribution D G can be different from all agent distributions. In light of this, we assume each agent has access to part of unlabeled server data drawn from the target distribution D G . FedAvg (McMahan et al., 2017) is a vanilla federated learning algorithm that we consider as a non-DP baseline. In this algorithm, a fraction of agents is sampled at each communication round with a probability q. Each selected agent downloads the shared global model and improves it by learning from local data using E iterations of stochastic gradient descent (SGD). We denote this local update process as an inner loop. Only the gradient is sent to the server, where it is averaged with other selected agents' gradient to improve the global model. The global model is learned after T communication rounds, where each communication round is denoted as one outer loop.

DIFFERENTIAL PRIVACY FOR FEDERATED LEARNING Differential privacy (Dwork et al., 2006) is a quantifiable and composable definition of privacy that provides provable guarantees against identification of individuals in a private dataset. Definition 1. A randomized mechanism M : D → R with a domain D and range R satisfies ( , δ)-differential privacy, if for any two adjacent datasets D, D ∈ D and for any subset of outputs S ⊆ R, it holds that Pr[M(D) ∈ S] ≤ e Pr[M(D ) ∈ S] + δ.

e.g., McMahan et al., 2018, Theorem 1). which results in a convergence rate upper bound of

max y∈{1,...,C} [ N i=1 fi (x t )] y 8: end for 9: Train a global model θ using (x t , ỹt ) Q t=1 Private-kNN-FL Input: Noise σ, global data D G , Q query 1: for t = 0, 1, ..., Q, pick x t ∈ D G do 2: for each agent i in 1, ..., N do Apply φ on D i and x t 4: y 1 , ..., y k ← top-k closest labels 5: fi (x t ) = 1 k ( k j=1 y j ) + N (0, σ 2 N I C ) max y∈{1,...,C} [ N i=1 fi (x t )] y 8: end for 9: Train a global model θ using (x t , ỹt ) Q t=1

Figure 2: Privacy-accuracy tradeoff for MNIST dataset with Non-i.i.d partition. The x-axis is the privacy budget and the y-axis reports the corresponding accuracy.

.1 CHALLENGE 1: BIASED GRADIENT ESTIMATION Recent works(Li et al., 2018) have shown that the FedAvg may not converge well under heterogeneity (e.g., non-identical distributions). Here, we provide a simple example to show that the clipping step of DP-FedAvg may raise additional challenge. Consider the special case when || 1 || 2 = τ +α and || 2 || 2 ≤ τ . Then the global update will be1  2( τ 1 || 1 ||2 + 2 ), which is biased.The unbiased global update shall be 1 2 ( 1 + 2 ). Such a simple example can be embedded in more realistic problems, causing substantial bias that leads to non-convergence. 3.2 CHALLENGE 2: SLOW CONVERGENCE Recent works (Li et al., 2019; Wang et al., 2019) have investigated the convergence rate in FL methods.

To address this issue, recent works apply delicate clipping strategies (McMahan et al., 2018; Geyer et al., 2017) and reduce data dimension with PCA (Abadi et al., 2016). In this work, we propose to avoid such dimension dependence and empirically investigate how network architecture affects performance in various DPFL approaches.

Agent-level DP Evaluation. We compare the state-of-the-art DPFL methods with ours on the Digit and CelebA datasets. For ( , δ)-DP setting, we set δ = 10 -3 across all the methods. other words, our proposed methods avoid the dependence on model dimension d that are inherited in DP-FedAvg and can release models for free privacy cost when a high consensus among votes from local agents. 4.2 COMMUNICATION COST Finally, regarding the communication issue, our proposed methods are parallel as each agent work independently without any synchronization. Overall, we reduce the up-stream communication cost from d • T floats (model size times T rounds) to C • Q floats in one round.

Instance-level DP on Office-Caltech using different backbones.

Instance-level DP on DomainNet. We compare our method with DP-FedSGD and the non-private baseline FedAvg. Total number of local agents is 5. We set δ = 10 -4 . related to multi-source domain adaptation(Peng et al., 2019b)  than the traditional domain adaptation. In other words, averaging gradients of domain adaptation methods implies averaging different trajectories towards the server's distribution, which may not work in practice. How to improve DP-FedAvg variants with DA techniques remains an open problem.CelebA Dataset Evaluation: CelebA is a 220k face attribute dataset with 40 attributes defined. 300 agents are designed with partitioned training data. We split 600 unlabeled data at server, and the rest 59,400 images are for testing. Detailed settings are referred to appendix. Consistent to Digits dataset, our method achieves clear performance gain by 1.8% compared to DP-FedAvg while maintaining the same privacy cost.MNIST Dataset with Non-i.i.d Partition Evaluation:In both CelebA and Digit experiments, we i.i.d partition each dataset into different agents. To investigate our proposed algorithm under a noni.i.d partition scenario, we choose a similar experimental setup as (McMahan et al., 2017) did. We divide the training set of sorted MNIST into 100 agents, such that each agent will have samples from 6 digits only. This way, each agent gets 600 data points from 6 classes. We split 30% of the testing set in MNIST as the available unlabeled public data and the remaining testing set used for testing. As shown in Table

annex

Table 3 compares our Private-kNN-FL method with DP-FedSGD. We observe that when the privacy cost is aligned close, our method outperforms DP-FedSGD by more than 10% in accuracy gain across all the three cases. When the accuracy is aligned close, our method saves more than 60% privacy cost, showing consistent advantage over DP-FedSGD.

5.3. ABLATION STUDY

In this section, we investigate the agent-level privacy-utility trade-off with respect to the number of agents and the volume of local data. MNIST is utilized for generality and simplicity. We randomly pick 1000 testing data as the unlabeled server data and the remaining 9000 data for testing. We adopt the model structure proposed in (Abadi et al., 2016) for both of our methods.

Effect of Data per Agent:

We fix the number of agent to 100 and range the number of data per agent from {50, 100, 200, 400, 600}. By only relaxing the "data per agent" factor, we fairly tune the other privacy parameters for all the methods to its maximized performance. In Figure 3 (a), as "data per agent" increases, all the methods improves as the overall dataset volume increases. Our method achieves consistently higher accuracy over DP-FedAvg. The failure cases for both methods are when "data per agent" is below 50, which cannot ensure the well-trained local agent models.Label aggregation over such weak local models results in failure or sub-optimal performance.Effect of Number of Agents: In Figure 3 (b), we vary N ∈ {50, 100, 200, 400, 800} and set overall privacy budget fixed as = 5, δ = 10 -3 . Following (Geyer et al., 2017) , each agent has exactly 600 data, where data samples are duplicated when N ∈ {200, 400, 800}. We conduct grid search for each method to obtain optimal hyper-parameters. Our method shows clear performance advantage over DP-FedAvg. We also see DP-FedAvg gradually approaches our method as the number of agents increases.

6. CONCLUSIONS

In this work, we propose voting-based approaches for differentially private federated learning (DPFL) under two privacy regimes: agent-level and instance-level. We substantially investigate the real-world challenges of DPFL and demonstrate the advantages of our methods over gradient aggregation-based DPFL methods on utility, convergence, reliance on network capacity, and communication cost. Extensive empirical evaluation shows that our methods improve the privacy-utility trade-off in both privacy regimes.

