FEDPD: DEFYING DATA HETEROGENEITY THROUGH PRIVACY DISTILLATION

Abstract

Model performance of federated learning (FL) typically suffers from data heterogeneity, i.e., data distribution varies with clients. Advanced works have already shown great potential for sharing client information to mitigate data heterogeneity. Yet, some literature shows a dilemma in preserving strong privacy and promoting model performance simultaneously. Revisiting the purpose of sharing information motivates us to raise the fundamental questions: Which part of the data is more critical for model generalization? Which part of the data is more privacy-sensitive? Can we solve this dilemma by sharing useful (for generalization) features and maintaining more sensitive data locally? Our work sheds light on data-dominated sharing and training, in a way that we decouple original training data into sensitive features and generalizable features. To be specific, we propose a Federated Privacy Distillation framework named FedPD to alleviate the privacy-performance dilemma. Namely, FedPD keeps the distilled sensitive features locally and constructs a global dataset using shared generalizable features in a differentially private manner. Accordingly, clients can perform local training on both the local and securely shared data for acquiring high model performance and avoiding the leakage of not distilled privacy. Theoretically, we demonstrate the superiority of the sharing-only useful feature strategy over sharing raw data. Empirically, we show the efficacy of FedPD in promoting performance with comprehensive experiments.

1. INTRODUCTION

Federated learning (FL), as an emerging protection paradigm, receives increasing attention recently (Kairouz et al., 2021; Li et al., 2021b; Yang et al., 2019) , which preserves data privacy without transmitting pure data. In general, distributed clients collaboratively train a global model by aggregating gradients (or model parameters). However, distributed data can cause heterogeneity issues (McMahan et al., 2017; Li et al., 2022; 2020; Zhao et al., 2018) , due to diverse computing capability and non-IID data distribution across federated clients. It results in unstable convergence and degraded performance. To address the challenge of heterogeneity, the seminal work, federated averaging (FedAvg) (McMahan et al., 2017) , proposes weighted averaging to overcome Non-IID data distribution when sharing selected local parameters in each communication round. Despite addressing the diversity of computing and communication, FedAvg still struggles with the client drift issue (Karimireddy et al., 2020) . Therefore, recent works try to resolve this issue by devising new learning objectives (Li et al., 2020) , designing new aggregation strategies (Yurochkin et al., 2019) and constructing information for sharing (Zhao et al., 2018; Yoon et al., 2021) . Among these explorations, sharing relevant information across clients provides a straightforward and promising approach to mitigate data heterogeneity. However, recent works point out a dilemma in preserving strong privacy and promoting model performance. Specifically, (Zhao et al., 2018) show that a limited amount of sharing data could significantly improve training performance. Unfortunately, sharing raw data, synthesized data, logits and statistical information (Luo et al., 2021; Goetz & Tewari, 2020; Hao et al., 2021; Karimireddy et al., 2020) can incur high privacy risks. To protect clients' privacy, differential privacy (DP) provides a de facto standard way for provable security quantitatively. The primary concern in applying DP is about performance degradation (Tramer & Boneh, 2020) . Thus, solving the above dilemma can contribute to promoting model performance while preserving strong privacy.

1.1. SYSTEMATIC OVERVIEW OF FEDPD

To solve the dilemma, we revisit the purpose of sharing information: sharing raw data benefits model generalization while violating privacy leakage. This motivates us to raise the fundamental questions: (1) Is it necessary to share complete raw data features to mitigate data heterogeneity? We find that some data features are more important than others to train a global model. Therefore, an intuitive approach is to divide the data features into two parts: one part for model generalization, named generalizable features, and the other part with clients' privacy, named sensitive features. Then, the dilemma can be solved by sharing generalizable features and keeping sensitive features locally throughout the training procedure. The insight is that the sensitive features in the data are kept locally, and the generalizable features intrinsically related to generalization are shared across clients. Accordingly, numerous decentralized clients can share generalizable features without privacy concerns and construct a global dataset to perform local training. (2) How to divide data features into generalizable features and sensitive features? It is challenging to identify which part of the data is more important for model generalization and which part is more privacy-sensitive. To resolve this challenge, we propose a novel framework named Federated Privacy Distillation (FedPD). FedPD introduces a competitive mechanism by decomposing x ∈ R d with dimension d into generalizable features x g ∈ R d and sensitive features x s ∈ R d , i.e., x = x g + x s . In FedPD, sensitive features x s aim to cover almost all information in the data x, while the generalizable features x g compete with x s for extracting sufficient information to train models such that models trained on x g can generalize well. Consequently, the sensitive features are almost the same as the data while models trained on generalizable features generalize well. (3) What is the difference between sharing raw data features and partial features? To ensure that sharing the generalizable features x g cannot expose FL to the danger of privacy leakage, we follow the conventional style in applying differential privacy to protect generalizable features x g shared across clients. Our trick is that most information in data has been distilled as sensitive features x s , which is very secure and kept locally. In other words, we only need a relatively small noise to protect x g , without the need to fully protect the raw data x, yet achieving a much stronger privacy than the straightforward protection (i.e., directly sharing x with differential privacy). Intuitively, sharing partial information in the data is more accessible to preserve privacy than sharing complete information, which is fortunately consistent with our theoretical analysis.

1.2. OUR RESULTS AND CONTRIBUTION

To tackle data heterogeneity, we propose a novel framework with privacy, which constructs a global dataset using securely shared data and performs local training on both the local and shared data, shedding new light on data-dominated sharing schemes. To show the efficacy, we deploy FedDP on four popular FL algorithms, including FedAvg, FedProx, SCAFFOLD, and FedNova, and conduct experiments on various scenarios with respect to different amounts of devices and varying degrees of heterogeneity. Our extensive results show that FedPD achieves considerable performance gains on different FL algorithms. Our solution not only improves model performance in FL but also provides strong security, which is theoretically guaranteed from the lens of differential privacy. Our contributions are summarized as follows: • We raise a foundation question: whether it is necessary to share complete raw data features when sharing privacy data for mitigating data heterogeneity in FL. 

Raw data

Protected generalizable features L A denotes the Eq. 1 and L F is the Eq. 3 in our paper. (Li et al., 2020) , FedIR (Hsu et al., 2020) , SCAFFOLD (Karimireddy et al., 2020) and MOON (Li et al., 2021a) . And some works propose designing new model aggregation schemes like FedAvgM (Hsu et al., 2019) ,FedNova (Wang et al., 2020b) ,FedMA (Wang et al., 2020a) ,FedBN (Li et al., 2021c) .

2. RELATED WORK

Another promising direction is sharing some data, which mainly focuses on synthesizing and sharing data of different clients to mitigate client drift (Zhao et al., 2018; Jeong et al., 2018; Long et al., 2021) . To avoid privacy leakage caused by sharing data, some methods share the statistics of data (Yoon et al., 2021; Shin et al., 2020) , which still contains some raw data content. Some methods distribute intermediate features (Hao et al., 2021) , logits (Chang et al., 2019; Luo et al., 2021) , or the learned new embedding (Tan et al., 2022) . Although these tactics enhance privacy at some degree, advanced attacks can still successfully reconstruct raw data given shared data (Zhao et al., 2020) . Unlike prior research, we exploit DP to ensure privacy of shared data and then analyze privacy-performance trade-off. Differential privacy in federated learning. Recent works on model memorization and gradient leakage confirm that model parameters are seemingly secure (Carlini et al., 2019) . Training with differential privacy (Zhu et al., 2019; Nasr et al., 2019) is a feasible solution to avoid some attacks, albeit at some loss in utility. Differential privacy quantifies what extent individual privacy in a statistical dataset is preserved while releasing the established model over specific datasets. In FL, training with differential privacy, i.e., adding noise to the model/data, originally aims to protect local information of each client (Yuan et al., 2019; Thakkar et al., 2019) . Some works analyze the relation between convergence and utility in FL (Huang et al., 2020; Wei et al., 2020) . A series of works in DP add noise to gradients or model parameters in FL to protect model privacy (Kim et al., 2021; van der Hoeven, 2019; Triastcyn & Faltings, 2019; Sun et al., 2021) . Unlike model-based protection, our work aims to protect data privacy and mitigate the client drift issue. We provide a detailed discussion of exciting works in Appendix A.5. Ideally, if we could identify the sensitive features x s and the generalizable features x g , we could be able to solve the privacy-performance dilemma. Intuitively, sensitive features x s contain most information of data, while generalizable features x g contains the nonsensitive part that can help global generalization in FL. To resolve the dilemma in protecting privacy and promoting performance, we can keep the sensitive features locally while sharing generalizable feature protected under differentially private guarantee. The major challenge here is that the intersection of two types of features as aforementioned may not be the empty set, making it challenging to distill privacy.

3.2. PRIVACY DISTILLATION

To address this issue, we propose a competitive mechanism to perform privacy distillation. Therein, the generalizable features aim to train models for generalizing well on the raw data, while the sensitive features compete with the generalizable features to construct the raw data. Consequently, the sensitive features is almost the same as the data while models trained on generalizable features generalizing well on the raw data. We propose two approaches to instantiate the competitive mechanism for privacy distillation, i.e., making generalizable features useful for model generalization while keeping sensitive features almost the same as the raw data, i.e., covering almost all information of raw data.

3.2.1. OPTIMIZATION VIEW

A straightforward approach is to distill private information in a meta manner (Finn et al., 2017) . Specifically, we employ a generative model, e.g., a variational auto-encoder (VAE), G(•; θ) parameterized with θ to achieve the goal of covering all information of raw data, i.e., x s = G(•; θ) aims to reconstruct x. Meanwhile, to ensure the generalizable features, x g = x -x s = x -G(•; θ) = x g (θ) , useful for model generalization, we train an auxiliary classifier A(•; w) parameterized with w using x g such that A(•; w) trained on x g performs well on the raw data x. Then, we can formalize the task of privacy distillation into the following optimization problem as: min θ E (x, y) L(A(x; ŵ(θ)), y) + H(x g (θ)), s.t. ŵ(θ) = arg min w E (x g (θ), y) L(A((x g (θ); w), y), x g (θ) = x -G(x; θ). Here, y is the label of the sample x and the generalizable features x g (θ), ŵ(θ) is a function of θ denoting the parameters of classifier A(x; •), H(x g (θ)) is the information entropy of x g (θ), and L(•, •) represents the cross-entropy loss. We can see that every possible parameter θ is paired with a model trained on the corresponding generated data x g (θ). Thus, solving the optimization problem is equivalent to searching for parameters θ to generate the generalizable features x g (θ) with minimum information entropy. Moreover, the model A((x g (θ); w), y) trained using (x g (θ), y) can perform well on the raw data. However, the proposed non-convex optimization problem is non-trival. We employ a simple yet effective trick widely used in reinforcement learning (Mnih et al., 2015) . Specifically, we alternatively update G(x; θ) over x via stochastic gradient descent and update A(x g (θ); w) over x g (θ). Moreover, we minimize an upper bound of H(x g (θ)) with the variance of x g (θ) following (Ahuja et al., 2021) .

3.2.2. GENERALIZATION VIEW

Besides the optimization approach, we also provide a generalization view to distill privacy. In a high level, we aim to train a model A(•; w) using x g such that A(•; w) can generalize well on x, i.e., samples drawn from a different distribution. Therefore, we should model how the performance on the generated data transfers to the raw data. To derive a detailed connect between these two distributions, the metric to measure the generalization performance should be defined clearly. According to the margin theory (Koltchinskii & Panchenko, 2002) that maximizing the margin between data points and the decision boundary achieves strong generalization performance, we relate such a margin to the generalization performance: Definition 3.1 (Margin). We define the margin for a classifier A(•; w) on a distribution P with a distance metric d: M m (A, P) = E (x, y) ∼ P inf A(x ′ ) ̸ = y d(x ′ , x). Built upon the defined margin that quantifies the degree of generalization performance, we can quantify the generalization performance of A(•; w) on a given distribution. To be specific, large margin means strong generalization performance. A recent work (Tang et al., 2022) shows that the margin is intrinsically related to the distribution discrepancy in the representation space, i.e., the distance between distributions sampling x and that sampling x g . Thus, we propose minimizing the distribution discrepancy of the generated distribution and the raw distribution in the representation space: min θ E (x, y) L(A(x; w), y) + L(A(x g (θ); w), y) + H(x g (θ)) + d(r(x g (θ)), r(x)). ( ) where d is the distance metric used in the definition of margin and r(x g (θ)) stands for the representation of x g (θ) generated by the classifier A.

3.3. DIFFERENTIALLY PRIVATE GENERALIZABLE FEATURES

The proposed privacy distillation methods make it possible to keep most (private) information locally while sending the generalizable features to the server. However, for ease of calculation of information entropy, we employ the variance of generalizable features as a surrogate, which may cause privacy leakage. This breaks the original intention of federated learning in protecting privacy. Thus, the shared generalizable features should be protected. Accordingly, the server can construct a global dataset using these generalizable features and send the dataset back to clients for local training. To avoid privacy leakage, additional noise (e.g., Gaussian or Laplacian) is added to generalizable features x g , i.e., x p ≜ x g + N (0, σ 2 ). Then, clients send x p to the server to construct a globally shared dataset. Using the global dataset, clients can train classifier F (•; ϕ) parameterized by ϕ with the local and shared data, : min ϕ L F (ϕ) = E (x, y) L(F (x; ϕ), y) + E (x p , y) L(F (x p ; ϕ), y). Algorithm 1 summarizes the training procedure of FedAvg with FedPD. To make sure the framework can be used without privacy concern, we further provide the corresponding analysis. Before that, we introduce the definition of differential privacy, which we used for adding i.  ϕ r+1 k,E-1 ← ClientTraining(k, ϕ r , D k r ) end for ϕ r+1 ← |Sr| k∈Sr |D i | n ϕ r k,E-1 end for ClientTraining(k, ϕ, D k r ): for each local epoch j with j = 0, • • • , E -1 do ϕ k,j+1 ← ϕ k,j -η k ∇ ϕ L F (ϕ), i. e., Eq. 3 end for Return ϕ to server Theorem 3.4. For any ϵ > 0, δ ∈ [0, 1], and δ ∈ [0, 1], the class of (ϵ, δ)-differentially private mechanisms satisfies (ε δ , 1 -(1 -δ)Π i (1 -δ i )) -differential privacy under k-fold adaptive composition for εδ = min{kϵ, (e ϵ -1)ϵk/(e ϵ +1)+ϵ 2k log(e + kϵ 2 / δ), (e ϵ -1)ϵk/(e ϵ +1)+ϵ 2k log(1/ δ)}. Since x s is kept by the corresponding client, an adversary views nothing, which can be regarded as adding a sufficiently large noise on x to make it random enough. Considering all clients' data as a whole, we use a relatively small σ (i.e., σ c < σ d + σ c ) for achieving much smaller privacy loss, summarized in Theorem 3.5. Theorem 3.5. Given identical privacy requirement, σ c of FedDP is much less than σ that is supposedly added to raw data in conventional FL. Given (ϵ, δ)-DP at each client side, we utilize composition theorem to analyze overall privacy in FedPD. In summary, FedPD protects two types of data features using two different protective manners, i.e., small noise for generalizable features and extremely large noise for sensitive features, and thus attains higher model performance and stronger security in the same time.

4. EXPERIMENTS AND EVALUATION

4.1 EXPERIMENT SETUP Federated Non-IID Datasets. We conduct experiments over various popular image classification datasets, including CIFAR-10, CIFAR100 (Krizhevsky et al., 2009) , Fashion-MNIST(FMNIST) (Xiao et al., 2017) , and SVHN (Netzer et al., 2011) . We use latent dirichlet sampling (LDA) (Hsu et al., 2019) to simulate Non-IID distribution with 10 and 100 clients. The primary thought is to draw a q ∼ Dir(αp) from Dirichlet distribution, where α controls the heterogeneity degree. Here, the less α is, the more severe Non-IID distribution generate. In our experiments, we partition our datasets with two different degrees by LDA including α = 0.1 and α = 0.05. Besides, in order to prove that our framework works well under with Non-IID partitions. We also test other two kinds of partition strategy: (1) #C = k (McMahan et al., 2017; Li et al., 2022) : each client only has k different labels from dataset, and k controls the unbalanced degree. (2) Subset method (Zhao et al., 2018) : each client (Wang et al., 2020b) , with or without FedPD, to explore the potency of our method. We conduct all algorithms with local epochs E = 1 and E = 5. The detailed hyper-parameters of each FL algorithm and privacy distillation in different datasets are listed in A.3.1.

4.2. EXPERIMENTAL RESULTS

Main Results. The results on CIFAR-10, CIFAR-100, FMNIST, and SVHN are shown respectively in Tables 1, 2 , 5, and 6, which demonstrates that FedPD has a significant performance gain. We also show the convergence speed of different algorithms on CIFAR-10 with a = 0.1, E = 1, M = 10 in Figure 2a ,foot_0 which shows that FedPD can also greatly improve the convergence rate. Privacy and performance. To explore the relationship between privacy level ϵ and performance, we conduct experiments with different σ 2 . As shown in Figure 2b , the performance decreases with the increasing protection strength. Another Laplacian noise report comparable results with Gaussian noise listed in Table 4 . In conclusion, we suggest sacrificing part of the privacy when encountering limited communication resources. Another question is, can the globally shared data be inferred by some attack methods? To answer this question, we resort model inversion attack (He et al., 2019) , widely used in the literature to reconstruction our shared data. The results on Figure 3b indicates that only privacy distillation still have risk of privacy leakage. Figure 3c also be a strong testimony for the differential privacy of noise adding on generalizable features. Furthermore, FedPD can give a strong private information protection. The original image can be found in Appendix A.2 Different number of clients. Table 1 , Table 2 , Table 5 , and Table 6 show that FedPD strengthen the performance and speed up the convergence both in 10 and 100 clients. Especially 100 clients in CIFAR-10 and CIFAR100 have a noteworthy enhancement. The reason may be that FL on CIFAR-10 and CIFAR100 with 100-clients has more diverge data distribution than FMNIST. With FedPD, the missed data knowledge can be well replenished. Different data heterogeneity. Table 1 , Table 2 , Table 5 , and Table 6 show that high Non-IID degree (α=0.05) achieve a better improvement than lower unbalanced degree (α=0.1), which also indicates that FedPD can well defend against data heterogeneity. Moreover, Table 3 shows that other two kinds of heterogeneity partition cause more performance decline compared with LDA (α = 0.1), and FedPD attains comparable improvement with LDA α = 0.1, indicating FedPD is insensitive to other Non-IID data distribution. ×9.9(×0.5) Different local epochs. To test the effect of local epoch E, we choose E = 1 and E = 5 with the same Non-IID degree (α = 0.1) and client number (K = 10). We run 1000 rounds with 1 epoch local training and 400 rounds for 5 epochs local update. The results show that FedPD is robust to the local epochs. α = 0.1, E = 5, K = 10 (Target ACC =87%) α = 0.1, E = 1, K = 100 (Target ACC Other Facts of FedPD. For a intuitive understanding of why we utilize x g as a substitute of raw data x without drastic performance degradation. We train two different networks separately on x and x g on CIFAR-10 and test them on x, x s , and x g , respectively. The results presented in Figure 2c . As we can see, the most useful features for downstream tasks are contained in x g . More experimental details are presented in Appendix A.4

5. CONCLUDING REMARKS

In this paper, we observe that model gains a substantial performance assisted by generalizable features. Later we conduct DP to protect generalizable features and contruct a globally shared dataset for defying heterogeneity in FL. Our contribution lies in not only improving model performance in Non-IID scenarios, but also inspiring a new viewpoint on data-dominated secure sharing, e.g., distillation data before knowledge learning. We expect that our work could simulated further data-dominated sharing in FL or other popular learning algorithms. Our framework shows suprior results against model inversion attack, yet we have not finished exploring data poisoning attack, given the shared data. We conduct preliminary experiments on data poisoning attacks, in which some clients send Gaussian noise to the server, causing performance degradation and slow convergence. Limited storage or communication resources may limited the power of FedPD, since FedPD introduces extra storage overhead. We leave it as our future work to explore the storage-friendly FedPD.

ETHIC STATEMENT

This paper does not raise any ethical concerns. This study does not involve any human subjects, practices to data set releases, potentially harmful insights, methodologies and applications, potential conflicts of interest and sponsorship, discrimination/bias/fairness concerns, privacy and security issues, legal compliance, and research integrity issues.

REPRODUCIBILITY STATEMENT

To make all experiments reproducible, we have listed all detailed hyper-parameters of each FL algorithm and privacy distillation on different datasets in A.3.1. Due to the privacy concerns, we will upload the anonymous link of source codes and instructions during the discussion phase to make it only visible to reviewers. All definitions can be found in Section 3. And the complete proof can be found in Appendix A.6. However, some studies (Zhao et al., 2018; Li et al., 2022) accumulates during the FedAvg weighted aggregation, leading to the performance degradation of FL models. Recently, a series work propose new learning objective to calibrate the update direction of local training from being too far away from the global model. FedProx (Li et al., 2020) adds a L 2 distance as the regularization term in the objective function and provides a theoretical guarantee of convergence. Similarly, a novel objective function is also introduced in FedIR (Hsu et al., 2020) over a mini-batch by self-normalized weights to address the non-identical class distribution. SCAFFOLD (Karimireddy et al., 2020) restricts the model using previous information. Besides, MOON (Li et al., 2021a) introduces constrastive learning at the model level to correct the divergence between clients and server. Meanwhile, recent works propose designing new model aggregation schemes. FedAvgM (Hsu et al., 2019) performs momentum on the server side. FedNova (Wang et al., 2020b) adopts normalized averaging method to eliminate objective inconsistency. A study (Cho et al., 2020) also indicates that biasing client selection with higher local loss can speed up the convergence rate. The coordinate-wise averaging of weights also induce noxious performance. FedMA (Wang et al., 2020a) conducts Bayesian non-parametric strategy for heterogeneous data. FedBN (Li et al., 2021c) focus on feature shift Non-IID and perform local batch normalization before averaging models. Another existing direction for tackling data heterogeneity is sharing data. This line of works mainly to assemble the data of different clients to construct a global IID dataset, mitigating client drift by replenishing the lack of information of clients (Zhao et al., 2018) . Existing methods include synthesizing data based on the raw data by GAN (Jeong et al., 2018; Long et al., 2021) . However, the synthetic data is generally relatively similar to the raw data, leading to privacy leakage at some degree. Adding a noise to the shared data is another promising strategy (Chatalic et al., 2022; Cai et al., 2021) . Some methods employ the statistics of data (Yoon et al., 2021; Shin et al., 2020) to synthesize for sharing, which still contains some raw data content. Other methods distribute intermediate features (Hao et al., 2021) , logits (Chang et al., 2019; Luo et al., 2021) , or learn the new embedding (Tan et al., 2022) . These tactics will increase the difficulty of privacy protection because some existing methods can reconstruct images based on feature inversion methods (Zhao et al., 2020) . Most of the above methods share information without a privacy guarantee or with strong privacy-preserving but poor performance, posing the privacy-performance dilemma. Concretely, in FD (Jeong et al., 2018) all clients leverages a generative model collaboratively for data generation in a homogeneous distribution. For a better privacy protection, G-PATE (Long et al., 2021) performs discriminators with local aggregation in GAN. Fed-ZDAC(Fed-ZDAS) (Hao et al., 2021) , depending on which side to play augmentation, introduce zero-shot data augmentation by gathering intermediate activations and batch normalization(BN) statistics to generate fake data. Inspired by mixup data, MAFL (Yoon et al., 2021) propose FedMIx to share information by averaging local data which also brings about the privacy problem. Cronus (Chang et al., 2019) transmit the logits information while CCVR (Luo et al., 2021) collect statistical inforamtion of logits to sample fake data. FedFTG (Zhang et al., 2022) use a generator to explore input space of local model and transfer local knowledge to global model. FedDF (Lin et al., 2020) utilizes knowledge distillation based on unlabeled data or a generator and then conduct AVGLOGITS. The main difference between FedDF and FedPD is that our method distill the privacy kept locally rather than distilling knowledge. We provide multi steps to protect privacy with drastic performance gain. Differential privacy with federated learning. Recent works on model memorization and gradient leakage confirms that model parameters are seemingly secure. Carlini et.al (Carlini et al., 2019) found that unintended-and-persistent memorization of sensitive data occurs early during training with no relation to data rarity and model size. Training with differential privacy (Zhu et al., 2019) (Nasr et al., 2019) is a feasible solution to avoid serious consequences, albeit at some loss in utility. Differential privacy is a framework to quantify to what extent individual privacy in a statistical dataset is preserved while releasing the established model over specific datasets. It has spawned a large set of research topics in data-releasing mechanism and noise-adding mechanism. Particularly, noise-adding mechanism has been widely utilized in various differentially private learning algorithms for protecting whether an individual is in the dataset or not. In federated settings, training with differential privacy, i.e., adding noise to the model/data, originally aims to protect local information of each clients. Say, an adversary should not be able to discern whether a client's data was used for early training. Here, we summarize some works with high citation or from top venue. Yuan et al (Yuan et al., 2019) apply differential privacy to protect medical images by adopting famous AlexNet and Gaussian mechanism. Huang et al (Huang et al., 2020) integrate an approximate augmented Lagrangian function and Gaussian noise mechanism for balancing utility and privacy in FL. Wei etal (Wei et al., 2020) perturb early-trained parameters locally by adding noises before uploading them to a server for aggregation. Both Huang et al and Wei et al are first (to their knowledge) to analyze the relation between convergence and utility in FL. Andrew et al (Thakkar et al., 2019) explore to set an adaptive clipping norm in federated setting rather than using a fixed one. They show that adaptive clipping to gradients can perform as well as any fixed clip chosen by hand. Kim et al (Kim et al., 2021) provide a noise variance bound that guarantees local DP after multiple rounds of parameter aggregations. They introduce a trilemma in privacy, utility, and transmission rate of a federated stochastic gradient decent. Hoeven et al (van der Hoeven, 2019) introduce datadependent bounds and apply symmetric noise in online learning, which allows data provider to pick noise distribution. Triastcyn et al (Triastcyn & Faltings, 2019) adapt the notion of Bayesian differential privacy to federated learning and make necessary analyses on privacy guarantee. Sun et al (Sun et al., 2021) explicitly vary ranges of weights at different layers in a DNN, and shuffle high-dimensional parameters at an aggregation for easing explodes of privacy budgets. All works above start to apply DP and its variants to federated setting for different goals/scenarios, which thus provide underlying security as DP guarantees. A.6 DIFFERENTIAL PRIVACY Proof of Theorem 3.4 is here. Proof. Definition A.1. (Privacy Loss). Let M : D → R be a randomized mechanism with input domain D and range R. Let D, D ′ be a pair of adjacent dataset and aux be an auxiliary input. For an outcome o ∈ R, the privacy loss at o is defined by: L (o) Pri ≜ log Pr[M(aux, D) = o] Pr[M(aux, D ′ ) = o] We need to compute the privacy loss on an outcome o as a random variable when the random mechanism operates on two adjacent database D and D ′ . Privacy loss is a random variable that accumulates the random noise added to the algorithm/model. We aim at an exact analysis on privacy via compositing multiple random mechanisms. For simplification, we start with a particular random mechanism M † and then generalize it. The mechanism M † does not depend on database or the query but relies on hypothesis hp. For hp = 0, the outcome O i of M † i is independent and identically distributed from a discrete random distribution O hp=0 ∼ P †,0 . P †,0 (o) is defined to be: δ for o = 0; (1 -δ)e ϵ /(1 + e ϵ ) for o = 1; (1 -δ)/(1 + e ϵ ) for o = 2; 0 for o = 3. For hp = 1, the outcome O i of M † i is O hp=1 ∼ P †,1 . P †,1 (o) is defined to be: 0 for o = 0; (1 -δ)/(1 + e ϵ ) for o = 1; (1 -δ)e ϵ /(1 + e ϵ ) for o = 2; δ for o = 3. Let R(ϵ, δ) be privacy region of a single access to M † . Privacy region consists of two rejection regions with errors, i.e., rejecting true null-hypothesis (type-I error) and retaining false null-hypothesis



More figures of convergence speed of other experiments are shown in appendix A.4.



Figure 1: FL Framework with the plug-in FedPD. Clients generate generalizable features and add noise protection to get protected generalizable features x p during privacy distillation process. The protected generalizable features x p are collected from numerous distributed clients to construct a globally shared dataset while sensitive features x s are kept locally. During local training procedure, local raw data and a subset of globally shared data jointly train the local model for global aggregation.L A denotes the Eq. 1 and L F is the Eq. 3 in our paper.

i.d noise to generalizable features. Definition 3.2. (Differential Privacy). A randomized mechanism M provides (ϵ, δ)-differential privacy (DP) if for any two neighboring datasets D and D ′ that differ in a single entry, ∀S ⊆ Range(M), Pr(M(D) ∈ S) ≤ e ϵ • Pr(M(D ′ ) ∈ S) + δ. where ϵ is the privacy budget and δ is the failure probability. Our added noise to x g is proportional to the sensitivity, as defined in Definition 3.3. The concept of sensitivity is originally used for sharing a dataset for achieving (ϵ, δ)-differential privacy. Later, we follow Theorem 3.4 to analyze the privacy on globally shared data. Definition 3.3. (Sensitivity). The sensitivity of a query function F : D → R for any two neighboring datasets D, D ′ is, ∆ = max D,D ′ ∥F(D) -F(D ′ )∥. where ∥ • ∥ denotes L 1 or L 2 norm. Algorithm 1 FedAvg with FedPD server input: initial ϕ 0 , communication round R client k's input: local epochs E, local datasets D k , learning rate η k Initialization: server distributes the initial model ϕ 0 to all clients, Globally shared dataset D s generating. ← Detail in Algorithm 2 Server Executes: for each round r = 1, 2, • • • , R do server samples a subset of clients S r ⊆ {1, ..., K}, n ← i∈Sr |D i | client k samples a subet of globally shared dataset D k r ⊆ D s (|D k r | = |D k |) server communicates ϕ r to selected clients k ∈ S r and sampled sharing data D k r for each client k ∈ S r in parallel do

(a) Test Accuracy on CIFAR-10 with α = 0.1, E = 1, K = 10 and Gaussian Filter for better visualization. Test Accuracy on FMNIST with different noise level σ 2 ,obtaining various privacy ϵ(lower ϵ is preferred). Two clasifiers trained on different data form and test accuracy on x, xs, and xg respectively.

Figure 2: Experiments of the relationship between privacy and performance.

Figure 3: Model Inversion Attack Results. White-Box attack globally shared data x p and generalizable features x g , respectively. The result of being attacked is in (b) and (c) to compare with shared data x p in (a)

Test Accuracy on CIFAR-100 with different noise level σ 2

Figure 8: Privacy-Performance results on different datasets

Figure 12: Convergence comparison on FMNIST.

• We answer the question by proposing a plug-and-play framework named FedPD, where raw data features are divided into generalizable features and sensitive features. In FedPD, the sensitive features are distilled in a competitive manner and kept locally, while the generalizable features are shared in a differentially private manner to construct a global dataset.

Results with/without FedPD on CIFAR-10 centralized training ACC = 95.48% w/(w/o) FedPD "Round" means the communication rounds that arrive at the target accuracy. ↓ and ↑ indicates smaller (larger) values are better. "None" implies not attaining the target accuracy during the entire training process. All the "Speedup" is calculated by comparing with vanilla FedAvg "Round" in different Non-IID partition scenarios. has all classes from the data, but one dominant class far away outnumbers other classes. These three partition methods mainly include label skew and quantity skew. The visualization of data distribution is shown in Figure 4 in Appendix A.1.

Results with/without FedPD on CIFAR-100

Experiment results of different Non-IID partition methods on CIFAR-10 with 10 clients.

Experiment results with different noise adding in CIFAR-10.

Results with/without FedPD on FMNIST

Results with/without FedPD on SVHN

have pointed out that the divergence between FedAvg and centralized training is slight in the IID case. But, in heterogeneous distribution, there is a considerable divergence between the different clients and centralized training, and the gap

A.2 GLOBALLY SHARED DATA

We display the globally shared data x p from four different datasets and the raw data x to compare our privacy protection. Firstly, the raw data in Figure 3 shown in Figure 5 . 

A.3 MORE DETAILS OF FEDPD

Algorithm 1 give us an intuitive explanation of how we deploy FedPD on FL algorithm e.g., FedAvg and Algorithm 2 illustrates the procedure to generate globally shared data.Algorithm 2 Globally Shared Data Generation Server input: generation process communication round T , noise mean µ, noise level σ 2 Client k's input: local epochs Q, local datasets D k Initialization: server distributes the initial model w 0 , θ 0 to all clients, Server Executes:We fine-tuned learning rates in 0.0001, 0.001, 0.01, 0.1 and report the best results and corresponding learning rate. In most case, we use 0.01 as the learning rate except SCAFFOLD and FedNova in SVHN under the α = 0.1, E = 1, K = 100 setting, the learning rate is 0.0001 and 0.001, respectively. Batch size is set as 64 in when K = 10 and 32 for K = 100. The number of clients selected for aggregation on server side is 5 per round for K = 10, and 10 for K = 100. The noise level in our experients is N (0, 0.15) In addition, we provide an insight experiment on the need for mixupdata (Zhang et al., 2017) augmentation in our approach shown in Figure 7 . As we can see, the absence of data leads to poor generalization of the auxiliary classifier A on x and adequate data for VAE G still has a bad effects.

A.3.2 TRICK FOR FEDPD

(type-II error). Let ϵ † k , δ † k be M † i 's parameters for defining privacy. R(M, D, D ′ ) of any mechanism M can be regarded as an intersection of {(ϵ † k , δ † k )} privacy regions. For an arbitrary mechanism M, we need to compute its privacy region using the (ϵ † k , δ † k ) pairs. Let D, D ′ be neighboring databases and O be the outputting domain. Define (symmetric) P, P ′ to be probability density function of the outputs M(D), M(D ′ ), respectively. Assume a permutation π over O such that P ′ (o) = P(π(o)). Let S denote the complement of a rejection region. Since R(M, D, D ′ ) is convex, we haveDefine Dt ϵ † (P, P ′ ) = max S⊆O {P(S) -e ϵ † P ′ (S)}. Thus, M's privacy region is the set:Next, we consider composition on random mechanisms M 1 , . . . , M i . By accessing M † i , P(O 1,hp = o 1 , . . . , O i,hp = o i ) = Π i j=1 P †,hp (o j ). By algebra on two discrete distributions, Dt (i-2j)ϵ (P i , (P ′ ) i ) = 1 -(1 -δ) i + (1 -δ) i j-1 l=0 i l (e ϵ(i-l) -e ϵ(i-2j+l) ) /(1 + e ϵ ) k Hence, privacy region is an interaction of i regions, parameterized by 1 -(1 -δ)Π i (1 -δ i ).

