FEDPD: DEFYING DATA HETEROGENEITY THROUGH PRIVACY DISTILLATION

Abstract

Model performance of federated learning (FL) typically suffers from data heterogeneity, i.e., data distribution varies with clients. Advanced works have already shown great potential for sharing client information to mitigate data heterogeneity. Yet, some literature shows a dilemma in preserving strong privacy and promoting model performance simultaneously. Revisiting the purpose of sharing information motivates us to raise the fundamental questions: Which part of the data is more critical for model generalization? Which part of the data is more privacy-sensitive? Can we solve this dilemma by sharing useful (for generalization) features and maintaining more sensitive data locally? Our work sheds light on data-dominated sharing and training, in a way that we decouple original training data into sensitive features and generalizable features. To be specific, we propose a Federated Privacy Distillation framework named FedPD to alleviate the privacy-performance dilemma. Namely, FedPD keeps the distilled sensitive features locally and constructs a global dataset using shared generalizable features in a differentially private manner. Accordingly, clients can perform local training on both the local and securely shared data for acquiring high model performance and avoiding the leakage of not distilled privacy. Theoretically, we demonstrate the superiority of the sharing-only useful feature strategy over sharing raw data. Empirically, we show the efficacy of FedPD in promoting performance with comprehensive experiments.

1. INTRODUCTION

Federated learning (FL), as an emerging protection paradigm, receives increasing attention recently (Kairouz et al., 2021; Li et al., 2021b; Yang et al., 2019) , which preserves data privacy without transmitting pure data. In general, distributed clients collaboratively train a global model by aggregating gradients (or model parameters). However, distributed data can cause heterogeneity issues (McMahan et al., 2017; Li et al., 2022; 2020; Zhao et al., 2018) , due to diverse computing capability and non-IID data distribution across federated clients. It results in unstable convergence and degraded performance. To address the challenge of heterogeneity, the seminal work, federated averaging (FedAvg) (McMahan et al., 2017) , proposes weighted averaging to overcome Non-IID data distribution when sharing selected local parameters in each communication round. Despite addressing the diversity of computing and communication, FedAvg still struggles with the client drift issue (Karimireddy et al., 2020) . Therefore, recent works try to resolve this issue by devising new learning objectives (Li et al., 2020) , designing new aggregation strategies (Yurochkin et al., 2019) and constructing information for sharing (Zhao et al., 2018; Yoon et al., 2021) . Among these explorations, sharing relevant information across clients provides a straightforward and promising approach to mitigate data heterogeneity. However, recent works point out a dilemma in preserving strong privacy and promoting model performance. Specifically, (Zhao et al., 2018) show that a limited amount of sharing data could significantly improve training performance. Unfortunately, sharing raw data, synthesized data, logits and statistical information (Luo et al., 2021; Goetz & Tewari, 2020; Hao et al., 2021; Karimireddy et al., 2020) can incur high privacy risks. To protect clients' privacy, differential privacy (DP) provides a de facto standard way for provable security quantitatively. The primary concern in applying DP is about performance degradation (Tramer & Boneh, 2020) . Thus, solving the above dilemma can contribute to promoting model performance while preserving strong privacy. Under review as a conference paper at ICLR 2023

1.1. SYSTEMATIC OVERVIEW OF FEDPD

To solve the dilemma, we revisit the purpose of sharing information: sharing raw data benefits model generalization while violating privacy leakage. This motivates us to raise the fundamental questions: (1) Is it necessary to share complete raw data features to mitigate data heterogeneity? We find that some data features are more important than others to train a global model. Therefore, an intuitive approach is to divide the data features into two parts: one part for model generalization, named generalizable features, and the other part with clients' privacy, named sensitive features. Then, the dilemma can be solved by sharing generalizable features and keeping sensitive features locally throughout the training procedure. The insight is that the sensitive features in the data are kept locally, and the generalizable features intrinsically related to generalization are shared across clients. Accordingly, numerous decentralized clients can share generalizable features without privacy concerns and construct a global dataset to perform local training. (2) How to divide data features into generalizable features and sensitive features? It is challenging to identify which part of the data is more important for model generalization and which part is more privacy-sensitive. To resolve this challenge, we propose a novel framework named Federated Privacy Distillation (FedPD). FedPD introduces a competitive mechanism by decomposing x ∈ R d with dimension d into generalizable features x g ∈ R d and sensitive features x s ∈ R d , i.e., x = x g + x s . In FedPD, sensitive features x s aim to cover almost all information in the data x, while the generalizable features x g compete with x s for extracting sufficient information to train models such that models trained on x g can generalize well. Consequently, the sensitive features are almost the same as the data while models trained on generalizable features generalize well. (3) What is the difference between sharing raw data features and partial features? To ensure that sharing the generalizable features x g cannot expose FL to the danger of privacy leakage, we follow the conventional style in applying differential privacy to protect generalizable features x g shared across clients. Our trick is that most information in data has been distilled as sensitive features x s , which is very secure and kept locally. In other words, we only need a relatively small noise to protect x g , without the need to fully protect the raw data x, yet achieving a much stronger privacy than the straightforward protection (i.e., directly sharing x with differential privacy). Intuitively, sharing partial information in the data is more accessible to preserve privacy than sharing complete information, which is fortunately consistent with our theoretical analysis.

1.2. OUR RESULTS AND CONTRIBUTION

To tackle data heterogeneity, we propose a novel framework with privacy, which constructs a global dataset using securely shared data and performs local training on both the local and shared data, shedding new light on data-dominated sharing schemes. To show the efficacy, we deploy FedDP on four popular FL algorithms, including FedAvg, FedProx, SCAFFOLD, and FedNova, and conduct experiments on various scenarios with respect to different amounts of devices and varying degrees of heterogeneity. Our extensive results show that FedPD achieves considerable performance gains on different FL algorithms. Our solution not only improves model performance in FL but also provides strong security, which is theoretically guaranteed from the lens of differential privacy. Our contributions are summarized as follows: • We raise a foundation question: whether it is necessary to share complete raw data features when sharing privacy data for mitigating data heterogeneity in FL. • We answer the question by proposing a plug-and-play framework named FedPD, where raw data features are divided into generalizable features and sensitive features. In FedPD, the sensitive features are distilled in a competitive manner and kept locally, while the generalizable features are shared in a differentially private manner to construct a global dataset. • We give a new perspective on employing differential privacy that adds noise to partial data features instead of the complete raw data features, which is theoretically superior to the raw data sharing strategy. • Extensive experiments demonstrate that FedPD can considerably improve the performance of FL models.

