EXPLOIT UNLABELED DATA ON THE SERVER! FEDERATED LEARNING VIA UNCERTAINTY-AWARE ENSEMBLE DISTILLATION AND SELF-SUPERVISION Anonymous

Abstract

Federated Learning (FL) is a distributed machine learning paradigm that involves the cooperation of multiple clients to train a server model. In practice, it is hard to assume that each client possesses large-scale data or many clients are always available to participate in FL for the same round, which may lead to data deficiency. This deficiency degrades the entire learning process. To resolve this challenge, we propose a Federated learning with entropy-weighted ensemble Distillation and Self-supervised learning (FedDS). FedDS reliably deals with situations where not only the amount of data per client but also the number of clients is scarce. This advantage is achieved by leveraging the prevalent unlabeled data in the server. We demonstrate the effectiveness of FedDS on classification tasks for CIFAR-10/100 and PathMNIST. In CIFAR-10, our method shows the improvement over FedAVG by 12.54% in data deficient regime, and by 17.16% and 23.56% in more challenging scenarios of noisy label or Byzantine client cases, respectively.

1. INTRODUCTION

, rapidly drops as the amount of available data at each client decreases. Our FedDS mitigates the effect of data deficiency by exploiting unlabeled data on the server. The accuracy is measured at the 50th communication round on CIFAR-10 classification task. This result follows the same setting as the main experiment in Fig. 4 except the data amounts for each client. Federated Learning (FL) is a distributed machine learning paradigm that involves the cooperation of multiple clients to train a server model (McMahan et al., 2017) . In FL, a server model is trained as follows: 1) the server distributes the current server model to clients, 2) each client independently trains each model downloaded from the server with their available local data and sends back the resultant model to the server, 3) the server updates the server model with the collected locally-trained models, and 4) repeat the steps. By collecting updated model parameters at the server instead of raw client data, FL can mitigate the personal information leakage. In some FL scenarios, such as developing a medical diagnosis algorithm, it is often the case where the number of clients (participating hospitals) and the size of labeled datasets in each client (the number of relevant patients and those labels at each hospital) are deficient. Such a lack of participating clients and labeled datasets in the clients leads to performance degradation for the standard FL method, e.g., FedAVG (McMahan et al., 2017) (refer to Fig. 1 ). The deficiency may also destabilize the learning process which increases label noise sensitivity of FL methods. Even in such scenarios, unlabeled data is abundant or easy to collect in practice, which may help to mitigate the data deficiency and label noise vulnerability of FL algorithms. In this paper, we propose a robust FL algorithm by utilizing additional unlabeled data on the server. The key idea of our method is to leverage unlabeled data to mitigate lack of data as well as to reliably aggregate the client models into the single server model by unsupervised ensemble knowledge distillation. We postulate that the aforementioned degradation of the accuracy mainly stems from unreliable clients participated in the server model update. To mitigate the influence of unreliable clients, we measure the entropy of each client's output to assess the uncertainty of each model, and use them to weigh for each client model. We found that this simple entropy measure is sufficiently well-calibrated to train the server model better. Thereby, we suppress unreliable clients' contribution when aggregating them. Furthermore, our proposed entropy-weighted ensemble distillation (EED) is performed jointly with another self-supervised learning (SSL) loss with unlabeled data on the server. With our setting, while additionally imposing a SSL loss is simple and demands negligible overheads, it is found to be crucial and have several benefits, such as faster convergence and reducing the influence of unreliable nodes.

Our major contributions

We propose Federated learning with entropy-weighted ensemble Distillation and Self-supervised learning (FedDS), a method for reliably updating the server model by utilizing the server's unlabeled data in an unsupervised way. We demonstrate through experiments that our FedDS outperforms several strong baselines in the classification tasks for CIFAR-10/100 and PathMNIST, especially in data deficient regime, and is also robust to various tough scenarios with unreliable clients.

2. RELATED WORK

Since the emergence of FL paradigm (McMahan et al., 2017) , there have been many follow-up studies to tackle various challenges in FL: enhancing communication efficiency (Alistarh et al., 2017; Suresh et al., 2017; Bernstein et al., 2018; Tang et al., 2018; Wu et al., 2018; Hamer et al., 2020; Rothchild et al., 2020; Reisizadeh et al., 2020; Haddadpour et al., 2021; Qiao et al., 2021; Konečnỳ et al., 2016; Hyeon-Woo et al., 2022; Jeong et al., 2018) , stabilizing convergence and solving various issues arising from heterogeneity in client's data (Li et al., 2020; Karimireddy et al., 2020; Acar et al., 2021; Reddi et al., 2021; Yuan & Ma, 2020) and in client's model structure (Diao et al., 2021b) , protecting clients from privacy attacks (Ammad- Ud-Din et al., 2019; Gong et al., 2021) , and effectively aggregating unreliable client models containing malicious clients (Chang et al., 2019) . In particular, the use of extra data, in addition to local data available at each client, has been shown to be effective to deal with some of the aforementioned challenges in FL (Chang et al., 2019; Lin et al., 2020; Li & Wang, 2019) . In this section, we focus on briefly reviewing two main research streams utilizing additional data in FL, which are closely related to our work, i.e., utilizing unlabeled data on the server. For comprehensive review of FL, please refer to Kairouz et al. (2019) Knowledge distillation in federated learning Knowledge distillation (KD) is used to transfer knowledge of a teacher model to a student model (Xie et al., 2020; Zhou et al., 2021; Radosavovic et al., 2018; Kim et al., 2021) . The teacher model is used to provide pseudo labels to either labeled or unlabeled data, and the student model learns to mimic the teacher's behavior by using these pseudo-labeled data as supervision. To adapt KD for FL, data to be pseudo labeled should be presented at the point of aggregation (the server); thereby, we can extract knowledge that will be transferred to the aggregated model (the server model). However, if the data to be distilled is provided from clients to the server, FL loses the privacy and security property which is one of the big advantages of FL. To address this, there are studies exploiting the assumptions that transferring features and logits from client's data is allowed (He et al., 2020; Gong et al., 2021) or assumptions that additional datasets exist (Gong et al., 2021; Chang et al., 2019; Lin et al., 2020; Li & Wang, 2019; Shi et al., 2021) . The latter assumes that an extra public dataset is shared across all clients. Then, KD is performed at the server by collecting the logit values for the public dataset from the clients, instead of model parameters. This approach can achieve higher communication efficiency and privacy protection. Our work relaxes the assumption by possessing extra unlabeled data only in the server, rather than sharing across all clients. As a similar setting to ours, Lin et al. (2020) proposes a FL method, called FedDF. In FedDF, the server obtains pseudo labels for the unlabeled data with the collected client models, and performs KD with those pseudo labels. They show that their approach is robust to non-i.i.d. data, achieving a higher accuracy than FedAVG (McMahan et al., 2017) . We further take into account confidence of each client and core knowledge of the server by measuring uncertainty of client predictions and applying



Figure 1: The accuracy of the standard FLmethod, FedAVG (McMahan et al., 2017), rapidly drops as the amount of available data at each client decreases. Our FedDS mitigates the effect of data deficiency by exploiting unlabeled data on the server. The accuracy is measured at the 50th communication round on CIFAR-10 classification task. This result follows the same setting as the main experiment in Fig.4except the data amounts for each client.

