MODEL-AGNOSTIC ROUND-OPTIMAL FEDERATED LEARNING VIA KNOWLEDGE TRANSFER

Abstract

Federated learning enables multiple parties to collaboratively learn a model without exchanging their local data. Currently, federated averaging (FedAvg) is the most widely used federated learning algorithm. However, FedAvg or its variants have obvious shortcomings. It can only be used to learn differentiable models and needs many communication rounds to converge. In this paper, we propose a novel federated learning algorithm FedKT that needs only a single communication round (i.e., round-optimal). With applying the knowledge transfer approach, our algorithm can be applied to any classification model. Moreover, we develop the differentially private versions of FedKT and theoretically analyze the privacy loss. The experiments show that our method can achieve close or better accuracy compared with the other state-of-the-art federated learning algorithms.

1. INTRODUCTION

While the size of training data can influence the machine learning model quality a lot, the data are often dispersed over different parties in reality. Due to regulations on data privacy, the data cannot be centralized to a single party for training. To address these issues, federated learning (Kairouz et al., 2019; Li et al., 2019a; b; Yang et al., 2019) enables multiple parties to collaboratively learn a model without exchanging their local data. It has become a hot research topic and shown promising results in the real world (Bonawitz et al., 2019; Hard et al., 2018; Li et al., 2020a; Peng et al., 2020) . Currently, federated averaging (FedAvg) (McMahan et al., 2016 ) is a widely used federated learning algorithm. Its training is an iterative process with four steps in each iteration. First, the server sends the global model to the selected parties. Second, each of the selected parties updates its model with their local data. Third, the updated models are sent to the server. Last, the server averages all the received models to update the global model. There are also many variants of FedAvg (Li et al., 2020c; Karimireddy et al., 2020) . For example, to handle the heterogeneous data setting, FedProx (Li et al., 2020c) introduces an additional proximal term to limit the local updates, while SCAFFOLD (Karimireddy et al., 2020) introduces control variates to correct the local updates. The overall frameworks of these studies are still similar to FedAvg. FedAvg or its variants have the following limitations. First, they rely on the gradient descent for optimization. Thus, they cannot be applied to train non-differentiable models such as decision trees in the federated setting. Second, the algorithm usually needs many communication rounds to finally achieve a good model, which causes massive communication traffic and fault tolerance requirements among rounds. Last, FedAvg is originally designed for the cross-device setting (Kairouz et al., 2019) , where the parties are mobile devices and the number of parties is large. In the cross-silo setting where the parties are organizations or data centers and the number of parties is relatively small, it is possible to take better advantage of the computation resources of the parties with relatively high computation power. In order to address the above-mentioned limitations, we propose a novel federated learning algorithm called FedKT (Federated learning via Knowledge Transfer) focusing on the cross-silo setting. With the round-optimal design goal, FedKT extends the idea of ensemble learning in a novel 2-tier design to federated setting. Inspired by the success of the usage of unlabelled public data in many studies (Papernot et al., 2017; 2018; Jordon et al., 2019; Chang et al., 2019) , which often exists such as text and images, we adopt the knowledge transfer method to reduce the inference and storage costs of ensemble learning. As such, FedKT is able to learn any classification model including differentiable models and non-differentiable models. Moreover, we develop differentially private versions and theoretically analyze the privacy loss of FedKT in order to provide different differential privacy guarantees. Our experiments on four tasks show that FedKT has quite good performance compared with the other state-of-the-art algorithms. Our main contributions are as follows. • We propose a new federated learning algorithm named FedKT. To the best of our knowledge, FedKT is the first algorithm which does not have any limitations on the model architecture and needs only a single communication round. • We show that FedKT is easy to achieve both example-level and party-level differential privacy and theoretically analyze the bound of its privacy cost. • We conduct experiments on various models and tasks and show that FedKT can achieve comparable accuracy compared with the other iterative algorithms. Moreover, FedKT can be used as an initialization step to achieve a better accuracy combined with the other approaches.

2. BACKGROUND AND RELATED WORK

2.1 ENSEMBLE LEARNING Instead of using a single model for prediction, ensemble learning (Zhang & Ma, 2012) combines the predictions of multiple models to obtain better predictive performance. There are many widely used ensemble learning algorithms such as boosting (Rätsch et al., 2001) and bagging (Prasad et al., 2006) . One important factor in ensemble learning is the model diversity. The increased model diversity can usually improve the performance of the ensemble learning. In federated learning, since different parties have their own local data, there is natural diversity among the local models. Thus, the local models can be used as an ensemble for prediction. Previous works (Yurochkin et al., 2019; Guha et al., 2019) have studied ensemble learning for federated learning and demonstrated promising predictive accuracy. As mentioned in their studies, since the prediction involves all the local models, the inference and the storage costs are prohibitively high especially when the number of models is large. In our study, we also use the local models as an ensemble and further use knowledge transfer to learn a single model in order to reduce the inference and the storage costs.

2.2. KNOWLEDGE TRANSFER OF THE TEACHER ENSEMBLE

Knowledge transfer has been successfully used in previous studies (Hinton et al., 2015; Papernot et al., 2017; 2018; Jordon et al., 2019) . Through knowledge transfer, an ensemble of models can be compressed into a single model. A typical example is the PATE (Private Aggregation of Teacher Ensembles) (Papernot et al., 2017) framework. In this framework, PATE first divides the original dataset into multiple disjoint subsets. A teacher model is trained separately on each subset. Then, the max voting method is used to make predictions on the public unlabelled datasets with the teacher ensemble, i.e., choosing the majority class among the teachers as the label. Last, a student model is trained on the public dataset. A good feature of PATE is that it can easily satisfy differential privacy guarantees by adding noises to the vote counts. Moreover, PATE can be applied to any classification model regardless of the training algorithm. PATE is not designed for federated learning. Inspired by PATE, we propose FedKT, which adopts the knowledge transfer approach in the federated setting to address the limitations of FedAvg.

2.3. FEDERATED LEARNING WITH A SINGLE COMMUNICATION ROUND

There are several preliminary studies on federated learning algorithms with a single communication round. Guha et al. (2019) propose an one-shot federated learning algorithm to train support vector machines (SVMs) in both supervised and semi-supervised settings. Instead of simply averaging all the model weights in FedAvg, Yurochkin et al. (2019) propose PFNM by adopting a Bayesian nonparametric model to aggregate the local models when they are multilayer perceptrons (MLPs). Their method shows a good performance in a single communication round and can also be applied in multiple communication rounds. While the above two methods are designed for specific models

