FEDEED: EFFICIENT FEDERATED DISTILLATION WITH ENSEMBLE OF AGGREGATED MODELS

Abstract

In this paper, we study the key components of distillation-based model aggregation in federated learning (FL). For that purpose, we first propose a generalized distillation framework which divides the training and model aggregation process into three key stages and includes existing methods as special cases. By investigating the contribution of each stage, we propose a novel distillation-based FL scheme, named Federated Efficient Ensemble Distillation (FedEED). Different from existing approaches, the ensemble teacher of FedEED is constructed by aggregated models, instead of the client models, to achieve improved scalability in large-scale systems. Due to the use of aggregated models, FedEED also achieves higher level of privacy protection, because the access to client models is no longer required. Furthermore, the knowledge distillation in FedEED only happens from the ensemble teacher to a designated model such that the diversity among different aggregated models is maintained to improve the performance of the ensemble teacher. Experiment results show that FedEED outperforms the state-of-the-art FL schemes, including FedAvg and FedDF, on the benchmark datasets. Besides the performance advantage, the designated distillation also allows for parallelism between server-side distillation and clients-side local training, which could speed up the training of real world systems.

1. INTRODUCTION

Federated learning (FL) (McMahan et al., 2017) allows users to jointly train a deep learning model without sharing their own data. Recent works (Lin et al., 2020; Zhang et al., 2022; Huang et al., 2022; Cho et al., 2022) adopted knowledge distillation (Hinton et al., 2015) for model aggregation to tackle data and device heterogeneity issues. These distillation-based aggregation methods have been shown to outperform weight averaging methods like FedAvg (McMahan et al., 2017) . However, existing works utilized all client models to build the ensemble for knowledge distillation, leading to poor scalability in real world applications with a large number, e.g., thousands, of clients. Furthermore, existing methods built the ensemble teacher by similar client models, where the low diversity among the compositing models will limit the performance of the ensemble teacher. To tackle the above issues, we will study the key components of distillation-based FL and propose an efficient and scalable distillation method. For that purpose, we first introduce a generalized framework for distillation-based model aggregation, which consists of three major components, namely the local trainers, the ensemble trainer, and the global trainer. The local trainers perform local training on the clients-side, where each local trainer handles a sub-group of participating clients. After that, each local trainer collects and aggregates the updated client models, and constructs one global model. Note that this is the global model for all participating clients in a sub-group. Existing FL algorithms like FedAvg (McMahan et al., 2017) can be used in-place as a local trainer. Then, the ensemble trainers combine the updated global models and local models into an ensemble, providing a higher capacity than the stand-alone global models. Finally, the global trainer will utilize the ensemble to further enhance the global models via knowledge distillation. With the generalized framework, we investigate the contributions of each key component. In particular, we compare the performance of the distillation-based FL with different local, ensemble, and global trainers and try to tackle the limitations of the existing methods in a bottom-up manner. We focus on several aspects of the distillation framework including: 1.) Improving scalability and pri-vacy of distillation-based aggregation by building the ensemble teacher from a set of aggregated models, i.e., client models are no longer required to construct the ensemble teacher; 2.) Maximizing the capacity of the ensemble and global models by maintaining the diversity among models utilized to create the ensemble; 3.) Reducing the computation overhead by exploiting the parallelism between server-side and client-side training. By integrating the above ideas together, we propose a new algorithm named Federated Efficient Ensemble Distillation (FedEED). FedEED is a highly scalable, distillation-based FL algorithm that does not require direct access to the client models, which provides further protection to the user privacy. The proposed FedEED achieved state-ofthe-art results in CIFAR10/100 (Krizhevsky et al., 2009) with the non-independent and identically distributed (Non-IID) data.

The contributions of this paper include:

1) We propose a generalized framework for distillaion-based model aggregation, which can be viewed as a generalization of the existing distillation-based FL algorithms. With the proposed framework, the contribution of each key component can be investigated to build up more efficient algorithms. 2) By investigating the contribution of each component in the generalized framework, we propose FedEED, a highly efficient and scalable, distillation-based FL algorithm with improved privacy and model diversity. Experiment results demonstrate that FedEED can achieve the state-of-the-art performance with lower complexity and latency than existing federated distillation algorithms.

2. RELATED WORKS

Federated learning. Deep learning has obtained great successes in the last decade. However, in practice, large amount of user data can not be shared to the central server due to privacy regulations and communication constraints. To tackle the above issues, federated learning (McMahan et al., 2017) was proposed to train a global model based on data belonging to different users, without data sharing. The simplest approach is FedAvg (McMahan et al., 2017) , which performs multiple local epochs of training on the client-side, and then aggregates the updated client models with weight averaging. Other approaches, such as FedProx (Li et al., 2020) , also utilize weight averaging to perform model aggregation, but with added regularization to tackle data heterogeneity issues. In this paper, we focus on distillation-based model aggregation for FL. For both the generalized framework and the newly proposed FedEED, averaging based methods such as FedAvg and FedProx can be directly applied as the local trainers. This makes the proposed framework and FedEED compatible with different weight averaging methods. Knowledge distillation. Knowledge distillation (Hinton et al., 2015) has been proposed in deep learning to compress deep neural networks. Student models, which typically have a smaller size, are forced to mimic the output of the teacher model. In some recent works, knowledge distillation has been applied in FL. There are two types of approaches to apply knowledge distillation in FL. The first type utilizes distillation to perform model aggregation. For example, FedDF (Lin et al., 2020) and FedBE (Chen & Chao, 2021) utilized client models as a teacher to update the global model on the server, and the purpose is to improve the model aggregation performance with heterogeneous data. Some other works, including FedFTG (Zhang et al., 2022) and Fed-ET (Cho et al., 2022) , utilized distillation for the same purpose, but in a different setting, i.e. in a data-free or model heterogeneous environment. The second type shares model predictions between clients and the server for training purposes. For example, in FD (Jeong et al., 2018) , model predictions are shared between clients to regularize the local training. In FedAD (Gong et al., 2021) , model outputs on the client-side are used to train a global model on the server through distillation. The proposed generalized framework and FedEED in this paper can be classified into the first type. In fact, the framework is a generalization of the available works in the first type. However, the mechanisms FedEED utilizes to improve its scalability, privacy and performance are orthogonal to those of the existing works, and can be combined with existing methods to further improve the performance.

