PERFEDMASK: PERSONALIZED FEDERATED LEARN-ING WITH OPTIMIZED MASKING VECTORS

Abstract

Recently, various personalized federated learning (FL) algorithms have been proposed to tackle data heterogeneity. To mitigate device heterogeneity, a common approach is to use masking. In this paper, we first show that using random masking can lead to a bias in the obtained solution of the learning model. To this end, we propose a personalized FL algorithm with optimized masking vectors called PerFedMask. In particular, PerFedMask facilitates each device to obtain its optimized masking vector based on its computational capability before training. Finetuning is performed after training. PerFedMask is a generalization of a recently proposed personalized FL algorithm, FedBABU (Oh et al., 2022). PerFedMask can be combined with other FL algorithms including HeteroFL (Diao et al., 2021) and Split-Mix FL (Hong et al., 2022). Results based on CIFAR-10 and CIFAR-100 datasets show that the proposed PerFedMask algorithm provides a higher test accuracy after fine-tuning and lower average number of trainable parameters when compared with six existing state-of-the-art FL algorithms in the literature.

1. INTRODUCTION

Federated learning (FL) is a distributed artificial intelligence (AI) framework, which allows multiple edge devices to train a single model collaboratively (Konečnỳ et al., 2015; McMahan et al., 2017) . The model is trained under the orchestration of a central server. In a typical FL algorithm, each communication round includes the following steps: (1) the edge devices download the latest model from the server to be used as their local model; (2) each device performs multiple local update iterations for updating the local model based on its local dataset; (3) the devices upload their updated local models to the server; (4) the server computes the new model by aggregating the local models. In practical systems, the devices may have diverse and limited computation, communication, and storage capabilities. Moreover, the local datasets available to the devices may be different in size, and contain non-independent and identically distributed (non-IID) data samples across the devices. Under these heterogeneous settings, the performance of the conventional FL algorithms can degrade (Wang et al., 2020; Li et al., 2021) . To handle the case when the data is non-IID, some works (Li et al., 2020a; Karimireddy et al., 2020) have introduced new optimization frameworks to obtain a more stable global model for the devices. Another approach to address the data heterogeneity issue is by designing a personalized model for each device (Arivazhagan et al., 2019; Fallah et al., 2020; Collins et al., 2021; Oh et al., 2022) . In personalized FL algorithms, instead of obtaining a single model for all the devices, an initial model is obtained. This initial model can then be personalized for each device using its local data samples. To overcome the computation limitation of the heterogeneous devices, one common approach is to use masking vectors. Masking vectors can be used to train only a sub-network of the learning model for each device based on the computational capability of that device. Masking vectors can be combined with pruning and freezing methods. Pruning methods utilize masking vectors to keep the important parameters of the learning model and remove those which are unimportant from the model architecture. However, leveraging pruning in FL may incur additional communication overhead (Babakniya et al., 2022; Bibikar et al., 2022) . Moreover, it results in different model architectures for the devices (Guo et al., 2016) . This may lead to accuracy loss, particularly when data heterogeneity exists in the system (Hong et al., 2022) . In the freezing methods, the masking vectors are used to freeze some parts of the learning model for each device. Unlike pruning, the masked parameters are not removed but are frozen during local updates. Hence, a more stable FL algorithm is obtained without changing the learning model architecture. Sidahmed et al. ( 2021) and Pfeiffer et al. ( 2022) have shown that freezing methods can reduce the computational and communication resources required for training the learning model in FL. However, the aforementioned works use heuristic approaches for designing the masking vectors, and do not provide a theoretical analysis for their choice. Also, the aforementioned works do not address the data heterogeneity issue in their proposed algorithms. In this work, we aim to answer the following question: By exploiting freezing method in FL, what is a systematic approach to determine the masking vectors which can improve the final test accuracy in a setting with data and device heterogeneities? We first show that using the masking vectors to freeze the model parameters for the devices may lead to a bias in the convergence bound. This bias can hinder the success of employing masking vectors to tackle the device heterogeneity issue in FL. Using the insights from our analysis, we propose PerFedMask, Personalized Federated Learning with Optimized Masking Vectors (see Fig. 1 ). Specifically, by decoupling the learning model into a global model and a local head model, we first freeze the local head model for all the devices. Then, we freeze a portion of the global model for each device based on its computational capability. In our work, the masking vectors are determined before training through minimizing the bias term in the convergence bound. The bias can be mitigated by this approach. However, it may not be eliminated completely. Thus, after training of the global model, the frozen parameters of the local head model can assist to fine-tune the entire learning model for each device. We demonstrate empirically the effectiveness of PerFedMask under the heterogeneous settings when compared with six existing state-of-the-art FL algorithms. PerFedMask has several distinct advantages: (1) PerFedMask generalizes the recently proposed personalized FL algorithm, FedBABU (Oh et al., 2022) . In particular, FedBABU is a special case of PerFedMask when all the devices have the same computational capability. (2) PerFedMask is flexible. Since PerFedMask does not change the model architecture, it can be combined with other FL algorithms such as HeteroFL (Diao et al., 2021) and Split-Mix FL (Hong et al., 2022) to further improve the performance. (3) PerFedMask can address the objective inconsistency problem, which arises due to different number of local update iterations. Unlike FedNova (Wang et al., 2020) which requires the modification of device optimizers to tackle this issue, in PerFedMask, we consider the same number of local update iterations for all the devices, while adjusting the required number of computations for those devices with lower computational capabilities.

2. RELATED WORK

FL Algorithms with non-IID Data. FedAvg (McMahan et al., 2017) is the most popular FL algorithm, which aims to find a single model for all the devices. However, the local data samples at the



Figure 1: Illustration of an FL system using PerFedMask. The model is decoupled into a global model and a local head model. The local head model remains unchanged during training. The devices collaboratively train the global model. Some parts of the global model can be frozen for the devices during local updates using the optimized masking vectors. After training, a personalized model is obtained for each device by fine-tuning.

