EFFICIENT PERSONALIZED FEDERATED LEARNING VIA SPARSE MODEL-ADAPTATION

Abstract

Federated Learning (FL) aims to train machine learning models for multiple clients without sharing their own private data. Due to the heterogeneity of clients' local data distribution, recent studies explore the personalized FL that learns and deploys distinct local models with the help of auxiliary global models. However, the clients can be heterogeneous in terms of not only local data distribution, but also their computation and communication resources. The capacity and efficiency of personalized models are restricted by the lowest-resource clients, leading to sub-optimal performance and limited practicality of personalized FL. To overcome these challenges, we propose a communication and computation efficient approach named pFedGate by adaptively and efficiently learning sparse local models. With a lightweight trainable gating layer, pFedGate enables clients to reach their full potential in model capacity by generating different sparse models accounting for both the heterogeneous data distributions and resource constraints. Meanwhile, the computation and communication efficiency are both improved thanks to the adaptability between the model sparsity and clients' resources. Further, we theoretically show that the proposed pFedGate has superior complexity with guaranteed convergence and generalization error. Extensive experiments show that pFedGate achieves superior global accuracy, individual accuracy and efficiency simultaneously over state-of-the-art methods. We also demonstrate that pFedGate performs better than competitors in the novel clients participation and partial clients participation scenarios, and can learn meaningful sparse local models adapted to different data distributions.

1. INTRODUCTION

Federated Learning (FL) gains increasing popularity in machine learning scenarios where the data are distributed in different places and can not be transmitted due to privacy concerns (Muhammad et al., 2020; Meng et al., 2021; Yu et al., 2021; Hong et al., 2021; Yang et al., 2021) . Typical FL trains a unique global model from multiple data owners (clients) by transmitting and aggregating intermediate information with the help of a centralized server (McMahan et al., 2017; Kairouz et al., 2021) . Although using a shared global model for all clients shows promising average performance, the inherent statistical heterogeneity among clients challenges the existence and convergence of the global model (Sattler et al., 2020; Li et al., 2020) . Recently, there are emerging efforts that introduce personalization into FL by learning and deploying distinct local models (Yang et al., 2019; Karimireddy et al., 2020; Tan et al., 2021) . The distinct models are designed particularly to fit the heterogeneous local data distribution via techniques taking care of relationships between the global model and personalized local models, such as multi-task learning (Collins et al., 2021 ), meta-learning (Dinh et al., 2020a ), model mixture (Li et al., 2021c ), knowledge distillation (Zhu et al., 2021) and clustering (Ghosh et al., 2020) . However, the heterogeneity among clients exists not only in local data distribution, but also in their computation and communication resources (Chai et al., 2019; 2020) . The lowest-resource clients restrict the capacity and efficiency of the personalized models due to the following reasons: (1) The adopted model architecture of all clients is usually assumed to be the same for aggregation compatibility and (2) The communication bandwidth and participation frequency of clients usually determine how much can they contribute to the model training of other clients and how fast can they agree to meet a converged "central point" w.r.t their local models. This resource heterogeneity is under-explored in most existing personalized FL (pFL) works, and instead, they gain the accuracy improvement with a large amount of additional computation or communication costs. Without special design taking the efficiency and resource heterogeneity into account, we can only gain sub-optimal performance and limited practicality of pFL. To overcome these challenges, in this paper, we propose a novel method named pFedGate for efficient pFL, which learns to generate personalized sparse models based on adaptive gated weights and different clients' resource constraints. Specifically, we introduce a lightweight trainable gating layer for each client, which predicts sparse, continues and block-wise gated weights and transforms the global model shared across all clients into a personalized sparse one. The gated weights prediction is conditioned on specific samples for better estimation of the heterogeneous data distributions. Thanks to the adaptability between the model sparsity and clients' resource, the personalized models and FL training process gain better computation and communication efficiency. As a result, the model-adaption via sparse gated weights delivers the double benefit of personalization and efficiency: (1) The sparse model-adaption enables each client to reach its full potential in model capacity with no need for compatibility across other low-resource clients, and to deal with a small and focused hypothesis space that is restricted by the personalized sparsity and the local data distribution. (2) Different resource restrictions can be easily imposed on the predicted weights as we consider the block-wise masking under a flexible combinatorial optimization setting. We further provide spacetime complexity analysis to show pFedGate' superiority over state-of-the-art (SOTA) methods, and provide theoretical guarantees for pFedGate in terms of its generalization and convergence. We evaluate the proposed pFedGate on four FL benchmarks compared to several SOTA methods. We show that pFedGate achieves superior global accuracy, individual accuracy and efficiency simultaneously (up to 4.53% average accuracy improvement with 12x smaller sparsity than the compared strongest pFL method). We demonstrate the effectiveness and robustness of pFedGate in the partial clients participation and novel clients participation scenarios. We also find that pFedGate can learn meaningful sparse local models adapted to different data distributions, and conduct various experiments to verify the necessity and effectiveness of pFedGate' components. Our main contributions can be summarized as follows: • We exploit the potential of co-design of model compression and personalization in FL, and propose a novel computation and communication efficient pFL approach that learns to generate sparse local models with a fine-grind sample-level adaptation. • We provide a new formulation for the efficient pFL considering the clients' heterogeneity in both local data distribution and hardware resources, and provide theoretical results about the generalization, convergence and complexity of the proposed method. • We achieve SOTA results on several FL benchmarks and illustrate the feasibility of gaining better efficiency and accuracy for pFL simultaneously. To facilitate further studies, we release our code at https://github.com/AnonyMLResearcher/pFedGate.

2. RELATED WORKS

Personalized FL. Personalized FL draws increasing attention as it is a natural way to improve FL performance for heterogeneous clients. Many efforts have been devoted via multi-task learning (Smith et al., 2017; Corinzia & Buhmann, 2019; Huang et al., 2021; Marfoq et al., 2021) , model mixture (Zhang et al., 2020; Li et al., 2021c ), clustering (Briggs et al., 2020; Sattler et al., 2020; Chai et al., 2020) , knowledge distillation (Lin et al., 2020; Zhu et al., 2021; Ozkara et al., 2021 ), meta-learning (Khodak et al., 2019; Jiang et al., 2019; Khodak et al., 2019; Fallah et al., 2020; Singhal et al., 2021) , and transfer learning (Yang et al., 2020; He et al., 2020; Zhang et al., 2021a 



). Although effective in accuracy improvements, most works pay the cost of additional computation or communication compared to non-personalized methods. For example, Sattler et al. (2020) consider group-wise client relationships, requiring client-wise distance calculation that is computationally intensive in cross-device scenarios. Fallah et al. (2020) leverage model agnostic meta learning to enable fast local personalized training that requires computationally expensive second-order gradients. Zhang et al. (2021a) learn pair-wise client relationships and need to store and compute similarity matrix with square complexity w.r.t. the number of clients. Marfoq et al. (2021) learn a mixture of multiple global models which multiplies the storing and communicating costs. Our work differs from these works by considering a practical setting that clients are heterogeneous in both the data distribution and hardware resources. Under this setting, we achieve personalization from a novel

