FEDPSE: PERSONALIZED SPARSIFICATION WITH ELEMENT-WISE AGGREGATION FOR FEDERATED LEARNING

Abstract

Federated learning (FL) is a popular distributed machine learning framework in which clients aggregate models' parameters instead of sharing their individual data. In FL, clients communicate with the server under limited network bandwidth frequently, which arises the communication challenge. To resolve this challenge, multiple compression methods have been proposed to reduce the transmitted parameters. However, these techniques show that the federated performance degrades significantly with Non-IID (non-identically independently distributed) datasets. To address this issue, we propose an effective method, called FedPSE, which solves the efficiency challenge of FL with heterogeneous data. FedPSE compresses the local updates on clients using Top-K sparsification and aggregates these updates on the server by element-wise average. Then clients download the personalized sparse updates from the server to update their individual local models. We then theoretically analyze the convergence of FedPSE under the non-convex setting. Moreover, extensive experiments on four benchmark tasks demonstrate that our FedPSE outperforms the state-of-the-art methods on Non-IID datasets in terms of both efficiency and accuracy.

1. INTRODUCTION

Federated learning (FL) is a prevailing distributed framework that can prevent sensitive data of clients from being disclosed (Kairouz et al., 2021; McMahan et al., 2017b) . The naive FL includes three steps: uploading clients' models to the server after local training, global aggregation, and downloading the aggregated model from the server. In practice, weight updates ∆W = W new -W old can be communicated instead of model weights W (Asad et al., 2021; Li et al., 2021a) . Recently, FL is increasingly applied in multiple tasks, such as computer vision, recommender systems, and medical diagnosis (Bibikar et al., 2021; Kairouz et al., 2021; Qayyum et al., 2020; Xu et al., 2021) .

1.1. EXISTING PROBLEM

Despite the aforementioned advantage, the communication cost of FL is overburdened by the fact that the server and clients exchange massive parameters frequently (Asad et al., 2021; Kairouz et al., 2021) . Furthermore, there usually is a limited upstream/downstream bandwidth between the server and clients, such as wireless connection in the cross-device (ToC) FL and dedicated network in the cross-silo (ToB) setting, which further decreases the communication efficiency (Li et al., 2021a; Sattler et al., 2019) . FL is much more time-consuming than traditional centralized machine learning, especially when the model parameters are massive under the cross-silo FL scenarios (Qayyum et al., 2020; Shi et al., 2020) . Therefore, it is necessary to optimize the bidirectional communication cost to minimize the training time of FL (Bernstein et al., 2018; Philippenko & Dieuleveut, 2021; Sattler et al., 2019; Wen et al., 2017) . In order to resolve the aforementioned challenge, various methods have been proposed, such as matrix decomposition (Li et al., 2021c; McMahan et al., 2017b ), quantization (Li et al., 2021a; Sattler et al., 2019), and sparsification (Gao et al., 2021; Mostafa & Wang, 2019; Wu et al., 2020; Yang et al., 2021b) . Although these novel algorithms can reduce the quantity of communicated information significantly, most of them can only work well 1

