FEDPSE: PERSONALIZED SPARSIFICATION WITH ELEMENT-WISE AGGREGATION FOR FEDERATED LEARNING

Abstract

Federated learning (FL) is a popular distributed machine learning framework in which clients aggregate models' parameters instead of sharing their individual data. In FL, clients communicate with the server under limited network bandwidth frequently, which arises the communication challenge. To resolve this challenge, multiple compression methods have been proposed to reduce the transmitted parameters. However, these techniques show that the federated performance degrades significantly with Non-IID (non-identically independently distributed) datasets. To address this issue, we propose an effective method, called FedPSE, which solves the efficiency challenge of FL with heterogeneous data. FedPSE compresses the local updates on clients using Top-K sparsification and aggregates these updates on the server by element-wise average. Then clients download the personalized sparse updates from the server to update their individual local models. We then theoretically analyze the convergence of FedPSE under the non-convex setting. Moreover, extensive experiments on four benchmark tasks demonstrate that our FedPSE outperforms the state-of-the-art methods on Non-IID datasets in terms of both efficiency and accuracy.

1. INTRODUCTION

Federated learning (FL) is a prevailing distributed framework that can prevent sensitive data of clients from being disclosed (Kairouz et al., 2021; McMahan et al., 2017b) . The naive FL includes three steps: uploading clients' models to the server after local training, global aggregation, and downloading the aggregated model from the server. In practice, weight updates ∆W = W new -W old can be communicated instead of model weights W (Asad et al., 2021; Li et al., 2021a) . Recently, FL is increasingly applied in multiple tasks, such as computer vision, recommender systems, and medical diagnosis (Bibikar et al., 2021; Kairouz et al., 2021; Qayyum et al., 2020; Xu et al., 2021) .

1.1. EXISTING PROBLEM

Despite the aforementioned advantage, the communication cost of FL is overburdened by the fact that the server and clients exchange massive parameters frequently (Asad et al., 2021; Kairouz et al., 2021) . Furthermore, there usually is a limited upstream/downstream bandwidth between the server and clients, such as wireless connection in the cross-device (ToC) FL and dedicated network in the cross-silo (ToB) setting, which further decreases the communication efficiency (Li et al., 2021a; Sattler et al., 2019) . FL is much more time-consuming than traditional centralized machine learning, especially when the model parameters are massive under the cross-silo FL scenarios (Qayyum et al., 2020; Shi et al., 2020) . Therefore, it is necessary to optimize the bidirectional communication cost to minimize the training time of FL (Bernstein et al., 2018; Philippenko & Dieuleveut, 2021; Sattler et al., 2019; Wen et al., 2017) . In order to resolve the aforementioned challenge, various methods have been proposed, such as matrix decomposition (Li et al., 2021c; McMahan et al., 2017b) , quantization (Li et al., 2021a; Sattler et al., 2019) , and sparsification (Gao et al., 2021; Mostafa & Wang, 2019; Wu et al., 2020; Yang et al., 2021b) . Although these novel algorithms can reduce the quantity of communicated information significantly, most of them can only work well under the ideal condition with IID (identically and independently distributed) datasets (Li et al., 2021c; Sattler et al., 2019; Wen et al., 2017) . In fact, the isolated datasets in clients are usually heterogeneous, due to the reason that each dataset belongs to a particular client with a specific geographic location and time window of data collection Kairouz et al. (2021) ; Kulkarni et al. (2020) ; Xu & Huang (2022) ; Yang et al. (2021a) . Hence, the current compression techniques, ignoring the personalization of clients, face a significant performance degradation on Non-IID datasets Liu et al. (2022) ; Sattler et al. (2019) ; Wu et al. (2020) .

1.2. SOLUTION

To bridge this gap, we propose a Personalized Sparsification with Element-wise aggregation for the cross-silo federated learning (FedPSE) paradigm, as shown in Figure 1 . For the first step of FedPSE, under the concern of efficiency and personalization, clients train their models with local datasets and upload the sparse updates to the server, as shown in Figure 1 (a). The kept indices of these compressed updates are probably different from each other due to the heterogeneity of clients' datasets. Secondly, we leverage element-wise averaging to aggregate the collected sparse updates on the server, which can relieve the bias of the traditional aggregation method, as shown in Figure 1(b) . Lastly, the server sparsifies the downstream parameters for each client in a personalized manner, as shown in Figure 1(c ). Especially, the downstream updates, transferred from the server to each client, also possess individual k elements to keep the overall compression ratio. Please see Section 4 for more details. To this end, FedPSE compresses both upstream and downstream communication overhead with personalization concerns.

1.3. CONTRIBUTION

We summarize our main contributions as follows: • We propose a novel personalized sparsification with an element-wise aggregation framework for FL, which resolves the bidirectional communication challenge on Non-IID datasets. • We propose an element-wise aggregation method, which can promote the performance of FL with sparse aggregated matrices. • We propose a downstream selection mechanism to personalize the clients' models, which adapts to various distributions and significantly increases the performance in the Non-IID setting. • We provide a convergence analysis of our method as well as extensive experiments on four benchmark datasets, and the results demonstrate that our proposed FedPSE outperforms the existing state-of-the-art FL framework on Non-IID datasets in terms of both efficiency and accuracy.

2. RELATED WORK

In this section, we briefly review optimization methods that focus on the core challenges in FL.

2.1. COMMUNICATION EFFICIENCY

Although FedAVG (McMahan et al., 2017a) , the naive federated algorithm, can decrease the communication cost by allowing multiple local steps, the massive transmitted parameters in one communication step are still a critical bottleneck. In general, there are three kinds of compression approaches to tackle this problem, i.e. matrix decomposition (Huang et al., 2022; Li et al., 2021c) , quantization (Bernstein et al., 2018; Li et al., 2021a) , and sparsification (Gao et al., 2021; Sattler et al., 2019; Wu et al., 2020) . Firstly, the decomposition method, decomposing the transmitted matrix, is unpractical with a smaller compression ratio and more computation complexity. Secondly, the quantization-based methods, limiting the number of bits, have the upper bound of the compression ratio and slow down the convergence speed in terms of training iterations (Chai et al., 2020) . Thirdly, the sparsification-based methods mask some elements of the transmitted matrices, among which the Top-K sparsification with residual error is most widely used due to its promising performance and convergence guarantee (Dan Alistarh et al., 2018; Gao et al., 2021; Wu et al., 2020; Yang et al., 2021b) . Therefore we leverage the error-compensated Top-K compressor to reduce the communication overhead in this article.

2.2. PERSONALIZATION

Since FL relies on the stochastic gradient descent (SGD) algorithm to train neural networks, the performance of FL is easily biased by the Non-IID datasets, e.g. feature skew and label skew, etc (Kairouz et al., 2021; Zhu et al., 2021) . Different strategies have been applied to FL with heterogeneous concerns (e.g. data shuffling, multi-task learning, and local model optimization) (Achituve et al., 2021; Chen & Chao, 2021; Huang et al., 2021; Li et al., 2021b; Ma et al., 2022; Oh et al., 2021; T Dinh et al., 2020; Xu & Huang, 2022; Zhang et al., 2021) . Although these state-of-the-art methods perform well under the Non-IID condition, the communication efficiency is often ignored.

2.3. COMPRESSION WITH PERSONALIZATION

In a word, few algorithms take both of the above two challenges into consideration. To our best knowledge, FedSTC (Sattler et al., 2019) and FedSCR (Wu et al., 2020) declare that they resolve the communication overburden with Non-IID data in one shot. Specifically, FedSTC compresses the clients' updates by combining Top-K sparsification and ternary quantization. Similarly, Fed-SCR compresses the transferred information by removing the redundant updates which are less than the adaptive threshold. However, the clients in FedSTC and FedSCR share the same model, which decreases the federated performance significantly in the Non-IID setting. In this paper, we propose a personalized compression paradigm for FL, which has competitive efficiency and promoted performance for heterogeneous data.

3. PRELIMINARY

In this section, we present some preliminary techniques of our proposal, including FedAVG and Top-K sparsification. FedAVG (McMahan et al., 2017a) , a basic FL algorithm, builds distributed machine learning models via model aggregation rather than data aggregation. We suppose that there are N clients with their datasets {D 1 , D 2 , ..., D N } in FL. Client i, i ∈ {1, . . . , N }, trains the local model W r-1 i using n i samples individually and uploads the weight updates ∆W r i to the server for aggregation during the r-th federated round. Then the server leverage the averaging function to generate the global matrix:

3.1. FEDAVG

∆W r s ← N i=1 ni N i=1 ni ∆W r i . In the end, the global matrix ∆W r s is sent back to each client to update the local model W r i ← W r-1 i + ∆W r s .

3.2. TOP-K SPARSIFICATION

Top-K Sparsification is the most widely used compression method in FL, which remains a stable performance even with a high proportion of ignored parameters (Gao et al., 2021; Wu et al., 2020; Sattler et al., 2019; Yang et al., 2021b) . The Top-K compressor selects K elements of the input matrix with the largest absolute values.

4. METHOD

In this section, we first give an overview of the proposed FedPSE paradigm. We then present its three main components, i.e., Upstream Personalized Sparsification (UPS), Element-Wise Aggregation (EWA), and Downstream Personalized Sparsification (DPS). Finally, we summarize the whole algorithm.

4.1. OVERVIEW

Our purpose of FedPSE is to design an FL paradigm that reduces the bidirectional communication cost while improving the performance on Non-IID datasets. We optimize the steps of FL according to its training process, as shown in Figure 1 . Firstly, in order to reduce the upstream communication burden, we should compress the information from clients to the server. Secondly, as the global updates may be biased due to the compression, it is worth developing a new aggregation method that is suitable for sparse matrices. Thirdly, since clients with Non-IID datasets should have individual local models, the server needs to personalize the downstream information to each client while keeping the compression sparsity. Then we propose the motivation of our methods in the following paragraphs. Algorithm 1: UPS Input: downstream personalized sparse updates ∆ Ŵ r-1 s,i , sparsity ratio p Output: upstream sparse updates ∆ Ŵ r i , samples number n r i 1 W r,0 i = W r-1,0 i + ∆ Ŵ r-1 s,i 2 Reset n r i = 0 3 for each local training step t = 1, . . . , T do 4 Sample a batch: B r,t i ∼ D i 5 g r,t i = ▽f i (W r,t-1 i , B r,t i ) 6 W r,t i = Opt(W r,t-1 i , g r,t i , α) 7 n r i = n r i + B r,t i 8 end 9 ∆W r i = W r,T i -W r,0 i + e r-1 i 10 ∆ Ŵ r i = Top-K(∆W r i , p) 11 e r i = ∆W r i -∆ Ŵ r i 12 return ∆ Ŵ r i , n r i The first step, upstream personalized sparsification, compresses clients' updates after local training, which is inspired by the current work (Sattler et al., 2019; Wu et al., 2020) . According to the prior research Wu et al. (2020) , the matrices of weights updates are structuresparse, in which most elements are redundant in the Non-IID setting. As a result, we leverage the Top-K sparsification operator with residual error to compress the weight updates, whose convergence rate has been theoretically proved (Gao et al., 2021; Haddadpour et al., 2021) . Apparently, the reserved indices of compressed updates have its personalized distributions on heterogeneous datasets. Then each client transmits its individual updates to the server for the following aggregation. The second step is proposing an appropriate aggregation method for the sparse updates. We demonstrate the bias of FedAVG with two examples under IID/Non-IID settings in Appendix A, which is caused by the sparsification. Then the effectiveness of the EWA method is proved, which is more suitable for sparse aggregation. Please see more details in Appendix A. Thirdly, the server should sparsify the downstream updates for each client under the concern of personalization. In order to quantitatively measure the divergence between the global distribution and the local distribution, we compress the global updates, possessing k elements, and compute its correlation distances with clients' upstream updates. Apparently, the correlation distances are likely different from each other in the Non-IID setting. Finally, the personalized downstream updates are selected by the correlation distances. Further details can be found in Section 4.4.

4.2. UPSTREAM PERSONALIZED SPARSFICATION (UPS)

We suppose that there are N clients with their own training datasets {D 1 , D 2 , ..., D N } and a server participated in FL. At the beginning of FL, client i (∀i ∈ P, P = {1, 2, ..., N }) initializes its weights W i with the same parameters W 0 (Zhao et al., 2018) . Algorithm 1 presents the process of local training on client i during r-th federated round, where the inputs, i.e. weight updates ∆ Ŵ r-1 s,i and sparsity ratio p, are transmitted from the server. The first step of local training is updating the client model with the received ∆ Ŵ r-1 s,i (line 1). Secondly, client i trains its individual model with n r i samples (lines 3-8), in which f i indicates the loss function and Opt (e.g., SGD or Adam) indicates the model's optimizer with the learning rate α. After that, the client i compresses the updates via the error-compensated Top-K compressor (lines 9-11). Finally, the sparse updates ∆ Ŵ r i and the number of local training samples n r i are uploaded to the server for aggregation.

4.3. ELEMENT-WISE AGGREGATION (EWA)

Algorithm 2: EWA Input: sparse updates ∆ Ŵ r i , samples number n r i , i ∈ P Output: aggregated updates: ∆W r s 1 for i = 1, . . . , N do 2 Computes the index matrices of client updates: M r agg,i = Sign ∆ Ŵ r i 3 end 4 Sum matrix of training samples: 5 N = N i=1 n r i • M r agg,i 6 ∆W r s = N i=1 n r i • ∆ Ŵ r i ⊘ N 7 return ∆W r s From the perspective of the selection mechanism in federated learning, the kept non-zero value in ∆ Ŵ r i indicates that the corresponding client i is selected by the server at this element location. Apparently, each element of the aggregated matrix has its individual participating samples. We introduce our Element-Wise Aggregation (EWA) method in Algorithm 2. Firstly, the server receives the sparse updates ∆ Ŵ r i from client i during r-th federated round, ∀i ∈ P. Then the server computes the indices of nonzero values M r agg,i for ∆ Ŵ r i , which is used to calculate the sum matrix N of training samples for each location (lines 1-5). Finally, the server aggregates the global updates ∆W r s , in which ⊘ indicates the element-wise division operator (line 6).

4.4. DOWNSTREAM PERSONALIZED SPARSIFICATION (DPS)

This process can be separated into three sub-steps: measurement of the heterogeneity between the global distribution and local distribution, selection of the downstream indices, and computation of the downstream personalized updates. First of all, during the r-th federated round, the server quantifies the heterogeneity of data distributions via the distance between ∆ Ŵ r s and ∆ Ŵ r i , ∀i ∈ P (lines 1-4). The server flattens the updates followed by normalization (lines 2-3). Then we compute the distance via the cosine function, which can be replaced by the other correlation functions (line 4). Secondly, we get the index matrix M r s of the global sparse updates ∆ Ŵ r s via the sign function. In similarity, the index matrix M r i of client i is calculated (line 6). The intersection matrix M r i,in of M r s and M r i can be regarded as the common information owned by both the server and client i, which will be wholly kept during downstream transmission (line 7). As M r i,in has k i,in elements, we choose the rest k -k i,in elements from the compensation matrices M r i,c and M r s,c . In order to enhance the generalization of training, we leverage the random mechanism to choose d r s,i • (k -k i,in ) elements from M r i,c , while selecting (1.0 -d r s,i ) • (k -k i,in ) from M r s,c (lines 8-13). Finally, we combine the personalized index matrix M r s,i of client i (line 14) and calculate the downstream sparse updates ∆ Ŵ r s,i , from the server to client i (lines 15-16). Apparently, each client uploads its unique sparse updates ∆ Ŵ r i , which means that the correlation distance d r s,i and the downstream sparse updates ∆ Ŵ r s,i are different from each other.

4.5. PUTTING ALL TOGETHER

To sum up, we conclude the FedPSE framework in the Algorithm 4, which executes the federated process, named PSE, for R times. Before the training process, we initialize the model weights of clients as W 0 , the received updates ∆W 0 s,i as zero and the sparsity ratio p for communication compression concern. First of all, we get the personalized sparse updates ∆ Ŵ r i of each client i using UPS method (Algorithm 1) with the last ∆ Ŵ r-1 s,i and the sparsity ratio p (line 5). Then we leverage the EWA operator (Algorithm 2) to compute the global updates (∆W r s ) during r-th federated round The intersection of index matrices: M r i,in = M r i ⊙ M r s The number of non-zero value in M r s : k = ∥M r s ∥ The number of non-zero value in M r i,in : k i,in = M r i,in The compensations of index matrices: M r i,c = M r i -M r i,in and M r s,c = M r s -M r i,in Combine the indices: M r i,c = Random(M r i,c , d r s,i • (k -k i,in )) M r s,c = Random(M r s,c , (1 -d r s,i ) • (k -k i,in )) M r s,i = M r i,in + M r i,c + M r s,c #Compute the downstream personalized updates: ∆ Ŵ r s,i = M r s,i ⊙ ∆W r s return ∆ Ŵ r s,i (line 8). Furthermore, the server sparsifies the global updates ∆W r s and generates the ∆ Ŵ r s with the compression rate p (line 9). Finally, we get the personalized updates of each client via DPS method (Algorithm 3) with the inputs of sparse upstream updates ∆ Ŵ r i , global updates ∆W r s and sparse global updates ∆ Ŵ r s (line 11).

4.6. THEORETICAL ANALYSIS

Algorithm 4: FedPSE framework 1 Initialization: W 0 i = W 0 , ∆W s,i = 0 (∀i ∈ P), sparsity ratio p 2 for each federated round r = 1, ..., R do 3 # At clients: 4 for each client i ∈ P do 5 ∆ Ŵ r i , n r i ← UPS(∆ Ŵ r-1 s,i , p) 6 end 7 # At Server: 8 ∆W r s ← EWA(∆W r i , n r i ), ∀i ∈ P 9 ∆ Ŵ r s ← Top-K(∆W r s , p) 10 for i ∈ P do 11 ∆ Ŵ r s,i ← DPS(∆ Ŵ r i , ∆W r s , ∆ Ŵ r s ) 12 end 13 end 14 return W R,T i , ∀i ∈ P In this subsection, we analyze the convergence results of the FedPSE framework theoretically. We suppose that the loss function (f i : R n → R) of client i (∀i ∈ P) is differentiable, where n is the dimension of parameters. We consider the general setting in deep learning where f i is a non-convex function. Our convergence results are proved under the following assumptions: Assumption 1. Lipschitz Smoothness: The loss function f i (∀i ∈ P) is L- Lipschitz smooth (L-smooth), i.e., ||∇f i (W u i )- ∇f i (W v i )|| ≤ L||W u i -W v i ||, ∀W u i , W v i ∈ R n . Assumption 2. Bounded Gradient: The second moment of stochastic gradient G i calculated by a single sample within client i is bounded, i.e., E[|| N i=1 G i (W i )|| 2 ] ≤ σ 2 , ∀W i ∈ R n . We aims to prove that min r∈{1,••• ,R} E[||∇f i (W r i )|| 2 ] R→∞ -→ 0 , which is a normal convergence guarantee in the non-convex problem (Liu & Wright, 2015) . Similar to Dan Alistarh et al. ( 2018), we use W r i to denote the auxiliary random variable during r-th federated round on client i, and W r+1 = W r -α r G(W r ), where G(W r ) = 1 N N i=1 G i (W r i ) and W 0 = W 0 i (∀i ∈ P). Lemma 1. For any federated round r ≥ 1: E[||W r i -W r i || 2 ] ≤ 1+2φ ρ r q=1 (φ(1+ρ)) q (α r-q ) 2 σ 2 N , where φ = 1 -k n , 0 < k ≤ n and ρ > 0. We propose the bound of the difference between W r and W r i during r-th federated round in Lemma 1. Consequently, the convergence of FedPSE is guaranteed as shown below. Theorem 1. Choose the learning rate schedule α, s.t. r q=1 (φ(1 + ρ)) q (α r-q ) 2 α r ≤ C(∀r > 0), then our proposed FedPSE satisfies: 1 R r=1 α r R r=1 α r E[||∇f i (W r i )|| 2 ] ≤ 4(f i ( W 0 ) -f i ( W * )) R r=1 α r + 2σ 2 L N (1 + 2L(1+2φ)C ρ ) R r=1 (α r ) 2 R r=1 α r where constant C > 0 and W * denotes the optimal solution auxiliary variable. Theorem 1 implies that client model W r i converges if federated round R is large enough when α r satisfies the following conditions: lim R→∞ R r=1 α r = ∞, lim R→∞ R r=1 (α r ) 2 R r=1 α r = 0. Finally, we derive the convergence speed of our framework, please see more details and proofs in Appendix B.

5. EXPERIMENT

In this section, we empirically compare the performance of our proposed FedPSE framework with other federated learning paradigms for personalized compression. We aim to answer the following questions. • Q1: whether FedPSE outperforms other optimized algorithms on the Non-IID data? • Q2: whether the convergence speed of FedPSE is acceptable in the Non-IID setting? • Q3: whether the correlation distance can be adopted to quantitatively measure the divergence between the local distribution and the global distribution? • Q4: whether the EWA method can promote the performance of FedPSE? • Q5: how does the sparsity ratio influence the performance of FedPSE?

5.1. EXPERIMENTAL SETTINGS

Datasets. To test the effectiveness of our proposed model, we choose four widely used benchmark datasets: MNIST (Deng, 2012) , Fashion-MNIST (FMNIST) (Xiao et al., 2017) , IMDB (Maas et al., 2011) and Cifar-10 ( Krizhevsky et al., 2009) . We load these datasets using the Keras package in Tensorflow2.8, keeping their original train/test samples (Abadi et al., 2016) . Non-IID Setting. As most empirical work on synthetic Non-IID datasets partitions a "flat" existing dataset based on the labels (Kairouz et al., 2021) , we also use the label distribution skew as our Non-IID setting. Then we set a variable Non-IID ratio λ, ranging from 0.0 to 1.0, to simulate the Non-IIDness of clients' datasets, which is the same with Beutel et al. (2020) . In this way, the distribution of clients' datasets is becoming more heterogeneous with a larger λ. Please see more details in Appendix C.1. Metrics. Following the existing work (Sattler et al., 2019; Wu et al., 2020) , we use accuracy on the test dataset as the evaluation metric. To compare the performance of different strategies in the decentralized scenario, we optimize their hyper-parameters and choose the average of metrics in all clients as the optimization target. Please see more details in Appendix C.2 and C.3.

5.2. PERFORMANCE COMPARISON OF ALGORITHMS WITH DIFFERENT NON-IID RATIOS

To answer the proposed question Q1, we firstly set different Non-IID ratios λ, i.e. 0.0, 0.5 and 1.0, to partition the original datasets for clients. As shown in Table 1 and Appendix D.1, we set the different numbers of clients, ranging from 2 to 100, for the cross-silo federated learning, which is similar to previous studies (Gao et al., 2021; Wu et al., 2020) . Furthermore, we compare the accuracy of our proposed FedPSE and the three aforementioned algorithms, i.e. FedAVG McMahan et al. (2017a) , FedSTC Sattler et al. (2019) , and FedSCR (Wu et al., 2020) . Among them, FedAVG is the baseline method without communication compression. In addition, FedSTC and FedSCR are state-of-the-art methods to solve the communication efficiency problem in the Non-IID setting. Especially, we set the same sparsity parameter p = 0.9 for FedSTC, FedSCR, and FedPSE, which means that only 10% parameters are transmitted between clients and the server compared with FedAVG. Then we summarize the results in Table 1 and Appendix D.1, which compare the performance of FedAVG, FedSTC, FedSCR, and FedPSE paradigms on these benchmark datasets. From the results of the experiments, we conclude that our proposed FedPSE almost achieves the best performance of all algorithms in the Non-IID setting (λ = 1.0 or 0.5), which is also robust to the number of clients. For example, the average metric of FedPSE outperforms 6.85% for FedSTC and 5.57% for FedSCR on FMNIST dataset as shown in Table 2 of Appendix D.1. Besides, FedPSE is also robust to the Non-IID ratios, which can get very similar performance compared with FedAVG and FedSTC, while FedSCR has a weakness when the data distribution is more symmetrical ( λ = 0.0). Furthermore, our method also outperforms other models with partial clients participation for aggregation as shown in Appendix D.2. To answer the proposed question Q2, we plot the metric over the federated rounds on the Non-IID (λ = 1.0) FM-NIST dataset as shown in Figure 2 , in which the compression ratio p is 0.9 with 5 participated clients. We observe that the performance of FedAVG is unstable on Non-IID datasets, which is also proposed in the previous result (Zhao et al., 2018) , while other methods suffer the least from Non-IID data. It is essential that our proposed FedPSE, compared with FedSTC and FedSCR, improves the performance with a faster convergence speed. Furthermore, FedSCR performs best in the Non-IID setting, which is consistent with the former result (Wu et al., 2020) .

5.4. CORRELATION DISTANCE OF FEDPSE WITH DIFFERENT NON-IID RATIOS

To answer the proposed question Q3, we plot the correlation distance of participated clients with different Non-IID ratios (i.e. λ = 0.0, 0.5, 1.0) as shown in Figure 3 . Figure 3 shows that the average correlation distance increases over the Non-IID ratios, which demonstrates that the correlation distance is able to represent the heterogeneity between a local distribution and global distribution. Furthermore, we compute the variance of correlation distance under diverse Non-IID settings as shown in Appendix D.3. Apparently, the variance is becoming larger with the growth of heterogeneity, which means that the correlation distance also has a relationship with the distribution divergence among each client. To answer the proposed question Q4, we vary sparsity ratio p from 0.1 to 0.9 on FMNIST dataset and report the average test accuracy on Non-IID datasets (λ = 1.0) in Figure 4 , in which FedPSE without EWA indicates FedPSE using the naive aggregation method of FedAVG. Figure 4 shows that FedPSE with EWA consistently outperforms FedPSE without EWA under different compression rates, which demonstrates the effectiveness of our proposed EWA method. Besides, the average promotion of EWA is 0.002 on 2 clients, while the promotion is 0.0071 on 5 clients and 0.0143 on 10 clients. We can indicate that the promotion of EWA increases with the number of clients. Furthermore, we construct experiments on the FMNIST datasets over Non-IID ratios λ (i.e. from 0.0 to 1.0) with a fixed p = 0.9 as shown in Figure 5 . The test accuracy of FedPSE with EWA exceeds the algorithm without EWA through different distributions. The performance of FedPSE increases rapidly with the rising λ, which means that our proposed paradigm performs better under the extreme Non-IID condition. In conclusion, the EWA method is useful for the sparse aggregation process in the server.

5.6. PERFORMANCE COMPARISON OF FEDPSE WITH DIFFERENT SPARSITY RATIOS

Figure 4 can also answer the proposed question Q5. We can find that the accuracy of FedPSE over sparsity ratios changes individually on the different number of clients. For instance, the accuracy keeps steady with most sparsity ratios on two clients, while the metric increases rapidly with the rising p from 0.1 to 0.7 on more clients. We can optimize the sparsity hyper-parameter p of FedPSE to balance the performance and efficiency of FedPSE.

6. CONCLUSION

We propose a personalized sparsification with element-wise aggregation for federated learning to solve the Non-IID isolated scenario. We first sparsify the upstream updates of clients via the Top-K operator. Then we propose the EWA aggregation method to promote the federated performance for sparse matrices. Finally, we leverage the DPS method to keep the personalization and sparsification for the downstream information. Experiments on real-world datasets demonstrate that our model significantly outperforms the current methods on the isolated Non-IID data. In this section, we give two examples to compare the results of different aggregation methods under the IID and Non-IID settings.

Dense

For the IID setting, we suppose that there are three clients with their updates, i.e. ∆W 1 = (1, 1), ∆W 2 = (2, 2) and ∆W 3 = (3, 3). It is evident that the original aggregation of these dense updates is (2, 2) (red arrow) as shown in Figure 6 (a), which is regarded as the accurate result. We then generate the corresponding sparse updates, i.e. ∆ Ŵ1 = (0, 1), ∆ Ŵ2 = (2, 0) and ∆ Ŵ3 = (0, 3), whose biased result 2 3 , 4 3 (green arrow) is easily aggregated via the naive averaging method (McMahan et al., 2017b) . From a microcosmic perspective, if we compute the aggregated vector via the Element-Wise Aggregation (EWA) method as shown in Section 4.3, we would get the precise aggregation (2, 2) (blue vector) with the above sparse updates. In similarity, we assume that the clients' updates are ∆W 1 = (1, 2), ∆W 2 = (3, 2) and ∆W 3 = (3, 4) in the Non-IID setting. Then we sparsify these updates via Top-K operator, i.e. ∆ Ŵ1 = (0, 2), ∆ Ŵ2 = (3, 0) and ∆ Ŵ3 = (0, 4). It's easy to calculate the accurate aggregation, 7 3 , 8 3 (red arrow), the naive result, (1, 2) (green arrow), and the EWA vector, (3, 3) (blue arrow). Apparently, the EWA result is more close to the accurate dense aggregation than the naive sparse aggregation as shown in Figure 6 (b), when it refers to the direction and magnitude. Overall, the original averaging approach is easily biased by the sparse matrices, while our proposed EWA method is more suitable for the sparse aggregation.

B CONVERGENCE ANALYSIS B.1 ANALYSIS PRELIMINARIES AND ASSUMPTIONS

The framework FedPSE is used to minimize the differentiable loss function f i : R n → R (∀i ∈ P) individually, where n indicates the dimension of parameters. We consider the general setting in deep learning where f i is a non-convex function. Our convergence results are proved under the following assumptions: Assumption 3. Smoothness: The loss function f i (∀i ∈ P) is L-Lipschitz smooth (L-smooth), i.e., ||∇f i (W u i ) -∇f i (W v i )|| ≤ L||W u i -W v i ||, ∀W u i , W v i ∈ R n . Then we assume that client i trains its model W i with unbiased stochastic gradient, i.e., E[G i (W i )] = ∇f i (W i ) , in which G i denotes the stochastic gradient with a batch of M samples. Assumption 4. Bounded Gradient: The second moment of local gradient G i is bounded, i.e., N i=1 E[||G i (W i )|| 2 ] ≤ σ 2 , ∀W i ∈ R n , where ||.|| is ℓ 2 -norm. As shown in Algorithm 5, FedPSE includes three steps, i.e., upstream personalized sparsification (UPS), element-wise aggregation (EWA), and downstream personalized sparsification (DPS). The training steps of each federated round are termed as PSE collectively in the following sections. The local model is updated during the (r + 1)-th federated round by the following expression: W r+1 i = W r i -PSE N i=1 (α r G i (W r i ) + e r-1 i ) where α r is the learning rate. Under review as a conference paper at ICLR 2023 According to Algorithm 2, e r-1 i = α r-1 G i (W r-1 i ) -Top-k(α r-1 G i (W r-1 i ) + e r-2 i ). We then expand the PSE with the previous parameters and get PSE N i=1 (α r G i (W r i ) + e r-1 i ) =(M r in + R(M r i,c , d r s,i • k c ) + R(M r s,c , (1 -d r s,i ) • k c ) ⊙ ∆W r s =(M r in + M r i,c + M r s,c ) ⊙ ∆W r s =W r ⊙ (M r in + M r i,c ) ⊙ Top-k( N i=1 Top-k(α r G i (W r i ) + e r-1 i )) + M r s,c ⊙ Top-k(α r G i (W r i ) + e r-1 i ) where Top-k(•) is the compress operator which is described in Algorithm 1, R(•) denotes the random selection function in Algorithm 4 and W r = ⃗ 1 n ⊘ N i=1 Sign Top-k(α r G i (W r i ) + e r-1 i ) is the element weight matrix. Similar to Dan Alistarh et al. ( 2018), we use W r i to denote the auxiliary model weight with convergence guarantee during r-th federated round on client i, and get W r+1 = W r -α r G(W r ) where G(W r ) = 1 N N i=1 G i (W r i ) and W 0 = W 0 i (∀i ∈ P). The difference between the auxiliary model W r and the local model W r i can be represented by W r i -W r =W 0 i - r q=1 PSE N i=1 (α q G i (W q i ) + e q-1 i ) -W 0 + r q=1 α q G(W q ) = 1 N r q=1 N i=1 α q G i (W q i ) - r q=1 PSE N i=1 (α q G i (W q i ) + e q-1 i ) Then we define a commonly used k-contraction operator (Gao et al., 2021; Sebastian U. Stich, 2018) as shown below: Definition 1. For any vector W ∈ R n and 0 < k ≤ n, operator Q (R n → R n ) is a k-contraction operator if it satisfies the following property: E||W -Q(W)|| 2 ≤ (1 - k n )||W || 2 , ∀W ∈ R n Apparently, the Top-K compressor is a k-contraction operator. Lemma 2. ∀W i ∈ R n and 0 < k ≤ n, we have E[|| 1 N N i=1 (α r G i (W r i ) + e r i ) -PSE N i=1 (α r G i (W r i ) + e r i )|| 2 ] ≤(1 - k n )|| 1 N N i=1 (α r G i (W r i ) + e r i )|| 2 Proof. Combining equation 2 and Definition 1, we obtain E[|| 1 N N i=1 (α r G i (W r i ) + e r i ) -Top-k N i=1 (α r G i (W r i ) + e r i )|| 2 ] ≤E[|| 1 N N i=1 (α r G i (W r i ) + e r i ) -PSE N i=1 (α r G i (W r i ) + e r i )|| 2 ] ≤E[ 1 N N i=1 (α r G i (W r i ) + e r i ) -Q(E[|| 1 N N i=1 (α r G i (W r i ) + e r i )]) ≤(1 - k n )|| 1 N N i=1 (α r G i (W r i ) + e r i )|| 2 B.2 MAIN PROCESS As a standard situation in non-convex settings (Liu & Wright, 2015) , we want to guarantee the following convergence: min r∈{1,••• ,R} E[||∇f i (W r i )|| 2 ] R→∞ -→ 0 that is, the algorithm converges ergodically to the optimal point where gradients are zero. Our purpose is to minimize the difference between the "real" model W r i and the viewed W r i observed at federated round r, which means decreasing the value of loss function f i (W r i ). Furthermore, we need to bound the following expression: 1 R r=1 α r R r=1 α r E[||∇f i (W r i )|| 2 ] Then We get Lemma 3: Lemma 3. For any federated round r ≥ 1: E[||W r i -W r i || 2 ] ≤ 1 + 2φ ρ r q=1 (φ(1 + ρ)) q (α r-q ) 2 σ 2 N ( ) where φ = 1 -k n , 0 < k ≤ n and ρ > 0. Proof. We derive the difference between W r+1 i and W r+1 : E[||W r+1 i -W r+1 i || 2 ] =E[||W r i -PSE N i=1 (α r G i (W r i ) + e r i ) -W r + α r G(W r )|| 2 ] =E[||W r i -W r -PSE N i=1 (α r G i (W r i ) + e r i ) + α r N N i=1 G i (W r i )|| 2 ] =E[||W r i -W r - 1 N N i=1 e r i -PSE N i=1 (α r G i (W r i ) + e r i ) + 1 N N i=1 (α r G i (W r i ) + e r i )|| 2 ] ≤E[|| 1 N N i=1 (α r G i (W r i ) + e r i ) -PSE N i=1 (α r G i (W r i ) + e r i )|| 2 ] + E[||W r i -W r - 1 N N i=1 e r i || 2 ] From Lemma 2, we can obtain E[||W r+1 i -W r+1 i || 2 ] ≤φ|| 1 N N i=1 (α r G i (W r i ) + e r i )|| 2 + E[||W r i -W r - 1 N N i=1 e r i || 2 ] ≤φ|| 1 N N i=1 α r G i (W r i )|| 2 + φ|| 1 N N i=1 e r i || 2 + E[||W r i -W r - 1 N N i=1 e r i || 2 ] From equation (4) and Lemma 2, we have E[||W r+1 i -W r+1 i || 2 ] ≤φE||α r G(W r )|| 2 + φE||W r i -W r || 2 + (1 + φ)E|| 1 N r q=1 N i=1 Top-k(α r G i (W r i ) + e r i ) - r q=1 PSE N i=1 (α q G i (W q i ) + e q i )|| 2 ≤φE||α r G(W r )|| 2 + φE||W r i -W r || 2 + (1 + φ)E||W r+2 i -W r+1 i || 2 ≤(1 + 2φ)(1 + 1 ρ )E||α r G(W r )|| 2 + φ(1 + ρ)E||W r i -W r || 2 Iterate the above inequality by r to get: E[||W r i -W r i || 2 ] ≤ 1 + 2φ ρ r q=1 (φ(1 + ρ)) q E||α r-q G(W r-q )|| 2 From Assumption 4 and Lemma 3, we have E[||W r i -W r i || 2 ] ≤ 1 + 2φ ρ r q=1 (φ(1 + ρ)) q E||α r-q G(W r-q )|| 2 ≤ 1 + 2φ ρ r q=1 (φ(1 + ρ)) q (α r-q ) 2 σ 2 N Theorem 2. Assume that our proposed FedPSE is applied to minimize the objective loss function f i that satisfies the assumptions in A.1. If we choose a learning rate schedule that satisfies: r q=1 (φ(1 + ρ)) q (α r-q ) 2 α r ≤ C (8) then for some constant C > 0, we have the following result after R federated rounds: 1 R r=1 α r R r=1 α r E[||∇f i (W r i )|| 2 ] ≤ 4(f i ( W 0 ) -f i ( W * )) R r=1 α r + 2σ 2 L N (1 + 2L(1+2φ)C ρ ) R r=1 (α r ) 2 R r=1 α r where W * is the optimal solution to f i . Proof. Under the Assumption 3, we have f i ( W r+1 ) -f i ( W r ) ≤ ⟨∇f i ( W r ), W r+1 -W r ⟩ + L 2 || W r+1 -W r || 2 = -⟨∇f i ( W r ), α r G(W r )⟩ + L 2 ||α r G(W r )|| 2 (9) and ||∇f i (W r i )|| 2 = ||∇f i (W r i ) -∇f i ( W r ) + ∇f i ( W r )|| 2 ≤ 2||∇f i (W r i ) -∇f i ( W r )|| 2 + 2||∇f i ( W r )|| 2 ≤ 2L 2 ||W r i -W r || 2 + 2||∇f i ( W r )|| 2 Taking the expectation at federated round r, we bonud E[f i ( W r+1 )] -f i ( W r ) ≤ -⟨∇f i ( W r ), α r E[G(W r )]⟩ + L 2 E[||α r G(W r )|| 2 ] = -⟨∇f i ( W r ), α r ∇f i (W r )⟩ + L 2 E[||α r G(W r )|| 2 ] = - α r 2 (||∇f i ( W r ) -∇f i (W r )|| 2 -||∇f i ( W r )|| 2 -||∇f i (W r )|| 2 ) + (α r ) 2 L 2 E[||G(W r )|| 2 ] ≤ - α r 2 ||∇f i ( W r )|| 2 + α r L 2 2 ||W r i -W r || 2 + (α r ) 2 L 2 E[||G(W r )|| 2 ] ≤ - α r 2 (||∇f i ( W r )|| 2 + L 2 ||W r i -W r || 2 ) + α r L 2 ||W r i -W r || 2 + (α r ) 2 Lσ 2 2N Taking the expectation before r, it yields E[f i ( W r+1 )] -E[f i ( W r )] ≤ - α r 2 (E[||∇f i ( W r )|| 2 + L 2 ||W r i -W r || 2 )] + α r L 2 E[||W r i -W r || 2 ] + (α r ) 2 Lσ 2 2N ≤ - α r 2 (E[||∇f i ( W r )|| 2 + L 2 ||W r i -W r || 2 )] + (α r ) 2 Lσ 2 2N + α r L 2 (1 + 2φ) ρ r q=1 (φ(1 + ρ)) q (α r-q ) 2 σ 2 N = - α r 2 (E[||∇f i ( W r )|| 2 + L 2 ||W r i -W r || 2 )] + (α r ) 2 Lσ 2 2N + (α r ) 2 L 2 (1 + 2φ) ρ r q=1 (φ(1 + ρ)) q (α r-q ) 2 α r σ 2 N Apply (8) to the above inequality, we get E[f i ( W r+1 )] -E[f i ( W r )] ≤ - α r 2 (E[||∇f i ( W r )|| 2 + L 2 ||W r i -W r || 2 )] + (α r ) 2 Lσ 2 2N + (α r ) 2 L 2 (1 + 2φ)Cσ 2 ρN = - α r 2 (E[||∇f i ( W r )|| 2 + L 2 ||W r i -W r || 2 )] + (α r ) 2 σ 2 L 2N (1 + 2L(1 + 2φ)C ρ ) Combine with inequality (10), we can obtain α r E[||∇f i (W r i )|| 2 ] ≤ 2α r (E[||∇f i ( W r )|| 2 + L 2 ||W r i -W r || 2 )] ≤ 4(E[f i ( W r+1 )] -E[f i ( W r )]) + 2(α r ) 2 σ 2 L N (1 + 2L(1 + 2φ)C ρ ) Summing up the above inequality for r = 1, 2, • • • , R, we have R r=1 α r E[||∇f i (W r i )|| 2 ] ≤ 4(f i ( W 0 ) -f i ( W * )) + 2σ 2 L N (1 + 2L(1 + 2φ)C ρ ) R r=1 (α r ) 2 (11) By dividing the summation of learning rates and therefore: 1 R r=1 α r R r=1 α r E[||∇f i (W r i )|| 2 ] ≤ 4(f i ( W 0 ) -f i ( W * )) R r=1 α r + 2σ 2 L N (1 + 2L(1+2φ)C ρ ) R r=1 (α r ) 2 R r=1 α r (12) The condition (8) keeps right if φ(1 + ρ) < 1. To derive the bound of ρ, we have φ(1 + ρ) = (1 - k n )(1 + ρ) < 1 Therefore, one should choose ρ < k n-k to satisfy the above inequality. Theorem 2 implies that each client's model in the FedPSE framework will converge if federated round R is large enough when α r satisfies the following conditions: lim R→∞ R r=1 α r = ∞ and lim R→∞ R r=1 (α r ) 2 R r=1 α r = 0 Corollary 1. Under the assumptions in Theorem 2, if τ = φ(1 + ρ) and α r = θ M N R , ∀r > 0, where θ > 0 is a constant, we have the convergence speed of FedPSE: datasets with λ = 1.0. Then we compute the promotion of our method compared to the SOTA methods via the following function: E[ 1 R R r=1 ||∇f i (W r i )|| 2 ] ≤ 4(f i ( W 0 i ) -f i ( W * i )) θ √ M N R + 2θLσ 2 √ M √ N R + 4σ 2 L 2 (1 + 2φ)τ θ 2 M Rρ(1 -τ ) P romotion = Acc F edP SE -Acc SOT A Acc SOT A Then we abbreviate the promotion of FedPSE compared to FedSTC as P1 and the promotion of FedPSE compared to FedSCR as P2 as shown in Table 2 . From Table 2 , we have the conclusion that our proposed FedPSE achieves the best performance among all algorithms under the different numbers of clients, which is consistent with Section 5.2. We take experiments with partial clients participation for aggregation on the FMNIST and IMDB datasets in the Non-IID setting (λ = 1.0). Specifically, the clients are randomly sampled by the server at different rates. As shown in Figure 7 , our method (red line) significantly outperforms other models, which is align with the conclusion in Section 5.2. In order to prove the effectiveness of the DPS algorithm, we construct experiments of FedPSE with different downstream updates on the FM-NIST datasets over Non-IID ratios λ (i.e. from 0.0 to 1.0) as shown in Figure 5 . The FedPSE without the DPS algorithm indicates that the server sparsifies the global updates with the TopK method and broadcasts the same sparse information to all clients. Apparently, the accuracy of FedPSE with DPS exceeds the algorithm without DPS through different distributions in the Non-IID setting (i.e. λ ≥ 0.2). In conclusion, the DPS method, personalizing the clients' models, is useful for our paradigm with heterogeneous datasets.

E EXAMPLE CODE

We provide the example code for our framework in this section. Figure 9 shows an example code of the UPS algorithm, while Figure 10 is an example implementation of the EWA algorithm and Figure 12 for the DPS algorithm. Furthermore, we present the training process of FedPSE in Figure 13 , in which we assume that there are two clients (a and b) with two-layers neural networks. 



Figure 1: The proposed framework of FedPSE.

Upstream sparse updates ∆ Ŵ r i , global updates ∆W r s , global sparse updates ∆ Ŵ r s Output: Downstream sparse updates ∆ Ŵ r s,i to the client i # Measure the heterogeneity of distributions via ∆ Ŵ r s and ∆ Ŵ r i Normalize the global sparse updates: ∆ Ŵ r s = N orm(F latten(∆ Ŵ r s )) Normalize the client i sparse updates: ∆ Ŵ r i = N orm(F latten(∆ Ŵ r i )) Compute the correlation distance: d r s,i = 0.5 -0.5 • Cosine(∆ Ŵ r s , ∆ Ŵ r i ) # Select the reserved indices of downstream updates Get the index matrices: M r s = Sign ∆ Ŵ r s and M r i = Sign ∆ Ŵ r i

Figure 2: Test accuracy regarding rounds.

Figure 3: Correlation distance on 5 clients.

Figure 5: Accuracy with Non-IID ratios.

Figure 6: Aggregation results of different methods.

Figure 7: Accuracy comparison over sampling ratios on Non-IID datasets.

Figure 8: Accuracy with different downstream updates.

Figure 11: Example code of the utils.

Performance comparison of algorithms on datasets with different Non-IID ratios λ.

Performance comparison of algorithms with the different number of clients on Non-IID datasets.

Variance of correlation distances in FedPSE with different Non-IID ratios.In this section, we compute the variance of correlation distances with diverse Non-IID ratios λ as shown in Table3, in which we perform the experiments on the FMNIST and IMDB datasets with a variable clients' number. Then we deduce our conclusion in section 5.4.

annex

Proof. First we prove that α r = θ M N R , a constant step size, satisfies the inequality (8). we set α r = α for simplification: r q=1 (φ(1 + ρ)) q (α r-q ) 2 α r = r q=1 τ q (α r-q ) 2 α r = α r q=1 τ q = α τ (1 -τ r ) 1 -τ Since 0 ≤ r < 1, we then obtain lim r→∞ α τ (1 -τ r ) 1 -τ = ατ 1 -τ Therefore, inequality (8) holds under the condition: C = ατ 1-τ . From Theorem 2, we obtain the inequality of the expected average-squared gradients of f i , i.e.,From Corollary 1, we can conclude that the framework FedPSE has a convergence rate of O( 1 √ R ) with a proper learning rate,. It also indicates that the hyper-parameter k (in Top-K) has minor impacts on the convergence rate if R is large enough.

C DETAILS OF EXPERIMENTS SETTINGS C.1 SEPARATION OF CLIENTS' DATASETS

In this section, we propose the details of the splitting method with a "flat" existing dataset in the Non-IID setting. Firstly, the whole training/test samples are separated into two parts, including the heterogeneous subset with λ percent and the homogeneous subset with (1.0 -λ) percent. On one hand, the homogeneous subset is averaged partitioned into N clients randomly. On the other hand, the heterogeneous subset is sorted by labels and split into N clients by order, which indicates that each client has its own distribution. Finally, we combine the above-mentioned homogeneous part and heterogeneous part into the individual training/test dataset of each client.

C.2 MODELS FOR DIFFERENT DATASETS

We establish an ConvNet2 model (Beutel et al., 2020) for MNIST and FMNIST datasets (Subramani et al., 2021) . In similarity, we construct a model with one Embedding layer, which is prebuilt in Keras, followed by a fully-connected layer on the IMDB dataset. Finally, we train the Resnet18 model (He et al., 2016) on the Cifar10.

C.3 HYPER-PARAMETERS OPTIMIZATION

We fix the architectures of models and the random seed as 1. Then we optimize the hyperparameters(e.g. batch size ranging from 64 to 512 and learning rate ranging from 1e-3 to 1e-2) for different strategies to compare their best metrics in the following experiments. Particularly, in order to accelerate the training process of Resnet18 on Cifar10, we use a pre-trained model to initialize the clients' weights. The experiments are conducted in a stand-alone PC to simulate the communication in federated learning.

D SUPPLEMENTARY EXPERIMENTS D.1 PERFORMANCE COMPARISON OF ALGORITHMS WITH DIFFERENT NUMBERS OF CLIENTS

In order to prove the effectiveness of our method, we vary the number of clients from 2 to 100 and report their accuracy in 

