COMMUNICATION-EFFICIENT AND DRIFT-ROBUST FEDERATED LEARNING VIA ELASTIC NET

Abstract

Federated learning (FL) is a distributed method to train a global model over a set of local clients while keeping data localized. It reduces the risks of privacy and security but faces important challenges including expensive communication costs and client drift issues. To address these issues, we propose FedElasticNet, a communicationefficient and drift-robust FL framework leveraging the elastic net. It repurposes two types of the elastic net regularizers (i.e., ℓ 1 and ℓ 2 penalties on the local model updates): (1) the ℓ 1 -norm regularizer sparsifies the local updates to reduce the communication costs and (2) the ℓ 2 -norm regularizer resolves the client drift problem by limiting the impact of drifting local updates due to data heterogeneity. FedElasticNet is a general framework for FL; hence, without additional costs, it can be integrated into prior FL techniques, e.g., FedAvg, FedProx, SCAFFOLD, and FedDyn. We show that our framework effectively resolves the communication cost and client drift problems simultaneously.

1. INTRODUCTION

Federated learning (FL) is a collaborative method that allows many clients to contribute individually to training a global model by sharing local models rather than private data. Each client has a local training dataset, which it does not want to share with the global server. Instead, each client computes an update to the current global model maintained by the server, and only this update is communicated. FL significantly reduces the risks of privacy and security (McMahan et al., 2017; Li et al., 2020a) , but it faces crucial challenges that make the federated settings distinct from other classical problems (Li et al., 2020a) such as expensive communication costs and client drift problems due to heterogeneous local training datasets and heterogeneous systems (McMahan et al., 2017; Li et al., 2020a; Konečnỳ et al., 2016a; b) . Communicating models is a critical bottleneck in FL, in particular when the federated network comprises a massive number of devices (Bonawitz et al., 2019; Li et al., 2020a; Konečnỳ et al., 2016b) . In such a scenario, communication in the federated network may take a longer time than that of local computation by many orders of magnitude because of limited communication bandwidth and device power (Li et al., 2020a) . To reduce such communication cost, several strategies have been proposed (Konečnỳ et al., 2016b; Li et al., 2020a) . In particular, Konečnỳ et al. (2016b) proposed several methods to form structured local updates and approximate them, e.g., subsampling and quantization. Reisizadeh et al. (2020) ; Xu et al. (2020) also proposed an efficient quantization method for FL to reduce the communication cost. Also, in general, as the datasets that local clients own are heterogeneous, trained models on each local data are inconsistent with the global model that minimizes the global empirical loss (Karimireddy et al., 2020; Malinovskiy et al., 2020; Acar et al., 2021) . This issue is referred to as the client drift problem. In order to resolve the client drift problem, FedProx (Li et al., 2020b) added a proximal term to a local objective function and regulated local model updates. Karimireddy et al. (2020) proposed SCAFFOLD algorithm that transfers both model updates and control variates to resolve the client drift problem. FedDyn (Acar et al., 2021) dynamically regularizes local objective functions to resolve the client drift problem. Unlike most prior works focusing on either the communication cost problem or the client drift problem, we propose a technique that effectively resolves the communication cost and client drift problems simultaneously. the objective of Lasso is to solve min θ ∥y -Xθ∥ 2 2 + λ 1 ∥θ∥ 1 , where y is the outcome and X is the covariate matrix. Lasso performs both variable selection and regularization to enhance the prediction accuracy and interpretability of the resulting model. However, it has some limitations, especially for high-dimensional models. If a group of variables is highly correlated, then Lasso tends to select only one variable from the group and does not care which one is selected (Zou & Hastie, 2005) . The elastic net overcomes these limitations by adding an ℓ 2 -norm penalty. The objective of the elastic net is to solve min θ ∥y -Xθ∥ 2 2 + λ 2 2 ∥θ∥ 2 2 + λ 1 ∥θ∥ 1 . The elastic net simultaneously enables automatic variable selection and continuous shrinkage by the ℓ 1 -norm regularizer and enables to select groups of correlated variables by its ℓ 2 -norm regularizer (Zou & Hastie, 2005) . We will leverage the elastic net approach to resolve the critical problems of FL: expensive communication cost and client drift problems.

3. PROPOSED METHOD: FEDELASTICNET

We assume that m local clients communicate with the global server. For the kth client (where k ∈ [m]) participating in each training round, we assume that a training data feature x ∈ X and its corresponding label y ∈ Y are drawn IID from a device-indexed joint distribution, i.e., (x, y) ∼ P k (Acar et al., 2021) . The objective is to find arg min θ∈R d   R (θ) := 1 m k∈[m] L k (θ)   , where L k (θ) = E x∼P k [l k (θ; (x, y))] is the local risk of the kth clients over possibly heterogeneous data distributions P k . Also, θ represents the model parameters and l k (•) is a loss function such as cross entropy (Acar et al., 2021) . FedElasticNet The proposed method (FedElasticNet) leverages the elastic net approach to resolve the communication cost and client drift problems. We introduce the ℓ 1 -norm and ℓ 2 -norm penalties on the local updates: In each round t ∈ [T ], the kth local client attempts to find θ t k by solving the following optimization problem: θ t k = arg min θ L k (θ) + λ 2 2 θ -θ t-1 2 2 + λ 1 θ -θ t-1 1 , where θ t-1 denotes the global model received from the server. Then, it transmits the difference ∆ t k = θ t k -θ t-1 to the server. Inspired by the elastic net, we introduce two types of regularizers for local objective functions; however, each of them works in a different way so as to resolve each of the two FL problems: the communication cost and client drift problems. First, the ℓ 2 -norm regularizer resolves the client drift problem by limiting the impact of variable local updates as in FedProx (Li et al., 2020b) . FedDyn (Acar et al., 2021 ) also adopts the ℓ 2 -norm regularizer to control the client drift. Second, the ℓ 1 -norm regularizer attempts to sparsify the local updates ∆ t k = θ t k -θ t-1 . We consider two ways of measuring communication cost: One is the number of nonzero elements in ∆ t k (Yoon et al., 2021; Jeong et al., 2021) , which the ℓ 1 -norm sparsifies. The other is the (Shannon) entropy since it is the theoretical lower bound on the data compression (Cover & Thomas, 2006) . We demonstrate that the ℓ 1 -norm penalty on the local updates can effectively reduce the number of nonzero elements as well as the entropy in Section 4. To boost sparseness of ∆ t k = θ t k -θ t-1 , we sent ∆ t k (i) = 0 if |∆ t k (i)| ≤ ϵ where ∆ t k (i) denotes the ith element of ∆ t k . The parameter ϵ is chosen in a range that does not affect classification accuracy. Our FedElasticNet approach can be integrated into existing FL algorithms such as FedAvg (McMahan et al., 2017) , SCAFFOLD (Karimireddy et al., 2020), and FedDyn (Acar et al., 2021) without additional costs, which will be described in the following subsections.

Algorithm 1 FedElasticNet for FedAvg & FedProx

Input: T , θ 0 , λ 1 > 0, λ 2 > 0 1: for each round t = 1, 2, ..., T do 2: Sample devices P t ⊆ [m] and transmit θ t-1 to each selected local client 3: for each local client k ∈ P t do in parallel 4: Set Set θ t k = arg min θ L k (θ) + λ2 2 θ -θ t-1 2 2 + λ 1 θ -θ t-1 1 5: Transmit ∆ t k = θ t k -θ t- θ t = θ t-1 + k∈Pt n k n ∆ k 8: end for Algorithm 2 FedElasticNet for SCAFFOLD Input: T , θ 0 , λ 1 > 0, λ 2 > 0, global step size η g , and local step size η l . 1: for each round t = 1, 2, ..., T do 2: Sample devices P t ⊆ [m] and transmit θ t-1 and c t-1 to each selected device 3: for each device k ∈ P t do in parallel 4: Initialize local model θ t k = θ t-1 5: for b = 1, . . . , B do 6: Compute mini-batch gradient ∇L k (θ t k ) 7: θ t k ← θ t k -η l ∇L k (θ t k ) -c t-1 k + c t-1 + λ 2 (θ t k -θ t-1 ) + λ 1 sign(θ t k -θ t-1 ) 8: end for 9: Set c t k = c t-1 k -c t-1 + 1 Bη l (θ t-1 -θ t k ) 10: Transmit ∆ t k = θ t k -θ t-1 and ∆c k = c t k -c t-1 k to the global server 11: end for 12: Set θ t = θ t-1 + ηg |Pt| k∈Pt ∆ k 13: Set c t = c t-1 + 1 m k∈Pt ∆c k 14: end for 3.1 FEDELASTICNET FOR FEDAVG & FEDPROX (FEDAVG & FEDPROX + ELASTIC NET) Our FedElasticNet can be applied to FedAvg (McMahan et al., 2017) by adding two regularizers on the local updates, which resolves the client drift problem and the communication cost problem. As shown in Algorithm 1, the local client minimizes the local objective function (4). In Step 7, n and n k denote the total numbers of data points of all clients and the data points of the kth client, respectively. It is worth mentioning that FedProx uses the ℓ 2 -norm regularizer to address the data and system heterogeneities (Li et al., 2020b) . By adding the ℓ 1 -norm regularizer, we can sparsify the local updates of FedProx and thus effectively reduce the communication cost. Notice that Algorithm 1 can be viewed as the integration of FedProx and FedElasticNet.

3.2. FEDELASTICNET FOR SCAFFOLD (SCAFFOLD + ELASTIC NET)

In SCAFFOLD, each client computes the following mini-batch gradient ∇L k (θ t k ) and control variate c t k (Karimireddy et al., 2020): θ t k ← θ t k -η l ∇L k θ t k -c t-1 k + c t-1 , c t k ← c t-1 k -c t-1 + 1 Bη l (θ t-1 -θ t k ), where η l is the local step size and B is the number of mini-batches at each round. This control variate makes the local parameters θ t k updated in the direction of the global optimum rather than each local optimum, which effectively resolves the client drift problem. However, SCAFFOLD incurs twice much communication cost since it should communicate the local update ∆ t k = θ t k -θ t-1 and the control variate ∆c k = c t k -c t-1 k , which are of the same dimension. In order to reduce the communication cost of SCAFFOLD, we apply our FedElasticNet framework. In the proposed algorithm (see Algorithm 2), each local client computes the following mini-batch gradient instead of (5): θ t k ← θ t k -η l ∇L k θ t k -c t-1 k + c t-1 + λ 2 (θ t k -θ t-1 ) + λ 1 sign(θ t k -θ t-1 ) , where λ 1 sign(θ t k -θ t-1 ) corresponds to the gradient of ℓ 1 -norm regularizer λ 1 ∥θ t k -θ t-1 ∥ 1 . This ℓ 1 -norm regularizer sparsifies the local update ∆ t k = θ t k -θ t-1 ; hence, reduces the communication cost. Since the control variate already addresses the client drift problem, we can remove the ℓ 2 -norm regularizer or set λ 2 as a small value.

3.3. FEDELASTICNET FOR FEDDYN (FEDDYN + ELASTIC NET)

In FedDyn, each local client optimizes the following local objective, which is the sum of its empirical loss and a penalized risk function: θ t k = arg min θ L k (θ) -⟨∇L k (θ t-1 k ), θ⟩ + λ 2 2 θ -θ t-1 2 2 , where the penalized risk is dynamically updated so as to satisfy the following first-order condition for local optima: ∇L k (θ t k ) -∇L k (θ t-1 k ) + λ 2 (θ t k -θ t-1 ) = 0. (9) This first-order condition shows that the stationary points of the local objective function are consistent with the server model (Acar et al., 2021) . That is, the client drift is resolved. However, FedDyn makes no difference from FedAvg and FedProx in communication costs. By integrating FedElasticNet and FedDyn, we can effectively reduce the communication cost of FedDyn as well. In the proposed method (i.e., FedElasticNet for FedDyn), each local client optimizes the following local empirical objective: θ t k = arg min θ L k (θ) -⟨∇L k (θ t-1 k ), θ⟩ + λ 2 2 θ -θ t-1 2 2 + λ 1 θ -θ t-1 1 , which is the sum of ( 8) and the additional ℓ 1 -norm penalty on the local updates. The corresponding first-order condition is given by ∇L k (θ t k ) -∇L k (θ t-1 k ) + λ 2 (θ t k -θ t-1 ) + λ 1 sign(θ t k -θ t-1 ) = 0. ( ) Notice that the stationary points of the local objective function are consistent with the server model as in (9). If θ t k ̸ = θ t-1 (i.e., sign(θ t k -θ t-1 ) = ±1), then the first-order condition is ∇L k (θ t k ) -∇L k (θ t-1 k ) + λ 2 (θ t k -θ t-1 ) = ±λ 1 , where λ 1 is a vectorized one. Our empirical results show that the optimized hyperparameter is λ 1 = 10 -4 or 10 -6 and the impact of ±λ 1 in (12) would be negligible. Hence, the proposed FedElasticNet for FedDyn resolves the client drift problem. Further, the local update ∆ t k = θ t k -θ t-1 is sparse due to the ℓ 1 -norm regularizer, which effectively reduces the communication cost at the same time. The detailed algorithm is described in Algorithm 3.

Algorithm 3 FedElasticNet for FedDyn

Input: T , θ 0 , λ 1 > 0, λ 2 > 0, h 0 = 0, ∇L k θ 0 k = 0. 1: for each round t = 1, 2, ..., T do 2: Sample devices P t ⊆ [m] and transmit θ t-1 to each selected device 3: for each device k ∈ P t do in parallel 4: Set θ t k = arg min θ L k (θ) -∇L k (θ t-1 k ), θ + λ2 2 θ -θ t-1 2 2 + λ 1 θ -θ t-1 1 5: Set ∇L k (θ t k ) = ∇L k θ t-1 k -λ 2 θ t k -θ t-1 -λ 1 sign θ t k -θ t-1 6: Transmit ∆ t k = θ t k -θ t-1 to the global server 7: end for 8: for each device k / ∈ P t do in parallel 9: Set θ t k = θ t-1 k and ∇L k (θ t k ) = ∇L k θ t-1 k 10: end for 11: Set h t = h t-1 -λ2 m k∈Pt θ t k -θ t-1 -λ1 m k∈Pt sign(θ t k -θ t-1 ) 12: Set θ t = 1 |Pt| k∈Pt θ t k -1 λ2 h t 13: end for Convergence Analysis We provide a convergence analysis on FedElasticNet for FedDyn (Algorithm 3). Theorem 3.1. Assume that the clients are uniformly randomly selected at each round and the local loss functions are convex and β-smooth. Then Algorithm 3 satisfies the following inequality: E R 1 T T -1 t=0 γ t -R(θ * ) ≤ 1 T 1 κ 0 (E∥γ 0 -θ * ∥ 2 2 + κC 0 ) + κ ′ κ 0 • λ 2 1 d - 1 T 2λ 1 λ 2 T t=1 γ t-1 -θ * , 1 m k∈[m] E[sign( θt k -θ t-1 )] , where θ * = arg min θ R(θ), P = |P t |, γ t = 1 P Pt θ t k , d = dim(θ), κ = 10m P 1 λ2 λ2+β λ 2 2 -25β 2 , κ 0 = 2 λ2 λ 2 2 -25λ2β-50β 2 λ 2 2 -25β 2 , κ ′ = 5 λ2 λ2+β λ 2 2 -25β 2 = κ • P 2m , C 0 = 1 m k∈[m] E∥∇L k (θ 0 k ) -∇L k (θ * )∥ and θt k = arg min θ L k (θ) -∇L k (θ t-1 k ), θ + λ 2 2 θ -θ t-1 2 2 + λ 1 θ -θ t-1 1 ∀k ∈ [m]. Theorem 3.1 provides a convergence rate of FedElasticNet for FedDyn. If T → ∞, the first term of (13) converges to 0 at the speed of O(1/T ). The second and the third terms of ( 13) are additional penalty terms caused by the ℓ 1 -norm regularizer. The second term is a negligible constant in the range of hyperparameters of our interest. Considering the last term, notice that the summand at each t includes the expected average of sign vectors where each element is ±1. If a coordinate of the sign vectors across clients is viewed as an IID realization of Bern( 12 ), it can be thought of as a small value with high probability by the concentration property (see Appendix B.3). In addition, γ t-1 -θ * characterizes how much the average of local models deviates from the globally optimal model, which tends to be small as training proceeds. Therefore, the effect of both additional terms is negligible.

4. EXPERIMENTS

In this section, we evaluate the proposed FedElasticNet on benchmark datasets for various FL scenarios. In particular, FedElasticNet is integrated with prior methods including FedProx (Li et al., 2020b) , SCAFFOLD (Karimireddy et al., 2020), and FedDyn (Acar et al., 2021) . The experimental results show that FedElasticNet effectively enhances communication efficiency while maintaining classification accuracy and resolving the client drift problem. We observe that the integration of FedElasticNet and FedDyn (Algorithm 3) achieves the best performance. Experimental Setup We use the same benchmark datasets as prior works. The evaluated datasets include MNIST (LeCun et al., 1998) , a subset of EMNIST (Cohen et al., 2017, EMNIST-L) , CIFAR-10, CIFAR-100 (Krizhevsky & Hinton, 2009) , and Shakespeare (Shakespeare, 1914) . The IID split is generated by randomly assigning datapoint to the local clients. The Dirichlet distribution is used on the label ratios to ensure uneven label distributions among local clients for non-IID splits as in Zhao et al. (2018); Acar et al. (2021) . For the uneven label distributions among 100 experimental devices, the experiments are performed by using the Dirichlet parameters of 0.3 and 0.6, and the number of data points is obtained by the lognormal distribution as in Acar et al. (2021) . The data imbalance is controlled by varying the variance of the lognormal distribution (Acar et al., 2021) . We use the same neural network models of FedDyn experiments (Acar et al., 2021) . For MNIST and EMNIST-L, fully connected neural network architectures with 2 hidden layers are used. The numbers of neurons in the layers are 200 and 100, respectively (Acar et al., 2021) . Remark that the model used for MNIST dataset is the same as in Acar et al. (2021) ; McMahan et al. (2017) . For CIFAR-10 and CIFAR-100 datasets, we use a CNN model consisting of 2 convolutional layers with 64 5 × 5 filters followed by 2 fully connected layers with 394 and 192 neurons and a softmax layer. For the next character prediction task for Shakespeare, we use a stacked LSTM as in Acar et al. (2021) . For MNIST, EMNIST-L, CIFAR10, and CIFAR100 datasets, we evaluate three cases: IID, non-IID with Dirichlet (.6), and non-IID with Dirichlet (.3). Shakespeare datasets are evaluated for IID and non-IID cases as in Acar et al. (2021) . We use the batch size of 10 for the MNIST dataset, 50 for CIFAR-10, CIFAR-100, and EMNIST-L datasets, and 20 for the Shakespeare dataset. We optimize the hyperparameters depending on the evaluated datasets: learning rates, λ 2 , and λ 1 . 

Evaluation of Methods

We compare the baseline methods (FedProx, SCAFFOLD, and FedDyn) and the proposed FedElasticNet integrations (Algorithms 1, 2, and 3), respectively. We evaluate the communication cost and classification accuracy for non-IID settings of the prior methods and the proposed methods. The robustness of the client drift problem is measured by the classification accuracy of non-IID settings. We report the communication costs in two ways: (i) the number of nonzero elements in transmitted values as in (Yoon et al., 2021; Jeong et al., 2021) and (ii) the Shannon entropy of transmitted bits. Note that the Shannon entropy is the theoretical limit of data compression (Cover & Thomas, 2006) , which can be achieved by practical algorithms; for instance, Han et al. (2016) used Huffman coding for model compression. We calculate the entropy of discretized values with the bin size of 0.01. Note that the transmitted values are not discretized in FL, and only the discretization is considered to calculate the entropy. The lossy compression schemes (e.g., scalar quantization, vector quantization, etc.) have not been considered since they include several implementational issues which are beyond our research scope. FedElasticNet approach. Algorithms 1, 2, and 3 reduce the entropy compared to the their baseline methods. We note that FedElasticNet integrated with FedDyn (Algorithm 3) achieves the minimum entropy, i.e., the minimum communication cost. For FedDyn, we evaluate the Shannon entropy values for two cases: (i) transmit the updated local models θ t k as in Acar et al. ( 2021) and (ii) transmit the local updates ∆ t k = θ t k -θ t-1 as in Algorithm 3. We observe that transmitting the local updates ∆ t k instead of the local models θ t k can reduce the Shannon entropy significantly. Hence, it is beneficial to transmit the local updates ∆ t k even for FedDyn if it adopts an additional compression scheme. The numbers of nonzero elements for two cases (i.e., θ t k and ∆ t k ) are the same for FedDyn. Fig. 1 shows that the FedElasticNet maintains the classification accuracy or incurs marginal degradation. We observe a classification gap between FedProx and Algorithm 1 for CIFAR-10 and CIFAR-100. However, the classification accuracies of FedDyn and Algorithm 3 are almost identical in the converged regime. In particular, Algorithm 3 significantly reduces the Shannon entropy, which can be explained by k for FedDyn, Algorithm 2, and Algorithm 3. Because of the ℓ 1 -norm penalty on the local updates, Algorithm 3 makes sparser local updates than FedDyn. The local updates of FedDyn can be modeled by the Gaussian distribution, and the local updates of FedElasticNet can be modeled by the non-Gaussian distribution (similar to the Laplacian distribution). It is well-known that the Gaussian distribution maximizes the entropy for a given variance in information theory Cover & Thomas (2006) . Hence, FedElasticNet can reduce the entropy by transforming the Gaussian distribution into the non-Gaussian one.

5. CONCLUSION

We proposed FedElasticNet, a general framework to improve communication efficiency and resolve the client drift problem simultaneously. We introduce two types of penalty terms on the local model updates by repurposing the classical elastic net. The ℓ 1 -norm regularizer sparsifies the local model updates, which reduces the communication cost. The ℓ 2 -norm regularizer limits the impact of variable local updates to resolve the client drift problem. Importantly, our framework can be integrated with prior FL techniques so as to simultaneously resolve the communication cost problem and the client drift problem. By integrating FedElasticNet with FedDyn, we can achieve the best communication efficiency while maintaining classification accuracy for heterogeneous datasets.

A APPENDIX A.1 EXPERIMENT DETAILS

We provide the details of our experiments. We select the datasets for our experiments, including those used in prior work on federated learning (McMahan et al., 2017; Li et al., 2020b; Acar et al., 2021) . To fairly compare the non-IID environments, the datasets and the experimental environments are the same as those of Acar et al. (2021) . Hyperparameters. We describe the hyperparameters used in our experiments in Section 4. We perform a grid search to find the best λ 1 and ϵ used in the proposed algorithms. Each hyperparameter was selected to double the value as the performance improved. We use the same λ 2 as in Acar et al. (2021) . SCAFFOLD has the same local epoch and batch size as other algorithms, and SCAFFOLD is not included in Table 4 because other hyperparameters are not required. Table 5 shows the hyperparameters used in our experiments.

Dataset

Algorithm λ 1 λ 2 ϵ CIFAR-10 FedProx - 10 -4 - Algorithm 1 10 -6 10 -4 10 -3 Algorithm 2 10 -4 0 10 -4 FedDyn - 10 -2 - Algorithm 3 10 -4 10 -2 5 × 10 -3 CIFAR-100 FedProx - 10 -4 - Algorithm 1 10 -6 10 -4 10 -3 Algorithm 2 10 -4 0 10 -4 FedDyn - 10 -2 - Algorithm 3 10 -4 10 -2 10 -3 MNIST FedProx - 10 -4 - Algorithm 1 10 -6 10 -6 10 -3 Algorithm 2 10 -4 0 10 -4 FedDyn -5 × 10 -2 -Algorithm 3 10 -4 5 × 10 -2 5 × 10 -3

FedProx

-10 -4 -Algorithm 1 10 -6 10 -6 10 -3 Algorithm 2 10 -4 0 10 -4 FedDyn -4 × 10 -2 -Algorithm 3 10 -4 4 × 10 -2 2 × 10 -3 -10 -4 -Algorithm 1 10 -6 10 -6 9 × 10 -3 Algorithm 2 10 -6 0 9 × 10 -4 FedDyn -10 -2 -Algorithm 3 10 -6 10 -2 10 -2 Table 5: Hyperparameters.

A.2 REGULARIZER COEFFICIENTS

We selected λ 1 over {10 -2 , 10 -4 , 10 -6 , 10 -8 } to observe the impact of λ 1 on the classification accuracy. We prefer a larger λ 1 to enhance communication efficiency unless the ℓ 1 -norm regularizer does not degrade the classification accuracy. Figures 3, 4 , and 5 show the classification accuracy depending on λ 1 in the CIFAR-10 dataset with 10% participation rate and Dirichlet (.3). The unit of the cumulative number of elements is 10 7 . In Algorithm 1, we selected λ 1 = 10 -6 to avoid a degradation of classification accuracy (see Fig. 3 ) and maximize the sparsity of local updates. In this way, we selected the coefficient values λ 1 (See Fig. 4 for Algorithm 2 and 5 and Algorithm 3). 

B PROOF

We utilize some techniques in FedDyn (Acar et al., 2021) .

B.1 DEFINITION

We introduce a formal definition and properties that we will use. Definition B.0.1. A function L k is β-smooth if it satisfies ∥∇L k (x) -∇L k (y)∥ ≤ β∥x -y∥ ∀x, y. If function L k is convex and β-smooth, it satisfies -⟨∇L k (x), z -y⟩ ≤ -L k (z) + L k (y) + β 2 ∥z -x∥ 2 ∀x, y, z. As a consequence of the convexity and smoothness, the following property holds (Nesterov, 2018, Theorem 2.1.5) : 1 2βm k∈[m] ∥∇L k (x) -∇L k (x * )∥ 2 ≤ R(x) -R(x * ) ∀x where R(x) = 1 m m k=1 L k (x) and ∇R(x * ) = 0. We will also use the relaxed triangle inequality (Karimireddy et al., 2020, Lemma 3) : n j=1 v j 2 ≤ n n j=1 ∥v j ∥ 2 . B.2 PROOF OF THEOREM 3.1 The theorem that we will prove is as follows. Theorem B.1 (Full statement of Theorem 3.1). Assume that the clients are uniformly randomly selected at each round and the individual loss functions {L k } m k=1 are convex and β-smooth. Also assume that λ 2 > 27β. Then Algorithm 3 satisfies the following inequality: Letting R (θ) = 1 m k∈[m] L k (θ) and θ * = arg min θ R(θ), E R 1 T T -1 t=0 γ t -R(θ * ) ≤ 1 T 1 κ 0 (E∥γ 0 -θ * ∥ 2 + κC 0 ) + κ ′ κ 0 • λ 2 1 d - 1 T 2λ 1 λ 2 T t=1 (γ t-1 -θ * ), 1 m k∈[m] E[sign( θt k -θ t-1 )] , where γ t = 1 P k∈Pt θ t k = θ t + 1 λ 2 h t with P = |P t |, κ = 10m P 1 λ 2 λ 2 + β λ 2 2 -25β 2 , κ 0 = 2 λ 2 λ 2 2 -25λ 2 β -50β 2 λ 2 2 -25β 2 , κ ′ = 5 λ 2 λ 2 + β λ 2 2 -25β 2 = κ • P 2m , C 0 = 1 m k∈[m] E∥∇L k (θ 0 k ) -∇L k (θ * )∥, d = dim(θ). To prove the theorem, define variables that will be used throughout the proof. θt k = arg min θ L k (θ) -∇L k (θ t-1 k ), θ + λ 2 2 θ -θ t-1 2 2 + λ 1 θ -θ t-1 1 ∀k ∈ [m] (19) C t = 1 m k∈[m] E∥∇L k (θ t k ) -∇L k (θ * )∥ 2 , ( ) ϵ t = 1 m k∈[m] E∥ θt k -γ t-1 ∥ 2 . ( ) Note that θt k optimizes the kth loss function by assuming that the kth client (k ∈ [m]) is selected at round t. It is obvious that θt k = θ t k if k ∈ P t . C t refers to the average of the expected differences between gradients of each individual model and the globally optimal model. Lastly, ϵ t refers to the deviation of each client model from the average of local models. Remark that C t and ϵ t approach zero if all clients' models converge to the globally optimal model, i.e., θ t k → θ * . The following lemma expresses h t , how much the averaged active devices' model deviates from the global model.

Lemma B.2. Algorithm 3 satisfies

h t = 1 m k∈[m] ∇L k (θ t k ) Proof. Starting from the update of h t in Algorithm 3, h t = h t-1 - λ 2 m k∈[m] (θ t k -θ t-1 ) - λ 1 m k∈[m] sign(θ t k -θ t-1 ) = h t-1 - 1 m k∈[m] (∇L k (θ t-1 k ) -∇L k (θ t k ) -λ 1 sign(θ t k -θ t-1 )) - λ 1 m k∈[m] sign(θ t k -θ t-1 ) = h t-1 - 1 m k∈[m] (∇L k (θ t-1 k ) -∇L k (θ t k )), where the second equality follows from (11). By summing h t recursively, we have h t = h 0 + 1 m k∈[m] ∇L k (θ t k ) - 1 m k∈[m] ∇L k (θ 0 k ) = 1 m k∈[m] ∇L k (θ t k ). The next lemma provides how much the average of local models changes by using only t round parameters. Lemma B.3. Algorithm 3 satisfies E[γ t -γ t-1 ] = 1 λ 2 m k∈[m] E[-∇L k ( θt k )] - λ 1 λ 2 m k∈[m] E[sign( θt k -θ t-1 )]. Proof. Starting from the definition of γ t , E γ t -γ t-1 = E 1 P k∈Pt θ t k -θ t-1 - 1 λ 2 h t-1 = E 1 P k∈Pt (θ t k -θ t-1 ) - 1 λ 2 h t-1 = E 1 λ 2 P k∈Pt (∇L k (θ t-1 k ) -∇L k (θ t k ) -λ 1 sign(θ t k -θ t-1 )) - 1 λ 2 h t-1 (23) = E 1 λ 2 P k∈Pt (∇L k (θ t-1 k ) -∇L k ( θt k ) -λ 1 sign( θt k -θ t-1 )) - 1 λ 2 h t-1 (24) = E   1 λ 2 m k∈[m] (∇L k (θ t-1 k ) -∇L k ( θt k ) -λ 1 sign( θt k -θ t-1 )) - 1 λ 2 h t-1   (25) = 1 λ 2 m k∈[m] E[-∇L k ( θt k )] - λ 1 λ 2 m k∈[m] E[sign( θt k -θ t-1 )], where ( 23) follows from ( 11), ( 24) follows since θt k = θ t k if k ∈ P t , and (25) follows since clients are randomly chosen. The last equality is due to Lemma B.2. Next, note that Algorithm 3 is the same as that of FedDyn except for the ℓ 1 -norm penalty. As this new penalty does not affect derivations of C t , ϵ t , and E∥γ t -γ t-1 ∥ 2 in FedDyn (Acar et al., 2021) , we can obtain the following bounds on them. Proofs are omitted for brevity. E∥h t ∥ 2 ≤ C t (27) C t ≤ 1 - P m C t-1 + 2β 2 P m ϵ t + 4βP m E[R(γ t-1 ) -R(θ * )] (28) E∥γ t -γ t-1 ∥ 2 ≤ 1 m k∈[m] E[∥ θt k -γ t-1 ∥ 2 ] = ϵ t Lemma B.4. Given model parameters at the round (t -1), Algorithm 3 satisfies E∥γ t -θ * ∥ 2 ≤E∥γ t-1 -θ * ∥ 2 - 2 λ 2 E[R(γ t-1 ) -R(θ * )] + β λ 2 ϵ t + E∥γ t -γ t-1 ∥ 2 (30) - 2λ 1 λ 2 m (γ t-1 -θ * ) k∈[m] E[sign( θt k -θ t-1 )], where the expectations are taken assuming parameters at the round (t -1) are given. Proof. E∥γ t -θ * ∥ 2 = E∥γ t-1 -θ * + γ t -γ t-1 ∥ 2 = E∥γ t-1 -θ * ∥ 2 + 2E[ γ t-1 -θ * , γ t -γ t-1 ] + E∥γ t -γ t-1 ∥ 2 = E∥γ t-1 -θ * ∥ 2 + E∥γ t -γ t-1 ∥ 2 + 2 λ 2 m k∈[m] E γ t-1 -θ * , -∇L k ( θt k ) -λ 1 (sign( θt k -θ t-1 )) (32) ≤ E∥γ t-1 -θ * ∥ 2 + E∥γ t -γ t-1 ∥ 2 + 2 λ 2 m k∈[m] E[L k (θ * ) -L k (γ t-1 ) + β 2 ∥ θt k -γ t-1 ∥ 2 ] + 2 λ 2 m k∈[m] E γ t-1 -θ * , -λ 1 sign( θt k -θ t-1 ) (33) = E∥γ t-1 -θ * ∥ 2 + E∥γ t -γ t-1 ∥ 2 - 2 λ 2 E[R(γ t-1 ) -R(θ * )] + β λ 2 ϵ t - 2λ 1 λ 2 m k∈[m] E γ t-1 -θ * , sign( θt k -θ t-1 ) (34) = E∥γ t-1 -θ * ∥ 2 + E∥γ t -γ t-1 ∥ 2 - 2 λ 2 E[R(γ t-1 ) -R(θ * )] + β λ 2 ϵ t - 2λ 1 λ 2 γ t-1 -θ * , 1 m k∈[m] E[sign( θt k -θ t-1 )] where (32) follows from Lemma B.3, (33) follows from (15), and (34) follows from the definitions of R(•) and ϵ t . Lemma B.5. Algorithm 3 satisfies (1 -5 β 2 λ 2 2 )ϵ t ≤ 10 1 λ 2 2 C t-1 + 10β 1 λ 2 2 E[R(γ t-1 ) -R(θ * )] + 5λ 2 1 λ 2 2 d Proof. Starting from the definitions of ϵ t and γ t , ϵ t = 1 m k∈[m] E∥ θt k -γ t-1 ∥ 2 = 1 m k∈[m] E∥ θt k -θ t-1 - 1 λ 2 h t-1 ∥ 2 = 1 λ 2 2 1 m k∈[m] E∥∇L k (θ t-1 k ) -∇L k ( θt k ) -λ 1 sign(θ t k -θ t-1 ) -h t-1 ∥ 2 (36) E∥∇L k ( θt k ) -∇L k (γ t-1 )∥ 2 + 5λ 2 1 λ 2 2 d + 5 λ 2 2 C t-1 (38) ≤ 5 λ 2 2 C t-1 + 5 λ 2 2 2β E[R(γ t-1 ) -R(θ * )] + 5β 2 λ 2 2 1 m k∈[m] E∥ θt k -γ t-1 ∥ 2 + 5λ 2 1 λ 2 2 d + 5 λ 2 2 C t-1 (39) = 10 λ 2 2 C t-1 + 10β λ 2 2 E[R(γ t-1 ) -R(θ * )] + 5β 2 λ 2 2 ϵ t + 5λ 2 1 λ 2 2 d, where (36) follows from ( 11), (37) follows from the relaxed triangle inequality ( 17), (38) follows from (27), and (39) follows from the definition of C t , the smoothness, and ( 16). The last equality follows from the definition of ϵ t . After multiplying (28) by κ(= 10 m P 1 λ2 λ2+β λ 2 2 -25β 2 ), we obtain the following theorem by summing (B.4) and scaled version of (29). Theorem B.6. Given model parameters at the round (t -1), Algorithm 3 satisfies κ 0 E[R(γ t-1 ) -R(θ * )] ≤ (E∥γ t-1 -θ * ∥ 2 + κC t-1 ) -(E∥γ t -θ * ∥ 2 + κC t ) + κ P 2m λ 2 1 - 2λ 1 λ 2 γ t-1 -θ * , 1 m k∈[m] E[sign( θt k -θ t-1 )] . where κ = 10 m P 1 λ2 λ2+β λ 2 2 -25β 2 , κ 0 = 2 λ2 λ 2 2 -25λ2β-50β 2 λ 2 2 -25β 2 . Note that the expectations taken above are conditional expectations given model parameters at time (t -1). Proof. Summing Lemma B.4 and κ-scaled version of (28), we have E∥γ t -θ * ∥ 2 + κC t ≤ E∥γ t-1 -θ * ∥ 2 + κC t-1 -κ P m C t-1 + κ 2β 2 P m ϵ t + κ 4βP m E[R(γ t-1 ) -R(θ * )] - 2 λ 2 E[R(γ t-1 ) -R(θ * )] + β λ 2 ϵ t + E∥γ t -γ t-1 ∥ 2 - 2λ 1 λ 2 m (γ t-1 -θ * ) k∈[m] E[sign( θt k -θ t-1 )]. (40) As E∥γ t -γ t-1 ∥ 2 ≤ ϵ t by ( 29), we have κ 2β 2 P m ϵ t + β λ 2 ϵ t + E∥γ t -γ t-1 ∥ 2 ≤ κ 2β 2 P m ϵ t + β λ 2 ϵ t + ϵ t . This can be further bounded as follows. (41) = 10 m P 1 λ 2 λ 2 + β λ 2 2 -25β 2 • 2β 2 P m + β λ 2 + 1 ϵ t = 1 λ 2 (λ 2 2 -25β 2 ) 20(λ 2 + β)β 2 + β(λ 2 2 -25β 2 ) + λ 2 (λ 2 2 -25β 2 ) ϵ t = λ 2 (λ 2 + β) λ 2 2 -25β 2 1 -5 β 2 λ 2 2 ϵ t ≤ λ 2 (λ 2 + β) λ 2 2 -25β 2 10 λ 2 2 C t-1 + 10β λ 2 2 E[R(γ t-1 ) -R(θ * )] + 5λ 2 1 λ 2 2 d = κ P m C t-1 + κ βP m E[R(γ t-1 ) -R(θ * )] + κ P 2m λ 2 1 d, where the inequality follows from Lemma B.5. Then, (40) term will be E∥γ t -θ * ∥ 2 + κC t ≤ E∥γ t-1 -θ * ∥ 2 + κC t-1 -κ 0 E[R(γ t-1 ) -R(θ * )] + κ P 2m λ 2 1 d - 2λ 1 λ 2 γ t-1 -θ * , 1 m k∈[m] E[sign( θt k -θ t-1 )] . Rearranging terms, we prove the claim. Dividing by T and applying Jensen's inequality, E R( 1 T T -1 t=0 γ t ) -R(θ * ) ≤ 1 T 1 κ 0 (E∥γ 0 -θ * ∥ 2 + κC 0 ) + 1 κ 0 (κ P 2m λ 2 1 d) - 1 T 2λ 1 λ 2 T t=1 γ t-1 -θ * , 1 m k∈[m] E[sign( θt k -θ t-1 )] , which completes the proof of Theorem B.1.

B.3 DISCUSSION ON CONVERGENCE

In this section, we revisit the convergence stated in Theorem 3.1. Recall the bound E R 1 T T -1 t=0 γ t -R(θ * ) ≤ 1 T 1 κ 0 (E∥γ 0 -θ * ∥ 2 + κC 0 ) + 1 κ 0 (κ P 2m λ 2 1 d) - 1 T 2λ 1 λ 2 T t=1 γ t-1 -θ * , 1 m k∈[m] E[sign( θt k -θ t-1 )] , As we discussed in the main body, the second term is a negligible constant in the range of our hyperparameters as λ 1 is of order of 10 -4 or 10 -6 . Consider the last term where the summand is the inner product between two terms: 1) γ t-1 -θ * , the deviation of the averaged local models from the globally optimal model and 2) the average of sign vectors across clients. The deviation term characterizes how much the averaged local models are different from the global model; thus, we can assume that as training proceeds it vanishes or at least is bounded by a constant vector. To argue the average of sign vectors, assume a special case where the sign vectors sign( θt k -θ t-1 ) are IID across clients. To further simplify the argument, let us consider only a single coordinate of the sign vectors, say X k = sign( θt k (i) -θ t-1 (i)), and suppose X k = ±1 with probability 0.5 each. Then, the concentration inequality (Durrett, 2019) implies that for any δ > 0, P   1 m k∈[m] sign( θt k ) -θ t-1 > δ   = P   1 m k∈[m] X k > δ   ≤ e -mδ 2 2 holds, which vanishes exponentially fast with the number of clients m. Since m is large in many FL scenarios, the average of sign vectors is negligible with high probability, which in turn implies the last term is also negligible.



Zou & Hastie (2005) proposed the elastic net to encourage the grouping effect, in other words, to encourage strongly correlated covariates to be in or out of the model description together(Hu et al., 2018). Initially, the elastic net was proposed to overcome the limitations of Lasso(Tibshirani, 1996) imposing an ℓ 1 -norm penalty on the model parameters. For instance of a linear least square problem,



Figure 1: Classification accuracy performance evaluated in MNIST, EMNIST-L, CIFAR-10, CIFAR-100 dataset settings (10% participation rate and Dirichlet (.3)).

Fig 2. Fig 2 compares the distributions of the transmitted local updates ∆ t

Figure 3: Classification accuracy and sparsity of local updates depending on λ 1 (Algorithm 1).

Figure 5: Classification accuracy and sparsity of local updates depending on λ 1 (Algorithm 3).

Figure 7: Classification accuracy performance evaluated in IID Shakespeare and Non-IID Shakespeare datasets.

Figure 8: Classification accuracy performance evaluated in MNIST, EMNIST-L datasets (100% participation rate).

Figure 9: Classification accuracy performance evaluated in CIFAR-10 and CIFAR-100 datasets (100% participation rate).

Now we are ready to prove the main claim by combining all lemmas. Let us take the sum on both sides of Lemma B.6 over t = 1, . . . , T . Then, telescoping gives usγ t-1 ) -R(θ * )] ≤ (E∥γ 0 -θ * ∥ 2 + κC 0 ) -(E∥γ T -θ * ∥ 2 + κC T ) + T (κ ( θt k -θ t-1 )] .Since κ is positive if λ 2 > 27β, we can eliminate the negative term in the middle. Then,κ 0 T t=1 E[R(γ t-1 ) -R(θ * )] ≤ E∥γ 0 -θ * ∥ 2 + κC 0 + T (κ 1 -θ * , 1 m k∈[m]E[sign( θt k -θ t-1 )] .

Number of non-zero elements cumulated over the all round simulated with 10% client participation for IID and non-IID settings in FL scenarios. The non-IID settings of MNIST, EMNIST-L, CIFAR-10, and CIFAR-100 datasets are created with the Dirichlet distribution of labels owned by the client. Algorithm 1 is FedElasticNet for FedProx, Algorithm 2 is FedElasticNet for SCAFFOLD, and Algorithm 3 is FedElasticNet for FedDyn. The unit of the cumulative number of elements is 10 7 .

Cumulative entropy values of transmitted bits with 10% client participation for IID and non-IID settings in FL scenarios. The non-IID settings of MNIST, EMNIST-L, CIFAR-10, and CIFAR-100 datasets are created with the Dirichlet distribution of labels owned by the client. Algorithm 1 is FedElasticNet for FedProx, Algorithm 2 is FedElasticNet for SCAFFOLD, and Algorithm 3 is FedElasticNet for FedDyn. The left-side numbers of FedDyn are the entropy values when the local models θ t k are transmitted and the right-side numbers in parentheses are the entropy values when the local updates ∆ t



Cumulative entropy values of transmitted bits with 100% client participation for IID and non-IID settings in FL scenarios. The non-IID settings of MNIST, EMNIST-L, CIFAR-10, and CIFAR-100 datasets are created with the Dirichlet distribution of labels owned by the client. Algorithm 1 is FedElasticNet for FedProx, Algorithm 2 is FedElasticNet for SCAFFOLD, and Algorithm 3 is FedElasticNet for FedDyn. The left-side numbers of FedDyn are the entropy values when the local models θ t k are transmitted and the right-side numbers in parentheses are the entropy values when the local updates ∆ t

