LOTTERY AWARE SPARSITY HUNTING: ENABLING FED-ERATED LEARNING ON RESOURCE-LIMITED EDGE

Abstract

Limited computation and communication capabilities of clients pose significant challenges in federated learning (FL) over resource-limited edge nodes. A potential solution to this problem is to deploy off-the-shelf sparse learning algorithms that train a binary sparse mask on each client with the expectation of training a consistent sparse server mask yielding sparse weight tensors. However, as we investigate in this paper, such naive deployments result in a significant drop in accuracy compared to FL with dense models, especially for clients with limited resource budgets. In particular, our investigations reveal a serious lack of consensus among the trained sparsity masks on clients, which prevents convergence for the server mask and potentially leads to a substantial drop in model performance. Based on such key observations, we propose federated lottery aware sparsity hunting (FLASH), a unified sparse learning framework to make the server win a lottery in terms of yielding a sparse sub-model, able to maintain classification performance under highly resource-limited client settings. Moreover, to support FL on different devices requiring different parameter density, we leverage our findings to present hetero-FLASH, where clients can have different target sparsity budgets based on their device resource limits. Experimental evaluations with multiple models on various datasets (both IID and non-IID) show superiority of our models in closing the gap with unpruned baseline while yielding up to ∼10.1% improved accuracy with ∼10.26× fewer communication costs, compared to existing alternatives, at similar hyperparameter settings. Code is released as Supplementary.

1. INTRODUCTION

Federated learning (FL) McMahan et al. (2017) is a popular form of distributed training, which has gained significant traction due to its ability to allow multiple clients to learn a shared global model without the requirement to transfer their private data. However, clients' heterogeneity and resource limitations pose significant challenges for FL deployment over edge nodes, including mobile phones and IoT devices. To resolve these issues, various methods have been proposed over the past few years including efficient learning for heterogeneous collaborative training Lin et al. ( 2020 Our Contributions. Our contribution is fourfold. In view of the above limitations, we first identify crucial differences between a centralized and the corresponding FL model, in learning the sparse masks for each layer. In particular, we observe that in FL, the server model fails to yield convergent sparse masks, primarily due to the lack of consensus among clients' later layers' masks. In contrast, the centralized model show significantly higher convergence trend in learning sparse masks for all layers. We then experimentally demonstrate the utility of pruning sensitivity and mask convergence in achieving good accuracy setting the platform to close the performance gap in sparse FL. We then leverage our findings and present federated lottery aware sparsity hunting (FLASH), a sparse FL methodology addressing the aforementioned limitations in a unified manner. At the core, FLASH leverages a two-stage FL, a robust and low-cost layer sensitivity evaluation stage and a FL training stage. In particular, the disentangling of the layer sensitivity evaluation from sparse weight training allows us to either choose to train a sparse mask or freeze a sensitivity driven pre-defined mask. This can further translate to a proportional communication saving. To deal with the heterogeneity in clients' compute-budget, we further extend our methodologies to hetero-FLASH, where individual clients can support different density based on their resources. Here, to deal with the unique problem of the server selecting different sparse models for clients, we present server-side gradual mask sub-sampling, that identifies sparse masks via a form of layer sensitivity re-calibration, starting for models with highest to that with lowest density support. We conduct experiments on MNIST, FEMNIST, and CIFAR-10 with different models for both IID and non-IID client data partitioning. Experimental results show that, compared to the existing alternative Qiu et al. (2021) , at iso-hyperparameter settings, FLASH can yield up to ∼8.9% and ∼10.1%, on IID and non-IID data settings, respectively, with reduced communication of up to ∼10.2×. 



We measure layer importance via the proxy of sensitivity. A layer with higher sensitivity demands higher % of non-zero weights compared to a less sensitive layer.



); Zhu et al. (2021), distillation He et al. (2020), federated dropout techniques Horvath et al. (2021); Caldas et al. (2018b), efficient aggregation for faster convergence and reduced communication Reddi et al. (2020); Li et al. (2020b). However, these methods do not necessarily address the growing concerns of highly computation and communication limited edge. Meanwhile, reducing the memory, compute, and latency costs for deep neural networks (DNNs) in centralized training for their efficient edge deployment has also become an active area of research. In particular, recently proposed sparse learning (SL) strategies Evci et al. (2020); Kundu et al. (2021b); Mocanu et al. (2018); Dettmers & Zettlemoyer (2019); Raihan & Aamodt (2020) effectively train weights and associated binary sparse masks to allow only a fraction of model parameters to be updated during training, potentially enabling the lucrative reduction in both the training time and compute cost Qiu et al. (2021); Raihan & Aamodt (2020), while creating a model to meet a target parameter density denoted as d, and is able to yield accuracy close to that of the unpruned baseline. However, the challenges and opportunities of sparse learning in FL is yet to be fully unveiled. Only very recently, few works Bibikar et al. (2021); Huang et al. (2022) have tried to leverage sparse learning in FL primarily to show their efficacy in non-IID settings. Nevertheless, these works primarily used sparsity for non-aggressive model compression,

Figure 1: Comparison of (a) accuracy at different communication budget with, ZeroFL Qiu et al. (2021) and FedAvg. (w/ d = 1.0) (b) Accuracy vs. parameter density of each client. Proposed approaches can significantly outperform the existing alternative Qiu et al. (2021) at ultra-low target parameter density (d). limiting the actual benefits of sparse learning, and required multiple local epochs, that may further increase the training time for stragglers making the overall FL process inefficient Zhang et al. (2021). Moreover, the server-side pruning used in these methods may not necessarily adhere to the layers' pruning sensitivity 1 that often plays a crucial role in sparse model performance Kundu et al. (2021b); Zhang et al. (2018). Another recent work, ZeroFL Qiu et al. (2021), has explored deploying sparse learning in FL settings. However, Qiu et al. (2021) could not leverage any advantage of model sparsity in the clients' communication cost and had to keep significantly more parameters active compared to a target d to yield good accuracy. Moreover, as shown in Fig. 1(b), for d = 0.05, ZeroFL still suffers from substantial accuracy drop of ∼14% compared to the baseline.

Model Pruning. Over the past few years, a plethora of research has been done to perform efficient model compression via pruning, particularly in centralized training Ma et al. (2021); Frankle & Carbin (2018); Liu et al. (2021); You et al. (2019); He et al. (2018). Pruning essentially identifies and removes the unimportant parameters to yield compute-efficient inference models. More recently, sparse learning Evci et al. (2020); Kundu et al. (2021b); Dettmers & Zettlemoyer (2019); Raihan & Aamodt (2020); Kundu et al. (2020; 2019), a popular form of model pruning, has gained significant traction as it can yield FLOPs advantage even during training. In particular, it ensures only d% of the model parameters remain non-zero during the training for a target parameter density d (d < 1.0 and sparsity is 100 -d%), potentially enabling training compute and comm. cost if deployed for FL. Dynamic network rewiring (DNR). We leverage DNR Kundu et al. (2021b), to sparsely learn the sparsity mask of each client. In DNR, a model starts with randomly initiated mask following the

