SUPERFED: WEIGHT SHARED FEDERATED LEARNING

Abstract

Federated Learning (FL) is a well-established technique for privacy preserving distributed training. Much attention has been given to various aspects of FL training. A growing number of applications that consume FL-trained models, however, increasingly operate under dynamically and unpredictably variable conditions, rendering a single model insufficient. We argue for training a global "family of models" cost efficiently in a federated fashion. Training them independently for different tradeoff points incurs ≈ O(k) cost for any k architectures of interest, however. Straightforward applications of FL techniques to recent weight-shared training approaches is either infeasible or prohibitively expensive. We propose SuperFedan architectural framework that incurs O(1) cost to co-train a large family of models in a federated fashion by leveraging weight-shared learning. We achieve an order of magnitude cost savings on both communication and computation by proposing two novel training mechanisms: (a) distribution of weight-shared models to federated clients, (b) central aggregation of arbitrarily overlapping weight-shared model parameters. The combination of these mechanisms is shown to reach an order of magnitude (9.43x) reduction in computation and communication cost for training a 5 * 10 8 -sized family of models, compared to independently training as few as k = 9 DNNs without any accuracy loss.

1. INTRODUCTION

With the increase in the computational power of smartphones, the use of on-device inference in mobile applications is on the rise, ranging from image recognition (google vision; azure vision) , virtual assistant (Alexa) , voice recognition (google ASR) to recommendation systems (Bin et al., 2019) . Indeed, on-device inference is pervasive, especially with recent advances in software (Chen et al., 2018; torch mobile) , accelerators (samsung exynos; apple neural engine), and neural architecture optimizations (Howard et al., 2019; Sun et al., 2020; Wu et al., 2019a) . The surge in its use cases (Cai et al., 2017; Han et al., 2019; Kang et al., 2017; Lane et al., 2016; Reddi et al., 2021; Wu et al., 2019b) has led to a growing interest in providing support not only for on-device inference, but also for on-device training of these models (Dhar et al., 2021) . Federated Learning (FL) is an emerging distributed training technique that allows smartphones with different data sources to collaboratively train an ML model (McMahan et al., 2017; Chen & Chao, 2020; Wang et al., 2020; Karimireddy et al., 2021; Konečnỳ et al., 2016) . FL enjoys three key properties, it -a) has smaller communication cost, b) is massively parallel, and c) involves no data-sharing. As a result, numerous applications such as GBoard (Hard et al., 2018 ), Apple's Siri (sir, 2019 ), pharmaceutical discovery (CORDIS., 2019 ), medical imaging (Silva et al., 2019) , health record mining (Huang & Liu, 2019) , and recommendation systems (Ammad-ud-din et al., 2019) are readily adopting federated learning. However, adoption of FL in smartphone applications is non-trivial. As a result, recent works pay attention to the emerging challenges that occur in training, such as data heterogeneity (Karimireddy et al., 2021; Li et al., 2020; Acar et al., 2021 ), heterogeneous resources (Alistarh et al., 2017; Ivkin et al., 2019; Li et al., 2020; Konečnỳ et al., 2016), and privacy (Truex et al., 2019; Mo et al., 2021; Gong et al., 2021) . These helped FL adoption, particularly in challenging training conditions. However, the success of FL adoption depends not only on tackling challenges that occur in training but also post-training (deployment). Indeed, deploying ML models for on-device inference is exceedingly challenging (Wu et al., 2019b; Reddi et al., 2021) . Yet, most of the existing training techniques in FL do not take these deployment challenges into consideration. In this paper, we focus on developing FL It is well-established that any single model statically chosen for on-device inference is sub-optimal. This is because the deployment conditions may continuously change on a multi-task system like smartphones (Xu et al., 2019) due to dynamic resource availability (Fang et al., 2018) . For instance, the computational budget may vary due to excessive consumption by background apps; the energy budget may vary if the smartphone is in low power or power-saver mode (Yu et al., 2019) . Furthermore, an increasing number of applications require flexibility with respect to resource-accuracy trade-offs in order to efficiently utilize the dynamic resources in deployment (Fang et al., 2018) . In all of these deployment scenarios, a single model neither satisfies variable constraints nor offers the flexibility to make trade-offs. In contrast to existing FL approaches that produce a single model, we need to produce multiple model variants (varying in size/latency) for efficient on-device inference. However, training these model variants independently is computationally prohibitive (Cai et al., 2020) . This is particularly true for FL, where training these variants independently will cumulatively inflate the communication cost as well as the computation cost. Thus, it is imperative to develop techniques for training multiple models in a federated fashion cost efficiently without any accuracy loss-achieving asymptotic cost improvements relative to independently training them. 



Figure 1: Weight shared FL Training.Shared weights reside on the server. NN subnetworks are distributed (left) anddeployed (right)  to participating clients, globally training a dense accuracy-latency tradeoff space.To achieve this goal, we propose SuperFed-a novel federated framework that targets the problem of efficient ondevice inference on smartphones with better training algorithms. SuperFed co-trains a family of model variants in a federated fashion simultaneously by leveraging weightsharing(Cai et al., 2020; Yu et al., 2020). After federated training, the clients perform local neural architecture search to find the appropriate model variants for their deployment scenarios. In weight-sharing, all model variants are subnetworks of a supernetwork(Cai et al., 2020)  and share their parameters partially. The largest subnetwork's (or supernetwork's) parameters contain other subnetworks' parameters within it as proper subgraphs. There are two key benefits that weight-sharing brings in FL, it a) significantly reduces the communication and computation cost for training k model variants, and b) requires no re-training after the federated training of the supernetwork is complete. Hence, SuperFed decouples training from neural architecture search which allows local clients to dynamically select subnetworks of their choice from the globally-trained supernetwork without any re-training. However, applying existing weight-shared training techniques to federated learning is challenging. First, weightshared training techniques like Progressive Shrinking (PS) (Cai et al., 2020) work on centralized i.i.d data, whereas the data is decentralized and typically non-i.i.d in FL. Second, PS uses a pre-trained largest subnetwork during the weight-shared training. This requirement becomes impractical in the context of FL as it -a) is hard to obtain a globally pre-trained FL model, or b) may significantly increase the overall communication cost to train it first. Third, weight-sharing training techniques need to minimize interference (Cai et al., 2020; Xu et al., 2022). Interference occurs when smaller subnetworks interfere with the training of larger subnetworks (Fig 2a in Yu et al. (2020)). To mitigate interference, PS adopts multi-phase training approach that prioritizes the training of larger subnetworks before training smaller subnetworks. Such multi-phase training may lead to significant communication cost in federated learning. Instead, we argue that the weight-shared training technique in FL must be be one-shot (single phase) to mitigate interference. As a part of SuperFed framework, we propose MaxNet -a weight-shared training technique for FL.Figure1provides a high level overview of our proposed approach. MaxNet hosts the supernetwork in the server and assumes no prior pre-trained model before the federated training. MaxNet decides which individual subnetworks are distributed for training on participating clients and when (subnetwork distribution). MaxNet 's subnetwork distribution optimizes both lower bound (smallest subnet)

