SUPERNET TRAINING FOR FEDERATED IMAGE CLASSI-FICATION UNDER SYSTEM HETEROGENEITY

Abstract

Efficient deployment of deep neural networks across many devices and resource constraints, particularly on edge devices, is one of the most challenging problems in the presence of data-privacy preservation issues. Conventional approaches have evolved to either improve a single global model while keeping each local heterogeneous training data decentralized (i.e. data heterogeneity; Federated Learning (FL)) or to train an overarching network that supports diverse architectural settings to address heterogeneous systems equipped with different computational capabilities (i.e. system heterogeneity; Neural Architecture Search). However, few studies have considered both directions simultaneously. This paper proposes the federation of supernet training (FedSup) framework to consider both scenarios simultaneously, i.e., where clients send and receive a supernet that contains all possible architectures sampled from itself. The approach is inspired by observing that averaging parameters during model aggregation for FL is similar to weight-sharing in supernet training. Thus, the proposed FedSup framework combines a weightsharing approach widely used for training single shot models with FL averaging (FedAvg). Furthermore, we develop an efficient algorithm (E-FedSup) by sending the sub-model to clients on the broadcast stage to reduce communication costs and training overhead, including several strategies to enhance supernet training in the FL environment. We verify the proposed approach with extensive empirical evaluations. The resulting framework also ensures data and model heterogeneity robustness on several standard benchmarks.

1. INTRODUCTION

Deep neural networks (DNNs) have achieved remarkable empirical success in many machine learning applications. This has led to increasing demand for training models using local data from mobile devices and the Internet of Things (IoT) because billions of local machines worldwide can bring more computational power and data quantities than central server system (Lim et al., 2020; El-Sayed et al., 2018) . However, it remains somewhat arduous to deploy them efficiently on diverse hardware platforms with significantly diverse specifications (e.g. latency, TPU) (Cai et al., 2019) and subsequently train a global model without sharing local data. Federated learning (FL) has become a popular paradigm for collaborative machine learning (Li et al., 2019; 2018; Karimireddy et al., 2019; Mohri et al., 2019; Lin et al., 2020; Acar et al., 2021) . Training the central server (e.g. service manager) in the FL framework requires that each client (e.g. mobile devices or the whole organization) individually updates its local model via their private data, with the global model subsequently updated using data from all local updates, and the process is repeated until convergence. Most notably, federated averaging (FedAvg) (McMahan et al., 2017) uses averaging as its aggregation method over local learned models on clients, which helps avoid systematic privacy leakages (Voigt & Von dem Bussche, 2017) . Despite the popularity of FL, developed models suffer from data heterogeneity as the locally generated data is not identically distributed. To tackle data heterogeneity, most FL studies have considered new objective functions to aggregate of each model (Acar et al., 2021; Wang et al., 2020; Yuan & Ma, 2020; Li et al., 2021a) , using auxiliary data in the centeral server (Lin et al., 2020; Zhang et al., 2022) , encoding the weight for an efficient communication stage (Wu et al., 2022; Hyeon-Woo et al., 2022; Xu et al., 2021) , or recruiting helpful clients for more accurate global models (Li et al., 2019; Cho et al., 2020; Nishio & Yonetani, 2019) . Recently, there has also been tremendous interest in deploying the FL algorithms for real-world applications such as mobile devices and IoT (Diao et al., 2021; Horvath et al., 2021; Hyeon-Woo et al., 2022) . However, significant issues remain regarding delivering compact models specialized for edge devices with widely diverse hardware platforms and efficiency constraints (Figure 1 In the early days, neural architecture search (NAS) studies suffer from system heterogeneity issues in deploying resource adaptive models to clients, but this challenge has been largely resolved by training a single set of shared weights from one-shot models (i.e. supernet) (Cai et al., 2019; Yu et al., 2020) (Figure 1 (b)). However, this approach has been rarely considered under data heterogeneity scenarios that can provoke the training instability. Recent works have studied the model heterogeneity in FL by sampling or generating sub-networks (Mushtaq et al., 2021; Diao et al., 2021; Khodak et al., 2021; Shamsian et al., 2021) , or employing pruned models from a global model (Horvath et al., 2021; Luo et al., 2021b) . However, these methods have limitations due to model scaling (e.g., depth (#layers), width (#channels), kernel size), training stability, and client personalization. This paper presents a novel framework to consider both scenarios, namely Federation of Supernet Training (FedSup), i.e., sub-models nested in a supernet for both data and model heterogeneity. Fed-Sup uses weight sharing in supernet training to forward supernets to each local client and ensembles the sub-model training sampled from itself at each client (Figure 1 (c)). We manifest an Efficient FedSup (E-FedSup) which broadcasts sub-models to local clients in lieu of full supernets (Figure 1 (d)). To evaluate both methods, we focus on improving the global accuracy (on servers, i.e., universality) and the personalized accuracy (on-device tuned models, i.e., personalization). Our key contributions are summarized as follows: • We propose a novel framework that simultaneously obtains a large number of sub-networks at once under data heterogeneity, and develop an efficient version that broadcasts sub-models for local training, dramatically reducing resource consumption during training, and hence the network bandwidth for clients and local training overheads. • To enhance the supernet training under federated scenarios, we propose a new normalization technique named Parameteric Normalization (PN) which substitutes mean and variance batch statistics in batch normalization. Our method protects the data privacy by not tracking running statistics of representations at each hidden layer as well as reduces the discrepancies across shared normalization layers from different sub-networks. • We extend previous methods by analyzing the global accuracy and a personalized accuracy for each client where multiple dimensions (depth, width, kernel size) are dynamic; and demonstrate the superiority of our methods using accuracy with respect to FLOPS Pareto. • Experimental results confirm that FedSup and E-FedSup provide much richer representations compared with current static training approaches on several FL benchmark datasets, improving global and personalized client model accuracies.



Figure 1: (a) Standard FL framework, (b) supernet training in a standard datacenter optimization (i.e., centralized settings), (c) our proposed federation of supernet training framework (FedSup), and (d) efficient FedSup algorithm (E-FedSup).varies greatly depending on the specification of devices(Yu et al., 2018). In this perspective, this can become a significant bottleneck for aggregation rounds in synchronous FL training if the same sized model is distributed to all clients without considering local resources(Li et al., 2020).

