SUPERNET TRAINING FOR FEDERATED IMAGE CLASSI-FICATION UNDER SYSTEM HETEROGENEITY

Abstract

Efficient deployment of deep neural networks across many devices and resource constraints, particularly on edge devices, is one of the most challenging problems in the presence of data-privacy preservation issues. Conventional approaches have evolved to either improve a single global model while keeping each local heterogeneous training data decentralized (i.e. data heterogeneity; Federated Learning (FL)) or to train an overarching network that supports diverse architectural settings to address heterogeneous systems equipped with different computational capabilities (i.e. system heterogeneity; Neural Architecture Search). However, few studies have considered both directions simultaneously. This paper proposes the federation of supernet training (FedSup) framework to consider both scenarios simultaneously, i.e., where clients send and receive a supernet that contains all possible architectures sampled from itself. The approach is inspired by observing that averaging parameters during model aggregation for FL is similar to weight-sharing in supernet training. Thus, the proposed FedSup framework combines a weightsharing approach widely used for training single shot models with FL averaging (FedAvg). Furthermore, we develop an efficient algorithm (E-FedSup) by sending the sub-model to clients on the broadcast stage to reduce communication costs and training overhead, including several strategies to enhance supernet training in the FL environment. We verify the proposed approach with extensive empirical evaluations. The resulting framework also ensures data and model heterogeneity robustness on several standard benchmarks.

1. INTRODUCTION

Deep neural networks (DNNs) have achieved remarkable empirical success in many machine learning applications. This has led to increasing demand for training models using local data from mobile devices and the Internet of Things (IoT) because billions of local machines worldwide can bring more computational power and data quantities than central server system (Lim et al., 2020; El-Sayed et al., 2018) . However, it remains somewhat arduous to deploy them efficiently on diverse hardware platforms with significantly diverse specifications (e.g. latency, TPU) (Cai et al., 2019) and subsequently train a global model without sharing local data. Federated learning (FL) has become a popular paradigm for collaborative machine learning (Li et al., 2019; 2018; Karimireddy et al., 2019; Mohri et al., 2019; Lin et al., 2020; Acar et al., 2021) . Training the central server (e.g. service manager) in the FL framework requires that each client (e.g. mobile devices or the whole organization) individually updates its local model via their private data, with the global model subsequently updated using data from all local updates, and the process is repeated until convergence. Most notably, federated averaging (FedAvg) (McMahan et al., 2017) uses averaging as its aggregation method over local learned models on clients, which helps avoid systematic privacy leakages (Voigt & Von dem Bussche, 2017). Despite the popularity of FL, developed models suffer from data heterogeneity as the locally generated data is not identically distributed. To tackle data heterogeneity, most FL studies have considered new objective functions to aggregate of each model (Acar et al., 2021; Wang et al., 2020; Yuan & Ma, 2020; Li et al., 2021a) , using auxiliary data in the centeral server (Lin et al., 2020; Zhang et al., 2022) , encoding the weight for an efficient communication stage (Wu et al., 2022; Hyeon-Woo et al., 2022; Xu et al., 2021) , or recruiting helpful clients for more accurate global models (Li et al., 2019; Cho et al., 2020; Nishio & Yonetani, 2019) . Recently, there has also been tremendous interest in deploying the FL algorithms for real-world applications such as mobile devices and IoT (Diao et al., 2021; Horvath et al., 2021; Hyeon-Woo et al., 2022) . However, significant issues remain regarding delivering compact models specialized for edge devices with widely diverse hardware platforms and efficiency constraints (Figure 1 (a)). It is notorious that the inference time of a neural network

