FEDORAS: FEDERATED ARCHITECTURE SEARCH UNDER SYSTEM HETEROGENEITY

Abstract

Federated learning (FL) has recently gained considerable attention due to its ability to learn on decentralised data while preserving client privacy. However, it also poses additional challenges related to the heterogeneity of the participating devices, both in terms of their computational capabilities and contributed data. Meanwhile, Neural Architecture Search (NAS) has been successfully used with centralised datasets, producing state-of-the-art results in constrained or unconstrained settings. However, such centralised datasets may not be always available for training. Most recent work at the intersection of NAS and FL attempts to alleviate this issue in a cross-silo federated setting, which assumes homogeneous compute environments with datacenter-grade hardware. In this paper we explore the question of whether we can design architectures of different footprints in a cross-device federated setting, where the device landscape, availability and scale are very different. To this end, we design our system, FedorAS, to discover and train promising architectures in a resource-aware manner when dealing with devices of varying capabilities holding non-IID distributed data. We present empirical evidence of its effectiveness across different settings, spanning across three different modalities (vision, speech, text), and showcase its better performance compared to state-of-the-art federated solutions, while maintaining resource efficiency.

1. INTRODUCTION

As smart devices become omnipresent where we live, work and socialise, the ML-powered services that these provide grow in sophistication. This ambient intelligence has undoubtedly been sustained by recent advances in Deep Learning (DL) across a multitude of tasks and modalities. Parallel to this race for state-of-the-art performance in various in DL benchmarks, mobile and embedded devices also became more capable to accommodate new Deep Neural Network (DNN) designs [37] , some even integrating specialised accelerators to their System-On-Chips (SoC) (e.g. NPUs) to efficiently run DL workloads [3] . These often come in various configurations in terms of their compute/memory capabilities and power envelopes [4] and co-exist in the wild as a rich multi-generational ecosystem (system heterogeneity) [79] . These devices bring intelligence through users' interactions, also innately heterogeneous amongst them, leading to non-independent or identically distributed (non-IID) data in the wild (data heterogeneity). Powered by the recent advances in SoCs' capabilities and motivated by privacy concerns [74] over the custody of data, Federated Learning (FL) [58] has emerged as a way of training on-device on user data without it ever directly leaving the device premises. However, FL training has largely been focused on the weights of a static global model architecture, assumed to be runnable by every participating client [40] . Not only may this not be the case, but it can also lead to subpar performance of the overall training process in the presence of stragglers or biases in the case of consistently dropping certain low-powered devices. On the opposite end, more capable devices might not fully take advantage of their data if the deployed model is of reduced capacity to ensure all devices can participate [52] . Parallel to these trends, Neural Architecture Search (NAS) has become the de facto mechanism to automate the design of DNNs that can meet the requirements (e.g. latency, model size) for these to run on resource-constrained devices. The success of NAS can be partly attributed to the fact that these frameworks are commonly run in datacenters, where high-performing hardware and/or large curated datasets [43, 23, 20, 39, 62, 62] are available. However, this also imposes two major limitations on current NAS approaches: i) privacy, i.e. these methods were often not designed to work in situations when user's data must remain on-device; and, consequently, ii) tail data non-discoverability, i.e. they might never be exposed to infrequent or time/user-specific data that exist in the wild but not necessarily in centralized datasets. On top of these, the whole cost is born by the provider and separate on-device modelling/profiling needs to be done in the case of hardware-aware NAS [26, 73, 45] , which has mainly focused on inference performance hitherto. Motivated by the aforementioned phenomena and limitations of the existing NAS methods, we propose FedorAS, a system that performs NAS over heterogeneous devices holding heterogeneous data in a resource-aware and federated manner. To this direction, we cluster clients into tiers based on their capabilities and design supernet comprising operations covering the whole spectrum of compute complexities. This supernet acts both as search space and a weight-sharing backbone. Upon federation, it is only partially and stochastically shared to clients, respecting their computational and bandwidth capabilities. In turn, we leverage resource-aware one-shot path sampling [28] and adapt it to facilitate lightweight on-device NAS. In this way, networks in a given search space are not only deployed in a resource-aware manner, but also trained as such, by tuning the downstream communication (i.e. the subspace explored by each client) and computation (i.e. FLOPs of sampled paths) to meet the device's training budget. Once federated training of the supernetwork has completed, usable pretrained networks can be extracted even before performing fine-tuning or personalising per device, thus minimising the number of extra on-device training rounds to achieve competitive performance. In summary, in this work we make the following contributions: • We propose a system for resource efficient federated NAS that can be applied in cross-device settings, where partial device participation, device and data heterogeneity are innate characteristics. • We implement a system called FedorAS (Federated nAS) that leverages a server-resident supernet enabling weight sharing for efficient knowledge exchange between clients, without directly sharing common model architectures with one another. • We propose a novel aggregation method named OPA (OPerator Aggregation) for weighing updates from multiple "single-path one-shot" client updates in a frequency-aware manner. • We empirically evaluate the performance and convergence of our system under IID and non-IID settings across different datasets, tasks and modalities, spanning different device distributions and compare our system's behaviour against state-of-the-art FL techniques.

2. BACKGROUND & MOTIVATION

Federated Learning. A typical FL pipeline is comprised of three distinct stages: given a global model initialised on a central server, ω , i) the server randomly samples k clients out of the available K (k ≪ K for cross-device; [10] k = K for cross-silo setting) and sends them the current state of the global model; ii) those k clients perform training on-device using their own data partition, D i , for a number of epochs and send the updated models, ω , is obtained. This aggregation can be implemented in different ways [58, 70, 53] . For example, in FedAvg [58] each update is weighted by the relative amount of data on each client: ω (t+1) g = k i=0 |Di| k j=0 |Dj | ω (t) i . Stages ii) and iii) repeat until convergence. The quality of the global model ω g can be assessed: on the global test set; by evaluating the fit of the ω g to each participating client's data (D i ) and derive fairness metrics [54] ; or, by evaluating the adaptability of the ω g to each client's data or new data these might generate over time, this is commonly referred to as personalised FL [27, 51] . Contrary to traditional distributed learning, cross-device FL performs the bulk of the compute on a highly heterogeneous [40] set of devices in terms of their compute capabilities, availability and data distribution. In such scenarios, a trade-off between model capacity and client participation arises: larger architectures might result in more accurate models which may only be trained on a fraction of the available devices; on the other hand, deploying smaller footprint networks could target more devices -and thus more data -for training, but these might be of inferior quality (gap in Fig. 1 ). Neural Architecture Search. NAS is usually defined as a bi-level optimisation problem: 



to the server after local training is completed; finally, iii) the server aggregates these models and a new global model, ω (t+1) g

a ⋆ = arg min a∈A L(ω * a (D t ), D v ), where ω * a (D t ) = arg min ωa L(ω a , D t )(1)

