FEDMES: SPEEDING UP FEDERATED LEARNING WITH MULTIPLE EDGE SERVERS

Abstract

We consider federated learning with multiple wireless edge servers having their own local coverages. We focus on speeding up training in this increasingly practical setup. Our key idea is to utilize the devices located in the overlapping areas between the coverage of edge servers; in the model-downloading stage, the devices in the overlapping areas receive multiple models from different edge servers, take the average of the received models, and then update the model with their local data. These devices send their updated model to multiple edge servers by broadcasting, which acts as bridges for sharing the trained models between servers. Even when some edge servers are given biased datasets within their coverages, their training processes can be assisted by coverages of adjacent servers, through the devices in the overlapping regions. As a result, the proposed scheme does not require costly communications with the central cloud server (located at the higher tier of edge servers) for model synchronization, significantly reducing the overall training time compared to the conventional cloud-based federated learning systems. Extensive experimental results show remarkable performance gains of our scheme compared to existing methods.

1. INTRODUCTION

With the explosive growth in the numbers of smart phones, wearable devices and Internet of Things (IoT) sensors, a large portion of data generated nowadays is collected outside the cloud, especially at the distributed end-devices at the edge. Federated learning (McMahan et al., 2017; Konecny et al., 2016b; a; Bonawitz et al., 2019; Li et al., 2019a ) is a recent paradigm for this setup, which enables training of a machine learning model in a distributed network while significantly resolving privacy concerns of the individual devices. However, training requires repeated downloading and uploading of the models between the parameter server (PS) and devices, presenting significant challenges in terms of 1) the communication bottleneck at the PS and 2) the nonIID (independent, identically distributed) data characteristic across devices (Zhao et al., 2018; Sattler et al., 2019; Li et al., 2019b; Reisizadeh et al., 2019; Jeong et al., 2018) . In federated learning, the PS can be located at the cloud or at the edge (e.g., small base stations). Most current studies on federated learning consider the former, with the assumption that millions of devices are within the coverage of the PS at the cloud; at every global round, the devices in the system should communicate with the PS (located at the cloud) for downloading and uploading the models. However, an inherent limitation of this cloud-based system is the long distance between the device and the cloud server, which causes significant propagation delay during model downloading/uploading stages in federated learning (Mao et al., 2017; Nguyen et al., 2019) . Specifically, it is reported in (Mao et al., 2017) that the supportable latency (for inference) of cloud-based systems is larger than 100 milliseconds, while the edge-based systems have supportable latency of less than tens of milliseconds. This large delay between the cloud and the devices directly affects the training time of cloud-based federated learning systems. In order to support latency-sensitive applications (e.g., smart cars) or emergency events (e.g., disaster response by drones) by federated learning, utilization of edge-based system is absolutely necessary. An issue, however, is that although the edge-based federated learning system can considerably reduce the latency between the PS and the devices, the coverage of an edge server is generally limited in practical systems (e.g., wireless cellular networks); there are insufficient number of devices within the coverage of an edge server for training a global model with enough accuracy. Accordingly, the

