MODERATED ASYNCHRONOUS FEDERATED LEARNING ON HETEROGENEOUS MOBILE DEVICES WITH NON-IID DATA

Abstract

Federated learning allows multiple clients to jointly learn an ML model while keeping their data private. While synchronous federated learning (Sync-FL) requires the devices to share local gradients synchronously, to provide better guarantees, it suffers from the problem of stragglers, slowing the entire training process. Conventional techniques completely drop the updates from the stragglers and lose the opportunity to learn from the data the stragglers hold, especially relevant in a non-iid setting. Asynchronous learning (Async-FL) provides a potential solution to allow the clients to function at their own pace, which typically achieves faster convergence. We target the video action recognition problem on edge devices as an exemplar heavyweight task to perform on a realistic edge setup using asynchronous-FL (Async-FL). Our FL system, KUIPER, leverages Async-FL to learn a heavy model on video-action-recognition tasks on a heterogeneous edge testbed with non-IID data. KUIPER introduces a novel aggregation scheme, which solves the straggler problem, while taking into account the different client data in a non-iid setting. Although the proposed aggregation technique is catered majorly for video action recognition, it is task-independent and scalable, and we demonstrate it by showing experiments on other vision and NLP tasks. KUIPER shows a 11% faster convergence compared to Oort [OSDI-21], up to 12% and 9% improvement in test accuracy compared to FedBuff and Oort [OSDI-21] on HMDB51, and 10% and 9% on UCF101.

1. INTRODUCTION

Federated learning McMahan et al. (2017) has gained great popularity in recent times as it allows heterogeneous clients to collaborate and benefit from peer data while keeping their own data private. As a result, the clients learn a better model with collaboration than they would have, individually. The training process is orchestrated by a central server that broadcasts the global model to the clients while the clients run local training on their own data and only share the gradient updates with the server. This has made it possible for clients with limited computational resources to participate in the learning process. However, heterogeneous clients with varying computational capabilities (we use the term "computational capabilities" as a shorthand to include heerogeneity in both computational capabilities on the node as well as the communication capabilities connecting the node to the federation server), if forced to synchronize, direct the process to progress at the speed of the slowest client Li et al. (2020a) . For example, in our experimental setup of embedded nodes with mobile GPUs, Jetson Nano is 5× slower than Jetson AGX Xavier; including variation in network speeds adds to this heterogeneity. It becomes crucial to incorporate even slow clients when the data distribution among clients is non-IID, as all clients then have distinctive elements to contribute to the learned model. In this paper, we target a heavyweight learning task, namely, video action recognition, that till date had been considered out of the reach of embedded devices, i.e., mobile GPUs. The straggler problem becomes particularly serious for heavyweight learning tasks on heterogeneous edge devices since the devices are resource constrained relative to the demands of the task and the variance in device capabilities (processing power, memory, storage) is large (5× in our representative setup). Therefore, to deal with stragglers an obvious approach seems to be to use synchronous learning. However, this prevents the global model from learning features specific to the local data of the stragglers, leading to a model that underfits. This problem becomes more acute as the degree of non-IIDness increases; Our proposed solution KUIPER: We propose KUIPERfoot_1 to solve the above problems of heterogeneous clients with resource constraints and non-IID data, with the overview shown in Figure 1 . We consider the typical case of FL with non-IID data where although the client might not have training data for all the classes but wants to have a global model which can work on all the classes (i.e., learning from peers). Our solution is based on the idea of scaling the stale updates before aggregation, depending on the staleness of the updates, and the current iteration's training error of the clients. Training error is a measure of how much the local model has made progress on learning from its own data. This ensures that the global model is not starved of the information that could be learned from the stragglers' data. Our scaling policy is designed to ensure high model quality while balancing the need to incorporate relatively outdated updates if they improve the global model. Further, we find that a pure asynchronous solution does not work well due to the wide diversity of rates of client updates. We then batch the updates from a group of clients, quantified by K, the batch size, before aggregation. This makes KUIPER a buffered asynchronous approachfoot_2 . Our contributions can be summarized as follows: 1. We propose a novel scheme to include heterogeneous clients in federated learning by balancing the utility of their data with their computational (and communication) efficiency. 2. We demonstrate our heterogeneous FL technique through video action recognition, which is a computationally heavy task and can be accomplished on resource-constrained edge devices only through the use of FL. In our setting, this task is particularly challenging due to device heterogeneity, network heterogeneity, and non-IID data.



There are two recent promising solutions to this problem in Oort Lai et al. (2021) and FedBuff Nguyen et al. (2022), and we discuss why they fall short and also compare them empirically to our solution. KUIPER is a band of small celestial bodies beyond the orbit of Neptune from which many short-period comets are believed to originate. Similarly, we make the small devices coalesce to achieve big tasks. Aspects of this design are shared with FedBuff Nguyen et al. (2022); we explain the differences in Section 2 and empirically demonstrate our superiority (Section 5).



Figure 1: Overview of a working example of KUIPER in action for 5 heterogeneous clients with K (burst size) =3. The circle denotes that a client is ready with its updates. The dashed vertical line denotes an aggregation step where we also update τi for the clients aggregated in the burst. The aggregator waits for 3 clients to respond, comprising a burst, denoted by identically colored circles. Within the burst, the individual client updates are weighed by a function of their local data size and training accuracy. The burst, as a whole, is then weighed again by the average staleness (t -τi) of the clients comprising the burst, and the global model is updated. The updated model is sent back to all the clients in that same burst, and the process goes on. again, for a distributed edge device scenario, high degrees of non-IIDness are commonly seen Zhao et al. (2018); Chen et al. (2020b). We empirically observe the severe negative consequence of discarding stragglers on the learning accuracy (Figure 7(d)). This motivates the use of asynchronous aggregation, which allows the central server to aggregate the clients' gradient updates as soon as they are made available without having to wait for all the clients to respond. However, it has remained an open problem how to best aggregate the updates sent by all clients in order to maximize information learned while minimizing any adverse effect from slow updates 1 .

