QUAFL: FEDERATED AVERAGING MADE ASYNCHRONOUS AND COMMUNICATION-EFFICIENT

Abstract

Federated Learning (FL) is an emerging paradigm to enable the large-scale distributed training of machine learning models, while still allowing individual nodes to maintain local data. In this work, we take steps towards addressing two of the main practical challenges when scaling federated optimization to large node counts: the need for tight synchronization between the central authority and individual computing nodes, and the large communication cost of transmissions between the central server and clients. Specifically, we present a new variant of the classic federated averaging (FedAvg) algorithm, which supports both asynchronous communication and communication compression. We provide a new analysis technique showing that, in spite of these system relaxations, our algorithm can provide similar convergence to FedAvg in some parameter regimes. On the experimental side, we show that our algorithm ensures fast convergence for standard federated tasks.

1. INTRODUCTION

Federated learning (FL) (Konečnỳ et al., 2016; McMahan et al., 2017) is a paradigm for large-scale distributed learning, in which multiple clients, orchestrated by a central authority, cooperate to jointly optimize a machine learning model given their local data. The key promise is to enable joint training over distributed client data, often located on end devices which are computationally-and communication-limited, without the data leaving the client device. The basic optimization algorithm underlying the learning process is known as federated averaging (FedAvg) (McMahan et al., 2017) , and works roughly by having a central authority periodically communicate a shared model to all clients; then, the clients optimize this model locally based on their data, and communicate the resulting models to a central authority, which incorporates these models, often via some form of averaging, after which it initiates the next iteration. This algorithmic blueprint has been shown to be effective in practice (Li et al., 2020) , and has also motivated a rich line of research analyzing its convergence properties (Stich, 2018; Haddadpour & Mahdavi, 2019) , as well as proposing improved variants (Reddi et al., 2020; Karimireddy et al., 2020; Li & Richtárik, 2021) . Scaling federated learning runs into a number of practical challenges (Kairouz et al., 2021) . One natural bottleneck is synchronization between the server and the clients: as practical deployments may contain thousands of nodes, it is infeasible for the central server to orchestrate synchronous rounds among all participants. A simple mitigating approach is node sampling, e.g. (Smith et al., 2017; Bonawitz et al., 2019) ; another, more general one is asynchronous communication, e.g. (Wu et al., 2020; Nguyen et al., 2022b) , by which the server and the nodes may work with inconsistent versions of the shared model. An orthogonal scalability barrier is the high communication cost of transmitting parameter updates (Kairouz et al., 2021) , which may overwhelm communication-limited clients. Several communication-compression approaches have been proposed to address this (Jin et al., 2020; Jhunjhunwala et al., 2021; Li & Richtárik, 2021; Wang et al., 2022) . It is reasonable to assume that both these bottlenecks would need to be mitigated in practice: for instance, communication-reduction may not be as effective if the server has to wait for each of the clients to complete their local steps on a version of the model; yet, synchrony is assumed by most references with compressed communication. Yet, removing synchrony completely may lead to divergence, given that local data is usually heterogenous. Thus, it is interesting to ask if asynchrony and communication compression, and heterogenous local data, can be jointly supported. Contribution. In this paper, we address this by proposing an algorithm for Quantized Asynchronous Federated Learning called QuAFL, which is an extension of FedAvg, specifically-adapted to support both asynchronous communication and communication compression. We provide a theoretical analysis of the algorithm's convergence under compressed and asynchronous communication, and experimental results on up to 300 nodes showing that it can also lead to practical performance gains. 1

