QUAFL: FEDERATED AVERAGING MADE ASYNCHRONOUS AND COMMUNICATION-EFFICIENT

Abstract

Federated Learning (FL) is an emerging paradigm to enable the large-scale distributed training of machine learning models, while still allowing individual nodes to maintain local data. In this work, we take steps towards addressing two of the main practical challenges when scaling federated optimization to large node counts: the need for tight synchronization between the central authority and individual computing nodes, and the large communication cost of transmissions between the central server and clients. Specifically, we present a new variant of the classic federated averaging (FedAvg) algorithm, which supports both asynchronous communication and communication compression. We provide a new analysis technique showing that, in spite of these system relaxations, our algorithm can provide similar convergence to FedAvg in some parameter regimes. On the experimental side, we show that our algorithm ensures fast convergence for standard federated tasks.

1. INTRODUCTION

Federated learning (FL) (Konečnỳ et al., 2016; McMahan et al., 2017) is a paradigm for large-scale distributed learning, in which multiple clients, orchestrated by a central authority, cooperate to jointly optimize a machine learning model given their local data. The key promise is to enable joint training over distributed client data, often located on end devices which are computationally-and communication-limited, without the data leaving the client device. The basic optimization algorithm underlying the learning process is known as federated averaging (FedAvg) (McMahan et al., 2017) , and works roughly by having a central authority periodically communicate a shared model to all clients; then, the clients optimize this model locally based on their data, and communicate the resulting models to a central authority, which incorporates these models, often via some form of averaging, after which it initiates the next iteration. This algorithmic blueprint has been shown to be effective in practice (Li et al., 2020) , and has also motivated a rich line of research analyzing its convergence properties (Stich, 2018; Haddadpour & Mahdavi, 2019) , as well as proposing improved variants (Reddi et al., 2020; Karimireddy et al., 2020; Li & Richtárik, 2021) . Scaling federated learning runs into a number of practical challenges (Kairouz et al., 2021) . One natural bottleneck is synchronization between the server and the clients: as practical deployments may contain thousands of nodes, it is infeasible for the central server to orchestrate synchronous rounds among all participants. A simple mitigating approach is node sampling, e.g. (Smith et al., 2017; Bonawitz et al., 2019) ; another, more general one is asynchronous communication, e.g. (Wu et al., 2020; Nguyen et al., 2022b) , by which the server and the nodes may work with inconsistent versions of the shared model. An orthogonal scalability barrier is the high communication cost of transmitting parameter updates (Kairouz et al., 2021) , which may overwhelm communication-limited clients. Several communication-compression approaches have been proposed to address this (Jin et al., 2020; Jhunjhunwala et al., 2021; Li & Richtárik, 2021; Wang et al., 2022) . It is reasonable to assume that both these bottlenecks would need to be mitigated in practice: for instance, communication-reduction may not be as effective if the server has to wait for each of the clients to complete their local steps on a version of the model; yet, synchrony is assumed by most references with compressed communication. Yet, removing synchrony completely may lead to divergence, given that local data is usually heterogenous. Thus, it is interesting to ask if asynchrony and communication compression, and heterogenous local data, can be jointly supported. Contribution. In this paper, we address this by proposing an algorithm for Quantized Asynchronous Federated Learning called QuAFL, which is an extension of FedAvg, specifically-adapted to support both asynchronous communication and communication compression. We provide a theoretical analysis of the algorithm's convergence under compressed and asynchronous communication, and experimental results on up to 300 nodes showing that it can also lead to practical performance gains. Overview. The main idea behind QuAFL is that we allow clients to perform their local steps independently of the round structure implemented by the server, and on a local, inconsistent version of the parameters, assuming a probabilistic scheduling model. Specifically, all clients receive a copy of the model when joining the computation, and start performing at most K ≥ 1 optimization steps on it based on their local data. Independently, in each "logical round," the server samples a set of s clients uniformly at random, and sends them a compressed copy of its current model. Whenever receiving the server's message, clients immediately respond with a compressed version of their current model, which may still be in the middle of the local optimization process, and therefore may not include recent server updates, nor the totality of the K local optimization steps. In fact, we even allow that, with some probability, some contacted clients do not take any steps at all. Clients carefully integrate the received server model into their next local iteration, while the server does the same with the client models it receives. The key missing piece regards quantization. Directly applying standard compressors on transmitted updates (Alistarh et al., 2017b; Karimireddy et al., 2019) runs into the issue that the quantization error may be too large, as it is proportional to the norm of the (updated) model at the client. Resolving this analytically would require either an unrealistic second-moment bound on the maximum gradient update, e.g. (Chen et al., 2021) , or variance-reduction techniques (Gorbunov et al., 2021) , which may be complex in practice. We circumvent this issue differently, by leveraging a lattice-based quantizer (Davies et al., 2021) , which has the property that the quantization error only depends on the difference between the quantized model and a carefully-chosen "reference point." We instantiate this technique for the first time in the federated setting. Our analysis technique relies on a new potential argument, which shows that the discrepancy between the client and server models is always bounded. This bound serves to control the "noise" at different steps due to model inconsistency, but also to ensure that the local models are consistent enough to allow correct encoding and decoding via lattice quantization. The technique is complex yet modular, and should allow further analysis of more complex algorithmic variants. We validate our algorithm experimentally in the rigorous LEAF (Caldas et al., 2018) environment, on a series of standard tasks. Specifically, in practice, QuAFL can compress updates by more than 3× without significant loss of convergence, and can withstand a large constant fraction of "slow" clients submitting infrequent updates. Moreover, in a setting where client computation speeds are heterogenous, QuAFL provides end-to-end speedup, since the server can progress without waiting for all clients to complete their local computation.

2. RELATED WORK

The federated averaging (FedAvg) algorithm was introduced by McMahan et al. (2017), and Stich (2018) was among the first to consider its convergence rate in the homogeneous data setting. Here, we investigate whether one can jointly eliminate two of the main scalability bottlenecks of this algorithm, the synchrony between the server and client iterations, as well as the necessity of full-precision communication, with heterogeneous data distributions. Due to space constraints, we focus on prior work which seeks to mitigate these two constraints in the context of FL. There is significant research into communication-compression for FedAvg (Philippenko & Dieuleveut, 2020; Reisizadeh et al., 2020; Jin et al., 2020; Haddadpour et al., 2021) . However, virtually all of this work considers synchronous iterations. Reisizadeh et al. ( 2020) introduced FedPAQ, a variant of FedAvg which supports quantized communication via standard compressors, and provides strong convergence bounds, under the strong assumption of i.i.d. client data. Jin et al. ( 2020) examines the viability of a variant of the signSGD quantizer (Seide et al., 2014; Karimireddy et al., 2019) in the context of FedAvg, providing convergence guarantees; however, the rate guarantees have a polynomial dependence in the model dimension d, rendering them less practically meaningful. Haddadpour et al. (2021) proposed FedCOM, a family of federated optimization algorithms with communicationcompression and convergence rates; yet, we note that, in order to prove convergence in the challenging heterogeneous-data setting, this reference requires non-trivial technical assumptions on the quantized gradients (Haddadpour et al., 2021, Assumption 5) . Chen et al. ( 2021) also considered update compression, but under convex losses, coupled with a rather strong second-moment bound assumption on the gradients. Finally, Jhunjhunwala et al. (2021) examine adapting the degree of compression during the execution, proving convergence bounds for their scheme, under the non-standard i.i.d. data sampling assumption.

