QUAFL: FEDERATED AVERAGING MADE ASYNCHRONOUS AND COMMUNICATION-EFFICIENT

Abstract

Federated Learning (FL) is an emerging paradigm to enable the large-scale distributed training of machine learning models, while still allowing individual nodes to maintain local data. In this work, we take steps towards addressing two of the main practical challenges when scaling federated optimization to large node counts: the need for tight synchronization between the central authority and individual computing nodes, and the large communication cost of transmissions between the central server and clients. Specifically, we present a new variant of the classic federated averaging (FedAvg) algorithm, which supports both asynchronous communication and communication compression. We provide a new analysis technique showing that, in spite of these system relaxations, our algorithm can provide similar convergence to FedAvg in some parameter regimes. On the experimental side, we show that our algorithm ensures fast convergence for standard federated tasks. Overview. The main idea behind QuAFL is that we allow clients to perform their local steps independently of the round structure implemented by the server, and on a local, inconsistent version of the parameters, assuming a probabilistic scheduling model. Specifically, all clients receive a copy of the model when joining the computation, and start performing at most K ≥ 1 optimization steps on it based on their local data. Independently, in each "logical round," the server samples a set of s clients uniformly at random, and sends them a compressed copy of its current model. Whenever receiving the server's message, clients immediately respond with a compressed version of their current model, which may still be in the middle of the local optimization process, and therefore may not include recent server updates, nor the totality of the K local optimization steps. In fact, we even allow that, with some probability, some contacted clients do not take any steps at all. Clients carefully integrate the received server model into their next local iteration, while the server does the same with the client models it receives. The key missing piece regards quantization. Directly applying standard compressors on transmitted updates (Alistarh et al., 2017b; Karimireddy et al., 2019) runs into the issue that the quantization error may be too large, as it is proportional to the norm of the (updated) model at the client. Resolving this analytically would require either an unrealistic second-moment bound on the maximum gradient update, e.g. (Chen et al., 2021) , or variance-reduction techniques (Gorbunov et al., 2021) , which may be complex in practice. We circumvent this issue differently, by leveraging a lattice-based quantizer (Davies et al., 2021) , which has the property that the quantization error only depends on the difference between the quantized model and a carefully-chosen "reference point." We instantiate this technique for the first time in the federated setting. Our analysis technique relies on a new potential argument, which shows that the discrepancy between the client and server models is always bounded. This bound serves to control the "noise" at different steps due to model inconsistency, but also to ensure that the local models are consistent enough to allow correct encoding and decoding via lattice quantization. The technique is complex yet modular, and should allow further analysis of more complex algorithmic variants. We validate our algorithm experimentally in the rigorous LEAF (Caldas et al., 2018) environment, on a series of standard tasks. Specifically, in practice, QuAFL can compress updates by more than 3× without significant loss of convergence, and can withstand a large constant fraction of "slow" clients submitting infrequent updates. Moreover, in a setting where client computation speeds are heterogenous, QuAFL provides end-to-end speedup, since the server can progress without waiting for all clients to complete their local computation.

1. INTRODUCTION

Federated learning (FL) (Konečnỳ et al., 2016; McMahan et al., 2017) is a paradigm for large-scale distributed learning, in which multiple clients, orchestrated by a central authority, cooperate to jointly optimize a machine learning model given their local data. The key promise is to enable joint training over distributed client data, often located on end devices which are computationally-and communication-limited, without the data leaving the client device. The basic optimization algorithm underlying the learning process is known as federated averaging (FedAvg) (McMahan et al., 2017) , and works roughly by having a central authority periodically communicate a shared model to all clients; then, the clients optimize this model locally based on their data, and communicate the resulting models to a central authority, which incorporates these models, often via some form of averaging, after which it initiates the next iteration. This algorithmic blueprint has been shown to be effective in practice (Li et al., 2020) , and has also motivated a rich line of research analyzing its convergence properties (Stich, 2018; Haddadpour & Mahdavi, 2019) , as well as proposing improved variants (Reddi et al., 2020; Karimireddy et al., 2020; Li & Richtárik, 2021) . Scaling federated learning runs into a number of practical challenges (Kairouz et al., 2021) . One natural bottleneck is synchronization between the server and the clients: as practical deployments may contain thousands of nodes, it is infeasible for the central server to orchestrate synchronous rounds among all participants. A simple mitigating approach is node sampling, e.g. (Smith et al., 2017; Bonawitz et al., 2019) ; another, more general one is asynchronous communication, e.g. (Wu et al., 2020; Nguyen et al., 2022b) , by which the server and the nodes may work with inconsistent versions of the shared model. An orthogonal scalability barrier is the high communication cost of transmitting parameter updates (Kairouz et al., 2021) , which may overwhelm communication-limited clients. Several communication-compression approaches have been proposed to address this (Jin et al., 2020; Jhunjhunwala et al., 2021; Li & Richtárik, 2021; Wang et al., 2022) . It is reasonable to assume that both these bottlenecks would need to be mitigated in practice: for instance, communication-reduction may not be as effective if the server has to wait for each of the clients to complete their local steps on a version of the model; yet, synchrony is assumed by most references with compressed communication. Yet, removing synchrony completely may lead to divergence, given that local data is usually heterogenous. Thus, it is interesting to ask if asynchrony and communication compression, and heterogenous local data, can be jointly supported. Contribution. In this paper, we address this by proposing an algorithm for Quantized Asynchronous Federated Learning called QuAFL, which is an extension of FedAvg, specifically-adapted to support both asynchronous communication and communication compression. We provide a theoretical analysis of the algorithm's convergence under compressed and asynchronous communication, and experimental results on up to 300 nodes showing that it can also lead to practical performance gains.

2. RELATED WORK

The federated averaging (FedAvg) algorithm was introduced by McMahan et al. (2017), and Stich (2018) was among the first to consider its convergence rate in the homogeneous data setting. Here, we investigate whether one can jointly eliminate two of the main scalability bottlenecks of this algorithm, the synchrony between the server and client iterations, as well as the necessity of full-precision communication, with heterogeneous data distributions. Due to space constraints, we focus on prior work which seeks to mitigate these two constraints in the context of FL. There is significant research into communication-compression for FedAvg (Philippenko & Dieuleveut, 2020; Reisizadeh et al., 2020; Jin et al., 2020; Haddadpour et al., 2021) . However, virtually all of this work considers synchronous iterations. Reisizadeh et al. (2020) introduced FedPAQ, a variant of FedAvg which supports quantized communication via standard compressors, and provides strong convergence bounds, under the strong assumption of i.i.d. client data. Jin et al. (2020) examines the viability of a variant of the signSGD quantizer (Seide et al., 2014; Karimireddy et al., 2019) in the context of FedAvg, providing convergence guarantees; however, the rate guarantees have a polynomial dependence in the model dimension d, rendering them less practically meaningful. Haddadpour et al. (2021) proposed FedCOM, a family of federated optimization algorithms with communicationcompression and convergence rates; yet, we note that, in order to prove convergence in the challenging heterogeneous-data setting, this reference requires non-trivial technical assumptions on the quantized gradients (Haddadpour et al., 2021, Assumption 5) . Chen et al. (2021) also considered update compression, but under convex losses, coupled with a rather strong second-moment bound assumption on the gradients. Finally, Jhunjhunwala et al. (2021) examine adapting the degree of compression during the execution, proving convergence bounds for their scheme, under the non-standard i.i.d. data sampling assumption. We observe that each of these references requires at least one non-standard assumption for the convergence of FedAvg with compression. By contrast, our analysis works for general (non-convex) losses, under a standard non-i.i.d. data distribution, without relying on second-moment bounds on the gradients. A complementary approach to reducing communication cost in FL has been to investigate optimizers with faster convergence, e.g. (Mishchenko et al., 2019; Karimireddy et al., 2020) , or adaptive optimizers (Reddi et al., 2020; Tong et al., 2020) . Tecent work has shown that these approaches can be compatible with communication-compression (Gorbunov et al., 2021; Li & Richtárik, 2021; Wang et al., 2022) . Specifically, for non-convex losses, MARINA Gorbunov et al. (2021) offers theoretical guarantees both in terms of convergence and bits transmitted. However, MARINA is structured in synchronous rounds; moreover, it periodically (with some probability) has clients compute full gradients and transmit uncompressed model updates, and requires complex synchronization and variance-reduction to compensate for the extra noise due to quantization. Tyurin & Richtárik (2022) proposed a family of theoretical methods called DASHA, which combines the general structure of MARINA with Momentum Variance Reduction (MVR) methods (Cutkosky & Orabona, 2019) , partially relaxing the coupling between the server and the workers and allowing compressed updates. In contrast to these works, we focus on obtaining a practical algorithm with good convergence bounds: we always transmit compressed, low-precision messages, and consider a notion of asynchronous communication which allows the server and nodes to make progress independently, in non-blocking fashion. We focus on the classic, practical FedAvg algorithm, although our general algorithmic and analytic approach should generalize to more complex notions of local optimization. Our approach extends ideas from the analysis of decentralized variants of SGD (Lian et al., 2017; Tang et al., 2018; Nadiradze et al., 2021; Koloskova et al., 2019; Lu & De Sa, 2020) , bringing them into the context of federated optimization. Significant differences exist: notably, we introduce a novel potential argument, adapted to FL, and cannot rely on stronger assumptions available in the decentralized setting, e.g. a gradient second-moment bound (Lu & De Sa, 2020) . The concurrent work of Koloskova et al. (2022) provided sharper convergence bounds for asynchronous SGD in a model that is related to ours. Specifically, this reference considers a setting with worst-case and average delay bounds on asynchrony, and proves convergence rates that are similar to ours in the case of a single sampled client at a time s. By contrast, our work considers a different probabilistic model on the delays, related to that of Cannelli et al. (2020) in which the worst-case delay may be unbounded. In addition, we allow the clients to be interrupted by the server during their local computation, which may lead to practical improvements in terms of waiting times and load-balancing. At the technical level, the two analysis techniques are different: in particular, their technique does not require a lower bound on the number of SGD steps w.r.t. the number of nodes n.

3.1. SYSTEM OVERVIEW

System Model. We assume a distributed system with one coordinator and n workers, jointly minimizing a d-dimensional, differentiable function f : R d → R. We consider the empirical risk minimization (ERM) setting, in which data samples are located at the n nodes. Each agent i has a local function f i associated to its own local fraction of the data, i.e ∀x ∈ R d : f (x) = n i=1 f i (x)/n. The goal is to converge on a model x * which minimizes the empirical loss. Clients run a distributed variant of SGD, coordinated by the central node. We will assume that each client i is able to obtain unbiased stochastic gradients g i of its own local function f i , i.e. E[ g i (x)] = ∇f i (x). These stochastic gradients can be computed by each agent by sampling i.i.d. from its own local distribution. Our analysis will consider the case where each client distribution is distinct, but there is a bound on the maximum gradient discrepancy. We model client asynchrony as follows: between two consecutive interactions with the server, each client should perform a number of gradient steps on its local model. We treat the number of local steps at client i as a random variable H i , taking values in {0, 1, 2, . . . , K}, where K is a bound on how many steps a client can take in isolation. We emphasize the fact that H i can take the value 0, meaning that the client may take no steps since last contacted. Our only assumption regarding asynchrony is that the expected value of H, denoted by H, exists and is > 0. That is, we assume that, on average, each client makes non-zero progress, and clients progress at similar rates, although the individual step distributions H i can be completely different.

3.2. ALGORITHM DESCRIPTION

Overview. Our algorithm starts from the standard pattern used by federated averaging (FedAvg): computation and communication are organized in logical "rounds," where in each round the server transmits its current version of the model to either all, or a subset of clients. The clients should then take some number of local optimization steps on the received model, which is at most K ≥ 1, and transmit the result to the server, which integrates these updates. Our algorithm will relax this pattern in two orthogonal ways, allowing for both quantized and asynchronous communication. Quantized Communication. The first relaxation is to only allow for compressed communication of the server model and of the client updates, via quantization. For this, we employ a carefullyparametrized version of the lattice-based quantization technique of Davies et al. (2021) , whose analytical properties we describe in the analysis section. For practical purposes, this quantization technique presents an encoding function Enc(A), which encodes an arbitrary input A to its quantized representation. (We always communicate vectors via their quantized representations.) To "read" an encoded message Enc(A), a node must call the symmetric Dec(B, Enc(A)) function, which allows for the "decoding" of the input Enc(A) with respect to a reference point B, returning a quantized output Q(A). We formally specify the properties of the compression process in Section 4. Algorithm 1 Pseudocode for QuAFL Algorithm. % Initial models X 0 = X 1 = X 2 = ... = X n = 0 d , number of local steps K % Encoding (Enc(A)) and decoding (Dec(B, Enc(A))) functions, with common parametrization. % At the Server: 1: for t = 0 to T -1 do 2: Server chooses s clients uniformly at random, let S be the resulting set.

3:

for all clients i ∈ S do 4: Server sends Enc(Xt) to the client i.

5:

Server receives Enc(Y i ) from client i 6: % Y i = X i -η hi is the client's progress since last contacted 7: Q(Y i ) ← Dec(Xt, Enc(Y i )) % Decodes client messages relative to Xt 8: end for 9: M SGi ← Enc(X i -η hi) % Client i compresses its local progress since last contacted. Xt+1 = 1 s+1 Xt + 1 s+1 i∈S Q(Y i

3:

Client sends M SGi to the server.

4:

Client receives Enc(Xt) from the server, where t is the current server time.

5:

Q(Xt) ← Dec(X i , Enc(Xt)) % Client decodes the message using its old model as reference point. 6: % The client then updates its local model 7: X i = 1 s+1 Q(Xt) + s s+1 (X i -η hi) 8: %Finally, it performs K new local steps on the updated X i , unless interrupted again. 9: LOCALUPDATES(X i , K) 10: WAIT( ) 11: end function 1: function LOCALUPDATES(X i , K) 2: hi = 0 % local gradient accumulator 3: for q = 0 to K -1 do 4: h q i = gi(X i -η q-1 ℓ=0 h ℓ i ) % compute the qth local gradient 5: hi = hi + h q i % add it to the accumulator 6: end for 7: end function Asynchronous Communication. A key practical limitation of the FedAvg pattern is that the server and its workers have to communicate in synchronous, lock-step fashion: thus, the server must wait for the results of computation at a round before it can move to the next round. In particular, this means that the server has to wait for the slowest client to complete its local steps before it can proceed. QuAFL relaxes this requirement by essentially allowing any contacted node i to immediately return (a quantized version of) its current version of the model to the server upon being contacted, even though the client might still not have completed all its K local optimization steps for the round. More precisely, the client always records its "base" model at the end of the last interaction with the server into parameter X i , and sums up its gradient updates since the last interaction into the buffer h i . Upon being contacted, the client simply sends its current progress Y i = X i -η h i to the server (excluding the local step for which computation was not finished due to interruption from the server) , where η is the learning rate, in quantized form. It is possible that this progress is zero. The client then decodes the quantized server model X t , using its old local model X i as the decoding key. Finally, the client updates X i to include the server's new information via weighted averaging. It is then ready to restart its local update loop, upon this new model. It is important to notice that the server interaction occurs asynchronously, and that it might occur either while the client is still performing local steps, or after the client has completed its K local steps, and is idle, waiting for server contact. In the former case, upon being contacted, immediately calls the server interaction function, without performing additional steps. (In particular, we allow the number of completed local steps to be 0.) Globally, the server contacts s random agents in each logical round, sends them a quantized version of the global model X t , then receives quantized versions of their progress, and then incorporates this into the global model which will be sent at the next round. Discussion. The practical advantage of QuAFL is that the server does not have to wait for each of the contacted clients to complete their local optimization on the global model X t . In addition, an important departure from FedAvg is the averaging between the server and client models. Our formulation is important for fast convergence: as we show in Figure 4 , other forms, such as just adopting the client average, lead to worse convergence.

4.1. ANALYTICAL ASSUMPTIONS

We begin by stating the assumptions we make in the theoretical analysis of our algorithm. Specifically, we assume the following for the global loss function f , the individual client losses f i , and the stochastic gradients g i : 1. Uniform Lower Bound: There exists f * ∈ R such that f (x) ≥ f * for all x ∈ R d . 2. Smooth Gradients: For any client i, the gradient ∇f i (x) is L-Lipschitz continuous for some L > 0, i.e. for all x, y ∈ R d : ∥∇f i (x) -∇f i (y)∥ ≤ L∥x -y∥. 3. Bounded Variance: For any client i, the variance of the stochastic gradients is bounded by some σ 2 > 0, i.e. for all x ∈ R d : E g i (x) -∇f i (x) 2 ≤ σ 2 . (2)

4.

Bounded Dissimilarity: There exist constants G 2 ≥ 0 and B 2 ≥ 1, s.t. ∀x ∈ R d : n i=1 ∇f i (x) 2 n ≤ G 2 + B 2 ∇f (x) 2 . ( ) The first three conditions are universal in distributed non-convex stochastic optimization, whereas the fourth encodes the fact that there must be a bound on the amount of divergence between the local distributions at the nodes in order to allow for joint optimization (Karimireddy et al., 2020; Jin et al., 2020; Gorbunov et al., 2021) . In addition, we make the following assumption on the local progress performed by each node: 5. Probabilistic Progress: The expected number of local steps taken by a client when contacted by the server is H > 0. Quantization Procedure. Please recall the semantics of our quantization procedure from Section 3.2. In this context, the quantizer has the following guarantees (Davies et al., 2021)  )O(R -d ), the function Q R,γ (x) = Dec R,γ (y, Enc R,γ (x)) has the following properties: 1. (Unbiased decoding) E[Q R,γ (x)] = E[Dec R,γ (y, Enc R,γ (x))] = x; 2. (Error bound) ∥Q R,γ (x) -x∥ ≤ (R 2 + 7)γ; 3. (Communication bound) O d log( R γ ∥x -y∥ ) bits are needed to send Enc R,γ (x).

4.2. MAIN RESULTS

Roughly, our aim is to show that in our algorithm, local models of the clients stay close to the local model of the server, so that models are consistent, and we can successfully apply the quantizer. Let µ t = (X t + n i=1 X i )/(n + 1) be the mean over all the node models in the system at a given t. Our main result shows the following: Theorem 4.2. Assume the total number of server steps T ≥ Ω(n 3 ), the learning rate η = n+1 sH √ T , and quantization parameters R = 2 + T 3 d and γ 2 = η 2 (R 2 +7) 2 σ 2 + 2KG 2 + f (µ0)-f * L . Let H > 0 be the expected number of local steps already performed by a client when interacting with the server. Then, with probability at least 1 -O( 1T ) we have that Algorithm 1 converges at the following rate 1 T T -1 t=0 E∥∇f (µ t )∥ 2 ≤ 5(f (µ 0 ) -f * ) √ T + 8KL(σ 2 + 2KG 2 ) H 2 √ T + O n 3 KL 2 (σ 2 + 2KG 2 ) sH 3 T and uses O (sT (d log n + log T )) expected communication bits in total. Discussion. The result provides a trade-off between the convergence speed of the algorithm, the variance of the local distributions (given by σ and G), the sampling set size s, and the average number of local steps H performed by a node when contacted by the server. For constant s, H and K, this bound appears to be asymptotically-optimal. Specifically, the third term contains similar "nuisance factors" as the second term, with the addition of the n 3 factor, and also bounds the extra variance. Crucially, this larger term is divided by T , as opposed to √ T ; since T is our asymptotic parameter, it is common to assume that this extra term becomes negligible as T is large, e.g. (Lu & De Sa, 2020) . However, for super-constant s = ω(n), there is no tangible benefit due to sampling over s clients. Intuitively, this is because of asynchrony: each client gets sampled on average every n/s steps, and therefore will work on a "stale" copy of the model that is n/s rounds old, which affects convergence speed. While this is a limitation of the analysis, from the practical perspective, this can be addressed by observing that, due to asynchrony, in terms of wall-clock time, we can think of the s interaction steps between the server and the clients as happening in parallel. Thus, again, in terms of wall-clocktime, it is reasonable to perform the substitution T → sT in the above rate calculation, which indeed suggests that our algorithm is able to obtain speedup with respect to the number of sampled clients s. In practice, it should be reasonable to assume that H = Θ(K), that is, that on average each client i will have completed its local steps on the old version of the model X i when being contacted: otherwise, the sampling frequency of the server is too high, and prevents clients from making progress on their local optimization, and the server should simply decrease it. Convergence at the Server. Finally, we show that not only convergence at the server, as opposed to the convergence of the mean of the local models as in Theorem 4.2. We get that: Corollary 4.3. Assume the total number of steps T ≥ Ω(n 4 ), the learning rate η = n+1 sH √ T , and quantization parameters R = 2 + T 3 d and γ 2 = η 2 (R 2 +7) 2 σ 2 + 2KG 2 + f (µ0)-f * L . Let H > 0 be the expected number of local steps already performed by a client when interacting with the server. Then, with probability at least 1 -O( 1T ) we have that Algorithm 1 converges at the following rate 1 T T -1 t=0 E∥∇f (X t )∥ 2 ≤ 5(f (µ 0 ) -f * ) √ T + 8KL(σ 2 + 2KG 2 ) H 2 √ T + O n 4 KL 2 (σ 2 + 2KG 2 ) sH 2 T . This corollary yields a very similar bound to our main result, except for the larger dependency between T and n, which is intuitively required due to the additional time required for the server to converge to a similar bound to the mean µ t . The third term may be significant for large number of nodes n; however, since it is divided by T (as opposed to √ T ) it can be seen as negligible for moderate n and large T . The fact that QuAFL can match some of the best known rates for FedAvg under some parameter settings may seem surprising, since our algorithm is asynchronous (in particular, nodes take steps on local, delayed versions of the server model) and also supports communication-compression.

4.3. OVERVIEW OF THE ANALYSIS

The complete analysis is fairly complex, and is provided in full in the Appendix. Due to space constraints, we only provide an overview of the proofs, outlining the main intermediate results. The first step in the proof is bounding the deviation between the local models and their mean. For this, we introduce the potential function Φ t = ∥X t -µ t ∥ 2 + n i=1 ∥X i -µ t ∥ 2 , and we use a load-balancing approach to show that this potential has the following supermartingale-type property: Lemma 4.4. For any time step t we have: E[Φ t+1 ] ≤ 1 - 1 4n E[Φ t ] + 8sη 2 n i=1 E∥ h i ∥ 2 + 16n(R 2 + 7) 2 γ 2 . The intuition behind this result is that potential Φ t will stay well-concentrated around its mean, except for influences from the variance due to local steps (second term) or quantization (third term). With this in place, the next lemma allows us to track the evolution of the average of the local models, with respect to local step and quantization variance: Lemma 4.5. For any step t E∥µ t+1 -µ t ∥ 2 ≤ 2s 2 η 2 n(n + 1) 2 i E h i 2 + 2 (n + 1) 2 (R 2 + 7) 2 γ 2 .

In both cases, the upper bound depends on the second moment of the nodes' local progress

i E h i 2 . (This is due to the fact that the server contacts s clients, which are chosen uniformly at random.) Then, our main technical lemma uses properties (1), ( 2) and (3), to concentrate i E h i (X i t ) 2 around the true gradient E∥∇f (µ t )∥ 2 , where the expectation is taken over the algorithm's randomness. Lemma 4.6. For any step t, we have that n i=1 E∥ h i ∥ 2 ≤ 2nK(σ 2 + 2KG 2 ) + 8L 2 K 2 E[Φ t ] + 4nK 2 B 2 E∥∇f (µ t )∥ 2 . We can then combine Lemmas 4.4 and 4.6 to get an upper bound on the potential with respect to E∥∇f (µ t )∥ 2 . Summing over steps, we obtain the following: Lemma 4.7. T t=0 E[Φ t ] ≤ 80T n 2 (R 2 + 7) 2 γ 2 + 80T n 2 sKη 2 (σ 2 + 2KG 2 ) + 160B 2 n 2 sK 2 η 2 T -1 t=0 E∥∇f (µ t )∥ 2 . Next, using the L-smoothness of the function f , implied by (1), we can show that E[f (µ t+1 )] ≤ E[f (µ t )] + E⟨∇f (µ t ), µ t+1 -µ t ⟩ + L 2 E∥µ t+1 -µ t ∥ 2 . ( ) Final argument. Using the above inequality, and given that E[µ t+1 -µ t ] = -η n+1 i∈S h i (X i t ), we observe that the sum n i=1 E⟨∇f (µ t ), µ t+1 -µ t ⟩ can be concentrated around E∥∇f (µ t )∥ 2 , in similar fashion as in Lemma 4.6. Together with Lemma 4.5, this results in the following bound: For η = (n + 1)/ √ T , as stated in the Theorem, we can use Lemma 4.7 to cancel out the terms containing the potential Φ t (after summing up the inequality over T steps). Replacing these terms, and modulo some additional term wrangling, we obtain the claimed convergence bound. Quantization Impact. Finally, we address the correctness of the quantization technique. We show that the quantization fails with negligible probability: E[f (µ t+1 )] -E[f (µ t )] ≤ 5ηsKL 2 E[Φ t ] n(n + 1) + 4sL 2 η 3 K 3 n + 1 + 2s 2 Kη 2 L (n + 1) 2 (σ 2 + 2KG 2 ) + (R 2 + 7) 2 γ 2 L (n + 1) 2 + -3ηsH 4(n + 1) + 8B 2 L 2 η 3 sK 3 n + 1 + 4B 2 s 2 K 2 Lη 2 (n + 1) 2 E∥∇f (µ t )∥ 2 . Lemma 4.8. Let T ≥ Ω(n 3 ), then for quantization parameters R = 2 + T 3 d and γ 2 = η 2 (R 2 +7) 2 (σ 2 + 2KG 2 + f (µ0)-f * L ) we have that the probability of quantization never failing during the entire run of the Algorithm 1 is at least 1 -O 1 T . Per Lemma 4.1, in order for the communication to fail with negligible probability, we need to show that whenever the server communicates with a client, the two norm of their local models is at most R R d γ. Hence, we need to use bound E[Φ(t)]. The only similar use of this technique was in Nadiradze et al. (2021) ; however, the authors of this reference could benefit from assuming that the second-moment of the gradients was bounded. Since we make no such assumption here, we need to find a way to bound T -1 t=0 E∥∇f (µ t )∥ 2 . Fortunately, our main result shows that the gradients are vanishing, so we can take the advantage of the convergence rate and plug it back into Lemma 4.7. Similarly, due to the Property 3, Lemma 4.1, the number of the bits used by our algorithm, in one communication between the server and a client, depends on the two norm of the distance between their local models. Thus, we can use the bound on T -1 t=0 E∥∇f (µ t )∥ 2 to show the following. Lemma 4.9. Let T ≥ Ω(n 3 ), then for quantization parameters R = 2 + T 3 d and γ 2 = η 2 (R 2 +7) 2 (σ 2 + 2KG 2 + f (µ0)-f * L ) we have that the expected number of bits used by Algorithm 1 in total is O(sT (d log(n) + log(T )). We note that the communication cost per step is also asymptotically optimal, modulo the multiplicative log n and additive log T terms, required to ensure error probability 1 -O(1/T ).

5. EXPERIMENTAL RESULTS

Experimental Setup and Goals. We implemented our algorithm in Pytorch in order to train neural networks for image classification tasks, specifically residual CNNs (He et al., 2016) on the MNIST (LeCun & Cortes, 2010), Fashion MNIST (Xiao et al., 2017) , CIFAR-10 (Krizhevsky & Hinton, 2009) and CelebA (Liu et al., 2015) datasets, in the rigorous FL setup of LEAF (Caldas et al., 2018) . Details are presented in the Appendix. We aim to validate our analysis relative to the impact of various parameters. We omit error bars, as we observed that the variance is very low. Specifically, the parameters we examine are n, s, and K, which have the same meaning as in our theoretical analysis. Our experiments are described by (n, s, K, b), where b is the number of bits for quantization. In addition, we define swt as the server waiting time between two consecutive calls, and the server interaction time, sit, as the amount of time that server needs to send and receive necessary data. We assume a server and n clients. The training dataset is distributed among clients so that each has access to a fixed 1/n partition of the training data. We track the accuracy of the server's model on an unseen validation dataset. We measure loss and accuracy of the model with respect to simulation time and total gradient steps performed by clients. In each round, the server chooses s clients uniformly at random. It then sends its model to those clients and receives their current local models. Each client will have taken a maximum of K local steps by the time it is contacted by the server. We update the both client and server models following QuAFL, and then increase the server time by sit. The server then waits for another interval of server waiting time (swt) to make its next call. Unless otherwise stated, communication is compressed. We differentiate between two types of timing experiments: uniform timing experiments assume all clients take the same amount of time for a gradient step; non-uniform timing experiments differentiate clients to be fast or slow. Specifically, the length of each client step is taken to be a random variable X ∼ exp(λ), where λ is 1/2 for fast clients and 1/8 for slow clients; the expected runtime E(X) would be 2 and 8, respectively. In each experiment, we assumed 30% of clients to be slow. Figure 1 examines the impact of the number of sampled peers s when training ResNet18 on the CelebA dataset, where 30% of clients are slow. We first observe that convergence speed clearly follows the ordering of the number of peers s, confirming our analysis. Interestingly, timings in this experiment are set up so that there is a 27% probability that a slow client will not have taken any steps when interacting with the server. (This probability decreases as s increases.). Thus, this experiment also shows that QuAFL is indeed robust to such slow clients, although their proportion can impact convergence. Figure 2 examines the impact of the number of quantization bits b, showing that increasing b from 8 to 10 improves convergence; however, there is clear staturation after 10 bits. In Figure 3 we examine the loss convergence of FedAvg and QuAFL versus simulated execution time, in a system with 20 clients, out of which 25% are slow. (The Baseline is a single slow node that performs an optimization step per round.) Here, it is evident that QuAFL asynchrony allows it to provide a faster convergence in terms of wall-clock time, than its synchronous counterparts. In Figure 4 , we examine the impact of different types of averaging on the convergence of the basic QuAFL pattern, on the CelebA dataset, with n = 100 clients. All variants execute in the same setup, with individually-tuned hyper-parameters. We clearly observe that the variant where averaging is applied both at the server and at the client performs the best, which validates our choices. In Figure 5 , we compare the convergence of QuAFL with the lattice quantizer, relative to QuAFL using the standard QSGD quantizer (Alistarh et al., 2017a) . We note that using QSGD is not theoreticallyjustified, and in fact we had to perform careful tuning in order to obtain stable convergence for this variant. It is interesting that QuAFL appears to support this, albeit at the cost of slower convergence. Finally, in Figure 6 , we compare QuAFL convergence relative to FedBuff, a state-of-the-art asynchronous FL protocol (Nguyen et al., 2022a) , which performs buffering of client messages at the server, updating the server model as soon as s messages, with K local updates each, are in the server buffer. Since the timing model of FedBuff is different, we compare convergence in terms of total number of gradient steps taken by clients, without quantization. We observe that QuAFL converges faster: our analysis suggests that this is because QuAFL takes into account partial progress by slow clients, whereas in FedBuff slow clients constantly contribute less significantly to the server updates. We present additional experimental results in the Appendix, specifically on higher node counts (up to 300), full-convergence experiments, as well as across all other tasks.

6. CONCLUSIONS AND LIMITATIONS

We have provided the first variant of FedAvg which incorporates both asynchronous and compressed communication, and have shown that this algorithm can still provide good convergence guarantees. Our analysis should be extensible to more complex federated optimizers, such as gradient tracking, e.g. (Haddadpour et al., 2021 ), controlled averaging (Karimireddy et al., 2020) , or variance-reduced variants (Gorbunov et al., 2021) . Our work has the following limitations. First, our algorithm has an optimal convergence rate when H = Θ(K), which we believe is natural due to asynchrony. Second, this version of the analysis requires the expected number of local steps H to be the same across all devices. We believe that this can be addressed either by modifying the objective, or by de-biasing via sampling, and plan to investigate it in future work, together with validation on real-world deployments.

7. REPRODUCIBILITY STATEMENT

All the code required to reproduce our experimental setup and our experiments is available at https://anonymous.4open.science/r/QuAFL-Anonymous. 

A EXPERIMENTAL SETUP

In this section, we describe our experimental setup in detail. We begin by defining the hyperparameters which control the behavior of QuAFL and FedAvg. Then, we proceed by carefully describing the way in which we simulated each of the algorithms. Finally, we detail the datasets, tasks, and models used for our experiments.

A.1 HYPER-PARAMETERS

We first define our hyper-parameters; in the later sections, we will examine their impact on algorithm behavior through ablation studies. n: Number of the clients. s: Number of clients interacting with the server at each step. K: In QuAFL, this is the maximum number of allowed local steps by each client between two server calls. In FedAvg, this is the number of local steps performed by each client upon each server call. b: Number of bits used to send a coordinate after quantization. swt: Server waiting time, i.e. the amount of time that server waits between two consecutive calls. sit: Server interaction time, i.e. the amount of time that server needs to send and receive necessary data (excluding computation time).

A.2 SIMULATION

We attempt to simulate a realistic FL deployment scenario, as follows. We assume a server and n clients, each of which initially has a model copy. The training dataset is distributed among the clients so that each of them has access to 1/n of the training data. We track the performance of each algorithm by evaluating the server's model, on an unseen validation dataset. We measure loss and accuracy of the model with respect to simulation time, server steps, and total local steps performed by clients. These setups so far were common between QuAFL and FedAvg. In the following, we are going to describe their specifications and differences. QuAFL: Upon each server call, the server chooses s clients uniformly at random. It then sends its model to those clients and asks for their current local models. (Recall that clients send their model immediately to the server.) Each of the clients will have taken a maximum of K local steps by the time it is contacted by the server. The server then replaces its model with a carefully-computed average over the received models and its current model. This process increases time on the server by the server interaction time (sit). The server then waits for another interval of server waiting time (swt) to make its next call. The s receiving clients replace their model with the weighted average between their current model and the received server's model. Since each client performs local steps from its last interaction time until the current server time, nodes are effectively executing asynchronously. Moreover, note that communication is compressed, as all the models get encoded in their source and decoded in their destination. Quantization: To have a lightweight but efficient communication between clients and the server, we use the well-known lattice quantization (Davies et al., 2021) . Using this method, we send b instead of 32 bits for each scalar dimension. Informally, each 32-bit number maps to one of the 2 b quantized levels and can be sent using b bits only. The encoded number can then be decoded to a sufficiently close number at the destination, following the quantization protocol. FedAvg: In the beginning of each round, server chooses s clients randomly, and sends its current model to them. Each of those clients receives the model, uncompressed, and performs exactly K local steps using this model as the starting point, and then sends back the resulting model to the server. The server then computes the average of the received models and adopts it as its model. By this synchronous structure, in each round, the server must wait for the slowest client to complete its local steps plus an extra sit for the communication time. After completing each round, the server starts the next call immediately, that means swt = 0 in FedAvg. Timing Experiments. We differentiate between two types of timing experiments. Uniform timing experiments, presented in the paper body, assume all clients take the same amount of time for a 14 gradient step. However, in real-world setups, different devices may require different amounts of time to perform a single local step. This is one of the main disadvantages of synchronous federated optimization algorithms. To demonstrate how this fact affects the experiments, in our Non-uniform timing experiments we differentiate clients to be either fast or slow. The length of each local step can be characterized as a memoryless time event. Therefore, the length of each local step can be defined by a random variable X ∼ exponential(λ). The parameter λ is 1/2 for fast clients and 1/8 for slow clients; the expected runtime E(X) would be 2 and 8, respectively. In each timing experiment, we assumed only one fourth of clients to be slow.

A.3 DATASETS AND MODELS

We used Pytorch to manage the training process in our algorithm. We have trained neural networks for image classification tasks on three well-known datasets, MNIST, Fashion MNIST, and CIFAR-10. For all the datasets, we used the default train/test split of the dataset for our training/validation dataset. In the following, we describe the model architecture and the training hyper-parameters used to train on each of these datasets.

MNIST:

We used SGD optimizer with constant lr = 0.5 in all the training process. We used a two-layer MLP architecture with (784,32,10) nodes in its layers respectively. We used batch size 128 in each client's SGD step. Fashion MNIST: Although this dataset has the same sample size and number of classes as MNIST, obtaining competitive performance on it requires a more complicated architecture. Therefore, we used a CNN model to train on this model and demonstrated the performance of our algorithm in a non-convex task. To optimize the models, we used Adam optimizer with constant lr = 0.001 and batch size 100. CIFAR-10: To load this dataset, we used data augmentation and normalization. For this task, we trained ResNet20 models. Moreover, the SGD optimizer with constant lr = 0.03 is used to in the training process. The batch size 64/200 is used for training/validation.

A.4 RESULTS ON FASHION MNIST (FMNIST)

We begin by validating our earlier results, presented in the paper body, for the slightly more complex FMNIST dataset, and on a convolutional model. In Figures 7 and 8 we examine the impact of the parameters K and s, respectively, on the total number of interaction rounds at the server, to reach a certain training loss. As expected, we notice that higher K and s improve the convergence behavior of the algorithm. In Figures 9 we examine the impact of the server waiting time on the convergence of the algorithm relative to the number of server rounds. Again, we notice that a higher server waiting time improves convergence, as it allows the server to take advantage of additional local steps performed at the clients, as predicted by our analysis. (Higher swt means higher average number of steps completed H.) Next, we examine the convergence, again in terms of number of optimization "rounds" at the server, between the sequential Baseline, FedAvg, and QuAFL. As expected, the Baseline is faster to converge than FedAvg, which in turn is faster than QuAFL in this measure. Specifically, the difference between QuAFL and the other algorithms comes because of the fact that, in our algorithm, nodes operate on old variants of the model at every step, which slows down convergence. Next, we examine convergence in terms of actual time, in the heterogeneous setting in which 25% of the clients are slow. In Figure 11 , we observe the validation accuracy ensured by various algorithms relative to the simulated execution time, whereas in Figure 12 we observe the training loss versus the same metric. (We assume that, in Baseline, a single node acts as both the client and the server, and that this node is slow, i.e. has higher per-step times.) To further support the robustness of our algorithm in regimes with large number of clients, we conducted an experiment with n = 300 clients and s = 30 peers interacting with the server at each step. The validation accuracy and loss versus time regarding the mentioned experiment plotted in Figure 13 and Figure 14 respectively. We observe that, importantly, if time is taken into account rather than the number of server rounds, QuAFL can provide notable 16 speedups in these metrics. This is specifically because of its asynchronous communication patters, which allow it to complete rounds faster, without having to always wait for the slow nodes to complete their local computation. While this behaviour is simulated, we believe that this reflects the algorithm's practical potential. Finally, Figure 15 shows that all methods can reach the maximum accuracy for this dataset/model combination (for the SGD baseline, this occurs later), although QuAFL is the fastest to do so in terms of wall-clock time. Our last experiment in Figure 16 examines whether naive QSGD quantization of the transmitted updates in FedBuff (Nguyen et al., 2022a ) can converge at a good rate, relative to QuAFL at the same quantization ratio. We find that this is not the case: first, we remark that, with careful tuning of the learning rate, FedBuff can indeed converge. We believe that this is because of the specific application: the norm of the model and updates in DNNs tends to be small, and therefore the quantization error induced by direct QSGD stochastic quantization is also manageable. However, we observe a clear loss of convergence for the FedBuff + QSGD algorithm in this case, relative to QuAFL, which we ascribe to the fact that we are essentially running a heuristic.

A.5 RESULTS ON CIFAR-10

We now present results for a standard image classification task on the CIFAR-10 dataset, using a ResNet20 model (He et al., 2016) . Figures 17 and 18 show the decrease in training loss versus the number of server steps (or rounds) for different values of K and s respectively. As our theory suggests, increasing K and s leads to an improvement in the convergence rate of the system. Figure 19 demonstrates the impact of the number of quantization bits b, on the convergence behaviour of the algorithm. According to the definition of b, increasing the number of quantization bits improves the communication accuracy. Thus, as it can be seen in the graph, higher values of b enhance the convergence relative to the number of server steps. Finally, Figure 20 shows the impact of the server interaction frequency, again controlled via the timeout parameter swt, on the algorithm's convergence. It is apparent that a very high interaction frequency can slow the algorithm down, by not allowing it to take advantage of the clients' local steps. In Figures 21 and 22 , we examine the validation accuracy and loss, respectively, ensured by various algorithms versus the simulated execution time. (As in the F-MNIST experiments, we assumed the Baseline to be a single slow node that performs an optimization step per round.) Again, the asynchronous nature of QuAFL provides a faster convergence rate than its synchronous counterparts; which can be clearly seen in the mentioned figures. Recall that X t denotes the model of the server at step t, and X i is the local model of client i after its last interaction with the server. Also, h i is the sum of local gradient steps for model X i since its last interaction with the server. For the convergence analysis, local steps of the clients that are not selected by the server don't have any effect on the server or other clients. Therefore we do not need to assume that clients are doing their local steps asynchronous, and we can assume that all clients run their local gradient steps after the server contacts them. The only thing that we should consider is the randomness of the server selecting the clients, and the fact that the server can contact nodes before they have finished their K steps. For this purpose, we assume that their number of steps is a random number H i t with mean H. To show the analysis in this setting, we introduce new notations that consider the server round. To this end, we use X i t as the value of X i when the server is running its tth iteration, And h i,t for the sum of local steps at this time. We show each local step q with a superscript. Formally, we have h 0 i,t = 0. and for 1 ≤ q ≤ H i t let: h q i,t = g i (X i t - q-1 s=0 η h s i,t ), h i,t = H i t q=0 h q i,t Further , for 1 ≤ q ≤ H i t , let h q i,t = E[ g i (X i t - q-1 s=0 η h s i,t )] = ∇f (X i t - q-1 s=0 η h s i,t ) be the expected value of h q i,t taken over the randomness of the stochastic gradient g i . Also, we have: h i,t = H i t q=0 h q i,t B.2 PROPERTIES OF LOCAL STEPS Lemma B.1. For any agent i and step t E∥h q i,t ∥ 2 ≤ σ 2 K 2 + 8L 2 E∥X i t -µ t ∥ 2 + 4E∥∇f i (µ t )∥ 2 . Proof. E∥h q i,t ∥ 2 ≤E ∇f i (X i t - q-1 s=0 η h s i,t ) -∇f i (µ t ) + ∇f i (µ t ) ≤2E ∇f i (X i t - q-1 s=0 η h s i,t ) -∇f i (µ t ) 2 + 2E∥∇f i (µ t )∥ 2 ≤4L 2 E∥X i t -µ t ∥ 2 + 4η 2 L 2 q q-1 s=0 E∥ h s i,t ∥ 2 + 2E∥∇f i (µ t )∥ 2 ≤4L 2 E∥X i t -µ t ∥ 2 + 4η 2 L 2 q q-1 s=0 (E∥h s i,t ∥ 2 + σ 2 ) + 2E∥∇f i (µ t )∥ 2 the rest of the proof is done by induction, and assuming η < 1 4LK 2 . Lemma 4.6. For any step t, we have that n i=1 E∥ h i,t ∥ 2 ≤ 2nK(σ 2 + 2KG 2 ) + 8L 2 K 2 E[Φ t ] + 4nK 2 B 2 E∥∇f (µ t )∥ 2 . Proof. Using lemma B.1 n i=1 E∥ h i,t ∥ 2 = n i=1 K h=0 P r[H i t = h]E∥ h q=1 h q i,t ∥ 2 ≤ n i=1 K h=1 P r[H i t = h]h h q=1 E∥ h q i,t ∥ 2 ≤ nKσ 2 + n i=1 K h=1 P r[H i t = h]h h q=1 E∥h q i,t ∥ 2 ≤ nKσ 2 + n i=1 K 2 σ 2 K 2 + 8L 2 E∥X i t -µ t ∥ 2 + 4E∥∇f i (µ t )∥ 2 ≤ 2nKσ 2 + n i=1 K 2 8L 2 E∥X i t -µ t ∥ 2 + 4E∥∇f i (µ t )∥ 2 ≤ 2nKσ 2 + 8L 2 K 2 E[Φ t ] + 4nK 2 G 2 + 4nK 2 B 2 E∥∇f (µ t )∥ 2 . Lemma B.2. For any local step 1 ≤ q, and agent 1 ≤ i ≤ n and step t E∥∇f i (µ t ) -h q i,t ∥ 2 ≤ 4L 2 η 2 q 2 σ 2 + 4L 2 E∥X i t -µ t ∥ 2 + 8L 2 η 2 q 2 E∥∇f i (µ t )∥ 2 . Proof. E∥∇f i (µ t ) -h q i,t ∥ 2 = E∥∇f i (µ t ) -∇f i (X i t - q-1 s=0 η h s i,t )∥ 2 ≤L 2 E∥µ t -X i t + q-1 s=0 η h s i,t ∥ 2 ≤2L 2 E∥X i t -µ t ∥ 2 + 2L 2 η 2 E∥ q-1 s=0 h s i,t ∥ 2 ≤2L 2 E∥X i t -µ t ∥ 2 + 2L 2 η 2 q q-1 s=0 E∥ h s i,t ∥ 2 Lemma (B.1) ≤ 2L 2 E∥X i t -µ t ∥ 2 + 2L 2 η 2 q 2 2σ 2 + 8L 2 E∥X i t -µ t ∥ 2 + 4E∥∇f i (µ t )∥ 2 = 4L 2 η 2 q 2 σ 2 + (2L 2 + 16L 4 η 2 q 2 )E∥X i t -µ t ∥ 2 + 8L 2 η 2 q 2 E∥∇f i (µ t )∥ 2 ≤ 4L 2 η 2 q 2 σ 2 + 4L 2 E∥X i t -µ t ∥ 2 + 8L 2 η 2 q 2 E∥∇f i (µ t ) ∥ 2 and the last inequality comes from η < 1 4LK . Lemma B.3. For any time step t n i=1 E⟨∇f (µ t ), -h i,t ⟩ ≤ 4KL 2 E[Φ t ] + (- 3Hn 4 + 8B 2 L 2 η 2 K 3 n)E∥∇f (µ t )∥ 2 + 4nL 2 η 2 K 3 (σ 2 + 2G 2 ). Proof. n i=1 E⟨∇f (µ t ), -h i,t ⟩ = n i=1 K h=1 P r[H i t = h]E⟨∇f (µ t ), - h q=1 h q i,t ⟩ + n i=1 P r[H i t = 0]E⟨∇f (µ t ), 0⟩ = n i=1 K h=1 P r[H i t = h] h q=1 E⟨∇f (µ t ), ∇f i (µ t ) -h q i,t ⟩ -E⟨∇f (µ t ), ∇f i (µ t )⟩ Using Young's inequality we can upper bound E⟨∇f (µ t ), ∇f i (µ t ) -h q i,t ⟩ by E∥∇f (µ t )∥ 2 4 + E ∇f i (µ t ) -h q i,t 2 . Plugging this in the above inequality we get: n i=1 E⟨∇f (µ t ), -h i,t ⟩ ≤ n i=1 K h=1 P r[H i t = h] h q=1 E∥∇f (µ t ) -h q i,t ∥ 2 + E∥∇f (µ t )∥ 2 4 -E⟨∇f (µ t ), ∇f i (µ t )⟩ Lemma B.2 ≤ n i=1 K h=1 P r[H i t = h] h q=1 4L 2 η 2 q 2 σ 2 + 4L 2 E∥X i t -µ t ∥ 2 + 8L 2 η 2 q 2 E∥∇f i (µ t )∥ 2 + E∥∇f (µ t )∥ 2 4 -E⟨∇f (µ t ), ∇f i (µ t )⟩ ≤ n i=1 K h=1 P r[H i t = h]h 4L 2 η 2 h 2 σ 2 + 4L 2 E∥X i t -µ t ∥ 2 + 8L 2 η 2 h 2 E∥∇f i (µ t )∥ 2 + E∥∇f (µ t )∥ 2 4 -E⟨∇f (µ t ), ∇f i (µ t )⟩ ≤ 4KL 2 E[Φ t ] + 4nL 2 η 2 K 3 (σ 2 + 2G 2 ) + (8B 2 nL 2 η 2 K 3 + Hn 4 -Hn)E∥∇f (µ t )∥ 2 Where in the last step we used that E[H i t ] = H, and n i=1 fi(x) n = f (x), for any vector x.

B.3 UPPER BOUNDING POTENTIAL FUNCTIONS

We proceed by proving the lemma 4.4 which upper bounds the expected change in potential: Lemma 4.4. For any time step t we have: E[Φ t+1 ] ≤ 1 - 1 4n E[Φ t ] + 8sη 2 n i=1 E∥ h i,t ∥ 2 + 16n(R 2 + 7) 2 γ 2 . Proof. First we bound change in potential ∆ t = Φ t+1 -Φ t for some fixed time step t > 0. 20 For this, let ∆ S t be the change in potential when set S of agents wake up. for i ∈ S define S i t and S t as follows: S i t = - s s + 1 η h i,t + Q(X t ) -X t s + 1 S t = - 1 s + 1 η i∈S h i,t + 1 s + 1 i∈S (Q(X i t -η h i,t ) -(X i t -η h i,t )) We have that: X i t+1 = sX i t + X t s + 1 + S i t X t+1 = i∈S X i t + X t s + 1 + S t µ t+1 = µ t + j∈S S j t + S t n + 1 This gives us that for i ∈ S: X i t+1 -µ t+1 = sX i t + X t s + 1 + S i t - j∈S S j t + S t n + 1 -µ t X t+1 -µ t+1 = i∈S X i t + X t s + 1 + S t - j∈S S j t + S t n + 1 -µ t For k ̸ ∈ S we get that X k t+1 -µ t+1 = X k t - j∈S S j t + S t n + 1 -µ t . Hence: ∆ S t = i∈S sX i t + X t s + 1 + S i t - j∈S S j t + S t n + 1 -µ t 2 -X i t -µ t 2 + i∈S X i t + X t s + 1 + S t - j∈S S j t + S t n + 1 -µ t 2 -X t -µ t 2 + k̸ ∈S X k t - j∈S S j t + S t n + 1 -µ t 2 -X k t -µ t 2 = i∈S sX i t + X t s + 1 -µ t 2 + S i t + j∈S S j t + S t n + 1 2 + 2 sX i t + X t s + 1 -µ t , S i t - j∈S S j t + S t n + 1 -X i t -µ t 2 + i∈S X i t + X t s + 1 -µ t 2 + S t - j∈S S j t + S t n + 1 2 + 2 i∈S X i t + X t s + 1 -µ t , S t - j∈S S j t + S t n + 1 -X t -µ t 2 + k̸ ∈S 2 X k t -µ t , - j∈S S j t + S t n + 1 + k̸ ∈S j∈S S j t + S t n + 1 2 Observe that: n k=0 X k t -µ t , - j∈S S j t + S t n + 1 = 0. 21 After combining the above two equations, we get that: ∆ S t = i∈S s(X i t -µ t ) + (X t -µ t ) s + 1 2 - s s + 1 X i t -µ t 2 - 1 s + 1 X t -µ t 2 + i∈S (X i t -µ t ) + (X t -µ t ) s + 1 2 - i∈S 1 s + 1 X i t -µ t 2 - 1 s + 1 X t -µ t 2 + i∈S S i t - j∈S S j t + S t n + 1 2 + 2 sX i t + X t s + 1 -µ t , S i t + S t - j∈S S j t + S t n + 1 2 + 2 i∈S X i t + X t s + 1 -µ t , S t + k̸ ∈S j∈S S j t + S t n + 1 2 By simplifying the above, we get: ∆ S t = -s (s + 1) 2 i∈S ∥X i t -X t ∥ 2 - 1 (s + 1) 2 i∈S ∥X i t -X t ∥ 2 - 1 (s + 1) 2 i,j∈S ∥X i t -X j t ∥ 2 ) + i∈S S i t - j∈S S j t + S t n + 1 2 + 2s s + 1 i∈S X i t -µ t , S i t + 2 s + 1 i∈S X t -µ t , S i t + S t - j∈S S j t + S t n + 1 2 + 2 s + 1 i∈S X i t -µ t , S t + 2 s + 1 X t -µ t , S t + k̸ ∈S j∈S S j t + S t n + 1 2 Let α be a parameter we will fix later: X i t -µ t , S i t Young ≤ α X i t -µ t 2 + S i t 2 4α Finally, we get that ∆ S t ≤ -1 s + 1 i∈S ∥X i t -X t ∥ 2 + 2 i∈S S i t 2 + 2s(s + 1) (n + 1) 2 j∈S S j t 2 + 2s(s + 1) (n + 1) 2 S t 2 + i∈S 2sα s + 1 X i t -µ t 2 + i∈S s S i t 2 2α(s + 1) + i∈S 2α s + 1 X t -µ t ∥ 2 + i∈S S i t 2 2α(s + 1) + 2 S t 2 + 2(s + 1) (n + 1) 2 j∈S S j t 2 + 2(s + 1) (n + 1) 2 S t 2 + i∈S 2α s + 1 X i t -µ t ∥ 2 + i∈S S t 2 2α(s + 1) + 2α s + 1 X t -µ t ∥ 2 + S t 2 2α(s + 1) + j∈S (n -s)(s + 1) (n + 1) 2 S j t 2 + (n -s)(s + 1) (n + 1) 2 S t 2 = -1 s + 1 i∈S ∥X i t -X t ∥ 2 + (2 + 2(s + 1) 2 (n + 1) 2 + 1 2α + (n -s)(s + 1) (n + 1) 2 ) j∈S S j t 2 + (2 + 2(s + 1) 2 (n + 1) 2 + 1 2α + (n -s)(s + 1) (n + 1) 2 ) S t 2 + i∈S 2α X i t -µ t 2 + 2α X t -µ t 2 22 ≤ -1 s + 1 i∈S ∥X i t -X t ∥ 2 + (4 + 1 2α ) i∈S S i t 2 + (4 + 1 2α ) S t 2 + i∈S 2α X i t -µ t 2 + 2α X t -µ t 2 Using definitions of S i t and S t , Cauchy-Schwarz inequality and properties of quantization we get that ∥S i t ∥ 2 ≤ 2s 2 (s + 1) 2 η 2 ∥ h i,t ∥ 2 + 2(R 2 + 7) 2 γ 2 (s + 1) 2 . ∥S t ∥ 2 ≤ 2s (s + 1) 2 η 2 i∈S ∥ h i,t ∥ 2 + 2s 2 (R 2 + 7) 2 γ 2 (s + 1) 2 Next, we plug this in the previous inequality: ∆ S t ≤ -1 s + 1 i∈S ∥X i t -X t ∥ 2 + i∈S 2α X i t -µ t 2 + 2α X t -µ t 2 + (4 + 1 2α ) 2s 2 + 2s (s + 1) 2 η 2 ∥ h i,t ∥ 2 + (2s 2 + 2s)(R 2 + 7) 2 γ 2 (s + 1) 2 ≤ -1 s + 1 i∈S ∥X i t -X t ∥ 2 + i∈S 2α X i t -µ t 2 + 2α X t -µ t 2 + (4 + 1 2α )(η 2 i∈S ∥ h i,t ∥ 2 + 2(R 2 + 7) 2 γ 2 ) Next, we calculate probability of choosing the set S and upper bound ∆ t in expectation, for this we define E t as expectation conditioned on the entire history up to and including step t E t [∆ t ] = S 1 n s E t [∆ S t ] ≤ S 1 n s -1 s + 1 i∈S ∥X i t -X t ∥ 2 + i∈S 2α X i t -µ t 2 + 2α X t -µ t 2 + (4 + 1 2α )(η 2 i∈S ∥ h i,t ∥ 2 + 2(R 2 + 7) 2 γ 2 ) = -n-1 s-1 (s + 1) n s i ∥X i t -X t ∥ 2 + i 2α n-1 s-1 n s X i t -µ t 2 + 2α X t -µ t 2 + (4 + 1 2α )(η 2 n-1 s-1 n s i ∥ h i,t ∥ 2 + 2(R 2 + 7) 2 γ 2 ) ≤ - i s∥X i t -µ t ∥ 2 (s + 1)n + i 2 sα n X i t -µ t 2 + 2α X t -µ t 2 + (8 + 1 α )(R 2 + 7) 2 γ 2 + i s n (4 + 1 2α )η 2 E t ∥ h i,t ∥ 2 ≤ ( -s (s + 1)n + 2α)Φ t + (8 + 1 α )(R 2 + 7) 2 γ 2 + i s n (4 + 1 2α )η 2 E t ∥ h i,t ∥ 2 By setting α = 3s-1 n(8s+8) ≥ 1 8n , we get that: E t [∆ t ] ≤ - 1 4n Φ t + 16n(R 2 + 7) 2 γ 2 + i 8sη 2 E t ∥ h i,t ∥ 2 . Next we remove the conditioning , and use the definitions of ∆ i and S i t (for S i t we also use upper bound which come from the properties of quantization). E[E t [Φ t+1 ]] = E[∆ t + Φ t ] ≤ (1 - 1 4n )E[Φ t ] + 16n(R 2 + 7) 2 γ 2 + 8sη 2 i E∥ h i,t ∥ 2 Lemma B.4. For any time step t we have: E[Φ t+1 ] ≤ (1- 1 5n )E[Φ t ]+16n(R 2 +7) 2 γ 2 +16nsKη 2 (σ 2 +2KG 2 )+32B 2 nsK 2 η 2 E∥∇f (µ t )∥ 2 Proof. By combining Lemma 4.4 and 4.6 we have: E[Φ t+1 ] ≤ (1 - 1 4n )E[Φ t ] + 16n(R 2 + 7) 2 γ 2 + 8sη 2 2nK(σ 2 + 2KG 2 ) + 8L 2 K 2 E[Φ t ] + 4nK 2 B 2 E∥∇f (µ t )∥ 2 = (1 - 1 4n + 64sL 2 K 2 η 2 )E[Φ t ] + 16n(R 2 + 7) 2 γ 2 + 16nsKη 2 (σ 2 + 2KG 2 ) + 32B 2 nsK 2 η 2 E∥∇f (µ t )∥ 2 ≤ (1 - 1 5n )E[Φ t ] + 16n(R 2 + 7) 2 γ 2 + 16nsKη 2 (σ 2 + 2KG 2 ) + 32B 2 nsK 2 η 2 E∥∇f (µ t )∥ 2 Lemma B.5. For the sum of potential functions in all T steps we have: T t=0 E[Φ t ] ≤ 80T n 2 (R 2 + 7) 2 γ 2 + 80T n 2 sKη 2 (σ 2 + 2KG 2 ) + 160B 2 n 2 sK 2 η 2 T -1 t=0 E∥∇f (µ t )∥ 2 Proof. T -1 t=0 E[Φ t+1 ] ≤ T -1 t=0 (1 - 1 5n )E[Φ t ] + 16n(R 2 + 7) 2 γ 2 + 16nsKη 2 (σ 2 + 2KG 2 ) + 32B 2 nsK 2 η 2 E∥∇f (µ t )∥ 2 ≤ (1 - 1 5n ) T -1 t=0 E[Φ t ] + 16T n(R 2 + 7) 2 γ 2 + 16T nsKη 2 (σ 2 + 2KG 2 ) + 32B 2 nsK 2 η 2 T -1 t=0 E∥∇f (µ t )∥ 2 T t=0 E[Φ t ] ≤ 5n 16T n(R 2 + 7) 2 γ 2 + 16T nsKη 2 (σ 2 + 2KG 2 ) + 32B 2 nsK 2 η 2 T -1 t=0 E∥∇f (µ t )∥ 2 = 80T n 2 (R 2 + 7) 2 γ 2 + 80T n 2 sKη 2 (σ 2 + 2KG 2 ) + 160B 2 n 2 sK 2 η 2 T -1 t=0 E∥∇f (µ t )∥ 2 Lemma 4.5. For any step t E∥µ t+1 -µ t ∥ 2 ≤ 2s 2 η 2 n(n + 1) 2 i E h i,t 2 + 2 (n + 1) 2 (R 2 + 7) 2 γ 2 . Proof. E∥µ t+1 -µ t ∥ 2 ≤ S 1 n s (n + 1) 2 E -η i∈S h i,t + Q(X t ) -X t s + 1 + 1 s + 1 i∈S (Q(X i t -η h i,t ) -(X i t -η h i,t )) 2 ≤ S 1 n s (n + 1) 2 2sη 2 i∈S E h i,t 2 + 2 s + 1 E Q(X t ) -X t 2 + 2 s + 1 i∈S E (Q(X i t -η h i,t ) -(X i t -η h i,t )) 2 ≤ S 1 n s (n + 1) 2 2sη 2 i∈S E h i,t 2 + 2(R 2 + 7) 2 γ 2 = i 2sη 2 n-1 s-1 n s (n + 1) 2 E h i,t 2 + 2 (n + 1) 2 (R 2 + 7) 2 γ 2 = i 2s 2 η 2 n(n + 1) 2 E h i,t 2 + 2 (n + 1) 2 (R 2 + 7) 2 γ 2 By plugging Lemma 4.6 in the above upper bound we get that: Lemma B.6. For any step t E∥µ t+1 -µ t ∥ 2 ≤ 4s 2 Kη 2 (σ 2 + 2KG 2 ) (n + 1) 2 + 16s 2 L 2 K 2 η 2 E[Φ t ] n(n + 1) 2 + 8B 2 s 2 K 2 η 2 E∥∇f (µ t )∥ 2 (n + 1) 2 + 2(R 2 + 7) 2 γ 2 (n + 1) 2 . Proof. E∥µ t+1 -µ t ∥ 2 ≤ i 2s 2 η 2 n(n + 1) 2 E h i,t 2 + 2 (n + 1) 2 (R 2 + 7) 2 γ 2 ≤ i 2s 2 η 2 n(n + 1) 2 2nK(σ 2 + 2KG 2 ) + 8L 2 K 2 E[Φ t ] + 4nK 2 B 2 E∥∇f (µ t )∥ 2 + 2(R 2 + 7) 2 γ 2 (n + 1) 2 = 4s 2 Kη 2 (n + 1) 2 (σ 2 + 2KG 2 ) + 16s 2 L 2 K 2 η 2 n(n + 1) 2 E[Φ t ] + 8B 2 s 2 K 2 η 2 (n + 1) 2 E∥∇f (µ t )∥ 2 + 2(R 2 + 7) 2 γ 2 (n + 1) 2 B.4 CONVERGENCE Theorem B.7. For learning rate η = n+1 sH √ T , Algorithm 1 converges at rate: 1 T T -1 t=0 E∥∇f (µ t )∥ 2 ≤ 2(f (µ 0 ) -f * ) √ T + 800nKL 2 (R 2 + 7) 2 γ 2 H + 6KL(σ 2 + 2KG 2 ) H 2 √ T + 808n(n + 1) 2 K 2 L 2 sH 3 T (σ 2 + 2KG 2 ) + 2(R 2 + 7) 2 γ 2 L √ T (n + 1) 2 sH Proof. Let E t denote expectation conditioned on the entire history up to and including step t. By L-smoothness we have that E t [f (µ t+1 )] ≤ f (µ t ) + E t ⟨∇f (µ t ), µ t+1 -µ t ⟩ + L 2 E t ∥µ t+1 -µ t ∥ 2 . (5) First we look at E t ⟨∇f (µ t ), µ t+1 -µ t ⟩ = ⟨∇f (µ t ), E t [µ t+1 -µ t ]⟩. If set S is chosen at step t + 1, We have that µ t+1 -µ t = 1 n + 1 (-η i∈S h i,t + Q(X t ) -X t s + 1 + 1 s + 1 i∈S (Q(X i t -η h i,t ) -X i t -η h i,t ))) Thus, in this case: E t [µ t+1 -µ t ] = - η n + 1 i∈S h i,t . Where we used unbiasedness of quantization and stochastic gradients. We would like to note that even though we do condition on the entire history up to and including step t and this includes conditioning on X i t , the algorithm has not yet used h i,t (it does not count towards computation of µ t ), thus we can safely use all properties of stochastic gradients. Hence, we can proceed by taking into the account that each set of agents S is chosen as initiator with probability 1 ( n s ) : E t [µ t+1 -µ t ] = S 1 n s i∈S - η n + 1 i∈S h i,t = - sη n(n + 1) n i=1 h i,t . and subsequently E t ⟨∇f (µ t ), µ t+1 -µ t ⟩ = n i=1 sη n(n + 1) E t ⟨∇f (µ t ), -h i,t ⟩. Hence, we can rewrite (5) as: E t [f (µ t+1 )] ≤ f (µ t ) + n i=1 sη n(n + 1) E t ⟨∇f (µ t ), -h i,t ⟩ + L 2 E t ∥µ t+1 -µ t ∥ 2 . Next, we remove the conditioning E[(µ t+1 )] = E[E t [f (µ t+1 )]] ≤ E[f (µ t )] + n i=1 sη n(n + 1) E⟨∇f (µ t ), -h i,t ⟩ + L 2 E∥µ t+1 -µ t ∥ 2 . This allows us to use Lemmas B.6 and B.3: E[f (µ t+1 )] -E[f (µ t )] ≤ sη n(n + 1) 4KL 2 E[Φ t ] + (- 3Hn 4 + 8B 2 L 2 η 2 K 3 n)E∥∇f (µ t )∥ 2 + 4nL 2 η 2 K 3 (σ 2 + 2G 2 ) + L 2 4s 2 Kη 2 (σ 2 + 2KG 2 ) (n + 1) 2 + 16s 2 L 2 K 2 η 2 E[Φ t ] n(n + 1) 2 + 8B 2 s 2 K 2 η 2 E∥∇f (µ t )∥ 2 (n + 1) 2 + 2(R 2 + 7) 2 γ 2 (n + 1) 2 = 4ηsKL 2 n(n + 1) + 8s 2 K 2 L 3 η 2 n(n + 1) 2 E[Φ t ] + 4sL 2 η 3 K 3 n + 1 + 2s 2 Kη 2 L (n + 1) 2 (σ 2 + 2KG 2 ) + (R 2 + 7) 2 γ 2 L (n + 1) 2 + -3ηsH 4(n + 1) + 8B 2 L 2 η 3 sK 3 n + 1 + 4B 2 s 2 K 2 Lη 2 (n + 1) 2 E∥∇f (µ t )∥ 2 By simplifying the above inequality we get: E[f (µ t+1 )] -E[f (µ t )] ≤ 5ηsKL 2 E[Φ t ] n(n + 1) + 4sL 2 η 3 K 3 n + 1 + 2s 2 Kη 2 L (n + 1) 2 (σ 2 + 2KG 2 ) + (R 2 + 7) 2 γ 2 L (n + 1) 2 + -3ηsH 4(n + 1) + 8B 2 L 2 η 3 sK 3 n + 1 + 4B 2 s 2 K 2 Lη 2 (n + 1) 2 E∥∇f (µ t )∥ 2 by summing the above inequality for t = 0 to t = T -1, we get that E[f (µ T )] -f (µ 0 ) ≤ 5ηsKL 2 n(n + 1) T -1 t=0 E[Φ t ] + 4sL 2 η 3 K 3 n + 1 + 2s 2 Kη 2 L (n + 1) 2 (σ 2 + 2KG 2 ) + -3ηsH 4(n + 1) + 8B 2 L 2 η 3 sK 3 n + 1 + 4B 2 s 2 K 2 Lη 2 (n + 1) 2 T -1 t=0 E∥∇f (µ t )∥ 2 + (R 2 + 7) 2 γ 2 LT (n + 1) 2 Further, we use Lemma B.5: E[f (µ T )] -f (µ 0 ) ≤ 5ηsKL 2 n(n + 1) 80T n 2 (R 2 + 7) 2 γ 2 + 80T n 2 sKη 2 (σ 2 + 2KG 2 ) + 160B 2 n 2 sK 2 η 2 T -1 t=0 E∥∇f (µ t )∥ 2 + 4sL 2 η 3 K 3 n + 1 + 2s 2 Kη 2 L (n + 1) 2 (σ 2 + 2KG 2 ) + (R 2 + 7) 2 γ 2 LT (n + 1) 2 + -3ηsH 4(n + 1) + 8B 2 L 2 η 3 sK 3 n + 1 + 4B 2 s 2 K 2 Lη 2 (n + 1) 2 T -1 t=0 E∥∇f (µ t )∥ 2 ≤ 400ηsnKL 2 T (R 2 + 7) 2 γ 2 n + 1 + 404T ns 2 K 2 L 2 η 3 (σ 2 + 2KG 2 ) n + 1 + 3s 2 Kη 2 LT (σ 2 + 2KG 2 ) (n + 1) 2 + (R 2 + 7) 2 γ 2 LT (n + 1) 2 + -3ηsH 4(n + 1) + 8B 2 L 2 η 3 sK 3 n + 1 + 4B 2 s 2 K 2 Lη 2 (n + 1) 2 + 800B 2 ns 2 K 3 η 3 L 2 n + 1 T -1 t=0 E∥∇f (µ t )∥ 2 by assuming η < 1 100B √ nsk 2 L we get: E[f (µ T )] -f (µ 0 ) ≤ 400ηsnKL 2 T (R 2 + 7) 2 γ 2 n + 1 + + ( 3s 2 Kη 2 LT (n + 1) 2 + 404T ns 2 K 2 L 2 η 3 n + 1 )(σ 2 + 2KG 2 ) + (R 2 + 7) 2 γ 2 LT (n + 1) 2 + -ηsH 2(n + 1) T -1 t=0 E∥∇f (µ t )∥ 2 Next, we regroup terms, multiply both sides by 2(n+1) ηsHT and use the fact that f (µ T ) ≥ f * : 1 T T -1 t=0 E∥∇f (µ t )∥ 2 ≤ 2(n + 1)(f (µ 0 ) -f * ) sHηT + 800nKL 2 (R 2 + 7) 2 γ 2 H + + ( 6sKηL H(n + 1) + 808nsK 2 L 2 η 2 H )(σ 2 + 2KG 2 ) + 2(R 2 + 7) 2 γ 2 L (n + 1)sHη Finally, we set η = n+1 sH √ T : 1 T T -1 t=0 E∥∇f (µ t )∥ 2 ≤ 2(f (µ 0 ) -f * ) √ T + 800nKL 2 (R 2 + 7) 2 γ 2 H + 6KL(σ 2 + 2KG 2 ) H 2 √ T (6) + 808n(n + 1) 2 K 2 L 2 sH 3 T (σ 2 + 2KG 2 ) + 2(R 2 + 7) 2 γ 2 L √ T (n + 1) 2 sH (7) Lemma B.8. For quantization parameters (R 2 + 7) 2 γ 2 = (n+1) 2 s 2 H 2 T (σ 2 + 2KG 2 + f (µ0)-f * L ) we have: 1 T T -1 t=0 E∥∇f (µ t )∥ 2 ≤ 5(f (µ 0 ) -f * ) √ T + 8KL(σ 2 + 2KG 2 ) H 2 √ T + 1608n(n + 1) 2 K 2 L 2 (σ 2 + 2KG 2 ) sH 3 T + 800n(n + 1) 2 KL(f (µ 0 ) -f * ) s 2 H 3 T Proof. 1 T T -1 t=0 E∥∇f (µ t )∥ 2 ≤ 2(f (µ 0 ) -f * ) √ T + 800nKL 2 (R 2 + 7) 2 γ 2 H + 6KL(σ 2 + 2KG 2 ) H 2 √ T + 808n(n + 1) 2 K 2 L 2 sH 3 T (σ 2 + 2KG 2 ) + 2(R 2 + 7) 2 γ 2 L √ T (n + 1) 2 sH = 2(f (µ 0 ) -f * ) √ T + 800nKL 2 (n + 1) 2 s 2 H 3 T (σ 2 + 2KG 2 + f (µ 0 ) -f * L ) + 6KL(σ 2 + 2KG 2 ) H 2 √ T + 808n(n + 1) 2 K 2 L 2 sH 3 T (σ 2 + 2KG 2 ) + 2L s 2 H 2 √ T (σ 2 + 2KG 2 + f (µ 0 ) -f * L ) ≤ 5(f (µ 0 ) -f * ) √ T + 8KL(σ 2 + 2KG 2 ) H 2 √ T + 1608n(n + 1) 2 K 2 L 2 (σ 2 + 2KG 2 ) sH 3 T + 800n(n + 1) 2 KL(f (µ 0 ) -f * ) s 2 H 3 T Lemma B.9. We have: 5s T -1 t=0 E[Φ t ] + 3η 2 T -1 t=0 i E∥ h i,t ∥ 2 ≤ 1000T n 3 s(R 2 + 7) 2 γ 2 + 10000B 2 n 3 s 2 K 2 LT (R 2 + 7) 2 γ 2 Proof. 5s T -1 t=0 E[Φ t ] + 3η 2 T -1 t=0 i E∥ h i,t ∥ 2 ≤ 5s T -1 t=0 E[Φ t ] + 3η 2 T -1 t=0 2nK(σ 2 + 2KG 2 ) + 8L 2 K 2 E[Φ t ] + 4nK 2 B 2 E∥∇f (µ t )∥ 2 ≤ 5s T -1 t=0 E[Φ t ] + 6nT η 2 K(σ 2 + 2KG 2 ) + 24η 2 L 2 K 2 T -1 t=0 E[Φ t ] + 12nB 2 η 2 K 2 T -1 t=0 E∥∇f (µ t )∥ 2 ≤ 6s T -1 t=0 E[Φ t ] + 6nT η 2 K(σ 2 + 2KG 2 ) + 12B 2 nη 2 K 2 T -1 t=0 E∥∇f (µ t )∥ 2 ≤ 6s 80T n 2 (R 2 + 7) 2 γ 2 + 80T n 2 sKη 2 (σ 2 + 2KG 2 ) + 160B 2 n 2 sK 2 η 2 T -1 t=0 E∥∇f (µ t )∥ 2 + 6nT η 2 K(σ 2 + 2KG 2 ) + 12B 2 nη 2 K 2 T -1 t=0 E∥∇f (µ t )∥ 2 ≤ 480T n 2 s(R 2 + 7) 2 γ 2 + (480T n 2 s 2 Kη 2 + 6nT η 2 K)(σ 2 + 2KG 2 ) + (960n 2 s 2 K 2 B 2 η 2 + 12B 2 nη 2 K 2 ) T -1 t=0 E∥∇f (µ t )∥ 2 ≤ 480T n 2 s(R 2 + 7) 2 γ 2 + 486T n 2 s 2 Kη 2 (σ 2 + 2KG 2 ) + 1000B 2 n 2 s 2 K 2 η 2 T -1 t=0 E∥∇f (µ t )∥ 2 ≤ 480T n 2 s(R 2 + 7) 2 γ 2 + 486T n 2 s 2 Kη 2 (σ 2 + 2KG 2 ) + 1000B 2 n 2 s 2 K 2 η 2 2(n + 1)(f (µ 0 ) -f * ) sHη + 800T nKL 2 (R 2 + 7) 2 γ 2 H + + ( 6T sKηL H(n + 1) + 808T nsK 2 L 2 η 2 H )(σ 2 + 2KG 2 ) + 2T (R 2 + 7) 2 γ 2 L (n + 1)sHη ≤ 480T n 2 s(R 2 + 7) 2 γ 2 + 486T n 2 s 2 Kη 2 (σ 2 + 2KG 2 ) + 2000B 2 n 2 (n + 1)sK 2 η(f (µ 0 ) -f * ) H + 800000T B 2 n 3 s 2 K 3 η 2 L 2 (R 2 + 7) 2 γ 2 H + + ( 6000T B 2 n 2 s 3 K 3 η 3 L H(n + 1) + 808000T B 2 n 3 s 3 K 4 η 4 L 2 H )(σ 2 + 2KG 2 ) + 4000T B 2 nsK 2 ηL(R 2 + 7) 2 γ 2 H ≤ 1000T n 3 s(R 2 + 7) 2 γ 2 + 4000B 2 n 3 sK 2 (n + 1) √ T (f (µ 0 ) -f * ) + 10000T n 3 (n + 1) 2 s 2 K T (σ 2 + 2KG 2 ) ≤ 1000T n 3 s(R 2 + 7) 2 γ 2 + 10000B 2 n 3 s 2 K 2 LT (R 2 + 7) 2 γ 2 Lemma 4.8. Let T ≥ O(n 3 ), then for quantization parameters R = 2 + T 3 d and γ 2 = (n+1) 2 (σ 2 +2KG 2 +f (µ0)-f * ) s 2 H 2 T (R 2 +7) 2 we have that the probability of quantization never failing during the entire run of the Algorithm 1 is at least 1 -O 1 T . Proof. Let L t be the event that quantization does not fail during step t. Our goal is to show that P r[∪ T t=1 L t ] ≥ 1 -O 1 T . In order to do this, we first prove that P r[¬L t+1 |L 1 , L 2 , ..., L t ] ≤ O 1 T 2 (O is with respect to T here). 29 We need need to lower bound probability that : ∀i ∈ S :∥X t -X i t ∥ 2 ≤ (R R d γ) 2 (8) ∥X t -(X i t -η h i,t )∥ 2 ≤ (R R d γ) 2 (9) ∥X t -X i t ∥ 2 = O γ 2 (poly(T )) 2 R 2 (10) ∥X t -(X i t -η h i,t )∥ 2 = O γ 2 (poly(T )) 2 R 2 We would like to point out that these conditions are necessary for decoding to succeed, we ignore encoding since it will be counted when someone will try to decode it. Since, R = 2 + T 3 d this means that (R R d ) 2 ≥ 2 2T 3 ≥ T 30 , for large enough T . Hence, it is suffices to upper bound the probability that i∈S ∥X t -X i t ∥ 2 + i∈S ∥X t -(X i t -η h i,t )∥ 2 ≥ T 30 γ 2 . To prove this, we have: i∈S ∥X t -X i t ∥ 2 + i∈S ∥X t -(X i t -η h i,t )∥ 2 ≤ i∈S (5∥X t -µ t ∥ 2 + 5∥µ t -X i t ∥ 2 + 3η 2 ∥ h i,t ∥ 2 ) ≤ 5sΦ t + 3η 2 i ∥ h i,t ∥ 2 Now, we use Markov's inequality, and Lemma B.9: P r[5sΦ t + 3η 2 ∥ h i,t ∥ 2 ≥ T 30 γ 2 |L 1 , L 2 , ..., L t ] ≤ E[5sΦ t + 3η 2 i ∥ h i,t ∥ 2 |L 1 , L 2 , ..., L t ] T 30 γ 2 ≤ 1000T n 3 s(R 2 + 7) 2 γ 2 + 10000B 2 n 3 s 2 K 2 LT (R 2 + 7) 2 γ 2 T 30 γ 2 ≤ O( 1 T 2 ) Thus, the failure probability due to the models not being close enough for quantization to be applied is at most O 1 T 2 . Conditioned on the event that ∥X t -X i t ∥ and ∥X t -(X i t -η h i,t )∥ are upper bounded by T 15 γ (This is what we actually lower bounded the probability for using Markov), we get that the probability of quantization algorithm failing is at most i∈S log log( 1 γ ∥X t -X i t ∥) • O(R -d ) + i∈S log log( 1 γ ∥X t -(X i t -η h i,t )∥) • O(R -d ) ≤ O s log log T T 3 ≤ O 1 T 2 . By the law of total probability (to remove conditioning) and the union bound we get that the total probability of failure, either due to not being able to apply quantization or by failure of quantization algorithm itself is at most O 1 T 2 . Finally we use chain rule to get that P r[∪ T t=1 L t ] = T t=1 P r[L t | ∪ t-1 s=0 L s ] = T t=1 1 -P r[¬L t | ∪ t-1 s=0 L s ] ≥ 1 - T t=1 P r[¬L t | ∪ t-1 s=0 L s ] ≥ 1 -O 1 T . Lemma 4.9. Let T ≥ O(n 3 ), then for quantization parameters R = 2 + T 3 d and γ 2 = η 2 (R 2 +7) 2 (σ 2 + 2KG 2 + f (µ0)-f * L ) we have that the expected number of bits used by Algorithm 1 per communication is O(d log(n) + log(T )). Proof. At step t + 1, by Corollary 4.3, we know that the total number of bits used is at most i∈S O d log( R γ ∥X i t -X t ∥) + O d log( R γ ∥X t -(X i t -η h i,t )∥ By taking the randomness of agent interaction at step t + 1 into the account, we get that the expected number of bits used is at most: S 1 n s i∈S O d log( R γ ∥X i t -X t ∥) + O d log( R γ ∥X t -(X i t -η h i,t )∥ = i s n O d log( R γ ∥X i t -X t ∥) + O d log( R γ ∥X t -(X i t -η h i,t )∥ =≤ i s n O d log( R 2 γ 2 ∥X i t -X t ∥ 2 ) + O d log( R 2 γ 2 ∥X t -(X i t -η h i,t )∥ 2 Jensen ≤ s O d log( R 2 γ 2 i 1 n (∥X i t -X t ∥ 2 + ∥X t -(X i t -η h i,t )∥ 2 ) ≤s O d log( R 2 γ 2 i 1 n (∥X t -µ t ∥ 2 + ∥X i t -µ t ∥ 2 + η 2 ∥ h i,t ∥ 2 ) ≤ s O d log( R 2 γ 2 (Φ t + η 2 n i ∥ h i,t ∥ 2 ) So the expected number of bits per communication in all rounds is at most: 1 sT T -1 t=0 s O d log( R 2 γ 2 (Φ t + η 2 n i ∥ h i,t ∥ 2 ) ≤ O d log( R 2 γ 2 ( 1 T T -1 t=0 Φ t + 1 T T -1 t=0 η 2 n i ∥ h i,t ∥ 2 ) Next, By Jensen inequality and Lemma B.9, We get that the expected number of bits used is at most, Proof. The proof simply follows from combining Lemmas B.8, 4.8 and 4.9 O dE log( R 2 γ 2 ( 1 T T -1 t=0 Φ t + 1 T T -1 t=0 η 2 n i ∥ h i,t ∥ 2 ) Jensen ≤ O d log( R 2 γ 2 ( 1 T T -1 t=0 E[Φ t ] + 1 T T -1 t=0 η 2 n i E∥ h i,t ∥ 2 ) ≤ O d log( R 2 γ 2 ( 1 T (1000T n 3 s(R 2 + 7) 2 γ 2 + 10000B 2 n 3 s 2 K 2 LT (R 2 + 7) 2 γ 2 ))) ≤ O d log(R 2 (1000n 3 s(R 2 + 7) 2 + 10000B 2 n 3 s 2 K 2 L(R 2 + 7) 2 )) = O( Lemma B.10. For the convergence of the server, we have: 1 T T -1 t=0 E∥∇f (X t )∥ 2 ≤ 15(f (µ 0 ) -f * ) √ T + 24KL(σ 2 + 2KG 2 ) H 2 √ T + ( 4824n(n + 1) 2 K 2 L 2 sH 3 T + 320n 2 (n + 1) 2 KL 2 sH 2 T )(σ 2 + 2KG 2 ) + ( 2400n(n + 1) 2 KL s 2 H 3 T + 160n 2 (n + 1) 2 L 2 s 2 H 2 T )(f (µ 0 ) -f * ) Proof. 1 T T -1 t=0 E∥∇f (X t )∥ 2 ≤ 1 T T -1 t=0 E∥∇f (X t ) -∇f (µ t ) + ∇f (µ t )∥ 2 ≤ 2 T T -1 t=0 E∥∇f (X t ) -∇f (µ t )∥ 2 + 2 T T -1 t=0 ∥∇f (µ t )∥ 2 ≤ 2L 2 T T -1 t=0 E∥X t -µ t ∥ 2 + 2 T T -1 t=0 ∥∇f (µ t )∥ 2 ≤ 2L 2 T T -1 t=0 E[Φ t ] + 2 T T -1 t=0 ∥∇f (µ t )∥ 2 ≤ 2L 2 80n 2 (R 2 + 7) 2 γ 2 + 80n 2 sKη 2 (σ 2 + 2KG 2 ) + 160B 2 n 2 sK 2 η 2 1 T T -1 t=0 E∥∇f (µ t )∥ 2 + 2 T T -1 t=0 ∥∇f (µ t )∥ 2 ≤ 160n 2 L 2 (R 2 + 7) 2 γ 2 + 160n 2 sKL 2 η 2 (σ 2 + 2KG 2 ) + 3 T T -1 t=0 ∥∇f (µ t )∥ 2 ≤ 160n 2 L 2 (R 2 + 7) 2 γ 2 + 160n 2 sKL 2 η 2 (σ 2 + 2KG 2 ) + 15(f (µ 0 ) -f * ) √ T + 24KL(σ 2 + 2KG 2 ) H 2 √ T + 4824n(n + 1) 2 K 2 L 2 (σ 2 + 2KG 2 ) sH 3 T + 2400n(n + 1) 2 KL(f (µ 0 ) -f * ) s 2 H 3 T ≤ 160n 2 L 2 η 2 (σ 2 + 2KG 2 + f (µ 0 ) -f * L ) + 160n 2 sKL 2 η 2 (σ 2 + 2KG 2 ) + 15(f (µ 0 ) -f * ) √ T + 24KL(σ 2 + 2KG 2 ) H 2 √ T + 4824n(n + 1) 2 K 2 L 2 (σ 2 + 2KG 2 ) sH 3 T + 2400n(n + 1) 2 KL(f (µ 0 ) -f * ) s 2 H 3 T ≤ 15(f (µ 0 ) -f * ) √ T + 24KL(σ 2 + 2KG 2 ) H 2 √ T + ( 4824n(n + 1) 2 K 2 L 2 sH 3 T + 320n 2 (n + 1) 2 KL 2 sH 2 T )(σ 2 + 2KG 2 ) + ( 2400n(n + 1) 2 KL s 2 H 3 T + 160n 2 (n + 1) 2 L 2 s 2 H 2 T )(f (µ 0 ) -f * ) Finally, the proof of Corollary 4.3 follows from combining Lemmas B.10, 4.8 and 4.9



end for % At Client i: % Upon (asynchronous) contact from the server run INTERACTWITHSERVER % Local variables: % X i stores the base client model, following the last server interaction. Initially 0 d . % h i accumulates local gradient steps since last server interaction, initially 0 d . 1: function INTERACTWITHSERVER 2:

Figure 1: Peers s ∈ {10, 20, 30, 40} on convergence, for n = 100 clients, 14-bit quantization, on CelebA.

Figure 7: Impact of the maximum number of local steps K ∈ {5, 10, 20} on the QuAFL algorithm / Fashion MNIST.

Figure 10: Convergence comparison relative to total number of rounds, between QuAFL, FedAvg, and the sequential baseline.

Figure 11: Time vs. accuracy for various algorithm variants, on Fashion MNIST.

Figure 17: Impact of maximum local steps K ∈ {3, 9, 15} on the QuAFL algorithm, on ResNet20/CIFAR-10.

Figure 21: Time vs. validation accuracy for various algorithm variants.

d log(n) + log(T )) Theorem 4.2. Assume the total number of steps T ≥ Ω(n 3 ), the learning rate η = n+1 sH √ T , and quantization parameters R = 2 + T 3 d and γ2 = η 2 (R 2 +7) 2 σ 2 + 2KG 2 + f (µ0)-f * L. Let H > 0 be the expected number of local steps already performed by a client when interacting with the server. Then, with probability at least 1 -O(1T ) we have that Algorithm 1 converges at the following rate1 T T -1 t=0 E∥∇f (µ t )∥ 2 ≤ 5(f (µ 0 ) -f * ) √ T + 8KL(σ 2 + 2KG 2 ) H 2 √ T + O n 3 KL 2 (σ 2 + 2KG 2 ) sH 3 Tand uses O (sT (d log n + log T )) expected communication bits in total.

(Lemma 23): Lemma 4.1. (Lattice Quantization) Fix parameters R and γ > 0. There exists a quantization procedure defined by an encoding function Enc R,γ : R d → {0, 1} → R d such that, for any vector x ∈ R d which we are trying to quantize, and any vector y which is used by decoding, which we call the decoding key, if ∥x -y∥ ≤ R R d γ then with probability at least 1 -log log( ∥x-y∥ γ

System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analytical Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Overview of the Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

