FEDEXP: SPEEDING UP FEDERATED AVERAGING VIA EXTRAPOLATION

Abstract

Federated Averaging (FedAvg) remains the most popular algorithm for Federated Learning (FL) optimization due to its simple implementation, stateless nature, and privacy guarantees combined with secure aggregation. Recent work has sought to generalize the vanilla averaging in FedAvg to a generalized gradient descent step by treating client updates as pseudo-gradients and using a server step size. While the use of a server step size has been shown to provide performance improvement theoretically, the practical benefit of the server step size has not been seen in most existing works. In this work, we present FedExP, a method to adaptively determine the server step size in FL based on dynamically varying pseudo-gradients throughout the FL process. We begin by considering the overparameterized convex regime, where we reveal an interesting similarity between FedAvg and the Projection Onto Convex Sets (POCS) algorithm. We then show how FedExP can be motivated as a novel extension to the extrapolation mechanism that is used to speed up POCS. Our theoretical analysis later also discusses the implications of FedExP in underparameterized and non-convex settings. Experimental results show that FedExP consistently converges faster than FedAvg and competing baselines on a range of realistic FL datasets. Published as a conference paper at ICLR 2023 step size prevents global slowdown. While this idea may be asymptotically optimal, it is not always effective in practical non-asymptotic and communication-limited settings (Charles & Konečnỳ, 2020). 0.01 0.03 0.1 0.3 1 Client Step Size (ηℓ) 0.1 0.3 1 3 10 Server Step Size (ηg) 5.8 8.2 42 55 9.1 16 46 58 71 55 46 60 73 77 6.2 36 57 60 50 6 10 6.4 6.1 5.7 4.8 20 40 60 While the importance of the server step size has been theoretically well established in these works, we find that its practical relevance has not been explored. In this work, we take a step towards bridging this gap between theory and practice by adaptively tuning the value of η g that we use in every round. Before discussing our proposed algorithm, we first highlight a useful and novel connection between FedAvg and the POCS algorithm used to find a point in the intersection of some convex sets. Connection Between FedAvg and POCS in the Overparameterized Convex Regime. Consider the case where the local objectives of the clients {F i (w)} M i=1 are convex. In this case, we know that the set of minimizers of F i (w) given by S * i = {w : w ∈ arg min F i (w)} is also a convex set for all i ∈ [M ]. Now let us assume that we are in the overparameterized regime where d is sufficiently larger than the total number of data points across clients. In this regime, the model can

1. INTRODUCTION

Federated Learning (FL) has emerged as a key distributed learning paradigm in which a central server orchestrates the training of a machine learning model across a network of devices. FL is based on the fundamental premise that data never leaves a clients device, as clients only communicate model updates with the server. Federated Averaging or FedAvg, first introduced by McMahan et al. (2017) , remains the most popular algorithm in this setting due to the simplicity of its implementation, stateless nature (i.e., clients do not maintain local parameters during training) and the ability to incorporate privacy-preserving protocols such as secure aggregation (Bonawitz et al., 2016; Kadhe et al., 2020) . Slowdown Due to Heterogeneity. One of the most persistent problems in FedAvg is the slowdown in model convergence due to data heterogeneity across clients. Clients usually perform multiple steps of gradient descent on their heterogeneous objectives before communicating with the server in FedAvg, which leads to what is colloquially known as client drift error (Karimireddy et al., 2019) . The effect of heterogeneity is further exacerbated by the constraint that only a fraction of the total number of clients may be available for training in every round (Kairouz et al., 2021) . Various techniques have been proposed to combat this slowdown, among the most popular being variance reduction techniques such as Karimireddy et al. (2019) ; Mishchenko et al. (2022) ; Mitra et al. (2021) , but they either lead to clients becoming stateful, add extra computation or communication requirements or have privacy limitations.

Server

Step Size. Recent work has sought to deal with this slowdown by using two separate step sizes in FedAvg -a client step size used by the clients to minimize their local objectives and a server step size used by the server to update the global model by treating client updates as pseudo-gradients (Karimireddy et al., 2019; Reddi et al., 2021) . To achieve the fastest convergence rate, these works propose keeping the client step size as O 1/τ √ T and the server step size as O √ τ M , where T is the number of communication rounds, τ is the number of local steps and M is the number of clients. Using a small client step size mitigates client drift, and a large server In practice, a small client step size severely slows down convergence in the initial rounds and cannot be fully compensated for by a large server step size (see Figure 1 ). Also, if local objectives differ significantly, then it may be beneficial to use smaller values of the server step size (Malinovsky et al., 2022) . Therefore, we seek to answer the following question: For a moderate client step size, can we adapt the server step size according to the local progress made by the clients and the heterogeneity of their objectives? In general, it is challenging to answer this question because it is difficult to obtain knowledge of the heterogeneity between the local objectives and appropriately use it to adapt the server step size. Our Contributions. In this paper, we take a novel approach to address the question posed above. We begin by considering the case where the models are overparameterized, i.e., the number of model parameters is larger than the total number of data points across all clients. This is often true for modern deep neural network models (Zhang et al., 2017; Jacot et al., 2018) and the small datasets collected by edge clients in the FL setting. In this overparameterized regime, the global minimizer becomes a common minimizer for all local objectives, even though they may be arbitrarily heterogeneous. Using this fact, we obtain a novel connection between FedAvg and the Projection Onto Convex Sets (POCS) algorithm, which is used to find a point in the intersection of some convex sets. Based on this connection, we find an interesting analogy between the server step size and the extrapolation parameter that is used to speed up POCS (Pierra, 1984) . We propose new extensions to the extrapolated POCS algorithm to support inexact and noisy projections as in FedAvg. In particular, we derive a time-varying bound on the progress made by clients towards the global minimum and show how this bound can be used to adaptively estimate a good server step size at each round. The result is our proposed algorithm FedExP, which is a method to adaptively determine the server step size in each round of FL based on the pseudo-gradients in that round. Although motivated by the overparameterized regime, our proposed FedExP algorithm performs well (both theoretically and empirically) in the general case, where the model can be either overparameterized or underparameterized. For this general case, we derive the convergence upper bounds for both convex and non-convex objectives. Some highlights of our work are as follows. • We reveal a novel connection between FedAvg and the POCS algorithm for finding a point in the intersection of convex sets. • The proposed FedExP algorithm is simple to implement with virtually no additional communication, computation, or storage required at clients or the server. It is well suited for both cross-device and cross-silo FL, and is compatible with partial client participation. • Experimental results show that FedExP converges 1.4-2× faster than FedAvg and most competing baselines on standard FL tasks. Related Work. Popular algorithms for adaptively tuning the step size when training neural networks include Adagrad (Duchi et al., 2011) and its variants RMSProp (Tieleman et al., 2012) and Adadelta (Zeiler, 2012) . These algorithms consider the notion of coordinate-wise adaptivity and adapt the step size separately for each dimension of the parameter vector based on the magnitude of the accumulated gradients. While these algorithms can be extended to the federated setting using the concept of pseudo-gradients as done by Reddi et al. (2021) , these extensions are agnostic to inherent data heterogeneity across clients, which is central to FL. On the contrary, FedExP is explicitly designed for FL settings and uses a client-centric notion of adaptivity that utilizes the heterogeneity of client updates in each round. The work closest to us is Johnson et al. (2020) , which proposes a method to adapt the step size for large-batch training by estimating the gradient diversity (Yin et al., 2018) of a minibatch. This result has been improved in a recent work by Horváth et al. (2022) . However, both Johnson et al. (2020) ; Horváth et al. (2022) focus on the centralized setting. In FedExP, we use a similar concept, but within a federated environment which comes with a stronger theoretical motivation, since client data are inherently diverse in this case. We defer a more detailed discussion of other adaptive step size methods and related work to Appendix A.

2. PROBLEM FORMULATION AND PRELIMINARIES

As in most standard federated learning frameworks, we consider the problem of optimizing the model parameters w ∈ R d to minimize the global objective function F (w) defined as follows: min w∈R d F (w) := 1 M M i=1 F i (w), where F i (w) := 1 |Di| δi∈Di ℓ(w, δ i ) is the empirical risk objective computed on the local data set D i at the the i-th client. Here, ℓ(•, •) is a loss function and δ i represents a data sample from the empirical local data distribution D i . The total number of clients in the FL system is denoted by M . Without loss of generality, we assume that all the M client objectives are given equal weight in the global objective function defined in (1). Our algorithm and analysis can be directly extended to the case where client objectives are unequally weighted, e.g., proportional to local dataset sizes |D i |. FedAvg. We focus on solving (1) using FedAvg (McMahan et al., 2017; Kairouz et al., 2021) . At round t of FedAvg, the server sends the current global model w (t) to all clients. Upon receiving the global model, clients perform τ steps of local stochastic gradient descent (SGD) to compute their updates {∆ (t) i } M i=1 for round t as follows. Perform Local SGD: w (t,k+1) i = w (t,k) i -η l ∇F i (w (t,k) i , ξ (t,k) ) ∀k ∈ {0, 1, . . . , τ -1} (2) Compute Local Difference: ∆ (t) i = w (t) -w (t,τ ) i where w t) for all i ∈ [M ], η l is the client step size and ∇F i (w (t,0) i = w ( (t,k) i , ξ (t,k) ) represents a stochastic gradient computed on the minibatch ξ (t,k) i sampled randomly from D i . Server Optimization in FedAvg. In vanilla FedAvg (McMahan et al., 2017) , the global model would simply be updated as the average of the client local models, that is, w (t+1) = 1 M M i=1 w (t,τ ) i . To improve over this, recent work (Reddi et al., 2021; Hsu et al., 2019) has focused on optimizing the server aggregation process by treating the client updates ∆ (t) i as "pseudo-gradients" and multiplying by a server step size when aggregating them as follows. Generalized FedAvg Global Update: w (t+1) = w (t) -η g ∆(t) where ∆(t) = 1 M M i=1 ∆ (t) i is the aggregated client update in round t and η g acts as server step size. Note that setting η g = 1 recovers the vanilla FedAvg update. fit all the training data at clients simultaneously and hence be a minimizer for all local objectives. Thus we assume that the global minimum satisfies w * ∈ S * i , ∀i ∈ [M ]. Our original problem in (1) can then be reformulated as trying to find a point in the intersection of convex sets {S * i } M i=1 since w * ∈ S * i , ∀i ∈ [M ]. One of the most popular algorithms to do so is the Projection Onto Convex Sets (POCS) algorithm (Gurin et al., 1967) . In POCS, at every iteration the current model is updated as followsfoot_0 . Generalized POCS update: w (t+1) POCS = w (t) POCS -λ 1 M M i=1 P i (w (t) POCS ) -w (t) POCS where P i (w (t) POCS ) is a projection of w (t) POCS on the set S * i and λ is known as the relaxation coefficient (Combettes, 1997) . Extrapolation in POCS. Combettes (1997) notes that POCS has primarily been used with λ = 1, with studies failing to demonstrate a systematic benefit of λ < 1 or λ > 1 (Mandel, 1984) . This prompts Combettes (1997) to study an adaptive method of setting λ, first introduced by Pierra (1984) as follows: Pierra (1984) refer to the POCS algorithm with this adaptive λ (t) as Extrapolated Parallel Projection Method (EPPM). This is referred to as extrapolation since we always have λ (t) ≥ 1 by Jensen's inequality. The intuition behind EPPM lies in showing that the update with the proposed λ (t) always satisfies w λ (t) = M i=1 P i (w (t) ) -w (t) 2 M 1 M M i=1 P i (w (t) ) -w (t) 2 . (t+1) POCS -w * 2 < w (t) POCSw * 2 , thereby achieving asymptotic convergence. Experimental results in Pierra (1984) and Combettes (1997) show that EPPM can give an order-wise speedup over POCS, motivating us to study this algorithm in the FL context.

3.2. INCORPORATING EXTRAPOLATION IN FL

Note that to implement POCS we do not need to explicitly know the sets {S * i } M i=1 ; we only need to know how to compute a projection on these sets. From this point of view, we see that FedAvg proceeds similarly to POCS. In each round, clients receive w (t) from the server and run multiple SGD steps to compute an "approximate projection" w (t,τ ) i of w (t) on their solution sets S * i . These approximate projections are then aggregated at the server to update the global model. In this case, the relaxation coefficient λ plays exactly the same role as the server step size η g in FedAvg. Inspired by this observation and the idea of extrapolation in POCS, we seek to understand if a similar idea can be applied to tune the server step size η g in FedAvg. Note that the EPPM algorithm makes use of exact projections to prove convergence which is not available to us in FL settings. This is further complicated by the fact that the client updates are noisy due to the stochasticity in sampling minibatches. We find that in order to use an EPPM-like step size the use of exact projections can be relaxed to the following condition, which bounds the distance of the local models from the global minimum as follows.

Approximate projection condition in FL:

1 M M i=1 w (t,τ ) i -w * 2 ≤ w (t) -w * 2 (6) where w (t) and {w (t,τ ) i } M i=1 are the global and local client models, respectively, at round t and w * is a global minimum. Intuitively, this condition suggests that after the local updates, the local models are closer to the optimum w * on average as compared to model w (t) at the beginning of that round. We first show that this condition (6) holds in the overparameterized convex regime under some conditions. The full proofs for lemmas and theorems in this paper are included in Appendix C. Lemma 1. Let F i (w) be convex and L-smooth for all i ∈ [M ] and let w * be a common minimizer of all F i (w). Assuming clients run full-batch gradient descent to minimize their local objectives with η l ≤ 1/L, then (6) holds for all t and τ ≥ 1. In the case with stochastic gradient noise or when the model is underparameterized, although (6) may not hold in general, we expect it to be satisfied at least during the initial phase of training when w (t)w * 2 is large and clients make common progress towards a minimum. Algorithm 1 Proposed Algorithm: FedExP 1: Input: w (0) , number of rounds T , local iteration steps τ , parameters η l , ϵ 2: For t = 0, . . . , T -1 communication rounds do:

3:

Global server does: Set w (t,0) i ← w (t,0) 7: For k = 0, . . . , τ -1 local iterations do: 8: Update w (t,k+1) i ← w (t,k) i -η l ∇F i (w (t,k) i , ξ (t,k) i ) 9: Send ∆ (t) i ← w (t) -w (t,τ ) i to the server 10: Global server does: 11: Compute ∆(t) ← 1 M M i=1 ∆ (t) i and η (t) g ← max 1, M i=1 ∆ (t) i 2 2M ∆(t) 2 +ϵ 12: Update global model with w (t+1) ← w (t) -η (t) g ∆(t) Given that (6) holds, we now consider the generalized FedAvg update with a server step size η (t) g in round t. Our goal is to find the value of η (t) g that minimizes the distance of w (t+1) to w * : w (t+1) -w * 2 = w (t) -w * 2 + (η (t) g ) 2 ∆(t) 2 -2η (t) g w (t) -w * , ∆(t) . (7) Setting the derivative of the RHS of ( 7) to zero we have, (η (t) g ) opt = w (t) -w * , ∆(t) ∆(t) 2 = M i=1 w (t) -w * , ∆ (t) i M ∆(t) 2 ≥ M i=1 ∆ (t) i 2 2M ∆(t) 2 , where the last inequality follows from ⟨a, b⟩ = 1 2 [∥a∥ 2 + ∥b∥ 2 -∥a -b∥ 2 ], definition of ∆ (t) i in (3) and ( 6). Note that depending on the values of {∆ (t) i } M i=0 , we may have (η (t) g ) opt ≫ 1. Thus, we see that (6) acts as a suitable replacement for projection to justify the use of extrapolation in FL settings.

3.3. PROPOSED ALGORITHM

Motivated by our findings above, we propose the following server step size for the generalized FedAvg update at each round: (η (t) g ) FedExP = max 1, M i=1 ∆ (t) i 2 2M ( ∆(t) 2 + ϵ) . We term our algorithm Federated Extrapolated Averaging or FedExP, in reference to the original EPPM algorithm which inspired this work. Note that our proposed step size satisfies the property that (η (t) g ) opt -(η (t) g ) FedExP ≤ (η (t) g ) opt -1 when (6) holds, which can be seen by comparing (8) and (9). Since (7) depends quadratically on η (t) g , we can show that in this case w (t+1) -(η (t) g ) FedExP ∆(t) -w * 2 ≤ w (t+1) -w * 2 , implying we are at least as close to the optimum as the FedAvg update. In the rest of the paper, we denote (η (t) g ) FedExP as η (t) g when the context is clear. Importance of Adding Small Constant to Denominator. In the case where (6) does not hold, using the lower bound established in (8) can cause the proposed step size to blow up. This is especially true towards the end of training where we can have ∆(t) 2 ≈ 0 but ∆ (t) i 2 ̸ = 0. Thus we propose to add a small positive constant ϵ to the denominator in (9) to prevent this blow-up. For a large enough ϵ our algorithm reduces to FedAvg and therefore tuning ϵ can be a useful tool to interpolate between vanilla averaging and extrapolation. Similar techniques exist in adaptive algorithms such as Adam (Kingma & Ba, 2015) and Adagrad (Duchi et al., 2011) to improve stability. Compatibility with Partial Client Participation and Secure Aggregation. Note that FedExP can be easily extended to support partial participation of clients by calculating η (t) g using only the updates of participating clients, i.e., the averaging and division in (9) will be only over the clients that participate in the round. Furthermore, since the server only needs to estimate the average of pseudo-gradient norms, η (t) g can be computed with secure aggregation, similar to computing ∆(t) . Connection with Gradient Diversity. We see that our lower bound on (η (t) g ) opt naturally depends on the similarity of the client updates with each other. In the case where τ = 1 and clients run full-batch gradient descent, our lower bound (8) reduces to M i=1 ∇F i (w (t) ) 2 2M ∇F (w (t) ) 2 which is used as a measure of data-heterogeneity in many FL works (Wang et al., 2020; Haddadpour & Mahdavi, 2019) . Our lower bound suggests using larger step-sizes as this gradient diversity increases, which can be a useful tool to speed up training in heterogeneous settings. This is an orthogonal approach to existing optimization methods to tackle heterogeneity such as Karimireddy et al. (2020b) ; Li et al. (2020); Acar et al. (2021) , which propose additional regularization terms or adding control variates to the local client objectives to limit the impact of heterogeneity.

4. CONVERGENCE ANALYSIS

Our analysis so far has focused on the overparameterized convex regime to motivate our algorithm. In this section we discuss the convergence of our algorithm in the presence of underparameterization and non-convexity. We would like to emphasize that ( 6) is not needed to show convergence of FedExP; it is only needed to motivate why FedExP might be beneficial. To show general convergence, we only require that η l be sufficiently small and the standard assumptions stated below. Challenge in incorporating stochastic noise and partial participation. Our current analysis focuses on the case where clients are computing full-batch gradients in every step with full participation. This is primarily due to the difficulty in decoupling the effect of stochastic and sampling noise on η (t) g and the pseudo-gradients {∆ (t) i } M i=1 . To be more specific, if we use ξ (t) to denote the randomness at round t, then E ξ (t) (η (t) g ) ∆(t) ̸ = E ξ (t) (η (t) g ) E ξ (t) ∆(t) which significantly complicates the proof. This is purely a theoretical limitation. Empirically, our results in Section 6 show that FedExP performs well with both SGD and partial client participation. Assumption 1. (L-smoothness) Local objective F i (w) is differentiable and L-smooth for all i ∈ [M ], i.e., ∥∇F i (w) -∇F i (w ′ )∥ ≤ L∥w -w ′ ∥, ∀w, w ′ ∈ R d . Assumption 2. (Bounded data heterogenenity at optimum) The norm of the client gradients at the global optima w * is bounded as follows: 1 M M i=1 ∥∇F i (w * )∥ 2 ≤ σ 2 * . Theorem 1. (F i are convex) Under Assumptions 1,2 and assuming clients compute full-batch gradients with full participation and η l ≤ 1 6τ L , the iterates {w (t) } generated by FedExP satisfy, F ( w(T ) ) -F * ≤ O w (0) -w * 2 η l τ T -1 t=0 η (t) g T1:=initialization error + O η 2 l τ (τ -1)Lσ 2 * T2:=client drift error + O η l τ σ 2 * T3:=noise at optimum , where η (t) g is the FedExP server step size at round t and w(T ) = T -1 t=0 η (t) g w (t) T -1 t=0 η (t) g . For the non-convex case, we need the data heterogeneity to be bounded everywhere as follows. Assumption 3. (Bounded global gradient variance) There exists a constant σ 2 g > 0 such that the global gradient variance is bounded as follows. 1 M M i=1 ∥∇F i (w) -∇F (w)∥ 2 ≤ σ 2 g , ∀w ∈ R d . Theorem 2. (F i are non-convex) Under Assumptions 1, 3 and assuming clients compute full-batch gradients with full participation and η l ≤ 1 6τ L , the iterates {w (t) } generated by FedExP satisfy, min t∈[T ] ∇F (w (t) ) 2 ≤ O F (w (0) ) -F * η l τ T -1 t=0 η (t) g T1:=initialization error + O η 2 l L 2 (τ -1)τ σ 2 g T2:=client drift error + O η l Lτ σ 2 g T3:= global variance , where η (t) g is the FedExP server step size at round t. Discussion. In the convex case, the error of FedAvg can be bounded by Khaled et al., 2020) and in the non-convex case by Wang et al., 2020) . A careful inspection The last iterate of FedExP has an oscillating behavior in F (w) but monotonically decreases w (t)w * 2 ; the average of the last two iterates lies in a lower loss region than the last iterate. O ∥w (0) -w * ∥ 2 /η l τ T + O η 2 l τ (τ -1)Lσ 2 * ( O (F (w 0 ) -F * )/η l τ T + O η 2 l L 2 τ (τ -1)σ 2 g ( reveals that the impact of T 1 on convergence of FedExP is different from FedAvg (effect of T 2 is the same). We see that since T -1 t=0 η (t) g ≥ T , FedExP reduces T 1 faster than FedAvg. However this comes at the price of an increased error floor due to T 3 . Thus, the larger step-sizes in FedExP help us reach the vicinity of an optimum faster, but can ultimately end up saturating at a higher error floor due to noise around the optimum. Note that the impact of the error floor can be controlled by setting the client step size η l appropriately. Moreover, in the overparameterized convex regime where σ 2 * = 0, the effect of T 2 and T 3 vanishes and thus FedExP clearly outperforms FedAvg. This aligns well with our initial motivation of using extrapolation in the overparameterized regime.

5. FURTHER INSIGHTS INTO FEDEXP

In this section, we discuss some further insights into the training of FedExP and how we leverage these insights to improve the performance of FedExP. FedExP monotonically decreases w (t)w * 2 but not necessarily F (w (t) ) -F (w * ). Recall that our original motivation for the FedExP step size was aimed at trying to minimize the distance to the optimum give by w (t+1)w * 2 , when (6) holds. Doing so satisfies w (t+1)w * 2 ≤ w (t)w * 2 but does not necessarily satisfy F (w (t+1) ) ≤ F (w (t) ). To better illustrate this phenomenon, we consider the following toy example in R 2 . We consider a setup with two clients, where the objective at each client is given as follows: F 1 (w) = (3w 1 + w 2 -3) 2 ; F 2 (w) = (w 1 + w 2 -3) 2 . ( ) We denote the set of minimizers of F 1 (w) and F 2 (w) by S * 1 = {w : 3w 1 + w 2 = 3} and S * 2 = {w : w 1 + w 2 = 3} respectively. Note that S * 1 and S * 2 intersect at the point w * = [0, 3], making it a global minimum. To minimize their local objectives, we assume clients run gradient descent with τ → ∞ in every round 2 . Figure 2 shows the trajectory of the iterates generated by FedExP and FedAvg. We see that while w (t)w * 2 decreases monotonically for FedExP, F (w (t) ) does not do so and in fact has an oscillating nature as we discuss below. Understanding oscillations in F (w (t) ). We see that the oscillations in F (w (t) ) are caused by FedExP iterates trying to minimize their distance from the solution sets S * 1 and S * 2 simultaneously. The initialization point w (0) is closer to S * 1 than S * 2 , which causes the FedExP iterate at round 1 to move towards S * 2 , then back towards S * 1 and so on. To understand why this happens, consider the case where ∆ (t) 1 = 0, ∆ (t) 2 ̸ = 0. In this case, we have η (t) g = 2 and therefore w (t+1) = w (t) -2 ∆(t) = w (t,τ ) 2 , which indicates that FedExP is now trying to minimize ∆ (t+1) 2

2

. This gives us the intuition that the FedExP update in round t is trying to minimize the objectives of the clients that have ∆ (t) i 2 ≫ 0. While this leads to a temporary increase in global loss F (w (t) ) in 2 The local models will be an exact projection of the global model on the solution sets {S * i } 2 i=1 . In this case, the lower bound in (8) can be improved by a factor of 2 and therefore we use η Averaging last two iterates in FedExP. Given the oscillating behavior of the iterates of FedExP, we find that measuring progress on F (w) using the last iterate can be misleading. Motivated by this finding, we propose to set the final model as the average of the last two iterates of FedExP. While the last iterate oscillates between regions that minimize the losses F 1 (w) and F 2 (w) respectively, the behavior of the average of the last two iterates is more stable and proceeds along a globally low loss region. Interestingly, we find that the benefits of averaging the iterates of FedExP also extend to training neural networks with multiple clients in practical FL scenarios (see Appendix D.1). In practice, the number of iterates to average over could also be a hyperparameter for FedExP, but we find that averaging the last two iterates works well, and we use this for our other experiments. (t) g = (∥∆1∥ 2 +∥∆2∥ 2 )/2∥ ∆(t) ∥

6. EXPERIMENTS

We evaluate the performance of FedExP on synthetic and real FL tasks. For our synthetic experiment, we consider a distributed overparameterized linear regression problem. This experiment aligns most closely with our theory and allows us to carefully examine the performance of FedExP when (6) holds. For realistic FL tasks, we consider image classification on the following datasets i) EMNIST (Cohen et al., 2017) , ii) CIFAR-10 ( Krizhevsky et al., 2009) , iii) CIFAR-100 (Krizhevsky et al., 2009) , iv) CINIC-10 (Darlow et al., 2018) . In all experiments, we compare against the following baselines i) FedAvg, ii) SCAFFOLD (Karimireddy et al., 2020b), and iii) FedAdagrad (Reddi et al., 2021) which is a federated version of the popular Adagrad algorithm. To the best of our knowledge, we are not aware of any other baselines that adaptively tune the server step size in FL. Experimental Setup. For the synthetic experiment, we consider a setup with 20 clients, 30 samples at each client, and model size to be 1000, making this an overparameterized problem. The data at each client is generated following a similar procedure as the synthetic dataset in Li et al. (2020) . We use the federated version of EMNIST available at Caldas et al. (2019) , which is naturally partitioned into 3400 clients. For CIFAR-10/100 we artifically partition the data into 100 clients, and for CINIC-10 we partition the data into 200 clients. In both cases, we follow a Dirichlet distribution with α = 0.3 for the partitioning to model heterogeneity among client data (Hsu et al., 2019) . For EMNIST we use the same CNN architecture used in Reddi et al. (2021) . For CIFAR10, CIFAR100 and CINIC-10 we use a ResNet-18 model (He et al., 2016) . For our baselines, we find the best performing η g and η l by grid-search tuning. For FedExP we optimize for ϵ and η l by grid search. We fix the number of participating clients to 20, minibatch size to 50 and number of local updates to 20 for all experiments. In Appendix D, we provide additional details and results, including the best performing hyperparameters, comparison with FedProx (Li et al., 2020) , and results for more rounds. FedExP comprehensively outperforms FedAvg and baselines. Our experimental results in Figure 3 demonstrate that FedExP clearly outperforms FedAvg and competing baselines that use the best performing η g and η l found by grid search. Moreover, FedExP does not require additional communication or storage at clients or server unlike SCAFFOLD and FedAdagrad. The orderwise improvement in the case of the convex linear regression experiment confirms our theoretical motivation for FedExP outlined in Section 3.2. In this case, since ( 6) is satisfied, we know that the FedExP iterates are always moving towards the optimum. For realistic FL tasks, we see a consistent g can be found in Appendix D.5. The key takeaway from our experiments is that adapting the server step size allows FedExP to take much larger steps in some (but not all) rounds compared to the constant optimum step size taken by our baselines, leading to a large speedup. Comparison with FedAdagrad. As discussed in Section 1, FedAdagrad and FedExP use different notions of adaptivity; FedAdagrad uses coordinate-wise adaptivity, while FedExP uses client-based adaptivity. We believe that the latter is more meaningful for FL settings as seen in our experiments. In many experiments, especially image classification tasks like CIFAR, the gradients produced are dense with relatively little variance in coordinate-wise gradient magnitudes (Reddi et al., 2021; Zhang et al., 2020) . In such cases, FedAdagrad is unable to leverage any coordinate-level information and gives almost the same performance as FedAvg. Comparison with SCAFFOLD. We see that FedExP outperforms SCAFFOLD in all experiments, showing that adaptively tuning the server step size is sufficient to achieve speedup in FL settings. Furthermore, SCAFFOLD even fails to outperform FedAvg for the more difficult CIFAR and CINIC datasets. Several other papers have reported similar findings, including Reddi et al. ( 2021 2022). Several reasons have been postulated for this behavior, including the staleness of control variates (Reddi et al., 2021) and the difficulty in characterizing client drift in non-convex scenarios (Yu et al., 2022) . Thus, while theoretically attractive, simply using variance reduction techniques such as SCAFFOLD may not provide any speedup in practice. Adding extrapolation to SCAFFOLD. We note that SCAFFOLD only modifies the Local SGD procedure at clients and keeps the global aggregation at the server unchanged. Therefore, it is easy to modify the SCAFFOLD algorithm to use extrapolation when updating the global model at the server (algorithm details in Appendix E). Figure 4 shows the result of our proposed extrapolated SCAFFOLD on the CIFAR-10 dataset. Interestingly, we observe that while SCAFFOLD alone fails to outperform FedAvg, the extrapolated version of SCAFFOLD achieves the best performance among all algorithms. This result highlights the importance of carefully tuning the server step size to achieve the best performance for variance-reduction algorithms. It is also possible to add extrapolation to algorithms with server momentum (Appendix F).

7. CONCLUSION

In this paper, we have proposed FedExP, a novel extension of FedAvg that adaptively determines the server step size used in every round of global aggregation in FL. Our algorithm is based on the key observation that FedAvg can be seen as an approximate variant of the POCS algorithm, especially for overparameterized convex objectives. This has inspired us to leverage the idea of extrapolation that is used to speed up POCS in a federated setting, resulting in FedExP. We have also discussed several theoretical and empirical perspectives of FedExP. In particular, we have explained some design choices in FedExP and how it can be used in practical scenarios with partial client participation and secure aggregation. We have also shown the convergence of FedExP for possibly underparameterized models and non-convex objectives. Our experimental results have shown that FedExP consistently outperforms baseline algorithms with virtually no additional computation or communication at clients or server. We have also shown that the idea of extrapolation can be combined with other techniques, such as the variance-reduction method in SCAFFOLD, for greater speedup. Future work will study the convergence analysis of FedExP with stochastic gradient noise and the incorporation of extrapolation into a wider range of algorithms used in FL. 

A ADDITIONAL RELATED WORK

In this section, we provide further discussion on some additional related work that complements our discussion in Section 1.

Adaptive

Step Size in Gradient Descent. Here we briefly discuss methods for tuning the step size in gradient descent and the challenges in applying them to the FL setting. Early methods to tune the step size in gradient descent were based on line search (or backtracking) strategies (Armijo, 1966; Goldstein, 1977) . However, these strategies need to repeatedly compute the function value or gradient within an iteration, making them computationally expensive. Another popular class of adaptive step sizes is based on the Polyak step size (Polyak, 1969; Hazan & Kakade, 2019; Loizou et al., 2021) . Similar to FedExP, the Polyak step size is derived from trying to minimize the distance to the optimum for convex functions. However it is not clear how this can be extended to the federated setting where we only have access to pseudo-gradients. Also, the Polyak step size requires knowledge of the function value at the optimum which is hard to estimate. Another related class of step sizes is the Barzilai-Borwein stepsize (Barzilai & Borwein, 1988) . However, to the best of our knowledge, these are known to provably work only for quadratic functions (Raydan, 1993; Burdakov et al., 2019) only. A recent work (Malitsky & Mishchenko, 2020) alleviates some of the concerns associated with these classical methods by setting the step size as an approximation of the inverse local Lipschitz constant; however it is again not clear how this intuition can be applied to the federated setting. An orthogonal line of work has focused on methods that adapt to the geometry of the data using gradient information in previous iterations, the most popular among them being Adagrad (Duchi et al., 2011) and its extensions RMSProp (Tieleman et al., 2012) and Adadelta (Zeiler, 2012). There exist federated counterparts of these algorithms, namely FedAdagrad; however, as we show in our experiments these methods can fail to even outperform FedAvg in standard FL tasks. Overparameterization in FL. Inspired by the success of analyzing deep neural networks in the neural tangent kernel (NTK) regime (Jacot et al., 2018; Arora et al., 2019; Allen-Zhu et al., 2019) , recent work has looked at studying the convergence of overparameterized neural networks in the FL setting. Huang et al. (2021) and Deng et al. (2022) show that for a sufficiently wide neural network and proper step size conditions, FedAvg will converge to a globally optimal solution even in the presence of data heterogeneity. We note that these works are primarily concerned with convergence analysis, whereas our focus is on developing a practical algorithm that is inspired by characteristics in the overparameterized regime for speeding up FL training. Another recent line of work has looked at utilizing NTK style Jacobian features for learning a FL model in just a few rounds of communication (Yu et al., 2022; Yue et al., 2022) . While interesting, these approaches are orthogonal to our current work. B to the server. This procedure is illustrated in Figure 5 . < l a t e x i t s h a 1 _ b a s e 6 4 = " w I  f j 5 Q 2 r u g i c X X H X d O R B V W n / Q C s = " > A A A C c X i c b V H L b t Q w F H V S H i W 8 h p Y N q k B W R 6 B W l U Z J F 5 R l J T Y s i + i 0 l S b D y H Z u U q t + R P Y N d B R l z / d 1 x 0 + w 4 Q d w p h E q L V e y d H T O f R 7 z W k m P a f o z i t f u 3 X / w c P 1 R 8 v j J 0 2 f P R y 8 2 T r x t n I C p s M q 6 M 8 4 8 K G l g i h I V n N U O m O Y K T v n F x 1 4 / / Q b O S 2 u O c V n D X L P K y F I K h o F a j H 7 k H C p p W q Z k Z b o k 2 6 X v a I 5 w i S 3 9 A i 5 U U g + m 8 L R S l j P V 0 d x Y 0 2 g O L s + T v 5 n a F q B o E D X D c 1 6 2 3 7 u v 7 Q 7 u d o O M l j K l q F A S D P o b P Z I 8 9 B 5 G L 0 b j d J K u g t 4 F 2 Q D G Z I i j x e g q L 6 x o d O g p F P N + l q U 1 z l v m U A o F X Z I 3 H m o m L l g F s w A N 0 + D n 7 c q x j r 4 N T E F L 6 8 I z S F f s z Y q W a e + X m o f M / i Z / W + v J / j V 1 R W 4 l A W u r A n K T j U y u C Q F G k 8 K S 1 C n m o 8 T s / 3 F / n j C 7 R O F e Y b z U s c 5 z A z K l M S y F O T z k + R 4 k y Z G r S a m S b a 6 v E 3 X B D + o J r v a 4 W G H H d o p j w F e c 4 b L g 5 Q E 0 z U a b 1 J v Y b v c p E D n a V Z / b 1 Z U e / + E i v V W 0 F Q 9 R p h C l P l K V o h o q s O 4 A v 7 m p f X q 4 r L 0 w G / 0 k f C t 2 / H m 3 S 6 c T 9 e B r 8 N k h Z 0 W R u H k 8 4 v M S 1 k l X s f U o N z o y Q u a V y D J S U 1 N p G o H J b e G c x w 5 K G B H N 2 4 X m 6 1 4 a 8 9 M + V Z Y f 0 z x J f s v z 9 q y J 2 b 5 6 l X L j y 7 m 7 k F + b / c q K J s Z 1 w r U 1 a E R q 4 a Z Z X m V P D F i f h U W Z S k A A L T E B l V l Q h W L M v z 5 N W r W q d V q 0 r l Y a J J i i g f X S A K s h C Z 6 i O L l E D N R F B G X p E z + h A A B + X i c b V D L S g N B E O y N r x h f q x 6 9 D A b B U 9 g V U Y + B X L w I E U w M J C H M T j r J k N k H M 7 3 B s O R P v H h Q x K t / 4 s 2 / c f I 4 a G L B Q F H V R f d U k C h p y P O + n d z a + s b m V n 6 7 s L O 7 t 3 / g H h 7 V T Z x q g T U R q 1 g 3 A m 5 Q y Q h r J E l h I 9 H I w 0 D h Y z C s T P 3 H E W o j 4 + i B x g m 2 Q 9 6 P Z E 8 K T l b q u G 6 L 8 I k y V l E S I 2 K T u 4 5 b 9 E r e D G y V + A t S h A W q H f e r 1 Y 1 F G t q 4 U N y Y p u 8 l 1 M 6 4 J i k U T g q t 1 G D C x Z D 3 s W l p x E M 0 7 W x 2 + Y S d n G q B N R G r W D c D b l D J C G s k S W E z 0 c j D Q G E j G F V m f m O M 2 s g 4 e q R J g p 2 Q D y L Z l 4 K T l b q u 2 y Z 8 o o x V l M S I 2 L T U d Q t e 0 Z u D r R N / S Q q w R L X r f r V 7 s U h D G x e K G 9 P y v Y Q 6 G d c k h c J p v p 0 a T L g Y 8 Q G 2 L I 1 4 i K a T z S + f s k u r 9 F g / 1 v b Z 9 X P 1 d y L j o T G T M L C T I a e h W f V m 4 n 9 e K 6 X + X S e T U Z I S R m K x q J 8 q R j G b 1 c B 6 U q M g N b G E C y 3 t r U w M u e a C b F l 5 W 4 K / + u V 1 U i 8 V / Z u i / 3 B d K H v L O n J w D h d w B T 7 c Q h n u o Q o 1 E D C G Z 3 i F N b O N U C a y J W s W 4 G 3 K C S E d Z I k s J m o p G H g c J G M K r M / M Y Y t Z F x 9 E i T B D s h H 0 S y L w U n K 3 V d t 0 3 4 R B m r K I k R s a n f d Y t e y Z u D r R N / S Y q w R L X r f r V 7 s U h D G x e K G 9 P y v Y Q 6 G d c k h c J p o Z 0 a T L g Y 8 Q G 2 L I 1 4 i K a T z S + f s g u r 9 F g / 1 v b Z 9 X P 1 d y L j o T G T M L C T I a e h W f V m 4 n 9 e K 6 X + X S e T U Z I S R m K x q J 8 q R j G b 1 c B 6 U q M g N b G E C y 3 t r U w M u e a C b F k F W 4 K / + u V 1 U r 8 q + T c l / + G 6 W P a W d e T h D M 7 h E n y 4 h T L c Q x V q I G A M z / A K b 0 7 m v D j v z s d i N O c s M 6 f w B 8 7 n D / N l k y g = < / l a t e x i t > Client 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " v C 0 1 7 / B q m j r t i u c 7  5 o i g u a z E K f c = " > A A A B / X i c b V D J S g N B F O x x j X E b l 5 u X x i B 4 C j M i 6 j G Q i 8 e I Z o F k C D 0 9 L 0 m T n o X u N 2 I c g r / i x Y M i X v 0 P b / 6 N n W Q O m l j Q U F T V 4 7 0 u P 5 F C o + N 8 W 0 v L K 6 t r 6 4 W N 4 u b W 9 s 6 u v b f f 0 H G q O N R 5 L G P V 8 p k G K S K o o 0 A J r U Q B C 3 0 J T X 9 Y n f j N e 1 B a x N E d j h L w Q t a P R E 9 w h k b q 2 o c d h A f M a F X G a U B v Q Z n w u G u X n L I z B V 0 k b k 5 K J E e t a 3 9 1 g p i n I U T I J d O 6 7 T o J e h l T K L i E c b G T a k g Y H 7 I + t A 2 N W A j a y 6 b X j + m J U Q L a i 5 V 5 E d K p + n s i Y 6 H W o 9 A 3 y Z D h Q M 9 7 E / E / r 5 1 i 7 8 r L R J S k C B G f L e q l k m J M J 1 X Q Q C j g K E e G M K 6 E u Z X y A V O M o

C PROOFS

We first state some preliminary lemmas that will used throughout the proofs. Lemma 2. (Jensen's inequality) For any a i ∈ R d , i ∈ {1, 2, . . . , M }: 1 M M i=1 a i 2 ≤ 1 M M i=1 ∥a i ∥ 2 , ( ) M i=1 a i 2 ≤ M M i=1 ∥a i ∥ 2 . ( ) We also note the following known result related to the Bregman divergence. Lemma 3. (Khaled et al., 2020) If F is smooth and convex, then ∥∇F (w) -∇F (w ′ )∥ 2 ≤ 2L(F (w) -F (w ′ ) -⟨∇F (w ′ ), w -w ′ ⟩). ( ) Lemma 4. (Co-coercivity of convex smooth function) If F is L-smooth and convex then, ⟨∇F (w) -∇F (w ′ ), w -w ′ ⟩ ≥ 1 L ∥∇F (w) -∇F (w ′ )∥ 2 . ( ) A direct consequence of this lemma is, ⟨∇F (w), w -w * ⟩ ≥ 1 L ∥∇F (w)∥ 2 (17) where w * is a minimizer of F (w).

C.1 PROOF OF LEMMA 1

Let F i (w) be the local objective at a client and w * be the global minimum. From the overparameterization assumption, we know that w * is also a minimizer for F i (w). We have, w (t,k) i -w * 2 = w (t,k-1) i -η l ∇F (w (t,k-1) i ) -w * 2 (18) = w (t,k-1) i -w * 2 -2η l ⟨∇F (w (t,k-1) i ), w (t,k-1) i -w * ⟩ + η 2 l ∇F (w (t,k-1) i ) 2 (19) ≤ w (t,k-1) i -w * 2 - 2η l L ∇F (w (t,k-1) i ) 2 + η 2 l ∇F (w (t,k-1) i ) 2 (20) ≤ w (t,k-1) i -w * 2 - η l L ∇F (w (t,k-1) i ) 2 (21) where (20) follows from ( 17) and ( 21) follows from η l ≤ 1 L . Summing the above inequality from k = 0 to τ -1 we have, w (t,τ ) i -w * 2 ≤ w (t) -w * 2 - η l L τ -1 k=0 ∇F (w (t,k) i ) 2 . ( ) Thus we have, 1 M M i=1 w (t,τ ) i -w * 2 ≤ w (t) -w * 2 - η l M L M i=1 τ -1 k=0 ∇F (w (t,k) i ) 2 (23) ≤ w (t) -w * 2 . ( ) This completes the proof of this lemma.

C.2 CONVERGENCE ANALYSIS FOR CONVEX OBJECTIVES

Our proof technique is inspired by Khaled et al. (2020) with some key differences. The biggest difference is the incorporation of the adaptive FedExP server step sizes which Khaled et al. (2020) does not account for. Another difference is that we provide convergence guarantees in terms of number of rounds T while Khaled et al. (2020) focus on number of iterations T ′ = T τ . We highlight the specific steps where we made adjustments to the analysis of Khaled et al. (2020) below. We begin by modifying Khaled et al. (2020, Lemma 11 and Lemma 13) to bound client drift in every round instead of every iteration. Lemma 5. (Bounding client aggregate gradients) 1 M M i=1 τ -1 k=0 ∇F i (w (t,k) i ) 2 ≤ 3L 2 M M i=1 τ -1 k=0 w (t,k) i -w (t) 2 + 6τ L(F (w (t) ) -F (w * )) + 3τ σ 2 * . ( ) Proof of Lemma 5: 1 M M i=1 τ -1 k=0 ∇F i (w (t,k) i ) 2 = 1 M M i=1 τ -1 k=0 ∇F i (w (t,k) i ) -∇F i (w (t) ) + ∇F i (w (t) ) -∇F i (w * ) + ∇F i (w * ) 2 (26) ≤ 3 M M i=1 τ -1 k=0 ∇F i (w (t,k) i ) -∇F i (w (t) ) 2 + 3 M M i=1 τ -1 k=0 ∇F i (w (t) ) -∇F i (w * ) 2 (27) + 3 M M i=1 τ -1 k=0 ∥∇F i (w * )∥ 2 ≤ 3L 2 M M i=1 τ -1 k=0 w (t,k) i -w (t) 2 + 6τ L(F (w (t) ) -F * ) + 3τ σ 2 * . The first term in (28) follows from L-smoothness of F i (w), the second term follows from Lemma 3 and the third term follows from bounded noise at optimum. Lemma 6. (Bounding client drift) 1 M M i=1 τ -1 k=0 w (t) -w (t,k) i 2 ≤ 12η 2 l τ 2 (τ -1)L(F (w (t) ) -F (w * )) + 6η 2 l τ 2 (τ -1)σ 2 * . ( ) Proof of Lemma 6: 1 M M i=1 τ -1 k=0 w (t) -w (t,k) i 2 = η 2 l 1 M M i=1 τ -1 k=0 k-1 l=0 ∇F i (w (t,l) i ) 2 (30) ≤ η 2 l 1 M M i=1 τ -1 k=0 k k-1 l=0 ∇F i (w (t,l) i ) 2 (31) ≤ η 2 l τ (τ -1) 1 M M i=1 τ -1 k=0 ∇F i (w (t,k) i ) 2 (32) ≤ 3η 2 l τ (τ -1)L 2 1 M M i=1 τ -1 k=0 w (t) -w (t,k) i 2 + 6η 2 l τ 2 (τ -1)L(F (w (t) ) -F (w * )) + 3η 2 l τ 2 (τ -1)σ 2 * ≤ 1 2M M i=1 τ -1 k=0 w (t) -w (t,k) i 2 + 6η 2 l τ 2 (τ -1)L(F (w (t) ) -F (w * )) + 3η 2 l τ 2 (τ -1)σ 2 * where (33) uses Lemma 5 and (34) uses η l ≤ 1 6τ L . Therefore we have, 1 M M i=1 τ -1 k=0 w (t) -w (t,k) i 2 ≤ 12η 2 l τ 2 (τ -1)L(F (w (t) ) -F (w * )) + 6η 2 l τ 2 (τ -1)σ 2 * . Proof of Theorem 1: We define the following auxiliary variables that will used in the proof. Aggregate Client Gradient: h (t) i = τ -1 k=0 ∇F i (w (t,k) i ). We also define h(t) = 1 M M i=1 h (t) i . Recall that the update of the global model can be written as w (t+1) = w (t) -η (t) g η l h(t) . We have w (t+1) -w * 2 = w (t) -η (t) g η l h(t) -w * 2 (37) = w (t) -w * 2 -2η (t) g η l w t -w * , h(t) + (η (t) g ) 2 η 2 l h(t) 2 (38) ≤ w (t) -w * 2 -2η (t) g η l w t -w * , h(t) + η (t) g η 2 l 1 M M i=1 h (t) i 2 where (39) follows from η 39) is a key step in our proof and the differentiating factor in our approach from Khaled et al. (2020) . Following a similar technique as Khaled et al. (2020) to bound (η (t) g ≤ M i=1 h (t) i 2 M ∥ h(t) ∥ 2 . Inequality ( (t) g ) 2 η 2 l h(t) 2 will end up requiring the condition η l ≤ 1/8Lη (t) g , which cannot be satisfied in our setup due to the adaptive choice of η (t) g . Therefore we first upper bound (η (t) g ) 2 η 2 l h(t) 2 by η (t) g η 2 l 1 M M i=1 h (t) i 2 and focus on further bounding this quantity in the rest of the proof, which does not require the aforementioned condition. Note that this comes at the expense of the additional T 3 error seen in our final convergence bound in Theorem 1. Therefore, w (t+1) -w * 2 ≤ w (t) -w * 2 -2η (t) g η l w t -w * , h(t) T1 +η (t) g η 2 l 1 M M i=1 h (t) i 2 T2 . Bounding T 2 We have, T 2 = 1 M M i=1 h (t) i 2 (41) = 1 M M i=1 τ -1 k=0 ∇F i (w (t,k) i ) 2 (42) ≤ τ M M i=1 τ -1 k=0 ∇F i (w (t,k) i ) 2 ≤ 3τ L 2 M M i=1 τ -1 k=0 w (t,k) i -w (t) 2 + 6τ 2 L(F (w (t) ) -F * ) + 3τ 2 σ 2 * where ( 43) follows from Jensen's inequality and and (44) follows from Lemma 5. Bounding T 1 T 1 = 1 M M i=1 w t -w * , h (t) i (45) = 1 M M i=1 τ -1 k=0 w (t) -w * , ∇F i (w (t,k) i ) . We have, w (t) -w * , ∇F i (w (t,k) i ) = w (t) -w (t,k) i , ∇F i (w (t,k) i ) + w (t,k) i -w * , ∇F i (w (t,k) i ) . From L-smoothness of F i we have, w (t) -w (t,k) i , ∇F i (w (t,k) i ) ≥ F i (w (t) ) -F i (w (t,k) i ) - L 2 w (t) -w (t,k) i 2 . ( ) From convexity of F i we have, w (t,k) i -w * , ∇F i (w (t,k) i ) ≥ F i (w (t,k) i ) -F i (w * ). Therefore, adding the above inequalities we have, w (t) -w * , ∇F i (w (t,k) i ) ≥ F i (w (t) ) -F i (w * ) - L 2 w (t) -w (t,k) i 2 . ( ) Substituting ( 50) in ( 46) we have, T 1 ≥ τ (F (w (t) ) -F (w * )) - L 2M M i=1 τ -1 k=0 w (t) -w (t,k) i 2 . ( ) Here we would like to note that the bound for T 1 is our contribution and is needed in our proof due to the relaxation in (39). The bound for T 2 follows a similar technique as Khaled et al. (2020, Lemma 12) . Substituting the bounds for T 1 and T 2 in (40) we have, w (t+1) -w * 2 ≤ w (t) -w * 2 -2η (t) g η l τ (1 -3η l τ L)(F (w (t) ) -F (w * )) + 3η (t) g η 2 l τ 2 σ 2 * + (3η (t) g η 2 l τ L 2 + η (t) g η l L) 1 M M i=1 τ -1 k=0 w (t,k) i -w (t) 2 ≤ w (t) -w * 2 -η (t) g η l τ (F (w (t) ) -F (w * )) + 3η (t) g η 2 l τ 2 σ 2 * (52) + 2η (t) g η l L 1 M M i=1 τ -1 k=0 w (t,k) i -w (t) 2 ≤ w (t) -w * 2 -η (t) g η l τ (F (w (t) ) -F (w * )) + 3η (t) g η 2 l τ 2 σ 2 * (53) + 24η (t) g η 3 l τ 2 (τ -1)L 2 (F (w (t) ) -F (w * )) + 12η (t) g η 3 l τ 2 (τ -1)Lσ 2 * ≤ w (t) -w * 2 - η (t) g η l τ 3 (F (w (t) ) -F (w * )) + 3η (t) g η 2 l τ 2 σ 2 * (54) + 12η (t) g η 3 l τ 2 (τ -1)Lσ 2 * where both ( 52) and ( 55) use η l ≤ 1 6τ L , and (53) uses Lemma 6. Rearranging terms and averaging over all rounds we have, T -1 t=0 η (t) g F (w (t) ) -F (w * ) T -1 t=0 η (t) g ≤ 3 w (0) -w * 2 T -1 t=0 η (t) g η l τ + 9η l τ σ 2 * + 36η 2 l τ (τ -1)Lσ 2 * . This implies, F ( w(T ) ) -F (w * ) ≤ O w (0) -w * 2 η l τ T -1 t=0 η (t) g + O η 2 l τ (τ -1)Lσ 2 * + O η l τ σ 2 * ( ) where w(T ) = T -1 t=0 η (t) g w (t) T -1 t=0 η (t) g . This completes the proof.

C.3 CONVERGENCE ANALYSIS FOR NON-CONVEX OBJECTIVES

Our proof technique is inspired by Wang et al. (2020) and we use one of their intermediate results to bound client drift in non-convex settings as we describe below. We highlight the specific steps where we made adjustments to the analysis of Wang et al. (2020) below. We begin by defining the following auxiliary variables that will used in the proof. Normalized Gradient: h (t) i = 1 τ τ -1 k=0 ∇F i (w (t,k) i ). We also define h (t) = 1 M M i=1 h (t) i . Lemma 7. (Bounding client drift in Non-Convex Setting) 1 M M i=1 ∇F i (w (t) ) -h (t) i 2 ≤ 1 8 ∇F (w (t) 2 + 5η 2 l L 2 τ (τ -1)σ 2 g . Proof of Lemma 7: Let D = 4η 2 l L 2 τ (τ -1). We have the following bound from equation (87) in Wang et al. (2020) , 1 M M i=1 ∇F i (w (t) ) -h (t) i 2 ≤ D 1 -D ∇F (w (t) ) 2 + Dσ 2 g 1 -D . From η l ≤ 1 6τ L we have D ≤ 1 9 which implies 1 1-D ≤ 9 8 and D 1-D ≤ 1 8 . Therefore we have, 1 M M i=1 ∇F i (w (t) ) -h (t) i 2 ≤ 1 8 ∇F (w (t) ) 2 + 9D 8 σ 2 g (60) ≤ 1 8 ∇F (w (t) 2 + 5η 2 l L 2 τ (τ -1)σ 2 g . ( ) Proof of Theorem 2: The update of the global model can be written as follows, w (t+1) = w (t) -η (t) g η l τ h(t) . ( ) Now using the Lipschitz-smoothness assumption we have, F (w (t+1) ) -F (w (t) ) ≤ -η (t) g η l τ ∇F (w (t) ), h(t) + (η (t) g ) 2 η 2 l τ 2 L 2 h(t) 2 (63) ≤ -η (t) g η l τ ∇F (w (t) ), h (t) + η (t) g η 2 l τ 2 L 2M M i=1 h (t) i 2 where (64) uses η (t) g ≤ M i=1 h (t) i 2 M ∥ h(t) ∥ 2 . As in the convex case, inequality (64) is a key step in our proof and the differentiating factor in our approach from Wang et al. (2020) . Following a similar technique as Wang et al. (2020) to bound (η (t) g ) 2 η 2 l τ 2 L h(t) 2 /2 will need the condition η l ≤ 1/2Lτ η (t) g , which cannot be satisfied in our setup due to the adaptive choice of η (t) g . Therefore we first upper bound (η (t) g ) 2 η 2 l τ 2 L 2 h(t) 2 by η (t) g η 2 l τ 2 L 1 M M i=1 h (t) i 2 2 and focus on further bounding this quantity in the rest of the proof, which does not require the aforementioned condition. Note that this comes at the expense of the additional T 3 error seen in our final convergence bound in Theorem 2. Therefore we have, F (w (t+1) ) -F (w (t) ) ≤ -η (t) g η l τ ∇F (w (t) ), h(t) T1 + η (t) g η 2 l τ 2 L 2M M i=1 h (t) i 2 T2 . ( ) Bounding T 1 We have, T 1 = ∇F (w (t) ), 1 M M i=0 h (t) i (66) = 1 2 ∇F (w (t) ) 2 + 1 2 1 M M i=1 h (t) i 2 - 1 2 ∇F (w (t) ) - 1 M M i=1 h (t) i 2 (67) ≥ 1 2 ∇F (w (t) ) 2 - 1 2M M i=1 ∇F i (w (t) ) -h (t) i 2 where (67) uses ⟨a, b⟩ = 1 2 ∥a∥ 2 + 1 2 ∥b∥ 2 -1 2 ∥a -b∥ 2 and (68) uses Jensen's inequality and the definition of the global objective function F .

Bounding T 2

We have, T 2 = 1 M M i=1 h (t) i 2 (69) = 1 M M i=1 h (t) i -∇F i (w (t) ) + ∇F i (w (t) ) -∇F (w (t) ) + ∇F (w (t) ) 2 (70) ≤ 3 M M i=1 h (t) i -∇F i (w (t) ) 2 + ∇F i (w (t) ) -∇F (w (t) ) 2 + ∇F (w (t) ) 2 (71) ≤ 3 M M i=1 h (t) i -∇F i (w (t) ) 2 + 3σ 2 g + 3 ∇F (w (t) ) 2 where ( 71) uses Jensen's inequality, (72) uses bounded data heterogeneity assumption. Here we would like to note that the bound for T 2 is our contribution and is needed in our proof due to the relaxation in (39). The bound for T 1 follows a similar technique as in Wang et al. (2020) . Substituting the T 1 and T 2 bounds into (65), we have, F (w (t+1) ) -F (w (t) ) ≤ -η (t) g η l τ 1 2 ∇F (w (t) ) 2 + 1 2M M i=1 ∇F i (w (t) ) -h (t) i 2 (73) + η l τ L 2 3σ 2 g + 3 ∇F (w (t) ) 2 + 3 M M i=1 h (t) i -∇F i (w (t) ) 2 ≤ -η (t) g η l τ 1 4 ∇F (w (t) ) 2 + 1 M M i=1 ∇F i (w (t) ) -h (t) i 2 + 3η l τ Lσ 2 g (74) ≤ -η (t) g η l τ 1 8 ∇F (w (t) 2 + 3η l τ Lσ 2 g + 5η 2 l L 2 τ (τ -1)σ 2 g ( ) where (74) uses η l ≤ 1 6τ L , (75) uses Lemma 7. Thus rearranging terms and averaging over all rounds we have, T -1 t=0 η (t) g ∇F (w (t) ) 2 T -1 t=0 η (t) g ≤ 8(F (w (0) ) -F * ) T -1 t=0 η (t) g η l τ + 40η 2 l L 2 τ (τ -1)σ 2 g + 24η l Lτ σ 2 g . This implies, min t∈[T ] ∇F (w (t) ) 2 ≤ O (F (w (0) ) -F * ) T -1 t=0 η (t) g η l τ + O η 2 l L 2 τ (τ -1)σ 2 g + O η l Lτ σ 2 g . ( ) This completes the proof.

C.4 EXACT PROJECTION WITH GRADIENT DESCENT FOR LINEAR REGRESSION

Let F (w) = ∥Aw -b∥ 2 where A is a (n × d) matrix and b is a n dimensional vector. We assume that d ≥ n here and A has rank n. The singular value decomposition (SVD) of A can be written as, A = UΣV ⊤ = U [Σ 1 0] V ⊤ 1 V ⊤ 2 = UΣ 1 V ⊤ 1 (78) Therefore, w (T ) = (I -η l A ⊤ A) T w (0) + η l T -1 t=0 (I -η l A ⊤ A) t A ⊤ b (93) = V(I -η l Σ ⊤ Σ) T V ⊤ w (0) + η l T -1 t=0 V(I -η l Σ ⊤ Σ) t Σ ⊤ U ⊤ b (94) = (V 1 (I -η l Σ 2 1 ) T V 1 + V 2 V ⊤ 2 )w (0) + η l V 1 T -1 t=0 (I -η l Σ 2 1 ) t Σ 1 U ⊤ b. In the limit T → ∞ and with η l ≤ λ max , we have, lim T →∞ (I -η l Σ 2 1 ) T = 0 and lim T →∞ T -1 t=0 (I -η l Σ 2 1 ) t = 1 η l Σ -2 1 . Thus,  lim T →∞ w (T ) = V 2 V ⊤ 2 w (0) + V 1 Σ -1 1 U ⊤ b (97) = P S * (w (0) ) (98) = P S * (w). = P S * i (w (t) ) ∀i ∈ [M ], i.e., the local models are an exact projection of w (t) on their respective solution sets. From (8) we have, (η (t) g ) opt = w (t) -w * , ∆(t) ∆(t) 2 = M i=1 w (t) -w * , ∆ (t) i M ∆(t) 2 . ( ) We can lower bound w (t)w * , ∆ (t) i as follows, w (t) -w * , ∆ (t) i = w (t) -w (t,τ ) i + w (t,τ ) i -w * , w (t) -w (t,τ ) i (101) = w (t) -w (t,τ ) i 2 + w (t,τ ) i -w * , w (t) -w (t,τ ) i (102) ≥ w (t) -w (t,τ ) i 2 (103) = ∆ (t) i 2 (104) where ( 103) uses the fact that w (t,τ ) i -w * , w (t) -w (t,τ ) i ≥ 0 following the properties of projection (Boyd & Dattarro, 2003) . Thus we have, (η (t) g ) opt ≥ M i=1 ∆ (t) i 2 M ∆(t) 2 Note here the improvement by a factor of 2 in the lower bound compared to (8).

D ADDITIONAL EXPERIMENTS AND SETUP DETAILS

Our code is available at the following link https://github.com/Divyansh03/FedExP.

D.1 IMPACT OF AVERAGING ITERATES FOR NEURAL NETWORKS

As discussed in Section 5, we find that setting the final FedExP model as the average of the last two iterates also improves performance when training neural networks in practical FL scenarios. To demonstrate this, we consider an experiment on the CIFAR-10 dataset with 10 clients, where the data at each client is distributed using a Dirichlet distribution with α = 0.3. We set the number of local steps to be τ = 20 and train a CNN model having the same architecture as outlined in McMahan et al. (2017) with full client participation. Figure 6 shows the training accuracy as a function of the last iterate and the average of last two iterates for FedAvg and FedExP. We see that the last iterate of FedExP has an oscillating behavior that can hide improvements in training accuracy. On the other hand, the average of the last two iterates of FedExP produces a more stable training curve and shows a considerable improvement in the final accuracy. Note however that this improvement only shows for FedExP; averaging iterates does not make significant difference for FedAvg. 

D.2 DATASET DETAILS

Here we provide more details about the datasets used in Section 6. Synthetic Linear Regression. In this case we assume that the local objective of each client is given by F 30×1000) , b i ∈ R 30 and w ∈ R 1000 . We set the number of clients to be M = 20. Note that since d ≥ M i=1 n i , this is an overparameterized convex problem. To generate A i and b i , we follow a similar process as Li et al. (2020) . We have i (w) = ∥A i w -b i ∥ 2 where A i ∈ R ( (A i ) j: ∼ N (m i , I d ) and (b i ) j = w ⊤ i (A i ) j: where m i ∼ N (u i , 1), w i ∼ N (y i , 1), u i ∼ N (0, 0.1), y i ∼ N (0, 0.1). EMNIST. EMNIST is an image classification task consisting of handwritten characters associated with 62 labels. The federated EMNIST dataset available at Caldas et al. (2019) is naturally partitioned into 3400 clients based on the identities of the character authors. The number of training and test samples is 671,585 and 77,483 respectively. CIFAR-10/100. CIFAR-10 is a natural image dataset consisting of 60,000 32x32 images divided into 10 classes. CIFAR-100 uses a finer labeling of the CIFAR images to divide them into 100 classes making it a harder dataset for image classification. In both cases the number of training examples and test examples is 50,000 and 10,000 respectively. To simulate a federated setting, we artificially partition the training data into 100 clients following the procedure outlined in Hsu et al. (2019) . CINIC-10. CINIC-10 is a natural image dataset that can be used as a direct replacement of CIFAR for machine learning tasks. It is intended to act as a harder dataset than CIFAR-10 while being easier than CIFAR-100. The number of training and test examples is both 90,000. We partition the training data into 200 clients in this case, following a similar procedure as for CIFAR.

D.3 HYPERPARAMETER DETAILS

For our baselines, we find the best performing η g and η l by grid-search tuning. For FedExP we search for ϵ and η l . This is done by running algorithms for 50 rounds and finding the parameters that achieve the highest training accuracy averaged over the last 10 rounds. We provide details of the grid used below for each experiment below.

Grid for Synthetic.

For FedAvg and SCAFFOLD, the grid for η g is {10 0 , 10 0.5 , 10 0.5 , 10 1 , 10 2 }. For FedAdagrad, the grid for η g is {10 -1 , 10 -0.5 , 10 -0 , 10 0.5 , 10 1 }. For FedExP we keep ϵ = 0 in this experiment as ( 6) is satisfied in this case. The grid for η l is {10 -2 , 10 -1.5 , 10 -1 , 10 -0.5 , 10 0 } for all algorithms. Grid for Neural Network Experiments. For FedAvg and SCAFFOLD the grid for η g is {10 -1 , 10 -0.5 , 10 0 , 10 0.5 , 10 1 }. For FedAdagrad, the grid for η g is {10 -2 , 10 -1.5 , 10 -1 , 10 -0.5 , 10 0 }. For FedExP the grid for ϵ is {10 -3 , 10 -2.5 , 10 -2 , 10 -1.5 , 10 -1 }. The grid for η l is {10 -2 , 10 -1.5 , 10 -1 , 10 -0.5 , 10 0 } for all algorithms. We use lower values of η g in the grid for FedAdagrad based on observations from Reddi et al. ( 2021) which show that FedAdagrad performs better with smaller values of the server step size. We provide details of the best performing hyperparameters below. Table 3 : Base-10 logarithm of the best combination of ϵ and η l for FedExP and combination of η l and η g for baselines. For the synthetic dataset we keep ϵ = 0 for FedExP.

Dataset

FedExP FedAvg SCAFFOLD FedAdagrad ϵ η l η g η l η g η l η g η l Synthetic * -1 1 -1 1 -1 -1 -1 EMNIST -1 -0.5 0 -0.5 0 -0.5 -0.5 -0.5 CIFAR-10 -3 -2 0 -2 0 -2 -1 -2 CIFAR-100 -3 -2 0 -2 0 -2 -1 -2 CINIC-100 -3 -2 0 -2 0 -2 -1 -2 Other hyperparameters are kept the same for all algorithms. In particular, we apply a weight decay of 0.0001 for all algorithms and decay η l by a factor of 0.998 in every round. We also use gradient clipping to improve stability of the algorithms as done in previous works (Acar et al., 2021) . In all experiments we fix the number of participating clients to be 20, minibatch size to be 50 (for the synthetic dataset this reduces to full-batch gradient descent) and number of local updates τ to be 20.

D.4 SENSITIVITY OF FEDEXP TO ϵ

To evaluate the sensitivity of FedExP to ϵ, we compute the training accuracy of FedExP after 500 rounds for varying ϵ and on different tasks. For each task, we fix η l to be the value used in our experiments in Section 6 and only vary ϵ. The results are summarized below. Table 4 : Training accuracy obtained by FedExP with different choices of ϵ after 500 rounds of training on various tasks. Value of η l is fixed for each task (10 -0.5 for EMNIST and 10 -2 for others). Results averaged over last 10 rounds. Dataset ϵ = 10 -3 ϵ = 10 -2.5 ϵ = 10 -2 ϵ = 10 -1.5 ϵ = 10 We see that the sensitivity of ϵ is similar to that of the τ parameter which is added to the denominator of FedAdam and FedAdagrad (Reddi et al., 2021) to prevent the step size from blowing up. Published as a conference paper at ICLR 2023 Keeping ϵ too large reduces the adaptivity of the method and makes the behavior similar to FedAvg. At the same time, keeping ϵ too small may not also be beneficial always as seen in the case of EMNIST. In practice, we find that a grid search for ϵ in the range {10 -3 , 10 -2.5 , 10 -2 , 10 -1.5 , 10 -1 } usually suffices to yield a good value of ϵ. A general rule of thumb would be to start with ϵ = 10 -3 and increase ϵ till the performance drops.

D.5 ADDITIONAL RESULTS

In this section, we provide additional results obtained from our experiments. Synthetic Linear Regression. Note that for the synthetic linear regression experiments there is no test data. Also note that there is no randomness in this experiment since clients compute full-batch gradients with full participation. We provide the plot of η (t) g for FedExP in Figure 7 . We see that FedExP takes much larger steps in some (but not all) rounds compared to the constant optimum step size taken by our baselines, leading to a large speedup. Recall that we also let ϵ = 0 in this experiment (since it aligns with our theory) which also explains the larger values of η EMNIST. For EMNIST we observe that SCAFFOLD gives slightly better training loss than FedExP towards the end of training. As described in Section 6, extrapolation can be combined with the variance-reduction in SCAFFOLD (the resulting algorithm is referred to as SCAFFOLD-ExP) to further improve performance. This gives the best result in this case as shown in Figure 8 . CIFAR-10, CIFAR-100 and CINIC-10. From Figure 3 and Figures 9-11, we see that FedExP comprehensively outperforms baselines in these cases, achieving almost 10%-20% higher accuracy than the closest baseline by the end of training. The margin of improvement is most in CIFAR-100, which can be considered as the toughest dataset in our experiments. This points to the practical utility of FedExP even in challenging FL scenarios. 

E COMBINING EXTRAPOLATION WITH SCAFFOLD

As described in Section 6, the extrapolation step can be added to the SCAFFOLD algorithm in a similar way as FedExP. The detailed steps of this SCAFFOLD-ExP algorithm are shown in Algorithm 2. Algorithm 2 SCAFFOLD-ExP 1: Input: w (0) , control variate c (0) , c i , ∀i ∈ [M ], number of rounds T , local iteration steps τ , parameters η l , ϵ 2: For t = 0, . . . , T -1 communication rounds do: For k = 0, . . . , τ -1 local iterations do: Update global control variate with c (t+1) ← c (t) -Ψ(t) size 20. The experimental setup is the same as described in Section 6. The hyperparameters η l , ϵ for FedExP-M and η l , η g for FedAdam and FedAvg-M were tuned following a similar process as described in Appendix D.3, and their resulting values are in Table 6 . Table 6 : Base-10 logarithm of the best combination of ϵ and η l for FedExP-M and combination of η l and η g for FedAdam and FedAvg-M. Our result shows that server momentum can be successfully combined with extrapolation for the best speed-up among all baselines. The behavior of FedAdam and FedAvg-M are quite similar in these experiments which can be attributed to the dense nature of the gradients in image classification as



We refer here to a parallel implementation of POCS. This is also known as Parallel Projection Method (PPM) and Simultaneous Iterative Reconstruction Technique (SIRT) in some literature(Combettes, 1997).



Figure 1: Test accuracy (%) achieved by different server and client step sizes on EMNIST dataset (Cohen et al., 2017) after 50 rounds (details of experimental setup are in Section 6 and Appendix D).

Clients i ∈ [M ] in parallel do:

Figure 2: Training characteristics of FedAvg and FedExP for the 2-D toy problem in Section 5.The last iterate of FedExP has an oscillating behavior in F (w) but monotonically decreases w (t)w * 2 ; the average of the last two iterates lies in a lower loss region than the last iterate.

2 for this experiment (see Appendix C.4 and Appendix C.4.1 for proof).

Figure 3: Experimental results on a synthetic linear regression experiments and a range of realistic FL tasks. FedExP consistently gives faster convergence compared to baselines while adding no extra computation, communication or storage at clients or server.some rounds as shown in Figure2, it is beneficial in the long run as it leads to a faster decrease in distance to the global optimum w * .

); Karimireddy et al. (2020a); Yu et al. (

Figure 4: Adding extrapolation to SCAFFOLD for greater speedup.

2 m z B s s P 8 1 a a u k E w 4 n p Q 2 a j e g 9 5 + W k g H A t U y A C a c D L t S c c 4 c E x g + K Q k m Z L d P v g t O 9 i f Z + 0 n 2 e X 9 8 m A 5 2 r JM t s k 1 2 S E Y O y C H 5 R I 7 I l A j y K 3 o Z v Y 7 e R L / j V z G N t 6 9 T 4 2 i o 2 S T / R L z 3 B y A L v X U = < / l a t e x i t > 1) Server sends global model w (t) to all clients < l a t e x i t s h a 1 _ b a s e 6 4 = " + R w O K x 1 V b M T b B H e r V H k a l 6 A / K a s = " > A A A C O X i c b V A 9 T x t B E N 0 j E M g B w U n K N C s s E D T W n Q t C i U S T 0 k g Y k H y W N b c e H y v 2 4 7 Q 7 h 2 K d / L d o + B f p I t F Q J E K 0 + Q N Z m y v C x5 N W e n o z s z P v 5 a W S n p L k V 7 T 0 b n n l / e r a h 3 h 9 Y / P j V u v T 5 z N v K y e w L 6 y y 7 i I H j 0 o a 7 J M k h R e l Q 9 C 5 w v P 8 6 n h e P 7 9 G 5 6 U 1 pz Q t c a i h M H I i B V C Q R q 1 e l m M h T Q 1 K F m Y W d / f 5 L s 8 I f 1 D N j 5 V E Q 5 6 X 6 C b W a a 6 s A M X J g T T S F H z G M 2 N N p X N 0 c Y Z m 3 H w x a r W T T r I A f 0 3 S h r R Z g 9 6 o 9 T M b W 1 H p s E s o 8 H 6 Q J i U N a 3 A k h c J Z n F U e S x B X U O A g U A M a / b B e O J / x n a C M e T g v P E N 8 o f 4 / U Y P 2 f q r z 0 K m B L v 3 L 2 l x 8 q z a o a H I 4 r K U p K 0 I j n h Z N q m D f 8 n m M f C w d C l L T Q E A 4 G W 7 l 4 h I c C A p h x y G E 9 K X l 1 + S s 2 0 k P O u l J t 3 2 U N H Gs s a 9 s m + 2 x l H 1 j R + w 7 6 7 E + E + y G 3 b H f 7 E 9 0 G 9 1 H D 9 H j U + t S 1 M x 8 Y c 8 Q / f 0 H g 3 6 t W w = = < / l a t e x i t > 2) Clients perform local training < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 o / E f z B P 3 o b s j A C c T V W n p D 3 w 7j E = " > A A A C l X i c b V F N b 9 N A E F 2 7 f B T z F c q B A 5 c V E a i R I L J T q f R S q d A K c a N I p K 2 UT a P x Z p y u u l 5 b u + N C 5 P o f 8 W u 4 8 W / Y J F a h L S O t 9 P T m 7 c y 8 m b T U y l E c / w 7 C t T t 3 7 9 1 f f x A 9 f P T 4 y d P O s 4 0

5 x 6 A t M r P y u U Z W J D k D x n 5 J S Q 3 L d 8 G R 4 N + s t 1 P v g 6 6 e 3 G 7 j n X 2 k r 1 i m y x h 7 9 k e + 8 w O 2 Z D J Y C P Y C T 4 E H 8 M X 4 W 5 4 E H 5 a S c O g / f O c X Y v w y x / N n M j S < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " G v 0 c x l C + 5 B 5 g Z A K 6 u C + G Y h 7 c d q 8= " > A A A C A H i c b V D L S s N A F J 3 U V 6 2 v V B c u 3 A S L U E F K I q I u C 2 5 c V r A P a G K Y T C f t 0 M m D m R u l h G z 8 E R d u X C j i 1 s 9 w 5 8 6 l n + G k 7 U J b D 1 w 4 n H M v 9 9 7 j x Z x J M M 1 P r b C w u L S 8 U l w t r a 1 v b G 7 p 5 e 2 W j B J B a J N E P B I d D 0 v K W U i b w I D T T i w o D j x O 2 9 7 w I v f b t 1 R I F o X X M I q p E + B + y H x G M C j J 1 X f t A M P A 8 9 O 7 z L V u 0 i o c 2 Y C T w 8 z V K 2 b N H M O Y J 9 a U V O p l 7 + v h e 3 D S c P U P u x e R J K A h E I 6 l 7 F p m D E 6 K B T D C a V a y E 0 l j T I a 4 T 7 u K h j i g 0 k n H D 2 T G g V J 6 h h 8 J V S E Y Y / X 3 R I o D K U e B p z r z c + W s l 4 v / e d 0 E / H M n Z W G c A A 3 J Z J G f c A M i I 0 / D 6 D F B C f C R I p g I p m 4 1 y A A L T E B l V l I h W L M v z 5 P W c c 0 6 r V l X K g 0 T T V B E e 2 g f V Z G F z l A d X a I G a i K C M v S I n t G L d q 8 9 a a / a 2 6 S 1 o E 1 n d t A f a O 8 / b R 6 Z 8 Q = = < / l a t e x i t > w (t,⌧ ) 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " f X x O m s K y 1 k G c j U 0 2 Z 7 m K l R v j W H Q = " > A A A C A H i c b V D L S s N A F J 3 4 r P W V 6 s K F m 2 A R K k h J i q j L g h u X Fe w D m h g m 0 0 k 7 d P J g 5 k Y p I R t / x I U b F 4 q 4 9 T P c u X P p Z z h p u 9 D W A x c O 5 9 z L v f d 4 M W c S T P N T W 1 h c W l 5 Z L a w V 1 z c 2 t 7 b 1 0 k 5 L R o k g t E k i H o m O h y X l L K R N Y M B p J x Y U B x 6 n b W 9 4 k f v t W y o k i 8 J r G M X U C X A / Z D 4 j G J T k 6 n t 2 g G H g + e l d 5 t Z u 0 g o c 2 4 C T o 8 z V y 2 b V H M O Y J 9 a U l O s l 7 + v h e 3 D S c P U P u x e R J K A h E I 6 l 7 F p m D E 6 K B T D C a V a 0 E 0 l j T I a 4 T 7 u K h j i g 0 k n H D 2 T G o V J 6 h h 8 J V S E Y Y / X 3 R I o D K U e B p z r z c + W s l 4 v / e d 0 E / H M n Z W G c A A 3 J Z J G f c A M i I 0 / D 6 D F B C f C R I p g I p m 4 1 y

F u 9 e e t F f t b d K 6 o E 1 n d t E f a O 8 / b q 2 Z 8 g = = < / l a t e x i t > w (t,⌧ ) 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " L q y I 3 A / B Z u c + V G P Z J T X b t t 5 r hI c = " > A A A C A H i c b V C 7 S g N B F J 2 N r x h f 0 R Q W N o N B i C B h 1 0 I t A z Y 2 Q g T z g G x c Z i e z y Z D Z B z N 3 l b B s 4 6 / Y W C j B 1 g + w s r K z t P E 7 n D w K T T x w 4 X D O v d x 7 j x s J r s A 0 P 4 3 M w u L S 8 k p 2 N b e 2 v r G 5 l d / e q a s w l p T V a C h C 2 X S J Y o I H r A Y c B G t G k h H f F a z h 9 s 9 H f u O W S c X D 4 B o G E W v 7 p B t w j 1 M C W n L y u 7 Z P o O d 6 y V 3 q X N4 k J T i y g c S H q Z M v m m V z D D x P r C k p V g r 4 e / j + 9 V Z 1 8 h 9 2 J 6 S x z w K g g i j V s s w I 2 g m R w K l g a c 6 O F Y s I 7 Z M u a 2 k a E J + p d j J + I M U H W u l g L 5 S 6 A s B j 9 f d E Q n y l B r 6 r O 0 f n q l l v J P 7 n t W L w zt o J D 6 I Y W E A n i 7 x Y Y A j x K A 3 c 4 Z J R E A N N C J V c3 4 p p j 0 h C Q W e W 0 y F Y s y / P k / p x 2 T o p W 1 c 6 D R N N k E V 7 a B + V k I V O U Q V d o C q q I Y p S 9 I C e 0 L N x b z w a Q + N l 0 p o x p j M F 9 A f G 6 w + e h Z r S < / l a t e x i t > w (t,⌧ ) M < l a t e x i t s h a 1 _ b a s e 6 4 = " G V L d e 3 C M f / Z f W F P e f Y y + A L 6 0 p 8 Y = " > A

W a X L e r G 2 z 6 6 f q b 8 T G Q + N G Y e B n Q w 5 D c y y N x X / 8 5 o p 9 W 7 a m Y y S l D A S 8 0W 9 V D G K 2 b Q G 1 p U a B a m x J V x o a W 9 l Y s A 1 F 2 T L K t g S / O U v r 5 L 6 R c m / K v n 3 l 8 W y t 6 g j D y d w C u f g w z W U 4 R a q U A M B I 3 i G V 3 h z M u f F e X c+ 5 q M 5 Z 5 E 5 h j 9 w P n 8 A H e S T R A = = < / l a t e x i t > Client M < l a t e x i t s h a 1 _ b a s e 6 4 = " m N M q N 3 U V p B F T k 5 N h Z T 6 P 2 0 N j L X s = " > A A A B + X i c b V D L S g N B E O z 1 G e N r 1 a O X w S B 4 C r t B 1 G M g F 4 8 R z A O S E G Y n n W T I 7 I O Z 3 m B Y 8 i d e P C j i 1 T / x 5 t 8 4 e R w 0 s W C g q O q i e y p I l D T k e d / O x u b W 9 s 5 u b i + / f 3 B 4 d O y e n N Z N

y d z X p x 3 5 2 M x u u E s M 2 f w B 8 7 n D / T p k y k = < / l a t e x i t > Client 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " W 2 M Y l r B n S G c p j m x e I 2 1 g M Z v E b o 0 = " > A A A B + X i c b V D L S g N B E O y N r x h f q x 6 9 D A b B U 9 g V U Y + B X D x G M A 9 I Q p i d d J I h s w 9 m e o N h y Z 9 4 8 a C I V / / E m 3 / j 5 H H Q x I K B o q q L 7 q k g U d K Q 5 3 0 7 u Y 3 N r e 2 d / G 5 h b / / g 8 M g 9 P q m

Figure 5: Schematic of client-server communication in FedExP.

Figure 6: Benefit of averaging the last two iterates for FedExP in training a CNN model on CIFAR-10. Note that averaging does not make significant difference for FedAvg.

FedExP in this case.

Figure 7: Global learning rates for synthetic data with linear regression. Results from a single instance of experiment.

Figure 8: Additional results for EMNIST dataset. Mean and standard deviation from experiments with 20 different random seeds. The shaded areas show the standard deviation.

Figure 9: Additional results for CIFAR-10 dataset. Mean and standard deviation from experiments with 5 different random seeds. The shaded areas show the standard deviation.

Figure 10: Additional results for CIFAR-100 dataset. Mean and standard deviation from experiments with 5 different random seeds. The shaded areas show the standard deviation.

t) , c(t) to all clients 5:Clients i ∈ [M ] in parallel do:

Figure 15: Training loss results of FedExP-M, FedAdam and FedAvg-M on the CIFAR10 and CIFAR100 datasets.

Figure 16: Training accuracy results of FedExP-M, FedAdam and FedAvg-M on the CIFAR10 and CIFAR100 datasets.

Figure 17: Test accuracy results of FedExP-M, FedAdam and FedAvg-M on the CIFAR10 and CIFAR100 datasets.

Table showing the average number of rounds to reach desired accuracy for FedExP and baselines. FedExP provides a consistent speedup over all baselines. 2× over FedAvg. This verifies that FedExP also provides performance improvement in more general settings with realistic datasets and models. Plots showing η

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017. Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention models? Advances in Neural Information Processing Systems, 33:15383-15393, 2020.

TABLE OF NOTATION AND SCHEMATIC B.1 TABLE OF NOTATION Summary of notation used in paper Local model at i-th client at t-th round and k-th iteration τ Number of local SGD steps ∆

C.4.1 IMPROVING LOWER BOUND IN (8) IN THE CASE OF EXACT PROJECTIONSLet S * i be convex and let w * ∈ S i for all i ∈ [M ]. We assume that w

Test accuracy obtained by FedExP and baselines after 2000 rounds of training on various tasks. Results are averaged across 3 random seeds and last 20 rounds.We see that FedExP continues to outperform baselines including FedProx in the long-term behavior as well.

ACKNOWLEDGMENTS

This work was supported in part by NSF grants CCF 2045694, CNS-2112471, ONR N00014-23-1-2149, and the CMU David Barakat and LaVerne Owen-Barakat Fellowship.

F Combining Extrapolation with Server Momentum

where U is anmatrix with orthogonal columns. Here V 1 is a basis for the row space of A, while V 2 is a basis for the null space of A. We first prove the following lemmas about the set of minimizers of F (w) and the projection on this set. Lemma 8. The set of minimizers of F (w) is given by,where the last line useswhere (82Combining the above statements we have,which completes the proof. Lemma 9. The projection of any w ∈ R d on S * is given by,Proof. When w ∈ S * , it is easy to see that this holds. Therefore we consider the case where w /leading to a contradiction. The cross term in ( 88) is zero since 89) follows by the definition of x.We now show that running gradient descent on F (w) starting from w with a sufficiently small step size converges to P S * (w).Lemma 10. Let w (0) , w (1) , . . . be the iterates generated by running gradient descent on F (w) with w (0) = w and learning rate η l ≤ λ max , where λ max is the largest eigen value of A ⊤ A. Then lim T →∞ w (T ) = P S * (w).Proof. By the gradient descent update we have,Long-Term Behavior of Algorithms and Comparison with FedProx. To evaluate the long-term behavior of different algorithms, we ran the experiments for 2000 rounds. Here, we also consider an additional algorithm, namely FedProx, for comparison. For fair comparison, we have tuned the µ parameter of FedProx for each dataset, by doing a grid search over the range {10 -3 , 10 -2 , 10 -1 , 1} as done in the original FedProx paper (Li et al., 2020) . The results of EMNIST, CIFAR-10, CIFAR-100, and CINIC-10 in Figures 12-14 and Table 5 are from experiments with 3 different random seeds.Except for the synthetic dataset, the plots show mean and standard deviation values across all the random seeds and also over a moving average window of size 20. 

F COMBINING EXTRAPOLATION WITH SERVER MOMENTUM

We begin by recalling some notation from our work. The vector w (t) is the global model at round t and ∆(t) is the average of client updates at round t. The server momentum update at round t can be written as v (t) = ∆(t) + βv (t-1) (let v -1 = 0) and the global model update can be written as t) . Our goal is now to find η (t) g that minimizes w (t+1)w * 2 . We have,Setting the derivative of the RHS of ( 107) to zero we have,Our goal now is to find a lower bound on ⟨w (t)w * , v (t) ⟩. We have the following lemma.Lemma 11. Assume that r) 2 for all r < t -1. Then,which implies,Proof. We proceed via a proof by induction. The statement clearly holds at t = 0 since w (0)w * , v (0) = w (0)w * , ∆(0) ≥ m (0) . Now assuming the lemma holds at t -1 we have,where the last line follows from the fact that w2 . Thus we propose to keep the following server step size when using server momentum,where m (t) = M i=1 ∆ (t) i 2 /M . Note that we also add a small constant ϵ to the denominator to prevent the step size from blowing up as done for FedExP. We call server momentum with this step size as FedExP-M.We compare the performance of FedExP-M with FedAdam and FedAvg-M (FedAvg with server momentum) on the CIFAR-10 and CIFAR-100 datasets as shown in Figures 15 16 17 , where the mean and standard deviation values are computed over 3 random seeds and a moving average window of discussed in Section 6. We note that this is only a preliminary result and future work will look to study the effect of combining server momentum and extrapolation more rigorously.

