FEDERATED LEARNING'S BLESSING: FEDAVG HAS LINEAR SPEEDUP

Abstract

Federated learning (FL) learns a model jointly from a set of participating devices without sharing each other's privately held data. The characteristics of non-i.i.d. data across the network, low device participation, high communication costs, and the mandate that data remain private bring challenges in understanding the convergence of FL algorithms, particularly in regards to how convergence scales with the number of participating devices. In this paper, we focus on Federated Averaging (FedAvg)-arguably the most popular and effective FL algorithm class in use today-and provide a unified and comprehensive study of its convergence rate. Although FedAvg has recently been studied by an emerging line of literature, it remains open as to how FedAvg's convergence scales with the number of participating devices in the fully heterogeneous FL setting-a crucial question whose answer would shed light on the performance of FedAvg in large FL systems. We fill this gap by providing a unified analysis that establishes convergence guarantees for FedAvg under three classes of problems: strongly convex smooth, convex smooth, and overparameterized strongly convex smooth problems. We show that FedAvg enjoys linear speedup in each case, although with different convergence rates and communication efficiencies. While there have been linear speedup results from distributed optimization that assumes full participation, ours are the first to establish linear speedup for FedAvg under both statistical and system heterogeneity. For strongly convex and convex problems, we also characterize the corresponding convergence rates for the Nesterov accelerated FedAvg algorithm, which are the first linear speedup guarantees for momentum variants of FedAvg in the convex setting. To provably accelerate FedAvg, we design a new momentum-based FL algorithm that further improves the convergence rate in overparameterized linear regression problems. Empirical studies of the algorithms in various settings have supported our theoretical results.

1. INTRODUCTION

Federated learning (FL) is a machine learning paradigm where many clients (e.g., mobile devices or organizations) collaboratively train a model under the orchestration of a central server (e.g., service provider), while keeping the training data decentralized (Smith et al. (2017) ; Kairouz et al. (2019) ). In recent years, FL has swiftly emerged as an important learning paradigm (McMahan et al. (2017) ; Li et al. (2020a) )-one that enjoys widespread success in applications such as personalized recommendation (Chen et al. (2018) ), virtual assistant (Lam et al. (2019) ), and keyboard prediction (Hard et al. (2018) ), to name a few-for at least three reasons: First, the rapid proliferation of smart devices that are equipped with both computing power and data-capturing capabilities provided the infrastructure core for FL. Second, the rising awareness of privacy and the explosive growth of computational power in mobile devices have made it increasingly attractive to push the computation to the edge. Third, the empirical success of communication-efficient FL algorithms has enabled increasingly larger-scale parallel computing and learning with less communication overhead. Despite its promise and broad applicability in our current era, the potential value FL delivers is coupled with the unique challenges it brings forth. In particular, when FL learns a single statistical model using data from across all the devices while keeping each individual device's data isolated (Kairouz et al. (2019) ), it faces two challenges that are absent in centralized optimization and distributed (stochastic) optimization (Zhou & Cong (2018) ; Stich (2019); Khaled et al. (2019) ; Liang et al. (2019) ; Wang & Joshi (2018) ; Woodworth et al. (2018) ; Wang et al. (2019) ; Jiang & Agrawal (2018) ; Yu et al. (2019b; a) ; Khaled et al. (2020) ; Koloskova et al. (2020) ; Woodworth et al. (2020b; a) ): 1) Data (statistical) heterogeneity: data distributions in devices are different (and data cannot be shared); `````````P  N T + E 2 T 2 ) O 1 √ N T + N E 2 T O(exp(-N T Eκ 1 )) O(exp(-N T Eκ )) † Partial O E 2 KT + E 2 T 2 O E 2 √ KT + KE 2 T O(exp(-KT Eκ 1 )) O(exp(-KT Eκ )) † Table 1 : Our convergence results for FedAvg and accelerated FedAvg in this paper. Throughout the paper, N is the total number of local devices, and K ≤ N is the maximal number of devices that are accessible to the central server. T is the total number of stochastic updates performed by each local device, E is the local steps between two consecutive server communications (and hence T /E is the number of communications). † In the linear regression setting, we have κ = κ1 for FedAvg and κ = √ κ1κ for momentum accelerated FedAvg (FedMaSS), where κ1 and √ κ1κ are condition numbers defined in Section G. Since κ1 ≥ κ, this implies a speedup factor of κ 1 κ for accelerated FedAvg. 2) System heterogeneity: only a subset of devices may access the central server at each time both because the communications bandwidth profiles vary across devices and because there is no central server that has control over when a device is active (the presence of "stragglers"). To address these challenges, Federated Averaging (FedAvg) McMahan et al. (2017) was proposed as a particularly effective heuristic, which has enjoyed great empirical success. This success has since motivated a growing line of research efforts into understanding its theoretical convergence guarantees in various settings. For instance, Haddadpour & Mahdavi (2019) analyzed FedAvg (for non-convex smooth problems satisfying PL conditions) under the assumption that each local device's minimizer is the same as the minimizer of the joint problem (if all devices' data is aggregated together), an overly restrictive assumption that restricts the extent of data heterogeneity. Very recently, Li et al. (2020b) furthered the progress and established an O( 1 T ) convergence rate for FedAvg for strongly convex smooth Federated learning problems with both data and system heterogeneity. A similar result in the same setting Karimireddy et al. (2019) also established an O( 1T ) result that allows for a linear speedup when the number of participating devices is large. At the same time, Huo et al. (2020) studied the Nesterov accelerated FedAvg for non-convex smooth problems and established an O( 1 √ T ) convergence rate to stationary points. However, despite these very recent fruitful pioneering efforts into understanding the theoretical convergence properties of FedAvg, it remains open as to how the number of devices-particularly the number of devices that participate in the computation-affects the convergence speed. In particular, is linear speedup of FedAvg a universal phenomenon across different settings and for any number of devices? What about when FedAvg is accelerated with momentum updates? Does the presence of both data and system heterogeneity in FL imply different communication complexities and require technical novelties over results in distributed and decentralized optimization? These aspects are currently unexplored or underexplored in FL. We fill in the gaps here by providing affirmative answers. Our Contributions We provide a comprehensive and unified convergence analysis of FedAvg and its accelerated variants in the presence of both data and system heterogeneity. Our contributions are threefold. First, we establish an O(1/KT ) convergence rate under FedAvg for strongly convex and smooth problems and an O(1/ √ KT ) convergence rate for convex and smooth problems (where K is the number of participating devices), thereby establishing that FedAvg enjoys the desirable linear speedup property in the FL setup. Prior to our work here, the best and the most related convergence analysis is given by Li et al. (2020b) and Karimireddy et al. (2019) , which established an O( 1 T ) convergence rate for strongly convex smooth problems under FedAvg. Our rate matches the same (and optimal) dependence on T , but also completes the picture by establishing the linear dependence on K, for any K ≤ N , where N is the total number of devices, whereas Li et al. (2020b) does not have linear speedup analysis, and Karimireddy et al. (2019) only allows linear speedup close to full participation (K = O(N )). As for convex and smooth problems, there was no prior work that established the O( 1 √ T ) rate under both system and data heterogeneity. Our unified analysis highlights the common elements and distinctions between the strongly and convex settings. Second, we establish the same convergence rates-O(1/KT ) for strongly convex and smooth problems and O(1/ √ KT ) for convex and smooth problems-for Nesterov accelerated FedAvg. We analyze the accelerated version of FedAvg here because empirically it tends to perform better; yet, its theoretical convergence guarantee is unknown. To the best of our knowledge, these are the first results that provide a linear speedup characterization of Nesterov accelerated FedAvg in those two problem classes (that FedAvg and Nesterov accelerated FedAvg share the same convergence rate is to be expected: this is the case even for centralized stochastic optimization). Prior to our results here, the most relevant results Yu et al. (2019a) ; Li et al. (2020a) ; Huo et al. (2020) only concern the non-convex setting, where convergence is measured with respect to stationary points (vanishing of gradient norms, rather than optimality gaps). Our unified analysis of Nesterov FedAvg also illustrates the technical similarities and distinctions compared to the original FedAvg algorithm, whereas prior works (in the non-convex setting) were scattered and used different notations. Third, we study a subclass of strongly convex smooth problems where the objective is overparameterized and establish a faster O(exp(-KT κ )) convergence rate for FedAvg, in contrast to the O(exp(-T κ )) rate for individual solvers Ma et al. (2018) . Within this class, we further consider the linear regression problem and establish an even sharper rate under FedAvg. In addition, we propose a new variant of accelerated FedAvg based on a momentum update of Liu & Belkin (2020) -MaSS accelerated FedAvg-and establish a faster convergence rate (compared to if no acceleration is used). This stands in contrast to generic (strongly) convex stochastic problems where theoretically no rate improvement is obtained when one accelerates FedAvg. The detailed convergence results are summarized in Table 1 . Connections with Distributed and Decentralized Optimization Federated learning is closely related to distributed and decentralized optimization, and as such it is important to discuss connections and distinctions between our work and related results from that literature. First, when there is neither system heterogeneity, i.e. all devices participate in parameter averaging during a communication round, nor statistical heterogeneity, i.e. all devices have access to a common set of stochastic gradients, FedAvg coincides with the "Local SGD" of Stich (2019), which showed the linear speedup rate O(1/N T ) for strongly convex and smooth functions. Woodworth et al. (2020b) and Woodworth et al. (2020a) further improved the communication complexity that guarantees the linear speedup rate. When there is only data heterogeneity, some works have continued to use the term Local SGD to refer to FedAvg, while others subsume it in more general frameworks that include decentralized model averaging based on a network topology or a mixing matrix. They have provided linear speedup analyses for strongly convex and convex problems, e.g. Khaled et al. (2020) ; Koloskova et al. (2020) as well as non-convex problems, e.g. Jiang & Agrawal (2018) ; Yu et al. (2019b) ; Wang & Joshi (2018) . However, these results do not consider system heterogeneity, i.e. the presence of stragglers in the device network. Even with decentralized model averaging, the assumptions usually imply that model averages over all devices is the same as decentralized model averages based on network topology (e.g. Koloskova et al. (2020) Proposition 1), which precludes system heterogeneity as defined in this paper and prevalent in FL problems. For momentum accelerated FedAvg, Yu et al. (2019a) provided linear speedup analysis for non-convex problems, while results for strongly convex and convex settings are entirely lacking, even without system heterogeneity. Karimireddy et al. (2019) considers both types of heterogeneities for FedAvg, but their rate implies a linear speedup only when the number of stragglers is negligible. In contrast, our linear speedup analyses consider both types of heterogeneity present in the full federated learning setting, and are valid for any number of participating devices. We also highlight a distinction in communication efficiency when system heterogeneity is present. Moreover, our results for Nesterov accelerated FedAvg completes the picture for strongly convex and convex problems. For a detailed comparison with related works, please refer to Table 2 in Appendix Section B.

2. SETUP

In this paper, we study the following federated learning problem: min w F (w) N k=1 p k F k (w) , ( ) where N is the number of local devices (users/nodes/workers) and p k is the k-th device's weight satisfying p k ≥ 0 and N k=1 p k = 1. In the k-th local device, there are n k data points: x 1 k , x 2 k , . . . , x n k k . The local objective F k (•) is defined as: F k (w) 1 n k n k j=1 w; x j k , where denotes a userspecified loss function. Each device only has access to its local data, which gives rise to its own local objective F k . Note that we do not make any assumptions on the data distributions of each local device. The local minimum F * k = min w F k (w) can be far from the global minimum of Eq (1) (data heterogeneity).

2.1. THE FEDERATED AVERAGING (FEDAVG) ALGORITHM

We first introduce the standard Federated Averaging (FedAvg) algorithm which was first proposed by McMahan et al. (2017) . FedAvg updates the model in each device by local Stochastic Gradient Descent (SGD) and sends the latest model to the central server every E steps. The central server conducts a weighted average over the model parameters received from active devices and broadcasts the latest averaged model to all devices. Formally, the updates of FedAvg at round t is described as follows: v k t+1 = w k t -α t g t,k , w k t+1 = v k t+1 if t + 1 / ∈ I E , k∈St+1 q k v k t+1 if t + 1 ∈ I E , where w k t is the local model parameter maintained in the k-th device at the t-th iteration and g t,k := ∇F k (w k t , ξ k t ) is the stochastic gradient based on ξ k t , the data point sampled from k-th device's local data uniformly at random. I E = {E, 2E, . . . } is the set of global communication steps, when local parameters from a set of active devices are averaged and broadcast to all devices. We use S t+1 to represent the (random) set of active devices at t + 1. q k is a set of averaging weights that are specific to the sampling procedure used to obtain the set of active devices S t+1 . Since federated learning usually involves an enormous amount of local devices, it is often more realistic to assume only a subset of local devices is active at each communication round (system heterogeneity). In this work, we consider both the case of full participation where the model is averaged over all devices at each communication round, in which case q k = p k for all k and 2020b) and assume that S t+1 is obtained by one of two types of sampling schemes to simulate practical scenarios. One scheme establishes S t+1 by i.i.d. sampling the devices with probability p k with replacement, and uses q k = 1 K , where K = |S t+1 |, while the other scheme samples S t+1 uniformly i.i.d. from all devices without replacement, and uses q k = p k N K . Both schemes guarantee that gradient updates in FedAvg are unbiased stochastic versions of updates in FedAvg with full participation, which is important in the theoretical analysis of convergence. Because the original sampling scheme and weights proposed by McMahan et al. (2017) lacks this nice property, it is not considered in this paper. For more details on the notations and setup as well as properties of the two sampling schemes, please refer to Section A in the appendix. w k t+1 = N k=1 p k v k t+1 if t + 1 ∈ I E ,

2.2. ASSUMPTIONS

We make the following standard assumptions on the objective function F 1 , . . . , F N . Assumptions 1 and 2 are commonly satisfied by a range of popular objective functions, such as 2 -regularized logistic regression and cross-entropy loss functions. Assumption 1 (L-smooth). F 1 , • • • , F N are all L-smooth: for all v and w, F k (v) ≤ F k (w) + (v - w) T ∇F k (w) + L 2 v -w 2 2 . Assumption 2 (Strongly-convex). F 1 , • • • , F N are all µ -strongly convex: for all v and w, F k (v) ≥ F k (w) + (v -w) T ∇F k (w) + µ 2 v -w 2 2 Assumption 3 (Bounded local variance). Let ξ k t be sampled from the k-th device's local data uniformly at random. The variance of stochastic gradients in each device is bounded: E ∇F k w k t , ξ k t -∇F k w k t 2 ≤ σ 2 k , for k = 1, • • • , N and any w k t . Let σ 2 := N k=1 p k σ 2 k . Assumption 4 (Bounded local gradient). The expected squared norm of stochastic gradients is uniformly bounded. i.e., E ∇F k w k t , ξ k t 2 ≤ G 2 , for all k = 1, ..., N and t = 0, . . . , T -1. Assumptions 3 and 4 have been made in many previous works in federated learning, e.g. Yu et al. (2019b) ; Li et al. (2020b); Stich (2019) . We provide further justification for their generality. As model average parameters become closer to w * , the L-smoothness property implies that E ∇F k (w k t , ξ k t ) 2 and E ∇F k (w k t , ξ k t ) -∇F k (w k t ) 2 approach E ∇F k (w * , ξ k t ) 2 and E ∇F k (w * , ξ k t ) -∇F k (w * ) 2 . Therefore, there is no substantial difference between these assumptions and assuming the bounds at w * only Koloskova et al. (2020) . Furthermore, compared to assuming bounded gradient diversity as in related work Haddadpour & Mahdavi (2019) ; Li et al. (2020a) , Assumption 4 is much less restrictive. When the optimality gap converges to zero, bounded gradient diversity restricts local objectives to have the same minimizer as the global objective, contradicting the heterogeneous data setting. For detailed discussions of our assumptions, please refer to Appendix Section B.

3. LINEAR SPEEDUP ANALYSIS OF FEDAVG

In this section, we provide convergence analyses of FedAvg for convex objectives in the general setting with both heterogeneous data (statistical heterogeneity) and partial participation (system heterogeneity). We show that for strongly convex and smooth objectives, the convergence of the optimality gap of averaged parameters across devices is O(1/KT ), while for convex and smooth objectives, the rate is O(1/ √ KT ). Our results improve upon Li et al. (2020b) ; Karimireddy et al. ( 2019) by showing linear speedup for any number of participating devices, and upon Khaled et al. (2020) ; Koloskova et al. (2020) by allowing system heterogeneity. The proofs also highlight similarities and distinctions between the strongly convex and convex settings. Detailed proofs are deferred to Appendix Section E.

3.1. STRONGLY CONVEX AND SMOOTH OBJECTIVES

We first show that FedAvg has an O(1/KT ) convergence rate for µ-strongly convex and L-smooth objectives. The result relies on a technical improvement over the analysis in Li et al. (2020b) . Moreover, it implies a distinction in communication efficiency that guarantees this linear speedup for FedAvg with full and partial device participation. With full participation, E can be chosen as large as O( T /N ) without degrading the linear speedup in the number of workers. On the other hand, with partial participation, E must be O(1) to guarantee O(1/KT ) convergence.  EF (w T ) -F * = O κν 2 max σ 2 /µ N T + κ 2 E 2 G 2 /µ T 2 , and with partial device participation with at most K sampled devices at each communication round, EF (w T ) -F * = O κE 2 G 2 /µ KT + κν 2 max σ 2 /µ N T + κ 2 E 2 G 2 /µ T 2 . Proof sketch. Because our unified analyses of results in the main text follow the same framework with variations in technical details, we first give an outline of proof for Theorem 1 to illustrate the main ideas. For full participation, the main ingredient is a recursive contraction bound E w t+1 -w * 2 ≤ (1 -µα t )E w t -w * 2 + α 2 t 1 N ν 2 max σ 2 + 6α 3 t LE 2 G 2 where the O(α 3 t E 2 G 2 ) term is the key improvement over the bound in Li et al. (2020b) , which has O(α 2 t E 2 G 2 ) instead. We then use induction to obtain a non-recursive bound on E w T -w * 2 , which is converted to a bound on EF (w T ) -F * using L-smoothness. For partial participation, an additional term O( 1 K α 2 t E 2 G 2 ) of leading order resulting from sampling variance is added to the contraction bound. To facilitate the understanding of our analysis, please refer to a high-level summary in Appendix C. Linear speedup. We compare our bound with that in Li et al. (2020b) , which is O( 1 N T + E 2 KT + E 2 G 2 T ). Because the term E 2 G 2 T is also O(1/T ) without a dependence on N , for any choice of E their bound cannot achieve linear speedup. The improvement of our bound comes from the term κ 2 E 2 G 2 /µ T 2 , which now is O(E 2 /T 2 ) and so is not of leading order. As a result, all leading terms scale with 1/N in the full device participation setting, and with 1/K in the partial participation setting. This implies that in both settings, there is a linear speedup in the number of active workers during a communication round. We also emphasize that the reason one cannot recover the full participation bound by setting K = N in the partial participation bound is due to the variance generated by sampling. Communication Complexity. Our bound implies a distinction in the choice of E between the full and partial participation settings. With full participation, the term involving E, O(E 2 /T 2 ), is not of leading order O(1/T ), so we can increase E and reduce the number of communication rounds without degrading the linear speedup in iteration complexity O(1/N T ), as long as E = O( T /N ), since then O(E 2 /T 2 ) = O(1/N T ) matches the leading term. This corresponds to a communication complexity of T /E = O( √ N T ). In contrast, the bound in Li et al. (2020b) does not allow E to scale with

√

T to preserve O(1/T ) rate, even for full participation. On the other hand, with partial participation, κE 2 G 2 /µ KT is also a leading term, and so E must be O(1). In this case, our bound still yields a linear speedup in K, which is also confirmed by experiments. The requirement that E = O(1) in order to achieve linear speedup in partial participation cannot be removed for our sampling schemes, as the term κE 2 G 2 /µ KT comes from variance in the sampling process, which is O(E 2 /T 2 ). In Proposition 1 in Section E of the appendix, we provide a problem instance where the dependence of the sampling variance on E is tight. Comparison with related works. To better understand the significance of the obtained bound, we compare our rates to the best-known results in related settings. Haddadpour & Mahdavi (2019) proves a linear speedup O(1/KT ) result for strongly convex and smooth objectivesfoot_0 , with O(K 1/3 T 2/3 ) communication complexity with non-i.i.d. data and partial participation. However, their results build on the bounded gradient diversity assumption, which implies the existence of w * that minimizes all local objectives (see discussions in Section 2.2 and Appendix B), effectively removing statistical heterogeneity. The bound in Koloskova et al. (2020) matches our bound in the full participation case, but their framework excludes partial participation (Koloskova et al., 2020 , Proposition 1). The result of Karimireddy et al. (2019) applies to the full FL setting, but only has linear speedup when K = O(N ), i.e. close to full participation, whereas our result has linear speedup for any number of participating devices. When there is no data heterogeneity, i.e. in the classical distributed optimization paradigm, communication complexity can be further improved, e.g. Woodworth et al. (2020b; a) , but such results are not directly comparable to ours since we consider the setting where individual devices have access to different datasets.

3.2. CONVEX SMOOTH OBJECTIVES

Next we provide linear speedup analysis of FedAvg with convex and smooth objectives and show that the optimality gap is O(1/ √ KT ). This result complements the strongly convex case in the previous part, as well as the non-convex smooth setting in Jiang & Agrawal (2018) ; Yu et al. (2019b) ; Haddadpour & Mahdavi (2019) , where O(1/ √ KT ) results are given in terms of averaged gradient norm, and it also extends the result in Khaled et al. (2020) , which has linear speedup in the convex setting, but only for full participation. Theorem 2. Under Assumptions 1,3,4 and constant learning rate α t = O( N T ), FedAvg satisfies min t≤T F (w t ) -F (w * ) = O ν 2 max σ 2 √ N T + N E 2 LG 2 T with full participation, and with partial device participation with K sampled devices at each communication round and learning rate α t = O( K T ), min t≤T F (w t ) -F (w * ) = O ν 2 max σ 2 √ KT + E 2 G 2 √ KT + KE 2 LG 2 T . The analysis again relies on a recursive bound, but without contraction: E w t+1 -w * 2 + α t (F (w t ) -F (w * )) ≤ E w t -w * 2 + α 2 t 1 N ν 2 max σ 2 + 6α 3 t E 2 LG 2 which is then summed over time steps to give the desired bound, with α t = O( N T ). Choice of E and linear speedup. With full participation, as long as E = O(T 1/4 /N 3/4 ), the convergence rate is O(1/ √ N T ) with O(N 3/4 T 3/4 ) communication rounds. In the partial participation setting, E must be O(1) in order to achieve linear speedup of O(1/ √ KT ). This is again due to the fact that the sampling variance E w t -v t 2 = O(α 2 t E 2 G 2 ) cannot be made independent of E, as illustrated by Proposition 1. See also the proof in Section E for how the sampling variance and the term E 2 G 2 / √ KT are related. Our result again demonstrates the difference in communication complexities between full and partial participation, and is to our knowledge the first result on linear speedup in the general federated learning setting with both heterogeneous data and partial participation for convex objectives.

4. LINEAR SPEEDUP ANALYSIS OF NESTEROV ACCELERATED FEDAVG

A natural extension of the FedAvg algorithm is to use momentum-based local updates instead of local SGD updates in order to accelerate FedAvg. As we know from stochastic optimization, Nesterov and other momentum updates fail to provably accelerate over SGD (Liu & Belkin (2020) ; Kidambi et al. (2018) ; Liu et al. (2018) ; Yuan & Ma (2020) ). This is in contrast to the classical acceleration result of Nesterov-accelerated gradient descent over GD. Thus in the FL setting, the best provable convergence rate for FedAvg with Nesterov updates is the same as FedAvg with SGD updates. Nevertheless, Nesterov and other momentum updates are frequently used in practice, in both non-FL and FL settings, and are observed to perform better empirically. In fact, previous works such as Stich (2019) on FedAvg with vanilla SGD uses FedAvg with Nesterov or other momentum updates in their experiments to achieve target accuracy. Because of the popularity of Nesterov and other momentumbased methods, understanding the linear speedup behavior of FedAvg with such local updates is important. To our knowledge, the only convergence analyses of FedAvg with momentum-based stochastic updates focus on the non-convex smooth case Huo et al. (2020) ; Yu et al. (2019a) ; Li et al. (2020a) , and no results existed in the convex smooth settings. In this section, we complete the picture by providing the first O(1/KT ) and O(1/ √ KT ) convergence results for Nesterov-accelerated FedAvg for convex objectives that match the rates for FedAvg with SGD updates. Detailed proofs of convergence results in this section are deferred to Appendix Section F.

4.1. STRONGLY CONVEX AND SMOOTH OBJECTIVES

The Nesterov Accelerated FedAvg algorithm follows the updates: v k t+1 = w k t -α t g t,k , w k t+1 = v k t+1 + β t (v k t+1 -v k t ) if t + 1 / ∈ I E , k∈St+1 q k v k t+1 + β t (v k t+1 -v k t ) if t + 1 ∈ I E , where g t,k := ∇F k (w k t , ξ k t ) is the stochastic gradient sampled on the k-th device at time t, and q k again depends on participation and sampling schemes. Theorem 3. Let v T = N k=1 p k v k T in Nesterov accelerated FedAvg, and set learning rates α t = 6 µ 1 t+γ , β t-1 = 3 14(t+γ)(1-6 t+γ ) max{µ,1} . Then under Assumptions 1,2,3,4 with full device participation, EF (vT ) -F * = O κν 2 max σ 2 /µ N T + κ 2 E 2 G 2 /µ T 2 , and with partial device participation with K sampled devices at each communication round, EF (vT ) -F * = O κν 2 max σ 2 /µ N T + κE 2 G 2 /µ KT + κ 2 E 2 G 2 /µ T 2 . Similar to FedAvg, the key step in the proof of this result is a recursive contraction bound, but different in that it involves three time steps, due to the update format of Nesterov SGD (see Lemma 7 in Appendix F.1). Then we can again use induction and L-smoothness to obtain the desired bound. To our knowledge, this is the first convergence result for Nesterov accelerated FedAvg in the strongly convex and smooth setting. The same discussion about linear speedup of FedAvg applies to the Nesterov accelerated variant. In particular, to achieve O(1/N T ) linear speedup, T iterations of the algorithm require only O( √ N T ) communication rounds with full participation.

4.2. CONVEX SMOOTH OBJECTIVES

We now show that the optimality gap of Nesterov-accelerated FedAvg has O(1/ √ KT ) rate for convex and smooth objectives. This result complements the strongly convex case in the previous part, as well as the non-convex smooth setting in Huo et al. (2020) ; Yu et al. (2019a) ; Li et al. (2020a) , where a similar O(1/ √ KT ) rate is given in terms of averaged gradient norm. Theorem 4. Set learning rates α t = β t = O( N T ). Then under Assumptions 1,3,4 Nesterov accelerated FedAvg with full device participation has rate min t≤T F (vt) -F * = O ν 2 max σ 2 √ N T + N E 2 LG 2 T , and with partial device participation with K sampled devices at each communication round, min t≤T F (vt) -F * = O ν 2 max σ 2 √ KT + E 2 G 2 √ KT + KE 2 LG 2 T . We emphasize again that in the stochastic optimization setting, the optimal convergence rate that FedAvg wtih Nesterov udpates can achieve is the same as FedAvg with SGD updates. However, due to the popularity and superior performance of momentum methods in practice, it is still important to understand the linear speedup behavior of such FedAvg variants. Our results in this section fill exactly this gap. 

5. NUMERICAL EXPERIMENTS

In this section, we empirically examine the linear speedup convergence of FedAvg and Nesterov accelerated FedAvg in various settings, including strongly convex function, convex smooth function, and overparameterized objectives, as analyzed in previous sections. Setup. Following the experimental setting in Stich ( 2019), we conduct experiments on both synthetic datasets and real-world dataset w8a Platt (1998) (d = 300, n = 49749). We consider the distributed objectives F (w) = N k=1 p k F k (w) , and the objective function on the k-th local device includes three cases: 1) Strongly convex objective: the regularized binary logistic regression problem, F k (w) = 1 N k N k i=1 log(1 + exp(-y k i w T x k i ) + λ 2 w 2 . The regularization parameter is set to λ = 1/n ≈ 2e -5. 2) Convex smooth objective: the binary logistic regression problem without regularization. 3) Overparameterized setting: the linear regression problem without adding noise to the label, F k (w) = 1 N k N k i=1 (w T x k i + b -y k i ) 2 . Linear speedup of FedAvg and Nesterov accelerated FedAvg. To verify the linear speedup convergence as shown in Theorems 1 2 3 4, we evaluate the number of iterations needed to reach -accuracy in three objectives. We initialize all runs with w 0 = 0 d and measure the number of iterations to reach the target accuracy . For each configuration (E, K), we extensively search the learning rate from min(η 0 , nc 1+t ), where η 0 ∈ {0.1, 0.12, 1, 32} according to different problems and c can take the values c = 2 i ∀i ∈ Z. As the results shown in Figure 1 , the number of iterations decreases as the number of (active) workers increasing, which is consistent for FedAvg and Nesterov accelerated FedAvg across all scenarios. For additional experiments on the impact of E, detailed experimental setup, and hyperparameter setting, please refer to the Appendix Section I.  v k t+1 = w k t -α t g t,k , w k t+1 = v k t+1 if t + 1 / ∈ I E , k∈St+1 q k v k t+1 if t + 1 ∈ I E . The following observations apply to FedAvg updates, while Nesterov accelerated FedAvg requires modifications. For full device participation or partial participation with t / ∈ I E , note that v t = w t = N k=1 p k v k t . For partial participation with t ∈ I E , w t = v t since v t = N k=1 p k v k t while w t = k∈St q k w k t . However, we can use unbiased sampling strategies such that E St w t = v t . Note that v t+1 is one-step SGD from w t . v t+1 = w t -α t g t , where g t = N k=1 p k g t,k is the one-step stochastic gradient averaged over all devices. g t,k = ∇F k w k t , ξ k t , Similarly, we denote the expected one-step gradient g t = E ξt [g t ] = N k=1 p k E ξ k t g t,k , where E ξ k t g t,k = ∇F k w k t , and ξ t = {ξ k t } N k=1 denotes random samples at all devices at time step t. Since in this work we also consider the case of partial participation, the sampling strategy to approximate the system heterogeneity can also affect the convergence. Here we follow the prior works Li et al. (2020b) and Li et al. (2020a) and consider two types of sampling schemes that guarantee E St w t = v t . The sampling scheme I establishes S t+1 by i.i.d. sampling the devices according to probabilities p k with replacement, and setting q k = 1 K . In this case the upper bound of expected square norm of w t+1 -v t+1 is given by (Li et al., 2020b, Lemma 5 ): E St+1 w t+1 -v t+1 2 ≤ 4 K α 2 t E 2 G 2 . (5) The sampling scheme II establishes S t+1 by uniformly sampling all devices without replacement and setting q k = p k N K , in which case we have E St+1 w t+1 -v t+1 2 ≤ 4(N -K) K(N -1) α 2 t E 2 G 2 . ( ) We summarize these upper bounds as follows: E St+1 w t+1 -v t+1 2 ≤ 4 K α 2 t E 2 G 2 . ( ) and this bound will be used in the convergence proof of the partial participation result.

B COMPARISON OF CONVERGENCE RATES WITH RELATED WORKS

In this section, we compare our convergence rate with the best-known results in the literature (see Table 2 ). In Haddadpour & Mahdavi (2019) , the authors provide O(1/N T ) convergence rate of nonconvex problems under Polyak-Łojasiewicz (PL) condition, which means their results can directly apply to the strongly convex problems. However, their assumption is based on bounded gradient diversity, defined as follows: Λ(w) = k p k ∇F k (w) 2 2 k p k ∇F k (w) 2 2 ≤ B This is a more restrictive assumption comparing to assuming bounded gradient under the case of target accuracy → 0 and PL condition. To see this, consider the gradient diversity at the global optimal w * , i.e., Λ(w * ) = k p k ∇F k (w) 2 2 k p k ∇F k (w) 2 2 . For Λ(w * ) to be bounded, it requires ∇F k (w * ) 2 2 = 0, ∀ k. This indicates w * is also the minimizer of each local objective, which contradicts to the practical setting of heterogeneous data. Therefore, their bound is not effective for arbitrary small -accuracy under general heterogeneous data while our convergence results still hold in this case. In Karimireddy et al. (2019) , the linear speedup convergence rate of FedAvg are provided for strongly convex, general convex, and non-convex problems under full participation setting. However, their rate does not enjoy linear speedup for any number of devices while our results apply to any valid K ≤ N . For example, they provides an optimality gap of O (1 -K N )E/T for the strongly convex case (Karimireddy et al., 2019, Theorem V) . With partial participation, and when K = O(1), their convergence rate is O(E/T ) which does not have linear speedup. Under partial participation, the FedAvg analyses in Karimireddy et al. ( 2019) requires E = O(1). For example, the strongly 2019) convex result O((1 -K N )E/T ) in Theorem V is O(E/T ) when K = O(1) and is O(E/N T ) when K = O(N ). O( 1 KT ) O(K -1/3 T 2/3 ) † ‡ ‡ Partial Bounded gradient diversity Strongly convex § FedAvgKoloskova et al. (2020) O( 1 N T ) O(N -1/2 T 1/2 ) Full Bounded gradient Strongly convex FedAvgKarimireddy et al. (2019) O( 1 N T ) † † O(N -1/2 T 1/2 ) † † Partial Bounded gradient dissimilarity Strongly convex FedAvg/N-FedAvg (our work) O( 1 KT ) O(N -1/2 T 1/2 ) ‡ Partial Bounded gradient Strongly convex FedAvgKhaled et al. (2020) O( 1 √ N T ) O(N -3/2 T 1/2 ) Full Bounded gradient Convex FedAvgKoloskova et al. (2020) O( 1 √ N T ) O(N -3/4 T 1/4 ) Full Bounded gradient Convex FedAvgKarimireddy et al. (2019) O( 1 √ N T ) † † O(N -3/4 T 1/4 ) † † Partial Bounded gradient dissimilarity Convex FedAvg/N-FedAvg (our work) O 1 √ KT O(N -3/4 T 1/4 ) ‡ Partial Bounded gradient Convex FedAvg O exp(-N T Eκ 1 ) O(T β ) Partial Bounded gradient Overparameterized LR FedMass O exp(-N T E √ κ 1 κ ) O(T β ) Partial Bounded gradient Overparameterized LR Table 2 : A high-level summary of the convergence results in this paper compared to prior state-ofthe-art FL algorithms. This table only highlights the dependence on T (number of iterations), E (the maximal number of local steps), N (the total number of devices), and K ≤ N the number of participated devices. κ is the condition number of the system and β ∈ (0, 1). We denote Nesterov accelerated FedAvg as N-FedAvg in this table. † This E is obtained under i.i.d. setting. ‡ This E is obtained under full participation setting. § In Haddadpour & Mahdavi (2019) , the convergence rate is for non-convex smooth problems with PL condition, which also applies to strongly convex problems. Therefore, we compare it with our strongly convex results here. ‡ ‡ The bounded gradient diversity assumption is not applicable for general heterogeneous data when converging to arbitrarily small -accuracy (see discussions in Sec B). † † Although the results in Karimireddy et al. ( 2019) is applicable for partial participation setting, their results only achieve linear speedup under full participation setting K = N while we show linear speedup convergence for K ≤ N (see discussions in Sec B). The E in the table is obtained under full participation. Under partial participation, the communication complexity is E = O(1).

C A HIGH-LEVEL SUMMARY OF FEDAVG ANALYSIS

To facilitate the understanding of our analysis and highlight the improvement of our work comparing to prior arts, we summarize the general steps used in the proofs across the various settings. In this section, we take the strongly convex case as an example to illustrate our analysis. The corresponding proof for general convex functions follows the same framework.

One step progress bound

This step establishes the progress of distance ( w t -w * 2 ) to optimal solution after one step SGD update (see line 9, Alg 1), as the following equation shows: E w t+1 -w * 2 ≤ O(η t E w t -w * 2 + α 2 t σ 2 /N + α 3 t E 2 G 2 ). Algorithm 1 FEDAVG: Federated Averaging The above bound consists of three main ingredients, the distance to optima in previous step (with η t ∈ (0, 1) to obtained a contraction bound), the variance of stochastic gradients in local clients (second term), the variance across different clients (third term). Notice that the third term in this bound is the primary source of improvement in the rate. Comparing to the bound in Li et al. (2020b) , we improve the third term from O( α 2 t E 2 G 2 ) to O(α 2 t E 2 G 2 ), which enables the linear speedup in the convergence rate.

Iterative deduction

This step uses the one step progress bound iteratively to connect the the current distance to optimal solution with the initial distance ( w 0 -w * 2 ), as follows: E w t+1 -w * 2 ≤ O(E w 0 -w * 2 1 T ). Then we can use the distance to optima to upper bound the optimality gap (F (w t ) -F * ≤ O(1/T )), as follows: E(F (w t )) -F * ≤ O(E w t -w * 2 ). The convergence rate of the optimality gap is equally obtained as the convergence rate of the distance to optima.

From full participation to partial participation

There are three sources of variances that affect the convergence rate. The first two sources come from the variances of within local clients and across clients (second and third term in one step progress bound). The partial participation, which involves a sampling procedure, is the third source of variance. Therefore, comparing to the rate in full participation, this will add another term of variance into the convergence rate, where we follow a similar derivation as in Li et al. (2020b) .

D TECHNICAL LEMMAS

To facilitate reading, we first summarize some basic properties of L-smooth and µ-strongly convex functions, found in e.g. Rockafellar (1970) , which are used in various steps of proofs in the appendix. Lemma 1. Let F be a convex L-smooth function. Then we have the following inequalities: 1. Quadratic upper bound: 0 ≤ F (w) -F (w ) -∇F (w ), w -w ≤ L 2 w -w 2 . 2. Coercivity: 1 L ∇F (w) -∇F (w ) 2 ≤ ∇F (w) -∇F (w ), w -w . 3. Lower bound: F (w) ≥ F (w ) + ∇F (w ), w -w + 1 2L ∇F (w) -∇F (w ) 2 . In particular, ∇F (w) 2 ≤ 2L(F (w) -F (w * )).

4.. Optimality gap:

F (w) -F (w * ) ≤ ∇F (w), w -w * . Lemma 2. Let F be a µ-strongly convex function. Then F (w) ≤ F (w ) + ∇F (w ), w -w + 1 2µ ∇F (w) -∇F (w ) 2 F (w) -F (w * ) ≤ 1 2µ ∇F (w) 2 E PROOF OF CONVERGENCE RESULTS FOR FEDAVG E.1 STRONGLY CONVEX SMOOTH OBJECTIVES To organize our proofs more effectively and highlight the significance of our results compared to prior works, we first state the following key lemmas used in proofs of main results and defer their proofs to later. Lemma 3 (One step progress, strongly convex). Let w t = N k=1 p k w k t , and suppose our functions satisfy Assumptions 1,2,3,4, and set step size α t = 4 µ(γ+t) with γ = max{32κ, E} and κ = L µ , then the updates of FedAvg with full participation satisfy E w t+1 -w * 2 ≤ (1 -µα t )E w t -w * 2 + α 2 t 1 N ν 2 max σ 2 + 6E 2 Lα 3 t G 2 . We emphasize that the above lemma is the key step that allows us to obtain a bound that improves on the convergence result of Li et al. (2020b) with linear speedup. Its proof will make use of the following two results. Lemma 4 (Bounding gradient variance (Lemma 2 Li et al. (2020b) ) ). Given Assumption 3, the upper bound of gradient variance is given as follows, E g t -g t 2 ≤ N k=1 p 2 k σ 2 k . Lemma 5 (Bounding the divergence of w k t (Lemma 3 Li et al. (2020b) ) ). Given Assumption 4, and assume that α t is non-increasing and α t ≤ 2α t+E for all t ≥ 0, we have E N k=1 p k w t -w k t 2 ≤ 4E 2 α 2 t G 2 . We now restate Theorem 1 from the main text and then prove it using Lemma 3. Theorem 1. Let w T = N k=1 p k w k T in FedAvg, ν max = max k N p k , and set decaying learning rates α t = 4 µ(γ+t) with γ = max{32κ, E} and κ = L µ . Then under Assumptions 1,2,3,4 with full device participation, EF (w T ) -F * = O κν 2 max σ 2 /µ N T + κ 2 E 2 G 2 /µ T 2 and with partial device participation with at most K sampled devices at each communication round, EF (w T ) -F * = O κE 2 G 2 /µ KT + κν 2 max σ 2 /µ N T + κ 2 E 2 G 2 /µ T 2 Proof. The road map of the proof for full device participation contains three steps. First, we establish a recursive relationship between E w t+1 -w * 2 and E w t -w * 2 , upper bounding the progress of FedAvg from step t to step t + 1. Second, we show that E w t -w * 2 = O( ν 2 max σ 2 /µ tN + E 2 LG 2 /µ 2 t 2 ) by induction using the recursive relationship from the previous step. Third, we use the property of L-smoothness to bound the optimality gap by E w t -w * 2 . By Lemma 3, we have the following upper bound for the one step progress: E w t+1 -w * 2 ≤ (1 -µα t )E w t -w * 2 + α 2 t 1 N ν 2 max σ 2 + 6E 2 Lα 3 t G 2 . We show next that E w t -w * 2 = O( ν 2 max σ 2 /µ tN + E 2 LG 2 /µ 2 t 2 ) using induction. To simplify the presentation, we denote C ≡ 6E 2 LG 2 and D ≡ 1 N ν 2 max σ 2 . Suppose that we have the bound E w t -w * 2 ≤ b • (α t D + α 2 t C ) for some constant b and learning rates α t . Then the one step progress from Lemma 3 becomes: E w t+1 -w * 2 ≤ (b(1 -µα t ) + α t )α t D + (b(1 -µα t ) + α t )α 2 t C To establish the result at step t+1, it remains to choose α t and b such that (b(1-µα t )+α t )α t ≤ bα t+1 and (b(1 -µα t ) + α t )α 2 t ≤ bα 2 t+1 . If we let α t = 4 µ(t+γ) where γ = max{E, 32κ} (choice of γ required to guarantee the one step progress) and set b = 4 µ , we have: (b(1 -µα t ) + α t )α t = b(1 - 4 t + γ ) + 4 µ(t + γ) 4 µ(t + γ) ≤ b 4 µ(t + γ + 1) = bα t+1 (b(1 -µα t ) + α t )α 2 t = b( t + γ -2 t + γ ) 16 µ 2 (t + γ) 2 ≤ b 16 µ 2 (t + γ + 1) 2 = bα 2 t+1 where we have used the following inequalities: t + γ -1 (t + γ) 2 ≤ 1 (t + γ + 1) t + γ -2 (t + γ) 3 ≤ 1 (t + γ + 1) 2 ∀ γ ≥ 1 Thus we have established the result at step t + 1 assuming the result is correct at step t: E w t+1 -w * 2 ≤ b • (α t+1 D + α 2 t+1 C) At step t = 0, we can ensure the following inequality by scaling b with c w 0 -w * 2 for a sufficiently large constant c: w 0 -w * 2 ≤ b • (α 0 D + α 2 0 C) = b • ( 4 µγ D + 16 µ 2 γ 2 C) It follows that E w t -w * 2 ≤ c w 0 -w * 2 4 µ (Dα t + Cα 2 t ) for all t ≥ 0. Finally, the L-smoothness of F implies E(F (w T )) -F * ≤ L 2 E w T -w * 2 ≤ L 2 c w 0 -w * 2 4 µ (Dα T + Cα 2 T ) = 2c w 0 -w * 2 κ(Dα T + Cα 2 T ) ≤ 2c w 0 -w * 2 κ 4 µ(T + γ) • 1 N ν 2 max σ 2 + 6E 2 LG 2 • ( 4 µ(T + γ) ) 2 = O( κ µ 1 N ν 2 max σ 2 • 1 T + κ 2 µ E 2 G 2 • 1 T 2 ) where in the first line, we use the property of L-smooth function (see Lemma 1), and in the second line, we use the conclusion in Eq (8). With partial participation, the update at each communication round is now given by weighted averages over a subset of sampled devices. When t + 1 / ∈ I E , v t+1 = w t+1 , while when t + 1 ∈ I E , we have Ew t+1 = v t+1 by design of the sampling schemes (Li et al. (2020b) , Lemma 4), so that E w t+1 -w * 2 = E w t+1 -v t+1 + v t+1 -w * 2 = E w t+1 -v t+1 2 + E v t+1 -w * 2 This in particular implies E v t -w * 2 ≤ E w t -w * 2 for all t. Since v t = N k=1 p k v k t always averages over all devices, the full participation one step progress result Lemma 3 applied to v t implies E v t+1 -w * 2 ≤ E(1 -µα t ) v t -w * 2 + 6E 2 Lα 3 t G 2 + α 2 t 1 N ν 2 max σ 2 ≤ E(1 -µα t ) w t -w * 2 + 6E 2 Lα 3 t G 2 + α 2 t 1 N ν 2 max σ 2 The bound for E w t+1 -v t+1 2 for the two sampling schemes we consider is provided in Eq (7), and applying it we can write the one step progress for partial participation as E w t+1 -w * 2 ≤ (1 -µα t )E w t -w * 2 + α 2 t 1 N ν 2 max σ 2 + 4 K α 2 t E 2 G 2 + 6E 2 Lα 3 t G 2 , ( ) and the same arguments using induction and L-smoothness as the full participation case implies EF (w T ) -F * = O( κν 2 max σ 2 /µ N T + κE 2 G 2 /µ KT + κ 2 E 2 G 2 /µ T 2 ) E.1.1 DEFERRED PROOFS OF KEY LEMMAS Here we first rewrite the proofs of lemmas 4 and 5 from Li et al. (2020b) with slight modifications for the consistency and completeness of this work, since later we will use modified versions of these results in the convergence proof for Nesterov accelerated FedAvg. Proof of lemma 4. Li et al. (2020b) . Since communication is done every E steps, for any t ≥ 0, we can find a t 0 ≤ t such that t -t 0 ≤ E -1 and w k t0 = w t0 for all k. Moreover, using α t is non-increasing and α t0 ≤ 2α t for any t -t 0 ≤ E -1, we have E g t -g t 2 = E g t -Eg t 2 = N k=1 p 2 k g t,k -Eg t,k 2 ≤ N k=1 p 2 k σ 2 k Proof of lemma 5. Now we bound E N k=1 p k w t -w k t 2 following E N k=1 p k w t -w k t 2 =E N k=1 p k w k t -w t0 -(w t -w t0 ) 2 ≤E N k=1 p k w k t -w t0 2 =E N k=1 p k w k t -w k t0 2 =E N k=1 p k - t-1 i=t0 α i g i,k 2 ≤2 N k=1 p k E t-1 i=t0 Eα 2 i g i,k 2 ≤2 N k=1 p k E 2 α 2 t0 G 2 ≤4E 2 α 2 t G 2 Based on the results of Lemma 4, 5, we now prove the upper bound of one step SGD progress. This proof improves on the previous work Li et al. (2020b) and is the first to reveal the linear speedup of convergence of FedAvg. Proof of lemma 3. We have w t+1 -w * 2 = (w t -α t g t ) -w * 2 = (w t -α t g t -w * ) -α t (g t -g t ) 2 = w t -w * -α t g t 2 A1 + 2α t w t -w * -α t g t , g t -g t A2 + α 2 t g t -g t 2 A3 where we denote: A 1 = w t -w * -α t g t 2 A 2 = 2α t w t -w * -α t g t , g t -g t A 3 = α 2 t g t -g t 2 By definition of g t and g t (see Eq (4)), we have EA 2 = 0. For A 3 , we have the following upper bound (see Lemma 4): α 2 t E g t -g t 2 ≤ α 2 t N k=1 p 2 k σ 2 k Next we bound A 1 : w t -w * -α t g t 2 = w t -w * 2 + 2 w t -w * , -α t g t + α t g t 2 and we will show that the third term α t g t 2 can be canceled by an upper bound of the second term, which is one of major improvement comparing to prior art Li et al. (2020b) . The upper bound of second term can be derived as follows, using the strong convexity and L-smoothness of F k : - 2α t w t -w * , g t = -2α t N k=1 p k w t -w * , ∇F k (w k t ) = -2α t N k=1 p k w t -w k t , ∇F k (w k t ) -2α t N k=1 p k w k t -w * , ∇F k (w k t ) ≤ -2α t N k=1 p k w t -w k t , ∇F k (w k t ) + 2α t N k=1 p k (F k (w * ) -F k (w k t )) -α t µ N k=1 p k w k t -w * 2 ≤2α t N k=1 p k F k (w k t ) -F k (w t ) + L 2 w t -w k t 2 + F k (w * ) -F k (w k t ) -α t µ N k=1 p k w k t -w * 2 =α t L N k=1 p k w t -w k t 2 + 2α t N k=1 p k [F k (w * ) -F k (w t )] -α t µ w t -w * 2 We record the bound we have obtained so far, as it will also be used in the proof for convex case: E w t+1 -w * 2 ≤E(1 -µα t ) w t -w * 2 + α t L N k=1 p k w t -w k t 2 + 2α t N k=1 p k [F k (w * ) -F k (w t )] + α 2 t N k=1 p 2 k σ 2 k + α 2 t g t 2 (10) For the term 2α t N k=1 p k [F k (w * ) -F k (w t )], which is negative, we can ignore it, but this yields a suboptimal bound that fails to provide the desired linear speedup. Instead, we upper bound it using the following derivation: 2α t N k=1 p k [F k (w * ) -F k (w t )] ≤2α t [F (w t+1 ) -F (w t )] ≤2α t E ∇F (w t ), w t+1 -w t + α t LE w t+1 -w t 2 = -2α 2 t E ∇F (w t ), g t + α 3 t LE g t 2 = -2α 2 t E ∇F (w t ), g t + α 3 t LE g t 2 = -α 2 t ∇F (w t ) 2 + g t 2 -∇F (w t ) -g t 2 + α 3 t LE g t 2 = -α 2 t ∇F (w t ) 2 + g t 2 -∇F (w t ) - k p k ∇F (w k t ) 2 + α 3 t LE g t 2 ≤ -α 2 t ∇F (w t ) 2 + g t 2 - k p k ∇F (w t ) -∇F (w k t ) 2 + α 3 t LE g t 2 ≤ -α 2 t ∇F (w t ) 2 + g t 2 -L 2 k p k w t -w k t 2 + α 3 t LE g t 2 ≤ -α 2 t g t 2 + α 2 t L 2 k p k w t -w k t 2 + α 3 t LE g t 2 -α 2 t ∇F (w t ) 2 where we have used the smoothness of F twice. Note that the term -α 2 t g t 2 exactly cancels the α 2 t g t 2 in the bound in Eq (10), so that plugging in the bound for -2α t w t -w * , g t , we have so far proved E w t+1 -w * 2 ≤ E(1 -µα t ) w t -w * 2 + α t L N k=1 p k w t -w k t 2 + α 2 t N k=1 p 2 k σ 2 k + α 2 t L 2 N k=1 p k w t -w k t 2 + α 3 t LE g t 2 -α 2 t ∇F (w t ) 2 Under Assumption 4, we have E g t 2 ≤ G 2 . Furthermore, we can check that our choice of α t satisfies α t is non-increasing and α t ≤ 2α t+E , so we may plug in the bound E N k=1 p k w t - w k t 2 ≤ 4E 2 α 2 t G 2 to the above inequality (see Lemma 5). Therefore, we can conclude that, with ν max := N • max k p k and ν min := N • min k p k , E w t+1 -w * 2 ≤E(1 -µα t ) w t -w * 2 + 4E 2 Lα 3 t G 2 + 4E 2 L 2 α 4 t G 2 + α 2 t N k=1 p 2 k σ 2 k + α 3 t LG 2 =E(1 -µα t ) w t -w * 2 + 4E 2 Lα 3 t G 2 + 4E 2 L 2 α 4 t G 2 + α 2 t 1 N 2 N k=1 (p k N ) 2 σ 2 k + α 3 t LG 2 ≤E(1 -µα t ) w t -w * 2 + 4E 2 Lα 3 t G 2 + 4E 2 L 2 α 4 t G 2 + α 2 t 1 N 2 ν 2 max N k=1 σ 2 k + α 3 t LG 2 ≤E(1 -µα t ) w t -w * 2 + 6E 2 Lα 3 t G 2 + α 2 t 1 N ν 2 max σ 2 where in the last inequality we use σ 2 = N k=1 p k σ 2 k , and that by construction α t satisfies Lα t ≤ 1 8 . One may ask whether the dependence on E in the term κE 2 G 2 /µ KT can be removed, or equivalently whether k p k w k t -w t 2 = O(1/T 2 ) can be independent of E. We provide a simple counterexample that shows that this is not possible in general. Proposition 1. There exists a dataset such that if E = O(T β ) for any β > 0 then k p k w k t - w t 2 = Ω( 1 T 2-2β ) . Proof. Suppose that we have an even number of devices and each F k (w) = 1 n k n k j=1 (x j k -w) 2 contains data points x j k = w * ,k , with n k ≡ n. Moreover, the w * ,k 's come in pairs around the origin. As a result, the global objective F is minimized at w * = 0. Moreover, if we start from w 0 = 0, then by design of the dataset the updates in local steps exactly cancel each other at each iteration, resulting in w t = 0 for all t. On the other hand, if E = T β , then starting from any t = O(T ) with constant step size O( 1 T ), after E iterations of local steps, the local parameters are updated towards w * ,k with w k t+E 2 = Ω((T β • 1 T ) 2 ) = Ω( 1 T 2-2β ). This implies that k p k w k t+E -w t+E 2 = k p k w k t+E 2 = Ω( 1 T 2-2β ) which is at a slower rate than 1 T 2 for any β > 0. Thus the sampling variance E w t+1 -v t+1 2 = Ω( k p k E w k t+1 -w 2 ) decays at a slower rate than 1 T 2 , resulting in a convergence rate slower than O( 1T ) with partial participation.

E.2 CONVEX SMOOTH OBJECTIVES

In this section we provide the proof of the convergence result for FedAvg with convex and smooth objectives. The key step is a one step progress result analogous to that in the strongly convex case, and their proofs share identical components as well. Lemma 6 (One step progress, convex case). Let w t = N k=1 p k w k t in FedAvg. Under assumptions 1,3,4, the following bound holds for all t: E w t+1 -w * 2 + α t (F (w t ) -F (w * )) ≤ E w t -w * 2 + α 2 t 1 N ν 2 max σ 2 + 6α 3 t E 2 LG 2 Proof. The first part of the proof follows directly from Eq (10) in the proof of Lemma 3. Setting µ = 0 in Eq (10) (since we are in the convex setting instead of strongly convex), we obtain w t+1 -w * 2 ≤ w t -w * 2 + α t L N k=1 p k w t -w k t 2 + 2α t N k=1 p k [F k (w * ) -F k (w t )] + α 2 t g t 2 + α 2 t N k=1 p 2 k σ 2 k The difference of this bound with that in the strongly convex case is that we no longer have a contraction factor of 1 -µα t in front of w t -w * 2 . In the strongly convex case, we were able to cancel α 2 t g t 2 with 2α t N k=1 p k [F k (w * ) -F k (w t ) ] and obtain only lower order terms. In the convex case, we use a different strategy and preserve N k=1 p k [F k (w * ) -F k (w t ) ] in order to obtain the desired optimality gap. We have g t 2 = k p k ∇F k (w k t ) 2 = k p k ∇F k (w k t ) - k p k ∇F k (w t ) + k p k ∇F k (w t ) 2 ≤ 2 k p k ∇F k (w k t ) - k p k ∇F k (w t ) 2 + 2 k p k ∇F k (w t ) 2 ≤ 2L 2 k p k w k t -w t 2 + 2 k p k ∇F k (w t ) 2 = 2L 2 k p k w k t -w t 2 + 2 ∇F (w t ) 2 using ∇F (w * ) = 0. Now using the L smoothness of F , we have ∇F (w t ) 2 ≤ 2L(F (w t ) -F (w * )), so that w t+1 -w * 2 ≤ w t -w * 2 + α t L N k=1 p k w t -w k t 2 + 2α t N k=1 p k [F k (w * ) -F k (w t )] + 2α 2 t L 2 k p k w k t -w t 2 + 4α 2 t L(F (w t ) -F (w * )) + α 2 t N k=1 p 2 k σ 2 k = w t -w * 2 + (2α 2 t L 2 + α t L) N k=1 p k w t -w k t 2 + α t N k=1 p k [F k (w * ) -F k (w t )] + α 2 t N k=1 p 2 k σ 2 k + α t (1 -4α t L)(F (w * ) -F (w t )) Since F (w * ) ≤ F (w t ), as long as 4α t L ≤ 1, we can ignore the last term, and rearrange the inequality to obtain w t+1 -w * 2 + α t (F (w t ) -F (w * )) ≤ w t -w * 2 + (2α 2 t L 2 + α t L) N k=1 p k w t -w k t 2 + α 2 t N k=1 p 2 k σ 2 k ≤ w t -w * 2 + 3 2 α t L N k=1 p k w t -w k t 2 + α 2 t N k=1 p 2 k σ 2 k The same argument as before yields E N k=1 p k w t -w k t 2 ≤ 4E 2 α 2 t G 2 which gives w t+1 -w * 2 + α t (F (w t ) -F (w * )) ≤ w t -w * 2 + α 2 t N k=1 p 2 k σ 2 k + 6α 3 t E 2 LG 2 ≤ w t -w * 2 + α 2 t 1 N ν 2 max σ 2 + 6α 3 t E 2 LG 2 With the one step progress result, we can now prove the convergence result in the convex setting, which we restate below. Theorem 2. Under assumptions 1,3,4 and constant learning rate α t = O( N T ), FedAvg satisfies min t≤T F (w t ) -F (w * ) = O ν max σ 2 √ N T + N E 2 LG 2 T with full participation, and with partial device participation with K sampled devices at each communication round and learning rate α t = O( K T ), min t≤T F (w t ) -F (w * ) = O ν max σ 2 √ KT + E 2 G 2 √ KT + KE 2 LG 2 T Proof. We first prove the bound for full participation. Applying Lemma 6, we have w t+1 -w * 2 + α t (F (w t ) -F (w * )) ≤ w t -w * 2 + α 2 t 1 N ν 2 max σ 2 + 6α 3 t E 2 LG 2 Summing the inequalities from t = 0 to t = T , we obtain T t=0 α t (F (w t ) -F (w * )) ≤ w 0 -w * 2 + T t=0 α 2 t • 1 N ν 2 max σ 2 + T t=0 α 3 t • 6E 2 LG 2 so that min t≤T F (w t ) -F (w * ) ≤ 1 T t=0 α t w 0 -w * 2 + T t=0 α 2 t • 1 N ν 2 max σ 2 + T t=0 α 3 t • 6E 2 LG 2 By setting the constant learning rate α t ≡ N T , we have min t≤T F (w t ) -F (w * ) ≤ 1 √ N T • w 0 -w * 2 + 1 √ N T T • N T • 1 N ν 2 max σ 2 + 1 √ N T T ( N T ) 3 6E 2 LG 2 ≤ 1 √ N T • w 0 -w * 2 + 1 √ N T T • N T • 1 N ν 2 max σ 2 + N T 6E 2 LG 2 = ( w 0 -w * 2 + ν 2 max σ 2 ) 1 √ N T + N T 6E 2 LG 2 = O( ν 2 max σ 2 √ N T + N E 2 LG 2 T ) For partial participation, the one step progress bound in Lemma 6 is updated in a similar manner as the strongly convex case in (9) to incorporate the sampling variance. More precisely, with partial participation, E w t+1 -w * 2 = E w t+1 -v t+1 + v t+1 -w * 2 = E w t+1 -v t+1 2 + E v t+1 -w * 2 , where Ew t+1 = v t+1 for all t, by the unbiasedness of our sampling schemes. Since v t = N k=1 p k v k t always averages over all devices, the full participation one step progress bound in Lemma 6 applied to v t implies E v t+1 -w * 2 + α t (F (v t ) -F (w * )) ≤ E v t -w * 2 + α 2 t 1 N ν 2 max σ 2 + 6α 3 t E 2 LG 2 ≤ E w t -w * 2 + α 2 t 1 N ν 2 max σ 2 + 6α 3 t E 2 LG 2 The bound for E w t+1 -v t+1 2 for the two sampling schemes we consider is provided in Eq (7), and applying it to the above bound we can write the one step progress for partial participation as E w t+1 -w * 2 + α t (F (w t ) -F (w * )) ≤ E w t -w * 2 + α 2 t ( 1 N ν 2 max σ 2 + C) + 6E 2 Lα 3 t G 2 , where C = 4 K E 2 G 2 or N -K N -1 4 K E 2 G 2 depending on the sampling scheme. Summing up the one-step progress over t, min t≤T F (w t ) -F (w * ) ≤ 1 T t=0 α t w 0 -w * 2 + T t=0 α 2 t • ( 1 N ν max σ 2 + C) + T t=0 α 3 t • 6E 2 LG 2 , so that with α t = K T , we have min t≤T F (w t ) -F (w * ) = O( ν max σ 2 √ KT + E 2 G 2 √ KT + KE 2 LG 2 T ).

F PROOF OF CONVERGENCE RESULTS FOR NESTEROV ACCELERATED FEDAVG F.1 STRONGLY CONVEX SMOOTH OBJECTIVES

Recall that the Nesterov accelerated FedAvg follows the updates v k t+1 = w k t -α t g t,k , w k t+1 = v k t+1 + β t (v k t+1 -v k t ) if t + 1 / ∈ I E , k∈St+1 q k v k t+1 + β t (v k t+1 -v k t ) if t + 1 ∈ I E . The proofs of convergence results for Nesterov Accelerated FedAvg consists of components that are direct analogues of the FedAvg case. We first state these analogue results before proving the main theorem. Like before, the proofs of the lemmas are deferred to after the main proof. Lemma 7 (One step progress, Nesterov). Let v t = N k=1 p k v k t in Nesterov accelerated FedAvg, and suppose our functions satisfy Assumptions 1,2,3,4, and set step sizes α t = 6 µ 1 t+γ , β t-1 = 3 14(t+γ)(1-6 t+γ ) max{µ,1} with γ = max{32κ, E} and κ = L µ , the updates of Nesterov accelerated FedAvg satisfy E v t+1 -w * 2 ≤ E(1 -µα t )(1 + β t-1 ) 2 v t -w * 2 + 20E 2 Lα 3 t G 2 + (1 -α t µ)β 2 t-1 (v t-1 -w * ) 2 + α 2 t 1 N ν max σ 2 + 2β t-1 (1 + β t-1 )(1 -α t µ) v t -w * • v t-1 -w * The one step progress result makes use of the same bound on the gradient variance in Lemma 4, as well as a divergence bound analogous to Lemma 5, which we state below. Lemma 8 (Bounding the divergence of w k t , Nesterov). Given Assumption 4, and assume that α t is non-increasing, α t ≤ 2α t+E , and 2β 2 t-1 + 2α 2 t ≤ 1/2 for all t ≥ 0, w t = N k=1 p k w k t in Nesterov accelerated FedAvg satisfies E N k=1 p k w t -w k t 2 ≤ 16(E -1) 2 α 2 t G 2 . Theorem 3. Let v T = N k=1 p k v k T in Nesterov accelerated FedAvg and set learning rates α t = 6 µ 1 t+γ , β t-1 = 3 14(t+γ)(1-6 t+γ ) max{µ,1} . Then under Assumptions 1,2,3,4 with full device participation, EF (v T ) -F * = O κν max σ 2 /µ N T + κ 2 E 2 G 2 /µ T 2 , and with partial device participation with K sampled devices at each communication round, EF (v T ) -F * = O κν max σ 2 /µ N T + κE 2 G 2 /µ KT + κ 2 E 2 G 2 /µ T 2 . Proof. We first prove the result for full participation. Applying the one step progress bound in Lemma 7, we have E v t+1 -w * 2 ≤ E(1 -µα t )(1 + β t-1 ) 2 v t -w * 2 + 20E 2 Lα 3 t G 2 + (1 -α t µ)β 2 t-1 (v t-1 -w * ) 2 + α 2 t 1 N ν max σ 2 + 2β t-1 (1 + β t-1 )(1 -α t µ) v t -w * • v t-1 -w * Recall that we require α t0 ≤ 2α t for any t -t 0 ≤ E -1, Lα t ≤ 1 5 , and 2β 2 t-1 + 2α 2 t ≤ 1/2 in order for Lemmas 8 and 7 to hold, which we can check by definition of α t and β t . We show next that E v t -w * 2 = O( ν 2 max σ 2 /µ tN + E 2 LG 2 /µ 2 t 2 ) by induction. Assume that we have shown E v t -w * 2 ≤ b(Cα 2 t + Dα t ) for all iterations until t, where C = 20E 2 LG 2 , D = 1 N ν 2 max σ 2 , and b is some constant to be chosen later. For step sizes recall that we choose α t = 6 µ 1 t+γ and β t-1 = 3 14(t+γ)(1-6 t+γ ) max{µ,1} where γ = max{32κ, E}, so that β t-1 ≤ α t and (1 -µα t )(1 + 14β t-1 ) ≤ (1 - 6 t + γ )(1 + 3 (t + γ)(1 -6 t+γ ) ) = 1 - 6 t + γ + 3 t + γ = 1 - 3 t + γ = 1 - µα t 2 Moreover, E v t-1 -w * 2 ≤ b(Cα 2 t-1 + Dα t-1 ) ≤ 4b(Cα 2 t + Dα t ) with the chosen step sizes. Therefore the bound for E v t+1 -w * 2 can be further simplified with 2β t-1 (1 + β t-1 )(1 -α t µ)E v t -w * • v t-1 -w * ≤ 4β t-1 (1 + β t-1 )(1 -α t µ) • b(Cα 2 t + Dα t ) and (1 -α t µ)β 2 t-1 E (v t-1 -w * ) 2 ≤ 4(1 -α t µ)β 2 t-1 • b(Cα 2 t + Dα t ) so that E v t+1 -w * 2 ≤ (1 -µα t )((1 + β t-1 ) 2 + 4β t-1 (1 + β t-1 ) + 4β 2 t-1 ) • b(Cα 2 t + Dα t ) + 20E 2 Lα 3 t G 2 + α 2 t 1 N ν max σ 2 ≤ E(1 -µα t )(1 + 14β t-1 ) • b(Cα 2 t + Dα t ) + 20E 2 Lα 3 t G 2 + α 2 t 1 N ν max σ 2 ≤ b(1 - µα t 2 )(Cα 2 t + Dα t ) + Cα 3 t + Dα 2 t = (b(1 - µα t 2 ) + α t )α 2 t C + (b(1 - µα t 2 ) + α t )α t D and so it remains to choose b such that (b(1 - µα t 2 ) + α t )α t ≤ bα t+1 (b(1 - µα t 2 ) + α t )α 2 t ≤ bα 2 t+1 from which we can conclude E v t+1 -w * 2 ≤ α 2 t+1 C + α t+1 D. With b = 6 µ , we have (b(1 - µα t 2 ) + α t )α t = (b(1 -( 3 t + γ ) + 6 µ(t + γ) ) 6 µ(t + γ) = (b t + γ -3 t + γ + 6 µ(t + γ) ) 6 µ(t + γ) ≤ b( t + γ -1 t + γ ) 6 µ(t + γ) ≤ b 6 µ(t + γ + 1) = bα t+1 where we have used t+γ-1 (t+γ) 2 ≤ 1 t+γ+1 . Similarly (b(1 - µα t 2 ) + α t )α 2 t = (b(1 -( 3 t + γ ) + 6 µ(t + γ) )( 6 µ(t + γ) ) 2 = (b t + γ -3 t + γ + 6 µ(t + γ) )( 6 µ(t + γ) ) 2 = b( t + γ -2 t + γ )( 6 µ(t + γ) ) 2 ≤ b 36 µ 2 (t + γ + 1) 2 = bα 2 t+1 where we have used t+γ-2 (t+γ)  3 ≤ 1 (t+γ+1) 2 . Finally, to ensure v 0 -w * 2 ≤ b(Cα 2 0 + Dα 0 ), we can rescale b by c v 0 -w * 2 for some c. It follows that E v t -w * 2 ≤ b(Cα 2 t + Dα t ) for all t ≥ 0. Using the L-smooothness of F , E(F (v T )) -F * = E(F (v T ) -F (w * )) ≤ L 2 E v T -w * 2 ≤ L 2 c v 0 -w * 2 6 µ (Dα T + Cα 2 T ) = 3c v 0 -w * 2 κ(Dα T + Cα 2 T ) ≤ 3c v 0 -w * 2 κ 6 µ(T + γ) • 1 N ν max σ 2 + 20E 2 LG 2 • ( 6 µ(T + γ) ) 2 = O( κ µ 1 N ν max σ 2 • 1 T + κ 2 µ E 2 G 2 • 1 T 2 ) With partial participation, the same argument with an added term for sampling error yields EF (w T ) -F * = O( κν max σ 2 /µ N T + κE 2 G 2 /µ KT + κ 2 E 2 G 2 /µ T 2 ) F.1.1 DEFERRED PROOFS OF KEY LEMMAS Proof of lemma 8. The proof of bound for E N k=1 p k w t -w k t 2 in the Nesterov accelerated FedAvg follows a similar logic as in Lemma 5, but requires extra reasoning. Since communication is done every E steps, for any t ≥ 0, we can find a t 0 ≤ t such that t -t 0 ≤ E -1 and w k t0 = w t0 for all k. Moreover, using α t is non-increasing, α t0 ≤ 2α t , and β t ≤ α t for any t -t 0 ≤ E -1, we have E N k=1 p k w t -w k t 2 = E N k=1 p k w k t -w t0 -(w t -w t0 ) 2 ≤ E N k=1 p k w k t -w t0 2 = E N k=1 p k w k t -w k t0 2 = E N k=1 p k t-1 i=t0 β i (v k i+1 -v k i ) - t-1 i=t0 α i g i,k 2 ≤ 2 N k=1 p k E t-1 i=t0 (E -1)α 2 i g i,k 2 + 2 N k=1 p k E t-1 i=t0 (E -1)β 2 i (v k i+1 -v k i ) 2 ≤ 2 N k=1 p k E t-1 i=t0 (E -1)α 2 i ( g i,k 2 + (v k i+1 -v k i ) 2 ) ≤ 4 N k=1 p k E t-1 i=t0 (E -1)α 2 i G 2 ≤ 4(E -1) 2 α 2 t0 G 2 ≤ 16(E -1) 2 α 2 t G 2 where we have used E v k t -v k t-1 2 ≤ G 2 . To see this identity for appropriate α t , β t , note the recursion v k t+1 -v k t = w k t -w k t-1 -(α t g t,k -α t-1 g t-1,k ) w k t+1 -w k t = -α t g t,k + β t (v k t+1 -v k t ) so that v k t+1 -v k t = -α t-1 g t-1,k + β t-1 (v k t -v k t-1 ) -(α t g t,k -α t-1 g t-1,k ) = β t-1 (v k t -v k t-1 ) -α t g t,k Since the identity v k t+1 -v k t = β t-1 (v k t -v k t-1 ) -α t g t,k implies E v k t+1 -v k t 2 ≤ 2β 2 t-1 E v k t -v k t-1 2 + 2α 2 t G 2 as long as α t , β t-1 satisfy 2β 2 t-1 + 2α 2 t ≤ 1/2, we can guarantee that E v k t -v k t-1 2 ≤ G 2 for all k by induction. This together with Jensen's inequality also gives E v t -v t-1 2 ≤ G 2 for all t. Now we are ready to prove the one step progress result for Nesterov accelerated FedAvg. The first part of the proof is identical to that of the FedAvg case, while the main recursion takes a different form. Proof of lemma 7. We again have v t+1 -w * 2 = (w t -α t g t ) -w * 2 and using exactly the same derivation as the FedAvg case, we can obtain the following bound (same as Eq (11) in the proof of Lemma 3): E w t+1 -w * 2 ≤ E(1 -µα t ) w t -w * 2 + α t L N k=1 p k w t -w k t 2 + α 2 t N k=1 p 2 k σ 2 k + α 2 t L 2 N k=1 p k w t -w k t 2 + α 3 t LE g t 2 -α 2 t ∇F (w t ) 2 Different from the FedAvg case, we no longer have w t = v t . Instead, w t -w * 2 = v t + β t-1 (v t -v t-1 ) -w * 2 = (1 + β t-1 )(v t -w * ) -β t-1 (v t-1 -w * ) 2 = (1 + β t-1 ) 2 v t -w * 2 -2β t-1 (1 + β t-1 ) v t -w * , v t-1 -w * + β 2 t-1 (v t-1 -w * ) 2 ≤ (1 + β t-1 ) 2 v t -w * 2 + 2β t-1 (1 + β t-1 ) v t -w * • v t-1 -w * + β 2 t-1 (v t-1 -w * ) 2 which gives a recursion involving both v t and v t-1 : v t+1 -w * 2 ≤ (1 -α t µ)(1 + β t-1 ) 2 v t -w * 2 + 2(1 -α t µ)β t-1 (1 + β t-1 ) v t -w * • v t-1 -w * + α 2 t N k=1 p 2 k σ 2 k + β 2 t-1 (1 -α t µ) (v t-1 -w * ) 2 + α t L N k=1 p k w t -w k t 2 + α 2 t L 2 k p k w t -w k t 2 + α 3 t LG 2 and we will using this recursive relation to obtain the desired bound. We can check that our choice of α t and β t satisfy α t is non-increasing, α t ≤ 2α t+E , and 2β 2 t-1 + 2α 2 t ≤ 1/2 for all t ≥ 0, so that we can apply the bound from Lemma 8 on E N k=1 p k w t -w k t 2 to conclude that, with ν max := N • max k p k , E v t+1 -w * 2 ≤ E(1 -µα t )(1 + β t-1 ) 2 v t -w * 2 + 16E 2 Lα 3 t G 2 + 16E 2 L 2 α 4 t G 2 + α 3 t LG 2 + (1 -α t µ)β 2 t-1 (v t-1 -w * ) 2 + α 2 t N k=1 p 2 k σ 2 k + 2β t-1 (1 + β t-1 )(1 -α t µ) v t -w * • v t-1 -w * ≤ E(1 -µα t )(1 + β t-1 ) 2 v t -w * 2 + 20E 2 Lα 3 t G 2 + (1 -α t µ)β 2 t-1 (v t-1 -w * ) 2 + α 2 t 1 N ν max σ 2 + 2β t-1 (1 + β t-1 )(1 -α t µ) v t -w * • v t-1 -w * where we have used σ 2 = k p k σ 2 k , and by construction our α t satisfies Lα t ≤ 1 5 .

F.2 CONVEX SMOOTH OBJECTIVES

In this section we provide proof of the convergence result for Nesterov accelerated FedAvg with convex and smooth objectives. Unlike with the FedAvg algorithm, where convex and strongly convex results share identical components, the proof for the convergence result in the convex setting for Nesterov FedAvg uses a change of variables, although the general ideas are in the same vein: we have a one step progress bound for E w t+1 -w * 2 + η t (F (w t ) -F (w * )), which is then used to form a telescoping sum that gives an upper bound on min t≤T F (w t ) -F (w * ). Lemma 9 (One step progress, convex case, Nesterov). Let w t = N k=1 p k w k t in Nesterov accelerated FedAvg, and define η t = αt 1-βt . Under assumptions 1,3,4, the following bound holds for all t: E w t+1 -w * 2 + η t (F (w t ) -F (w * )) ≤ E w t -w * 2 + 32LE 2 α 2 t η t G 2 + η 2 t ν max 1 N σ 2 + 2η t β 2 t 1 -β t G 2 Theorem 4. Set learning rates α t = β t = O( N T ). Then under Assumptions 1,3,4 Nesterov accelerated FedAvg with full device participation has rate min t≤T F (w t ) -F * = O ν max σ 2 √ N T + N E 2 LG 2 T , and with partial device participation with K sampled devices at each communication round, min t≤T F (w t ) -F * = O ν max σ 2 √ KT + E 2 G 2 √ KT + KE 2 LG 2 T . Proof. Applying the bound from Lemma 9, with η t = αt 1-βt we have E w t+1 -w * 2 + η t (F (w t ) -F (w * )) ≤ E w t -w * 2 + 32LE 2 α 2 t η t G 2 + η 2 t ν max 1 N σ 2 + 2η t β 2 t 1 -β t G 2 Summing the inequalities from t = 0 to t = T , we obtain T t=0 η t (F (w t ) -F (w * )) ≤ w 0 -w * 2 + T t=0 η 2 t • 1 N ν max σ 2 + T t=0 η t α 2 t • 32LE 2 G 2 + T t=0 2η t β 2 t 1 -β t G 2 so that min t≤T F (w t ) -F (w * ) ≤ 1 T t=0 η t w 0 -w * 2 + T t=0 η 2 t • 1 N ν max σ 2 + T t=0 η t α 2 t • 32LE 2 G 2 + T t=0 2η t β 2 t 1 -β t G 2 By setting the constant learning rates α t ≡ N T and β t ≡ c N T so that η t = αt 1-βt = √ N T 1-c √ N T ≤ 2 N T , we have min t≤T F (w t ) -F (w * ) ≤ 1 2 √ N T • w 0 -w * 2 + 2 √ N T T • N T • 1 N ν max σ 2 + 1 √ N T T ( N T ) 3 32LE 2 G 2 + 2 √ N T T ( N T ) 3 G 2 = ( 1 2 w 0 -w * 2 + 2ν max σ 2 ) 1 √ N T + N T (32LE 2 G 2 + 2G 2 ) = O( ν max σ 2 √ N T + N E 2 LG 2 T ) Similarly, for partial participation, we have min t≤T F (w t ) -F (w * ) ≤ 1 T t=0 α t w 0 -w * 2 + T t=0 α 2 t • ( 1 N ν max σ 2 + C) + T t=0 α 3 t • 6E 2 LG 2 where C = 4 K E 2 G 2 or N -K N -1 4 K E 2 G 2 , so that with α t ≡ K T and β t ≡ c K T , we have min t≤T F (w t ) -F (w * ) = O( ν max σ 2 √ KT + E 2 G 2 √ KT + KE 2 LG 2 T ) F.2.1 DEFERRED PROOFS OF KEY LEMMAS Proof of lemma 9. Define p t := βt 1-βt [w t -w t-1 + α t g t-1 ] = β 2 t 1-βt (v t -v t-1 ) for t ≥ 1 and 0 for t = 0. We can check that w t+1 + p t+1 = w t + p t - α t 1 -β t g t Now we define z t := w t + p t and η t = αt 1-βt for all t, so that we have the recursive relation z t+1 = z t -η t g t Now z t+1 -w * 2 = (z t -η t g t ) -w * 2 = (z t -η t g t -w * ) -η t (g t -g t ) 2 = A 1 + A 2 + A 3 where A 1 = z t -w * -η t g t 2 A 2 = 2η t z t -w * -η t g t , g t -g t A 3 = η 2 t g t -g t 2 where again EA 2 = 0 and EA 3 ≤ η 2 t k p 2 k σ 2 k . For A 1 we have z t -w * -η t g t 2 = z t -w * 2 + 2 z t -w * , -η t g t + η t g t 2 Using the convexity and L-smoothness of F k , -2η t z t -w * , g t = -2η t N k=1 p k z t -w * , ∇F k (w k t ) = -2η t N k=1 p k z t -w k t , ∇F k (w k t ) -2η t N k=1 p k w k t -w * , ∇F k (w k t ) = -2η t N k=1 p k z t -w t , ∇F k (w k t ) -2η t N k=1 p k w t -w k t , ∇F k (w k t ) -2η t N k=1 p k w k t -w * , ∇F k (w k t ) ≤ -2η t N k=1 p k z t -w t , ∇F k (w k t ) -2η t N k=1 p k w t -w k t , ∇F k (w k t ) + 2η t N k=1 p k (F k (w * ) -F k (w k t )) ≤ 2η t N k=1 p k F k (w k t ) -F k (w t ) + L 2 w t -w k t 2 + F k (w * ) -F k (w k t ) -2η t N k=1 p k z t -w t , ∇F k (w k t ) = η t L N k=1 p k w t -w k t 2 + 2η t N k=1 p k [F k (w * ) -F k (w t )] -2η t N k=1 p k z t -w t , ∇F k (w k t ) which results in Strohmer & Vershynin (2009) . A natural question is whether such a result still holds in the federated learning setting. In this section, we provide the first geometric convergence rate of FedAvg for the overparameterized strongly convex and smooth problems, and show that it preserves linear speedup at the same time. We then sharpen this result in the special case of linear regression. Inspired by recent advances in accelerating SGD Liu et al. (2020) ; Jain et al. (2017) , we further propose a novel momentum-based FedAvg algorithm, which enjoys an improved convergence rate over FedAvg. Detailed proofs are deferred to Appendix Section H. In particular, we do not need Assumptions 3 and 4 and use modified versions of Assumptions 1 and 2 detailed in this section. E w t+1 -w * 2 ≤ E w t -w * 2 + η t L N k=1 p k w t -w k t 2 + 2η t N k=1 p k [F k (w * ) -F k (w t )] + η 2 t g t 2 + η 2 t N k=1 p 2 k σ 2 k -2η t N k=1 p k z t -w t , ∇F k (w k t ) As before, g t 2 ≤ 2L 2 k p k w k t -w t 2 + 4L(F (w t ) -F (w * )), so that η 2 t g t 2 + η t N k=1 p k [F k (w * ) -F k (w t )] ≤ 2L 2 η 2 t k p k w k t -w t 2 + η t (1 -4η t L)(F (w * ) -F (w t )) ≤ 2L 2 η 2 t k p k w k t -w t 2 for η t ≤ 1/4L. Using N k=1 p k w t -w k t 2 ≤ 16E 2 α 2 t G 2 and N k=1 p 2 k σ 2 k ≤ ν max 1 N σ 2 , it follows that E w t+1 -w * 2 + η t (F (w t ) -F (w * )) ≤ E w t -w * 2 + (η t L + 2L 2 η 2 t ) N k=1 p k w t -w k t 2 + η 2 t N k=1 p 2 k σ 2 k -2η t N k=1 p k z t -w t , ∇F k (w k t ) ≤ E w t -w * 2 + 32LE 2 α 2 t η t G 2 + η 2 t ν max 1 N σ 2 -2η t N k=1 p k z t -w t , ∇F k (w k t ) if η t ≤ 1 2L . It remains to bound E N k=1 p k z t -w t , ∇F k (w k t ) . Recall that z t -w t = βt 1-βt [w t -w t-1 + α t g t-1 ] = β 2 t 1-βt (v t -v t-1 ) and E v t -v t-1 2 ≤ G 2 , E ∇F k (w k t ) 2 ≤ G 2 . Cauchy-Schwarz gives E N k=1 p k z t -w t , ∇F k (w k t ) ≤ N k=1 p k E z t -w t 2 • E ∇F k (w k t ) 2 ≤ β 2 t 1 -β t G 2 Thus E w t+1 -w * 2 + η t (F (w t ) -F (w * )) ≤ E w t -w * 2 + 32LE 2 α 2 t η t G 2 + η 2 t ν max 1 N σ 2 + 2η t β 2 t 1 -β t G 2 G GEOMETRIC CONVERGENCE OF FEDAVG IN

G.1 GEOMETRIC CONVERGENCE OF FEDAVG IN THE OVERPARAMETERIZED SETTING

Recall the FL problem min w N k=1 p k F k (w) with F k (w) = 1 n k n k j=1 (w; x j k ). In this section, we consider the standard Empirical Risk Minimization (ERM) setting where is non-negative, l-smooth, and convex, and as before, each F k (w) is L-smooth and µ-strongly convex. Note that l ≥ L. This setup includes many important problems in practice. In the overparameterized setting, there exists w * ∈ arg min w N k=1 p k F k (w) such that (w * ; x j k ) = 0 for all x j k . We first show that FedAvg achieves geometric convergence with linear speedup in the number of workers. Theorem 5. In the overparameterized setting, FedAvg with communication every E iterations and constant step size α = O( 1 E N lνmax+L(N -νmin) ) has geometric convergence: EF (w T ) ≤ L 2 (1 -α) T w 0 -w * 2 = O L exp - µ E N T lν max + L(N -ν min ) • w 0 -w * 2 .

Linear speedup and Communication Complexity

The linear speedup factor is on the order of O(N/E) for N ≤ O( l L ), i.e. FedAvg with N workers and communication every E iterations provides a geometric convergence speedup factor of O(N/E), for N ≤ O( l L ). When N is above this threshold, however, the speedup is almost constant in the number of workers. This matches the findings in Ma et al. (2018) . Our result also illustrates that E can be taken O(T β ) for any β < 1 to achieve geometric convergence, achieving better communication efficiency than the standard FL setting. We emphasize again that compared to the single-server results in Ma et al. (2018) , the difference of our result lies in the factor of N in the speedup, which cannot be obtained if one simply applied the single-server result to each device in our problem.

G.2 OVERPARAMETERIZED LINEAR REGRESSION PROBLEMS

We now turn to quadratic problems and show that the bound in Theorem 5 can be improved to O(exp(-N Eκ1 t)) for a larger range of N . We then propose a variant of FedAvg that has provable acceleration over FedAvg with SGD updates. The local device objectives are now given by the sum of squares F k (w) = 1 2n k n k j=1 (w T x j k -z j k ) 2 , and there exists w * such that F (w * ) ≡ 0. Two notions of condition number are important in our results: κ 1 which is based on local Hessians, and κ, which is termed the statistical condition number Liu & Belkin (2020) ; Jain et al. (2017) . For their detailed definitions, please refer to Appendix Section H. Here we use the fact κ ≤ κ 1 . Recall ν max = max k p k N and ν min = min k p k N . Theorem 6. For the overparamterized linear regression problem, FedAvg with communication every E iterations with constant step size α = O( 1 E N lνmax+µ(N -νmin) ) has geometric convergence: EF (w T ) ≤ O L exp(- N T E(ν max κ 1 + (N -ν min )) ) w 0 -w * 2 . When N = O(κ 1 ), the convergence rate is O((1 -N Eκ1 ) T ) = O(exp(-N T Eκ1 )), which exhibits linear speedup in the number of workers, as well as a 1/κ 1 dependence on the condition number κ 1 . Inspired by Liu & Belkin (2020) , we propose the MaSS accelerated FedAvg algorithm (FedMaSS): w k t+1 = u k t -η k 1 g t,k if t + 1 / ∈ I E , k∈St+1 u k t -η k 1 g t,k if t + 1 ∈ I E , u k t+1 = w k t+1 + γ k (w k t+1 -w k t ) + η k 2 g t,k . When η k 2 ≡ 0, this algorithm reduces to the Nesterov accelerated FedAvg algorithm. In the next theorem, we demonstrate that FedMaSS improves the convergence to O(exp(-N T E √ κ1 κ )). To our knowledge, this is the first acceleration result of FedAvg with momentum updates over SGD updates. Theorem 7. For the overparamterized linear regression problem, FedMaSS with communication every E iterations and constant step sizes η 1 = O( 1 E N lνmax+µ(N -νmin) ), η 2 = η 1 (1-1 κ ) 1+ 1 √ κ 1 κ , γ = 1-1 √ κ 1 κ 1+ 1 √ κ 1 κ has geometric convergence: EF (w T ) ≤ O L exp(- N T E(ν max √ κ 1 κ + (N -ν min )) ) w 0 -w * 2 .

Speedup of FedMaSS over FedAvg

To better understand the significance of the above result, we briefly discuss related works on accelerating SGD. Nesterov and Heavy Ball updates are known to fail to accelerate over SGD in both the overparameterized and convex settings Liu & Belkin (2020) ; Kidambi et al. (2018) ; Liu et al. (2018) ; Yuan et al. (2016) . Thus in general one cannot hope to obtain acceleration results for the FedAvg algorithm with Nesterov and Heavy Ball updates. Luckily, recent works in SGD Jain et al. (2017) ; Liu & Belkin (2020) introduced an additional compensation term to the Nesterov updates to address the non-acceleration issue. Surprisingly, we show the same approach can effectively improve the rate of FedAvg. Comparing the convergence rate of FedMass (Theorem 7) and FedAvg (Theorem 6), when  N = O( √ κ 1 κ), the convergence rate is O((1 -N E √ κ1 κ ) T ) = O(exp(-N T E √ κ1 κ )) as opposed to O(exp(-N T Eκ1 )). Since κ 1 ≥ κ, EF (w t ) ≤ L 2 (1 -µα) t w 0 -w * 2 = O(exp(- µ 2E N lν max + L(N -ν min ) t) • w 0 -w * 2 ) Proof. To illustrate the main ideas of the proof, we first present the proof for E = 2. Let t -1 be a communication round, so that w k t-1 = w t-1 . We show that w t+1 -w * 2 ≤ (1 -α t µ)(1 -α t-1 µ) w t-1 -w * 2 for appropriately chosen constant step sizes α t , α t-1 . We have w t+1 -w * 2 = (w t -α t g t ) -w * 2 = w t -w * 2 -2α t w t -w * , g t + α 2 t g t 2 and the cross term can be bounded as usual using µ-convexity and L-smoothness of F k : -2α t E t w t -w * , g t = -2α t N k=1 p k w t -w * , ∇F k (w k t ) = -2α t N k=1 p k w t -w k t , ∇F k (w k t ) -2α t N k=1 p k w k t -w * , ∇F k (w k t ) ≤ -2α t N k=1 p k w t -w k t , ∇F k (w k t ) + 2α t N k=1 p k (F k (w * ) -F k (w k t )) -α t µ N k=1 p k w k t -w * 2 ≤ 2α t N k=1 p k F k (w k t ) -F k (w t ) + L 2 w t -w k t 2 + F k (w * ) -F k (w k t ) -α t µ N k=1 p k (w k t -w * ) 2 = α t L N k=1 p k w t -w k t 2 + 2α t N k=1 p k [F k (w * ) -F k (w t )] -α t µ w t -w * 2 = α t L N k=1 p k w t -w k t 2 -2α t N k=1 p k F k (w t ) -α t µ w t -w * 2 and so E w t+1 -w * 2 ≤ E(1 -α t µ) w t -w * 2 -2α t F (w t ) + α 2 t g t 2 + α t L N k=1 p k w t -w k t 2 Applying this recursive relation to w t -w * 2 and using w t-1 -w k t-1 2 ≡ 0, we further obtain E w t+1 -w * 2 ≤ E(1 -α t µ) (1 -α t-1 µ) w t-1 -w * 2 -2α t-1 F (w t-1 ) + α 2 t-1 g t-1 2 -2α t F (w t ) + α 2 t g t 2 + α t L N k=1 p k w t -w k t 2 Now instead of bounding N k=1 p k w t -w k t 2 using the arguments in the general convex case, we follow Ma et al. (2018) and use the fact that in the overparameterized setting, w * is a minimizer of each (w, x j k ) and that each is l-smooth to obtain ∇F k (w t-1 , ξ k t-1 ) 2 ≤ 2l(F k (w t-1 , ξ k t-1 ) - F k (w * , ξ k t-1 )), where recall F k (w, ξ k t-1 ) = (w, ξ k t-1 ), so that N k=1 p k w t -w k t 2 = N k=1 p k w t-1 -α t-1 g t-1 -w k t-1 + α t-1 g t-1,k 2 = N k=1 p k α 2 t-1 g t-1 -g t-1,k 2 = α 2 t-1 N k=1 p k ( g t-1,k 2 -g t-1 2 ) = α 2 t-1 N k=1 p k ∇F k (w t-1 , ξ k t-1 ) 2 -α 2 t-1 g t-1 2 ≤ α 2 t-1 N k=1 p k 2l(F k (w t-1 , ξ k t-1 ) -F k (w * , ξ k t-1 )) -α 2 t-1 g t-1 2 again using w t-1 = w k t-1 . Taking expectation with respect to ξ k t-1 's and using the fact that F (w * ) = 0, we have E t-1 N k=1 p k w t -w k t 2 ≤ 2lα 2 t-1 N k=1 p k F k (w t-1 ) -α 2 t-1 g t-1 2 = 2lα 2 t-1 F (w t-1 ) -α 2 t-1 g t-1 Note also that g t-1 2 = N k=1 p k ∇F k (w t-1 , ξ k t-1 ) 2 while g t 2 = N k=1 p k ∇F k (w k t , ξ k t ) 2 ≤ 2 N k=1 p k ∇F k (w t , ξ k t ) 2 + 2 N k=1 p k (∇F k (w t , ξ k t ) -∇F k (w k t , ξ k t )) 2 ≤ 2 N k=1 p k ∇F k (w t , ξ k t ) 2 + 2 N k=1 p k l 2 w t -w k t 2 Substituting these into the bound for w t+1 -w * 2 , we have E w t+1 -w * 2 ≤ E(1 -α t µ)((1 -α t-1 µ) w t-1 -w * 2 -2α t-1 F (w t-1 ) + α 2 t-1 g t-1 2 ) -2α t F (w t ) + 2α 2 t N k=1 p k ∇F k (w t , ξ k t ) 2 + 2l 2 α 2 t-1 α 2 t + α t α 2 t-1 L 2lF (w t-1 ) -g t-1 2 = E(1 -α t µ)(1 -α t-1 µ) w t-1 -w * 2 -2α t (F (w t ) -α t N k=1 p k ∇F k (w t , ξ k t ) 2 ) -2α t-1 (1 -α t µ) (1 - lα t-1 (2l 2 α 2 t + α t L) 1 -α t µ )F (w t-1 ) - α t-1 2 N k=1 p k ∇F k (w t-1 , ξ k t-1 ) 2 from which we can conclude that E w t+1 -w * 2 ≤ (1 -α t µ)(1 -α t-1 µ)E w t-1 -w * 2 if we can choose α t , α t-1 to guarantee E(F (w t ) -α t N k=1 p k ∇F k (w t , ξ k t ) 2 ) ≥ 0 E (1 - lα t-1 (2l 2 α 2 t + α t L) 1 -α t µ )F (w t-1 ) - α t-1 2 N k=1 p k ∇F k (w t-1 , ξ k t-1 ) 2 ≥ 0 Note that E t N k=1 p k ∇F k (w t , ξ k t ) 2 = E t N k=1 p k ∇F k (w t , ξ k t ), N k=1 p k ∇F k (w t , ξ k t ) = N k=1 p 2 k E t ∇F k (w t , ξ k t ) 2 + N k=1 j =k p j p k E t ∇F k (w t , ξ k t ), ∇F j (w t , ξ j t ) = N k=1 p 2 k E t ∇F k (w t , ξ k t ) 2 + N k=1 j =k p j p k ∇F k (w t ), ∇F j (w t ) = N k=1 p 2 k E t ∇F k (w t , ξ k t ) 2 + N k=1 N j=1 p j p k ∇F k (w t ), ∇F j (w t ) - N k=1 p 2 k ∇F k (w t ) 2 ≤ N k=1 p 2 k E t ∇F k (w t , ξ k t ) 2 + k p k ∇F k (w t ) 2 - 1 N ν min k p k ∇F k (w t ) 2 = N k=1 p 2 k E t ∇F k (w t , ξ k t ) 2 + (1 - 1 N ν min ) ∇F (w t ) 2 and so following Ma et al. (2018) if we let α t = min{ qN 2lνmax , 1-q 2L(1-1 N νmin) } for a q ∈ [0, 1] to be optimized later, we have E t (F (w t ) -α t N k=1 p k ∇F k (w t , ξ k t ) 2 ) ≥ E t N k=1 p k F k (w t ) -α t N k=1 p 2 k E t ∇F k (w t , ξ k t ) 2 + (1 - 1 N ν min ) ∇F (w t ) 2 ≥ E t N k=1 p k (qF k (w t , ξ k t ) -α t 1 N ν max ∇F k (w t , ξ k t ) 2 ) + ((1 -q)F (w t ) -α t (1 - 1 N ν min ) ∇F (w t ) 2 ) ≥ qE t N k=1 p k (F k (w t , ξ k t ) - 1 2l ∇F k (w t , ξ k t ) 2 ) + (1 -q)(F (w t ) - 1 2L ∇F (w t ) 2 ) ≥ 0 again using w * optimizes F k (w, ξ k t ) with F k (w * , ξ k t ) = 0. Maximizing α t = min{ qN 2lνmax , 1-q 2L(1-1 N νmin) } over q ∈ [0, 1], we see that q = lνmax lνmax+L(N -νmin) results in the fastest convergence, and this translates to α t = 1 2 N lνmax+L(N -νmin) . Next we claim that α t-1 = c 1 2 N lνmax+L(N -νmin) also guarantees E(1 - lα t-1 (2l 2 α 2 t + α t L) 1 -α t µ )F (w t-1 ) - α t-1 2 N k=1 p k ∇F k (w t-1 , ξ k t-1 ) 2 ≥ 0 Note that by scaling α t-1 by a constant c ≤ 1 if necessary, we can guarantee lαt-1(2l 2 α 2 t +αtL) 1-αtµ ≤ 1 2 , and so the condition is equivalent to F (w t-1 ) -α t-1 N k=1 p k ∇F k (w t-1 , ξ k t-1 ) 2 ≥ 0 which was shown to hold with α t-1 ≤ 1 2 N lνmax+L(N -νmin) . For the proof of general E ≥ 2, we use the following two identities: g t 2 ≤ 2 N k=1 p k ∇F k (w t , ξ k t ) 2 + 2 N k=1 p k l 2 w t -w k t 2 E N k=1 p k w t -w k t 2 ≤ E2(1 + 2l 2 α 2 t-1 ) N k=1 p k w t-1 -w k t-1 2 + 8α 2 t-1 lF (w t-1 ) -2α 2 t-1 g t-1 2 where the first inequality has been established before. To establish the second inequality, note that N k=1 p k w t -w k t 2 = N k=1 p k w t-1 -α t-1 g t-1 -w k t-1 + α t-1 g t-1,k 2 ≤ 2 N k=1 p k w t-1 -w k t-1 2 + α t-1 g t-1 -α t-1 g t-1,k 2 and k p k g t-1,k -g t-1 2 = k p k ( g t-1,k 2 -g t-1 2 ) = k p k ∇F k (w t-1 , ξ k t-1 ) + ∇F k (w k t-1 , ξ k t-1 ) -∇F k (w t-1 , ξ k t-1 ) 2 -g t-1 2 ≤ 2 k p k ∇F k (w t-1 , ξ k t-1 ) 2 + l 2 w k t-1 -w t-1 2 -g t-1 2 so that using the l-smoothness of , E N k=1 p k w t -w k t 2 ≤ E2(1 + 2l 2 α 2 t-1 ) N k=1 p k w t-1 -w k t-1 2 + 4α 2 t-1 k p k ∇F k (w t-1 , ξ k t-1 ) 2 -2α 2 t-1 g t-1 ≤ E2(1 + 2l 2 α 2 t-1 ) N k=1 p k w t-1 -w k t-1 2 + 4α 2 t-1 2l k p k (F k (w t-1 , ξ k t-1 ) -F k (w * , ξ k t-1 )) -2α 2 t-1 g t-1 2 = E2(1 + 2l 2 α 2 t-1 ) N k=1 p k w t-1 -w k t-1 2 + 8α 2 t-1 lF (w t-1 ) -2α 2 t-1 g t-1 2 Using the first inequality, we have E w t+1 -w * 2 ≤ E(1 -α t µ) w t -w * 2 -2α t F (w t ) + 2α 2 t N k=1 p k ∇F k (w t , ξ k t ) 2 + (2α 2 t l 2 + α t L) N k=1 p k w t -w k t 2 and we choose α t and α t-1 such that E(F (w t ) -α t N k=1 p k ∇F k (w t , ξ k t ) 2 ) ≥ 0 and (2α 2 t l 2 + α t L) ≤ (1 -α t µ)(2α 2 t-1 l 2 + α t-1 L)/3. This gives E w t+1 -w * 2 ≤ E(1 -α t µ)[(1 -α t-1 µ) w t-1 -w * 2 -2α t-1 F (w t-1 ) + 2α 2 t-1 N k=1 p k ∇F k (w t-1 , ξ k t-1 ) 2 + (2α 2 t-1 l 2 + α t-1 L)( N k=1 p k w t-1 -w k t-1 2 + N k=1 p k w t -w k t 2 )/3] Using the second inequality N k=1 p k w t -w k t 2 ≤ E2(1 + 2l 2 α 2 t-1 ) N k=1 p k w t-1 -w k t-1 2 + 8α 2 t-1 lF (w t-1 ) -2α 2 t-1 g t-1 2 and that 2(1 + 2l 2 α 2 t-1 ) ≤ 3, 2α 2 t-1 l 2 + α t-1 L ≤ 1, we have E w t+1 -w * 2 ≤ E(1 -α t µ)[(1 -α t-1 µ) w t-1 -w * 2 -2α t-1 F (w t-1 ) + 2α 2 t-1 N k=1 p k ∇F k (w t-1 , ξ k t-1 ) 2 + 8α 2 t-1 lF (w t-1 ) + (2α 2 t-1 l 2 + α t-1 L)(2 N k=1 p k w t-1 -w k t-1 2 )] and if α t-1 is chosen such that (F (w t-1 ) -4α t-1 lF (w t-1 )) -α t-1 N k=1 p k ∇F k (w t-1 , ξ k t-1 ) 2 ≥ 0 and (2α 2 t-1 l 2 + α t-1 L)(1 -α t-1 µ) ≤ (2α 2 t-2 l 2 + α t-2 L)/3 we again have E w t+1 -w * 2 ≤ E(1 -α t µ)(1 -α t-1 µ)[ w t-1 -w * 2 + (2α 2 t-2 l 2 + α t-2 L) • (2 N k=1 p k w t-1 -w k t-1 2 )/3] Applying the above derivation iteratively τ < E times, we have E w t+1 -w * 2 ≤ E(1 -α t µ) • • • (1 -α t-τ +1 µ)[(1 -α t-τ µ) w t-τ -w * 2 -2α t-τ F (w t-τ ) + 2α 2 t-τ N k=1 p k ∇F k (w t-τ , ξ k t-τ ) 2 + 8τ α 2 t-τ lF (w t-τ ) + (2α 2 t-τ l 2 + α t-τ L)((τ + 1) N k=1 p k w t-τ -w k t-τ 2 )] as long as the step sizes α t-τ are chosen such that the following inequalities hold (2α 2 t-τ l 2 + α t-τ L)(1 -α t-τ µ) ≤ (2α 2 t-τ -1 l 2 + α t-τ -1 L)/3 2(1 + 2l 2 α 2 t-τ ) ≤ 3 2α 2 t-τ l 2 + α t-τ L ≤ 1 (F (w t-τ ) -4τ α t-τ lF (w t-τ )) -α t-τ N k=1 p k ∇F k (w t-τ , ξ k t-τ ) 2 ≥ 0 We can check that setting α t-τ = c 1 τ +1 N lνmax+L(N -νmin) for some small constant c satisfies the requirements. Since communication is done every E iterations, w t0 = w k t0 for some t 0 > t -E , from which we can conclude that E w t -w * 2 ≤ ( t-t0-1 τ =1 (1 -µα t-τ )) w t0 -w * 2 ≤ (1 -c µ E N lν max + L(N -ν min ) ) t-t0 w t0 -w * 2 and applying this inequality to iterations between each communication round, E w t -w * 2 ≤ (1 -c µ E N lν max + L(N -ν min ) ) t w 0 -w * 2 = O(exp( µ E N lν max + L(N -ν min ) t)) w 0 -w * 2 With partial participation, we note that E w t+1 -w * 2 = E w t+1 -v t+1 + v t+1 -w * 2 = E w t+1 -v t+1 2 + E v t+1 -w * 2 = 1 K k p k E w k t+1 -w t+1 2 + E v t+1 -w * 2 and so the recursive identity becomes E w t+1 -w * 2 ≤ E(1 -α t µ) • • • (1 -α t-τ +1 µ)[(1 -α t-τ µ) w t-τ -w * 2 -2α t-τ F (w t-τ ) + 2α 2 t-τ N k=1 p k ∇F k (w t-τ , ξ k t-τ ) 2 + 8τ α 2 t-τ lF (w t-τ ) + (2α 2 t-τ l 2 + α t-τ L + 1 K )((τ + 1) N k=1 p k w t-τ -w k t-τ 2 )] which requires (2α 2 t-τ l 2 + α t-τ L + 1 K )(1 -α t-τ µ) ≤ (2α 2 t-τ -1 l 2 + α t-τ -1 L + 1 K )/3 2(1 + 2l 2 α 2 t-τ ) ≤ 3 2α 2 t-τ l 2 + α t-τ L + 1 K ≤ 1 (F (w t-τ ) -4τ α t-τ lF (w t-τ )) -α t-τ N k=1 p k ∇F k (w t-τ , ξ k t-τ ) 2 ≥ 0 to hold. Again setting α t-τ = c 1 τ +1 N lνmax+L(N -νmin) for a possibly different constant from before satisfies the requirements. Finally, using the L-smoothness of F , F (w T ) -F (w * ) ≤ L 2 E w T -w * 2 = O(L exp(- µ E N lν max + L(N -ν min ) T )) w 0 -w * 2

H.2 GEOMETRIC CONVERGENCE OF FEDAVG FOR OVERPARAMETERIZED LINEAR REGRESSION

We first provide details on quantities used in the proof of results on linear regression in Section G. The local device objectives are now given by the sum of squares F k (w) = 1 2n k n k j=1 (w T x j k -z j k ) 2 , and there exists w * such that F (w * ) ≡ 0. Define the local Hessian matrix as H k := 1 n k n k j=1 x j k (x j k ) T , and the stochastic Hessian matrix as Hk t := ξ k t (ξ k t ) T , where ξ k t is the stochastic sample on the kth device at time t. Define l to be the smallest positive number such that E ξ k t 2 ξ k t (ξ k t ) T lH k for all k. Note that l ≤ max k,j x j k 2 . Let L and µ be lower and upper bounds of non-zero eigenvalues of H k . Define κ 1 := l/µ and κ := L/µ. Following Liu & Belkin (2020) ; Jain et al. (2017) , we define the statistical condition number κ as the smallest positive real number such that E k p k Hk t H -1 Hk t ≤ κH. The condition numbers κ 1 and κ are important in the characterization of convergence rates for FedAvg algorithms. Note that κ 1 > κ and κ 1 > κ. Let H = k p k H k . In general H has zero eigenvalues. However, because the null space of H and range of H are orthogonal, in our subsequence analysis it suffices to project w t -w * onto the range of H, thus we may restrict to the non-zero eigenvalue of H. A useful observation is that we can use w * T x j k -z j k ≡ 0 to rewrite the local objectives as F k (w) = 1 2 w -w * , H k (w -w * ) ≡ 1 2 w -w * 2 H k : F k (w) = 1 2n k n k j=1 (w T x k,j -z k,j -(w * T x k,j -z k,j )) 2 = 1 2n k n k j=1 ((w -w * ) T x k,j ) 2 = 1 2 w -w * , H k (w -w * ) = 1 2 w -w * 2 H k so that F (w) = 1 2 w -w * 2 H . Finally, note that E Hk t = 1 n k n k j=1 x j k (x j k ) T = H k and g t,k = ∇F k (w k t , ξ k t ) = Hk t (w k t -w * ) while g t = N k=1 p k ∇F k (w k t , ξ k t ) = N k=1 p k Hk t (w k t -w * ) and g t = N k=1 p k H k (w k t -w * ) Theorem 6. For the overparamterized linear regression problem, FedAvg with communication every E iterations with constant step size α = O( 1 E N lνmax+µ(N -νmin) ) has geometric convergence: EF (w T ) ≤ O L exp(- N T E(ν max κ 1 + (N -ν min )) ) w 0 -w * 2 . Proof. We again show the result first when E = 2 and t -1 is a communication round. We have w t+1 -w * 2 = (w t -α t g t ) -w * 2 = w t -w * 2 -2α t w t -w * , g t + α 2 t g t 2 and -2α t E t w t -w * , g t = -2α t N k=1 p k w t -w * , ∇F k (w k t ) = -2α t N k=1 p k w t -w k t , ∇F k (w k t ) -2α t N k=1 p k w k t -w * , ∇F k (w k t ) = -2α t N k=1 p k w t -w k t , ∇F k (w k t ) -2α t N k=1 p k w k t -w * , H k (w k t -w * ) = -2α t N k=1 p k w t -w k t , ∇F k (w k t ) -4α t N k=1 p k F k (w k t ) ≤ 2α t N k=1 p k (F k (w k t ) -F k (w t ) + L 2 w t -w k t 2 ) -4α t N k=1 p k F k (w k t ) = α t L N k=1 p k w t -w k t 2 -2α t N k=1 p k F k (w t ) -2α t N k=1 p k F k (w k t ) = α t L N k=1 p k w t -w k t 2 -α t N k=1 p k (w t -w * ), H k (w t -w * ) -2α t N k=1 p k F k (w k t ) and g t 2 = N k=1 p k Hk t (w k t -w * ) 2 = N k=1 p k Hk t (w t -w * ) + N k=1 p k Hk t (w k t -w t ) 2 ≤ 2 N k=1 p k Hk t (w t -w * ) 2 + 2 N k=1 p k Hk t (w k t -w t ) 2 which gives E w t+1 -w * 2 ≤ E w t -w * 2 -α t N k=1 p k w t -w * , H k w t -w * + 2α 2 t N k=1 p k Hk t (w t -w * ) 2 + α t L N k=1 p k w t -w k t 2 + 2α 2 t N k=1 p k Hk t (w k t -w t ) 2 -2α t N k=1 p k F k (w k t ) following Ma et al. (2018) we first prove that E w t -w * 2 -α t N k=1 p k (w t -w * ), H k (w t -w * ) + 2α 2 t N k=1 p k Hk t (w t -w * ) 2 ≤ (1 - N 8(ν max κ 1 + (N -ν min )) )E w t -w * 2 with appropriately chosen α t . Compared to the rate O( µN lνmax+L(N -νmin) ) = O( N νmaxκ1+(N -νmin)κ ) for general strongly convex and smooth objectives, this is an improvement as linear speedup is now available for a larger range of N . We have E t N k=1 p k Hk t (w t -w * ) 2 = E t N k=1 p k Hk t (w t -w * ), N k=1 p k Hk t (w t -w * ) = N k=1 p 2 k E t Hk t (w t -w * ) 2 + N k=1 j =k p j p k E t Hk t (w t -w * ), Hj t (w t -w * ) = N k=1 p 2 k E t Hk t (w t -w * ) 2 + N k=1 j =k p j p k E t H k (w t -w * ), H j (w t -w * ) = N k=1 p 2 k E t Hk t (w t -w * ) 2 + N k=1 N j=1 p j p k E t H k (w t -w * ), H j (w t -w * ) - N k=1 p 2 k H k (w t -w * ) 2 = N k=1 p 2 k E t Hk t (w t -w * ) 2 + k p k H k (w t -w * ) 2 - N k=1 p 2 k H k (w t -w * ) 2 ≤ N k=1 p 2 k E t Hk t (w t -w * ) 2 + k p k H k (w t -w * ) 2 - 1 N ν min k p k H k (w t -w * ) 2 ≤ 1 N ν max N k=1 p k E t Hk t (w t -w * ) 2 + (1 - 1 N ν min ) k p k H k (w t -w * ) 2 ≤ 1 N ν max l N k=1 p k (w t -w * ), H k (w t -w * ) + (1 - 1 N ν min ) k p k H k (w t -w * ) 2 = 1 N ν max l (w t -w * ), H(w t -w * ) + (1 - 1 N ν min ) w t -w * , H 2 (w t -w * ) using Hk t ≤ l. Now we have E w t -w * 2 -α t N k=1 p k (w t -w * ), H k (w t -w * ) + 2α 2 t N k=1 p k Hk t (w t -w * ) 2 = w t -w * , (I -α t H + 2α 2 t ( ν max l N H + N -ν min N H 2 ))(w t -w * ) and it remains to bound the maximum eigenvalue of (I -α t H + 2α 2 t ( ν max l N H + N -ν min N H 2 )) and we bound this following Ma et al. (2018) . If we choose α t < N 2(νmaxl+(N -νmin)L) , then -α t H + 2α 2 t ( ν max l N H + N -ν min N H 2 ) ≺ 0 and the convergence rate is given by the maximum of 1-α t λ+2α 2 t ( νmaxl N λ+ N -νmin N λ 2 ) maximized over the non-zero eigenvalues λ of H. To select the step size α t that gives the smallest upper bound, we then minimize over α t , resulting in min αt< N 2(νmaxl+(N -ν min )L) max λ>0:∃v,Hv=λv 1 -α t λ + 2α 2 t ( ν max l N λ + N -ν min N λ 2 ) Since the objective is quadratic in λ, the maximum is achieved at either the largest eigenvalue λ max of H or the smallest non-zero eigenvalue λ min of H. When N ≤ 4νmaxl L-λmin + 4ν min , i.e. when N = O(l/λ min ) = O(κ 1 ), the optimal objective value is achieved at λ min and the optimal step size is given by α t = N 4(νmaxl+(N -νmin)λmin) . The optimal convergence rate (i.e. the optimal objective value) is equal to 1 -1 8 N λmin (νmaxl+(N -νmin)λmin) = 1 -1 8 N (νmaxκ1+(N -νmin)) . This implies that when N = O(κ 1 ), the optimal convergence rate has a linear speedup in N . When N is larger, this step size is no longer optimal, but we still have 1 -1 8 N (νmaxκ1+(N -νmin)) as an upper bound on the convergence rate. For general E, we have the recursive relation E w t+1 -w * 2 ≤ E(1 -c 1 8 N (ν max κ 1 + (N -ν min )) ) • • • (1 -c 1 8τ N (ν max κ 1 + (N -ν min )) )[ w t-τ -w * 2 -α t-τ w t-τ -w * , Hw t-τ -w * + 2α p k w t-τ -w k t-τ 2 )] as long as the step sizes are chosen α t-τ = c N 4τ (νmaxl+(N -νmin)λmin) such that the following inequalities hold (2α 2 t-τ l 2 + α t-τ L) ≤ (1 -α t-τ µ)(2α 2 t-τ -1 l 2 + α t-τ -1 L)/3 2(1 + 2l 2 α 2 t-τ ) ≤ 3 2α 2 t-τ l 2 + α t-τ L ≤ 1 and w t-τ -w * 2 -α t-τ w t-τ -w * , Hw t-τ -w * + 2α 2 t-τ N k=1 p k Hk t-τ (w t-τ -w * ) 2 + 4τ α 2 t-1 l w t-1 -w * , H(w t-1 -w * ) ≤ (1 -c N 8(τ + 1)(ν max κ 1 + (N -ν min )) )E w t-τ -w * 2 which gives E w t -w * 2 ≤ (1 -c 1 8E N (ν max κ 1 + (N -ν min )) ) t w 0 -w * 2 = O(exp(- 1 E N (ν max κ 1 + (N -ν min )) t)) w 0 -w * 2 and with partial participation, the same bound holds with a possibly different choice of c.

H.3 GEOMETRIC CONVERGENCE OF FEDMASS FOR OVERPARAMETERIZED LINEAR REGRESSION

Theorem 7. For the overparamterized linear regression problem, FedMaSS with communication every E iterations and constant step sizes η 1 = O( 1 E N lνmax+µ(N -νmin) ), η 2 = η 1 (1-1 κ ) 1+ 1 √ κ 1 κ , γ = 1-1 √ κ 1 κ 1+ 1 √ κ 1 κ has geometric convergence: EF (w T ) ≤ O L exp(- N T E(ν max √ κ 1 κ + (N -ν min )) ) w 0 -w * 2 . Proof. The proof is based on results in Liu & Belkin (2020) which originally proposed the MaSS algorithm. Note that the update can equivalently be written as v k t+1 = (1 -α k )v k t + α k u k t -δ k g t,k w k t+1 = u k t -η k g t,k if t + 1 / ∈ I E N k=1 p k u k t -η k g t,k if t + 1 ∈ I E u k t+1 = α k 1 + α k v k t+1 + 1 1 + α k w k t+1 where there is a bijection between the parameters 1-α k 1+α k = γ k , η k = η k 1 , η k -α k δ k 1+α k = η k 2 , and we further introduce an auxiliary parameter v k t , which is initialized at v k 0 . We also note that when δ k = η k α k , the update reduces to the Nesterov accelerated SGD. This version of the FedAvg algorithm with local MaSS updates is used for analyzing the geometric convergence. As before, define the virtual sequences w t = N k=1 p k w k t , v t = N k=1 p k v k t , u t = N k=1 p k u k t , and g t = N k=1 p k Eg t,k . We have Eg t = g t and w t+1 = u t -η t g t , v t+1 = (1-α k )v t +α k w t -δ k g t , and u t+1 = α k 1+α k v t+1 + 1 1+α k w t+1 . We first prove the theorem with E = 2 and t -1 being a communication round. We have Following the proof in Liu & Belkin (2020) , v t+1 -w * 2 H -1 = (1 -α)v t + EA ≤ E(1 -α) v t -w * 2 H -1 + α u t -w * 2 H -1 ≤ E(1 -α) v t -w * 2 H -1 + α µ u t -w * 2 using the convexity of the norm  p k (u t -u k t ) 2 H -1 ≤ (2δ 2 l 2 + δL) k p k u t -u k t 2 H -1 = (2δ 2 l 2 + δL) k p k α 1 + α v t + 1 1 + α w t -( α 1 + α v k t + 1 1 + α w k t ) 2 H -1 ≤ (2δ 2 l 2 + δL)(2( α 1 + α ) 2 δ 2 + 2( 1 1 + α ) 2 η 2 ) k p k Hk t-1 (u t-1 -w * ) 2 ≤ (2δ 2 l 2 + δL)(2( α 1 + α ) 2 δ 2 + 2( 1 1 + α ) 2 η 2 )l 2 (u t-1 -w * ) 2 



Their result applies to a larger class of non-convex objectives that satisfy the Polyak-Lojasiewicz condition.



and the case of partial participation where |S t+1 | < N . With partial participation, we follow Li et al. (2020a); Karimireddy et al. (2019); Li et al. (

Let w T = N k=1 p k w k T in FedAvg, ν max = max k N p k , and set decaying learning rates α t = 4 µ(γ+t) with γ = max{32κ, E} and κ = L µ . Then under Assumptions 1 to 4 with full device participation,

Figure 1: The linear speedup of FedAvg in full participation, partial participation, and the linear speedup of Nesterov accelerated FedAvg, respectively.

THE OVERPARAMETERIZED SETTING Overparameterization is a prevalent machine learning setting where the statistical model has much more parameters than the number of training samples and the existence of parameter choices with zero training loss is ensured Allen-Zhu et al. (2018); Zhang et al. (2016). Due to the property of automatic variance reduction in overparameterization, a line of recent works proved that SGD and accelerated methods achieve geometric convergence Ma et al. (2018); Moulines & Bach (2011); Needell et al. (2014); Schmidt & Roux (2013);

Figure 2: The convergence of FedAvg w.r.t the number of local steps E.

Kun Yuan, Bicheng Ying, and Ali H Sayed. On the influence of momentum acceleration on online learning. The Journal of Machine Learning Research, 17(1):6602-6667, 2016. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016. Fan Zhou and Guojing Cong. On the convergence properties of a k-step averaging stochastic gradient descent algorithm for nonconvex optimization. IJCAI, 2018.A ADDITIONAL NOTATIONS AND BOUNDS FOR SAMPLING SCHEMESIn this section, we introduce additional notations that are used throughout the proofs. Following common practice, e.g. Stich (2019);Li et al. (2020b), we define two virtual sequences v t =

In either case, to achieve a O(1/T ) convergence rate, it requires E = O(1) as well. Similar conclusion also holds for the general convex problem.

this implies a speedup factor of κ1 κ for FedMaSS. On the other hand, the same linear speedup in the number of workers holds for N in a smaller range of values.

Now we have provedE w t+1 -w * 2 ≤ (1 -1 8 N (ν max κ 1 + (N -ν min )) )E w t -w * 2 + α t LNext we bound terms in the second line using a similar argument as the general case. We havel w t-1 -w * , H(w t-1 -w * ) and if α t , α t-1 satisfy max κ 1 + (N -ν min )) )[E w t-1 -w * 2 -α t w t-1 -w * , Hw t-1 -w * + 2α 2 -c N 16(ν max l + (N -ν min )λ min ) )E w t-1 -w * 2

αu t -δ Hk t (u k t -w * ), (1 -α)v t + αu t -w * H -1 ≤ (1 -α)v t + αu t -w * 2 Hk t (u t -w * ) 2 Hk t (u t -w * ), (1 -α)v t + αu t -w * Hk t (u k t -u t ), (1 -α)v t + αu t -w *

• H -1 and that µ is the smallest non-zero eigenvalue of H. where we have used E k p k Hk t H -1 Hk t ≤ κH by definition of κ and the operator convexity of the mappingW → W H -1 W . Hk t (u t -w * ), (1 -α)v t + αu t -w * H k (u t -w * ), (1 -α)v t + αu t -w * -α)v t + αu t = (1 -α)((1 + α)u t -w t )/α + αu t and the identity that -2 a, b = a 2 + b 2 -a + b 2 .Finally, to deal with the terms 2δ2 k p k Hk t (u t -u k t ) 2 H -1 + δL k p k (u t -u k

annex

Combining the bounds for E w t+1 -w * 2 and E v t+1 -w * 2Following Liu & Belkin (2020) if we choose step sizes so thatthe second and third terms are negative. To optimize the step sizes, note that the two inequalities implyand maximizing the right hand side with respect to η, which is quadratic, we see that η ≡ 1/(ν max 1

N

) maximizes the right hand side, withwhich can be combined with the terms with (u t-1 -w * ) 2 in the recursive expansion ofand the step sizes can be chosen so that the resulting coefficients are negative. Therefore, we have shown that)) and this guarantees thatfor all t.

I DETAILS ON EXPERIMENTS AND ADDITIONAL RESULTS

We describe the precise procedure to reproduce the results in this paper. As we mentioned in Section 5, we empirically verified the linear speed up on various convex settings for both FedAvg and its accelerated variants. For all the results, we set random seeds as 0, 1, 2 and report the best convergence rate across the three folds. For each run, we initialize w 0 = 0 and measure the number of iteration to reach the target accuracy . We use the small-scale dataset w8a Platt (1998) , which consists of n = 49749 samples with feature dimension d = 300. The label is either positive one or negative one. The dataset has sparse binary features in {0, 1}. Each sample has 11.15 non-zero feature values out of 300 features on average. We set the batch size equal to four across all experiments. In the next following subsections, we introduce parameter searching in each objective separately.

I.1 STRONGLY CONVEX OBJECTIVES

We first consider the strongly convex objective function, where we use a regularized binary logistic regression with regularization λ = 1/n ≈ 2e -5. We evenly distributed on 1, 2, 4, 8, 16, 32 devices and report the number of iterations/rounds needed to converge to -accuracy, where = 0.005. The optimal objective function value f * is set as f * = 0.126433176216545. This is determined numerically and we follow the setting in Stich ( 2019). The learning rate is decayed as the η t = min(η 0 , nc 1+t ), where we extensively search the best learning rate c ∈ {2 -1 c 0 , 2 -2 c 0 , c 0 , 2c 0 , 2 2 c 0 }. In this case, we search the initial learning rate η 0 ∈ {1, 32} and c 0 = 1/8.

I.2 CONVEX SMOOTH OBJECTIVES

We also use binary logistic regression without regularization. The setting is almost same as its regularized counter part. We also evenly distributed all the samples on 1, 2, 4, 8, 16, 32 devices. The figure shows the number of iterations needed to converge to -accuracy, where = 0.02. The optiaml objective function value is set as f * = 0.11379089057514849, determined numerically. The learning rate is decayed as the η t = min(η 0 , nc 1+t ), where we extensively search the best learning rate c ∈ {2 -1 c 0 , 2 -2 c 0 , c 0 , 2c 0 , 2 2 c 0 }. In this case, we search the initial learning rate η 0 ∈ {1, 32} and c 0 = 1/8.

I.3 LINEAR REGRESSION

For linear regression, we use the same feature vectors from w8a dataset and generate ground truth [w * , b * ] from a multivariate normal distribution with zero mean and standard deviation one. Then we generate label based on y i = x t i w * +b * . This procedure will ensure we satisfy the over-parameterized setting as required in our theorems. We also evenly distributed all the samples on 1, 2, 4, 8, 16, 32 devices. The figure shows the number of iterations needed to converge to -accuracy, where = 0.02. The optiaml objective function value is f * = 0. The learning rate is decayed as the η t = min(η 0 , nc 1+t ), where we extensively search the best learning rate c ∈ {2 -1 c 0 , 2 -2 c 0 , c 0 , 2c 0 , 2 2 c 0 }. In this case, we search the initial learning rate η 0 ∈ {0.1, 0.12} and c 0 = 1/256.

I.4 PARTIAL PARTICIPATION

To examine the linear speedup of FedAvg in partial participation setting, we evenly distributed data on 4, 8, 16, 32, 64, 128 devices and uniformly sample 50% devices without replacement. All other hyperparameters are the same as previous sections.

I.5 NESTEROV ACCELERATED FEDAVG

The experiments of Nesterov accelerated FedAvg (the update formula is given as follows) uses the same setting as previous three sections for vanilia FedAvg.We set β t = 0.1 and search α t in the same way as η t in FedAvg.I.6 THE IMPACT OF E.In this subsection, we further examine how does the number of local steps (E) affect convergence. As shown in Figure 2 , the number of iterations increases as E increase, which slow down the convergence in terms of gradient computation. However, it can save communication costs as the number of rounds decreased when the E increases. This showcase that we need a proper choice of E to trade-off the communication cost and convergence speed.

