FEDERATED LEARNING OF A MIXTURE OF GLOBAL AND LOCAL MODELS Anonymous authors Paper under double-blind review

Abstract

We propose a new optimization formulation for training federated learning models. The standard formulation has the form of an empirical risk minimization problem constructed to find a single global model trained from the private data stored across all participating devices. In contrast, our formulation seeks an explicit trade-off between this traditional global model and the local models, which can be learned by each device from its own private data without any communication. Further, we develop several efficient variants of SGD (with and without partial participation and with and without variance reduction) for solving the new formulation and prove communication complexity guarantees. Notably, our methods are similar but not identical to federated averaging / local SGD, thus shedding some light on the essence of the elusive method. In particular, our methods do not perform full averaging steps and instead merely take steps towards averaging. We argue for the benefits of this new paradigm for federated learning.

1. INTRODUCTION

With the proliferation of mobile phones, wearable devices, tablets, and smart home devices comes an increase in the volume of data captured and stored on them. This data contains a wealth of potentially useful information to the owners of these devices, and more so if appropriate machine learning models could be trained on the heterogeneous data stored across the network of such devices. The traditional approach involves moving the relevant data to a data center where centralized machine learning techniques can be efficiently applied (Dean et al., 2012; Reddi et al., 2016) . However, this approach is not without issues. First, many device users are increasingly sensitive to privacy concerns and prefer their data to never leave their devices. Second, moving data from their place of origin to a centralized location is very inefficient in terms of energy and time. 1.1 FEDERATED LEARNING Federated learning (FL) (McMahan et al., 2016; Konečný et al., 2016b; a; McMahan et al., 2017) has emerged as an interdisciplinary field focused on addressing these issues by training machine learning models directly on edge devices. The currently prevalent paradigm (Li et al., 2019; Kairouz et al., 2019) casts supervised FL as an empirical risk minimization problem of the form min x∈R d 1 n n i=1 f i (x), ( ) where n is the number of devices participating in training, x ∈ R d encodes the d parameters of a global model (e.g., weights of a neural network) and f i (x) := E ξ∼Di [f (x, ξ)] represents the aggregate loss of model x on the local data represented by distribution D i stored on device i. One of the defining characteristics of FL is that the data distributions D i may possess very different properties across the devices. Hence, any potential FL method is explicitly required to be able to work under the heterogeneous data setting. The most popular method for solving (1) in the context of FL is the FedAvg algorithm (McMahan et al., 2016) . In its most simple form, when one does not employ partial participation, model compression, or stochastic approximation, FedAvg reduces to Local Gradient Descent (LGD) (Khaled et al., 2019; 2020) , which is an extension of GD performing more than a single gradient step on each device before aggregation. FedAvg has been shown to work well empirically, particularly for non-convex problems, but comes with poor convergence guarantees compared to the non-local counterparts when data are heterogeneous.

Some issues with current approaches to FL

The first motivation for our research comes from the appreciation that data heterogeneity does not merely present challenges to the design of new provably efficient training methods for solving (1), but also inevitably raises questions about the utility of such a global solution to individual users. Indeed, a global model trained across all the data from all devices might be so removed from the typical data and usage patterns experienced by an individual user as to render it virtually useless. This issue has been observed before, and various approaches have been proposed to address it. For instance, the MOCHA (Smith et al., 2017) framework uses a multi-task learning approach to allow for personalization. Next, (Khodak et al., 2019) propose a generic online algorithm for gradientbased parameter-transfer meta-learning and demonstrate improved practical performance over Fe-dAvg (McMahan et al., 2017) . Approaches based on variational inference (Corinzia & Buhmann, 2019) , cyclic patterns in practical FL data sampling (Eichner et al., 2019) transfer learning (Zhao et al., 2018) and explicit model mixing (Peterson et al., 2019) have been proposed. The second motivation for our work is the realization that even very simple variants of FedAvg, such as LGD, which should be easier to analyze, fail to provide theoretical improvements in communication complexity over their non-local cousins, in this case, GD (Khaled et al., 2019; 2020) . 1This observation is at odds with the practical success of local methods in FL. This leads us to ask the question: if LGD does not theoretically improve upon GD as a solver for the traditional global problem (1), perhaps LGD should not be seen as a method for solving (1) at all. In such a case, what problem does LGD solve? A good answer to this question would shed light on the workings of LGD, and by analogy, on the role local steps play in more elaborate FL methods such as local SGD (Stich, 2020; Khaled et al., 2020) and FedAvg.

2. CONTRIBUTIONS

In our work we argue that the two motivations mentioned in the introduction point in the same direction, i.e., we show that a single solution can be devised addressing both problems at the same time. Our main contributions are: New formulation of FL which seeks an implicit mixture of global and local models. We propose a new optimization formulation of FL. Instead of learning a single global model by solving (1), we propose to learn a mixture of the global model and the purely local models which can be trained by each device i using its data D i only. Our formulation (see Sec. 3) lifts the problem from R d to R nd , allowing each device i to learn a personalized model x i ∈ R d . These personalized models are encouraged to not depart too much from their mean by the inclusion of a quadratic penalty ψ multiplied by a parameter λ ≥ 0. Admittedly, the idea of softly-enforced similarity of the local models was already introduced in the domain of the multi-task relationship learning (Zhang & Yeung, 2010; Liu et al., 2017; Wang et al., 2018) and distributed optimization (Lan et al., 2018; Gorbunov et al., 2019; Zhang et al., 2015) . The mixture objective we propose (see ( 2)) is a special case of their setup, which justifies our approach from the modeling perspective. Note that Zhang et al. (2015) ; Liu et al. (2017) ; Wang et al. (2018) provide efficient algorithms to solve the mixture objective already. However, none of the mentioned papers consider the FL application, nor they shed a light on the communication complexity of LGD algorithms, which we do in our work. Theoretical properties of the new formulation. We study the properties of the optimal solution of our formulation, thus developing an algorithmic-free theory. When the penalty parameter is set to zero, then obviously, each device is allowed to train their own model without any dependence on the data stored on other devices. Such purely local models are rarely useful. We prove that the optimal local models converge to the traditional global model characterized by (1) at the rate O(1/λ). We also show that the total loss evaluated at the local models is never higher than the total loss evaluated at the global model (see Thm. 3.1) . Moreover, we prove an insightful structural result for the optimal local models: the optimal model learned by device i arises by subtracting the gradient of the loss function stored on that device evaluated at the same point (i.e., a local model) from the average of the optimal local models (see Thm. 3.2). As a byproduct, this theoretical result sheds new light on the key update step in the model agnostic meta-learning (MAML) method (Finn et al., 2017) , which has a similar but subtly different structure. 2 The subtle difference is that the MAML update obtains the local model by subtracting the gradient evaluated at the global model. While MAML was originally proposed as a heuristic, we provide rigorous theoretical guarantees. Loopless LGD: non-uniform SGD applied to our formulation. We then propose a randomized gradient-based method-Loopless Local Gradient Descent (L2GD)-for solving our new formulation (Algorithm 1). This method is, in fact, a non-standard application of SGD to our problem, and can be seen as an instance of SGD with non-uniform sampling applied to the problem of minimizing the sum of two convex functions (Zhao & Zhang, 2015; Gower et al., 2019) : the average loss, and the penalty. When the loss function is selected by the randomness in our SGD method, the stochastic gradient step can be interpreted as the execution of a single local GD step on each device. Since we set the probability of the loss being sampled to be high, this step is typically repeated multiple times, resulting in multiple local GD steps. In contrast to standard LGD, the number of local steps is not fixed, but random, and follows a geometric distribution. This mechanism is similar in spirit to how the recently proposed loopless variants of SVRG (Hofmann et al., 2015; Kovalev et al., 2020) work in comparison with the original SVRG (Johnson & Zhang, 2013a; Xiao & Zhang, 2014) . Once the penalty is sampled by our method, the resultant SGD step can be interpreted as the execution of an aggregation step. In contrast with standard aggregation, which performs full averaging of the local models, our method merely takes a step towards averaging. However, the step is relatively large. Convergence theory. By adapting the general theory from (Gower et al., 2019) to our setting, we obtain theoretical convergence guarantees assuming that each f i is L-smooth and µ-strongly convex (see Thm. 4.2). Interestingly, by optimizing the probability of sampling the penalty (we get p = λ λ+L ), which is an indirect way of fixing the expected number of local steps to 1 + L λ , we prove an 2λ λ+L L µ log 1 ε bound on the expected number of communication rounds (see Cor. 4.3) . We believe that this is remarkable in several ways. By choosing λ small, we tilt our goal towards pure local models: the number of communication rounds is tending to 0 as λ → 0. If λ → ∞, the solution our formulation converges to is the optimal global model, and L2GD obtains the communication bound O L µ log 1 ε , which matches the efficiency of GD. What problem do local methods solve? Noting that L2GD is a (mildly nonstandard) version of LGD,foot_3 which is a key method most local methods for FL are based on, and noting that, as we show, L2GD solves our new formulation of FL, we offer a new and surprising interpretation of the role of local steps in FL. In particular, the role of local steps in gradient type methods, such as GD, is not to reduce communication complexity, as is generally believed. Indeed, there is no theoretical result supporting this claim in the key heterogeneous data regime. Instead, their role is to steer the method towards finding a mixture of the traditional global and the purely local models. Given that the stepsize is fixed, the more local steps are taken, the more we bias the method towards the purely local models. Our new optimization formulation of FL formalizes this as it defines the problem that local methods, in this case L2GD, solve. There is an added benefit here: the more we want our formulation to be biased towards purely local models (i.e., the smaller the penalty parameter λ is), the more local steps does L2GD take, and the better the total communication complexity of L2GD becomes. Hence, despite a lot of research on this topic, our paper provides the first proof that a local method (e.g., L2GD) can be better than its non-local counterpart (e.g., GD) in terms of total communication complexity in the heterogeneous data setting. We are able to do this by noting that local methods should better be seen as methods for solving the new FL formulation proposed here. Generalizations: partial participation, local SGD and variance reduction. We further generalize and improve our method by allowing for (i) stochastic partial participation of devices in each communication round,(ii) subsampling on each device which means we can perform local SGD steps instead of local GD steps, and (iii) total variance reduction mechanism to tackle the variance coming from three sources: locality of the updates induced by non-uniform sampling (already present in L2GD), partial participation and local subsampling. Due to its level of generality, this method, which we call L2SGD++, is presented in the Appendix only, alongside the associated complexity results. In the main body of this paper, we instead present a simplified version thereof, which we call L2SGD+ (Algorithm 3). The convergence theory for it is presented in Thm. 5.1 and Cor. 5.2. Heterogeneous data. All our methods and convergence results allow for fully heterogeneous data and do not depend on any assumptions on data similarity across the devices. Superior empirical performance. We show through ample numerical experiments that our theoretical predictions can be observed in practice.

3. NEW FORMULATION OF FL

We now introduce our new formulation for training supervised FL models: min x1,...,xn∈R d {F (x) := f (x) + λψ(x)} f (x) := 1 n n i=1 f i (x i ), ψ(x) := 1 2n n i=1 x i -x 2 , where λ ≥ 0 is a penalty parameter, x 1 , . . . , x n ∈ R d are local models, x := (x 1 , x 2 , . . . , x n ) ∈ R nd and x := 1 n n i=1 x i is the average of the local models. Due to the assumptions on f i we will make in Sec. 3.1, F is strongly convex and hence (2) has a unique solution, which we denote x(λ) := (x 1 (λ), . . . , x n (λ)) ∈ R nd . We further let x(λ) := 1 n n i=1 x i (λ). We now comment on the rationale behind the new formulation. Local models (λ = 0). Note that for each i, x i (0) solves the local problem min xi∈R d f i (x i ). That is, x i (0) is the local model based on data D i stored on device i only. This model can be computed by device i without any communication whatsoever. Typically, D i is not rich enough for this local model to be useful. In order to learn a better model, one has to take into account the date from other clients as well. This, however, requires communication. Mixed models (λ ∈ (0, ∞)). As λ increases, the penalty λψ(x) has an increasingly more substantial effect, and communication is needed to ensure that the models are not too dissimilar, as otherwise the penalty λψ(x) would be too large. Global model (λ = ∞). Let us now look at the limit case λ → ∞. Intuitively, this limit case should force the optimal local models to be mutually identical, while minimizing the loss f . In particular, this limit case will solvefoot_4 min f (x) : x 1 , . . . , x n ∈ R d , x 1 = x 2 = • • • = x n , which is equivalent to the global formulation (2). Because of this, let us define x i (∞) for each i to be the optimal global solution of (1), and let x(∞) := (x 1 (∞), . . . , x n (∞)).

3.1. TECHNICAL PRELIMINARIES

We make the following assumption on the functions f i : Assumption 3.1 For each i, the function f i : R d → R is L-smooth and µ-strongly convex. For x i , y i ∈ R d , x i , y i denotes the standard inner product and x := x i , x i 1/2 is the standard Euclidean norm. For vectors x = (x 1 , . . . , x n ) ∈ R nd , y = (y 1 , . . . , y n ) ∈ R nd we define the standard inner product and norm via x, y := n i=1 x i , y i , x 2 := n i=1 x i 2 . Note that the separable structure of f implies that (∇f (x) ) i = 1 n ∇f i (x i ), i.e., ∇f (x) = 1 n (∇f 1 (x 1 ), ∇f 2 (x 2 ), . . . , ∇f n (x n )). Note that Assumption 3.1 implies that f is L f -smooth with L f := L n and µ f -strongly convex with µ f := µ n . Clearly, ψ is convex by construction. It can be shown that ψ is L ψ -smooth with L ψ = 1 n (see Appendix). We can also easily see that (∇ψ(x) ) i = 1 n (x i -x)(see Appendix), which implies ψ(x) = n 2 n i=1 (∇ψ(x)) i 2 = n 2 ∇ψ(x) 2 .

3.2. CHARACTERIZATION OF OPTIMAL SOLUTIONS

Our first result describes the behavior of f (x(λ)) and ψ(x(λ)) as a function of λ. Theorem 3.1 The function λ → ψ(x(λ)) is non-increasing, and for all λ > 0 we have ψ(x(λ)) ≤ f (x(∞))-f (x(0)) λ . (3) Moreover, the function λ → f (x(λ)) is non-decreasing, and for all λ ≥ 0 we have f (x(λ)) ≤ f (x(∞)). Ineq. (3) says that the penalty decreases to zero as λ grows, and hence the optimal local models x i (λ) are increasingly similar as λ grows. The second statement suggest that the loss f (x(λ)) increases with λ, but never exceeds the optimal global loss f (x(∞)) of the standard FL formulation (1). We now characterize the optimal local models which connect our model to the MAML framework (Finn et al., 2017) , as mentioned in the introduction. Theorem 3.2 For each λ > 0 and 1 ≤ i ≤ n we have x i (λ) = x(λ) -1 λ ∇f i (x i (λ)). Further, we have n i=1 ∇f i (x i (λ)) = 0 and ψ(x(λ)) = 1 2λ 2 ∇f (x(λ)) 2 . The optimal local models (5) are obtained from the average model by subtracting a multiple of the local gradient. Observe that the local gradients always sum up to zero at optimality. This is obviously true for λ = ∞, but it is a bit less obvious that this holds for any λ > 0. Next, we argue the optimal local models converge to the traditional FL solution at the rate O(1/λ). Theorem 3.3 Let P (z) := 1 n n i=1 f i (z). Then, x(∞) is the unique minimizer of P and we have ∇P (x(λ)) 2 ≤ 2L 2 (f (x(∞)) -f (x(0))) λ .

4. L2GD: LOOPLESS LOCAL GD

In this section we describe a new randomized method for solving the formulation (2). Our method is a non-uniform SGD for (2) seen as a 2-sum problem, sampling either ∇f or ∇ψ to estimate ∇F . Letting 0 < p < 1, we define a stochastic gradient of F at x ∈ R nd as follows G(x) := ∇f (x) 1-p with probability 1 -p λ∇ψ(x) p with probability p . Clearly, G(x) is an unbiased estimator of ∇F (x). This leads to the following method for minimizing F , which we call L2GD: x k+1 = x k -αG(x k ). Plugging the formulas for ∇f (x) and ∇ψ(x) into (7), and writing the resulting method in a distributed manner, we arrive at Algorithm 1. In each iteration, a coin ξ is tossed and lands 1 with probability p and 0 with probability 1 -p. If ξ = 0, all Devices perform one local GD step (8), and if ξ = 1, Master shifts each local model towards the average via (9). As we shall see in Sec. 4.2, our theory limits the value of the stepsize α, which has the effect that the ratio αλ np cannot exceed 1 2 . Hence, ( 9) is a convex combination of x k i and xk . Note that Algorithm 1 is only required to communicate when a two consecutive coin tosses land a different value (see the detailed explanation in Sec. C.1 of the appendix). Consequently, the expected number of communication rounds in k iterations of L2GD is p(1 -p)k. Remark 4.1 Our algorithm statements do not take the data privacy into the consideration. While privacy is a very important aspect of FL; in this paper, we tackle different FL challenges and thus we ignore privacy issues. However, the proposed algorithms can be implemented in a private fashion as well using tricks that are used in the classical FL scenario (Bonawitz et al., 2017) . Algorithm 1 L2GD: Loopless Local Gradient Descent Input: x 0 1 = • • • = x 0 n ∈ R d , stepsize α, probability p for k = 0, 1, . . . do ξ = 1 with probability p and 0 with probability 1 -p if ξ = 0 then All Devices i = 1, . . . , n perform a local GD step: x k+1 i = x k i -α n(1-p) ∇f i (x k i ) (8) else Master computes the average xk = 1 n n i=1 x k i Master for each i computes step towards aggregation x k+1 i = 1 -αλ np x k i + αλ np xk (9) end if end for

4.1. THE DYNAMICS OF LOCAL GD AND AVERAGING STEPS

Notice that the average of the local models does not change during an aggregation step. Indeed, xk+1 is equal to 1 n n i=1 x k+1 i (9) = 1 n n i=1 1 -αλ np x k i + αλ np xk = xk . If several averaging steps take place in a sequence, the point a = xk in (9) remains unchanged, and each local model x k i merely moves along the line joining the initial value of the local model at the start of the sequence and a, with each step pushing x k i closer to the average a. In summary, the more local GD steps are taken, the closer the local models get to the pure local models; and the more averaging steps are taken, the closer the local models get to their average value. The relative number of local GD vs. averaging steps is controlled by the parameter p: the expected # of local GD steps is 1 p , and the expected number of consecutive aggregation steps is 1 1-p .

4.2. CONVERGENCE THEORY

We now present our convergence result for L2GD. Theorem 4.2 Let Assumption 3.1 hold. If α ≤ 1 2L , then E x k -x(λ) 2 ≤ 1 -αµ n k x 0 -x(λ) 2 + 2nασ 2 µ , where L := 1 n max L 1-p , λ p and σ 2 := 1 n 2 n i=1 1 1-p ∇f i (x i (λ)) 2 + λ 2 p x i (λ) -x(λ) 2 . Let us find the parameters p, α which lead to the fastest rate, to push the error within O(ε) + 2nασ 2 µ -neighborhood of the optimum 5 , i.e., to achieve E x k -x(λ) 2 ≤ ε x 0 -x(λ) 2 + 2nασ 2 µ . ( ) Corollary 4.3 The value p = λ L+λ minimizes both the number of iterations and the expected number of communications for achieving (10). In particular, the optimal number of iterations is 2 L+λ µ log 1 ε , and the optimal expected number of communications is 2λ λ+L L µ log 1 ε . If we choose p = p , then αλ np = 1 2 , and the aggregation rule (9) in Algorithm 1 becomes x k+1 i = 1 2 x k i + xk 5 In Sec. 5 we propose a variance reduced algorithm which removes the ( 2nασ 2 µ )-neighborhood from Thm. 4.2. In that setting, our goal will be to achieve  E x k -x(λ) 2 ≤ ε x 0 -x(λ) 2 . F (x k )-F (x * ) F (x 0 )-F (x * ) ≤ 10 -5 as a function of p with p * ≈ 0.09 (for L2SGD+). Logistic regression on a1a dataset with λ = 0.1. while the local GD step (8) becomes x k+1 i = x k i -1 2L ∇f i (x k i ). Notice that while our method does not support full averaging as that is too unstable, (11) suggests that one should take a large step towards averaging. As λ get smaller, the solution to the optimization problem (2) will increasingly favour pure local models, i.e., x i (λ) → x i (0) := arg min f i for all i as λ → 0. Pure local models can be computed without any communication whatsoever and Cor. 4.3 confirms this intuition: the optimal number of communication round decreases to zero as λ → 0. On the other hand, as λ → ∞, the optimal number of communication rounds converges to 2 L µ log 1 ε , which recovers the performance of GD for finding the globally optimal model (see Fig. 1 ). In summary, we recover the communication efficiency of GD for finding the globally optimal model as λ → ∞ (ignoring the ( 2nασ 2 µ )-neighborhood). However, for other values of λ, the communication complexity of L2GD is better and decreases to 0 as λ → 0. Hence, our communication complexity result interpolates between the communication complexity of GD for finding the global model and the zero communication complexity for finding the pure local models.

5. LOOPLESS LOCAL SGD WITH VARIANCE REDUCTION

As we have seen in Sec. 4.2, L2GD is a specific instance of SGD, thus only converges linearly to a neighborhood of the optimum. In this section, we resolve the mentioned issue by incorporating control variates to the stochastic gradient (Johnson & Zhang, 2013b; Defazio et al., 2014) . We go further: we assume that each local objective has a finite-sum structure and propose an algorithm, L2SGD+, which takes local stochastic gradient steps, while maintaining (global) linear convergence rate. As a consequence, L2SGD+ is the first local SGD with linear convergence. 6 For convenience, we present variance reduced local GD (i.e., no local subsampling) in the Appendix. Assumption 5.1 Assume that f i has a finite-sum structure: f i (x i ) = 1 m m j=1 f i,j (x i ). Let f i,j be convex, L -smooth while f i is µ-strongly convex (for each 1 ≤ j ≤ m, 1 ≤ i ≤ n).

5.1. CONVERGENCE THEORY

We are now ready to present a convergence rate of L2SGD+ (the algorithm, along with the efficient implementation is presented in Appendix C.4). Theorem 5.1 Let Assumption 5.1 hold and choose α = n min (1-p) 4L +µm , p 4λ+µ . Then the iteration complexity of Algorithm 3 is max 4L +µm (1-p)µ , 4λ+µ pµ log 1 ε . Next, we find the value of p that yields both the best iteration and communication complexity.  + 4 L µ + m + 1 log 1 ε , while the com- munication complexity is 4λ+µ 4L +4λ+(m+1)µ 4 L µ + m log 1 ε . Note that with λ → ∞, the communication complexity of L2SGD+ tends to 4 L µ + m log 1 ε , which is communication complexity of minibatch SAGA to find the globally optimal model (Hanzely & Richtárik, 2019) . On the other hand, in the pure local setting (λ = 0), the communication complexity becomes log 1 -this is because the Lyapunov function involves a term that measures the distance of local models, which requires communication to be estimated. Remark 5.3 L2SGD+ is the simplest local SGD method with variance reduction. In the Appendix, we present L2SGD++ which allows for 1) an arbitrary number of data points per client and arbitrary local subsampling, 2) partial participation of clients, and 3) local SVRG-like updates of control variates (thus better memory). Lastly, L2SGD++ exploits the complex smoothness structure of the local objectives, resulting in tighter rates.

6. EXPERIMENTS

In this section, we numerically verify the theoretical claims from this paper. We only present a single experiment here, all remaining ones along with the missing details about the setup are in the Appendix. In particular, the Appendix includes two more experiments. The first one studies how p (communication) influences the convergence of L2SGD+. The second experiment aims to examine the effect of parameter λ on the convergence rate of L2SGD+. We consider logistic regression problem with LibSVM data (Chang & Lin, 2011) . The data were normalized so that f i,j is 1-smooth for each j, while the local objectives are 10 -4 -strongly convex. In order to cover a range of possible scenarios, we have chosen a different number of clients for each dataset (see the Appendix). Lastly, the stepsize was always chosen according to Thm. 5.1. We compare three different methods: L2SGD+, L2GD with local subsampling (L2SGD in the Appendix), and L2GD with local subsampling and control variates constructed for ψ only (L2SGD2 in the Appendix; similar to (Liang et al., 2019) ). We expect L2SGD+ to converge to the global optimum linearly, while both L2SGD and L2SGD2 to converge to certain neighborhood. Each method is applied to two objectives constructed by a different split of the data among the devices. For the homogeneous split, we randomly reshuffle the data. For heterogeneous split, we first sort the data based on the labels and then construct the local objectives according to the current order. Fig. 3 demonstrates the importance of variance reduction -it ensures a fast global convergence of L2SGD+, while the neighborhood is slightly smaller for L2SGD2 compared to L2SGD. As predicted, data heterogeneity does not affect the convergence speed of the proposed methods.

A POSSIBLE EXTENSIONS

Our analysis of L2GD can be extended to cover smooth convex and non-convex loss functions f i (we do not explore these directions). Further, our methods can be extended to a decentralized regime where the devices correspond to devices of a connected network, and communication is allowed along the edges of the graph only. This can be achieved by introducing an additional randomization over the penalty ψ. Further, our approach can be accelerated in the sense of Nesterov (Nesterov, 2004 ) by adapting the a variant of Katyusha (Allen-Zhu, 2017; Qian et al., 2019a) to our setting, thus further reducing the number of communication rounds.

B EXPERIMENTAL SETUP AND FURTHER EXPERIMENTS

In all experiments in this paper, we consider a simple binary classification model -logistic regression. In particular, suppose that device i owns data matrix A i ∈ R m×d along with corresponding labels b i ∈ {-1, 1} m . The local objective for client i is then given as follows f i (x) := 1 m m j=1 f i,j (x) + µ 2 x 2 , where f im+j (x) = log (1 + exp ((A i ) j,: x • b i )) . The rows of data matrix A were normalized to have length 4 so that each f i,j is 1-smooth for each j. At the same time, the local objective on each device is 10 -4 strongly convex. Next, datasets are from LibSVM (Chang & Lin, 2011) . In each case, we consider the simplest locally stochastic algorithm. In particular, each dataset is evenly split among the clients, while the local stochastic method samples a single data point each iteration. We have chosen a different number of clients for each dataset -so that we cover different possible scenarios. See Table 1 for details (it also includes sizes of the datasets). Lastly, the stepsize was always chosen according to Thm. 5.1. In our first experiment, we verify two phenomena: • The effect of variance reduction on the convergence speed of local methods. We compare 3 different methods: local SGD with full variance reduction (Algorithm 3), shifted local SGD (Algorithm 7) and local SGD (Algorithm 6). Our theory predicts that a fully variance reduced algorithm converges to the global optimum linearly, while both shifted local SGD and local SGD converge to a neighborhood of the optimum. At the same time, the neighborhood should be smaller for shifted local SGD. • The claim that heterogeneity of the data does not influence the convergence rate. We consider two splits of the data heterogeneous and homogeneous. For the homogeneous split, we first randomly reshuffle the data and then construct the local objectives according to the current order (i.e., the first client owns the first m indices, etc.). For heterogeneous split, we first sort the data based on the labels and then construct the local objectives accordingly (thus achieving the worst-case heterogeneity). Note that the overall objective to solve is different in homogeneous and heterogeneous case -we thus plot relative suboptimality of the objective (i.e., F (x k )-F (x ) F (x 0 )-F (x ) ) to directly compare the convergence speed. In each experiment, we choose p = 0.1 and λ = 1 9 -such choice mean that p is very close to optimal. The other parameters (i.e., number of clients) are provided in Table 1 . Fig. 4 Stepsize for non-variance reduced method was chosen the same as for the analogous variance reduced method. As expected, Figure 4 clearly demonstrates the following: • Full variance reduction always converges to the global optima, methods with partial variance reduction only converge to a neighborhood of the optimum. • Partial variance reduction (i.e., shifting the local SGD) is better than not using control variates at all. Although the improvement in the performance is rather negligible. • Data heterogeneity does not affect the convergence speed of the proposed methods. Therefore, unlike standard local SGD, mixing the local and global models does not suffer the problems with heterogeneity.

B.2 EFFECT OF p

In the second experiment, we study the effect of p on the convergence rate of variance reduced local SGD. Note that p immediately influences the number of communication rounds -on average, the clients take (p -1 -1) local steps in between two consecutive rounds of communication (aggregation). In Section 5, we argue that, it is optimal (in terms of the convergence rate) to choose p of order p := λ L +λ . Figure 5 compares p = p against other values of p and confirms its optimality (in terms of optimizing the convergence rate). While the slower convergence of Algorithm 3 with p < p is expected (i.e., communicating more frequently yields a faster convergence), slower convergence for p > p is rather surprising; in fact, it means that communicating less frequently yields faster convergence. This effect takes place due to the specific structure of problem ( 12); it would be lost when enforcing x 1 = • • • = x n (corresponding to λ = ∞).

B.3 EFFECT OF λ

In this experiment we study how different values of λ influence the convergence rate of Algorithm 3, given that everything else (i.e., p) is fixed. Note that for each value of λ we get a different instance of problem (12); thus the optimal solution is different as well. Therefore, in order to make a fair comparison between convergence speeds, we plot the relative suboptimality (i.e., F (x k )-F (x ) F (x 0 )-F (x ) ) against the data passes. Figure 6 presents the results. The complexity of Algorithm 3 isfoot_6 O L (1-p)µ log 1 ε as soon as λ < λ := Lp (1-p) ; otherwise the complexity is O λ pµ log 1 ε . This perfectly consistent with what Figure 6 shows -the choice λ < λ resulted in comparable convergence speed than λ = λ ; while the choice λ > λ yields noticeably worse rate than λ = λ . The choice λ = λ corresponds to borwn dash-dotted line with diamond marker (the third one from the legend). Aggregation probability p was chosen in each case as Table 1 indicates.

C REMAINING ALGORITHMS C.1 UNDERSTANDING COMMUNICATION OF L2GD

Example C.1 In order to better understand when communication takes place in Algorithm 1, consider the following possible sequence of coin tosses: 0, 0, 1, 0, 1, 1, 1, 0. The first two coin tosses lead to two local GD steps (8) on all devices. The third coin toss lands 1, at which point all local models x k i are communicated to the master, averaged to form xk , and the step (9) towards averaging is taken. The fourth coin toss is 0, and at this point, the master communicates the updated local models back to the devices, which subsequently perform a single local GD step (8). Then come three consecutive coin tosses landing 1, which means that the local models are again communicated to the master, which performs three averaging steps (9). Finally, the eighth coin toss lands 0, which makes the master send the updated local models back to the devices, which subsequently perform a single local GD step. This example illustrates that communication needs to take place whenever two consecutive coin tosses land a different value. If 0 is followed by a 1, all devices communicate to the master, and if 1 is followed by a 0, the master communicates back to the devices. It is standard to count each pair of communications, Device→Master and the subsequent Master→Device, as a single communication round.

Lemma C.2

The expected number of communication rounds in k iterations of L2GD is p(1 -p)k.

C.2 L2GD AND FULL AVERAGING

Is a setup such that conditions of Thm. 4.2 are satisfied and the aggregation update ( 9) is identical to full averaging? This is equivalent requiring 0 < p < 1 such that αλ = np. However, we have αλ ≤ λ 2L ≤ np, which means that full averaging is not supported by our theory.

C.3 LOCAL GD WITH VARIANCE REDUCTION

In this section, we present variance reduced local gradient descent with partial aggregation. In particular, the proposed algorithm (Algorithm 2) incorporates control variates to Algorithm 1. Therefore, the proposed method can be seen as a special case of Algorithm 3 with m = 1. We thus only present it for pedagogical purposes, as it might shed additional insights into our approach. In particular, the update rule of proposed method will be x k+1 = x k -αg k where g k = p -1 (λ∇ψ(x k ) -n -1 Ψ k ) + n -1 J k + n -1 Ψ k with probability p (1 -p) -1 (∇f (x k ) -n -1 J k ) + n -1 J k + n -1 Ψ k with probability 1 -p . for some control variates vectors J k , Ψ k ∈ R nd . A quick check gives E g k | x k = ∇f (x k ) + λ∇ψ(x k ) = ∇F (x k ), thus the direction we are taking is unbiased regardless of the value of control variates J k , Ψ k . The goal is to make control variates J k , Ψ k correlatedfoot_7 with n∇f (x k ) and nλ∇ψ(x k ). One possible solution to the problem is for J k , Ψ k to track most recently observed values of n∇f (•) and nλ∇ψ(•), which corresponds to the following update rule Ψ k+1 , J k+1 = nλ∇ψ(x k ), J k with probability p Ψ k , n∇f (x k ) with probability 1 -p . A specific, distributed implementation of the described method is presented as Algorithm 2. The only communication between the devices takes place when the average model xk is being computed (with probability p), which is analogous to standard local SGD. Therefore we aim to set p rather small. Note that Algorithm 2 is a particular special case of SAGA with importance sampling (Qian et al., 2019b) ; thus, we obtain convergence rate of the method for free. We state it as Thm. C.3.

Algorithm 2 Variance reduced local gradient descent

Input: x 0 1 = • • • = x 0 n ∈ R d , stepsize α, probability p J 0 1 = • • • = J 0 n = Ψ 0 1 = • • • = Ψ 0 n = 0 ∈ R d for k = 0, 1, . . . do ξ = 1 with probability p and 0 with probability 1 -p if ξ then All Devices i = 1, . . . , n: Compute ∇f i (x k i ) x k+1 i = x k i -α n -1 (1 -p) -1 ∇f i (x k i ) -n -1 p 1-p J k i + n -1 Ψ k i Set J k+1 i = ∇f i (x k i ), Ψ k+1 i = Ψ k i else Master computes the average xk = 1 n n i=1 x k i Master does for all i = 1, . . . , n: Set x k+1 i = x k i -α λ np (x k i -xk ) -(p -1 -1)n -1 Ψ k i + n -1 J k i Set Ψ k+1 i = λ(x k i -xk ), J k+1 i = J k i end if end for Theorem C.3 Let Assumption 3.1 hold. Set α = n min (1-p) 4L+µ , p 4λ+µ . Then, iteration complexity of Algorithm 2 is max 4L+µ µ(1-p) , 4λ+µ µp log 1 ε . Proof: Clearly, F (x) = f (x) + λψ(x) = 1 2   2f (x) :=f (x) + 2λψ(x) :=ψ(x)    . Note that ψ is 2λ n smooth and f is 2L n smooth. At the same time, F is µ n strongly convex. Using convergence theorem of SAGA with importance sampling from (Qian et al., 2019b; Gazagnadou et al., 2019) , we get E F (x k ) + α 2 Υ(J k , Ψ k ) ≤ 1 -α µ n k F (x 0 ) + α 2 Υ(J 0 , Ψ 0 ) , where Υ(J k , Ψ k ) := 4 n 2 n i=1 Ψ k i -λ(x i (λ) -x(λ)) 2 + J k i -∇f i (x i (λ)) 2 and α = n min (1-p) 4L+µ , p 4λ+µ , as desired. Corollary C.4 Iteration complexity of Algorithm 2 is minimized for p = 4λ+µ 4λ+4L+2µ , which yields complexity 4 λ µ + L µ + 1 2 log 1 ε . The communication complexity is minimized for any p ≤ 4λ+µ 4λ+4L+2µ , in which case the total number of communication rounds to reach ε-solution is 4λ µ + 1 log 1 ε . As a direct consequence of Corollary C.4 we see that the optimal choice of p that minimizes both communication and number of iterations to reach ε solution of problem ( 17) is p = 4λ+µ 4λ+4L+2µ . Remark C.5 While both Algorithm 2 and Algorithm 3 are a special case of SAGA, the practical version of variance reduced local SGD (presented in Section C.5) is not. In particular, we wish to run the SVRG-like method locally in order to avoid storing the full gradient table. 9 Therefore, variance reduced local SGD that will be proposed in Section C.5 is neither a special case of SAGA nor a special case of SVRG (or a variant of SVRG). However, it is still a special case of a more general algorithm from (Hanzely & Richtárik, 2019) . As mentioned, Algorithm 3 is a generalization of Algorithm 2 when the local subproblem is a finite sum. Note that Algorithm 2 constructs a control variates for both local subproblem and aggregation function ψ and constructs corresponding unbiased gradient estimator. In contrast, Algorithm 3 constructs extra control variates within the local subproblem in order to reduce the variance of gradient estimator coming from the local subsampling.

C.4 L2SGD+: ALGORITHM AND THE EFFICIENT IMPLEMENTATION

Denote 1 ∈ R m to be vector of ones. We are now ready to state L2SGD+ as Algorithm 3. Algorithm 3 L2SGD+: Loopless Local SGD with Variance Reduction Input: x 0 1 = • • • = x 0 n ∈ R d , stepsize α, probability p J 0 i = 0 ∈ R d×m , Ψ 0 i = 0 ∈ R d (for i = 1, . . . , n) for k = 0, 1, . . . do ξ = 1 with probability p and 0 with probability 1 -p if ξ = 0 then All Devices i = 1, . . . , n: Sample j ∈ {1, . . . , m} (uniformly at random) Master does for all i = 1, . . . , n: g k i = 1 n(1-p) ∇f i,j (x k i ) -J k i :,j + J k i 1 nm + Ψ k i n x k+1 i = x k i -αg k i Set (J k+1 i ) :,j = ∇f i,j (x k i ), Ψ k+1 i = Ψ k i , (J k+1 i ) :,l = (J k+1 g k i = λ np (x k i -xk ) -p -1 -1 n Ψ k i + 1 nm J k i 1 Set x k+1 i = x k i -αg k i Set Ψ k+1 i = λ(x k i -xk ), J k+1 i = J k i end if end for L2SGD+ only communicates when a two consecutive coin tosses land a different value, thus, on average p(1 -p)k times per k iterations. However, L2SGD+ requires communication of control variates J i 1, Ψ i as well -each communication round is thus three times more expensive. In the Appendix, we provide an implementation of L2SGD+ that does not require the communication of J i 1, Ψ i . Here we present an efficient implementation of L2SGD+ as Algorithm 4 so that we do not have to communicate control variates. As a consequence, Algorithm 4 needs to communicate on average p(1 -p)k times per k iterations, while each communication consists of sending only local models to the master and back.

C.5 LOCAL SGD WITH VARIANCE REDUCTION -GENERAL METHOD

In this section, we present a fully general variance reduced local SGD. We consider a more general instance of (2) where each local objective includes a possibly nonsmooth regularizer, which admits a cheap evaluation of proximal operator. In particular, the objective becomes Algorithm 4 L2SGD+: Loopless Local SGD with Variance Reduction (communication-efficient implementation) Input: x 0 1 = • • • = x 0 n = x ∈ R d , stepsize α, probability p Initialize control variates J 0 i = 0 ∈ R d×m , Ψ 0 i = 0 ∈ R d (for i = 1, . . . , n), initial coin toss ξ -1 = 0 for k = 0, 1, . . . do ξ k = 1 with probability p and 0 with probability 1 -p if ξ k = 0 then All Devices i = 1, . . . , n: if ξ k-1 = 1 then Receive x k i , c from Master Reconstruct xk = xk-c using x k i , x k-c i , c Set x k i = x k i -cα 1 nm J k i 1, J k i = J k-c i , Ψ k i = λ(x k-c i -xk ), end if Sample j ∈ {1, . . . , m} (uniformly at random) g k i = 1 n(1-p) ∇f i,j (x k i ) -J k i :,j + J k i 1 nm + Ψ k i n x k+1 i = x k i -αg k i Set (J k+1 i ) :,j = ∇f i,j (x k i ), Ψ k+1 i = Ψ k i , (J k+1 i ) :,l = (J k+1 i ) :,l for all l = j else Master does for all i = 1, . . . , n: if ξ k-1 = 0 then Set c = 0 Receive x k i from Device and set x = 1 n n i=1 x k i , x k i = x k i end if Set x k+1 i = x k i -α λ np (x k i -x) -p -1 -1 n λ(x -x) Set x = x k i Set c = c + 1 end if end for min x∈R dn 1 N n i=1   mi j=1 f i,j (x i )   = N n fi(x) =f (x) +λ 1 2n n i=1 x i -x 2 =ψ(x) =F (x) + n i=1 R i (x i ) :=R(x) , where m i is the number of data points owned by client i and N = n i=1 m i . In order to squeeze a faster convergence rate from minibatch samplings, we will assume that f i,j is smooth with respect to a matrix M i,j (instead of scalar L i,j = λ max M i,j ). Assumption C.1 Suppose that f i,j is M i,j smooth (M i,j ∈ R d×d , M i,j 0) and µ strongly convex for 1 ≤ j ≤ m i , 1 ≤ i ≤ n, i.e. f i,j (y)+ ∇f i,j (y), x -y ≤ f i,j (x) ≤ f i,j (y)+ ∇f i,j (y), x -y + 1 2 y -x 2 Mi,j , ∀x, y ∈ R d . (13) Furthermore, assume that R i is convex for 1 ≤ i ≤ n. Our method (Algotihm 5) allows for arbitrary aggregation probability (same as Algorithms 2, 3), arbitrary sampling of clients (to model the inactive clients) and arbitrary structure/sampling of the local objectives (i.e., arbitrary size of local datasets, arbitrary smoothness structure of each local objective and arbitrary subsampling strategy of each client). Moreover, it allows for the SVRG-like update rule of local control variates J k , which requires less storage given an efficient implementation. To be specific, each device owns a distribution D i over subsets of m i . When the aggregation is not performed (with probability 1 -p), a subset of active devices S is selected (S follows arbitrary fixed distribution D). Each of the active clients (i ∈ S) samples a subset of local indices S i ∼ D i and observe the corresponding part of local Jacobian G i (x k ) (:,Si) (where G i (x k ) := [∇f i,1 (x k ), ∇f i,2 (x k ), . . . ∇f i,mi (x k )). When the aggregation is performed (with probability p) we evaluate xk and distribute it to each device; using which each device computes a corresponding component of λ∇ψ(x k ). Those are the key components in constructing the unbiased gradient estimator (without control variates). It remains to construct control variates and unbiased gradient estimator. If the aggregation is done, we just simply replace the last column of the gradient table. If the aggregation is not done, we have two options -either keep replacing the columns of the Jacobian table (in such case, we obtain a particular case of SAGA (Defazio et al., 2014) ) or do LSVRG-like replacement (Hofmann et al., 2015; Kovalev et al., 2020) (in such case, the algorithm is a particular case of GJS (Hanzely & Richtárik, 2019) , but is not a special case of neither SAGA nor LSVRG. Note that LSVRG-like replacement is preferrable in practice due to a better memory efficiency (one does not need to store the whole gradient table) for the models other than linear. In order to keep the gradient estimate unbiased, it will be convenient to define vector p i ∈ R mi such that for each j ∈ {1, . . . , m i } we have P (j ∈ S i ) = p i,j . Next, to give a tight rate for any given pair of smoothness structure and sampling strategy, we use a standard tool first proposed for the analysis of randomized coordinate descent methods (Richtárik & Takáč, 2016; Qu & Richtárik, 2016) called Expected Separable Overapproximation (ESO) assumption. ESO provides us with smoothness parameters of the objective which "account" for the given sampling strategy. Assumption C.2 Suppose that there is v i ∈ R mi such for each client we have: E    j∈Si M 1 2 i,j h i,j 2    ≤ mi j=1 p i,j v i,j h i,j 2 , ∀ 1 ≤ i ≤ n, ∀h i,j ∈ R mi , j ∈ {1, . . . , m i }. Lastly, denote p i to be the probability that worker i is active and 1 (mi) ∈ R mi to be the vector of ones. The resulting algorithm is stated as Algorithm 5. Next, Theorems C.6 and C.7 present convergence rate of Algorithm 5 (SAGA and SVRG variant, respectively) . Input: x 0 1 , . . . x 0 n ∈ R d , # parallel units n, each of them owns m i data points (for 1 ≤ i ≤ n), distributions D t over subsets of {1, . . . , m i }, distribution D over subsets of {1, 2, . . . n}, aggregation probability p, stepsize α J 0 i = 0 ∈ R d×mi , Ψ 0 i = 0 ∈ R d (for i = 1, . . . , n) for k = 0, 1, . . . do ξ = 1 with probability p and 0 with probability 1 -p if ξ = 0 then Sample S ∼ D All Devices i ∈ S: Sample S i ∼ D i ; S i ⊆ {1, . . . , m i } (independently on each machine) Observe ∇f i,j (x k i ) for all j ∈ S i g k i = 1 N (1-p)pi j∈Si p -1 i,j ∇f i,j (x k i ) -J k i :,j + 1 N J k i 1 (mi) + n -1 Ψ k i x k+1 i = prox αRi (x k i -αg k i ) For all j ∈ {1, . . . , m i } set J k+1 :,j =        ∇f i,j (x k i ) if j ∈ S i J k :,j otherwise if SAGA ∇f i,j (x k i ); w. p. p i J k :,j otherwise if L -SVRG Set Ψ k+1 i = Ψ k i All Devices i ∈ S: g k i = 1 N J k i 1 (mi) + n -1 Ψ k i x k+1 i = prox αRi (x k i -αg k i ) Set J k+1 i = J k i , Ψ k+1 i = Ψ k i else Master computes the average xk = 1 n n i=1 x k i Master does for all i = 1, . . . , n: g k i = p -1 λ(x k i -xk ) -(p -1 -1)n -1 Ψ k i + 1 N J k i 1 (mi) Set x k+1 i = prox αRi x k i -αg k i Set Ψ k+1 i = λ(x k i -xk ), J k+1 i = J k i end if end for Remark C.8 Algotihm 2 is a special case of Algorithm 3 which is in turn special case of Algorithm 5. Similarly, Theorem 2 is a special case of Theorem 5.1 which is again special case of Theorem C.6.

C.6 LOCAL STOCHASTIC ALGORITHMS

In this section, we present two more algorithms -Local SGD with partial variance reduction (Algorithm 7) and Local SGD without variance reduction (Algorithm 6). While Algorithm 6 uses no control variates at all (thus is essentially Algorithm 1 where local gradient descent steps are replaced with local SGD steps), Algorithm 7 constructs control variates for ψ only, resulting in locally drifted SGD algorithm (with the constant drift between each consecutive rounds of communication). While we do not present the convergence rates of the methods here, we shall notice they can be easily obtained using the framework from (Gorbunov et al., 2020) . Algorithm 6 Loopless Local SGD (L2SGD) Input: x 0 1 = • • • = x 0 n ∈ R d , stepsize α, probability p for k = 0, 1, . . . do ξ = 1 with probability p and 0 with probability 1 -p if ξ = 0 then All Devices i = 1, . . . , n: Sample j ∈ {1, . . . , m} (uniformly at random) g k i = 1 n(1-p) ∇f i,j (x k i ) x k+1 i = x k i -αg k i else Master computes the average xk = 1 n n i=1 x k i Master does for all i = 1, . . . , n: g k i = λ np (x k i -xk ) Set x k+1 i = x k i -αg k i end if end for Algorithm 7 Loopless Local SGD with partial variance reduction (L2SGD2) Input: x 0 1 = • • • = x 0 n ∈ R d , stepsize α, probability p Ψ 0 i = 0 ∈ R d (for i = 1, . . . , n) for k = 0, 1, . . . do ξ = 1 with probability p and 0 with probability 1 -p if ξ = 0 then All Devices i = 1, . . . , n: Sample j ∈ {1, . . . , m} (uniformly at random) g k i = 1 n(1-p) ∇f i,j (x k i ) + 1 n Ψ k i x k+1 i = x k i -αg k i Set Ψ k+1 i = Ψ k i else Master computes the average xk = 1 n n i=1 x k i Master does for all i = 1, . . . , n: g k i = λ np (x k i -xk ) -p -1 -1 n Ψ k i Set x k+1 i = x k i -αg k i Set Ψ k+1 i = λ(x k i -xk ) end if end for

D MISSING LEMMAS AND PROOFS

D.1 GRADIENT AND HESSIAN OF ψ Lemma D.1 Let I be the d × d identity matrix and I n be n × n identity matrix. Then, we have ∇ 2 ψ(x) = 1 n I n -1 n ee ⊗ I and ∇ψ(x) = 1 n            x -            x . . . x x x . . . x                      . Furthermore, L ψ = 1 n . Proof: Let O the d × d zero matrix and let Q i := [O, . . . , O i-1 , I, O, . . . , O n-i ] ∈ R d×dn and Q := [I, . . . , I] ∈ R d×dn . Note that x i = Q i x, and x = 1 n Qx. So, ψ(x) = 1 2n n i=1 Q i x -1 n Qx 2 = 1 2n n i=1 Q i -1 n Q x 2 . The Hessian of ψ is ∇ 2 ψ(x) = 1 n n i=1 Q i -1 n Q Q i -1 n Q = 1 n n i=1 Q i Q i -1 n Q i Q -1 n Q Q i + 1 n 2 Q Q = 1 n n i=1 Q i Q i -1 n n i=1 1 n Q i Q -1 n n i=1 1 n Q Q i + 1 n n i=1 1 n 2 Q Q = 1 n n i=1 Q i Q i -1 n 2 Q Q and by plugging in for Q and Q i , we get ∇ 2 ψ(x) = 1 n        1 -1 n I -1 n I -1 n I • • • -1 n I -1 n I 1 -1 n I -1 n I • • • -1 n I -1 n I -1 n I 1 -1 n I • • • -1 n I . . . . . . . . . . . . -1 n I -1 n I -1 n I • • • 1 -1 n I        = 1 n        1 -1 n -1 n -1 n • • • -1 n -1 n 1 -1 n -1 n • • • -1 n -1 n -1 n 1 -1 n • • • -1 n . . . . . . . . . . . . -1 n -1 n -1 n • • • 1 -1 n        ⊗ I = 1 n I n -1 n ee ⊗ I. Notice that I n -1 n ee is a circulant matrix, with eigenvalues 1 (multiplicity n -1) and 0 (multiplicity 1). Since the eigenvalues of a Kronecker product of two matrices are the products of pairs of eigenvalues of the these matrices, we have λ max (∇ 2 ψ(x)) = λ max 1 n I n -1 n ee ⊗ I = 1 n λ max I n -1 n ee = 1 n . So, L ψ = 1 n . The gradient of ψ is given by ∇ψ (x) = 1 n n i=1 Q i -1 n Q Q i -1 n Q x = 1 n n i=1 Q i Q i -1 n Q i Q -1 n Q Q i + 1 n 2 Q Q x = 1 n n i=1                       0 . . . 0 x i 0 . . . 0            -            0 . . . 0 x 0 . . . 0            -            x i /n . . . x i /n x i /n x i /n . . . x i /n            +            x/n . . . x/n x/n x/n . . . x/n                       = 1 n            n i=1            0 . . . 0 x i 0 . . . 0            - n i=1            0 . . . 0 x 0 . . . 0            - n i=1            x i /n . . . x i /n x i /n x i /n . . . x i /n            + n i=1            x/n . . . x/n x/n x/n       x . . . x x x . . . x                      . D.2 PROOF OF THEOREM 3.1 For any λ, θ ≥ 0 we have f (x(λ)) + λψ(x(λ)) ≤ f (x(θ)) + λψ(x(θ)) (15) f (x(θ)) + θψ(x(θ)) ≤ f (x(λ)) + θψ(x(λ)). By adding inequalities (15) and ( 16), we get (θ -λ)(ψ(x(λ)) -ψ(x(θ))) ≥ 0, which means that ψ(x(λ)) is decreasing in λ. Assume λ ≥ θ. From the (16) we get f (x(λ)) ≥ f (x(θ)) + θ(ψ(x(θ)) -ψ(x(λ))) ≥ f (x(θ)), where the last inequality follows since θ ≥ 0 and since ψ(x(θ)) ≥ ψ(x(λ)). So, f (x(λ)) is increasing. D.6 PROOF OF LEMMA D.2 We first have E G(x) -G(x(λ)) 2 = (1 -p) ∇f (x) 1-p -∇f (x(λ)) 1-p 2 + p λ ∇ψ(x) p -λ ∇ψ(x(λ)) p 2 = 1 1-p ∇f (x) -∇f (x(λ)) 2 + λ 2 p ∇ψ(x) -∇ψ(x(λ)) 2 ≤ 2L f 1-p D f (x, x(λ)) + 2λ 2 L ψ p D ψ (x, x(λ)) = 2L n(1-p) D f (x, x(λ)) + 2λ 2 np D ψ (x, x(λ)). Since D f + λD ψ = D F and ∇F (x(λ)) = 0, we can continue: E G(x) -G(x(λ)) 2 ≤ 2 n max L 1-p , λ p D F (x, x(λ)) = 2 n max L 1-p , λ p (F (x) -F (x(λ))) . Next, note that σ 2 = 1 n 2 n i=1 1 1-p ∇f i (x i (λ)) 2 + λ 2 p x i (λ) -x(λ) 2 = 1 1-p ∇f (x(λ)) 2 + λ 2 p ∇ψ(x(λ)) 2 = (1 -p) ∇f (x(λ)) 1-p 2 + p λ∇ψ(x(λ)) p 2 = E G(x(λ)) 2 . Therefore, we have E G(x) 2 ≤ E G(x) -G(x(λ)) 2 + 2E G(x(λ)) 2 Lemma D.2+(17) ≤ 4L(F (x) -F (x(λ))) + 2σ 2 , as desired.

D.7 PROOF OF COROLLARY 4.3

Firstly, to minimize the total number of iterations, it suffices to minimize L which is achieved with p = λ L+λ . Let us look at the communication. Fix ε > 0, choose α = 1 2L and let k = 2nL µ log 1 ε , so that 1 -µ 2nL k ≤ ε. The expected number of communications to achieve this goal is equal to Note first that Algorithm 3 is a special case of Algorithm 5, and Theorem 5.1 immediately follows from Theorem C.6. Therefore it suffices to show Theorems C.6, and C.7. In order to do so, we will cast Algorithm 5 as a special case of GJS from (Hanzely & Richtárik, 2019) . As a consequence, Theorem C.6 will be a special cases of Theorem 5.2 from (Hanzely & Richtárik, 2019) .

D.9.1 GJS

In this section, we quickly summarize results from (Hanzely & Richtárik, 2019) , which we cast to sho convergence rate of Algorithm 3. GJS (Hanzely & Richtárik, 2019 ) is a method to solve regularized empirical risk minimization objective, i.e.,  Defining G(x) := [∇f 1 (x), . . . , ∇f n (x)], we observe SG(x), UG(x) every iteration where S is random linear projection operator and U is random linear operator which is identity on expectation. Based on this random gradient information, GJS (Algorithm 8) constructs variance reduced gradient estimator g and takes a proximal step in that direction. Algorithm 8 Generalized JacSketch (GJS) (Hanzely & Richtárik, 2019) 1: Parameters: Stepsize α > 0, random projector S and unbiased sketch U 2: Initialization: Choose solution estimate x 0 ∈ R d and Jacobian estimate J 0 ∈ R d×n 3: for k = 0, 1, . . . do 4: Sample realizations of S and U, and perform sketches SG(x k ) and UG(x k ) 5: J k+1 = J k -S(J k -G(x k )) update the Jacobian estimate 6: g k = 1 n J k e + 1 n U G(x k ) -J k e construct the gradient estimator 7: x k+1 = prox αR (x k -αg k ) perform the proximal SGD step 8: end for Next we quickly summarize theory of GJS. Assumption D.1 Problem (17) has a unique minimizer x , and f is µ-quasi strongly convex, i.e., f (x ) ≥ f (y) + ∇f (y), x -y + µ 2 y -x 2 , ∀y ∈ R d , Functions f j are convex and M j -smooth for some M j 0, i.e., f j (y)+ ∇f j (y), x -y ≤ f j (x) ≤ f j (y)+ ∇f j (y), x -y + 1 2 y -x We can now proceed with the proof of Theorem C.6 and Theorem C.7. As ∇f i (x) -∇f i (y) ∈ Range (M i ), we must have G(x k ) -G(x ) = M † M G(x k ) -G(x ) and J k -G(x ) = M † M J k -G(x ) . ( ) Due to ( 26), (25), inequalities ( 21) and ( 22) with choice Y = M † 1 2 X become respectively: 2α n 2 p -1 M 1 2 n Y :,n 2 + 2α 2 n 2 (1 -p) -1 n i=1 E    p -1 i j∈Si p -1 i,j M 1 2 i,j Y :j 2    + (I -E [S]) 1 2 B(Y) 2 ≤ (1 -αµ) B(Y) 2 (27) 2α n 2 p -1 M 1 2 n Y :,n 2 + 2α 2 n 2 (1-p) -1 n i=1 E    p -1 i j∈Si p -1 i,j M 1 2 i,j Y :j 2   + (E [S]) 1 2 B(Y) 2 ≤ 1 n Y 2 Above, we have used E UXe 2 = E UM 1 2 Ye 2 = p -1 M 1 2 n Y :,n 2 +(1-p) -1 n i=1 E    p -1 i j∈Si p -1 i,j M 1 2 i,j Y :j 2    . Note that E [S(X)] = X • Diag ((1 -p)(p • p), p) where p ∈ R n-1 such that p Ω(i,j) = p i,j . Using (24), setting B to be right multiplication with Diag(b) and noticing that λ max M n = nλ it suffices to have n 2 (1 -p) -1 p -1 i,j p -1 i v Ω(i,j) + (1 -p)p i,j p i b 2 j ≤ 1 n ∀j ∈ {1, . . . , m i }, i ≤ n for SAGA case and 2α n p -1 λ + (1 -p)b 2 n ≤ (1 -αµ)b 2 n 2α n 2 (1 -p) -1 p -1 i,j p -1 i v Ω(i,j) + (1 -(1 -p)p i p i )b 2 j ≤ (1 -αµ)b 2 j ∀j ∈ {1, . . . , m i }, i ≤ n 2α n p -1 λ + pb 2 n ≤ 1 n 2α n 2 (1 -p) -1 p -1 i,j p -1 i v Ω(i,j) + (1 -p)p i p i b 2 j ≤ 1 n ∀j ∈ {1, . . . , m i }, i ≤ n for LSVRG case. It remains to notice that to satisfy the SAGA case, it suffices to set b 2 n = 1 2np , b 2 Ω(i,j) = 1 2n(1-p)pi,j pi (for j ∈ {1, . . . , m i }, i ≤ n) and α = min min j∈{1,...,mi},1≤i≤n n(1-p)pi,j pi 4v Ω(i,j) +nµ , p 4λ+µ . To satisfy LSVRG case, it remains to set b 2 n = 1 2np , b 2 Ω(i,j) = The last step to establish is to recall that n = N + 1, v Ω(i,j) = N +1 N v i,j and µ = µ n and note that the iteration complexity is 1 αµ log 1 ε = n αµ log 1 ε .



After our paper was completed, a lower bound on the performance of local SGD was presented that is worse than the known minibatch SGD guarantee(Woodworth et al., 2020a), confirming that the local methods do not outperform their non-local counterparts in the heterogeneous setup. Similarly, the benefit of local methods in the non-heterogeneous scenario was questioned in(Woodworth et al., 2020b). The connection of FL and multi-task meta learning is discussed in(Kairouz et al., 2019), for example. To be specific, L2GD is equivalent to Overlap LGD(Wang et al., 2020) with random local loop size. If λ = ∞ and x1 = x2 = • • • = xn does not hold, we have F (x) = ∞. Therefore, we can restrict ourselves on set x1 = x2 = • • • = xn without loss of generality. We are aware that a linearly converging local SGD (with λ = ∞) can be obtained as a particular instance of the decoupling method(Mishchenko & Richtárik, 2019). Other variance reduced local SGD algorithms(Liang et al., 2019;Karimireddy et al., 2019;Wu et al., 2019) do not achieve linear convergence. Given that µ is small. Specifically we aim to have CorrJ k , n∇f (x k ) → 1 and Corr n -1 Ψ k , λ∇ψ(x k ) → 1 as x k → x . SAGA does not require storing a full gradient table for problems with linear models by memorizing the residuals. However, in full generality, SVRG-like methods are preferable.



Figure 1: Distance of solution x(λ) of (2) to pure local solution x(0) and global solution x(∞) as a function of λ. Logistic regression on a1a dataset. See Appendix for the setup.

Figure 2: Communication rounds to get

Both communication and iteration complexity of L2SGD+ are minimized for p = 4λ+µ 4λ+4L +(m+1)µ . The resulting iteration complexity is 4 λ µ

Figure 3: L2SGD+, vs L2SGD vs L2SGD2 with identical stepsize (details in the Appendix).

Figure6: Effect of parameter λ (legend of the plot) on the convergence rate of Algorithm 3. The choice λ = λ corresponds to borwn dash-dotted line with diamond marker (the third one from the legend). Aggregation probability p was chosen in each case as Table1indicates.

:,l for all l = j else Master computes the average xk = 1

Then the iteration complexity of Algorithm 5 (LSVRG option) is max max j∈{1,...,mi},1≤i≤n4vj n N pi,j +µp -1 i piµ(1-p) , 4λ+µ pµ log 1 ε .Algorithm 5 L2SGD++: Loopless Local SGD with Variance Reduction and Partial Participation

The quantity Comm p is minimized by choosing any p such that pL = (1 -p)λ, i.e., for p = λ λ+L = p , as desired. The optimal expected number of communications is therefore equal to Comm p = 2λ minimize the total number of iterations, it suffices to solvemin max 4L +µm (1-p)µ , 4λ+µ pµ ,which is achieved with p = p = 4λ+µ 4L +4λ+(m+1)µ . The expected number of communications to reach ε-solution isComm p = p(1 -p) max 4L +µm (1-p)µ ,Minimizing the above in p yield p = p = 4λ+µ 4L +4λ+(m+1)µ , as desired. The optimal expected number of communications is therefore equal toComm p = 4λ+µ 4L +4λ+(m+1)µ 4 L µ + m log 1 ε .D.9 PROOF OF THEOREMS 5.1, C.6, AND C.7

Mj , ∀x, y ∈ R d . (19)

2α n p -1 λ + (1 -p)b 2 n ≤ (1 -αµ)b 2 n 2α n 2 (1 -p) -1 p -1 i,j p -1 i v Ω(i,j) + (1 -(1 -p)p i,j p i )b 2 j ≤ (1 -αµ)b 2 j ∀j ∈ {1, . . . , m i }, i ≤ n 2α n p -1 λ + pb 2 n ≤ 1 n 2α

p)pipi (for j ∈ {1, . . . , m i }, i ≤ n) and α = min min j∈{1,...,mi},1≤i≤n n(1-p)pi4v Ω(i,j) pi,j +nµp -1 i , p 4λ+µ .

Setup for the experiments.

presents the result.

Appendix Federated Learning of a Mixture of Global and Local Models

which implies (3) and (4).D.3 PROOF OF THEOREM 3.2The equation ∇F (x(λ)) = 0 can be equivalently written aswhich is identical to (5). Averaging these identities over i, we getFurther, we haveas desired.D.4 PROOF OF THEOREM 3.3First, observe thatwhere the second identity is due to Theorem 3.2 which says that 1 n i ∇f i (x i (λ)) = 0. By applying Jensen's inequality and Lipschitz continuity of functions f i , we getIt remains to apply (3) and notice that P is strongly convex and thus x(∞) is indeed the unique minimizer.

D.5 PROOF OF THEOREM 4.2

We first show that our gradient estimator G(x) satisfies the expected smoothness property (Gower et al., 2018; 2019) .andNext, Theorem 4.2 from Lemma D.2 by applying Theorem 3.1 from (Gower et al., 2019) . where {x k } and {J k } are the random iterates produced by Algorithm 8 with stepsize α > 0. Suppose that α and B are chosen so that 2αfor all X ∈ R d×n . Then for all k ≥ 0, we haveLet Ω(i, j) := j + i-1 l=1 m i In order to case problem (12) as a special case of 17, denote n := N +1, f Ω(i,j) (x) := N +1 N f i,j (x i ) and f n := (N + 1)ψ. Therefore the objective (12) becomesLet v ∈ R n-1 be such that v Ω(i,j) = N +1 N v i,j and as a consequence of ( 14) we have(24) At the same time, Υ is µ := µ n strongly convex. D.9.3 PROOF OF THEOREM C.6 AND THEOREM C.7Let e ∈ R d be a vector of ones and p i ∈ R N is such that p i j = p i,j if j ∈ {1, . . . , m i }, otherwise p i j = 0. Given the notation, random operator U is chosen asWe next give two options on how to update Jacobian -first one is SAGA-like, second one is SVRG like.

SAGA-like:

(SX) :,mi = X :,Si = X :mi j∈Si e j e j , w.p. To obtain convergence rate of Theorem 5.1, it remains to use Theorem C.6 with p i = 1, m i = m (∀i ≤ n), where each machine samples (when the aggregation is not performed) individual data points with probability 1 m and thus p j = 1 m (for all j ≤ N ). The last remaining thing is to realize that v j = L for all j ≤ N .

