ADAPTIVE PERSONALIZED FEDERATED LEARNING

Abstract

Investigation of the degree of personalization in federated learning algorithms has shown that only maximizing the performance of the global model will confine the capacity of the local models to personalize. In this paper, we advocate an adaptive personalized federated learning (APFL) algorithm, where each client will train their local models while contributing to the global model. We derive the generalization bound of mixture of local and global models, and find the optimal mixing parameter. We also propose a communication-efficient optimization method to collaboratively learn the personalized models and analyze its convergence in both smooth strongly convex and nonconvex settings. The extensive experiments demonstrate the effectiveness of our personalization schema, as well as the correctness of established generalization theories.

1. INTRODUCTION

With the massive amount of data generated by the proliferation of mobile devices and the internet of things (IoT), coupled with concerns over sharing private information, collaborative machine learning and the use of federated optimization (FO) is often crucial for the deployment of large-scale machine learning (McMahan et al., 2017; Kairouz et al., 2019; Li et al., 2020b) . In FO, the ultimate goal is to learn a global model that achieves uniformly good performance over almost all participating clients without sharing raw data. To achieve this goal, most of the existing methods pursue the following procedure to learn a global model: (i) a subset of clients participating in the training is chosen at each round and receive the current copy of the global model; (ii) each chosen client updates the local version of the global model using its own local data, (iii) the server aggregates over the obtained local models to update the global model, and this process continues until convergence (McMahan et al., 2017; Mohri et al., 2019; Karimireddy et al., 2019; Pillutla et al., 2019) . Most notably, FedAvg by McMahan et al. (2017) uses averaging as its aggregation method over local models. Due to inherent diversity among local data shards and highly non-IID distribution of the data across clients, Fe-dAvg is hugely sensitive to its hyperparameters, and as a result, does not benefit from a favorable convergence guarantee (Li et al., 2020c) . In Karimireddy et al. (2019), authors argue that if these hyperparameters are not carefully tuned, it will result in the divergence of FedAvg, as local models may drift significantly from each other. Therefore, in the presence of statistical data heterogeneity, the global model might not generalize well on the local data of each client individually (Jiang et al., 2019) . This is even more crucial in fairness-critical systems such as medical diagnosis (Li & Wang, 2019) , where poor performance on local clients could result in damaging consequences. This problem is exacerbated even further as the diversity among local data of different clients is growing. To better illustrate this fact, we ran a simple experiment on MNIST dataset where each client's local training data is sampled from a subset of classes to simulate heterogeneity. Obviously, when each client has samples from less number of classes of training data, the heterogeneity among them will be high and if each of them has samples from all classes, the distribution of their local training data becomes almost identical, and thus heterogeneity will be low. The results of this experiment are depicted in Figure 1 , where the generalization and training losses of the global models of the FedAvg (McMahan et al., 2017) and SCAFFOLD (Karimireddy et al., 2019) on local data diverge when the diversity among different clients' data increases. This observation illustrates that solely optimizing for the global model's accuracy leads to a poor generalization of local clients. To embrace statistical heterogeneity and mitigate the effect of negative transfer, it is necessary to integrate the personalization into learning instead of finding a single consensus predictor. This pluralistic solution for FO has recently resulted in significant research in personalized learning schemes (Eichner et al., 2019; Smith et al., 2017; Dinh et al., 2020; Mansour et al., 2020; Fallah et al., 2020; Li et al., 2020a) . To balance the trade-off between the benefit from collaboration with other users and the disadvantage from the statistical heterogeneity among different users' domains, in this paper, we propose an adaptive personalized federated learning (APFL) algorithm which aims to learn a personalized model for each device that is a mixture of optimal local and global models. We theoretically analyze the generalization ability of the personalized model on local distributions, with dependency on mixing parameter, the divergence between local and global distributions, as well as the number of local and global training data. To learn the personalized model, we propose a communication efficient optimization algorithm that adaptively learns the model by leveraging the relatedness between local and global models as learning proceeds. As it is shown in Figure 1 , by progressively increasing the diversity, the personalized model found by the proposed algorithm demonstrates a better generalization compared to the global models learned by FedAvg and SCAFFOLD. We supplement our theoretical findings with extensive corroborating experimental results that demonstrate the superiority of the proposed personalization schema over the global and localized models of commonly used federated learning algorithms.

2. PERSONALIZED FEDERATED LEARNING

In this section, we propose a personalization approach for federated learning and analyze its statistical properties. Following the statistical learning theory, in a federated learning setting each client has access to its own data distribution D i on domain Ξ := X ×Y, where X ∈ R d is the input domain and Y is the label domain. For any hypothesis h ∈ H the loss function is defined as : H×Ξ → R + . The true risk at local distribution is denoted by L Di (h) = E (x,y)∼Di [ (h(x), y)]. We use LDi (h) to denote the empirical risk of h on distribution D i . We use D = (1/n) n i=1 D i to denote the average distribution over all clients.

2.1. PERSONALIZED MODEL

In a standard federated learning scenario, where the goal is to learn a global model for all devices cooperatively, the learned global model obtained by minimizing the joint empirical distribution D, i.e., min h∈H L D (h) by proper weighting. However, as alluded to before, a single consensus predictor may not perfectly generalize on local distributions when the heterogeneity among local data shards is high (i.e., the global and local optimal models drift significantly). Meanwhile, from the local user perspective, the key incentive to participate in "federated" learning is the desire to seek a reduction in the local generalization error with the help of other users' data. In this case, the ideal situation would be that the user can utilize the information from the global model to compensate for the small number of local training data while minimizing the negative transfer induced by heterogeneity among distributions. This motivates us to mix the global model and local model with a controllable weight as a joint prediction model, namely, the personalized model. Here we formally introduce our proposed adaptive personalized learning schema, where the goal is to find the optimal combination of the global and the local models, in order to achieve a better client-specific model. In this setting, global server still tries to train the global model by minimizing the empirical risk on the aggregated domain D, i.e., h * = arg min h∈H L D (h), while each user trains a local model while partially incorporating the global model, with some mixing weight α i , i.e., ĥ * loc,i = arg min h∈H LDi (α i h + (1 -α i ) h * ). Finally, the personalized model for ith client is a convex combination of h * and ĥ * loc,i : h αi = α i ĥ * loc,i + (1 -α i ) h * , It is worth mentioning that, h αi is not necessarily the minimizer of empirical risk LDi (•), because we optimize ĥ * loc,i with partially incorporating the global model. Example 1. Let us illustrate a simple situation where mixed model does not necessarily coincide with local ERM model. To this end, consider a setting where the hypothesis class H is the set of all vectors in R 2 , lying in 2 unit ball: H = {h ∈ R 2 : h 2 ≤ 1}. Assume the local empirical minimizer is known to be [1, 0] , and h * = [-1, 0] , and α is set to be 0.5. Now, if we wish to find a ĥ * loc,i , such that h αi = α * ĥ * loc,i + (1 -α) * h * coincides with local empirical minimizer, we have to solve: 0.5 * h + 0.5 * [-1, 0] = [1, 0] , subject to h 2 ≤ 1. This equation has no feasible solution, implying that it is not necessarily true that h αi coincides with local empirical minimizer. In fact, in most cases, as we will show in the convergence of the proposed algorithm, h αi will incur a residual risk if evaluated on the training set drawn from D i .

2.2. GENERALIZATION GUARANTEES

We now characterize the generalization of the mixed model. We present the learning bounds for classification and regression tasks. For classification, we consider a binary classification task, with squared hinge loss (h(x), y) = (max{0, 1 -yh(x)}) 2 . In the regression task, we consider the MSE loss (h(x), y) = (h(x) -y) 2 . Even though we present learning bounds under these two loss functions, our analysis can be generalized to any convex smooth loss. Before formally presenting the generalization bound, we introduce the following quantity to measure the empirical complexity of a hypothesis class H over a training set S. Definition 1. Let S be a fixed set of samples and consider a hypothesis class H. The worst case disagreement between a pair of models measured by absolute loss is quantified by: λ H (S) = sup h,h ∈H We now state the main result on the generalization of the proposed personalization schema. The proof of the theorem is provided in Appendix D. Theorem 1. Let hypothesis class H be compact closed set with finite VC dimension d. Assume loss function is Lipschitz continuous with constant G, and bounded in [0, B]. Then with probability at least 1-δ, there exists a constant C, such that the risk of the mixed model h αi = α i ĥ * loc,i +(1-α i ) h * on the ith local distribution D i is bounded by: L Di (h αi ) ≤ 2α 2 i   L Di (h * i ) + 2C d + log(1/δ) m i + Gλ H (S i )   + 2(1 -α i ) 2 L D ( h * ) + B D -D i 1 + C d + log(1/δ) m , where m i , i = 1, 2, . . . , n is the number of training data at ith user, m = m 1 + . . . + m n is the total number of all data, S i to be the local training set drawn from D i , D -D i 1 = Ξ |P (x,y)∼ D -P (x,y)∼Di |dxdy, is the difference between distributions D and D i , and h * i = arg min h∈H L Di (h). Remark 1. We note that a very analogous work to ours is Mansour et al. (2020) , where a generalization bound is provided for mixing global and local models. However, their bound does not depend on α i , and hence we cannot see how it impacts the generalization ability. In Theorem 1, by omitting constant terms, we observe that the generalization risk of h αi on D i mainly depends on three key quantities: i) m: the number of global data drawn from D, ii) divergence between distributions D and D i , and iii) m i : the amount of local data drawn from D i . Usually, the first quantity m, the amount of global data is fairly large compared to individual users, so global model usually has a better generalization. The second quantity characterizes the data heterogeneity between the average distribution and ith local distribution. If this divergence is too high, then the global model may hurt the local generalization. For the third quantity, as amount of local data m i is often small, the generalization performance of local model can be poor. Optimal mixing parameter. We can also find the optimal mixing parameter α * i that minimizes generalization bound in Theorem 1. Notice that the RHS of (2) is quadratic in α i , so it admits a minimum value at α * i = L D ( h * ) + B D -D i 1 + C d+log(1/δ) m L D ( h * ) + B D -D i 1 + C d+log(1/δ) m + L Di (h * i ) + 2C d+log(1/δ) mi + Gλ H (S i ) . The optimal mixture parameter is strictly bounded in [0, 1], which matches our intuition. If the divergence term is large, then the value becomes close to 1, which implies if local distribution drifts too much from average distribution, it is preferable to take more local models. If m i is small, this value will be negligible, indicating that we need to mix more of the global model into the personalized model. Conversely, if m i is large, then this term will be again roughly 1, which means taking the majority of local model will give the desired generalization performance.

3. OPTIMIZATION METHOD

To optimize the learning problem we cast in the previous section, here we propose a communication efficient adaptive algorithm to learn the personalized local models and the global model. To do so, we let every hypothesis h in the hypothesis space H to be parameterized by a vector w ∈ W ⊂ R d where W is some convex closed set and denote the empirical risk at ith device by local objective function f i (w). Adaptive personalized federated learning can be formulated as a two-phase optimization problem: globally update the shared model, and locally update users' local models. Similar to FedAvg algorithm, the server will solve the following optimization problem: min w∈W F (w) := 1 n n i=1 {f i (w) := E ξi [f i (w, ξ i )]} , where f i (.) is the local objective at ith client, ξ i is a minibatch of data in data shard at ith client, and n is the total number of clients. Motivated by the trade-off between the global model and local model generalization errors in Theorem 1, we need to learn a personalized model as in (1) to optimize the local empirical risk. To this end, each client needs to solve this optimization over its local data: min v∈W f i (α i v + (1 -α i )w * ) , where w * = arg min w F (w) is the optimal global model. The balance between these two models is governed by a parameter α i , which is associated with the diversity of the local model and the global model. We first state the algorithm for a pre-defined proper α i , and then propose an adaptive schema to learn this parameter as learning proceeds. Remark 2. As mentioned in Section 2.1, when the hypothesis class is bounded, the mixed model will not coincide with local ERM model. However, if the class is unbounded, the mixed model will eventually converge to local ERM model, which means the personalization fails. Hence, to make sure the correctness of our algorithm, we need to require the parameter comes from some bounded domain W Local Descent APFL.  = α i v (t) i + (1 -α i )w (t) i . Then, selected clients will perform the following updates locally on their own data for τ iterations: w (t) i = W w (t-1) i -η t ∇f i w (t-1) i ; ξ t i , v (t) i = W v (t-1) i -η t ∇ v f i v(t-1) i ; ξ t i , Algorithm 1: Local Descent APFL input: Mixture weights α 1 , • • • , α n , Synchronization gap τ . for t = 0, • • • , T do parallel for i ∈ U t do if t not divides τ then w (t) i = W w (t-1) i -η t ∇f i w (t-1) i ; ξ t i , v (t) i = W v (t-1) i -η t ∇ v f i v(t-1) i ; ξ t i v(t) i = α i v (t) i + (1 -α i )w (t) i , U t ← -U t-1 else each selected client sends w (t) i to the server w (t) = 1 |Ut| j∈Ut w (t) j server uniformly samples a subset U t of K clients. server broadcast w (t) to all chosen clients end end end where ∇f i (.; ξ) denotes the stochastic gradient of f (.) evaluated at mini-batch ξ. Then, using the updated version of the global model and the local model, we update the personalized model v(t) i as well. The clients that are not selected in this round will keep their previous step local modelv (t) i = v (t-1) i . After these τ local updates, selected clients will send their local version of the global model w (t) i to the server for aggregation by averaging: w (t) = 1 |Ut| j∈Ut w (t) j . Then the server will choose another set of K clients for the next round of training and broadcast this new model to them. Adaptively updating α. Even though in Section 2.2, we give the information theoretically optimal mixing parameter, in practice we usually do not know the distance between user's distribution and the average distribution. Thus, finding the optimal α is infeasible. However, we can infer it empirically during optimization. Based on the local objective defined in (4), the empirical optimum value of α for each client can be found by solving α * i = arg min αi∈[0,1] f i (α i v + (1 -α i )w) , where we can use the gradient descent to optimize it at every communication round, using the following step: α (t) i = α (t-1) i -η t ∇ α f i v(t-1) i ; ξ t i = α (t-1) i -η t v (t-1) i -w (t-1) i , ∇f i v(t-1) i ; ξ t i , which shows that the mixing coefficient α is updated based on the correlation between the difference of the personalized and the local version of global models, and the gradient at the in-device personalized model. Meaning, when the global model is drifting from the personalized model, the value of α changes to adjust the balance between local data and shared knowledge among all devices captured by the global model.

4. CONVERGENCE ANALYSIS

In this section we provide the convergence analysis of Local Descent APFL with fixed α i for strongly convex and nonconvex functions. To have a tight analysis, as well as putting the optimization results in the context of generalization bounds discussed above, we define the following parameterization-invariant quantities that only depend on the distributions of local data across clients and the geometry of loss functions. Definition 2. We define the following quantity to measure the diversity among local gradients with respect to the gradient of the ith client: ζ i = sup w∈R d ∇F (w) -∇f i (w) 2 2 (Woodworth et al., 2020a ). We also define the sum of gradient diversities of n clients as: ζ = n i=1 ζ i . Definition 3. We define ∆ i = v * i -w * 2 2 , where v * i = arg min v f i (v), and w * = arg min w F (w) to measure the gap between optimal local model and optimal global model. We also need the following standard assumption on the stochastic gradients at local objectives. Assumption 1 (Bounded Variance). The variance of stochastic gradients computed at each local data shard is bounded, i.e., ∀i ∈ [n]:E[ ∇f i (x; ξ) -∇f i (x) 2 ] ≤ σ 2 . Strongly Convex Loss. We now turn to establishing the convergence of local descent APFL on smooth strongly convex functions. Specifically, the following theorem characterizes the convergence of the personalized local model to the optimal local model. The proof is provided in Appendix E.2.3. Theorem 2. Assume each client's objective function is µ-strongly convex and L-smooth, and satisfies Assumption 1. Also let κ = L/µ, b = min K n , 1 2 . Using Algorithm 1, by choosing the mixing weight α i ≥ max{1 -1 4 √ 6κ , 1 - 1 4 √ 6κ √ µ }, learning rate: η t = 16 µ(t+a) , where a = max{128κ, τ }, and using average scheme vi = 1 S T T t=1 p t (α i v (t) i +(1-α i ) 1 K j∈Ut w (t) j ) , where p t = (t+a) 2 , S T = T t=1 p t , and letting f * i to denote the local minimum of the ith client, then the following convergence rate holds for all clients i ∈ [n]: E[fi(vi)] -f * i ≤ α 2 i O σ 2 µbT + (1 -αi) 2 O κ 2 σ 2 µbKT + κ 2 τ τ ζi + κ 2 τ ζ K µbT 2 + ζi + ζ K µb + κL∆i b . If we choose τ = T /K, then: E[f i (v i )] -f * i ≤ α 2 i O σ 2 µT + (1 -α i ) 2 O κ 2 σ 2 +κ 2 ζi+κ 4 ζ K µKT + (1 -α i ) 2 O ζi+ ζ K µ + κL∆ i . A few remarks about the convergence of personalized local model are in place: (1) If we set α i = 1, then we recover O 1 T convergence rate of single machine SGD. If we only focus on the terms with (1 -α i ) 2 , which is contributed by the global model's convergence, and omit the residual error, we achieve the convergence rate of O(1/KT ) using only

√

KT communication, which matches with the convergence rate of vanilla local SGD (Stich, 2018; Woodworth et al., 2020a), and (2) The residual error is related to the gradient diversity ζ i and local-global optimality gap ∆ i . It shows that taking any proportion of the global model will result in a sub-optimal ERM model. As we discussed in Section 2.1, h αi will not be the empirical risk minimizer in most cases. Also, we assume that α i needs to be larger than some value in order to get a tight rate. This condition can be alleviated, but the residual error will be looser. The analysis of this relaxation is presented in Appendix F. Nonconvex Loss. The following theorem establish the convergence rate of personalized model learned by APFL for nonconvex smooth loss functions. The proof is provided in Appendix G.3. Theorem 3. Let v(t) i = α i v (t) i + (1 -α i ) 1 K j∈Ut w (t) j . If each client's objective function is L-smooth and domain W be bounded by D W , that is, ∀w, w ∈ W, ww 2 ≤ D W . Using Algorithm 1 with full gradient, by choosing K = n and learning rate η = 1 2 √ 5L √ T , we have 1 T T t=1 ∇f i (v (t) i ) 2 ≤ O L √ T + (1 -α i ) 2 L √ T + (1 -α 2 i ) 2 ζ i + L 2 D W + α 4 i (1 -α i ) 2 O τ 4 ζ nT 2 + τ 2 ζ i T + (1 -α i ) 2 O τ 2 ζ nT . By choosing τ = n -1/4 T 1/4 , it holds that: 1 T T t=1 ∇f i (v (t) i ) 2 ≤ O 1 √ T + (1 -α i ) 2 O 1 √ T + 1 √ nT + (1 -α 2 i ) 2 ζ i + L 2 D W . Here we show that APFL will converge to stationary point on nonconvex function with sublinear rate plus some residual error, with n 3/4 T 

5. EXPERIMENTS

In this section, we empirically show the effectiveness of the proposed algorithm in personalized federated learning. Due to lack of space, some experimental results are deferred to Appendix B. Experimental setup. We run our experiments on Microsoft Azure systems, using Azure ML API. Bottou (2012) . At each iteration the learning rate is decreased by 1%, unless otherwise stated. We report the performance over training data for optimization error and local validation data (from the same distribution as training data for each client) for the generalization accuracy. Throughout these experiments we report the results for the following three models: • Global Model: Referring to the global model of FedAvg or SCAFFOLD. • Localized Global Model: Referring to the fine-tuned version of the global model at each round of communication after τ steps of local SGD. Here, we have either the localized FedAvg or the localized SCAFFOLD. The reported results are for the average of the performance over all the local models on each online client. In all the experiments τ = 10, unless otherwise stated. • Personalized Model: This model is the personalized model produced by our proposed algorithm APFL. The reported results are the average of the respective performance of personalized models over all online clients at each round of communication. Strongly convex loss. First, we run a set of experiments on the MNIST dataset, with different levels of non-IIDness by assigning certain number of classes to each client. We use logistic regression with parameter regularization as our strongly convex loss function. In this part, all clients are online for each round, however, the results when client sampling is involved is discussed in Appendix B.2. We compare the personalized model of APFL with different rates of personalization as α with global and localized models of FedAvg and SCAFFOLD, as well as their global models. The initial learning rate is set to 0.1 and it is decaying as mentioned before. The results of running this experiment on 100 clients and after 100 rounds of communication are depicted in Figure 2 , where we move from highly non-IID data distribution (left) to IID data distribution (right). As it can be seen, global models learned by FedAvg and SCAFFOLD have high local training losses. On the other hand, taking more proportion of the local model in the personalized model (namely, increasing α) will result in the lower training losses. For generalization ability, the best performance is given by personalized model with α = 0.25 in both (a) and (b) cases, which outperforms the global (FedAvg and SCAF-FOLD) and their localized versions. However, as we move toward IID distribution, the advantage of personalization vanishes as expected. Hence, as expected by the theoretical findings, we can benefit from personalization the most when there is a statistical heterogeneity between the data of different clients. When the data are distributed IID, local models of FedAvg or SCAFFOLD are preferable. An interesting observation from the results in Figure 2 , which is inline with our theoretical findings is the relationship of α with both optimization and generalization losses. As it can be seen from the first row, α has a linear relationship with the optimization loss, that is, with smaller α, training loss is getting closer to the global model of FedAvg in terms of optimization loss, which matches with our convergence theory. However, from the second row, it can be inferred that there is no linear relationship between α and generalization. In fact, according to (2), we know that generalization bound is quadratic in α, and hence, the generalization performance does not simply increase or decrease monotonically with α. Adaptive α update. In this part, we want to show how adaptively learning the value of α across different clients, based on (6), will affect the training and generalization performance of APFL's personalized models. We use the three synthetic datasets as described in Appendix B.1, with logistic regression as the loss function. We set the initial value of α (0) i = 0.01 for every i ∈ [n]. The results of this training are depicted in Figure 3 , where both optimization and generalization of the learned models are compared. As it can be inferred, in training, APFL outperforms FedAvg in the same datasets. More interestingly, in generalization of learned APFL personalized models, all datasets achieve almost the same performance as a result of adaptively updating α values, while the FedAvg algorithm has a huge gap with them. This shows that, when we do not know the degree of diversity among data of different clients, we should adaptively update α values to guarantee the best generalization performance. We also have results on EMNIST dataset with adaptive tuning of α in Appendix B.2, wih a 2-layer MLP. Nonconvex loss. To showcase the results for a nonconvex loss, we use CIFAR10 dataset that is distributed in a non-IID way with 2 classes per client. We apply it to a CNN model with 2 convolution layers, followed by 2 fully connected layers, using cross entropy as the loss function. The initial learning rates of APFL and FedAvg algorithms are set to 0.1 with the mentioned decay structure, while for SCAFFOLD this value is 0.05 with 5% decay per iteration to avoid divergence. As it can be inferred from the results in Table 1 , the personalized model learned by APFL outperforms the localized models of FedAvg and SCAFFOLD, as well as their global models, in both optimization and generalization. In this case adaptively tuning the α achieves the best training loss, while α = 0.25 case reaching the best generalization performance. Comparison with other personalization methods. We now compare our proposed APFL with two recent approaches for personalization in federated learning. In addition to FedAvg, we compare with perFedAvg introduced in Fallah et al. (2020) using a meta-learning approach, and pFedMe  APFL FedAvg SCAFFOLD α = 0.25 α = 0.5 α = 0.

6. CONCLUSIONS

In this paper, we proposed an adaptive federated learning algorithm that learns a mixture of local and global models as the personalized model. Motivated by learning theory in domain adaptation, we provided generalization guarantees for our algorithm that demonstrated the dependence on the diversity between each clients' data distribution and the representative sample of the overall distribution of data, and the number of per-device samples as key factors in personalization. Moreover, we proposed a communication-reduced optimization algorithm to learn the personalized models and analyzed its convergence rate for both smooth strongly convex and nonconvex functions. Finally, we empirically backed up our theoretical results by conducting experiments in a federated setting. 

A ADDITIONAL RELATED WORK

The number of research in federated learning is proliferating during the past few years. In federated learning, the main objective is to learn a global model that is good enough for yet to be seen data and has fast convergence to a local optimum. This indicates that there are several uncanny resemblances between federated learning and meta-learning approaches (Finn et al., 2017; Nichol et al., 2018) . However, despite this similarity, meta-learning approaches are mainly trying to learn multiple models, personalized for each new task, whereas in most federated learning approaches, the main focus is on the single global model. As discussed by Kairouz et al. (2019) , the gap between the performance of global and personalized models shows the crucial importance of personalization in federated learning. Several different approaches are trying to personalize the global model, primarily focusing on optimization error, while the main challenge with personalization is during the inference time. Some of these works on the personalization of models in a decentralized setting can be found in Vanhaesebrouck et al. (2017); Almeida & Xavier (2018) , where in addition to the optimization error, they have network constraints or peer-to-peer communication limitation (Bellet et al., 2017; Zantedeschi et al., 2019) . In general, as discussed by Kairouz et al. (2019) , there are three significant categories of personalization methods in federated learning, namely, local fine-tuning, multi-task learning, and contextualization. Yu et al. (2020) argue that the global model learned by federated learning, especially with having differential privacy and robust learning objectives, can hurt the performance of many clients. They indicate that those clients can obtain a better model by using only their own data. Hence, they empirically show that using these three approaches can boost the performance of those clients. In addition to these three, there is also another category that fits the most to our proposed approach, which is mixing the global and local models. Local fine-tuning: The dominant approach for personalization is local fine-tuning, where each client receives a global model and tune it using its own local data and several gradient descent steps. This approach is predominantly used in meta-learning methods such as MAML by 2018) and FedAvg, and combine them to personalize local models. They observed that federated learning with a single objective of performance of the global model could limit the capacity of the learned model for personalization. In Khodak et al. (2019) , authors using online convex optimization to introduce a meta-learning approach that can be used in federated learning for better personalization. Fallah et al. (2020) borrow ideas from MAML to learn personalized models for each client with convergence guarantees. Similar to fine-tuning, they update the local models with several gradient steps, but they use second-order information to update the global model, like MAML. Another approach adopted for deep neural networks is introduced by Arivazhagan et al. ( 2019), where they freeze the base layers and only change the last "personalized" layer for each client locally. The main drawback of local fine-tuning is that it minimizes the optimization error, whereas the more important part is the generalization performance of the personalized model. In this setting, the personalized model is pruned to overfit. Multi-task learning: Another view of the personalization problem is to see it as a multi-task learning problem similar to Smith et al. (2017) . In this setting, optimization on each client can be considered as a new task; hence, the approaches of multi-task learning can be applied. One other approach, discussed as an open problem in Kairouz et al. (2019) , is to cluster groups of clients based on some features such as region, as similar tasks, similar to one approach proposed by Mansour et al. (2020) . In fact, they propose three different approaches for personalization with generalization guarantees, namely, client clustering, data interpolation, and model interpolation. Out of these three, the first two approaches need some meta-features from all clients that makes them not a feasible approach for federated learning, due to privacy concerns. The third schema, which is the most promising one in practice as well, has a close formulation to ours in the interpolation of the local and global models. However, in their theory, the generalization bound does not demonstrate the advantage of mixing models, but in our analysis, we show how the model mixing can impact the generalization bound, by presenting its dependency on the mixture parameter, data diversity and optimal models on local and global distributions. Beyond different techniques for personalization in federated learning, Kairouz et al. (2019) ask an essential question of "when is a global FL-trained model better?", or as we can ask, when is personalization better? The answer to these questions mostly depends on the distribution of data across clients. As we theoretically prove and empirically verify in this paper, when the data is distributed IID, we cannot benefit from personalization, and it is similar to the local SGD scenario (Stich, 2018; Haddadpour et al., 2019a;b; Woodworth et al., 2020b). However, when the data is non-IID across clients, which is mostly the case in federated learning, personalization can help to balance between shared and local knowledge. Then, the question becomes, what degree of personalization is best for each client? While this was an open problem in Mohri et al. (2019) on how to appropriately mix the global and local model, we answer this question by adaptively tuning the degree of personalization for each client, as discussed in Section 3, so it can perfectly become agnostic to the local data distributions.

B ADDITIONAL EXPERIMENTAL RESULTS

In this section, we present additional experimental results to demonstrate the efficacy of the proposed APFL algorithm. First, we describe different datasets we have used in this paper, and then, present additional results.

B.1 DATASETS

For the experiments we use 4 different data sources as follows: MNIST and CIFAR10 For the MNIST and CIFAR10 datasets to be similar to the setting in federated learning, we need to manually distribute them in a non-IID way, hence the data distribution is pathologically heterogeneous. To this end, we follow the steps used by McMahan et al. (2017) , where they partitioned the dataset based on labels and for each client draw samples from some limited number of classes. We use the same way to create 3 datasets for the MNIST, that are, MNIST non-IID with 2 classes per client, MNIST non-IID with 4 classes per client, and MNIST IID, where the data is distributed uniformly random across different clients. Also, we create a non-IID CIFAR10 dataset, where each client has access to only 2 classes of data. EMNIST In addition to pathological heterogeneous data distributions, we applied our algorithm on a real-world heterogeneous dataset, which is an extension to MNIST dataset. The EMNIST dataset includes images of characters divided by authors, where each author has a different style, make their distributions different Caldas et al. (2018) . We use only digit characters and 1000 authors' data to train our models on. Synthetic For generating the synthetic dataset, we follow the procedure used by Li et al. (2018) , where they use two parameters, say synthetic(γ, β), that control how much the local model and the local dataset of each client differ from that of other clients, respectively. Using these parameters, we want to control the diversity between data and model of different clients. The procedure is that for each client we generate a weight matrix W i ∈ R m×c and a bias b ∈ R c , where the output for the ith client is y i = arg max σ W i x i + b , where σ(.) is the softmax. In this setting, the input data x i ∈ R m has m features and the output y can have c different values indicating number of classes. The model is generated based on a Gaussian distribution W i ∼ N (µ i , 1) and b i ∼ N (µ i , 1), where µ i ∼ N (0, γ). The input is drown from a Gaussian distribution x i ∼ N (ν i , Σ), where ν i ∼ N (V i , 1) and V i ∼ N (0, β). Also the variance Σ is a diagonal matrix with value of Σ k,k = k -1.2 . Using this procedure, we generate three different datasets, namely synthetic(0.0, 0.0), synthetic(0.5, 0.5), and synthetic(1.0, 1.0), where we move from an IID dataset to a highly non-IID data. 

B.2 ADDITIONAL RESULTS

In this part, we present more experimental results that can further illustrate the effectiveness of APFL on other datasets and models. Effect of sampling. To understand how the sampling of different clients will affect the performance of the APFL algorithm, we run the same experiment with different sampling rates for the MNIST dataset. The results of this experiment are depicted in Figure 4 , where we run the experiment for different sampling rates of K ∈ {0.3, 0.5, 0.7}. Also, we run it with different values of α ∈ {0.25, 0.5, 0.75}. The results are reported for the personalized model of APFL and localized FedAvg. As it can be inferred, decreasing the sampling ratio has a negative impact on both the training and generalization performance of FedAvg. However, we can see that despite the sampling ratio, APFL is outperforming local model of the FedAvg in both training and generalization. Also, from the results of Figure 2 , we know that for this dataset that is highly non-IID, larger α values are preferred. Increasing α can diminish the negative impacts of sampling on personalized models both in training and generalization. Natural heterogeneous data In addition to the CIFAR10 and MNIST datasets with pathological heterogeneous data distributions, we apply our algorithm on a natural heterogeneous dataset, EM-NIST (Caldas et al., 2018) . We use the data from 1000 clients, and for each round of communication we randomly select 10% of clients to participate in the training. We use an MLP model with 2 hidden layers, each with 200 neurons and ReLU as the activation function, using cross entropy as the loss function. For APFL, we use the adaptive α scheme with initial value of 0.5 for each client. We run both algorithms for 250 rounds of communication. In each round, each online client performs the local updates for 1 epoch on its data. Figure 5 

C DISCUSSIONS AND EXTENSIONS

Connection between learning guarantee and convergence. As Theorem 1 suggests, the generalization bound depends on the divergence of the local and global distributions. In the language of optimization, the counter-part of divergence of distribution is the gradient diversity; hence, the gradient diversity appears in our empirical loss convergence rate (Theorem 2). The other interesting discovery is in the generalization bound, we have the term λ H and L Di (h * i ), which are intrinsic to the distributions and hypothesis class. Meanwhile, in the convergence result, we have the term v * i -w * 2 , which also only depends on the data distribution and hypothesis class we choose. In addition, v * i -w * 2 also reveals the divergence between local and global optimal solutions. Why APFL is "Adaptive". Both information-theoretically (Theorem 1) and computationally (Theorem 2), we prove that when the local distribution drifts far away from the average distribution, the global model does not contribute too much to improve the local generalization and we have to tune the mixing parameter α to a larger value. Thus it is necessary to make α updated adaptively during empirical risk minimization. In Section 3, (6) shows that the update of α depends on the correlation of local gradient and deviation between local and global models. Experimental results show that our method can adaptively tune α, and can outperform the training scheme using fixed α. Comparison with local ERM model A crucial question about personalization is when it is preferable to employ a mixed model?, and how bad a local ERM model will be? In the following corollary, we answer this by showing that the risk of local ERM model can be strictly worse than that of our personalized model. Corollary 1. Continuing with Theorem 1, there exist a distribution D i , constant C 1 and C 2 , such that with probability at least 1 -δ, the following upper bound for the difference between risks of personalized model h αi and local ERM model ĥ * i on D i , holds : L Di (h αi ) -L Di ( ĥ * i ) ≤ (2α 2 i -1)L Di (h * i ) + (2α 2 i C 1 -C 2 ) d + log(1/δ) m i + 2α 2 i Gλ H (S i ) + 2(1 -α i ) 2 L D ( h * ) + B D -D i 1 + C 1 d + log(1/δ) m . By examining the above bound, the personalized model is preferable to local model if this value is less than 0. In this case, we require (2α 2 -1) and (2α 2 i C 1 -C 2 ) to be negative, which is satisfied by choosing α i ≤ min{ √ 2 2 , C2 2C1 }. Then, the term d+log(1/δ) mi , should be sufficiently large, and the divergence term, as well as the global model generalization error has to be small. In this case, from the local model perspective, it can benefit from incorporate some global model. Using the similar technique, we can prove the supremacy of mixed model over global model as well. Proof of Corollary 1. Since in Theorem 1, we already obtained upper bound for L Di (h αi ) as following, L Di (h αi ) ≤ 2α 2 i   L Di (h * i ) + 2C 1 d + log(1/δ) m i + Gλ H (S i )   + 2(1 -α i ) 2 L D ( h * ) + B D -D i 1 + C 1 d + log(1/δ) m , to find the upper bound of L Di (h αi ) -L Di ( ĥ * i ), we just need the lower bound of L Di ( ĥ * i ). The fundamental theorem of statistical learning (Shalev-Shwartz & Ben-David, 2014; Mohri et al., 2018) states a lower risk bound for agnostic PAC learning: for a hypothesis class with finite VC dimension d, then there exists a distribution D, such that for any learning algorithm, which learns a hypothesis h ∈ H on m i.i.d. samples from D, there exists a constant C, with the probability at least 1 -δ, we have: L D (h) -min h ∈H L D (h ) ≥ C d + log(1/δ) m . Since ĥ * i is learnt by ERM algorithm, the agnostic PAC learning lower risk bound also holds for it, so in worst case it might hold that under distribution D i , if ĥ * i is learnt by ERM algorithm using m i samples, then there is a C 2 , such that with probability at least 1 -δ, we have: L Di ( ĥ * i ) ≥ L Di (h * i ) + C 2 d + log(1/δ) m i . Thus we can bound L Di (h αi ) -L Di ( ĥ * i ) as Corollary 1 claims. Personalization for new participant nodes. Suppose we already have a trained global model ŵ, and now a new device k joins in the network, which is desired to personalize the global model to adapt its own domain. This can be done by performing a few local stochastic gradient descent updates from the given global model as an initial local model: v (t+1) k = v (t) k -η t ∇ v f k (α k v (t) k + (1 -α k ) ŵ; ξ (t) k ) to quickly learn a personalized model for the newly joined device. One thing worthy of investigation is the difference between APFL and meta-learning approaches, such as model-agnostic meta-learning (Finn et al., 2017) . Our goal is to share the knowledge among the different users, in order to reduce the generalization error; while meta-learning cares more about how to build a meta-learner, to help training models faster and with fewer samples. In this scenario, similar to FedAvg, when a new node joins the network, it gets the global model and takes a few stochastic steps based on its own data to update the global model. In Figure 6 , we show the results of applying FedAvg and APFL on synthetic data with two different rates of diversity, synthetic(0.0, 0.0) and synthetic(0.5, 0.5). In this experiment, we keep 3 nodes with their data off in the entire training for 100 rounds of communication between 97 nodes. In each round, each client updates its local and personalized models for one epoch. After the training is done, those 3 clients will join the network and get the latest global model and start training local and personalized models of their own. Figure 6 shows the training loss and validation accuracy of these 3 nodes during the 5 epochs of updates. The local model represents the model that will be trained in FedAvg, while the personalized model is the one resulting from APFL. Although the goal of APFL is to adaptively learn the personalized model during the training, it can be inferred that APFL can learn a better personalized model in a meta-learning scenario as well. Mohri et al. (2019) , the global model can be distributionally robust if we optimize the agnostic loss:

Agnostic global model. As pointed out by

min w∈R d max q∈∆n F (w) := n i q i f i (w), where ∆ n = {q ∈ R n + | q i = 1} is the n-dimensional simplex. We call this scenario "Adaptive Personalized Agnostic Federated Learning". In this case, the analysis will be more challenging since the global empirical risk minimization is performed at a totally different domain, so the risk upper bound for h αi we derived does not hold anymore. Also, from a computational standpoint, since the resulted problem is a minimax optimization problem, the convergence analysis of agnostic APFL will be more involved, which we will leave as an interesting future work.

D PROOF OF GENERALIZATION BOUND

In this section we present the proof of generalization bound for APFL algorithm. Recall that we define the following hypotheses on ith local true and empirical distributions: ĥ * i = arg min h∈H LDi (h) (LOCAL EMPIRICAL RISK MINIMIZER) h * i = arg min h∈H L Di (h) (LOCAL TRUE RISK MINIMIZER) h * = arg min h∈H L D (h) (GLOBAL EMPIRICAL RISK MINIMIZER) ĥ * loc,i = arg min h∈H LDi (α i h + (1 -α i ) h * ) (MIXED EMPIRICAL RISK MINIMIZER) h * loc,i = arg min h∈H L Di (α i h + (1 -α i ) h * ) (MIXED TRUE RISK MINIMIZER) where LDi (h) and L Di (h) denote the empirical and true risks on D i , respectively. From a high-level technical view, since we wish to bound the risk of the mixed model on local distribution D i , first we need to utilize the convex property of the risk function, and decompose it into two parts: L Di ĥ * loc,i and L Di h * . To bound L Di ĥ * loc,i , a natural idea is to characterize it by the risk of optimal model L Di (h * i ), plus some excess risk. However, due to fact that ĥ * loc,i is not the sole local empirical risk minimizer, rather it partially incorporates the global model, we need to characterize to what extent it drifts from the local empirical risk minimizer ĥ * i . This drift can be depicted by the hypothesis capacity, so that is our motivation to define λ H (S) to quantify the empirical loss discrepancy over S among pair of hypotheses in H. We have to admit that there should be a tighter theory to bound this drift, depending how global model is incorporated, which we leave it as a future work. The following simple result will be useful in the proof of generalization.  L D (h) ≤ L D (h) + B D -D 1 , where D -D 1 = Ξ |P (x,y)∼D -P (x,y)∼D |dxdy. Proof. L D (h) ≤ L D (h) + |L D (h) -L D (h)| ≤ L D (h) + Ξ | (y, h(x))||P (x,y)∼D -P (x,y)∼D |dxdy = L D (h) + B D -D 1 . Proof of Theorem 1 We now turn to proving the generalization bound for the proposed APFL algorithm. Recall that for the classification task we consider squared hinge loss, and for the regression case we consider MSE loss. We will first prove that in both cases we can decompose the risk as follows: L Di (h * αi ) ≤ 2α 2 i L Di ĥ * loc,i + 2(1 -α i ) 2 L Di h * (x) . ( ) We start with the classification case first. Note that, hinge loss: max{0, 1 -z} is convex in z, so max{0, 1 -y( α i h + (1 -α i )h )} ≤ α i max{0, 1 -yh} + (1 -α i ) max{0, 1 -yh }, according to Jensen's inequality. Hence, we have: L Di (h * αi ) = L Di (α i ĥ * loc,i + (1 -α i ) h * ) = E (x,y)∼Di max{0, 1 -y(α i ĥ * loc,i (x) + (1 -α i ) h * (x))} 2 = E (x,y)∼Di α i max{0, 1 -y ĥ * loc,i (x)} + (1 -α i ) max{0, 1 -y h * (x)} 2 ≤ 2α 2 i E (x,y)∼Di max{0, 1 -y ĥ * loc,i (x)} 2 + 2(1 -α i ) 2 E (x,y)∼Di max{0, 1 -y h * (x)} 2 ≤ 2α 2 i L Di ĥ * loc,i + 2(1 -α i ) 2 L Di h * . For regression case: L Di (h * αi ) = L Di (α i ĥ * loc,i + (1 -α i ) h * ) = E (x,y)∼Di y -(α i ĥ * loc,i (x) + (1 -α i ) h * (x)) 2 = E (x,y)∼Di α i y -α i ĥ * loc,i (x) + (1 -α i )y -(1 -α i ) h * (x) 2 ≤ 2α 2 i E (x,y)∼Di y -ĥ * loc,i (x) 2 + 2(1 -α i ) 2 E (x,y)∼Di y -h * (x) 2 ≤ 2α 2 i L Di ĥ * loc,i + 2(1 -α i ) 2 L Di h * Thus we can conclude: L Di (h * αi ) ≤ 2α 2 i L Di ĥ * loc,i T1 +2(1 -α i ) 2 L Di h * T2 . We proceed to bound the terms T 1 and T 2 in RHS of above inequality. We first bound T 1 as follows.  ∀h ∈ H, |L Di (h) -LDi (h)| ≤ C d + log(1/δ) m i , where C is constant factor. So we can bound T 1 as: T 1 = L Di ( ĥ * loc,i ) = L Di (h * i ) + L Di ( ĥ * loc,i ) -L Di (h * i ) = L Di (h * i ) + L Di ( ĥ * loc,i ) -LDi ( ĥ * loc,i ) ≤C d+log(1/δ) m i + LDi ( ĥ * loc,i ) -LDi (h * i ) + LDi (h * i ) -L Di (h * i ) ≤C d+log(1/δ) m i ≤ L Di (h * i ) + 2C d + log(1/δ) m i + LDi ( ĥ * loc,i ) -LDi ( ĥ * i ).

Note that

LDi ( ĥ * loc,i ) -LDi ( ĥ * i ) ≤ G 1 |S i | (x,y)∈Si | ĥ * loc,i (x) -ĥ * i (x)| ≤ Gλ H (S i ), As a result we can bound T 1 by: T 1 ≤ L Di (h * i ) + 2C d + log(1/δ) m i + Gλ H (S i ). We now turn to bounding T 2 . Plugging Lemma 1 in (11) and using uniform generalization risk bound will immediately give: T 2 ≤ L D ( h * ) + B 2 D -D 1 + C d + log(1/δ) m . Plugging T 1 and T 2 back into (11) concludes the proof. Remark 3. One thing worth mentioning is that, we assume the customary boundedness of loss functions. Actually it can be satisfied if the data and the parameters of hypothesis are bounded. For example, considering the scenario where we are learning a linear model w with the constraint w ≤ 1, and also the data tuples (x, y) are drawn from some bounded domain, then the loss is obviously bounded by some finite real value. 2020) are only applicable to a settings where we aim at directly learning a model in some combination of source and target domain, while in our setting, we partially incorporate the model learned from source domain and then perform ERM on joint model over target domain. Moreover, their results only apply to very simple loss functions, e.g., absolute loss or MSE loss, while we consider squared hinge loss in the classification case. Analogous to multiple domain theory, we derive the multi domain learning bound based on the divergence of source and target domains but measured in absolute distance, • 1 . As Mansour et al. (2009) points out, divergence measured by absolute loss can be large, and as a result we leave the development of a more general multiple domain learning theory that can deal with most popular loss functions like hinge loss, cross entropy loss and optimal transport, with tighter divergence measure on distributions as an open question.

E PROOF OF CONVERGENCE RATE IN CONVEX SETTING

In this section, we present the proof of convergence raters. For ease of mathematical derivations, we first consider the case without sampling clients at each communication step and then generalize the proof to the setting where K devices are sampled uniformly at random by the server as employed in the proposed algorithm. Technical challenges. The analysis of convergence rates in our setting is more involved compared to analysis of local SGD with periodic averaging by Stich (2018); Woodworth et al. (2020a). The key difficulty arises from the fact that unlike local SGD where local solutions are evolved by employing mini-batch SGD, in our setting we also partially incorporate the global model to compute stochastic gradients over local data. In addition, our goal is to find the convergence rate of the mixed model, rather than merely the local model or global model. To better illustrate this, let us first clarify the notations of models that will be used in analysis. Let us consider the simple case for now where we set K = n (all device participate averaging). We define three virtual sequences: (t) . Since the personalized model incorporates 1-α i percentage of global model, then the key challenge in the convergence analysis is to find out how much the global model benefits/hurts the local convergence. To this end, we analyze how much the dynamics of personalized model v(t) {w (t) } T t=1 , {v (t) } T t=1 and {v (t) } T t=1 where w (t) = 1 n n j=1 w (t) i ,v (t) i = α i v (t) i + (1 -α i )w (t) i v(t) i = α i v (t) i + (1-α i )w i and global model w (t) differ from each other at each iteration. To be more specific, we study the distance between gradients ∇f i (v (t) i ) -∇F (w (t) ) 2 . Surprisingly, we relate this distance to gradient diversity, personalized model convergence, global model convergence and local-global optimality gap: E ∇f i (v (t) i ) -∇F (w (t) ) 2 ≤ 6ζ i + 2L 2 E v(t) i -v * 2 + 6L 2 E w (t) -w * 2 + 6L 2 ∆ i . E v(t) i -v * 2 and E w (t) -w * 2 will converge very fast under smooth strongly convex objective, and ζ i and ∆ i will serve as residual error that indicates the heterogeneity among local functions.  i for i ∈ [n]. for t = 0, • • • , T do if t not divides τ then w (t) i = W w (t-1) i -η t ∇f i w (t-1) i ; ξ t i v (t) i = W v (t-1) i -η t ∇ v f i v(t-1) i ; ξ t i v(t) i = α i v (t) i + (1 -α i )w (t) i else each client sends w (t) j to the server w (t) = 1 n n j=1 w (t) j server broadcast w (t) to all clients end end for i = 1, • • • , n do output: Personalized model: vi = 1 S T T t=1 p t (α i v (t) i + (1 -α i ) 1 n n j=1 w (t) j ); Global model: ŵ = 1 nS T T t=1 p t n j=1 w (t) j . end E.1 PROOF WITHOUT SAMPLING Before giving the proof of convergence analysis of the Algorithm 1 in the main paper, we first discuss a warm-up case: local descent APFL without client sampling. As Algorithm 2 shows, all clients will participate in the averaging stage every τ iterations. The convergence of global and local models in Algorithm 2 are given in the following theorems. We start by stating the convergence of global model. Theorem 4 (Global model convergence of Local Descent APFL without Sampling). If each client's objective function is µ-strongly convex and L-smooth, and satisfies Assumption 1, using Algorithm 2, choosing the mixing weight α i ≥ max{1 -1 4 √ 6κ , 1 - 1 4 √ 6κ √ µ }, learning rate η t = 16 µ(t+a) , where a = max{128κ, τ }, and using average scheme ŵ = 1 nS T T t=1 p t n j=1 w (t) j , where p t = (t + a) 2 , S T = T t=1 p t , then the following convergence holds: E [F ( ŵ)] -F (w * ) ≤ O µ T 3 + O κ 2 τ σ 2 + τ ζ n µT 2 + O κ 2 τ σ 2 + τ ζ n ln T µT 3 + O σ 2 nT , where w * = arg min w F (w) is the optimal global solution. Proof. Proof is deferred to Appendix E.1.2. The following theorem obtains the convergence of personalized model in Algorithm 2. Theorem 5 (Personalized model convergence of Local Descent APFL without Sampling). If each client's objective function is µ-strongly convex and L-smooth, and satisfies Assumption 1, using Algorithm 2, choosing the mixing weight α i ≥ max{1 -1 4 √ 6κ , 1 - 1 4 √ 6κ √ µ }, learning rate η t = 16 µ(t+a) , where a = max{128κ, τ }, and using average scheme vi = 1 S T T t=1 p t (α i v (t) i + (1 -α i ) 1 n n j=1 w (t) j ) , where p t = (t + a) 2 , S T = T t=1 p t , and f * i is the local minimum of the ith client, then the following convergence holds for all i ∈ [n]: E[fi(vi)] -f * i ≤ O µ T 3 + α 2 i O σ 2 µT + (1 -αi) 2 O ζi µ + κL∆i + (1 -αi) 2 O κL ln T T 3 + O κ 2 σ 2 µnT + O κ 2 τ σ 2 + τ (ζi + ζ n ) µT 2 + O κ 4 τ σ 2 + 2τ ζ n µT 2 . Proof. Proof is deferred to Appendix E.1.3.

E.1.1 PROOF OF USEFUL LEMMAS

Before giving the proof of Theorem 4 and 5, we first prove few useful lemmas. Recall that we define virtual sequences {w (t) } T t=1 ,{v (t) i } T t=1 ,{v i } T t=1 where w (t) = 1 n n i=1 w (t) i ,v (t) i = α i v (t) i + (1 - α i )w (t) i ,v (t) i = α i v (t) i + (1 -α i )w (t) . We start with the following lemma that bounds the difference between the gradients of local objective and global objective at local and global models. Lemma 2. For Algorithm 2, at each iteration, the gap between local gradient and global gradient is bounded by E ∇fi(v (t) i ) -∇F (w (t) ) 2 ≤ 2L 2 E v(t) i -v * 2 + 6ζi + 6L 2 E w (t) -w * 2 + 6L 2 ∆i. Proof. From the smoothness assumption and by applying the Jensen's inequality we have: E ∇f i (v (t) i ) -∇F (w (t) ) 2 ≤ 2E ∇f i (v (t) i ) -∇f i (v * i ) 2 + 2E ∇f i (v * i ) -∇F (w (t) ) 2 ≤ 2L 2 E v(t) i -v * 2 + 6E ∇f i (v * i ) -∇f i (w * ) 2 + 6E ∇f i (w * ) -∇F (w * ) 2 + 6E ∇F (w * ) -∇F (w (t) ) 2 ≤ 2L 2 E v(t) i -v * 2 + 6L 2 E v * i -w * 2 + 6ζ i + 6L 2 E w (t) -w * 2 ≤ 2L 2 E v(t) i -v * 2 + 6L 2 ∆ i + 6ζ i + 6L 2 E w (t) -w * 2 . Lemma 3 (Local model deviation without sampling). For Algorithm 2, at each iteration, the deviation between each local version of the global model w (t) i and the global model w (t) is bounded by: E w (t) -w (t) i 2 ≤ 3τ σ 2 η 2 t-1 + 3(ζ i + ζ n )τ 2 η 2 t-1 , 1 n n i=1 E w (t) -w (t) i 2 ≤ 3τ σ 2 η 2 t-1 + 6τ 2 ζ n η 2 t-1 , where ζ n = 1 n n i=1 ζ i . Proof. According to Lemma 8 in Woodworth et al. ( 2020a): E w (t) -w (t) i 2 ≤ 1 n n j=1 E w (t) j -w (t) i 2 ≤ 3 σ 2 + ζ i τ + ζ n τ t-1 p=tc η 2 p t-1 q=p+1 (1 -µη q ) 1 n n i=1 E w (t) -w (t) i 2 ≤ 1 n 2 n i=1 n j=1 E w (t) j -w (t) i 2 ≤ 3 σ 2 + 2τ ζ n t-1 p=tc η 2 p t-1 q=p+1 (1 -µη q ) . Plugging in η q = 16 µ(a+q) yields: E w (t) -w (t) i 2 ≤ 3 σ 2 + ζ i τ + ζ n τ t-1 p=tc η 2 p t-1 q=p+1 a + q -16 a + q ≤ 3 σ 2 + ζ i τ + ζ n τ t-1 p=tc η 2 p t-1 q=p+1 a + q -16 a + q ≤ 3 σ 2 + ζ i τ + ζ n τ t-1 p=tc η 2 p t-1 q=p+1 a + q -2 a + q ≤ 3 σ 2 + ζ i τ + ζ n τ t-1 p=tc η 2 p (a + p -1)(a + p) (a + t -2)(a + t -1) ≤ 3 σ 2 + ζ i τ + ζ n τ t-1 p=tc η 2 p η 2 t-1 η 2 p ≤ 3τ σ 2 + ζ i τ + ζ n τ η 2 t-1 . Similarly, 1 n n i=1 E w (t) -w (t) i 2 ≤ 3τ σ 2 η 2 t-1 + 6τ 2 ζ n η 2 t-1 . Lemma 4. (Convergence of global model) Let w (t) = 1 n n i=1 w (t) i . Under the setting of Theorem 5, we have: E w (T +1) -w * 2 ≤ a 3 (T + a) 3 E w (1) -w * 2 + T + 16 1 a + 1 + ln(T + a) 1536a 2 τ σ 2 + 2τ ζ n L 2 (a -1) 2 µ 4 (T + a) 3 + 128σ 2 T (T + 2a) nµ 2 (T + a) 3 . Proof. Using the updating rule and non-expensive property of projection, as well as applying strong convexity and smoothness assumptions yields: E w (t+1) -w * 2 ≤ E w (t) -w (t) -ηt 1 n n j=1 ∇fj(w (t) j ; ξ t j ) -w * 2 ≤ E w (t) -w * 2 -2ηtE 1 n n j=1 ∇fj(w (t) j ), w (t) -w * + η 2 t σ 2 n + η 2 t E 1 n n j=1 ∇fj(w (t) j ) 2 ≤ E w (t) -w * 2 -2ηtE ∇F (w (t) ), w (t) -w * + η 2 t σ 2 n + η 2 t E 1 n n j=1 ∇fj(w (t) j ) 2 T 1 -2ηtE 1 n n j=1 ∇fj(w (t) j ) -∇F (w (t) ), w (t) -w * T 2 ≤ (1 -µηt)E w (t) -w * 2 -2ηt(E[F (w (t) )] -F (w * )) + η 2 t σ 2 n + T1 + T2, where at the last step we used the strongly convex property. Now we are going to bound T 1 . By the Jensen's inequality and smoothness, we have: T 1 ≤ 2η 2 t E    1 n n j=1 ∇f j (w (t) j ) -∇F (w (t) ) 2    + 2η 2 t E ∇F (w (t) ) 2 ≤ 2η 2 t L 2 1 n n j=1 E w (t) j -w (t) 2 + 4η 2 t L E F (w (t) ) -F (w * ) Then, we bound T 2 as: T 2 ≤ η t    2 µ E    1 n n j=1 ∇f j (w (t) j ) -∇F (w (t) ) 2    + µ 2 E w (t) -w * 2    ≤ 2η t L 2 µ 1 n n j=1 E w (t) j -w (t) 2 + µη t 2 E w (t) -w * 2 . Now, by plugging back T 1 and T 2 from ( 13) and ( 14) in ( 12), we have: E w (t+1) -w * 2 ≤ 1 - µη t 2 E w (t) -w * 2 -(2η t -4η 2 t L) ≤-ηt E F (w (t) ) -F (w * ) + η 2 t σ 2 n + 2η t L 2 µ + 2η 2 t L 2 1 n n j=1 E w (t) j -w (t) 2 ≤ 1 - µη t 2 E w (t) -w * 2 + η 2 t σ 2 n + 2η t L 2 µ + 2η 2 t L 2 1 n n j=1 E w (t) j -w (t) 2 . Now, by using Lemma 3 we have: E w (t+1) -w * 2 ≤ 1 - µη t 2 E w (t) -w * 2 + 2η t L 2 µ + 2η 2 t L 2 3τ σ 2 + 2τ ζ n η 2 t-1 + η 2 t σ 2 n . Note that (1 -µηt 2 ) pt ηt = µ(t+a) 2 (t-8+a) 16 ≤ µ(t-1+a) 3

16

= pt-1 ηt-1 , so we multiply pt ηt on both sides and do the telescoping sum: p T η T E w (T +1) -w * 2 ≤ p 0 η 0 E w (1) -w * 2 + T t=1 2L 2 µ + 2η t L 2 3τ σ 2 + 2τ ζ n p t η 2 t-1 + T t=1 p t η t σ 2 n ≤ p 0 η 0 E w (1) -w * 2 + T t=1 2L 2 µ + 2η t L 2 3τ σ 2 + 2τ ζ n 256a 2 µ 2 (a -1) 2 + T t=1 p t η t σ 2 n . (16) Then, by re-arranging the terms will conclude the proof: E w (T +1) -w * 2 ≤ a 3 (T + a) 3 E w (1) -w * 2 + T + 16 1 a + 1 + ln(T + a) 1536a 2 τ σ 2 + 2τ ζ n L 2 (a -1) 2 µ 4 (T + a) 3 + 128σ 2 T (T + 2a) nµ 2 (T + a) 3 , where we use the inequality T t=1 1 t+a ≤ 1 a+1 + T 1 1 t+a < 1 a+1 + ln(T + a).

E.1.2 PROOF OF THEOREM 4

Proof. According to ( 15) and ( 16) in the proof of Lemma 4 we have: pT ηT E w (T +1) -w * 2 ≤ p0 η0 E w (1) -w * 2 - T t=1 pt E F (w (t) ) -F (w * ) + T t=1 2L 2 µ + 2ηtL 2 3τ σ 2 + 2τ ζ n 256a 2 µ 2 (a -1) 2 + T t=1 ptηt σ 2 n , re-arranging term and dividing both sides by S T = T t=1 p t > T 3 yields: 1 ST T t=1 pt E F (w (t) ) -F (w * ) ≤ p0 ST η0 E w (1) -w * 2 + 1 ST T t=1 2L 2 µ + 2ηtL 2 3τ σ 2 + 2τ ζ n 256a 2 µ 2 (a -1) 2 + 1 ST T t=1 ptηt σ 2 n ≤ O µ T 3 + O κ 2 τ σ 2 + 2τ ζ n µT 2 + O κ 2 τ σ 2 + 2τ ζ n ln T µT 3 + O σ 2 nT . Recall that ŵ = 1 nS T T t=1 n j=1 w (t) j and convexity of F , we can conclude that: E [F ( ŵ)] -F (w * ) ≤ O µ T 3 + O κ 2 τ σ 2 + 2τ ζ n µT 2 + O κ 2 τ σ 2 + 2τ ζ n ln T µT 3 + O σ 2 nT .

E.1.3 PROOF OF THEOREM 5

Proof. Recall that we defined virtual sequences {w (t) } T t=1 where w (t) , then by the updating rule and non-expensiveness of projection we have: (t) = 1 n n i=1 w (t) i and v(t) i = α i v (t) i + (1 -α i )w E v(t+1) i -v * i 2 ≤ E v(t) i -α 2 i ηt∇fi(v (t) i ) -(1 -αi)ηt 1 n n j=1 ∇fj(w (t) j ) -v * i 2 + E α 2 i ηt(∇fi(v (t) i ) -∇fi(v (t) i ; ξ t i )) + (1 -αi)ηt 1 n j∈U t ∇fj(w (t) j ) -∇fj(w (t) j ; ξ t j ) 2 ≤ E v(t) i -v * i 2 -2E α 2 i ηt∇fi(v (t) i ) + (1 -αi)ηt 1 n n j=1 ∇fj(w (t) j ), v(t) i -v * i + η 2 t E α 2 i ∇fi(v (t) i ) + (1 -αi) 1 n n j=1 ∇fj(w (t) j ) 2 + α 2 i η 2 t σ 2 + (1 -αi) 2 η 2 t σ 2 n = E v(t) i -v * i 2 -2(α 2 i + 1 -αi)ηtE ∇fi(v (t) i ), v(t) i -v * i T 1 -2ηt(1 -αi)E 1 n n j=1 ∇fj(w (t) j ) -∇fi(v (t) i ), v(t) i -v * i T 2 + η 2 t E α 2 i ∇fi(v (t) i ) + (1 -αi) 1 n n j=1 ∇fj(w (t) j ) 2 T 3 +α 2 i η 2 t σ 2 + (1 -αi) 2 η 2 t σ 2 n . Now, we bound the term T 1 as follows: T1 = -2ηt(α 2 i + 1 -αi)E ∇fi(v (t) i ), v(t) i -v * i -2ηt(α 2 i + 1 -αi)E ∇fi(v (t) i ) -∇fi(v (t) i ), v(t) i -v * i ≤ -2ηt(α 2 i + 1 -αi) E fi(v (t) i ) -fi(v * i ) + µ 2 E v(t) i -v * i 2 + (α 2 i + 1 -αi)ηt 8L 2 µ(1 -8(αi -α 2 i )) E v(t) i - v(t) i 2 + µ(1 -8(αi -α 2 i )) 8 E v(t) i -v * i 2 ≤ -2ηt(α 2 i + 1 -αi) E fi(v (t) i ) -fi(v * i ) + µ 2 E v(t) i -v * i 2 + ηt 8L 2 (1 -αi) 2 µ(1 -8(αi -α 2 i )) E w (t) -w (t) i 2 + µ(1 -8(αi -α 2 i )) 8 E v(t) i -v * i 2 ≤ -2ηt(α 2 i + 1 -αi) E fi(v (t) i ) -fi(v * i ) - 7µηt 8 E v(t) i -v * i 2 + 8ηtL 2 (1 -αi) 2 µ(1 -8(αi -α 2 i )) E w (t) -w (t) i 2 , where we use the fact (α 2 i + 1 -α i ) ≤ 1. Note that, because we set α i ≥ max{1 -1 4 √ 6κ , 1 - 1 4 √ 6κ √ µ }, and hence 1 -8(α i -α 2 i ) ≥ 0, so in the second inequality we can use the arithmeticgeometry inequality. Next, we turn to bounding the term T 2 in (17): T2 = -2ηt(1 -αi)E i -v * i ≤ ηt(1 -αi) 2(1 -αi) µ E ∇fi(v + µ 2(1 -αi) E v(t) i -v * i 2 ≤ 6(1 -αi) 2 ηt µ E ∇fi(v (t) i ) -∇fi(v (t) i ) 2 + E ∇fi(v (t) i ) -∇F (w (t) ) 2 +E ∇F (w (t) ) - 1 n n j=1 ∇fj(w (t) j ) 2 + ηtµ 2 E v(t) i -v * i 2 ≤ 6(1 -αi) 2 ηt µ L 2 E w (t) -w (t) i 2 + E ∇fi(v (t) i ) -∇F (w (t) ) 2 +E ∇F (w (t) ) - 1 n n j=1 ∇fj(w (t) j ) 2 + ηtµ 2 E v(t) i -v * i 2 . And finally, we bound the term T 3 in (17) as follows: T3 = E α 2 i ∇fi(v (t) i ) + (1 -αi) 1 n n j=1 ∇fj(w (t) j ) 2 ≤ 2(α 2 i + 1 -αi) 2 E ∇fi(v (t) i ) 2 + 2E (1 -αi) 1 n n j=1 ∇fj(w (t) j ) -∇fi(v (t) i ) 2 ≤ 2 2(α 2 i + 1 -αi) 2 E ∇fi(v (t) i ) -∇f * i 2 + 2(α 2 i + 1 -αi) 2 E ∇fi(v (t) i ) -∇fi(v (t) i ) 2 + 2(1 -αi) 2 E 1 n n j=1 ∇fj(w (t) j ) -∇fi(v (t) i ) 2 ≤ 8L(α 2 i + 1 -αi) E fi(v (t) i ) -f * i + 4(1 -αi) 2 L 2 E w (t) -w (t) i 2 + 6(1 -αi) 2 L 2 E w (t) -w (t) i 2 + E ∇fi(v (t) i ) -∇F (w (t) ) 2 + 1 n n j=1 L 2 E w (t) -w (t) j 2 . Now, using Lemma 3, (1 -α i ) 2 ≤ 1 and plugging back T 1 , T 2 , and T 3 from (18), (19), and (20) into (17), yields: E v(t+1) i -v * i 2 ≤ 1 - 3µηt 8 E v(t) i -v * i 2 -2(ηt -4η 2 t L)(α 2 i + 1 -αi) E fi(v (t) i ) -fi(v * i ) + 8ηtL 2 (1 -αi) 2 µ(1 -8(αi -α 2 i )) + 6(1 -αi) 2 ηtL 2 µ + 10(1 -αi) 2 η 2 t L 2 E w (t) -w (t) i 2 + 6(1 -αi) 2 ηtL 2 µ + 6(1 -αi) 2 η 2 t L 2 1 n n j=1 E w (t) -w (t) j 2 + 6ηt µ + 6η 2 t (1 -αi) 2 E ∇F (w (t) ) -∇fi(v (t) i ) 2 + α 2 i η 2 t σ 2 + (1 -αi) 2 η 2 t σ 2 n , ≤ 1 - 3µηt 8 E v(t) i -v * i 2 -2(ηt -4η 2 t L)(α 2 i + 1 -αi) E fi(v (t) i ) -fi(v * i ) + 8ηtL 2 (1 -αi) 2 µ(1 -8(αi -α 2 i )) + 6(1 -αi) 2 ηtL 2 µ + 10(1 -αi) 2 η 2 t L 2 3τ σ 2 + (ζi + ζ n )τ η 2 t-1 + 6(1 -αi) 2 ηtL 2 µ + 6(1 -αi) 2 η 2 t L 2 3τ σ 2 + 2 ζ n τ η 2 t-1 + 6ηt µ + 6η 2 t (1 -αi) 2 E ∇F (w (t) ) -∇fi(v (t) i ) 2 T 4 +α 2 i η 2 t σ 2 + (1 -αi) 2 η 2 t σ 2 n , where using Lemma 2 we can bound T 4 as: T 4 ≤ 6η t µ (1 -α i ) 2 2L 2 E v(t) i -v * 2 + 6ζ i + 6L 2 E w (t) -w * 2 + 6L 2 ∆ i + 6η 2 t (1 -α i ) 2 2L 2 E v(t) i -v * 2 + 6ζ i + 6L 2 E w (t) -w * 2 + 6L 2 ∆ i . Note that we choose α i ≥ max{1 -1 4 √ 6κ , 1 - 1 4 √ 6κ √ µ }, hence 12L 2 (1-αi) 2 µ ≤ µ 8 and 12L 2 (1 - α i ) 2 ≤ µ 8 , thereby we have: T 4 ≤ µη t 4 v(t) i -v * 2 + 36η t 1 µ + η t (1 -α i ) 2 ζ i + L 2 E w (t) -w * 2 + L 2 ∆ i . Now, using Lemma 4 we have: T4 ≤ µηt 4 E v(t) i -v * 2 + 36ηt 1 µ + ηt (1 -αi) 2 ζi + L 2 a 3 (t + a -1) 3 E w (1) -w * 2 + t + 16 1 a + 1 + ln(t + a) 1536τ σ 2 + 2τ ζ n L 2 µ 4 (t + a -1) 3 + 128σ 2 t(t + 2a) nµ 2 (t + a -1) 3 + L 2 ∆i . ( ) By plugging back T 4 from ( 23) in (21) and using the fact -(η t -4η 2 t L) ≤ -1 2 η t , and (α 2 i +1-α i ) ≥ 3 4 , we have: E v(t+1) i -v * i 2 ≤ (1 - µηt 8 )E v(t) i -v * i 2 - 3ηt 4 E fi(v (t) i ) -fi(v * i ) + α 2 i η 2 t σ 2 + (1 -αi) 2 η 2 t σ 2 n + 8ηtL 2 (1 -αi) 2 µ(1 -8(αi -α 2 i )) + 6(1 -αi) 2 ηtL 2 µ + 10(1 -αi) 2 η 2 t L 2 3τ σ 2 + (ζi + ζ n )τ η 2 t-1 + 6(1 -αi) 2 ηtL 2 µ + 6(1 -αi) 2 η 2 t L 2 3τ σ 2 + 2 ζ n τ η 2 t-1 + 36ηt 1 µ + ηt (1 -αi) 2   ζi + L 2   a 3 E w (1) -w * 2 (t -1 + a) 3 + t + 16 1 a + 1 + ln(t + a) 1536τ σ 2 + 2τ ζ n L 2 µ 4 (t + a -1) 3 + 128σ 2 t(t + 2a) nµ 2 (t -1 + a) 3 + ∆i . Note that (1 -µηt 8 ) pt ηt ≤ pt-1 ηt-1 where p t = (t + a) 2 , so, we multiply pt ηt on both sides, and re-arrange the terms: 3pt 4 E fi(v (t) i ) -fi(v * i ) ≤ pt-1 ηt-1 E v(t) i -v * i 2 - pt ηt E v(t+1) i -v * i 2 + ptηt α 2 i σ 2 + (1 -αi) 2 σ 2 n + 8L 2 (1 -αi) 2 µ(1 -8(αi -α 2 i )) + 6(1 -αi) 2 L 2 µ + 10(1 -αi) 2 ηtL 2 3τ σ 2 + (ζi + ζ n )τ ptη 2 t-1 + 6(1 -αi) 2 L 2 µ + 6(1 -αi) 2 ηtL 2 3τ σ 2 + 2 ζ n τ ptη 2 t-1 + 36pt 1 µ + ηt (1 -αi) 2 ζi + L 2 ∆i + 36pt 1 µ + ηt (1 -αi) 2 L 2 a 3 (t -1 + a) 3 + (t + 16Θ(ln t)) 1536τ σ 2 + 2τ ζ n L 2 µ 4 (t + a -1) 3 + 128σ 2 t(t + 2a) nµ 2 (t -1 + a) 3 . T t=1 p t ≥ T 3 we have: fi(vi) -fi(v * i ) ≤ 1 ST T t=1 pt(fi(v (t) i ) -fi(v * i )) ≤ 4p0E v(1) i -v * i 2 3η0ST + 1 ST 4 3 T t=1 ptηt α 2 i σ 2 + (1 -αi) 2 σ 2 n + 1 ST 4 3 T t=1 8L 2 (1 -αi) 2 µ(1 -8(αi -α 2 i )) + 6(1 -αi) 2 L 2 µ + 10(1 -αi) 2 ηtL 2 3τ σ 2 + (ζi + ζ n )τ ptη 2 t-1 + 1 ST 4 3 T t=1 6(1 -αi) 2 L 2 µ + 6(1 -αi) 2 ηtL 2 3τ σ 2 + 2 ζ n τ ptη 2 t-1 + 48(1 -αi) 2 L 2 ST T t=1 pt 1 µ + ηt a 3 (t -1 + a) 3 + (t + 16Θ(ln t)) 1536τ σ 2 + 2τ ζ n L 2 µ 4 (t + a -1) 3 + 128σ 2 t(t + 2a) nµ 2 (t -1 + a) 3 + 48(1 -αi) 2 ζi + L 2 ∆i 1 ST T t=1 pt 1 µ + ηt ≤ 4p0E v(1) i -v * i 2 3η0ST + 32T (T + a) 3µST α 2 i σ 2 + (1 -αi) 2 σ 2 n + 4(1 -αi) 2 3 8L 2 T µ(1 -8(αi -α 2 i ))ST + 6L 2 T µST + 10L 2 Θ(ln T ) µST 3τ σ 2 + (ζi + ζ n )τ 256a 2 µ 2 (a -1) 2 + 4 3 6(1 -αi) 2 L 2 T µST + 6(1 -αi) 2 L 2 Θ(ln T ) µST 3τ σ 2 + 2 ζ n τ 256a 2 µ 2 (a -1) 2 + 48(1 -αi) 2 L 2 a 2 (a -1) 2 ST a 3 Θ(ln T ) µ + T a + Θ(ln T ) 1536L 2 τ σ 2 + 2τ ζ n µ 5 + 64(2a + 1)σ 2 T (T + a) naµ 3 + 48(1 -αi) 2 L 2 a 2 (a -1) 2 ST 16a 3 π 2 6µ + (Θ(ln T )) 1536L 2 τ σ 2 + 2τ ζ n µ 5 + 2048(2a + 1)σ 2 naµ 3 T + 48(1 -αi) 2 ζi + L 2 ∆i 1 ST ST µ + 8T (T + 2a) µ = O µ T 3 + α 2 i O σ 2 µT + (1 -αi) 2 O ζi µ + κL∆i + (1 -αi) 2 O κL ln T T 3 + O κ 2 σ 2 µnT + O κ 2 τ 2 (ζi + ζ n ) + κ 2 τ σ 2 µT 2 + O κ 4 τ σ 2 + 2τ ζ n µT 2 . where we use the convergence of , where a = max{128κ, τ }, κ = L µ , and using average scheme ŵ = 1 KS T T t=1 p t j∈Ut w (t) j , where p t = (t + a) 2 , S T = T t=1 p t , and letting F * to denote the minimum of the F , then the following convergence holds: E [F ( ŵ)] -F * ≤ O µ T 3 + O κ 2 τ σ 2 + 2τ ζ K µT 2 + O κ 2 τ σ 2 + 2τ ζ K ln T µT 3 + O σ 2 KT , ( ) where τ is the number of local updates (i.e., synchronization gap) . Proof.  E ∇fi(v (t) i ) - 1 K j∈U t ∇fj(w (t) ) 2 ≤ 2L 2 E v(t) i -v * 2 + 6 2ζi + 2 ζ K + 6L 2 E w (t) -w * 2 + 6L 2 ∆i. Proof. From the smoothness assumption and by applying the Jensen's inequality we have: ∇fi(v (t) i ) - 1 K j∈U t ∇fj(w (t) ) 2 ≤ 2E ∇fi(v (t) i ) -∇fi(v * i ) 2 + 2E ∇fi(v * i ) - 1 K j∈U t ∇fj(w (t) ) 2 ≤ 2L 2 E v(t) i -v * 2 + 6E ∇fi(v * i ) -∇fi(w * ) 2 + 6E ∇fi(w * ) - 1 K j∈U t ∇fj(w * ) 2 + 6E ∇ 1 K j∈U t ∇fj(w * ) - 1 K j∈U t ∇fj(w (t) ) 2 ≤ 2L 2 E v(t) i -v * 2 + 6L 2 E v * i -w * 2 + 6 2ζi + 2 1 K j∈U t ζj + 6L 2 E w (t) -w * 2 ≤ 2L 2 E v(t) i -v * 2 + 6L 2 ∆i + 6 2ζi + 2 ζ K + 6L 2 E w (t) -w * 2 . Lemma 6 (Local model deviation with sampling). For Algorithm 1, at each iteration, the deviation between each local version of the global model w (t) i and the global model w (t) is bounded by: E w (t) -w (t) i 2 ≤ 3τ σ 2 η 2 t-1 + 3(ζ i + ζ K )τ 2 η 2 t-1 , 1 K i∈Ut E w (t) -w (t) i 2 ≤ 3τ σ 2 η 2 t-1 + 6τ 2 ζ K η 2 t-1 . where ζ K = 1 K n i=1 ζ i . Proof. According to Lemma 8 in Woodworth et al. (2020a): E w (t) -w (t) i 2 ≤ 1 K j∈Ut E w (t) j -w (t) i 2 ≤ 3 σ 2 + ζ i τ + ζ K τ t-1 p=tc η 2 p t-1 q=p+1 (1 -µη q ) 1 n n i=1 E w (t) -w (t) i 2 ≤ 1 n 2 n i=1 n j=1 E w (t) j -w (t) i 2 ≤ 3 σ 2 + 2τ ζ K t-1 p=tc η 2 p t-1 q=p+1 (1 -µη q ) . Then the rest of the proof follows Lemma 3. Lemma 7. (Convergence of Global Model) Let w (t) = 1 K j∈Ut w (t) j . In Theorem 2's setting, using Algorithm 1 by choosing learning rate as η t = 16 µ(t+a) , we have: E w (T +1) -w * 2 ≤ a 3 (T + a) 3 E w (1) -w * 2 + T + 16 1 a + 1 + ln(T + a) 1536a 2 τ σ 2 + 2τ ζ K L 2 (a -1) 2 µ 4 (T + a) 3 + 128σ 2 T (T + 2a) Kµ 2 (T + a) 3 . Proof. According to the updating rule and non-expensiveness of projection, and the strong convexity we have: E w (t+1) -w * 2 ≤ E    w (t) -η t 1 K j∈Ut ∇f j (w (t) j ; ξ t j ) -w * 2    ≤ E w (t) -w * 2 -2η t E   1 K j∈Ut ∇f j (w (t) j ), w (t) -w *   + η 2 t E    1 K j∈Ut ∇f j (w (t) j ) 2    + η 2 t σ 2 K ≤ (1 -µη t )E w (t) -w * 2 -(2η t -2Lη 2 t )E F (w (t) ) -F (w * ) + η 2 t σ 2 K + η 2 t 1 K j∈Ut L 2 E w (t) j -w (t) 2 -2η t E   1 K j∈Ut ∇f j (w (t) j ) -∇f j (w (t) ), w (t) -w *   ≤ (1 -µη t )E w (t) -w * 2 -(2η t -4Lη 2 t ) ≤-ηt E F (w (t) ) -F (w * ) + η 2 t σ 2 K + 2η 2 t L 2 1 K j∈Ut E w (t) j -w (t) 2 + 2η t L 2 µ 1 K j∈Ut E w (t) j -w (t) 2 + µη t 2 E w (t) -w * 2 . ( ) Then, merging the term, multiplying both sides with pt ηt , and do the telescoping sum yields: p T η T E w (T +1) -w * 2 ≤ p 0 η 0 E w (1) -w * 2 -E[F (w (t) ) -F (w * )] + T t=1 2L 2 µ + 2η t L 2 p t 1 K j∈Ut E w (t) j -w (t) 2 + T t=1 p t η t σ 2 K . Plugging Lemma 6 into (26) yields: p T η T E w (T +1) -w * 2 ≤ p 0 η 0 E w (1) -w * 2 -E[F (w (t) ) -F (w * )] + T t=1 2L 2 µ + 2η t L 2 3p t η 2 t-1 τ σ 2 + 2τ ζ K + T t=1 p t η t σ 2 K . Then, by re-arranging the terms will conclude the proof as E w (T +1) -w * 2 ≤ a 3 (T + a) 3 E w (1) -w * 2 + T + 16 1 a + 1 + ln(T + a) 1536a 2 L 2 τ σ 2 + 2τ ζ K (a -1) 2 µ 4 (T + a) 3 + 128σ 2 T (T + 2a) Kµ 2 (T + a) 3 .

E.2.2 PROOF OF THEOREM 6

Proof. According to (28) we have: p T η T E w (T +1) -w * 2 ≤ p 0 η 0 E w (1) -w * 2 -E[F (w (t) ) -F (w * )] + T t=1 2L 2 µ + 2η t L 2 3p t η 2 t-1 τ σ 2 + 2τ ζ K + T t=1 p t η t σ 2 K . By re-arranging the terms and dividing both sides by S T = T t=1 p t > T 3 yields: 1 ST T t=1 pt E F (w (t) ) -F (w * ) ≤ p0 ST η0 E w (1) -w * 2 + 1 ST T t=1 2L 2 µ + 2ηtL 2 3ptη 2 t-1 τ σ 2 + 2τ ζ K + 1 ST T t=1 ptηt σ 2 K ≤ O   µE w (1) -w * 2 T 3   + O κ 2 τ σ 2 + 2τ ζ K µT 2 + O κ 2 τ σ 2 + 2τ ζ K ln T µT 3 + O σ 2 KT . Recalling that ŵ = 1  E [F ( ŵ)] -F (w * ) ≤ O µ T 3 + O κ 2 τ σ 2 + 2τ ζ K µT 2 + O κ 2 τ σ 2 + 2τ ζ K ln T µT 3 + O σ 2 KT .

E.2.3 PROOF OF THEOREM 2

Now we provide the formal proof of Theorem 2. The main difference from without-sampling setting is that only a subset of local models get updated each period due to partial participation of devices, i.e., K out of all n devices that are sampled uniformly at random. To generalize the proof, we will use an indicator function to model this stochastic update, and show that while the stochastic gradient is unbiased, the variance is changed. Proof. Recall that we defined virtual sequences of {w (t) } T t=1 where w t) . We also define an indicator variable to denote whether ith client was selected at iteration t: (t) = 1 K j∈Ut w (t) i and v(t) i = α i v (t) i + (1 -α i )w ( I t i = 1 if i ∈ U t 0 else obviously, E [I t i ] = K n . Then, according to updating rule and non-expensiveness of projection we have: E v(t+1) i -v * i 2 ≤ E v(t) i -α 2 i I t i ηt∇fi(v (t) i ) -(1 -αi)ηt 1 K j∈U t ∇fj(w (t) j ) -v * i 2 + E α 2 i I t i ηt ∇fi(v (t) i ) -∇fi(v (t) i ; ξ t i ) + (1 -αi)ηt 1 K j∈U t ∇fj(w (t) j ) - 1 K j∈U t ∇fj(w (t) j ; ξ t ) 2 = E v(t) i -v * i 2 -2 K n α 2 i ηt∇fi(v (t) i ) + (1 -αi)ηt 1 K j∈U t ∇fj(w (t) j ), v(t) i -v * i + η 2 t E α 2 i I t i ∇fi(v (t) i ) + (1 -αi) 1 K j∈U t ∇fj(w (t) j ) 2 + α 2 i η 2 t 2K 2 σ 2 n 2 + (1 -αi) 2 η 2 t 2σ 2 K . = E v(t) i -v * i 2 -2ηt K n α 2 i + 1 -αi ∇fi(v (t) i ), v(t) i -v * i T 1 -2ηt(1 -αi)E 1 K j∈U t ∇fj(w (t) j ) -∇fi(v (t) i ), v(t) i -v * i T 2 + η 2 t E α 2 i I t i ∇fi(v (t) i ) + (1 -αi) 1 K j∈U t ∇fj(w (t) j ) 2 T 3 +α 2 i η 2 t 2K 2 σ 2 n 2 + (1 -αi) 2 η 2 t 2σ 2 K . Now we switch to bound T 1 : T 1 = -2η t ( K n α 2 i + 1 -α i )E ∇f i (v (t) i ), v(t) i -v * i -2η t ( K n α 2 i + 1 -α i )E ∇f i (v (t) i ) -∇f i (v (t) i ), v(t) i -v * i ≤ -2η t ( K n α 2 i + 1 -α i ) E f i (v (t) i ) -f i (v * i ) + µ 2 E v(t) i -v * i 2 + ( K n α 2 i + 1 -α i )η t 8L 2 µ(1 -8(α i -α 2 i K n )) E v(t) i - v(t) i 2 + µ(1 -8(α i -α 2 i K n )) 8 E v(t) i -v * i 2 ≤ -2η t ( K n α 2 i + 1 -α i ) E f i (v (t) i ) -f i (v * i ) + µ 2 E v(t) i -v * i 2 + η t 8L 2 (1 -α i ) 2 µ(1 -8(α i -K n α 2 i )) E w (t) -w (t) i 2 + µ(1 -8(α i -K n α 2 i )) 8 E v(t) i -v * i 2 ≤ -2η t ( K n α 2 i + 1 -α i ) E f i (v (t) i ) -f i (v * i ) - 7µη t 8 E v(t) i -v * i 2 + 8η t L 2 (1 -α i ) 2 µ(1 -8(α i -α 2 i )) E w (t) -w (t) i 2 , For T 2 , we use the same approach as we did in (19); To deal with T 3 , we also employ the similar technique in (20): T 3 = E    α 2 i I t i ∇f i (v (t) i ) + (1 -α i ) 1 K j∈Ut ∇f j (w (t) j ) 2    ≤ 2( K n α 2 i + 1 -α i ) 2 E ∇f i (v (t) i ) 2 + 2E    (1 -α i )   1 K j∈Ut ∇f j (w (t) j ) -∇f i (v (t) i )   2    ≤ 2 2( K n α 2 i + 1 -α i ) 2 E ∇f i (v (t) i ) -∇f * i 2 +2( K n α 2 i + 1 -α i ) 2 E ∇f i (v (t) i ) -∇f i (v (t) i ) 2 + 2(1 -α i ) 2 E    1 K j∈Ut ∇f j (w (t) j ) -∇f i (v (t) i ) 2    ≤ 8L( K n α 2 i + 1 -α i ) E f i (v (t) i ) -f * i + 4(1 -α i ) 2 L 2 E w (t) -w (t) i 2 + 6(1 -α i ) 2 L 2 E w (t) -w (t) i 2 + E ∇f i (v (t) i ) -∇F (w (t) ) 2 + 1 K j∈Ut L 2 E w (t) -w (t) j 2   . Then plugging T 1 , T 2 , T 3 back, we obtain the similar formulation as the without sampling case in (17). Thus: E v(t+1) i -v * i 2 ≤ 1 - 3µηt 8 E v(t) i -v * i 2 -2(ηt -4η 2 t L) α 2 i K n + 1 -αi E fi(v (t) i ) -fi(v * i ) + α 2 i η 2 t 2Kσ 2 n + (1 -αi) 2 η 2 t 2σ 2 K + 8ηtL 2 (1 -αi) 2 µ(1 -8(αi -α 2 i K n )) + 6(1 -αi) 2 ηtL 2 µ + 10(1 -αi) 2 η 2 t L 2 E w (t) -w (t) i 2 + 6(1 -αi) 2 ηtL 2 µ + 6(1 -αi) 2 η 2 t L 2 1 K j∈U t E w (t) -w (t) j 2 + 6ηt µ + 6η 2 t (1 -αi) 2 E 1 K j∈U t ∇fj(w (t) ) -∇fi(v (t) i ) 2 . ( ) we then examine the lower bound of α 2 i K n + 1 -α i . Notice that: α 2 i K n + 1 -α i = K n ((α i -n 2K ) 2 + n K -n 2 4K 2 ).

Case 1:

2K ≥ 1 The lower bound is attained when α i = 1: α 2 i K n + 1 -α i ≥ K n . Case 2: n 2K < 1 The lower bound is attained when α i = n 2K : α 2 i K n + 1 -α i ≥ 1 -n 4K > 1 2 . So α 2 i K n + 1 -α i ≥ b := min{ K n , 1 } always holds. Now we plug it and Lemma 6 back to (31): E v(t+1) i -v * i 2 ≤ 1 - 3µηt 8 E v(t) i -v * i 2 -bηt E fi(v (t) i ) -fi(v * i ) + α 2 i η 2 t 2Kσ 2 n + (1 -αi) 2 η 2 t 2σ 2 K + 8ηtL 2 (1 -αi) 2 µ(1 -8(αi -α 2 i K n )) + 6(1 -αi) 2 ηtL 2 µ + 10(1 -αi) 2 η 2 t L 2 3τ η 2 t-1 σ 2 + (ζi + ζ K )τ + 6(1 -αi) 2 ηtL 2 µ + 6(1 -αi) 2 η 2 t L 2 3τ η 2 t-1 σ 2 + 2 ζ K τ + 6ηt µ + 6η 2 t (1 -αi) 2 E 1 K j∈U t ∇fj(w (t) ) -∇fi(v (t) i ) 2 . ( ) Plugging Lemma 5 yields: E v(t+1) i -v * i 2 ≤ 1 - 3µηt 8 E v(t) i -v * i 2 -bηt E fi(v (t) i ) -fi(v * i ) + α 2 i η 2 t 2Kσ 2 n + (1 -αi) 2 η 2 t 2σ 2 K + 8ηtL 2 (1 -αi) 2 µ(1 -8(αi -α 2 i K n )) + 6(1 -αi) 2 ηtL 2 µ + 10(1 -αi) 2 η 2 t L 2 3τ η 2 t-1 σ 2 + (ζi + ζ K )τ + 6(1 -αi) 2 ηtL 2 µ + 6(1 -αi) 2 η 2 t L 2 3τ η 2 t-1 σ 2 + 2 ζ K τ + 6ηt µ + 6η 2 t (1 -αi) 2 2L 2 E v(t) i -v * 2 + 6 2ζi + 2 ζ K + 6L 2 E w (t) -w * 2 + 6L 2 ∆i . Then following the same procedure in Appendix E.1.3, together with the application of Lemma 7 we can conclude that: fi(vi) -fi(v * i ) ≤ 1 ST T t=1 pt(fi(v (t) i ) -fi(v * i )) ≤ p0E v(1) i -v * i 2 bη0ST + 1 bST T t=1 ptηt α 2 i η 2 t 2Kσ 2 n + (1 -αi) 2 η 2 t 2σ 2 K + 1 bST T t=1 (1 -αi) 2 L 2 8 µ(1 -8(αi -α 2 i K n )) + 6 µ + 10ηt 3τ ptη 2 t-1 σ 2 + (ζi + ζ K )τ + 1 bST T t=1 (1 -αi) 2 L 2 6 µ + 10ηt 3τ ptη 2 t-1 σ 2 + 2 ζ K τ + 36(1 -αi) 2 L 2 bST T t=1 pt 1 µ + ηt a 3 (t -1 + a) 3 E w (1) -w * 2 + t + 16 1 a + 1 + ln(t + a) 1536a 2 τ σ 2 + 2τ ζ K L 2 (a -1) 2 µ 4 (t -1 + a) 3 + 128σ 2 t(t + 2a) Kµ 2 (t -1 + a) 3 + 36(1 -αi) 2 2ζi + 2 ζ K + L 2 ∆i 1 bST T t=1 pt 1 µ + ηt . = O µ bT 3 + α 2 i O σ 2 µbT + (1 -αi) 2 O 2ζi + 2 ζ K µb + κL∆i b + (1 -αi) 2 O κL ln T bT 3 + O κ 2 σ 2 µbKT + O κ 2 τ 2 (ζi + ζ K ) + κ 2 τ σ 2 µbT 2 +O κ 4 τ σ 2 + 2τ ζ K µbT 2 .

F CONVERGENCE RATE WITHOUT ASSUMPTION ON α i

In this section, we provide the convergence results of Algorithm 1 without assumption on α i . The following Theorem establish the convergence rate: Theorem 7 (Personalized model convergence of Local Descent APFL without assumption on α i ). If each client's objective function is µ-strongly-convex and L-smooth, and its gradient is bounded by G, using Algorithm 1, learning rate: η t = 8 µ(t+a) , where a = max{64κ, τ }, and using average scheme vi = 1  and f  * i is the local minimum of the ith client, then the following convergence holds for all i ∈ [n]: S T T t=1 p t (α i v (t) i + (1 -α i ) 1 K j∈Ut w (t) j ), where p t = (t + a) 2 , S T = T t=1 p t , E[f i (v i )] -f * i ≤ O µ bT 3 + α 2 i O σ 2 µbT + (1 -α i ) 2 O G 2 µb + (1 -α i ) 2 O κL ln T bT 3 + O κ 2 σ 2 µbKT + O κ 2 τ 2 (ζ i + ζ K ) + κ 2 τ σ 2 µbT 2 +O   κ 4 τ σ 2 + 2τ ζ K µbT 2     , where b = min{ K n , 1 2 } Remark 6. Here we remove the assumption α i ≥ max{1-1 4 √ 6κ , 1-1 4 √ 6κ √ µ }. The key difference is that we can only show the residual error with dependency on G, instead of more accurate quantities ζ i and ∆ i . Apparently, when the diversity among data shards is small, ζ i and ∆ i terms become small which leads to a tighter convergence rate. Also notice that, to realize the bounded gradient assumption, we need to require the parameters come from a bounded domain W. Thus, we need to do projection during parameter update, which is inexpensive. Proof. According to (32): E v(t+1) i -v * i 2 ≤ 1 - 3µη t 8 E v(t) i -v * i 2 -bη t E f i (v (t) i ) -f i (v * i ) + α 2 i η 2 t 2Kσ 2 n + (1 -α i ) 2 η 2 t 2σ 2 K + 8η t L 2 (1 -α i ) 2 µ(1 -8(α i -α 2 i K n )) + 6(1 -α i ) 2 η t L 2 µ + 10(1 -α i ) 2 η 2 t L 2 3τ η 2 t-1 σ 2 + (ζ i + ζ K )τ + 6(1 -α i ) 2 η t L 2 µ + 6(1 -α i ) 2 η 2 t L 2 3τ η 2 t-1 σ 2 + 2 ζ K τ + 6η t µ + 6η 2 t (1 -α i ) 2 E    1 K j∈Ut ∇f j (w (t) ) -∇f i (v (t) i ) 2    . Here, we directly use the bound E 1 K j∈Ut ∇f j (w (t) ) -∇f i (v (t) i ) 2 ≤ 2G 2 . Then we have: E v(t+1) i -v * i 2 ≤ 1 - µη t 4 E v(t) i -v * i 2 -bη t E f i (v (t) i ) -f i (v * i ) + α 2 i η 2 t 2Kσ 2 n + (1 -α i ) 2 η 2 t 2σ 2 K + 8η t L 2 (1 -α i ) 2 µ(1 -8(α i -α 2 i K n )) + 6(1 -α i ) 2 η t L 2 µ + 10(1 -α i ) 2 η 2 t L 2 3τ η 2 t-1 σ 2 + (ζ i + ζ K )τ + 6(1 -α i ) 2 η t L 2 µ + 6(1 -α i ) 2 η 2 t L 2 3τ η 2 t-1 σ 2 + 2 ζ K τ + 12η t µ + 12η 2 t (1 -α i ) 2 G 2 . Then following the same procedure in Appendix E.1.3, we can conclude that: fi(vi) -fi(v * i ) ≤ 1 ST T t=1 pt(fi(v (t) i ) -fi(v * i )) ≤ p0E v(1) i -v * i 2 bη0ST + 1 bST T t=1 ptηt α 2 i η 2 t 2Kσ 2 n + (1 -αi) 2 η 2 t 2σ 2 K + 1 bST T t=1 (1 -αi) 2 L 2 8 µ(1 -8(αi -α 2 i K n )) + 6 µ + 10ηt 3τ ptη 2 t-1 σ 2 + (ζi + ζ K )τ + 1 bST T t=1 (1 -αi) 2 L 2 6 µ + 10ηt 3τ ptη 2 t-1 σ 2 + 2 ζ K τ + 12(1 -αi) 2 G 2 1 bST T t=1 pt 1 µ + ηt . = O µ bT 3 + α 2 i O σ 2 µbT + (1 -αi) 2 O G 2 µb + (1 -αi) 2 O κL ln T bT 3 + O κ 2 σ 2 µbKT + O κ 2 τ 2 (ζi + ζ K ) + κ 2 τ σ 2 µbT 2 + O κ 4 τ σ 2 + 2τ ζ K µbT 2 .

G PROOF OF CONVERGENCE RATE IN NONCONVEX SETTING

In this section we will provide the proof of convergence results on nonconvex functions. Let us first present the convergence rate of the global model of APFL, on nonconvex function: Theorem 8 (Global model convergence of Local Descent APFL). If each client's objective function is L-smooth, using Algorithm 1 with full gradient, by choosing K = n and learning rate η = 1 2 √ 5L √ T , then the following convergence holds: 1 T T t=1 ∇F (w (t) ) 2 ≤ O L √ T + O τ 2 ζ nT . Proof. The proof is provided in Appendix G.2. As usual, let us introduce several useful lemmas before the formal proof of Theorem 3 and 8.

G.1 PROOF OF TECHNICAL LEMMAS

Lemma 8. Under Theorem 3's assumptions, the following statement holds true: f i (v (t+1) i ) ≤ f i (v (t) i ) - 1 8 η ∇f i (v i ) 2 + 3 2 (1 -α i ) 2 η 1 n n j=1 w (t) -w (t) j 2 + 3α 4 i (1 -α) 2 ηL 2 w (t) -w (t) i 2 + 6η(1 -α 2 i ) 2 ζ i + 12η(1 -α 2 i ) 2 L 2 D 2 W + 12η(α i -α 2 i ) 2 ∇F (w (t) ) 2 . Proof. Define the following quantities:  g (t) = α 2 i ∇f i (v (t) i ) + (1 -α i ) 1 n n j=1 ∇f j (w (t) j ) P W (v (t) i , g (t) , η) = 1 η   v(t) i - W   v(t) i -η   α 2 i ∇f i (v (t) i ) + (1 -α i ) 1 n n j=1 ∇f j (w g, P W (w, g, η) ≥ P W (w, g, η) 2 . According to the updating rule and smoothness of f i , we have: f i (v (t+1) i ) ≤ f i (v (t) i ) + ∇f i (v (t) i ), v i - v(t) i + L 2 v(t+1) i - v ≤ f i (v we have: (t) i ) + ∇f i (v (t) i ), v f i (v (t+1) i ) ≤ f i (v (t) i ) -g (t) 2 + η 2 P W (v (t) i , g (t) , η) 2 + η 2 L 2 P W (v i , g (t) , η)  2 + η 2 ∇f i (v (t) i ) -α 2 i ∇f i (v (t) i ) -(1 -α i ) 1 n n j=1 ∇f j (w (t) j ) 2 ≤ f i (v (t) i ) - η 2 - η 2 L 2 ≤ 1 4 η g (t) 2 + η 2 ∇f i (v (t) i ) -α 2 i ∇f i (v (t) i ) -(1 -α i ) 1 n n j=1 ∇f j (w (t) j ) 2 ≤ f i (v (t) i ) - 1 4 η g (t) 2 + (1 -α i ) 2 η ∇F (w (t) ) - 1 n n j=1 ∇f j (w (t) j ) 2 + η α 2 i ∇f i (v (t) i ) -∇f i (v (t) i ) -(1 -α i )∇F (w (t) ) + (1 -α 2 i )∇f i (v (t) i ) 2 ≤ f i (v (t) i ) - 1 4 η g (t) 2 + (1 -α i ) 2 η 1 n n j=1 w (t) -w (t) j 2 + 2η α 2 i ∇f i (v (t) i ) -∇f i (v (t) i ) 2 + 2η (1 -α 2 i )∇f i (v (t) i ) -(1 -α 2 i )∇F (v ≤ f i (v (t) i ) - 1 4 η g (t) 2 + (1 -α i ) 2 η 1 n n j=1 w (t) -w (t) j 2 + 2α 4 i (1 -α) 2 ηL 2 w (t) -w (t) i 2 + 4η(1 -α 2 i ) 2 ζ i + 8η(1 -α 2 i ) 2 L 2 D 2 W + 8η(α i -α 2 i ) 2 ∇F (w (t) ) Using the following inequality to replace g (t) 2 : ∇f i (v i ) 2 ≤ 2 ∇f i (v i ) -g (t) 2 + 2 g (t) 2 hence we can conclude the proof: f i (v (t+1) i ) ≤ f i (v (t) i ) - 1 4 η 1 2 ∇f i (v i ) 2 -∇f i (v i ) -g (t) 2 + (1 -α i ) 2 η 1 n n j=1 w (t) -w (t) j 2 + 2α 4 i (1 -α) 2 ηL 2 w (t) -w (t) i 2 + 4η(1 -α 2 i ) 2 ζ i + 8η(1 -α 2 i ) 2 L 2 D 2 W + 8η(α i -α 2 i ) 2 ∇F (w (t) ) 2 ≤ f i (v (t) i ) - 1 8 η ∇f i (v i ) 2 + 1 4 η ∇f i (v i ) -   α 2 i ∇f i (v (t) i ) + (1 -α i ) 1 n n j=1 ∇f j (w (t) j )   2 + (1 -α i ) 2 η 1 n n j=1 w (t) -w (t) j 2 + 2α 4 i (1 -α) 2 ηL 2 w (t) -w (t) i 2 + 4η(1 -α 2 i ) 2 ζ i + 8η(1 -α 2 i ) 2 L 2 D 2 W + 8η(α i -α 2 i ) 2 ∇F (w (t) ) 2 ≤ f i (v (t) i ) - 1 8 η ∇f i (v i ) 2 + 3 2 (1 -α i ) 2 η 1 n n j=1 w (t) -w (t) j 2 + 3α 4 i (1 -α) 2 ηL 2 w (t) -w (t) i 2 + 6η(1 -α 2 i ) 2 ζ i + 12η(1 -α 2 i ) 2 L 2 D 2 W + 12η(α i -α 2 i ) 2 ∇F (w (t) ) 2 . Lemma 9. Under Theorem 3's assumptions, the following statement holds true: F (w (t+1) ) ≤ F (w (t) ) + ∇F (w (t) ), w (t+1) -w (t) + L 2 w (t+1) -w (t) 2 ≤ F (w (t) ) -η ∇F (w (t) ), P W (w (t) , ḡ(t) , η) + η 2 L 2 P W (w (t) , ḡ(t) , η) 2 ≤ F (w (t) ) -η ḡ(t) , P W (w (t) , ḡ(t) , η) -η ∇F (w (t) ) -ḡ(t) , P W (w (t) , ḡ(t) , η) Re-arranging terms and doing the telescoping sum from t = 1 to T : + 1 T T t=1 ∇F (w (t) ) 2 ≤ 8 ηT F (w (1) ) + 6L k ) -∇f k (w (j) ) + ∇f k (w (j) ) -∇f i (w (j) ) + ∇f i (w (j) ) -∇f i (w (j) i ) 2 ≤ τ t+τ j=tc 5η 2 2L 2 γ j + ζ n . Summing over t from t c to t c + τ yields: Summing over all synchronization stages t c , and dividing both sides by T can conclude the proof of the first statement: 1 T T t=1 γ t ≤ 10τ 2 η 2 ζ n . ( ) To prove the second statement, let δ i t = w (t) -w (t) i

2

. Notice that: k ) -∇f k (w (j) ) + ∇f k (w (j) ) -∇f i (w (j) ) + ∇f i (w (j) ) -∇f i (w Then plugging in Lemma 10 will conclude the proof. δ i t = w tc - (

G.3 PROOF OF THEOREM 3

Proof. According to Lemma 8: f i (v (t+1) i ) ≤ f i (v (t) i ) - 1 8 η ∇f i (v i ) 2 + 3 2 (1 -α i ) 2 η 1 n n j=1 w (t) -w (t) j 2 + 3α 4 i (1 -α) 2 ηL 2 w (t) -w (t) i 2 + 6η(1 -α 2 i ) 2 ζ i + 12η(1 -α 2 i ) 2 L 2 D 2 W + 12η(α i -α 2 i ) 2 ∇F (w (t) ) 2 . Re-arranging the terms, summing from t = 1 to T , and dividing both sides with T yields: 1 T T t=1 ∇f i (v (t) i ) 2 ≤ 8f i (v i ) ηT  α 2 i ) 2 ζ i + 128(1 -α 2 i ) 2 L 2 D W , Then, plug in Lemma 9 and 10 : 1) ). T L concludes the proof: 1 T T t=1 ∇f i (v (t) i ) 2 ≤ 8f i (v i ) ηT + 48(1 -α 2 i ) 2 ζ i + 128(1 -α 2 i ) 2 L 2 D W + 24α 4 i (1 -α i ) 2 L 2 1 T T t=1 w (t) i -w (t) 2 + 12(1 -α i ) 2 L 2 1 n n j=1 1 T T t=1 w (t) j -w (t) 2 + 128(1 -α i ) 2   8 ηT F (w (1) ) + 6L 2 1 T T t=1 1 n n j=1 w (t) j -w (t) 2   ≤ 8f i (v i ) ηT + 48(1 -α 2 i ) 2 ζ i + 128(1 -α 2 i ) 2 L 2 D W + 24α 4 i (1 -α i ) 2 L 2 200L 2 τ 4 η 4 ζ n + 20τ 2 η 2 ζ i + 7800τ 2 η 2 (1 -α i ) 2 L 2 ζ n + 1024(1 -α i ) 2 ηT F (w 1 T T t=1 ∇f i (v (t) i ) 2 ≤ O L √ T + (1 -α i ) 2 L √ T + (1 -α 2 i ) 2 ζ i + L 2 D W + α 4 i (1 -α i ) 2 O τ 4 ζ nT 2 + τ 2 ζ i T + (1 -α i ) 2 O τ 2 ζ nT .



Figure 1: Comparing the generalization and training losses of our proposed personalized model with the global models of FedAvg and SCAFFOLD by increasing the diversity among the data of clients on MNIST dataset with a logistic regression model.

|S| (x,y)∈S |h(x) -h (x)|. The empirical discrepancy characterizes the complexity of hypothesis class over some finite set. The similar concepts are also employed in the related multiple source PAC learning or domain adaption (Kifer et al., 2004; Mansour et al., 2009; Ben-David et al., 2010; Konstantinov et al., 2020; Zhang et al., 2020).

Figure 2: Comparing the the performance of APFL with FedAvg (APFL with α = 0) and SCAF-FOLD on the MNIST dataset. Top row is the training loss and the bottom row is the generalization accuracy on training and validation data, respectively. In (a), the accuracy lines of SCAFFOLD and FedAvg global models are removed since their low values degrade the readability of the plot.

Figure 3: Comparing the APFL with adaptive α and the localized FedAvg. The left figure is the training performance, and the right one is the accuracy of these models on local validation data.

introduced in Dinh et al. (2020) using a regularization with Moreau envelope function. We run these algorithms to train an MLP with 2 hidden layers, each with 200 neurons, on a non-IID MNIST dataset with 2 classes per client. For perFedAvg, similar to their setting, we use learning rates of α = 0.01 (different from the α in our APFL) and β = 0.001. To have a fair comparison, we use the same validation for perFedAvg and we use 10% of training data as the test dataset that updates the meta-model. For pFedMe, following their setting, we use λ = 15, η = 0.01. We use τ = 20 with total number of communications to 100 and the batch size is 20. The results of these experiments are presented in

Proof of Technical Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 G.2 Proof of Theorem 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 G.3 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Finn et al. (2017) or domain adaptation and transfer learning (Ben-David et al., 2010; Mansour et al., 2009; Pan & Yang, 2009). Jiang et al. (2019) discuss the similarity between federated learning and meta-learning approaches, notably the Reptile algorithm by Nichol et al. (

An important application of personalization in federated learning is using the model under different contexts. For instance, in the next character recognition task in Hard et al. (2018), based on the context of the use case, the results should be different. Hence, we need a personalized model on one client under different contexts. This requires access to more features about the context during the training. Evaluation of the personalized model in such a setting has been investigated by Wang et al. (2019), which is in line with our approach in experimental results in Section 5. Liang et al. (2020) propose to directly learn the feature representation locally, and train the discriminator globally, which reduces the effect of data heterogeneity and ensures the fair learning. Personalization via model regularization: Another significant trial for personalization is model regularization. There are several studies to introduce different personalization approaches for federated learning by regularize the difference between the global and local models. Hanzely & Richtárik (2020) try to introduce a new formulation for federated learning where they add the regularization term on the distance of local and global models. In their effort, they use a mixing parameter, which controls the degree of optimization for both local models and the global model. The Fe-dAvg (McMahan et al., 2017) can be considered a special case of this approach. They show that the learned model is in the convex haul of both local and global models, and at each iteration, depend on the local models' optimization parameters, the global model is getting closer to the global model learned by FedAvg. Similarly, Huang et al. (2020) and Dinh et al. (2020) also propose to use the regularization between local and global model, to realize the personalized learning. Shen et al. (2020) propose a knowledge distillation way to achieve personalization, where they apply the regularization on the predictions between local model and global model. Personalization via model interpolation: Parallel to our work, there are other studies to introduce different personalization approaches for federated learning by mixing the global and local models. The closest approach for personalization to our proposal is introduced by Mansour et al. (2020).

Figure 4: Evaluating the effect of sampling on APFL and FedAvg algorithm using the MNIST dataset that is non-IID with only 2 classes per client with logistic regression as the loss. The first row is training performance on the local model of FedAvg and personalized model of APFL with different sampling rates from {0.3, 0.5, 0.7}. The second row is the generalization performance of models on local validation data, aggregated over all clients. It can be inferred that despite the sampling ratio, APFL can superbly outperform FedAvg.

Figure 5: The results of applying FedAvg and APFL (with adaptive α) on an MLP model using EM-NIST dataset, which is naturally heterogeneous. APFL achieves the same training loss of localized FedAVG, while outperforms it in validation accuracy. for MNIST dataset with MLP model in the main body. Here, again we have 100 clients and run the experiments for 100 rounds of communication each with 1 epoch of training. The results are summarized in Table3. Again, it can be inferred that APFL can generalize well on the local test dataset of different clients.

Figure 6: Comparing the effect of fine-tuning with the local model of FedAvg and with the personalized model of APFL on the synthetic datasets. The model is trained for 100 rounds of communication with 97 clients, and then 3 clients will join in fine-tuning the global model based on their own data. It can be seen that the model from APFL can better personalize the global model with respect to the FedAvg method both in training loss and validation accuracy. Increasing diversity makes it harder to personalize, however, APFL surpasses FedAvg again.

Let H be a hypothesis class and D and D denote two probability measures over space Ξ. Let L D (h) = E (x,y)∼D [ (h(x), y)] denote the risk of h over D . If the loss function (•) is bounded by B, then for every h ∈ H:

The first step is to utilize uniform VC dimension error bound over H Mohri et al. (2018); Shalev-Shwartz & Ben-David (2014):

As L Di ĥ * loc,i is the risk of the empirical risk minimizer on D i after incorporating a model learned on a different domain (i.e., global distribution), one might argue that generalization techniques established in multi-domain learning theory (Ben-David et al., 2010; Mansour et al., 2009; Zhang et al., 2020) can be utilized to serve our purpose. However, we note that the techniques developed in Ben-David et al. (2010); Mansour et al. (2009); Zhang et al. (

Local Descent APFL (without sampling) input: Mixture weights α 1 , • • • , α n , Synchronization gap τ , Local models v (0) i for i ∈ [n] and local version of global model w (0)

j , from the convexity of F (•), we can conclude that

et al. (2016) Lemma 1 :For all w ∈ W ⊂ R d , g ∈ R d and η > 0, we have:

(t) = P W (w(t) , ḡ(t) , η)According to the updating rule and smoothness of f i , we have:

Lemma 10. Under Theorem 3's assumptions, the following statement holds true:200L 2 τ 4 η 4 ζ n + 20τ 2 η 2 ζ i .Proof. For the first statement, we define γ t = 1 and let t c be the latest synchronization stage. Then we have:

2L 2 γ j + ζ n 5τ L , we have 10L 2 τ 2 η 2 ≤ 12 , hence by re-arranging the terms we have:tc+τ t=tc γ t ≤ 10τ 3 η 2 ζ n .

To efficiently optimize the problem we cast in (3) and (4), in this subsection we propose our bilevel optimization algorithm, Local Descent APFL. At each communication round, server uniformly random selects K clients as a set U

3/4 communication rounds. The rate with factor (1α i ) 2 is contributed from the global model convergence, and here we have some additive residual error reflected by ζ i and D W . Compared to most related work by Haddadpour & Mahdavi (2019) regarding the convergence of local SGD on nonconvex functions, they obtain O(1/√nT ), while we only have speedup in n on partial terms. This could be solved by using different learning rate for local and global update. Additionally, we assume K = n to derive the convergence in nonconvex setting, and leave the analysis for partial participation as a future work.

75 Adaptive α Global Model Localized Model Global Model Localized Model The results of training a CNN model on CIFAR10 dataset using different algorithms.

, where APFL clearly outperforms all other models in both training and generalization. The APFL model with α = 0.75 has the lowest training loss, and the one with adaptive α has the best validation accuracy. perFedAvg is slightly better than the localized FedAvg, however, it is worse than APFL models. pFedMe performs better than the global model of FedAvg, but it cannot surpass neither the localized model of FedAvg nor APFL models.



Paul Vanhaesebrouck, Aurélien Bellet, and Marc Tommasi. Decentralized collaborative learning of personalized models over networks. 2017. 13 Kangkang Wang, Rajiv Mathews, Chloé Kiddon, Hubert Eichner, Franc ¸oise Beaufays, and Daniel Ramage. Federated evaluation of on-device personalization. arXiv preprint arXiv:1910.10252, 2019. 14 Blake Woodworth, Kumar Kshitij Patel, and Nathan Srebro. Minibatch vs local sgd for heterogeneous distributed learning. arXiv preprint arXiv:2006.04735, 2020a. 5, 6, 22, 24, 32, 33 Blake Woodworth, Kumar Kshitij Patel, Sebastian U Stich, Zhen Dai, Brian Bullins, H Brendan McMahan, Ohad Shamir, and Nathan Srebro. Is local sgd better than minibatch sgd? arXiv preprint arXiv:2002.07839, 2020b. 15 Tao Yu, Eugene Bagdasaryan, and Vitaly Shmatikov. Salvaging federated learning by local adaptation. arXiv preprint arXiv:2002.04758, 2020. 13 Mikhail Yurochkin, Mayank Agarwal, Soumya Ghosh, Kristjan Greenewald, Trong Nghia Hoang,

Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 B.2 Additional Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Proof without Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Again, it can be inferred that APFL can generalize well on the local test dataset of different clients.

The results of training an MLP on MNIST dataset with APFL and FedAvg using Dirichlet distribution for splitting data across clients. The parameter of the Dirichlet distribution is set to 1.

PROOF OF CONVERGENCE OF APFL WITH SAMPLINGIn this section we will provide the formal proof of the Theorem 2. Before proceed to the proof, we would like to give the convergence of global model here first. The following theorem establishes the convergence of global model in APFL. Theorem 6 (Global model convergence of Local Descent APFL). If each client's objective function is µ-strongly convex and L-smooth, and satisfies Assumption 1, using Algorithm 1, by choosing the learning rate η t = 16 µ(t+a)

The proof is provided in Appendix E.2.2.Remark 5. It is noticeable that the obtained rate matches the convergence rate of the FedAvg, and if we choose τ = T /K, we recover the rate O( 1/KT ), which is the convergence rate of well-known local SGD with periodic averaging (Woodworth et al., 2020a). Now we switch to the proof of the Theorem 2. The proof pipeline is similar to what we did in Appendix E.1.3, non-sampling setting. The only difference is that we use sampling method here, hence, we will introduce the variance depending on sampling size K. Now we first begin with the proof of some technique lemmas.

F (w (t+1) ) ≤ F (w (t) ) -η ḡ(t) 2Using the following inequality to replace g (t) 2 :∇F (w (t) ) 2 ≤ 2 ∇F (w (t) ) -ḡ(t) 2 + 2 ḡ(t) 2

L 2 γ j + L 2 δ i j + ζ i ≤ 5L 2 τ 2 η 2 + 5τ 3 η 2 ζ i . , we have 5L 2 τ 2 η 2 ≤ 14 , hence by re-arranging the terms we have:+ 20τ 3 η 2 ζ i .Summing over all synchronization stages t c , and dividing both sides by T can conclude the proof of the first statement:+ 20τ 2 η 2 ζ i . 200L 2 τ 4 η 4 ζ n + 20τ 2 η 2 ζ i .

