PERSONALIZED FEDERATED COMPOSITE LEARNING WITH FORWARD-BACKWARD ENVELOPES

Abstract

Federated composite optimization (FCO) is an optimization problem in federated learning whose loss function contains a non-smooth regularizer. It arises naturally in the applications of federated learning (FL) that involve requirements such as sparsity, low rankness, and monotonicity. In this study, we propose a personalization method, called pFedFBE, for FCO by using forward-backward envelope (FBE) as clients' loss functions. With FBE, we not only decouple the personalized model from the global model, but also allow personalized models to be smooth and easily optimized. In spite of the nonsmoothness of FCO, pFedFBE shows the same convergence complexity results as FedAvg for FL with unconstrained smooth objectives. Numerical experiments are shown to demonstrate the effectiveness of our proposed method.

1. INTRODUCTION

Federated learning (FL) is originally proposed by (McMahan et al., 2016) to solve learning tasks with decentralized data arising in various applications. For example, data are generated from medical institutions which can not share its data to each other due to confidentiality or legal constraints. Instead of accessing all the data sets, different institutions or clients are under the coordination of a central server and the central server aggregates the local information to train a global model. Similar methodologies have been investigated in the literature of decentralized optimization (Colorni et al., 1991; Boyd et al., 2011; Yang et al., 2019) . For more introductions and open problems in the field of federated optimization, we refer to the review articles (Kairouz et al., 2021; Wang et al., 2021) . The local loss functions of FL can be nonsmooth. In particular, it is a summation of a smooth function and a nonsmooth regularizer, where the regularizer is used to promote certain structure of the optimal parameters such as sparisity, low rankness, total variation, and additional constraints on the parameters. This has motivated the recent study of the federated setting of composite optimization (Yuan et al., 2021) . The mathematical formulation of FCO is to optimize min w∈R d f (w) := 1 N N i=1 (f i (w) + h(w)) , where f i (w) = E ξi fi (w, ξ i ) or its empirical version f i (w) = 1 |D i | ξi∈D i fi (w, ξ i ) is a smooth function with local dataset D i , and h : R d → R is a nonsmooth but convex regularizer. Besides, we assume that the proximal operator of h, prox h (w) := arg min u∈R d h(u) + 1 2 ∥u -w∥ 2 has closed-form expressions and is easy to compute. The difference to the centralized setting is that D i is local data of client i and the data distribution of each client may not be the same. Optimizing FL with unconstrained smooth objectives, i.e., problem (1) with h ≡ 0, has been extensively studied in the literature, e.g., FedAvg (McMahan et al., 2016) , FedProx (Li et al., 2020) , SCAFFOLD (Karimireddy et al., 2020b) and MIME (Karimireddy et al., 2020a) , to name a few. When h ̸ = 0, FedDual (Yuan et al., 2021 ), FedDR (Tran Dinh et al., 2021) and FedADMM (Wang et al., 2022) are developed. One of the challenges of these algorithms is the heterogeneity of the local dataset D i , where the distribution of D i is none-identical. The model parameter w learned by minimizing f (w) may perform poorly for each client. If each client learns its parameter by its own data, the local model parameters may not generalize well due to the insufficient data. For the case h ≡ 0, in order to learn the global parameter and the local parameters jointly, the concept of the personalized FL has been studied, e.g., (Smith et al., 2017; Hanzely and Richtárik, 2020; Hanzely et al., 2020; Fallah et al., 2020b; Mansour et al., 2020; Chen and Chao, 2021; T Dinh et al., 2020) . To our best knowledge, there is no paper which directly investigating the personalization techniques to FCO, although the above personalized methods may be generalized. Our Contributions. In this paper, we consider constructing the personalized model for the FCO (1), while existing methods mentioned above all conduct personalizations to FL with unconstrained smooth objectives, i.e., h = 0. Although their approaches may be generalized to (1) by replacing gradients with subgradients, the computational results could be worse due to the slow convergence of subgradient methods. The main contributions are summarized as follows. • We present a personalized model by utilizing forward-backward envelope (FBE) for FCO (1), called pFedFBE. As a generalized of the Moreau envelope of the Moreau envelope under the Bregman distance (Liu and Pong, 2017) , FBE is smooth and has explicit forms of gradients, which is a crucial benefit compared to the Moreau envelope. Analogous to the personalized method by the Moreau envelope (T Dinh et al., 2020) , our proposed method is able to obtain both global parameters for generalization and local parameters for personalization. To our best knowledge, this is the first work to investigate personalizations to FCO. • Based on FBE, the local loss functions and global loss function are smooth and hence FedAvg can be used to solve the resulting model. By applying FedAvg, the optimization process of our proposed personalized model can be regarded as several local variable-metric proximal gradient updates followed by a global aggregated parameter. The variable-metric proximal gradient steps are able to protect the local information, and the aggregation steps guarantee the total loss is minimized at the aggregated parameter. A proper choice of parameter of FBE allows local parameter to move towards their own models and not to go far away from the global parameter. An algorithm, called pFedFBE, is proposed. We show its convergence for the nonconvex f i under mild assumptions. The complexity result matches the standard results of FedAvg for unconstrained smooth FL. • Based on the properties of FBE, the convergence rate of pFedFBE match with the standard analysis of applying FedAvg under standard assumptions over f , h and the stochasticity. Numerical experiments on various applications are performed to demonstrate the effectiveness of our proposed personalized model. Notations. For a vector w ∈ R d or a matrix H ∈ R d×d , we use ∥w∥ and ∥H∥ to denote its ℓ 2 norm and Frobenius norm, respectively. For a smooth function f : R d → R, ∇f (x) and ∇ 2 f (x) represent its gradient and Hessian at x, respectively. For a nonsmooth and convex function h, we denote by ∂h(x) its subgradients at x. We use |D| to denote the cardinality of a set D.

2. PERSONALIZED FEDERATED LEARNING WITH FORWARD-BACKWARD ENVELOPE (PFEDFBE)

The personalized FedAvg (Per-FedAvg) (Fallah et al., 2020b) and personalized FL with Moreau envelope (pFedMe) (T Dinh et al., 2020) are proposed to deal with the data heterogeneity for the smooth setting, i.e., h = 0. For pFedMe, the local model is constructed based on the Moreau envelope of f i , namely, Fi (w) = min θi∈R d f i (θ i ) + λ 2 ∥θ i -w∥ 2 . (2) Then, the resulting personalized model is a bi-level problem: min w∈R d F (w) := 1 N N i=1 Fi (w). (3) Solving (3) will give both the global parameter w and local personalized parameter θ i (w) := prox fi/λ (w) := arg min θi∈R d f i (θ i ) + λ 2 ∥θ i -w∥ 2 . A crucial benefit of optimizing with the Moreau envelope Fi 's lies on the flexible choices of λ. When λ = ∞, we have Fi (w) = f i (w) and θ i (w) = w, which means no personalization is introduced. If λ = 0, Fi (w) is a constant function taking value f (θ i (w)) with θ i (w) ≡ arg min θi∈R d f i (θ i ). In this case, there is only personalization and no federation. Hence, they claim that a proper λ ∈ (0, ∞) will introduce both federation and personalization. Since there is no explicit solution of the inner problems, it is proposed to use multiple gradient steps to get an estimation of gradient ∇ Fi (w) = λ(w -prox fi/λ (w)). For the convergence, they need this estimation satisfying certain accuracy. Once the gradient of Fi (w) is available, the existing federated optimization algorithms can be adopted to obtain a global parameter w and locally personalized parameters θ i (w). We note that the Moreau envelope for a nonsmooth function also exists (Rockafellar and Wets, 2009) and pFedMe can be applied to our setting (1). However, obtaining the Moreau envelope needs to solve a nonsmooth problem, which may be costly due to the absence of explicit expressions. Note that the inner problem (3) should be solved to a certain accuracy to guarantee the convergence (T Dinh et al., 2020) . The bi-level model (3) will be time-consuming to solve and may not be ideal in this setting.

2.1. PROBLEM FORMULATION

Since the Moreau envelope does not obey explicit expressions, we aim to find efficient envelopes not only enjoying simpler formulation, but also sharing similar properties of both federations and personalizations. For composite optimization, a famous generalization of Moreau envelope is called FBE (Stella et al., 2017; Liu and Pong, 2017) . Specifically, the FBE of f i + h is defined as F i (w) := min θi∈R d f i (w) + ⟨∇f i (w), θ i -w⟩ + h(θ i ) + λ 2 ∥θ i -w∥ 2 . ( ) When the proximal operator of h has a closed-form solution, the above envelope can be equivalently written as (Stella et al., 2017 ) F i (w) = f i (w) - 1 2λ ∥∇f i (w)∥ 2 + H(w - 1 λ ∇f i (w)), where H(w) := min θ∈R d h(θ) + λ 2 ∥θ -w∥ 2 = h(prox h/λ (w)) + λ 2 ∥prox h/λ (w) -w∥ 2 . Assuming the gradient of f i is Lipschitz continuous with modulus L, i.e., ∥∇f i (θ) -∇f i (w)∥ ≤ L∥θ -w∥, ∀θ, w, i, the function F i (w) is continuously differentiable for any λ > L, with gradients ∇F i (w) = λ(I - 1 λ ∇ 2 f i (w))(w -prox 1 λ h (w - 1 λ ∇f i (w))). Compared with the Moreau envelope, the gradient of FBE is of closed form and can be calculated with much less computational costs. Furthermore, when λ > L, the set of global minimizers of F i equals to that of f i + h. With the definition of FBE (4), our personalized model for FCO (1) is min w∈R d F (w) := 1 N N i=1 F i (w). Similar to pFedMe, one can obtain both global parameter w and local personalized parameter θ i (w) := prox 1 λ h (w -∇f i (w)) by solving (7). When λ = ∞, F i (w) = f i (w) + h(w), θ i (w) = w. That is to say, problem (7) reduces to the original problem (1) and there is no personalization. If λ = 0, then θ i (w) = arg min θi ∇f i (w) ⊤ (θ i -w) + h(θ i ), which is not constant function if ∇f i (w) depends on w. In the extreme case of linear f i , θ i (w) will be a constant function taking value arg min θi f i (θ i ) + h(θ i ). Then, θ i (w) will be the best personalization parameter and no federation introduced. As λ = ∞ results in only federation, we claim λ ∈ (0, ∞) will allow both federations and personalizations. Other than the linear case, if f i is a quadratic function with Hessian matrix being λ 0 I (λ 0 > 0), then setting λ = λ 0 will result in the perfect personalization and no federation. In this case, a λ ∈ (λ 0 , ∞) will guarantee both federations and personalizations. We also note that λ > 0 is needed for the smoothness. Otherwise, problem (4) is not strongly convex and F i will not be smooth. Besides, compared with original model (1), i.e., λ = ∞, the objective function of our new model ( 7) is smooth and hence easier to optimize. Basically, our new model takes the solution of (7) as an initial point and slightly update it with respect to their own data by performing one proximal gradient step. The benefits compared with the original Moreau envelope lie on the explicit expressions of θ i (w) and ∇F i (w), given in ( 8) and ( 6), respectively. To summarize, our proposed model ( 7) has the following advantages: • The flexible choice of λ allows the user-defined trade-off between FL and personalization. • F i is smooth while shares the same optimizers with the nonsmooth function f i + h. • Although the Moreau envelope of f i + h is smooth, the calculations of gradients are expensive compared with FBE F i . Both the gradients ∇F i (w) and local personalized parameters θ i (w) has explicit expressions whenever prox h has simple and closed-form solution.

2.2. PFEDFBE: ALGORITHM

In this subsection, we conduct an algorithm, called pFedFBE, to solve our proposed model ( 7). Since F i (w) is smooth and its gradient is with explicit expression, solving (7) falls into the classic FL setting. One may utilize the existing methods (McMahan et al., 2016; Li et al., 2020; Karimireddy et al., 2020b; a) . We now describe how to use FedAvg to solve our proposed model ( 7). At k-th step, the server randomly select a subset of clients, denoted by S k . Each selected client is initialized with w k and optimized by R local updates. By collecting all local parameters {w i k,R } i∈S k , the server updates its model by w k+1 = 1 |S k | i∈S k w i k,R . Let us introduce the details of local updates. Since the full batch gradient is costly, we take a minibatch D i k ⊂ D i and calculate the unbiased minibatch gradient ∇f i (w i k,t ) ≈ ∇ fi (w i k,t , D i k,t ) := 1 |D i k,t | ξi∈D i k,t ∇ fi (w i k,t , ξ i ). From the expression of ∇F i (w), we need to compute the Hessian of f i as well. By using another minibatch Di k,t ⊂ D i , the unbiased estimated Hessian is ∇ 2 f i (w i k,t ) ≈ ∇ 2 fi (w i k,t , Di k ) := 1 | Di k,t | ξi∈ Di k ∇ 2 fi (w i k,t , ξ i ). Plugging the estimations into (6), we obtain the estimated gradient ∇F i (w i k,t ) ≈ g i (w i k,t ) := λ(I - 1 λ ∇ 2 fi (w i k,t , Di k,t ))(w i k,t -prox 1 λ h (w i k,t - 1 λ ∇ fi (w i k,t , D i k,t ))). (9) Note that g i (w i k,t ) is biased due to the nonlinearity of prox 1 λ h . After obtaining the estimated gradient of F i , we do R-step stochastic gradient descent with a fixed step size η > 0, namely, w i k,0 := w k , w i k,t+1 = w i k,t -ηg i (w i k,t ), i = 0, . . . , R -1. The detailed algorithm is presented in Algorithm 1. Note that the computations of Hessian may be costly. The approximations to the Hessian are developed in (Finn et al., 2017; Nichol et al., 2018; Fallah et al., 2020a) . To make the computations afordable, we use the following approximations in our numerical experiments: ∇ 2 f (w)[u] ≈ (∇f (w + tu) -∇f (w))/t, where t is a small positive number. Hence, two mini-batch gradient evaluations of fi are needed for estimating ∇F i . Similar to (Fallah et al., 2020b, Section 5) , the proximal gradient (w i k,t - prox 1 λ h (w i k,t -1 λ ∇ fi (w i k,t , D i k,t ) )) can also serve as an efficient estimate of ∇F i (w i k,t ) when λ > L.

3. CONVERGENCE

In this section, we present the convergence results of the proposed pFedFBE, i.e., Algorithm 1. Let us start with the following necessary assumptions. Algorithm 1: pFedFBE for solving (7) input: Initial point w 0 , personalization parameter λ and learning rate η. for k = 0, 1, . . . , K -1 do Sample a subset S k of clients for client i ∈ S k in parallel do Initialize local model w i k,0 = w k for t = 0, 1, . . . , R -1 do Sample two minibatches D i k,t and Di k,t from D i Calculate the personalized parameter θ i (w i k,t ) = prox 1 λ h (w i k,t -1 λ ∇ fi (w i k,t , D i k,t )) Compute local stochastic gradient g i (w i k,t ) by following (9) Perform local update w i k,t+1 = w i k,t -ηg i (w i k,t ) Aggregate local parameters and set  w k+1 = 1 |S k | i∈S k w i k,R Assumption 1 Let f i = 1, 2, . . . , ∥∇f i (w) -∇f i (u)∥ ≤ L∥w -u∥, ∀w, u ∈ R d , ∥∇f i (w)∥ ≤ B, ∀w ∈ C, ∂h i (w) ≤ B, ∀w ∈ C, ∂h i (w) ∈ ∂h i (w). (A3) For each i ∈ {1, . . . , N }, the Hessian of the function f i is ρ-Lipschitz continuous, i.e., ∇ 2 f i (w) -∇ 2 f i (u) ≤ ρ∥w -u∥, ∀w, u ∈ R d , i. (A4) For any w ∈ R d , the stochastic gradient ∇ f (w, ξ i ) and Hessian ∇ 2 fi (w, ξ i ), computed with respect to a single data point ξ i ∈ D i , have bounded variance, i.e., for all i and w, E ξi ∇ fi (w, ξ i ) -∇f i (w) 2 ≤ σ 2 G , E ξi ∇ 2 fi (w, ξ i ) -∇ 2 f i (w) 2 ≤ σ 2 H . (A5) For any w ∈ R d , the gradient and Hessian of local functions f i (w) and the global function f (w) := N i=1 f i (w) satisfy the following conditions ∥∇f i (w) -∇f (w)∥ 2 ≤ γ 2 G , ∇ 2 f i (w) -∇ 2 f (w) 2 ≤ γ 2 H , ∀i, w. The smoothness of f i and cheap proximity operator property of h are standard assumptions in FCO (Yuan et al., 2021) . The assumptions (A2) and (A3) hold for sufficiently smooth f , which are satisfied by many problems arising from machine learning, such as the federated Lasso problem and the federated matrix completion problem in Section 4. These two assumptions are used for the Lipschitz continuous property of F i . The bounded variance condition (A4) and bounded diversity condition (A5) are crucial to control the bias introduced by the randomness and the heterogeneity of the local clients. Conditions on f i in Assumption 1 are also made in (Fallah et al., 2020b) , and the conditions on h is standard in decentralized composite optimization, see (Zeng and Yin, 2018) . Compared with the analysis of pFedMe (T Dinh et al., 2020), we do not need an additional assumption on the exactness of solving the Moreau envelope since the closed-form solutions is avaliable for the FBE. Based on Assumption 1, FBE F i has the following desired properties. Proposition 1 (Liu and Pong, 2017, Theorem 3.1) If Assumption (A1) holds, then F i is smooth and level-bounded for all λ > L. Here, we say a function φ is level-bounded if {w ∈ R d : φ(w) ≤ γ} is bounded for all γ ∈ R. The level boundedness and the smoothness ensure the existence of the minimizer of F . The next lemma establishes the gradient Lipschitz continuous property of F i and F . Lemma 1 Suppose that Assumptions (A1)-(A2) hold. If λ > L, the gradient ∇F i is Lipschitz continuous with modulus L F := 2ρB+(λ+L)(2λ+L) λ over the set C. The assumption (A5) gives the bounded divergence between f i and f . We show that the divergence of the local gradient of FBE is also bounded. Lemma 2 Suppose that the assumptions (A2) and (A5) hold. Then, for λ > L, we have 1 N N i=1 ∥∇F i (w) -∇F (w)∥ 2 ≤ γ F := 24γ 2 G + 12B 2 λ 2 γ 2 H . With the variance assumptions (A4) of f i , we show the estimations on F i in the following lemma. Lemma 3 Let D 1 , D 2 ⊂ D be the sample set and independent to each other. Suppose that (A4) holds. Then for any λ > L, it holds ∥E [g i (w, D 1 , D 2 ) -∇F i (w)] ∥ ≤ 2 |D 1 | 1 - |D 1 | -1 |D| -1 σ G , E ∥g i (w, D 1 , D 2 ) -∇F i (w)∥ 2 ≤ σ 2 F := 12 |D 1 | σ 2 G + 12B 2 |D 2 |λ 2 σ 2 H + 1 |D 1 ||D 2 |λ 2 σ 2 G σ 2 H . The above lemma tells us the error of the stochastic gradient of F i can be controlled by the variances of the stochastic gradient and Hessian of f i . Note that the estimated gradient of F i is biased unless using the full batch D. Lemma 4 Let {w i k,t } be the iterates generated by Algorithm 1. Suppose that Assumption 1 holds. Assume that λ > L and η < 1 10L F R , we have E 1 N N i=1 w i k,t -w k,t ≤ 4ηt (σ F + γ F ) , E 1 N N i=1 w i k,t -w k,t 2 ≤ 48tRη 2 (γ 2 F + 4σ 2 F ), where w k,t := 1 N N i=1 w i k,t . The above lemma presents the consensus error induced by the local stochastic gradient updates, which is proportional to the local step size η. With the above preparations, we show the following convergence of Algorithm 1. Theorem 1 Consider the objective function F i defined in (4) for the case that λ > L. Suppose that Assumption 1 is satisfied, and recall the definitions of L F , γ F and σ F , from Lemmas 1, 2 and 3, respectively. Consider running Algorithm 1 for K rounds with R local updates in each round and with η ≤ 1 10RL F . Then, the following first-order stationary condition holds 1 RK K-1 k=0 R-1 t=0 E ∥∇F ( wk+1,t )∥ 2 ≤ 4 (F (w 0 ) -F * ) ηRK + 1600η 2 R 2 (γ 2 F + 4σ 2 F ) + 8L F η N -S S(N -1) γ 2 F + σ 2 F + 16rσ 2 G , where wk+1,t = 1 S i∈S k w i k+1,t with wk+1,0 = w k and wk+1,R = w k+1 , and r = max k,t,i 1 - |D i k,t |-1 |D i |-1 /|D i k,t |. Remark. Theorem 1 presents the results of using fixed step size. One can easily extend to the setting of diminishing step size. Due to the bias of the estimated gradient of F i , the squared norm of F at wk,t will converge to a ball of center 0 and radius 16rσ 2 G . When full-batch gradients are used, this radius diminishes. By taking η = 1/ √ RK, the convergence speed of the squared norm of expected gradient is O( 1 √ RK ), which is similar to (Deng et al., 2020; Reddi et al., 2021) . Since F i is smooth, the advanced algorithms, such as FedProx (Li et al., 2020) and SCAFFOLD (Karimireddy et al., 2020b) can also be adopted for better complexity results.

4. NUMERICAL EXPERIMENTS

4.1 FEDERATED LASSO Federated Lasso was considered in (Yuan et al., 2021) , which is to recover the sparse ground-truth signal from observations. The mathematical formulation is min w∈R d , b∈R 1 N N j=1 nj i=1 ((x (i) j ) ⊤ w + b -y (i) j ) 2 2 + λ∥w∥ 1 , where N is the number of clients and client j has n j observation pairs {(x (i) j , y (i) j )} nj i=1 . We set λ = 0.1 in our numerical experiments. Synthetic Dataset Descriptions. We consider both i.i.d. setting and non-i.i.d. setting. (I) We generate the ground truth w real ∈ R 1024 with d 1 = 992 ones and d 0 = 32 zeros, namely, denotes the i-th observation of the j-th client. For each client j, we first generate and fix the mean µ j ∼ N (0, I d×d ). Then, we sample n j pairs of observations following w real = [1 ⊤ d1 , 0 ⊤ d0 ] ⊤ , x (i) j = µ j + δ (i) j , where δ (i) j ∼ N (0 d , I d×d ) are i.i.d., for i = 1, . . . , n j y (i) j = w ⊤ real x (i) j + ε (i) j , where ε (i) j ∼ N (0, 1) are i.i.d., for i = 1, . . . , n j . We generate N = 30 training clients where each client possesses 128 pairs of samples. There are 3840 training samples in total. (II) The ground truth w i ∈ R 1024 of each client are constructed as follows: w real = [1 ⊤ d1 , 0, . . . , 0, 0.5 × 1 i1 , 0, . . . , 0, 0.5 × 1 i d 0 , 0, . . . , 0] ⊤ , where i 1 , . . . i d0 are uniformly drawn from the {d 1 + 1, d 1 + 2, . . . , 1024} for each client. We set d 1 = 8 and d 0 = 2. Using the same data generalization process as in (I), 30 training clients where each client possesses 128 pairs of samples are constructed. The numerical results for settings (I) and (II) are presented in Figures 1, 2 , 3, and 4. The precision, recall, density, and F1 indexes are calculated by measuring the difference between the ground truth w real and the obtained parameters by the algorithms (where any element with absolute value less than 0.01 is regraded as 0). We compare with the baseline algorithms, FedAvg (McMahan et al., 2016 ), Fedmirror (Yuan et al., 2021 ), FedDual (Yuan et al., 2021) , pFedMe (T Dinh et al., 2020), and pFedditto (Li et al., 2021) . For all algorithms, we set the maximum number of rounds to 200. In each round, we sample 10 clients and run 20 local iterations with batch size 50. For all algorithms, we tune to get the best clients' learning rate and keep the remaining parameters as their default values. For both settings (I) and (II), we set the learning rate η = 0.0005 and λ = 2000 for pFedFBE. From Figure 1 and 2, we see that our proposed pFedFBE and Fedmirror give the best performances among all algorithms for setting (I). The poor performance of FedDual compared with Fedmirror may be from the multiple local iterations while the number of the local iteration is set to 1 in (Yuan et al., 2021). For the non-i.i.d. setting (II), Figures 3 and 4 show the results with respect to the personalized parameters. For those algorithms without personalization, we directly set the personalized parameter as the global parameters. From the test precision and recall, our proposed pFedFBE is able to find all nonzero elements of w real and do not introduce extra nonzero elements. Although the personalized algorithm pFedditto can recognize all nonzero elements, the zero entries are mistaken as nonzero as well. pFedFBE gives the best F1 score, the test accuracy, and the train loss. ( X (i) j , W + b -y i ) 2 2 + λ∥W ∥ nuc , where N is the number of clients, client j has n j observation pairs {(X (i) j , y (i) j )} nj i=1 , ∥W ∥ nuc is the nuclear norm of W , and λ > 0 is a parameter to control the rank of W . We set λ = 0.1 in our numerical experiments. We only present the numerical results for a non-i.i.d. setting to exhibit the importance of personalization. The setting is as follows. We set the number of clients to 30 and generate a vector w j ∈ R 32 of the form w j = [1 ⊤ 4 , 0, . . . , 0, 0.25 × 1 d0 , 0, . . . , 0] ⊤ , where d 0 is uniformly drawn from {5, 6, . . . , 32} for each client. After getting w j , we set the diagonal matrix W j = diag(w j ) with diagonal elements w j as the local ground truth. For each client j, we first generate and fix the mean µ j ∼ N (0, I d×d ). Then we sample n j pairs of observations following x (i) j = µ j + δ (i) j , where δ (i) j ∼ N (0 d , I d×d ) are i.i.d., for i = 1, . . . , n j y (i) j = W j , X (i) j + ε (i) j , where ε (i) j ∼ N (0, 1) are i.i.d., for i = 1, . . . , n j . 128 pairs of samples are generated for each client. The numerical results are presented in Figure 5 . Since FedDual is not comparable to Fedmirror in the case of multiple local iterations, we omit it and add comparisons with pFedprox (Li et al., 2020) . Analogous to the Federated Lasso, the total number of global rounds is set to 200. In each round, we randomly select 10 clients and perform 20 local iterations with batch size 50. We only tune to get the best step sizes for each algorithm. The learning rate η and the parameter λ used for pFedFBE are 0.0005 and 2000, respectively. From Figure 5 , the personalized parameters of pFedFBE are able to recover the ground truth rank 5, while all other algorithms fail. Moreover, pFedFBE converges fastest to better training loss, training MSE, and recovery error (which is defined as the Euclidean distance between the local ground truth and the obtained personalized parameters). For the numerical tests, we set the total number of global rounds to 100. In each round, we randomly select 10 clients and perform 10 local iterations with batch size 20. For all algorithms, we tune to get the best clients' learning rate and keep the remaining parameters as their default values. We use the learning rate η = 0.005 and λ = 200 for pFedFBE. The results are presented in Figure 6 , where the global accuracies are based on the global parameters, and personalized accuracies are computed from the local parameters. We see the algorithms taking the composite structures, pFedFBE, Fedmirror, and FedDual, converge faster than the other algorithms in terms of the global accuracies. Moreover, pFedFBE outperforms Fedmirror and FedDual. For the personalized accuracies, our pFedFBE performs the best although the personalized algorithms could achieve better accuracies.  (w) = λ(I -1 λ ∇ 2 f i (w))(w -prox 1 λ h (w -1 λ ∇f i (w))). It holds for w 1 , w 2 ∈ C that ∥∇F i (w 1 ) -∇F i (w 2 )∥ =λ (I - 1 λ ∇ 2 f i (w 1 ))(w 1 -prox 1 λ h (w 1 - 1 λ ∇f i (w 1 ))) -(I - 1 λ ∇ 2 f i (w 2 ))(w -prox 1 λ h (w 2 - 1 λ ∇f i (w 2 ))) =λ (I - 1 λ ∇ 2 f i (w 1 )) -(I - 1 λ ∇ 2 f i (w 2 )) (w 1 -prox 1 λ h (w 1 - 1 λ ∇f i (w 1 ))) +(I - 1 λ ∇ 2 f i (w 2 )) w 1 -prox 1 λ h (w 1 - 1 λ ∇f i (w 1 )) -w 2 + prox 1 λ h (w 2 - 1 λ ∇f i (w 2 )) ≤∥∇ 2 f i (w 1 ) -∇ 2 f i (w 2 )∥∥w 1 -prox 1 λ h (w 1 - 1 λ ∇f i (w 1 ))∥ + λ∥I - 1 λ ∇ 2 f i (w 2 )∥ ∥w 1 -w 2 ∥ + ∥w 1 - 1 λ ∇f i (w 1 ) -w 2 + 1 λ ∇f i (w 2 )∥ ≤ρ∥w 1 -w 2 ∥∥w 1 -prox 1 λ h (w 1 - 1 λ ∇f i (w 1 ))∥ + λ 1 + L λ 2 + L λ ∥w 1 -w 2 ∥ ≤ ρ 1 λ ∥∇f i (w 1 )∥ + 1 λ max θ∈∂h(prox 1 λ h (w1-1 λ ∇fi(w1)) ∥θ∥ + (λ + L)(2λ + L) λ ∥w 1 -w 2 ∥ ≤ 2ρB + (λ + L)(2λ + L) λ ∥w 1 -w 2 ∥, where the first inequality is due to triangle inequality, the second inequality is from (A3) and the nonexpansive property of prox 1 λ h , the third inequality is from w -prox 1 λ h (u) ∈ ∂h(prox 1 λ h (u)), and the last inequality is due to (A2). A.2 PROOF OF LEMMA 2 It follows from ∇F i (w) = λ(I -1 λ ∇ 2 f i (w))(w -prox 1 λ h (w -1 λ ∇f i (w))) that ∇F i (w) -∇F (w) = λ I - 1 λ ∇ 2 f (w) r i + λE i (w -prox 1 λ h (w - 1 λ ∇f (w))) + λE i r i , where E i = 1 λ (∇ 2 f (w) -∇ 2 f i (w)) and r i = prox 1 λ h (w -1 λ ∇f (w)) -prox 1 λ h (w -1 λ ∇f i (w)). Due to (A5), it holds ∥E i ∥ 2 ≤ 1 λ 2 γ 2 H . Using the nonexpansive property of prox 1 λ h , we have ∥r i ∥ 2 ≤ 1 λ 2 ∥∇f i (w) -f (w)∥ 2 ≤ 1 λ 2 γ 2 G . Combining ( 12) and ( 13), it holds 1 N N i=1 ∥∇F i (w) -∇F (w)∥ 2 ≤3λ 2 (1 + L λ ) 2 1 N N i=1 ∥r i ∥ 2 + 4B 2 λ 2 1 N N i=1 ∥E i ∥ 2 + 1 N N i=1 ∥E i ∥ 2 ∥r i ∥ 2 . ≤3 (λ + L) 2 1 λ 2 γ 2 G + 4B 2 1 λ 2 γ 2 H + λ 2 max i ∥E i ∥ 2 1 λ 2 γ 2 G ≤3 4γ 2 G + 4B 2 λ 2 γ 2 H + 4γ 2 G =24γ 2 G + 12B 2 λ 2 γ 2 H , where the first inequality is from ∥I -1 ∇ 2 f (w)∥ ≤ 1 + L λ and ∥w -prox 1 λ h (w -1 λ ∇f (w))∥ ≤ 2B λ , the second inequality is due to ( 12) and ( 13), and the last inequality is based on ∥E i ∥ ≤ 2L λ and λ > L.

A.3 PROOF OF LEMMA 3

From the definition of g i (w) in ( 9) with the sample set D 1 for the gradient and the sample set D 2 for the Hessian, we have g i (w) -∇F i (w) = λe 1 (w -prox 1 λ h (w - 1 λ ∇f i (w))) + λ(I - 1 λ ∇ 2 f i (w))e 2 + λe 1 e 2 , where e 1 = 1 λ (∇ 2 f i (w) -∇ 2 fi (w, D 2 )) and e 2 = prox 1 λ h (w -1 λ ∇f i (w)) -prox 1 λ h (w - 1 λ ∇ fi (w, D 1 )). Let us estimate e 1 and e 2 first. Due to (A4), we have E[e 1 ] = 0, E[∥e 1 ∥ 2 ] ≤ 1 |D 2 |λ 2 σ 2 H , where |D 2 | is the number of samples. For e 2 , it follows from the nonexpansive property of prox 1 λ h that ∥E[e 2 ]∥ ≤ E 1 λ ∥∇f i (w) -∇ fi (w, D 1 )∥ ≤ 1 λ |D 1 | 1 - |D 1 | -1 |D| -1 σ G , where the second inequality is from (A4) and the variance of sampling without replacement. Similarly, the second-order moment can be bounded by E[∥e 2 ∥ 2 ] ≤ E 1 λ 2 ∥∇f i (w) -∇ fi (w, D 1 )∥ 2 ≤ 1 λ 2 |D 1 | σ G . Combining ( 14), ( 15) and ( 16), we have ∥E [g i (w) -∇F i (w)] ∥ ≤ λ∥(I - 1 λ ∇ 2 f i (w))E[e 2 ]∥ ≤ 2 |D 1 | 1 - |D 1 | -1 |D| -1 σ G . Furthermore, it holds E ∥g i (w) -∇F i (w)∥ 2 ≤3λ 2 E ∥e 1 ∥ 2 ∥(w -prox 1 λ h (w - 1 λ ∇f i (w)))∥ 2 + ∥I - 1 λ ∇ 2 f i (w)∥ 2 ∥e 2 ∥ 2 + ∥e 1 ∥ 2 ∥e 2 2 ∥ ≤3λ 2 1 |D 2 |λ 2 σ 2 H • 4B 2 λ 2 + 1 + L λ 2 • 1 |D 1 |λ 2 σ 2 G + 1 |D 2 |λ 2 σ 2 H • 1 |D 1 |λ 2 σ 2 G = 12 |D 1 | σ 2 G + 12B 2 |D 2 |λ 2 σ 2 H + 3 |D 1 ||D 2 |λ 2 σ 2 G σ 2 H , where the first equality is from the Cauchy-Schwarz inequality, the second inequality is from ∥w -prox 1 λ h (w -1 λ ∇f (w))∥ ≤ 2B λ , ( 14) and ( 16). We complete the proof. A.4 PROOF OF LEMMA 4 Note that the local update of Algorithm 1 is w i k,t+1 = w k,t -ηg i (w i k,t ), where g i (w i k,t ) is the estimated gradient of F i at w i k,t . Define C t := 1 N N i=1 E w i k,t -w k,t . We have S 0 = 0 since w i k,0 = w k for any i. It follows from the local update scheme (17) that C t+1 = 1 N N i=1 E w i k,t+1 -w k,t+1 = 1 N n i=1 E   w i k,t -ηg i w i k,t - 1 N N j=1 w j k,t -ηg j w j k,t   ≤ C t + η 1 N N i=1 E   g i w i k,t - 1 N N j=1 g j w j k,t   =:b1 . ( ) For b 1 , it holds b 1 ≤ η N N i=1 E   ∇F i w i k,t - 1 N N j=1 ∇F j w j k,t   + η N N i=1 E ∇F i w i k,t -g i w i k,t + η N N i=1 E   1 N N j=1 ∇F j w j k,t -g j w j k,t   ≤ η N N i=1 E   ∇F i w i k,t - 1 N N j=1 ∇F j w j k,t   + 2ησ F   (19) where the first inequality is due to the triangle inequality and the last inequality is from Lemma 3. Combining ( 18) and ( 19), and defining α i := ∇F i w i k,t -∇F i (w k,t ), we have C t+1 ≤C t + 2ησ F + η N N i=1 E   ∇F i w i k,t - 1 N N j=1 ∇F j w j k,t   =C t + 2ησ F + η N N i=1 E   ∇F i (w k,t ) - 1 N N j=1 ∇F j (w k,t )   + η N N i=1 E   α i - 1 N N j=1 α j   . It follows from Lemma 1 that ∥α i ∥ ≤ L F w i k,t -w k,t . Consequently, 1 N N i=1 E [∥α i ∥] ≤ L F C t . Plugging the above inequality into (20) leads to C t+1 ≤ (1 + 2ηL F ) C t + 2ησ F + η N N i=1 E   ∇F i (w k,t ) - 1 N N j=1 ∇F j (w k,t )   ≤ (1 + 2ηL F ) C t + 2η (σ F + γ F ) where the last inequality is due to Lemma 2. From the recursion (21), we have C t+1 ≤   t j=0 (1 + 2ηL F ) j   2η (σ F + γ F ) ≤ 2η(t + 1) (1 + 2ηL F ) t (σ F + γ F ) ≤ 2η(t + 1)(1 + 1 5R ) t (σ F + γ F ) ≤ 4η(t + 1)(σ F + γ F ), which the third inequality is from η ≤ 1 10L F R and the last inequality is due to (1 + 1 5R ) t ≤ e 1 5 < 2. We finish the proof of (10). For the proof of ( 11), let us define D t := 1 N N i=1 E w i k,t -w k,t 2 . Since w 0 k,t = w k , ∀i = 1, . . . , N , it holds D 0 = 0. Following (17), we have D t+1 = 1 N N i=1 E w i k,t+1 -w k,t+1 2 = 1 N N i=1 E    w i k,t -ηg i w i k,t - 1 N N j=1 w j k,t -ηg j w j k,t 2    ≤ 1 + ν N N i=1 E    w i k,t - 1 N N j=1 w j k,t 2    + η 2 1 + 1/ν N N i=1 E    g i w i k,t - 1 N N j=1 g j w j k,t 2    ≤ (1 + ν)D t + η 2 1 + 1/ν N N i=1 E    g i w i k,t - 1 N N j=1 g j w j k,t 2    =:b2 , where the first equality is from ∥a + b∥ 2 ≤ (1 + ϕ)∥a∥ 2 + (1 + 1/ϕ)∥b∥ 2 for any ν > 0. For b 2 , it holds that b 2 ≤2η 2 1 + 1/ν N N i=1   E    ∇F i w i k,t - 1 n n j=1 ∇F j w j k,t 2    +2E    g i w i k,t -∇F i w i k,t + 1 N N j=1 ∇F j w j k,t -g j w j k,t 2       ≤2η 2 1 + 1/ν N N i=1   E    ∇F i w i k,t - 1 n n j=1 ∇F j w j k,t 2    +4E   g i w i k,t -∇F i w i k,t 2 + 1 N N j=1 ∇F j w j k,t -g j w j k,t 2     ≤4η 2 1 + 1/ν N N i=1   E    ∇F i (w k,t ) - 1 N N j=1 ∇F j (w k,t ) 2 + E    α i - 1 N N j=1 α j 2       + 4σ 2 F    Under review as a conference paper at ICLR 2023 ≤4η 2 1 + 1/ν N N i=1   γ 2 F + 2L 2 F   E ∥w i k,t -w k,t ∥ 2 + E   1 N N j=1 ∥w j k,t -w k,t ∥ 2     + 4σ 2 F   ≤4η 2 (1 + 1/ν) γ 2 F + 4L 2 F D t + 4σ 2 F , where the first inequality the second inequality are due to the Cauchy-Schwarz inequality, the third inequality is from Lemma 3, the Cauchy-Schwarz inequality and α i = ∇F i (w i k,t ) -∇F i (w k,t ), the fourth inequality is from Lemma 1, and the last inequality is obtained using the definition of D t . Plugging ( 24) into (23) yields D t+1 ≤(1 + ν)D t + 4η 2 (1 + 1/ν) γ 2 F + 4L 2 F D t + 4σ 2 F ≤   t j=0 (1 + ν + 16(1 + 1/ν)η 2 L 2 F ) j   • 4η 2 (1 + 1/ν)(γ 2 F + 4σ 2 F ) ≤4η 2 (t + 1)(1 + ν + 16(1 + 1/ν)η 2 L 2 F ) t (1 + 1/ν)(γ 2 F + 4σ 2 F ). ( ) Taking ν = 1 2R and η ≤ 1 10L F R give (1 + ν + 16(1 + 1/ν)η 2 L 2 F ) t = 1 + 1 2R + 16(1 + 2R)η 2 L 2 F t ≤ 1 + 1 2R + 16(1 + 2R) 1 100R 2 t ≤ 1 + 1 R t ≤ 4, where the first inequality is due to η <≤ 1 10L F R , the second inequality is from 1 + 2R ≤ 3R, and the last inequality is based on the fact (1 + 1 R ) R ≤ e. Plugging the above inequality into (25), we have D t+1 ≤ 4η 2 • (t + 1) • 4 • 3R(γ 2 F + 4σ 2 F ) ≤ 48(t + 1)Rη 2 (γ 2 F + 4σ 2 F ). We complete the proof. . Define the averaged iterate wk,t := 1 S i∈S k w i k,t . Then, wk+1,t+1 = 1 S i∈S k w i k+1,t -ηg i (w i k+1,t ) = wk+1,t - η S i∈S k g i (w i k+1,t ). From the Lipschitz continuous property of the gradient of F i 1, we have F ( wk+1,t+1 ) ≤F ( wk+1,t ) + ∇F ( wk+1,t ) ⊤ ( wk+1,t+1 -wk+1,t ) + L F 2 ∥ wk+1,t+1 -wk+1,t ∥ 2 =F ( wk+1,t ) -η∇F ( wk+1,t ) ⊤ 1 S i∈S k g i w i k+1,t + L F 2 η 2 1 S i∈S k g i w i k+1,t 2 where the inequality is from the Lipschitz gradient property of F and the equality is due to (26). By taking expectation on (27), we obtain E [F ( wk+1,t+1 )] ≤E [F ( wk+1,t )] -ηE ∇F ( wk+1,t ) ⊤ 1 S i∈S k g i w i k+1,t =:q1 + L F 2 η 2 E   1 S i∈S k g i (w i k+1,t ) 2   =:q2 . ( ) The sketch of the proof is to estimate the difference between the stochastic gradient 1 S i∈S k g i w i k+1,t and ∇F ( wk+1,t ) and derive a decrease of F with respect to ∥∇F ( wk+1,t )∥ 2 . Once one-step decrease is obtained, the complexity result is obtained by summing over all iterates. Firstly, we use the following split on g i (w i k+1,t ), namely, 1 S i∈S k g i w i k+1,t = X + Y + Z + Q + ∇F ( wk+1,t ), where X = 1 S i∈S k g i w i k+1,t -∇F i w i k+1,t , Y = 1 S i∈S k ∇F i w i k+1,t -∇F i (w k+1,t ) , Z = 1 S i∈S k (∇F i (w k+1,t ) -∇F i ( wk+1,t )) , Q = 1 S i∈S k ∇F i ( wk+1,t ) -∇F ( wk+1,t ). Next, we bound the moments of X, Y, Z and Q. • It follows from the Cauchy-Schwarz inequality that ∥X∥ 2 ≤ 1 S i∈S k g i w i k+1,t -∇F i w i k+1,t . By the tower rule, we have E ∥X∥ 2 = E E ∥X∥ 2 |F t k+1 ≤ σ 2 F . • Note that ∥Y ∥ 2 ≤ 1 S i∈S k ∇F i w i k+1,t -∇F i (w k+1,t ) 2 ≤ L 2 F S i∈S k w i k+1,t -w k+1,t 2 ≤ 48R(R -1)η 2 L 2 F (γ 2 F + 4σ 2 F ), where the first inequality is from the Cauchy-Schwarz inequality, the second inequality is due to the Lipschitz gradient of F i given in Lemma 1, and the last inequality is based on (11) and t ≤ R -1. • Using the variance of sampling without replacement, we have E ∥ wk+1,t -w k+1,t ∥ 2 |F t k+1 ≤E 1 S i∈S k ∥w i k+1,t -w k+1,t ∥ 2 = 1 SN N i=1 ∥w i k+1,t -w k+1,t ∥ 2 • 1 - S -1 N -1 . ( ) By the gradient Lipschitz property of F i given in Lemma 1, we obtain E ∥Z∥ 2 ≤E E L 2 F S i∈S k ∥ wk+1,t -w k+1,t ∥ 2 |F t k+1 ≤ L 2 F SN N i=1 ∥w i k+1,t -w k+1,t ∥ 2 • 1 - S -1 N -1 ≤ 48(N -S)R(R -1)η 2 L 2 F (γ 2 F + 4σ 2 F ) S(N -1) , where the second inequality is due to (32) and the last inequality is from (11). • From Lemma 2 and similar derivations as (32), we have (36) where the second inequality is due to the Cauchy-Schwarz inequality and the last equality is from (35). For the second term in (36), we have E ∥Q∥ 2 ≤ E E 1 S i∈S k ∥∇F i ( E ∇F ( wk+1,t ) ⊤ X = E E ∇F ( wk+1,t ) ⊤ X | F t k+1 = E ∇F ( wk+1,t ) ⊤ E X | F t k+1 ≤ 1 4 E ∥∇F ( wk+1,t )∥ 2 + E E X | F t k+1 2 ≤ 1 4 E ∥∇F ( wk+1,t )∥ 2 + 4rσ 2 G , where r := max k,t,i 1 - |D i k,t |-1 |D i |-1 /|D i k,t | in the last inequality follows from Lemma 3. For the last term of (36), it holds that E ∥Y + Z∥ 2 ≤ 2 E ∥Y ∥ 2 + E ∥Z∥ 2 ≤ 96R(R -1)η 2 L 2 F (γ 2 F + 4σ 2 F ) 1 + N -S S(N -1) ≤ 192R(R -1)η 2 L 2 F (γ 2 F + 4σ 2 F ), where the first inequality is from the Cauchy-Schwarz inequality, the second inequality due to (31) and (33). Plugging (37) and ( 38) in (36) yields q 1 ≥ η 2 E ∥∇F ( wk+1,t )∥ 2 -192R(R -1)η 3 L 2 F (γ 2 F + 4σ 2 F ) -4rησ 2 G . The remaining term in (28) needs to be bounded is q 2 . It follows from the Cauchy-Schwarz inequality that (40) By ( 30), ( 38), ( 34) and (40), we have q 2 ≤ 2L F η 2 σ 2 F + 192R(R -1)η 2 L 2 F (γ 2 F + 4σ 2 F ) + N -S S(N -1) γ 2 F + ∥∇F ( wk+1,t )∥ 2 ≤ 2L F η 2 ∥∇F ( wk+1,t )∥ 2 + 400L 3 F η 4 R(R -1)(γ 2 F + 4σ 2 F ) + 2L F η 2 ( N -S S(N -1) γ 2 F + σ 2 F ). Plugging ( 36) and ( 41) into ( 28) gives E [F ( wk+1,t+1 )] ≤E [F ( wk+1,t )] -η (1/2 -2ηL F ) E ∥∇F ( wk+1,t )∥ 2 + 200 (1 + 2ηL F ) η 3 L 2 F R(R -1) γ 2 F + 4σ 2 F + 2L F η 2 N -S S(N -1) γ 2 F + σ 2 F + 4rησ 2 G ≤E [F ( wk+1,t )] - η 4 E ∥∇F ( wk+1,t )∥ 2 + 400η 3 R 2 (γ 2 F + 4σ 2 F ) + 2L F η 2 N -S S(N -1) γ 2 F + σ 2 F + 4rησ 2 G , where the last inequality is due to η ≤ 1 10RL F . Note that wk,R = w k . Summing over (42) for all t = 0, . . . , R -1 and k = 0, . . . , K -1 yields E [F (w K )] ≤F (w 0 ) - ηRK 4 1 RK K-1 k=0 R-1 t=0 E ∥∇F ( wk+1,t )∥ 2 + 400η 3 R 3 K(γ 2 F + 4σ 2 F ) + 2RKL F η 2 N -S S(N -1) γ 2 F + σ 2 F + 4RKrησ 2 G . Therefore, 1 RK K-1 k=0 R-1 t=0 E ∥∇F ( wk+1,t )∥ 2 ≤ 4 ηRK F (w 0 ) -E [F (w K )] + ηRKσ 2 T ≤ 4 (F (w 0 ) -F * ) ηRK + 1600η 2 R 2 (γ 2 F + 4σ 2 F ) + 8L F η N -S S(N -1) γ 2 F + σ 2 F + 16rσ 2 G , which F * the minimal value of F . This completes the proof.



and ground truth b real = 0. The observations (x, y) are generated as follows to simulate the heterogeneity among clients. Let x

Figure 1: Results for federated Lasso with setting (I).

Figure 2: Results (cont.) for federated Lasso with setting (I).

Figure 3: Results for federated Lasso with setting (II).

Figure 4: Results (cont.) for federated Lasso with setting (II).

Figure 5: Results for federated matrix completion. 4.3 NEURAL NETWORK WITH NONSMOOTH REGULARIZATION Consider a two-layer deep neural network with hidden layer of size 100, the ReLU activation, and a softmax layer at the end. The numerical results are performed on the Mnist dataset, which consists of 7000 handwritten digit images from 10 classes. We distribute the complete dataset to N = 20 clients. To model a heterogeneous setting in terms of local data sizes and classes, each client is allocated a different local data size in the range of [1165, 3834] and only has 2 of the 10 labels. A similar setting is used in (T Dinh et al., 2020). The loss function is a sum of the cross-entropy loss and the nonsmooth ℓ 2 norm function on the weights. Therefore, the resulting problem takes the composite form (1).

Figure 6: Results for non-iid Mnist dataset



2∥X + Y + Z∥ 2 + 2 ∥Q + ∇F ( wk+1,t )∥ 2 ≤ 4(∥X∥ 2 + ∥Y + Z∥ 2 ) + 4(∥Q∥ 2 + ∥∇F ( wk+1,t )∥ 2 ).

Honglin Yuan, Manzil Zaheer, and Sashank Reddi. Federated composite optimization. In International Conference on Machine Learning, pages 12253-12266. PMLR, 2021. Jinshan Zeng and Wotao Yin. On nonconvex decentralized gradient descent. IEEE Transactions on signal processing, 66(11):2834-2848, 2018.

wk+1,t ) -∇F ( wk+1,t )∥ 2 |F t With the above estimates, we haveq 1 ≤ηE ∇F ( wk+1,t ) ⊤ (X + Y + Z + Q + ∇F ( wk+1,t )) ≥ηE ∇F ( wk+1,t ) ⊤ (Q + ∇F ( wk+1,t )) -η E ∇F ( wk+1,t ) ∥∇F ( wk+1,t )∥ 2 -ηE[∥Y + Z∥ 2 ], =ηE ∥∇F ( wk+1,t )∥ 2 -η E ∇F ( wk+1,t ) ⊤ X -η 4 E ∥∇F ( wk+1,t )∥ 2 -ηE[∥Y + Z∥ 2 ],

