FEDERATED LEARNING ON ADAPTIVELY WEIGHTED NODES BY BILEVEL OPTIMIZATION Anonymous

Abstract

We propose a federated learning method with weighted nodes in which the weights can be modified to optimize the model's performance on a separate validation set. The problem is formulated as a bilevel optimization problem where the inner problem is a federated learning problem with weighted nodes and the outer problem focuses on optimizing the weights based on the validation performance of the model returned from the inner problem. A communication-efficient federated optimization algorithm is designed to solve this bilevel optimization problem. We analyze the generalization performance of the output model and identify the scenarios when our method is in theory superior to training a model locally and superior to federated learning with static and evenly distributed weights.

1. INTRODUCTION

Federated learning (FL) is an emerging technique for training a model using data distributed over a network of nodes without sharing data between nodes (Konečnỳ et al., 2016; McMahan et al., 2017) . In this paper, we focus on the case where data distributions across nodes are heterogeneous and each node aims at a model with an optimal local generalization performance. In the classical setting of FL, a globally shared model is learned by minimizing a weighted average loss across all nodes. However, given the heterogeneity of data distributions, a global model is likely to be sub-optimal for some node (Fallah et al., 2020) . Alternatively, each node can train a model only using its local data, but such a local model may not generalize well neither when the volume of local data is small. To achieve a good local generalization performance, each node can still exploit global training data through FL but, at the same time, identify and collaborate only with the nodes whose data distributions are similar or identical to its local distribution. One way to implement this strategy is to allow each node to solve its own weighted average loss minimization problem with weights designed based on the performance on a separate set of local (validation) data. Ideally, each node can learn a better model by allocating more weights on its peers whose data distribution is similar to its local distribution. In this paper, we formulate the choice of the weights as a bilevel optimization (BO) problem (Colson et al., 2005; Vicente & Calamai, 1994) , which can be solved by a federated bilevel optimization algorithm, and analyze the generalization performances of the resulting model. We consider a standard learning problem where the goal is to learn a vector of model parameters θ from a set Θ that minimizes a generalization loss. This problem can be formulated as θ * ∈ arg min θ∈Θ {L 0 (θ) := E z∼p0 [l(θ; z)]} , (P) where l(θ; z) is the loss of θ on a data point z from a space Z, and E z∼p0 represents the expectation taken over z when z follows an unknown ground truth distribution p 0 . Directly solving (P) is challenging as p 0 is unknown, and, typically, training data sampled from p 0 is needed for learning an approximation of θ * . In this paper, we consider the scenario where the amount of data sampled directly from p 0 may not be sufficient to learn a good approximation of θ * , but there exist external data distributed on K nodes that can potentially help the learning on θ * . In particular, we denote the set of nodes by K := {1, . . . , K} and assume a training set D train k is stored in node k. We also define D train := D train We assume node k is weighted by w k and the vector of weights w = (w 1 , . . . , w K ) ∈ [0, 1] K is located on the capped simplex ∆ b K defined as ∆ b K = w = (w1, . . . , wK ) K k=1 w k = 1, 0 ≤ w k ≤ b, k ∈ K , where b ∈ [ 1 K , 1] is a user-defined parameter. The FL on weighted nodes can be formulated as θ(w) ∈ arg min θ∈Θ K k=1 w k L k (θ), where L k (θ) is the empirical loss of θ on D train k , namely, L k (θ) := 1 n k n k i=1 l(θ; z (i) k ), k ∈ K. (2) When some p k 's are different from p 0 , w in (1) must be chosen adaptively to ensure θ(w) is a good approximation of θ * in (P). To do so, we assume that there is a validation dataset D valid with |D valid | = n 0 = n valid and D valid = z (i) n0 i=1 , where z (i) ∈ Z is an i.i.d. sample from p 0 . We assume D valid is stored in a node called node 0 or center, which may or may not be a node in K. Set D valid alone may not be sufficient for learning θ * precisely but can be used to assist the selection of w. We then propose to estimate the generalization loss of θ(w) using the loss on D valid , i.e., L 0 (θ) := 1 n0 n0 i=1 l(θ; z (i) ) (3) and use this validation loss to guide the procedure for updating w. Presumably, when both the training and validation sets are large enough, the weights in w will be shifted towards the nodes where the data is helpful for learning θ * . Following this idea, we formulate the federated learning problem on adaptively weighted nodes as the following bilevel optimization (BO) problem: w ∈ arg min  In Section 4, we will present a federated optimization algorithm for solving ( P). Suppose an algorithm can find the optimal solution w of ( P) and the corresponding model parameter θ(w). We are interested in the optimality gap of the generalization loss of θ( w), namely, L 0 ( θ( w)) -L 0 (θ * ), where L 0 is defined as in (P). The main contribution of this paper is to establish a high-probability bound of this gap as a function of the sizes of D train and D valid as well as a statistical distance between p 0 and p k 's. Moreover, we compare our generalization bound with the bound achieved by learning only locally from D valid and the bound achieved by solving (1) with evenly distributed weights, and identify the parameter regimes where our method is preferred in theory.

2. RELATED WORK

The work most related to ours is Chen et al. (2021a) in which the authors proposed a target-aware weighted training algorithm for cross-task learning. Although their problem is completely different from FL, the bilevel optimization model they studied contains ( P) as a special case. In fact, some steps in the proofs of the generalization bounds in the current work are borrowed from Chen et al. (2021a) with some modifications. However, our work extends their results in several valuable directions. First, the generalization bound in Chen et al. (2021a) is shown for any weight w without any small or zero components, which is not necessarily the case for the optimal solution w of ( P). Second, their generalization bound contains a term of task distance whose convergence rate is not characterized. On the contrary, we show the convergence of the entire generalization bound for w without any conditions on its components. Third, the generalization bound in Chen et al. (2021a) has a dominating term O(1/ √ n valid ), which is the same as the generalization bound obtained by directly training with the local data D valid . However, we show that, when there exist identical neighbors and an error bound condition holds (Assumptions 2 ′ and 3), the model learned by ( P) can be superior to a model trained locally when the p k 's are similar enough to (but still different from) p 0 , providing an insight on when a node with insufficient data should actively seek collaboration with others. FL has become a prominent machine learning paradigm for training models with distributed data Konečnỳ et al. (2016) ; McMahan et al. (2017) . Many federated optimization algorithms have been developed for solving (1) or its expectation form (with L k replaced by L k ). A well-known method is the federated averaging (FedAvg) method (McMahan et al., 2017) , which applies a local optimization method (e.g., stochastic gradient descend (Robbins & Monro, 1951) et al. (2021) . In our setting, ( 1) is a sub-problem we need to solve multiple times with different w's. We then apply the Local-SVRG method by Gorbunov et al. (2021) to (1) because it has the lowest communication complexity for finite-sum problems like (1). Most FL methods produce a globally-shared model which may not perform well on each node when data is heterogeneous across nodes. Grazzi et al. (2020) . However, these algorithms are designed for a single-machine setting and may not be communication efficient if implemented directly in a distributed environment. There are much fewer studies on BO in a distributed setting. The recent works (Li et al., 2022; Tarzanagh et al., 2022) consider a BO where both the outer and inner problems are defined with the expectation over data distributed across nodes. They analyze the communication complexity of their methods in a non-convex setting. We propose a different FL algorithm based on Local-SVRG because our problem ( P) has a finite-sum structure that allows periodically going through all the data points in each node to obtain exact gradient information and achieving lower communication complexity than Li et al. (2022); Tarzanagh et al. (2022) . (Chen et al., 2022) consider a decentralized BO problem and their algorithm for the deterministic case can be applied to our problem and achieve the same communication complexity in the non-convex case. However, we include the results for the convex case and focus more on the generalization performance of the federated learning based on BO.

3. GENERALIZATION PERFORMANCE

The following assumption on ( P) is made for analyzing the generalization performance of θ( w) in ( P) and the convergence property of the optimization algorithm for solving ( P) in Section 4. Assumption 1 (Well-behaved function). The following statements hold. ( 1 ) l(θ; z) ∈ [0, 1] and ∇l(θ; z) is ℓ 1 -Lipschitz continuous in θ for any z ∈ Z. (2) L k (θ) and ∇ 2 L k (θ) are ℓ 0 and ℓ 2 - Lipschitz continuous, respectively, for k ∈ K. (3) L k (θ) is µ-strongly convex for k ∈ K. These are standard regularity assumptions in recent literature on bilevel optimization (e.g. Ghadimi & Wang (2018) ). Assuming the strong convexity in the lower-level problem, (1) has a unique solution so that the inclusion there can be replaced by equality. Similar to L 0 in (P), we define L k (θ) := E z∼p k [l(θ; z)] for k ∈ K, and we consider the following auxiliary problem W * = arg min w∈∆ b K F (w) := L0(θ(w)) s.t. θ(w) ∈ arg min θ∈Θ K k=1 w k L k (θ) . (P * ) Problem ( P) can be viewed as an empirical approximation of (P * ) in both inner and outer problems. Even if all p k 's are different from p 0 , it is still possible to learn θ * correctly by solving (P * ). A simple example on mean estimation is min w∈∆foot_1 2 E(θ(w) -z 0 )foot_2 s.t. θ(w) ∈ arg min θ where z 0 , z 1 and z 2 follow normal distributions N (0, 1), N (a, 1) and N (-a, 1), respectively, for any a ̸ = 0. Obviously, w * = (0.5, 0.5) is the optimal weight and θ(w * ) = 0 = θ * . Throughout the paper, we assume θ * can be learned by solving (P * ), which is stated formally below. Assumption 2 (Learnability of θ * by (P * )). θ(w) = θ * for any w ∈ W * , where θ * satsifes (P). Besides the situation like the aforementioned simple example, Assumption 2 holds obviously when p k = p 0 for at least one k ∈ K. In fact, the latter case happens when node 0 is a node in K, so w equal to one on that node and zero on others is optimal. Moreover, we will later on provide a refined generalization performance analysis for the latter case, so we state the latter case as a separate assumption below. Assumption 2 ′ (Existence of identical neighbors). There exists a strict subset J ⊂ K with |J | = J such that p k = p 0 for k ∈ J . Moreover, W * = w ∈ ∆ b K w k = 0 for k ∈ K\J . The first statement in Assumption 2 ′ implies that the right-hand side of ( 4) is contained by the left-hand side. The second statement further assumes that they are equal. Assumption 2 ′ implies Assumption 2 because K k=1 w k L k (θ) = L 0 (θ) for any θ ∈ Θ and any w ∈ W * satisfying (4). Assumption 3 (Error bound condition). There exist C r > 0 and r ≥ 1 such that Dist(w, W * ) := min w ′ ∈W * ∥w -w ′ ∥ ≤ Cr F (w) -min w∈∆ b K F (w) 1/r . Inequality ( 5) means problem (P * ) satisfies the error bound condition, which has impact on the convergence property of many optimization algorithms (Johnstone & Moulin, 2020; Yang & Lin, 2018; Lewis & Pang, 1998; Pang, 1997; Lin et al., 2020) . Due to the limit of space, we refer readers to Appendix B for a practical example satisfying Assumption 3. We are interested in the generalization performance of θ( w), represented by the gap L 0 ( θ( w)) -L 0 (θ * ), as both D valid and D train grow. For simplicity of notation, we assume n k = n train for any k ∈ K for some integer n train ≫ n valid . To facilitate the analysis, we need to introduce a few notations. Given a probability measure Q on Z, let H = {l(θ; •) : θ ∈ Θ} be a pseudometric metric space equipped with the pseudometric metric ρ Q , which is the L 2 distance metric with respect Q, i.e., ρ Q (l, l ′ ) := Z (l(z) -l ′ (z)) 2 dQ(z) for l, l ′ ∈ H. The ball with radius ϵ > 0 centered at l ∈ H is defined as Following Chen et al. (2021a) , we make the following assumption on N (H; ρ Q , ϵ), which is important for analyzing the generalization performance (Koltchinskii, 2006; Kakade et al., 2008) . Assumption 4. There exist C H > 0 and ν H > 0 such that, for any probability measure Q on Z, B ϵ (l) := {l ′ ∈ H|ρ Q (l, l ′ ) ≤ ϵ}. Let N (H; ρ Q , ϵ) be the ϵ-covering number of H with respect to ρ Q , i.e., N (H; ρ Q , ϵ) := min{m|∃l 1 , . . . , l m ∈ H, H ⊂ ∪ m i=1 B ϵ (l i )}. N (H; ρ Q , ϵ) ≤ (C H /ϵ) ν H , ∀ϵ > 0. With the these assumptions, we obtain the following theorems whose proofs are in Appendix D. Theorem 1 (Bound independent of statistical distance). Suppose Assumptions 1, 2 and 4 hold. There exists a universal constant 1 C g > 0 such that, with a probability of at least 1 -δ, L0( θ( w)) -L0(θ * ) ≤ Cg νH + log(1/δ) n valid 1 2 + Cg ℓ0 √ µ νH + K + log(1/δ) ntrain/(Kb 2 ) 1 4 . ( ) Under the same assumptions, 2 the generalization bound by Chen et al. (2021a) becomes L0( θ(w)) -L0(θ * ) ≤C ′ g νH + log(1/δ) nvalid 1 2 + C ′ g √ β √ µ νH + K log(K) + log(1/δ) Kntrain 1 4 + L0(θ(w)) -L0(θ * ) (8) for a universal constant C ′ g and any w satisfying β -1 ≤ w k /w j ≤ β with k ̸ = j for some β > 0. However, it is likely that the optimal solution w * has zero components (e.g., when p k = p 0 for some k). If so, β on the right-hand side of (8) needs to be arbitrarily large for w(≈ w * ) in ( P) to satisfy the aforementioned condition. Moreover, when w = w, the convergence of the last term L 0 (θ(w)) -L 0 (θ * ) in ( 8) is not characterized in Chen et al. (2021a) . On the contrary, Theorem 1 holds without any assumption on w (zero components are allowed), does not depend on β and provides a generalization bound converging in every term. 3 When b = c/K with a constant c ≥ 1, the right-hand side of (7) improves the first two terms on the right-hand side of (8) by a log(K) term. The bounds ( 7) and ( 8  G := max θ∈Θ k∈K L0(θ) -L k (θ) 2 . ( ) Theorem 2 (Bound dependent on statistical distance). Suppose Assumptions 1, 2 ′ , 3 and 4 hold. There exists universal constants C e > 0 and C w > 0 such that, with a probability of at least 1 -3δ, Dist( w, W * ) ≤ ε(n valid , ntrain) := Cw νH + log(1/δ) n valid 1 2r + Cw ℓ0 √ µ νH + K + log(1/δ) ntrain/(Kb 2 ) 1 4r and L0( θ( w)) -L0(θ * ) ≤ Ce νH + J + log(1/δ) Nε + Ce ε(n valid , ntrain)(K -J) b √ Nε + 2ε(n valid , ntrain)G, where N ε = ntrain b 2 J+ε 2 (n valid ,ntrain)(K-J) and G is defined in (9). Note that N ε = Θ(n train ). Based on the decreasing rate of ε(n valid , n train ) in ( 10 valid ), meaning that method ( P) has a better generalization guarantee than training locally. Since G is small in this case, a natural question is whether optimizing the weight in ( P) is still needed because the FL with equally weighted nodes may already have a good performance with respect to p 0 . However, we show in Proposition 2 in Appendix E that ( P) is still preferred to FL with equally weighted nodes for any G. We show the impacts of b, J and K through Corollary 2 in Appendix D.

4. FEDERATED BILEVEL OPTIMIZATION ALGORITHM

Although our main focus is the generalization performance of ( P), we present a federated optimization algorithm for ( P) based on the existing techniques by Gorbunov et al. (2021) and Ghadimi & Wang (2018) . Different from a single-level optimization problem, the outer objective F (w) in ( P) depends implicitly on w through the inner optimal solution θ(w), which makes the exact gradient ∇ F (w) difficult to compute. A commonly used solution is to exploit implicit function as shown in the following lemma, which is from Lemma 2.1 and 2.2 in Ghadimi & Wang (2018) . Lemma 1. Under Assumption 1, ∇ F (w) is ℓ F -Lipschitz continuous with ℓF := 2ℓ0ℓ1 µ + ℓ2ℓ 2 0 µ 2 √ Kℓ0 µ + Kℓ1ℓ 2 0 µ 2 , ( ) Algorithm 1: Local-SVRG method for (16): Local-SVRG({f k,i }, w, x (0) , γ, τ, q, T ) Input: functions {f k,i }, weight w, initial vector x (0) ∈ R d , learning rate γ, communication period τ ≥ 1, probability q of updating reference point, and the total number of iterations T x (0) k = x (0) , y (0) k = x (0) , k = 1, . . . , K for t = 0, 1, . . . , T -1 do for k = 1, . . . , K in parallel do Choose i k from {1, . . . , n k } uniformly at random g (t) k = ∇f k,i k (x (t) k ) -∇f k,i k (y (t) k ) + ∇f k (y (t) k ) y (t+1) k = x (t) k with probability q and y (t+1) k = y (t) k with probability 1 -q if t + 1 mod τ = 0 then 9 x (t+1) k = x (t+1) := K k=1 w k x (t) k -γg (t) k else 11 x (t+1) k = x (t) k -γg (t) k end end end Return: x(T ) = U -1 T T t=0 u t x (t) with u t = (1 -min{γµ, q/4}) -(t+1) and U T = T t=0 u t . and ∇ F (w) = (∇ k F (w)) k=1,...,K , where ∇ k F (w) is the partial derivative of F w.r.t. w k and ∇ k F (w) = -∇ L k ( θ(w)) ⊤ K k=1 w k ∇ 2 L k ( θ(w)) -1 ∇ L0( θ(w)). By Lemma 1, computing ∇ k F (w) requires solving (1) exactly and taking the inverse of the Hessian matrix in (13), both of which are challenging. Hence, for a given w, we will find an approximate solution of (1), denoted by θ(w)(≈ θ(w)), and approximate the matrix inversion in (13) by solving a strongly convex quadratic program. In particular, we will approximate ∇ F (w) by ∇ F (w) := ( ∇k F (w)) k=1,...,K with ∇k F (w ) := -∇ L k ( θ(w)) ⊤h , where h ≈ arg min h 1 2 h ⊤ K k=1 w k ∇ 2 L k ( θ(w)) h -h ⊤ ∇ L 0 ( θ(w)). Both ( 1) and ( 15) can be written as a distributed finite-sum minimization on K weighted nodes: min x∈R d f (x) := K k=1 w k f k (x), where f k (x) = 1 n k n k i=1 f k,i (x), k = 1, . . . , K. (16) When f k,i (θ) = l(θ; z (i) k ), (16) becomes (1). When f k,i (h) = 1 2 h ⊤ ∇ 2 l( θ(w); z (i) k )h - h ⊤ ∇ L 0 ( θ(w)), (16) becomes (15). With this observation, we apply Local-SVRG by Gorbunov et al. (2021) to the aforementioned two instances ( 16) to obtain θ(w) and h, which are used to construct the approximate gradient ∇ F (w) in ( 14). Then we update w using ∇ F (w) based on the accelerated bilevel approximation method (ABA) by Ghadimi & Wang (2018) . We choose the combination of Local-SVRG and the ABA methods because it leads to the lowest communication complexity in literature for solving ( P). We formally present this approach in Algorithms 1 and 2. Recall that we have assumed D valid is stored in node 0, which is called center in Algorithm 2. In each iteration of Algorithm 2, in addition to the communication within Local-SVRG, constantly many rounds of communication are needed to exchange θ (s) , h (s) ∇ L 0 (θ (s) ) and ∇ L k (θ (s) ) between the center and node k. We present the communication complexity of Algorithm 2 which can be proved by adapting the analysis in Gorbunov et al. (2021) and Ghadimi & Wang (2018) to our setting. The proofs are deferred to Sections F.1 and F.2. Theorem 3. Suppose Assumption 1 holds and F (w). Let R := max w∈∆ b K ∥ θ(w)∥ and γ0 := min 3 80ℓ1 , 1 ℓ1 5e(τ -1)[6(τ -1) + 8 + 16/(1 -q)] , q 4µ . ( ) Algorithm 2: Federated Learning Method for Bilevel Optimization ( P) Input: initial weight w (0) , learning rate η, training data D train k for k ∈ K, validation data D valid , the number of outer iterations S, parameters (γ, τ, q) for Local-SVRG, and the number of inner iterations T s for s = 0, . . . , S -1 Set w (0) ag = w (0) for s = 0, 1, . . . , S -1 do w (s) md = 2 s+2 w (s) + s s+2 w (s) ag Compute ∇ F (w (s) md ) as follows: Set f k,i (θ) = l(θ; z (i) k ), i = 1, . . . , n k , k = 1, . . . , K Compute θ (s) = Local-SVRG({f k,i }, w md , θ (s-1) , γ, τ, q, T s ) and send it to each node. Compute ∇ L 0 (θ (s) ) at center and send it to each node. Set f k,i (h) = 1 2 h ⊤ ∇ 2 l(θ (s) ; z (i) k )h -h ⊤ ∇ L 0 (θ (s) ), i = 1, . . . , n k , k = 1, . . . , K Compute h (s) = Local-SVRG({f k,i }, w md , ∇ L 0 (θ (s) ), γ, τ, q, T s ) and send it to each node. Each node computes ∇ L k (θ (s) ) in parallel and send it to the center. Set ∇k F (w (s) md ) = -∇ L k (θ (s) ) ⊤ h (s) for k = 1, . . . , K w (s+1) = arg min w∈∆ b K ∇ F (w (s) md ), w + 2 η(s+1) ∥w -w (s) ∥ 2 w (s+1) ag = arg min w∈∆ b K ∇ F (w (s) md ), w + 1 2η ∥w -w (s) md ∥ 2 end Return: w (S) ag Suppose η = 1 3ℓ F in Algorithm 2 with ℓ F defined as in ( 12). There exist constants A 1 , A 2 and A 3 that only depend on ℓ 0 , ℓ 1 , ℓ 2 , µ, R, q and K but not on τ such that the following statements hold. • Suppose τ = 1, γ = γ 0 and T s = 1 γ0µ ln A1(s+1) 4 γ0 . Algorithm 2 finds an ϵ-optimal solution of ( P) with Õ ϵ -0.5 rounds of communication. • Suppose τ > 1, γ = 1 Ms and T s = µ -1 M s ln M 3 s , where Ms = max 1/γ0, (s + 1) 2 [A1 + A2(τ -1) + A3(τ -1) 2 ] , s = 0, 1, . . . . Algorithm 2 finds an ϵ-optimal solution of ( P) with Õ ϵ -1.5 rounds of communication. When F in ( P) is non-convex, we aim at finding an ϵ-stationary point of ( P). Following Ghadimi & Wang (2018) , we apply a standard proximal gradient method to ( P) based on the approximate gradient ∇ F (w) in ( 14). This method and its analysis are standard and we include them in Section F.3 due to the limit of space. In Remark 1 in Section F.3, we also show that the complexity of our method is lower than those of Li et al. (2022) and Tarzanagh et al. (2022) .

5. NUMERICAL EXPERIMENT

In this section, we demonstrate the performance of our methods on image classification tasks. We compare our method, denoted by Bi-level, against four baselines, including (1) Local-train, which solves min θ∈Θ L 0 (θ) locally; (2) FedAvg (McMahan et al., 2017) , which solves (1) with w k = 1/K; (3) Ditto (Li et al., 2021) ; and (4) pFedMe (T Dinh et al., 2020) . Ditto and pFedMe are two personalized FL methods. We apply all methods to train a convolutional neural network (CNN) on multiple image datasets: Fashion-MNIST (Xiao et al., 2017) , MNIST (Deng, 2012) , CIFAR-10 (Krizhevsky et al., 2009) and downsampled 32 × 32 ImageNet (Chrabaszcz et al., 2017) . We See Appendix G for more details on the CNN and the computing environment we use. We use a mini batch of size 50 to construct the stochastic gradients in all methods. In the Bi-level method, Algorithm 3 is applied to ( P) with b = 1/3 and five epochs are performed within each call of Local-SVRG (i.e., T s = 5n train /50). We set q = 1/50, choose γ from {0.05, 0.02, 0.01} when solving (1) and from {0.0005, 0.0002, 0.0001} when solving (15), choose τ from {10, 20}, and η from {0.025, 0.02, 0.015}. We choose the combination that produces the highest validation accuracy after five outer iterations. SVRG is applied to min θ∈Θ L 0 (θ) in Local-train and Local-SVRG is applied to (1) with w k = 1/K in FedAvg. Parameters τ , q, and η in FedAvg and Local-train are set the same as in our method. Ditto is implemented by setting S t = K, r = τ , s = 25 and η g = η l = γ in Algorithm 2 in Li et al. (2021) , where τ and γ are set the same as in our method. Similar to Li et al. (2021) , we choose λ in Ditto from {0.05, 0.1, 0.2} to maximize the validation accuracy after five outer iterations. pFedMe is implemented by setting β = 1, δ = 0.005, R = τ and η = γ in Algorithm 1 in T Dinh et al. ( 2020) with τ and γ set the same as in our method. Each subproblem in pFedMe is solved by gradient descend with a maximum iterations of 20. Like Li et al. (2021) , we choose λ in pFedMe from {5, 10, 15} to maximize the validation accuracy after five outer iterations. We set K = {1, . . . , 15} (i.e., K = 15) and partition it into two groups, a minority group J m = {1, . . . , 5} and a majority group J M = {6, . . . , 15}. We then generate D train k for k ∈ K by randomly sampling data from the training sets with some artificial distributions, such that the data distributions (i.e., p k 's) are the same within each group but different between groups. In particular, we create the data distributions of J m and J M under four different settings. In Setting 1, we create two different distributions over the classes and use them to sample D train k with k ∈ J m and k ∈ J M , respectively. In Setting 2, Setting 3 and Setting 4, we first sample data in the same way as Setting 1 and, additionally, we permute the class labels among a few classes in D train k with k ∈ J M under Setting 2, rotate each image in D train k with k ∈ J M by 90 degrees in the same but random direction under Setting 3, and do both under Setting 4. This creates nodes with different levels of heterogeneity. To compare the performances of the methods on both groups, we conduct two sets of experiments under each setting, one with p 0 being the distribution of J m (i.e., J = J m ) and the other with p 0 being the distribution of J M (i.e., J = J M ). D valid is then sampled from p 0 . For out-ofsample evaluation, we generate testing data by sampling from the testing set of each dataset using distribution p 0 described above under each setting. We plot the test (top-1) accuracy each method obtains during iterations for the minority group in Figure 1 , where the horizontal axis represents the number of synchronizations, i.e., the rounds of communications the method performs. Since Local-train does not require any communication, we just plot a horizontal line positioned at its final accuracy. Due to space limit, we present the accuracy for the majority group in Figure 7 in Section G.4. We also report the same results in Figure 8 and Figure 9 but the horizontal axis there represents the cumulative number of data points each method processes in parallel. In each figure, we show the confidence intervals of the curves as shaded areas. According to Figure 1 , our Bi-level method performs better than the four benchmarks on the minority group on all datasets under all settings. Local-train does not perform well because it only gets access to a small amount of data. The poor performance of FedAvg is because of the heterogeneity we created across nodes. In fact, FedAvg is even worse than Local-train in many cases, especially in Settings 2, 3 and 4 where the heterogeneity is high. This is consistent with the findings in literature. Although Ditto and pFedMe are designed for heterogeneous nodes, they still use a fixed weight on each node to train a global model, which may not provide a good starting point for personalization due to the high heterogeneity. In fact, their performances drop more or less as the data heterogeneity increases from Setting 1 to Settings 2, 3 and 4. On the contrary, by updating the weights, our method filters the information in the network and help the node in the minority group to find its similar peers and produce a good model through intra-group collaboration. Comparing Figure 1 with Figure 7 , we find that the performances of FedAvg, Ditto and pFedMe are improved on the majority group. This is again because they utilize the information aggregated from all nodes, which is in favor of the majority. However, our method perform similarly on both groups and is still overall the best for the majority group. Similar phenomena are found in Figure 8 and Figure 9 . In addition, we also plot in Figure 2 how the weight w k for each node evolves during the Bi-level method under Setting 1. We show the results when p 0 is the distribution of the majority and the minority groups separately. In each case, we call the nodes in J similar nodes (to node 0) and call the others dissimilar nodes. According to Figure 2 , our method successfully detects similar nodes in both cases and increases their weights but decreases the weights of dissimilar nodes. We present the weights under Settings 2, 3 and 4 in Section G.4. Similar phenomenons are observed.

6. CONCLUSION

We propose a FL approach on a network with weighted nodes and develop a federated bilevel optimization algorithm to optimize the weights based on the model's performance on a validation set. We analyze the generalization performance of the resulting model and identify the scenarios where our method theoretically outperforms training with local data and FL with even weights.  ) ∈ [0, 1] K b A user-defined parameter in [ 1 K , 1] ∆ b K Capped simplex where w comes from θ Vector of model parameters Θ Set where θ comes from l(θ; z) Loss of θ on a data point z from the space Z L 0 (θ) Empirical loss of θ on D valid L k (θ) Empirical loss of θ on D train k , k = 1, . . . , K L 0 (θ) Generalization loss of θ on p 0 L k (θ) Generalization loss of θ on p k , k = 1, . . . , K ℓ 1 Lipschitz constant of l(θ; z) ∈ [0, 1] and ∇l(θ; z) in θ for any z ∈ Z ℓ 0 Lipschitz constant of all L k (θ) for k ∈ K ℓ 2 Lipschitz constant of all ∇ 2 L k (θ) for k ∈ K µ parameter of strong convexity of L k (θ) for k ∈ K G Statistical distance between p 0 and p k 's defined in ( 9) F (w) The objective in the bilevel optimization problem ( P) F (w) The 

B EXAMPLES SATISFYING ASSUMPTION 3

We consider (P * ) in the setting of linear regression. Consider data z = (x, y), where x ∈ R d is a feature vector and y ∈ R is a continuous target variable, and consider the quadratic loss l(θ; z) = 1 2 (x ⊤ θ -y) 2 . We assume x in all nodes, including node 0 (center) and the nodes in K, follows the same distribution, and matrix E xx ⊤ is non-singular. Moreover, we assume that there is a vector θ * k ∈ R d associated to node k, and y in node k is generated as y = x ⊤ θ * k + ϵ k for k = 0, 1, . . . , K, where ϵ k is a zero-mean random noise indepedent of x. In this problem, we have L k (θ) = 1 2 E (x ⊤ θ -y) 2 = 1 2 E (x ⊤ θ -x ⊤ θ * k -ϵ k ) 2 = 1 2 (θ -θ * k ) ⊤ E xx ⊤ (θ -θ * k ) + 1 2 E ϵ 2 k . We can easily show that θ(w) in (P * ) has the closed form θ(w) = K k=1 w k θ * k , which F (w) = L 0 (θ(w)) = 1 2 ( K k=1 w k θ * k -θ * 0 ) ⊤ E xx ⊤ ( K k=1 w k θ * k -θ * 0 ) + 1 2 E ϵ 2 0 . This is a quadratic function of w over the polyhedral set ∆ b K and thus satisfies the error bound condition (5) with r = 2 according to Lemma 1 in Gong & Ye (2014) .

C TECHNICAL LEMMAS

In this section, we provide some technical lemmas with proofs which are necessary for establishing the main theorems. The main steps in the proofs of Lemma 2 and 3 are borrowed from Chen et al. (2021a) . However, the generalization bound in Theorem 3.1 in Chen et al. (2021a) is for a w satisfying β -1 ≤ w k /w j ≤ β with k ̸ = j for some β > 0, and their bound increases with β. When applied to w = w with zero or nearly zero components (that happens when p k = p 0 for some k), such a β is very large or equals infinity. Therefore, we make necessary changes in the proofs to extend the results for a generic w and a w that is ε-away from W * (see Lemma 3) where the components can be nearly zero. These extensions are important for proofing our main theorems. Lemma 2. Suppose Assumptions 1 and 4 hold. There exists a universal constant C 0 > 0 such that, with a probability of at least 1 -δ, sup θ∈Θ L 0 (θ) -L 0 (θ) ≤ C 0 ν H + log(1/δ) n valid . Proof. For simplicity of notation, we write n valid as n in this proof. Let G valid (D valid ) := sup θ∈Θ L 0 (θ) -L 0 (θ) and G ′ valid (D valid ) := sup θ∈Θ L 0 (θ) -L 0 (θ) . Consider any i ∈ {1, 2, . . . , n}. Let D valid i be the same as D valid except that z (i) is replaced by another data point z ′(i) sampled from p 0 . Recall (3). We have G valid (D valid ) -G valid (D valid i ) = sup θ∈Θ L 0 (θ) -L 0 (θ) -sup θ∈Θ L 0 (θ) -L 0 (θ) + 1 n l(θ; z (i) ) -l(θ; z ′(i) ) ≤ 1 n , where the inequality is because the loss is in [0, 1] (Assumption 1). This inequality means we can apply the McDiarmid's inequality to obtain that, for any ϵ > 0, P G valid (D valid ) ≥ E[G valid (D valid )] + ϵ ≤ exp(-2ϵ 2 n), or equivalently, with a probability of at least 1 -δ, G valid (D valid ) ≤ E[G valid (D valid )] + log(1/δ) 2n . Next, we apply the standard symmetrization argument by introducing a ghost dataset D valid ghost := z ′(i) n i=1 , which is independent of D valid and sampled from p 0 . Let {σ i } n i=1 be Rademacher random variables. We have i) for θ ∈ Θ. By Hoeffding's Lemma, we have for any θ, θ ′ ∈ Θ, E[G valid (D valid )] =E sup θ∈Θ L 0 (θ) -L 0 (θ) = E sup θ∈Θ E L θ; D valid ghost -L(θ; D valid ) ≤E sup θ∈Θ L(θ; D valid ghost ) -L(θ; D valid ) = E sup θ∈Θ 1 n n i=1 σ i l θ; z ′(i) -l θ; z (i) ≤2E sup θ∈Θ 1 n n i=1 σ i l θ; z (i) = 2E R n (H), where R n (H) = E sup θ∈Θ 1 n n i=1 σ i l θ; z (i) D valid and H = {l(θ; •) : θ ∈ Θ}. Let M θ := 1 √ n n i=1 σ i l θ; z ( E exp λ(M θ -M θ ′ ) D valid = n i=1 E exp λ √ n σ i l θ; z (i) -l θ ′ ; z (i) D valid ≤ n i=1 exp λ 2 2n l θ; z (i) -l θ ′ ; z (i) 2 = exp λ 2 2 d 2 (θ, θ ′ ) , where  d(θ, θ ′ ) = n i=1 1 n l θ; z (i) -l θ ′ ; z (i) 2 ≤ 1 is the L 2 -distance R n (H) = 1 √ n E sup θ∈Θ M θ D valid ≤ C d √ n 1 0 log (N (H; d; ϵ))dϵ According to Assumption 4, we have E[G train ] ≤ 2 R n (H) ≤ 2C d √ n 1 0 ν H log C H ϵ dϵ Hence, with a probability of at least 1 -δ, G valid (D valid ) ≤ 2C d √ n 1 0 ν H log C H ϵ dϵ + log(1/δ) 2n . Applying the same argument to G ′ valid (D valid ), we can show that, with a probability of at least 1 -δ G ′ valid (D valid ) ≤ 2C d √ n 1 0 ν H log C H ϵ dϵ + log(1/δ) 2n . By a union bound, we have, with a probability of at least 1 -δ sup θ∈Θ L 0 (θ) -L 0 (θ) ≤ 2C d √ n 1 0 ν H log C H ϵ dϵ + log(2/δ) 2n , which completes the proof. Given ε > 0, we define W * ε := w ∈ ∆ b K Dist(w, W * ) ≤ ε (19) N ε := n train b 2 J + ε 2 (K -J) . ( ) Lemma 3. Suppose Assumptions 1,2 and 4 hold. There exists a universal constant C a > 0 such that, with a probability of at least 1 -δ, sup θ∈Θ,w∈∆ b K K k=1 w k L k (θ) -L k (θ) ≤ C a ν H + K + log(1/δ) n train /(Kb 2 ) . ( ) Suppose Assumptions 1, 2 ′ , 3 and 4 hold. There exists a universal constant C ′ a > 0 such that, with a probability of at least 1 -δ, sup θ∈Θ,w∈W * ε K k=1 w k L k (θ) -L k (θ) ≤ C ′ a   ν H + J + log(1/δ) N ε + ε(K -J) b √ N ε   . ( ) Proof. We prove ( 22) first. Suppose Assumptions 1, 2 ′ , 3 and 4 hold. By Assumptions 2 ′ , we have w * j = 0 for w ∈ W * and j ∈ K\J , which means w j ≤ k∈K\J w 2 k ≤ Dist(w, W * ) ≤ ε for any w ∈ W * ε and j ∈ K\J . Let G train (D train ) := sup θ∈Θ,w∈W * ε K k=1 w k L k (θ) -L k (θ) G ′ train (D train ) := sup θ∈Θ,w∈W * ε K k=1 w k L k (θ) -L k (θ) . ( ) Consider an index j ∈ {1, 2, . . . , K} and i ∈ {1, 2, . . . , n j }. Let D train j,i be the same as D train except that z (i) j is replaced by another data point z ′(i) j sampled from p j . Recall (2). We have G train (D train ) -G train (D train j,i ) = sup θ∈Θ,w∈W * ε K k=1 w k L k (θ) -L k (θ) - sup θ∈Θ,w∈W * ε K k=1 w k L k (θ) -L k (θ) + w j n j l θ; z (i) j -l θ; z ′(i) j ≤ w j n j ≤ b nj j ∈ J ε nj j ∈ K\J . With this inequality, we can apply the McDiarmid's inequality to show that, for any ϵ > 0, P G train (D train ) ≥ E[G train (D train )] + ϵ ≤ exp -2ϵ 2 n train b 2 J + ε 2 (K -J) , which implies that, with a probability of at least 1 -δ,  G train (D train ) ≤ E[G train (D train )] + log(1/δ) N ε . ( ′(i) k n k i=1 for k = 1, . . . , K is a dataset independent of D train k sampled from p k . Let {σ k,i } n k i=1 for k = 1, . . . , K be Rademacher random variables. We have E[G train (D train )] = E sup θ∈Θ,w∈W * ε K k=1 w k L k (θ) -L k (θ) = E sup θ∈Θ,w∈W * ε K k=1 w k E L k (θ; D train k,ghost ) -L k (θ) ≤ E sup θ∈Θ,w∈W * ε K k=1 w k L k (θ; D train k,ghost ) -L k (θ) = E sup θ∈Θ,w∈W * ε K k=1 n k i=1 σ k,i w k n k l(θ; z ′(i) k ) -l(θ; z (i) k ) ≤ 2E sup θ∈Θ,w∈W * ε K k=1 n k i=1 σ k,i w k n k l(θ; z (i) k ) = 2E R ntrain (H, W * ε ), where R ntrain (H, W * ε ) := E sup θ∈Θ,w∈W * ε K k=1 n k i=1 σ k,i w k n k l θ; z (i) k D train and H = l(θ; •) : θ ∈ Θ . Let M θ,w := √ N ε K k=1 n k i=1 σ k,i w k n k l(θ; z (i) k ) for θ ∈ Θ and w ∈ W * ε . Hence, by Hoeffding's Lemma, we have E exp λ(M θ,w -M θ ′ ,w ′ ) D train = K k=1 n k i=1 E exp λ N ε σ k,i n k w k l θ; z (i) k -w ′ k l θ ′ ; z (i) k D train ≤ K k=1 n k i=1 exp λ 2 N ε 2n 2 k w k l θ; z (i) k -w ′ k l θ ′ ; z (i) k 2 = exp λ 2 2 d 2 (θ, w, θ ′ , w ′ ) , where d(θ, w, θ ′ , w ′ ) = K k=1 n k i=1 N ε n 2 k w k l θ; z (i) k -w ′ k l θ ′ ; z (i) k 2 ≤ k∈J N ε b 2 n train + k∈K\J N ε ε 2 n train = 1 is a pseudo distance metric between (l(θ; •), w) and (l(θ ′ ; •), w ′ ) in H × W * ε . Hence, by Dudley's entropy integral inequality (see Corollary 13.2 in Boucheron et al. (2013) ), there exists a universal constant C d such that R ntrain (H, W * ε ) = 1 √ N ε E sup θ∈Θ,w∈W * ε M θ,w D train ≤ C d √ N ε 1 0 log (N (H × W * ε ; d; ϵ))dϵ, where N (H × W * ε ; d; ϵ) is the ϵ-covering number of H × W * ε w.r.t. d. We next need to bound N (H × W * ε ; d; ϵ). Note that d 2 (θ, w, θ ′ , w ′ ) = K k=1 n k i=1 N ε n 2 k w k l θ; z (i) k -w ′ k l θ ′ ; z (i) k 2 ≤ K k=1 n k i=1 2N ε n 2 k w 2 k l θ; z (i) k -l θ ′ ; z (i) k 2 + K k=1 n k i=1 2N ε n 2 k (w k -w ′ k ) 2 l 2 θ ′ ; z (i) k ≤ k∈J n k i=1 2N ε n 2 k b 2 l θ; z (i) k -l θ ′ ; z (i) k 2 + k∈K\J n k i=1 2N ε n 2 k ε 2 l θ; z (i) k -l θ ′ ; z (i) k 2 + K k=1 n k i=1 2N ε n 2 k (w k -w ′ k ) 2 ≤ 2N ε n train   k∈J ntrain i=1 b 2 n train l θ; z (i) k -l θ ′ ; z (i) k 2 + k∈K\J ntrain i=1 ε 2 n train l θ; z (i) k -l θ ′ ; z (i) k 2   + 2N ε n train ∥w -w ′ ∥ 2 . ( ) We then define a probability measure  Q = N ε n train   k∈J ntrain i=1 b 2 n train δ z (i) k + k∈K\J ntrain i=1 ε 2 n train δ z (i) k   (J + 1)Nε √ ntrainϵ J ε (K -J)(J + 1)Nε √ ntrainϵ K-J ≤ 2 ϵ J 2ε √ K -J bϵ K-J , where the inequality is because N ε ≤ ntrain b 2 J by the definition of N ε . This implies E[G train (D train )] ≤ 2 R ntrain (H, W * ε ) ≤ 2C d √ N ε 1 0 log N (H × W * ε ; d; ϵ)dϵ =O   1 √ N ε 1 0 (ν H + J) log 1 ϵ + (K -J) log 2ε √ K -J bϵ dϵ   =O ν H + J N ε + O K -J N ε 2ε √ K -J b = O ν H + J N ε + O ε(K -J) b √ N ε , where the first equality is because N (H × W * ε ; d; ϵ) ≤ 2CH ϵ ν H × 2 ϵ J 2ε √ K -J bϵ K-J according to Assumption 4 and (31) and the second equality is by changing variable ϵ to bϵ 2ε √ K-J in the integral and the fact that 1 ϵ = 0 when ϵ > 1. Combining (32) with (26), we have that, with a probability of at least 1 -δ, G train (D train ) ≤ O   ν H + J + log(1/δ) N ε   + O ε(K -J) b √ N ε . Applying the same argument to G ′ valid (D valid ), we can show that the same inequality as above holds for G ′ train (D train ) with a probability of at least 1 -δ. By a union bound, we have, with a probability of at least 1 -δ sup θ∈Θ,w∈W * ε K k=1 w k L k (θ) -L k (θ) ≤ O   ν H + J + log(1/δ) N ε   + O ε(K -J) b √ N ε , which completes the proof ( 22). Next we prove (21). Since the proof is similar to (22), we will mainly elaborate the parts that are different. Suppose Assumptions 1,2 and 4 hold. We define G train (D train ) and G ′ train (D train ) the same as in ( 23) and ( 24) except that W * ε is replaced by the entire domain ∆ b K . Following the same proof of (25), we have G train (D train ) -G train (D train j,i ) ≤ w j n j ≤ b n j for all i ∈ {1, 2, . . . , n j } and j ∈ K. Then the McDiarmid's inequality implies that, with a probability of at least 1 -δ, G train (D train ) ≤ E[G train (D train )] + Kb 2 log(1/δ) 2n train . ( ) By replacing W * ε with ∆ b K in the proof of ( 27), we can show that E[G train (D train )] ≤ 2E R ntrain (H, ∆ b K ) where R ntrain (H, ∆ b K ) := E sup θ∈Θ,w∈∆ b K K k=1 n k i=1 σ k,i w k n k l(θ; z (i) k ) D train . Let N b defined as (20) with ε replaced by b, i.e., N b = n train /(Kb 2 ). Let M θ,w := √ N b K k=1 n k i=1 σ k,i w k n k l(θ; z (i) k ) for θ ∈ Θ and w ∈ ∆ b K . With J replaced by ∅ and N b replaced by N ε in the proof of (28), we have E exp λ(M θ,w -M θ ′ ,w ′ ) D train = exp λ 2 2 d 2 (θ, w, θ ′ , w ′ ) , where  d(θ, w, θ ′ , w ′ ) = K k=1 n k i=1 N b n 2 k w k l θ; z (i) k -w ′ k l θ ′ ; z (i) k 2 ≤ k∈K N b b 2 n train = 1 is a pseudo distance metric between (l(θ; •), w) and (l(θ ′ ; •), w ′ ) in H × ∆ b K . R ntrain (H, ∆ b K ) = 1 √ N b E sup θ∈Θ,w∈∆ b K M θ,w D train ≤ C d √ N b 1 0 log N (H × ∆ b K ; d; ϵ) dϵ, where  N (H × ∆ b K ; d; ϵ) is the ϵ-covering number of H × ∆ b K w. d 2 (θ, w, θ ′ , w ′ ) ≤ 2N b n train k∈K ntrain i=1 b 2 n train l θ; z (i) k -l θ ′ ; z (i) k 2 + 2N b n train ∥w -w ′ ∥ 2 . Similar to ( 30  √ KN b √ ntrainϵ K = 1 ϵ K . This implies N (H × ∆ b K ; d; ϵ) ≤ 2C H ϵ ν H × 1 ϵ K and thus E[G train (D train )] ≤ 2 R ntrain (H, ∆ b K ) ≤ 2C d √ N b 1 0 log N (H × ∆ b K ; d; ϵ)dϵ =O 1 √ N b 1 0 ν H log 1 ϵ + K log 1 ϵ dϵ = O ν H + K N b . ( ) Combining ( 34) with ( 33), we have that, with a probability of at least 1 -δ, G train (D train ) ≤O   ν H + K + log(1/δ) N b   . Applying the same argument to G ′ valid (D valid ), we can show that the same inequality as above holds for G ′ train (D train ) with a probability of at least 1 -δ. By taking a union bound, we have that, with a probability of at least 1 -δ, sup θ∈Θ,w∈∆ b K K k=1 w k L k (θ) -L k (θ) ≤ O   ν H + K + log(1/δ) N b   , which completes the proof of ( 21) as N b = n train /(Kb 2 ).

D PROOFS OF MAIN THEOREMS AND COROLLARIES

In this section, we provide the proofs of Theorem 1, Theorem 2 and Corollary 1. Proof of Theorem 1. By the strong convexity of the loss function and the optimality of θ(w) and θ(w) in the inner problems in (P * ) and ( P), we have, for any w ∈ ∆ b K , µ 2 θ(w) -θ(w) 2 ≤ K k=1 w k L k ( θ(w)) - K k=1 w k L k (θ(w)) (35) µ 2 θ(w) -θ(w) 2 ≤ K k=1 w k L k (θ(w)) - K k=1 w k L k ( θ(w)). ( ) Adding ( 35) and ( 36) on both sides leads to, with a probability of at least 1 -δ, µ θ(w) -θ(w) 2 ≤ K k=1 w k L k ( θ(w)) - K k=1 w k L k ( θ(w)) + K k=1 w k L k (θ(w)) - K k=1 w k L k (θ(w)) ≤ 2C a ν H + K + log(1/δ) n train /(Kb 2 ) , ∀w ∈ ∆ b K , ( ) where the second inequality is because the first conclusion in Lemma 3. Let w * = Proj W * ( w). Then we have, with a probability of 1 -2δ, that F ( w) -min w∈∆ b K F (w) =L 0 (θ( w)) -L 0 (θ(w * )) ≤L 0 ( θ( w)) -L 0 ( θ(w * )) + ℓ 0 ∥ θ( w) -θ( w)∥ + ℓ 0 ∥ θ(w * ) -θ(w * )∥ ≤ L 0 ( θ( w)) -L 0 ( θ(w * )) + 2C 0 ν H + log(1/δ) n valid + ℓ 0 ∥ θ( w) -θ( w)∥ + ℓ 0 ∥ θ(w * ) -θ(w * )∥ ≤2C 0 ν H + log(1/δ) n valid + 2ℓ 0 2C a µ ν H + K + log(1/δ) n train /(Kb 2 ) 1 4 , ( ) where the first inequality is because of Assumption 1, the second is due to Lemma 2, and the last is due to (37) and the optimality of w for problem ( P). Therefore, we can show that, with a probability of 1 -2δ, L 0 ( θ( w)) -L 0 (θ * ) =L 0 ( θ( w)) -L 0 (θ( w)) + L 0 (θ( w)) -L 0 (θ(w * )) ≤ℓ 0 ∥ θ( w) -θ( w)∥ + L 0 (θ( w)) -L 0 (θ(w * )) ≤2C 0 ν H + log(1/δ) n valid + 3ℓ 0 2C a µ ν H + K + log(1/δ) n train /(Kb 2 ) 1 4 where the equality is because of Assumption 2, the first inequality is by Assumption 1 and the second by ( 37) and ( 38). This completes the proof. Proof of Theorem 2. Since Assumption 2 ′ implies Assumption 2, the proof and the conclusion of Theorem 1 also hold under the assumptions of Theorem 2. In particular, inequality (38) holds with a probability of 1 -2δ. According to Assumption 3 and (38), we have with a probability of 1 -2δ that C -1 r Dist( w, W * ) r ≤ 2C 0 ν H + log(1/δ) n valid + 2ℓ 0 2C a µ ν H + K + log(1/δ) n train /(Kb 2 ) 1 4 . Applying the fact that s + t ≤ (s 1 r + t 1 r ) r for any s > 0 and t > 0 to the right-hand side of the inequality above, we obtain (10) with an appropriately defined C w . Suppose Dist( w, W * ) ≤ ε(n valid , n train ), which happens with a probability of 1 -2δ according to the proof above. We have w ∈ W * ε with ε = ε(n valid , n train ) according to the definition in ( 19). We then decompose the optimality gap of the generalization loss as follows L 0 ( θ( w)) -L 0 (θ * ) = L 0 ( θ( w)) - K k=1 w k L k ( θ( w)) T1 + K k=1 w k L k ( θ( w)) - K k=1 w k L k ( θ( w)) T2 + K k=1 w k L k ( θ( w)) - K k=1 w k L k (θ( w)) T3 + K k=1 w k L k (θ( w)) - K k=1 w k L k (θ( w)) T4 + K k=1 w k L k (θ( w)) - K k=1 w k L k (θ(w * )) T5 + K k=1 w k L k (θ * ) -L 0 (θ * ) T6 . ( ) It is clear that T 3 ≤ 0 and T 5 ≤ 0 by the optimality of θ( w) and θ( w) in ( P) and (P * ), respectively. Moreover, by Assumption 2 ′ , we have w * j = 0 for w ∈ W * and j ∈ K\J . Using the fact that w ∈ W * ε , we have k∈K\J w 2 k ≤ Dist 2 ( w, W * ) ≤ ε 2 , which implies T 1 = K k=1 w k L 0 ( θ( w)) -L k ( θ( w)) = k∈K\J w k L 0 ( θ( w)) -L k ( θ( w)) ≤ k∈K\J w 2 k max θ∈Θ k∈K\J L 0 (θ) -L k (θ) 2 ≤ ε max θ∈Θ k∈K\J L 0 (θ) -L k (θ) 2 . Using a similar argument, we can also show T 6 ≤ ε max θ∈Θ k∈K\J L 0 (θ) -L k (θ) 2 . According to Lemma 3 and the fact that w ∈ W * ε , we have that, with a probability of at least 1 -δ, T 2 , T 4 ≤ C ′ a   ν H + J + log(1/δ) N ε + ε(K -J) b √ N ε   . Note that G = max θ∈Θ k∈K\J L 0 (θ) -L k (θ) 2 under Assumption 2 ′ . Applying the upper bounds of the six terms to (39) and taking a union bound, we can show that L 0 ( θ( w)) -L 0 (θ * ) ≤ 2C ′ a   ν H + J + log(1/δ) N ε + ε(K -J) b √ N ε   + 2ε(n valid , n train )G with a probability of at least 1 -3δ, which completes the proof. Before we prove Corollary 1, we first present another corollary of Theorem 2 where we can see the impact of b more clearly. Corollary 2. Suppose the assumptions of Theorem 2 hold and n valid and n train are large enough such that ε(n valid , n train ) defined in (10) satisfies ε(n valid , n train ) ≤ b √ J K-J . With a probability of at least 1 -3δ, we have L0( θ( w)) -L0(θ * ) ≤O νH + J + log(1/δ) ntrain/(Jb 2 ) + G • O νH + log(1/δ) n valid 1 2r + G • O νH + K + log(1/δ) ntrain/(Kb 2 ) 1 4r , ( ) Proof. Theorem 2 guarantees that ( 11) holds with a high probability. When ε(n valid , n train ) ≤ b √ J K-J , the second term on the right-hand side of ( 11) can be merged with the first term. Also, we have N ε = ntrain b 2 J+ε 2 (nvalid,ntrain)(K-J) ≥ ntrain b 2 J+b 2 J/(K-J) ≥ ntrain 2b 2 J , where the last inequality is because K -J ≥ 1. As a result, the first two terms in (11) together has the order of O νH + J + log(1/δ) ntrain/(Jb 2 ) . Then ( 40) is proved by applying the definition of ε(n valid , n train ) to the third term on the right-hand side of (11). Suppose n train ≫ n valid . The first term on the right-hand side of ( 40) is smaller than the entire righthand side of ( 7). If, in addition, G = o(1/n 1 2 -1 2r valid ) and G = o(1/n 1 4 -1 train ), the other two terms in (40) are also smaller than the two terms in ( 7), respectively, so the bound in ( 40) is tighter than (7). Proof of Corollary 1. Corollary 1 is directly from Corollary (2) by only keeping G, n train and n test in the order of magnitude given in ( 40).

E GENERALIZATION PERFORMANCE BY TRAINING LOCALLY AND TRAINING WITH EQUALLY WEIGHTED NODES

In this section, we first consider a model locally trained only with data D valid in node 0, namely, θ valid ∈ arg min θ∈Θ L 0 (θ) where L 0 is given in (3). The generalization bound of θ valid is well-known, so we omit the proof but directly give the result. Proposition 1. Suppose Assumptions 1 and 4 hold. There exists a universal constant C 0 > 0 such that, with a probability of at least 1 -δ, L 0 ( θ valid ) -L 0 (θ * ) ≤ 2C 0 ν H + log(1/δ) n valid . For the purpose of theoretical comparison, we also consider a model trained only with data D train distributed over equally weighted nodes, namely, θ equal ∈ arg min θ∈Θ 1 K K k=1 L k (θ), where L k is defined as in (2). It is easy to construct an example where each p k with k ∈ K\J is significantly different from p 0 so that L 0 ( θ equal ) does not convergence to L 0 (θ * ) as n train goes to infinity. Motivated by Corollary 1 and the discussion afterwards, it will be interesting to show the generalization bound of θ equal when each p k with k ∈ K\J is similar to p 0 with a small G defined in (9). Proposition 2. Suppose Assumptions 1, 2 ′ and 4 hold. There exists a universal constant C a > 0 such that, with a probability of at least 1 -3δ, L 0 ( θ equal ) -L 0 (θ * ) ≤ 2C a ν H + K + log(1/δ) Kn train + 2 √ K -J K G, where G is defined in (9). Proof. We first define θ equal ∈ arg min θ∈Θ 1 K K k=1 L k (θ). We first decompose the optimality gap of the generalization loss as follows L 0 ( θ equal ) -L 0 (θ * ) = L 0 ( θ equal ) - 1 K K k=1 L k ( θ equal ) T1 + 1 K K k=1 L k ( θ equal ) - 1 K K k=1 L k ( θ equal ) T2 + 1 K K k=1 L k ( θ equal ) - 1 K K k=1 L k (θ equal ) T3 + 1 K K k=1 L k (θ equal ) - 1 K K k=1 L k (θ equal ) T4 + 1 K K k=1 L k (θ equal ) - 1 K K k=1 L k (θ * ) T5 + 1 K K k=1 L k (θ * ) -L 0 (θ * ) T6 It is clear that T 3 ≤ 0 and T 5 ≤ 0 by the optimality of θ equal and θ equal in ( 42) and ( 43), respectively. Moreover, we have T 1 = 1 K K k=1 L 0 ( θ equal ) -L k ( θ equal ) = k∈K\J 1 K L 0 ( θ equal ) -L k ( θ equal ) ≤ K -J K 2 max θ∈Θ k∈K\J L 0 (θ) -L k (θ) 2 . Using a similar argument, we can also show T 6 ≤ K -J K 2 max θ∈Θ k∈K\J L 0 (θ) -L k (θ) 2 . Since Assumption 2 ′ implies Assumption 2, by the first statement of Lemma 3 with b = 1 K , we have, with a probability of at least 1 -δ, that T 2 , T 4 ≤ C a ν H + K + log(1/δ) Kn train . Applying the upper bounds of the six terms to (44) and a taking union bound, we have L 0 ( θ equal ) -L 0 (θ * ) ≤ 2C a ν H + K + log(1/δ) Kn train + 2 √ K -J K G with a probability of at least 1 -3δ, which completes the proof. Note that the bound in Proposition 2 is strictly worse than the one we showed in Corollary 1 for any value of G. In fact, the former is O(1/ √ n train + G) and the latter is O(1/ √ n train ) + o(G)

F COMMUNICATION COMPLEXITY OF ALGORITHM 2 AND EXTENSION TO NON-CONVE CASE

In this section, we present the communication complexity of Algorithm 2 for convex problems as well as the corresponding algorithm and complexity for non-convex problems. To do so, we first present the convergence property of Algorithm 1, which is originally established by Gorbunov et al. (2021) . Then, we combine the analysis by Gorbunov et al. (2021) and Ghadimi & Wang (2018) with some minor but necessary modifications, for example, to allow for a generic weight w instead of the uniform weight in Gorbunov et al. (2021) , and to handle the approximation error between ∇ F (w) and ∇ F (w), which is a little different from the one considered in Ghadimi & Wang (2018) .

F.1 CONVERGENCE PROPERTY OF ALGORITHM 1

As mentioned in Section 4, we need to solve subproblems ( 1) and ( 15) in Algorithm 2, both of which are instances of ( 16). Because of Assumption 1, problem ( 16) in these two cases satisfies the following assumption with ℓ 1 and µ exactly the same as the ℓ 1 and µ in Assumption 1. Assumption 5. f k,i (x) is convex, ∇f k,i (x) is ℓ 1 -Lipschitz continuous and f k (x) is µ-strongly con- vex for i = 1, . . . , n k and k = 1, 2, . . . , K. A unified analysis is provided in Gorbunov et al. (2021) for a large class of FL methods including Local-SVRG given in Algorithm 1. The following proposition is obtained by applying Theorem 2.1 in Gorbunov et al. (2021) to Local-SVRG under our setting after minor modifications. It characterizes the convergence property of Algorithm 1. We omit its proof because it is the almost the same as the proof of Theorem G.7 in Gorbunov et al. (2021) . Proposition 3. Suppose Assumption 5 holds for ( 16) and γ ≤ γ 0 with γ 0 defined in ( 17). Algorithm 1 guarantees E f (x (T ) -f (x * ) ≤ 1 γ 1 -γµ T +1 4 + 32γ 2 ℓ 2 1 3q + 30eγ 3 ℓ 3 1 (τ -1) 2 + q q x (0) -x * 2 + 45e 2 ℓ 1 γ 2 (τ -1) 2 K k=1 w k ∇f k x * 2 , ( ) where x * be the optimal solution of (16).

F.2 COMMUNICATION COMPLEXITY OF ALGORITHM 2

To analyze the complexity of Algorithm 2, we needs to bound the error of the approximate gradient of F , namely, the quantity ∇ F (w (s) md ) -∇ F (w md ) , s = 0, 1, . . . , and then the convergence analysis in Ghadimi & Wang (2018) can be directly applied. This error depends the suboptimality of θ (s) and h (s) in iteration s of Algorithm 2, which can be characterized using Proposition 3. To do so, we first bound ∥∇f k x * and ∥x (0) -x * appearing in Proposition 3 for these two instances. For simplicity, we assume Local-SVRG is initialized at x (0) = 0 when it is applied to any instance of ( 16). Suppose f k,i (θ) = l(θ; z (i) k ), i = 1, . . . , n k , k = 1, . . . , K and x * is the optimal solution of (16), namely, x * = θ(w). Because of Assumption 1, we have that ∇f k x * ≤ ℓ 0 for any k in any iteration s of Algorithm 2 and x * = θ(w) is a continuous of w on ∆ b K , which means ∥x (0) -x * 2 = ∥x * 2 ≤ max w∈∆ b K ∥ θ(w)∥ 2 in any iteration s of Algorithm 2. Suppose f k,i (h) = 1 2 h ⊤ ∇ 2 l(θ (s) ; z (i) k )h-h ⊤ ∇ L 0 (θ (s) ), i = 1, . . . , n k , k = 1, . . . , K. The optimal solution of ( 16) in this case is x * = K k=1 w (s) k ∇ 2 L k (θ (s) ) -1 ∇ L 0 (θ (s) ) which means ∥x (0) -x * 2 = ∥x * 2 ≤ 1 µ 2 ∥∇ L 0 (θ (s) )∥ 2 ≤ ℓ 2 0 µ 2 because of Assumption 1. Moreover, ∇f k (x * ) = ∇ 2 L k (θ (s) ) K k=1 w (s) k ∇ 2 L k (θ (s) ) -1 ∇ L 0 (θ (s) ) -∇ L 0 (θ (s) ), so ∥∇f k (x * )∥ ≤ ℓ1ℓ0 µ + ℓ 0 for any k and s by Assumption 1. Since f is µ-strongly convex, we have µ 2 ∥x (T ) -x * ∥ 2 ≤ f (x (T ) -f (x * ). With this inequality and the discussion observations, we can derive from ( 45) that E x (T ) -x * 2 ≤ 2 γµ 1 -γµ T +1 4 + 32γ 2 ℓ 2 1 3q + 30eγ 3 ℓ 3 1 (τ -1) 2 + q q • max R 2 , ℓ 2 0 µ 2 + 45e µ ℓ 1 γ 2 (τ -1) 2 ( ℓ 1 µ + 1) 2 ℓ 2 0 ( ) with R := max w∈∆ b K ∥ θ(w)∥ when Local-SVRG is applied to either (1) or ( 15) in any iteration of Algorithm 2. The following lemma bounds the error of the approximate gradient for F . Lemma 4. Suppose Assumption 5 holds and γ ≤ γ 0 with γ 0 defined in (17). We have E ∇ F (w (s) md ) -∇ F (w (s) md ) 2 ≤ C 1 γ 1 -γµ Ts+1 C 2 + C 3 (τ -1) + C 4 γ 2 (τ -1) 2 , ( ) where C 1 , C 2 and C 3 are constants that depend on ℓ 0 , ℓ 1 , ℓ 2 , µ, R, q and K but not τ , T s and γ. Consequently, • when τ = 1, γ = γ 0 and T s = 1 γ0µ ln C1C2(s+1) 4 γ0 OR • when τ > 1, γ = 1 Ms and T s = µ -1 M s ln M 3 s , where M s = max 1 γ 0 , (s + 1) 2 [C 1 (C 2 + C 3 (τ -1)) + C 4 (τ -1) 2 ] , we have E ∇ F (w (s) md ) -∇ F (w (s) md ) 2 ≤ 1 (s+1) 4 . Moreover, • when τ = 1, γ = γ 0 and T s = 1 γ0µ ln C1C2(s+1) 2 γ0 OR • when τ > 1, γ = 1 M ′ s and T s = µ -1 M ′ s ln M ′3 s , where M ′ s = max 1 γ 0 , (s + 1) [C 1 (C 2 + C 3 (τ -1)) + C 4 (τ -1) 2 ] , we have E ∇ F (w (s) md ) -∇ F (w (s) md ) 2 ≤ 1 (s+1) 2 . Proof. Let ∇ F (w (s) md ) = ( ∇ k F (w (s) md )) k=1,...,K , where ∇ k F (w (s) md ) = -∇ L k (θ (s) ) ⊤ K k=1 w (s) md,k ∇ 2 L k (θ (s) ) -1 ∇ L 0 (θ (s) ). Recall that ∇k F (w s) . (s) md ) = -∇ L k (θ (s) ) ⊤ h ( We then obtained from ( 46) that E ∇ F (w (s) md ) -∇ F (w (s) md ) 2 ≤Kℓ 2 0 E h (s) - K k=1 w (s) md,k ∇ 2 L k (θ (s) ) -1 ∇ L 0 (θ (s) ) 2 ≤ 2Kℓ 2 0 γµ 1 -γµ Ts+1 4 + 32γ 2 0 ℓ 2 1 3q + 30eγ 3 0 ℓ 3 1 (τ -1) 2 + q q • max R 2 , ℓ 2 0 µ 2 + 45eKℓ 2 0 µ ℓ 1 γ 2 (τ -1) 2 ( ℓ 1 µ + 1) 2 ℓ 2 0 ( ) According to Lemma 2.2 in Ghadimi & Wang (2018) and ( 46), we have E ∇ F (w (s) md ) -∇ F (w (s) md ) 2 ≤K 2ℓ 0 ℓ 1 µ + ℓ 2 ℓ 2 0 µ 2 2 E ∥θ (s) -θ(w (s) md )∥ 2 ≤K 2ℓ 0 ℓ 1 µ + ℓ 2 ℓ 2 0 µ 2 2 2 γµ 1 -γµ Ts+1 4 + 32γ 2 0 ℓ 2 1 3q + 30eγ 3 0 ℓ 3 1 (τ -1) 2 + q q • max R 2 , ℓ 2 0 µ 2 + K 2ℓ 0 ℓ 1 µ + ℓ 2 ℓ 2 0 µ 2 2 45e µ ℓ 1 γ 2 (τ -1) 2 ( ℓ 1 µ + 1) 2 ℓ 2 0 (50) Combining ( 49) and ( 50) by the triangle inequity leads to (47). When τ = 1, γ = γ 0 and T s = 1 γ0µ ln C1C2(s+1) 4 γ0µ , it is easy to show that E ∇ F (w (s) md ) - ∇ F (w (s) md ) 2 ≤ 1 (s+1) 4 . Suppose τ > 1, γ = 1 Ms and T s = µ -1 M s ln M 3 s . We have E ∇ F (w (s) md ) -∇ F (w (s) md ) 2 ≤ C 1 γ 1 -γµ Ts+1 (C 2 + C 3 (τ -1)) + C 4 γ 2 (τ -1) 2 ≤ exp - µT s M s M s C 1 (C 2 + C 3 (τ -1)) + C 4 (τ -1) 2 M 2 s ≤ exp -ln(M 3 s ) M s C 1 (C 2 + C 3 (τ -1)) + C 4 (τ -1) 2 M 2 s ≤ C 1 (C 2 + C 3 (τ -1)) M 2 s + C 4 (τ -1) 2 M 2 s ≤ 1 (s + 1) 4 . The conclusion with E ∇ F (w (s) md )-∇ F (w (s) md ) 2 ≤ 1 (s+1) 2 can be proved in the same way except that (s + 1) 4 must be changed to (s + 1) 2 , and thus we omit the proof. The complexity of Algorithm 2 when F (w) is convex can be showed using the proof in Ghadimi & Wang (2018) with their gradient approximation error replaced by the one in Lemma 4. Proof of Theorem 3. According to Lemma 2.2 in Ghadimi & Wang (2018) , F (w) is ℓ F -smooth with ℓ F defined in (12). Let E s := ∇ F (w (s) md ) -∇ F (w (s) md ) According to (2.51) in Ghadimi & Wang (2018) , we have F (w (s+1) ag ) ≤ s s + 2 F (w (s) ag ) + 2 s + 2 F ( w) + 16 η(s + 1)(s + 2) ∥ w -w (s) ∥ 2 -∥ w -w (s+1) ∥ 2 + 2 s + 2 ∥ w -w (s+1) ∥E s + η 2 E 2 s ≤ s s + 2 F (w (s) ag ) + 2 s + 2 F ( w) + 16 η(s + 1)(s + 2) ∥ w -w (s) ∥ 2 -∥ w -w (s+1) ∥ 2 + 4 s + 2 E s + η 2 E 2 s , where the second inequality is because ∥ w -w (s+1) ∥ ≤ 2 as both w and w (s+1) are on a simplex. Subtracting F ( w) from both sides of the inequality above and dividing both sides by 2 (s+1)(s+2) , we have (s + 1)(s + 2) 2 F (w (s+1) ag ) -F ( w) ≤ s(s + 1) 2 F (w (s) ag ) -F ( w) + 8 η ∥ w -w (s) ∥ 2 -∥ w -w (s+1) ∥ 2 + 2(s + 1)E s + η(s + 1)(s + 2) 4 E 2 s . Summing up this inequality for s = 0, 1, . . . , S -1 gives This means, as long as E E 2 s ≤ 1 (s+2) 4 , Algorithm 2 finds an ϵ-optimal solution of ( P) in Õ(ϵ -0.5 ) iterations. S(S + 1) 2 E F (w (S) ag ) -F ( w) ≤ 16 η + S-1 s=0 2(s + 1) E E 2 s + S-1 s=0 η(s + 1)(s + 2) 4 E E 2 s , which, when E E 2 s ≤ 1 (s+2) 4 , implies Let . By Lemma 4, we have E E 2 s ≤ 1 (s+2) 4 . In iteration s of Algorithm 2, the total number of rounds of communication needed in Local-SVRG is O(T s ) = O(ln(s)) so that means the total number of rounds is Õ(ϵ -0.5 ). A 1 = C 1 C 2 , A 2 = C 1 C 3 and A 3 = C 1 C 4 with C 1 , C 2 , C Suppose τ > 1, γ = 1 Ms and T s = µ -1 M s ln M 3 s . By Lemma 4, we have E E 2 s ≤ 1 (s+2) 4 . In iteration s of Algorithm 2, the total number of rounds of communication needed in Local-SVRG is O(T s /τ ) = O(s 2 ) so that means the total number of rounds is Õ(ϵ -1.5 ).

F.3 ALGORITHM AND COMMUNICATION COMPLEXITY FOR NON-CONVEX F

When F in ( P) is non-convex, we no long expect any algorithm to find an ϵ-optimal solution and change our goal to finding an ϵ-stationary point of ( P), which is defined as a solution w ∈ ∆ b K satisfying η -1 w -Proj ∆ b K ( w -η∇ F ( w)) ≤ ϵ. for some η > 0. There exist multiple numerical techniques for finding an ϵ-stationary, among which the proximal gradient method is the simplest one. When the gradient can only be computed inexactly, there exist studies on the iteration complexity of the proximal gradient method for finding an ϵ-stationary point, including Ghadimi & Wang (2018) for bilevel optimization and Gu et al. (2018) for a general problem. We will simply apply the proximal gradient method to ( P) using the approximate gradient ∇ F (w) in ( 14). We formally present this approach in Algorithm 3. Again, the center is node 0, i.e., the node where D valid is stored. Taking w = w (s) in ( 53) gives ∇ F (w (s) ), w (s+1) -w (s) + 1 η ∥w (s+1) -w (s) ∥ 2 ≤ 0. ( ) Since F is ℓ F -Lipschitz continuous and η ≤ 1 ℓ F , we have F (w (s+1) ) -F (w (s) ) ≤ ∇ F (w (s) ), w (s+1) -w (s) + 1 2η ∥w (s+1) -w (s) ∥ 2 . ( ) Adding ( 54) and( 55) gives us F (w (s+1) ) -F (w (s) ) + 1 2η ∥w (s+1) -w (s) ∥ 2 ≤ ∇ F (w (s) ) -∇ F (w (s) ), w (s+1) -w (s) ≤ E s ∥w (s+1) -w (s) ∥ ≤ 2E s , which, together with (52), implies 1 4η w (s) -Proj ∆ b K (w (s) ) 2 ≤ F (w (s) ) -F (w (s+1) ) + 2E s + ηE 2 s 2 . Summing this inequality and taking expectation give us η -2 E w (s) -Proj ∆ b K (w (s) ) 2 ≤ 4 ηS F (w (0) ) -F ( w) + 8 ηS S-1 s=0 E E s + 2 S S-1 s=0 E E 2 s . When E E 2 s ≤ 1 (s+1) 2 , the inequality above implies η -2 E w (s) -Proj ∆ b K (w (s) ) 2 ≤ 4 ηS F (w (0) ) -F ( w) + 8 log(S) ηS + π 2 3S . This means, as long as E E 2 s ≤ 1 (s+1) 2 , Algorithm 3 finds an ϵ-stationary solution of ( P) in Õ(ϵ -2 ) iterations. Note that Lemma 4 still holds with w (s) md replaced by w (s) in Algorithm 3. Let A 1 = C 1 C 2 , A 2 = C 1 C 3 and A 3 = C 1 C 4 with C 1 , C 2 , C 3 and C 4 defined in Lemma 4. Suppose τ = 1, γ = γ 0 and T s = 1 γ0µ ln A1(s+1) 2 γ0 . By Lemma 4, we have E E 2 s ≤ 1 (s+2) 2 . In iteration s of Algorithm 3, the total number of rounds of communication needed in Local-SVRG is O(T s ) = O(ln(s)) so that means the total number of rounds is Õ(ϵ -2 ).

Suppose

τ > 1, γ = 1 M ′ s and T s = µ -1 M ′ s ln M ′3 s . By Lemma 4, we have E E 2 s ≤ 1 (s+2) 4 . In iteration s of Algorithm 3, the total number of rounds of communication needed in Local-SVRG is O(T s /τ ) = O(s) so that means the total number of rounds is Õ(ϵ -4 ). Remark 1. The federated bilevel optimization methods by Li et al. (2022) and Tarzanagh et al. ( 2022) can find an ϵ-stationary point within Õ(ϵ -3 ) and Õ(ϵ -4 ) rounds of communication, respectively. In the first setting of Theorem 4 (τ = 1), Algorithm 3 has complexity of Õ(ϵ -2 ), which is better than Li et al. (2022) and Tarzanagh et al. (2022) . We want to point out that the lower complexity of Algorithm 3 is because it utilizes the finite-sum structure in ( P), which allows computing a deterministic gradient infrequently to accelerate the convergence. However, Li et al. (2022) and Tarzanagh et al. (2022) both consider objective functions given in expectation, which does not allow computing a deterministic gradient in general. One can check the following Table 2 for a comparison.

G ADDITIONAL MATERIALS FOR NUMERICAL EXPERIMENTS

In this section, we present additional details and results of our numerical experiments in Section 5. The CNN we train in the experiments consists of two layers of 2D convolution, each equipped with 2D batch normalization and ReLU activation, and followed by a fully connected layer to generate C2: Classes 0 and 3 which include short-sleeve upper-body clothes; C3: Classes 1 and 8 which include pants and bags; C4: Classes 5, 7 and 9 which include only shoes. Note that these four merged classes are only used for generating data. In the classification task, we still have ten classes. This is the same for the other three datasets. We set n train = 4000, n valid = 500 and n test = 5000. Each image is sampled from one of the four merged classes with a probability distribution (P 1 , P 2 , P 3 , P 4 ). Once a merged class is chosen, each image in that merged class has an equal chance to be sampled. For D train k with k ∈ J , we sample data from the training set with P 1 = 0.42, P 2 = 0.08, P 3 = 0.38 and P 4 = 0.12. For D train k for k ∈ J M , we sample data with P 1 = 0.12, P 2 = 0.38, P 3 = 0.08 and P 4 = 0.42. Depending on p 0 is the distribution of J m or J M , D valid and D test are sampled from p 0 the corresponding distribution. Note that D valid and D test are sampled from the training set and testing set of the original data, respectively, although they have the same probability distribution over the four merged classes. Since D valid and D test are generated in the similar way for the other three datasets, we will only discuss the generation of D train k 's in the subsequent sections.

G.1.2 MNIST

MNIST (Deng, 2012) contains a training set of 60,000 images and a testing set of 10,000 images. Each image is a handwritten digit in grayscale and has a size of 28 × 28. Since MNIST has the same number of classes, same class distribution, and same data size as Fashion-MNIST, we directly apply the same procedure in Section G.1.1 to sample data. In particular, we merge the ten digits into four classes as follows: C1: Digits 2, 4 and 6; C2: Digits 0 and 3; Under review as a conference paper at ICLR 2023 C3: Digits 1 and 8; C4: Digits 5, 7 and 9. We set n train = 4000, n valid = 500 and n test = 5000. Following Section G.1.1, for D train k with k ∈ J m , we sample data from the four merged classes with P 1 = 0.42, P 2 = 0.08, P 3 = 0.38 and P 4 = 0.12. For D train k with k ∈ J M , we sample data with P 1 = 0.12, P 2 = 0.38, P 3 = 0.08 and P 4 = 0.42. G.1.3 CIFAR-10 CIFAR-10 ( Krizhevsky et al., 2009) contains a training set of 50,000 images and a testing set of 10,000 images. Each image is in color, has a size of 32 × 32, and is associated with a label from ten classes: 0: airplane, 1: automobile, 2: bird, 3: cat, 4: deer, 5: dog, 6: frog, 7: horse, 8: ship and 9: truck. We merge the ten classes into four classes as follows: C1: Classes 1 and 9 which are related to ground transportation; C2: Classes 0 and 8 which are related to non-ground transportation; C3: Classes 2, 3 and 4 which form a set of animals; C4: Classes 5, 6 and 7 which form another set of animals. We set n train = 4000, n valid = 500 and n test = 5000. Similar to the procedure with Fashion-MNIST, for D train k with k ∈ J m , we sample data from the four merged classes with P 1 = 0.36, P 2 = 0.04, P 3 = 0.54 and P 4 = 0.06. For D train k with k ∈ J M , we sample data with P 1 = 0.04, P 2 = 0.36, P 3 = 0.06 and P 4 = 0.54. G.1.4 DOWNSAMPLED IMAGENET Downsampled ImageNet (Chrabaszcz et al., 2017) is created by downsampling each image in Im-ageNet (Deng et al., 2009) to 32 × 32 pixels without changing the class labels. Just as ImageNet, downsampled ImageNet has 1000 classes and we choose ten classes and merge them into four classes as follows. (The class labels listed below are consistent with ImageNet.) C1: Classes 7, 9, 10, 29, 54, 75, 84 and 189, which are cats or animals similar to cat; C2: Classes 61, 66, 68, 101, 114, 124, 131 and 148, which are dogs or animals similar to cat; C2: Classes 383, 397, 403, 404, 405, 406, 412, 414, 420, 426, 433 and 434, which are all birds; C3: Classes 224, 441, 442, 443, 444, 445, 449, 453, 454, 498, 499 and 500, which are either fishes or frogs. We set n train = 4000, n valid = 1500, n test = 1000. For D train k for k ∈ J m , we sample data from the four merged classes with P 1 = 0.36, P 2 = 0.04, P 3 = 0.54 and P 4 = 0.06. For D train k with k ∈ J M , we sample data with P 1 = 0.04, P 2 = 0.36, P 3 = 0.06 and P 4 = 0.54.

G.1.5 COVTYPE

Covtype (Blackard & Dean, 1999 ) is an imbalanced dataset of 581,012 instances. Each instance records ten features in integer value and two categorical features one-hot encoded into 44 binary values, and is associated with a label from seven classes: 1: Spruce/Fir, 2: Lodgepole Pine, 3: Ponderosa Pine, 4: Cottonwood/Willow, 5: Aspen, 6: Douglas-fir, 7: Krummholz. According to the imbalance on class labels, we merge the seven classes into four classes as follows: C1: Classes 1 which consists of approximately 36.5% of the raw data; C2: Classes 2 which consists of approximately 48.8% of the raw data; C3: Classes 3 which consists of approximately 6.2% of the raw data; C4: Classes 4, 5, 6 and 7 which consist of the rest. 500 and 1000 for Fashion-MNIST, MNIST and CIFAR-10, and 1000 , 1200 , 1500 and 2000 for the downsampled 32 × 32 ImageNet. Similarly, we save the model generated by our Bi-level method at the iteration where the highest accuracy is achieved on the respective validation data. Then, we report the test (top-1) accuracy each method obtains during iterations for the minority group in Table 4 and the majority group in Table 5 .

How w evolves

Test Accuracy 



k=1 w k E(θ -z k ) 2 , We define a universal constant as a constant that does not depend on any parameter of the problem except CH and Cr. This definition is made only to simply the constant factors in our bounds. A (ρ, Cρ)-transferable assumption is needed inChen et al. (2021a), which also holds in our case with ρ = 2 because of the Lipschitz continuity and strong convexity assumed in Assumption 1. As a by-product of our analysis, we show in (38) that L0(θ( w)) -L0(θ * ) also satisfies (7).



assume |D train k | = n k and D train k = {z (i) k } n k i=1 , where z (i)k ∈ Z is an i.i.d. sample from an unknown distribution p k for k ∈ K.

(w) := L 0 ( θ(w)) s.t. θ(w) is defined as in (1) .

) to L k or L k in each node and periodically aggregates the solutions from all nodes by averaging. Many variants of FedAvg and other federated learning methods have been proposed to reduce the computation and communication complexity. A partial list includes Gorbunov et al. (2021); Lee et al. (2017); Karimireddy et al. (2020); Liang et al. (2019); Li et al. (2020); Yuan & Ma (2020); Wu & Wang (2021); Zhao

), we simplify (11) by only showing the bounds in terms of n valid , n train and G for a clear comparison with local training. Corollary 1. Suppose the assumptions of Theorem 2 hold and n valid and n train are large enough such that ε(n valid , n train ) defined in (10) satisfies ε(n valid , n train ) ≤ b √ J K-J . With a probability of at least 1 -3δ, we have L 0 ( θ( w)) -L 0 (θ * ) ≤ O 1the bound in Corollary 1 becomes o(1/n 1 2

Figure 1: Comparison in test accuracy for the minority group vs number of synchronizations.

Figure 2: How w evolves during the Bi-level method under Setting 1.

)Next, we apply the standard symmetrization strategy by introducing a ghost dataset D train ghost :

cover for [0, b] corresponding to a coordinate in J and create a ntrain (K-J)(J+1)Nε ϵ 2 -cover for [0, ε] corresponding to a coordinate in K\J . (Recall that w j ≤ ε for j ∈ K\J .) Then we take the Cartesian product of these K one-dimensional covers and project it to W * ε . This provides a ntrain Nε ϵ 2 -cover for W * ε with a cardinality of b

), we define a probability measure Q = N b where δ z (i) k is a point mass at z (i) k . Then, we only need to construct an ϵ-cover for H × ∆ b K by taking the Cartesian product of an ϵ 2 -cover for H w.r.t. distance metric ρ Q and a ntrain N b ϵ 2 -cover for ∆ b K w.r.t. the Euclidean distance. According to Assumption 4, the former has a cardinality of (2C H /ϵ) ν H . To construct for [0, b], take its K-fold Cartesian product, and project it to ∆ b K . This provides a ntrain N b ϵ 2 -cover for ∆ b K with a cardinality of b

3 and C 4 defined in Lemma 4. Suppose τ = 1, γ = γ 0 and T s = 1 γ0µ ln A1(s+1) 4 γ0

Figure 3: Results for Covtype. F-MNIST MNIST CIFAR-10 DS-ImageNet

To address this challenge, many personalized FL methods, including but not limited to Smith et al. (2017); Tan et al. (2022); Fallah et al. (2020); Li & Wang (2019); Deng et al. (2020); Li et al. (2021), have been developed, where a global model is tailored using local data for a good local performance. However, many personalized FL methods use a fixed weight in (1) to obtain the global model. Such a global model may be dominated by the majority of the data distributions in the network and is hard to personalize for a minority group with unique data patterns. On the contrary, our method can produce a personalized weight so a node from the minority group can still find and collaborate with its peers.

We denote the testing set by D test and let n test = |D test |. We repeat all experiments five times using different random seeds. The values of n valid , n train , n test and the details of data generation are presented in Sections G.1, G.2 and G.3.

Training set i.i.d. sampled from p k and stored in node k, k = 1, . . . , K n k Size of training set D train k w Vector of weights defined by (w 1 , . . . , w K

Notations used throughout the paper.

between mappings l(θ; •) and l(θ ′ ; •) with respect to the empirical distribution over D valid and is a pseudometric in H. Hence, by Dudley's entropy integral inequality (see Corollary 13.2 inBoucheron et al. (2013)), there exists a universal constant C d such that

Hence, by Dudley's entropy integral inequality (see Corollary 13.2 in Boucheron et al. (2013)), there exists a universal constant C d such that

Comparison in Remark 1.predictions. The first convolution layer is set to output the same number of channels as the input, and uses kernels with a size of 4, a stride of 4 and one padding. The second convolution layer returns two output channels for Fashion-MNIST and MNIST and five for CIFAR-10 and ImageNet, and uses kernels with a size of 2, a stride of 2 and one padding. All experiments are conducted with PyTorch 1.9.0 and CUDA 11.1 computing platform on a computer with the CPU Intel Xeon Gold 6330@2.0GHz (Turbo up to 3.1GHz) and the GPU NVIDIA GeForce RTX 2080 Ti.G.1 DATA GENERATION WITH DIFFERENT CLASS DISTRIBUTIONS (SETTING 1)In this section, we describe in details how we generate D valid , D train and D test from each original dataset for our experiments under Setting 1.

Test accuracy when reaching the highest validation accuracy. 7758±0.0059 0.8824±0.0198 0.5175±0.0065 0.2668±0.0079 0.8364±0.0121 0.8443±0.0096 0.5928±0.0068 0.2630±0.0091 Local-train 0.6926±0.0175 0.7850±0.0269 0.3036±0.0102 0.1108±0.0127 0.7427±0.0110 0.7550±0.0095 0.3568±0.0144 0.1150±0.0072 FedAvg 0.7507±0.0097 0.8297±0.0171 0.2965±0.0075 0.2008±0.0117 0.8327±0.0119 0.8484±0.0123 0.5705±0.0052 0.2612±0.0116 Ditto 0.7297±0.0146 0.8358±0.0201 0.4361±0.0166 0.1582±0.0037 0.8086±0.0141 0.8059±0.0053 0.5021±0.0204 0.1874±0.0126 pFedMe 0.7418±0.0235 0.8568±0.0204 0.4860±0.0067 0.2084±0.0068 0.8297±0.0070 0.8320±0.0162 0.5588±0.0110 0.2324±0.0087 7754±0.0082 0.8580±0.0283 0.5148±0.0051 0.2734±0.0052 0.8332±0.0063 0.8384±0.0196 0.5979±0.0044 0.2694±0.0104 Local-train 0.6763±0.0116 0.7874±0.0129 0.3159±0.0110 0.1260±0.0033 0.7343±0.0104 0.7650±0.0146 0.3594±0.0086 0.1276±0.0110 FedAvg 0.5686±0.0211 0.5809±0.0405 0.2870±0.0106 0.1676±0.0044 0.7824±0.0062 0.7746±0.0121 0.5431±0.0087 0.2542±0.0089

Test accuracy's of the minority model by the Bi-level method using validation sets of different sizes.

Test accuracy's of the majority model by the Bi-level method using validation sets of different sizes. Figure 7: Comparison in test accuracy for the majority group vs number of synchronizations. Comparison in test accuracy for the minority group vs number of points processed.

annex

Algorithm 3: Federated Learning Method for Bilevel Optimization ( P) (Non-Convex Case) Input: initial weight w (0) , learning rate η, training data D train k for k ∈ K, validation data D valid , the number of outer iterations S, parameters (γ, τ, q) for Local-SVRG, and the number of inner iterations T s for s = 0, . . . , S -1 for s = 0, 1, . . . , S -1 do Compute ∇ F (w (s) ) as follows:Set f k,i (θ) = l(θ; z (i) k ), i = 1, . . . , n k , k = 1, . . . , K Compute θ (s) = Local-SVRG({f k,i }, w (s) , θ (s-1) , γ, τ, q, T s ) and send it to each node.Compute ∇ L 0 (θ (s) ) at center and send it to each node.), γ, τ, q, T s ) and send it to each node.Each node computes ∇ L k (θ (s) ) in parallel and send it to the center.The convergence result of Algorithm 3 can be proved in a standard way (e.g., see Theorem 2.1 in Ghadimi & Wang (2018) ). We present it below only for the sake of completeness.Theorem 4. Suppose Assumption 1 holds. Let R := max w∈∆ b K ∥ θ(w)∥ and ℓ F and γ 0 defined as in ( 12) and ( 17). Suppose η = 1 3ℓ F in Algorithm 3. There exist constants A 1 , A 2 and A 3 that only depend on ℓ 0 , ℓ 1 , ℓ 2 , µ, R, q and K but not on τ such that the following statements hold.• Suppose τ = 1, γ = γ 0 and T s = 1 γ0µ ln A1(s+1) 2 γ0. Algorithm 3 finds an ϵ-stationary solution of ( P) with Õ ϵ -2 rounds of communication.s , whereAlgorithm 3 finds an ϵ-stationary solution of ( P) with Õ ϵ -4 rounds of communication.Proof., by the property of projection mapping, we haveBy the definition of w (s+1) and the 1 η -strong convexity of function ∇ F (w (s) ), wWe set n train = 40000, n valid = 5000 and n test = 50000. Similar to the procedure with Fashion-MNIST, for D train k with k ∈ J m , we sample data from the four merged classes with P 1 = 0.72, P 2 = 0.14, P 3 = 0.12 and P 4 = 0.02. For D train k with k ∈ J M , we sample data with P 1 = 0.10, P 2 = 0.82, P 3 = 0.06 and P 4 = 0.02.We train a liner multi-class logistic regression model using the same computing platform in this experiment as before. We use a mini batch of size 400 to construct the stochastic gradients in all methods. In the Bi-level method, Algorithm 3 is applied to ( P) with b = 1/3 and five epochs are performed within each call of Local-SVRG (i.e., T s = 5n train /400). We set q = 1/100. Other parameters are the same as our choice in Section 5. We only run Setting 1 with one random seed on this dataset. The results are shown in Figure 3 . We then describe how the class labels in D train k for k ∈ J M are permuted for each dataset. For Fashion-MNIST and MNIST, we permute the class labels in D train k for k ∈ J M by changing label 2 to 0, 0 to 1, 1 to 5, and 5 to 2. For CIFAR-10, we permute the class labels in D train k for k ∈ J M by changing label 1 to 0, 0 to 2, 2 to 5, and 5 to 1. For downsampled ImageNet, we permute the class labels in D train k for k ∈ J M by changing label 7 to 61, 61 to 383, 383 to 224, 224 to 9, 9 to 66, 66 to 397, 397 to 441, 441 to 10, 10 to 68, 68 to 403, 403 to 442, and 442 to 7.

G.3 DATA GENERATION WITH DIFFERENT CLASS DISTRIBUTIONS, LABEL PERMUTATION

AND/OR RANDOM ROTATION (SETTINGS 3 AND 4)In this section, we discuss in details how we generate D valid , D train and D test from each original dataset for our experiments under Settings 3 and 4.Under Setting 3, we first generate data in the same way as in Setting 1 described in Section G.1. Then, we randomly choose a rotation direction, clockwise or anti-clockwise, and rotate each image in D train k with k ∈ J M toward that direction for 90 degrees. Under Setting 4, we first generate data in the same way as in Setting 2 described in Section G.2. Then, we apply the same rotation procedure as we do in Setting 3.

G.4 ADDITIONAL NUMERICAL RESULTS

In this section, we first plot how the weight w k for each node evolves during the Bi-level method under Setting 2, 3 and 4 respectively in Figure 4 , Figure 5 and Figure 6 . We show the results when p 0 is the distribution of the majority and the minority groups separately.Since the performance of an algorithm may fluctuate during training, we are interested in comparing the methods in the best performance they achieved during training. To do so, we save the model generated by each method at the iteration where the highest accuracy is achieved on the validation data. Then, we report the performance of the saved model's by each method on the testing set in Table 3 . Again, we show the results when p 0 is the distribution of the majority and the minority groups separately. Our method outperforms the baselines in most of the cases.Next, we plot the test (top-1) accuracy each method obtains during iterations for the minority group and the majority group in Figure 8 and Figure 9 , respectively, where the horizontal axis represents the cumulative number of data points each method processes in parallel.At last, we give the results of the ablation study. We choose the case of Setting 2 Seed 1, and consider four choices on the size of the validation dataset D valid , i. 

