FEDERATED LEARNING ON ADAPTIVELY WEIGHTED NODES BY BILEVEL OPTIMIZATION Anonymous

Abstract

We propose a federated learning method with weighted nodes in which the weights can be modified to optimize the model's performance on a separate validation set. The problem is formulated as a bilevel optimization problem where the inner problem is a federated learning problem with weighted nodes and the outer problem focuses on optimizing the weights based on the validation performance of the model returned from the inner problem. A communication-efficient federated optimization algorithm is designed to solve this bilevel optimization problem. We analyze the generalization performance of the output model and identify the scenarios when our method is in theory superior to training a model locally and superior to federated learning with static and evenly distributed weights.

1. INTRODUCTION

Federated learning (FL) is an emerging technique for training a model using data distributed over a network of nodes without sharing data between nodes (Konečnỳ et al., 2016; McMahan et al., 2017) . In this paper, we focus on the case where data distributions across nodes are heterogeneous and each node aims at a model with an optimal local generalization performance. In the classical setting of FL, a globally shared model is learned by minimizing a weighted average loss across all nodes. However, given the heterogeneity of data distributions, a global model is likely to be sub-optimal for some node (Fallah et al., 2020) . Alternatively, each node can train a model only using its local data, but such a local model may not generalize well neither when the volume of local data is small. To achieve a good local generalization performance, each node can still exploit global training data through FL but, at the same time, identify and collaborate only with the nodes whose data distributions are similar or identical to its local distribution. One way to implement this strategy is to allow each node to solve its own weighted average loss minimization problem with weights designed based on the performance on a separate set of local (validation) data. Ideally, each node can learn a better model by allocating more weights on its peers whose data distribution is similar to its local distribution. In this paper, we formulate the choice of the weights as a bilevel optimization (BO) problem (Colson et al., 2005; Vicente & Calamai, 1994) , which can be solved by a federated bilevel optimization algorithm, and analyze the generalization performances of the resulting model. We consider a standard learning problem where the goal is to learn a vector of model parameters θ from a set Θ that minimizes a generalization loss. This problem can be formulated as θ * ∈ arg min θ∈Θ {L 0 (θ) := E z∼p0 [l(θ; z)]} , (P) where l(θ; z) is the loss of θ on a data point z from a space Z, and E z∼p0 represents the expectation taken over z when z follows an unknown ground truth distribution p 0 . Directly solving (P) is challenging as p 0 is unknown, and, typically, training data sampled from p 0 is needed for learning an approximation of θ * . In this paper, we consider the scenario where the amount of data sampled directly from p 0 may not be sufficient to learn a good approximation of θ * , but there exist external data distributed on K nodes that can potentially help the learning on θ * . In particular, we denote the set of nodes by K := {1, . . . , K} and assume a training set D train k is stored in node k. We also define D train := D train k K k=1 and assume |D train k | = n k and D train k = {z (i) k } n k i=1 , where z (i) k ∈ Z is an i.i.d. sample from an unknown distribution p k for k ∈ K. We assume node k is weighted by w k and the vector of weights w = (w 1 , . . . , w K ) ∈ [0, 1] K is located on the capped simplex ∆ b K defined as ∆ b K = w = (w1, . . . , wK ) K k=1 w k = 1, 0 ≤ w k ≤ b, k ∈ K , where b ∈ [ 1 K , 1 ] is a user-defined parameter. The FL on weighted nodes can be formulated as θ(w) ∈ arg min θ∈Θ K k=1 w k L k (θ), where L k (θ) is the empirical loss of θ on D train k , namely, L k (θ) := 1 n k n k i=1 l(θ; z (i) k ), k ∈ K. (2) When some p k 's are different from p 0 , w in (1) must be chosen adaptively to ensure θ(w) is a good approximation of θ * in (P). To do so, we assume that there is a validation dataset D valid with |D valid | = n 0 = n valid and D valid = z (i) n0 i=1 , where z (i) ∈ Z is an i.i.d. sample from p 0 . We assume D valid is stored in a node called node 0 or center, which may or may not be a node in K. Set D valid alone may not be sufficient for learning θ * precisely but can be used to assist the selection of w. We then propose to estimate the generalization loss of θ(w) using the loss on D valid , i.e., L 0 (θ) := 1 n0 n0 i=1 l(θ; z (i) ) (3) and use this validation loss to guide the procedure for updating w. Presumably, when both the training and validation sets are large enough, the weights in w will be shifted towards the nodes where the data is helpful for learning θ * . Following this idea, we formulate the federated learning problem on adaptively weighted nodes as the following bilevel optimization (BO) problem: w ∈ arg min w∈∆ b K F (w) := L 0 ( θ(w)) s.t. θ(w) is defined as in (1) . ( P) In Section 4, we will present a federated optimization algorithm for solving ( P). Suppose an algorithm can find the optimal solution w of ( P) and the corresponding model parameter θ(w). We are interested in the optimality gap of the generalization loss of θ( w), namely, L 0 ( θ( w)) -L 0 (θ * ), where L 0 is defined as in (P). The main contribution of this paper is to establish a high-probability bound of this gap as a function of the sizes of D train and D valid as well as a statistical distance between p 0 and p k 's. Moreover, we compare our generalization bound with the bound achieved by learning only locally from D valid and the bound achieved by solving (1) with evenly distributed weights, and identify the parameter regimes where our method is preferred in theory.

2. RELATED WORK

The work most related to ours is Chen et al. (2021a) in which the authors proposed a target-aware weighted training algorithm for cross-task learning. Although their problem is completely different from FL, the bilevel optimization model they studied contains ( P) as a special case. In fact, some steps in the proofs of the generalization bounds in the current work are borrowed from Chen et al. (2021a) with some modifications. However, our work extends their results in several valuable directions. First, the generalization bound in Chen et al. ( 2021a) is shown for any weight w without any small or zero components, which is not necessarily the case for the optimal solution w of ( P). Second, their generalization bound contains a term of task distance whose convergence rate is not characterized. On the contrary, we show the convergence of the entire generalization bound for w without any conditions on its components. Third, the generalization bound in Chen et al. (2021a) has a dominating term O(1/ √ n valid ), which is the same as the generalization bound obtained by directly training with the local data D valid . However, we show that, when there exist identical neighbors and an error bound condition holds (Assumptions 2 ′ and 3), the model learned by ( P) can be superior to a model trained locally when the p k 's are similar enough to (but still different from) p 0 , providing an insight on when a node with insufficient data should actively seek collaboration with others. FL has become a prominent machine learning paradigm for training models with distributed data Konečnỳ et al. (2016); McMahan et al. (2017) . Many federated optimization algorithms have been

