LOCAL COEFFICIENT OPTIMIZATION IN FEDERATED LEARNING

Abstract

Federated learning emerges as a promising approach to build a large-scale cooperative learning system among multiple clients without sharing their raw data. However, given a specific global objective, finding the optimal sampling weights for each client remains largely unexplored. This is particularly challenging when clients' data distributions are non-i.i.d. and clients partially participant. In this paper, we model the above task as a bi-level optimization problem which takes the correlations among different clients into account. We present a doubleloop primal-dual-based algorithm to solve the bi-level optimization problem. We further provide rigorous convergence analysis for our algorithm under mild assumptions. Finally, we perform extensive empirical studies under both toy examples and learning models from real datasets to verify the effectiveness of the proposed method.

1. INTRODUCTION

Federated learning has achieved high success in the large-scale cooperative learning system without sharing raw data. However, due to the large number of devices involved in the learning system, it is hard to check data quality (e.g., noise level) for individual devices. Further, it will degrade the model's ability when it is trained with bad-quality data. To eliminate the influence of the 'bad' devices, it is natural to reduce the weight of those devices. In most popular federated training algorithms (e.g., FedAvg (Li et al., 2019) ), all devices are weighted the same or with respect to the number of data points it holds. Borrowing the formulation of federated algorithms, we introduce a new variable x to control the weight of each device which is the coefficient of each local objective. We introduce a validation set in the server to validate whether coefficients improve the model. We formulate the whole problem as a bi-level optimization state in the following: min x f 0 (w * (x)) s.t. w * (x) ∈ arg min w N i=1 x (i) f i (w) x ∈ X = {x|x ≥ 0, ∥x∥ 1 = 1}, To solve problem (1), Kolstad & Lasdon (1990) propose an algorithm that calculates the gradient of x directly, i.e., ∂f 0 (w * (x)) ∂x (i) = -∇ w f 0 (w * (x)) ⊤ N i=1 ∇ 2 w f i (w * (x)) -1 ∇ w f i (w * (x)). But, due to the large parameter dimension of w, it is impossible to take the inverse of the Hessian or solve the linear system related to the Hessian. Meanwhile, due to a large amount of data in the local device, it is hard to directly estimate the gradient or the Hessian of the local function f i . Only stochastic gradient and stochastic hessian can be accessed. Thus, Ghadimi & Wang (2018) propose the BSA algorithm where the inverse of Hessian is approximated by a series of the power of Hessian (using 2021) propose SUSTAIN algorithm for solving stochastic bi-level optimization problems with smaller sample complexity. Similar to Ghadimi & Wang (2018) , they need an extra loop of the Hessian-vector product to approximate the product of the Hessian inverse with some vector. However, it is known that for constraint optimization, with the descent direction, the algorithm will not converge to the optimal point or even to the first-order stationary point, where the inner product between the descent direction and gradient is larger than 0 (Bertsekas, 2009) . Therefore, getting an accurate approximation of the hessian inverse is essential. With the series of the power, it has to start with k = 0 and apply several iterations to get an accurate approximation, increasing the computation and communication in federated learning. Fortunately, by noticing the KKT condition, information of the hessian inverse can be embedded into dual variables. Based on the smoothness of the objective, we can give a good initialization of dual variables rather than start with the same initialization in each iteration (like I in the series approximation). Thus, we propose a primal-dual-based algorithm to solve problem (1). K k=0 (I -ηH) k to approximate 1 η H -1 with certain η). Khanduri et al. ( Further, to solve a constrained optimization with non-linear equality constraints, adding the norm square of the equality constraint as an augmented term may not give the convexity to the augmented Lagrange function. As a result, it is hard for the min-max optimization algorithm to find the stationary point. Instead, with the assumption in Ghadimi & Wang (2018), the function f i 's are assumed to be strongly convex, adding function f i 's as the augmented term can help introduce convexity and it will not change the stationary point of the min-max problem. Based on this new augmented Lagrange function, we prove that with stochastic gradient descent and ascent, w and λ can converge to the KKT point. Meanwhile, by the implicit function theorem, when w and λ are closed to the stationary point of min-max, the bias of estimating the gradient of x can be reduced to 0. Thus, with the primal-dual algorithm on w and λ and stochastic projected gradient descent on x, we show the convergence of our algorithm. Finally, we compare our algorithm with other algorithms on a toy example and real datasets (MNIST and F-MNIST with Network LeNet-5). The experimental results show that the proposed algorithm can perform well in strongly convex cases and even in some non-convex cases (Neural Networks). We summarize our contributions as follows: • In Federated Learning, we formulate the local coefficient learning problem as a bi-level optimization problem, which gives a way to identify the dataset quality in each local client for some specific task (where a small validation set is given). • In bi-level optimization, we introduce a primal-dual framework and show the convergence of the whole algorithm in the constrained and stochastic setting. • For some specific optimization problems with non-linear constraints, we give a new augmented term. With the new augmented term, the primal variable and dual variable can converge to the KKT point of the original problems. Different from these works, we explicitly formulate a bi-level optimization problem. By adding a validation set, it can be more clearly identified the correlation of information from the other devices and from its own.

2.2. STOCHASTIC BI-LEVEL OPTIMIZATION

Bi-level optimization problem has been studied for a long time. One of the simplest cases in bi-level optimization is the singleton case, where the lower-level optimization has a unique global optimal



FEDERATED LEARNING The most related work in federated learning tasks will be personalized federated learning. A welltrained local personalized personalized model is needed for each local device in personalized federated learning. Jiang et al. (2019); Deng et al. (2020) propose a method that they train a global model and then fine-tune the trained global model to get the local model. T Dinh et al. (2020); Fallah et al. (2020) change the local objective function to make each local has the ability to be different and handle individual local tasks. Li et al. (2021) introduces a two-level optimization problem for seeking the best local model from great global models. All of these works do not involve a validation set as a reference, but they use a few gradient steps or simple modifications and hope the local model can both fit the local training data and use information from the global model (other local devices).

