LOCAL COEFFICIENT OPTIMIZATION IN FEDERATED LEARNING

Abstract

Federated learning emerges as a promising approach to build a large-scale cooperative learning system among multiple clients without sharing their raw data. However, given a specific global objective, finding the optimal sampling weights for each client remains largely unexplored. This is particularly challenging when clients' data distributions are non-i.i.d. and clients partially participant. In this paper, we model the above task as a bi-level optimization problem which takes the correlations among different clients into account. We present a doubleloop primal-dual-based algorithm to solve the bi-level optimization problem. We further provide rigorous convergence analysis for our algorithm under mild assumptions. Finally, we perform extensive empirical studies under both toy examples and learning models from real datasets to verify the effectiveness of the proposed method.

1. INTRODUCTION

Federated learning has achieved high success in the large-scale cooperative learning system without sharing raw data. However, due to the large number of devices involved in the learning system, it is hard to check data quality (e.g., noise level) for individual devices. Further, it will degrade the model's ability when it is trained with bad-quality data. To eliminate the influence of the 'bad' devices, it is natural to reduce the weight of those devices. In most popular federated training algorithms (e.g., FedAvg (Li et al., 2019) ), all devices are weighted the same or with respect to the number of data points it holds. Borrowing the formulation of federated algorithms, we introduce a new variable x to control the weight of each device which is the coefficient of each local objective. We introduce a validation set in the server to validate whether coefficients improve the model. We formulate the whole problem as a bi-level optimization state in the following: min x f 0 (w * (x)) s.t. w * (x) ∈ arg min w N i=1 x (i) f i (w) x ∈ X = {x|x ≥ 0, ∥x∥ 1 = 1}, To solve problem (1), Kolstad & Lasdon (1990) propose an algorithm that calculates the gradient of x directly, i.e., ∂f 0 (w * (x)) ∂x (i) = -∇ w f 0 (w * (x)) ⊤ N i=1 ∇ 2 w f i (w * (x)) -1 ∇ w f i (w * (x)). But, due to the large parameter dimension of w, it is impossible to take the inverse of the Hessian or solve the linear system related to the Hessian. Meanwhile, due to a large amount of data in the local device, it is hard to directly estimate the gradient or the Hessian of the local function f i . Only stochastic gradient and stochastic hessian can be accessed. Thus, Ghadimi & Wang (2018) propose the BSA algorithm where the inverse of Hessian is approximated by a series of the power of Hessian (using 



k=0 (I -ηH) k to approximate 1 η H -1 with certain η). Khanduri et al. (2021) propose SUSTAIN algorithm for solving stochastic bi-level optimization problems with smaller 1

