LEARNING TO AGGREGATE: A PARAMETERIZED AG-GREGATOR TO DEBIAS MODEL AGGREGATION FOR CROSS-DEVICE FEDERATED LEARNING Anonymous

Abstract

Federated learning (FL) collaboratively trains deep models on decentralized clients with privacy constraint. The aggregation of client parameters within a communication round suffers from the "client drift" due to the heterogeneity of client data distributions, resulting in unstable and slow convergence. Recent works typically impose regularization based on the client parameters to reduce the local aggregation heterogeneity for optimization. However, we argue that they generally neglect the inter-communication heterogeneity of data distributions ("period drift"), leading to deviations of intra-communication optimization from the global objective. In this work, we aim to calibrate the local aggregation under "client drift" and simultaneously approach the global objective under "period drift". To achieve this goal, we propose a learning-based aggregation strategy, named FEDPA, that employs a Parameterized Aggregator rather than non-adaptive techniques (e.g., federated average). We frame FEDPA within a meta-learning setting, where aggregator serves as the meta-learner and the meta-task is to aggregate the client parameters to generalize well on a proxy dataset. Intuitively, the metalearner is task-specific and could thereby acquire the meta-knowledge, i.e., calibrating the parameter aggregation from a global view and approaching the global optimum for generalization.

1. INTRODUCTION

Federated Learning (FL) McMahan et al. (2017) has been an emerging privacy-preserving machine learning paradigm to collaboratively train a shared model on a decentralized manner without sharing private data. In FL, clients independently train the shared model over their private data, and the server aggregates the uploaded model parameters periodically until convergence. In FL Kairouz et al. (2021) , a key challenge hindering effective model aggregation lies in the heterogeneous data of clients Zhao et al. (2018) , especially in cross-device (as opposed to cross-silo) FL with a large amount of clients (e.g. mobile devices). Wherein, vanilla FL algorithms, such as federated averaging (FEDAVG) McMahan et al. (2017) , based on averaging the parameters of candidate clients, would suffer from bad convergence and performance degradation. (2021) . To cope with it, they typically impose regularization in local optimization at each communication round such that the intra-round heterogeneity can be reduced. However, we argue that existing methods generally neglect the heterogeneity among different communication rounds, and the round-specific regularization would inevitably fall into a local optimum. Specifically, in cross-device FL, the sampled clients to be aggregated might involve different data distributions among different communication rounds. As such, the optimization direction estimated in a single round might deviate from that estimated with all clients, eventually amplifying the the aggregation biasfoot_0 , and resulting in bad convergence even oscillation. For simplicity, we term this challenge as "period drift", and provide empirical evidence in real-wolrd datasets (c.f. Figure 1 ). 2017), which is, however, non-trivial to go beyond the local view and approach the optimum based on solely the intra-communication client parameters (c.f. Figure 1 ). To bridge the gap, we introduce a learning-based framework, where a parameterized aggregator takes the intra-communication client parameters into consideration, and learns to calibrate the direction of aggregated parameters. Technically, we propose a novel aggregation strategy, named FEDPA, which frames the learning-to-aggregate procedure as a metalearning setting Ravi & Larochelle (2016); Andrychowicz et al. (2016) . In particular, the aggregator is considered as a meta-learner that is learning to aggregate the parameters of clients into a proxy model that could generalize well on a proxy dataset. The aggregation process at each communication round refers to one meta-task. The meta-knowledge refers to how to capture the global view under the client/period drift, alleviate the aggregation bias, and calibrate the aggregated parameters towards the optimum.

2. RELATED WORK

Federated learning with non-iid data Federated Learning with non-iid Data The performance of federated learning often suffers from the heterogeneous data located over multiple clients. (Zhao et al., 2018) demonstrates that the accuracy of federated learning reduces significantly when models are trained with highly skewed non-iid data, which is explained by weight divergence. (Li et al., 2020) proposes FEDPROX that utilizes a proximal term to deal with heterogeneity. (Li et al., 2021b) provides comprehensive data partitioning strategies to cover the typical non-iid data scenarios. Fed-Nova (Wang et al., 2020) puts insight on the number of epochs in local updates and proposes a normalized averaging scheme to eliminate objective inconsistency. FedBN (Li et al., 2021c) focuses on the feature shift non-iid in FL, and proposes to use local batch normalization to alleviate the feature shift before averaging models.



In ecological studies, aggregation bias is the expected difference between effects for the group and effects for the individual.



Existing works Hsu et al. (2019); Li et al. (2020); Karimireddy et al. (2021) depict the non-iid trap as weight divergence Zhao et al. (2018) or client drift Karimireddy et al.

Figure 1: Period drift in FL. In the left figure, we give a example of a 10-class classification task with label distribution skew Kairouz et al. (2021). We consider three degrees of non-iidness by setting the dirichlet hyperparameter α = 1/0.1/0.01 Hsu et al. (2019), and display the distribution difference of five communication rounds within a 5 × 3 grid. The colored blocks in the histogram represent the amount of data of different labels that belongs to the selected 10 clients of 100 clients. In a subfigure (i.e. within a communication round), client drift is exhibited by different colors, while in a column, the period drift can be presented by the length of bars. Period drift becomes more obvious as increasing the degree of non-iid (smaller α). In the right figure, we illustrates the trajectory of FL, where the direction with period drift may deviates from the global optimum, resulting in slow convergence and oscillation, and how FEDPA calibrates and controls the trajectory of FL.

