EXP-α: BEYOND PROPORTIONAL AGGREGATION IN FEDERATED LEARNING

Abstract

Federated Learning (FL) is a distributed learning paradigm, which computes gradients of a model locally on different clients and aggregates the updates to construct a new model collectively. Typically, the updates from local clients are aggregated with weights proportional to the size of clients' local datasets. In practice, clients have different local datasets suffering from data heterogeneity, such as imbalance. Although proportional aggregation still theoretically converges to the global optimum, it is provably slower when non-IID data is present (under convexity assumptions), the effect of which is exacerbated in practice. We posit that this analysis ignores convergence rate, which is especially important under such settings in the more realistic non-convex real world. To account for this, we analyze a generic and time-varying aggregation strategy to reveal a surprising trade-off between convergence rate and convergence error under convexity assumptions. Inspired by the theory, we propose a new aggregation strategy, Exp-α, which weights clients differently based on their severity of data heterogeneity. It achieves stronger convergence rates at the theoretical cost of a non-vanishing convergence error. Through a series of controlled experiments, we empirically demonstrate the superior convergence behavior (both in terms of rate and, in practice, even error) of the proposed aggregation on three types of data heterogeneity: imbalance, label-flipping, and domain shift when combined with existing FL algorithms. For example, on our imbalance benchmark, Exp-α, combined with FedAvg, achieves a relative 12% increase in convergence rate and a relative 3% reduction in error across four FL communication settings.

1. INTRODUCTION

Federated Learning (FL) (McMahan et al., 2017) is a decentralized approach for learning a model on distributed data to preserve data privacy. Because data reside on clients and are never transmitted to a central server, privacy is preserved. However, data on local clients are often correlated with their demographics and preferences. This makes training data highly non-IID or heterogeneous (Wang et al., 2021; Zhang et al., 2021; Kairouz et al., 2021) , containing label imbalance, noisy labels (e.g. label-flipping), or domain shift. This can significantly impact a model's performance and specifically convergence rates (Zhao et al., 2018; Li et al., 2019) . To tackle the issue of data heterogeneity, the majority of federated learning have focused on improving the local optimization (Zhao et al., 2018; Shoham et al., 2019; Karimireddy et al., 2020; Zhang et al., 2020; Acar et al., 2021) and the global optimization (Hsu et al., 2019; Reddi et al., 2020) objectives in a federated learning pipeline. Few papers have paid attention to the other aspects of federated learning, such as client selection (Cho et al., 2020) and model aggregation (Chen et al., 2020; Wang et al., 2020) . Most existing methods use proportional aggregation (McMahan et al., 2017) , whose aggregation weights are proportional to the size of local dataset. Although proportional aggregation still theoretically converges when non-IID data is present under convexity assumptions, we posit that this analysis ignores convergence rate, which is especially important under such settings in the real world, because proportional aggregation assumes equal importance of all samples. Intuitively, non-IID data makes the equal importance property questionable since imbalanced data can bias predictions towards majority classes, and noise or domain shift can slow down convergence. To study this, we start by introducing the proportional aggregation strategy and discussing its merits: equal importance and asymptotic convergence. Following prior works (Wang et al., 2021; Reisizadeh et al., 2020; Yuan et al., 2021) , we define the federated global objective F (W) of the server as a weighted sum of N local objectives F i (W) in Eq. 1. F (•) denotes generic loss/risk function. Definition 1 F (W) := N i=1 ρ i F i (W) where F i (W) = E ξ∼Pi [f (W; ξ)] . (1) N i=1 ρ i = 1 are the aggregation weights, W ∈ R d denotes the global model and ξ is a sample from the local data distribution P i . Usually in the distributed learning (Stich, 2018) and federated learning (Li et al., 2019) literature, the weights are set to be proportional to the number of samples on a client denoted as ρ i = |Ξi| N j=1 |Ξj | , where |Ξ i | 1 is the size of local dataset Ξ i . This weighting scheme has an intuitive interpretation, i.e., the global data can be equivalently seen as the union of local datasets, and the federated global objective is equivalent in expectation to what one would optimize centrally if data are sampled randomly from it. Proportional aggregation is then used to compute an unbiased update to Eq. 1. In summary, proportional aggregation is a statistically sound strategy, giving all data points equal importance and providing asymptotic convergence to a hypothetical centralized objective, i.e., achieving zero-error eventually. However, a recent survey calls these properties into question (Wang et al., 2021) . In the real world, the defining characteristics of proportional aggregation, particularly equal importance and asymptotic convergence, can be less well justified. The property of equal importance of all participating data can be less desirable when data heterogeneity is severe. For example, even though the convergence of using proportional aggregatoin (with zero-error under convex settings) with non-IID clients is guaranteed, it is provably slower (Li et al., 2019) , and with data poisoning (such as label-flipping), it can be even unstable (Xie et al., 2019; Jebreel et al., 2022) . This is exacerbated by the limited communication rounds in FL, making the asymptotic convergence property less relevant since asymptotic convergence can only be achieved under the assumption of unlimited communication. As a result, two algorithms with comparable asymptotic convergence can perform quite differently in practice (Wang et al., 2021) . In this paper, we study a generic and time-varying aggregation strategy, N i=1 ρ t i = 1, where ρ t i is the weight for client i at time t, as opposed to proportional aggregation. A theoretical study of this strategy reveals a surprising trade-off between convergence rate and convergence error, allowing us to make more explicit what proportional aggregation favors and to develop new algorithms that make different trade-offs. For example, proportional aggregation, when instantiated in our framework as a special case, is shown to favor convergence error at the cost of convergence rate. More specifically, we start from a theoretical analysis on the convergence of FedAvg (McMahan et al., 2017) , a prototypical FL algorithm, while allowing the aggregation weights to change over time. The resultant convergence bound in this more generic setting reveals a family of aggregation strategies that 1) improves convergence rate but 2) leaves a theoretically non-vanishing error w.r.t the proportionally weighted federated objective (Eq. 1). Subsequently, we propose a specific aggregation strategy in this family, Exp-α, which weights clients differently based on their severity of data heterogeneity and can achieve stronger convergence rates at the theoretical cost of a non-vanishing convergence error. Intuitively, this strategy puts larger weights on clients sharing more similar data distribution to each other. Empirically, we go beyond theory to test its effectiveness on three major types of local data heterogeneity: imbalance (Zhao et al., 2018 ), label-flipping (Xie et al., 2019) and domain shift (Li et al., 2021) . Our results suggest that an aggregation strategy with faster convergence rate can be more important in practice than one with theoretically zero-error under the convex assumption; in practice, our method achieves both better rates and better errors, owing to the fact that practical settings are non-convex. For example, on our imbalance benchmark, Exp-α, combined with FedAvg, achieves a relative 12% increase in convergence rate and a relative 3% reduction in error across four FL communication settings. To sum up, our contributions are: • We analyze the convergence of FedAvg with a generic and time-varying aggregation strategy to reveal a trade-off between convergence rate and error under convexity assumptions, and elucidate properties of prior proportional aggregation strategies.



We use the notation| • | to denote the size of a set.

