EXP-α: BEYOND PROPORTIONAL AGGREGATION IN FEDERATED LEARNING

Abstract

Federated Learning (FL) is a distributed learning paradigm, which computes gradients of a model locally on different clients and aggregates the updates to construct a new model collectively. Typically, the updates from local clients are aggregated with weights proportional to the size of clients' local datasets. In practice, clients have different local datasets suffering from data heterogeneity, such as imbalance. Although proportional aggregation still theoretically converges to the global optimum, it is provably slower when non-IID data is present (under convexity assumptions), the effect of which is exacerbated in practice. We posit that this analysis ignores convergence rate, which is especially important under such settings in the more realistic non-convex real world. To account for this, we analyze a generic and time-varying aggregation strategy to reveal a surprising trade-off between convergence rate and convergence error under convexity assumptions. Inspired by the theory, we propose a new aggregation strategy, Exp-α, which weights clients differently based on their severity of data heterogeneity. It achieves stronger convergence rates at the theoretical cost of a non-vanishing convergence error. Through a series of controlled experiments, we empirically demonstrate the superior convergence behavior (both in terms of rate and, in practice, even error) of the proposed aggregation on three types of data heterogeneity: imbalance, label-flipping, and domain shift when combined with existing FL algorithms. For example, on our imbalance benchmark, Exp-α, combined with FedAvg, achieves a relative 12% increase in convergence rate and a relative 3% reduction in error across four FL communication settings.

1. INTRODUCTION

Federated Learning (FL) (McMahan et al., 2017) is a decentralized approach for learning a model on distributed data to preserve data privacy. Because data reside on clients and are never transmitted to a central server, privacy is preserved. However, data on local clients are often correlated with their demographics and preferences. This makes training data highly non-IID or heterogeneous (Wang et al., 2021; Zhang et al., 2021; Kairouz et al., 2021) , containing label imbalance, noisy labels (e.g. label-flipping), or domain shift. This can significantly impact a model's performance and specifically convergence rates (Zhao et al., 2018; Li et al., 2019) . To tackle the issue of data heterogeneity, the majority of federated learning have focused on improving the local optimization (Zhao et al., 2018; Shoham et al., 2019; Karimireddy et al., 2020; Zhang et al., 2020; Acar et al., 2021) and the global optimization (Hsu et al., 2019; Reddi et al., 2020) objectives in a federated learning pipeline. Few papers have paid attention to the other aspects of federated learning, such as client selection (Cho et al., 2020) and model aggregation (Chen et al., 2020; Wang et al., 2020) . Most existing methods use proportional aggregation (McMahan et al., 2017) , whose aggregation weights are proportional to the size of local dataset. Although proportional aggregation still theoretically converges when non-IID data is present under convexity assumptions, we posit that this analysis ignores convergence rate, which is especially important under such settings in the real world, because proportional aggregation assumes equal importance of all samples. Intuitively, non-IID data makes the equal importance property questionable since imbalanced data can bias predictions towards majority classes, and noise or domain shift can slow down convergence. To study this, we start by introducing the proportional aggregation strategy and discussing its merits: equal importance and asymptotic convergence. Following prior works (Wang et al., 2021; Reisizadeh et al., 2020; Yuan et al., 2021) , we define the federated global objective F (W) of the server as a weighted sum of N local objectives F i (W) in Eq. 1. F (•) denotes generic loss/risk function. Definition 1 F (W) := N i=1 ρ i F i (W) where F i (W) = E ξ∼Pi [f (W; ξ)] . (1) N i=1 ρ i = 1 are the aggregation weights, W ∈ R d denotes the global model and ξ is a sample from the local data distribution P i . Usually in the distributed learning (Stich, 2018) and federated learning (Li et al., 2019) literature, the weights are set to be proportional to the number of samples on a client denoted as ρ i = |Ξi| N j=1 |Ξj | , where |Ξ i |foot_0 is the size of local dataset Ξ i . This weighting scheme has an intuitive interpretation, i.e., the global data can be equivalently seen as the union of local datasets, and the federated global objective is equivalent in expectation to what one would optimize centrally if data are sampled randomly from it. Proportional aggregation is then used to compute an unbiased update to Eq. 1. In summary, proportional aggregation is a statistically sound strategy, giving all data points equal importance and providing asymptotic convergence to a hypothetical centralized objective, i.e., achieving zero-error eventually. However, a recent survey calls these properties into question (Wang et al., 2021) . In the real world, the defining characteristics of proportional aggregation, particularly equal importance and asymptotic convergence, can be less well justified. The property of equal importance of all participating data can be less desirable when data heterogeneity is severe. For example, even though the convergence of using proportional aggregatoin (with zero-error under convex settings) with non-IID clients is guaranteed, it is provably slower (Li et al., 2019) , and with data poisoning (such as label-flipping), it can be even unstable (Xie et al., 2019; Jebreel et al., 2022) . This is exacerbated by the limited communication rounds in FL, making the asymptotic convergence property less relevant since asymptotic convergence can only be achieved under the assumption of unlimited communication. As a result, two algorithms with comparable asymptotic convergence can perform quite differently in practice (Wang et al., 2021) . In this paper, we study a generic and time-varying aggregation strategy, N i=1 ρ t i = 1, where ρ t i is the weight for client i at time t, as opposed to proportional aggregation. A theoretical study of this strategy reveals a surprising trade-off between convergence rate and convergence error, allowing us to make more explicit what proportional aggregation favors and to develop new algorithms that make different trade-offs. For example, proportional aggregation, when instantiated in our framework as a special case, is shown to favor convergence error at the cost of convergence rate. More specifically, we start from a theoretical analysis on the convergence of FedAvg (McMahan et al., 2017) , a prototypical FL algorithm, while allowing the aggregation weights to change over time. The resultant convergence bound in this more generic setting reveals a family of aggregation strategies that 1) improves convergence rate but 2) leaves a theoretically non-vanishing error w.r.t the proportionally weighted federated objective (Eq. 1). Subsequently, we propose a specific aggregation strategy in this family, Exp-α, which weights clients differently based on their severity of data heterogeneity and can achieve stronger convergence rates at the theoretical cost of a non-vanishing convergence error. Intuitively, this strategy puts larger weights on clients sharing more similar data distribution to each other. Empirically, we go beyond theory to test its effectiveness on three major types of local data heterogeneity: imbalance (Zhao et al., 2018) , label-flipping (Xie et al., 2019) and domain shift (Li et al., 2021) . Our results suggest that an aggregation strategy with faster convergence rate can be more important in practice than one with theoretically zero-error under the convex assumption; in practice, our method achieves both better rates and better errors, owing to the fact that practical settings are non-convex. For example, on our imbalance benchmark, Exp-α, combined with FedAvg, achieves a relative 12% increase in convergence rate and a relative 3% reduction in error across four FL communication settings. To sum up, our contributions are: • We analyze the convergence of FedAvg with a generic and time-varying aggregation strategy to reveal a trade-off between convergence rate and error under convexity assumptions, and elucidate properties of prior proportional aggregation strategies. • We propose a new aggregation strategy, Exp-α, that trades convergence rate over error under convexity assumptions. When applied to several existing FL algorithms in real world experiments, Exp-α demonstrates superior performance in both convergence rate and error over the widely used proportional aggregation on three types of data heterogeneity. (Li et al., 2019) . However, a recent work (Cho et al., 2020) shows that a biased strategy can bring pratical improvement to FL algorithms.

2. BACKGROUND AND RELATED WORKS

Client Update: After receiving the global model, the clients optimize their copies of it independently on their own local data ξ ∈ Ξ i ∼ P i (X, Y ), where ξ represents an element in the set of local data Ξ i on client i sampled uniformly from the local data distribution P i , for a specified E number of steps to arrive at different updated local models W t+E i ∈ R d for i ∈ {1, ..., N }. This is the most investigated stage in FL research due to its unique non-IID (distribution shift) challenge (Zhao et al., 2018; Li et al., 2019) . The vanilla FedAvg (McMahan et al., 2017) uses plain SGD updates, which can only handle mild non-IID data. Many followup works design regularization techniques to improve convergence under more severe distribution shift. Please see Appendix A.1 for an introduction to those methods. Our contribution is orthogonal to FL research in this category and can be combined. We will demonstrate this compatibility in our experiments (Sec. 4). Server Update: To complete this round of communication, the updated local models are sent back to the central server for aggregation, which yields the next global model. Server update can be split into two steps: aggregation and optimization (Reddi et al., 2020) . Our work focuses on the aggregation step in the server update stage. Specifically, aggregation refers to how gradients are combined and optimization refers to how the aggregated gradients are applied. Please see Appendix A.1 for an introduction to FL algorithms with different server optimization techniques. Few have studied the aggregation step in server update. FOCUS (Chen et al., 2020) measures the performance of a local model on a globally shared dataset and assigns an aggregation weight accordingly. However, the requirement of a global dataset that encompasses unknown local data distributions violates the privacy premise of FL. A recent work (Wang et al., 2020) discovered an implicit bias in aggregation when the number of local updates is different and proposed a mitigation strategy. This problem is orthogonal to our target on non-IID data and therefore, we keep the number of local updates the same on all clients in our experiments. Nonetheless, most existing works use proportional aggregation. Our contribution is orthogonal and compatible to other innovations in the optimization step. We will demonstrate this compatibility in the experiment section (Sec. 4).

3. GOING BEYOND PROPORTIONAL AGGREGATION

Existing FL convergence analyses often assume proportional aggregation in their deviation (Li et al., 2019; Khaled et al., 2020) . This strategy yields asymptotically zero-error convergence under convex settings. However, as we will show in this section, revisiting the convergence analysis with a generic, time-varying aggregation strategy reveals that by carefully designing the aggregation weights, one can theoretically trade off convergence rate over error. This section is organized as the following. Sec. 3.1 introduces several common assumptions in FL convergence analysis and notations necessary to understand the theoretical results; Sec. 3.2 presents a convergence bound with a generic and time-varying aggregation strategy; Sec. 3.3 discusses the trade-off between convergence rate and error with a derived corollary; Finally, inspired by the corollary, Sec. 3.4 proposes a practical aggregation strategy, Exp-α, which demonstrates superior convergence behavior in terms of both rate and error on several benchmarks in Sec. 4. 

3.1. TIME-VARYING AGGREGATION: ASSUMPTIONS AND NOTATIONS

We first make some common assumptions in FL convergence analysis. Assumption 1 The local objective functions F 1 , ...F N are µ-strongly convex: F i (W) -F i (V) ≥ (W -V) T ∇F i (V) + µ 2 ∥W -V∥ 2 2 ∀W, V. Assumption 2 The local objective functions, F 1 , ..., F N are L-smooth functions: F i (W) - F i (V) ≤ (W -V) T ∇F i (V) + L 2 ∥W -V∥ 2 2 ∀W, V. Assumption 3 Bounded local gradient variance, let ξ i ∼ P i be a sampled data point on client i. The variance of gradients on all devices is bounded: E∥∇f i (W t ; ξ i ) -∇F i (W t )∥ 2 ≤ σ 2 i ∀i ∈ {1, ..., N }. Assumption 4 Bounded local gradients, let ξ i ∼ P i be a sampled data point on client i. The squared norm of gradients on all devices are bounded: E∥∇f i (W t ; ξ i )∥ 2 ≤ G 2 ∀i ∈ {1, ..., N }. In addition to the convex Assumption 2, Assumptions 1-4 are fairly common assumptions in nonconvex optimization literature (Reddi et al., 2016; Ward et al., 2020) and federated learning literature (Li et al., 2019; Cho et al., 2020) . There are other FL works relaxing the above assumptions. For example, FedAdaGrad (Reddi et al., 2020) relaxes the convex assumptions and shows that the expected gradient goes to zero, thus converging to a stationary point with unknown error bound. While it's sufficient to demonstrate the hidden convergence dependency on aggregation weights under the current assumptions, extending our subsequent analysis to different FL assumptions can be an interesting future work. We will utilize the method of virtual sequence (Stich, 2018) for the proof of the main theorem. Let I E be the set of synchronization/communication steps, such that I E = {n × E|n = 0, ..2}, where E denotes the number of local update iterations. The virtual sequence Wt+1 is defined as: Wt+1 = N i=1 ρ t+1 i W t+1 i where W t+1 i = V t+1 i , if t + 1 / ∈ I E N i=1 ρ t+1 i V t+1 i , if t + 1 ∈ I E . where ρ t+1 i ≥ 0 and N i=1 ρ t+1 i = 1 is the time-varying aggregation weight, and W0 = W 0 i = W 0 . V t i denotes the local model i at optimization step t. In reality, we only have access to Wt+1 when t + 1 ∈ I E , i.e., time of actual synchronization. When this happens, we write W T where T ∈ I E . We provide a graphical illustration of the virtual sequence in Fig. 1 , which also features our proposed method, Exp-α. Furthermore, let W * be the optimal solution to the federated global objective in Eq. 1, i.e., W * = arg min W F (W) = arg min W N i=1 ρ i F i (W), and W * i be the optimal solution to a client's data distribution, i.e., W * i = arg min W F i (W).

3.2. TIME-VARYING AGGREGATION: CONSERVATIVE ERROR BOUND

Taking into account of generic and time-varying aggregation weights, ρ t i , we present the following theorem for the FedAvg algorithm (McMahan et al., 2017) , a prototypical FL algorithm that uses vanilla SGD for local and global updates and proportional aggregation in the original work. Theorem 1 Assume Assumptions 1-4 hold and L, µ, σ i , G be defined therein. Choose γ = max L µ and the learning rate η t = 2 µ(γ+t) and T ∈ I E . Then FedAvg using SGD with full device participation and a generic, time-varying aggregation weights N i=1 ρ t i = 1 satisfies E[F (W T )] -F (W * ) ≤ L (γ + T ) 2 B µ 2 + γ 2 ∆ 0 vanishing + L µ (Γ -Ω) non-vanishing where ∆ 0 = ∥W 0 -W * ∥ 2 2 , B = max t ( N i=1 (ρ t+1 i ) 2 σ 2 i ) + 8(E -1)G 2 + 6LΩ, Γ = max t N i=1 ρ t+1 i (F i (W * ) -F i (W * i )), and Ω = min t N i=1 ρ t+1 i (F i ( Wt ) -F i (W * i ))∀t ≥ 0. We provide a complete proof of the main theorem in A.9. The convergence bound in Thm. 1 has two outstanding components, a vanishing term decreasing over time and a non-vanishing term. The vanishing term goes to zero with time and controls convergence rate; the non-vanishing term does not decrease over time and results in a non-zero error after convergence. The convergence result is consistent with convergence bound using proportional aggregation (Li et al., 2019) . Specifically, we can show (Appendix A.13) that if ρ t i = ρ i = |Ξi| N j=1 |Ξj | , then Ω = Γ = N i=1 ρ i (F i (W * ) - F i (W * i )) , and the non-vanishing error is zero. This demonstrate that proportional aggregation, as a special case of our more general analysis, favors convergence error at the cost of convergence rate.

3.3. TIME-VARYING AGGREGATION: TRADE-OFF BETWEEN SPEED AND ERROR

From Thm. 1, we observe that FedAvg with a generic time-varying weighing converges at a rate of O Ω T + O(Γ -Ω). where Ω = min t N i=1 ρ t+1 i (F i ( Wt ) -F i (W * i )) The quantity Ω controls the convergence rate and the non-vanishing error. The intuitive way to improve the convergence rate is to design a strategy which gives a small Ω. Let Ω pr denote the quantity Ω defined by proportional aggregation, i.e., Ω pr ≜ min t N i=1 ρ i (F i ( Wt ) -F i (W * i )) , where ρ i = |Ξi| N j=1 |Ξj | . Specifically, if the goal is to improve over proportional aggregation, then the proposed aggregation strategy should lead to an Ω smaller than Ω pr . Upon close examination of Ω in Thm. 1, one can expect that the key lies in choosing the aggregation weight ρ t+1 i according to the relative magnitude of the quantity Ω t i ≜ F i ( Wt ) -F i (W * i ) for all i ∈ {1, ..., N }. To this end, we provide the following corollary to formally justify the intuition. We show that with a specific choice of ρ t+1 i , one can achieve Ω ≤ Ω pr . Corollary 1.1 Assume N |Ξi| N j=1 |Ξj | (F i ( Wt ) -F i (W * i )) are arranged in decreasing order, i.e., N |Ξi| N j=1 |Ξj | (F i ( Wt ) -F i (W * i )) ≥ N |Ξi+1| N j=1 |Ξj | (F i+1 ( Wt ) -F i+1 (W * i+1 )). If we choose ρ t+1 i ∝ N |Ξ i | N j=1 |Ξ j | U N |Ξ i | N j=1 |Ξ j | F i (W * i ) -F i ( Wt ) ∀t, where U( * ) ≥ 0 is a non-decreasing function, then Ω ≤ Ω pr . A detailed proof is provided in Appendix A.11. In Corollary 1.1, Wt is the virtual global model (Eq. 2) at the previous step and W * i is the optimal local model for client i. Therefore, the quantity Ω t i ≜ F i ( Wt ) -F i (W * i ) captures the performance difference between the closest virtual global model and the optimal local model on client i. Intuitively, one would expect that a client with more severe distribution shift will result in a larger Ω t i since the current virtual global model should not work well on this severely shifted distribution, resulting in a larger discrepancy between F i ( Wt ) and F i (W * i ). In other words, the aggregation strategy in Corollary 1.1 puts smaller weights on more severely shifted clients based on current performance difference. Consequently, it is reasonable to expect a trade off between convergence speed and convergence error depending on how aggressively the algorithm down-weights shifted clients. In our experiments, to avoid the compound issue of update bias due to unequal number of local updates (Wang et al., 2020) , we deliberately keep the size of local datasets equal, i.e., |Ξ i | = |Ξ j |. Therefore N |Ξi| N j=1 |Ξj | = 1.

3.4. TIME-VARYING AGGREGATION: THE EXPONENTIAL FUNCTIONS

One family of functions that satisfies U( * ) ≥ 0 and non-decreasing, is the exponential functions. We will use this as the paramterization for our empirical investigation. However, at a synchronization step T ∈ I E , according to Thm. 1 and Corollary 1.1, we need to evaluate F i ( WT -1 ) and F i (W * ) to calculate the aggregation weights. This is not realistic, however; first, we only have access to the virtual global model at t = T -E, i.e., the model from the previous synchronization step since the current one has yet to be calculated. Therefore, the closest available global model is WT -E . Second, in the most common setting, FL algorithms only optimize a local model for fixed number of E epochs and do not train it to convergencefoot_1 . Thus, the closest approximation to F i (W * ) is the current local model after E local updates from the previous synchronization, F i (W T i ). A graphical illustration is provided in Fig. 1 . For subsequent investigation, we use the following approximation, ρ T i ∝ exp F i (W T i ) -F i ( WT -E ) α , where α is a temperature hyperparameter to control the strength of the proportionality. As α → ∞, the strength of proportionality decreases, e.g., in the limit, ρ t → 1. Intuitively, a small α increases the concentration of ρ t i and a large α decreases the concentration and makes the weights more evenly spread. In subsequent sections, we term this family of aggregation strategy as the Exp-α method. We provide a detailed algorithm description of Exp-α and discussion on computation in Appendix A.2.

4. EXPERIMENTS

Overview. In this section, we present experiments to test the capability of the Exp-α strategy beyond theory. To this end, we surveyed existing literature and identified three dominant data heterogeneity types: imbalance (Zhao et al., 2018) , label-flipping (Xie et al., 2019) and domain shift (Li et al., 2021) . Each types of heterogeneity brings a specific challenge to an aggregation strategy. Specifically, imbalanced clients require the aggregation to be adaptive to the severity of imbalance; label-flipping requires the aggregation to block contributions from label-flipped clients; domain shift requires the aggregation to not disregard any clients since all domains should contribute. Datasets. To benchmark on different data heterogeneity, we use popular datasets in FL research (Zhao et al., 2018; Li et al., 2021; Yuan et al., 2021) . For imbalance, we adopt the popular Imbalanced CIFAR10 (Cao et al., 2019) setting in the imbalanced classification task. For labelflipping experiments, we use CIFAR10 with randomly flipped labels (Xie et al., 2019; Jebreel et al., 2022) . For domain shift experiments, following Li et al. (2021) , we use Digits (Li et al., 2021) , Office-Clatech10 (Gong et al., 2012) , and DomainNet (Peng et al., 2019) , each of which consists of a range of different domains with shared labels. Specifically, Digits has five domains, Office-Clatech10 has four domains, and DomainNet has six domains. Please see Appendix A.4 for details. Metrics. We compare different methods using the accuracy achieved at both the halfway and full global training iterations (McMahan et al., 2017) . Higher accuracy means lower converged error using the same number of optimization steps. We also report the global iterations required to achieve X performance (given as "Accx") (McMahan et al., 2017) . Lower "Accx" means that the algorithm converges to the same performance using fewer rounds of global communications and has higher convergence rate. We separate datasets into train, validation, and test splits, and report both validation and test accuracy if applicable in our experiments. Backbone Algorithms. Exp-α is an aggregation strategy and can plug into most existing FL methods. We select five representative baselines from different categories as the backbone algorithm: FedAvg (McMahan et al., 2017) , FedAvgM (Hsu et al., 2019) , FedAdam (Reddi et al., 2020) , Fed-Prox (Zhao et al., 2018), and FedFor (Tian et al., 2022) . Specifically, FedAvgM and FedAdam use different server-side momentum while FedProx and FedFor have different client-side regularization. For domain shift experiments, we include the SOTA personalized FL alogorithm FedBN (Li et al., 2021) . For all experiments, we assume a large amount of available clients and sample a fraction of them to participate in each round of communication. This corresponds to the most practical FL setting: cross-device FL with partial participation (Kairouz et al., 2021) . For different experiments, we use different neural network architectures. Please refer to the Appendix A.3 for more details.

4.1. IMBALANCE EXPERIMENTS

For imbalance experiments, we use the Imbalanced CIFAR10 (Cao et al., 2019) dataset created with an artificial exponential imbalance among classes. To create this exponential imbalance, we specify a variable imbalance ratio. For example, an imbalance ratio of 0.01 means that the ratio between the number of samples in the smallest class and the largest class is 0.01. Compatibility to other FL algorithms. To mimic the real world situation of varying imbalance, we sample a batch of 10 clients for each round of communication and each client is created using an imbalance ratio sampled randomly from the set of ratios {1.0, 0.5, 0.1, 0.05, 0.01, 0.005, 0.001} covering a gradual increase of severity of imbalance from balanced, moderately imbalanced and extremely imbalanced datasets. At the beginning of each round, we re-sample the clients and their imbalance ratios. We benchmark the performance of Exp-α and proportional aggregation in six FL algorithms under four communication-computation configurations with a trade-off between the number of global iterations (denoted as t) and local update iterations (denoted as E) in Tab. 1. More local update (larger E) iterations lead to more severe weight divergence (Li et al., 2019) but with potential global communication savings (smaller t). Here, we report the number of steps to reach 40% accuracy (denoted as acc40), half-time, final validation accuracy, and the final test accuracy. Moreover, we fix the temperature hyperparameter α = 0.2, chosen by a grid search on a validation set using FedAvg and FedAvgM with {t = 100, E = 80} and an imbalance ratio of 0.001. Please see Appendix A.5 for the details on effects of α. We observe that Exp-α brings improvement to all FL algorithms consideredfoot_2 . Specifically, most algorithms using Exp-α reach 40% accuracy with fewer rounds of global steps than using proportional aggregation, e.g., 12.3% reduction in number of steps using FedAvg with E = 20. This shows that Exp-α improves convergence rate over proportional aggregation. Furthermore, all algorithm using Exp-α reach higher converged accuracy, thus lower error, than using proportional aggregation. This shows that in real world settings, where function are non-convex, faster convergence speed can potentially lead to lower error, in contrast to the speed and error trade-off in theoretical convergence analysis in convex settings. Adaptability to Severity of Imbalance. As a dynamic algorithm, Exp-α should weight each sample differently based on the severity of imbalance. To provide more insights into the adaptability to the degree of imbalance, we now use more controlled sampling strategies and imbalance ratios. Instead of sampling clients with random imbalance ratio as in the previous experiment, we designate a few combinations of imbalance configurations. For example, we use the notation {1.0×4, 0.01×3, 0.1× 3} to denote a composition of four balanced clients, three imbalanced clients with an imbalance ratio of 0.01 and three imbalanced clients with an imbalance ratio of 0.1. Specifically, we visualize the aggregation weights of Exp-α and compare its performance against that of the proportional aggregation in four imbalance configurations. Again, we sample ten clients at the beginning of each round however with different imbalance configurations. The hyperparameter α is set to 0.2 and the FL setting is {t = 400, E = 20}. As we can see from Fig. 2a , Exp-α adapts to different imbalance configurations. For example, as a client becomes less imbalanced, it receives a higher Fedavg is used as the backbone FL algorithm and results are averaged over three runs. weight. The balanced clients always have the highest aggregation weights. Also, as expected, Exp-α shows more performance improvement when imbalance is more severe. In the case, when all clients are balanced, Exp-α assign roughly equal weights to all clients with marginal variation due to stochasticity in sampling. Experiments with IID clients are provided in Appendix A.5. As expected, when clients have balanced datasets, Exp-α performs just as well as proportional aggregation.

4.2. LABEL FLIPPING EXPERIMENTS

For label flipping experiments, we use CIFAR10 as the dataset. To control the extend of label flipping, we define corruption rate and flip ratio. For example, a corruption rate of 1/3 means that with 1/3 chance, a client has flipped labels and a flip ratio of 1.0 means that all classes are incorrectly labeled and 0.5 means that half of the classes are incorrectly labeled. The number of classes whose labels are flipped is determined by the flip ratio multiplied by the number of classes. Compatibility to other FL algorithms. Similar to the imbalance experiments, we benchmark Expα and proportional aggregation using six different FL backbone algorithms. In this experiment, we keep α = 0.2, chosen by a grid search on a validation set using FedAvg and FedAvgM with a corruption rate of 1/3, a flip ratio of 1.0 and {t = 100, E = 80}. Please see Appendix A.6 for the effects of α. At the beginning of each round, we sample six clients with different data composition and each client has a probability of 1/3 being corrupted with a flip ratio of 1.0, meaning that all its labels are incorrect. Therefore, the algorithm is challenged with a different flipping pattern each time. We report the number of global steps to reach 40% accuracy, half-time, final time validation accuracy and final test accuracy across four federated learning configurations in Tab. 2. In all experiments, Exp-α brings significant improvements over proportional aggregation. This demonstrates that 1) label flipping is detrimental to federated learning and 2) Exp-α can greatly alleviate its negative affect. Adaptability to Severity of Label Flipping. In this experiment, we study how Exp-α responds to partially flipped clients. Specifically, we keep α = 0.2 and vary the flip ratio ∈ {1.0, 0.1}, meaning a corrupted client can either have all labels wrong or just labels for one class incorrect. Furthermore, instead of sampling, we use a deterministic corruption rate ∈ {2/3, 1/3}. This means that at the beginning of each round of communication, we sample six clients and four or two clients out of the six clients will be corrupted, corresponding to 2/3 or 1/3 corruption rate respectively. The hyperparameter α is set to 0.2 and the FL setting is {t = 400, E = 20}. In Fig. 2b , we present four configurations covering the aforementioned variables of interest. We notice that 1) Exp-α outperforms proportional aggregation in all configurations; 2) Exp-α differentiates between clients with different level of label flipping, e.g., it assigns higher weights to flipped clients where only a single class is incorrect than to those where all classes are incorrect. Furthermore, Exp-α works when the number of flipped clients is majority and minority. Unlike in the previous two challenges, for domain shift, a federated learning algorithm needs to consider all clients despite their data heterogeneity. We benchmark Exp-α and proportional aggregation on three domain shift benchmarks: Digits, Office, and DomainNet. Each benchmark consists of several domains with a shared label space. Specifically, Digits consists of five digit-like datasets;

4.3. DOMAIN SHIFT EXPERIMENTS

Office-Caltech10 has four domains and DomainNet has six domains. Please see Appendix A.4 for detailed description. We distribute the data from each domain to a client separately, such that each client has a distinct data distribution with domain shift. FedAvg (McMahan et al., 2017) and FedBN (Li et al., 2021) are used as the backbone FL algorithms. We report test accuracy of each domain and cross-domain average for each benchmark under the FL setting {t = 400, E = 16} in Tab. 3. Different α has been used for each dataset, chosen by grid search using the validation set. Please see Appendix A.7 for the discussion on effects of α. We observe that Exp-α leads to similar performance as proportional aggregation in most cases. This shows that Exp-α incorporates all local data despite the existence of domain shifts among clients. Furthermore, Exp-α even brings noticeable improvement in some cases . For example, Exp-α improves FedAvg with proportional aggregation on Office-Clatech10 by a relatively 4% on average across four domains. Upon close examination, the improvement mainly comes from the Webcam (W) domain (relatively 10% improvement). The Webcam domain is the best performing domain already when using proportional aggregation, indicating that it benefits the most from federated learning across the four domains. Exp-α emphasizes it further by assigning the Webcam domain the largest aggregation weight. For this particular experiment, the average aggregation weights over the entire training trajectory for Amazon (A), Caltech (C), DSLR (D) and Webcam (W) are {0.19, 0.26, 0.22, 0.34}. While Amazon (A) received the smallest average aggregation weight, this did not deteriorate the performance of Exp-α on this domain but rather improved it by a relative 3%.

5. CONCLUSION

While proportional aggregation enjoys several theoretical advantages, e.g., equal importance and asymptotic convergence, a fixed client weighting is less sensible under non-IID settings. In this paper, we start out by removing the assumption of proportional aggregation and derive a convergence bound using a generic and time-varying aggregation strategy. This analysis reveals a surprising trade-off between convergence speed and error under convexity assumptions. The analysis motivates a family of aggregation strategies, which prioritize convergence speed and weight samples dynamically. Consequently, we propose a new aggregation strategy, Exp-α, from this family. Our extensive experiments on three types of data heterogeneity demonstrates its superior performance and robustness, and compatibility to existing algorithm albeit the existence of non-zero error in theory. More importantly, the theoretical analysis opens a new direction to study aggregation strategy to focus on convergence speed and robustness in future works.

6. REPRODUCIBILITY STATEMENT

We include a code repository to reproduce the results on imbalanced CIFAR10 reported in Tab. 1. Specifically, the codebase includes an implementation of FedAvg (McMahan et al., 2017) with the original proportional aggregation and the proposed Exp-α aggregation strategy. Readers can reproduce results reported in the first row of Tab. 1. The code repository has a readme file with necessary instructions to install environment and run experiments. The code is written to run on a single GPU.

A APPENDIX

A.1 EXTENDED RELATED WORKS Client Update. FedProx (Zhao et al., 2018) adds a first order proximal term in the loss function; FedCurv (Shoham et al., 2019) and FedFor (Tian et al., 2022)  W T Initialize W 0 for T ∈ {nE|n = 0, ..2} do for i ∈ {1, ..., K} in parallel do F i (W T ) ← CalculateRisk (Ξ i , W T ) W T +E i ← ClientUpdate(W T ) ▷ e.g., FedProx Zhao et al. (2018) F i (W T +E i ) ← CalculateRisk (Ξ i , W T +E i ) ρ T +E i = exp Fi(W T +E i )-Fi(W T ) α . ▷ Eq. 3 end ∇F (W T ) = 1 K j=1 ρ T +E j K i=1 ρ T +E i W T -W T +E i W T +E ← ServerUpdate(W T ,∇F (W T )) ▷ e.

g., FedAvgM Hsu et al. (2019) end

In this section, we describe the Exp-α algorithm (Alg. 1). Exp-α is an aggregation algorithm, so it is compatible with most existing FL algorithms. Following the convention in (Reddi et al., 2020) and to describe the algorithm as general as possible, we abstract the client optimization and global optimization procedures as ClientUpdate and ServerUpdate. The majority of FL algorithms differ in how they change these two components. Please refer to the related works section (Sec. 2) for a brief discussion on this. In our experiments (Sec. 4), we aim to demonstrate generality and compatibility of Exp-α in combination of innovations to these components. In Alg. 1, we implement a CalculateRisk function to calculate the risk values. This is a simple inference forward pass through the local dataset given specific model. The computation on the client side is fairly cheap, as it only requires two additional inference passes on the local data. During local training, an FL algorithm needs to run forward-backward pass, e.g., computing, storing and applying gradients for multiple epochs, each of which has many more local iterations while the calculation of risk only requires a simple forward pass without computing, storing and applying gradients. Therefore, the computation cost of CalculateRisk is only a small fraction of the original computation cost.

A.3 IMPLEMENTATION DETAILS

For CIFAR experiment in Sec. 4.1 and 4.2, we use ResNet20 (He et al., 2016) . Specifically, we use the proper ResNet implementation for CIFAR10 (He et al., 2016) . For Digits experiment in Sec. 4.3, we use a custom CNN provided by Li et al. (2021) . For Office and DomainNet experiment in Sec 4.3, we use ResNet18 (He et al., 2016) . All models are trained with SGD, with no momentum and weight decays. We use constant learning rate, i.e., no learning rate decay: CIFAR, Digits and Office 0.01 and DomainNet 0.05. We summarize the statistics in Tab. 4. To avoid the issue of implicit bais due to difference in number of local updates (Wang et al., 2020) , we keep the optimization steps per epoch constant on all clients in one experiment. Therefore, we also report the the number of steps in each local epoch in Tab. 4 for each dataset. A.4 DATASET STATISTICS The Digits benchmark consists of SVHN (Netzer et al., 2011) , USPS Hull (Hull, 1994) , SynthDigits (Ganin & Lempitsky, 2015) and MNIST-M (Ganin & Lempitsky, 2015) , MNIST (LeCun et al., 1998) ; the DomainNet benchmark (Peng et al., 2019) has six domains. The Office-Caltech10 dataset (Gong et al., 2012) selects three doamins from Office-31 (Saenko et al., 2010) , Amazon, DSLR and Webcam, and one domain from Caltech256 (Griffin et al., 2007) . We split datasets into training, validation and test sets. In our experiments we report validation accuracy and test accuracy if applicable. We summarize the number of samples in each split for Office, DomainNet and Digits in domain shift experiments in Tab. 5,Tab. 6 and Tab. 7 respectively. For CIFAR 10 experiments, we have the following splits {train : 35, 000, validation : 15, 000, test : 10, 000}.

A.5 ADDITIONAL RESULTS FOR IMBALANCE EXPERIMENTS IN SEC 4.1

Effects of Alpha. In this experiment, we fix the imbalance ratio to 0.001, the number of local epochs as 2 and the number of global iteration 200, and vary the hyperaparameter α in Exp-α. Specifically, we sample a different set of ten clients, among which four are balanced and six are imbalanced each time. So the total number available clients is the number of rounds of communication multiplied by ten. We compare convergence performance of under different α ∈ {0.2, 1.0, 5.0, 25.0, 125.0} using two backbone FL algorithms: FedAvg (McMahan et al., 2017) and FedavgM (Hsu et al., 2019) in Fig. 3 . Compared to proportional aggregation, Exp-α with a proper selection of α can consistently improve both convergence speed and converged performance. Specifically in this experiment, we noticed that smaller α leads to better performance because a smaller α makes the weights more concentrated on the balanced clients. We use α = 0.2 in the main paper in Sec. 4.1.

Effects of Local

Steps and Degree of Heterogeneity. With increasing increasing number of local steps and increasing heterogeneity among clients, a smaller α can do better. To demonstrate this we present the following experiments. Specifically, At each round of global communication, we To show the effects of increasing heterogeneity, we fix the number of local steps to be E = 160 and vary the imbalance ration in {0.1, 0.2, 0.3, 0.4, 0.5} with smaller number indicating more severe imbalance. Similarly, we sweep α ∈ {0.2, 1, 5, 25, 125}. In Tab. 9 we present test accuracy for each imbalance ratio with different α. We observe that more severe heterogeneity can benefit from smaller α. Exp-α in IID Setting. In the main paper, we explored Exp-α in non-IID settings, where clients are subject to different degrees of imbalance. In this section, we present results comparing FedAvg using Exp-α and proportional aggregation under IID settings, where all clients have balanced data in Tab. 11. As expected, Exp-α and proportional aggregation perform similarly. Effects of Alpha. In this experiment, we fix the corruption rate to be 1/3, the flip ratio to be 1.0, i.e., all labels on a flipped client are incorrect, and vary the hyperparamter α. Specifically, we randomly sample data from a training set to create six balanced clients. However, clients are subject to label-flipping with a chance of 1/3. We compare convergence performance of under different α ∈ {0.2, 1.0, 5.0, 25.0, 125.0} using two backbone FL algorithms: fedavg (McMahan et al., 2017) and fedavgm (Hsu et al., 2019) in Fig. 4 . We notice that Exp-α with small α values provides significant convergence improvement compared to proportional aggregation. This is because a smaller α forces the aggregation algorithm to focus more on the clean clients. We use α = 0.2 in the main paper in Sec. 4.2. Effects of Alpha. In this section, we sweep across a range of α ∈ {0.2, 1, 5, 25, 125} on the three domain shift benchmarks: Digits, Office-Caltech10 and DomainNet. We report validation accuracy in Tab. 13. While there is not obvious trend on which α works the best, Exp-α with a moderate α ≥ 1 outperforms proportional aggregation in most cases.

A.8 LEMMAS

To facilitate the derivation of the main convergence bound, we will introduce some lemmas. Specifically, we will utilize the method of virtual sequence (Stich, 2018) . Let I E be the set of synchronization/communication steps, such that I E = {nE|n = 0, ..2} where E denotes the number of local update iterations. We first introduce an intermediate notation denoting the update of a single step SGD update on a client: V t+1 i = W t i -η t ∇f i (W t i ; ξ t i ). where ξ t i ∼ P i is a sampled data point on the client i at time t. Depending on whether the current iteration is a synchronization step, the local update on each client can be written as the following, for t ≥ 0: W t+1 i =        V t+1 i if t + 1 / ∈ I E , N i=1 ρ t+1 i V t+1 i if t + 1 ∈ I E . ( ) where ρ t+1 i ≥ 0 and N i=1 ρ t+1 i = 1 is the aggregation weight for client i, and  W 0 i = W 0 . ∥ Wt+1 -W * ∥ 2 ≤ (1 -η t µ)∥ Wt -W * ∥ 2 + 2 N i=1 ρ t+1 i ∥ Wt -W t i ∥ 2 + 6Lη 2 t Ω + 2η t (Γ -Ω t ) + η 2 t ∥ḡ t -g t ∥ 2 . where Γ = max t N i=1 ρ t+1 i (F i (W * ) -F i (W * i )) and Ω = min t N i=1 ρ t+1 i (F i ( Wt )-F i (W * i )). Lemma 2 Bounded variance. With assumption 3, it follows that E∥g t -ḡt ∥ 2 ≤ N i (ρ t+1 i ) 2 σ 2 i ≤ max t N i=1 (p t+1 i ) 2 σ 2 i . Lemma 3 Bounded divergence. With assumption 4, η t is non-decreasing and η t ≤ 2η t+E , ∀t ≥ 0, it follows that A.9 PROOF OF THEOREM 1 E N i=1 ρ t+1 i ∥ Wt -W t i ∥ 2 ≤ 4η 2 t (E -1)G 2 Here we present the proof of the main theorem using the lemmas from the previous section. It follows closely the method in Li et al. (2019) . While we do not claim novelty in the methodology of this derivation, we show that there exists an error term due to time-varying weighting, that has been previously ignored. Let ∆ t = E∥ Wt -W * ∥ 2 . From Lemma 1, Lemma 2 and Lemma 3, we have that ∆ t+1 ≤ (1 -η t µ)∆ t + η 2 t B where B = max t N i=1 (ρ t+1 i ) 2 σ 2 i + 8(E -1)G 2 + 6LΩ + 2 ηt (Γ -Ω). Following the setting in Li,2019 (Li et al., 2019) , we set η t = β t+γ for some β ≥ 1 µ and γ > 0 such that η 1 ≤ 1 4L and η t ≤ 2η t+E . Let v = max{ β 2 B βµ-1 , γ∆ 0 }. We first assume that ∆ t ≤ v γ+t and prove by induction that this holds for all t. By induction, ∆ t+1 ≤ (1 -η t µ)∆ t + η 2 t B = 1 - β t + γ µ ∆ t + β 2 (t + γ) 2 B (8) ≤ 1 - β t + γ µ v γ + t + β 2 (t + γ) 2 B = t + γ -1 -βµ + 1 (t + γ) 2 v + β 2 (t + γ) 2 B = t + γ -1 (t + γ) 2 v + β 2 (t + γ) 2 B - βµ -1 (t + γ) 2 v ≤ v t + γ + 1 By the L-smoothness assumption (Assump. 1), E[F ( Wt )] -F (W * ) ≤ L 2 E∥ WT -W * ∥ 2 ≤ L 2 v γ + t (9) Following Li,2019 (Li et al., 2019) , we choose β = 2 µ , γ = max{8 L µ , E} -1, and let η t = 2 µ 1 γ+t , we can show that the learning rate satisfies η t ≤ 2η t+E , ∀t ≥ 0. Then, v = max β 2 B βµ -1 , γ∆ 0 ≤ β 2 B βµ -1 + γ∆ 0 ≤ 4B µ 2 + γ∆ 0 . Finally, plugging this in to Eq. 9, we obtain a convergence bound as, E[F ( Wt )] -F (W * ) ≤ L 2 1 γ + t 4B µ 2 + γ∆ 0 (11) = L (γ + t) 2 µ 2 max t N i=1 (ρ t+1 i ) 2 σ 2 i + 8(E -1)G 2 + 6LΩ + 2 η t (Γ -Ω) + γ 2 ∆ 0 = L (γ + t) 2 B µ 2 + γ 2 ∆ 0 + L µ (Γ -Ω) where ∆ 0 = ∥W 0 -W * ∥ 2 2 , B = max t N i=1 (ρ t+1 i ) 2 σ 2 i + 8(E -1)G 2 + 6LΩ, Γ = max t N i=1 ρ t+1 i (F i (W * ) -F i (W * i )) and Ω = min t N i=1 ρ t+1 i (F i ( Wt ) -F i (W * i )).

A.10 PROOF OF LEMMAS

Proof of Lemma 1. From the definition of Wt+1 = Wt -η t g t , we can decompose ∥ Wt+1 -W * ∥ 2 as ∥ Wt+1 -W * ∥ 2 = ∥ Wt -η t g t -W * -η t ḡt + η t ḡt ∥ 2 (12) = ∥ Wt -W * -η t ḡt ∥ 2 A1 + 2η t Wt -W * -η t ḡt , ḡt -g t A2 +η 2 t ∥ḡ t -g t ∥ 2 . In the above expression, E[A 2 ] = 0 so we only need to bound A 1 . A 1 = ∥ Wt -W * -η t ḡt ∥ 2 = ∥ Wt -W * ∥ 2 -2η t Wt -W * , ḡt B1 + η 2 t ∥ḡ t ∥ 2 B2 . ( ) We first focus on B 2 . From the L-smooth assumption (Assump. 1), we have that ∥∇F i (W t i )∥ 2 ≤ 2L(F i (W t i ) -F i (W * i )). We now bound B 2 as the following, B 2 = η 2 t ∥ḡ t ∥ 2 = η 2 t N i=1 ρ t+1 i ∇F i (W t i ) 2 ≤ η 2 t N i=1 ρ t+1 i ∥∇F i (W t i )∥ 2 (15) ≤ 2Lη 2 t N i=1 ρ t+1 i (F i (W t i ) -F i (W * i )). where the first inequality comes from the convexity of norms and Jensen's inequality for convex functions. To bound B 1 , we first split it into two terms by the linearity of inner product as B 1 = -2η t Wt -W * , ḡt = -2η t Wt -W * + W t i -W t i , N i=1 ρ t+1 i ∇F i (W t i ) (16) = 2η t N i=1 ρ t+1 i W t i -Wt , ∇F i (W t i ) B1,1 +2η t N i=1 ρ t+1 i W * -W t i , ∇F i (W t i ) B1,2 . To bound B 1,1 , we invoke Cauchy-Schwarz and AM-GM inequality as the following, B 1,1 = W t i -Wt , ∇F i (W t i ) ≤ 1 η t ∥ Wt -W t i ∥ 2 η t ∥∇F i (W t i )∥ 2 (17) ≤ 1 2 1 η t ∥ Wt -W t i ∥ 2 + η n ∥∇F i (W t i )∥ 2 . To bound B 1,2 , we use the convexity assumption (Assump. 2), which gives B 1,2 = W * -W t i , ∇F i (W t i ) ≤ F i (W * ) -F i (W t i ) - µ 2 ∥W * -W t i ∥ 2 . ( ) Now we plug Eq. 15, 16, 17 and 18 back into A 1 (Eq. 13) as the following, A 1 = ∥ Wt -W * -η t ḡt ∥ 2 ≤ ∥ Wt -W * ∥ 2 + η t N i=1 ρ t+1 i 1 η t ∥ Wt -W t i ∥ 2 + η n ∥∇F i (W t i )∥ 2 + 2η t N i=1 ρ t+1 i F i (W * ) -F i (W t i ) - µ 2 ∥W * -W t i ∥ 2 + 2Lη 2 t N i=1 ρ t+1 i (F i (W t i ) -F i (W * i )) = ∥ Wt -W * ∥ 2 -η t µ N i=1 ρ t+1 i ∥W * -W t i ∥ 2 + N i=1 ρ t+1 i ∥ Wt -W t i ∥ 2 + 2Lη 2 t N i=1 ρ t+1 i (F i (W t i ) -F i (W * i )) + η 2 t N i=1 ρ t+1 i ∥∇F i (W t i )∥ 2 + 2η t N i=1 ρ t+1 i F i (W * ) -F i (W t i ) ≤ (1 -η t µ)∥ Wt -W * ∥ 2 + N i=1 ρ t+1 i ∥ Wt -W t i ∥ 2 + 4Lη 2 t N i=1 ρ t+1 i (F i (W t i ) -F i (W * i )) + 2η t N i=1 ρ t+1 i F i (W * ) -F i (W t i ) C . The last inequality uses the convexity of norms, Jensen's inequality and Eq. 14. We now rearrange C. C = 4Lη 2 t N i=1 ρ t+1 i (F i (W t i ) -F i (W * i )) + 2η t N i=1 ρ t+1 i F i (W * ) -F i (W t i ) + 2η t N i=1 ρ t+1 i F i (W * i ) -2η t N i=1 ρ t+1 i F i (W * i ) = -2η t N i=1 ρ t+1 i F i (W t i ) + 2η t N i=1 ρ t+1 i F i (W * i ) + 4Lη 2 t N i=1 ρ t+1 i (F i (W t i ) -F i (W * i )) + 2η t N i=1 ρ t+1 i F i (W * ) -2η t N i=1 ρ t+1 i F i (W * i ) = -2η t (1 -2Lη t ) N i=1 ρ t+1 i (F i (W t i ) -F i (W * i )) + 2η t N i=1 ρ t+1 i (F i (W * ) -F i (W * i )) = -γ t N i=1 ρ t+1 i (F i (W t i ) -F i (W * )) D +4Lη 2 t Γ t where we define γ t = 2η t (1 -2Lη t ) and Γ t = N i=1 ρ t+1 i (F i (W * ) -F i (W * i )). To bound D, we use first use the convexity assumption (Assump. 2). D = N i=1 ρ t+1 i (F i (W t i ) -F i (W * )) = N i=1 ρ t+1 i (F i (W t i ) -F i ( Wt )) + N i=1 pρ t+1 i (F i ( Wt ) -F i (W * )) (20) ≥ N i=1 ρ t+1 i W t i -Wt , ∇F i ( Wt ) + N i=1 ρ t+1 i (F i ( Wt ) -F i (W * )) ≥ 1 2 N i=1 ρ t+1 i η t ∥∇F i ( Wt )∥ 2 + 1 η t ∥W t i -Wt ∥ 2 + N i=1 ρ t+1 i (F i ( Wt ) -F i (W * )) ≥ 1 2 N i=1 ρ t+1 i 2Lη t (F i ( Wt ) -F i (W * i )) + 1 η t ∥W t i -Wt ∥ 2 + N i=1 ρ t+1 i (F i ( Wt ) -F i (W * )) where the second last inequality uses the AM-GM inequality and the last equality comes from the L-smooth assumption (Assump. 1). Therefore, C ≤ γ t N i=1 ρ t+1 i Lη t (F i ( Wt ) -F i (W * i )) + 1 2η t ∥W t i -Wt ∥ 2 (21) -γ t N i=1 ρ t+1 i (F i ( Wt ) -F i (W * )) + 4Lη 2 t Γ t = γ t N i=1 ρ t+1 i Lη t (F i ( Wt ) -F i (W * i )) + 1 2η t ∥W t i -Wt ∥ 2 -γ t N i=1 ρ t+1 i (F i ( Wt ) -F i (W * ) + F i (W * i ) -F i (W * i )) + 4Lη 2 t Γ t = γ t (η t L -1) N i=1 ρ t+1 i (F i ( Wt ) -F i (W * i )) + (4Lη 2 t + γ t )Γ t + γ t 2η t N i=1 ρ t+1 i ∥W t i -Wt ∥ 2 ≤ γ t (η t L -1) min t N i=1 ρ t+1 i (F i ( Wt ) -F i (W * i )) Ω≥0 +2η t max t Γ t Γ≥0 + γ t 2η t N i=1 ρ t+1 i ∥W t i -Wt ∥ 2 = (6η 2 t L -2η t -4η 3 t L 2 )Ω + 2η t Γ + γ t 2η t N i=1 ρ t+1 i ∥W t i -Wt ∥ 2 ≤ 6η 2 t LΩ + 2η t (Γ -Ω) + N i=1 ρ t+1 i ∥W t i -Wt ∥ 2 where the second inequality is because N i=1 ρ t+1 i (F i ( Wt ) -F i (W * i )) ≥ 0 and η t L -1 ≤ -3 4 , and the last inequality is because γt 2ηt ≤ 1 and 4η 3 t L 2 Ω ≥ 0. Plugging in everything into A 1 , we can bound the effect of one-step SGD as ∥ Wt+1 -W * ∥ 2 ≤ (1 -η t µ)∥ Wt -W * ∥ 2 + 2 N i=1 ρ t+1 i ∥ Wt -W t i ∥ 2 (22) + 6Lη 2 t Ω + 2η t (Γ -Ω) + η 2 t ∥ḡ t -g t ∥ 2 . Proof of Lemma 2. Assume Assumption 3 hold, the variance of gradients on all devices is bounded E∥∇f i (W t ; ξ i ) - ∇F i (W t )∥ 2 ≤ σ 2 i ∀i ∈ {1, ..., N }. E∥g t -ḡt ∥ 2 = E   N i=1 ρ t+1 i ∇f i (W t ; ξ i ) - N i=1 ρ t+1 i F i (W t ) 2   (23) ≤ N i=1 (ρ t+1 i ) 2 E ∇f i (W t ; ξ i ) -F i (W t ) 2 ≤ N i=1 (ρ t+1 i ) 2 σ 2 i ≤ max t N i=1 (ρ t+1 i ) 2 σ 2 i where the first inequality comes from the convexity of norms and Jensen's inequality. Proof of Lemma 3. Assume Assumption 4 holds, i.e., E∥∇f i (W t ; ξ i )∥ 2 ≤ G 2 ∀i ∈ {1, ..., N }. Let t 0 denote a synchronization step. This means that W t0 i = Wt0 . Because FL requires synchronization of every E step, we have that t -t 0 ≤ E -1 where t is any step between now and the next synchronization step (inclusively). Furthermore, we assume the learning rate η t is non-increasing and η o ≤ 2η t . Then,  E N i=1 ρ t+1 i ∥ Wt -W t i ∥ 2 = E N i=1 ρ t+1 i ∥( Wt -Wt0 ) -(W t i -Wt0 )∥ 2 (24) = E E ρ t ∥E ρ t (W t i -Wt0 ) -(W t i -Wt0 )∥ 2 ≤ E E ρ t ∥W t i -Wt0 ∥ 2 = E   E ρ t Wt0 -Wt0 - t-1 i=0 η t ∇f i (W t ; ξ i ) 2   = E ρ t   E t-1 i=0 η t ∇f i (W t ; ξ i ) 2   ≤ E ρ t   E η 0 t-1 i=0 ∇f i (W t ; ξ i ) 2   ≤ E ρ t   E η 0 t-1 i=0 ∇f i (W t ; ξ i ) 2   ≤ E ρ t E η 2 0 (t -t 0 ) ∇f i (W t ; ξ i ) 2 ≤ E ρ t η 2 0 (E -1)G 2 ≤ 4η 2 t (E -1)G 2 N |Ξ i | N j=1 |Ξ j | (F i ( Wt ) -F i (W * i )) ≥ N |Ξ i+1 | N j=1 |Ξ j | (F i+1 ( Wt ) -F i+1 (W * i+1 )) → N |Ξ i | N j=1 |Ξ j | (F i (W * i ) -F i ( Wt )) ≤ N |Ξ i+1 | N j=1 |Ξ j | (F i+1 (W * i+1 ) -F i+1 ( Wt )) → U N |Ξ i | N j=1 |Ξ j | (F i (W * i ) -F i ( Wt )) ≤ U N |Ξ i+1 | N j=1 |Ξ j | (F i+1 (W * i+1 ) -F i+1 ( Wt )) Let's define ρt+1 i ∝ U N |Ξi| N j=1 |Ξj | (F i (W * i ) -F i ( Wt )) , we have N i=1 ρt+1 i N |Ξ i | N j=1 |Ξ j | (F i ( Wt ) -F i (W * i )) ≤ 1 N N i=1 N |Ξ i | N j=1 |Ξ j | (F i ( Wt ) -F i (W * i )) where the inequality is a direct consequence of the Chebyshev's sum inequality. Then rewrite the equation above as the following, N i=1 ρ t+1 i (F i ( Wt ) -F i (W * i )) ≤ N i=1 ρ i (F i ( Wt ) -F i (W * i )) where ρ t+1 A.12 EXTENSION TO PARTIAL PARTICIPATION To provide an partial participation extension to Thm. 1, we need to define some additional notations. Note that the following derivation and notations largely follow the prior work (Li et al., 2019) , which provides an easy way to extend FL convergence analysis to the partial participation setting. Stochasticity due to Client Sampling. Now, instead of full participation of N clients, at time t, we assume to sample K clients from the pool of N available clients, forming an active set of S t+1 . This new client sampling procedure introduces another level of stochasticity in addition to data sampling stochasticity on each client. We use the notation E s [•] and E[•] to denote expectation w.r.t each of the stochasticity respectively. Assumption 5 The active set S t+1 is constructed by sampling a client with probabilities {ρ t+1 i |i = 1, ..., N } repeatedly for K times with replacement, and the aggregation pattern is defined as, Wt+1 = N i=1 ρ t+1 i W t+1 i where W t+1 i = V t+1 i , if t + 1 / ∈ I E 1 K K i=1 V t+1 i , if t + 1 ∈ I E . ( ) The virtual sequence Wt+1 is different than the virtual sequence Wt+1 in Eq. 2 when t + 1 ∈ I E . Therefore, the key to incorporate partial participation is characterizing the difference between the two when t + 1 ∈ I E . To facilitate the proof we present the following two lemmas. Lemma 4 If t + 1 ∈ I E , then E s [ Wt+1 ] = Wt+1 . Lemma 5 If t + 1 ∈ I E and η t ≤ 2η t+E is non-increasing ∀t ≥ 0, then E s [∥ Wt+1 -Wt+1 ∥ 2 ] ≤ 4 K η 2 t EG 2 Theorem 2 Assume Assumptions 1-5 hold and L, µ, σ i , G be defined therein. Choose γ = max L µ and the learning rate η t = 2 µ(γ+t) and T ∈ I E . Then FedAvg using SGD with partial device participation and a generic, time-varying sampling weights Proof of Lemma 4 E s [ Wt+1 ] = E s 1 K K i=1 V t+1 i = 1 K K i=1 E s V t+1 i = N i=1 ρ t+1 i V t+1 i = Wt+1 Proof of Lemma 5 E s [∥ Wt+1 -Wt+1 ∥ 2 ] = E s   1 K K i=1 V t+1 i -Wt+1 2   = E s   1 K 2 K i=1 V t+1 i -Wt+1 2   ≤ E s 1 K 2 K i=1 V t+1 i -Wt+1 2 = 1 K 2 K i=1 E s V t+1 i -Wt+1 2 = 1 K E s V t+1 i -Wt+1 2 where the first inequality stems from triangle inequality of norms. Now we introduce a new notation t s . = t + 1 -E ∈ I E , which is the most recent aggression moment. Therefore, Wts is the same across all clients. 1 K E s V t+1 i -Wts + Wts -Wt+1 2 = 1 K E s (V t+1 i -Wts ) -( Wt+1 -Wts ) 2 = 1 K E s (V t+1 i -Wts ) -E s [V t+1 i -Wts ] 2 ≤ 1 K E s V t+1 i -Wts 2 The last inequality stems from the calculation of auto-correlation, i.e., E[∥x -E[x]∥ 2 ] = E∥x∥ 2 -E[x] 2 . Finally, we have the following, E E s [∥ Wt+1 -Wt+1 ∥ 2 ] ≤ 1 K N i=1 ρ t+1 i E V t+1 i -Wts 2 = 1 K N i=1 ρ t+1 i   E    t j=ts η j ∇f i (W j i ; ξ j i ) 2       ≤ 1 K N i=1 ρ t+1 i   t j=ts E η j ∇f i (W j i ; ξ j i ) 2   ≤ 1 K N i=1 ρ t+1 i   4η 2 t t j=ts E ∇f i (W j i ; ξ j i ) 2   ≤ 1 K N i=1 ρ t+1 i 4η 2 t EG 2 = 4 K η 2 t EG 2 A.13 ADDITIONAL PROOF In the main paper, we claimed equality between Thm. 1 and the convergence bound in a prior work (Li et al., 2019) if ρ t i = ρ i = |Ξi| N j=1 |Ξj | . Specifically, we want to show that Ω = Γ = N i=1 ρ i (F i (W * ) -F i (W * i )). In this section, we give a detailed proof to this statement. The equality holds because Ω = min t N i=1 ρ i (F i ( Wt ) -F i (W * i )) = min t N i=1 ρ i F i ( Wt ) - N i=1 ρ i F i (W * i ) = N i=1 ρ i F i (W * ) - N i=1 ρ i F i (W * i ), and, Γ = max t N i=1 ρ i (F i (W * ) -F i (W * i )) = N i=1 ρ i F i (W * ) - N i=1 ρ i F i (W * i ).



We use the notation| • | to denote the size of a set. Some FL algorithms require exact convergence on local model, e.g., FedPD(Zhang et al., 2020). Exp-α can be applied to other FL algorithms such as SCAFFOLD(Karimireddy et al., 2020), Fed- Dyn (Acar et al., 2021). However, these algorithms are not compatible with the current benchmark because they are stateful algorithms and perform poorly in the cross-device setting(Xu et al., 2021).



Figure 1: Illustration of our proposed Exp-α (Sec. 3.4) with three local optimization steps, i.e., E = 3, and two clients. In this example, synchronization/communication steps are t = 0, 3, 6. Expα calculates the aggregation weights based on the latest accessible global model and local models.

Figure 2: We visualize the average aggregate weights and report test accuracy of Exp-α for different imbalance and flip configurations. Exp-α always assigns smaller weight to more shifted clients. Fedavg is used as the backbone FL algorithm and results are averaged over three runs.

Figure 3: Convergence of Fedavg and Fedavgm using Exp-α and proportional aggregation on Imbalance CIFAR10. Results are averaged over three runs. Imbalance ratio = 0.001.

Figure 4: Convergence of Fedavg and Fedavgm using Exp-α and proportional aggregation on Flipped CIFAR10. Results are averaged over three runs. Flip ratio = 1.0.

(F i ( Wt ) -F i (W * i )) are arranged in decreasing order, i.e., N |Ξi| N j=1 |Ξj | (F i ( Wt ) -F i (W * i )) ≥ N |Ξi+1| N j=1 |Ξj | (F i+1 ( Wt ) -F i+1 (W * i+1 )). If we choose U( * ) ≥ 0 as a non-decreasing function, then it follows that,

F i ( Wt ) -F i (W * i )).

(W * ) -F i (W * i )), and Ω = min t N i=1 ρ t+1 i (F i ( Wt ) -F i (W * i ))∀t ≥ 0.

Federated learning (FL) is a distributed machine learning paradigm developed to preserve privacy while enabling continual development of an ML model on private data(McMahan et al., 2017). FL generally consists of three stages: client selection, client update and server update. Most FL algorithms innovate on one component of this algorithm. At the beginning of a round of communication (a global iteration), the current global model W t ∈ R d is distributed to a randomly sampled set of N local clients from a large pool of candidates N , sampled from a population distribution C supported on N . If N < |N |, then this is called partial-participation. Most paper follows a uniform client sampling strategy

Compatibility Experiments on Imbalanced CIFAR. Results are averaged over 3 runs. E is the number of local iterations and t refers to the number of global iterations. A complete table with standard deviation is available at A.5. Backbone Strategy acc40↓ t = 200 ↑ t = 400 ↑ Test↑ acc40↓ t = 100 ↑ t = 200 ↑ Test↑ acc40↓ t = 25 ↑ t = 50 ↑ Test ↑

Compatibility Experiments on CIFAR with flipped clients. Results are averaged over 3 runs. E is the number of local iterations and t refers to the number of global iterations. A complete table with standard deviation is available at A.6.

Domain Shift Experiments on Digits, Office-Clatech10 and DomainNet. Results are averaged over 3 runs. Each benchmark has several domains. We use shorthand notation in this table. .91 85.24 78.82 56.45 93.75 93.79 80.70 78.84 71.30 43.48 80.97 73.77 84.47 72.14

Implementation Details for CIFAR10, Digits, Office and DomainNet.

Train, validation and test splits for the Office-Caltech10 dataset.

Train, validation and test splits for the DomainNet dataset.

Train, validation and test splits for the Digits dataset.

Effects of LocalSteps with EXP-α. We use a fixed imbalance ratio of 0.1.

Effects of Heterogeneity with EXP-α. We sweep different imbalance ratio and use a fixed number of local steps E = 160

Table with Standard Deviation. Here, we show the full table of Tab. 2 with standard deviation in Tab. 12.

Compatibility Experiments on Imbalanced CIFAR. Results are averaged over 3 runs. E is the number of local iterations and t refers to the number of global iterations. Backbone Strategy acc40↓ t = 200 ↑ t = 400 ↑ Test↑ acc40↓ t = 100 ↑ t = 200 ↑ Test↑ acc40↓ t = 50 ↑ t = 100 ↑ Test↑ acc40↓ t = 25 ↑ t = 50 ↑ Test ↑

Exp-α and Proportional Aggregation on IID CIFAR. Results are averaged over 3 runs. E is the number of local iterations and t refers to the number of global iterations. Test↑ acc40↓ t = 100 ↑ t = 200 ↑ Test↑ acc40↓ t = 50 ↑ t = 100 ↑ Test↑ acc40↓ t = 25 ↑ t = 50 ↑ Test ↑ Wt+1 when t + 1 ∈ I E . We provide a graphical illustration of virtual sequence with E = 3 and two clients in Fig.1. The virtual sequence Wt+1 can be viewed as a virtual single step SGD update from Wt , i.e, Wt+1 = Wt -η t g t where g t = Note that Lemma 2 and Lemma 3 are adaptations of lemmas fromLi et al. (2019) with the addition of time-varying aggregation weights. Deferred proof of lemmas is in A.10.

EXP-α with Varying α on Digits, Office-Clatech10 and DomainNet. Results are averaged over 3 runs. We use two FL algorithms FedAvg(McMahan et al., 2017) and Fedbn(Li et al., 2021). The numbers reported are validation accuracy.

annex

Therefore, the bound in Thm 1 reduces to the following,

