FOCUS: FAIRNESS VIA AGENT-AWARENESS FOR FEDERATED LEARNING ON HETEROGENEOUS DATA Anonymous authors Paper under double-blind review

Abstract

Federated learning (FL) provides an effective collaborative training paradigm, allowing local agents to train a global model jointly without sharing their local data to protect privacy. However, due to the heterogeneous nature of local data, it is challenging to optimize or even define fairness of the trained global model for the agents. For instance, existing work usually considers accuracy equity as fairness for different agents in FL, which is limited, especially under the heterogeneous setting, since it is intuitively "unfair" to enforce agents with high-quality data (e.g., hospitals with high-resolution data and fine-grained labels) to achieve similar accuracy to those who contribute low-quality data (e.g., hospitals with low-resolution data and noisy labels), which may discourage the agents from participating in FL. In this work, we aim to address such limitations and propose a formal fairness definition in FL, fairness via agent-awareness (FAA), which takes different contributions of heterogeneous agents into account. Under FAA, the performance of agents with high-quality data will not be sacrificed just due to the existence of large amounts of agents with low-quality data. In addition, we propose a fair FL training algorithm based on agent clustering (FOCUS) to achieve fairness in FL measured by FAA. Theoretically, we prove the convergence and optimality of FOCUS under mild conditions for linear and general convex loss functions with bounded smoothness. We also prove that FOCUS always achieves higher fairness in terms of FAA compared with standard FedAvg under both linear and general convex loss functions. Empirically, we evaluate FOCUS on four datasets, including synthetic data, images, and texts under different settings, and we show that FOCUS achieves significantly higher fairness in terms of FAA while maintaining similar or even higher prediction accuracy compared with FedAvg and other existing fair FL algorithms.

1. INTRODUCTION

Federated learning (FL) is emerging as a promising approach to enable scalable intelligence over distributed settings such as mobile networks (Lim et al., 2020; Hard et al., 2018) . Given the wide adoption of FL, including medical analysis (Sheller et al., 2020; Adnan et al., 2022) , recommendation systems (Minto et al., 2021; Anelli1 et al., 2021) , and personal Internet of Things (IoT) devices (Alawadi et al., 2021) , how to ensure the fairness of the trained global model in FL is of great importance before its large-scale deployment, especially when the data quality/contributions of different agents are different in the heterogeneous setting. In general, fairness is defined as the protection of a specific attribute, and fair FL is usually in the form of equity, which means that each individual that joins collaborative learning would not suffer from bad performance due to their identity. Several studies have explored fairness in FL, which mainly focus on the fairness of the final trained model regarding the protected attributes without considering different contributions of agents (Chu et al., 2021; Hu et al., 2022) or the accuracy parity across agents (Li et al., 2020b; Donahue & Kleinberg, 2022a; Mohri et al., 2019) . Some works have considered the properties of local agents, such as the local data properties (Zhang et al., 2020; Kang et al., 2019) and data size (Donahue & Kleinberg, 2022b) . However, the fairness analysis in FL under heterogeneous data distributions is still lacking. Thus, in this paper, we aim to ask: What is the fairness of FL that is able to take different contributions of heterogeneous local agents into account? Can we enhance the fairness of FL by providing advanced training algorithms? To better understand the fairness of FL under heterogeneous data, in this work, we aim to define and enhance fairness by explicitly considering different contributions of heterogeneous agents. In particular, for FL trained with standard FedAvg protocol (McMahan et al., 2017) , if we denote the data of agent e as D e with size n e and the total number of data as n, the final trained global model aims to minimize the loss with respect to the global distribution P = E e=1 ne n D e , where E is the total number of agents. In practice, some local agents may have low-quality data (e.g., free riders), so intuitively it is "unfair" to train the final model regarding such global distribution over all agents, which will sacrifice the performance of agents with high-quality data. For example, considering the FL applications for medical analysis, some hospitals have high-resolution medical data and finegrained labels, which cost a large amount of money to collect the data from advanced equipment and to crowdsource data labeling. In contrast, some hospitals may have low-resolution medical data and noisy labels. In such a setting, high-quality agents may not be willing to participate in collaborative learning with low-quality agents because they could have achieved higher accuracy by standalone local training. Therefore, a proper fairness notion is important to encourage agents to participate in FL and ensure fairness. In this paper, we define fairness via agent-awareness in FL (FAA) as FAA({θ e } e∈[E] ) = max e1,e2∈E E e1 (θ e1 ) -E e2 (θ e2 ) , measured by the maximal excess risk difference between any pair of agents e 1 , e 2 ∈ E. The excess risk of each agent is calculated as E e (θ e ) = L e (θ e ) -min θ * L e (θ * ), which stands for the loss of user e evaluated on the FL model θ e subtracted by the Bayes optimal error of the local data distribution (Opper & Haussler, 1991) . For each agent, a lower excess risk E e (θ e ) indicates more gain from the FL model θ e w.r.t the local distribution because its loss L e (θ e ) is more closed to its Bayes optimal error. Notably, reducing FAA enforces the equity of excess risks among agents, following the philosophy that each agent should "gain the same" from participating in FL. Therefore, lower FAA indicates stronger fairness for FL. Based on our fairness definition FAA, we then propose a fair FL algorithm based on agent clustering (FOCUS) to improve the fairness of FL. Specifically, we first cluster the local agents based on their data distributions and then train a model for each cluster. During inference time, the final prediction will be the weighted aggregation over the prediction result of each model trained with the corresponding clustered local data. Theoretically, we prove that the final converged stationary point of FOCUS is exponentially close to the optimal cluster assignment under mild conditions. In addition, we prove that the fairness of FOCUS in terms of FAA is strictly higher than that of the standard FedAvg under both linear models and general convex losses. Empirically, we evaluate FOCUS on four datasets, including synthetic data, images, and texts, and we show that FOCUS achieves higher fairness measured by FAA than FedAvg and SOTA fair FL algorithms while maintaining similar or even higher prediction accuracy. Technical contributions. In this work, we define and improve FL fairness in heterogeneous settings by considering different contributions of heterogeneous local agents. We make contributions on theoretical and empirical fronts. • We formally define fairness via agent-awareness (FAA) in FL based on agent-level excess risks to measure fairness in FL, and explicitly take the heterogeneity nature of local agents into account. • We propose a fair FL algorithm via agent clustering (FOCUS) to improve fairness measured by FAA, especially in the heterogeneous setting. We prove the convergence rate and optimality of FOCUS under linear models and general convex losses. • We prove that FOCUS achieves stronger fairness measured by FAA compared with FedAvg for both linear models and general convex losses. • Empirically, we compare FOCUS with FedAvg and SOTA fair FL algorithms on four datasets, including synthetic data, images, and texts under heterogeneous settings. We show that FOCUS indeed achieves stronger fairness measured by FAA while maintaining similar or even higher prediction accuracy on all datasets.

2. RELATED WORK

Fair Federated Learning There have been several studies exploring fairness in FL. Li et al. (2020b) first define agent-level fairness by considering accuracy equity across agents and achieve fairness by assigning the agents with worse performance with higher aggregation weight during training. However, such a definition of fairness fails to capture the heterogeneous nature of local agents. Mohri et al. (2019) pursue accuracy parity by improving the performance of the worstperforming agent. Wang et al. (2021) propose to mitigate conflict gradients from local agents to enhance fairness. Instead of pursuing fairness with one single global model, Li et al. (2021) propose to train a personalized model for each agent to achieve accuracy equity for the personalized models. Zhang et al. (2020) predefine the agent contribution levels based on an oracle assumption (e.g., data volume, data collection cost, etc.) for fairness optimization, which lacks quantitative measurement metrics in practice. Xu et al. (2021) approximate the Shapely Value based on gradient cosine similarity to evaluate agent contribution. However, Zhang et al. (2020) point out that Shapely Value may discourage agents with rare data, especially under heterogeneous settings. Here we provide an algorithm to quantitatively measure the contribution of local data based on each agent's excess risk, which will not be affected even if the agent is the minority. Clustered Federated Learning Clustered FL algorithms are initially designed for multitasking and personalized federated learning, which assumes that agents can be naturally partitioned into clusters (Ghosh et al., 2020; Xie et al., 2021; Sattler et al., 2021; Marfoq et al., 2021) . Existing clustering algorithms usually aim to assign each agent to a cluster that provides the lowest loss (Ghosh et al., 2020) , optimize the clustering center to be close to the local model (Xie et al., 2021) , or cluster agents with similar gradient updates (with respect to, e.g., cosine similarity (Sattler et al., 2021) ) to the same cluster. In addition to these hard clustering approaches (i.e., each agent only belongs to one cluster), soft clustering has also been studied (Marfoq et al., 2021; Li et al., 2022; Ruan & Joe-Wong, 2022; Stallmann & Wilbik, 2022) , which enables the agents to benefit from multiple clusters. However, none of these works considers the fairness of clustered FL and the potential implications, and our work makes the first attempt to bridge them.

3. FAIR FEDERATED LEARNING ON HETEROGENEOUS DATA

In this section, we first define our fairness via agent-awareness (FAA) in FL with heterogeneous data and then introduce our fair FL based on the agent clustering (FOCUS) algorithm to achieve FAA. Intuitively, the performance of agents with high-quality data (e.g., clean or better generality) could be severely compromised by the existence of large amounts of agents with low-quality data (e.g., noisy or lower generality) under FedAvg. To solve such a problem and characterize the distinctions of local data distributions (contributions) among agents to ensure fairness, we propose fairness via agent-awareness in FL (FAA) as below.

3.1. FAIRNESS

Definition 1 (Fairness via agent-awareness for FL (FAA)). Given a set of agents E in FL, the overall fairness score among all agents is defined as the maximal difference of excess risks for any pair of agents: FAA({θ e } e∈[E] ) = max e1,e2∈[E] E e1 (θ e1 ) -E e2 (θ e2 ) . ( ) where θ e is the local model for agent e ∈ [E]. The excess risk E e (θ e ) for agent e given model θ e is defined as the difference between the population loss L e (θ e ) and the Bayes optimal error of the corresponding data distribution, i.e., E e (θ e ) = L e (θ e ) -min θ * L e (θ * ), where θ * denotes any possible models. Note that in FedAvg, each client uses the global model θ as its local model θ e . Definition 1 represents a quantitative data-dependent measurement of agent-level fairness. Instead of forcing accuracy equity among all agents regardless of their data distributions, we define agent-level fairness as the equity of excess risks among agents, which takes the contributions of local data into account by measuring their Bayes errors. For instance, when a local agent has low-quality data, although the corresponding utility loss would be high, the Bayes error of such low-quality data is also high, and thus the excess risk of the user is still low, enabling the agents with high-quality data to achieve low utility loss for fairness. According to the definition, we note that lower FAA indicates stronger fairness among agents.

3.2. FAIR FEDERATED LEARNING ON HETEROGENEOUS DATA VIA CLUSTERING (FOCUS)

Method Overview. To enhance the fairness of FL in terms of FAA, we provide an agent clusteringbased FL algorithm (FOCUS) by partitioning agents conditioned on their data distributions. Intuitively, grouping agents with similar data distributions together helps to improve fairness, since it reduces the intra-cluster data heterogeneity. We will analyze the fairness achieved by FOCUS and compare it with standard FedAvg both theoretically (Section 4.2) and empirically (Section 5). Our FOCUS algorithm (Algorithm 1) leverages the Expectation-Maximization algorithm to perform agent clustering. Define M as the number of clusters and E as the number of agents. m , m ∈ [M ], and then update the soft clustering labels Π according to Eq. ( 8). M step. The goal of M steps in Eq. ( 9) is to minimize a weighted sum of empirical losses for all local agents. However, given distributed data, it is impossible to find its exact optimal solution in practice. Thus, we specify a concrete protocol in Eq. (4) ∼ Eq. ( 6) to estimate the objective in Eq. ( 9). At t-th communication round, for each cluster model w m , and then updates the model using its own dataset. To reduce communication costs, each agent is allowed to run SGD locally for K local steps as shown in Eq. ( 5). After K local steps, each agent sends the updated models θ (t) em(K) back to the central server, and the server aggregates the models of all agents by a weighted average based on the soft clustering labels {π em }. We provide theoretical analysis for the convergence and optimality of FOCUS considering these multiple local updates in Section 4. Clients: θ (t) em(0) = w (t) m . (4) θ (t) em(k+1) = θ (t) em(k) -η k ∇ ne i=1 ℓ h θ (t) em(k) (x (i) e ), y e , ∀k = 1, . . . , K -1. (5) Server: w (t+1) m = E e=1 π (t+1) em θ (t) em(K) E e ′ =1 π (t+1) e ′ m . Inference. At inference time, each agent ensembles the M models by a weighted average on their prediction probabilities, i.e., a agent e predicts M m=1 π em h wm (x) for input x. Suppose a test dataset D test e is sampled from distribution P e . The test loss can be calculated by Ltest(W, Π) = 1 |D test e | (x,y)∈D test e ℓ M m=1 πemhw(x), y For unseen agents that do not participate in the training process, their clustering labels Π are unknown. Therefore, an unseen agent e computes its one-shot clustering label π (1) em, m ∈ [M ] according to Eq. ( 8), and outputs predictions M m=1 π (1) emhwm (x) for the test sample x.

4. THEORETICAL ANALYSIS OF FOCUS

In this section, we first present the convergence and optimality guarantees of our FOCUS algorithm; and then prove that it improves the fairness of FL regarding FAA. Our analysis considers linear models and then extends to nonlinear models with smooth and strongly convex loss functions.

4.1. CONVERGENCE ANALYSIS

Linear models. We first start with linear models for analysis simplicity. Suppose there are E agents, each with a local dataset De = {(x (i) e , y is blurred by some random noise ϵ (i) e ∼ N (0, σ 2 ). Each agent is asked to minimize the mean squared error to estimate µ e , so the empirical loss function for a local agent given dataset D e is Lemp(De; w) = 1 ne ne i=1 (w T x (i) e -y (i) e ) 2 . ( ) We further make the following assumption about the heterogeneous agents. Assumption 1 (Separable distributions). Suppose there are M predefined vectors {w * i } M i=1 , where for any m 1 , m 2 ∈ [M ], ∥w * m1 -w * m2 ∥ 2 ≥ R. A set of agents E satisfy separable distributions if they can be divided into M subsets S 1 , . . . , S M such that, for any agent e ∈ S m , ∥µ e -w * m ∥ 2 ≤ r < R 2 . Assumption 1 guarantees that the heterogeneous local data distributions are separable so that an optimal clustering solution exists, in which {w * 1 , . . . , w * M } are the centers of clusters. We next present Theorem 1 to demonstrate the linear convergence rate to the optimal cluster centers for FOCUS. Detailed proofs can be found in Appendix B.1. Theorem 1. Consider the agent set E satisfying separable distributions as assumption 1. Given trained M models and ∀e, m, π (0) em = 1 M . Under the natural initialization w m for each model m ∈ [M ], which satisfies ∃∆ 0 > 0, ∥w (0) m -w * m ∥ 2 ≤ min m ′ ̸ =m ∥w (0) m -w * m ′ ∥ 2 -2(r + ∆ 0 ) and |D e | = O(d). If learning rate η ≤ min( 1 4δ 2 , β √ T ), FOCUS converges by π (T ) em ≥ 1 1 + (M -1) • exp(-2Rδ 2 ∆0T ) , ∀e ∈ Sm (11) E∥w (T ) m -w * m ∥ 2 2 ≤ (1 - 2ηγmδ 2 M ) KT (∥w (0) m -w * m ∥ 2 2 + A) + 2M Kr + M δ 2 Eβ 2 √ T O(K 3 , σ 2 ). ( ) where T is the total number of communication rounds; K is the number of local updates in each communication round; γ m = |S m | is the number of agents in the m-th cluster, and A = 2EK(M -1)δ 2 (1 -2ηδ 2 γm M ) K -exp(-2Rδ 2 ∆0) (caused by initial inaccurate clustering). ( ) Proof sketch. To prove this theorem, we first consider E steps and M steps separately to derive corresponding convergence lemmas (Lemmas 1 and 2). In E steps, the soft cluster labels π em increase for all e ∈ S m , as long as ∥w (t) m -w * m ∥ 2 < ∥w (t) m ′ -w * m ∥ 2 , ∀m ′ ̸ = m. On the other hand, ∥w m -w * m ∥ is guaranteed to shrink linearly as long as π em is large enough for any e ∈ S m . We then integrate Lemmas 1 and 2 and prove Theorem 1 using an induction statement. Remarks. Theorem 1 shows the convergence of parameters (Π, W ) to a near-optimal solution. Eq. ( 11) implies that the agents will be correctly clustered since π em will converge to 1 as the number of communication rounds K increases. In Eq. ( 12), the first term diminishes exponentially, while the second term 2M Kr reflects the intra-cluster distribution divergence r. The last term originates from the data heterogeneity among clients across different clusters. Its influence is amplified by the number of local updates (O(K 3 )) and will also diminish to zero as the number of communication rounds T goes to infinity. Our convergence analysis is conditioned on the natural clustering initialization for model weights w (0) m towards a corresponding cluster center w * m , which is standard in convergence analysis for a mixture of models (Yan et al., 2017; Balakrishnan et al., 2017) . Smooth and strongly convex loss functions. Next, we extend our analysis to a more general case of non-linear models with L-smooth and µ-strongly convex loss function. Assumption 2 (Smooth and strongly convex loss functions). The population loss functions L e (θ) for each agent e is L-smooth, i.e., ∥∇ 2 L e (θ)∥ 2 ≤ L. The loss functions are µ-strongly convex, if the eigenvalues λ of the Hessian matrix ∇ 2 L e (θ) satisfy λ min (∇ 2 L e (θ)) ≥ µ. We further make an assumption similar to Assumption 1 following the similar philosophy. Assumption 3 (Separable distributions). A set of agents E satisfy separable distributions if they can be partitioned into M subsets S 1 , . . . , S M with w * 1 , ..., w * M representing the center of each set respectively, and the optimal parameter θ * of each local loss L e (i.e., θ * e = arg min θ L e (θ)) satisfy ∥θ * e -w * m ∥2 ≤ r In the meantime, agents from different subsets have different data distributions, such that ∥w * m 1 -w * m 2 ∥2 ≥ R, ∀m1, m2 ∈ [M ], m1 ̸ = m2. ( ) Theorem 2. Consider the agent set E satisfying separable distributions as assumption 3. Suppose loss functions have bounded variance for gradients on local datasets, i.e., E (x,y)∼De [∥∇ℓ(x, y; θ) -∇L e (θ)∥ 2 2 ] ≤ σ 2 , and the population losses are bounded, i.e., L e ≤ G, ∀e ∈ [E]. If let π (0) em = 1 M , ∃∆ 0 > 0, ∥w m -w * m ∥ 2 ≤ √ µR √ µ+ √ L -r -∆ 0 , and the learn- ing rate of each agent η ≤ min( 1 2(µ+L) , β √ T ), FOCUS converges by π (T ) em ≥ 1 1 + (M -1) exp(-µR∆0T ) , ∀e ∈ Sm ( 16) E∥w (T ) m -w * m ∥ 2 2 ≤ (1 -ηA) KT (∥w (0) m -w * m ∥ 2 2 + B) + O(Kr) + M EβO(K 3 , σ 2 ne ) √ T ( ) where T is the total number of communication rounds; K is the number of local updates in each communication round; γ m = |S m | is the number of agents in the m-th cluster, and A = 2γm M µL µ + L related to convergence rate , B = GM T E( 4L µ + 6 µ(µ+L) ) (1 -ηA) K -exp(-µR∆0) caused by the offset of initial clustering . (18) Proof sketch. We analyze the evolution of parameters (Π, W ) for E steps in Lemma 3 and M steps in Lemma 4. Lemma 3 shows that the soft cluster labels π em increase for all e ∈ S m in E steps as long as ∥w m -w * m ∥ 2 < √ µR √ µ+ √ L -r; whereas Lemma 4 guarantees that the model weights w m get closer to the optimal solution w * m in M steps. We combine Lemmas 3 and 4 by induction to prove this theorem. Detailed proofs are deferred to Appendix B.2.3. Remarks. Theorem 2 extends the convergence guarantee of (Π, W ) from linear models (Theorem 1) to general models with smooth and convex loss functions. For any agent e that belongs to a cluster m (e ∈ S m ), its soft cluster label π em converges to 1 based on Eq. ( 16), indicating the clustering optimality. Meanwhile, the model weights W converge linearly to a near-optimal solution. The error term O(Kr) in Eq. ( 17) is expected, since r represents the data divergence within each cluster and w * m denotes the center of each cluster. The last term in Eq. ( 17) implies a trade-off between communication cost and convergence speed. Increasing K reduces communication cost by O( 1K ) but at the expanse of slowing down the convergence.

4.2. FAIRNESS ANALYSIS

To theoretically show that FOCUS achieves stronger fairness in FL based on FAA, here we focus on a simple yet representative case where all agents share similar distributions except one outlier agent. Linear models. We first concretize such a scenario for linear models. Suppose we have E agents learning weights for M linear models. Their local data D e (e ∈ [E]) are generated by y (i) e = µ T e x (i) e -ϵ (i) e with x (i) e ∼ N (0, δ 2 I d ) and ϵ (i) e ∼ N (0, σ 2 e ). E -1 agents learn from normal dataset with ground truth vector µ 1 , . . . , µ E-1 and ∥µ e -µ * ∥ 2 ≤ r, while the E-th agent has an outlier data distribution, with its the ground truth vector µ E far away from other agents, i.e., ∥µ E -µ * ∥ 2 ≥ R. As stated in Theorem 1, the soft clustering labels and model weights (Π, W ) converge linearly to the global optimum. Therefore, we analyze the fairness of FOCUS, assuming an optimal (Π, W ) is reached. We compare the FAA achieved by FOCUS and FedAvg to underscore how our algorithm helps improve fairness for heterogeneous agents. Theorem 3. When a single agent has an outlier distribution, the fairness FAA achieved by FOCUS algorithm with two clusters M = 2 is FAA f ocus (W, Π) ≤ δ 2 r 2 . ( ) while the fairness FAA achieved by FedAvg is FAAavg(W ) ≥ δ 2 R 2 (E -2) -2Rr E + r 2 = Ω(δ 2 R 2 ). Remarks. When a single outlier exists, the fairness gap between Fedavg and FOCUS is shown by Theorem 3. FAAavg(W ) -FAA f ocus (W, Π) ≥ δ 2 R 2 (E -2) -2Rr E . ( ) As long as R > 2r E-2 , FOCUS is guaranteed to achieve stronger fairness (i.e., lower FAA) than FedAvg. Note that the outlier assumption only makes sense when E > 2 since one cannot tell which agent is the outlier when E = 2. Also, we naturally assume R > 2r so that the two underlying clusters are at least separable. Therefore, we conclude that FOCUS dominates than FedAvg in terms of FAA. Here we only discuss the scenario of a single outlier agent for clarity, but similar conclusions can be drawn for multiple underlying clusters and M > 2, as discussed in Appendix C.1. Smooth and strongly convex loss functions. We generalize the fairness analysis to nonlinear models with smooth and convex loss functions. To illustrate the superiority of our FOCUS algorithms in terms of FAA fairness, we similarly consider training in the presence of an outlier agent. Suppose we have E agents that learn weights for M models. We assume their population loss functions are L-smooth, µ-strongly convex (as in Assumption 2) and bounded, i.e., L e (θ) ≤ G. E -1 agents learn from similar data distributions, such that the total variation distance between the distributions of any two different agents i, j ∈ [E -1] is no greater than r: D T V (P i , P j ) ≤ r. On the other hand, the E-th agent has an outlier data distribution, such that the Bayes error L E (θ * i ) -L E (θ * E ) ≥ R for any i ∈ [E -1]. We claim that this assumption can be reduced to a lower bound on H-divergence (Zhao et al., 2022)  FAA f ocus (W, Π) ≤ 2Gr E -1 Let B = 2Gr E-1 . The fairness achieved by FedAvg is FAAavg(W ) ≥ E -1 E - L µE 2 R -1 + L(E -1) µE - L 2 µ 2 E B - 2L µE B(R - L µ B) Remarks. Notably, when the outlier distribution is very different from the normal distribution, such that R ≫ Gr (which means B ≪ R), we simplify Eq. ( 23) as FAAavg(W ) ≥ ( E -1 E - L µE 2 )R. ( ) Note that FAA f ocus (W, Π) ≤ B ≪ R, so the fairness FAA achieved by FedAvg is always larger (weaker) than that of FOCUS, as long as E ≥ L/µ, indicating the effectiveness of FOCUS.

5. EXPERIMENTAL EVALUATION

We conduct extensive experiments on various heterogeneous data settings to evaluate the fairness measured by FAA for FOCUS, FedAvg (McMahan et al., 2017) , and two baseline fair FL algorithms (i.e., q-FFL (Li et al., 2020b) and AFL (Mohri et al., 2019) ). We show that FOCUS achieves significantly higher fairness measured by FAA while maintaining similar or even higher accuracy.

5.1. EXPERIMENTAL SETUP

Data and Models. We carry out experiments on four different datasets with heterogeneous data settings, ranging from synthetic data for linear models to images (rotated MNIST (Deng, 2012) and rotated CIFAR (Krizhevsky, 2009) ) to text data for sentiment classification on Yelp (Zhang et al., 2015) and IMDb (Maas et al., 2011) datasets. We train a fully connected model consisting of two linear layers with ReLU activations for MNIST, a ResNet 18 model (He et al., 2016) for CIFAR, and a pre-trained BERT-base model (Devlin et al., 2019) for the text data. We refer the readers to Appendix D for more implementation details. Evaluation Metrics and Implementation Details. We consider three evaluation metrics: average test accuracy, average test loss, FAA for fairness, and the existing fairness metric "agnostic loss" introduced by Mohri et to make aggregated predictions on each agent's test data. We also report the performance of existing fair FL algorithms (i.e., q-FFL (Li et al., 2020b) , AFL (Mohri et al., 2019) , Ditto (Li et al., 2021) , and CGSV Xu et al. ( 2021)) as well as existing state-of-the-art FL algorithms in heterogeneous data settings (i.e., FedMA (Wang et al., 2020) , Bayesian nonparametric FL (Yurochkin et al., 2019) and FedProx (Li et al., 2020a) in Appendix D.2). To evaluate FAA of different algorithms, we estimate the Bayes optimal loss min w L e (w) for each local agent e. Specifically, we train a centralized model based on the subset of agents with similar data distributions (i.e., the same ground-truth cluster) and use it as a surrogate to approximate the Bayes optimum. We select the agent pair with the maximal difference of excess risks to measure fairness in terms of FAA calculated following Definition 1.

5.2. EVALUATION RESULTS

Synthetic data for linear models. We first evaluate FOCUS on linear regression models with synthetic datasets. We set up E = 10 agents with data sampled from Gaussian distributions. Each agent e is assigned with a local dataset of De = {(x (i) e , y e )} ne i=1 generated by y (i) e = µ T e x (i) e + ϵ (i) e with x (i) e ∼ N (0, I d ) and ϵ (i) e ∼ N (0, σ 2 ). We study the case considered in Section 4.2 where a single agent has an outlier data distribution. We set the intra-cluster distance r = 0.01 and the inter-cluster distance R = 1 in our experiment. Note that it is a regression task, so we mainly report the average test loss instead of accuracy here. Table 1 shows that FOCUS achieves FAA of 0.001, which is much lower than the FAA 0.958 achieved by FedAvg, 0.699 by q-FFL, and 0.780 by AFL. both datasets. In addition, although existing fair FL algorithms q-FFL and AFL achieve lower FAA scores than FedAvg, their average test accuracy drops significantly. This is mainly because these fair algorithms are designed for performance parity via improving low-quality agents (i.e., agents with high training loss), thus sacrificing the accuracy of high-quality agents. Notably, FOCUS both improves the FAA fairness and preserves high test accuracy. Next, we analyze the surrogate excess risk of every agent on MNIST in Fig. 1 (a). We observe that the global model trained by FedAvg obtains the highest test loss as 0.61 on the outlier cluster, which rotates 180 degrees (i.e., cluster C3), resulting in high excess risk for the 9th agent. Moreover, the low-quality data of the outlier cluster affect the agents in the 1st cluster via FedAvg training, which leads to a much higher excess risk than that of FOCUS. On the other hand, FOCUS successfully identifies clusters of the outlier distributions, i.e., clusters 2 and 3, rendering models trained from the outlier clusters independent from the normal cluster 1. As shown in Fig. 1 , our FOCUS reduces the excess risks of all agents, especially for the outliers, on different datasets. This leads to strong fairness among agents in terms of FAA. Similar trends are also observed in CIFAR, in which our FOCUS reduces the surrogate excess risk for the 9th agent from 2.74 to 0.44. We omit the loss histogram of CIFAR to Appendix D. In addition, we evaluate different numbers of outliers in Table 2 . In the presence of 1, 3, and 5 outlier agents, forming 2, 3, or 4 underlying true clusters, FOCUS consistently achieves a lower FAA score and higher accuracy than baseline FedAvg. In practice, we do not know the number of underlying clusters, so we set M = 2, 3, 4 while we have 3 true underlying clusters in 

6. CONCLUSION

In this work, we provide an agent-level fairness measurement in FL (FAA) by taking agents' inherent heterogeneous data properties into account. Motivated by our fairness definition in FL, we also provide an effective FL training algorithm FOCUS to achieve high fairness. We theoretically analyze the convergence rate and optimality of FOCUS, and we prove that under mild conditions FOCUS is always fairer than the standard FedAvg protocol. We conduct thorough experiments on synthetic data with linear models as well as image and text datasets on deep neural networks. We show that FOCUS achieves stronger fairness than FedAvg and achieves similar or higher prediction accuracy across all datasets. We believe our work will inspire new research efforts on exploring the suitable fairness measurements for FL under different requirements.

A REVISION UPDATES

If the submission gets accepted, we will exploit the extra page for the main text to add the following contents.

A.1 SCALABILITY WITH MORE AGENTS

To study the scalability of FOCUS, we evaluate the performance and fairness of FOCUS and existing methods under 100 clients on MNIST. Table 3 shows that FOCUS achieves the best fairness measured by FAA and Agnostic Loss, higher test accuracy, and lower test loss than Fedavg and existing fair FL methods. 

A.3 RUNTIME ANALYSIS

Computation time analysis for proposed metric FAA and its scalability to more clients. In FAA, to calculate the maximal difference of excess risks for any pair of agents, it suffices to calculate the difference between the maximal per-client excess risks and the minimum per-client excess risk, and we don't need to calculate the difference for any pairs of agents. We compare the computation time (averaged over 100 trials) of FAA and existing fairness criteria (i.e., Accuracy Parity (Li et al., 2020b) and Agnostic Loss (Mohri et al., 2019) ) under 10 clients and 100 clients on MNIST. shows that the computation of FAA is efficient even with a large number of agents. Moreover, calculating the difference between maximal excess risk and minimum excess risk (i.e., FAA) is even faster than calculating the standard deviation of the accuracy between agents (i.e., Accuracy Parity). Communication rounds analysis. Here, we report the number of communication rounds that each method takes to achieve targeted accuracy on MNIST and CIFAR in Table 5 . We note that FOCUS requires significantly a smaller number of communication rounds than FedAvg, q-FFL, and AFL on both datasets, which demonstrates the small costs required by FOCUS. Training time and inference time analysis. In terms of runtime, we report the training time for one FL round (averaged over 20 trials) as well as inference time (averaged over 100 trials) in Table 7 . Since the local updates and sever aggregation for different cluster models can be run in parallel, we find that FOCUS has a similar training time compared to FedAvg, q-FFL, and AFL which train one global FL model. For the inference time, FOCUS is slightly slower than existing methods by about 0.17 seconds due to the ensemble prediction of all cluster models at each client. However, we note that such cost is small and the forward passes of different cluster models for the ensemble prediction can also be made in parallel to further reduce the inference time.

A.4 COMPARISON TO FEDAVG WITH CLUSTERING

In this section, we construct a new method by combining the clustering and Fedavg together (i.e., FedAvg-HardCluster), which serves as a strong baseline. Specifically, FedAvg-HardCluster works as below: • Step 1: before training, for each agent, it takes the arg max of the learned soft cluster assignment from FOCUS to get the hard cluster assignment (i.e., each agent only belongs to one cluster). To compare the performance between FOCUS and FedAvg-HardCluster, we consider two scenarios on MNIST: • Scenario 1: underly clusters are clearly separatable, where each cluster contains samples from one distribution, which is the setting used in our paper. • Scenario 2: underlying clusters are not separatable, where each cluster has 80%, 10%, and 10% samples from three different distributions, respectively. For example, the first underlying cluster contains 80% samples without rotation, 10% samples rotating 90 degrees, and 10% samples rotating 180 degrees. We observe that the learned soft cluster assignments from FOCUS align with the underlying distribution, so the hard cluster assignment for Step 1 in FedAvg-HardCluster is equal to the underlying ground-truth clustering for both scenarios. Table 8 presents the results of FOCUS and FedAvg-HardCluster on Rotated MNIST under two scenarios. Under Scenario 1, the accuracy of FOCUS and FedAvg-HardCluster is similar, and FOCUS achieves better fairness in terms of FAA. The results show that the hard clustering for FedAvg-HardCluster is as good as the soft clustering for FOCUS when the underlying clusters are clearly separable, which verifies that clustering is one of the key steps in FOCUS, and it aligns with our hypothesis for fairness under heterogeneous data. Under Scenario 2, FOCUS achieves higher accuracy and better FAA fairness than FedAvg-HardCluster. The results show that when underly clusters are not separatable, soft clustering is better than hard clustering since each agent can benefit from multiple cluster models with the soft π learned from the EM algorithm in FOCUS.

A.5 EFFECT OF THE NUMBER OF THE CLUSTERS M

The performance of FOCUS would not be harmed if the selected number of clusters is larger than the number of underlying clusters since the superfluous clusters would be useless (the corresponding soft cluster assignment π goes to zero). On the other hand, when the selected number of clusters is smaller than the number of underlying clusters, FOCUS would converge to a solution when some clusters contain agents from more than one underlying cluster. Empirically, in Table 9 , we have 3 true underlying clusters while we set M = 1, 2, 3, 4 in our experiments, and we see that when M = 3 and M = 4, FOCUS achieves similar accuracy and fairness, which verifies our hypothesis that the superfluous clusters would become useless. When M = 2, FOCUS even achieves the highest fairness, which might be because one cluster benefits from the shared knowledge of multiple underlying clusters. When M = 1, FOCUS reduces to FedAvg, which does not have the clustering mechanism, leading to the lowest accuracy and fairness under heterogeneous data.  m -w * m ∥ ≤ α < β ≤ min m ′ ̸ =m ∥w (t) m ′ -w * m ∥. Then the E-step updates as π (t+1) em ≥ π (t) em π (t) em + (1 -π (t) em ) exp -(β 2 -α 2 -2(α + β)r)δ 2 (25) Remark. Our assumption of proper initialization guarantees that ∥w (0) m -w * m ∥ ≤ α while ∀m ′ , we have ∥w m ′ -w * m ∥ 2 ≥ ∥w * m -µ * m ′ ∥ -∥w m ′ -µ * m ′ ∥ = R -α. Hence, we substitute β = R -α and α = R 2 -r -∆, which yields π (t+1) em ≥ π (t) em π (t) em + (1 -π (t) em ) exp(-2R∆δ 2 ) , ∀e ∈ S m (26) For M-steps, the local agents are initialized with θ (0) em = w (t) m . Then for k = 1, . . . , K -1, each agent use local SGD to update its personal model: θ (k+1) em = θ em -η k g em (θ em ) = θ (k) em -η k ∇ ne i=1 ℓ(h θem (x (i) e ), y (i) e ). ( ) To analyze the aggregated model Eq. ( 6), we define a sequence of virtual aggregated models ŵ(k) m . ŵ(k) m = E e=1 π em θ (k) em E e ′ =1 π e ′ m . ( ) Lemma 2. Suppose any agent e ∈ S m has a soft clustering label π  η k ≤ 1 4δ 2 . E∥ ŵ(k+1) m -w * m ∥ 2 2 ≤ (1 -2η k γ m pδ 2 )E∥ ŵ(k+1) m -w * m ∥ 2 2 + η k A 1 + η 2 k A 2 . ( ) A 1 = 4γ m rδ 2 + 2δ 2 E(1 -p), A 2 = 16E(K -1) 2 δ 4 + O( d n e )E(δ 4 + δ 2 σ 2 ) (30) Remark. Using the recursive relation in Lemma 2, if the learning rate η k is fixed, the sequence ŵ(k) m has a convergence rate of E∥ ŵ(k) m -w * m ∥ 2 2 ≤ (1 -2ηγ m pδ 2 ) k E∥ ŵ(0) m -w * m ∥ 2 2 + ηk(A 1 + ηA 2 ).

B.1.2 COMPLETING THE PROOF OF THEOREM 1

We now combine Lemma 1 and Lemma 2 to prove Theorem 1. The theorem is restated below. Theorem 1. With the assumptions 1 and 2, n e = O(d), if learning rate η ≤ min( 1 4δ 2 , β √ T ), π (T ) em ≥ 1 1 + (M -1) • exp(-2Rδ 2 ∆0K) , ∀e ∈ Sm (32) E∥w (T ) m -w * m ∥ 2 2 ≤ (1 - 2ηγmδ 2 M ) KT (∥w (0) m -w * m ∥ 2 2 + A) + 2M Kr + M δ 2 Eβ 2 √ T O(K 3 , σ 2 ). ( ) where K is the total number of communication rounds; T is the number of iterations each round; γ m = |S m | is the number of agents in the m-th cluster, and A = 2EK(M -1)δ 2 (1 -2ηδ 2 γm M ) K -exp(-2Rδ 2 ∆0) . ( ) Proof. We prove Theorem 1 by induction. Suppose π (t) em ≥ 1 1 + (M -1) exp(-2Rδ 2 ∆ 0 t) (35) E∥w (t) m -w * m ∥ 2 ≤ (1 - 2ηγ m δ 2 M ) Kt (∥w (0) m -w * m ∥ 2 ) + A (1 - 2ηγ m δ 2 M ) Kt -exp -2Rδ 2 ∆ 0 t + ηB 1 -(1 -2ηγmδ 2 M ) K . ( ) where B = [16Eδ 4 K 3 + EK(δ 4 + δ 2 σ 2 )]η + 4γ m rδ 2 K. Then according to Lemma 1, π (t+1) em ≥ π (t) em π (t) em + (1 -π (t) em ) exp(-2R∆ 0 δ 2 ) (37) ≥ 1 1 + (M -1) exp(-2Rδ 2 ∆ 0 t) exp(-2R∆ 0 δ 2 ) (38) ≥ 1 1 + (M -1) exp(-2R∆ 0 δ 2 (t + 1)) . ( ) We recall the virtual sequence of ŵm defined by Eq. ( 28). Since models are synchronized after K rounds, the know ŵ(0) m = w (t) m and w (t+1) m = ŵ(K) m . We then apply Lemma 2 to prove the induction. Note that instead of proving Eq. ( 33), we prove a stronger induction hypothesis of Eq. (36). E∥w (t+1) m -w * m ∥ 2 = E∥ ŵ(K) m -w * m ∥ 2 (40) ≤ (1 -2ηγ m pδ 2 ) K E∥ ŵ(t) m -w * m ∥ 2 + ηK(A 1 + ηA 2 ) (41) ≤ (1 -2ηγ m pδ 2 ) K (1 - 2ηγ m δ 2 M ) Kt ∥w (0) m -w * m ∥ 2 + A((1 - 2ηγ m δ 2 M ) Kt -exp -2R∆ 0 δ 2 t ) + ηB 1 -(1 -2ηγmδ 2 M ) K + ηK(4γ m rδ 2 + 2δ 2 E(1 -p)) + η 2 KA 2 (42) ≤ (1 - 2ηγ m δ 2 M ) (t+1)K ∥w (0) m -w * m ∥ 2 + A(1 - 2ηγ m δ 2 M ) (t+1)K -A exp -2R∆ 0 δ 2 t (1 - 2ηγ m δ 2 M ) K + 2δ 2 E(1 -p) D1 + (1 - 2ηγ m δ 2 M ) K ηB 1 -(1 -2ηγmδ 2 M ) K + 4ηKγ m rδ 2 + η 2 KA 2 D2 . Note that 1 -p ≤ (M -1) exp -2R∆ 0 δ 2 t , so D 1 ≤ A(1 - 2ηγ m δ 2 M ) (t+1)K -A exp -2R∆ 0 δ 2 t (1 - 2ηγ m δ 2 M ) K + 2δ 2 EK(M -1) exp -2R∆ 0 δ 2 t ≤ A((1 - 2ηγ m δ 2 M ) (t+1)K -exp -2R∆ 0 δ 2 (t + 1) ) For D 2 we have D 2 ≤ (1 - 2ηγ m δ 2 M ) K ηB [1 -(1 -2ηγmδ 2 M ) K ] + 4ηγ m rδ 2 K + 16η 2 Eδ 4 K 3 + η 2 EKO(δ 4 + δ 2 σ 2 ) = ηB 1 -(1 -2ηγmδ 2 M ) K . ( ) Finally we combine Eqs. ( 43) to (45) so E∥w (t+1) m -w * m ∥ 2 ≤ (1 - 2ηγmδ 2 M ) (t+1)K ∥w (0) m -w * m ∥ 2 + A (1 - 2ηγmδ 2 M ) (t+1)K -exp -2Rδ 2 ∆0(t + 1) + ηB 1 -(1 -2ηγmδ 2 M ) K . ( ) Since it is trivial to check that both induction hypotheses hold when t = 0, the induction hypothesis holds. Note that K ≥ 1, so ηB 1 -(1 -2ηγmδ 2 M ) K ≤ ηB M 2ηγ m δ 2 ≤ 2M Kr + M δ 2 Eβ 2 √ T O(K 3 , δ 2 ). Combining Eq. ( 46) and Eq. ( 47) completes our proof.

B.1.3 DEFERRED PROOFS OF KEY LEMMAS

Lemma 1. Proof. For simplicity, we abbreviate the model weights w (t) m by w m in the proof of this lemma. The n-th E step updates the weights Π by π (t+1) em = π (t) em exp -E (x,y)∼De (w m T x -y) 2 m ′ π (t) em ′ exp -E (x,y)∼De (w m ′ T x -y) 2 (48) so π (t+1) em = π (t) em exp -∥w (t) m -µ e ∥ 2 δ 2 m ′ π (t) em ′ exp -∥w ′ m (t) -µ e ∥ 2 δ 2 (49) ≥ π (t) em exp -(β -r) 2 δ 2 π (t) em exp(-(β -r) 2 δ 2 ) + m ′ ̸ =m π (t) em ′ exp(-(α + r) 2 δ 2 ) (50) ≥ π (t) em π (t) em + (1 -π (t) em ) exp -(β 2 -α 2 -2(α + β)r)δ 2 Lemma 2. Proof. Notice that local datasets are generated by X e ∼ N (0, δ 2 1 ne×d ) and y e = X e µ e + ϵ e with ϵ e ∼ N (0, σ 2 ). Therefore, ∥ ŵ(k+1) m -w * m ∥ 2 = ∥w (k) m -w * m -η k g k ∥ 2 (52) = ∥ ŵ(k) m -w * m -η k 2 n e e π em X T e X e (θ (k) em -µ e ) + 2η k n e e π em X T e ϵ e ∥ 2 (53) = ∥ ŵ(k) m -w * m -ĝk ∥ 2 + η 2 k ∥g k -ĝk ∥ 2 + 2η k ⟨w (k) m -w * m -ĝk , ĝk -g k ⟩. ( ) where ĝk = 2 ne e π em E(X T e X e )(θ (k) em -µ). Since the expectation of the last term in Eq. ( 54) is zero, we only need to estimate the expectation of ∥ ŵ(k) m -w * m -η k ĝk ∥ 2 and ∥ĝ k -g k ∥ 2 . ∥ ŵ(k) m -w * m -η k ĝk ∥ 2 = ∥ ŵ(k) m -w * m ∥ 2 + 4η 2 k n 2 e e π em E(X T e X e )∥θ t em -µ e ∥ 2 - 4η k n e e π em ⟨ ŵ(k) m -w * m , E(X T e X e )(θ (k) em -µ e )⟩ = ∥ ŵ(k) m -w * m ∥ 2 + 4η 2 k δ 2 e π em ∥θ (k) em -µ e ∥ 2 -4η k ⟨ ŵ(k) m -w * m , e π em δ 2 (θ (k) em -µ e )⟩ . C 1 = -4η k e π em ⟨ ŵ(k) m -θ (k) em , δ 2 (θ (k) em -µ e )⟩ -4η k e π em ⟨θ (k) em -w * m , δ 2 (θ (k) em -µ e )⟩ (56) ≤ 4 e π em ∥ ŵ(k) m -θ (k) em ∥ 2 + 4δ 4 η 2 k e π em ∥θ (k) em -µ e ∥ 2 -4η k δ 2 e π em ∥θ (k) em -µ e ∥ 2 -4η k δ 2 e π em ⟨µ e -w * m , θ (k) em -µ e ⟩ C2 Since η k ≤ 1 4δ 2 , E∥ ŵ(k) m -w * m -η k ĝk ∥ 2 (58) ≤ E∥ ŵ(k) m -w * m ∥ 2 + (8δ 4 η 2 k -4η k δ 2 ) e π em E∥θ (k) em -µ e ∥ 2 + 4 e π em E∥ ŵ(k) m -θ (k) em ∥ 2 + C 2 (59) ≤ E∥ ŵ(k) m -w * m ∥ 2 -2η k δ 2 e π em E∥θ (k) em -µ e ∥ 2 + 4 e π em E∥ ŵ(k) m -θ (k) em ∥ 2 + C 2 Note that e π em E∥θ (k) em -µ e ∥ 2 (61) = e∈Sm π em E∥θ (k) em -µ e ∥ 2 + e̸ ∈Sm π em E∥θ (k) em -µ e ∥ 2 ≥ e∈Sm π em (E∥θ (k) em -w * m ∥ 2 + 2r + r 2 ) + e̸ ∈Sm π em E∥θ (k) em -µ e ∥ 2 (63) = e∈Sm π em (E∥ ŵ(k) m -w * m ∥ 2 + E∥ ŵ(k) m -θ (k) em ∥ 2 + 2r + r 2 ) + e̸ ∈Sm π em E∥θ (k) em -µ e ∥ 2 (64) And since ŵ(k) m = E e π em θ (k) em , we have 4E e π em ∥ ŵ(k) m -θ (k) em ∥ 2 ≤ 4E e π em ∥ ŵ(0) m -θ (k) em ∥ 2 (65) ≤ 4 e π em (K -1)E t-1 t ′ η ′ k 2 ∥ 2 n e X T e X e (θ (k) em -µ e )∥ 2 (66) ≤ 16η 2 k E(K -1) 2 δ 4 . Thus, E∥ ŵ(k) m -w * m -η k ĝk ∥ 2 ≤ (1 -2η k δ 2 e π em )E∥ ŵ(k) m -w * m ∥ 2 + 16η 2 k E(K -1) 2 δ 4 -2η k δ 2 e̸ ∈Sm π em E∥θ (k) em -µ e ∥ 2 -4η k δ 2 e π em ⟨θ (k) em -µ e , µ e -w * m ⟩ C3 Since C 3 ≤ 2η k δ 2 e̸ ∈Sm π em ∥µ e -w m ∥ 2 2 -4η k δ 2 e∈Sm π em ∥θ (k) em -µ e ∥ 2 ∥µ e -w * m ∥ 2 (69) ≤ 2η k δ 2 E(1 -p) + 4η k δ 2 γ m r we have E∥ ŵ(k) m -w * m -η k ĝk ∥ 2 ≤ (2η k δ 2 γ m p)E∥ ŵ(k) m -w * m ∥ 2 +16η 2 k E(K-1) 2 δ 4 +2η k δ 2 E(1-p)+4η k δ 2 γ m r Notice that E∥ĝ k -g k ∥ 2 = E e 4 n 2 e π em ∥(X T e X e -E(X T e X e ))(θ (k) em -µ e )∥ 2 + E e 4 n 2 e e π em ∥X T e ϵ e ∥ 2 = E O(dn e ) n 2 e δ 4 + E O(dn e ) n 2 e δ 2 σ 2 (72) so E∥ ŵ(k+1) m -w * m ∥ 2 2 ≤ (1 -2η k γ m pδ 2 )E∥ ŵ(k) m -w * m ∥ 2 2 + η k A 1 + η 2 k A 2 where A 1 = 4δ 2 γ m r + 2δ 2 E(1 -p) and A 2 = 16E(K -1) 2 δ 4 + O( d n e )E(δ 4 + δ 2 σ 2 ).

B.2 CONVERGENCE OF MODELS WITH SMOOTH AND STRONGLY CONVEX LOSSES (THEOREM 2)

Here we present the detailed proof for Theorem 2.

B.2.1 KEY LEMMAS

We first state two lemmas for E-step updates and M-step updates, respectively. The proofs of both lemmas are deferred to the Appendix B.2.3 Lemma 3. Suppose the loss function L Pt (θ) is L-smooth and µ-strongly convex for any cluster m. If ∥w (t) m -w * m ∥ ≤ √ µR √ µ+ √ L -r -∆ for some ∆ > 0, then E-step updates as π (t) em ≥ π (t) em π (t) em + (1 -π (t) em ) exp(-µR∆) . For M-steps, the local agents are initialized with θ (0) em = w (t) m . Then for k = 1, . . . , K -1, each agent use local SGD to update its personal model: θ (k+1) em = θ em -η k g em (θ em ) = θ (k) em -η k ∇ ne i=1 ℓ(h θem (x (i) e ), y (i) e ). To analyze the aggregated model Eq. ( 6), we define a sequence of virtual aggregated models . ŵ(k) m . ŵ(k) m = E e=1 π em θ (k) em E e ′ =1 π e ′ m . E∥ ŵ(k+1) m -w * m ∥ 2 2 ≤ (1 -η k A 0 )E∥ ŵ(k) m -w * m ∥ 2 2 + η k A 1 + η 2 k A 2 . ( ) where A 0 = 2γ m pµL µ + L A 1 = 2γ m Lr 2G µ + G(1 -p)E µ (4L + 6 µ + L ) + O(r 2 ). ( ) A 2 = 4E(K -1) 2 GL 2 µ + Eσ 2 n e . ( ) Remark. Using this recursive relation, if the learning rate η k is fixed, the sequence ŵ(k+1) m has a convergence rate of  E∥ ŵ(k) m -w * m ∥ 2 ≤ (1 -ηA 0 ) k E∥ ŵ(0) m -w * m ∥ 2 + ηk(A 1 + ηA 2 ). π (T ) em ≥ 1 1 + (M -1) exp(-µR∆ 0 T ) , ∀e ∈ S m (84) E∥w (T ) m -w * m ∥ 2 2 ≤ (1 -ηA) KT (∥w (0) m -w * m ∥ 2 2 + B) + O(Kr) + M EβO(K 3 , σ 2 ne ) √ T ( ) where T is the total number of communication rounds; K is the number of iterations each round; γ m = |S m | is the number of agents in the m-th cluster, and A = 2γm M µL µ + L , B = GM T E( 4L µ + 6 µ(µ+L) ) (1 -ηA) K -exp(-µR∆0) . ( ) Proof. The proof is quite similar to Theorem 1 for linear models: we follow an induction proof using lemmas 3 and 4. Suppose Eq. ( 84) hold for step t. And suppose E∥w (t) m -w * m ∥ 2 2 ≤ (1-ηA) Kt (∥w (0) m -w * m ∥ 2 2 )+B((1-ηA) Kt -exp(-µR∆0t))+ ηC 1 -(1 -ηA) K . (87) where C = 4ηEGK 3 L 2 µ + (2γ m Lr 2G µ + O(r 2 )) + η EKσ 2 n e . ( ) Then for any t ∈ S m , π (t+1) em ≥ π (t) em π (t) em + (1 -π (t) em ) exp(-µR∆ t ) ≥ 1 1 + (M -1) exp(-µR∆ 0 t) exp(-µR∆ t ) (90) ≥ 1 1 + (M -1) exp(-µR∆ 0 (t + 1)) We recall the virtual sequence ŵ(k) m defined in Eq. ( 78). Models are synchronized after K rounds of local iterations, so w (t+1) m = ŵ(K) m . Thus, according to Lemma 4, E∥w (t+1) m -w * m ∥ 2 2 = E∥ ŵ(K) m -w * m ∥ 2 2 (92) ≤ (1 -ηA0) K E∥w (t) m -w * m ∥ 2 2 + ηK(A1 + ηA2) ≤ (1 -ηA0) K (1 -ηA) Kt (E∥w (0) m -w * m ∥ 2 ) + B((1 -ηA) Kt -exp(-µR∆0t)) + ηC 1 -(1 -ηA) K + ηK(A1 + ηA2) (94) ≤ (1 -ηA) (t+1)K E∥w (0) m -w * m ∥ 2 + (1 -ηA) K B (1 -ηA) Kt -exp(-µR∆0t) + η GK(1 -p)E µ (4L + 6 µ + L ) F 1 + (1 -ηA) K ηC 1 -(1 -ηA) K + ηK(2γmLr 2G µ + O(r 2 )) + η 2 KA2 F 2 . ( ) For F 1 , we use the fact that π (t+1) em ≥ 1 1 + (M -1) exp -(µR∆ 0 (t + 1)) ≥ 1 -(M -1) exp(-µR∆(t + 1)), so F 1 ≤ (1 -ηA) K B (1 -ηA) Kt -exp(-µR∆ 0 t) + η G(M -1) exp(-µR∆ 0 t) µ (4L + 6 µ + L ) (96) = B (1 -ηA) (t+1)K -exp(-µR∆ 0 t) For F 2 , we have F 2 ≤ (1 -ηA) K ηC 1 -(1 -ηA) K + ηK(2γ m Lr 2G µ + O(r 2 )) + 4EGL 2 η 2 K 3 µ + η 2 KEσ 2 n e (98) ≤ ηC 1 -(1 -ηA) K . ( ) Combining F 1 and F 2 finishes the induction proof. Moreover, since T ≥ 1, we have ηC 1 -(1 -ηA) K ≤ C A = O(Kr) + M Eβ √ T O(K 3 , σ 2 n e ). Combining Eq. ( 87) and Eq. ( 100) completes our proof.

B.2.3 DEFERRED PROOFS OF KEY LEMMAS

Lemma 3. Proof. According to Algorithm 1, π (t+1) em = π (t) em π (t) em + m ′ ̸ =m π (t) em ′ exp Eℓ(x, y; w (t) m ) -Eℓ(x, y; w (t) m ′ ) (101) ≥ π (t) em π (t) em + (1 -π (t) em ) exp max m ′ ̸ =m (L Pe (w (t) m ) -L Pe (w (t) m ′ )) Since L Pe is L-smooth and µ-strongly convex, L Pe (w (t) m ) -L Pe (w (t) m ′ ) ≤ L 2 ∥w (t) m -θ * t ∥ 2 - µ 2 ∥w (t) m ′ -θ * t ∥ 2 ≤ L 2 ( √ µR √ µ + √ L -∆) 2 - µ 2 ( √ LR √ µ + √ L + ∆) 2 ≤ -µLR∆ + L -µ 2 ∆ 2 ≤ -µR∆. ( ) Combining Eq. ( 102) and Eq. ( 103) completes our proof. Lemma 4. Proof. We define g (k) m = e π em 1 ne ne i=1 ∇ℓ(h θem (x (i) e ), y e ) and ĝ(k) m = e π em ∇L(θ (k) em ). E∥ ŵ(k+1) m -w * m ∥ 2 = E∥ ŵ(k) m -w * m -η k g (k) m ∥ 2 (104) = E∥ ŵ(k) m -w * m -η k ĝ(k) m ∥ 2 + η 2 k E∥g (k) m -ĝ(k) m ∥ 2 + 2η k E⟨w (k) m -w * m -η k ĝ(k) m , ĝ(k) m -g (k) m ⟩ (105) = E∥ ŵ(k) m -w * m -η k ĝ(k) m ∥ 2 + η 2 k E∥g (k) m -ĝ(k) m ∥ 2 . ( ) The first term can be decomposed into ∥ ŵ(k) m -w * m -η k ĝ(k) m ∥ 2 = ∥ ŵ(k) m -w * m ∥ 2 + η 2 k ∥ĝ (k) m ∥ 2 -2η k ⟨ ŵ(k) m -w * m , ĝ(k) m ⟩. Note that ∥ĝ (k) m ∥ 2 ≤ E e=1 π em ∥∇L e (θ (k) em )∥ 2 . ( ) -⟨ ŵ(k) m -w * m , ĝ(k) m ⟩ = - E e=1 π em ⟨ ŵ(k) m -θ (k) em , ∇L e (θ (k) em )⟩ - E e=1 π em ⟨θ (k) em -w * m , ∇L e (θ (k) em )⟩. ( ) We further decompose the two terms in Eq. ( 109) by -2⟨ ŵ(k) m -θ (k) em , ∇L e (θ (k) em )⟩ ≤ 1 η k ∥ ŵ(k) m -θ (k) em ∥ 2 + η k ∥∇L e (θ (k) em )∥ 2 . ( ) and ⟨θ (k) em -w * m , ∇L e (θ (k) em )⟩ ≥ ⟨θ (k) em -w * m , ∇L e (θ (k) em ) -∇L e (w * m )⟩ + ∥∇L e (w * m )∥ 2 ∥θ (k) em -w * m ∥ 2 . ( ) ≥ µL µ + L ∥θ (k) em -w * m ∥ 2 + 1 µ + L ∥∇L e (θ (k) em -∇L e (w * m ))∥ 2 + ∥∇L e (w * m )∥ 2 ∥θ (k) em -w * m ∥ 2 . (112) Therefore, E∥ ŵ(k+1) m -w * m ∥ 2 = E∥ ŵ(k) m -w * m ∥ 2 -2η k µL µ + L e π em E∥θ (k) em -w * m ∥ 2 E1 + e π em E∥ ŵ(k) m -θ (k) em ∥ 2 E2 + 2η 2 k e π em E∥∇L e (θ (k) em )∥ 2 -2η k 1 µ + L e π em E∥∇L e (θ (k) em ) -∇L e (w * m )∥ 2 E3 + 2η k E e π em ∥θ (k) em -w * m ∥ 2 • ∥∇L e (w * m )∥ 2 E4 + η 2 k E∥g (k) m -ĝ(k) m ∥ 2 E5 . ( ) E 1 = E∥ ŵ(k) m -w * m ∥ 2 -2η k µL µ + L E e π em ∥ ŵ(k) m -w * m ∥ 2 + e π em ∥ ŵ(k) m -θ (k) em ∥ 2 ≤ (1 - 2η k µLpγ m µ + L )E∥w (k) m -w * m ∥ 2 + E 2 . ( ) E 2 = E e π em ∥ ŵ(k) m -θ (k) em ∥ 2 = E e π em ∥(w (0) m -θ (k) em ) + (θ (k) em -w (k) m )∥ 2 ≤ E e π em ∥(w (0) m -θ (k) em )∥ 2 ≤ e π em (K -1)E k-1 k ′ =0 η k ′ 2 ∥g em (θ (k ′ ) em )∥ 2 ≤ 2η 2 k E(K -1) 2 G 2 L 2 µ . ( ) E 3 = 2E e π em (η 2 k - η k µ + L )∥∇L e (θ (k) em )∥ 2 + 2η k µ + L ⟨∇L e (θ (k) em ), ∇L e (w * m )⟩ -η k ∥∇L e (w * m )∥ 2 µ + L ≤ 2η k E e π em 1 2(µ + L) ∥∇L e (θ (k) em )∥ 2 + 1 µ + L ⟨∇L e (θ (k) em ), ∇L e (w * m )⟩ - ∥∇L e (θ (k) em )∥ 2 µ + L ≤ 6η k E ∥∇L e (w * m )∥ 2 µ + L ≤ 6η k e∈Sm π em L 2 r 2 µ + L + 6η k e̸ ∈Sm π em 2G µ(µ + L) ≤ η k O(r 2 ) + 6η k G(1 -p)E µ(µ + L) . ( ) E 4 = 2η k E e∈Sm π em ∥θ (k) em -w * m ∥ 2 • ∥∇L e (w * m )∥ 2 + 2η k E e̸ ∈Sm π em ∥θ (k) em -w * m ∥ 2 • ∥∇L e (w * m )∥ 2 ≤ 2η k γ m Lr 2G µ + 2η k (1 -p)EL • 2G µ . ( ) E 5 = η 2 k E∥g (k) m -ĝ(k) m ∥ 2 ≤ η 2 k E e π em 1 n e ne i=1 ∇ℓ(h θem (x (i) e ), y (i) e ) -L(θ (k) em ) 2 ≤ η 2 k E σ 2 n e . Combining Eq. ( 114) to Eq. ( 118) yields the conclusion of Lemma 4.

C FAIRNESS ANALYSIS

C.1 PROOF OF THEOREM 3 Proof. Let the first cluster m 1 contain agents µ 1 , . . . , µ E-1 , while the second cluster contains only the outlier µ E . Then, for e = 1, . . . , E -1, E e (w m1 ) = δ 2 µ e - E-1 e ′ =1 µ e ′ E -1 2 ≤ δ 2 r 2 And for the outlier agent, the expected output is just the optimal solution, so E E (w m2 ) = 0 As a result, the fairness of this algorithm is bounded by FAA f ocus (P ) = max i,j∈[E] |E i (Π, W ) -E j (Π, W )| ≤ δ 2 r 2 . On the other hand, the expected final weights of of FedAvg algorithm is w avg = μ = E e=1 µe E , so the expected loss for agent e shall be E (x,y)∼Pe (ℓ θ (x)) = E x∼N (0,δ 2 I d ),ϵ∼N (0,σ 2 e ) [(µ T i x + ϵ -μT x) 2 ] = σ 2 e + δ 2 ∥µ e -μ∥ 2 The infimum risk for agent t 1 is σ 2 1 , and after subtracting it from the expected loss, we have E 1 (w avg ) = δ 2 ∥µ 1 -μ∥ 2 (123) = δ 2 ∥µ 1 - E-1 e=1 µ 1 E - µ E E ∥ 2 (124) ≤ δ 2 r • E -1 E + ∥µ 1 -µ E ∥ E 2 (125) ≤ δ 2 (r • E -1 E + R + r E ) 2 = δ 2 (r + R E ) 2 However for the outlier agent, E E (w avg ) = δ 2 ∥µ E -μ∥ 2 (127) = δ 2 E -1 E µ E - E-1 e=1 µ E E 2 (128) ≥ E -1 E 2 δ 2 R 2 Hence, FAA avg (P ) ≥ E E (w avg ) -E 1 (w avg ) = δ 2 R 2 (E -2) -2Rr E + r 2 Remark. When there are E k > 1 outliers, we can similarly derive FAA for FedAvg algorithm: E 1 (w avg ) ≤ δ 2 (r + E k R E ) 2 (131) E E (w avg ) ≥ δ 2 ( E -E k E R - E k E r) 2 (132) so as long as E k < E 2 , FAA avg ≥ E E (w avg ) -E 1 (w avg ) = Ω(δ 2 R 2 ) The FOCUS algorithm produces a result with E 1 (w m1 ) ≤ δ 2 r 2 (134) E E (w m2 ) ≤ δ 2 r 2 Hence we still have FAA f ocus ≤ δ 2 r 2 . C.2 PROOF OF THEOREM 4 Proof. Note that the local population loss for agent i with weights θ is L i (θ) = p i (x, y)ℓ(f θ (x), y)dxdy. Thus, |L i (θ * i ) -L j (θ * i )| = |p i (x, y) -p j (x, y)| • ℓ(f θ * i (x), y)dxdy (138) ≤ G • |p i (x, y) -p j (x, y)|dxdy ≤ Gr. Hence, L i (θ * j ) ≤ L j (θ * j ) + Gr ≤ L j (θ * i ) + Gr ≤ L i (θ * i ) + 2Gr. For the cluster that combines agents {1, . . . , E -1} together, the weight converges to θ′ = 1 E-1 E-1 i=1 θ * i . Then ∀i = 1, . . . , E -1, the population loss for the ensemble prediction L i (θ, Π) = L i E-1 j=1 θ * j E -1 (141) ≤ 1 T -1 T -1 j=1 L i (θ * j ) ≤ L i (θ * i ) + 2Gr E -1 . Therefore, for any i = 1, . . . , T -1, E i (θ, Π) ≤ 2Gr E -1 . ( ) Since E T (θ, Π) = 0, FAA f ocus (W, Π) ≤ 2Gr E -1 Now we prove the second part of Theorem 4 for the fairness of Fedavg algorithm. For simplicity, we define B = 2Gr E-1 in this proof. Also, we denote the mean of all optimal weight θ = E i=1 θ * i E and θ′ = E-1 i=1 θ * i E-1 . Remember that we assume loss functions to be L-smooth, so L E (θ * i ) ≤ L E ( θ′ ) + ⟨∇L E ( θ′ ), θ * i -θ′ ⟩ + L 2 ∥ θ′ -θ i ∥ 2 . ( ) Taking summation over i = 1, . . . , E -1, we get L E ( θ′ ) ≥ 1 E -1 E-1 i=1 L E (θ * i ) -⟨∇L E ( θ′ ), E-1 i=1 (θ i -θ′ )⟩ - L 2 E-1 i=1 ∥ θ′ -θ i ∥ 2 (147) = 1 E -1 E-1 i=1 L E (θ * i ) - L 2 E-1 i=1 ∥ θ′ -θ i ∥ 2 (148) ≥ L E (θ * E ) + R - LB µ . The last inequality uses the µ-strongly convex condition that implies B ≥ L i ( θ′ ) -L i (θ * i ) ≥ µ 2 ∥ θ′ -θ i ∥ 2 . ( ) By L-smoothness, we have L E ( θ′ ) ≤ L E ( θ) + ⟨∇L E ( θ), θ′ -θ⟩ + L 2 ∥ θ′ -θ∥ 2 . ( ) L E (θ * E ) ≤ L E ( θ) + ⟨∇L E ( θ), θ * E -θ⟩ + L 2 ∥θ * E -θ∥ 2 . ( ) Note that θ = θ′ +(E-1)θ * E E , we take a weighted sum over the above two inequalities to cancel the dot product terms out. We thus derive L E ( θ) ≥ (E -1)L E ( θ′ ) + L E (θ * E ) -L 2 (E -1)∥ θ′ -θ∥ 2 -L 2 ∥θ * E -θ∥ 2 E (153) = E -1 E R - LB µ - L∥θ * E -θ′ ∥ 2 2E + L E (θ * E ). ( ) Note that L E (•) is µ-strongly convex, which means R - LB µ ≥ L E ( θ′ ) -L E (θ * E ) ≥ µ 2 ∥θ * E -θ′ ∥ 2 . ( ) so L E ( θ) ≥ (1 - L µE ) • E -1 E (R - LB µ ) + L E (θ * E ). And E E ( θ) ≥ (1 - L µE ) • E -1 E (R - LB µ ). On the other hand, for agent i = 1, . . . , E -1 we know L i ( θ) ≤ L i ( θ′ ) + ⟨∇L i ( θ′ ), θ -θ′ ⟩ + L 2 ∥ θ -θ′ ∥ 2 . ( ) By L smoothness, ∥∇L i ( θ′ )∥ 2 ≤ L∥ θ′ -θ * i ∥ ≤ L 2B µ . ( ) So L i ( θ) ≤ L i (θ * i ) + B + L 2B µ 2(R -LB µ ) µ 1 E + L(R -LB µ ) µE 2 (160) E i ( θ) ≤ B + 2L µE B(R - LB µ ) + L(R -LB µ ) µE 2 (161) In conclusion, the fairness can be estimated by Machines. We simulate the federated learning setup on a Linux machine with AMD Ryzen Threadripper 3990X 64-Core CPUs and 4 NVIDIA GeForce RTX 3090 GPUs. FAA avg (P ) ≥ E E ( θ) -E 1 ( θ) (162) ≥ E -1 E - L µE 2 R -1 + L(E -1) µE - L 2 µ 2 E B - 2L µE B(R - L µ B) Hyperparameters. For each FL experiment, we implement both FOCUS algorithm and FedAvg algorithm using SGD optimizer with the same hyperparameter setting. Detailed hyperparameter specifications are listed in Table 10 for different datasets, including learning rate, the number of local training steps, batch size, the number of training epochs, etc.

D.2 ADDITIONAL EXPERIMENTAL RESULTS

Histogram of loss on CIFAR. Fig. 3 shows the surrogate excess risk of every agent trained with FedAvg and FOCUS on CIFAR dataset. For the outlier cluster that rotates 180 degrees (i.e., 2rd cluster), FedAvg has the highest test loss for the 9th agent, resulting in a high excess risk of 2.74. Index of Client In addition, the agents in 1st cluster trained by FedAvg are influenced by the FedAvg global model and have high excess risk. On the other hand, FOCUS successfully identifies the outlier distribution in 2nd cluster, leading to a much lower excess risk among agents with a more uniform excess risk distribution. Notably, FOCUS reduces the surrogate excess risk for the 9th agent to 0.44. Comparison with existing fair FL methods. We present the full results of existing fair federated learning algorithms on our data settings in terms of FAA. The results in Tables 11 and 12 show that FOCUS achieves the lowest FAA score compared to existing fair FL methods. We note that fair FL methods (i.e., q-FFL (Li et al., 2020b) and AFL (Mohri et al., 2019) ) have lower FAA scores than FedAvg, but their average test accuracy is worse. This is mainly because they mainly aim to improve bad agents (i.e., with high training loss), thus sacrificing the accuracy of agents with high-quality data. Table 11 : Comparison of FOCUS and the existing fair federated learning algorithms on the rotated MNIST dataset. FOCUS FedAvg q-FFL AFL q = 0.1 q = 1 q = 3 q = 5 q = 10 λ = 0.01 Comparison with state-of-the-art FL methods. We compare FOCUS with other SOTA FL methods, including FedMA (Wang et al., 2020) , Bayesian nonparametric FL (Yurochkin et al., 2019) and FedProx (Li et al., 2020a) . Specifically, the matching algorithm in (Yurochkin et al., 2019 ) is designed for only fully-connected layers, and the matching algorithm in (Wang et al., 2020) is designed for fully-connected and convolutional layers, while our experiments on CIFAR use ResNet-18 where the batch norm layers and residual modules are not considered in (Wang et al., 2020; Yurochkin et al., 2019) . Therefore, we evaluate (Li et al., 2020a; Wang et al., 2020; Yurochkin et al., 2019) on MNIST with a fully-connected network, and (Li et al., 2020a) on CIFAR with a ResNet-18 model. The results on MNIST and CIFAR in Tables 13 and 14 show that FOCUS achieves the highest average test accuracy and lowest FAA score than SOTA FL methods. FOCUS FedAvg q-FFL AFL q = 0.1 q = 1 q = 3 q = 5 q = 10 λ = 0.01  Avg

E BROADER IMPACT

This paper presents a novel definition of fairness via agent-level awareness for federated learning, which considers the heterogeneity of local data distributions among agents. We develop FAA as a fairness metric for Federated learning and design FOCUS algorithm to improve the corresponding fairness. We believe that FAA can benefit the ML community as a standard measurement of fairness for FL based on our theoretical analyses and empirical results. A possible negative societal impact may come from the misunderstanding of our work. For example, low FAA does not necessarily mean low loss or high accuracy. Additional utility evaluation metrics are required to evaluate the overall performance of different federated learning algorithms. We have tried our best to define our goal and metrics clearly in Section 3; and state all assumptions for our theorems accurately in Section 4 to avoid potential misuse of our framework. 



VIA AGENT-AWARENESS IN FL (FAA) WITH HETEROGENEOUS DATA Given a set of E agents participated in the FL training process, each agent e only has access to its local dataset: D e = {(x e , y e )} ne i=1 , which is sampled from a distribution P e . The goal of standard FedAvg training is to minimize the overall loss L E (θ) based on the local loss L e (θ) of each agent: min θ L E (θ) = e∈[E] |D e | n L e (θ), L e (θ) = E (x,y)∈Pe ℓ(h θ (x), y). (1) where ℓ(•, •) is a loss function given model prediction h θ (x) and label y (e.g., cross-entropy loss), n = e∈[E] |D e | represents the total number of training samples, and θ represents the parameter of trained global model.

Figure 1: The excess risk of different agents trained with FedAvg and FOCUS on MNIST (a) and Yelp/IMDb text data (b). Ci denotes ith cluster.

Rotated CIFAR with 10 clients.

Figure 2: The test accuracy and test loss of different methods over FL communication rounds on different datasets. FOCUS converges faster and achieves higher accuracy and lower loss than other methods.

m by Eq. (29), if the learning rate

Figure 3: The excess risk of different agents trained with FedAvg (left) and FOCUS (right) on CIFAR dataset.

The goal of FOCUS is to simultaneously optimize the soft clustering labels Π and model weights W . Specifically, Π = {π em } e∈[E],m∈[M ] are the dynamic soft clustering labels, representing the estimated probability that agent e belongs to cluster m; W = {w m } m∈[M ] represent the model weights for M data clusters. Given E agents with datasets D 1 , . . . , D E , our FOCUS algorithm follows a two-step scheme that alternately optimizes Π and W . E step. Expectation steps update the cluster labels Π given the current estimation of (Π, W ).

Specifically, we assume each dataset D e has a mean vector µ e ∈ R d , and (x Data D1, . . . , DE; E remote agents and M learning models.

between distributions P i and P E that D H (P i , P E ) ≥ LR 4µ . (See proofs in Appendix C.3.) Theorem 4. The fairness FAA achieved by FOCUS with two clusters M = 2 is

al. (2019). For FedAvg, we evaluate the trained global model on each agent's test data; for FOCUS, we train M models corresponding to M clusters, and use the soft clustering labels Π = {πem} e∈[E],m∈[M ]

Comparison of FOCUS, FedAvg, and fair FL algorithms q-FFL, AFL, Ditto and CGSV, in terms of average test accuracy (Avg Acc), average test loss (Avg Loss), fairness FAA and existing fairness metric Agnostic loss. FOCUS achieves the best fairness measured by FAA compared with all baselines.

Comparison of FOCUS and FedAvg with dif-

Table9on MNIST. It shows that under different M , FOCUS achieves similar accuracy and fairness (see more discussion in Appendix A). We evaluate FOCUS on the sentiment classification task with text data, Yelp (restaurant reviews), and IMDb (movie reviews), which naturally form data heterogeneity among 10 agents and thus create 2 clusters. Specifically, we sample 56k reviews from Yelp datasets distributed among seven agents and use the whole 25k IMDB datasets distributed among three agents to simulate the heterogeneous setting. From Table1, we can see that while the average test accuracy of FOCUS, FedAvg, and other fair FL algorithms are similar, FOCUS achieves a lower average test loss. In addition, the FAA of FOCUS is significantly lower than other baselines, indicating stronger fairness. We also observe from Fig.1 (b) that the excess risk of FOCUS on the outlier cluster (i.e., C2) drops significantly compared with that of FedAvg.

Comparison of different methods on MNIST 100 clients setting, in terms of average test accuracy (Avg Acc), average test loss (Avg Loss), fairness FAA and existing fairness metric Agnostic loss. FOCUS achieves the best fairness measured by FAA.



The number of communication rounds that different methods take to reach a target accuracy on Rotated MNIST. FOCUS requires a significantly smaller number of communication rounds than other methods.

Training time per FL round and inference time for different methods on Rotate MNIST.

Comparison between FOCUS and FedAvg-HardCluster on Rotate MNIST under two scenarios.

The effect of M on Rotate MNIST when the number of underlying clusters is 3.

Lemma 4. Suppose for any agent e ∈ S m , its soft clustering label π

Theorem 2. Suppose loss functions have bounded variance for gradients on local datasets, i.e., E (x,y)∼De [∥∇ℓ(x, y; θ) -∇L e (θ)∥ 2 2 ] ≤ σ 2 . Assume population losses are bounded, i.e., L e ∈ G, ∀e ∈ [E]. With initialization from assumptions 3 and 4, if each agent chooses learning rate

163) C.3 PROOF OF DIVERGENCE REDUCTIONHere we prove the claim that the assumption L E (θ * e ) -L E (θ * E ) ≥ R is implied by a lower bound of the H-divergence(Zhao et al., 2022).

Dataset description and hyperparameters. Dataset # training samples # test samples E M batch size learning rate local training epochs epochs

Comparison of FOCUS and the existing fair federated learning algorithms on the rotated CIFAR dataset.

Comparison of FOCUS and other SOTA federated learning algorithms on the rotated MNIST dataset.

Comparison of FOCUS and other SOTA federated learning algorithms on the rotated CIFAR dataset.

