SEMI-VARIANCE REDUCTION FOR FAIR FEDERATED LEARNING Anonymous authors Paper under double-blind review

Abstract

Ensuring fairness in Federated Learning (FL) systems, i.e. a satisfactory performance for all of the diverse clients in the systems, is an important and challenging problem. There are multiple fair FL algorithms in the literature, which have been relatively successful in providing fairness. However, these algorithms mostly emphasize on the loss functions of worst-off clients to improve their performance, which often results in the suppression of well-performing ones. As a consequence, they usually sacrifice the system overall average performance for achieving fairness. Motivated by this and inspired by two well-known risk modeling methods in Finance, Mean-Variance and Mean-Semi-Variance, we propose and study two new fair FL algorithms, Variance Reduction (VRed) and Semi-Variance Reduction (Semi-VRed). VRed encourages equality between clients loss functions by penalizing their variance. In contrast, Semi-VRed penalizes the discrepancy of only the worst-off clients loss functions from the average loss. Through extensive experiments on multiple vision and language datasets, we show that, Semi-VRed achieves SoTA performance in scenarios with highly heterogeneous data distributions and improves both fairness and system overall average performance.

1. INTRODUCTION

Federated Learning McMahan et al. (2017) is a framework consisting of some clients and the private data that is distributed among them, and it allows training of a shared or personalized model based on the clients data. Since the invention of FL by proposing the well-known FedAvg algorithm (McMahan et al., 2017) , it has attracted an intensive amount of attention and much progress has been made in its different aspects, including algorithmic innovations (Li et al., 2020b; Reddi et al., 2020a; Pathak & Wainwright, 2020; Huo et al., 2020; Wang et al., 2020; Reddi et al., 2020b; Qu et al., 2022) , fairness (McMahan et al., 2017; Li et al., 2020c; Mohri et al., 2019; Li et al., 2020a; Yue et al., 2021; Zhang et al., 2022a) , convergence analysis (Khaled et al., 2020; Li et al., 2020; Gorbunov et al., 2021) , personalization (Zhang et al., 2021; Chen & Chao, 2022; Oh et al., 2022; Zhang et al., 2022b; Bietti et al., 2022) . Due to heterogeneity in clients data and their resources, performance fairness is an important challenge in FL systems. There have been some previous works addressing this problem. For instance, Mohri et al. (2019) proposed Agnostic Federated Learning (AFL), which aims at minimizing the largest loss function among clients through a minimax optimization framework. Similarly, Li et al. (2020a) proposed an algorithm called TERM using tilted losses. Ditto (Li et al., 2021) is another existing algorithm based on model personalization for clientsfoot_0 . Also, q-Fair Federated Learning (q-FFL) (Li et al., 2020c ) is an algorithm inspired by α-fairness in wireless networks (Lan et al., 2010) . Recently, Zhang et al. (2022a) proposed PropFair based on the concept of Proportional Fairness (PF). Interestingly, they also showed that all the aforementioned fair FL algorithms can be unified into a generalized mean framework. GiFair (Yue et al., 2021) is another recent algorithm which achieves fairness using a different mechanism than the previously mentioned algorithms: by penalizing the discrepancy between clients loss functions, i.e. encouraging equality. FCFL (Cui et al., 2021) uses a constrained version of AFL for achieving both algorithmic parity and performance consistency in FL settings. Being designed for fair FL, the aforementioned algorithms usually result in the suppression of wellperforming clients, due to the lower weights that the algorithms place on them or due to the equality that is encouraged between clients losses (GiFair). As a consequence, they achieve an overall average performance which is either smaller than or close to that of vanilla FedAvg. This is our motivation for proposing our two new algorithms. Our inspiration in this paper is a concept in Finance called risk modeling used for portfolio selection. There are two vastly used methodologies for risk modeling: Mean-Variance (MV) (Zhang et al., 2018; Soleimani et al., 2009; Markowitz, 1952) and its expansion: Mean-Semi-Variance (MSV) (Boasson et al., 2017; Plà-Santamaria & Bravo, 2013; Ballestero, 2005; Stuart & Markowitz, 1959) , which are used for quantifying investment return and investment risk. Motivated by the vast usage of these methodologies and their great success in financial planning, we bring the MV and MSV methods to FL by proposing our Variance Reduction (VRed) and Semi-Variance Reduction (Semi-VRed) algorithms, respectively. By conducting extensive experiments on popular vision and language datasets, we show that our VRed algorithm achieves a performance competitive to existing baseline fair FL algorithms. More importantly, Semi-VRed achieves state-of-the-art performance in terms of both fairness and system overall average performance.

2. BACKGROUND

With formal notations, we consider an FL setting with n clients for the task of multi-class classification. Let x ∈ X ⊆ R p and y ∈ Y = {1, . . . , C} denote the input data point and its target label, respectively. Each client i has its own private data with data distribution P i (x, y). Let h : X ×Θ → R C be the used predictor function, which is parameterized by θ ∈ Θ ⊆ R d shared among all clients. Also, let ℓ : R C × Y → R + be the loss function, which we choose to be the cross entropy loss. Client i minimizes loss function f i (θ) = E (x,y)∼Pi(x,y) [ℓ(h(x, θ), y)] with minimum value of f * i . There are various fair FL algorithms in the literature. In table 3 in the appendix, we have provided details of the most recent algorithms with their formulations. The existing fair FL algorithms can be grouped into two main categories:

2.1. ALGORITHMS BASED ON THE GENERALIZED MEAN

This category of algorithms includes FedAvg (McMahan et al., 2017) , q-FFL (Li et al., 2020c) , AFL (Mohri et al., 2019) , TERM (Li et al., 2020a) , PropFair (Zhang et al., 2022a) . It was shown by Zhang et al. (2022a) that this set of existing fair FL algorithms can be unified into a generalized mean framework (Kolmogorov, 1930) , where more attention is paid to the clients with larger losses.

2.2. ALGORITHMS BASED ON ENCOURAGING EQUALITY

The second category of fair FL algorithms, which includes GiFair, is based on encouraging equality between clients loss functions. GiFair adds a regularization term to the objective of FedAvg to penalize the discrepancy between clients loss functions (see table 3 in the appendix). In this way, it encourages equality between clients loss functions. A common feature of all the aforementioned algorithms is their emphasis on the clients with relatively larger losses, which usually results in the suppression of the well-performing clients. This might result in the degradation of the overall average performance (measured by the mean test accuracy across clients). In the next sections, we will see that Semi-VRed can achieve fairness by regularizing the semi-variance of clients loss functions and improves both fairness and the system overall performance simultaneously. In the context of variance regularization, there have been some works in the literature: Maurer & Pontil (2009) ; Namkoong & Duchi (2017) propose regularizing the empirical risk minimization (ERM) by the empirical variance of losses across training samples to balance bias and variance and improve out-of-sample (test) performance and convergence rate. Similarly, Shivaswamy & Jebara (2010) propose boosting binary classifiers based on a variance penalty applied to exponential loss. Variance regularization has also been used for out-of-distribution (domain) generalization: assuming having access to data from multiple training domains, Krueger et al. (2021) propose penalizing variance of training risks across the domains, as a method of distributionally robust optimization, to provide domain generalization.

3. RISK MODELING METHODS IN FINANCE: Mean-Variance AND

Mean-Semi-Variance Mean-Variance (MV) and Mean-Semi-Variance (MSV) have been the most popular methods for Modeling risks and gains of an investment portfolio, which is a first step in financial planning. Mean-Variance (MV) (Zhang et al., 2018; Soleimani et al., 2009; Markowitz, 1952) . This method treats the return of each security in an investment portfolio as a random variable and adopts its expected value and variance to quantify the return and risk of the portfolio, respectively. An investor either minimizes the risk for a fixed expected return level or maximizes the return for a given acceptable risk level. For instance, in the former case, MV results in the following problem: max x1,...,xn E[x 1 S 1 + . . . + x n S n ] (1) s.t. σ 2 [x 1 S 1 + . . . + x n S n ] ≤ R, i x i = 1, x i ≥ 0. Here E and σ 2 denote the expected value and variance operators, respectively. Also, S i and x i denote the random return from security i and the proportion of total wealth invested in security i, respectively. This example has provided a basic view of how MV model works. Other closely related measures of risk in the MV model include the standard deviation (σ) and coefficient of variation (σ/E). However, the Mean-Variance modeling of risk is debatable: any uncertain return above the expectation is usually not considered as risk in the common sense, but the MV model does so. This shortcoming is resolved by the Mean-Semi-Variance model. Mean-Semi-Variance (MSV) (Boasson et al., 2017; Plà-Santamaria & Bravo, 2013; Ballestero, 2005; Stuart & Markowitz, 1959) . Having recognized the importance of the (often) one-side nature of risks, MSV model proposed a downside risk measure known as semi-variance, which we denote by σ 2 < . Unlike variance, it is only concerned with the downside of the return. i.e. only the cases that the return drops below a predefined threshold. With this risk modeling method, problem 1 changes to the following: max x1,...,xn E[x 1 S 1 + . . . + x n S n ] (2) s.t. σ 2 < [x 1 S 1 + . . . + x n S n ] ≤ R, i x i = 1, x i ≥ 0, where the operator semi-variance measures the downsides of the return: σ 2 < [z] = E[(E[z] -z) 2 + ]. MSV is a preferable alternative for the MV model as its modeling of the risk is more consistent with our perception from an investment risk. Again, the problem above gives a basic understanding of how the MSV model works. More complex variations of MV and MSV models have been developed for complex and unpredictable financial markets (Rigamonti & Lučivjanská, 2022; Zhang et al., 2018; Ballestero, 2005) .

4. MV AND MSV MODELS FOR FAIR FL

In this section, we propose two fair FL algorithms based on the MV and MSV models. We use the two models to quantify the inequality between clients performances. Inspired by Zhang et al. (2022a) , we take a simple approach and define u i (θ) = M -f i (θ) as the utility of client i, where M can be treated as a utility baseline. The smaller the loss function of a client becomes, the larger its utility becomes: the utility of a client can be used to roughly represent the test accuracy of the shared model, parameterized by θ, on its local data. With this definition, we propose to model the inequality between clients by the variance and semi-variance of their utilities, resulting in the VRed and Semi-VRed algorithm, respectively.  p i = ni N for i ∈ {0, 1, . . . , n -1} for t = 0, 1 . . . T -1 do randomly select S t ⊆ [n] θ (i) t = θ t for i ∈ S t , N = i∈St n i for i in S t do // in parallel starting from θ (i) t , take K i local SGD steps on f i (θ (i) t ) with learning rate η to find θ (i) t+1 compute ∆ (t) i = θ (i) t -θ (i) t+1 compute f (θ t ) = i p i f i (θ t ) and ∆ (t) = i p i ∆ (t) i if VRed then compute ∆ t = i p i ∆ (t) i + 2β i p i (f i (θ t ) -f (θ t ))(∆ (t) i -∆ (t) ) else if Semi-VRed then compute ∆ t = i p i ∆ (t) i + 2β i p i (f i (θ t ) -f (θ t )) + (∆ (t) i -∆ (t) ) update θ t+1 = θ t -∆ t Output: global model θ T 4.1 THE VRE D ALGORITHM VRed models the inequality between clients utilities by their variance and aims to minimize the following objective function: min θ F (θ) = E[{f i (θ)} n i=1 ] + βσ 2 [{u i (θ)} n i=1 ] = i p i f i (θ) + β i p i   u i (θ) - j p j u j (θ)   2 = i p i f i (θ) + β i p i   f i (θ) - j p j f j (θ)   2 . ( ) This objective, in addition to minimizing the vanilla FedAvg mean loss, reduces the variance of clients utilities. Let us derive the VRed federated learning algorithm. By taking the gradient of equation 3 and multiplying it by the step size η, we have: η∇F (θ) = i p i η∇f i (θ) + 2β i p i f i (θ) - j p j f j (θ) η∇f i (θ) - j p j η∇f j (θ) . (4) This equation immediately leads to an FL algorithm, by replacing the gradient η∇f i (θ) with the pseudo-gradient (i.e., the opposite of the local update), denoted by ∆ (t) i : η∇F (θ) = i p i ∆ (t) i + 2β i p i f i (θ) -f (θ) ∆ (t) i -∆ (t) , where f (θ) = i p i f i (θ) and ∆ (t) = i p i ∆ (t) i . The corresponding algorithm is included in algorithm 1. There is a parameter β which tunes the effect of the regularization term, which needs to get tuned for better performance. Note that this is a new aggregation rule: instead of simply averaging the local models, it has an additional second term, which relates to the variance of clients losses. If all clients are identical, this term would vanish.

4.1.1. AN INTERPRETATION OF VRE D

With the definition of utilities in the previous section (u i (θ) = M -f i (θ)), the objective function of VRed algorithm (equation 3) is aimed to penalize the variance of clients utilities. One potential drawback of this is that it might result in the suppression of well-performing clients (the ones with small losses) for reducing the variance, which is the same drawback that GiFair (Yue et al., 2021) had. Hence, the final model overall performance averaged across clients might get sacrificed for achieving fairness. In fact, GiFair minimizes an upper bound of VRed objective function: assuming p i = 1 n (i = 1, . . . , n), i.e., all clients have the same number of data points, we have the following upper bound on VRed objective function (see equation 18 in the appendix for derivation): i f i + β i f i (θ) - 1 n j f j (θ) 2 ≤ i f i + 2β n 2 j̸ =i f i (θ) -f j (θ) 2 , where the right hand side is the same as GiFair objective function (except the power 2 used for measuring the pairwise distances between clients losses, see table 3 in the appendix). Therefore, GiFair in fact minimizes an upper bound of VRed's objective function. This might be the reason explaining why our VRed usually outperforms GiFair in terms of fairness in our experiments. In typical FL settings, the global objective function can be written as a weighted sum of clients loss functions, i.e. F (θ) : = n i=1 w i h i (θ) , where h i (θ) is used by client i as a surrogate of the global objective and is optimized using the client local data. Also, the weight w i represents the importance of client i loss function in the global objective function F (θ). For example, FedAvg simply uses h i (θ) = f i (θ) and w i = p i (p i = ni N , see algorithm 1) and q-FFL uses h i (θ) = f q+1 i (θ) and w i = p i . A direct consequence of the above summation form for F (θ) is: ∇F (θ) = n i=1 w i ∇h i (θ), Again, the weight w i represents the importance of the client i model updates. In lemma 1, we show that the gradient of the global objective of VRed in equation 3, can be written in the form of equation 7. For simplicity and easier interpretation, we assume p i = 1 n (i = 1, . . . , n), i.e., all clients have the same number of data points. Lemma 1. For any model parameter θ, the gradient of the global objective F (θ) defined in equation 3 can be expressed as ∇F (θ) = n i=1 w i (θ)∇f i (θ), w i (θ) = 1 n + 2β(f i (θ) -f (θ)) n , f (θ) = i f i (θ) n . ( ) The proof is deferred to §A in the appendix. Lemma 1 shows that, unlike FedAvg that would assign w i = 1 n , i = 1, . . . , n to all clients, VRed assigns a relatively larger weight (w i ) to clients with larger loss functions, and dynamically updates the weights w i at each communication round. We will provide an interpretation of this finding about VRed in the next sections. Importantly, based on equation 8, in order for all clients to get assigned a positive weight, the parameter β needs to satisfy the following inequalities: 0 ≤ β < β max VRed (θ) ≜ 1 2(f (θ)-mini{fi(θ)}) .

4.2. THE SE M I-VRE D ALGORITHM

Inspired by the discussion on the superiority of MSV over MV in § 3 for risk modeling, we propose an extension of VRed. Consider the following objective function instead of equation 3: min θ F (θ) = E[{f i (θ)} n i=1 ] + βσ 2 < [{u i (θ)} n i=1 ] = i p i f i (θ) + β i p i   f i (θ) - j p j f j (θ)   2 + (9) where σ 2 < denotes the semi-variance of clients utilities. This objective, in addition to minimizing the mean loss, reduces the semi-variance of clients losses, meaning that only those clients that have relatively small utilities u i (θ) (or equivalently large losses f i (θ)) contribute to the semi-variance regularization term in eq. ( 9). Similar to what we did for VRed, if we take the gradient of equation 9, we have: η∇F (θ) = i p i ∆ (t) i + 2β i p i f i (θ) -f (θ) + ∆ (t) i -∆ (t) , where ∆ (t) i is the pseudo-gradient (i.e., the opposite of the local update) of user i. The corresponding algorithm is included in algorithm 1. Again, we have a tunable parameter β which sets the effect of the regularization term and needs to get tuned for better performance. We will show in lemma 2 that, in contrast to VRed (and GiFair) and thanks to its more efficient formulation, Semi-VRed does not suppress the well-performing clients to help the worst-off ones. Similar to lemma 1, we assume p i = 1 n (i = 1, . . . , n), for simplicity and easier interpretation. Lemma 2. In each communication round between the clients and the server, let > C denote the set of clients whose local loss function is greater than the average loss function f (θ). For any model parameter θ, the gradient of the global objective F (θ) defined in equation 9 can be expressed as ∇F (θ) = n i=1 w i (θ)∇f i (θ), where: f (θ) = i f i (θ) n , w i (θ) =        1 n + 2β(f i (θ) -f (θ)) n - 2β j∈> C (f j (θ) -f (θ)) n 2 , if i ∈> C 1 n - 2β j∈> C (f j (θ) -f (θ)) n 2 , if i / ∈> C (12) Similar to VRed, there is an upper-bound for β to ensure positive weights for all clients in equation 12: 0 ≤ β < β max Semi-VRed (θ) ≜ n 2 j∈> C (fj (θ)-f (θ)) . Remark 1. Interesting points can be observed by comparing lemma 1 and lemma 2. First, both of the algorithms pay more attention to worst-off clients by assigning larger weights to their gradients. However, Semi-VRed assigns relatively larger weights to the well-performing clients. Also, for VRed, w i (θ) = 1 n + 2β(fi(θ)-f (θ)) n , so the better a client performs, the more it is suppressed by the algorithm. In contrast Semi-VRed assigns weights to well-performing clients depending on how bad the worst-off clients perform compared to the average performance (equation 12). As the performance of worst-off clients improves gradually, the algorithm also lets the well-performing ones for further improvement, instead of strictly suppressing them like VRed.

4.3.2. HANDLING EXTREME LABEL SHIFTS

We now provide another interesting interpretation of Semi-VRed, related to data heterogeneity in FL. In order for an easier interpretability, we assume P i (x, y) = P i (x|y)P i (y) = P (x|y)P i (y). This means that the class conditional distribution of input x is identical for all clients. Having made this assumption, we define ℓ j (θ) = E x∼P (x|y=j) [ℓ(h(x, θ), j)] as the average loss of predictor h on class j. We have lemma 3 about the objective function of Semi-VRed in equation 9. Lemma 3. Assuming P i (x, y) = P i (x|y)P i (y) = P (x|y)P i (y) for i ∈ {1, . . . , n}, for any model parameter θ, Semi-VRed global objective F (θ) defined in equation 9 can be expressed as F (θ) = C j=1 P (j)ℓ j (θ) + β n n i=1 C j=1 [P i (j) -P (j)]ℓ j (θ) 2 + , where P (j) = n i=1 Pi(j) n is the marginal distribution of class j in the global dataset. P i (j) and P (j) show the ratio of class j in the client i local dataset and the global dataset, respectively. Based on equation 13, Semi-VRed is capable of improving the performance of the predictor h in extreme class imbalance scenarios: consider when a label j is over-represented in a client i's data (i.e. P i (j) ≈ 1) and under-represented in the global data (i.e. P (j) ≈ 0). For a better understanding of how Semi-VRed does so, see example 1 in the appendix.

5. CONVERGENCE RESULTS: FULL CLIENT PARTICIPATION

In this section, we prove the convergence of our proposed Semi-VRed algorithm, when clients fully participate in each round. We make some standard assumptions on the objective functions f i . Specifically, we assume the functions are Lipschitz smooth and strongly convex and also their gradients have bounded norm and local variance: Assumption 1 (smoothness, strong convexity, bounded stochastic gradient and bounded gradient variance). Each objective function f i is L-Lipschitz smooth and τ -strongly convex: for any θ, θ ′ ∈ R d and any i ∈  [n], we have ∥∇f i (θ) -∇f i (θ ′ )∥ ≤ L∥θ -θ ′ ∥ and f i (θ) -τ 2 ∥θ∥ 2 is convex. Also, E S∼B b i 1 |S| (x,y)∈S ∇ℓ(θ, (x, y)) 2 ≤ C 2 , E S∼B b i 1 |S| (x,y)∈S ∇ℓ(θ, (x, y)) -∇f i (θ) 2 ≤ σ 2 l,i , Note that, according to algorithm 1, each client might have a different number (K i ) of mini batches of size b, but here we assume b = 1 and each client takes K i = K local steps. Also all clients use the same learning rate η. In order to prove the convergence of Semi-VRed , we also additionally assume boundedness and Lipschitzness for the client losses: Assumption 2 ( boundednes and Lipschitz continuity). For any i ∈ [n], θ, θ ′ ∈ R d and any batch S ∼ B b i of b i.i.d. samples, we have: 0 ≤ ℓ S (θ) := 1 |S| (x,y)∈S ℓ(θ, (x, y)) ≤ M 2 ∥f i (θ) -f i (θ ′ )∥ ≤ L 0 ∥θ -θ ′ ∥ (14) With the above assumptions, we prove that Semi-VRed algorithm converges to the correct solution. Theorem 1 (Semi-VRed with full participation). Given Assumptions 1 and 2 , let p i = ni N , ν = L τ and γ = max{8ν, K}. Assume the diminishing learning rate η t = 2 τ (γ+t) . Then Semi-VRed with full participation satisfies: E[F (θ T )] -F * ≤ 2ν (γ + T ) B τ + 2(L + BM L + 2βL 2 0 )∥θ 0 -θ * ∥ 2 , ( ) where F (θ) = i p i f i (θ) + β f i (θ) -f (θ) 2 + and F * = min θ F (θ). Also, B = n i=1 p 2 i (2σ 2 l,i + 8β 2 M 2 C 2 ) + 6LF (θ 0 ) + 8(K -1) 2 C 2 .

6. EXPERIMENTS

In this section, we evaluate our proposed algorithms for training fair models. From the obtained experimental results, we can observe that VRed achieves competitive fairness performance and Semi-VRed beats almost all the existing algorithms in terms of multiple fairness metrics. Furthermore, Semi-VRed achieves the state of the art performance in terms of the system overall average performance too. 

6.1. EXPERIMENTAL SETUP

In this section, we provide some details about the experiments we conducted to evaluate our algorithms: the details of the datasets, models and their hyperparameters, and the metrics we use to evaluate our algorithms. For detailed explanation of the experiments, see §C in the appendix. Datasets We use four benchmark datasets existing in the literature. The datasets we use include: CIFAR-10/100 (Krizhevsky et al., 2009) , CINIC-10 (Darlow et al., 2018) (task of image classification) and StackOverflow (The Tensorflow Federated Authors, 2019) (task of next word prediction). In order to split the data among clients, we use Dirichlet distribution Wang et al. (2019) . StackOverflow has a default realistic partition for each client. We follow the same default data distribution. Train-Test dataset splitting After partitioning the dataset among clients, we split the data of each client into train and test sets with a ratio for each dataset. Each client uses its test data to evaluate the common trained model. For more details of the data splittings, see §C in the appendix. Models, optimizers and loss functions For the CIFAR-10/100 and CINIC-10 datasets, we use ResNet-18 He et al. (2016) . For the language dataset (StackOverflow), we use LSTMs Hochreiter & Schmidhuber (1997) . In order to optimize the models parameters, We use SGD for minimizing average cross entropy loss. For further details, see §C in the appendix. Other hyperparameters We implement an FL setting where different clients participate in all communication rounds with one local epoch at each round. We use 200 communication rounds for all algorithms on the datasets to ensure their convergence. For CIFAR-10/100 and CINIC-10, we partition the data into 50 clients and for language datasets, we partition the data into 20 clients. Evaluation metrics As we discussed in §4, the ultimate goal of proposing our novel algorithms is to achieve fairness without compromising the system overall average performance. We measure the overall performance with the mean test accuracy across clients. In order to measure the fairness in the system, we use the worst 10% test accuracies among clients, which is a standard metric for fairness in FL (Li et al., 2020a; c) . In the appendix, we also use other common metrics in the literature for measuring fairness, e.g. the standard deviation of test accuracies (see table 6 in appendix C).

6.2. COMPARISON OF VRE D AND SE M I-VRE D WITH OTHER BASELINE ALGORITHMS

From fig. 1 , Semi-VRed outperforms almost all the existing baseline algorithms in terms of the fairness in the system. Furthermore, Semi-VRed improves the system overall average performance (mean test accuracy) for three of the datasets as well. For instance, as can be observed from the results obtained for StackOverflow (see table 6 in § C in the appendix for evaluation in terms of various fairness metrics), Semi-VRed improves both fairness and mean test accuracy by 3% and 2.7%, respectively. Also, we can observe the competitive performance of VRed.

6.3. SUPERIORITY OF SE M I-VRE D OVER VRE D AND THE OTHER BASELINE ALGORITHMS

As discussed in § 2, the existing fair FL algorithms usually suppress the well-performing clients in order to improve the worst-off clients performance. However, Semi-VRed, thanks to its efficient formulation, tries to avoid this. In order to get a better understanding of this, in table 1, we have compared different algorithms based on the amount of performance improvement that they provide: after running the simple vanilla FedAvg on CIFAR-100, we divide the existing 50 clients into two sets: 1. suffering clients: those with test accuracies below the FedAvg mean accuracy (22 clients in our experiment) 2. well-performing clients: those with test accuracies above the FedAvg mean accuracy (28 clients). Then, we run each of the other algorithms and compare their performance improvement with each other. The results clearly delivers two important messages: 1. the existing algorithms either more or less suppress the well-performing clients or cannot improve them, due to the more attention that they pay to the worst-off clients 2. Semi-VRed has the least suppression of well-performing clients (53.17% with a small average degradation of -0.42), and the highest average improvement of suffering clients (65.40% with an average accuracy improvement of +1.47), which results in improving both the fairness and the system overall average performance simultaneously.

7. CONCLUSION

In this work, we introduced two novel fair FL algorithms: VRed and Semi-VRed. In order to resolve the drawback of most of the existing fair FL algorithms, which is suppression of wellperforming clients, we propose Semi-VRed, which uses an efficient method for measuring performance inequality in a FL system. Our experimental results show that Semi-VRed not only improves the worst-off clients performance, but also improves the system overall average performance as well. Accordingly, Semi-VRed achieves SoTA performance in terms of both the overall average accuracy and fairness, measured in terms of various common fairness metrics.

Appendix for Semi-Variance Reduction for Fair Federated Learning

A PROOFS Lemma 1. For any model parameter θ, the gradient of the global objective F (θ) defined in equation 3 can be expressed as ∇F (θ) = n i=1 w i (θ)∇f i (θ), w i (θ) = 1 n + 2β(f i (θ) -f (θ)) n , f (θ) = i f i (θ) n . ( ) Proof. From equation 3 and with p i = 1 n , we have: n∇F (θ) = i ∇f i (θ) + 2β i f i (θ) -f (θ) ∇f i (θ) -∇f (θ) = i ∇f i (θ) + 2β i f i (θ) -f (θ) ∇f i (θ) -f i (θ) -f (θ) ∇f (θ) = i 1 + 2β(f i (θ) -f (θ)) ∇f i (θ) -2β i f i (θ) -f (θ) ∇f (θ) = i 1 + 2β(f i (θ) -f (θ)) ∇f i (θ) Hence, ∇F (θ) = i 1 + 2β(f i (θ) -f (θ)) n ∇f i (θ) Derivation of equation 6 i f i + β i f i (θ) - 1 n j f j (θ) 2 = i f i + β i n -1 n f i (θ) - 1 n j̸ =i f j (θ) 2 = i f i + β n 2 i j̸ =i (f i (θ) -f j (θ)) 2 ≤ i f i + β n 2 i j̸ =i f i (θ) -f j (θ) 2 = i f i + 2β n 2 j̸ =i f i (θ) -f j (θ) 2 . ( ) Lemma 2. In each communication round between the clients and the server, let > C denote the set of clients whose local loss function is greater than the average loss function f (θ). For any model parameter θ, the gradient of the global objective F (θ) defined in equation 9 can be expressed as ∇F (θ) = n i=1 w i (θ)∇f i (θ), where: f (θ) = i f i (θ) n , w i (θ) =        1 n + 2β(f i (θ) -f (θ)) n - 2β j∈> C (f j (θ) -f (θ)) n 2 , if i ∈> C 1 n - 2β j∈> C (f j (θ) -f (θ)) n 2 , if i / ∈> C (12) Proof. From equation 9 and with p i = 1 n , we have: n∇F (θ) = i ∇f i (θ) + 2β i∈> C f i (θ) -f (θ) ∇f i (θ) -∇f (θ) = i ∇f i (θ) + 2β i∈> C f i (θ) -f (θ) ∇f i (θ) -f i (θ) -f (θ) ∇f (θ) = i / ∈> C ∇f i (θ) + i∈> C 1 + 2β(f i (θ) -f (θ)) ∇f i (θ) -2β i∈> C f i (θ) -f (θ) ∇f (θ) The last term in the above equation can be written as: -2β i∈> C f i (θ) -f (θ) ∇f (θ) (20) = - 2β n i∈> C f i (θ) × j ∇f j (θ) + 2β n i∈> C f (θ) × j ∇f j (θ) = - 2β n i∈> C f i (θ) -f (θ) × j ∇f j (θ) Hence, n∇F (θ) = i∈> C 1 + 2β(f i (θ) -f (θ)) - 2β n j∈> C f j (θ) -f (θ) ∇f i (θ) + i / ∈> C 1 - 2β n j∈> C f j (θ) -f (θ) ∇f i (θ) Therefore, ∇F (θ) = i∈> C 1 + 2β(f i (θ) -f (θ)) -2β n j∈> C f j (θ) -f (θ) n ∇f i (θ) + i / ∈> C 1 -2β n j∈> C f j (θ) -f (θ) n ∇f i (θ) Lemma 3. Assuming P i (x, y) = P i (x|y)P i (y) = P (x|y)P i (y) for i ∈ {1, . . . , n}, for any model parameter θ, Semi-VRed global objective F (θ) defined in equation 9 can be expressed as F (θ) = C j=1 P (j)ℓ j (θ) + β n n i=1 C j=1 [P i (j) -P (j)]ℓ j (θ) 2 + , where P (j) = n i=1 Pi(j) n is the marginal distribution of class j in the global dataset. Proof. From equation 9 and with p i = 1 n , we have: f (θ) = n i=1 f i (θ) n = 1 n n i=1 E (x,y)∼pi(x,y) [ℓ(h(x, θ), y)] = 1 n n i=1 C j=1 p i (j) × E (x,y)∼p(x|y=j) [ℓ(h(x, θ), j)] = 1 n n i=1 C j=1 p i (j)ℓ j (θ)] = C j=1 ( n i=1 p i (j) n )ℓ j (θ) = C j=1 p(j)ℓ j (θ), where p(j) = n i=1 pi(j) n is the ratio of data points with label j in the global dataset. Similarly, we have: f i (θ) = C j=1 p i (j)ℓ j (θ). By plugging in the above equivalences for f i (θ) and f (θ) into equation 9, we get to equation 13. Due to the similarity of Semi-VRed's objective function to that of FedAvg, we build its convergence proof on top of the convergence proof for FedAvg in Li et al. (2020d) . We refer the reader to the work for the detailed proof. We now prove the convergence of our Semi-VRed algorithm. Theorem 1 (Semi-VRed with full participation). Given Assumptions 1 and 2 , let p i = ni N , ν = L τ and γ = max{8ν, K}. Assume the diminishing learning rate η t = 2 τ (γ+t) . Then Semi-VRed with full participation satisfies: E[F (θ T )] -F * ≤ 2ν (γ + T ) B τ + 2(L + BM L + 2βL 2 0 )∥θ 0 -θ * ∥ 2 , where F (θ) = i p i f i (θ) + β f i (θ) -f (θ) 2 + and F * = min θ F (θ). Also, B = n i=1 p 2 i (2σ 2 l,i + 8β 2 M 2 C 2 ) + 6LF (θ 0 ) + 8(K -1) 2 C 2 . Proof. We first rewrite a simplified version of the Semi-VRed objective function (equation 9) in the following. F (θ) = i p i G i (θ) = i p i f i (θ) + β (f i (θ) -µ) 2 + , where G i (θ) = f i (θ) + β (f i (θ) -µ) 2 + , and also, µ = Σ i p i f i is fixed during clients local computations. In the beginning of each communication round, we update µ for the next round of local computations. In other words: µ t+1 = n i=1 p i f i (θ t+1 ). With these notations, it suffices to find the constants in Assumption 1 for G i (θ).

Smoothness

We have: G i (θ) = f i (θ) + β(f i (θ) -µ) 2 + = f i (θ) , if 0 ≤ f i (θ) ≤ µ f i (θ) + β(f i (θ) -µ) 2 , if f i (θ) > µ In the first case, G i is smooth with the same smoothness parameter of f i (θ): ∥∇G i (θ) -∇G i (θ ′ )∥ ≤ L∥θ -θ ′ ∥. In the second case, we have: ∇ 2 G i (θ) = ∇ 2 f i (θ) + 2β((f i (θ) -µ)∇ 2 f i (θ) + ∇f i (θ)∇f i (θ) ⊤ ) ⪯ L + 2β( M 2 ∇ 2 f i (θ) + ∇f i (θ)∇f i (θ) ⊤ ) ⪯ (L + βM L + 2βL 2 0 )I , where in the last line, we used Assumption 2 and the following: ∥∇f i (θ)∇f i (θ) ⊤ ∥ sp = sup ∥u∥=1 sup ∥v∥=1 ⟨∇f i (θ)∇f i (θ) ⊤ u; v⟩ = sup ∥u∥=1 sup ∥v∥=1 (∇f i (θ) ⊤ u) ⊤ ∇f i (θ) ⊤ v = ∥∇f i (θ)∥ 2 ≤ L 2 0 , where in the second line we used Cauchy-Schwarz inequality and Assumption 2. Hence, from eq. ( 28) and eq. ( 29), we conclude that G i (θ) is Lipschitz smooth: ∥∇G i (θ) -∇G i (θ ′ )∥ ≤ (L + βM L + 2βL 2 0 )∥θ -θ ′ ∥. Strong Convexity From, eq. ( 27), we have: ∇ 2 G i (θ) = ∇ 2 f i (θ) , if 0 ≤ f i (θ) ≤ µ ∇ 2 f i (θ) + 2β(f i (θ) -µ)∇ 2 f i (θ) + 2β∇f i (θ)∇f i (θ) ⊤ , if f i (θ) > µ (32) From the above derivation of ∇ 2 G i (θ) and that f i (θ) is τ -strongly convex, we can immediately conclude that G i (θ) is also τ -strongly convex. local gradient variance constants For the local variance term, we define φ(t) = t + β(t -µ) 2 + . We have: ∥∇G i (θ) -∇(φ • ℓ S )(θ)∥ = ∇f i (θ) + 2β(f i (θ) -µ) + ∇f i (θ) -∇ℓ S (θ) + 2β(ℓ S (θ) -µ) + ∇ℓ S (θ) ≤ ∥∇f i (θ) -∇ℓ S (θ)∥ + 2β∥(f i (θ) -µ) + ∇f i (θ)∥ + 2β∥ℓ S (θ) -µ) + ∇ℓ S (θ)∥ ≤ ∥∇f i (θ) -∇ℓ S (θ)∥ + βM ∥∇f i (θ)∥ + βM ∥∇ℓ S (θ)∥ ≤ ∥∇f i (θ) -∇ℓ S (θ)∥ + 2βM C (33) where in line four, we used Assumption 2 and in line five, we used 1. By taking the square on both sides and the expectation over S ∼ B b i , we get: E S∼B b i ∥∇G i (θ) -∇(φ • ℓ S )(θ)∥ 2 ≤ E S∼B b i ∥∇f i (θ) -∇ℓ S (θ)∥ + 2βM C 2 ≤ E S∼B b i 2∥∇f i (θ) -∇ℓ S (θ)∥ 2 + 8β 2 M 2 C 2 = 2σ 2 l,i + 8β 2 M 2 C 2 . ( ) In the third line, we used (a + b) 2 ≤ 2(a 2 + b 2 ). We also used Assumption 1 in the same line.

B EXAMPLES

We borrow the following example on class imbalance in FL from Shen et al. (2022) to provide a better understanding of lemma 3. The following example shows an extreme class imbalance, which Semi-VRed can handle efficiently. Example 1. Let u be the uniform distribution over the existing C classes. Also, let δ c be the Dirac distribution of class c. Now, without loss of generality, lets assume that C = 2 (binary classification problem). For the n existing clients, we have: p i (y) = αu + (1 -α)δ 1 if i = 1 αu + (1 -α)δ 2 if i ∈ {2, . . . , n} Accordingly, we have: p i (1) =    1 - α 2 if i = 1 α 2 if i ∈ {2, . . . , n} p i (2) =    α 2 if i = 1 1 - α 2 if i ∈ {2, . . . , n} Therefore, f i (θ) =    (1 - α 2 )ℓ 1 (θ) + α 2 ℓ 2 (θ) if i = 1 α 2 ℓ 1 (θ) + (1 - α 2 )ℓ 2 (θ) if i ∈ {2, . . . , n} Hence, f (θ) = α 2 + 1 -α n ℓ 1 (θ) + α 2 + (1 -α)(n -1) n ℓ 2 (θ) Clearly, we can see that if α ≈ 0 and n is large, then ℓ 1 (θ), which is the loss over the minority data will have a small weight, which leads to ℓ 1 (θ) being larger than ℓ 2 (θ) and poor performance on the minority class 1. Now, if we rewrite the Semi-VRed objective function (equation 9), we have: F (θ) = α 2 + 1 -α n ℓ 1 (θ) + α 2 + (1 -α)(n -1) n ℓ 2 (θ) + β(n -1) 2 (1 -α) 2 n 3 ℓ 1 (θ) -ℓ 2 (θ) 2 (40) For α ≈ 0: F (θ) ≈ ℓ 2 (θ) + β n ℓ 1 (θ) -ℓ 2 (θ) 2 , ( ) which improves ℓ 2 (θ), thanks to its regularization term. Hence, the performance of client 1 and consequently, fairness in the system will improve.  GiFair i f i + λ i̸ =j |f i -f j | Yue et al. ( ) VRed i f i + β i f i (θ) -1 n j f j (θ) 2 this work Semi-VRed i f i + β i f i (θ) -1 n j f j (θ) 2 + this work • StackOverflow: {1e-2, 5e-2, 1e-1, 5e-1, 1}. The best learning rate used for each (dataset, algorithm) pair is reported in Table 4 . Table 4 : The best learning rates used for training each algorithm on different datasets.

Datasets

FedAvg q-FFL AFL TERM PropFair GiFair VRed Semi-VRed CIFAR-10 5e-3 5e-3 5e-3 1e-2 1e-2 5e-3 5e-3 5e-3 CIFAR-100 2e-3 2e-3 5e-3 1e-2 1e-2 5e-3 5e-3 5e-3 CINIC-10 1e-2 5e-3 1e-2 1e-2 2e-2 2e-2 5e-3 5e-3 StackOverflow 2e-1 5e-2 5e-2 2e-1 5e-1 2e-1 5e-1 5e-1 We now explain the algorithms we use and how we tune their hyperparameters. We adapt TERM with only client-level fairness (α > 0) and no sample-level fairness (τ = 0). We tune the hyperparameter α for each dataset based on a grid search in the grid {0.01, 0.1, 0.5, 1}. We have reported the best value of α for each dataset in Table 5 . For AFL, there are two hyperparameters: γ w and γ λ . We tune the learning rate γ w from the corresponding grid and choose the default value γ λ = 0.1. For q-FFL, we use the q-FedAvg algorithm (Li et al., 2020c) . We also tune the hyperparameter q from the grid {0.01, 0.1, 1}. We find that for all the used datasets, q = 0.1 has the best peformance (as reported in Table 5 ). We also tried larger values out of the grid and they often lead to divergence of the q-FFL algorithm. We adopt the Global GiFair model (Yue et al., 2021) , which results in a single global model. In order to have client-level fairness, we treat each client as a group of size 1. For tuning the regularization weight of GiFair (λ), we follow (Yue et al., 2021) . As stated in the paper, there is an upper-bound for λ (see §3 in the paper). For our experiments, the upper-bound is λ ≤ min i { wi n-1 }, where w i is the ratio of total samples allocated to the client i and n is the number of clients. We try four different values in the interval and choose the best one. When the number of clients is large, this upper-bound is small, and for all of our datasets, this upper-bound was the best value, as reported in Table 5 . We tune M for the PropFair algorithm based on a grid search in {2, 3, 4, 5}. Finally, for our VRed and Semi-VRed algorithms, we tune the regularization weight β based on grid search on the grid {0.01, 0.05, 0.1, 0.2, 0.5, 1}. Larger values of β often resulted in the divergence of the algorithms. We have reported the best value of all of the hyperparameters for each dataset in Table 5 .

C.3 DETAILED RESULTS

In Table 6 , we report detailed results obtained from the algorithms we study in this work. We use a default batch size of 64 for all the experiments. The statistics we report include: 1. the average test accuracy across clients (overall average performance) 2. the standard deviation of test accuracies across clients 3. the lowest (worst) test accuracy among clients 4. the lowest 10% test accuracies 5. the lowest 20% test accuracies 6. the highest 10% test accuracies. For each experiment, we report the average result obtained from three runs with different random seeds. As can be observed, our proposed algorithms VRed and Semi-VRed consistently beat almost all the baseline algorithms across different datasets in terms of various fairness metrics. Also, Semi-VRed can improve the overall average performance (mean test accuracy) for three of the datsets as well. Following Figure 1 , we have compared our proposed algorithms with the baseline algorithms in terms of their worst 20% test accuracies as well and the visualized results are shown in Figure 2 . , where Pn and Q are the train data empirical distribution and the test data distribution. The above problem optimizes against the "worst-case" test distribution Q. The deviation of the distribution Q from Pn is penalized in the regularization term 1 δ H ϕ (Q| Pn ), where H ϕ is a divergence measure between two distributions. The solution to this optimization problem is a model which is robust against distribution shifts between the train and test data. It was shown in (ya Gotoh et al., 2018) [see Propositions 3.1 and 3.2] that the above DRO problem is equivalent to a mean-variance problem, where the empirical average loss on train set is regularized with sample variance of the loss: DRO is also an effective approach to deal with imbalanced and non-iid data. Unlike the above sample-wise variance regularization works, the work (Krueger et al., 2021) -assuming having access to data from multiple training domains -proposes penalizing variance of training risks across the domains as a method of distributionally robust optimization to provide out-of-distribution (domain) generalization. The first work propopsing DRO in FL setting is (Mohri et al., 2019) , where they minimize the maximum combination of clients local losses to address fairness in FL: min θ max i f i (θ). Also, the work in (Deng et al., 2020) proposea distributionally robust FL by minimizing the worst combination of clients local losses via periodic averaging and adaptive sampling: min θ max λ∈Λ n i=1 λ i f i (θ), where λ ∈ Λ = {λ ∈ R n + : n i=1 λ i = 1}. In contrast, our proposed VRed penalizes the variance of losses across clients for improving fairness (performance consistency) in FL settings with high data heterogeneity: min θ n i=1 λ i f i (θ) + β n i=1 λ i   f i (θ) - n j=1 λ j f j (θ)   2 , ( ) where λ i = ni N , is the sample size of client i and is fixed. The relation between robust optimization and variance regularization in non-FL settings (eq. ( 43)) encourages us to interpret VRed as an equivalent form of DRO. Hence, although the variance regularization used in VRed connects it non-trivially to the previous works AFL (Mohri et al., 2019) and DRFA (Deng et al., 2020) through DRO, it does not use a minmax objective function with potential convergence problems. As we have reported in our experiments, AFL fails to converge in settings with high data heterogeneity. Similarly, the authors of (Deng et al., 2020) have evaluated the DRFA algorithm only on Logistic Regression model, which is a convex problem. Furthermore, as reported in our results (and also reported in (Deng et al., 2020) , fig. 3 ) AFL, thanks to its DRO formulation, can improve the fairness (performance consistency) in the system. However, it clearly degrades the overall average performance. Similarly, DRFA (as reported in fig. 3 of (Deng et al., 2020) ) can improve the system fairness keeping the same level of global accuracy as FedAvg. This is our motivation for proposing our Semi-VRed algorithm by solving the following problem instead of the previous DRO-based algorithms: min θ n i=1 λ i f i (θ) + β n i=1 λ i   f i (θ) - n j=1 λ j f j (θ)   2 + , where λ i = ni N , is the sample size of client i and is fixed. We have shown by lemma 2 and lemma 3 and also example 1 that our Semi-VRed has a smarter and more efficient formulation for achieving fairness in FL systems, which results in improvement of fairness without degrading the system overall performance.



In order to have fair comparison with our baseline algorithms, we do not use model personalization in this work.



4.3 CAN WE INTERPRET WHAT SE M I-VRE D DOES? 4.3.1 OPTIMIZATION ASPECT

for any batch S ∼ B b i of b i.i.d samples from client i local data, the following inequalities hold (bounded stochastic gradient and bounded local variance):

Figure 1: Average and worst 10% test accuracies. top left: CIFAR-10, top right: CIFAR-100, bottom left: CINIC-10, bottom right: StackOverflow. Due to divergence, results for AFL on CIFAR-10 and StackOverFlow are not shown. All subfigures share the same legends and axis labels.

We compare our VRed and Semi-VRed algorithms with various fair FL algorithm existing in the literature including: FedAvg McMahan et al. (2017), AFL Mohri et al. (2019), q-FFL Li et al. (2020c) , PropFair Zhang et al. (2022a), TERM Li et al. (2020a), GiFair Yue et al. (2021) and Ditto Li et al. (2021).

y)∼Q[ℓ(h(x, θ), y)] + 1 δ H ϕ (Q| Pn ) ≡ min θ E (x,y)∼ Pn ℓ(h(x, θ), y) + δ 2ϕ ′′ (1) E (x,y)∼ Pn ℓ(h(x, θ), y) -E (x,y)∼ Pn [ℓ(h(x, θ), y)] variance regularization is equivalent to DRO and can improve out-of-sample (test) performance.Maurer & Pontil (2009);Namkoong & Duchi (2017) propose regularizing the empirical risk minimization (ERM) by the empirical variance of losses across training samples to balance bias and variance and improve out-of-sample (test) performance and convergence rate. Similarly,Shivaswamy & Jebara (2010) propose boosting binary classifiers based on a variance penalty applied to exponential loss.

Algorithm 1: VRed and Semi-VRed Input: global epoch T , client number n, loss function f i , number of samples n i for client i, , number of total samples N , initial global model θ 0 , local step number K i for client i, learning rate η Let

Comparison between the performance of different algorithms on CIFAR-100. Second column: the percentage (%) of suffering clients with improved test accuracy. The value in parentheses shows the amount of test accuracy improvement averaged over suffering clients. Third column: the percentage (%) of well-performing clients with degraded test accuracy. The value in parentheses shows the amount of test accuracy improvement averaged over well-performing clients. Fourth column: the amount of improvement in the overall mean test accuracy

Details of the existing fairfl algorithms. f i is the loss function of the client i.

The best values of hyperparameters used for different datasets, chosen based on grid search. CINIC-10, bottom right: StackOverflow. Due to divergence, results for AFL on CIFAR-10 and StackOverFlow are not shown. All subfigures share the same legends and axis labels.

C EXPERIMENTAL SETUP

In this section, we provide more experimental details that are deferred from the main paper. For each experiment, we report the average result obtained from three runs with different random seeds. For our experiments, we used an internal GPU server with six NVIDIA Tesla P100. The experiments last about 4 weeks in total.

C.1 DATASETS AND MODELS

In this subsection, we describe the datasets we use in our experiments. For all the datasets we use a batch size of 64.CIFAR-10/100 (Krizhevsky et al., 2009) are two image classification datasets vastly used in the literature as benchmark datasets. Each of these datasests contains 50000 sample images with 10/100 balanced classes for CIFAR-10 and CIFAR-100, respectively. We use Dirichlet allocation (Wang et al., 2019) to distribute the data among 50 clients with label shift: we split the set of samples from class k (S k ) to n subsets (S k = S k,1 ∪ S k,2 ∪ . . . ∪ S k,n ), where n is the number of clients and S k,j corresponds to the client j. We do the split based on Dirichlet distribution with parameter 0.05 (Dir(0.05)). When the split is done for all classes, we gather the samples corresponding to each client from different classes: assuming there are C classes in total S 1,j ∪ S 2,j ∪ . . . ∪ S C,j is the data allocated to the client j. Having allocated the data of each client, we split them into train and test set for each client. The train-test split ratio is 50-50 and 60-40 for CIFAR-10 and CIFAR-100, respectively. -10 (Darlow et al., 2018) is another benchmark vision dataset that we use in our experiments. There are a total of 270,000 sample images, which we distribute with label shift between 50 clients based on Dir(0.5) distribution Wang et al. (2019) . We then randomly split the data of each client into train and test sets with split ratio 50-50.

CINIC

StackOverflow (The Tensorflow Federated Authors, 2019) is a language dataset consisting of Shakespeare dialogues for the task of next word prediction. There is a natural heterogeneous partition of the dataset and we treat each speaking role as a client. We filter out the clients (speaking roles) with less than 10,000 samples from the original dataset and randomly select 20 clients from the remaining. Finally, we split the data of each client into train and test sets with a ratio of 50-50.Table 2 provides a summary of the datasets we used and the models used for each of them. (Zhang et al., 2022a) , TERM (Li et al., 2020a) , GiFair (Yue et al., 2021) . For each pair of dataset and algorithm, we find the best learning rate based on a grid search. In the following, we have reported the learning rate grid we use for each dataset:• CIFAR-10: {1e-3, 2e-3, 5e-3, 1e-2, 2e-2, 5e-2};• CIFAR-100: {1e-3, 2e-3, 5e-3, 1e-2, 2e-2, 5e-2};• CINIC-10: {1e-3, 2e-3, 5e-3, 1e-2, 2e-2, 5e-2}; Empirical optimization is usually used as a data-driven approach for tuning models for decision making, where an expected loss is minimized based on some available train data. The trained model is then used for prediction tasks on some test data. However, if the empirical distribution of the train data is substantially different from that of test data, our confidence for doing prediction on the test data with the trained model diminishes. Robust empirical optimization has been used to address this problem (Bertsimas et al., 2018b; a; Ben-Tal et al., 2013) . The work in (ya Gotoh et al., 2018) formulated a distributionally robust optimization (DRO) problem based on a minimax problem, where a model is trained on the given train data against the worst distribution shifts between the train and test data:

