FEDDEBIAS: REDUCING THE LOCAL LEARNING BIAS IMPROVES FEDERATED LEARNING ON HETEROGE-NEOUS DATA

Abstract

Federated Learning (FL) is a machine learning paradigm that learns from data kept locally to safeguard the privacy of clients, whereas local SGD is typically employed on the clients' devices to improve communication efficiency. However, such a scheme is currently constrained by the slow and unstable convergence induced by clients' heterogeneous data. In this work, we identify three under-explored phenomena of the biased local learning that may explain these challenges caused by local updates in supervised FL. As a remedy, we propose FedDebias, a novel unified algorithm that reduces the local learning bias on features and classifiers to tackle these challenges. FedDebias consists of two components: The first component alleviates the bias in the local classifiers by balancing the output distribution of models. The second component learns client invariant features that are close to global features but considerably distinct from those learned from other input distributions. In a series of experiments, we show that FedDebias consistently outperforms other SOTA FL and domain generalization (DG) baselines, in which both two components have individual performance gains.

1. INTRODUCTION

Federated Learning (FL) is an emerging privacy-preserving distributed machine learning paradigm. The model is transmitted to the clients by the server, and when the clients have completed local training, the parameter updates are sent back to the server for integration. Clients are not required to provide local raw data during this procedure, maintaining their privacy. As the workhorse algorithm in FL, FedAvg (McMahan et al., 2016) proposes local SGD to improve communication efficiency. However, the considerable heterogeneity between local client datasets leads to inconsistent local updates and hinders convergence. Several studies propose variance reduction methods (Karimireddy et al., 2019; Das et al., 2020) , or suggest regularizing local updates towards global models (Li et al., 2018b; 2021) to tackle this issue. Almost all these existing works directly regularize models by utilizing the global model collected from previous rounds to reduce the variance or minimize the distance between global and local models (Li et al., 2018b; 2021) . However, it is hard to balance the trade-offs between optimization and regularization to perform well, and data heterogeneity remains an open question in the community, as justified by the limited performance gain, e.g. in our Table 1 . To this end, we begin by revisiting and reinterpreting the issues caused by data heterogeneity and local updates. We identify three pitfalls of FL, termed local learning bias, from the perspective of representation learningfoot_0 : 1) Biased local classifiers are unable to effectively classify unseen data (in Figure 1 (a)), due to the shifted decision boundaries dominated by local class distributions; 2) Local features (extracted by a local model) differ significantly from global features (similarly extracted by a centralized global model), even for the same input data. (c.f. Figure 1(b )); and 3) Local features, even for data from different classes, are close to each other and cannot be accurately distinguished (c.f. Figure 1(b) ). As a remedy, we propose FedDebias, a unified method that leverages a globally shared pseudo-data and two key algorithmic components to simultaneously address the three difficulties outlined above. The first component of FedDebias alleviates the first difficulty by forcing the output distribution of The difference between features extracted by client 1's local feature extractor and global feature extractor is sustainable large. However, client 2's local feature is close enough to client 1's, even for input data from different data distributions/clients. the pseudo-data to be close to the global prior distribution. The second component of FedDebias is designed for the second and third difficulties. In order to tackle the last two difficulties simultaneously, we develop a min-max contrastive learning method to learn client invariant local features. More precisely, instead of directly minimizing the distance between global and local features, we design a twostage algorithm. The first stage learns a projection space-an operation that can maximize the difference between global and local features but minimize local features of different inputs-to distinguish the features of two types. The second stage then debiases the features by leveraging the trained projection space to enforce learned features that are farther from local features and closer to global features. We examine the performance of FedDebias and compare it with other FL and domain generalization baselines on RotatedMNIST, CIFAR10, and CIFAR100. Numerical results show that FedDebias consistently outperforms other algorithms by a large margin on mean accuracy and convergence speed. Furthermore, both components have individual performance gains, and the combined approach FedDebias yields the best results.

Contributions

• We propose FedDebias, a unified algorithm that leverages pseudo-data to reduce the learning bias on local features and classifiers. We design two orthogonal key components of FedDebias to complement each other to improve the learning quality of clients with heterogeneous data. • FedDebias considerably outperforms other FL and domain generalization (DG) baselines, as justified by extensive numerical evaluation.

2. RELATED WORKS

Federated Learning (FL). As the de facto FL algorithm, McMahan et al. (2016) ; Lin et al. (2020b) propose to use local SGD steps to alleviate the communication bottleneck. However, the objective inconsistency caused by the local data heterogeneity considerably hinders the convergence of FL algorithms (Li et al., 2018b; Wang et al., 2020; Karimireddy et al., 2019; 2020; Guo et al., 2021) . To address the issue of heterogeneity in FL, a series of projects has been proposed. FedProx (Li et al., 2018b) (Tuor et al., 2021; Yoshida et al., 2019) . Some knowledge distillation-based methods also require a global dataset (Lin et al., 2020a; Li & Wang, 2019) , which is used to transfer knowledge from local models (teachers) to global models (students). Considering the impractical of sharing the global datasets in FL settings, some recent research use proxy datasets with augmentation techniques. Astraea (Duan et al., 2019) uses local augmentation to create a globally balanced distribution. XorMixFL (Shin et al., 2020) encodes a couple of local data and decodes it on the server using the XOR operator. FedMix (Yoon et al., 2021b) creates the privacy-protected augmentation data by averaging local batches and then applying Mixup in local iterations. VHL (Tang et al., 2022) relies on the created virtual data with labels and forces the local features to be close to the features of same-class virtual data. Our framework significantly outperforms VHL; unlike VHL, our solution has no label constraint and uses much less pseudo-data than VHL. Distribution Robust FL. Domain generalization is a well-studied field, aiming to learn domainrobust models that perform well on unknown distributions. Some methods apply domain robust optimization methods (Sagawa et al., 2019; Hu & Hong, 2013; Michel et al., 2021) to minimize the worst-case empirical error, and others propose to learn domain invariant features (Ganin et al., 2015; Li et al., 2018c; a; Sun & Saenko, 2016) by minimizing the distance of features from different domains. By treating each client as a domain, some existing works tackle the FL problem as a domain generalization problem. Several methods include optimizing the weights of different clients to lower the worst empirical error among all clients (Mohri et al., 2019; Deng et al., 2021) . Huang et al. (2021) assumes each client has two local datasets with a different distribution, and the robustness is obtained by balancing the two local datasets. Xie et al. (2021) proposes collecting gradients from one segment of clients first, then combining them as a global gradient to reduce variance in the other segments. Reisizadeh et al. (2020) assumes the local distribution is perturbed by an affine function, i.e., from x to Ax + b. There are also some methods that aim to learn client invariant features (Peng et al., 2019; Wang et al., 2022; Shen et al., 2021; Sun et al., 2022; Gan et al., 2021) . However, these methods are designed to learn a model that can perform well on unseen deployment distributions that differ from the (seen) clients' local distributions, which is beyond the scope of this paper. Recently, Moon (Li et al., 2021) has proposed to employ contrastive loss to reduce the distance between global and local features. However, their projection layer is only used as part of the feature extractor, and cannot contribute to distinguish the local and global features-a crucial step identified by our investigation for a better model performance.

3. THE PITFALLS OF FL ON HETEROGENEOUS DATA DISTRIBUTIONS

FL and local SGD. FL is an emerging learning paradigm that supposes learning on various clients while clients can not exchange data to protect users' privacy. Learning occurs locally on the clients, while the server collects and aggregates gradient updates from the clients. The standard FL considers the following problem: f * = min ω∈R d f (ω) = N i=1 p i f i (ω) , where f i (ω) is the local objective function of client i, and p i is the weight for f i (ω). In practice, we set p i = |Di| /|D| by default, where D i is the local dataset of client i and D is the combination of all local datasets. The global objective function f (ω) aims to find ω that can perform well on all clients. In the training process of FL, the communication cost between client and server has become an essential factor affecting the training efficiency. Therefore, local SGD (McMahan et al., 2016) has been proposed to reduce the communication round. In local SGD, clients perform multiple local steps before synchronizing to the server in each communication round. The negative impact of local update steps. Despite the success of local SGD, the non-iid nature of clients' local data leads to local gradient inconsistency, which will slow down the convergence (Li et al., 2018b; Karimireddy et al., 2019) . A series of studies have proposed several methods for client heterogeneity to address this issue. One natural idea considers using the global gradient/model of previous rounds during the local updates to reduce variance or minimize the distance between the global and local model (Karimireddy et al., 2019; 2020; Li et al., 2018b; 2021) . However, the performance of such algorithms is limited in our challenging scenarios (as we shown in Table 1 ). Using FedProx (Li et al., 2018b) as an example, setting larger weights for proximal terms will hinder the further optimization steps of the local model, while setting a small weight will result in a marginal improvement of FedProx over FedAvg. Bias caused by local updates. To mitigate the negative impact of local updates, we first identify the pitfalls of FL on heterogeneous data with a sufficient number of local updates and then design the algorithms to address the issues caused by the local updates. The pitfalls can be justified by a toy experiment. More precisely, we divide the MNIST dataset into two sets. The first dataset, denoted by X 1 , contains the 5 classes 0-4. The other dataset, denoted by X 2 , contains the remaining five classes. Then we train a CNN model on X 1 for 10 epochs and store We define this observation as the "biased local feature". In detail, we calculate F 1 (X 1 ), F 1 (X 2 ), F g (X 1 ), and F g (X 2 ), and use t-SNE to project all the features to the same 2D space. 2 We can observe that the local features of data in X 2 are so close to local features of data in X 1 , and it is non-trivial to tell which category the current input belongs to by merely looking at the local features. • Biased local feature: For local feature extractor F i (•), and centralized trained global feature extractor F g (•), we have: 1) Given the data input X, F i (X) could deviate largely from F g (X). 2) Given the input from different data distributions X 1 and X 2 , F i (X 1 ) could be very similar or almost identical to F i (X 2 ). • Biased local classifier: After a sufficient number of iterations, local models classify all samples into only the classes that appeared in the local datasets. It is worth to note that some related works also discussed the learning bias (Karimireddy et al., 2019; Li et al., 2018b; 2021) . However, we have an inherent difference compared with previous works. 1) FedProx defines the local drifts as the difference between model weights ∥ω g -ω i ∥, and SCAFFOLD considers gradient difference as client drifts. Despite the theoretical success of these methods, these two methods usually have minor improvements on deep models (Tang et al., 2022; Li et al., 2021; Yoon et al., 2021a; Chen & Chao, 2021; Luo et al., 2021) . 2) Though MOON is a crucial first step that minimizes the distance between global and local features, its performance gain is still limited due to the improper methodology design (cf. Table 1 ).

4. FEDDEBIAS: REDUCING LEARNING BIAS IN FL BY PSEUDO-DATA

Addressing the local learning bias is crucial to improving FL on heterogeneous data, due to the bias discussed in Definition 3.3. To this end, we propose FedDebias as shown in Figure 4 , a novel framework that leverages the globally shared pseudo-data with two key components to reduce the local training bias, namely 1) reducing the local classifier's bias by balancing the output distribution of classifiers (component 1), and 2) an adversary contrastive scheme to learn unbiased local features (component 2).

4.1. OVERVIEW OF THE FEDDEBIAS

The learning procedure of FedDebias on each client i involves the construction of a global pseudo-data (c.f. Section 4.2), followed by applying two key debias steps in a min-max approach to jointly form two components (c.f. Section 4.3 and 4.4) to reduce the bias in the classifier and feature, respectively. The min-max procedure of FedDebias can be interpreted as first projecting features onto spaces that can distinguish global and local feature best, then 1) minimizing the distance between the global and local features of pseudo-data and maximizing distance between local features of pseudo-data and local data; 2) minimize classification loss of both local data and pseudo-data: Max Step: max θ L adv (D p , D i ) = E xp∼Dp,x∼Di L con (x p , x, ϕ g , ϕ i , θ) . (2)

Min

Step: min ϕ i ,ω L gen (D p , D i ) = E (x,y)∼Di [L cls (x, y, ϕ i , ω)] + λE xp∼Dp [L cls (x p , ỹp , ϕ i , ω)] +µE xp∼Dp,x∼Di L con (x p , x, ϕ g , ϕ i , θ) . L cls and L con represent the cross-entropy loss and a contrastive loss (will be detailed in Section 4.4), respectively. D i denotes the distribution of local dataset at client i. D p is that of shared pseudodataset, where ỹp is the pseudo-label of pseudo-data. The model is composed of a feature extractor ϕ and a classifier ω, where the omitted subscript i and g correspond to the local client i and global parameters, respectively (e.g., ϕ g denotes the feature extractors received from the server at the beginning of each communication round). We additionally use a projection layer θ for the max step to project features onto spaces where global and local features have the largest dissimilarity. Apart from the standard classification loss of local data in Equation ( 3), the second term aims to overcome the biased local classifier while the local feature is debiased by the third term. The proposed FedDebias is summarized in Algorithm 1. The global communication part is the same as FedAvg, and the choice of synchronizing the new pseudo-data to clients in each round is optionalfoot_2 .

4.2. CONSTRUCTION OF THE PSEUDO-DATA

The choice of the pseudo-data in our FedDebias framework is arbitrary. For ease of presentation and taking the communication cost into account, we showcase two construction approaches below and detail their performance gain over all other existing baselines in Section 5: • Random Sample Mean (RSM). Similar to the treatment in FedMix (Yoon et al., 2021b) , one RSM sample of the pseudo-data is estimated through a weighted combination of a random subset of local samples, and the pseudo-label is setfoot_3 to ỹp = for chosen client i = 1, . . . , M do 5: ω 0 i = ωt, θ 0 i = θt, ϕ 0 i = ϕ t , ϕ g = ϕ t 6: for k = 1, . . . , K do 7: # Max Step 8: θ k i = θ k-1 i + η∇ θ L adv . 9: # Min Step 10: ω k i = ω k-1 i -η∇ωL k . 11: ϕ k i = ϕ k-1 i -η∇ ϕ Lgen. 12: Send ω K i , θ K i , ϕ K i to server. 13: ωt+1 = 1 M M i=1 ω K i . 14: θt+1 = 1 M M i=1 θ K i . 15: ϕ t+1 = 1 M M i=1 ϕ K i . Algorithm 1: Algorithm Framework of FedDebias • Mixture of local samples and the RSM of a proxy dataset (Mixture). This strategy relies on applying the procedure of RSM to an irrelevant and globally shared proxy data (refer to Algorithm 3). To guard the distribution distance between the pseudo-data and local data, one sample of the pseudo-data at each client is constructed by xp = 1 K+1 x p + K k=1 x k , ỹp = 1 K+1 1 C • 1 + K k=1 y k , where x p is one RSM sample of the global proxy dataset, and x k and y k correspond to the data and label of one local sample (vary depending on the client). K is a constant that controls the closeness between the distribution of pseudo-data and local data. As we will show in Section 5, setting K = 1 is data efficient yet sufficient to achieve good results.

4.3. COMPONENT 1: REDUCING BIAS IN LOCAL CLASSIFIERS

Due to the issue of label distribution skew or the absence of some samples for the majority/minority classes, the trained local model classifier tends to overfit the locally presented classes, and may further hinder the quality of feature extractor (as justified in Figure 3 and Definition 3.3). As a remedy, here we implicitly mimic the global data distribution-by using the pseudo-data constructed in Section 4.2-to regularize the outputs and thus debias the classifier: λE xp∼Di [L cls (x p , ỹp , ϕ i , ω)] . Note that the Component 1 is appeared to be the second term of Equation (3).

4.4. COMPONENT 2: REDUCING BIAS IN LOCAL FEATURES

In addition to alleviating the biased local classifier in Section 4.3, here we introduce a crucial adversary strategy to learn unbiased local features. Intuition of constructing an adversarial problem. As discussed in Definition 3.3, effective federated learning on heterogeneous data requires learning debiased local feature extractors that 1) can extract local features that are close to global features of the same input data; 2) can extract different local features for input samples from different distributions. However, existing methods that directly minimize the distance between global features and local features (Li et al., 2018b; 2021) have limited performance gain (c.f. Table 1 ) due to the diminishing optimization objective caused by the indistinguishability between the global and local features of the same input. To this end, we propose to extend the idea of adversarial training to our FL scenarios: 1. We construct a projection layer as the critical step to distinguish features extracted by the global and local feature extractor: such layer ensures that the projected features extracted by local feature extractor will be close to each other (even for distinct local data distributions), but the difference between features extracted by the global and local feature extractor after projection will be considerable (even for the same input samples). 2. We can find that constructing such a projection layer can be achieved by maximizing the local feature bias discussed in Definition 3.3. More precisely, it can be achieved by maximizing the distance between global and local features of pseudo-data and simultaneously minimizing the distance between local features of pseudo-data and local data. 3. We then minimize the local feature biases (discussed in Definition 3.3) under the trained projection space, so as to enforce the learned local features of pseudo-data to be closer to the global features of pseudo-data but far away from the local features of real local data. On the importance of utilizing the projection layer to construct the adversary problem. To construct the aforementioned adversarial training strategy, we consider using an additional projection layer to map features onto the projection spacefoot_4 . In contrast to the existing works that similarly add a projection layer (Li et al., 2021) , we show that 1) simply adding a projection layer as part of the feature extractor has trivial performance gain (c.f. Figure 5 (a)); 2) our design is the key step to reducing the feature bias and boosting the federated learning on heterogeneous data (c.f. Table 3 ). Objective function design. We extend the idea of Li et al. (2021) and improve the contrastive loss initially proposed in simCLR (Chen et al., 2020) to our challenging scenario. Different from previous works, we use the projected features (global and local) on pseudo-data as the positive pairs and rely on the projected local feature of both pseudo-data and local data as the negative pairs: L con (x p , x, ϕ g , ϕ i , θ) = -log   exp sim(P (ϕ i (xp )),P (ϕ g (xp ))) τ 1 exp sim(P (ϕ i (xp )),P (ϕ g (xp )) τ 1 +exp sim(P (ϕ i (xp )),P (ϕ i (x))) τ 2   , ( ) where P is the projection layer parameterized by θ, τ 1 and τ 2 are temperature parameters, and sim is the cos-similarity function. Our implementation uses a tied value for τ 1 and τ 2 for the sake of simplicity, but an improved performance may be observed by tuning these two.

5.1. EXPERIMENT SETTING

We elaborate the detailed experiment settings in Appendix A. Baseline algorithms. We compare FedDebias with both FL baselines and commonly used domain generalization (DG) baselines that can be adapted to FL scenarios. Note that we do not consider domain generalization scenarios and include DG baselines to check if DG methods can benefit FL on non-iid clients. For FL baselines, we choose FedAvg (McMahan et al., 2016) , Moon (Li et al., 2021) , FedProx (Li et al., 2018b) , VHL (Tang et al., 2022) , and FedMix (Yoon et al., 2021b) , which are most relevant to our proposed algorithms. For DG baselines, we choose GroupDRO (Sagawa et al., 2019) , Mixup (Yan et al., 2020) , and DANN (Ganin et al., 2015) . Unless specially mentioned, all algorithms use FedAvg as the backbone algorithm.

Models and datasets.

We examine all algorithms on RotatedMNIST, CIFAR10, and CIFAR100 datasets. We use a four-layer CNN for RotatedMNIST, VGG11 for CIFAR10, and Compact Convolutional Transformer (CCT (Hassani et al., 2021 )) for CIFAR100. We split the datasets following the idea introduced in Table 3 : Ablation studies of FedDebias on the effects of two components. We show the performance of two components, and remove the max step (Line 8 in Algorithm 1) of component 2. We split RotatedMNIST, CIFAR10, and CIFAR100 to 10 clients with α = 0.1. We run 1000 communication rounds on RotatedMNIST and CIFAR10 for each algorithm and 800 communication rounds on CIFAR100. We report the mean of maximum (over rounds) 5 test accuracies and the number of communication rounds to reach the target accuracy. results regarding the impact of hyper-parameter choices and performance gain of FedDebias on other datasets/settings/evaluation metrics can be found in Appendix C.

5.2. NUMERICAL RESULTS

The superior performance of FedDebias over existing FL and DG algorithms. 6 In Table 1 , we show the results of baseline methods as well as our proposed FedDebias algorithm. When comparing different FL and DG algorithms, we discovered that: 1) FedDebias performs best in all settings; 2) DG baselines only slightly outperform ERM, and some are even worse; 3) Regularizing local models to global models from prior rounds, such as Moon and Fedprox, does not result in positive outcomes. Comparison with VHL. We vary the size of virtual data in VHL and compare it with our FedDebias in Table 2 : our communication-efficient FedDebias only uses 32 pseudo-data and transfers pseudo-data once, while the communication-intensive VHL (Tang et al., 2022) requires the size of virtual data to be proportional to the number of classes and uses at least 2,000 virtual data (the authors suggest 2,000 for CIFAR10 and 20,000 for CIFAR100 respectively in the released official code, and we use the default value of hyper-parameters and implementation provided by the authors). We can find that 1) FedDebias always outperforms VHL. 2) FedDebias overcomes several shortcomings of VHL, e.g., the need for labeled virtual data and the large size of the virtual dataset. Figure 5 : Ablation studies of FedDebias, regarding the impact of projection layer, the communication strategy of pseudo-data, and the choices of pseudo-data. In Figure 5 (a), we show the performance of algorithms with/without the additional projection layer on CIFAR10 dataset with the VGG11 model. In Figure 5 (b), we show the performance of FedDebias on RotatedMNIST, CIFAR10, and CIFAR100 datasets when only transferring pseudo-data once (at the beginning of training) or generating new pseudo-data each round. In Figure 5 (c), we show the performance of FedDebias using different types of pseudo-data (all transfer once at the beginning of training). We split each dataset into 10 clients with α = 0.1 and used CNN for RotatedMNIST dataset, VGG11 for CIFAR10, and CCT for CIFAR100. We run 1000 communication rounds on RotatedMNIST and CIFAR10 for each algorithm and 800 communication rounds on CIFAR100. We report the mean of maximum 5 test accuracies. 

5.3. ABLATION STUDIES

Effectiveness of the different components in FedDebias. In Table 3 , we show the improvements brought by different components of FedDebias. In order to highlight the importance of our two components, especially the max-step (c.f. Line 8 in Algorithm 1) in component 2, we first consider two components of FedDebias individually, followed by removing the max-step. We find that: 1) Two components of FedDebias have individual improvements compared with FedAvg, but the combined solution FedDebias consistently achieves the best performance. 2) The projection layer is crucial. After removing projection layers, the component 2 of FedDebias performs even worse than FedAvg; such insights may also explain the limitations of Moon (Li et al., 2021) . Performance of FedDebias on CIFAR10 with different number of clients. In Table 4 , we vary the number of clients among {10, 30, 100}. For each setting, 10 clients are randomly chosen in each communication round. FedDebias outperforms FedAvg by a significant margin in all settings. Reducing the communication cost of FedDebias. To reduce the communication overhead, we reduce the size of pseudo-data, and only transmit one mini-batch of pseudo-data (64 for MNIST and 32 for others) once at the beginning of training. In Figure 5 (b), we show the performance of FedDebias when pseudo-data only transfer to clients at the beginning of the training (64 pseudo-data for RotatedMNIST, and 32 for CIFAR10 and CIFAR100). Results show that only transferring pseudodata once can achieve comparable performance gain compared with transferring pseudo-data in each round. This indicates that the performance of FedDebias will not drop even we give a small number of pseudo-data. Regarding privacy issues caused by RSM. Because RSM may have some privacy issues, we consider using Mixture to protect privacy. In Figure 5 (c), we show the performance of FedDebias with different types of pseudo-data (pseudo-data only transfer once at the beginning of training as in Figure 5 (b)). Results show that: 1) FedDebias consistently outperforms FedAvg on all types of pseudo-data. 2) When using Mixture as pseudo-data and setting K = 0 (Equation ( 4)), FedDebias still have a performance gain compared with FedAvg, and a more significant performance gain can be observed by setting K = 1.

A EXPERIMENT DETAILS

Framework and baseline algorithms. In addition to traditional FL methods, we aim to see if domain generalization (DG) methods can help increase model performance during FL training. Thus, we use the DomainBed benchmark (Gulrajani & Lopez-Paz, 2020) , which contains a series of regularly used DG algorithms and datasets. The algorithms in DomainBed can be divided into three categories: • Infeasible methods: Some algorithms can't be applied in FL scenarios due to the privacy concerns, for example, MLDG (Li et al., 2017) , MMD (Li et al., 2018a) , CORAL (Sun & Saenko, 2016) , VREx (Krueger et al., 2020) that need features or data from each domain in each iteration. • Feasible methods (with limitations): Some algorithms can be applied in FL scenarios with some limitations. For example, DANN (Ganin et al., 2015) , CDANN (Li et al., 2018c) require knowing the number of domains/clients, which is impractical in the cross-device setting. • Feasible methods ( without limitations): Some algorithms can be directly applied in FL settings. For example, ERM, GroupDRO (Sagawa et al., 2019) , Mixup (Yan et al., 2020) , and IRM (Arjovsky et al., 2019) . We choose several common used DG algorithms that can easily be applied in Fl scenarios, including ERM, GroupDRO (Sagawa et al., 2019 ), Mixup (Yan et al., 2020) , and DANN (Ganin et al., 2015) . For FL baselines, we choose FedAvg (McMahan et al., 2016) (equal to ERM), Moon (Li et al., 2021) , FedProx (Li et al., 2018b) , SCAFFOLD (Karimireddy et al., 2019) and FedMix (Yoon et al., 2021b) which are most related to our proposed algorithms. Notice that some existing works consider combining FL and domain generalization. For example, combining DRO with FL (Mohri et al., 2019; Deng et al., 2021) , and combine MMD or DANN with FL (Peng et al., 2019; Wang et al., 2022; Shen et al., 2021) . The natural idea of the former two DRO-based approaches is the same as our GroupDRO implementations, with some minor weight updates differences; the target of the later series of works that combine MMD or DANN is to train models to work well on unseen distributions, which is orthogonal with our consideration (overcome the local heterogeneity).To check the performance of this series of works, we choose to integrate FL and DANN into our environments. Notice that we carefully tune all the baseline methods. The implementation detail of each algorithm is listed below: • GroupDRO: The weight of each client is updated by ω t+1 i = ω t i exp(0.01l t i ), where l t i is the loss value of client i at round t. • Mixup: Local data is mixed by x = λx i + (1 -λ)x j , and λ is sampled by Beta(0.2, 0.2). • DANN: Use a three-layer MLP as domain discriminator, where the width of MLP is 256. The weight of domain discriminate loss is tuned in {0.01, 0.1, 1}. • FedProx: The weight of proximal term is tuned in {0.001, 0.01, 0.1}. • Moon: The projection layer is a two-layer MLP, the MLP width is setting to 256, and the output dimension is 128. We tuned the weight of contrastive loss in {0.01, 0.1, 1, 10}. • FedMix: The mixup weight λ used in FedMix is tuned in {0.01, 0.1, 0.2}, we construct 64 augmentation data in each local step for RotatedMNIST, and 32 samples for CIFAR10 and CIFAR100.. • VHL: We use the same setting as in the original paper, with the weight of augmentation classification loss α = 1.0, and use the "proxy_align_loss" provided by the authors for feature alignment. Virtual data is generated by untrained style-GAN-v2, and we sample 2000 virtual data for CI-FAR10 and RotatedMNIST; 20000 virtual data for CIFAR100 follow the default setting of the original work. To make a fair comparison, we sample 32 virtual samples in each local step for CIFAR10 and CIFAR100. • FedDebias: We use a three-layer MLP as the projection layer, the MLP width is set to 256, and the output dimension is 128. By default, we set τ 1 = τ 2 = 2.0, the weight of contrastive loss µ = 0.5, and the weight of AugMean λ = 1.0 on MNIST and CIFAR100, λ = 0.1 on CIFAR10 and PACS. We sample 64 pseudo-data in each local step for RotatedMNIST and 32 samples for CIFAR10 and CIFAR100. Feature correction when using proxy datasets to construct pseudo-data. When using proxy datasets to construct the pseudo-data, we additionally mix up local data with pseudo-data to make the pseudo-data not too far from the local distribution. However, the pseudo-data will have a large overlap with local data after the mixup. Then the exp sim(P (ϕ i (xp)),P (ϕ i (x))) τ2 term in Equation ( 5), which is used to maximize the distance between local features of local data and pseudo-data, will be meaningless. To address this issue, we change this term to exp sim (P (ϕ i (x p ) -⟨ ỹp , y⟩ • ϕ i (x)), P (ϕ i (x))) τ 2 , ( ) where ỹp is the pseudo-label of x p , and y is the one-hot label of local data x. Then we can minimize the relationship between x and x p caused by the mixup with local data. Datasets and Models. For datasets, we choose RotatedMNIST, CIFAR10, CIFAR100, and PACS. For RotatedMNIST, CIFAR10, and CIFAR100, we split the datasets following the idea introduced in Yurochkin et , 15, 30, 45, 60, 75, 90, 105, 120 , 135}. • CIFAR10: We first split CIFAR10 by LDA using parameter α = 0.1 to N clients. Then for each client, we sample q ∈ R 10 from Dir(1.0). For each image in local data, we sample an angle in {0, 15, 30, 45, 60, 75, 90, 105, 120 , 135} by probability q, and rotate the image by the angle. • Clean CIFAR10: Unlike the previous setting, we do not rotate the samples in CIFAR10 (no inner-class non-iidness). • CIFAR100: We split the CIFAR100 by LDA using parameter α = 0.1, and transform the train data using RandomCrop, RandomHorizontalFlip, and normalization. Each communication round includes 50 local iterations, with 1000 communication rounds for RotatedMNIST and CIFAR10, 800 communication rounds for CIFAR100, and 400 communication rounds for PACS. Notice that the number of communication rounds is carefully chosen, and the accuracy of all algorithms does not significantly improve after the given communication rounds. The public data is chosen as RSM (Yoon et al., 2021b) by default, and we also provide results on other proxy datasets. We utilize a four-layer CNN for MNIST, VGG11 for CIFAR10 and PACS, and CCT (Hassani et al., 2021) (Compact Convolutional Transformer, cct_7_3x1_32_c100) for CIFAR100. For each algorithm and dataset, we employ SGD as the optimizer, and set learning rate lr = 0.001 for MNIST, and lr = 0.01 for CIFAR10 , CIFAR100, and PACS. When using CCT and ResNet, we set momentum as 0.9. We set the same random seeds for all algorithms. We set local batch size to 64 for RotatedMNIST, and 32 for CIFAR10, CIFAR100, and PACS.

B DETAILS OF AUGMENTATION DATA

We use the data augmentation framework the same as FedMix, as shown in Algorithm 2. For each local dataset, we upload the mean of each M samples to the server. The constructed augmentation data is close to random noise. As shown in Figure 6 , we randomly choose one sample in the augmentation dataset of CIFAR10 dataset. 

C.1 RESULTS WITH ERROR BAR

In this section, we report the performance of our method FedAug and other baselines with an error bar to verify the performance gain of our proposed method. Table 5 : Performance of algorithms with error bar. All examined algorithms use FedAvg as the backbone. We run 1000 communication rounds on RotatedMNIST and CIFAR10 for each algorithm. For each algorithm, we run three different trials with different random seeds. For each trial, we report the mean of maximum 5 accuracies for test datasets and the number of communication rounds to reach the threshold accuracy.

Algorithm

RotatedMNIST CIFAR10 2) The best weight on Rotated-MNIST dataset is τ 1 = 2.0 and τ 2 = 0.5. 11 to show the domain robustness of algorithms. We have the following findings: 1) FedDebias significantly outperforms other approaches, and the improvements of FedDebias over FedAvg become more significant than the mean accuracy in Table 1 . FedDebias has a role in learning a domain-invariant feature and improving robustness, as evidenced by this finding. 2) Under these settings, DG baselines outperform FedAvg. This finding demonstrates that the DG algorithms help to enhance domain robustness. 

C.3 T-SNE AND CLASSCIFIER OUTPUT

As the setting in Figure 2 and Figure 3 , we investigate if the two components of FedDebias will help for mitigating the proposed bias on feature and classifier. Figure 8 show the features after the second component of FedDebias, which implies this component can significantly mitigate the proposed feature bias: 1) on the seen datasets, local features are close to global features. 2) on the unseen datasets, the local feature is far away from that of seen datasets. Figure 9 shows the output of the local classifier after the first component of FedDebias on unseen classes. Notice that compared with Figure 3 , the output is more balanced. 



Please refer to section 3 for more justification about the existence of our observations. We provide the results after using FedDebias in Appendix C.3. As shown in Figure 5(b), the communication-efficient variant of FedDebias-i.e., only transferring pseudodata at the beginning of the FL training-is on par with the choice of frequent pseudo-data synchronization. We assume an uniform distribution for label and pseudo-data does not belong to any particular classes. Such a projection layer is not part of the feature extractor or used for classification, as shown in Figure4. We give the results of CIFAR10 with ResNet18 in Table6of Appendix C.



Figure 1: Observation for learning bias: three pitfalls of FL on heterogeneous data with local updates. There are two clients in the figure (denoted by two colors), and each has two classes of data (red and blue points).Figure 1(a): Client 1's decision boundary cannot accurately classify data samples from client 2. Figure 1(b):The difference between features extracted by client 1's local feature extractor and global feature extractor is sustainable large. However, client 2's local feature is close enough to client 1's, even for input data from different data distributions/clients.

Figure 1: Observation for learning bias: three pitfalls of FL on heterogeneous data with local updates. There are two clients in the figure (denoted by two colors), and each has two classes of data (red and blue points).Figure 1(a): Client 1's decision boundary cannot accurately classify data samples from client 2. Figure 1(b):The difference between features extracted by client 1's local feature extractor and global feature extractor is sustainable large. However, client 2's local feature is close enough to client 1's, even for input data from different data distributions/clients.

Figure 2: Observation for biased local features on a shared t-SNE projection space. Local updates will cause: • Large difference in local and global features for the same input data. Colored points in sub-figures (a) & (b) denote the global and local features of data from X1, and the same color indicates data from the same class. Notice that even for data from the same class (same color), the global and local features are clustered into two distinct groups, implying a considerable distance between global and local features even for the same input data distribution. • High similarity of local features for different inputs. Notice from sub-figure (b) & (c) that X1 and X2 are two disjoint datasets (no data from the same class). However, the local features of X1 and X2 are clustered into the same group by t-SNE, indicating the relatively small distance between local features of different classes.

Example 3.2 (Observation for biased local classifiers).Figure 3 shows the output of local model on data X 2 , where all data in X 2 are incorrectly categorized into classes 0 to 4 of X 1 . The observation, i.e., data from classes that are absent from local datasets cannot be correctly classified by the local classifiers, refers to the "biased local classifiers". More precisely, Figure 3(a) shows the prediction result of one sample (class 8) and Figure 3(b) shows the predicted distribution of all samples in X 2 . Based on Example 3.1 and Example 3.2, we summarize our observations and introduce the formal definition of "local learning bias" caused by local updates: Definition 3.3 (Local Learning Bias). We define the local learning bias below:

Figure 4: Overview of FedDebias. We illustrate how to calculate the three terms in Equation (2) and (3). We calculate the cross-entropy loss of local data (x, y), and pseudo-data (xp, ỹp), and use the local feature ϕ i (x), ϕ i (xp), and global feature ϕ g (xp) for contrastive loss.

Yurochkin et al. (2019);Hsu et al. (2019);Reddi et al. (2021), where we leverage the Latent Dirichlet Allocation (LDA) to control the distribution drift with parameter α. The pseudo-data is chosen as RSM by default, and we also provide results on other types of pseudo-data (c.f. Figure5(c)). By default, we generate one batch of pseudo-data (64 for MNIST and 32 for other datasets) in each round, and we also investigate only generating one batch of pseudo-data at the beginning of training to reduce the communication cost (c.f. Figure5(b), Figure5(c)). We use SGD optimizer (with momentum=0.9 for CCT), and set the learning rate to 0.001 for RotatedMNIST, and 0.01 for other datasets. The local batch size is set to 64 for RotatedMNIST, and 32 for other datasets (following the default setting in DomainBed(Gulrajani & Lopez-Paz, 2020)). Additional

Figure 6: We show 20 augmentation data of CIFAR10 dataset here. Notice that the augmentation data is close to random noise and can not be classified as any class.

Figure 12: First train global model on the whole dataset for 1 epoch, then report local features after 10 local epochs.

incorporates a proximal term into local objective functions to reduce the gap between the local and global models. SCAFFOLD (Karimireddy et al., 2019) adopts the variance reduction method on local updates, and Mime (Karimireddy et al., 2020) increases convergence speed by adding global momentum to global updates.

Local datasets D1, . . . , DN , pseudo dataset Dp where |Dp| = B, and B is the batch size, number of local iterations K, number of communication rounds T , number of clients chosen in each round M , weights used in designed loss λ, µ, local learning rate η. Ensure: Trained model ωT , θT , ϕ T .

Performance of algorithms. We split RotatedMNIST, CIFAR10, and CIFAR100 to 10 clients with α = 0.1, and ran 1000 communication rounds on RotatedMNIST and CIFAR10 for each algorithm, 800 communication rounds CIFAR100. We report the mean of maximum (over rounds) 5 test accuracies and the number of communication rounds to reach the threshold accuracy.

Comparison with VHL. We split CIFAR10 and CIFAR100 to 10 clients with α = 0.1, and run 1000 communication rounds on CIFAR10 for each algorithm and 800 communication rounds on CIFAR100. We report the mean of maximum (over rounds) 5 test accuracies and the number of communication rounds to reach the threshold accuracy. We set different numbers of virtual data to check the performance of VHL, and pseudo-data only transfer once in FedDebias (32 pseudo-data). For CIFAR100, we choose Mixup as the backbone.

Performance of FedDebias on CIFAR10 with different number of clients. We split CIFAR10 dataset into 10, 30, and 100 clients with α = 0.1. We run 1000 communication rounds for each algorithm on the VGG11 model, and report the mean of maximum 5 accuracies (over rounds) during training on test datasets.

al. (2019); Hsu et al. (2019); Reddi et al. (2021), where we leverage the Latent Dirichlet Allocation (LDA) to control the distribution drift with parameter α. Larger α indicates smaller non-iidness. We divided each environment into two clients for PACS, with the first client containing data from classes 0-3, and the second client containing data from classes 4-6. Unless specially mentioned, we split RotatedMNIST, CIFAR10, and CIFAR100 to 10 clients and set α = 0.1. For PACS, we have 8 clients instead. Notice that for each client of CIFAR10, we utilize a special transformation, i.e., rotation to the local data, to simulate the natural shift. In detail:• RotatedMNIST: We first split MNIST by LDA using parameter α = 0.1 to 10 clients, then for each client, we rotate the local data by {0

Performance of algorithms on CIFAR10. We split CIFAR10 dataset to 10 clients with α = 0.1, without additional rotation. For each algorithm, we run 1000 communication rounds on ResNet18 (with group normalization), and set local steps to 50. Note that we set momentum to 0.9 for ResNet18.Values of τ 1 and τ 2 in Componennt 2. In this paragraph, we investigate how the value of τ 1 and τ 2 affect the performance of the second component of FedDebias. In table 7, we show the results on Rotated-MNIST dataset with different weights τ 1 and τ 2 . Results show that: 1) Setting τ 2 = 0 , which only minimizes the distance of global and local features, has significant performance gain compare with ERM. However, adding τ 2 can further improve the performance.

Performance of Component 2 of FedDebias under different values of τ1, τ2. We run 1000 communication rounds on RotatedMNIST dataset. For each setting, we run three different trials with different random seeds. For each trial, we report the mean of maximum 5 accuracies for test datasets and the number of communication rounds to reach the threshold accuracy.

Performance of FedDebias under different values of τ1, τ2. We run 1000 communication rounds on the CIFAR10 dataset. For each setting, we run three different trials with different random seeds. For each trial, we report the mean of maximum 5 accuracies for test datasets and the number of communication rounds to reach the threshold accuracy.

Performance of component 1 under different weights. We run 1000 communication rounds on the CIFAR10 dataset. For each setting, we run three different trials with different random seeds. For each trial, we report the mean of maximum 5 accuracies for test datasets and the number of communication rounds to reach the threshold accuracy. We use λ as the weight of the first component of FedDebias.

Performance of algorithms. All examined algorithms use FedAvg as the backbone. We run 1000 communication rounds on RotatedMNIST and CIFAR10 for each algorithm, 800 communication rounds CIFAR100 and 400 communication rounds for PACS. We report the mean of maximum 5 accuracies for test datasets and the number of communication rounds to reach the final accuracy of ERM .

Worst Case Performance of algorithms. All examined algorithms use FedAvg as the backbone. We run 1000 communication rounds on RotatedMNIST and CIFAR10 for each algorithm, 800 rounds for CIFAR100, and 400 communication rounds for PACS. We calculate the worst accuracy for all clients in each round and report the mean of the top 5 worst accuracies for each method. Besides, we report the number of communication rounds to reach the final worst accuracy of FedAvg.

C.4 ILLUSTRATION OF MIN-MAX PROBLEM

In Figure 15 , we illustrate the intuition to use the proposed min-max process. The projection layer is used to distinguish biased and unbiased features that can not be distinguished well on the original feature space, and the min step is to learn unbiased local features that close to unbiased features on the projected spaces.

