BENCHMARKING ALGORITHMS FOR FEDERATED DO-MAIN GENERALIZATION

Abstract

In this paper, we present a unified platform to study domain generalization in the federated learning (FL) context and conduct extensive empirical evaluations of the current state-of-the-art domain generalization algorithms adapted to FL. In particular, we perform a fair comparison of 11 existing algorithms in solving domain generalization either centralized domain generalization algorithms adapted to the FL context or existing FL domain generalization algorithms to comprehensively explore the challenges introduced by FL. These challenges include statistical heterogeneity among clients, the number of clients, the number of communication rounds, etc. The evaluations are conducted on five diverse datasets including PACS (image dataset covering photo, sketch, cartoon, and painting domains), FEMNIST (image dataset containing writing digits and characters written by more than 3500 users), iWildCam (image dataset with 323 domains), Py150 (natural language processing dataset with 8421 domains) and CivilComments (natural language processing dataset with 16 domains). The experiments show that the challenges brought by federated learning stay unsolved in realistic experimental settings. Furthermore, the code base supports fair and reproducible evaluation of new algorithms with little implementation overhead.

1. INTRODUCTION

Federated learning (FL) Konečnỳ et al. (2016) is a distributed machine learning approach that assumes each client or device owns a local dataset and this local dataset cannot be exchanged or centrally collected because of privacy or communication constraints. Given this context, a natural paradigm for FL (e.g., FedAvg McMahan et al. (2017) ) is to alternate between two stages: clients locally update the model based on its local dataset and a central server aggregates client models. Because the clients may be phones, network sensors, hospitals, or alternative local information sources, the local datasets are naturally heterogeneous between clients. Specifically, there are at least two types of realistic statistical data heterogeneity in the FL context. Client heterogeneity is the data heterogeneity between clients involved in training-e.g., hospitals may use different staining procedures or imaging equipment. Train-test heterogeneity is the data heterogeneity between the training and testing data-e.g., the performance on a new client that was not involved in training or a natural shift in real-world test data due to changes over time, location, or context. Client heterogeneity has long been considered a statistical challenge since federated learning was introduced. FedAvg McMahan et al. (2017) has experimentally shown that their methods effectively mitigate some client heterogeneity. There are many other extensions based on the FedAvg framework tackling the heterogeneity among clients in FL Hsieh et al. (2020) ; Li et al. (2020) ; Karimireddy et al. (2020) . There is an alternative setup in FL, known as the personalized setting, which aims to learn personalized models for different clients to tackle heterogeneity. Numerous recent papers have proposed FL models and algorithms to accommodate personalization Smith et al. (2017) ; Chen et al. (2018) ; Hanzely et al. (2020); T Dinh et al. (2020) ; Deng et al. (2020) ; Acar et al. (2021) . However, these prior works only train the model on simple data sets such as MNIST, EMNIST, and CIFAR10 and the client heterogeneity is constructed mainly through class imbalance, which assumes the ratio of data from each class is different for each client but the class conditional distributions are homogeneous across clients. Class imbalance is a special kind of heterogeneity called prior probability shift. In practice, due to the difference between the location of the local data collector (cameras, sensors, etc), real data heterogeneity is more complex than simple class imbalance as in many prior works. Furthermore, most prior FL works do not consider train-test heterogeneity but rather assume the train and test data are i.i.d. While most FL works have usually ignored train-test heterogeneity, the task of domain generalization 2020) provide thorough comparisons between different algorithms for DG. However, these centralized DG benchmarks do not consider the unique constraints of the FL context. In particular, they fail to provide insights on how the client dataset heterogeneity, the number of clients, and the communication budget will influence the generalization ability. From the FL side, to the best of our knowledge, there are currently only four published works, FedADG Zhang et al. (2021 ), FedDG Liu et al. (2021 ), FedSR Nguyen et al. (2022) and FedGMA et al. (2021) are sensitive to clients number, and they fail even on the simple dataset PACS when the clients number large than 20. See Sec. 4.3 massive number of clients for detail. 3) They fail to show the network effect of FL; in particular, neither work considers the influence of the number of clients on the performance, and FedADG Zhang et al. (2021) does not consider the effect of the number of communication rounds. In summary, the current DG benchmarks fail to consider challenges unique to FL, and the few FL methods for DG have limited evaluations. Therefore, more systematic evaluation is needed both to aid in systematic progress at the intersection of FL and DG but also to aid in answering many open questions. For example, how does client domain heterogeneity influence the performance of current algorithms? What is the performance of a direct translation of centralized DG algorithms (if applicable) to the FL context? How does the performance of current algorithms scale with the number of clients and communications rounds on complicated real-world datasets? Major contributions: This work addresses the above questions, and our contributions can be summarized as follows. 1) We propose a standardized definition of client domain heterogeneity that is unique to the FL context and interpolates between domain homogeneity and domain separation (see Figure 1 ) while limiting the class imbalance. In particular, we develop an experimental setup method to split dataset domain samples among any number of clients (see subsection 3.1). 2) We provide a fair comparison over multiple representative centralized DG methods adapted to the FL context as well as four prior works on federated domain generalization on five different benchmark datasets. 3) We also explore the impact on the generalization ability of client domain heterogeneity, the total client number, communication rounds, which are unique to the FL context. 4) From these results, we identify significant generalization gaps between centralized domain generalization and domain generalization in the federated Learning setting. Algorithm 1 DomainSplit function where w.l.o.g. the domains are assumed to be in descending order, i.e., n 1 ≥ n 2 ≥ • • • ≥ n M . Input K, n 1 , . . . , n M if K ≤ M then ∀k, P k ← ∅ for m = 1, 2, . . . , M do k * ∈ arg min k m ′ ∈P k n m ′ P k * ← P k * ∪ {m} end for else if K > M then ∀k ∈ {1, 2, . . . , M }, P k ← {k} for k = M + 1, . . . , K do m * ∈ arg max m nm K k ′ =1 1[m∈P k ′ ] P k ← {m * } end for end if Output P k

Implementation of domain heterogeneity

We now provide a concrete procedure for implementing domain heterogeneity for the benchmark (see Figure 1 for an illustration). Given the number of training samples for all M domains, denoted by {n m } M m=1 where n m is the number of samples for domain m, we first assign "primary" domains P k ⊆ {1, 2, . . . , M } to each client via the domain split function defined in Algorithm 1, i.e., {P k } K k=1 = DomainSplit(K, {n m } M m=1 ). Algorithm 1 carefully handles two cases: fewer clients than domains (K ≤ M ) and more clients than domains (K > M ). In the first case, the domains are sorted in descending order and are iteratively assigned to the client k * which has the smallest number of training samples m ′ ∈P k * n m ′ currently. In this way, the algorithm outputs {P k } K k=1 such that no client shares domains with the others but attempts to balance the total number of training samples between clients. In the case K > M , we first assign the domains one by one to the first M clients. Then, starting from client k = M + 1, we iteratively split the largest domain m * , where the samples are evenly split among all clients where m * ∈ P k . In this way, some clients may share one domain, but no client holds two domains simultaneously. Again, this also attempts to balance the number of samples across clients as much as possible. After selecting the primary domains P k for each client, we define the training sample counts, denoted n m,k (λ), for domain m, client k, and domain balance parameter λ ∈ [0, 1] : n m,k (λ) = λ n m K + (1 -λ) 1[m ∈ P k ] • n m K k ′ =1 1[m ∈ P k ] , where rounding to integers is carefully handled when not perfectly divisible and where 1[•] is an indicator function. This is simply a convex combination between a uniform splitting of domains among clients (i.e., the nm K term) and a splitting where each client has a disjoint set of domains (i.e., the 1[m∈P k ]•nm K k ′ =1 1[m∈P k ] term)-unless K > M and then we try to split domains evenly based on number of samples as defined in Algorithm 1. After defining n m,k (λ),we can denote the total training samples related to client k with domain balance parameter λ as n k (λ) = M m=1 n m,k (λ).

3.2. FEDERATED DOMAIN GENERALIZATION

In the federated domain generalization problem, we are interested in collaboratively training a model across the clients to perform well on unseen test domains D ∈ D test , which is different from training domains 1, . . . , M. Therefore, we focus on minimizing not the average loss on the source domains constructed from each client f k (θ) = E (x,y)∼D k [ℓ(((x, y); θ))] ≈ (1/n k ) n k i=1 ℓ(x i k , y i k ; θ), but on the unseen test domains, in either an average (or worst-case sense) as defined below min θ E D∼D test E (x,y)∼D [ℓ(((x, y); θ))], or min θ sup D∼D test E (x,y)∼D [ℓ(((x, y); θ))]. (2) To solve equation 2 in the federated learning context, we need to consider client domain heterogeneity, as it can introduce unique challenges for domain generalization. In particular, local client may not have access to all the domains, and their domains may only overlap partially; in the extreme case, local data included in one client may naturally form a unique domain. Therefore, whether centralized generalization ability is achievable when local client has heterogeneous domains remains unknown. Further, if some distributed algorithms can achieve the centralized generalization ability as if the server has information of all the domains, the communication cost remains unclear. In addition, the impact on the generalization ability of the total clients number remains unknown. Adapting Centralized DG Methods to FL Setting. To adapt centralized methods, we simply run the centralized DG method at each client locally with their own local dataset (see subsection 3.1 for how the local datasets are created), and then compute an average of model parameters at each communication round (see next paragraph). This approach is straightforward for the homogeneous (λ = 1) and heterogeneous (λ = 0.1) settings where each client has data from all domainsalbeit quite imbalanced for λ = 0.1. This can be seen as biased updates at each client based on biased local data. Similarly, this approach can be implemented in the domain separation case if at least one client holds multiple non-overlapping domains (i.e., ∃k, |P k | ≥ 2, which would happen if K < M ). However, if all clients only have one primary domain, i.e., ∀k, |P k | = 1, which will happen if K ≥ M , this simple approach cannot be extended to the domain separation setting (λ = 0) because centralized DG methods require data from at least two domains. In fact, these centralized DG methods degenerate to ERM if there is only one domain per client. Extending these methods to the case where all clients only have one domain is an interesting direction for future work. Synchronization Schedule and Batch Creation For all algorithms, we run E epochs locally on each client and then the server computes a weighted average of the resulting models, i.e., θ t+1 = K k=1 n k n θ t+1 k . Because each epoch runs through the whole dataset, the k-th client runs m n m,k /B batches, where B denotes the batch size. For FedAvg (ERM) and FedGMA, we uniform randomly sample a batch from the local dataset without considering the domain labels. For the FL adaptations of centralized algorithms, we use the sampling method from the WILDS benchmark Koh et al. ( 2021 Figure 2 plots the held-out test accuracy versus communication rounds on PACS with increasing domain heterogeneity. As seen from Figure 2 , 1) most algorithms perform reasonably well in the homogeneous case (λ = 1) except FedADG and FedSR. In fact those two fail in all three cases. They are sensitive to the client number and work favorably when K is small, e.g. K = 2, see subsection 4.2 for discussion on the effect of number of clients. 2) As domain heterogeneity increases, i.e., λ from 1 to 0.1, the algorithms consistently converge slower and have worse test accuracy, which demonstrates that domain heterogeneity among clients is a unique extra challenge introduced by FL. In particular, the centralized DG methods FISH, CORAL, IRM, and MMD extended to the FL setting have poor performance compared with that of ERM in the heterogeneous case, while Group-DRO outperforms ERM both in homogeneous and heterogeneous case. 3) In the domain separation case, because K > M for PACS, each client locally only holds one training domain, and thus no centralized methods are suitable to use. FedDG requires sharing the amplitude spectrum of images among local clients, which causes privacy concerns. Therefore, even for this dataset containing only four domains in total, prior works struggle to compete with ERM. For the domain separation case, given that K < M, we use Algorithm 1 to split the domains to each client. Therefore, each client holds multiple non-overlapping domains. All of the methods are applicable on this dataset. We observe that as the domain balance parameter λ decreases from 1 to 0, FedGMA, FedSR are consistently comparable to ERM while the others fail. The in-domain and held-out domain accuracy are reported in We compare the final test accuracy using held-out validation in Table 4a for Py150, Table 4b for CivilComments, and Table 3 for IWildCam. (Using two validation criterion are summarized in Table 8 , Table 9 and Table 10 in Appendix C). The results show that the performance of ERM dominates the other algorithms on these three datasets. No algorithm achieves its centralized domain generalization ability, where centralized corresponds to training on the centralized dataset gathered from all the training clients. We also plot accuracy versus communication figures in Figure 7 for Py150, Figure 6 for CivilComments and Figure 8 ) for IWildCam in Sec. C.

4.3. ADDITIONAL FL-SPECIFIC CHALLENGES FOR DOMAIN GENERALIZATION

Besides domain heterogeneity, we also investigate the challenges brought by FL, including massive number of clients number and communication constraints, which are unique to the FL setting. ii) Communication constraint: To show the effect of communication rounds on convergence, we plot the test accuracy versus communication rounds in Figure 5 . We fix the number of clients K = 100 on PACS and decreases rounds of communication (together with increasing local epochs), that is, C = (50, 10, 5) (with E = (1, 5, 10)). That is, if the regime restricts the communication budget, then we increase its local computation E to have the same the total computations. Therefore, the influence of communication on the performance is fair between algorithms because the total data pass is fixed. We observe that 1) the curves are relatively "flat" for most of algorithms, this is predictable because we vary the number of pass of the data E per round for a changing C, and locally the aggregation rules are the same θ t+1 = K k=1 n k n θ t+1 k , where n k is the training sample size at client k. 2) In particular, ERM, GroupDRO, and FedDG can achieve comparable good performance when communication budget is low (C = 10) comparable to when there communication budget is high (C = 50), showing their communication efficiency in the FL context.  2<1#'83($311<2-$!;-32C W W W¤ W¥ '9;$$<8!$@ T,313+'2'3<9 2<1#'83($311<2-$!;-32C WT,';'83+'2'3<9 2<1#'83($311<2-$!;-32C T&31!-29'6!8!;-32 -9, -?<6 38!£ 83<6 '& '&

5. CONCLUSION AND DISCUSSION

This work evaluates multiple algorithms to solve domain generalization task in the federated learning context. We evaluate the influence of client domain heterogeneity, the total number clients, and communication rounds on the domain generalization ability. We show that DG in FL context is an unsolved problem, and it brings abundant new challenges. Specifically, the following aspect might be future directions of interest: 1) domain heterogeneity across the clients dramatically impacts the generalization ability on image dataset PACS and IWildCAM as well as NLP datasets Py150 and CivilComments; designing new FL algorithm to recover centralized domain generalization ability remains open; 2) previous works (eg, FedSR, FedADG) evaluate the generalization ability built upon small number of clients K is not enough, massive clients setting needs to be take into consideration; 3) more realistic datasets need to be considered in the domain generalization in FL context; 4) For the domain separation case, few prior works are applicable to the case where each client only holds one domain-new DG algorithms for the FL setting are required for this case. We list the gap table below for summarizing the current DG algorithms performance gap w.r.t ERM in the FL context, in particular, positive means it outperforms ERM, negative means it is worse than ERM. It can be seen that in the on the simple dataset, the best DG migrated from centralized setting is better than ERM. In the domain separation case, no centralized DG algorithms can be adapted to it, and FedDG and FedADG performs comparably good in this setting. However, they fail in harder datasets. Federated DG algorithms that outperforming ERM, supporting NLP dataset, and free of data sharing are still in need. 

A REPRODUCIBILITY STATEMENT

Code for reproduce the result is available at the anonymous link. We include detailed documentation in using our code to reproduce the results throughout the paper. We also provide documentation in adding new algorithm's DG evaluation in the FL context.

B EXPERIMENTS SETTING SUPPLEMENTARY B.1 DATASETS AND MODELS

In this section, we introduce the datasets we used in our experiments, and the split method we used to build heterogeneous datasets in the training and testing phase as well as the heterogeneous local training datasets among clients in the FL. Datasets. We contain three datasets as the benchmark: PACS 

B.2 HYPERPARAMETERS AND MODEL SELECTION

Hyperparameters To make fair comparisons, we allocate the same budget during training for each algorithm on each dataset. The budget includes the times of allowed hyperparameter search, model architecture, local computation resources, and communication rounds. For each dataset, we fix the model architecture and initialization to be the same. We conduct eight times hyperparameter searching for each algorithm, and choose the set of hyperparameters that achieves the best performance. For all the datasets and algorithms, we set 100% for clients' participation in the training during each communication. For PACS, we fix the number of clients K = 100, 50 communications in total where each communication happens after one epoch of local training. the batch size is 64. We use Adam optimizer and the learning rate is 1 × 10 -3 except for FedSR algorithm, where we choose SGD with lr= 0.002, momentum= 0.9 and weight decay= 5 × 10 -4 . For Py150, we set K = 100, and the total number of communication is 3 in the centralized case and 10 in the distributed case. The batch size is 96. We use AdamW optimizer, lr = 8 × 10 -5 , and ϵ = 1 × 10 -8 . For IWildcam, the number of clients K = 243, 12 communications in the centralized case, and 50 communications in the federated learning case, and the batch size is 16. We use Adam optimizer where the learning rate is 3 × 10 -5 . For FEMNIST, the number of clients K = 100, 20 communications in the centralized case, and 40 communications in the federated learning case, and the batch size is 64. We use Adam optimizer where the learning rate is 1 × 10 -3 . For CivilComments, the number of clients K = 100, 5 communications in the centralized case, and 10 communications in the federated learning case, and the batch size is 16. We use Adam optimizer where the learning rate is 1 × 10 -5 . For each dataset, we choose the hyperparameters starting from their default value consistent with the choice in previous benchmarks DomainBed Gulrajani & Lopez-Paz (2020) 

C OTHER EVALUATION

We put our extra evaluations here for reference. The experiments are summarized in terms of dataset. C.1 PACS and FEMNIST We report the final test accuracy using two validation criterion in Table 6 . It shows that the homogeneity case (λ = 1 column) may even slightly outperform the its counterpart centralized domain generalization accuracy in the simple case (small client and domain numbers). This could come from the natural regularization brought by the FL. In the domain separation case (λ = 0 column), although FedADG is not communication efficient as shown in Fig 2 , it seems to be more robust across validation strategies whereas FedDG performs poorly using in-domain validation. Overall, as expected, existing DG algorithms with enough communication rounds are able to perform reasonably well in this simple setting where the dataset is simple and the number of clients and domains are small.  λ = (1, 0.1, 0). ¤ ¥ 311<2-$!;-329 W W W W¤ W '9;$$<8!$@ T,313+'2'3<9 ¤ ¥ 311<2-$!;-329 WT,';'83+'2'3<9 ¤ ¥ 311<2-$!;-329 T&31!-29'6!8!;-32 -9, 38!£ 83<6 '&



Figure1: Our benchmark simultaneously evaluates train-test heterogeneity (i.e., domain generalization) as seen in the top left and client heterogeneity across domains by splitting the domain data amongst clients. The client heterogeneity can be homogeneous (left), heterogeneous (center), or domain separated (right), where M is the number of domains, K is the number of clients, color denotes domain data, and λ is the domain balance parameter. The right three panels demonstrate the domain split for homogeneous, heterogeneous, and domain separation when K ≤ M.

DG) Blanchard et al. (2011) formalizes a special case of train-test heterogeneity in which the training algorithm has access to data from multiple source domains and the goal is to perform well on data from an unseen test domain. There is an active line of research on domain generalization in the centralized setting Muandet et al. (2013); Saito et al. (2018); Ganin & Lempitsky (2015); Long et al. (2015); Arjovsky et al. (2019); Sagawa et al. (2019); Shi et al. (2021); Li et al. (2018). Current benchmark papers Koh et al. (2021); Gulrajani & Lopez-Paz (

), namely each client uniform randomly samples two domains from its local dataset and then uniform randomly samples B/2 examples from each domain without replacement. Beyond simple model averaging, FedGMATenison et al. (2022) additionally adopts a global masking operation at the server for update changes. We refer the readers to Appendix B.2 for the detailed hyperparameters including learning rate, batch size, and model selection.4 MAIN RESULTSWeadapt six representative centralized DG methods into FL context, include IRM Arjovsky et al. (2019), Fish Shi et al. (2021), MMD Li et al. (2018), Coral Sun & Saenko (2016), GroupDRO Sagawa et al. (2019), Mixup Zhang et al. (2017), and compare them with FedDG Liu et al. (2021), FedADG Zhang et al. (2021), FedSR Nguyen et al. (2022) and FedGMA Tenison et al. (2022) which are naturally designed for solving domain generalization tasks in federated learning.4.1 BASELINE SETTING (PACS AND FEMNIST-62)In the baseline setting, we consider three domain heterogeneity regimes on image classification tasksPACS Li et al. (2017)  widely used in domain generalization, and on FEMNIST-62 (digits and characters)Cohen et al. (2017), a prototype dataset in the FL context.PACS. We evaluate PACS with K = 100 clients, M = 2 training domains (cartoon, sketch), 1 validation domain (art painting) and 1 test domain (photo). The maximum communication rounds is set to 50 with 1 local epoch per communication. The domain balance parameter was varied λ ∈ {1, 0.1, 0}. For the domain separation case λ = 0, each client locally only has one training domain (i.e., ∀k, |P k | = 1); in this case, no centralized methods are suitable (see subsection 3.2) so we only compare the four federated domain generalization methods to ERM.

Figure 2: Accuracy versus communication rounds for PACS; total clients and training domains (K, M ) = (100, 2); increasing domain heterogeneity from left to right: λ = (1, 0.1, 0).

Figure 4: PACS: Held-out DG test accuracy versus number of clients.

Massive number of clients: In this experiment, we explore the performance of different algorithms when the number of clients K increases on PACS. We fix the communication rounds C = 50 and the local number of epoch is 1 (synchronizing the models every epoch). Figure4plots the held-out DG test accuracy versus number of clients for different levels of data heterogeneity. The following comments are in order: given communication budget, 1) current domain generalization methods all degrade a lot in particular after K ≥ 50, while the performance ERM and FedDG maintain relatively unchanged as the clients number increases given communication budget. FedADG and FedSR are are sensitive to the clients number, and they both fail after K ≥ 20. 2) Even in the simplest homogeneous setting λ = 1, where each local client has i.i.d training data, current domain generalization methods IRM, FISH, Mixup, MMD, Coral, GroupDRO work poorly in the existence of large clients number, this means new methods are needed for DG in FL context when data are stored among massive number of clients.

Figure 5: PACS: Held-out DG test accuracy vs. varying communications (resp. varying echoes ).

Figure 7: Accuracy versus communication rounds for Py150 ; Total clients number K = 100; increasing domain heterogeneity from left to right panel: λ = (1, 0.1, 0).

Their evaluations are restricted to the case when the number of clients is equal to the number of domains, which may be an unrealistic assumption (e.g., a hospital that has multiple imaging centers or a device that is used in multiple locations). For example, FedSR Nguyen et al. (2022) and FedADGZhang et al. (2021) only evaluates on the case when the number of clients equals to the number of domains. However, we show in this paper, FedSRNguyen et al. (2022) and FedADG Zhang

Therefore it is closely related to domain adaptation in FL context, which is different than the domain generalization setting considered in this benchmark, i.e., the training and test domains do not overlap.Nguyen et al. (2022) proposed FedSR where they enable domain generalization while still respecting the decentralized and privacy-preserving natures of FL context by enforcing ℓ 2 -norm and a conditional mutual information regularizer on the representation. In Table1, we give an overview of domain heterogeneity across two dimensions: 1) between training and testing datasets (i.e., standard vs domain generalization task) and 2) among clients (i.e., domain imbalance between clients). While some work has considered the standard supervised learning task (left column), a new fair evaluation is needed to understand the behaviour of domain generalization algorithms in the federated context including the influence of data heterogeneity, communication budget, and the number of clients. Different Tasks with Domain Heterogeneity in the FL Context

Test accuracy on PACS dataset with heldout-domain validation; total client number K = 100; total training domain number M = 100. N.A. refers to not applicable.

Sec.Appendix C.

For the homogeneous and heterogeneous case, respectively, each client locally holds all the training domains. For the domain separation case, we use Algorithm 1 to split the domains to each client. When K < M, client locally holds non-overlapping domains; all of the centralized methods are applicable in this case. When K ≥ M, each client locally only holds one training domain; thus, no centralized method is applicable, only four natural federated domain generalization methods are applicable, we compare them with ERM. We evaluate Py150 with 100 clients, 5477 training domains, 261 validation and 2471 test domains. Given that K < M, we compare all the methods except Mixup, FedDG and FedADG in all three different domain heterogeneity regimes. IWildCam: We evaluate IWildCam with 323 clients, 243 training domains, 32 validation and 48 test domains. We compare all the methods in homogeneous and heterogeneous regimes. Given that K = M, we can only compare FedDG,

Test accuracy with held-out-domain validation; total client number K = 100.

Gap Table: Current Progress in solving DG in FL context

Li et al. (2017), IWildCam Koh et al. (2021), and Py150 Koh et al. (2021). These three datasets cover different levels of difficulty as well as different types of tasks. PACS and IWildCam are both image classification datasets and Py150 is a natural language processing (NLP) dataset. PACS is an image dataset for domain generalization. It consists of four domains, namely Photo (1,670 images), Art Painting (2,048 images), Cartoon (2,344 images), and Sketch (3,929 images). This task requires learning the classification task on a set of objects by learning on totally different renditions. The Py150 is a natural language processing dataset containing 150, 000 python source code dataset from 8, 421 repository. The goal is to predict the next token given the context of previous tokens. This is a real-world NLP dataset that contains multiple repositories which naturally form multiple domains. The IWildCam contains wild animals captured by multiple heats or motion-activated static cameras. Due to the variation in camera model, position, color, background, and relative animal frequencies, the samples form multiple domains. It contains 203, 029 images from 323 different camera traps, the images contain 182 different animal species. For Py150 and IWildcam datasets, we follow the same split method as the WildsKoh et al. (2021). For PACS, we use cartoon and sketch as the training domains, art-painting as the held-out-validation domain, and photo as the test domain, and we use 90% of the data from cartoon and sketch domains to be the training domains, and about 5% to be in-domain validation, and other 5% to be the in-domain test set. During the sampling, we keep the class distribution the same among the training dataset, the in-domain validation dataset, and the in-domain test dataset. Models. For image classification datasets PACS and IWildCam, we use ResNet50 modelHe et al. (2016), and Py150 is a natural language processing (NLP) dataset where we use OpenAIGPT2 Radford et al. (2019)  to train.

Wilds Koh et al. (2021), or their value proposed in the original paper . Model Selection In DG, model selection could significantly affect the model performance. During training, we evaluate the aggregated model by using the validation dataset after each communication round. After performing C communication rounds, we select the model that achieves the best performance on the validation dataset. This work adopts two model selection methods for the DG task: in-domain and held-out-domain model selection Gulrajani & Lopez-Paz (2020). In-domain model selection method uses validation dataset which is independent and identically sampled from the training dataset. The held-out-domain model selection method uses validation dataset that only contains examples from a set of domains that do not overlap with the training and testing domains.



Test accuracy on IWildCam dataset with two model selection criterion: in-domain / heldout-domain validation; total client number K = 243.

Test accuracy on PACS dataset with two model selection criterion: in-domain / held-outdomain validation; total client number K = 100; total training domain number M = 2.

Test accuracy on PACS dataset with two model selection criterion: in-domain / held-outdomain validation; total client number K = 200; total training domain number M = 2.

Test accuracy on PACS dataset with two model selection criterion: in-domain / held-outdomain validation; total client number K = 200; total training domain number M = 2; Communication Rounds C = 5.

Test accuracy on PACS dataset with two model selection criterion: in-domain / held-outdomain validation; total client number K = 200; total training domain number M = 2; Communication Rounds C = 10. TRAINING TIME, COMMUNICATION ROUNDS AND LOCAL COMPUTATION In this section, we provide training time per communication in terms of the wall clock training time. Notice that for a fixed dataset, most of algorithms have similar training time comparing to ERM, where FedDG and FedADG are significantly more expensive.

Wall-clock Training time per communication (unit: s).

annex

To observe the convergence of each algorithm, we plot Figure 7 for Py150 (resp. Figure 6 for Civil-Comments and Figure 8 for IWildCam). It shows that with realistic dataset as well as with non-trivial number of clients, all of the algorithms tend to be more sensitive to domain heterogeneity. Even in the heterogeneous case, where each client locally holds all the training domains, their generalization abilities on the unseen domains are much worse than its centralized counterpart; let alone to solve the even harder domain separation case.We also reported accuracy using the in-domain and held-out domain validation for Py150 in Table 8 , CivilCommnets in Table 9 , and IWildCAM in Table 10 . 

C.3 ADDITIONAL FL-SPECIFIC CHALLENGES FOR DOMAIN GENERALIZATION

In this subsection, we reported the additional results related to Sec 4.3. Including Table 11 , Table 12 , Table 13 , Table 14 for test accuracy using the held-out and in-domain validation for PACS with K = 20, 50, 100, 200, respectively. 

