CLIENT-AGNOSTIC LEARNING AND ZERO-SHOT ADAP-TATION FOR FEDERATED DOMAIN GENERALIZATION

Abstract

Federated domain generalization (federated DG) aims to learn a client-agnostic global model from various distributed source domains and generalize the model to new clients in completely unseen domains. The main challenges of federated DG are the difficulty of building the global model with local client models from different domains while keeping data private and low generalizability to test clients, where data distribution deviates from those of training clients. To solve these challenges, we present two strategies: (1) client-agnostic learning with mixed instance-global statistics and (2) zero-shot adaptation with estimated statistics. In client-agnostic learning, we first augment local features by using data distribution of other clients via global statistics in the global model's batch normalization layers. This approach allows the generation of diverse domains by mixing local and global feature statistics while keeping data private. Local models then learn client-invariant representations by applying our client-agnostic objectives with the augmented data. Next, we propose a zero-shot adapter to help the learned global model to directly bridge a large domain gap between seen and unseen clients. At inference time, the adapter mixes instance statistics of a test input with global statistics that are vulnerable to distribution shift. With the aid of the adapter, the global model improves generalizability further by reflecting test distribution. We comprehensively evaluate our methods on several benchmarks in federated DG.

1. INTRODUCTION

A huge amount of data is being collected every second from a wide range of IoT devices, and the data have been utilized for building robust deep learning models. Federated learning (FL) has emerged as a promising paradigm to train the model indirectly accessing the distributed data such that it reduces privacy leakage. Pioneering studies such as FedAvg (McMahan et al., 2017) and FedProx (Li et al., 2020) train each local model on its own data while keeping data private and transmit model parameters to the server for obtaining a generalized global model. The parameters from local clients are aggregated in the server, and the server parameters are broadcasted to clients. This process is iteratively performed until the global model converges to a stationary point, and user privacy is ensured by sharing aggregated parameters not data itself with other clients. In real-world scenarios, local data are collected from various domains across clients coming from different characteristics of sensors and surrounding environments. For example, in autonomous driving tasks, each vehicle captures street views and infrastructures differently from others due to variances in camera sensors, region, and other factors. These local data deviates in terms of the distribution in feature space, inducing non-iid data across clients, denoted as domain shift (Li et al., 2021b; Jiang et al., 2021) . Currently, most studies have tried to solve the issues of FL on non-iid data, especially heterogeneous label distribution (Li et al., 2020; Karimireddy et al., 2020; Wang et al., 2020) , but domain shift has not been fully explored in the literature yet. Domain shift also exists between training and test clients. After federated learning, the learned FL model is deployed to new customers outside the federation, e.g., new vehicles or medical centers, where data distribution is shifted from those of clients inside the federation. However, most works only focus on improving model performance of the clients participated in FL, while neglecting generalization on unseen clients. In this paper, we treat federated domain generalization (federated DG), which aims to collaboratively learn a client-agnostic federated model from various distributed source domains and generalize the learned model to new clients in unseen domains, as illustrated in Fig. 1 . Previous FL works (Li et al., 2021b; Andreux et al., 2020) also tried to solve and analyze the domain shift problem in federated learning, but they only focused on training personalized models inside the federation not building the generalized global model for new clients. To improve the generalizability of the FL model, federated DG has been explored by (Jiang et al., 2021; Wu & Gong, 2021; Liu et al., 2021; Yuan et al., 2021) . However, there are some limitations such as a threat of privacy leakage (Wu & Gong, 2021; Liu et al., 2021) by sharing private information with other clients, performance limitation (Yuan et al., 2021) focusing on aggregation rather than local training, and need of multiple target data for test-time adaptation (Jiang et al., 2021) . In the literature of domain generalization of centralized learning, multi-source domain generalization (Gulrajani & Lopez-Paz, 2021) has been widely used to build the generalized model using multiple source domains. In federated learning, sharing data with other clients is strictly restricted preventing from serious privacy leakage, so these generalization methods are not be applicable to federated DG. This paper presents two approaches: (1) client-agnostic learning with mixed instance-global statistics for local training and (2) zero-shot adaptation with estimated statistics for inference. Our proposed methods, named FedIG-A, allow local models to learn client-invariant representations from other clients' data distribution while preserving privacy and let the learned global model directly generalize to unseen domains. To this end, we adopt FedBN (Li et al., 2021b) that is designed to mitigate domain shift across clients. In FedBN, local clients use local batch normalization (BN) layers and keep them local to learn client-specific representations while the remaining parts are aggregated in the server to learn client-invariant representations. However, it is difficult to explicitly train local models to learn client-invariant representations only using single domain local data. To solve the issue, we propose a novel client-agnostic learning with mixed statistics. In client-agnostic learning, we augment local features using data distribution of other clients via aggregated BN statistics from the global model, i.e., global statistics. Our proposed augmentation randomly mixes instance-level and global feature statistics to produce diverse domain features. We then apply client-agnostic loss functions to learn client-invariant representations. Note that our method exploits global statistics that do not pose additional privacy leakage, the same amount with FedAvg. At inference time, we introduce a zero-shot adapter for helping the learned global model to directly bridge a large domain gap between seen and unseen clients. We mix instance statistics of a test input with global statistics that are vulnerable to distribution shift. The optimal interpolation values are different across test samples in each BN layer, thus we design the adapter for estimating the interpolation value as an instancewise manner. With the aid of the adapter, the global model improves generalizability further by reflecting test distribution. We conduct extensive experiments on several DG benchmarks in image domain including PACS (Li et al., 2017) , VLCS (Fang et al., 2013) , and OfficeHome (Venkateswara et al., 2017) in the federated setting and show the effectiveness of our components.

2. RELATED WORKS

2.1 FEDERATED LEARNING Federated learning (FL) has been extensively studied to train a global model using distributed datasets, while ensuring user privacy and reducing communication overhead issues. Most recent FL approaches focus on solving the issues of non-iid data distribution over clients, especially heterogeneous label distribution (Li et al., 2020; Karimireddy et al., 2020; Wang et al., 2020) . Especially, FedProx (Li et al., 2020) incorporates a proximal term into local loss functions to regularize the local model reducing the gap with the global model. Only a few works (Li et al., 2021b; Andreux et al., 2020; Jiang et al., 2021) point out domain shift across different clients in FL. FedBN (Li et al., 2021b) and SiloBN (Andreux et al., 2020) keep BN statistics locally without aggregating them in the server for mitigating domain shift, but both methods only focus on boosting performance of clients inside the federation. TsmoBN (Jiang et al., 2021) (Li et al., 2019; Xu et al., 2021a; Seo et al., 2020; Zhao et al., 2021; Pandey et al., 2021; Nam & Kim, 2018; Chen et al., 2022; 2021; Lv et al., 2022) has been extensively explored to learn domain-invariant representations by minimizing domain discrepancy over multiple source domains. These methods can be applied to federated DG to help building the client-agnostic model. However, client data should be shared across clients for multi-source domain generalization, and it leads to serious privacy issues in federated learning. Single-source domain generalization (Zhou et al., 2021; Li et al., 2021a; Xu et al., 2021b; Wang et al., 2021; Carlucci et al., 2019; Huang et al., 2020; Kim et al., 2021) has tried to learn a generalized model with single source data. These algorithms can be applied to federated DG without privacy leakage. They can improve the generalization ability through domain expansion or regularization, but the performance improvement is limited since they are designed to use individual source domain data. It cannot fully exploit the advantage of federated learning. Recently, federated DG has been studied to treat the distributed multi-source domain setting. COPA (Wu & Gong, 2021) and FedDG (Liu et al., 2021) apply multi-source domain generalization methods (Li et al., 2019; Xu et al., 2021a)  ϕ G • F θ G : X → Y by aggregating K distributed clients' models {F θ k , C ϕ k } K k=1 trained on source data {D k } K k=1 , such that the global model generalizes to unseen domain D t . Challenges: Domain shift over clients hinders obtaining the generalized global model since the local models are easily over-fitted to their domains, indicating large model divergence across clients (Li et al., 2021b) . Even though data from various domains and a large amount of data are used through federated learning, domain shifts in the distributed setting negatively affect the generalization ability both inside and outside the federation. Local models should learn domain-invariant representations such that the generalized global model is collaboratively obtained from the local models. FedBN: We adopt FedBN (Li et al., 2021b) 4) and ( 5). server, denoted as a client-agnostic part. We can expect that local BN statistics learn client-specific characteristics and the client-agnostic part learns client-invariant representations. In this paper, we use θ = {θ a , θ s } indicating client-agnostic and client-specific parts, respectively. In the centralized setting that accesses to multiple source datasets together, using multiple BN layers forces the model to learn domain-specific and domain-invariant characteristics separately, finding the common knowledge in multiple domains with ERM (Gulrajani & Lopez-Paz, 2021) as follows: L Centeralized CE = K k=1 1 |D k | n k i=1 CE(C ϕ G (F {θ a G ,θ s k } (x i,k )), y i,k ), where CE(•, •) is the cross-entropy loss, and {θ a G , ϕ G } are shared across all domains, thus it can learn domain-invariant characteristics while k-th domain-specific information is learned by BN statistics θ s k . In federated learning, local client only has single domain data different with centralized learning, thus the model cannot explicitly learn domain-invariant representations. To solve the issue, we propose to use mixed instance-global statistics in BN layers to learn domain-invariant representations.

3.2. CLIENT-AGNOSTIC LEARNING WITH MIXED INSTANCE-GLOBAL STATISTICS

In local training, k-th local model is trained with the cross-entropy loss on k-th dataset as follows: L CE = 1 |D k | n k i=1 CE(C ϕ k (F {θ a k ,θ s k } (x i,k )), y i,k ). (  µ l ∆ = u l µ l i + (1 -u l )µ l G and σ l ∆ = u l σ l i + (1 -u l )σ l G , where µ l i and σ l i indicate instance mean and standard deviation along the channel axis of the intermediate feature of the i-th input to the l-th BN layer, respectively. u l ∈ R C l is an interpolation weight vector, where each element is independently sampled from uniform distribution U (0, 1) each iteration, and C l is the feature dimension in l-th BN layer. The feature normalized by instance statistics contains local representative characteristics, and the normalized feature with global statistics is composed of global representations. By randomly interpolating these two statistics in all BN layers, we obtain more diverse data fully utilizing the characteristics of local and global domains. In Fig. 2 , we denote the intermediate feature by âl i,k normalized with MixIG, and the augmented feature f i,∆ is obtained with {µ ∆ , σ ∆ } L l=1 . We train the local model using both f i,k and f i,∆ in a client-agnostic way, which is described in the next section. While previous works (Zhou et al., 2021; Li et al., 2021a) augment features with random noise values or styles of batch samples, our method gets access to aggregated data distribution for safe and diverse augmentations as multi-source domain generalization. It is noting that our method uses global statistics in BN layers for data augmentation, which can reduce a threat of privacy leakage different with Wu & Gong (2021) ; Liu et al. (2021) .

Client-agnostic Learning Objectives:

We propose a client-agnostic feature loss as follows: L CAF L = 1 |D k | n k i=1 ∥f i,k -f i,∆ ∥ 2 2 . (4) With this loss function, the local model can explicitly extract the client-agnostic features by minimizing the distance between the original and augmented features. Here, we perturb the features on the client-specific part, i.e., BN statistics, thus the client-agnostic part explicitly learns clientinvariant characteristics helping to mitigate domain shift. In addition, we train the local classifier to classify the features from other domains forcing the classifier to be client-agnostic. To achieve it, the local classifier C ϕ k is trained with client-agnostic classification loss as follows: L CACL = 1 |D k | n k i=1 CE(C ϕ k (f i,∆ ), y i,k ). (5) Our client-agnostic learning can be considered as a regularization method that forces local models not to deviate largely from the global model. Different with the previous work (Li et al., 2020) that directly regularizes local weight parameters with the global model, our proposed learning considers the importance of weight parameters for client-invariant representations with diverse domain data. The overall loss for local optimization is as follows: 2022) get the prediction from the selected BN that is most related to a test input. However, they cannot be applicable to the federated setting since local client model cannot access local BN layers of other clients, i.e., only global BN layers are allowed to access due to user privacy. In addition, recent test-time adaptation works (Gong et al., 2022; You et al., 2021; Hu et al., 2021) use the interpolated statistics between instance and learned statistics to reflect test distribution, but their interpolation parameters are manually fixed suitable to the target domain or generated from the rule-based function containing sensitive hyper-parameters. Here, we propose to dynamically generate instance-wise interpolation parameters for mixing instance and global statistics with a learning-based network. L total = L CE + λ 1 • L CACL + λ 2 • L CAF L ,

Interpolated BN Statistics:

We utilize statistics of the test input with global statistics as follows: µ l t = α l µ l i + (1 -α l )µ l G and σ l t = α l σ l i + (1 -α l )σ l G , where µ l i and σ l i indicate instance mean and standard deviation of the input in l-th BN layer. µ l t and σ l t are used for normalizing the test input tensor. α l is an interpolation parameter that adjusts the contribution of instance statistics of the test sample. Ideally, optimal α is selected to test domains or test inputs, but we cannot access test domain. We propose a zero-shot adapter that is carefully designed to dynamically generate α for each input in both seen and unseen domains. 

…

Eq. ( 7) Eq. ( 9)

Conv

: Freeze : Update Figure 3 : Zero-shot adapter takes the difference between instance and global statistics, and it generates estimated statistics by Eq. ( 7) and Eq. ( 9). Design of Zero-shot Adapter: We design the zero-shot adapter G φ parameterized by φ, which aims to generate proper α for the test sample. The zero-shot adapter is separately added on each BN layer in the feature extractor. We set an input of the adapter in l-th BN layer as the channel-wise distance between instance and global statistics {µ l i -µ l G ; σ l i -σ l G } ∈ R 2C l , and an output is α l . With this design, the adapter estimates the statistics based on the distance between input and global statistics as an instance-wise manner at the test time. Training Strategy for Zero-shot Adapter: In local training, we freeze the main model, i.e., F θ k and C ϕ k , and train the adapter with the cross-entropy loss to classify the inputs as follows: L A = 1 |D k | n k i=1 L CE (C ϕ k (F {θ a k ,θ s t } (x i,k )), y i,k ), where θ s t indicates the interpolated statistics described in Eq. ( 7), and α l is generated from G φ l , as shown in Fig. 3 . To prevent the zero-shot adapter from over-fitting to each local training data, we apply reparameterization trick (Kingma & Welling, 2013) . We generate α l sampled from the gaussian distribution reparameterized by the zero-shot adapter as follows: α l = T(δ l z l + ϵ l ), where δ l , ϵ l = G φ l ({µ l i -µ l G ; σ l i -σ l G }), and z l is sampled from N (0, 1). T(•) is a clamp function to ensure α l within the range of [0, 1]. Here, the main network and the adapter are alternately trained with the loss functions Eq. ( 6) and Eq. ( 8), respectively, freezing the other model. The training procedure for the adapter does not affect the performance on the main network since the purpose of the adapter is only to learn how to interpolate instance and global statistics for minimizing the cross-entropy loss on the trained network, which simulates the test scenario in new clients. Local zero-shot adapters are also aggregated by FedAvg. Zero-shot Adapter at Inference Time: We set α l as ϵ l , which is the mean of δ l z l + ϵ l . The test input is normalized by interpolated statistics using α l at l-th BN layer, and this process operates via one forward propagation same with the naive inference process. We analyze the inference cost in the experimental section. We denote our overall framework by FedIG-A.

4.1. EXPERIMENTAL SETUP

Datasets and Settings: We conduct extensive experiments on three DG benchmarks: PACS (Li et al., 2017) , VLCS (Fang et al., 2013) , and OfficeHome (Venkateswara et al., 2017) . PACS contains seven categories from Photo (P), Art painting (A), Cartoon (C), and Sketch (S) domains, where domain shift is large across four domains (Dou et al., 2019; Chen et al., 2022) compared to VLCS and OfficeHome (see analysis in A.4). VLCS consists of five categories which are collected from VOC2007 (V) (Everingham et al., 2010) , LabelMe (L) (Russell et al., 2008) , Caltech-101 (C) (Fei-Fei et al., 2004) , and SUN (S) (Xiao et al., 2010) , where domain shift stems from the type of camera. OfficeHome (Venkateswara et al., 2017) contains 65 categories collected from four domains, Artistic (A), Clipart (C), Product (P), and Real world (R), and the domain shift problem is not as severe as other two datasets (Du et al., 2022; Wu & Gong, 2021) . In federated DG, the client has single domain data, and there are total four clients in each DG benchmark. The global model is collaboratively learned with three clients, and the learned model is evaluated on the remaining client. Implementation Details: In the federated learning process, all clients use the same architecture and hyper-parameter settings. We use ResNet-18 (He et al., 2016) pretrained on ImageNet as the backbone network. For FedIG, all batch normalization (BN) layers are replaced with two BN layers, i.e., local and global BN layers, respectively. We added two fully connected layers in front of each BN layer for the zero-shot adapter. We train the network using SGD with momentum of 0.5, the fixed learning rate of 0.01, and the batch size of 64 for 200 iterations at each round. Total 40 rounds are conducted following the federated DG setting (Yuan et al., 2021) . We set λ 1 and λ 2 to 0.1 and 4.0 in Eq. ( 6), respectively. Following Li et al. (2019) , we only update the classifier by Eq. ( 5). We conduct ablation studies about balancing parameters in A.1. (Gong et al., 2022) calibrates learned statistics with instance statistics by the rule-based function when difference between learned and instance statistics is large. IABN has sensitive hyper-parameters, and these parameters should be properly selected for each domain to consistently improve the performance on all domains. Our learning based zero-shot adapter outperforms IABN on all domains without any sensitive hyperparameters. MixNorm (Hu et al., 2021) augments test input data with various spatial augmentations, here we use the original data and four augmented data, estimating test distribution more accurately. It slightly boosts the performance, but the inference time and the memory usage increase. We also compare the zero-shot adapter with naive approaches, random and fixed alpha (You et al., 2021) . Compared to two results, our zero-shot adapter can generate more proper alpha values for each BN layer as an instance-wise manner. It shows that layer-wise and instance-wise generation is more effective than using the fixed alpha. Computational Overhead Analysis: Table 4 shows the computational cost of each method. (3) Augmentation-based DG methods: Mixstyle (Zhou et al., 2021) , SFA (Li et al., 2021a) , Rand-Conv (Xu et al., 2021b) , and L2D (Wang et al., 2021) ; (4) Regularization-based DG methods: Ji-Gen (Carlucci et al., 2019) , RSC (Huang et al., 2020) , and SelfReg (Kim et al., 2021) ; (5) Federated DG methods: COPA (Wu & Gong, 2021) , FedDG (Liu et al., 2021) , and CSAC (Yuan et al., 2021) . Performance Analysis: We first analyze the results on PACS with severe distribution shift between domains. Decentralized without DG methods achieve low performance compared to other paradigms. FedProx regularizes local models not to deviate from the global model, but it cannot solve domain shift. FedBN mitigates domain shift with local BN layers, and it can improve the performance on unseen clients. However, the performance is very limited since they do not consider learning client-invariant representations, explicitly. Decentralized with DG methods show performance improvement on several domains. Especially augmentation-based DG methods achieve significant accuracy improvement, but performance on few domains is downgraded a lot. Since augmentation-based DG methods learn domain-invariant representations only within single domain, it is not effective when augmentation on source domain does not cover test distribution. JiGen, RSC, and SelfReg consistently improve or maintain the performance on four domains compared to FedAvg, but the improvement is very marginal. COPA and FedDG allow the model to learn clientinvariant representations by utilizing data information from other clients, and they obtain the highest performance on several domains among competitive methods. However, they cause serious privacy issue. Our methods, FedIG and FedIG-A, achieve the state-of-the-art performance without sharing private information across clients. FedIG outperforms all competitive methods with a large margin on most domains except for COPA. With zero-shot adaptation, we boost the performance further without significantly increasing the computational cost. On VLCS and OfficeHome, domain shift is relatively small compared to PACS. FedIG and FedIG-A consistently improve the performance almost all domains compared to other methods, indicating that our methods can be safely applied to any domains (see more results in A.5). We show the results on cross-silo FL in A.6. Moreover, we analyze the performance on clients inside the federation in A.7. 

REPRODUCIBILITY STATEMENT

Our experimental evaluation is conducted with publicly available DomainBed (Gulrajani & Lopez-Paz, 2021) and CSAC (Yuan et al., 2021) . We provide the data pre-processing and hyper-parameter settings in Section 4.1 and pseudo-codes in A.8. Together with the references of related works and publicly available codes, our paper contains sufficient details to ensure reproducibility.

A APPENDIX A.1 PERFORMANCE DEPENDENCY OF HYPER-PARAMETERS IN OBJECTIVES

In Eq. ( 6), there are two hyper-parameters, λ 1 and λ 2 , and we analyze performance dependency of these parameters. It is worth noting that it is an important ablation study in DG, where the DG model should generalize well on several domains, not sensitively depending on these parameters. In Fig. 5 and 6, we conduct experiments changing each parameter while fixing another parameter, λ 1 = 0.1 and λ 2 = 4.0. Although the optimal hyper-parameters are different with each domain, choosing λ 1 in [0.1, 0.5] and λ 2 in [2.0, 5.0] shows consistent results on all domains. In this range, FedIG-A consistently achieves high performance on all PACS domains compared to competitive methods (see Table 5 ). We can train and evaluate models by setting hyper-parameters to optimal values for each domain, but we set λ 1 and λ 2 to 0.1 and 4.0 for all datasets in our experiments. In the server, the best round model is selected when the average of the validation performance on seen clients is maximized among 40 rounds (validation per 1 round). The server can obtain the validation performance from clients using the aggregated server parameters, thus it can be possible to use the average of the validation performance on seen clients. It is practical and effective to use single-source DG validation on local training and multi-source DG validation on the server.

A.3 MORE STUDIES ON MIXIG

In Table 6 , we conduct experiments using different range of uniform distribution for mixing instance and global statistics in Eq. ( 3). The optimal range of augmentation is different for each test domain. In other words, the desired strength or type of augmentation is different for each domain, as shown in Table . 5. Different with augmentation-based DG methods, our method interpolates instance and global information. The performance is consistently improved in all domains due to safe and diverse augmentation using features normalized within instance and global statistics. With extrapolation U (-0.1, 1.1), the performance on S domain, which is severely deviated with P, A, and C domains, is largely improved. Using extrapolated statistics generates more diverse samples compared to using interpolated statistics, but it induces the performance drop on not severely different domains. At the test time, we use alpha values in Eq. ( 7) from the trained zero-shot adapter for inference. We plot alpha values on each layer from all test samples in PACS, VLCS, and OfficeHome in Fig. 7 , 8, and 9. In PACS, alpha values in P, A, and C domains have similar distribution on each layer. In the low-level layers, different alpha values are used for test samples between 0.1 and 0.6, i.e., using different amount of instance statistics, and in the high-level layers, alpha values are within 0.0 and 0.6. In S domain, alpha values at the low-level layers are lower than other domains. Since there is a large domain gap between S and other domains, distribution of alpha values are different. Interestingly, alpha values in the middle-level layers are almost same across test samples, e.g., almost all test samples get 0.2 on the 9-th BN layer in P domain. Alpha values in VLCS are all between 0.0 and 0.2, and it shows that global statistics work well without instance statistics because domain shift existing in VLCS is not large compared to PACS. Similarly, alpha values in OfficeHome are smaller than those in PACS. The distribution of alpha values is similar across domains because domain shift is not large. The degree of shift between domains can be inferred from the distribution of alpha from the zero-shot adapter. Experimental Results: In Table 9 , 10, and 11, we compare our FedIG-A with FedProx Li et al. (2020) and FedBN (Li et al., 2021b) . Our method works well when the number of clients is large. The performance is slightly downgraded in non-iid label distribution compared to the performance in iid label distribution, but we can expect that this negative effect is alleviated if the method for non-iid label distribution is added together. In the case of severe non-iid label distribution, FedIG-A significantly improves the performance compared to the baselines as shown in Table 11 . It demonstrates that our method effectively solves the domain shift problem even severe label shift exists. We measure the performance on clients inside the federation, i.e., personalized performance, in Table 12. In PACS with a large domain shift across training clients, the model with local statistics, i.e., the client-specific part, achieves the best performance compared to the model with global statistics, i.e., the global model. Global statistics reflect data distribution of all clients, but it causes the performance degradation on each test domain, where data distribution is shifted from other clients. Our proposed zero-shot adapter can reduce the performance gap between using local and global statistics with the aid of instance statistics of the test input. In the case of VLCS and OfficeHome, global statistics help the model to generalize well to each test domain. Since the distribution shift across clients is not large, global statistics represent more general distribution compared to local statistics. It improves the model performance a lot. The zero-shot adapter also helps the model to generalize well on seen domains in VLCS and OfficeHome. 



Figure 1: Federated domain generalization: a client has data belonging to a single domain that is different from other clients' domains, and the learned global model is deployed to new client.

Before starting local training, the client receives the global model parameters {θ a G , ϕ G } and initialize the local model {θ a k , ϕ k } with the global parameters while θ s k remains. Then, F {θ a k ,θ s k } and C ϕ k are trained on local data for long epochs. Although {θ a k , ϕ k } is initialized with the generalized global model, there is no way to learn client-invariant representations only using single domain data with the cross-entropy loss because the direct use of other clients' data leads to privacy issue. To mitigate the issue, we propose to generate diverse domain using statistics in BN layers from the global model. Mixed Instance and Global Statistics (MixIG): Global statistics that are aggregated with local BN statistics in the server reflect training data distribution over clients. We exploit this property for data augmentation in local training as illustrated in Fig. 2. In naive local training, an input tensor a l i,k is normalized with statistics of batch samples in l-th BN layer, and a local feature f i,k is always calculated with statistics from single domain local data. It makes the local model learn representations only in single domain. Here, we propose to normalize inputs with mixed statistics exploiting global statistics beyond local statistics. We mix mean and standard deviation of each sample with global statistics {µ G , σ G }, i.e., running mean and standard deviation, as follows:

where λ 1 and λ 2 are balancing parameters. After local training for long epochs, model parameters are aggregated by FedAvg (McMahan et al., 2017) in the server, i.e., θ G = K k=1 n k n θ k and ϕ G = K k=1 n k n ϕ k . We denote FedIG as our federated model trained by Eq. (6). 3.3 ZERO-SHOT ADAPTATION At inference time, the globle model, F {θ a G ,θ s G } and C ϕ G , is deployed to unseen clients. It cannot generalize well to completely unseen domains, where data distribution is shifted from training distribution. In the literature of domain generalization using multiple BN layers, Seo et al. (2020) uses ensemble predictions from multiple BN layers, and Chen et al. (2022); Zhou et al. (

Figure 4: (a) alpha values on each layer from all test samples, and (b) t-SNE of FedIG and FedIG-A in unseen A domain, where features are normalized by global (left) and estimated statistics (right).

Figure5: Ablation studies of client-agnostic classification loss on PACS. We conduct experiments using various λ 1 (x-axis) and get the performance (y-axis).

Figure 7: Alpha values on each layer from test samples in P (upper left), A (upper right), C (lower left), and S (lower right).

Figure 8: Alpha values on each layer from test samples in V (upper left), L (upper right), C (lower left), and S (lower right).

Figure 9: Alpha values on each layer from test samples in A (upper left), C (upper right), P (lower left), and R (lower right).

Figure 10: We plot the data distribution of clients using (a) iid data partition, (b) non-iid data partition following Dirichlet distribution of α = 0.5, and (c) non-iid data partition following Dirichlet distribution of α = 0.1 when A, C, and S domains are source and P is target. The color bar denotes the number of data samples, and x-axis indicates client ID and y-axis indicates class ID. Each rectangle represents the number of data samples of a specific class in a client. In this setup, 30 clients are participated in FL from A, C, and S domains, and we test the FL model on P domain.

tackles the generalization ability of previous works. They present updating test batch normalization to adapt the global model to target clients, but it requires lots of data in target domain. Moreover, three methods do not deal with building the client-agnostic global model, i.e., training local clients without domain generalization algorithms. We go beyond these approaches building the client-agnostic global model while mitigating domain shift and directly generalizing the model to new target clients indicating zero-shot adaptation.

Let X and Y denote the input space and the label space, respectively. k-th client has single domain dataD k = {(x i,k , y i,k )} n k i=1 ,and {D 1 , ..., D K } (X k , Y) different with other clients. D t indicates the target test domain data from a new client outside the federation, which distribution (X t , Y) is shifted from that of training data. F θ is the feature extractor parameterized by θ, and C ϕ is the classifier parameterized by ϕ. Federated DG aims to learn a generalized global model C

The local feature f i,k is extracted by batch statistics (Local BN), and the augmented feature f i,∆ is extracted by mixed instance and global statistics (MixIG). MixIG is constructed by randomly interpolating instance and global statistics, and an intermediate feature âl i,k is standardized by MixIG (right) to possibly be various features. This operation is repeated on every BN layer. With both local and augmented features, local models learn client-invariant representations by Eq. (

Variants of MixIG on PACS.

Ablation studies of client-agnostic learning objectives on PACS.

Comparison studies to show the effectiveness our components on PACS. We follow the standard DG evaluation protocols in DomainBed(Gulrajani & Lopez-Paz, 2021) including dataset splits, image augmentations, and the model selection strategy. For federated DG, we set the model selection strategy as single-source DG validation on local clients and multi-source DG validation on the server. We give the details in A.2. In this protocol, we reproduce all competitive methods for a fair comparison in the federated setting. All experiments are reported the average accuracy and standard deviation over four runs with different random seeds. Since global and instance statistics are randomly used for each BN layer, the model can effectively learn representations within local client, across clients, and within other clients. We also analyze the effect of distribution range in A.3. Effectiveness of Client-agnostic Learning: We apply two loss functions, client-agnostic classification loss (CACL) and client-agnostic feature loss (CAFL), on augmented features. CACL builds a client-agnostic classifier, and CAFL forces the feature extractor to learn domain-invariant information by strictly minimizing the distance between two features. In Table2, we show that each loss function is effective, and CACL and CAFL complementary operate to each other. Mixstyle only accesses the single-domain data in local training while MixIG exploits global statistics, and it shows that feature augmentation with global statistics makes the model more robust to domains. Comparison Studies of Inference Methods: In Table3-(c), we compare our zero-shot adapter with various inference algorithms. First, we compare with BIN(Nam & Kim, 2018), denoted by learned alpha, which replaces all BN layers to the weighted summation of BN and IN layers in the backbone network. In training, the learnable interpolation parameters for BN and IN are optimized, and learned parameters are used for test samples. Since the interpolation parameters are fitted to training datasets, it cannot generalize well on unseen domains. Next, we adapt various inference algorithms on the trained model using FedIG for a fair comparison. We obtain ensemble predictions from multiple local BN layers. Learned multiple BN statistics can partially reflect the distribution of unseen domains, but the performance on some domains that are severely deviated from training domains is downgraded. Furthermore, this method leads to privacy issue due to sharing multiple local statistics. IABN

Comparison of computational cost. Acc. denotes the average accuracy on PACS.

Classification accuracy comparison results on PACS, VLCS, and OfficeHome. Gray color indicates methods posing privacy issues. In Table 5, we compare our FedIG and FedIG-A with representative methods of FL and DG. (1) Federated learning: FedAvg (McMahan et al., 2017), FedProx (Li et al.

We presented client-agnostic learning with mixed instance-global statistics for local training and zero-shot adaptation with estimated statistics for inference. Our mixed instance-global statistics generate diverse domain features helping local clients to learn client-invariant representations while ensuring user privacy. In addition, our proposed zero-shot adapter directly bridges a large domain gap between training and test clients at inference time. Extensive experiments on federated DG benchmarks showed the effectiveness of our methods.

Accuracy on PACS using various range of distribution for MixIG.

Performance on iid label distribution with 30 training clients. FedProx 94.95 (0.42) 73.97 (0.48) 70.72 (0.28) 73.76 (0.24) 78.35 FedBN 94.53 (0.85) 75.95 (0.17) 70.53 (2.03) 78.45 (2.16) 79.86 FedIG-A 94.76 (1.14) 84.20 (0.31) 79.20 (2.02) 82.87 (1.76) 85.26

Performance on non-iid label distribution (α = 0.5) with 30 training clients. FedProx 92.25 (0.25) 71.75 (3.56) 74.87 (0.61) 69.30 (6.14) 77.04 FedBN 92.82 (0.13) 74.78 (1.55) 75.50 (0.95) 71.71 (2.66) 78.70 FedIG-A 93.62 (0.30) 79.54 (0.83) 82.02 (0.15) 77.49 (1.17) 83.17

Performance on non-iid label distribution (α = 0.1) with 30 training clients. FedProx 83.17 (0.42) 62.65 (4.42) 66.95 (0.95) 55.34 (4.33) 67.03 FedBN 84.41 (1.40) 61.62 (0.48) 69.05 (3.81) 59.85 (4.00) 68.73 FedIG-A 92.13 (0.89) 73.29 (2.90) 73.19 (1.12) 71.96 (0.34) 77.64 A.7 PERFORMANCE ON CLIENTS INSIDE THE FEDERATION

Personalized performance on PACS, VLCS, and OfficeHome. .68) 76.11 (1.21) 87.81 (1.32) 79.78 (1.80) 77.57 FedIG (global statistics) 68.05 (2.03) 76.77 (1.20) 88.19 (1.14) 80.53 (1.91) 78.38 FedIG-A (global statistics) 68.50 (1.82) 76.63 (1.16) 88.08 (1.25) 80.70 (1.99) 78.48

A.5 EXPERIMENTAL RESULTS ON VLCS AND OFFICEHOME

In Table 7 , our FedIG and FedIG-A achieve the state-of-the-art performance on VLCS. Several methods downgrade the performance on V or S domains compared with FedAvg, e.g., CSAC achieves low performance on V and S domains, but we can get consistent better performance on all domains. In Table 8 , both decentralized with DG and federated DG paradigms cannot get a large improvement on OfficeHome because domain shift across clients is very small. FedIG-A gets comparable results with the state-of-the-art previous methods. We describe pseudo-codes of client-agnostic learning and zero-shot adaptation in Table 13 and 14 . Note that federated DG benchmarks only contain four clients, i.e., three for training and one for test, thus all three clients participate in the federation at every round.Table 13 : Pseudo-code for FedIG-A training.Global weights Obtain f i,t by Eq. ( 7) and ( 9); Obtain the prediction with C ϕt ;

