GRADIENT ESTIMATION FOR UNSEEN DOMAIN RISK MINIMIZATION WITH PRE-TRAINED MODELS Anonymous

Abstract

Domain generalization aims to build generalized models that perform well on unseen domains when only source domains are available for model optimization. Recent studies have demonstrated that large-scale pre-trained models could play an important role in domain generalization by providing their generalization power. However, large-scale pre-trained models are not fully equipped with target task-specific knowledge due to a discrepancy between the pre-training objective and the target task. Although the task-specific knowledge could be learned from source domains by fine-tuning, this hurts the generalization power of the pretrained models because of gradient bias toward the source domains. To address this issue, we propose a new domain generalization method that estimates unobservable gradients that reduce potential risks in unseen domains, using a largescale pre-trained model. Our proposed method allows the pre-trained model to learn task-specific knowledge further while preserving its generalization ability with the estimated gradients. Experimental results show that our proposed method outperforms baseline methods on DOMAINBED, a standard benchmark in domain generalization. We also provide extensive analyses to demonstrate that the estimated unobserved gradients relieve the gradient bias, and the pre-trained model learns the task-specific knowledge without sacrificing its generalization power.

1. INTRODUCTION

Many machine learning studies assume that training and test data are independent and identically distributed (i.i.d). However, this i.i.d assumption does not always hold in real-world scenarios where distribution shifts between training and test data occur frequently. Thus, traditional machine learning models often show poor performance on unseen domains shifted from source (training) domains (Quinonero-Candela et al., 2008; Torralba & Efros, 2011) . To tackle this problem, domain generalization has attracted much attention recently. The main goal of domain generalization is to build generalized models that also perform the target task (e.g., classification) well on unseen domains (e.g., painted images) when only source domains (e.g., realistic images) are accessible during model optimization. Early domain generalization studies (Muandet et al., 2013; Ganin et al., 2016; Li et al., 2018b) have focused on learning domaininvariant representations across the source domains. However, Gulrajani & Lopez-Paz (2021) have recently shown that simple empirical risk minimization (ERM) (Vapnik, 1999) outperforms the previous methods on DOMAINBED, a benchmark for domain generalization, with pre-trained ResNet-50 (He et al., 2016) . Moreover, Yu et al. (2021) provide empirical evidence that large-scale pretrained models could play an important role in domain generalization by providing their generalization power. Motivated by this, several studies have begun to leverage the generalization power of large-scale pre-trained models. Cha et al. (2022) employ a pre-trained model for regularization, considering it as an approximation of the oracle model on any domain, and Li et al. (2022) utilize a frozen pre-trained model as a feature extractor. These studies have proven the usefulness of pre-trained models in domain generalization. However, the pre-trained models used in those studies cannot learn task-specific knowledge further since they are frozen during model optimization to preserve their generalization ability. To learn the task-specific knowledge, one can choose fine-tuning that updates all the parameters of pre-trained models by optimizing the models on the source domains. Gradient "conflicts" (Yu et al., 2020; Mansilla et al., 2021) between g and g u (i.e., g • g u < 0) constantly occur throughout the whole fine-tuning iterations due to the gradient bias. Our proposed method reduces the number of gradient conflicts by adding the estimated unobservable gradient gu to the biased gradient g. This observation indicates that the gradient bias is relieved with the estimated gradient during model optimization. The more details are described in § 3.4. However, Kumar et al. (2022) demonstrate that fine-tuning distorts generalized representations of the pre-trained models. Namely, fine-tuning hurts the generalization ability of pre-trained models. In this paper, we interpret the above issue in terms of gradient bias during model optimization. As shown in Figure 1a , the gradient of naive fine-tuning is biased toward the source domains because it is computed by only the source domains, disregarding unseen domains. Although this biased gradient reduces empirical risks in the source domains with the learning of task-specific knowledge, it probably increases risks in the unseen domains. We argue that the gradient bias would be relieved if gradients that lower the risks in the unseen domains are observable. To this end, we propose a new domain generalization method, called GESTUR, which estimates the unobservable gradients with a large-scale pre-trained model. GESTUR consists of two key components: a task expert (TE) and a generalization expert (GE). Based on ERM where gradients tend to be biased to the source domains, TE learns task-specific knowledge from source domains directly to transfer the knowledge to GE. Meanwhile, GE learns the task-specific knowledge from TE indirectly via exponential moving average (EMA) while preserving the generalization ability of a large-scale pre-trained model. Still, the gradient bias of TE might impair the generalization ability of GE. To mitigate this, GE is utilized to estimate the unobservable gradient that minimizes risks in unseen domains for TE based on the assumption that large-scale pre-trained models could act as a loose approximation of the oracle model of unseen domains ( § 2). As shown in Figure 1b , the biased gradient of TE is relieved by simply adding the estimated unobservable gradient to the biased gradient, improving domain generalization performance ( § 3). Extensive experiments and analyses demonstrate that GESTUR outperforms baseline methods by learning the task-specific knowledge appropriately from source domains while preserving the generalization ability of large-scale pretrained models.

Contributions:

(1) We propose a simple yet effective domain generalization method that learns task-specific knowledge while preserving the generalization ability of large-scale pre-trained models. Our proposed method estimates the unobservable gradients that reduce potential risks in unseen domains to relieve the gradient bias toward source domains, based on the two experts, TE and GE. (2) We conduct extensive experiments to show the effectiveness of our proposed method in domain generalization. By providing careful analyses, we demonstrate that the unobservable gradients could be estimated with a large-scale pre-trained model, and it relieves the gradient bias. We also demonstrate that our proposed method learns task-specific knowledge without sacrificing the generalization ability of the large-scale pre-trained model.

2. METHODOLOGY

2.1 PRELIMINARIES Problem formulation. Let D s and D u be sets of source domains and unseen domains, respectively. Each domain D contains the total number of n D data samples, {(x i , y i )} n D i=1 ∼ D, where each data sample (x i , y i ) consists of an input x i and its target label y i . The n D data samples are i.i.d over some probability distribution. The main goal of domain generalization is to build a model θ that performs well on the unseen domains D u when the source domains D s are only available: min θ E D∼Du E (x,y)∼D [ℓ((x, y); θ)], where ℓ((x, y); θ) is the loss function defined for the model θ on the data sample (x, y). Note that this study focuses on solving classification tasks. Hence, we denote the model in detail as θ = {θ f ; θ c } consisting of its feature extractor θ f and classifier θ c . Motivation. With success in many downstream tasks, it has become a convention to initialize the feature extractor θ f with a large-scale pre-trained model. Although pre-trained models provide better feature representations than randomly initialized parameters, they do not fully equip taskspecific knowledge yet. It is because there is a discrepancy between the pre-training objective and the target task. For example, CLIP (Radford et al., 2021) is pre-trained to match web-crawled imagecaption pairs, whereas the target task is to classify data into seven classes (e.g., horse and dog), in the case of PACS (Li et al., 2017) . Therefore, many studies have adopted fine-tuning that updates all the parameters of the feature extractor θ f to learn the task-specific knowledge by optimizing the model on source domains D s . However, Kumar et al. (2022) observe that fine-tuning impairs generalization ability of pre-trained models during the learning of task-specific knowledge. We try to interpret this issue at the gradient level. Based on ERM (Vapnik, 1999) , the gradient g of fine-tuning is computed for the model θ on the source domains D s , as follows: g = ∇ θ E (x,y)∼B [ℓ((x, y); θ)], where B is a mini-batch sampled from the source domains D s . The gradient g is influenced by only the source domains D s because the unseen domains D u are not accessible. Namely, the gradient is biased toward the source domains. We presume that this gradient bias degrades generalization performance in the unseen domains.

2.2. GESTUR: GRDIENT ESTIMATION FOR UNSEEN DOMAIN RISK MINIMIZATION WITH PRE-TRAINED MODELS

We hypothesize that the gradient bias mentioned above could be relieved if the unobservable gradient g u minimizing risks in the unseen domains is computable. To achieve this, we borrow the assumption of Cha et al. (2022) that large-scale pre-trained models are the approximation of the oracle model θ * which is optimally generalized for any domain D. Since the unobservable gradient g u cannot be computed from the unseen domains D u directly, we consider the direction from the current model θ to the oracle model θ * as the unobservable gradient g u . However, the oracle model is inaccessible in practice. Hence, we estimate the unobservable gradient using a large-scale pre-trained model as the approximation of the oracle model. Note that we aim to estimate the unobservable gradient g u for the unseen domains D u to alleviate the gradient bias, so the above assumption needs to be more elaborate due to the following reasons. First, we intend to design the unobservable gradient for the unseen domains only rather than any domain. Second, pre-trained models do not have task-specific knowledge yet, as described in § 2.1. Therefore, we slightly modify the assumption as follows: pre-trained models are the loose approximation of the oracle model θ * u of the unseen domains D u , and they could get closer to the oracle model by learning task-specific knowledge. Based on this assumption, we propose a simple yet effective domain generalization method, GESTUR, which estimates the unobservable gradient g u for unseen domain risk minimization with a large-scale pre-trained model. Task expert and generalization expert. GESTUR consists of two classification models: a task expert (TE, θ TE ) and a generalization expert (GE, θ GE ), which are complementary to each other. Their feature extractors are both initialized with a large-scale pre-trained model θ 0 , respectively. TE learns task-specific knowledge from the source domains D s directly to transfer the knowledge to GE. Meanwhile, GE also learns task-specific knowledge from TE via EMA, but it preserves the generalization ability of the pre-trained model deliberately. Here, the gradient bias of TE might hurt the generalization ability of GE because the knowledge of TE is injected into GE. To relieve the gradient bias, GE is used to estimate the unobservable gradient g u as the loose approximation of the oracle model θ * u for the unseen domains D u . Our proposed GESTUR is summarized in Algorithm 1. Algorithm  gf u = θ f GE -θ f TE 7: gf u = λ∥g f ∥2 • gf u ∥g f u∥2 8: g f = (g f + gf u )/2 9: update θ f TE with g f and update θ c TE with g c 10: update θGE = mθGE + (1 -m)θTE 11: end for Gradient estimation. Using Equation 2, the gradient g for TE is computed as g = ∇ θ E (x,y)∼B [ℓ((x, y); θ TE )] while learning task-specific knowledge. The gradient g is biased toward the source domains D s . A gradient that minimizes risks in the unseen domains could relieve the gradient bias, but it is unobservable. When we have access to the oracle model θ * u of the unseen domains D u , we can direct the current model to head to the oracle model instead of empirically calculating the unobservable gradient from the unseen domains. Hence, we treat the direction from the current model θ TE to the oracle model θ * u as the unobservable gradient g u : g u = θ * u -θ TE . In fact, it is infeasible to access the oracle model. Thus, we estimate the unobservable gradient using GE that approximates the oracle model loosely, as follows: gu = θ GE -θ TE . This estimated gradient gu is used to relieve the gradient bias during the parameter optimization. Parameter optimization. We want to emphasize again that GESTUR leverages the generalization power of large-scale pre-trained models to relieve the gradient bias which distorts the generalized feature representations of the feature extractor θ f . Hence, we limit the scope of usage of the estimated unobservable gradient gu only to the feature extractor θ f , not the classifier θ c . For TE, the estimated gradient gf u for the feature extractor θ f TE is added to the biased gradient g f for the same feature extractor, as follows: g f = 1 2 g f + λ∥g f ∥ 2 • gf u ∥g f u ∥ 2 , ( ) where λ is a gradient scale factor that controls the influence of the normalized gf u . The feature extractor θ f TE is updated with the gradient g f adjusted by gf u . On the other hand, the classifier θ c TE of TE is updated with its original gradient g c . As our assumption, GE can get closer to the oracle model θ * by learning task-specific knowledge. However, the generalization ability of GE decreases when we optimize GE on the source domains D s directly to learn the task-specific knowledge. Therefore, we inject the learned task-specific knowledge of TE into GE delicately via EMA: θ GE = mθ GE + (1 -m)θ TE , ( ) where m is the moving average coefficient. By encouraging the parameters of GE to change slowly, EMA is helpful in preserving the generalization ability of GE. Since the goal of domain generalization is to build a model that minimizes the risk of the unseen domains, we choose GE, designed to approximate the oracle model of the domains, as our final model θ. Pre-trained models. GESTUR heavily relies on pre-trained models. Therefore, we employ three pre-trained models of different sizes to verify that the proposed method performs well with various pre-trained models generally: ResNet-50 (He et al., 2016) Evaluation protocol. We adopt the experimental protocol of DOMAINBED, which enforces fair and realistic evaluations (e.g., same model selection criterion) across competitors. We divide the data from each domain into 80% and 20% splits and follow training-domain validation set strategy for the model selection and the hyperparameter search in every experiment. We also repeat every experiment three times to reduce the randomness in dataset splits and parameter initialization, similar to Gulrajani & Lopez-Paz (2021) , and report the mean and standard error of the experimental results. Implementation details. Our implementation is built on the codebase of Cha et al. (2022) . We use Adam optimizer (Kingma & Ba, 2015) for parameter optimization. GESTUR has two hyperparameters, the gradient scale factor (λ) and the moving average coefficient (m). In every experiment, we search the optimal λ from {0.01, 0.05, 0.1, 0.5} and fix m as 0.999. Other hyperparameters such as learning rate, weight decay, and dropout are searched in the same way as Cha et al. (2022) . We explain more details of implementation in Appendix A. Baselines. We exhaustively compare our proposed method with various baseline methods in the experiment. For simplicity, we report only the experimental results of baseline methods that show the higher performance than ERM (Vapnik, 1999) , the simplest baseline method. We describe the baseline methods and report the full version of the results in Appendix B.1.

3.2. MAIN RESULTS

Results on RN50. The first part of Table 1 shows the experimental results where RN50 is used to initialize the feature extractor. GESTUR achieves the best performance for all the datasets except DomainNet. In detail, the proposed method outperforms ERM by an average of 3.2%p. Furthermore, our proposed method improves the runner-up by: 1.0%p in VLCS, 0.5%p in OfficeHome, and 1.3%p in TerraIncognita. Especially, the proposed method outperforms the state-of-the-art method (SWAD (Cha et al., 2021) ) by an average of 0.5%p. Results on CLIP and SWAG. In the second and third parts of Table 1 , we show the experimental results where the larger pre-trained models, CLIP and SWAG, are used to initialize the feature extractor, respectively. In summary, GESTUR achieves the best performance in all the datasets. In detail, the proposed method outperforms MIRO that also leverages generalization power of large-scale pretrained models by 1.8%p and 2.8%p on CLIP and SWAG, respectively. From this, we verify that the proposed method successfully leverages the generalization ability of pre-trained models compared to other baseline methods. Interestingly, we observe that the performance gap between the proposed method and ERM increases as the size of the pre-trained model increases.

3.3. COMPARISON BETWEEN THE TASK EXPERT AND THE GENERALIZATION EXPERT

Setup. GESTUR consists of two essential components: the task expert (TE) and the generalization expert (GE). In this paper, we use GE as the final model based on the assumption that GE is set as the approximation of the oracle model of unseen domains. Nevertheless, TE is also designed to preserve the generalization ability of pre-trained models since it also considers the estimated unobservable gradient in every update to relieve its gradient bias. Therefore, we compare the performance of ERM, GESTUR w/ GE, and its variant GESTUR w/ TE based on the hyperparameters searched in § 3.2 to show that they preserve the generalization ability of large-scale pre-trained models. Results. As shown in Table 2 , GESTUR w/ GE achieves the best performance in all experiments. Also, GESTUR w/ TE outperforms ERM by averages of 9.8%p and 4.5%p when using CLIP and SWAG, respectively. The performance of GESTUR w/ TE is higher when the larger pre-trained models are given, similar to the observation in § 3.2. These observations demonstrate that GESTUR w/ TE could preserve the generalization ability of the pre-trained models with the estimated unobservable gradient, i.e., the gradient bias of TE is relieved. Moreover, GESTUR w/ GE shows a higher performance than GESTUR w/ TE, which indicates that EMA ensures the model preserves generalization ability during the learning of task-specific knowledge stable. From these, we reaffirm the justification for our choice of GE as the final model. Setup. As described in § 1, we suspect that the gradient bias degrades the domain generalization performance. We further conduct analysis to check how much gradient bias occurs during the fine-tuning and how much gradient bias is alleviated by our proposed method. To quantify the gradient bias, we borrow the concept of gradient conflict (Yu et al., 2020; Mansilla et al., 2021) : there is a conflict between two gradients g i and g j if g i • g j < 0. For every iteration, we first sample two mini-batches from both source domains and an unseen domain, respectively.

3.4. COMPARISON WITH ERM IN TERMS OF GRADIENT BIAS

We then compute losses of the mini-batches, and calculate gradients g and g u from the losses, respectively. Finally, we count the number of iterations where the gradient conflict (g • g u < 0) occurs, for ERM and GESTUR. Here, we update the model using only the gradient g since unseen domains are inaccessible in practice. Results. As shown in Table 3 , GESTUR reduces the gradient conflicts of ERM by around 11.5%, 21%, and 28.2% for the pre-trained models, respectively. From this, we verify that our proposed method relieves gradient bias by estimating unobservable gradients with the pre-trained model. We observe that gradient conflicts occur more often in GESTUR than ERM on only the experimental setup (OfficeHome w/ RN50), which is consistent with the performance in Table 2 where ERM outperforms GESTUR w/ TE. This observation indicates that the domain generalization performance is affected by the gradient bias represented as the gradient conflicts in this analysis. Additional analysis on the similarity of the true and estimated unobservable gradients is provided in Appendix C.3. Setup. We conduct additional experiments to show that the feature extractor θ f GE of GE learns taskspecific knowledge successfully. Linear probing that updates parameters of only the classifier while freezing those of the feature extractor is common practice for assessing representation quality. We assume that the more task-specific knowledge the feature extractor learns, the better linear probing performance it exhibits in unseen domains targeting the same task. In detail, we first train GE on source domains and then evaluate linear probing performance on an unseen domain with the trained feature extractor of GE. We compare it with the case that a frozen pre-trained model is used as the feature extractor. For linear probing, we simply train a logistic regression classifier on the output feature representations of each feature extractor using the unseen domain only. Note that, in this analysis, we use CLIP and SWAG which are pre-trained with objectives significantly different from the target task to demonstrate the effectiveness of the newly learned task-specific knowledge clearly.

3.5. TASK-SPECIFIC KNOWLEDGE LEARNED BY THE GENERALIZATION EXPERT

Results. As shown in Table 4 , GE outperforms frozen in all benchmark datasets except the one case where the two models reach the near 99% performance. This shows that GE is learning task-specific knowledge further during training, which makes it a better approximation of the oracle model. The result supports our claim that pre-trained models are not fully equipped with target task-specific knowledge, and injecting the knowledge further increases performance. Setup. Our proposed GESTUR controls the scale of the estimated unobservable gradients that reduce risks in unseen domains using the gradient scale factor λ. To verify the effect of the scale factor, we observe the performance change varying the scale factor.

3.6. RELATIONSHIP BETWEEN λ AND THE SIZE OF THE PRE-TRAINED MODEL

Results. In Table 5 , RN50 achieves the best performance with λ = 0.01. On the other hand, the larger pre-trained models, CLIP and SWAG achieve the best performance with the relatively larger λ = 0.1 and λ = 0.5, respectively. We summarize more results on other datasets (i.e., VLCS, OfficeHome, and TerraIncognita) in Appendix C.1, and they show the similar pattern as in PACS. Intuitively, the larger pre-trained models act as a better approximation of the oracle model than the small one because they are likely to encounter various domains from the huge web-crawled datasets during pre-training. They help to estimate unobservable gradients more accurately. The larger gradient scale factor, gradients g of TE is more affected by the estimated unobservable gradients gu while optimizing the model on source domains. From this, we can conclude that the larger scale factor improves the generalization performance when larger pre-trained models are given.

4. RELATED WORK

4.1 DOMAIN GENERALIZATION Domain alignment. Domain alignment is to learn domain-invariant feature representations by removing domain-specific knowledge in the representations. Adversarial training is widely adopted to learn domain invariant features through a min-max game between a feature extractor and a domain discriminator (Ganin et al., 2016; Li et al., 2018c; Matsuura & Harada, 2020; Zhu et al., 2022) . On the other hand, several studies (Muandet et al., 2013; Sun & Saenko, 2016; Li et al., 2018b) aim to minimize feature divergence across source domains. Recently, contrastive learning-based algorithms (Kim et al., 2021; Yao et al., 2022) have been proposed to minimize distances between feature representations of samples in the same class, regardless their domains. Data augmentation. Many studies have employed data augmentation techniques to improve domain generalization performance. For example, Gulrajani & Lopez-Paz (2021) apply simple data augmentation techniques as a default setup in DOMAINBED and some studies (Wang et al., 2020; Xu et al., 2020; Yan et al., 2020) utilize Mixup (Zhang et al., 2017) . Recently, a few works (Zhou et al., 2021; Nam et al., 2021; Kang et al., 2022) focus on image style, based on the idea that domain gap is closely related to image style. On the other side, some works on single domain generalization introduce adversarial data augmentation (Volpi et al., 2018; Fan et al., 2021; Qiao et al., 2020) to generate hard samples adversarially while assuring their reliability. Gradient-based. Recently, several studies utilize gradients to build generalized models, especially by aligning gradients from different domains. Mansilla et al. (2021) exploit gradient agreement for gradient surgery, based on the hypothesis that conflicting gradients contain domain-specific information. Shi et al. (2022) propose a training method that maximizes inner product between source domain gradients to match optimization paths across domains. Similarly, Rame et al. (2022) try to match domain-level Hessian to align loss landscapes across domains. As another line of work, Huang et al. (2020) introduce the self-challenging algorithm that iteratively masks dominant features, which are selected by the scale of the gradients. Meta-learning-based. Since simulating domain shift by dividing source domains into meta-train and meta-test domains was first introduced in MLDG (Li et al., 2018a) , several approaches have been proposed in a similar setting. For example, Balaji et al. (2018) propose to learn a regularizer for classifier weights and Zhang et al. (2021a) bring the idea of Reptile (Nichol et al., 2018) to MLDG to further increase performance with a multi-view framework. On the other hand, Zhang et al. (2021b) employ meta-learning to adaptively predict model parameters from a batch of inputs. Others. Some of the works bring concepts of causality (Lv et al., 2022) , optimize the worst-case performance (Sagawa et al., 2019; Krueger et al., 2021) , utilize text labels (Min et al., 2022) , or average model weights from different epochs (Cha et al., 2021; Arpit et al., 2022) . Our work differs from aforementioned approaches in that we mainly concentrate on effectively utilizing large-scale pre-trained models.

4.2. DOMAIN GENERALIZATION WITH PRE-TRAINED MODELS

Recently, Gulrajani & Lopez-Paz (2021) empirically show that simple ERM (Vapnik, 1999) outperforms most of early methods with pre-trained ResNet-50 (He et al., 2016) . Yu et al. (2021) show that using large-scale models pre-trained on massive datasets improves out-of-distribution performance. Kumar et al. (2022) find that fine-tuning distorts pre-trained features and propose the linear-probing then fine-tuning to mitigate the feature distortion. Wortsman et al. (2022) find that linearly interpolating the zero-shot and fine-tuned parameters of a pre-trained model improves performance in both source and unseen domains. Although GESTUR's EMA (Equation 6) looks similar to their interpolation, GESTUR updates the pre-trained model to inject task-specific knowledge. Li et al. (2022) propose a method to efficiently leverage a pool of large-scale pre-trained models through specialtyaware ensemble learning. Cha et al. (2022) propose MIRO, a regularization method that targets to minimize mutual information with pre-trained models which approximate the oracle model. In this work, we share similar motivation with MIRO in that we initially approximate the oracle model with a large-scale pre-trained model. However, we iteratively inject task-specific knowledge into the approximation of the oracle model, resulting in a better approximation.

5. CONCLUSION AND FUTURE WORK

In this paper, we propose a new domain generalization method that learns task-specific knowledge while preserving the generalization ability of large-scale pre-trained models. We point out that gradient bias toward source domains hurts the generalization ability of pre-trained models during finetuning. To alleviate the gradient bias, our proposed method estimates unobservable gradients that minimize risk in unseen domains based on two key components: a task expert and a generalization expert. Experimental results on DOMAINBED show that our proposed method outperforms baseline methods in domain generalization. Through extensive analyses, we also demonstrate that the estimated unobservable gradients effectively reduce gradient bias, thereby helping to learn task-specific knowledge without hurting the generalization power of large-scale pre-trained models. Although we verify the effectiveness of our proposed method, it heavily relies on the capability of pre-trained models. When unseen domains that pre-trained models did not encounter are given (e.g. ResNet trained on ImageNet does not see medical images), the pre-trained models might not act as an approximation of the oracle model of the domains. We will address this issue in future work.

REPRODUCIBILITY STATEMENT

We provide the source code for reproduction in the supplementary materials. See Appendix A for the hyperparameters used for the experiments.

B.1 MAIN RESULTS

Table 8 : Domain generalization accuracy (%) on the five domain generalization benchmark datasets with the three different pre-trained models. We mark * , †, and ‡ for the results from Gulrajani & Lopez-Paz (2021), Cha et al. (2021) and Cha et al. (2022) respectively. We use the reported numbers from each paper for Fish, Fishr, SelfReg, mDSDI, GVRT, and SMA. In § 3.2, we only compare baselines superior to ERM (Vapnik, 1999) with GESTUR for simplicity. Here, we provide the entire results of the main experiment in Table 8 . Baselines. In the main experiment, we compare GESTUR against a number of baselines: MMD (Li et al., 2018b) , MixStyle (Zhou et al., 2021) , GroupDRO (Sagawa et al., 2019) , IRM (Arjovsky et al., 2019) , ARM (Zhang et al., 2021b) , VREx (Krueger et al., 2021) , CDANN (Li et al., 2018c) , DANN (Ganin et al., 2016) , RSC (Huang et al., 2020) , MTL (Blanchard et al., 2021) , Mixup (Wang et al., 2020; Xu et al., 2020; Yan et al., 2020) , MLDG (Li et al., 2018a) , Fish (Shi et al., 2022) , Fishr (Rame et al., 2022) , ERM (Vapnik, 1999) , SagNet (Nam et al., 2021) , Self-Reg (Kim et al., 2021) , CORAL (Sun & Saenko, 2016) , mDSDI (Bui et al., 2021) , GVRT (Min et al., 2022) , MIRO (Cha et al., 2022) , SWAD (Cha et al., 2021) , and SMA (Arpit et al., 2022) Setup. The recent study (Cha et al., 2022) has observed that SWAD (Cha et al., 2021) that seeks the flat minima is a good optimizer for domain generalization, improving the generalization performance of several baselines by applying it to the baselines as a optimizer. Motivated by this observation, we evaluate the performance of our GESTUR applied with SWAD as a optimizer to verify whether GESTUR and SWAD are orthogonal directions to each other. Results. Table 9 shows that SWAD does not improve the performance of GESTUR. We conjecture that it is because EMA used to transfer the knowledge of TE to GE has a similar effect as SWAD to find a flat minima by averaging the model's weights. by averaging the text-based representations of the queries. Finally, the model predictions are computed with the text-based representations and the representations of input images. For WiSE-FT, an ensemble of the fine-tuned and zero-shot models, we set the balance factor α as 0.5 following its original paper since target unseen domains are inaccessible in the domain generalization setting.

B.3 COMPARISON WITH CLIP-BASED BASELINES

Results. Table 10 shows the evaluation results where GESTUR achieves the best averaged performance. In detail, GESTUR outperforms CLIP Zero-shot on VLCS, OfficeHome, and TerraInc, and shows comparable performance on PACS. Likewise, GESTUR achieves better performance on PACS, OfficeHome, and TerraInc than WiSE-FT and comparable performance on VLCS. Interestingly, the CLIP-based methods exhibit severe performance degradation on TerraInc. We conjecture that their performance is sensitive to pre-defined text-based queries. For example, the query "a sketch of a {}" is helpful for the "Sketch" domain of PACS. On the other hand, the queries "a plastic {}" and "a {} in a video game" are not helpful for TerraInc, which is composed of animal images taken from the wild. Instagram (3.6B) SWAG RegNetY-16GF 57.6 ±0.9 61.1 ±0.4 62.1 ±0.3 54.9 ±0.1 In § 3.6, we analyze the relationship between λ and the size of the pre-trained model. However, we only present the results from PACS (Li et al., 2017) in Table 5 for simplicity. Here, we provide the additional results from VLCS (Fang et al., 2013) , OfficeHome (Venkateswara et al., 2017) , and TerraIncognita (Beery et al., 2018) Setup. Domain generalization aims to improve the generalization performance on unseen domains shifted from source domains. Thus, domain generalization studies often do not consider situations where the source domains and the target domains are similar. To verify whether estimated unobservable gradients are useful when the unseen domains are similar to the source domains, we report the performance on the training-domain validation set, simulating the situations when the testing domains are exactly the same as the training domains. Results. The evaluation results are summarized in Table 14 . GESTUR w/ TE shows worse performance than ERM, indicating that the estimated unobservable gradients act as noisy gradients. Namely, gradients biased toward the source domains are more helpful than estimated unobservable gradients when the source domains and the target domains are similar. Nonetheless, GESTUR w/ GE performs better than ERM, demonstrating that our two-expert architecture is robust to various situations even when source domains and unseen domains are similar or not. C.3 SIMILARITY BETWEEN TRUE UNOBSERVABLE GRADIENTS g u AND ESTIMATED UNOBSERVABLE GRADIENTS gu OF GESTUR Setup. In this paper, we argue that gradient bias is a major culprit in degrading domain generalization performance (Figure 1a ) and our proposed method relieves the gradient bias by estimating unobservable gradients. To support this argument, we reported the number of iterations where gradient conflicts exist in Figure 1b and Table 3 . To examine whether the estimated unobservable gradients gu are similar to the true unobservable gradients g u , we add the analysis calculating the cosine similarity of the true and estimated unobservable gradients. Note that the true unobservable gradients are computed by cross-entropy loss using true labels of unseen domain datasets D u . On the other hand, the estimated unobservable gradients are just computed as the parameter difference between GE and TE (θ GE -θ T E ). Results. Figure 2 shows that our estimated gradients display positive similarity scores with the true gradients. This trend demonstrates that the estimated gradients reduce the number of gradient conflicts, leading models to reduce the risks of unseen domains without accessing unseen domain data.



https://github.com/openai/CLIP/blob/main/notebooks/Prompt_Engineering_ for_ImageNet.ipynb



Figure 1: (a): model optimization is influenced by the gradient g biased toward the source domains, neglecting the unobservable gradient g u that could minimize risks in the unseen domains. (b):Gradient "conflicts"(Yu et al., 2020;Mansilla et al., 2021) between g and g u (i.e., g • g u < 0) constantly occur throughout the whole fine-tuning iterations due to the gradient bias. Our proposed method reduces the number of gradient conflicts by adding the estimated unobservable gradient gu to the biased gradient g. This observation indicates that the gradient bias is relieved with the estimated gradient during model optimization. The more details are described in § 3.4.

Evaluation results (%) on the five datasets with the three different pre-trained models.

Evaluation results (%) on the four datasets with the three different pre-trained models. We separate the cases where GESTUR uses TE and GE as the final model, respectively.

The percentage (%) of gradient conflicts between g and g u to the whole training iterations.

Linear probing performance (%) with the two different pre-trained feature extractors: frozen pre-trained model θ 0 and the feature extractor θ f GE of GE.

Evaluation results (%) on PACS with the three different pre-trained models varying λ.

.B.2 APPLICABILITY OF SWAD (CHA ET AL., 2021) TO GESTUR Evaluation results (%) of combination of SWAD and GESTUR on the four datasets with the three different pre-trained models.

Evaluation results (%) on the four datasets with CLIP. Here, we compare GESTUR with CLIP-based baelines, CILP Zero-shot and WiSE-FT(Wortsman et al., 2022).

These observations indicate that the CLIP-based methods require hard prompt engineering for each target dataset. Moreover, the CLIP-based methods solely depend on CLIP, which cannot be extended to other architecture or learning methods trained on only visual modality, such as ResNet with ImageNet and RegNet with SWAG. Considering these, our GESTUR achieves a meaningful performance. .1 RELATIONSHIP BETWEEN λ AND THE TYPES OF THE PRE-TRAINED MODEL

Evaluation results (%) on VLCS with the three different pre-trained models varying λ.

Evaluation results (%) on OfficeHome with the three different pre-trained models varying λ. ±0.0 71.1 ±0.1 70.4 ±0.2 68.9 ±0.1 ±0.2 84.2 ±0.1 84.4 ±0.0 84.7 ±0.0 Instagram (3.6B) SWAG RegNetY-16GF 81.5 ±0.2 83.1 ±0.0 83.5 ±0.0 81.1 ±0.1

Evaluation results (%) on TerraIncognita with the three different pre-trained models varying λ. ±0.2 55.7 ±0.2 54.0 ±0.3 42.3 ±0.9

in Table11, Table 12, and Table 13, respectively.C.2 PERFORMANCE ON SOURCE DOMAINS

Evaluation results (%) on the four datasets with RN50. Here, we average the performances in the source domains, not the performance in the unseen target domain.

APPENDIX A IMPLEMENTATION DETAILS

Hyperparameter search strategy. Similar to Cha et al. (2022) , the hyperparameter tuning strategy differs depending on which pre-trained model is used. In the experiments of this work, we use three different pre-trained models: ResNet-50 (He et al., 2016) pre-trained on ImageNet (Deng et al., 2009) (RN50), ViT-B/16 (Dosovitskiy et al., 2021) with CLIP (Radford et al., 2021) (CLIP), and RegNetY-16GF (Radosavovic et al., 2020) with SWAG (Singh et al., 2022) (SWAG) . A two-stage hyperparameter search strategy is used for experiments with RN50. Here, the batch size and the moving average coefficient (m) are fixed as 32 and 0.999 in the entire search procedure, respectively. In the first stage, we search the gradient scale factor (λ) from {0.01, 0.05, 0.1, 0.5}.In this stage, we fix the learning rate as 5e-5 and do not use weight decay and dropout (i.e., weight decay and dropout are equal to 0). In the second stage, we fix λ as the one searched in the first stage. Then, we search the learning rate from {1e-5, 3e-5, 5e-5}, weight decay from {0, 1e-6, 1e-4}, and dropout from {0, 0.1, 0.5}. We provide the hyperparameters we use for RN50 in Table 6 . 

