GRADIENT ESTIMATION FOR UNSEEN DOMAIN RISK MINIMIZATION WITH PRE-TRAINED MODELS Anonymous

Abstract

Domain generalization aims to build generalized models that perform well on unseen domains when only source domains are available for model optimization. Recent studies have demonstrated that large-scale pre-trained models could play an important role in domain generalization by providing their generalization power. However, large-scale pre-trained models are not fully equipped with target task-specific knowledge due to a discrepancy between the pre-training objective and the target task. Although the task-specific knowledge could be learned from source domains by fine-tuning, this hurts the generalization power of the pretrained models because of gradient bias toward the source domains. To address this issue, we propose a new domain generalization method that estimates unobservable gradients that reduce potential risks in unseen domains, using a largescale pre-trained model. Our proposed method allows the pre-trained model to learn task-specific knowledge further while preserving its generalization ability with the estimated gradients. Experimental results show that our proposed method outperforms baseline methods on DOMAINBED, a standard benchmark in domain generalization. We also provide extensive analyses to demonstrate that the estimated unobserved gradients relieve the gradient bias, and the pre-trained model learns the task-specific knowledge without sacrificing its generalization power.

1. INTRODUCTION

Many machine learning studies assume that training and test data are independent and identically distributed (i.i.d). However, this i.i.d assumption does not always hold in real-world scenarios where distribution shifts between training and test data occur frequently. Thus, traditional machine learning models often show poor performance on unseen domains shifted from source (training) domains (Quinonero-Candela et al., 2008; Torralba & Efros, 2011) . To tackle this problem, domain generalization has attracted much attention recently. The main goal of domain generalization is to build generalized models that also perform the target task (e.g., classification) well on unseen domains (e.g., painted images) when only source domains (e.g., realistic images) are accessible during model optimization. Early domain generalization studies (Muandet et al., 2013; Ganin et al., 2016; Li et al., 2018b) have focused on learning domaininvariant representations across the source domains. However, Gulrajani & Lopez-Paz (2021) have recently shown that simple empirical risk minimization (ERM) (Vapnik, 1999) outperforms the previous methods on DOMAINBED, a benchmark for domain generalization, with pre-trained ResNet-50 (He et al., 2016 ). Moreover, Yu et al. (2021) provide empirical evidence that large-scale pretrained models could play an important role in domain generalization by providing their generalization power. Motivated by this, several studies have begun to leverage the generalization power of large-scale pre-trained models. Cha et al. (2022) employ a pre-trained model for regularization, considering it as an approximation of the oracle model on any domain, and Li et al. ( 2022) utilize a frozen pre-trained model as a feature extractor. These studies have proven the usefulness of pre-trained models in domain generalization. However, the pre-trained models used in those studies cannot learn task-specific knowledge further since they are frozen during model optimization to preserve their generalization ability. To learn the task-specific knowledge, one can choose fine-tuning that updates all the parameters of pre-trained models by optimizing the models on the source domains. Gradient "conflicts" (Yu et al., 2020; Mansilla et al., 2021) between g and g u (i.e., g • g u < 0) constantly occur throughout the whole fine-tuning iterations due to the gradient bias. Our proposed method reduces the number of gradient conflicts by adding the estimated unobservable gradient gu to the biased gradient g. This observation indicates that the gradient bias is relieved with the estimated gradient during model optimization. The more details are described in § 3.4. However, Kumar et al. ( 2022) demonstrate that fine-tuning distorts generalized representations of the pre-trained models. Namely, fine-tuning hurts the generalization ability of pre-trained models. In this paper, we interpret the above issue in terms of gradient bias during model optimization. As shown in Figure 1a , the gradient of naive fine-tuning is biased toward the source domains because it is computed by only the source domains, disregarding unseen domains. Although this biased gradient reduces empirical risks in the source domains with the learning of task-specific knowledge, it probably increases risks in the unseen domains. We argue that the gradient bias would be relieved if gradients that lower the risks in the unseen domains are observable. To this end, we propose a new domain generalization method, called GESTUR, which estimates the unobservable gradients with a large-scale pre-trained model. GESTUR consists of two key components: a task expert (TE) and a generalization expert (GE). Based on ERM where gradients tend to be biased to the source domains, TE learns task-specific knowledge from source domains directly to transfer the knowledge to GE. Meanwhile, GE learns the task-specific knowledge from TE indirectly via exponential moving average (EMA) while preserving the generalization ability of a large-scale pre-trained model. Still, the gradient bias of TE might impair the generalization ability of GE. To mitigate this, GE is utilized to estimate the unobservable gradient that minimizes risks in unseen domains for TE based on the assumption that large-scale pre-trained models could act as a loose approximation of the oracle model of unseen domains ( § 2). As shown in Figure 1b , the biased gradient of TE is relieved by simply adding the estimated unobservable gradient to the biased gradient, improving domain generalization performance ( § 3). Extensive experiments and analyses demonstrate that GESTUR outperforms baseline methods by learning the task-specific knowledge appropriately from source domains while preserving the generalization ability of large-scale pretrained models.

Contributions:

(1) We propose a simple yet effective domain generalization method that learns task-specific knowledge while preserving the generalization ability of large-scale pre-trained models. Our proposed method estimates the unobservable gradients that reduce potential risks in unseen domains to relieve the gradient bias toward source domains, based on the two experts, TE and GE. (2) We conduct extensive experiments to show the effectiveness of our proposed method in domain generalization. By providing careful analyses, we demonstrate that the unobservable gradients could be estimated with a large-scale pre-trained model, and it relieves the gradient bias. We also demonstrate that our proposed method learns task-specific knowledge without sacrificing the generalization ability of the large-scale pre-trained model.



Photo

