SIMPLE: SPECIALIZED MODEL-SAMPLE MATCHING FOR DOMAIN GENERALIZATION

Abstract

In domain generalization (DG), most existing methods aspire to fine-tune a specific pretrained model through novel DG algorithms. In this paper, we propose an alternative direction, i.e., to efficiently leverage a pool of pretrained models without fine-tuning. Through extensive empirical and theoretical evidence, we demonstrate that (1) pretrained models have possessed generalization to some extent while there is no single best pretrained model across all distribution shifts, and (2) out-of-distribution (OOD) generalization error depends on the fitness between the pretrained model and unseen test distributions. This analysis motivates us to incorporate diverse pretrained models and to dispatch the best matched models for each OOD sample by means of recommendation techniques. To this end, we propose SIMPLE, a specialized model-sample matching method for domain generalization. First, the predictions of pretrained models are adapted to the target domain by a linear label space transformation. A matching network aware of model specialty is then proposed to dynamically recommend proper pretrained models to predict each test sample. The experiments on DomainBed show that our method achieves significant performance improvements (up to 12.2% for individual dataset and 3.9% on average) compared to state-of-the-art (SOTA) methods and further achieves 6.1% gain via enlarging the pretrained model pool. Moreover, our method is highly efficient and achieves more than 1000× training speedup compared to the conventional DG methods with fine-tuning a pretrained model. Code and supplemental materials are available at https://seqml.github.io/simple.

1. INTRODUCTION

Distribution shift is a common problem in real-world applications, which breaks the independent and identically distributional (i.i.d.) assumption of machine learning algorithms (Wang et al., 2022) . Mismatches between training and test distributions, which are quite common in reality, can largely deteriorate model performance and make machine learning models infeasible for practical applications (González & Abu-Mostafa, 2015) . Therefore, enhancing the generalization ability of models has attracted increasing attention (Cha et al., 2021; Zhang et al., 2022) . For its practical significance, various methods have been proposed, e.g., domain alignment (Ganin et al., 2016; Gong et al., 2019; Arjovsky et al., 2019) , meta-learning (Finn et al., 2017; Dou et al., 2019; Du et al., 2020) , and ensemble learning (Mancini et al., 2018; Cha et al., 2021; Arpit et al., 2021) . The effectiveness of DG algorithms is generally verified by fine-tuning a pre-trained ResNet (He et al., 2016) model with these algorithms (Gulrajani & Lopez-Paz, 2020) . It has demonstrated that these algorithms improve upon empirical risk minimization (ERM) baseline on ResNet-50 backbone (Arpit et al., 2021; Wiles et al., 2021) . Meanwhile, recent studies show that neural architectures and pretraining methods have a large impact on the model robustness to distribution shifts (Radford et al., 2021; Wiles et al., 2021) . For example, vision transformers are more robust to texture and style shifts compared with ResNet-based models (Zhang et al., 2022) , which are instead superior to transformer-based models on dense image classification tasks (Liu et al., 2022) . In terms of pretraining, using pretraining datasets other than ImageNet-1k improves the generalization performance in one test domain, yet leads to performance degradation in another (Kim et al., 2022) . These findings are in line with the No Free Lunch (NFL) Theorem (Wolpert, 1996) , which suggests that no single model can always perform better than any other model without having substantive information about the targeted problem. In DG, we usually have very limited information about the test domain, so we are more likely to encounter the above challenge. Inspired by these attempts, in this paper, we conduct a fine-grained study on the relationship between pretrained models and distribution shifts. From both empirical and theoretical evidence, we show that no free lunch in terms of pretraining for domain generalization, i.e., there is no single best pretrained model across shifting test distributions. Specifically, 283 pretrained models with different network architectures, pretraining datasets, and learning objectives are compared for their generalization performance under different distributional shifts. The results reveal that the pretrained models without fine-tuning generalize well to some unseen domains, but none of these models dominate in all unseen distributions. Furthermore, the theoretical analysis indicates that OOD generalization error is determined by the fitness between model (varying w.r.t. the network architectures and model weights) and test distribution. For any network architecture with fixed training distributions, such as pretrained models (Iandola et al., 2014; He et al., 2016; 2021a) , it is always possible to find a beneficial or detrimental test distribution with a small or large generalization error. Motivated by these findings, we propose an alternative DG paradigm that leverages pretrained models with different network architectures and shifting training distributions, upon which we match the most suitable pretrained models for each OOD sample. As shown in Figure 3 , our paradigm (specialized model-sample matching for domain generalization, SIMPLE) first adopts a simple label adapter that projects the label space of the pretraining domainfoot_0 to that of unseen domainsfoot_1 , where the adapter is shared by pretrained models from the same pretraining domain. Then, a matching network, which is aware of model specialty, selects a set of proper pretrained models and aggregates them together to conduct the prediction for each OOD sample. Notably, this promising alternative exhibits significant performance improvements, averaging 3.9% over the existing SOTA results, with gains of up to 12.2% on single datasets, and a significant increase in training efficiency. To summarize, this work has made the following contributions: • We theoretically and empirically analyze the generalization of pretrained models on shifting unseen test distribution, revealing no free lunch hypothesis exists that motivates our solution of modelsample matching. • In complementary to traditional DG solutions, we propose a novel DG paradigm which directly leverages pretrained models without fine-tuning, and it has significantly improved the DG performance in the mainstream benchmark upon other strong baselines. • Besides the performance gain, our method is even more efficient since it does not follow the common fine-tuning approach, shedding new light on using pretrained models in DG tasks.

2. NO FREE LUNCH IN PRETRAINING FOR DOMAIN GENERALIZATION

In this section, we investigate if there exists free lunch in pretraining for DG, that is, whether we can find one single best pretrained model that generalizes across all distributional shifts. To this end, we first conduct an empirical analysis on the generalization ability of pretrained models over shifting distributions in Section 2.1, followed by a theoretical analysis in Section 2.2.

2.1. GENERALIZABILITY ANALYSIS OF THE PRETRAINED MODELS

We here analyze the generalization ability possessed by different pretrained models. Note that existing DG methods generally adopt a specific ImageNet-pretrained model (e.g, ResNet-50), which has been shown not sufficient for generalization (Kumar et al., 2021; Kim et al., 2022) . Thus, for a comprehensive analysis, we first incorporate 283 diverse pretrained models composed of diverse combinations of network architectures, pretraining datasets, objectives, and algorithms. Detailed information on all these models and more experimental settings are in Appendix A.4 . For the efficient adaptation of pretrained models from pretraining domains to unseen domains, we propose to train only a label space adapter upon the pretrained models without modifying any pretrained model parameters, which is remarkably lightweight and will be elaborated detailedly in Section 3.2. Takeaway 1: Pretrained models possess decent generalization ability for some OOD samples. By grafting such a lightweight label space adapter, we find that the generalization performance of the pretrained model is already promising in some cases. As a concrete example, given a fixed DenseNet-121 model (Iandola et al., 2014) pretrained on ImageNet, we employ an adapter that converts its prediction of the original ImageNet label space to that of OfficeHome (a dataset of DG benchmark) (Venkateswara et al., 2017) , with the adapter training on source domains. This combination achieves an average accuracy of 78.3% on target domains, which is 5.9% higher than SOTA as detailed in Section 4.3. Takeaway 2: No dominant pretrained models across unseen domains. Though pretrained models have possessed some generalization ability in some cases, however, their generalization performance relates to the specific unseen distributions. The left part of Figure 1 shows the relative performance of the pretrained models evaluated in different domains, with label adapter. Each domain represents a different data distribution. As can be seen, pretrained models vary greatly in performance on different test domains, without any single model being dominant in all domains. Figure 1 : Classification performance comparison of pretrained models in different domains and different classes of the TerraIncognita dataset (Beery et al., 2018) . For clarity of presentation, only partial results are shown. The complete results can be found in Appendix A.4. Takeaway 3: Pretrained models exhibit more diverse performance at finer-grained levels. We further examine whether performance divergence also exists at a finer-grained level, such as classlevel. The right part of Figure 1 presents the relative model performance on 10 classes in TerraIncognita dataset. Similar to that at domain-level, varying model performance among different classes is also noticeable. Moreover, there exists a more pronounced divergence in model performance at the finer class level, as evidenced in detail in Appendix A.4 . It supports the necessity of incorporating pretrained models and being aware of their specialty at a fine-grained level, which motivates us to investigate and exploit the matching metric between models and test samples.

2.2. THEORETICAL ANALYSIS OF NO FREE LUNCH FOR DG

This section provides theoretical analysis to support the findings in Section 2.1. To alleviate the generalization difficulties associated with the train-test mismatch, mainstream DG methods seek domain-invariant features that can be generalized beyond the training distribution. However, there are several issues to consider. First, domain-invariant features learned from source domains can still be biased towards source domains and thus have limited performance on unseen domains (Cha et al., 2022) . Second, domain-specific information has also been found to enhance DG, as it may be closely related to the sample labels that help generalize in certain unseen domains (Bui et al., 2021) . These observations suggest that existing DG methods may not guarantee whether a model can generalize across distinct unseen target domains, the reason lies in the limitation of our knowledge of unseen domains, which is in line with NFL theorem. The analysis of the OOD generalization of kernel methods also points out the relevant insight that shift in test distribution may help or hurt generalization (Canatar et al., 2021) . Theorem 1. (Informal; OOD Kernel Generalization; Canatar et al. (2021) ) Given the kernel matrix K(x, x ′ ) with Mercer decomposition K(x, x ′ ) = Φ(x) T ΛΦ(x ′ ). Suppose the training data is i.i.d. generated from the distribution p(x), the target function is given by y = āT Φ(x), and the training loss is kernel regression with ERM. The generalization error on distribution p(x) is given by E g = E matched g + κā T (P Λ + κI) -1 O ′ (P Λ + κI) -1 ā, where O ij = dxp(x)ϕ i (x)ϕ j (x) and O ′ = O -1-γ ′ 1-γ I. Here E matched g is the generalization error when the training distribution and test distribution are matched (i.e., in-distribution error), P is the number of training samples, κ, γ are constants depend on Λ, and γ ′ is a constant depending on Λ and O. Theorem 1 will be detailed in Appendix A.5. Remark 1. (Interpretations of Theorem 1) Theorem 1 shows that E g is in the form E g = E matched g + vO ′ v, where the second term is due to distribution shift. Note that, if a pretrained model is kept frozen to serve as a static feature extractor for the subsequent trainable linear classifier, the pretrained model acts as a kernel and thus Theorem 1 applies. Remark 2. (Generalization depends on the fitness between model and test distributions) The eigenvalue of matrix O depends on the alignment between the test distribution p(x) and the kernel basis Φ which is determined by the model (network architecture and model weights). The matrix O ′ may have negative eigenvalues when p(x) and Φ are matched well and OOD generalization is even better than i.i.d. generalization under such circumstances. On the contrary, when Φ(x) is fixed, p(x) can be adversarially chosen such that the matrix O ′ has large positive eigenvalues and thus the network fails to generalization to OOD data. Remark 3. (No free lunch for a single model in DG) A major focus of DG is tackling covariate shift, where p(x) can be set arbitrarily as long as p(y|x) = p(y|x). Under covariate shift, no single pretrained model will outperform other models for all p(x), implying no free lunch theorem in DG. The theoretical analysis above raises a natural question: should not the focus be more on matching the pretrained model and testing distributions based on their fitness? Given that pretrained models with fixed network architectures and training distributions always face beneficial or detrimental test distributions, we suggest incorporating more pretrained models to create diverse network architectures and shifting training distributions, to facilitate generalization. Depending on the fitness of the pretrained models to the target distributions, it is then possible to match OOD samples to appropriate models, thus bypassing the issue of using a single model indicated by the NFL theorem.

3. THE PROPOSED FRAMEWORK FOR DOMAIN GENERALIZATION

In this section, we present a novel framework, namely specialized model-sample matching for domain generalization (SIMPLE), that reformulates DG as a matching problem, in light of the analysis in Section 2. We first introduce the overall framework in Section 3.1. Then, we elaborate on the design of specialty-aware model-sample matching and ensemble in Section 3.2, followed by the learning algorithm in Section 3.3.

3.1. THE OVERALL FRAMEWORK

This section provides an overview of the SIMPLE framework, as shown in Figure 3 . Following our analysis in Section 2, one single pretrained model is not sufficient to accommodate diverse OOD samples. From an intuitive view, it appears that more models need to be incorporated and selected to solve the problem. Analogous to recommending items (models) to users (samples) from the vast item set in recommender system, we formulate DG as a model-sample matching problem, with a model pool containing various models and a model dispatcher responsible for assigning these models to OOD samples appropriately. Preliminaries. DG aims to tackle the shift of data distribution among different domains by transferring knowledge from seen to unseen domains. Unlike domain adaptation, samples from unseen target domain(s) are inaccessible in DG. For a domain, its input and label space can be denoted as X ∈ R d and Y ∈ R C , respectively, where d is the dimension of input and C is the number of classes in Y; And its samples are observed constructing a dataset D = {(x i , y i )} N i=1 with sample number N . Consider that we have S source domains D s = {D 1 , . . . , D S } and T target domains D t = {D 1 , . . . , D T } that share the same label space but with different joint distributions on X × Y. Pretrained model pool. As discussed in Section 2.2, the OOD generalization error depends on the fitness between the pretrained model and the test distribution, which is unknown in DG. Therefore, building a pool with diverse pretrained models is crucial for DG. Note that with abundant pretrained models being released, it is easy to construct a model pool by simply downloading pretrained models from public repositories (Wightman, 2019) , without further efforts such as retraining. We collect extensive pretrained models to serve for SIMPLE, as detailed in Appendix A.3. We denote a model pool with K models as {f k } K k=1 , each of which is parameterized as θ k . Label space adapter. As the label space of the pretraining domains generally differs from the one shared by source and target domains, label space adapters are required to make them consistent. The adapter is a linear mapping between these two label spaces (pretraining → source/target), which is shared among different pretrained models from the same pretraining domain (e.g., we only use one adapter for all the models trained on ImageNet-1k). Specifically, given pretrained model f k , we parameterize the adapted model h ψ (f k (•; θ k )) as θ ′ k = [ψ; θ k ], where ψ denotes the parameters of the adapter, h ψ ∈ A: R Co → R C and C o is the dimension of the label space of the original pretraining dataset of f k . Through the label adapter, the output of pretrained model f k can be transformed and adapted to target domains as ŷik = h ψ (f k (x i )), without fine-tuning {θ k } K k=1 . In this way, it can largely reduce the adaptation cost of the pretrained models onto the new domains. By the above construction, SIMPLE differs from the existing methods in two aspects. First, most existing DG methods strive to fine-tune a specific pretrained model, which cannot generalize well to a variety of unseen domains, as analyzed in Section 2. Second, conventional options utilizing the pretrained models are fine-tuning and linear probing, as shown in Figure 2 (A) and (B), respectively. In detail, fine-tuning the pretrained model is costly and can hurt generalization ability (Wortsman et al., 2022) , while linear probing (i.e., replacing the last layer of the pretrained model and retraining that) may achieve better accuracy in OOD scenarios than fine-tuning the whole model (Kumar et al., 2021) . However, the linear probing layer is not transferable across models, making the way either costly when a large number of models need to adapt to new target domains. Our label space adaption shown in Figure 2 (C), instead, is remarkably light since it is shared by all the models from the same pretraining domain, which has also been empirically verified in Section 4.2. As shown in Section 2, there is no individual model performing the best across different tasks, thus, appropriate models need to be dispatched to address each specific task. We define a model dispatcher g ρ , with parameter ρ, that takes the sample x i as input and determines the weight w k assigned to model f k for the sample x i with K k=1 w k = 1. Here w k represents an estimate of the relative match between the model f k in the model pool and the sample.

Pretrained model(s)

Based on the constructed model pool and the model dispatcher, the prediction for each test sample is an ensemble of the predictions by the dispatched models. The final prediction is given by ŷi = K k=1 [w k • h ψ (f k (x i )) ]. Finally, we can define a population loss as E D (ψ, ρ) = 1 |D| |D| j=1 E xi∼Dj [l(ŷ i , y i )] over the given domain D. The objective is to minimize the task-specific loss l (e.g., cross-entropy loss for classification) over both source domains D s and target domains D t by only minimizing the empirical risk ÊDs (ψ, ρ) w.r.t. ψ and ρ. The performance on the target domains, then, measures the generalization ability.

3.2. SPECIALTY-AWARE MODEL-SAMPLE MATCHING AND ENSEMBLE

Following the paradigm introduced in Section 3.1, this section elaborates on the model dispatcher, which consists of a model-sample matching network and a specialty-aware ensemble layer. Model-sample matching. To capture the fitness of pretrained models towards OOD samples, a network is proposed to learn the model-sample matching function. We employ a simple recommendation algorithm (but not restricted), neural collaborative filtering (NCF) (He et al., 2017) , to initiate a proof-of-concept for our idea. Following NCF, the matching scores Specialty-aware ensemble. After constructing a series of ranked models based on the matching metric, the subsequent goal is to conduct prediction by selecting the most proper models. We argue that individual model prediction may not cover most of the target distribution, thus, instead of utilizing only the top-1 matched pretrained model, we propose to apply a specialty-aware model ensemble to derive the final prediction for the specific sample. It is also considered that ensemble has shown improved robustness (Lakshminarayanan et al., 2016) and model specialty in various tasks also plays a key role in prediction (Gontijo-Lopes et al., 2021) , which will be further elaborated in Section 3.3. Specifically, we normalize matching scores by softmax function Softmax(z) j = e z j K k=1 e z k to highlight the relative competition among the pretrained models. That is, the ensemble weights ••• Input shared ••• ••• ••• ••• 0.1 0.2 0.5 sample embedding m i = [m i1 , . . . , m iK ] ∈ R K of a w i = [w 1 , . . . , w k , . . . , w K ] ∈ R K , is computed as w i = Softmax(m i ). To save the inference time, we further select models with the top-k ( k < K) matching scores, before their inference for the given sample. In this paper, we set k = 6 and the sensitivity analysis on k will be shown in Section 4.5. The overall inference cost is small as illustrated in Section 4.2.

3.3. OBJECTIVE AND LEARNING FOR SPECIALIZED MODEL-SAMPLE MATCHING

Loss for ensemble learning. The classification loss L ens (ψ, ρ) = l (ŷ i , y i ) is optimized for the likelihood of final ensemble output, to update both the matching network and the adapter. Loss for label space adapter learning. To train the general label space adapter h ψ for all pretrained models, we incorporate the weighted classification losses of adapted predictions of pretrained models to update the shared adapter, which is defined as follows: L adapter (ψ) = K k=1 w k • l (ŷ ik , y i ) = K k=1 w k • l (h ψ (f k (x i )), y i ) . (1) Loss for model specialty learning. The model dispatcher generates ensemble weights to aggregate multiple model predictions for each sample, where models vary in their performance significantly over samples as indicated in Section 2. Thus, we expect to assign more weights to the models with (Li et al., 2018b) 85.0 ±0.2 76.7 ±0.9 67.7 ±0.1 49.3 ±1.4 39.4 ±0.8 63.6 C-DANN (Li et al., 2018c) 82.8 ±1.5 78.3 ±0.6 65.6 ±0.5 47.6 ±0.8 38.9 ±0.1 62.6 ERM (Gulrajani & Lopez-Paz, 2020) 85.7 ±0.5 77.4 ±0.3 67.5 ±0.5 47.2 ±0.4 41.2 ±0.2 63.8 Fish (Shi et al., 2021) 85.5 ±0.3 77.8 ±0.3 68.6 ±0.4 45.1 ±1.3 42.7 ±0.2 63.9 LP-FT (Kumar et al., 2022) 84.6 ±0.8 76.7 ±1.5 65.0 ±0.2 47.1 ±0.7 43.0 ±0.1 63.3 MIRO (Cha et al., 2022) 85.4 ±0.4 79.0 ±0.0 70.5 ±0.4 50.4 ±1.1 44.3 ±0.2 65.9 Ensemble algorithms SWAD (Cha et al., 2021) 88.1  ImageNet-pretrained MIRO + SWAD (ResNet-50) 25.6M 1 × 3.9 SIMPLE (Model Pool-A) 0.9M 1000 × 9.6 Diverse MIRO + SWAD (RegNetY-16GF) 79.7M 1 × 15.2 SIMPLE (Model Pool-B) 0.9M 1000 × 9.6 higher sample-level specialties to achieve the best utilization of the pretrained models. We use the likelihood of the ground truth label p(y i | x i ; θ ′ k ) on the i-th sample produced by model f k as the evaluation metric of its sample-level model specialty. That is, we try to minimize the estimation risk of the estimated model specialty on the ground truth, i.e., w k and p(y i | x i ; θ ′ k ), as L specialty (ρ) = - K k=1 [p (y i | x i ; θ ′ k ) • ln(w k ) + (1 -p (y i | x i ; θ ′ k )) • ln(1 -w k )] . L specialty is used to optimize the model-sample network and the ensemble layer to work jointly as a specialty-aware model dispatcher. Therefore, the total loss to minimize is L = a e L ens (ψ, ρ) + a d L adapter (ψ) + a s L specialty (ρ), where a e , a d , and a s are loss weights. It is worth noting that the only parameters to update are {ψ, ρ}, each of which is lightweight compared to the pretrained models which remain fixed in our method, yet have been fine-tuned in other previous works.

4.1. EVALUATION PROTOCOL

We conduct experiments on DomainBed suite (Gulrajani & Lopez-Paz, 2020 ) which provides likefor-like comparisons between algorithms with a standard evaluation, as detailed in Appendix A.6. Datasets. We experiment on 5 real-world benchmark datasets including PACS (4 domains, 9,991 samples, 7 classes) (Li et al., 2017) , VLCS (4 domains, 10,729 samples, 5 classes) (Fang et al., 2013) , OfficeHome (4 domains, 15,588 samples, 65 classes) (Venkateswara et al., 2017) , TerraIncognita (4 domains, 24,778 samples, 10 classes) (Beery et al., 2018) , and DomainNet (6 domains, 586,575 samples, 345 classes) (Peng et al., 2019) . Baselines. We compare SIMPLE with strong DG baselines including state-of-the-art. General DG methods incorporate elaborate learning algorithms including ERM (Vapnick, 1998) , CORAL (Sun & Saenko, 2016) , MLDG (Li et al., 2018a) , MMD (Li et al., 2018b) , DANN (Ganin et al., 2016) , C-DANN (Li et al., 2018c) , Fish (Shi et al., 2021) , LP-FT (Kumar et al., 2021) , and MIRO (Cha , 2022) . Some other works incorporate ensemble learning, including SWAD (Cha et al., 2021) and EoA (Arpit et al., 2021) . MIRO has combined SWAD (MIRO + SWAD) into their approach, with the result being the current SOTA for DomainBed. More details can refer to Appendix A.7. Model pool composition. We collect 283 pretrained models to compose model pools, described in detail in Appendix A.3. Based on their pretraining domains, we divide them into a pure ImageNetpretrained model pool (Model Pool-A) and one with pretrained models from different datasets (Model Pool-B). We denote SIMPLE using Model Pool-B as SIMPLE + to distinguish it from the one using Model Pool-A.

4.2. EVALUATION RESULTS ON DOMAINBED

This section presents the evaluation results of DomainBed suite, on which we compare SIMPLE with general DG algorithms to verify its effectiveness and efficiency. Specifically, we compare algorithms using ImageNet-pretrained models only (e.g., SIMPLE using Model Pool-A) and using pretrained models from diverse pretraining datasets (e.g., SIMPLE + using Model Pool-B), resp. (RQ2) Is it necessary to assemble pretrained models according to their specialty? The analysis in Section 2.1 shows that pretrained models possess certain generalization ability. Thus, a natural approach is to simply aggregate pretrained models without considering their relative specialty on different samples, e.g., randomly select k models for each sample and average their outputs as the final prediction (Lakshminarayanan et al., 2017) . The results of the random ensemble are shown in Table 3 , illustrating that random ensemble lags behind SIMPLE by a large margin. This verifies the necessity to select and ensemble pretrained models based on their specialty over samples (RQ2).

4.4. PRACTICAL TIPS FOR CONSTRUCTING MODEL POOLS

In this section, we investigate the impact of different model pool properties on the generalization performance, to provide useful guidance for the composition and utilization of model pools. The corresponding results are presented in Tip 3: Increasing model diversity matters more than increasing model pool size. On top of Model Pool-A-Small, Model Pool-A incorporates 224 additional pretrained models from Ima-geNet pretraining domain, while Model Pool-B-Small adds 2 more models that are pretrained on YFCC100M (Radford et al., 2021) . However, SIMPLE performs better upon Model Pool-B-Small, suggesting that the diversity of model pool may be more important for generalization. The above tips from empirical observations that, more models and more diversity can improve generalization performance, again suggest that there is no free lunch for DG. Thus, different pretrained models are needed to address DG. SIMPLE provides a way to realize this goal both effectively and efficiently by (1) adapting pretrained models to unseen domains via label space adaption with a low cost and (2) dispatching best-fit models from a large model pool to handle each OOD sample.

4.5. SENSITIVITY ANALYSIS

We conduct sensitivity analysis for k and sample feature extractor f k0 , detailed in Appendix A.9, and the main findings include: (1) There is a marginal effect on the generalization performance improvement brought by increasing k, and SIMPLE can outperform SOTA baseline (MIRO + SWAD) even when k = 2; and (2) SIMPLE is robust to the selection of feature extractors.

5. CONCLUSION

Despite recent studies suggesting that network architectures and pretraining practices affect generalization ability to a large extent, no work has explored the use of these easy-to-obtain pretrained models to address domain generalization. Our work provides a comprehensive analysis of generalizing pretrained models to unseen domains and reveals that there is no free lunch of pretrained models in DG. Based on that, we propose a novel DG paradigm that leverages fixed pretrained models and dispatches them to OOD samples based on their matching metric to the target task. Extensive evaluations show that our proposal is a promising alternative for DG, with better generalization performance and significantly higher training efficiency compared to existing DG methods. here only review studies that are mainstream in DG and that relate to our method. For a more comprehensive survey, please refer to Wang et al. (2022) . Data manipulation. DG arises from the lack of sufficient data to feed machine learning models, so they are biased to the training distribution, resulting in an inability to work well on the test distribution (Wang et al., 2022) . Thus, various methods resort to manipulating the training samples to simulate unseen target domains, among which there are two techniques, namely data augmentation (Shankar et al., 2018) and data generation (Li et al., 2021) . Data augmentation has been widely used in general machine learning models to enhance their generalization ability by avoiding overfitting (Shorten & Khoshgoftaar, 2019) . Inspired by its importance, Tobin et al. (2017) Representation learning. In another promising direction, some DG studies attempt to extract domain-invariant features that can be generalized to unseen domains (Ben- David et al., 2006) , or separate domain-shared and domain-specific parts from the features for generalization (Khosla et al., 2012) . Motivated by the transferability of domain-invariant features across varied domains and the expectation of their generalization ability to unseen domains, methods try to obtain a specific feature space that is invariant to domain labels. Li et al. (2018b) first introduces adversarial training in DG, allowing the generator seeks to trick the discriminator about the domain labels of images and thus generate domain-invariant features. Several works (Shao et al., 2019; Rahman et al., 2020; Wang et al., 2020b) follow this direction. In the line of extracting domain-invariant features, other works explicitly align feature distribution learned from source domains, with the differences measured by Wasserstein distance (Zhou et al., 2020) , maximum mean discrepancy (Wang et al., 2018; 2020a) , or mean and variance (Peng et al., 2019) . At a meta level, Balaji et al. (2018) consider learning a regularization function on the classifier to avoid biasing to domain-specific information. In contrast, for robust representation learning, some work aims to separate domain-invariant variables from the specific ones of the learned features (Niu et al., 2015; Ilse et al., 2020; He et al., 2021b) . 2020), aiming at achieving better performance than individual model alone. As a well-known technique to improve generalization performance, ensemble learning has also been explored in DG. One general idea of using ensemble learning by existing DG methods is to utilize the relationship between unseen domains and source domains. Specifically, in the work of Mancini et al. (2018) , they train an individual classifier for each specific source domain and one additional classifier to predict the probability of each domain that a sample belongs to, and aggregate their predictions by weights accordingly. In addition, Segu et al. (2020) proposes to maintain domain-specific batch normalization layers, which will be weighted for aggregation in inference. Zhou et al. (2021) instead trains only domain-specific classifier heads while letting them share the same feature extractor. SWAD (Cha et al., 2021) and EoA (Arpit et al., 2021) instead use model ensemble and weight ensemble directly to improve the generalization. Though promising in enhancing performance, ensemble learning is criticized for its computational efficiency, which hinders the practical application (Wang et al., 2022) . Relation to existing work. The main difference between the existing DG works and SIMPLE resides in the fact that these efforts seek to find an optimal model that can settle all distributional shifts. Even DG algorithms that use ensemble learning are trying to find an optimal ensemble that has a flatter loss landscape and better generalization ability (Cha et al., 2021) . To this end, existing works use a specific pretrained model for initialization, as this performs better than training from scratch (Wortsman et al., 2022) , and then fine-tune it with training algorithms like data manipulation, robust representation learning, or ensemble learning as discussed above. In contrast, recent studies have shown that such specific pretrained models may not be sufficient to solve distributional shifts, and that pretraining strategies need to be chosen for different types of distributional shifts. To the best of our knowledge, SIMPLE is the first work to address DG by directly using different pretrained models without fine-tuning, by reformulating DG as a model-sample matching problem. Moreover, SIMPLE does not fine-tune these pretrained models, but uses domain-level label space adaptation to achieve effective adaptation, which has significantly reduced the cost of training and adaptation. A.2 MORE DISCUSSION ABOUT SIMPLE Towards a robustness algorithm that can effectively and efficiently leverage extensive pretrained models for DG, there are certain challenges that remain in our paper. Currently, SIMPLE is not feasible for adding new pretrained models into the model pool for direct utilization. Although retraining is lightweight, we need a more straightforward approach to address this problem due to the rapid development of pre-trained models. We leave this problem, i.e., the cold-start problem which has been long studied in recommendation fields, as further work.

A.3 PRETRAINED MODEL POOLS

This section presents the pretrained models used in SIMPLE. With more and more pretrained models being published, it is straightforward to build a pretrained model pool consisting of several public pretrained models for direct adaptation to novel domains. In particular, these models are categorized according to their pretraining domain as an ImageNet-pretrained model pool (Model Pool-A) and a pool (Model Pool-B) containing models pretrained on datasets other than the ImageNet dataset. Model Pool-A. DG methods commonly use ImageNet-pretrained models to initialize the model weights for further fine-tuning (Kim et al., 2021) . To present fair comparisons with those methods, we first construct an ImageNet-pretrained model pool (Model Pool-A). Specifically, we collect 244 models with diverse network architectures and learning objectives from popular 3rd-party repositories 345 . Note that, the models pretrained on the same dataset, i.e., ImageNet, will share the same label adapter transforming the vanilla model outputs to the target label space. Therefore, for Model Pool-A, only one label space adapter will be trained, to transform the original predictions for 1,000 classes in ImageNet into the predictions of the target class space. Model Pool-B. Previous studies also found that ImageNet-pretrained model (e.g, ResNet-50) is not sufficient for generalization (Kumar et al., 2021; Kim et al., 2022) , suggesting the need for models pretrained on different datasets. Therefore, we further expand the Model Pool-A by incorporating models that trained on datasets including ImageNet-21k, YFCC100M (Radford et al., 2021) , Instagram 3.6B (Singh et al., 2022) , as Model Pool-B. In total, we add 39 models, whose weights are obtained from the same sources or from their public repositories 67 . Thus, Model Pool-B contains 283 pretrained models with varied network architectures, learning objectives, and pretraining datasets. To analyze the impact of model pool size on performance, we take a portion of the models from these two model pools and use them to construct Model Pool-A-Small and Model Pool-B-Small. Specificially, in Model Pool-A-Small, we incorporate the architectures including AlexNet (1) (Krizhevsky et al., 2012) (Iandola et al., 2016) , and MAE-ViT-Base/Large/Huge (3) (He et al., 2021a) . On the other hand, several DG algorithms also use models pretrained on other datasets, such as 1G-1B (Arpit et al., 2021) and ILSVRC12 (Thomas et al., 2021) . Therefore, we build Model Pool-B-Small which contains two more CLIP models (Radford et al., 2021) , ViT-B/16 and ViT-B/32, which are trained on a subset of the YFCC100M dataset of roughly the same size as ImageNet. A detailed list of pretrained models is given in table 7, with the sources of the models, the dimensions of their outputs and FLOPs (floating point operations per second).

A.4 EMPIRICAL EVIDENCE OF NO FREE LUNCH

In this section, we elaborate on the empirical analysis in Section 2, and provide more experimental details and insights for the analysis. Transfer and performance measurement of pretrained models. As discussed in Section 2, instead of fine-tuning or conducting linear probing to adapt those pretrained models to be a predictive model for unseen target domains, we only train a label space adapter that learns the mapping function h ψ . Then we train this shared adapter h ψ with the empirical loss ÊDs (ψ, ρ) without fine-tuning the pretrained model parameters {θ k }, as formally described in Section 3.2. With the trained adapter, we use the likelihood of the ground truth label p(y i | x i ; θ ′ k ) on the i-th sample produced by each adapted model, which also indicates the confidence of the ground truth label y i of the model and y∈Y p(y | x i ; θ ′ k ) = 1. We utilize this likelihood as the evaluation metric of its sample-level model specialty. Experimental settings. First, we analyze the specialty distribution of each pretrained model from an aggregation view (i.e, domains and classes, respectively), and we verify if there exists a dominant pretrained model that generalizes best across different unseen domains. We calculate domain-level model specialty as summation of the sample-level specialty over all domain samples as (xi,yi)∼D log p(y i | x i ; θ ′ k ), on TerraIncognita (Beery et al., 2018) with four domains. To reflect the relative model performance, we perform min-max normalization for model specialty values in the same domain. These complete results are shown in Figure 4 , with only partial results put in Section 2 for a clearer presentation. Then, we further examine whether performance divergence also exists at the finer-grained class-level. Similar as that at domain-level, Figure 4 presents the relative model performance on 10 classes in TerraIncognita. In addition, to clearly compare model specialty differences at the two levels, we present heatmaps of specialty differences (measured by Kullback-Leibler divergence) for domain and class pairs, respectively in Figure 5 . Based on the results of comparing the performance of all pretrained models on different domains and classes, we can draw more empirical insights as follows: (1) Most importantly, there is no single model that performs best both across domains and classes; (2) The divergence of the performance distribution is significantly more prominent at the class level than at the domain level, as evidenced by the comparison of the Figure 5 (a) and (b); (3) As a side finding, models that perform well in the 'Bird' class generally perform well in the 'Bobcat' class, as indicated by their pair-wise Kullback-Leibler divergence values. However, models that perform well in 'Bird' and 'Bobcat' usually perform poorly in 'Dog' and 'Rabbit' classes. This can be viewed as another piece of evidence of NFL for shifting distribution shifts.

A.5 NO FREE LUNCH THEOREM

We here detail the formal version of Theorem 1 with related definitions. Considering a kernel regression task, training samples {x µ , y µ } are sampled from i.i.d. and the labels which contain noisy are generated from a target function y µ = f (x µ ) + ϵ µ where the noise covariance is ⟨ϵ µ ϵ ν ⟩ = ε 2 δ µν . Then the model for regression is trained by minimizing the empirical error (ERM loss): f * = arg min f ∈H 1 2 P µ=1 (f (x µ ) -y µ ) 2 + Λ⟨f, f ⟩ H , where H is a Reproducing Kernel Hilbert Space (RKHS) associated with a positive semi-definite kernel K (x, x ′ ) : R D × R D → R, and ⟨., .) H is the Hilbert inner product. The generalization error on the test distribution p(x) is E g (D) = f * (x) -f (x) 2 p(x) , which is a random variable whose value depends on the sampled training dataset. Therefore, the generalization error is averaged over the distribution of all datasets with sample size P : E g = f * (x) -f (x) 2 p(x),D . Based on the problem setting, Canatar et al. (2021) proposes the following proposition: The dataset averaged OOD generalization error is given by: E g = E 0,p(x) g ID error + γ ′ -γ 1 -γ ε 2 + κ 2 a ⊤ (P Λ + κI) -1 O ′ (PΛ + κI) -1 a distribution shift error , κ = Λ + κ Tr P + κΛ -1 -1 , γ = P Tr P + κΛ -1 -2 , γ ′ = P Tr O P + κΛ -1 -2 , where κ must be solved self-consistently, and we defined the M × M overlap matrix O ργ = dxp(x)ϕ ρ (x)ϕ γ (x), O ′ = O - 1 -γ 1 -γ I. Here E 0,p(x) g denotes the generalization error when both training and test distributions are matched to p(x) (i.e., in-distribution error) and is given by: E 0,p(x) g = γ 1 -γ ε 2 + κ 2 1 -γ a ⊤ (P Λ + κI) -2 a. Further, the expected estimator is: ⟨f * (x; P )⟩ D = ρ P η ρ P η ρ + κ a ρ ϕ ρ (x). A.6 EVALUATION SETTINGS Evaluation protocol. We follow DomainBed protocol to conduct our evaluation, for fair comparisons with baselines. We use the training-domain validation set protocol for model selection. Specifically, one domain in a dataset is selected as the target domain and the rest as source domains, from which 20% of samples are used as the validation set. All runs are repeated 3 times using different random seeds, thus, with different train-validation splits. The out-of-domain test performance averaged over all domains will be reported for each dataset. In addition, we use the standard number of iterations of 5,000 for all datasets, with early-stop based on validated accuracy. Hyperparameter tuning. Here we state the details of hyperparameter tuning in our experiments. We use separate Adam optimizers (Zhang, 2018) for the ensemble network and label space adapter. Table 4 lists the hyperparameters to tune and their search space. For each domain, we sweep through 48 different hyperparameter settings. (Arpit et al., 2021) : EoA combines both model ensemble and weight ensemble by taking an ensemble of moving average models from 6 runs. They experiment with three different pretrained models as initialization. The first one is pretrained on ImageNet with ResNet-50 and the second is pretrained on both ImageNet and a much larger additional dataset, IG-1B, with a more advanced backbone, ResNeXt-50 (Xie et al., 2017) . Additionally, with a pretrained RegNetY-16GF (Singh et al., 2022) , EoA achieves its best results. We denote the last one as EoA + to indicate it uses the additional dataset and to compare with SIMPLE + . • Random ensemble: In contrast to SIMPLE of learning to select models for ensemble, we also compare it with an average ensemble of k models chosen randomly for each sample.

A.8 MORE ABLATION STUDY

Here we introduce an additional baseline that uses the same set of ensemble weights for all samples in a domain, rather than generating different ensemble weights to aggregate model predictions differently for each OOD sample. For ensemble-based approaches, the overall prediction is given by ŷi = w ⊤ [f 1 (x i ), • • • , f K (x i )], where w k is the weight for aggregating the prediction of k-th models f k (x i ). In SIMPLE, the ensemble weights are given by w = MLP(c ⊤ i C), where c i is the embedding for the i-th sample and C contains the embeddings for all the models. Therefore, there is a special case of proposed ensemble approach, where the ensemble weights w ∈ R K are randomly initialized and optimized through back-propagation. Here in this simplified version, w is shared for all the samples in the dataset and does not incorporate model or sample information. Compared with this simplified version, SIMPLE explicitly leverages sample and model information (encoded in the embedding vectors c i and C) to generate specialized ensemble weights for each sample, which is more fine-grained. With model and sample embedding, SIMPLE enjoys much lower training cost when incorporating new pretrained models that are not in the model pool. We implement this special case and conduct experiments on OfficeHome dataset to compare with SIMPLE. By performing sufficient hyperparameter tuning, such a simple version achieves an average accuracy of 82.3% on OfficeHome, which is worse than 87.7% of SIMPLE (and even worse than 84.6% of SIMPLE using the much smaller Model Pool-A) and 83.9% of the existing SOTA. It in contrast suggests that, incorporating information about models and the sample to conduct finegrained model-sample matching in SIMPLE, is necessary and more effective, than using a set of weights and optimized by back-propagation. A.9 SENSITIVITY ANALYSIS A.9.1 ANALYSIS OF k IN TOP-k MODEL SELECTION Intuitively, using more models in an ensemble may lead to better performance (Zhang et al., 2020) . However, a larger ensemble size also means that the ensemble consumes more computation in infer-ence. Thus, in inference, we choose to selectively activate k (< K) models with higher ensemble weights. It is necessary to verify the impact of how many models are used for the prediction of each sample on the final performance, which can help balance the effectiveness and efficiency of our approach. Settings. To demonstrate the sensitivity of changes in the parameter k (the number of models activated in inference) to the final generalization performance, we evaluated the performance of SIMPLE + with different values of k in the top-k selection on the OfficeHome dataset. Specifically, we evaluate k ∈ [1, . . . , 10], with a limited and average sweep of the hyperparameters to save computational time. Therefore, the accuracy may not be optimal. As illustrated in Figure 6 , the highest accuracy of 87.02% is obtained when the number of activated models reaches the highest value we set. This indicates that the generalization performance can benefit from using more models to compose the ensemble for the final prediction. However, another important finding of this result is that the marginal effect of increasing the number of models decreases as the number of activated models increases. For example, when k is increased from 1 to 2, SIMPLE + gains 1.22% performance gain. In contrast, when k shifts from 8 to 10, the performance gain is only 0.11%. Therefore, it is predictable that since matching networks can provide a decent ranking of pretrained models, we can obtain promising generalization performance with limited activation models to save computational cost. Note that when k = 2, our method has obtained a performance that exceeds the existing SOTA (MIRO + SWAD).

A.9.2 ANALYSIS OF FEATURE EXTRACTOR SELECTION

We perform an ablation study to verify the robustness of our matching network to the selection of feature extractors. Here we provide details for that claim in Section 4.5. Settings. In general, ResNet-based models are used as feature extractors. We compare the performance of our method here with a ResNet-based feature extractor (ResNet-34) and a more advanced network structure (EfficientNet-B7 with Noisy Student) to see if these changes result in significant performance differences. Specifically, ResNet-34 and EfficientNet-B7-NS obtain ImageNet top-1 classification accuracy of 75.0% and 86.9%, respectively. This ablation study is performed on the OfficeHome dataset, with results shown in Table 5 . As can be seen, using ResNet-34 or EfficientNet-B7-NS results in similar performance, without one dominating in all domains. Therefore, feature extractor selection has little impact on the generalization performance of SIMPLE, showing its robustness. Meanwhile, it is noted that the number of parameters of EfficientNet-B7-NS is three times higher than that of ResNet-34. Therefore, we choose ResNet-34 because it can provide good performance with less computation cost.

A.9.3 ANALYSIS OF LABEL SPACE ADAPTER TRAINING

As detailed in Section 3.3, the label space adapter of SIMPLE is trained by two losses, L ens (ψ, ρ) and L adapter , and the training process will be influenced by the ensemble weights w i . In this section, we analyze and verify whether such an influence affects the training of adapters. Note that, we train the same adapter for all the pretrained models from the same pretraining domain, which is more efficient (than training different adapters for different pretrained models) and also avoids unexpected training instability. When considering the large model pool with various models from different pretraining domains (e.g., Model Pool-B), we believe the influence is acceptable or even beneficial since: (1) adapters will observe all the training samples, so their training is sufficient; (2) since the number of parameters of adapters is relatively small, the training of adapters will not have a significant impact on the final performance; (3) although the training of adapters is influenced by the ensemble weights, it is unbiased because the ensemble weights are optimized according to the final objective. Furthermore, we conduct experiments to verify whether training label space adapters under the influence of ensemble weights will affect the training of adapters or degrade the performance. Settings. We compare single model performance with and without the influence of ensemble weights on the corresponding label space adapter. For no influence on adapters, we separately train individual adapters for each model. We select several models which are assigned with small ensemble weights in SIMPLE. The evaluation is on 'art' domain of the OfficeHome dataset. Figure 7 shows the performance difference between (1) a model with the adapter trained with and without ensemble weights (blue bars); and (2) our method SIMPLE and the SOTA baseline for reference (orange bar). As can be seen, (1) single model performance does not significantly degrade under the influence of ensemble weights (even less than the difference between our method SIMPLE and the existing SOTA); (2) moreover, for some models in Figure 7 , training a shared adapter (i.e., influenced by the ensemble weights) can make a single model perform even better. The possible reason is that training a shared adapter for multiple models avoids overfitting, leading to better generalization.

A.10 TRAINING AND INFERENCE COST COMPARISON

We compare the training and inference costs of SIMPLE with general DG methods, including SOTA and methods that use ensemble learning. Specifically, for training costs, we evaluate the number of learnable parameters and the overall training time. And for inference, we compute GFLOPs. To fairly compare training costs, we run experiments of ERM, SWAD, and SIMPLE on a single Nvidia Tesla V100 and compare their overall back-propagation time from the start of training to the end (or early-stop). And based on the statistics of ERM and SWAD, we estimate the training time for EoA and MIRO, respectively. The results are shown in Table 6 . Training cost comparison. As shown in Table 2 , training SIMPLE paradigm uses noticeably less time. SIMPLE + takes only 0.1% of the time of ERM on PACS. The significant training time advantage of the method and its surpassing performance suggest that SIMPLE is an effective and efficient paradigm for domain generalization. Inference cost comparison. Although ensemble methods like EoA and SEDGE achieve better generalization performance at the cost of higher inference FLOPs, SEDGE still manages to save a large amount of inference cost compared to the previous best ensemble model (half of the inference FLOPs compared to EoA). This is because SIMPLE only selects models with the highest k(< K) ensemble weights. Therefore, only k of K models are activated for inference per sample, which reduces the inference cost to a large extent. In addition, compared to SOTA (MIRO + SWAD), which uses a larger network architecture to achieve better results than using ResNet-50, SIMPLE + keeps the inference cost using significantly lower GFLOPs than SOTA uses to obtain the new SOTA results. SIMPLE shows excellent generalization performance in the case of dispatching only fixed pretrained models to predict each OOD sample. Thus, we are curious about its matching preference, that is, which models are typically dispatched in specific domains. This could also provide insights into which types of pretrained models might be more suitable for certain domains. Settings. We attempt to analyze the preferences of our approach for network architectures and pretraining datasets. To do so, we first measure the importance of the pretrained models and then perform a refined analysis of their architectures and pretraining datasets. Specifically, by taking the model dispatcher trained by SIMPLE, we compute the sum of the ensemble weights assigned to each pretrained model as its importance measure for ranking. Then, we classify these pretrained models according to the basic architecture and the pretraining dataset. Treating the ranking of the pretrained models as the ranking of the associated types, we can measure the importance of each type by calculating the mean reciprocal rank (MRR). Analysis of network architecture. Figure 13 , 14, 15, 16, and 17 show the raw rankings of the pretrained models on different datasets. From the ranking information, we can see that SIMPLE assigns markedly different pretrained models to samples from different domains. This may imply that these domains, as shown in Figure 17 , may differ widely from each other and thus need to be handled by varying combinations of pretrained models. We then classify these pretrained models based on their basic architecture, i.e., CNN-based, ViT-based, and MLP-based, with their MRR values over different datasets and domains shown in Figure 9 . It can be seen that ViT-based models are dispatched more frequently than CNN-based models in most domains, with the exception of 6 domains. Observations over these domains suggest that CNN-based models tend to be used more for real images (e.g., photo), while ViT-based models are chosen to handle stylistic or textural variations (e.g., sketch). Analysis of pretraining datasets. We then analyze on the pretraining datasets, including ImageNet-1k, ImageNet-21kfoot_7 , CLIP (Radford et al., 2021) , and SWAG (Singh et al., 2022) , used for pretrained models. The MRR results of these different pretraining datasets on five datasets are shown in Figure 10 , with scores grouped by domain. On one hand, it can be seen that our method shows a preference for SWAG, CLIP, and ImageNet-21k over ImageNet-1k dataset. This preference can be supported by recent studies finding that pretraining on ImageNet-1k is insufficient for generalization (Kumar et al., 2022) and that changing the dataset may improve generalization performance (Kim et al., 2022) . On the other hand, despite this preference, SIMPLE dispatches different models for each domain (e.g. on TerraIncognita-L100 neither SWAG nor CLIP is chosen much, but rather ImageNet-pretrained models are used), suggesting that there is still no free lunch for DG in the selection of pretraining datasets. A In this section, we leverage the matching preferences of the learned model dispatcher to optimize the model pool size and study the performance of SIMPLE for different model pool sizes. Specifically, since the model dispatcher of SIMPLE learns to match the best suitable models for unseen samples at a meta-level, its matching preference over unseen domains (without accessing ground truth labels) can be regarded as a measurement of model transferability on these unseen samples. Therefore, we can further optimize the model pool size used for each domain in a two-stage manner: (1) first learn with a large model pool on source domains to generate matching preferences; and (2) then reconstruct a smaller model pool to include models preferred by our dispatcher. Settings. We experiment this two-manner training on PACS and OfficeHome datasets, using different reconstructed model pool sizes in the second stage. Note that we construct a model pool specific to each domain in a dataset, based on aggregated matching preferences for all samples of the unseen domain. The results of PACS and OfficeHome datasets are shown in Figure 11 and 12, respectively. As can be seen, given that SIMPLE can automatically match models more suitable for transferring to unseen domains, we can actually go beyond the existing SOTA approach even using a model pool with only two models (while the best single model fails to outperform since there is No Free Lunch). And when using a smaller model pool, SIMPLE can even further improve the performance of the OfficeHome dataset. 



The data distribution where the pretrained models have been learned. Source and target domains in DG share the same label space yet differing from that of pretraining domains. https://github.com/rwightman/pytorch-image-models https://github.com/Cadene/pretrained-models.pytorch https://github.com/facebookresearch/mae https://github.com/openai/CLIP https://github.com/facebookresearch/SWAG https://image-net.org/index



Figure 2: Different training paradigms in DG.Model dispatcher. As shown in Section 2, there is no individual model performing the best across different tasks, thus, appropriate models need to be dispatched to address each specific task. We define a model dispatcher g ρ , with parameter ρ, that takes the sample x i as input and determines the weight w k assigned to model f k for the sample x i with

Figure3: The SIMPLE framework. Based on a pool of fixed pretrained models, a recommender learns the matching of models and samples for model dispatching with the help of a label space adapter for prediction transformation. Note that only pretrained models from the same pretraining domain share the same label space adapter.

sample x i and models {f k } K k=1 are computed on their latent features c i and c = [c 1 , . . . , c K ], through a non-linear function activated multi-layer perceptrons (MLP) as m i = MLP(c ⊤ i c). Specifically, the sample and the model are first embedded and then transformed into a joint latent space. The feature extractor of one pretrained model f k0 from our model pool, fixed as the sample feature extractor, generates sample embedding e i for the sample x i . For the embedding of each model f k , we introduce a learnable embedding e k , included in ρ, which is randomly initialized and optimized during training. Both of them are processed by two non-linear functions activated MLP as c i = MLP(MLP(e i )) and c k = MLP(MLP(e k )), to map into the same space for matching scoring.

Ensemble learning. Ensemble learning methods Hansen &Salamon (1990);Zhou et al. (2018);Li et al. (2023) exploit multiple models to produce prediction results and combine the results with various techniques, e.g., boosting Schapire (1990);Freund (1995);Moghimi et al. (2016) or mean aggregationZhou et al. (2018);Zhang et al. (

Figure 4: Performance comparison of all pretrained models on different domains and classes of the TerraIncognita benchmark. Note that we omit the model names for clarity, and each row represents the performance of one model. 18

Figure 5: Kullback-Leibler divergence between the performance distribution over all the leveraged pretrained models in domain-leval (a) and class-level (b), respectively.

Figure 6: The impact of k values of top-k selection in inference, on the generalization performance of SIMPLE + on the OfficeHome dataset.

a separate adapter) minus (Model with a shared adapter) (SIMPLE) minus (existing SOTA (MIRO))

Figure 7: The difference of single model performance with and without the influence of ensemble weights, on the OfficeHome dataset.

Figure 8: Samples that are included in DomainBed, from Table 3 in Gulrajani & Lopez-Paz (2020).

Figure 11: The performance of SIMPLE on PACS dataset, with model pools of different sizes.

Figure 13: Ranking of the models in the four domains of PACS using the sum of the ensemble weights assigned to the models. The four columns from left to right in the figure correspond to the different domains (Art, Cartoon, Photo, Sketch) in the dataset.

Figure 14: Ranking of the models in the four domains of VLCS using the sum of the ensemble weights assigned to the models. The four columns from left to right in the figure correspond to the different domains (Caltech101, LabelMe, SUN09, VOC2007) in the dataset.

Figure 15: Ranking of the models in the four domains of OfficeHome using the sum of the ensemble weights assigned to the models. The four columns from left to right in the figure correspond to the different domains (Art, Clipart, Product, Photo) in the dataset.

Figure 16: Ranking of the models in the four domains of TerraIncognita using the sum of the ensemble weights assigned to the models. The four columns from left to right in the figure correspond to the different domains (L100, L38, L43, L46) in the dataset.

Figure 17: Ranking of the models in the four domains of DomainNet using the sum of the ensemble weights assigned to the models. The six columns from left to right in the figure correspond to the different domains (Clipart, Infographic, Painting, QuickDraw, Photo, Sketch) in the dataset.

DomainBed benchmarking. Baseline results are from original papers with the same setup.



Results of (1) ablation study and (2) SIMPLE with different sized model pools.

This reduces the inference cost to a large extent, though more than 200 pretrained models have been incorporated in the model pool. As shown in Table2, although SIMPLE uses more inference cost than the model using ResNet-50 as the backbone, SIMPLE + uses less inference time than the existing SOTA (MIRO + SWAD) using RegNetY-16GF(Singh et al., 2022) as the backbone. That is, SIMPLE + obtains new SOTA results with significantly higher training and inference efficiency. somehow compensated for the drawbacks brought by NFL of a single model in DG, by utilizing different pretrained models and dispatching them selectively to OOD samples.

Specifically, we focus on the size and diversity of model pools by comparing four different model pools, i.e., Model Pool-A-Small (ImageNetpretrained, 15 models), Model Pool-A (ImageNet-pretrained, 244 models), Model Pool-B-Small (diverse pretraining datasets, 17 models), and Model Pool-B (diverse, 283 models). The composition of these model pools can be found in Appendix A.3.Tip 1: Use a larger model pool. Based on the analysis in Section 2, a larger model pool is favored as it increases the probability that the model pool contains models that match well with each OOD sample. This is consistent with the comparison of Model Pool-A-Small with Model Pool-A, and Model Pool-B-Small with Model Pool-B. Both types of model pools show significantly better generalization performance as the size of the model pool increases.

first uses it for DG to simulate test distributions, and subsequently,Peng et al. (2018);Khirodkar et al. (2019);Tremblay et al. (2018) also adopt data augmentation in various ways. In addition to the common feature augmentation,Peng et al. (2022) further propose to perform label augmentation. In the field of data generation, instead of augmentation, methods generate new samples or domains by means of such as Mixup(Zhang et al., 2017), auto-encoder(Qiao et al., 2020), generative adversarial network(Rahman et al., 2019).

Hyperparameters we set or tune for SIMPLE.

Performance comparison of SIMPLE + using ResNet-34 and EfficientNet-B7-NS as the ±0.2 76.7 ±0.6 92.8 ±0.9 92.6 ±0.3 87.7 ±0.5 EfficientNet-B7-NS 85.8 ±0.8 77.0 ±0.7 92.2 ±0.5 91.9 ±0.7 86.7 ±0.6

The comparison of training and inference cost. Here the training time implies the overall back-propagation time. The training times of ERM and SWAD are derived from the statistics of our runs, and we estimate the training time of EoA and MIRO based on that of ERM. The run for SWAD on DomainNet failed due to out-of-memory.

List of pretrained models from one of the following sources: (timm, pretrainedmodels, clip, MAE, or SWAG). The output dimension is the number of classes in the pretraining domain.

