SIMPLE: SPECIALIZED MODEL-SAMPLE MATCHING FOR DOMAIN GENERALIZATION

Abstract

In domain generalization (DG), most existing methods aspire to fine-tune a specific pretrained model through novel DG algorithms. In this paper, we propose an alternative direction, i.e., to efficiently leverage a pool of pretrained models without fine-tuning. Through extensive empirical and theoretical evidence, we demonstrate that (1) pretrained models have possessed generalization to some extent while there is no single best pretrained model across all distribution shifts, and (2) out-of-distribution (OOD) generalization error depends on the fitness between the pretrained model and unseen test distributions. This analysis motivates us to incorporate diverse pretrained models and to dispatch the best matched models for each OOD sample by means of recommendation techniques. To this end, we propose SIMPLE, a specialized model-sample matching method for domain generalization. First, the predictions of pretrained models are adapted to the target domain by a linear label space transformation. A matching network aware of model specialty is then proposed to dynamically recommend proper pretrained models to predict each test sample. The experiments on DomainBed show that our method achieves significant performance improvements (up to 12.2% for individual dataset and 3.9% on average) compared to state-of-the-art (SOTA) methods and further achieves 6.1% gain via enlarging the pretrained model pool. Moreover, our method is highly efficient and achieves more than 1000× training speedup compared to the conventional DG methods with fine-tuning a pretrained model. Code and supplemental materials are available at https://seqml.github.io/simple.

1. INTRODUCTION

Distribution shift is a common problem in real-world applications, which breaks the independent and identically distributional (i.i.d.) assumption of machine learning algorithms (Wang et al., 2022) . Mismatches between training and test distributions, which are quite common in reality, can largely deteriorate model performance and make machine learning models infeasible for practical applications (González & Abu-Mostafa, 2015) . Therefore, enhancing the generalization ability of models has attracted increasing attention (Cha et al., 2021; Zhang et al., 2022) . For its practical significance, various methods have been proposed, e.g., domain alignment (Ganin et al., 2016; Gong et al., 2019; Arjovsky et al., 2019 ), meta-learning (Finn et al., 2017; Dou et al., 2019; Du et al., 2020) , and ensemble learning (Mancini et al., 2018; Cha et al., 2021; Arpit et al., 2021) . The effectiveness of DG algorithms is generally verified by fine-tuning a pre-trained ResNet (He et al., 2016) model with these algorithms (Gulrajani & Lopez-Paz, 2020) . It has demonstrated that these algorithms improve upon empirical risk minimization (ERM) baseline on ResNet-50 backbone (Arpit et al., 2021; Wiles et al., 2021) . Meanwhile, recent studies show that neural architectures and pretraining methods have a large impact on the model robustness to distribution shifts (Radford et al., 2021; Wiles et al., 2021) . For example, vision transformers are more robust to texture and style shifts compared with ResNet-based models (Zhang et al., 2022) , which are instead superior to transformer-based models on dense image classification tasks (Liu et al., 2022) . In terms of pretraining, using pretraining datasets other than ImageNet-1k improves the generalization performance in one test domain, yet leads to performance degradation in another (Kim et al., 2022) . These findings are in line with the No Free Lunch (NFL) Theorem (Wolpert, 1996) , which suggests that no single model can always perform better than any other model without having substantive information about the targeted problem. In DG, we usually have very limited information about the test domain, so we are more likely to encounter the above challenge. Inspired by these attempts, in this paper, we conduct a fine-grained study on the relationship between pretrained models and distribution shifts. From both empirical and theoretical evidence, we show that no free lunch in terms of pretraining for domain generalization, i.e., there is no single best pretrained model across shifting test distributions. Specifically, 283 pretrained models with different network architectures, pretraining datasets, and learning objectives are compared for their generalization performance under different distributional shifts. The results reveal that the pretrained models without fine-tuning generalize well to some unseen domains, but none of these models dominate in all unseen distributions. Furthermore, the theoretical analysis indicates that OOD generalization error is determined by the fitness between model (varying w.r.t. the network architectures and model weights) and test distribution. For any network architecture with fixed training distributions, such as pretrained models (Iandola et al., 2014; He et al., 2016; 2021a) , it is always possible to find a beneficial or detrimental test distribution with a small or large generalization error. Motivated by these findings, we propose an alternative DG paradigm that leverages pretrained models with different network architectures and shifting training distributions, upon which we match the most suitable pretrained models for each OOD sample. As shown in Figure 3 , our paradigm (specialized model-sample matching for domain generalization, SIMPLE) first adopts a simple label adapter that projects the label space of the pretraining domainfoot_0 to that of unseen domainsfoot_1 , where the adapter is shared by pretrained models from the same pretraining domain. Then, a matching network, which is aware of model specialty, selects a set of proper pretrained models and aggregates them together to conduct the prediction for each OOD sample. Notably, this promising alternative exhibits significant performance improvements, averaging 3.9% over the existing SOTA results, with gains of up to 12.2% on single datasets, and a significant increase in training efficiency. To summarize, this work has made the following contributions: • We theoretically and empirically analyze the generalization of pretrained models on shifting unseen test distribution, revealing no free lunch hypothesis exists that motivates our solution of modelsample matching. • In complementary to traditional DG solutions, we propose a novel DG paradigm which directly leverages pretrained models without fine-tuning, and it has significantly improved the DG performance in the mainstream benchmark upon other strong baselines. • Besides the performance gain, our method is even more efficient since it does not follow the common fine-tuning approach, shedding new light on using pretrained models in DG tasks.

2. NO FREE LUNCH IN PRETRAINING FOR DOMAIN GENERALIZATION

In this section, we investigate if there exists free lunch in pretraining for DG, that is, whether we can find one single best pretrained model that generalizes across all distributional shifts. To this end, we first conduct an empirical analysis on the generalization ability of pretrained models over shifting distributions in Section 2.1, followed by a theoretical analysis in Section 2.2.

2.1. GENERALIZABILITY ANALYSIS OF THE PRETRAINED MODELS

We here analyze the generalization ability possessed by different pretrained models. Note that existing DG methods generally adopt a specific ImageNet-pretrained model (e.g, ResNet-50), which has been shown not sufficient for generalization (Kumar et al., 2021; Kim et al., 2022) . Thus, for a comprehensive analysis, we first incorporate 283 diverse pretrained models composed of diverse combinations of network architectures, pretraining datasets, objectives, and algorithms. Detailed information on all these models and more experimental settings are in Appendix A.4. For the efficient adaptation of pretrained models from pretraining domains to unseen domains, we propose to



The data distribution where the pretrained models have been learned. Source and target domains in DG share the same label space yet differing from that of pretraining domains.

