IN SEARCH OF LOST DOMAIN GENERALIZATION

Abstract

The goal of domain generalization algorithms is to predict well on distributions different from those seen during training. While a myriad of domain generalization algorithms exist, inconsistencies in experimental conditions-datasets, network architectures, and model selection criteria-render fair comparisons difficult. The goal of this paper is to understand how useful domain generalization algorithms are in realistic settings. As a first step, we realize that model selection is non-trivial for domain generalization tasks, and we argue that algorithms without a model selection criterion remain incomplete. Next we implement DOMAINBED, a testbed for domain generalization including seven benchmarks, fourteen algorithms, and three model selection criteria. When conducting extensive experiments using DO-MAINBED we find that when carefully implemented and tuned, ERM outperforms the state-of-the-art in terms of average performance. Furthermore, no algorithm included in DOMAINBED outperforms ERM by more than one point when evaluated under the same experimental conditions. We hope that the release of DOMAINBED, alongside contributions from fellow researchers, will streamline reproducible and rigorous advances in domain generalization.

1. INTRODUCTION

Machine learning systems often fail to generalize out-of-distribution, crashing in spectacular ways when tested outside the domain of training examples (Torralba and Efros, 2011) . The overreliance of learning systems on the training distribution manifests widely. For instance, self-driving car systems struggle to perform under conditions different to those of training, including variations in light (Dai and Van Gool, 2018) , weather (Volk et al., 2019) , and object poses (Alcorn et al., 2019) . As another example, systems trained on medical data collected in one hospital do not generalize to other health centers (Castro et al., 2019; AlBadawy et al., 2018; Perone et al., 2019; Heaven, 2020) . Arjovsky et al. (2019) suggest that failing to generalize out-of-distribution is failing to capture the causal factors of variation in data, clinging instead to easier-to-fit spurious correlations prone to change across domains. Examples of spurious correlations commonly absorbed by learning machines include racial biases (Stock and Cisse, 2018) , texture statistics (Geirhos et al., 2018) , and object backgrounds (Beery et al., 2018) . Alas, the capricious behaviour of machine learning systems out-of-distribution is a roadblock to their deployment in critical applications. Aware of this problem, the research community has spent significant efforts during the last decade to develop algorithms able to generalize out-of-distribution. In particular, the literature in Domain Generalization (DG) assumes access to multiple datasets during training, each of them containing examples about the same task, but collected under a different domain or experimental condition (Blanchard et al., 2011; Muandet et al., 2013) . The goal of DG algorithms is to incorporate the invariances across these training domains into a classifier, in hopes that such invariances will also hold in novel test domains. Different DG solutions assume different types of invariances, and propose algorithms to estimate them from data. Despite the enormous importance of DG, the literature is scattered: a plethora of different algorithms appear yearly, each of them evaluated under different datasets, neural network architectures, and model selection criteria. Borrowing from the success of standardized computer vision benchmarks (Ilse et al., 2019) 95.3 98.7 98.7 98.4 97.7 94.5 97.2 Our ERM 95.9 98.9 98.8 98.9 98.9 96.4 98.0 VLCS C L S V G2DM (Albuquerque et al., 2019) 95.5 67.6 69.4 81.9 such as ImageNet (Russakovsky et al., 2015) , the purpose of this work is to perform a rigorous comparison of DG algorithms, as well as to open-source our software for anyone to replicate and extend our analyses. This manuscript investigates the question: How useful are different DG algorithms when evaluated in a consistent and realistic setting? To answer this question, we implement and tune fourteen DG algorithms carefully, to compare them across seven benchmark datasets and three model selection criteria. There are three major takeaways from our investigations: • Claim 1: A careful implementation of ERM outperforms the state-of-the-art in terms of average performance across common benchmarks (Table 1 , full list in Appendix A.5). • Claim 2: When implementing fourteen DG algorithms in a consistent and realistic setting, no competitor outperforms ERM by more than one point (Table 3 ). • Claim 3: Model selection is non-trivial for DG, yet affects results (Table 3 ). As such, we argue that DG algorithms should specify their own model selection criteria. As a result of our research, we release DOMAINBED, a framework to streamline rigorous and reproducible experimentation in DG. Using DOMAINBED, adding a new algorithm or dataset is a matter of a few lines of code. A single command runs all experiments, performs all model selections, and auto-generates all the result tables included in this work. DOMAINBED is a living project: we welcome pull requests from fellow researchers to update the available algorithms, datasets, model selection criteria, and result tables. Section 2 kicks off our exposition with a review of the DG setup. Section 3 discusses the difficulties of model selection in DG and makes recommendations for a path forward. Section 4 introduces DOMAINBED, describing the features included in the initial release. Section 5 discusses the experimental results of running the entire DOMAINBED suite, illustrating the competitive performance of ERM and the importance of model selection criteria. Finally, Section 6 offers our view on future research directions in DG. Appendix A reviews one hundred articles spanning a decade of research in DG, summarizing the experimental performance of over thirty algorithms. 

2. THE PROBLEM OF DOMAIN GENERALIZATION

The goal of supervised learning is to predict values y ∈ Y of a target random variable Y , given values x ∈ X of an input random variable X. Predictions ŷ = f (x) about x originate from a predictor f : X → Y. We often decompose predictors as f = w • φ, where we call φ : X → H the featurizer, and w : H → Y the classifier. To solve the prediction task we collect the training dataset D = {(x i , y i )} n i=1 , which contains identically and independently distributed (i.i.d.) examples from the joint probability distribution P (X, Y ). Given a loss function : Y × Y → [0, ∞) measuring prediction error, supervised learning seeks the predictor minimizing the risk E (x,y)∼P [ (f (x) , y)]. Since we only have access to the data distribution P (X, Y ) via the dataset D, we instead search a predictor minimizing the empirical risk 1 n n i=1 (f (x i ), y i ) (Vapnik, 1998) . The rest of this paper studies the problem of Domain Generalization (DG), an extension of supervised learning where training datasets from multiple domains (or environments) are available to train our predictor (Blanchard et al., 2011) , 2009; Patel et al., 2015; Wilson and Cook, 2018) . Table 2 compares different machine learning setups to highlight the nature of DG problems. The causality literature refers to DG as learning from multiple environments (Peters et al., 2016; Arjovsky et al., 2019) . Although challenging, the DG framework can capture some of the difficulty of real prediction problems, where unforeseen distributional discrepancies between training and testing data are surely expected. At the same time, the framework can be limiting: in many real world scenarios there may be external variables informing about task relatedness (space, time, annotations) that the DG framework ignores.

3. MODEL SELECTION AS PART OF THE LEARNING PROBLEM

Here we discuss issues surrounding model selection (choosing hyperparameters, training checkpoints, architecture variants) in DG and make specific recommendations for a path forward. Because we lack access to a validation set identically distributed to the test data, model selection in DG is not as straightforward as in supervised learning. Some works adopt heuristic strategies whose behavior is not well-studied, while others simply omit a description of how to choose hyperparameters. This leaves open the possibility that hyperparameters were chosen using the test data, which is not methodologically sound. Differences in results arising from inconsistent tuning practices may be misattributed to the algorithms under study, complicating fair assessments. We believe that much of the confusion surrounding model selection in DG arises from treating it as merely a question of experimental design. To the contrary, model selection requires making theoretical assumptions about how the test data relates to the training data. Different DG algorithms make different assumptions, and it is not clear a priori which ones are correct, or how they influence the model selection criterion. Indeed, choosing reasonable assumptions is at the heart of DG research. Therefore, a DG algorithm without a strategy to choose its hyperparameters should be regarded as incomplete. Recommendation 1 A DG algorithm should be responsible for specifying a model selection method. While algorithms without well-justified model selection methods are incomplete, they may be useful stepping-stones in a research agenda. In this case, instead of using an ad-hoc model selection method, we can evaluate incomplete algorithms by considering an oracle model selection method, where we select hyperparameters using some data from the test domain. Of course, it is important to avoid invalid comparisons between oracle results and baselines tuned without an oracle method. Also, unless we restrict access to the test domain data somehow, we risk obtaining meaningless results (we could just train on such test domain data using supervised learning). Recommendation 2 Researchers should disclaim any oracle-selection results as such and specify policies to limit access to the test domain. Leave-one-domain-out validation Given d tr training domains, we train d tr models with equal hyperparameters, each holding one of the training domains out. We evaluate each model on its held-out domain, and average the accuracies of these d tr models over their held-out domains. Finally, we choose the model maximizing this average accuracy, retrained on all d tr domains. This strategy assumes that training and test domains follow a meta-distribution over domains, and that our goal is to maximize the expected performance under this meta-distribution. Note that leaving k > 1 domains out would increase greatly the number of experiments, and introduces a hyperparameter k.

Test-domain validation (oracle)

We choose the model maximizing the accuracy on a validation set that follows the distribution of the test domain. Following our earlier recommendation to limit test domain access, we allow one query (the last checkpoint) per choice of hyperparameters, disallowing early stopping. Recall that this is not a valid benchmarking methodology. Oracle-based results can be either optimistic, because we select models using the test distribution, or pessimistic, because the query limit reduces the number of considered hyperparameters. We also tried limiting the size of the oracle test set instead of the number of queries, but this led to unacceptably high variance.

3.2. CONSIDERATIONS FROM THE LITERATURE

Some references in prior work discuss additional strategies to choose hyperparameters in DG. For instance, Krueger et al. (2020, Appendix B.1) suggest choosing hyperparameters to maximize the performance across all domains of an external dataset. This "leave-one-dataset out" is akin to the second strategy outlined above. Albuquerque et al. (2019, Section 5.3. 2) suggest performing model selection based on the loss function (which often incorporates an algorithm-specific regularizer), and D'Innocente and Caputo (2018, Section 3) derive an strategy specific to their algorithm. Finally, tools from differential privacy enable multiple reuses of a validation set (Dwork et al., 2015) , which could be a tool to control the power of test-domain validation (oracle).

4. DOMAINBED: A PYTORCH TESTBED FOR DOMAIN GENERALIZATION

At the heart of our large scale experimentation is DOMAINBED, a PyTorch (Paszke et al., 2019) testbed to streamline reproducible and rigorous research in DG: https://github.com/facebookresearch/DomainBed/. The initial release comprises fourteen algorithms, seven datasets, and three model selection methods (those described in Section 3), as well as the infrastructure to run all the experiments and generate all the L A T E X tables below with a single command. Datasets DOMAINBED currently includes downloaders and loaders for seven standard DG image classification benchmarks. These are Colored MNIST (Arjovsky et al., 2019) , Rotated MNIST (Ghifary et al., 2015) , PACS (Li et al., 2017) , VLCS (Fang et al., 2013) , OfficeHome (Venkateswara et al., 2017) , Terra Incognita (Beery et al., 2018) , and DomainNet (Peng et al., 2019) . The datasets based on MNIST are "synthetic" since changes across domains are well understood (colors and rotations). The rest of the datasets are "real" since domains vary in unknown ways. Appendix B.2 describes these datasets.

Implementation choices

We highlight three implementation choices made towards a consistent and realistic evaluation setting. First, whereas prior work is inconsistent in its choice of network architecture, we finetune ResNet-50 models (He et al., 2016) pretrained on ImageNet for all non-MNIST experiments. We note that recent state-of-the-art results (Balaji et al., 2018; Nam et al., 2019; Huang et al., 2020 ) also use ResNet-50 models. Second, for all non-MNIST datasets, we augment training data using the following protocol: crops of random size and aspect ratio, resizing to 224 × 224 pixels, random horizontal flips, random color jitter, grayscaling the image with 10% probability, and normalization using the ImageNet channel statistics. This augmentation protocol is increasingly standard in state-of-the-art DG work (Nam et al., 2019; Huang et al., 2020; Krueger et al., 2020; Carlucci et al., 2019a; Zhou et al., 2020; Dou et al., 2019; Hendrycks et al., 2020; Wang et al., 2020a; Seo et al., 2020; Chattopadhyay et al., 2020) . We use no augmentation for MNIST-based datasets. Third, and for RotatedMNIST, we divide all the digits evenly among domains, instead of replicating the same 1000 digits to construct all domains. We deviate from standard practice for two reasons: using the same digits across training and test domains leaks test data, and reducing the amount of training data complicates the task in an unrealistic way.

5. EXPERIMENTS

We run experiments for all algorithms, datasets, and model selection criteria shipped in DOMAINBED. We consider all configurations of a dataset where we hide one domain for testing, resulting in the training of 58,000 models. To generate the following results, we simply run sweep.py at commit 0x7df6f06 from DOMAINBED's repository. Hyperparameter search For each algorithm and test domain, we conduct a random search (Bergstra and Bengio, 2012) of 20 trials over a joint distribution of all hyperparameters (Appendix B.4). Appendix C.4 shows that running more than 20 trials does not improve our results significantly. We use each model selection criterion to select amongst the 20 models from the random search. We split the data from each domain into 80% and 20% splits. We use the larger splits for training and final evaluation, and the smaller splits to select hyperparameters (for an illustration, see Appendix B.3). All hyperparameters are optimized anew for each algorithm and test domain, including hyperparameters like learning rates which are common to multiple algorithms. Standard error bars While some DG literature reports error bars across seeds, randomness arising from model selection is often ignored. This is acceptable if the goal is best-versus-best comparison, but prohibits analyses concerning the model selection process itself. Instead, we repeat our entire study three times, making every random choice anew: hyperparameters, weight initializations, and dataset splits. Every number we report is a mean (and its standard error) over these repetitions.

5.1. RESULTS

Table 3 summarizes the results of our experiments. Appendix C contains the full results per dataset and domain. As anticipated in our introduction, we draw three conclusions from our results. Claim 1: Carefully tuned ERM outperforms the previously published state-of-the-art Table 1 (full version in Appendix A.5) shows this result, when we provide ERM with a training-domain validation set for hyperparameter selection. Such state-of-the-art average performance of our ERM baseline holds even when we select the best competitor available in the literature separately for each benchmark. One reason for ERM's strong performance is that we use ResNet-50, whereas some prior work uses smaller ResNet-18 models. As recently shown in the literature (Hendrycks et al., 2020) , this suggests that better in-distribution generalization is a dominant factor behind better out-of-distribution generalization. Our result does not refute prior work: it is possible that with stronger implementations, some competing methods may improve upon ERM. Rather, we provide a strong, realistic, and reproducible baseline for future work to build upon. Claim 2: When evaluated in a consistent setting, no algorithm outperforms ERM in more than one point We observe this result in Table 3 , obtained by running from scratch every combination of dataset, algorithm, and model selection criterion in DOMAINBED. Given any model selection criterion, no method improves the average performance of ERM in more than one point. At the number of trials performed, no improvement over ERM is statistically significant according to a t-test at a significance level α = 0.05. While new algorithms could improve upon ERM (an exciting premise!), getting substantial DG improvements in a rigorous way proved challenging. Most of our baselines can achieve ERM-like performance because there have hyperparameter configurations under which they behave like ERM (e.g. regularization coefficients that can be set to zero). Our advice to DG practitioners is to use ERM (which is a safe contender) or CORAL (Sun and Saenko, 2016) (which achieved the highest average score). Claim 3: Model selection methods matter We observe that model selection with a training domain validation set outperforms leave-one-domain-out cross-validation across multiple datasets and algorithms. This does not mean that using a training domain validation set is the right way to tune hyperparameters. In fact, the stronger performance of oracle-selection (+2.3 points for ERM) suggests headroom to develop improved DG model selection criteria. 

5.2. ABLATION STUDY ON ERM

To better understand our ERM performance, we perform an ablation study on the neural network architecture and the data augmentation protocol. Table 5 .2 shows that using a ResNet-50 neural network architecture, instead of a smaller ResNet-18, improves DG test accuracy by 3.7 points. Using data augmentation improves DG test accuracy by 0.5 points. However, these ResNet models were pretrained on ImageNet using data augmentation, so the benefits of augmentation are partly absorbed by the model. In fact, we hypothesize that among models pretrained on ImageNet, domain generalization performance is mainly influenced by the model's original test accuracy on ImageNet.

6. DISCUSSIONS

We provide several discussions to help the reader interpret our results and motivate future work. Our negative claims are fundamentally limited Broad negative claims (e.g. "algorithm X does not outperform ERM") do not specify an exact experimental setting and are therefore impossible to rigorously prove. In order to be verifiable, such claims must be restricted to a specific setting. This limitation is fundamental to all negative result claims, and ours (Claim 2) is no exception. We have shown that many algorithms don't substantially improve on ERM in our setting, but the relevance of that setting is a subjective matter ultimately left for the reader. In making this judgement, the reader should consider whether they agree with our methodological and implementation choices, which we have explained and motivated throughout the paper. We also note that our implementation can outperform previous results (Table 1 ). Finally, DomainBed is not a black box: our implementation is open-source and actively maintained, and we invite the research community to improve on our results. Is this as good as it gets? We question whether DG is possible in some of the considered datasets. Why do we assume that a neural network should be able to classify cartoons, given only photorealistic training data? In the case of Rotated MNIST, do truly rotation-invariant features discriminative of the digit class exist? Are those features expressible by a neural network? Even in the presence of correct model selection, is the out-of-distribution performance of modern ERM implementations as good as it gets? Or is it simply as poor as every other alternative? How far are we from the achievable DG performance? Is this upper-bound simply the test error in-domain? Are these the right datasets? Most datasets considered in the DG literature do not reflect realistic situations. If one wanted to classify cartoons, the easiest option would be to collect a small labeled dataset of cartoons. Should we consider more realistic, impactful tasks for better research in DG? Some alternatives are medical imaging in different hospitals and self-driving cars in different cities. Generalization requires untestable assumptions Every time we use ERM, we assume that training and testing examples follow the same distribution. This is an untestable assumption in every single instance. The same applies for DG: each algorithm assumes a different (untestable) type of invariance across domains. Therefore, the performance of a DG algorithm depends on the problem at hand, and only time can tell if we have made a good choice. This is akin to the generalization of a scientific theory such as Newton's gravitation, which cannot be proven, but rather only resist falsification.

7. CONCLUSION

Our extensive empirical evaluation of DG algorithms leads to three conclusions. First, a carefully tuned ERM baseline outperforms the previously published state-of-the-art results in terms of average performance (Claim 1). Second, when compared to thirteen popular DG alternatives on the exact same experimental conditions, we find out that no competitor is able to outperform ERM by more than one point (Claim 2). Third, model selection is non-trivial for DG, and it should be an integral part of any proposed method (Claim 3). Going forward, we hope that our results and DOMAINBED promote realistic and rigorous evaluation and enable advances in domain generalization.

A A DECADE OF LITERATURE ON DOMAIN GENERALIZATION

In this section, we provide an exhaustive literature review on a decade of domain generalization research. The following classifies domain generalization algorithms according into four strategies to learn invariant predictors: learning invariant features, sharing parameters, meta-learning, or performing data augmentation. A.1 LEARNING INVARIANT FEATURES Muandet et al. (2013) use kernel methods to find a feature transformation that (i) minimizes the distance between transformed feature distributions across domains, and (ii) does not destroy any of the information between the original features and the targets. Although popular, learning domain-invariant features has received some criticism (Zhao et al., 2019; Johansson et al., 2019) . Some alternatives exist, as we review next. show that, together with larger models and data, data augmentation improves out-of-distribution performance. 2018a)) leverages MAML (Finn et al., 2017) to meta-learn how to generalize across domains. 5. Domain-Adversarial Neural Networks (DANN, Ganin et al. (2016) ) employ an adversarial network to match feature distributions across environments. 6. Class-conditional DANN (C-DANN, Li et al. (2018d) ) is a variant of DANN matching the conditional distributions P (φ(X d )|Y d = y) across domains, for all labels y. 7. CORAL (Sun and Saenko, 2016) matches the mean and covariance of feature distributions. 8. MMD (Li et al., 2018b) matches the MMD (Gretton et al., 2012) 

A.5 PREVIOUS STATE-OF-THE-ART NUMBERS

A 2 B 2 C 2 D 2 A 1 B 1 C 1 D 1 big split small split # Layer 1 Conv2D (in=d, out=64) 2 ReLU 3 GroupNorm (groups=8) 4 Conv2D (in=64, out=128, stride=2) 5 ReLU 6 GroupNorm (8 groups) 7 Conv2D (in=128, out=128) 8 ReLU 9 GroupNorm (8 groups) 10 Conv2D (in=128, out=128) 11 ReLU 12 GroupNorm (8 groups) 13 Global average-pooling For "ResNet-50", we replace the final (softmax) layer of a ResNet50 pretrained on ImageNet and fine-tune the entire network. Since minibatches from different domains follow different distributions, batch normalization degrades domain generalization algorithms (Seo et al., 2020) . Therefore, we freeze all batch normalization layers before fine-tuning. We insert a dropout layer before the final ResNet-50 linear layer. Table 6 lists all algorithm hyperparameters, their default values, and their sweep random search distribution. We optimize all models using Adam (Kingma and Ba, 2015). Table 6 : Hyperparameters, their default values and distributions for random search.

Condition

Parameter Default value Random distribution ResNet learning rate 0.00005 10 Uniform(-5,-3.5) batch size 32 2 Uniform(3,5.5) batch size (if ARM) 8 8 ResNet dropout 0 RandomChoice([0, 0.1, 0.5]) generator learning rate 0.00005 10 Uniform(-5,-3.5) discriminator learning rate 0.00005 10 Uniform (-5,-3.5) not ResNet learning rate 0.001 10 Uniform(-4.5,-3.5) batch size 64 2 Uniform(3,9) generator learning rate 0.001 10 Uniform(-4.5,-2.5) discriminator learning rate 0.001 10 Uniform(-4.5,-2.5) MNIST weight decay 0 0 generator weight decay 0 0 not MNIST weight decay 0 10 Uniform(-6,-2) generator weight decay 0 10 Uniform(-6,-2) DANN, C-DANN lambda 1.0 10 Uniform(-2,2) discriminator weight decay 0 10 Uniform(-6,-2) discriminator steps 1 2 Uniform(0,3) discriminator width 256 int(2 Uniform(6,10) ) discriminator depth 3 RandomChoice([3, 4, 5]) discriminator dropout 0 RandomChoice([0, 0.1, 0.5]) discriminator grad penalty 0 10 Uniform(-2,1) Adam β 1 0.5 RandomChoice([0, 0.5]) DRO eta 0.01 10 Uniform(-1,1) IRM lambda 100 10 Uniform(-1,5) warmup iterations 500 10 Uniform(0,4) Mixup alpha 0.2 10 Uniform(0,4) MLDG beta 1 10 Uniform(-1,1) MMD gamma 1 10 Uniform(-1,1) MTL ema 0.99 RandomChoice([.5, .9, .99, 1]) RSC feature drop percentage 1/3 Uniform(0, 0.5) batch drop percentage 1/3 Uniform(0, 0.5) SagNet adversary weight 0.1 10 Uniform(-2,1) VREx lambda 10 10 Uniform(-1,5) warmup iterations 500 10 Uniform(0,4) B.5 EXTENDING DOMAINBED Algorithms are classes that implement two methods: .update(minibatches) and .predict(x). The update method receives a list of minibatches, one minibatch per training domain, and each minibatch containing one input and one output tensor. For example, to implement group DRO (Sagawa et al., 2019 , Algorithm 1), we simply write the following in algorithms.py: class GroupDRO(ERM): def __init__(self, input_shape, num_classes, num_domains, hparams): super().__init__(input_shape, num_classes, num_domains, hparams) self.register_buffer("q", torch.Tensor()) def update(self, minibatches): device = "cuda" if minibatches[0][0].is_cuda else "cpu" if not len(self.q): self.q = torch.ones(len(minibatches)).to(device) losses = torch.zeros(len(minibatches)).to(device) for m in range(len(minibatches)): x, y = minibatches[m] losses[m] = F.cross_entropy(self.predict(x), y) self.q[m] * = (self.hparams["dro_eta"] * losses[m].data).exp() self.q /= self.q.sum() loss = torch.dot(losses, self.q) / len(minibatches) self.optimizer.zero_grad() loss.backward() self.optimizer.step() return {'loss': loss.item()} ALGORITHMS.append('GroupDRO') By inheriting from ERM, the new GroupDRO class has access to a default classifier .network, optimizer .optimizer, and prediction method .predict(x). Finally, we tell DOMAINBED about the default values and hyperparameter search distributions of the hyperparameters of this new algorithm. We do so by adding the following to the function hparams in hparams registry.py: hparams['dro_eta'] = (1e-2, 10 ** random_state.uniform(-3, -1)) class MyDataset(MultipleEnvironmentImageFolder): ENVIRONMENTS = ['Env1', 'Env2', 'Env3'] def __init__(self, root, test_envs, augment=True): self.dir = os.path.join(root, "MyDataset/") super().__init__(self.dir, test_envs, augment)

DATASETS.append('MyDataset')

We are now ready to train our new algorithm on our new dataset, using the second domain as test: python train.py --model DRO --dataset MyDataset --data_dir /root --test_envs 1 \ --hparams '{"dro_eta": 0.2}' Finally, we can run a fully automated sweep on all datasets, algorithms, test domains, and model selection criteria by simply invoking python sweep.py, after extending the file command launchers.py to your computing infrastructure. When the sweep finishes, the script collect results.py automatically generates all the result tables shown in this manuscript. 95.9 ± 0.1 98.9 ± 0.0 98.8 ± 0.0 98.9 ± 0.0 98.9 ± 0.0 96.4 ± 0.0 98.0 IRM 95.5 ± 0.1 98.8 ± 0.2 98.7 ± 0.1 98.6 ± 0.1 98.7 ± 0.0 95.9 ± 0.2 97.7 GroupDRO 95.6 ± 0.1 98.9 ± 0.1 98.9 ± 0.1 99.0 ± 0.0 98.9 ± 0.0 96.5 ± 0.2 98.0 Mixup 95.8 ± 0.3 98.9 ± 0.0 98.9 ± 0.0 98.9 ± 0.0 98.8 ± 0.1 96.5 ± 0.3 98.0 MLDG 95.8 ± 0.1 98.9 ± 0.1 99.0 ± 0.0 98.9 ± 0.1 99.0 ± 0.0 95.8 ± 0.3 97.9 CORAL 95.8 ± 0.3 98.8 ± 0.0 98.9 ± 0.0 99.0 ± 0.0 98.9 ± 0.1 96.4 ± 0.2 98.0 MMD 95.6 ± 0.1 98.9 ± 0.1 99.0 ± 0.0 99.0 ± 0.0 98.9 ± 0.0 96.0 ± 0.2 97.9 DANN 95.0 ± 0.5 98.9 ± 0.1 99.0 ± 0.0 99.0 ± 0.1 98.9 ± 0.0 96.3 ± 0.2 97.8 CDANN 95.7 ± 0.2 98.8 ± 0.0 98.9 ± 0.1 98.9 ± 0.1 98.9 ± 0.1 96.1 ± 0.3 97.9 MTL 95.6 ± 0.1 99.0 ± 0.1 99.0 ± 0.0 98.9 ± 0.1 99.0 ± 0.1 95.8 ± 0.2 97.9 SagNet 95.9 ± 0.3 98.9 ± 0.1 99.0 ± 0.1 99.1 ± 0.0 99.0 ± 0.1 96.3 ± 0.1 98.0 ARM 96.7 ± 0.2 99.1 ± 0.0 99.0 ± 0.0 99.0 ± 0.1 99.1 ± 0.1 96.5 ± 0.4 98.2 VREx 95.9 ± 0.2 99.0 ± 0.1 98.9 ± 0.1 98.9 ± 0.1 98.7 ± 0.1 96.2 ± 0.2 97.9 RSC 94.8 ± 0.5 98.7 ± 0.1 98.8 ± 0.1 98.8 ± 0.0 98.9 ± 0.1 95.9 ± 0. 



FOR DG Having made broad recommendations, we review and justify three model selection criteria for DG. Appendix B.3 illustrates these with an specific example. Training-domain validation We split each training domain into training and validation subsets. We train models using the training subsets, and choose the model maximizing the accuracy on the union of validation subsets. This strategy assumes that the training and test examples follow similar distributions. For example, Ben-David et al. (2010) bound the test error of a classifier with the divergence between training and test domains.

In their pioneering work,Ganin et al. (2016) propose Domain Adversarial Neural Networks (DANN), a domain adaptation technique which uses generative adversarial networks (GANs,Goodfellow et al. (2014)), to learn a feature representation that matches across training domains.Akuzawa et al. (2019) extend DANN by considering cases where there exists an statistical dependence between the domain and the class label variables.Albuquerque et al. (2019) extend DANN by considering one-versus-all adversaries that try to predict to which training domain does each of the examples belong to.Li et al. (2018b)  employ GANs and the maximum mean discrepancy criteria(Gretton et al., 2012) to align feature distributions across domains.Matsuura and Harada (2019) leverages clustering techniques to learn domaininvariant features even when the separation between training domains is not given. Li et al. (2018c;d) learns a feature transformation φ such that the conditional distributions P (φ(X d ) | Y d = y) match for all training domains d and label values y. Shankar et al. (2018) use a domain classifier to construct adversarial examples for a label classifier, and use a label classifier to construct adversarial examples for the domain classifier. This results in a label classifier with better domain generalization. Li et al. (2019a) train a robust feature extractor and classifier. The robustness comes from (i) asking the feature extractor to produce features such that a classifier trained on domain d can classify instances for domain d = d, and (ii) asking the classifier to predict labels on domain d using features produced by a feature extractor trained on domain d = d. Li et al. (2020) adopt a lifelong learning strategy to attack the problem of domain generalization. Motiian et al. (2017) learn a feature representation such that (i) examples from different domains but the same class are close, (ii) examples from different domains and classes are far, and (iii) training examples can be correctly classified. Ilse et al. (2019) train a variational autoencoder (Kingma and Welling, 2014) where the bottleneck representation factorizes knowledge about domain, class label, and residual variations in the input space. Fang et al. (2013) learn a structural SVM metric such that the neighborhood of each example contains examples from the same category and all training domains. The algorithms of Sun and Saenko (2016); Sun et al. (2016); Rahman et al. (2019a) match the feature covariance (second order statistics) across training domains at some level of representation. The algorithms of Ghifary et al. (2016); Hu et al. (2019) use kernel-based multivariate component analysis to minimize the mismatch between training domains while maximizing class separability.

Peters et al. (2016);Rojas-Carulla et al. (2018) considered that one should search for features that lead to the same optimal classifier across training domains. In their pioneering work,Peters et al. (2016) linked this type of invariance to the causal structure of data, and provided a basic algorithm to learn invariant linear models, based on feature selection.Arjovsky et al. (2019) extend the previous to general gradient-based models, including neural networks, in their Invariant Risk Minimization (IRM) principle.Teney et al. (2020) build on IRM to learn a feature transformation that minimizes the relative variance of classifier weights across training datasets. The authors apply their method to reduce the learning of spurious correlations in Visual Question Answering (VQA) tasks.Ahuja et al. (2020) analyze IRM under a game-theoretic perspective to develop an alternative algorithm.Krueger et al. (2020) propose an approximation to the IRM problem consisting in reducing the variance of error averages across domains.Bouvier et al. (2019) attack the same problem as IRM by re-weighting data samples.A.2 SHARING PARAMETERSBlanchard et al. (2011) build classifiers f (x d , µ d ), where µ d is a kernel mean embedding(Muandet et al., 2017) that summarizes the dataset associated to the example x d . Since the distributional identity of test instances is unknown, these embeddings are estimated using single test examples at test time. SeeBlanchard et al. (2017);Deshmukh et al. (2019) for theoretical results on this family of algorithms (only applicable when using RKHS-based learners).Zhang et al. (2020) is an extension ofBlanchard et al. (2011) where a separate CNN computes the domain embedding, appended to the input image as additional channels.Khosla et al. (2012) learn one max-margin linear classifier w d = w + ∆ d per domain d, from which they distill their final, invariant predictor w. Ghifary et al. (2015) use a multitask autoencoder to learn invariances across domains. To achieve this, the authors assume that each training dataset contains the same examples; for instance, photographs about the same objects under different views. Mancini et al. (2018b) train a deep neural network with one set of dedicated batch-normalization layers (Ioffe and Szegedy, 2015) per training dataset. Then, a softmax domain classifier predicts how to linearly-combine the batch-normalization layers at test time. Seo et al. (2020) combines instance normalization with batch-normalization to learn a normalization module per domain, enhancing out-of-distribution generalization. Similarly, Mancini et al. (2018a) learn a softmax domain classifier used to linearly-combine domain-specific predictors at test time. D'Innocente and Caputo (2018) explore more sophisticated ways of aggregating domain-specific predictors. Li et al. (2017) extends Khosla et al. (2012) to deep neural networks by extending each of their parameter tensors with one additional dimension, indexed by the training domains, and set to a neutral value to predict domain-agnostic test examples. Ding and Fu (2017) implement parametertying and low-rank reconstruction losses to learn a predictor that relies on common knowledge across training domains. Hu et al. (2016); Sagawa et al. (2019) weight the importance of the minibatches of the training distributions proportional to their error. Chattopadhyay et al. (2020) overlays multiple weight masks over a single network to learn domain-invariant and domain-specific features.A.3 META-LEARNINGLi et al. (2018a)  employ Model-Agnostic Meta-Learning, or MAML(Finn et al., 2017), to build a predictor that learns how to adapt fast between training domains.Dou et al. (2019) use a similar MAML strategy, together with two regularizers that encourage features from different domains to respect inter-class relationships, and be compactly clustered by class labels.Li et al. (2019b)  extend the MAML meta-learning strategy to instances of domain generalization where the categories vary from domain to domain.Balaji et al. (2018) use MAML to meta-learn a regularizer encouraging the model trained on one domain to perform well on another domain.A.4 AUGMENTING DATA Data augmentation is an effective strategy to address domain generalization(Zhang et al., 2019). Unfortunately, how to design efficient data augmentation routines depends on the type of data at hand, and demands a significant amount of work from human experts.Xu et al. (2019);Yan et al. (2020);Wang et al. (2020b)  use mixup(Zhang et al., 2018) to blend examples from the different training distributions.Carlucci et al. (2019a)  constructs an auxiliary classification task aimed at solving jigsaw puzzles of image patches. The authors show that this self-supervised learning task learns features that improve domain generalization. Similarly,Wang et al. (2020a)  use metric learning and self-supervised learning to augment the out-of-distribution performance of an image classifier.Albuquerque et al. (2020) introduce the self-supervised task of predicting responses to Gabor filter banks, in order to learn more transferrable features.Wang et al. (2019) remove textural information from images to improve domain generalization.Volpi et al. (2018) show that training with adversarial data augmentation on a single domain is sufficient to improve domain generalization.Nam et al. (2019) promote representations of data that ignore image style and focus on content.Rahman et al.  (2019b);Zhou et al. (2020);Carlucci et al. (2019a)  are three alternatives that use GANs to augment the data available during training time. Representation Self-Challenging(Huang et al., 2020) learns robust neural networks by iteratively dropping-out important features.Hendrycks et al. (2020)

Figure1: Data configuration for a benchmark with four domains A, B, C, D, where the test domain is D. We shuffle and divide the data from each domain into a big split and a small split.

Our ERM baseline outperforms the state-of-the-art in terms of average domain generalization performance, even when picking the best competitor per dataset.

Learning setups. L d and U d denote the labeled and unlabeled distributions from domain d. , . . . , L dtr U 1 , . . . , U dtr Continual (or lifelong) learning L 1 , . . . , L ∞ U 1 , . . . , U ∞ Domain adaptation L 1 , . . . , L dtr , U dtr+1 U dtr+1 Transfer learning U 1 , . . . , U dtr , L dtr+1 U dtr+1 Domain generalization L 1 , . . . , L dtr U dtr+1

DG accuracy for all algorithms, datasets and model selection criteria in DOMAINBED. These experiments compare fourteen popular DG algorithms across seven benchmarks in the exact same conditions, showing the competitive performance of ERM.

Ablation study on ERM showing the impact of (i) using raw images versus data augmentation, and (ii) using ResNet-18 versus ResNet-50 models. Model selection: training-domain validation set.

compiles the best out-of-distribution test accuracies reported across a decade of domain generalization research.

Previous state-of-the-art in the literature of domain generalization.

of feature distributions. 9. Invariant Risk Minimization (IRMArjovsky et al. (2019)) learns a feature representation φ(X d ) such that the optimal linear classifier on top of that representation matches across domains. 10. Risk Extrapolation (VREx,Krueger et al. (2020)) approximates IRM with a variance penalty. 11. Marginal Transfer Learning (MTL,Blanchard et al. (2011; 2017)) estimates a mean embedding per domain, passed as a second argument to the classifier. 12. Adaptive Risk Min. (ARM, Zhang et al. (2020)) extends MTL with a separate embedding CNN. 13. Style-Agnostic Networks (SagNets, Nam et al. (2019)) learns neural networks by keeping image content and randomizing style. 14. Representation Self-Challenging (RSC, Huang et al. (2020)) learns robust neural networks by iteratively discarding (challenging) the most activated features. The label is a noisy function of the digit and color, such that color bears correlation d with the label and the digit bears correlation 0.75 with the label. This dataset contains 70, 000 examples of dimension (2, 28, 28) and 2 classes. 2. Rotated MNIST (Ghifary et al., 2015) is a variant of MNIST where domain d ∈ { 0, 15, 30, 45, 60, 75 } contains digits rotated by d degrees. Our dataset contains 70, 000 examples of dimension (1, 28, 28) and 10 classes. 3. PACS (Li et al., 2017) comprises four domains d ∈ { art, cartoons, photos, sketches }. This dataset contains 9, 991 examples of dimension (3, 224, 224) and 7 classes. 4. VLCS (Fang et al., 2013) comprises photographic domains d ∈ { Caltech101, LabelMe, SUN09, VOC2007 }. This dataset contains 10, 729 examples of dimension (3, 224, 224) and 5 classes. 5. OfficeHome (Venkateswara et al., 2017) includes domains d ∈ { art, clipart, product, real }. This dataset contains 15, 588 examples of dimension (3, 224, 224) and 65 classes. 6. Terra Incognita (Beery et al., 2018) contains photographs of wild animals taken by camera traps at locations d ∈ {L100, L38, L43, L46}. Our version of this dataset contains 24, 788 examples of dimension (3, 224, 224) and 10 classes. 7. DomainNet (Peng et al., 2019) has six domains d ∈ { clipart, infograph, painting, quickdraw, .3 MODEL SELECTION CRITERIA, ILLUSTRATED Consider Figure 1, and letT i = {A i , B i , C i } for i ∈ {1, 2}.Training-domain validation trains each hyperparameter configuration on T 1 and chooses the configuration with the highest performance in T 2 . Leave-one-out validation trains one clone F Z of each hyperparameter configuration on T 1 \ Z, for Z ∈ T 1 ; then, it chooses the configuration with highest Z∈T1 Performance(F Z , Z). Test-domain validation trains each hyperparameter configuration on T 1 and chooses the configuration with the highest performance on D 2 , only looking at its final epoch. Finally, result tables show the performance of selected models on D 1 .

Extension to UDA One can use DOMAINBED to perform experimentation on unsupervised domain adaptation by extending the .update(minibatches) methods to accept unlabeled examples from the test domain.

± 13.3 11.0 ± 4.6 26.8 ± 11.3 8.7 ± 2.1 32.7 ± 13.8 28.9 ± 11.9  23.4 DANN 52.7 ± 0.1 18.0 ± 0.3 44.2 ± 0.7 11.8 ± 0.1 55.5 ± 0.4 46.8 ± 0.6 38.2 CDANN 53.1 ± 0.9 17.3 ± 0.1 43.7 ± 0.9 11.6 ± 0.6 56.2 ± 0.4 45.9 ± 0.5 38.0 MTL 57.3 ± 0.3 19.3 ± 0.2 45.7 ± 0.4 12.5 ± 0.1 59.3 ± 0.2 49.2 ± 0.1 40.6 SagNet 56.2 ± 0.3 18.9 ± 0.2 46.2 ± 0.5 12.6 ± 0.6 58.2 ± 0.6 49.1 ± 0.2 40.2 ARM 49.0 ± 0.7 15.8 ± 0.3 40.8 ± 1.1 9.4 ± 0.2 53.0 ± 0.4 43.4 ± 0.3 35.2 VREx 46.5 ± 4.1 15.6 ± 1.8 35.8 ± 4.6 10.9 ± 0.3 49.6 ± 4.9 42.0 ± 3.0 33.4 RSC 55.0 ± 1.2 18.3 ± 0.5 44.4 ± 0.6 12.2 ± 0.2 55.7 ± 0.7 47.8 ± 0.9 38.9 C.2.8 AVERAGES ± 0.3 67.9 ± 0.7 70.9 ± 0.2 74.0 ± 0.6 77.6 IRM 97.3 ± 0.2 66.7 ± 0.1 71.0 ± 2.3 72.8 ± 0.4 76.9 GroupDRO 97.7 ± 0.2 65.9 ± 0.2 72.8 ± 0.8 73.4 ± 1.3 77.4 Mixup 97.8 ± 0.4 67.2 ± 0.4 71.5 ± 0.2 75.7 ± 0.6 78.1 MLDG 97.1 ± 0.5 66.6 ± 0.5 71.5 ± 0.1 75.0 ± 0.9 77.5 CORAL 97.3 ± 0.2 67.5 ± 0.6 71.6 ± 0.6 74.5 ± 0.0 77.7 MMD 98.8 ± 0.0 66.4 ± 0.4 70.8 ± 0.5 75.6 ± 0.4 77.9 DANN 99.0 ± 0.2 66.3 ± 1.2 73.4 ± 1.4 80.1 ± 0.5 79.7 CDANN 98.2 ± 0.1 68.8 ± 0.5 74.3 ± 0.6 78.1 ± 0.5 79.9 MTL 97.9 ± 0.7 66.1 ± 0.7 72.0 ± 0.4 74.9 ± 1.1 77.7 SagNet 97.4 ± 0.3 66.4 ± 0.4 71.6 ± 0.1 75.0 ± 0.8 77.6 ARM 97.6 ± 0.6 66.5 ± 0.3 72.7 ± 0.6 74.4 ± 0.7 77.8 VREx 98.4 ± 0.2 66.4 ± 0.7 72.8 ± 0.1 75.0 ± 1.4 78.1 RSC 98.0 ± 0.4 67.2 ± 0.3 70.3 ± 1.3 75.6 ± 0.4 77.8 ± 1.0 81.3 ± 0.6 96.2 ± 0.3 82.7 ± 1.1 86.7 IRM 84.2 ± 0.9 79.7 ± 1.5 95.9 ± 0.4 78.3 ± 2.1 84.5 GroupDRO 87.5 ± 0.5 82.9 ± 0.6 97.1 ± 0.3 81.1 ± 1.2 87.1 Mixup 87.5 ± 0.4 81.6 ± 0.7 97.4 ± 0.2 80.8 ± 0.9 86.8 MLDG 87.0 ± 1.2 82.5 ± 0.9 96.7 ± 0.3 81.2 ± 0.6 86.± 0.2 52.1 ± 1.2 74.9 ± 0.7 76.2 ± 0.2 65.3 MTL 60.7 ± 0.8 53.5 ± 1.3 75.2 ± 0.6 76.6 ± 0.6 66.5 SagNet 62.7 ± 0.5 53.6 ± 0.5 76.0 ± 0.3 77.8 ± 0.1 67.5 ARM 58.8 ± 0.5 51.8 ± 0.7 74.0 ± 0.1 74.4 ± 0.2 64.8 VREx 59.6 ± 1.0 53.3 ± 0.3 73.2 ± 0.5 76.6 ± 0.4 65.7 RSC 61.7 ± 0.8 53.0 ± 0.9 74.8 ± 0.8 76.3 ± 0.5 66.5 C.3.8 AVERAGES

