IN SEARCH OF LOST DOMAIN GENERALIZATION

Abstract

The goal of domain generalization algorithms is to predict well on distributions different from those seen during training. While a myriad of domain generalization algorithms exist, inconsistencies in experimental conditions-datasets, network architectures, and model selection criteria-render fair comparisons difficult. The goal of this paper is to understand how useful domain generalization algorithms are in realistic settings. As a first step, we realize that model selection is non-trivial for domain generalization tasks, and we argue that algorithms without a model selection criterion remain incomplete. Next we implement DOMAINBED, a testbed for domain generalization including seven benchmarks, fourteen algorithms, and three model selection criteria. When conducting extensive experiments using DO-MAINBED we find that when carefully implemented and tuned, ERM outperforms the state-of-the-art in terms of average performance. Furthermore, no algorithm included in DOMAINBED outperforms ERM by more than one point when evaluated under the same experimental conditions. We hope that the release of DOMAINBED, alongside contributions from fellow researchers, will streamline reproducible and rigorous advances in domain generalization.

1. INTRODUCTION

Machine learning systems often fail to generalize out-of-distribution, crashing in spectacular ways when tested outside the domain of training examples (Torralba and Efros, 2011) . The overreliance of learning systems on the training distribution manifests widely. For instance, self-driving car systems struggle to perform under conditions different to those of training, including variations in light (Dai and Van Gool, 2018), weather (Volk et al., 2019) , and object poses (Alcorn et al., 2019) . As another example, systems trained on medical data collected in one hospital do not generalize to other health centers (Castro et al., 2019; AlBadawy et al., 2018; Perone et al., 2019; Heaven, 2020) . Arjovsky et al. (2019) suggest that failing to generalize out-of-distribution is failing to capture the causal factors of variation in data, clinging instead to easier-to-fit spurious correlations prone to change across domains. Examples of spurious correlations commonly absorbed by learning machines include racial biases (Stock and Cisse, 2018), texture statistics (Geirhos et al., 2018) , and object backgrounds (Beery et al., 2018) . Alas, the capricious behaviour of machine learning systems out-of-distribution is a roadblock to their deployment in critical applications. Aware of this problem, the research community has spent significant efforts during the last decade to develop algorithms able to generalize out-of-distribution. In particular, the literature in Domain Generalization (DG) assumes access to multiple datasets during training, each of them containing examples about the same task, but collected under a different domain or experimental condition (Blanchard et al., 2011; Muandet et al., 2013) . The goal of DG algorithms is to incorporate the invariances across these training domains into a classifier, in hopes that such invariances will also hold in novel test domains. Different DG solutions assume different types of invariances, and propose algorithms to estimate them from data. Despite the enormous importance of DG, the literature is scattered: a plethora of different algorithms appear yearly, each of them evaluated under different datasets, neural network architectures, and model selection criteria. Borrowing from the success of standardized computer vision benchmarks et al., 2015) , the purpose of this work is to perform a rigorous comparison of DG algorithms, as well as to open-source our software for anyone to replicate and extend our analyses. This manuscript investigates the question: How useful are different DG algorithms when evaluated in a consistent and realistic setting? To answer this question, we implement and tune fourteen DG algorithms carefully, to compare them across seven benchmark datasets and three model selection criteria. There are three major takeaways from our investigations: • Claim 1: A careful implementation of ERM outperforms the state-of-the-art in terms of average performance across common benchmarks (Table 1 , full list in Appendix A.5). • Claim 2: When implementing fourteen DG algorithms in a consistent and realistic setting, no competitor outperforms ERM by more than one point (Table 3 ). • Claim 3: Model selection is non-trivial for DG, yet affects results (Table 3 ). As such, we argue that DG algorithms should specify their own model selection criteria. As a result of our research, we release DOMAINBED, a framework to streamline rigorous and reproducible experimentation in DG. Using DOMAINBED, adding a new algorithm or dataset is a matter of a few lines of code. A single command runs all experiments, performs all model selections, and auto-generates all the result tables included in this work. DOMAINBED is a living project: we welcome pull requests from fellow researchers to update the available algorithms, datasets, model selection criteria, and result tables. Section 2 kicks off our exposition with a review of the DG setup. Section 3 discusses the difficulties of model selection in DG and makes recommendations for a path forward. Section 4 introduces DOMAINBED, describing the features included in the initial release. Section 5 discusses the experimental results of running the entire DOMAINBED suite, illustrating the competitive performance of ERM and the importance of model selection criteria. Finally, Section 6 offers our view on future research directions in DG. Appendix A reviews one hundred articles spanning a decade of research in DG, summarizing the experimental performance of over thirty algorithms.



Our ERM baseline outperforms the state-of-the-art in terms of average domain generalization performance, even when picking the best competitor per dataset.

