LOST DOMAIN GENERALIZATION IS A NATURAL CON-SEQUENCE OF LACK OF TRAINING DOMAINS Anonymous

Abstract

We show a hardness result for the number of training domains required to achieve a small population error in the test domain. Although many domain generalization algorithms have been developed under various domain-invariance assumptions, there is significant evidence to indicate that out-of-distribution (o.o.d.) test accuracy of state-of-the-art o.o.d. algorithms is on par with empirical risk minimization and random guess on the domain generalization benchmarks such as DomainBed. In this work, we analyze its cause and attribute the lost domain generalization to the lack of training domains. We show that, in a minimax lower bound fashion, any learning algorithm that outputs a classifier with an ϵ excess error to the Bayes optimal classifier requires at least poly(1/ϵ) number of training domains, even though the number of training data sampled from each training domain is large. Experiments on the DomainBed benchmark demonstrate that o.o.d. test accuracy is monotonically increasing as the number of training domains increases. Our result sheds light on the intrinsic hardness of domain generalization and suggests benchmarking o.o.d. algorithms by the datasets with a sufficient number of training domains.

1. INTRODUCTION

Domain generalization (Mahajan et al., 2021; Dou et al., 2019; Yang et al., 2021; Bui et al., 2021; Robey et al., 2021; Wald et al., 2021; Recht et al., 2019) -where the training distribution is different from the test distribution-has been a central research topic in machine learning (Blanchard et al., 2021; Chuang et al., 2020; Zhou et al., 2021 ), computer vision (Piratla et al., 2020; Gan et al., 2016; Huang et al., 2021; Song et al., 2019; Taori et al., 2020) , and natural language processing (Wang et al., 2021; Fried et al., 2019) . In machine learning, the study of domain generalization has led to significant advances in the development of new algorithms for out-of-distribution (o.o.d.) generalization (Li et al., 2022b; Bitterwolf et al., 2022; Thulasidasan et al., 2021) . In computer vision and natural language processing, new benchmarks such as DomainBed (Gulrajani & Lopez-Paz, 2021) and WILDs (Koh et al., 2021; Sagawa et al., 2021) are built toward closing the gap between the developed methodology and real-world deployment. In both cases, the problem can be stated as given a set of training domains {P e } E e=1 which are drawn from a domain distribution P and given a set of training data {(x e i , y e i )} n i=1 which are drawn from P e , the goal is to develop an algorithm based on the training data and their domain labels e so that the algorithm in expectation performs well on the unseen test domains drawn from P. Despite progress on the domain generalization, many fundamental questions remain unresolved. For example, in search of lost domain generalization, Gulrajani & Lopez-Paz (2021) conducted extensive experiments using DomainBed and found that, when carefully implemented, empirical risk minimization (ERM) shows state-of-the-art performance across all datasets despite many algorithms are carefully designed for the out-of-distribution tasks. For example, when the algorithm is trained on the "+90%"foot_0 and "+80%" domains of the ColoredMNIST dataset (Arjovsky et al., 2019) and is tested on the "-90%" domain, the best-known o.o.d. algorithm achieves test accuracy no better than a random-guess algorithm under all three model selection methods in Gulrajani & Lopez-Paz (2021). Thus, it is natural to ask what causes the lost domain generalization and how to find it? 



The number refers to the degree of correlation between color and label.



The number of domains in the o.o.d. benchmarksWILDs (Koh et al., 2021; Sagawa et al.,  2021)  and DomainBed(Gulrajani & Lopez-Paz, 2021). It shows that most of the datasets in the two benchmarks suffer from small number of domains, which might not be sufficient to learn a classifier with good domain generalization.In this paper, we attribute the lost domain generalization to the lack of training domains. Our study is motivated by an observation that off-the-shelf benchmarks often suffer from few training domains. For example, the number of training domains in DomainBed (Gulrajani & Lopez-Paz, 2021) for all its 7 datasets is at most 6; in WILDs(Koh et al., 2021; Sagawa et al., 2021), 7 out of 10 datasets have the number of training domains fewer than 350 (see Table1). Therefore, one may conjecture that increasing the number of training domains might improve the empirical performance of existing domain generalization algorithms significantly. In this paper, we show that, information theoretically, one requires at least poly(1/ϵ 2 ) number of training domains in order to achieve a small excess error ϵ for any learning algorithm. This is in sharp contrast to many existing benchmarks in which the number of training domains is limited.Krueger et al., 2021)  proposes to reduce differences in risk across training domains. Derivative Invariant Risk Minimization (DIRM) (Bellot & van der Schaar, 2020) maintains the invariance of the gradient of training risks across different domains. Another line of research uses different metrics to tackle the o.o.d problem. For example, Maximum Mean Discrepancy-Adversarial AutoEncoder (Li et al., 2018b) employs Generative Adversarial Networks and the maximum mean discrepancy metric (Gretton et al., 2012) to align different feature distributions. Mixture of Multiple Latent Domains (Matsuura & Harada, 2020) learns domaininvariant features by clustering techniques without knowing which domain the training samples belong to. Recently, Meta-Learning Domain generalization (Li et al., 2020) employs a lifelong learning method to tackle the sequential problem of new incoming domains. To explore the o.o.d problem, one line of research focuses on the case where only one training domain is accessible. Causal Semantic Generative model (CSG) (Liu et al., 2021) uses two sets of correlated latent variables, i.e., the semantic and non-semantic features, to model the relation between the data and the corresponding labels. In their assumption, the semantic features relate the data to their corresponding labels while the non-semantic features only affect the generation of data. CSG decouples the semantic and non-semantic features to improve o.o.d generalization given only one training domain.However, recent work(Gulrajani & Lopez-Paz, 2021)  claims that all existing algorithms cannot capture the true invariant feature and observes that their performance is on par with ERM and random guess on several datasets. In this paper, to explain why it occurs, we theoretically analyze the o.o.d. generalization problem and provide a minimax lower bound for the number of training domains required to achieve a small population error in the test domain.Massart & Nédélec (2006a)  have proved that it requires at least Ω(1/ϵ 2 ) samples from a distribution to estimate the success probability of a Bernoulli variable with an ϵ error. Motivated by this, we observe a similar phenomenon and prove that the learning algorithms need at least Ω(1/ϵ 2 ) number of training domains. Recently, a

