LOST DOMAIN GENERALIZATION IS A NATURAL CON-SEQUENCE OF LACK OF TRAINING DOMAINS Anonymous

Abstract

We show a hardness result for the number of training domains required to achieve a small population error in the test domain. Although many domain generalization algorithms have been developed under various domain-invariance assumptions, there is significant evidence to indicate that out-of-distribution (o.o.d.) test accuracy of state-of-the-art o.o.d. algorithms is on par with empirical risk minimization and random guess on the domain generalization benchmarks such as DomainBed. In this work, we analyze its cause and attribute the lost domain generalization to the lack of training domains. We show that, in a minimax lower bound fashion, any learning algorithm that outputs a classifier with an ϵ excess error to the Bayes optimal classifier requires at least poly(1/ϵ) number of training domains, even though the number of training data sampled from each training domain is large. Experiments on the DomainBed benchmark demonstrate that o.o.d. test accuracy is monotonically increasing as the number of training domains increases. Our result sheds light on the intrinsic hardness of domain generalization and suggests benchmarking o.o.d. algorithms by the datasets with a sufficient number of training domains.

1. INTRODUCTION

Domain generalization (Mahajan et al., 2021; Dou et al., 2019; Yang et al., 2021; Bui et al., 2021; Robey et al., 2021; Wald et al., 2021; Recht et al., 2019) -where the training distribution is different from the test distribution-has been a central research topic in machine learning (Blanchard et al., 2021; Chuang et al., 2020; Zhou et al., 2021 ), computer vision (Piratla et al., 2020; Gan et al., 2016; Huang et al., 2021; Song et al., 2019; Taori et al., 2020) , and natural language processing (Wang et al., 2021; Fried et al., 2019) . In machine learning, the study of domain generalization has led to significant advances in the development of new algorithms for out-of-distribution (o.o.d.) generalization (Li et al., 2022b; Bitterwolf et al., 2022; Thulasidasan et al., 2021) . In computer vision and natural language processing, new benchmarks such as DomainBed (Gulrajani & Lopez-Paz, 2021) and WILDs (Koh et al., 2021; Sagawa et al., 2021) are built toward closing the gap between the developed methodology and real-world deployment. In both cases, the problem can be stated as given a set of training domains {P e } E e=1 which are drawn from a domain distribution P and given a set of training data {(x e i , y e i )} n i=1 which are drawn from P e , the goal is to develop an algorithm based on the training data and their domain labels e so that the algorithm in expectation performs well on the unseen test domains drawn from P. Despite progress on the domain generalization, many fundamental questions remain unresolved. For example, in search of lost domain generalization, Gulrajani & Lopez-Paz (2021) conducted extensive experiments using DomainBed and found that, when carefully implemented, empirical risk minimization (ERM) shows state-of-the-art performance across all datasets despite many algorithms are carefully designed for the out-of-distribution tasks. For example, when the algorithm is trained on the "+90%"foot_0 and "+80%" domains of the ColoredMNIST dataset (Arjovsky et al., 2019) and is tested on the "-90%" domain, the best-known o.o.d. algorithm achieves test accuracy no better than a random-guess algorithm under all three model selection methods in Gulrajani & Lopez-Paz (2021). Thus, it is natural to ask what causes the lost domain generalization and how to find it?



The number refers to the degree of correlation between color and label.1

