LOST DOMAIN GENERALIZATION IS A NATURAL CON-SEQUENCE OF LACK OF TRAINING DOMAINS Anonymous

Abstract

We show a hardness result for the number of training domains required to achieve a small population error in the test domain. Although many domain generalization algorithms have been developed under various domain-invariance assumptions, there is significant evidence to indicate that out-of-distribution (o.o.d.) test accuracy of state-of-the-art o.o.d. algorithms is on par with empirical risk minimization and random guess on the domain generalization benchmarks such as DomainBed. In this work, we analyze its cause and attribute the lost domain generalization to the lack of training domains. We show that, in a minimax lower bound fashion, any learning algorithm that outputs a classifier with an ϵ excess error to the Bayes optimal classifier requires at least poly(1/ϵ) number of training domains, even though the number of training data sampled from each training domain is large. Experiments on the DomainBed benchmark demonstrate that o.o.d. test accuracy is monotonically increasing as the number of training domains increases. Our result sheds light on the intrinsic hardness of domain generalization and suggests benchmarking o.o.d. algorithms by the datasets with a sufficient number of training domains.

1. INTRODUCTION

Domain generalization (Mahajan et al., 2021; Dou et al., 2019; Yang et al., 2021; Bui et al., 2021; Robey et al., 2021; Wald et al., 2021; Recht et al., 2019) -where the training distribution is different from the test distribution-has been a central research topic in machine learning (Blanchard et al., 2021; Chuang et al., 2020; Zhou et al., 2021 ), computer vision (Piratla et al., 2020; Gan et al., 2016; Huang et al., 2021; Song et al., 2019; Taori et al., 2020) , and natural language processing (Wang et al., 2021; Fried et al., 2019) . In machine learning, the study of domain generalization has led to significant advances in the development of new algorithms for out-of-distribution (o.o.d.) generalization (Li et al., 2022b; Bitterwolf et al., 2022; Thulasidasan et al., 2021) . In computer vision and natural language processing, new benchmarks such as DomainBed (Gulrajani & Lopez-Paz, 2021) and WILDs (Koh et al., 2021; Sagawa et al., 2021) are built toward closing the gap between the developed methodology and real-world deployment. In both cases, the problem can be stated as given a set of training domains {P e } E e=1 which are drawn from a domain distribution P and given a set of training data {(x e i , y e i )} n i=1 which are drawn from P e , the goal is to develop an algorithm based on the training data and their domain labels e so that the algorithm in expectation performs well on the unseen test domains drawn from P. Despite progress on the domain generalization, many fundamental questions remain unresolved. For example, in search of lost domain generalization, Gulrajani & Lopez-Paz (2021) conducted extensive experiments using DomainBed and found that, when carefully implemented, empirical risk minimization (ERM) shows state-of-the-art performance across all datasets despite many algorithms are carefully designed for the out-of-distribution tasks. For example, when the algorithm is trained on the "+90%"foot_0 and "+80%" domains of the ColoredMNIST dataset (Arjovsky et al., 2019) and is tested on the "-90%" domain, the best-known o.o.d. algorithm achieves test accuracy no better than a random-guess algorithm under all three model selection methods in Gulrajani & Lopez-Paz (2021) . Thus, it is natural to ask what causes the lost domain generalization and how to find it? Table 1 : The number of domains in the o.o.d. benchmarks WILDs (Koh et al., 2021; Sagawa et al., 2021) and DomainBed (Gulrajani & Lopez-Paz, 2021) . It shows that most of the datasets in the two benchmarks suffer from small number of domains, which might not be sufficient to learn a classifier with good domain generalization. In this paper, we attribute the lost domain generalization to the lack of training domains. Our study is motivated by an observation that off-the-shelf benchmarks often suffer from few training domains. For example, the number of training domains in DomainBed (Gulrajani & Lopez-Paz, 2021) for all its 7 datasets is at most 6; in WILDs (Koh et al., 2021; Sagawa et al., 2021) , 7 out of 10 datasets have the number of training domains fewer than 350 (see Table 1 ). Therefore, one may conjecture that increasing the number of training domains might improve the empirical performance of existing domain generalization algorithms significantly. In this paper, we show that, information theoretically, one requires at least poly(1/ϵ 2 ) number of training domains in order to achieve a small excess error ϵ for any learning algorithm. This is in sharp contrast to many existing benchmarks in which the number of training domains is limited.

2. RELATED WORK

Out-of-distribution (o.o.d) generalization (Hendrycks & Dietterich, 2019; Shankar et al., 2018; Zhou et al., 2021) has received extensive attention in recent years. One representative way is the causal modelling inspired by Invariant Risk Minimization (IRM) (Arjovsky et al., 2019) . IRM tries to learn an invariant feature representation to capture the underlying causal mechanism of interest across domains such that the classifier based on this invariant feature representation shall be invariant across all domains. Given multiple training domains, IRM learns invariant representations approximately by adding a regularization. The results of IRM indicate that failing to generalize to o.o.d. data comes from failing to capture the causal factors of variation in different domains. Following IRM, Risk Extrapolation (REx) (Krueger et al., 2021) proposes to reduce differences in risk across training domains. Derivative Invariant Risk Minimization (DIRM) (Bellot & van der Schaar, 2020) maintains the invariance of the gradient of training risks across different domains. Another line of research uses different metrics to tackle the o.o.d problem. For example, Maximum Mean Discrepancy-Adversarial AutoEncoder (Li et al., 2018b) employs Generative Adversarial Networks and the maximum mean discrepancy metric (Gretton et al., 2012) to align different feature distributions. Mixture of Multiple Latent Domains (Matsuura & Harada, 2020) learns domaininvariant features by clustering techniques without knowing which domain the training samples belong to. Recently, Meta-Learning Domain generalization (Li et al., 2020) employs a lifelong learning method to tackle the sequential problem of new incoming domains. To explore the o.o.d problem, one line of research focuses on the case where only one training domain is accessible. Causal Semantic Generative model (CSG) (Liu et al., 2021) uses two sets of correlated latent variables, i.e., the semantic and non-semantic features, to model the relation between the data and the corresponding labels. In their assumption, the semantic features relate the data to their corresponding labels while the non-semantic features only affect the generation of data. CSG decouples the semantic and non-semantic features to improve o.o.d generalization given only one training domain. However, recent work (Gulrajani & Lopez-Paz, 2021) claims that all existing algorithms cannot capture the true invariant feature and observes that their performance is on par with ERM and random guess on several datasets. In this paper, to explain why it occurs, we theoretically analyze the o.o.d. generalization problem and provide a minimax lower bound for the number of training domains required to achieve a small population error in the test domain. Massart & Nédélec (2006a) have proved that it requires at least Ω(1/ϵ 2 ) samples from a distribution to estimate the success probability of a Bernoulli variable with an ϵ error. Motivated by this, we observe a similar phenomenon and prove that the learning algorithms need at least Ω(1/ϵ 2 ) number of training domains. Recently, a concurrent work (Li et al., 2022a) presents an upper bound on the expected excess error of the ERM algorithm using the Rademacher complexity. Similarly, another work (Blanchard et al., 2021) gives an upper bound on the excess error of general learning algorithms with high probability and shows that the sample size of each domain is inversely proportional to the excess error. On the other side, while previous work (Li et al., 2022a; Blanchard et al., 2021) showed positive results on the domain generalization, we present a negative result (i.e., a lower bound regarding the number of training domains) on the expected excess error for all possible learning algorithms.

3. MINIMAX LOWER BOUND FOR DOMAIN GENERALIZATION

In this section, we provide a minimax lower bound for domain generalization. Our results lower bound the number of training domains required for good o.o.d. generalization. Notation. We will use bold capital letters such as X to represent a random vector, bold lower-case letters such as x to represent the implementation of a random vector, capital letters such as Y to represent a random variable, and lower-case letters such as y to represent the implementation of a random variable. Specifically, we denote by X the random vector of instance, denote by x the implementation of random vector X, denote by Y the random variable of label, and denote by y ∈ {0, 1} the implementation of random variable Y . We will use L(f ) to represent the expected 0-1 loss of classifier f w.r.t. the mixture of data distributions of all domains, i.e., L(f ) = Pr (X,Y ) (f (X) ̸ = Y ). Throughout the paper, we will frequently use P to represent the distribution of distribution, i.e., the domain distribution, will use P e to represent the data distribution of the e-th domain, and will use (x e , y e ) to represent the data sampled from the e-th domain P e . We call e ∈ {1, 2, 3, ...} the domain labels, which are accessible to the learner. Problem setups. In our hard instance, we view the e-th domain as a data distribution P e given by Pr(X, Y |B e = b e ), where e is the domain label and B e 's represent i.i.d. Bernoulli random vectors that parameterize the data distribution of the e-th domain. In this paper, we will regard Pr(X, Y |B e1 ) and Pr(X, Y |B e2 ) as two different domains as long as e 1 ̸ = e 2 . We assume that each domain is sampled from a domain distribution P (i.e., the distribution of B e ), and the data in the e-th domain are sampled from a data distribution P e given by Pr(X, Y |B e = b e ). Let f * be the Bayes optimal classifier of the mixture of data distributions across all domains, and assume f * ∈ F, where F can be any function class such as deep neural networks. For any h ∈ [0, 1], we define a class of domain distributions by P(h, F) := {P : |2 Pr(B e = 1) -1| ≥ h}. Note that the margin parameter h determines the randomness of the domain: large h (e.g., h = 1) means Pr(B e = 1) is bounded away from 1/2. We will investigate the following minimax risk: R E,n (h, F) := inf f E,n ∈F sup P E Pe∼P E (X e ,Y e )∼Pe L( f E,n ) -L(f * ) , ( ) where E is the number of training domains, n is the number of training samples from each domain, and the two expectations are taken over the sampling of training data and domains to learn f E,n . The minimax problem in Equation (1) characterizes the access risk of the best learning algorithm with an access to E training domains and n data samples under the worst-case domain distribution. Let V be the VC dimension of F, which is defined as the maximum number of points that can be arranged so that F shatters them. Our main results are as follows: Theorem 1. For n = ∞, any h ∈ [0, 1] and any E ≥ V , we have the lower bound R E,∞ (h, F) ≥ c min V -1 Eh , V -1 E , where c > 0 is an absolute constant. We defer the proofs of Theorem 1 to Appendix A. The theorem provides a lower bound on the number of training domains required to achieve a small population error, even though one can sample as many data points as possible from each domain. The case of n = ∞ captures the "easiest" case for the learner, where the learning algorithm can access to full knowledge about each training domain. The case of finite n is harder than n = ∞, as the learner has only partial knowledge about each training domain and for general n ≥ 1. Theorem 1 implies that, information-theoretically, one requires at least poly(V /ϵ) number of training domains in order to achieve a small excess error ϵ for any learning algorithm of F. This is in sharp contrast to many existing benchmarks on which the number of training domains is limited (see Table 1 ). For example, in the celebrated ColoredMNIST dataset (Arjovsky et al., 2019) , there are only 2 training domains. When the algorithm is trained on the "+90%" and "+80%" domains and is tested on the "-90%" domain, the best-known o.o.d. algorithm achieves test accuracy no better than random guess under all three model selection methods in Gulrajani & Lopez-Paz (2021) . Theorem 1 predicts the failures of future algorithms on these datasets and attributes the poor performance of existing o.o.d. algorithms to the lack of training domains. Massart & Nédélec (2006b) . The major differences include: 1) the construction of the hard instance, i.e., a two-stage data generative procedure, and 2) the strategy of splitting the hard problem into two sub-problems (see Figure 1 ). These two aspects are original and separate our contributions with previous works. For 1), our data generative model first samples E domains from the domain distribution P by generating the domain-specific label b e , ∀e ∈ [E] and then samples the training data from each sampled domain. On the other hand, Massart & Nédélec (2006b) considered a totally different scenario: they investigated the effect of training sample size on the excess risk in the single-domain problem when the training and test data are i.i.d. For 2), our proof has to deal with two expectations given that we have designed a novel two-stage recovery strategy. Our two-stage problem splits the hard problem into two simpler problems which estimate a binary string a and b e , ∀e ∈ [E], while Massart & Nédélec (2006b) only considered one binary string estimation problem. The two binary string estimation problems are entangled, making our analysis more challenging. R E,n (h, F) ≥ R E,∞ (h, F).

4. EXPERIMENTS

Theorem 1 shows that any learning algorithm that outputs a classifier with an ϵ excess error to the Bayes optimal classifier requires at least poly(1/ϵ) number of training domains, even though the number of training data sampled from each training domain is large. In this section, we complement our theoretical results with an empirical study to evaluate the impact of number of training domains.

4.1. DATASETS

We conducted extensive experiments on two datasets from DomainBed, i.e., ColoredMNIST (Arjovsky et al., 2019) and RotatedMNIST (Ghifary et al., 2015) . We notice that there are other popular domain generalization datasets, e.g., PACS (Li et al., 2017) , VLCS (Fang et al., 2013) , Office-Home (Venkateswara et al., 2017) , and Terra Incognita (Beery et al., 2018) . However, these datasets are hard to generate more training domains synthetically as their data generation process cannot be parameterized by a single variable (e.g., correlation between color and label in ColoredMNIST, or rotation degree in RotatedMNIST). Thus, in our paper we do not consider these datasets. ColoredMNIST (Arjovsky et al., 2019) is a variant of the MNIST hand written digit classification dataset (LeCun et al., 1998) . It is a synthetic dataset containing three domains p e ∈ [0.1, 0.2, 0.9] colored either red or blue formalizing 70, 000 examples of dimension (2, 28, 28) and 2 classes. The label is a noisy function of the digit and color, such that color bears correlation p E with the label and the digit bears correlation 0.75 with the label. Inspired by the protocol introduced in DomainBed repository, we randomly split the training dataset into 10 subdatasets with equal training samples. Each domain of ColoredMNIST is generated as follows: 1) Assign a preliminary binary label y ′ to the image based on the digit: y ′ = 0 for digits 0 -4 and y ′ = 1 for 5 -9; 2) Obtain the final label y by flipping y ′ with probability 0.25; 3) Sample the color id z by flipping y with probability p e ; 4) Color the image red if z = 1 or green if z = 0. The only parameter of a training domain is p e . We use the domain with p e = 0.5 as the test domain and uniformly sample E parameters p e from (0, 1)/{0.5} to form E training domains. Each domain randomly uses a subset from 10 subdatasets of the whole MNIST to generate the data. RotatedMNIST (Ghifary et al., 2015) (Vapnik, 1991) , IRM (Invariant Risk Minimization) (Arjovsky et al., 2019) , GroupDRO (Sagawa et al., 2020) , Mixup (Xu et al., 2020) , MLDG (Li et al., 2018a) , CORAL (Sun & Saenko, 2016) , MMD (Li et al., 2018b) , DANN (Ganin et al., 2016) , and C-DANN (Li et al., 2018c) . The details of the algorithms are shown in the appendix. For each algorithm, we employ the default hyper-parameter introduced in Section D.2 of Do-mainBed (Gulrajani & Lopez-Paz, 2021) , as our goal is not to show the best performance of algorithms but to show the correlations to our theoretical results. Following DomainBed (Gulrajani & Lopez-Paz, 2021) , we use MUNIT (Table 4 ) for ColoredMNIST and RotatedMNIST.

4.3. EVALUATION SETTINGS

We train models using 9 different Domain Generalization algorithms, with a varying number of training domains on ColoredMNIST and RotatedMNIST. Each trial is done with 5 different random seeds, and we present the average results. We use the code repository of DomainBed (Gulrajani & Lopez-Paz, 2021) with PyTorch (Paszke et al., 2019) . Model Evaluation. Following DomainBed (Gulrajani & Lopez-Paz, 2021) , we employ and adapt three different model selection methods for training algorithms as shown below, • Leave-one-domain-out cross-validation. E models are trained on E training domains with equal hyperparameters, while each experiment holds out one of the training domains. The evaluation of each model is conducted on its held-out domain and we choose the model when maximizing the accuracy on the held-out domain. This method has an assumption that training and test domains are drawn from a meta-distribution over domains, and that our goal is to maximize the expected performance under this meta-distribution. This method corresponds our data generation model. But it requires huge computational resources. We only use this method to select model with the number of domains varying from 2 to 30. • • Test-domain validation set (oracle). We choose the model maximizing the accuracy on a validation set that follows the distribution of the test domain. All the models are trained for the same fixed number of steps and the final checkpoints are used for evaluation. It assumes and requires that models have the access to the test domain which might not be possible in the real-world application.

4.4. EXPERIMENTAL RESULTS ON COLOREDMNIST AND ROTATEDMNIST

We first introduce the average results on two different datasets using 9 algorithms with the number of training domains varying from 2 to 192 and then present the results with limited number of domains. Due to the limitation of space, we present the most important results in our paper while leaving the left results in the Appendix.

4.4.1. EVALUATING THE EFFECT OF NUMBER OF TRAINING DOMAINS

Results. We run the experiments on ColoredMNIST and RotatedMNIST with ERM, IRM, Group-DRO, Mixup, MLDG, CORAL, MMD, DANN and C-DANN while the number of training domains varies from 2 to 192. The average accuracy w.r.t the number of training domains is shown in Tables 2 and 3 in the main paper, Tables 6, and 8 Training-domain validation set analysis. The results are shown in Tables 2 and 8 and Figure 5 . We observe that the test accuracy of almost all the algorithms on both ColoredMNIST and RotatedMNIST is monotonically increasing as the number of training domains grows while the accuracy of IRM on both datasets, MMD on ColoredMNIST, DANN and CDANN on RotatedMNIST experiences slight drops for certain number of training domains. We also find that the standard deviations of MMD are quite big which might be due to the hyperparameter setting as we did not try to tune the hyperparameters to gain the best performance. Besides, the standard deviations of all the algorithms on the first experiments (the least number of training domains) are quite large. That is because the Test-domain validation set (oracle) analysis. Figure 4 and Tables 3, 6 show the results using the oracle model selection method. Similar observations can be obtained. The accuracy of ERM, DANN, CDANN, CORAL, GroupDRO and Mixup on ColoredMNIST and RotatedMNIST is proportional to the number of training domains while there are fluctuations in the lines of MMD and IRM on both datasets, which might be due to the fact that MMD and IRM are sensitive to the hyperparameters as we did not tune the hyperparameters for the best performance. The line of IRM on RotatedMNIST drops slightly when the number of training domains is over 100. That might be caused by the limited number of training images n in our theorem. In that case, algorithms might not be able to extract general patterns and might learn biased information, which causes the performance drop. Besides, as we only conduct 5 trials for each experiment, the randomness of the experiments might be also another reason why the performance of IRM on RotatedMNIST drops slightly. Overall, the results using test-domain validation set and training-domain validation set model selection methods are the same, which supports our theoretical results.

4.4.2. EVALUATING ON THE LIMITED NUMBER OF TRAINING DOMAINS

As the leave-one-domain-out cross-validation requires huge computational resources, we only conduct the experiments with the number of training domain from 2 to 30 with a step of 2. The results are shown in Figure 2 in the main paper, Figure 6 , Table 9 and Table 10 in the Appendix. The reason why we only choose the even numbers is that we are trying to sample the domains evenly. For example, for ColoredMNIST, if we use 3 training domains and test on the 0.5 domain, we may have to sample either two domains whose hyper-parameters are bigger than 0.5 or smaller than 0.5, and only one domain whose hyper-parameter is smaller than 0.5 or bigger than 0.5, which might cause domain sampling drift and further we observe biased results. Analysis of leave-one-domain-out cross-validation results. Observed from two tables and the figure, we conclude that the test accuracy of most algorithms is proportional to the number of training domain, while there are some exceptions, e.g., IRM on ColoredMNIST and GroupDRO on ColoredMNIST. For all the results on RotatedMNIST, we observe that the results perfectly match our theoretical results even without any hyper-parameter tuning, especially for the experiments on IRM. But we still can observe exactly the same fact that without any hyper-parameter tuning, the test accuracy of IRM on RotatedMNIST grows with the increase of the number of training domain. There are some fluctuations in the line of GroupDRO, CORAL and Mixup, which might be due to the randomness of the experiments as we only conduct 5 experiments for each trial. On the other hand, the results on ColoredMNIST are still proportional to the number of training domains on all algorithms except IRM. And we also observe fluctuation on almost every algorithm. That may be due to the fact that the number of data from each domain, n, is 1, 000, which is much smaller than that for RotatedMNIST, which is 10, 000. The limited number of data from each domain might hinder algorithms from learning enough patterns and thus the algorithms can only capture biased patterns due to the lack of training data from each domain. For IRM, similarly, the importance of hyper-parameter tuning causes the accuracy to go up and down with the increase of the number of training domains and one might have to utilize as much as possible computational resources for tuning it. Comparing with other two model selections methods, as this one requires more trials, the standard deviations are smaller.

4.5. ABLATION STUDY

Analysis on different architecture. To test how our theoretical results generalize to other architecture of neural networks, we further conduct experiments on ColoredMNIST (Table 4 ) with VGG11 (Simonyan & Zisserman, 2014) with the oracle model selection method. We use the learning rate of 5e -5 while remaining other hyparameters the same. The corresponding results are shown in Tables 15, 16 , 17, and 18 and the left two figures in Figure 3 . Similar conclusion can be summarized that the test accuracy still grows with the increase of number of training domain while using a total different architecture. But we observe more fluctuation in the line with VGG11 as we only conduct 1 trail for VGG11. There is a big "valley" around E = 100 in the experiments of ERM on ColoredMNIST, which is quite unusual as it is too big compared with other fluctuation. That might be caused by the randomness or the failure of hardware as we only see that situation once. We also conducted experiments on Resnet18 (He et al., 2016) . But after we tried different set of hyper-parameters, ResNet18 seems not to converge on any set of hyperparamters. Analysis on the number n of data from each domain. To testify the effect of the number of data from each domain, we conduct experiments on ColoredMNIST using ERM and IRM with n from 1000 to 20000 and the oracle model selection method, while the original n is set to be 7000. The experimental results are shown in Tables 11, 12, 13, and 14 and the right two figures in Figure 3 . It shows that when n is relatively small compared with the original number 7000, especially when n = 1000, the line of the accuracy experiences lots of fluctuations. The randomness may be the biggest reason as we only conducted one trail for the ablation study, while we still observe that, the Discussion on E < V . When E < V , there would be a lot of domains that the learning algorithm has never seen in the training phase. Under this assumption, the lower bound on the excess error might be higher than the current results (Theorem 1). But we might be able to have the similar conclusion with our theoretical result. The experimental results shown in Table 9 , Table 10 and Figure 2 indicate that, even when the number of domains (less than 30) is relatively small compared with the dimension of the training data in the case E < V , the performance is still proportional to the number of training domain E in the most of cases, which supports our theoretical results (Theorem 1).

5. CONCLUSION

In this paper, we investigated the out-of-distribution problem and analyzed how many training domains were required to achieve a small population error in the test domain under reasonable assumptions. Our results theoretically characterized the phenomenon of the lost domain generalization which had been found by Gulrajani & Lopez-Paz (2021) in 2021. And our work showed that in a minimax lower bound fashion, any learning algorithm with an ϵ excess error to the Bayes optimal classifier required at least poly(1/ϵ) number of training domains, even when the number of training data sampled from each training domain was large. There were strong correlations between our work and some empirical results (Arjovsky et al., 2019; Liu et al., 2021; Krueger et al., 2021) A PROOF OF THEOREM 1 A.1 PREPARATION We first construct a "hard" instance over the supremum to lower bound the minimax problem and then reduce it into a label recovery problem. We begin with the definition of domain distribution and some lemmas, and we assume h > 0 throughout our proof. Definition 2. A domain distribution is said to satisfy the Massart noise condition with margin h if |2η(X) -1| ≥ h with probability 1. We denote the set of domain distributions satisfying the Massart noise condition by P(h, F) = {P ∈ P(h, F) : |2η(X) -1| ≥ h with probability 1}. Lemma 3. For any classifier f : X → {0, 1}, any distribution P on X × {0, 1} and the Bayes optimal classifier f * on P , we have L(f ) -L(f * ) = E|2η(X) -1||f (X) -f * (X)| . Specifically, if P ∈ P(h, F), we have L(f ) -L(f * ) ≥ hE|f (X) -f * (X)| = h∥f -f * ∥ L1 , where the L 1 norm is computed w.r.t the distribution of X, i.e., ∥f - f * ∥ L1 = x∈X Pr(X = x)|f (x) -f * (x)| if X is drawn from a discrete distribution or ∥f -f * ∥ L1 = x∈X p(x)|f (x) - f * (x)|dx if X is drawn from a continuous distribution. Lemma 4. Let f be any learning algorithm satisfying f ∈ F, f be the learning algorithm which is closest to f * β in L 1 norm w.r.t the distribution of sampled data x, i.e., f = argmin f ∈F ∥f - f * β ∥ L1 , where f * β is a classifier given by a binary string β on x 1 , x 2 , ..., x V -1 such that β = [f * β (x 1 ), . . . , f * β (x V -1 )]. For any β ∈ {0, 1} V -1 , we have ∥ f -f * β ∥ L1 ≤ 2∥f -f * β ∥ L1 . The goal of fE,n is to estimate the ground-truth label a of the supporting data by observing and learning from E • n data points sampled from E domains. We observe that, given any domain distribution D ∈ P(h, F), the minimax risk R E,n (h, F) (1) can be lower-bounded as R E,n (h, F) ≥ inf fE,n ∈F sup D∈P(h,F ) E Pe∼D E (X e ,Y e )∼Pe L( fE,n ) -L(f * ) . D will be constructed such that the supporting space (X, a) of D contains V -1 data points (x i , a i ), i ∈ [V -1]. We denote by {x i } i∈[V -1] and {a i } i∈[V -1] the feature space and the label space, respectively. The label space of D can be indexed by the vertices of a binary hypercube and the expected excess risk can be reduced to the problem of label recovering. The feature support x 1 , x 2 , ..., x V is shared by all the domains in D. All learning algorithms f with the VC dimension bigger than V can shatter these data points. Theorem 5. Let {P θ : θ ∈ {0, 1} m } be a collection of probability distributions on some set Z index by the vertices of the binary hypercube Θ = {0, 1} m . Suppose that there exists some constant α > 0, such that H 2 (P θ , P θ ′ ) ≤ α, if d H (θ, θ ′ ) = 1 . (4) Consider the problem of estimating the parameter θ ∈ Θ based on n i.i.d observations from P θ , where the loss is measured by the Hamming distance. Then the corresponding minimax risk, which we denote by M n (Θ), is lower-bounded as M n (Θ) ≥ m 2 (1 - √ αn) .

A.2 CONSTRUCTING HARD INSTANCE

In this part, we lower bound this minimax problem by constructing a "hard" domain distribution D ∈ P(h, F). We first show how to sample E training domains from the domain set D. Then, we illustrate the procedure of sampling data points from the domain e by picking the marginal distribution P r X of feature X while specifying the condition distributions P r e Y of the binary label Y given X Theorem 7 (Reducing to A Label Recovery Problem). Given a set of domains D ∈ P(h, F) constructed as in Section A.2, then o.o.d minimax problem can be reduced to an estimation problem on a binary hypercube whose vertices index the label space of D, i.e., a, which is the underlying label of X, and the minimax risk R E,n (h, F) (1) satisfies R E,n (h, F) ≥ h 2 inf βE,n ∈{0,1} V -1 max β∈{0,1} V -1 E Pe∼D E (X e ,Y e )∼Pe f * βE,n -f * β L1 , where the L 1 norm is computed w.r.t the distribution of X, and β and β are two strings. Proof. We first apply Lemma 3. Now, the o.o.d generalization minimax risk R E,n (h, F) (1) becomes inf fE,n ∈F sup D∈P(h,F ) EP e ∼D E (X e ,Y e )∼Pe L( fE,n) -L(f * ) ≥h inf fE,n ∈F max β∈{0,1} V -1 EP e ∼D E (X e ,Y e )∼Pe fE,n -f * β L 1 , where the L 1 norm is w.r.t the distribution of samples P r e X , fE,n and f * β are any classifier in F and the Bayes optimal classifier trained on the En training samples from E domains, respectively. Next, by using Lemma 4 with f = f * E,n and f = fE,n , we have h inf fE,n ∈F max β∈{0,1} V -1 EP e ∼D E (X e ,Y e )∼Pe fE,n -f * β L 1 ≥ h 2 inf βE,n ∈{0,1} V -1 max β∈{0,1} V -1 EP e ∼D E (X e ,Y e )∼Pe f * βE,n -f * β L 1 , where βE,n = [ f * βE,n (x 1 ), . . . , f * βE,n (x V -1 ) ] is the binary string that indexes the element of { f βE,n : βE,n ∈ {0, 1} V -1 }. Hence, we have reduced the o.o.d minimax problem to an estimation problem on a binary hypercube (i.e., a label recovery problem). By definition, given n = ∞, we have the following results h 2 inf βE,∞ ∈{0,1} V -1 max β∈{0,1} V -1 EP e ∼D E (X e ,Y e )∼Pe f * βE,∞ -f * β L 1 ≥ h 2 inf βE,∞ ∈{0,1} V -1 max β∈{0,1} V -1 EP e∼D f * βE,∞ -f * β L 1 = h 2 inf βE,∞ ∈{0,1} V -1 max β∈{0,1} V -1 E β f * βE,∞ -f * β L 1 , where E β represents the expectation with respect to P r β .

Now, we are ready to analyze the

L 1 norm f * βE,∞ -f * β L1 , ∀ βE,∞ , β ∈ {0, 1} V -1 . By definition, we derive the following results, f βE,∞ -f * β L 1 = V j=1 Pr X (X = xj) f * βE,∞,j -f * β j = p V -1 j=1 βE,∞,j -βj = p • dH βE,∞, β , where d H is the Hamming distance, i.e., d H (β 1 , β 2 ) = j |β 1,j -β 2,j |, and β 1,j and β 2,j are the j-th items of β 1 and β 2 . Now, the minimax problem becomes as measuring the distance between two strings βE,n and β, inf fE,∞ ∈F sup D⊆P(h,F ) EP e ∼D E (X e ,Y e )∼Pe L( fE,∞) -L(f * ) ≥ ph 2 inf βE,∞ ∈{0,1} V -1 max β∈{0,1} V -1 E β dH βE,∞, β . To analyze this problem, we apply Theorem 5 and have the following theorem:  Theorem 8 (Minimax Bound). Given H 2 P r β * E,∞ , P r β ≤ α = 2p 1 - √ 1 -h 2 ≤ 2ph 2 , E Pe∼D E (X e ,Y e )∼Pe L( fE,∞ ) -L(f * ) ≥ V -1 54Eh , with p ∈ (0, 1/(V -1)]. Proof. We need to upper-bound the squared Hellinger distance 2 H 2 P r βE,∞ , P r β , ∀ βE,∞ , β that satisfies d H β∞E,n , β = 1. Based on the definition of the squared Hellinger distance, we have H 2 P r βE,∞ , P r β = V i=1 y∈{0,1} P r βE,∞ (x i , b) -P r β (x i , b) 2 = p V -1 i=1 H 2 Bernoulli 1 + (2 βE,∞,i -1)h 2 , Bernoulli 1 + (2β i -1)h 2 . For j ∈ [V -1], the j-th term in the above summation is nonzero if and only if βE,∞,j ̸ = β j , in which case it is equal to the squared Hellinger distance between the Bernoulli 1-h 2 and Bernoulli 1+h 2 distributions. Thus, H 2 P r βE,∞ , P r β = p • d H βE,∞ , β H 2 Bernoulli 1 -h 2 , Bernoulli 1 + h 2 = 2p • d H βE,∞ , β 1 -h 2 - 1 + h 2 2 = 2p • d H βE,∞ , β 1 -1 -h 2 . Inserting Theorem 5 with H 2 (P r βE,∞ , P r β ) = 2p 1 - √ 1 -h 2 = α ≤ 2ph 2 , we obtain inf fE,∞ sup D∈P(h,F ) EP e ∼D E (X e ,Y e )∼Pe L( fE,∞) -L(f * ) ≥ ph 2 inf βE,∞ ∈{0,1} V -1 sup β∈{0,1} V -1 E β dH ( β, β) ≥ ph 2 V -1 2 1 - √ αE ≥ p(V -1)h 4 1 -2Eph 2 . We let p = 2 9h 2 E and now we have inf fE,∞ sup D∈P(h,F ) E Pe∼D E (X e ,Y e )∼Pe L( fE,n ) -L(f * ) ≥ V -1 54Eh . Next, we discuss the above theorem for different choices of h. First, given h ≥ V -1 E , we have R E,∞ (F) ≥ V -1 54Eh , if h ≥ V -1 E . 2 The squared Hellinger distance H 2 (P, Q) between P and Q is defined as H 2 (P, Q) = 1 2 λ dP dλ -dQ dλ 2 dλ. When 0 ≤ h < V -1 E , consider h = V -1 E . As P( h, F) ⊆ P(h, F), we have R E,∞ (F) ≥ V -1 54E h = 1 54 V -1 E , if 0 ≤ h < V -1 E . Combine the two cases of h. The proof is completed.

B PROOFS OF USEFUL LEMMAS AND THEOREMS

B.1 PROOF OF LEMMA 3 For any classifier f : X → {0, 1} and any distribution P on X × {0, 1}, we have L(f ) -L(f * ) = E[1 (f (X) ̸ = Y ) -1 (f * (X) ̸ = Y )] = E[|2η(X) -1||f (X) -f * (X)|] , where 1(•) is the indicator function. If cond holds, 1(cond) = 1. Otherwise, 1(cond) = 0. Assuming that P ∈ P(h, F), we have L(f ) -L(f * ) ≥ hE[|f (x) -f * (x)|] = h∥f -f * ∥ L1 , where the L 1 norm is computed w.r.t the distribution of sampled data x, i.e., ∥f - f * ∥ L1 = x Pr(X = x)|f (x) -f * (x)|.

B.2 PROOF OF LEMMA 4

Let f be any learning algorithm in F, and f be the learning algorithm which is closest to f * β in L 1 norm, i.e., f = argmin f ∈F ,β∈{0,1} V -1 ∥f -f * β ∥ L1 , where f * β is indexed by β ∈ {0, 1} V -1 . For any β ∈ {0, 1} V -1 , we have ∥ f -f * β ∥ L1 ≤ ∥ f -f ∥ L1 + ∥f -f * β ∥ L1 ≤ 2∥f -f * β ∥ L1 , where the first one is by the triangle inequality and the second one is due to the definition of f .

B.3 PROOF OF LEMMA 6

First, by Eq. ( 6), we have |2η be (x) -1| ≥ h, ∀x ∈ X . Second, all learning algorithms with VC dimension bigger than V can shatter these data point. There exists at least one f ∈ F, such that f * α (x) = f (x) for all x ∈ {x 1 , . . . , x V }. Thus, D ⊆ P(h, F).

B.4 PROOF OF THEOREM 5

As the total variation distance can be both upper-and lower-bounded by the Hellinger distance, we have 1 2 H 2 (P, Q) ≤ ∥P -Q∥ T V ≤ H(P, Q) . ( ) For any θ ∈ Θ, let P n θ denote the product of n copies of P θ , i.e., the joint distribution of n i.i.d samples from P θ . For any two θ, θ ′ ∈ Θ with d H (θ, θ ′ ), by letting P = P n θ and Q = P n θ ′ , we have the following results ∥P n θ -P n θ ′ ∥ T V ≤H(P n θ , Q n θ ) . ( ) Besides, for any n pairs of distributions (P θ,1 , P θ ′ ,1 ), . . . , (P θ,n , P θ ′ ,n ), where P θ, * and P θ ′ , * are copies of P θ and P θ ′ , we have H(P n θ , Q n θ ) = H(P θ,1 × • • • × P θ,n , P θ ′ ,1 × • • • × P θ ′ ,n ) ≤ n i=1 H 2 (P θ,i ), P θ ′ ,i . With the assumption (4) on the square of Hellinger, we have ∥P n θ -P n θ ′ ∥ T V ≤ n i=1 H 2 (P θ,i ), P θ ′ ,i ≤ √ αn. ( ) With Theorem 11, the proof is completed.

B.5 BACKGROUND AND LEMMAS

We begin this section with some definitions. The minimax risk M(Θ) is defined as M(Θ) = inf θ sup θ∈Θ E θ [d(θ, θ(Z))] , ( ) where Θ is a parameter set, θ = θ(Z) is an estimator to recover θ from the observation of a sample Z sampled from an indexed set {P θ : θ ∈ Θ} of a probability distributions on a finite set Z. E θ represent the expectation with respect to P θ , i.e., E θ [d(θ, θ(Z))] = z∈Z P θ (z)d(θ, θ(z)) . ( ) Besides, the distance metric d(•, •) : Θ × Θ → R + is a pseudometric on Θ and satisfies the following three properties, 1. Symmetry. d(θ, θ ′ = d(θ ′ , θ)), ∀θ, θ ′ ∈ Θ; 2. Triangle inequality. d(θ, θ ′ ) ≤ d(θ, θ * ) + d(θ * , θ ′ ), ∀θ, θ ′ , θ * ∈ Θ; 3. Non-negative. d(θ, θ ′ ) ≥ 0, ∀θ, θ ′ ∈ Θ. The minimax risk (Eq. ( 16)) takes the infimum over all estimators θ = θ(Z). In another word, this risk tries to find an estimator θ to minimize the worst-case risk sup θ∈Θ E θ [d(θ, (θ)(Z))] We introduce the total variance distance based on the previous definitions. Definition 9 (Total Variation Distance). For any two probability P, Q ∈ P(Z), the total variation distance is ∥P -Q∥ T V = 1 2 z∈Z |P (z) -Q(z)| . ( ) And it can be expressed as follows, ∥P -Q∥ T V = 1 - z∈Z min(P (z), Q(z)) . In the literature, there is an important lemma named two-point method introduced by LeCam ( 1973) for getting lower bounds on the minimax risk, Lemma 10 (LeCam's Lemma). For any θ, θ ′ ∈ Θ and any estimator θ, we have E θ [d(θ, θ(Z))] + E θ [d(θ ′ , θ(Z))] ≥d(θ, θ ′ ) • z∈Z min(P θ (z), P θ ′ (z)) =d(θ, θ ′ )(1 -∥P θ -P θ ′ ∥ T V ) . ( ) Proof. Given a point z ∈ Z, assuming P θ (z) ≥ P θ ′ (z), we have P θ (z)d(θ, θ(Z)) + P θ ′ (z)d(θ ′ , θ(Z)) =P θ (z)(d(θ, θ(Z)) + d(θ ′ , θ(Z))) + (P θ ′ (z) -P θ (z))d(θ ′ , θ(Z)) ≥P θ (z)(d(θ, θ(Z)) + d(θ ′ , θ(Z))) ≥P θ (z)d(θ, θ ′ ) . Similar, if P θ (z) > P θ ′ (z), we have P θ (z)d(θ, θ(Z)) ≥ P θ ′ (z)d(θ, θ ′ ) . ( ) Sum over Z with the definition of total variation distance. The proof is completed. Next, we introduce an important lemma. Lemma 11. Supposing Θ = {0, 1} m , ∀θ, θ ′ ∈ Θ, we consider the Hamming metric d H (θ, θ ′ ) = i∈[m] |θ i -θ ′ i | , ( ) where θ i and θ ′ i are i-th entries of θ and θ ′ . Then, we can lower-bound the minimax problem as M ≥ m 2 1 -max d H (θ,θ ′ )=1 ∥P θ -P θ ′ ∥ T V . ( ) Proof. Let π be the uniform distribution on Θ = {0, 1} m and µ i be the joint distribution of a random pair (θ, θ ′ ) ∈ Θ × Θ, ∀i ∈ [m] , such that the marginal distributions of both θ and θ ′ are equal to π. Then, the minimax risk can be lower-bounded as, M(Θ) ≥ inf θ E π [d(θ, θ(Z))] = inf θ i∈[m] E π [d(θ i , θi (Z))] ≥ i∈[m] inf θ E π [d(θ i , θi (Z))] ≥ 1 2 i∈[m] E µi [d(θ i , θ ′ i ) • (1 -∥P θ -P θ ′ ∥ T V )] . The first inequality is due to the supremum over all the prior distribution on Θ while the third one is by definition and Eq. ( 18). Next, since d(θ i , θ ′ i ), ∀i ∈ [m], we have M(Θ) ≥ 1 2 i∈[m] E µi [d(θ i , θ ′ i ) • (1 -∥P θ -P θ ′ ∥ T V )] ≥ 1 2 i∈[m] E µi [1 -∥P θ -P θ ′ ∥ T V ] ≥ 1 2 i∈[m] min θ,θ ′ :d H (θ,θ ′ )=1 [1 -∥P θ -P θ ′ ∥ T V ] = m 2 (1 - max θ,θ ′ :d H (θ,θ ′ )=1 ∥P θ -P θ ′ ∥ T V ) . C EXPERIMENTS Due to the limitation of space, we present the rest of the experiments in this part.

C.1 DETAILS OF ALGORITHMS

We include the following algorothms for two multi-domain image classification tasks: • ERM (Vapnik, 1991) is a famous machine learning algorithm that minimizes the sum of errors across domains and examples. • IRM (Arjovsky et al., 2019) tries to learn an invariant feature representation ϕ(•) to capture the underlying causal mechanism of interest across domains such that the optimal linear classifier on top of that representation matches across domains. • GroupDRO (Sagawa et al., 2020) learns models minimizing the worst-case training loss over a set of pre-defined groups while increasing the importance of domains with larger errors. • Mixup (Xu et al., 2020) guarantees domain-invariance in a continuous latent space and guides the domain discriminator in judging samples' difference relative to source and target domains. • MLDG (Li et al., 2018a) simulates train/test domain shift during training by synthesizing virtual testing domains within each mini-batch. • CORAL (Sun & Saenko, 2016) aligns correlations of layer activations in deep neural networks to learn domain-invariant features. • MMD (Li et al., 2018b) extend adversarial autoencoders by imposing the Maximum Mean Discrepancy (MMD (Gretton et al., 2012) ) measure to align the distributions among different domains, and matching the aligned distribution to an arbitrary prior distribution via adversarial feature learning. • DANN (Ganin et al., 2016) encourages the emergence of features that are discriminative for the main learning task on the source domain and indiscriminate with respect to the shift between the domains with an adversarial network. • C-DANN (Li et al., 2018c ) is a variant of DANN matching the conditional distributions on features and labels across domains, for all labels. C.2 DETAILS OF MUNIT (Vapnik, 1991) , IRM (Arjovsky et al., 2019) , GroupDRO (Sagawa et al., 2020) , Mixup (Xu et al., 2020) , CORAL (Sun & Saenko, 2016) , MMD (Li et al., 2018b) , MLDG (Li et al., 2018a) , DANN (Ganin et al., 2016) and C-DANN (Li et al., 2018c) (Vapnik, 1991) , IRM (Arjovsky et al., 2019) , DRO (Sagawa et al., 2020) , Mixup (Xu et al., 2020) , MLDG (Li et al., 2018a) , CORAL (Sun & Saenko, 2016) , MMD (Li et al., 2018b) , MLDG (Li et al., 2018a) , DANN (Ganin et al., 2016) and C-DANN (Li et al., 2018c) 



The number refers to the degree of correlation between color and label.



Figure 1: Illustration of our o.o.d. generalization problem. We show how data is sampled and how learning algorithms learn the knowledge. When generating training data, E domain from the data distribution P are first sampled, and after that for each domain, n training data are sampled to form the training dataset which will be fed into a learning algorithm. The learning algorithm recovers the underlying label a by the estimation of the underlying label b e , e ∈ [E] under the observation of training data.

Training-domain validation set. Each training domain is split into training and validation subsets and the overall validation set consists of the validation subsets of each training domain. Finally, we choose the model maximizing the accuracy on the overall validation set. This method has an assumption that the training and test examples follow similar distributions.

, and Figures 4 and 5 in the Appendix. It shows that the test accuracy is proportional to the number of training domains with all the algorithms on both ColoredMNIST and RotatedMNIST which is consistent with our theoretical results (Theorem 1).

Figure 2: The experimental results on ColoredMNIST and RotatedMINST using ERM, IRM, DANN and C-DANN w.r.t the number of training domain using the leave-one-domain-out crossvalidation method.

Figure 3: The experimental results on ColoredMNIST using ERM and IRM, w.r.t the number of training domain with the oracle model selection method. The left two figures show the results with different architectures, i.e., MUNIT and VGG11 (Simonyan & Zisserman, 2014), while the left three figures present the corresponding results with different number of n.

, out=128, kernels = 3 × 3, padding=1) + ReLU() + GroupNorm(gourps=8) 10-12 Conv2D (in=128, out=128, kernels = 3 × 3, padding=1) + ReLU() + GroupNorm(

Figure4: The experimental results on ColoredMNIST and RotatedMINST using ERM(Vapnik, 1991), IRM(Arjovsky et al., 2019), GroupDRO(Sagawa et al., 2020), Mixup(Xu et al., 2020), CORAL(Sun & Saenko, 2016), MMD(Li et al., 2018b), MLDG(Li et al., 2018a), DANN(Ganin et al., 2016) and C-DANN(Li et al., 2018c)  w.r.t the number of training domain using the test-domain validation set (oracle) model selection method.

Figure5: The experimental results on ColoredMNIST and RotatedMINST using ERM(Vapnik, 1991), IRM(Arjovsky et al., 2019), DRO(Sagawa et al., 2020), Mixup(Xu et al., 2020), MLDG(Li  et al., 2018a), CORAL(Sun & Saenko, 2016), MMD(Li et al., 2018b), MLDG(Li et al., 2018a), DANN(Ganin et al., 2016) and C-DANN(Li et al., 2018c)  w.r.t the number of training domain using the training-domain validation set model selection method.

Figure6: The experimental results on ColoredMNIST and RotatedMINST using DRO(Sagawa et al., 2020), Mixup(Xu et al., 2020), MLDG(Li et al., 2018a), CORAL(Sun & Saenko, 2016), MMD(Li et al., 2018b), and MLDG(Li et al., 2018a)  w.r.t the number of training domain with the leave-one-domain-out cross-validation method.

𝐱 ) , 𝒂 𝟏 𝐱 𝟐 , 𝒂 + 𝐱 𝟑 , 𝒂 𝟑 𝐱 𝟒 , 𝒂 - 𝐱 𝟓 , 𝒂 𝟓 𝐱 𝟔 , 𝒂 𝟔 … … 𝐱 6 , 𝒂 6

is another variant of MNIST with 6 domains containing digits rotated by {0, 15, 30, 45, 60, 75} degrees. It contains 70, 000 examples of dimension (1, 28, 28) and 10 classes. Similar to ColoredMNIST, we use the domain with 45 degrees rotation as the test domain and uniformly sample E rotation degrees from [0, 90)/{45} to form E training domains.

The experimental results on ColoredMNIST with ERM, IRM, GroupDRO, Mixup, MLDG, and CORAL w.r.t the number of training domain using the training-domain validation set model selection method.

The experimental results on ColoredMNIST with ERM, IRM, GroupDRO, Mixup, MLDG, and CORAL w.r.t the number of training domain using the test-domain validation set (oracle) model selection method.

in the o.o.d area. Besides, though we used Bernoulli (discrete) random variables to present our theoretical results, our lower bounds hold true for broader distribution class as we look at the worst-case distributions.To complement our theoretical results, we conduct experiments on two OOD benchmarks, i.e., ColoredMNIST and RotatedMNIST, with several OOD methods, showing that for the methods used in this paper, the test accuracy on the test domain was proportional to the number of training domains under three different model selection methods. That matched our theoretical results perfectly.There are several future directions of our work. Our theorem assumed that the number of data samples n from each domain was ∞. This assumption was used to lower bound the case of general n because intuitively, the case of n = ∞ should be simpler than the case where n is a finite number. It is interesting to understand how n affects a tight minimax lower bound. Another future direction is to explore the case where the numbers of samples from each domain are different. It would be interesting to see which domain dominates the training procedure and how to design o.o.d training algorithms under this scenario. Moreover, in our case, the instance support (feature space) was shared

we have the following lower bound for the o.o.d minimax problem,

Details of our MUNIT architecture. We use MUNIT for all the experiments.

w.r.t the number of training domain using the test-domain validation set (oracle) model selection method.

The experimental results on ColoredMNIST with MMD, DANN and C-DANN w.r.t the number of training domain using the test-domain validation set (oracle) model selection method.

The experimental results on RotatedMNIST with ERM, IRM, GroupDRO, Mixup, MLDG, CORAL, MMD, DANN and C-DANN w.r.t the number of training domain using the test-domain validation set (oracle) model selection method.

w.r.t the number of training domain using the training-domain validation set model selection method.

The experimental results on ColoredMNIST with MMD, DANN and C-DANN w.r.t the number of training domain using the training-domain validation set model selection method.

The experimental results on RotatedMNIST with ERM, IRM, GroupDRO, Mixup, MLDG, CORAL, MMD, DANN and C-DANN w.r.t the number of training domain using the training-domain validation set model selection method.

The experimental results on ColoredMNIST with ERM w.r.t the number of training domain using the test-domain validation set (oracle) model selection method with changing the number of training images from each domain.

The experimental results on ColoredMNIST with IRM w.r.t the number of training domain using the test-domain validation set (oracle) model selection method with changing the number of training images from each domain.

The experimental results on ColoredMNIST with ERM w.r.t the number of training domain using the training-domain validation set model selection method with changing the number of training images from each domain.

The experimental results on ColoredMNIST with IRM w.r.t the number of training domain using the training-domain validation set model selection method with changing the number of training images from each domain.

The experimental results on ColoredMNIST with ERM w.r.t the number of training domain using the training-domain validation set model selection method with MUNIT and VGG11.

The experimental results on ColoredMNIST with IRM w.r.t the number of training domain using the training-domain validation set model selection method with MUNIT and VGG11. .5910 0.6278 0.6685 0.6968 0.6709 0.6777 0.7031 0.6957 0.6935 0.6908 0.6995 0.6997 0.7113 0.7219 0.7182 0.7287 VGG11 0.5589 0.6312 0.5716 0.6466 0.6659 0.6476 0.6585 0.6548 0.6694 0.6571 0.6897 0.6743 0.6994 0.7029 0.7025 0.7379 0.7603

The experimental results on ColoredMNIST with ERM w.r.t the number of training domain using the test-domain validation set (oracle) model selection method with MUNIT and VGG11.

The experimental results on ColoredMNIST with IRM w.r.t the number of training domain using the test-domain validation set (oracle) model selection method with MUNIT and VGG11. .5915 0.6278 0.6685 0.6968 0.6709 0.6777 0.7031 0.6958 0.6935 0.6908 0.6995 0.6997 0.7113 0.7219 0.7182 0.7287 VGG11 0.5589 0.6312 0.5803 0.6466 0.6659 0.6476 0.6608 0.6548 0.6694 0.6571 0.6897 0.6743 0.6994 0.7036 0.7025 0.7382 0.7603

ETHICS STATEMENT

We did not see obvious negative ethical impacts in our work. In contrast, our work might have a positive impact on society regarding the fairness (Barocas et al., 2019) and security (Zhang et al., 2019; Kawaguchi et al., 2017; Bubeck et al., 2020) of machine learning. Achieving a small population error in the test domain ensures fairness regarding the bias of the dataset in the race, gender, age, etc., as most of the public datasets, e.g., CelebA (Liu et al., 2015) , have bias and that will lead to the bias of machine learning models (Barocas et al., 2019) . Moreover, security issues such as adversarial robustness (Hendrycks et al., 2020; 2021) are also related to our study of domain generalization, where clean examples and adversarial examples are from different domains. Improving the population error of machine learning models in the test domain may lead to robust models against adversarial attacks. and the domain. Finally, we show that the label space a ∈ {0, 1} |X| , which is the underlying label of X, of D can be naturally indexed by the vertices of a binary hypercube of dimension V -1.Generating E domains from the set D. As the data generation procedure of o.o.d generalization is of two stages, we show how to sample domains based on the feature space X . The domain-specific label b e of the e-th domain is generated by the Bernoulli 1+(2a-1)h 2 distribution asSampling data from e-th domain. Now, E domains are sampled from D and we present how to sample the training data from the e-th domain. Given p ∈ [0, 1/(V -1)], which will be defined later, P r e X is constructed as follows,In this way, we ensure that Pr X ({x 1 , . . . , x V }) = 1. The labels of the samples follow the distribution Pr(Y |X, B e ) given B e = b e ∈ {0, 1} V and X = x e . Similarly, for a fixed b e ∈ {0, 1} V , the conditional distribution of Y e given x e is given by two Bernoulli distributions aswhere b e,j is the j-th entry of b e . Thus, we have] and be,j = 1; 0, otherwise.(6)The corresponding Bayes optimal classifiers on the e-th domain and on the mixed data distribution over all domains, denoted by f * be and f * , respectively, are given byFrom each domain, we i.i.d draw n samples. That is, the learning algorithm f E,n can access to E • n samples in total. Next, we have the following lemma to show that D = {P be :Lemma 6 (Property of "Hard" Case). All the instances of D satisfy the Massart noise condition with margin h. The distribution D belongs to P(h, F), i.e., D ∈ P(h, F).Now, the domain distribution D has been constructed. In the following part, we will show that the problem of learning a classifier in our setting is at least as difficult as recovering the label a. With the Bayes optimal classifier Eq. ( 7) on the domain set D, we can reduce the problem to a label recovery problem.A 

