EMPIRICAL OR INVARIANT RISK MINIMIZATION? A SAMPLE COMPLEXITY PERSPECTIVE

Abstract

Recently, invariant risk minimization (IRM) was proposed as a promising solution to address out-of-distribution (OOD) generalization. However, it is unclear when IRM should be preferred over the widely-employed empirical risk minimization (ERM) framework. In this work, we analyze both these frameworks from the perspective of sample complexity, thus taking a firm step towards answering this important question. We find that depending on the type of data generation mechanism, the two approaches might have very different finite sample and asymptotic behavior. For example, in the covariate shift setting we see that the two approaches not only arrive at the same asymptotic solution, but also have similar finite sample behavior with no clear winner. For other distribution shifts such as those involving confounders or anti-causal variables, however, the two approaches arrive at different asymptotic solutions where IRM is guaranteed to be close to the desired OOD solutions in the finite sample regime for polynomial generative models, while ERM is biased even asymptotically. We further investigate how different factors -the number of environments, complexity of the model, and IRM penalty weight -impact the sample complexity of IRM in relation to its distance from the OOD solutions.

1. INTRODUCTION

A recent study shows that models trained to detect COVID-19 from chest radiographs rely on spurious factors such as the source of the data rather than the lung pathology (DeGrave et al., 2020) . This is just one of many alarming examples of spurious correlations failing to hold outside a specific training distribution. In one commonly cited example, Beery et al. (2018) trained a convolutional neural network (CNN) to classify camels from cows. In the training data, most pictures of the cows had green pastures, while most pictures of camels were in the desert. The CNN picked up the spurious correlation and associated green pastures with cows thus failing to classify cows on beaches. Recently, Arjovsky et al. (2019) proposed a framework called invariant risk minimization (IRM) to address the problem of models inheriting spurious correlations. They showed that when data is gathered from multiple environments, one can learn to exploit invariant causal relationships, rather than relying on varying spurious relationships, thus learning robust predictors. More recent work suggests that empirical risk minimization (ERM) is still state-of-the-art on many problems requiring OOD generalization (Gulrajani & Lopez-Paz, 2020) . This gives rise to a fundamental question: when is IRM better than ERM (and vice versa)? In this work, we seek to answer this question through a systematic comparison of the sample complexity of the two approaches under different types of train and test distributional mismatches. The distribution shifts P train (X, Y ) = P test (X, Y ) that we consider informally stated satisfy an invariance condition -there exists a representation Φ * of the covariates such that P train (Y |Φ * (X)) = P test (Y |Φ * (X)) = P(Y |Φ * (X)). A special case of this occurs when Φ * is identity -P train (X) = P test (X) but P train (Y |X) = P test (Y |X) -such a shift is known as a covariate-shift (Gretton et al., 2009) . In many other settings Φ * may not be identity (denoted as I), examples include settings with confounders or anti-causal variables (Pearl, 2009) where covariates appear spuriously correlated with the label and P train (Y |X) = P test (Y |X). We use causal Bayesian networks to illustrate these shifts in Figure 1 . Suppose X e = [X e 1 , X e 2 ] represents the image, where X e 1 is the shape of the animal and X e 2 is the background color, Y e is the label of the animal, and e is the index of the environment/domain. In Figure 1a ) X e 2 is independent of (Y e , X e 1 ), it represents the covariate shift case (Φ * = I). In Figure 1b ) X e 2 is spuriously correlated with Y e through the confounder ε e . In Figure 1c ) X e 2 is spuriously correlated with Y e as it is anti-causally related to Y e . In both Figure 1b ) and c) Φ * = I; Φ * is a block diagonal matrix that selects X e 1 . Our setup assumes we are given data from multiple training environments satisfying the invariance condition, i.e., P(Y |Φ * (X)) is the same across all of them. Ideally, we want to learn and predict using E[Y |Φ * (X)]; this predictor has a desirable OOD behavior as we show later where we prove min-max optimality with respect to (w.r.t.) unseen test distributions satisfying the invariance condition. Our goal is to analyze and compare ERM and IRM's ability to learn E[Y |Φ * (X)] from finite training data acquired from a fixed number of training environments. Our analysis has two parts. We prove (Proposition 4) that the sample complexity for both the methods is similar thus there is no clear winner between the two in the finite sample regime. For the setup in Figure 1a ), both ERM and IRM learn a model that only uses X e 1 . 2) Confounder/Anti-causal variable case (Φ * = I): We consider a family of structural equation models (linear and polynomial) that may contain confounders and/or anti-causal variables. For the class of models we consider, the asymptotic solution of ERM is biased and not equal to the desired E[Y |Φ * (X)]. We prove that IRM can learn a solution that is within O( √ ) distance from E[Y |Φ * (X)] with a sample complexity that increases as O( 12 ) and increases polynomially in the complexity of the model class (Proposition 5, 6); (defined later) is the slack in IRM constraints. For the setup in Figure 1b ) and c), IRM gets close to only using X e 1 , while ERM even with infinite data (Proposition 17 in the supplement) continues to use X e 2 . We summarize the results in Table 1 . Arjovsky et al. (2019) proposed the colored MNIST (CMNIST) dataset; comparisons on it showed how ERM-based models exploit spurious factors (background color). The CMNIST dataset relied on anti-causal variables. Many supervised learning datasets may not contain anti-causal variables (e.g. human labeled images). Therefore, we propose and analyze three new variants of CMNIST in addition to the original one that map to different real-world settings: i) covariate shift based CMNIST (CS-CMNIST): relies on selection bias to induce spurious correlations, ii) confounded CMNIST (CF-CMNIST): relies on confounders to induce spurious correlations, iii) anti-causal CMNIST (AC-CMNIST): this is the original CMNIST proposed by Arjovsky et al. (2019) , and iv) anti-causal and confounded (hybrid) CMNIST (HB-CMNIST): relies on confounders and anti-causal variables to induce spurious correlations. On the latter three datasets, which belong to the Φ * = I class described above, IRM has a much better OOD behavior than ERM, which performs poorly regardless of the data size. However, IRM and ERM have a similar performance on CS-CMNIST with no clear winner. These results are consistent with our theory and are also validated in regression experiments.

2. RELATED WORKS

IRM based works. Following the original work IRM from Arjovsky et al. (2019) , there have been several interesting works - (Teney et al., 2020; Krueger et al., 2020; Ahuja et al., 2020; Chang et al., 2020; Mahajan et al., 2020) is an incomplete representative list -that build new methods inpired from IRM to address the OOD generalization problem. Arjovsky et al. (2019) prove OOD guarantees for linear models with access to infinite data from finite environments. We generalize these results in several ways. We provide a first finite sample analysis of IRM. We characterize the impact of hypothesis class complexity, number of environments, weight of IRM penalty on the sample complexity and its distance from the OOD solution for linear and polynomial models. Theory of domain generalization and domain adaption. Following the seminal works (Ben-David et al., 2007; 2010) , there have been many interesting works - (Muandet et al., 2013; Ajakan et al., 2014; Zhao et al., 2019; Albuquerque et al., 2019; Li et al., 2017; Piratla et al., 2020; Matsuura & Harada, 2020; Deng et al., 2020; David et al., 2010; Pagnoni et al., 2018) is an incomplete representative list (see Redko et al. (2019) for further references) -that build the theory of domain adaptation and generalization and construct new methods based on it. While many of these works develop bounds on loss over the target domain using train data and unlabeled target data, δ Yes some (Ben-David & Urner, 2012; David et al., 2010; Pagnoni et al., 2018) analyze the finite sample (PAC) guarantees for domain adaptation under covariate shifts. These works (Ben-David & Urner, 2012; David et al., 2010; Pagnoni et al., 2018) access unlabeled data from a target domain, which we do not. Instead, we have data from multiple training domains (as in domain generalization). In these works, the guarantees are w.r.t. a specific target domain, while we provide (for linear and polynomial models) worst-case guarantees w.r.t. all the unseen domains satisfying the invariance condition. Also, we consider a larger family of distribution shifts including covariate shifts. The above two categories are not exhaustive -e.g., there are some recent works that characterize how some inductive biases favor extrapolation Xu et al. (2021) and can be better for OOD generalization.

3.1. INVARIANT RISK MINIMIZATION

We start with some background on IRM (Arjovsky et al., 2019) . Consider a dataset D = {D e } e∈Etr , which is a collection of datasets D e = {(x e i , y e i , e)} ne i=1 } obtained from a set of training environments E tr , where e is the index of the environment, i is the index of the data point in the environment, n e is the number of points from environment, x e i ∈ X ⊆ R n is the feature value and y e i ∈ Y ⊆ R is the corresponding label. Define a probability distribution {π e } e∈Etr , π e is the probability that a training data point is from environment e. Define a probability distribution of points conditional on environment e as P e , (X e , Y e ) ∼ P e . Define the joint distribution P, (X e , Y e , e) ∼ P, d P(X e , Y e , e) = π e dP e (X e , Y e ). D is a collection of i.i.d. samples from P. Define a predictor f : X → R and the space F of all the possible maps from X → R. Define the risk achieved by f in environment e as R e (f ) = E e f (X e ), Y e , where is the loss, f (X e ) is the predicted value, Y e is the corresponding label and E e is the expectation conditional on environment e. The overall expected risk across the training environments is R(f ) = e∈Etr π e R e (f ). We are interested in two settings: regression (square loss) and binary-classification (cross-entropy loss). In the main body, our focus is regression (square loss) and we mention wherever the results extend to binary-classification (cross-entropy). We discuss these extensions in the supplement. OOD generalization problem. We want to construct a predictor f that performs well across many unseen environments E all , where E all ⊇ E tr . For o ∈ E all \E tr , the distribution P o can be very different from the train environments. Next we state the OOD problem. min f ∈F max e∈E all R e (f ) The above problem is very challenging to solve since we only have access to data from training environments E tr but are required to find the robust solution over all environments E all . Next, we make assumptions on E all and characterize the optimal solution to equation 1. Assumption 1. Invariance condition. There exists a representation Φ * that transforms X e to Z e = Φ * (X e ) and ∀e, o ∈ E all , ∀z ∈ Φ * (X ) satisfies E e [Y e |Z e = z] = E o [Y o |Z o = z]. Also, ∀e ∈ E all , ∀z ∈ Φ * (X ), Var e [Y e |Z e = z] = ξ 2 , where Var e is the conditional variance. The above assumption is inspired from causality (Pearl, 2009) . Φ * acts as the causal feature extractor and from the definition of causal features, it follows that E e [Y e |Z e = z] does not vary across environments. When a human labels a cow she uses Φ * to extract causal features from the pixels to identify cow while ignoring the background. The first part of the above assumption encompasses a large class of distribution shifts including standard covariate shifts (Gretton et al., 2009)  : Φ * (X ) → R as follows ∀z ∈ Φ * (X ), m(z) = E e [Y e |Z e = z], where Z e = Φ * (X e ) (2) Assumption 2. Existence of an environment where the invariant representation is sufficient. ∃ an environment e ∈ E all such that Y e ⊥ X e |Z e Assumption 2 states there exists an environment where the information that X e has about Y e is also contained in Z e . Define a composition m • Φ * , ∀x ∈ X , m • Φ * (x) = E e [Y e |Z e = Φ * (x)]. Proposition 1. If is the square loss, and Assumptions 1 and 2 hold, then m • Φ * solves the OOD problem (equation 1). The proofs of all the propositions are in the supplement. A similar result holds for the cross-entropy loss (discussion in supplement). For the rest of the paper, we focus on learning m • Φ * as it solves the OOD problem. For covariate shifts Φ * = I, m(x) = E e [Y e |X e = x] is the OOD solution. In Arjovsky et al. (2019) , a proof connecting m • Φ * and OOD was not stated. Recently, in Koyama & Yamaguchi (2020) , a result similar to Proposition 1 was shown but with a few differences. The authors assume conditional probabilities are invariant unlike our assumption that only requires conditional expectations and variances to be invariant. However, their result applies to more losses. m • Φ * is the target we want to learn. Arjovsky et al. (2019) proposed IRM since standard min-max optimization over the training environments E tr and ERM fail to learn m • Φ * in many cases. The authors in Arjovsky et al. (2019) identify a crucial property of m • Φ * and use it to define an object called invariant predictor that we define next. Invariant predictor and IRM optimization. Define a representation map Φ : X → Z from feature space to representation space Z ⊆ R q . Define a classifier map, w : Z → R from representation space to real values. Define H Φ and H w as the spaces of representations and classifiers respectively. A data representation Φ elicits an invariant predictor w • Φ across environments e ∈ E tr if there is a classifier w that achieves the minimum risk simultaneously for all the environments, i.e., ∀e ∈ E tr , w ∈ arg min w∈Hw R e ( w • Φ). Observe that if we we transform the data with representation Φ * then m will achieve the minimum risk simultaneously in all the environments. If Φ * ∈ H Φ and m ∈ H w , then m • Φ * is an invariant predictor. IRM selects the invariant predictor with least sum risk across environments (results presented later can be adapted if invariant predictor was selected based on the worst-case risk over the environments as well) as follows: min Φ∈HΦ,w∈Hw R(w • Φ) = e∈Etr π e R e (w • Φ) s.t. w ∈ arg min w∈Hw R e ( w • Φ), ∀e ∈ E tr (3) From the above discussion we know m • Φ * is a feasible solution to equation 3. It is also the ideal solution we want IRM to find since it solves equation 1. Later in Propositions 4, 5, and 6, we show that IRM actually solves equation equation 1. For the setups in Proposition 5, and 6, conventional ERM based approaches fail thus justifying the need for above formulation.

3.2. SAMPLE COMPLEXITY OF GRADIENT CONSTRAINT FORMULATION OF IRM

In Arjovsky et al. (2019) , a gradient constrained alternate (derived below in equation 4) to equation 3 was proposed, which focuses on linear and scalar classifiers (Z = R, Φ : X → R, H w = R). In this case, the composite predictor w • Φ is a multiplication of w and Φ written as w • Φ. (For binaryclassification predictor's output w•Φ(x) represents logits.) From the definition of invariant predictors and H w = R it follows that if ∀ w ∈ R, R e (1 • Φ) ≤ R e ( w • Φ), then Φ is an invariant predictor. For square and cross-entropy losses, R e (w • Φ) is convex in w. Therefore, a gradient constraint ∇ w|w=1.0 R e (w•Φ) = 0 is equivalent to the condition that ∀ w ∈ R, R e (1•Φ) ≤ R e ( w•Φ), which implies Φ is an invariant predictor. Recall that IRM aims is to search among invariant predictors and find one that minimizes the risk. We state this as a gradient constrained optimization as follows min Φ∈HΦ R(Φ) s.t. ∇ w|w=1.0 R e (w • Φ) = 0, ∀e ∈ E tr (4) We propose an approximation of the above with slack in the constraint. Define R (Φ) = e∈Etr π e ∇ w|w=1.0 R e (w • Φ) 2 and a set S IV ( ) = {Φ | R (Φ) ≤ , Φ ∈ H Φ }. Note that R is very similar to the penalty defined in Arjovsky et al. (2019) . The approximation of equation 4 is min Φ∈S IV ( ) R(Φ) If = 0, then equation 4 and equation 5 are equivalent. In all the optimizations so far, the expectations are computed w.r.t. the distributions P e , which are unknown. Therefore, we develop an empirical version of equation 5 below (in equation 6) and call it empirical IRM (EIRM). We replace R and R with empirical estimators R and R respectively. For R we use a simple plugin estimator (sample mean of loss across all the samples in D). For R we construct a new estimator that enables the use of standard concentration inequalities. Define a set ŜIV ( ) = {Φ | R (Φ) ≤ , Φ ∈ H Φ }. min Φ∈ ŜIV ( ) R(Φ) If we replaced ŜIV ( ) with H Φ in equation 6, then we get the standard ERM. ERM aims to solve min Φ∈HΦ R(Φ). The sample complexity analysis of ERM aims to understand the distance between the empirical solutions and the expected solutions as a function of the number of samples. Similarly, we seek to understand the relationship between solutions of equation 6 and equation 5. Assumption 3. Bounded loss and bounded gradient of the loss. ∃ L < ∞, L < ∞ such that ∀Φ ∈ H Φ , ∀x ∈ X , ∀y ∈ Y, | (Φ(x), y)| ≤ L, | ∂ (w•Φ(x),y) ∂w | w=1.0 | ≤ L . If every Φ in the hypothesis class H Φ is bounded by M and the label space Y is bounded, then for both square and cross-entropy loss, (Φ(•), •) and ∂ (w•Φ(•),•) ∂w | w=1.0 are bounded. Define κ = min Φ∈HΦ |R (Φ) -|; κ measures how close any penalty can get to the boundary . κ quantifies how good the finite sample approximation R need to be in order to get ŜIV ( ) = S IV ( ). Define ν to quantify the approximation w.r.t. optimal risk. Proposition 2. For every ν > 0, > 0 and δ ∈ (0, 1), if H Φ is a finite hypothesis class, Assumption 3 holds, κ > 0, and if the number of samples |D| is greater than max 16L 4 κ 2 , 8L 2 ν 2 log 4|HΦ| δ , then with a probability at least 1 -δ, every solution Φ to EIRM (equation 6) is a ν approximation of IRM, i.e. Φ ∈ S IV ( ), R(Φ * ) ≤ R( Φ) ≤ R(Φ * ) + ν, where Φ * is a solution to IRM (equation 5). Proof Sketch. The standard analysis in learning theory on ERM or regularized/constrained ERM typically relies on linearly separable loss functions. In such cases, we can use standard plugin estimators and analyze their behavior using concentration inequalities. In our setting, R is a weighted sum of squares of expectation and thus it is not linearly separable. We develop a new way of expressing R that allows us to make it linearly separable. Next, in order to ensure R(Φ * ) ≤ R( Φ) ≤ R(Φ * ) + ν we first need to guarantee that the set of invariant predictors is exactly recovered, i.e., ŜIV ( ) = S IV ( ) (exact recovery is typically not required in existing constrained analysis such as Woodworth et al. (2017) Agarwal et al. (2018) ). We show that if the number of samples grow as 1 κ 2 even the closest points on either side of the boundary of the set S IV ( ) are correctly discriminated, which guarantees exact recovery of S IV ( ). Once the exact set is recovered, beyond this we use standard learning theory tools to ensure R(Φ * ) ≤ R( Φ) ≤ R(Φ * ) + ν. The above result holds for both square and cross-entropy loss. For ease of exposition, we use the standard setting of finite hypothesis class and extend all the results to infinite hypothesis classes in the supplement (summary of insights from the extension are in Section 3.3.2). Next, we state a standard result on ERM's sample complexity. Define a Φ + such that Φ + ∈ arg min Φ∈HΦ R(Φ) Proposition 3. (Shalev-Shwartz & Ben-David, 2014) For every ν > 0 and δ ∈ (0, 1), if H Φ is a finite hypothesis class, Assumption 3 holds, and if the number of samples |D| is greater than 8L 2 ν 2 log 2|HΦ| δ , then with a probability at least 1 -δ, every solution Φ † to ERM is an ν approximation of expected risk minimization, i.e., R(Φ + ) ≤ R(Φ † ) ≤ R(Φ + ) + ν. Proposition 2 vs. 3 Since κ ≤ , the sample complexity of EIRM grows at least as O(max{ 1 2 , 1 ν 2 }). Let us look at the two terms inside maxi) 1 ν 2 growth term is similar to ERM, it ensures ν approximate optimality in the overall risk R, ii) 1 2 growth ensures the IRM penalty R is less than . A direct comparison of sample complexities in Propositions 2 and 3 suggests that the sample complexity of EIRM is higher than ERM, which is not the complete picture. The two approaches may not converge to the same solutions and IRM may converge to a solution with better OOD behavior than one achieved by ERM. Therefore, a fair comparison is only possible when we also study the OOD properties of the solutions achieved by the two approaches, which is the subject of the next section.

3.3. OOD PERFORMANCE: ERM VS. IRM

We divide the comparisons based on distributional shift assumptions that decide whether ERM and IRM arrive at the same asymptotic solutions or not.

3.3.1. COVARIATE SHIFT

Assumption 4. Invariance w.r.t. all the features. ∀e, o ∈ E all and ∀x ∈ X , E[Y e |X e = x] = E[Y o |X o = x]. ∀e ∈ E all , X e ∼ P e X e and the support of P e X e is equal to X . As stated earlier, the first part of the above assumption follows from standard covariate shift assumptions (Gretton et al., 2009)  e ← g(X e ) + ε e , E[ε e ] = 0, ε e ⊥ X e , E[(ε e ) 2 ] = σ 2 (7) In the above model X e is the cause, Y e is the effect, and g a general non-linear function (it satisfies Assumption 4 with m = g). Next, we compare ERM and IRM's ability to learn m under covariate shifts. In Figure 1 a), we show Define κ = min Φ1,Φ2∈HΦ,Φ1 =Φ2 |R(Φ 1 ) -R(Φ 2 )|, which measures the minimum separation between the risks of any two distinct hypothesis in H Φ . Proposition 4. Let be the square loss. For every ν > 0, > 0 and δ ∈ (0, 1), if H Φ is a finite hypothesis class, m ∈ H Φ , Assumptions 3, 4 hold, and • if the number of samples |D| is greater than max 8L 2 ν 2 log( 4|HΦ| δ ), 16L 4 2 log( 2 δ ) , then with a probability at least 1 -δ, every solution Φ to EIRM (equation 6) satisfies R(m) ≤ R( Φ) ≤ R(m) + ν. If also ν < κ, then Φ = m. • if the number of samples |D| is greater than 8L 2 ν 2 log( 2|HΦ| δ ), then with a probability at least 1 -δ, every solution Φ † to ERM satisfies R(m) ≤ R(Φ † ) ≤ R(m) + ν. If also ν < κ, then Φ † = m. Implications of Proposition 4. ERM and EIRM both asymptotically achieve the ideal OOD solution; the above proposition helps compare them in a finite sample regime. The second term inside the max for EIRM, 16L ). In Proposition 4, we assumed square loss, but a similar result extends to cross-entropy loss as well.

3.3.2. DISTRIBUTIONAL SHIFT WITH CONFOUNDERS AND (OR) ANTI-CAUSAL VARIABLES

In this section, we consider more general models than equation 7, which only contained cause X e and effect Y e . We also allow confounders and anti-causal variables. However, we restrict g to polynomials. We start with linear models from Arjovsky et al. (2019) . Assumption 5. e ∼ Categorical({π o } o∈Etr ), ∀o ∈ E tr , π o > 0 Y e ← γ T (Z e 1 ) + ε e , ε e ⊥ Z e 1 , E[ε e ] = 0, E[(ε e ) 2 ] = σ 2 , |ε e | ≤ ε sup X e ← S(Z e 1 , Z e 2 ) We assume that Z 1 component of S is invertible, i.e. ∃ S such that S(S(Z 1 , Z 2 )) = Z 1 , and γ = 0. ∀e ∈ E tr , π e ≥ π min |Etr| > 0. Define Σ e = E[X e X e,T ]. ∀e ∈ E tr , Σ e is positive definite. The support of distribution of Z e = (Z e 1 , Z e 2 ), P e Z e , is bounded and the norm of S, S = σ max (S) (maximum singular value of S), is also bounded. In the above model, Z e 1 is the cause of X e and Y e but may not be directly observed. Z e 2 may be arbitrarily correlated with Z e 1 and e . We observe a scrambled transformation of (Z  -zero x ∈ R n dim span E e [X e X e,T ]x -E e [X e ε e ] e∈Etr > n -r (9) where span is the linear span, dim is the dimension, and recall n is dimension of X e . This assumption checks for diversity in the environments and holds almost everywhere (Arjovsky et al., 2019) . Assumption 7. Inductive bias. H Φ is a finite set of linear models parametrized by Φ ∈ R n (output Φ T X e ). ST γ ∈ H Φ . ∃ ω > 0, Ω > 0, s.t. ∀Φ ∈ H Φ , ω ≤ Φ 2 ≤ Ω & 2ω ≤ ST γ 2 ≤ 2 3+2 √ 2 Ω. Informally stated, the above assumption requires the OOD optimal predictor ST γ to lie in the interior of the search space and not on the boundary. If Assumptions 5, 7 hold, then Assumption 3 holds. Hence, we can use the bounds L and L on and Proposition 5. Let be the square loss. For every ∈ (0, th ) and δ ∈ (0, 1), if Assumptions 5, 6 (with r = 1), 7 hold and if the number of data points |D| is greater than 16L 4 2 log 2|HΦ| δ , then with a probability at least 1 -δ, every solution Φ to EIRM (equation 6) satisfies Φ = ( ST γ α, where ∂ (w•Φ(•),•) α ∈ [ 1 1+τ √ , 1 1-τ √ ]. Proof Sketch. In learning theory it is common to analyze the concentration of empirical risks around the expected risks. In our case, we have a target ideal solution to equation 1 ( ST γ) and we want our empirical solutions to concentrate around that. A direct finite sample approximation of equation 4 is hard to analyze. Therefore, we introduce an intermediate problem in equation 5 and then develop a finite sample approximation of it in equation 6. We first show that solving equation 5 leads to solutions in the neighborhood of the target. To show this we use the linear general position assumption. Next, we connect equation 6 and equation 5 using our new estimator for R and Hoeffding's inequality. Implications of Proposition 5. 1. Convergence rate of ERM vs. EIRM: Recall that is the slack on IRM penalty R . If is sufficiently small and the data grows as O( 12 ), every solution Φ to EIRM (equation 6) is in √ radius of the OOD solution, i.e., Φ -ST γ = O( √ ). We contrast these rates to ones in the covariate shift setting (Section 3. 4) . This shows that EIRM works in more settings (Proposition 4, 5) than ERM while matching the convergence rate of ERM. 3.1). Let E[Y e |X e = x] = Ψ T x. If the data grows as O( 1 ν 2 ), then both ERM and EIRM solution converge to Ψ as Φ -Ψ = O( √ ν) (from

2.. Sample complexity grows polynomially in data dimension to ensure OOD generalization:

Next, we set = µ th with µ ∈ [0, 1), and |E tr | = 2n (satisfies Assumption 6 for r = 1). A simple manipulation of terms in Proposition 5 shows that sample complexity with quadratic growth in data dimension n, O n 2 µ 2 log 2|HΦ| δ , ensures Φ = ( ST γ α with α ∈ [ 1 1+ √ µ( √ 2-1) , 1 1- √ µ( √ 2-1) ]. 3. Comparison with Proposition 2: Lastly, we contrast sample complexity of EIRM in Proposition 5, O( 1 2 ), to Proposition 2, O(max{ 1 2 , 1 ν 2 }); the additional distributional assumptions in Proposition 5 help arrive at a lower sample complexity of O( 12 ). The bound in Proposition 2, O(max{ 1 2 , 1 ν 2 }), is larger than the one in Proposition 4, O( 1 ν 2 ), and Proposition 5, O( 1 2 ), but is more general as it is agnostic to the distributional assumptions. A simple illustration summarizing Propositions 4, 5: Set S to identity in Assumption 5. Recall Z e 1 and Z e 2 from Assumption 5. Since S is identity X e can be written as [X e 1 , X e 2 ], where X e 1 = Z e 1 and X e 2 = Z e 2 . If X e 2 ⊥ ε e , then E[Y e |X e ] is invariant and Assumption 4 holds. This corresponds to the setup in Figure 1a ). We can now use Proposition 4 and deduce that ERM and IRM have same sample complexity and end up learning the ideal model that only uses the causal features X e 1 . If X e 2 ← ε e + N e , then this corresponds to the setup Figure 1b ), X e 1 is the cause and X e 2 is spuriously correlated with label Y e through the confounder ε e . If X e 2 ← Y e + N e , then this corresponds to the setup in Figure 1c ), X e 1 is the cause and X e 2 is anti-causally related to the label Y e . In both these cases, the ideal OOD solution that solves equation 1 will only exploit X e 1 to make predictions. From Propositions 5, it follows that IRM when fed with O( 12 ) samples, it is in √ radius of the target OOD solution, while ERM is asymptotically biased and exploits X e 2 (Proposition 17). We define a polynomial version of the model in Assumption 5. We only need to change Y e ← γ T (Z e 1 ) + ε e to Y e ← γ t ζ p (Z e 1 ) + ε e . ζ p is a polynomial feature map of degree p defined as ζ p : R c → R c , where c is the dimension of the input Z e 1 , ζ p (W ) = W, W ⊗ W, . . . , (W ⊗ W...p times ⊗ W ) = (W ⊗i ) p i=1 and ⊗ is the Kronecker product. Also, c = p i=1 c i . Can we directly use the analysis from the linear case by transforming X e appropriately? No, we first need to find an appropriate transformation for the scrambling matrix S that satisfies the conditions (invertibiltiy) in Assumption 5 while maintaining a linear relationship between transformations of X e and Z e . We present the main result informally below (details are in the supplement). Proposition 6. (Informal statement) For sufficiently small and δ ∈ (0, 1), if Assumptions similar to Proposition 5 hold and |D| ≥ 16L 4 2 log( 2|HΦ| δ ), then with a probability at least 1 -δ, every solution Φ to EIRM (equation 6) satisfies Φ = ST γ(α), where ST γ is the OOD optimal model (defined in the supplement), α ∈ [ 1 1+τ √ , 1 1-τ √ ]. Insights from the polynomial case and infinite hypothesis case. In the polynomial case, we adapt the linear general position Assumption 6, the number of environments |E tr | are now required to grow as O(n p ). As a result, in the sample complexity analysis we discussed, we replace n with n p to obtain that a sample complexity of O n 2p µ 2 log 2|HΦ| δ ensures Φ = ST γ(α) with α ∈ [ 1 1+ √ µ( √ 2-1) , 1 1- √ µ( √ 2-1) ]. In the infinite hypothesis case, the main change in the results is that we replace |H Φ | with an appropriate model complexity metric (Shalev-Shwartz & Ben-David, 2014) . Consider Proposition 5, a sample complexity of O n 3 µ 2 log n µ ensures Φ = ( ST γ α with α ∈ [ 1 1+ √ µ( √ 2-1) , 1 1- √ µ( √ 2-1) ] in contrast to O n 2 µ 2 log 2|HΦ| δ in the finite hypothesis case. We showed the benefits of IRM for polynomial models and other extensions (non-linear S) are future work. In the supplement, we provide a dialogue explaining how our work fits in the big picture.

4. EXPERIMENTS

In this section, we discuss classification experiments (regression experiments with similar qualitative findings are in the supplement). We introduce three new variants of the colored MNIST (CMNIST) dataset in (Arjovsky et al., 2019) . We divide the training data in MNIST digits into two environments (e = 1, 2) equally and the testing data in MNIST digits is assigned to another environment (e = 3). X e g : gray scale image of the digit, Y e g : label of the gray scale digit (digits ≥ 5 have Y e g = 1 and digits < 5 have Y e g = 0). X e : final colored image and Y e : final label are generated as follows. Define Bernoulli variables G, N , N e that take a value 1 with probability θ, β and β e and 0 otherwise. Define a color variable C e , where C e = 0 is red and C e = 1 is green. Let ⊕ denotes xor operation. Y e g ← L(X e g ), L : Human labeling, Y e ← L(X e g ) ⊕ N, Corrupt the original labels with noise C e ← G(Y e ⊕ N e ) + (1 -G)(N ⊕ N e ), Use G to select b/w anti-causal or confounded X e ← T (X g , C e ), T : transformation to color the image (10) If the probability θ = 1, then G = 1 and that gives us back the original CMNIST in (Arjovsky et al., 2019) , which we call anti-causal CMNIST (AC-CMNIST). If θ = 0, then we get confounded colored MNIST (CF-CMNIST). If 0 < θ < 1, we get a hybrid dataset (HB-CMNIST). The above model in equation 10 has features of the model in Assumption 5, where X e g , C e , L, T take the role of Z e 1 , Z e 2 , γ, S. We set the noise N parameter β = 0.25, and the parameter for N e in the three environments [β 1 , β 2 , β 3 ] = [0.1, 0.2, 0.9] . Color is spuriously correlated with the label; P(C e = 1|Y e = 1) varies drastically in the three environments ([0.9, 0.8, 0.1] for AC-CMNIST). In the variants of CMNIST we discussed, P(Y e |X e ) varies across the environments. We now define a covariate shift based CMNIST (CS-CMNIST) (P(Y e |X e ) is invariant). We use selection bias to induce spurious correlations. Generate a color C e uniformly at random. Select the pair (X e g , C e ) with probability 1 -ψ e if the label Y e g and C e are the same, else select them with a probability ψ e . If the pair is selected, then color the image X e ← T (X g , C e ) and Y e ← Y e g . Selection probability ψ e for the three environments are [ψ 1 , ψ 2 , ψ 3 ] = [0.1, 0.2, 0.9]. Due to the selection bias, color is spuriously correlated, P(C e = 1|Y e = 1) varies drastically [0.9, 0.8, 0.1]. We provide the graphical models for the CMNIST variants and computations of P(C e |Y e ), P(Y e |X e ) in the supplement. 

4.1. RESULTS

We use the first two environments (e = 1, 2) to train and third environment (e = 3) to test. Other details of the training (models, hyperparameters, etc.) are in the supplement. For each of the above datasets, we run the experiments for different amounts of training data from 1000 to up to 60000 samples (10 trials for each data size). In Figure 2 , we compare the models trained using IRM and ERM in terms of the classification error on the test environment e = 3 (a poor performance indicates model exploits the color) for varying number of train samples. We also provide the performance of the ideal hypothetical optimal invariant model. Observe that except for in the covariate shift setting where IRM and ERM are similar as seen in Figure 2a (as predicted from Proposition 4), IRM outperforms ERM in the remaining three datasets (as predicted from Proposition 5) as seen in Figure 2b-d . We further validate this claim through the regression experiments provided in the supplement. In CF-CMNIST, IRM achievs an error of 0.45, which is much better than error of ERM (0.7) but is marginally better than a random guess. This suggests that confounder induced spurious correlations are harder to mitigate and may need more samples than needed in anti-causal case (AC-CMNIST).

5. CONCLUSION

We presented a sample complexity analysis of IRM to answer the question: when is IRM better than ERM (and vice-versa)? For distribution shifts such as the covariate shifts, we proved that both IRM and ERM have similar sample complexity and arrive at the desired OOD solution asymptotically. For distribution shifts involving confounders and (or) anti-causal variables and polynomial generative models, we proved that IRM is guaranteed to achieve the desired OOD solution while ERM can be asymptotically biased. We proposed new variants of original colored MNIST dataset from Arjovsky et al. (2019) , which are more comprehensive and better capture how spurious correlations occur in reality. To the best of our knowledge, we believe this to be the first work that provides a rigorous characterization of impact of factors such as model complexity, number of environments on the sample complexity and its distance from the OOD solution in distribution shifts that go beyond covariate shifts. • IRMA: But after reading this manuscript I understand that there are settings where no matter how one selects the model, ERM is bound to be at a disadvantage. • ERIC: What you say seems to contradict my understanding of Gulrajani & Lopez-Paz (2020) . • IRMA: Actually, they ... [IRMA's audio and video freeze for a minute, leaving ERIC to ponder this conundrum for a little bit on his own. IRMA stops her video and continues only with audio and the conversation is able to resume.] • IRMA: Sorry, how much did you hear? • ERIC: You were just starting to explain why ERM can be at a disadvantage, but I didn't hear anything after that. • IRMA: Sure! Consider the dataset from domain 1, say photographs of birds, and domain 2, say sketches of birds. Imagine I have access to the oracle representation Φ * . Now I pass the data from domain 1 and 2 through Φ * to get the representations. If these representations from the two domains live in very different parts of the representation space, then we cannot hope that IRM will offer any advantage. This is the other explanation, which I was going to come to, why IRM did not outperform ERM in Gulrajani & Lopez-Paz (2020). • ERIC: How about when there is a complete overlap? • IRMA: Yes, if there is a strong or complete overlap in the representations from the two domains, then it is possible that IRM can help. In fact I believe in such settings, if the ERM models trained on the two domains disagree a lot, then that can be a strong indication towards the models heavily exploiting spurious correlations. In such cases, I believe IRM can again offer some advantage. • ERIC: I will try to put these ideas down on paper and meet you again for a discussion. • IRMA: Great! I truly hope that it can be in person at the café in Palais-Royal where we first started this conversation. So long for now! [The students end their call.]

7.2. SUPPLEMENTARY MATERIALS FOR EXPERIMENTS

In this section, we cover the supplementary materials for the experiments. The code to reproduce the results presented in this work can be found at https://github.com/IBM/OoD.

7.2.1. CLASSIFICATION

We first describe the model and other training details. Choice of H Φ and other training details We use the same architecture for both ERM and IRM. We choose the architecture that was used in Arjovsky et al. (2019) : 2 layer MLP. The first two layer consists of 390 hidden nodes, and the output layer has two nodes (for the two classes). We use ReLU activation in each layer, a regularization weight of 0.0011 is used for each layer. We use a learning rate of 4.9e-4, batch size of 512 for both ERM and IRM. We use 1000 gradient steps for IRM. As was done in the original IRM work (Arjovsky et al., 2019) , we use a threshold on steps (190) after which a large penalty is imposed for violating the IRM constraint. We use the train domain validation set procedure described in Gulrajani & Lopez-Paz (2020) to select the penalty value from the set {1e4, 3.3e4, 6.6e4, 1e5} (with 4:1 train-validation split). With the same learning rate, we observed ERM was slower at learning than IRM. To ensure ERM always converges we set the number of epochs to a very high value 100 (118k steps).

A. Covariate shift based for CMNIST

We provide the generative model for CS-CMNIST below.  Y e g ← L(X e g ), C e ← Uniform({0, 1}) U e ← Bernoulli (C e ⊕ Y e g )ψ e + 1 -(C e ⊕ Y e g ) 1 -ψ e ) X e ← T (X g , C e )|U e = 1, Y e ← Y e g |U e = 1 From the above simplification we gather that P(C e = 1|Y e = 1) is 0.9, 0.8 and 0.1 in the three environments. Define p l = P (Y e g = 1|X e g ). In MNIST data it is reasonable to assume deterministic labeling, i.e. p l = 1 or p l = 0. From the way the data is constructed we can assume that T is invertible, i.e. from colored image X e we can get back the grayscale image and the color X e g , C e . We now move to computing P(Y e |X e ). We assume Y e g = 1 in the simplifcation below. ). We first compute P(C e |Y e ). P(C e = 1|Y e = 1) = P(Y e ⊕ N e = 1|Y e = 1) = P(N e = 0|Y e = 1) = β e . Therefore, P(C e = 1|Y e = 1) is 0.9, 0.8 and 0.1 in environments 1, 2, and 3 respectively. Next, we compute P(Y e |X e ). We assume Y e g = 1 in the simplfication below. We also assume deteministic labeling. In the simplfication that follows we use β = 0.25. Suppose Y e g = 1 for the calculation below. We also assume deterministic labeling. In Figure 4 , we provide the graphical model for confounded CMNIST described in equation 10 (for 0 < θ < 1). D.3. Results with numerical values and standard errors. In Table 5, we provide the numerical values for the results showed in the Figure 2 d ) along with the standard errors.  H e ← N (0, σ 2 e I s ) X e 1 ← N (0, σ 2 e I s ) + W h→1 H e Y e ← W e 1→y X e 1 + N (0, σ 2 e ) + W h→y H e X e 2 ← W y→2 Y e + N (0, I s ) + W h→2 H e (20) H e is the hidden confounder, X e = [X e 1 , X e 2 ] is the observed covariate vector, Y e is the label. Different W 's correspond to the weight vectors that multiply with the covariates and the confounders. The four datasets differ in the weight W vectors and we describe them below. σ e is environment • Covariate shift case (CS-regresion): In this case, we fix W h→2 , W h→y , and W y→2 to zero and we draw each entry of W 1→y from 1 s N (0, 1) and set W h→1 to identity. • Confounded variable case (CF-regression): Set W y→2 to zero and we draw each entry in W 1→y , W h→1 , W h→2 , W h→y from 1 s N (0, 1). • Anti-causal variable case (AC-regression): Set W h→y ,W h→1 and W h→2 to zero and draw each entry of W 1→y and W y→2 from 1 s N (0, 1). • Hybrid confounded and anti-causal variable case (HB-regression): Draw each W h→y ,W h→1 , W h→2 , W 1→y and W y→2 from 1 s N (0, 1). We also present the graphical models for the four types of models used in Figure 5 .

7.2.3. CHOICE OF H Φ AND OTHER TRAINING DETAILS

We use a linear model that takes as input X e . For ERM we use standard linear regression from sklearn. For IRM, we use 50k gradient steps with learning rate 1e-3, batch size is equal to the size of the training data. We use the train domain validation set procedure described by Gulrajani & Lopez-Paz (2020) to select the penalty value from the set {0, 1e -5, 1e -4, 1e -3, 1e -2, 1e -1} (with 4:1 train-validation split). We average the results over 25 trials.

7.2.4. RESULTS

We discuss results for the case when the length of the covariate vector X e is 10. The desired optimal invariant predictor is W * = [W 1→y , 0]. We will compare ERM and IRM in terms of the model estimation error, i.e., the distance between the model estimated by the method Ŵ and the true model given as Ŵ -W * 2 . In Figures 6, 7 , 8, 9, we compare the model estimation error vs. the number of samples when the model. In these comparisons, we see that consistent with the classification experiments and predictions from Proposition 4 in the covariate shift case (See Figure 6 ) there is no clear winner between the two approaches. There are gains from using IRM in the other cases (Figures 7, 8, 9) . However, in the confounder case in Figure 7 , the gains from IRM appear in the low sample regime but are not there in the high sample regime. This is because for this setup the asymptotic bias of ERM is also very small. In addition to the Figures 6, 7, 8, 9, we provide the tables (see In the results to follow, we will rely on Hoeffding's inequality. We restate the inequality below for convenience. Lemma 1. (Hoeffding's inequality). Let θ 1 , . . . θ m be a sequence of i.i.d. random variables and assume that for all i, E[θ i ] = µ and P[a ≤ θ i ≤ b] = 1. Then, for any > 0 P 1 m m i=1 θ i -µ > ≤ 2 exp(-2 m 2 (b -a) 2 ) We restate the propositions from the main body. Next, we prove Proposition 1 from the main body of the manuscript. Consider the environment q that satisfies Assumption 2. Let us simplify the expression for the risk achieved by every predictor f ∈ F in the environment q following the steps similar to equation 21. R q (f ) = E q Y q -E q [Y q |Z q ] 2 + E q E q [Y q |Z q ] -f (X q ) 2 + 2E q Y q -E q [Y q |Z q ] E q [Y q |Z q ] -f (X q ) = E q Y q -m(Z q ) 2 + E q m(Z q ) -f (X q ) 2 + 2E q Y q -m(Z q ) m(Z q ) -f (X q ) = E q Y q -m(Z q ) 2 + E q m(Z q ) -f (X q ) 2 = ξ 2 + E q m(Z q ) -f (X q ) 2 In the above simplification in equation 25, we use the following equation 26 E q Y q -m(Z q ) m(Z q ) -f (X q )) = E q E e Y e -m(Z q ) m(Z q ) -f (X q ) Z q = E q E q Y q Z q -m(Z q ) m(Z q ) -f (X q ) = 0 (Last equality follows from the Assumption 2) (26) Therefore, for environment q satisfying Assumption 2 from equation 25, it follows that for all f ∈ F, R q (f ) ≥ ξ 2 . Therefore, we can write that ∀f ∈ F, max e∈E all R e (f ) ≥ R q (f ) ≥ ξ 2 (27) Therefore, from equation 27 it directly follows that min f ∈F max e∈E all R e (f ) ≥ ξ 2 (28) We showed in equation 24, R e (m • Φ * ) = ξ 2 for all the environments. Hence, f = m • Φ * achieves the RHS of equation 28. This completes the proof. Some of the proofs that we describe next take a few intermediate steps to build. Here we give a brief preview of the key ingredients that we developed to build these propositions. • In Proposition 2, our goal is to carry out a sample complexity analysis of EIRM in the same spirit as ERM. However, there are two key challenges that we are faced with -i) the IRM penalty R is not separable (as it is composed of terms involving squares of expectations) and ii) unlike ERM, IRM is a constrained optimization problem. To deal with i), we develop an estimator in the next section that allows us to re-express IRM penalty in a separable fashion. To deal with ii), we define a parameter κ that measures the minimum separation between IRM penalty and . We show that as long as this separation for all the predictors in the hypothesis class is positive, then we can rely on κ-representative property from Shalev-Shwartz & Ben-David (2014) applied to the new estimator that we build to show that the set of empirical invariant predictors ŜIV ( ) are the same as exact invariant predictors S IV ( ). • In Proposition 5, our goal is to show that approximate OOD can be achieved by IRM in the finite sample regime. This result builds on the infinite sample result from Arjovsky et al. (2019) . In Theorem 9 in Arjovsky et al. (2019) , it was shown that for linear models (defined in Assumption 5) obeying linear general position 6, if the gradient constraints in the exact gradient constraint IRM equation 4 are satisfied, then the OOD solution is achieved. We extend this result to show that if the constraints in the approximate penalty based IRM in equation 5 are satisfied, then we are guaranteed to be in the √ neighborhood of the OOD solution. Note that this result is again in the infinite sample regime as it proves the approximation for solutions of the problem equation 5, which involves expectations w.r.t true distributions. Next, we exploit similar tools that we introduced to prove Proposition 2 to also prove the finite sample extension. • In later sections, we show the generalizations to infinite hypothesis classes. In particular, we focus on parametric model families that are Lipschitz continuous. The extension to infinite hypothesis classes is based on carefully exploiting the covering number based techniques Shalev-Shwartz & Ben-David (2014) for the IRM penalty estimator that we introduced. We also provide generalizations of the results for linear models to polynomial models. To arrive at these results, we exploit some standard properties of tensor products.

7.3.1. EMPIRICAL ESTIMATOR OF R

Next, we define an estimator for R . We first simplify R as follows. Observe that In the above simplification, we used Leibniz integral rule and take the derivative inside the expectation. ∇ w|w=1.0 R e (w • Φ) = ∂E e (w • Φ(X e ), Y e ) ∂w

Also we can write E[X] 2 = E[AB]

, where A and B are independent and identical random variables with same distribution as X. Therefore, we consider two independent data points (X e , Y e ) ∼ P e and ( Xe , Ỹ e ) ∼ P e . ∇ w|w=1.0 R e (w • Φ) 2 = E e ∂ (w • Φ(X e ), Y e ) ∂w w=1.0 ∂ (w • Φ( Xe ), Ỹ e ) ∂w w=1.0 In the above the expectation E e is taken over the joint distribution over pairs of distributions of pairs (X e , Y e ), ( Xe , Ỹ e ) from the same environment e. We write  R (Φ) = e∈Etr π e ∇ w|w=1.0 R e (w • Φ) 2 = e∈Etr π e E e ∂ (w • Φ(X e ), Y e ) ∂w R (Φ) = Ẽ h, (X e , Y e ), ( Xe , Ỹ e ) We construct a simple estimator R (Φ) by pairing the data points in each environment. For simplicity assume that each environment has even number of points. In environment e, which has n e points we construct ne 2 pairs. Define a set of such pairs as D = {{(x e 2i-1 , y e 2i-1 ), (x e 2i , y e 2i )} ne 2 i=1 } e∈Etr (36) R (Φ) = 2 |D| e∈Etr ne 2 i=1 h, (X e 2i-1 , Y e 2i-1 ), ( Xe 2i , Ỹ e 2i ) There can be other estimators of R (Φ) where we separately estimate each term ∇ w|w=1.0 R e (w.Φ) 2 and π e in the summation. We rely on the above estimator equation 37 as its separability allows us to use standard concentration inequalities, e.g., Hoeffding's inequality.

7.3.2. -REPRESENTATIVE TRAINING SET FOR R AND R

We use the definition of -representative sample from Shalev-Shwartz & Ben-David (2014)  D) if ∀h ∈ H, | R(h) -R(h)| ≤ ( ) where R(h) = E D [ (h(X), Y ] and (X, Y ) ∼ D. Following the above definition, we apply it to the set of points D defined above in equation 36. D is called -representative w.r.t. domain X , hypothesis H Φ , loss (equation 33) and distribution P (equation 32) if ∀Φ ∈ H Φ , | R (Φ) -R (Φ)| ≤ ( ) where R (Φ) = Ẽ h, (X e , Y e ), ( Xe , Ỹ e ) (from equation 34) and (e, (X e , Y e ), ( Xe , Ỹ e )) ∼ P. Recall the definition of κ, κ = min Φ∈HΦ |R (Φ) -|. Next, we show that if D is κ 2 -representative w.r.t X , H Φ , loss , distribution P then the set of invariant predictors in equation 6 ( ŜIV ( ) ) and the set of invariant predictors in equation 5 (S IV ( )) are equal. Lemma 2. If κ > 0 and D is κ 2 -representative w.r.t X , H Φ , loss and distribution P , then ŜIV ( ) = S IV ( ). Proof. First we show S IV ( ) ⊆ ŜIV ( ). From the definition of κ, κ = min Φ∈HΦ |R (Φ) -| it follows that ∀Φ ∈ H Φ |R (Φ) -| ≥ κ =⇒ R (Φ) ≥ + κ or R (Φ) ≤ -κ (40) Consider any Φ in S IV ( ). R (Φ) ≤ (41) Given the definition of κ and equation 40, we obtain R (Φ) ≤ =⇒ R (Φ) ≤ -κ Therefore, S IV ( ) ⊆ S IV ( -κ). Also, it follows from the definition of the set S IV ( ) that S IV (κ) ⊆ S IV ( ). Hence, S IV ( ) = S IV ( -κ) (43) Consider any Φ in S IV ( ) R (Φ) ≤ -κ (From equation 43) R (Φ) -R (Φ) + R (Φ) ≤ -κ R (Φ) ≤ -κ + |R (Φ) -R (Φ)| (44) From the definition of κ 2 -representativeness it follows that |R (Φ) -R (Φ)| ≤ κ 2 and substituting this in equation 44 we get R (Φ) ≤ -κ/2 =⇒ R (Φ) ≤ =⇒ S IV ( ) ⊆ ŜIV ( ) Next we show ŜIV ( ) ⊆ S IV ( ). Consider Φ ∈ ŜIV ( ) P (B c ). If we bound P (A c ) ≤ δ 2 and P (B c ) ≤ δ 2 , then we know the probability of success is at least 1 -δ.

We write

P(A) = P D : ∀h ∈ H Φ , | R(h)-R(h)| ≤ ν 2 = 1-P D : ∃h ∈ H Φ , | R(h)-R(h)| > ν 2 P D : ∃h ∈ H Φ , | R(h) -R(h)| > ν 2 = P h∈HΦ D : | R(h) -R(h)| > ν 2 ≤ h∈HΦ P D : | R(h) -R(h)| > ν 2 (49) The loss function is bounded | (Φ(•), •)| ≤ L. From Hoeffding's inequality in Lemma 1 it follows that P D : | R(h) -R(h)| > ν 2 ≤ 2 exp - |D|ν 2 8L 2 (50) Using this expression equation 50 in equation 49, we get 2|H Φ | exp - |D|ν 2 8L 2 ≤ δ 2 =⇒ |D| ≥ 8L 2 ν 2 log 4|H Φ | δ (51) P(B) = P D : ∀h ∈ H Φ , | R (h)-R (h)| ≤ κ 2 = 1-P D : ∃h ∈ H Φ , | R (h)-R (h)| > κ 2 P D : ∃h ∈ H Φ , | R (h) -R (h)| > κ 2 = P h∈HΦ D : | R (h) -R (h)| > κ 2 ≤ h∈HΦ P D : | R (h) -R (h)| > κ 2 (52) The gradient of loss function is bounded | ∂ (h(•),•) ∂w | w=1.0 | ≤ L . From the definition of (h(•), •) in equation 34, we can infer that | (h(•), •)| ≤ L 2 . Recall that R (h) = Ẽ h, (X e , Y e ), ( Xe , Ỹ e ) From Hoeffding's inequality in Lemma 1 it follows that P D : | R (h) -R (h)| > κ 2 ≤ 2 exp - | D|κ 2 8L 4 = 2 exp - |D|κ 2 16L 4 (53) Using the above equation 53 in equation 52 we get 2|H Φ | exp - |D|κ 2 16L 4 ≤ δ 2 =⇒ |D| ≥ 16L 4 κ 2 log 4|H Φ | δ ( ) Combining the two conditions in equation 51 and equation 51 we get that if |D| ≥ max 16L 4 κ 2 , 8L 2 ν 2 log 4|H Φ | δ then with probability at least 1 -δ event A ∩ B occurs.

7.3.3. PROPERTY OF LEAST SQUARES OPTIMAL SOLUTIONS

We first remind ourselves of a simple property of least squares minimization. Consider the least squares minimization setting, where R(h) = E[(Y -h(X)) 2 ]. E[(Y -h(X)) 2 ] = E Y -E Y X + E Y X -h(X) 2 = E X E Y -E Y X 2 |X + E X E Y X -h(X) 2 = E X Var Y X + E X E Y X -h(X) 2 (55) In the above simplification, we use the law of total expectation. Both the terms in the above equations are always greater than or equal to zero. The first term does not depend on h, which implies the minimization can focus on second term only. min h R(h) = E X Var Y X + min h E X E Y X -h(X) 2 (56) Assume that P has full support over X . Define ∀x ∈ X , h * (x) = E[Y |X = x]. ∀h, R(h) ≥ E X Var Y X . Since R(h * ) = E X Var Y X . Therefore, h * ∈ arg min h E X E Y X -h(X) 2 (57) Moreover, we conclude that h * is the unique minimizer. Observe that From Theorem 1.6.6 in (Ash et al., 2000) , it follows that any other minimizer is same as h * except over a set of measure zero. Proof. R(Φ) = e∈Etr π e R e (Φ). From Assumption 4 and the observation in equation 57, it follows that the unique optimal solution to expected risk minimization for each R e is m. Therefore, m also minimizes the weighted combination R. E X E Y X -h(X) 2 is zero for h = h * . To show the latter part of the Lemma, if we can show that m ∈ S IV ( ), then the rest of the proof follows from the previous part as we already showed m is a minimizer among all the functions in H Φ and S IV ( ) ⊆ H Φ . Suppose m ∈ S IV ( ). This implies there exists at least one environment for which ∇ w|w=1.0 R e (w • m) 2 > 0 =⇒ ∇ w|w=1.0 R e (w • m) = 0. As a result ∃ w in the neighborhood of w = 1.0 where R e (w • m) < R e (m) (if such a point does not exist and all the points in the neigbohood of w = 1.0 are greater than or equal to R e (m) that would make ∇ w|w=1.0 R e (w • m) = 0, which would be a contradiction). Therefore, R e (w • m) < R e (m). However, this is a contradiction as we know that m is the unique optimizer for each environment. Hence, m ∈ S IV ( ) cannot be true and thus m ∈ S IV ( ). This completes the proof. Next, we prove Proposition 4 from the main body of the manuscript. Proposition 9. Let be the square loss. For every ν > 0, > 0 and δ ∈ (0, 1), if H Φ is a finite hypothesis class, m ∈ H Φ , Assumptions 3, 4 hold, and • if the number of samples |D| is greater than max 8L 2 ν 2 log( 4|HΦ| δ ), 16L 4 2 log( 2 δ ) , then with a probability at least 1 -δ, every solution Φ to EIRM (equation 6 ) satisfies R(m) ≤ R( Φ) ≤ R(m) + ν. If also ν < κ, then Φ † = m. • if the number of samples |D| is greater than 8L 2 ν 2 log( 2|HΦ| δ ), then with a probability at least 1 -δ, every solution Φ † to ERM satisfies R(m) ≤ R(Φ † ) ≤ R(m) + ν. If also ν < κ, then Φ † = m. Before stating the proof of Proposition 5, we will prove an intermediate proposition. For clarity, we will restate the result (Theorem 9 from Arjovsky et al. ( 2019  Φ ∈ R n×1 (Φ = 0), then Φ T E e [X e X e,T ]Φ = Φ T E e [X e Y e ] holds for all e ∈ E tr iff Φ = ST γ. Next, we propose an -approximation of Proposition 10. Define 0 = π min |Etr| (ωλ min ) 2 (12 -8 √ 2). Proposition 11. Let be the square loss. If Assumptions 5, 6 (with r = 1) and 7 hold , then for all 0 < < 0 , the solution Φ R (Φ) ≤ ( 63) satisfies Φ = ST γ(α), where α ∈ [ 1 1+ 1 2ωλ min |E tr | π min , 1 1-1 2ωλ min |E tr | π min ] Proof. Let us start by simplifying ∇ w|w=1.0 R e (w • Φ), using square loss for and linear represen- From the bound on π e in Assumption 5 it follows that for each e ∈ E tr tation Φ ∈ R n×1 . ∇ w|w=1.0 R e (w • Φ) = ∂E e Y e -w • Φ T X e 2 ∂w w=1.0 = 2E e (Φ T X e ) 2 -2E e Φ T X e Y e = 2Φ T E e X e X e,T Φ -2Φ T E e Y e X e Φ T E e X e X e,T Φ -Φ T E e X e Y e 2 ≤ |E tr | 4π min Φ T E e X e X e,T Φ -Φ T E e X e Y e ≤ |E tr | 4π min (66) Note that if the condition above equation 66 is not true, then the preceeding condition in equation 65 cannot be true as contribution from one term in the summation itself will exceed . In the above equation 66, we are using the positive square root since RHS has to be greater than or equal to zero. We compute the second derivative of loss w.r.t w ∇ 2 w R e (w • Φ) = ∂ 2 E e Y e -Φ T X e w 2 ∂w 2 = ∂ 2wE e (Φ T X e ) 2 -2E e Φ T X e Y e ∂w = 2Φ T E e X e X e,T Φ = 2Φ T Σ e Φ (67) Since Σ e is symmetric, we can use the eigenvalue decomposition of Σ e = U ΛU T in equation 67 to get Φ T Σ e Φ = Φ T U ΛU T Φ. Substitute Φ = U T Φ to get Φ T Σ e Φ = ΦT Λ Φ ≥ λ min (Σ e ) Φ 2 = λ min (Σ e )Φ T U U T Φ = λ min (Σ e ) Φ 2 . From Assumption 5, we have λ min (Σ e ) ≥ λ min and from Assumption 7 we have Φ 2 ≥ ω. Therefore, we can deduce that 2Φ T Σ e Φ ≥ 2λ min ω. Therefore, the second derivative defined in equation 67 is always greater than or equal to 2λ min ω. ∇ 2 w R e (w • Φ) ≥ 2λ min ω > 0 (68) Let = 2 |Etr| 4π min = |Etr| π min . We rewrite equation 66 in terms of ∇ w|w=1.0 R e (w • Φ) and to get |∇ w|w=1.0 R e (w • Φ)| ≤ -≤ ∇ w|w=1.0 R e (w • Φ) ≤ Since the second derivative (equation 68) is strictly positive and larger than or equal to λ min ω =⇒ ∃ w e in the neighborhood of w = 1.0 at which ∇ w|w=w e R e (w • Φ) = 0. This holds for all the environments in E tr . Define c(w) = ∇ w R e (w • Φ) Also, define c (w) = ∂c(w) ∂w . Also, c (w) = ∇ 2 w R e (w•Φ). Suppose ∇ w|w=1.0 R e (w•Φ) = c(1) < 0. Since for all w, c (w) > 0 (from equation 68), ∃ w e > 1, where c(w e ) = 0. Using fundamental theorem of calculus, we write c(w) -c(1) = w 1 c (u)du ≥ 2λ min ω(w -1) Substituting w = w e in the above c(w e ) -c(1) ≥ 2λ min ω(w e -1) -c(1) ≥ 2λ min ω(w e -1)  w e ≤ 1 - c(1) 2λ min ω ≤ 1 + 2λ min ω = 1 + |Etr| π min 2λ min ω w e ≥ 1 - c(1) 2λ min ω ≥ 1 - 2λ min ω = 1 - |Etr| π min 2λ min ω Define η 1 = 1 2ωλ min |E tr | π min 1-1 2ωλ min |E tr | π min , η 2 = -1 2ωλ min |E tr | π min 1+ 1 2ωλ min |E tr | π min . Combining equation 70 and equation 71 and using the definition of η 1 and η 2 , we can conclude that w e ∈ [ 1 1+η1 , 1 1+η2 ]. If we reparamterize w e = 1 1+η e , then η e ∈ [η 2 , η 1 ]. We expand this condition ∇ w|w=w e R e (w • Φ) = 0. ∂ ∂w E e Y e -Φ T X e w 2 = 2wE e (Φ T X e ) 2 -2E e Φ T X e Y e = 2Φ T E e X e X e,T Φw e -E e X e Y e = 2Φ T E e X e X e,T Φw e -E e X e X e,T ST γ -E e X e ε e = 2Φ T E e X e X e,T Φw e -ST γ -E e X e ε e = 2Φ T E e X e X e,T Φw e -ST γ -E e X e ε e = 0 = 2Φ T E e X e X e,T Φ 1 1 + η e -ST γ -E e X e ε e = 0 (72) Assume that Φ 1 1+η e = ST γ, for all e ∈ E tr . If this assumption is not true and Φ 1 1+η e = ST γ for some e ∈ E tr , then it already establishes the claim we set out to prove in this Proposition (since η e ∈ [η 2 , η 1 ]). Define q e = E e X e X e,T Φ 1 1+η e -ST γ -E e X e ε e . From the Assumption 6 and since Φ 1 1+η e = ST γ, we know dim span q e > n -1. From rank-nullity theorem we know dimension of kernel space of Φ (rank of Φ is 1) is n -1. From equation 72 it follows that q e is in kernel space of Φ. Therefore, dim(Ker(Φ)) = n -1 =⇒ dim span q e ≤ n -1 which leads to a contradiction. Therefore, Φ 1 1+η e = ST γ at least for one environment. If Φ = ST γ(1 + η e ), then Φ 2 = ST γ 2 (1 + η e ) 2 ≥ ST γ 2 (1 + η 2 ) 2 ≥ ST γ 2 1 2 ≥ ω. In this simplification, we use (1 + η 2 ) 2 ≥ 1 1+ 1 2ωλ min |E tr | π min 2 ≥ 1 1+ 1 2ωλ min 0 |E tr | π min 2 = 1 1+ √ 3-2 √ 2 2 = 1 2 and ST γ 2 ≥ 2ω (from Assumption 7 ). This ensures that for any solution of the form Φ 1 1+η e the assumption Φ 2 ≥ ω is automatically satisfied. If Φ = ST γ(1+η e ), then Φ 2 = ST γ 2 (1+η e ) 2 ≤ ST γ 2 (1+η 1 ) 2 ≤ ST γ 2 3+2 √ 2 2 ≤ Ω. In the above simplification, we use (1 + η 1 ) 2 ≤ 1 1-1 2ωλ min |E tr | π min 2 ≤ 1 1-1 2ωλ min 0 |E tr | π min 2 = 1 1- √ 3-2 √ 2 2 = 3+2 √ 2 2 and ST γ 2 ≤ 2 3+2 √ 2 Ω (from Assumption 7). This ensures that for any solution of the form Φ 1 1+η e the assumption Φ 2 ≤ Ω is automatically satisfied. The entire proof so far has characterized the property of Φ that satisfies R (Φ) ≤ . But how do we know such a Φ exists. Recall St γ ∈ H Φ . For each environment e ∈ E tr ∇ w|w=1.0 R e (w • Φ) = Φ T E e X e X e,T Φ -Φ T E e X e Y e = γ T SE e X e X e,T ST γ -γ T SE e X e Y e = γ T E e Z e 1 Z e,T 1 γ -γ T E e Z e 1 Y e = γ T E e Z e 1 Z e,T 1 γ -γ T E e Z e 1 Z e,T 1 γ -γ T E e Z e 1 ε e = 0 We use Assumption 5 in the simplification above in equation 73 Therefore, R ( St γ) = 0 and as a result the existence of a Φ that satisfies R (Φ) ≤ is guaranteed. This completes the proof. Before proving Proposition 5, we first establish that Assumptions 5 and 7 are sufficient to ensure that Assumption 3 holds and we can thus use bounds L and L defined in Assumption 3. Proving Assumption 3 for square loss from Assumptions 5, 7. From Assumption 5, we have Z e = (Z e 1 , Z e 2 ) X e = SZ e X e ≤ S Z e (74) From Assumption 5, S is bounded and Z e is bounded =⇒ X e is bounded as well (from equation 74). Therefore, there exists X sup < ∞, s.t. X e ≤ X sup . Y e = ( ST γ) T X e + ε e |Y e | ≤ ST γ X e + |ε e | =⇒ |Y e | ≤ 2 3 + 2 √ 2 ΩX sup + ε sup In the last step of equation 75, we use ST γ 2 ≤ 2 3+2 √ 2 Ω (Assumption 7), X e ≤ X sup derived above (equation 74), and |ε e | ≤ ε sup (Assumption 5). Therefore, Y e is bounded and there exists a K such that |Y e | ≤ K ≤ √ ΩX sup + ε sup . Therefore, ∀Φ ∈ H Φ and for all X e , Y e sampled from the model in Assumption 5 we have (Φ(X e ), Y e ) = (Y e -Φ T X e ) 2 ≤ (K + √ ΩX sup ) 2 ∂ (w • Φ T X, Y ) ∂w w=1.0 = Φ T X(Φ T X -Y ) ≤ ( √ ΩX sup )( √ ΩX sup + K) From equation 76, we conclude that (Φ(•), •) is bounded and there exists an L such that | (Φ(•), •)| ≤ L ≤ (K + √ ΩX sup ) 2 . From equation 77, we conclude that ∂ (w.Φ T X,Y ) ∂w w=1.0 is bounded and there exists an L such that (w.Φ(•),•) ∂w w=1.0 ≤ L ≤ ( √ ΩX sup )( √ ΩX sup + K). Define th = 24-16 √ 2 3 π min |Etr| (ωλ min ) 2 and τ = 1 2ωλ min 3|Etr| 2π min Next, we prove Proposition 5 from the main body of the manuscript. Proposition 12. Let be the square loss. For every ∈ (0, th ) and δ ∈ (0, 1), if Assumptions 5, 6 (with r = 1), 7 hold and if the number of data points |D| is greater than 16L 4 2 log 2|HΦ| δ , then with a probability at least 1 -δ, every solution Φ to EIRM (equation 6) satisfies Φ = ( ST γ α, where α ∈ [ 1 1+τ √ , 1 1-τ √ ].

Proof. Define an event

A: { D : ∀Φ ∈ H Φ , | R (Φ) -R (Φ)| ≤ 2 }. If event A happens, then R (Φ) ≤ R (Φ) -R (Φ) + R (Φ) ≤ R (Φ) ≤ + | R (Φ) -R (Φ)| R (Φ) ≤ 3 2 If event A happens, then every solution Φ to EIRM (equation 6) satisfies equation 78. From equation 78, we see that we should substitute with 3 2 and use Proposition 11. If the Assumptions 5, 6 (with r = 1), 7 hold and event A happens, then for all 0 < 3 2 < 0 the output of EIRM equation 6 Φ = St (α), where α ∈ [ 1 1+ 1 2ωλ min 3 |E tr | 2π min , 1 1-1 2ωλ min 3 |E tr | 2π min ] = [ 1 1+τ √ , 1 1-τ √ ]. Note that 3 2 < 0 =⇒ < th . Hence, all that remains to be shown is that event A occurs with a probability at least 1 -δ. Next, we show that if |D| ≥ 16L 4 2 log( 2|HΦ| δ ), then with probability 1 -δ event A happens. Now we need to bound the probability P(A). We will find an upper bound on the failure probability using Hoeffding's inequality (Lemma 1) and the bound L (derived in equation 77) as follows. We redo the same analysis as was done in equation 54 for reader's convenience. P(A) = 1-P D : ∃Φ ∈ H Φ , | R (Φ)-R (Φ)| > 2 = P Φ∈HΦ D : | R (Φ)-R (Φ)| > 2 P Φ∈HΦ D : | R (Φ) -R (Φ)| > 2 ≤ Φ∈HΦ P { D : | R (Φ) -R (Φ)| > 2 ≤ 2|H Φ |e - 2 |D| 16L 4 2|H Φ |e - 2 |D| 16L 4 ≤ δ =⇒ P(A c ) ≤ δ |D| ≥ 16L 4 2 log 2|H Φ | δ =⇒ P(A c ) ≤ δ Hence, we know that if |D| ≥ 16L 4 2 log 2|HΦ| δ , then with probability 1 -δ event A happens. The proof characterized the property of Φ that satisfies R ( Φ) ≤ . But how do we know such a Φ exists; we show that a Φ always exists. Consider a Φ that satisfiess the following. R (Φ) ≤ 2 R (Φ) -R (Φ) + R (Φ) ≤ 2 R (Φ) ≤ 2 + | R (Φ) -R (Φ)| R (Φ) ≤ (follows from event A) From equation 73 in the proof of Proposition 11, we know that R ( ST γ) = 0. From equation 80, ST γ ∈ H Φ satisfies R (Φ) ≤ . Since equation 6 is a constrained optimization, a penalty based version (IRMv1) was proposed by Arjovsky et al. (2019) that minimizes R(Φ) + λ R (Φ). Both Propositions 4 and Proposition 5 can be extended to IRMv1. Below we show the sample complexity analysis for IRMv1 applied to the setting assumed in Proposition 5. Sample complexity analysis of IRMv1 below shows that distance to the OOD solution decays as O( 1/λ). Define λ th = max{5σ 2 /3 th , 1} Corollary 1. For every δ ∈ (0, 1), λ > λ th , if Assumptions 5, 6 (with r = 1), 7 hold, and |D| ≥ max 64L 4 λ 2 25σ 4 log 4|HΦ| δ , 32L 2 λ 2 25σ 4 log( 4 δ ) , then with a probability at least 1 -δ every solution Φ of IRMv1 satisfies Φ = ST γ α, where α ∈ [ 1 1+τ σ √ 5 3λ , 1 1-τ σ √ 5 3λ ]. 

Proof. The empirical version of

: | R ST γ -R ST γ | ≤ 2 . If event A holds, then R ST γ ≤ σ 2 + 2 (82) Define event B: D : ∀ Φ ∈ H Φ , | R (Φ) -R (Φ)| ≤ 2 . If event B holds, then | R ST γ -R ST γ | ≤ 2 and if we plug in R ST γ = 0 (from equa- tion 73), then R ST γ ≤ 2 (83) Define success as A∩B. If event A∩B occurs, then from equation 82 and equation 83 the following is true for any solution Φ of IRMv1 R(Φ) + λ R (Φ) ≤ R ST γ + λ R ST γ ≤ σ 2 + 2 + λ 2 (84) Since R(Φ) ≥ 0 it follows that R (Φ) ≤ σ 2 + 2 λ + 2 =⇒ R (Φ) -R (Φ) + R (Φ) ≤ σ 2 + 2 λ + 2 =⇒ R (Φ) ≤ σ 2 + 2 λ + In the last implication in the above equation 85, we use the condition that event B occurs. Let = σ 2 λ and substitute in the above equation 85 to get R (Φ) ≤ 2σ 2 λ + σ 2 2λ 2 ≤ 5σ 2 2λ (since λ ≥ 1, σ 2 2λ 2 ≤ σ 2 2λ ) Recall Proposition 11 and see 5σ 2 2λ takes the role of . If 5σ 2 2λ ≤ 0 =⇒ λ ≥ 5σ 2 3 th , then the condition in the Proposition 11 is true, then every solution of IRMv1 is ST γ(α), where α ∈ [ 1 1+ 1 ωλ min 5σ 2 |E tr | 2λπ min , 1 1-1 ωλ min 5σ 2 |E tr | 2λπ min ] = [ 1 1+τ σ √ 5 3λ , 1 1-τ σ √ 5 3λ ] We arrived at the above result assuming A ∩ B occurs. We will now show that if |D| ≥ max{ 16L 4 λ 2 σ 4 log( 4|HΦ| δ ), 8L 2 λ 2 σ 4 log( 4 δ )}, then with a probability at least 1 -δ, A ∩ B occurs. We write P(A) = P D : | R ST γ -R ST γ | ≤ 2 = 1 -P D : | R( ST γ) -R( ST γ)| > 2 . From Hoeffding's inequality in Lemma 1 and 3 it follows that P D : | R( ST γ) -R( ST γ)| > 2 ≤ 2 exp - |D| 2 8L 2 2 exp - |D| 2 8L 2 ≤ δ 2 =⇒ P(A c ) ≤ δ 2 |D| ≥ 8L 2 2 log 4 δ =⇒ P(A c ) ≤ δ 2 Next, we will show next that if |D| ≥ 16L 4 2 log( 4|HΦ| δ ), then with probability at least 1 -δ 2 event B happens. Now we need to bound the probability P(B). We will find an upper bound on the failure probability using Hoeffding's inequality and 3 as follows P(B) = 1-P D : ∃Φ ∈ H Φ , | R (Φ)-R (Φ)| > 2 = P Φ∈HΦ D : | R (Φ)-R (Φ)| > 2 ) P Φ∈HΦ D : | R (Φ) -R (Φ)| > 2 ≤ Φ∈HΦ P D : | R (Φ) -R (Φ)| > 2 ) ≤ 2|H Φ |e - 2 |D| 16L 4 2|H Φ |e - 2 |D| 16L 4 ≤ δ 2 =⇒ P(B c ) ≤ δ 2 |D| ≥ 16L 4 2 log 4|H Φ | δ =⇒ P(B c ) ≤ δ 2 (88) Define S = diag ( S⊗i ) p i=1 . SX e = S S Ze = diag ( S⊗i ) p i=1 diag (S ⊗i ) p i=1 (Z e,⊗i ) p i=1 diag (( SS) ⊗i ) p i=1 (Z e,⊗i ) p i=1 = (( SSZ e ) ⊗i ) p i=1 = ((Z e 1 ) ⊗i ) p i=1 = ζ c p (Z e 1 ) = Ze 1 (92) The dimensionality of Xe is n = p i=1 n i = n p+1 -n n-1 . Assumption 9. Inductve bias. H Φ is a finite set of linear models (bounded) parametrized by Φ ∈ R n . ST γ ∈ H Φ . ∃ ω > 0, Ω > 0, ∀Φ ∈ H Φ , ω ≤ Φ 2 ≤ Ω and 2ω ≤ ST γ 2 ≤ 2 3+2 √ 2 Ω, We next compute the norm of S in terms of the norm of S. Recall we are using operator norm defined as S = σ max (S). S = diag (S ⊗i ) p i=1 . Since S is a diagonal matrix S = max i=1,..,p { S ⊗i }. Also, note that S ⊗i = S i (Laub, 2005) . Therefore, S = max i=1....,p { S i }. Hence, if S is bounded, S is also bounded. Also, Ze 2 = i Z e,⊗i 2 . Observe that Z e,⊗i = Z e i . Hence, if Z e is bounded, Ze is also bounded. Since Xe = S Ze . We can conclude that Xe is also bounded. We can now follow the same line of reasoning as in equation 74, equation 75,equation 76, equation 77 to conclude that the loss and the gradient of the loss are bounded. We rewrite the above model in Assumption 8 as a linear model in terms of the transformed features. e ∼ Categorical(π e ), π e > 0∀e ∈ E Y e = γ t Z1 + ε e , ε e ⊥ Z1 e , E[ε e ] = 0, E[(ε e ) 2 ] = σ 2 , |ε e | ≤ ε sup Xe = S Ze We showed above in equation 92 that Ze 1 defined in equation 91 component of Ze is invertible, S S Ze = Ze 1 . We have also shown above that support of Ze is bounded and the norm of S is bounded and as a result Xe and the loss and the gradient of the loss (conditions in Assumption 3 are satisfied) are bounded. We adapt the linear general position assumption (Assumption 6) for the polynomial case below. Assumption 10. Linear general position of training environments. A set of training environments E tr is said to lie in a linear general position of degree r for some r ∈ N if |E tr | > n -r + n /r and for non-zero x ∈ R n and dim span E e [ Xe Xe,T ]x -E e [ Xe,T ε e ] e∈Etr > n -r We denote E[ Xe Xe,T ] = Σe Assumption 11. For all the environments e ∈ E tr , Σe is positive definite. Define the minimum eigenvalue over all the matrices Σe as λmin = min e∈Etr λ( Σe ). Define ¯ th = 24-16 √ 2 3 π min |Etr| (ω λmin ) 2 . From the analysis in this section, we see that we have been able to construct a linear model identical to 5, where the role of X e , Z e , S, S, Σ e , λ min , th is taken by Xe , Ze , S, S, Σe , λmin , ¯ th . Now we are ready to use the result already proven for linear model and state the next Proposition in terms of the parameters for the polynomial model. Proposition 13. Let be the square loss. For every ∈ (0, ¯ th ), δ ∈ (0, 1), if Assumptions 8, 9, 10 (with r = 1), 11 hold and if the number of data points |D| is greater than 16L 4 2 log 2|HΦ| δ , then with a probability at least 1 -δ, every solution Φ to EIRM (equation 6) satisfies Φ = ( ST γ α, where α ∈ [ 1 1+τ √ , 1 1-τ √ ].

7.4.2. INFINITE HYPOTHESIS CLASSES

In the work so far we have assumed that the hypothesis class H Φ is finite. In this section, we discuss infinite hypothesis class extensions. Before we do that we state an important result on covering numbers that we will use soon. Lemma 5. (Shalev-Shwartz & Ben-David, 2014)  Define a set A = {a ∈ R k , a 2 ≤ A sup }. Covering number for η-cover of A given as N η (A) is bounded as N η (A) ≤ 2 √ A sup k η k A. Infinite Hypothesis Class: Confounders and Anti-causal variables In this section, we seek to extend Proposition 5 to infinite hypothesis classes. We restate the Assumption 7 for linear models. Assumption 12. Inductve bias. H Φ is a set of linear models (bounded) parametrized by Φ ∈ R n H Φ = {Φ ∈ R n , 0 < ω ≤ Φ 2 ≤ Ω}. ST γ ∈ H Φ . 2ω ≤ ST γ 2 ≤ 2 3+2 √ 2 Ω, Note that the only difference between Assumption 7 and Assumption 12 is that the hypothesis class is not required to be finite anymore. We already established in equation 75, equation 76, equation 77 that from Assumptions 5 and 7, we can show that the conditions in Assumption 3 hold, i.e., loss and the gradient of the loss are bounded, and also X e , Y e are bounded . The same conclusion follows from Assumptions 5 and Assumption 12. Hence, for the rest of this section, we can state that X e ≤ X Proof. The output of the model |Φ T X| ≤ Φ X ≤ √ ΩX sup (From Cauchy-Schwarz). R (Φ) = e π e E e Φ T X e Φ T X e -Y e 2 |R (Φ 1 ) -R (Φ 2 )| = e π e E e Φ T 1 X e Φ T 1 X e -Y e 2 -E e Φ T 1 X e Φ T 2 X e -Y e 2 ≤ e π e E e Φ T 1 X e Φ T 1 X e -Y e 2 -E e Φ T 2 X e (Φ T 2 X e -Y e ) 2 We bound each term in the summation in the equation 96 above E e Φ T 1 X e Φ T 1 X e -Y e 2 -E e Φ T 2 X e (Φ T 2 X e -Y e ) 2 = E e Φ T 1 X e Φ T 1 X e -Y e -E e Φ T 2 X e Φ T 2 X e -Y e E e Φ T 1 X e Φ T 1 X e -Y e + E e Φ T 2 X e Φ T 2 X e -Y e ≤ E e Φ T 1 X e Φ T 1 X e -Y e -E e Φ T 2 X e Φ T 2 X e -Y e 2 √ ΩX sup ( √ ΩX sup + K) ≤ E e Φ T 1 X e Φ T 1 X e -Y e -E e Φ T 2 X e Φ T 1 X e -Y e + E e Φ T 2 X e Φ T 1 X e -Y e - E e Φ T 2 X e (Φ T 2 X e -Y e ) 2 √ ΩX sup ( √ ΩX sup + K) = E e Φ T 1 X e -Φ T 2 X e Φ T 1 X e -Y e + E e Φ T 2 X e Φ T 1 X e -Φ T 2 X e 2 √ ΩX sup ( √ ΩX sup + K) ≤ E e Φ T 1 X e -Φ T 2 X e Φ T 1 X e -Y e 2 √ ΩX sup ( √ ΩX sup + K)+ E e Φ T 2 X e Φ T 1 X e -Φ T 2 X e 2 √ ΩX sup ( √ ΩX sup + K) We bound each term in the last line in equation 97 E e Φ T 1 X e -Φ T 2 X e Φ T 1 X e -Y e ≤ E e Φ T 1 X e -Φ T 2 X e Φ T 1 X e -Y e ≤ ( √ ΩX sup + K)E e (Φ 1 -Φ 2 ) T X e ) ≤ ( √ ΩX sup + K)E e Φ 1 -Φ 2 X sup (Cauchy-Scwarz) ≤ ( √ ΩX sup + K)X sup Φ 1 -Φ 2 (98) E e Φ T 2 X e Φ T 1 X e -Φ T 2 X e ≤ E e Φ T 2 X e Φ T 1 X e -Φ T 2 X e ≤ √ ΩX sup E e (Φ 1 -Φ 2 ) T X e ) (Cauchy-Scwarz) ≤ √ ΩX sup E e Φ 1 -Φ 2 X sup ≤ √ Ω(X sup ) 2 Φ 1 -Φ 2 Substituting equation 98 and equation 99 in equation 97 to get |R (Φ 1 ) -R (Φ 2 )| ≤ 2 √ Ω(X sup ) 2 ( √ ΩX sup + K)(2 √ ΩX sup + K) Φ 1 -Φ 2 (100) Therefore, R is Lipschitz with a constant C ≤ 2 √ Ω(X sup ) 2 ( √ ΩX sup + K)(2 √ ΩX sup + K). We just showed that R is Lipschitz continuous and we set its Lipschitz constant as C . Proposition 14. Let be the square loss. For every ∈ (0, th ) and δ ∈ (0, 1), if Assumptions 5, 6 (with r = 1), 12 hold and if the number of samples |D| is greater than 32L 4 2 n log 16C √ Ωn + log 2 δ , then with a probability at least 1 -δ every solution Φ to EIRM (equation 6) satisfies Φ = ST γ(α), where α ∈ [ 1 1+τ √ , 1 1-τ √ ]. Proof. Following the proof of Proposition 5, our goal is to compute the probability of event A: {∀Φ ∈ H Φ , | R (Φ) -R (Φ)| ≤ 2 }. We construct a minimum cover of size N η (H Φ ) (See Lemma 5) with points C = {Φ j } b j=1 . Compute the probability of failure at one point Φ j in the cover P D : |R (Φ j ) -R (Φ j )| > 4 < 2e - 2 |D| 32L 4 We use union bound to bound the probability of failure over the cover C as follows P D : max Φj ∈C |R (Φ j ) -R (Φ j )| > 4 < 2N η (H Φ )e - 2 |D| 32L 4 Now consider any Φ ∈ H Φ and suppose Φ j is nearest point to it in the cover. |R (Φ) -R (Φ)| = |R (Φ) -R (Φ j ) + R (Φ j ) -R (Φ j ) + R (Φ j ) -R (Φ)| ≤ |R (Φ) -R (Φ j )| + |R (Φ j ) -R (Φ j )| + | R (Φ j ) -R (Φ)| ≤ |R (Φ j ) -R (Φ j )| + 2ηC In the above simplification, we exploited the Lipschitz continuity of R . Therefore, for each Φ ∈ H Φ |R (Φ) -R (Φ)| ≤ max Φj ∈C |R (Φ j ) -R (Φ j )| + 2ηC max Φ∈HΦ |R (Φ) -R (Φ)| ≤ max Φj ∈C |R (Φ j ) -R (Φ j )| + 2ηC Set η = 8C in equation 104 and from equation 102 with probability at least 1 -N η (H Φ )2e - 2 |D| 32L 4 max Φ∈HΦ |R (Φ) -R (Φ)| ≤ /2 (since max Φj ∈C |R (Φ j ) -R (Φ j )| ≤ /4 ) We bound N η (H Φ )2e Therefore, if condition in equation 106 holds, then event A occurs and following the same argument as in the proof of Proposition 5 the proof is complete.

B. Infinite hypothesis class: Lipschitz continuous functions

In this section, we seek to extend Proposition 2 and 4 to infinite hypothesis class of Lipschitz continuous functions that we formally define next. Define a map Φ : P × X → R from the parameter space P and the feature space X to reals. Each p ∈ P is a possible choice for the representation Φ(p, •). Consider neural networks as an example, P represents the set of the values the weights of the network can take. Assumption 13. Φ : P × X → R is a a Lipschitz continuous function (with Lipschitz constant say Q). P ⊂ R k is closed and bounded, thus there exists a P < ∞ such that ∀p ∈ P, p 2 ≤ P . X ⊂ R n is closed and bounded, thus there exists a X sup < ∞ such that ∀x ∈ X , x ≤ X sup . Y ⊂ R is closed and bounded, thus there exists a K < ∞ such that ∀y ∈ Y, |y| ≤ K Assumption 14. Lipschitz loss and gradient of loss. R(Φ) is Lipschitz with a constant C, R (Φ) is Lipchitz with a constant C . From Assumption 13 derive the conditions in Assumption 3 and Assumption 14 Φ : P ×X → R is a continuous function defined over closed and bounded domain P × X (domain is compact) and as a result Φ is bounded say by M . Consider square loss ( Φ(p, X), Y ) = (Y -Φ(p, X)) 2 ≤ (M + K) 2 . Hence, there exists an L such that | (Φ(•), •)| ≤ L ≤ (M + K) 2 . ∂ (w.Φ(•),•) ∂w w=1.0 is bounded: ∂ (w.Φ(•),•) ∂w w=1.0 = |(Y -Φ(p, X))Φ(p, X)| ≤ (K + M )M . Hence, there exists an L such that | ∂ (w.Φ(•),•) ∂w w=1.0 | ≤ L ≤ (K + M )M Lemma 7. If Assumption 13 holds, then R(Φ(p, •)) and R (Φ(p, •)) are Lipschitz continuous in p. R is Lipschitz: |R(Φ(p, •) -R(Φ(q, •))| = e π e E e [(Φ(p, X e ) -Y e ) 2 ] -E e [(Φ(q, X e ) -Y e ) 2 ] ≤ e π e E e Φ(p, X e ) -Y e 2 -E e Φ(q, X e ) -Y e 2 = e π e E e Φ(p, X e ) -Φ(q, X e ) E e Φ(p, X e ) + Φ(q, X e ) -2Y ≤ e π e E e Φ(p, X e ) -Φ(q, X e ) E e Φ(p, X e ) + Φ(q, X e ) -2Y ≤ e π e E e Φ(p, X e ) -Φ(q, X e ) 2(M + K) ≤ p -q 2(M + K)Q (107) Therefore, R is Lipschitz with a constant C ≤ 2(M + K)Q R is Lipschitz: |R (Φ(p, •)) -R (Φ(q, •))| = e π e E e Φ(p, X e ) Φ(p, X e ) -Y e 2 -E Φ(q, X e ) Φ(q, X e ) -Y e 2 ≤ e π e E e Φ(p, X e )(Φ(p, X e ) -Y e )] 2 -E[Φ(q, X e )(Φ(q, X e ) -Y e ) 2 We bound each term in the summation in the equation 96 above E e Φ(p, X e ) Φ(p, X e ) -Y e -E e [Φ(q, X e ) Φ(q, X e ) -Y e E e Φ(p, X e ) Φ(p, X e ) -Y e + E e Φ(q, X e ) Φ(q, X e ) -Y e ≤ E e Φ(p, X e ) Φ(p, X e ) -Y e -E e Φ(q, X e ) Φ(q, X e ) -Y e 2M (M + K) ≤ E e Φ(p, X e ) Φ(p, X e ) -Y e -E e Φ(q, X e ) Φ(p, X e ) -Y e + E e Φ(q, X e ) Φ(p, X e ) -Y e -E e Φ(q, X e ) Φ(q, X e ) -Y e 2M (M + K) = E e (Φ(p, X e ) -Φ(q, X e ) Φ(p, X e ) -Y e + E e Φ(q, X e ) Φ(p, X e ) -Φ(q, X e ) 2M (M + K) ≤ E e Φ(p, X e ) -Φ(q, X e ) Φ(p, X e ) -Y e 2M (M + K)+ E e Φ(q, X e ) Φ(p, X e ) -Φ(q, X e ) 2M (M + K) We bound each term in the last line in equation 109 E e (Φ(p, X e ) -Φ(q, X e ))(Φ(p, X e ) -Y e ) ≤ E e (Φ(p, X e ) -Φ(q, X e )) (Φ(p, X e ) -Y e ) ≤ (M + K)Q p -q (110) E Φ(q, X e )(Φ(p, X e ) -Φ(q, X e ) ≤ E Φ(q, X e )||(Φ(p, X e ) -Φ(q, X e ) ] ≤ M Q p -q Substituting equation 110 and equation 111 in equation 109 to get |R (Φ(p, •)) -R (Φ(q, •))| ≤ 2M (M + K)(2M + K)Q p -q (112) Therefore, R is Lipschitz with a constant C ≤ 2M (M + K)(2M + K)Q.

C. EIRM: Sample complexity with no distributional assumptions

In this section, we discuss the extension of Proposition 2 to the infinite hypothesis class case. Consider the problem in equation 6 and replace the with + κ. Define distance between the two sets S IV ( ) and its approximation S IV ( + κ) as follows dis(κ) = max g∈S IV ( +κ) min h∈S IV ( ) d(g, h) where d(g, h) is some metric that measures the distance between functions g and h. Observe that if a ≤ b, dis(a) ≤ dis(b). Assumption 15. lim k→0 dis(κ) = 0 Define D * = max 32L 2 ν 2 k log 16C √ P k ν + log 2 δ , 8L 4 κ 2 k log 8C √ P k κ + log 2 δ Proposition 15. For every ν > 0 and δ ∈ (0, 1), if Assumption 13, 15 hold, then ∃ κ > 0 such that if the number of samples |D| is greater than D * , then with a probability at least 1 -δ, every solution Φ to EIRM (replace with + κ inequation 6) in S IV ( + 2κ) and |R( Φ) -R(Φ * )| ≤ ν, where Φ * is a solution of IRM in equation 5. Proof. We divide the proof in two parts.

Define an event

A: { D : ∀p ∈ P, |R (Φ(p, •)) -R (Φ(p, •))| ≤ κ}. In the first half, we will show that if event A occurs, then S IV ( ) ⊆ ŜIV ( + κ) ⊆ S IV ( + 2κ) and then bound the probability of A not occuring. R (Φ(p, •)) ≤ =⇒ R (Φ(p, •)) -R (Φ(p, •)) + R (Φ(p, •)) ≤ =⇒ R (Φ(p, •)) ≤ + |R (Φ(p, •)) -R (Φ(p, •))| =⇒ R (Φ(p, •)) ≤ + κ (114) Therefore, S IV ( ) ⊆ ŜIV ( + κ) R (Φ(p, •)) ≤ + κ =⇒ R (Φ(p, •)) -R (Φ(p, •)) + R (Φ(p, •)) ≤ + κ =⇒ R (Φ(p, •)) ≤ + κ + |R (Φ(p, •)) -R (Φ(p, •))| =⇒ R (Φ(p, •)) ≤ + 2κ Therefore, ŜIV ( + κ) ⊆ S IV ( + 2κ) We bound the probability of event A not occurring. Using the covering number (from Lemma 5) we construct a minimum cover of size b = N η (P) with points C 1 = {p j } b j=1 . Compute the probability of failure at one point p j in the cover P D : |R (Φ(p j , •)) -R (Φ(p j , •))| > κ 2 < 2e -κ 2 |D| 8L 4 We use union bound to bound the probability of the failure over the entire cover C 1 as P D : max pj ∈C1 |R (Φ(p j , •)) -R (Φ(p j , •))| > κ 2 < N η (P)2e -κ 2 |D| 8L 4 Now consider any p ∈ P and suppose p j is nearest point to it in the cover. |R (Φ(p, •)) -R (Φ(p, •))| = |R (Φ(p, •)) -R (Φ(p j , •)) + R (Φ(p j , •)) -R (Φ(p j , •)) + R (Φ(p j , •)) -R (Φ(p, •))| ≤ |R (Φ(p j , •)) -R (Φ(p j , •))| + 2ηC In the above simplficiation, we exploit the Lipschitz continuity of R . Therefore ∀p ∈ P |R (Φ(p, •)) -R (Φ(p, •))| ≤ max pj ∈C1 |R (Φ(p j , •)) -R (Φ(p j , •))| + 2ηC max p∈P |R (Φ(p, •)) -R (Φ(p, •))| ≤ max pj ∈C1 |R (Φ(p j , •)) -R (Φ(p j , •))| + 2ηC Set η = κ 4C in equation 119 and from equation 117 with probability at least 1 -N η (P)2e -κ 2 |D| 8L 4 max p∈P |R (Φ(p, •)) -R (Φ(p, •))| ≤ κ (since max pj ∈C1 |R (Φ(p j , •)) -R (Φ(p j , •))| ≤ κ/2 ) (120) We bound N η (P)2e  κ) satisfies R(Φ(p, •)) - ν 2 ≤ R(Φ(p, •) ≤ R(Φ(p * , •)) ≤ R(Φ(p * , •)) + ν 2 R(Φ(p, •)) ≤ R(Φ(p * , •)) + ν (122) Using the covering number (from Lemma 5) we construct a minimum cover of size b = N η (P) with points C 1 = {p j } b j=1 . Let us bound the probability of failure at one point p j in the cover P |R(Φ(p j , •)) -R(Φ(p j , •))| > ν 4 < 2e -ν 2 |D| 32L 2 We use union bound to bound the probability defined as P D : max pj ∈C1 |R(Φ(p j , •)) -R(Φ(p j , •))| > ν 4 < N η (P)2e -ν 2 |D| 32L 2 Now consider any p ∈ P and suppose p j is nearest point to it in the cover. We bound N η (P)2e We begin by showing how to extend Proposition 4 to binary (cross-entropy). Recall that the entropy of a distribution P X is H(P) = -E P log(dP) . Recall that the cross entropy of Q relative to P is H(P, Q) = -E P log(dQ) = H(P ) + KL(dP dQ). The cross entropy loss for binary classification when using a predictor f : X → [0, 1] (f (X e ) is the probability of label 1 conditional on X e ) is given as f (X e ), Y e = Y e log f (X e ) + (1 -Y e ) log 1 -f (X e ) . If Assumption 16 holds, then from cross-entropy decomposition in equation 140 it is clear that m solves the OOD problem (as it is optimal w.r.t each environment). It is also the unique minimizer. We can justify it based on the same argument presented in equation 56. Suppose there was another optimizer which was different from m over a set with a non-zero measure. Over such a set the the KL divergence term inside equation 140 will be greater than zero, thus making the second term in equation 140 positive thus contradicting the optimality. This shows m is the unique optimizer. The rest of the arguments presented in the proof of Proposition 4 carry over to this case. Therefore, Proposition 4 extends to the cross-entropy loss. Note that Proposition 2's proof was agnostic to loss type and only used boundedness, which holds for both cross-entropy as long as the probability output are in the strict interior of [0, 1] defined by [p min , p max ] ⊂ [0, 1]. We could not generalize Proposition 5 to cross-entropy loss and that is left as future work. We derive a relationship as follows for the environment q satisfying Assumption 18. P(Y q |Z q , X q ) = P(Y q |Z q ) (follows from conditional independence in Assumption 18) (144) Also, note that since Z q = Φ * (X q ) we have P(Y q |Z q , X q ) = P(Y q |X q ) (145) From equation 144 and equation 145 we have P(Y q |X q ) = P(Y q |Z q ) (146) We use equation 146 in the cross entropy decomposition from equation 140 R q (f ) = E q H P(Y q |Z q ) + E q KL P(Y q |Z q ) Q(Y q |X q (147) Recall Q(Y q = 1|X q ) = f (X q ) (Q(Y q = 0|X q ) = 1 -f (X q )). Also, recall w * (Z q ) = P(Y q = 1|Z q ). From the above it is clear that f = w * • Φ * is the optimal predictor for environment q. R q (w * • Φ * ) = E q H P(Y q |X q ) (148) The expected conditional entropy for environment e is defined as He = E e H P(Y e |Z e ) is the risk achieved by w * • Φ * . Also, He measures the amount of noise in the environment. This is much like the variance that remains in the least squares minimization. In the next assumption, we state that the noise in all the environments is bounded above. We also assume that one of the environments which achieves the maximum noise level is environment q, which satisfies Assumption 18. Proposition 17. If Assumption 5 holds, H Φ is a linear hypothesis class with parameter Φ and if the rank of ρ is at least one, then ERM is asymptotically biased, i.e., even with infinite data ERM will not achieve the desired solution ST γ, except over a set of measure zero of probability distributions π. Proof. Consider the case when ERM has access to infinite i.e., we are solving the expected risk minimization problem stated as min Φ∈HΦ R(Φ). We will consider the linear model in Assumption 5 and assume H Φ linear hypothesis class parametrized by Φ ∈ R n . We simplify the ∇ Φ R(Φ) for the square loss below Since rank of ρ > 0 at least one of the columns of ρ is non-zero. As a result a uniform random draw from this set of probablity distributions would have zero probability of satisfying ρπ = 0. Therefore, ST γ is not the optimal solution to ERM and thus the solution of ERM would be biased away from ST γ. ∇ Φ R(Φ) = In Proposition 5, we had assumed Assumptions 5, 6, 7 hold. If we also assume that rank of ρ is at least one, the Proposition 5 continues to hold. If for at least one e ∈ E tr , E e ε e X e is non-zero, then the rank of ρ is at least one. From proof of Theorem 10 in Arjovsky et al. (2019) , linear general position continues to hold except over a set of covariance matrices with measure zero even when one of the E e ε e X e is non-zero. Also, in the above Proposition 17, we only required that rank of ρ is at least one. However, if we make the additional assumptions 5, 6, 7, the result of the above Proposition continues to hold. Therefore, if Assumption 5, 6, 7 hold and rank of ρ is at least one, then ERM is asymptotically biased and IRM can be within √ neighborhood of the ideal solution with the sample complexity shown in Proposition 5.



• ERIC: After readingArjovsky et al. (2019) andGulrajani & Lopez-Paz (2020), I was not sure of how to understand the settings where IRM is beneficial over ERM and vice-versa? From Gulrajani & Lopez-Paz (2020), I understood that ERM continues to be the state-ofthe-art if sufficient care is taken in performing model selection.



Figure 1: Causal Bayesian networks for different distribution shifts. 1) Covariate shift case (Φ * = I): ERM and IRM achieve the same asymptotic solution E[Y |X].We prove (Proposition 4) that the sample complexity for both the methods is similar thus there is no clear winner between the two in the finite sample regime. For the setup in Figure1a), both ERM and IRM learn a model that only uses X e 1 . 2) Confounder/Anti-causal variable case (Φ * = I): We consider a family of structural equation models (linear and polynomial) that may contain confounders and/or anti-causal variables. For the class of models we consider, the asymptotic solution of ERM is biased and not equal to the desiredE[Y |Φ * (X)]. We prove that IRM can learn a solution that is within O( √ ) distance from E[Y |Φ * (X)] with a sample complexity that increases as O(12 ) and increases polynomially in the complexity of the model class (Proposition 5, 6); (defined later) is the slack in IRM constraints. For the setup in Figure1b) and c), IRM gets close to only using X e 1 , while ERM even with infinite data (Proposition 17 in the supplement) continues to use X e 2 . We summarize the results in Table1.Arjovsky et al. (2019) proposed the colored MNIST (CMNIST) dataset; comparisons on it showed how ERM-based models exploit spurious factors (background color). The CMNIST dataset relied on anti-causal variables. Many supervised learning datasets may not contain anti-causal variables (e.g. human labeled images). Therefore, we propose and analyze three new variants of CMNIST in addition to the original one that map to different real-world settings: i) covariate shift based CMNIST (CS-CMNIST): relies on selection bias to induce spurious correlations, ii) confounded CMNIST (CF-CMNIST): relies on confounders to induce spurious correlations, iii) anti-causal CMNIST (AC-CMNIST): this is the original CMNIST proposed byArjovsky et al. (2019), and iv) anti-causal and confounded (hybrid) CMNIST (HB-CMNIST): relies on confounders and anti-causal variables to induce spurious correlations. On the latter three datasets, which belong to the Φ * = I class described above, IRM has a much better OOD behavior than ERM, which performs poorly regardless of the data size. However, IRM and ERM have a similar performance on CS-CMNIST with no clear winner. These results are consistent with our theory and are also validated in regression experiments.

w=1.0 respectively in our next result. Define the minimum eigenvalue across all Σ e as λ min = min e∈Etr λ min (Σ e ), min ) 2 and τ = 1 2ωλ min 3|Etr| 2π min . Next, we analyze how EIRM learns ST γ.

Figure 2: Comparisons: a) CS-CMNIST, b) CF-CMNIST, c) AC-CMNIST and d) HB-CMNIST.4.1 RESULTSWe use the first two environments (e = 1, 2) to train and third environment (e = 3) to test. Other details of the training (models, hyperparameters, etc.) are in the supplement. For each of the above datasets, we run the experiments for different amounts of training data from 1000 to up to 60000 samples (10 trials for each data size). In Figure2, we compare the models trained using IRM and ERM in terms of the classification error on the test environment e = 3 (a poor performance indicates model exploits the color) for varying number of train samples. We also provide the performance of the ideal hypothetical optimal invariant model. Observe that except for in the covariate shift setting where IRM and ERM are similar as seen in Figure2a(as predicted from Proposition 4), IRM outperforms ERM in the remaining three datasets (as predicted from Proposition 5) as seen in Figure2b-d. We further validate this claim through the regression experiments provided in the supplement. In CF-CMNIST, IRM achievs an error of 0.45, which is much better than error of ERM (0.7) but is marginally better than a random guess. This suggests that confounder induced spurious correlations are harder to mitigate and may need more samples than needed in anti-causal case (AC-CMNIST).

11) A.1. Compute P(C e |Y e ) and P(Y e |X e ). We compute P(C e |Y e ) and P(Y e |X e ) for the covariate shift based CMNIST described by equation 11. P(C e |Y e ) helps us understand how the spurious correlations vary across the environments. P(Y e |X e ) helps us understand if the covariate shift condition is satisfied or not. Compute the probability P(C e |Y e = 1) = P(C e |Y e g = 1, U e = 1) as follows. P(C e = 1|Y e g = 1, U e = 1) = P(C e = 1, Y e g = 1|U e = 1) P(C e = 1, Y e g = 1|U e = 1) + P(C e = 0, Y e g = 1|U e = 1) P(C e = 1, Y e g = 1|U e = 1) = P(C e = 1, Y e g = 1, U e = 1) a,b P(C e = a, Y e g = b, U e = 1ψ e ) P(C e = 0, Y e g = 1|U e = 1) = P(C e = 0, Y e g = 1, U e = 1) a,b P(C e = a, Y e g = b, U e = 1e = 1|Y e g = 1, U e = 1) = (1 -ψ e )

Figure 3: Graphical model for CS-CMNIST

(Y e = 1, C e = 0|X e g ) = P(Y e = 1, C e = 0|X e g ) = P(Y e = 1|Y e g = 1)P(C e = 0|Y e = 1) = 0.75β e P(Y e = 0, C e = 0|X e g ) = P(Y e = 1, C e = 0|X e g ) = P(Y e = 0|Y e g = 1)P(C e = 0|Y e = 0) = 0.25(1 -β e ) P(Y e = 1|X e ) = P(Y e = 1|X e g , C e = 0) = 3β e 2β e + 1 P(Y e = 1|X e ) = 0.25 (For environment 1) P(Y e = 1|X e ) = 0.428 (For environment 2) P(Y e = 1|X e ) = 0.964 (For environment 2) (14) B.2 Graphical model for anti-causal CMNIST In Figure 4, we provide the graphical model for AC-CMNIST described in equation 10 (for G = 1). B.3. Results with numerical values and standard errors. In Table 3, we provide the numerical values for the results showed in the Figure 2 c) along with the standard errors. C. Colored MNIST with confounded variables (CF-CMNIST) C.1. Compute P(C e |Y e ) and P(Y e |X e ). We start by computing P(C e |Y e ). Recall that C e = (N ⊕ N e ). In the simplfication that follows we use β = 0.25. P(N = 0|Y e = 1) = P(N = 0, Y e = 1) P(N = 0, Y e = 1) + P(N = 1, Y e = 1use the above to compute P(C e = 0|Y e = 1) = P(N = 0, N e = 0|Y e = 1) + P(N = 1, N e = 1|Y e = 1) = (1 -β e )0.75 + 0.25β e = 0.75 -0.5β e . For environment 1, 2, 3 the above probability P(C e = 0|Y e = 1) is 0.7, 0.65 and 0.30 respectively. Next, we compute the probability P(Y e |X e ).

P(Y e = 1, C e = 1|X e g ) = P(Y e = 1, C e = 1|X e g ) = P(Y e = 1|Y e g = 1)P(C e = 1|N = 0) = 0.75β e P(Y e = 0, C e = 1|X e g ) = P(Y e = 1, C e = 1|X e g ) = P(Y e = 0|Y e g = 1)P(C e = 1|N = 1) = 0.25(1 -β e ) P(Y e = 1|X e ) = P(Y e = 1|X e g , C e = 1) = 3β e 2β e + 1 P(Y e = 1|X e ) = 0.25 (For environment 1) P(Y e = 1|X e ) = 0.428 (For environment 2) P(Y e = 1|X e ) = 0.964 (For environment 2) (16) C.2 Graphical model for confounded CMNIST. In Figure 4, we provide the graphical model for confounded CMNIST described in equation 10 (for G = 0). C.3. Results with numerical values and standard errors. In Table 4, we provide the numerical values for the results showed in the Figure 2 b) along with the standard errors. D. Colored MNIST with anti-causal variables and confounded variables (HB-CMNIST) D.1. Compute P(C e |Y e ) and P(Y e |X e ). We start by computing P(C e |Y e ). Recall that C e = G(Y e ⊕ N e ) + (1 -G)(N ⊕ N e ). G = 1 with probability θ and 0 otherwise. P(C e = 1|Y e = 1) = P(C e = 1, Y e = 1) P(C e = 1, Y e = 1) + P(C e = 0, Y e = 1) (17) P(C e = 1, Y e = 1) = θ(1 -β e ) + (1 -θ)(0.25 + 0.5β e ) (18) P(C e = 0, Y e = 1) = θ(β e ) + (1 -θ)(0.75 -0.25β e ) (19) We used θ = 0.8 in the experiments. P(Y e |X e ) can be computed on the same lines as was shown for anti-causal and confounded model and it varies significantly across the environments. D.2 Graphical model for confounded CMNIST.

Figure 4: Graphical model for CF-CMNIST, AC-CMNIST, HB-CMNIST

Figure 6: Comparisons: n = 10 CS-regression

(w • Φ * ) using square loss for . Recall Z e = Φ * (X e ) R e (w• Φ * ) = E e Y e -E e Y e Z e + E e Y e Z e -(w • Φ * )(X e ) 2 = E e Y e -E e Y e Z e 2 + E e E e Y e Z e -(w • Φ * )(X e ) 2 + 2E e Y e -E e Y e Z e E e Y e Z e -(w • Φ * )(X e ) = E e Y e -m(Z e ) 2 + E e m(Z e ) -w(Z e ) 2 + 2E e Y e -m(Z e ) m(Z e ) -w(Z e ) = E e Y e -m(Z e ) 2 + E e m(Z e ) -w(Z e ) 2 = ξ 2 + E e m(Z e ) -w(Z e ) 2(21)In the above simplification in equation 21, we use the following equation 22 and equation 23, which rely on the law of total expectation.E e Y e -m(Z e ) m(Z e ) -w(Z e ) = E e E e Y e -m(Z e ) m(Z e ) -w(Z e ) Z e = E e E e Y e Z e -m(Z e ) m(Z e ) -w(Z e ) = 0(22)E e Y e -m(Z e ) 2 = E e E e Y e -m(Z e ) 2 Z e = E e Var e Y e Z e = ξ 2(23)In the last equality in equation 23, we use the Assumption 1 and obtain ξ 2 . Therefore, substituting w = m in equation 21 achieves a risk of ξ 2 for all the environments.∀e ∈ E all , R e (m • Φ * ) = ξ 2 (24)

E e ∂ (w • Φ(X e ), Y e ) ∂w w=1.0 and ∇ w|w=1.0 R e (w•Φ) 2 = ∂E e (w • Φ(X e ), Y e ) ∂w w=1.0 2 = E e ∂ (w • Φ(X e ), Y e ) ∂w

simplification, we used equation 30. Define a joint distribution P over the tuple (e, (X e , Y e ), ( Xe , Ỹ e )), where e ∼ {π o } o∈Etr , (X e , Y e ) ∼ P e and ( Xe , Ỹ e ) ∼ P e . Also, P (e, (X e , Y e ), ( Xe , Ỹ e )) = π e P e (X e , Y e )P e ( Xe , Ỹ e ) (32) We rewrite the above expression equation 31 in terms of an expectation w.r.t P, which we represent as Ẽ follows R (Φ) = Ẽ ∂ (w • Φ(X e ), Y e ) ∂w X e , Y e ), ( Xe , Ỹ e ) = ∂ (w • Φ(X e ), Y e ) ∂w w=1.0 ∂ (w • Φ( Xe ), Ỹ e ) ∂w

Recall the definition of m(x) = E[Y e |X e = x] Lemma 4. Let be the square loss. If Assumption 4 holds and m ∈ H Φ , then m uniquely solves expected risk minimization m ∈ arg min Φ∈HΦ R(Φ) and also uniquely solves IRM (equation 5).

)) next. Proposition 10. (Theorem 9 Arjovsky et al. (2019)) If Assumptions 5 and 6 (with r = 1) hold and let

64) Plug the above equation 64 in the condition R (Φ) ≤ to get R (Φ) = e π e ∇ w|w=1.0 R e (w • Φ) 2 = 4 e π e Φ T E e X e X e,T Φ -Φ T E e X e Y e 2 ≤ (65)

∇ w,w=1.0 R e (w.Φ) = c(1) > 0. Using fundamental theorem of calculus we can write c(1) -c(w) = 1 w c (u)du ≥ λ min ω(1 -w) Substituting w = w e in the above c(1) -c(w e ) ≥ 2λ min ω(1 -w e ) c(1) ≥ 2λ min ω(1 -w e )

IRMv1 minimizes R(Φ) + λ R (Φ) Let us compute the risk achieved by the ideal invariant predictor ST γ. R ST γ = e∈Etr π e E e Y e -γ T SX e 2 = e∈Etr π e E e Y e -γ T Z e 1 2 = e∈Etr π e E e (ε e ) 2 = σ 2 (81) Define event A: D

sup , |Y e | ≤ K, the square loss (Φ(•), •) is bounded by L and ∂ (w•(Φ(•),•) ∂w w=1.0 is bounded by L . In the next lemma, we aim to show that if Assumption 5 and 12 hold, then R (Φ) is Lipschitz continuous. Lemma 6. If Assumption 5 and 12 hold, then R (Φ) is Lipschitz continuous in Φ.

|R(Φ(p, •)) -R(Φ(p, •))| = |R(Φ(p, •)) -R(Φ(p j , •)) + R(Φ(p j , •)) -R(Φ(p j , •)) + R(Φ(p j , •)) -R(Φ(p, •))| ≤ |R(Φ j ) -R(Φ j )| + 2ηC (125) Therefore, for each p ∈ P |R(Φ(p, •)) -R(Φ(p j , •))| ≤ max pj ∈C1 |R(Φ(p j , •)) -R(Φ(p j , •))| + 2ηC max p∈P |R(Φ(p, •)) -R(Φ(p, •))| ≤ max pj ∈C1 |R(Φ(p j , •)) -R(Φ(p j , •))| + 2ηC(126) Set η = ν 8C in equation 124 and from equation 126 with probability at least 1 -N η (P)2e -ν 2 |D| 32L 4 max p∈P |R(Φ(p, •)) -R (Φ(p, •))| ≤ ν/2 (since max pj ∈C1 |R (Φ(p j , •)) -R (Φ(p j , •))| ≤ ν/4 ) (127)

For the discussion below Q(Y e |X e ) is defined in terms of f as followsQ(Y e = 1|X e ) = f (X e ) Q(Y e = 0|X e ) = 1 -f (X e ) . R e (f ) = E e (Y e , f X e ) = E e Y e log f (X e + (1 -Y e ) log 1 -f X e = E e E e Y e |X e log f X e + 1 -E Y e |X e log 1 -f X e = E e P(Y e = 1|X e ) log f X e + P(Y e = 0|X e ) log 1 -f X e = Ee H P(Y e |X e ), Q(Y e |X e ) = E e H P(Y e |X e ) + KL P(Y e |X e ) Q(Y e |X e = E e H P(Y e |X e ) + E e KL P(Y e |X e ) Q(Y e |X e (140) From the above it is clear that Q(Y e |X e = P(Y e |X e minimizes the risk in an individual environment. Assumption 16. Invariance w.r.t all the features. For all e, o ∈ E all and for allx ∈ X , E[Y e |X e = x] = E[Y o |X o = x]. X e ∼ P eX e and ∀e ∈ E all support of P e X e is equal to X .Observe that in the binary-classification setting the above assumption amounts to equating the conditional probabilities P(Y e |X e ) and P(Y o |X o ).Recall that map m (from equation 2) simplifies to ∀x ∈ X m(x) = E e [Y e |X e = x] = P(Y e = 1|X e = x) (141)

Next, we move to showing how Proposition 1 can be generalized to binary classification. Assumption 17. Existence of an invariant representation.∃ Φ * : X → Z such that ∀e, o ∈ E all and ∀x ∈ X , E[Y e |Φ * (x)] = E[Y o |Φ * (x)]. Recall m defined in equation 2, ∀z ∈ Φ * (X ) m(z) = E e [Y e |Z e = z] = P(Y e = 1|X e = z)(142) Define a composite predictor w • Φ * . Substituting f = w • Φ * in equation 140 we get the following. For the discussion below, a distribution R(Y e |X e ) is defined in terms of w • Φ * as follows R(Y e = 1|X e ) = w • Φ * (X e ) (R(Y e = 0|X e ) = 1 -w • Φ * (X e )). R e (w • Φ * ) = E e H P(Y e |Z e ) + E e KL P(Y e |Z e ) R(Y e | Z e (143) If all the data is transformed by Φ * , then the above decomposition equation 143 it is clear that R(Y e |Z e ) = P(Y e |Z e ) is the optimal predictor for each environment. Hence, w * (Z e ) = P(Y e = 1|Z e ) is the best choice for w. Assumption 18. Existence of an environment where the invariant representation is sufficient. ∃ an environment e ∈ E all such that Y e ⊥ X e |Z e , where Z e = Φ * (X e ).

Assumption 19. ∀e ∈ E all , He ≤ Hsup , Hq = Hsup Therefore, R q (w * • Φ * ) = Hsup (149) From equation 143 for all the environments R e (w * • Φ * ) = He (150)Observe that max e∈E all R e (w * • Φ * ) = Hsup . From the above assumption it is clear that for all predictors f :X → [0, 1]. ∀f, max e∈E all R e (f ) ≥ R q (f ) max e∈E all R e (w * • Φ * ) =Hsup , we conclude that w * • Φ * is the predictor that solves the OOD problem in equation 1. This completes the extension of Proposition 1 to cross-entropy. 7.4.4 ON THE BIASEDNESS OF ERM Consider the model in Assumption 5. For each environment e ∈ E tr , define a vector ρ e = E e ε e X e . Define a matrix ρ with ρ e as column vectors ρ = [ρ 1 , . . . , ρ |Etr| ]. Define a vector π = [π 1 , . . . , π |Etr| ], where recall from Assumption 5 π o is probability a point comes from environment o.

e∈Etr π e E e Y e -Φ T X e X e (152)We compute the gradient for Φ = ST γ as∇ Φ|Φ= ST γ R(Φ) = e∈Etr π e E e Y e -γ T SX e X e (Use Assumption 5) = e∈Etr π e E e ε e X e(153)Recall ρ e = E e ε e X e and ρ = [ρ 1 , . . . ,ρ |Etr| ]. Recall π = [π 1 , . . . , π |Etr| ].Setting the gradient defined in equation 153 to zero and using the above matrix notation we getρπ = 0, 1 T π = 1, π ≥ 0(154)If π satisfies equation 154, then ERM is unbiased, else it is not. Consider the set of vectors in the probability simplex {π | 1 T π = 1, π ≥ 0} and define a uniform probability distribution over it.

Summary of (empirical) IRM vs. ERM for finite hypothesis class H Φ . : slack in IRM constraints, ν: approximation w.r.t optimal risk, δ: failure probability, E tr : set of training environments, n: data dimension, p: degree of the generative polynomial, L, L : bound on loss & its gradients.

. Covariate shift assumes ∀e, o ∈ E all , ∀x ∈ X , P(Y e |X e = x) and P(Y o |X o = x) are equal thus implying E e [Y e |X e = x] = E o [Y o |X o = x]. Therefore, for covariate shifts, Φ * is identity in Assumption 1. A simple instance illustrating Assumption 1 with Φ * = I is when Y e ← g(X e )+ε e , where E e [ε e ] = 0, E e [(ε e ) 2 ] = σ 2 , ε e ⊥ X e . Using Assumption 1, we define the invariant map m

and is a special case of the first part of the Assumption 1 with Φ * set to I. If Φ * = I, then m (equation 2), which simplifies to m(x) = E[Y e |X e = x], solves the OOD problem equation 1. A generative model that satisfies the above Assumption 4 is given as Y

Linear general position. A set of training environments E tr is said to lie in a linear general position of degree r for some r ∈ N if |E tr | > n -r + n/r and for all non

• IRMA: Okay, let me start my explanation again. There is no contradiction. The current manuscript says that in covariate shift settings, there maybe no clear winner. Many of the datasets considered inGulrajani & Lopez-Paz (2020) are perhaps similar to covariate shift settings. Also, there could be another reason why IRM did not outperform ERM in Gulrajani & Lopez-Paz (2020) and I will come to it in a bit. • ERIC: Yes, you are right! The datasets are human labeled images if I recall correctly. In these datasets it's safe to assume P(Y e |X • ERIC: Wow! These different variants of CMNIST sound exciting. Going back to the covariate shift case, do you think that we can perhaps come up with a finer criterion to say when IRM is better than ERM and vice-versa? • IRMA: Yes, actually I have been thinking about this problem and would be happy to share my initial thoughts. Identifying a representation Φ * that leads to invariant conditional distributions is crucial to the success of IRM. A subtle factor that is implicitly assumed is representations obtained from multiple domains should overlap. • ERIC: Can you clarify what you mean by overlap?

Comparison of ERM vs IRM: CS-CMNIST

Comparison of ERM vs IRM: AC-CMNIST

Comparison of ERM vs IRM: CF-CMNIST

Comparison of ERM vs IRM: HB-CMNIST7.2.2 REGRESSIONWe use the same structure for the generative model as described byArjovsky et al. (2019). We work with the four different variants on the same lines as CMNIST (covariate shift based, confounded, anti-causal, hybrid). The comparisons inArjovsky et al. (2019) were for anti-causal and hybrid models. The general model is written as

Table 6, 7, 8, 9)  with the numerical values for the mean model estimation (and the standard error) error shown in the figures.

Comparison of ERM vs IRM: n = 10 CS-regression

Comparison of ERM vs IRM: n = 10 AC-regression

Comparison of ERM vs IRM: n = 10 HB-regression

and state it appropriately for both R and R . Definition 1. A training set S is called -representative (w.r.t. domain Z, hypothesis H, loss and distribution

-

Therefore, if condition in equation 121 holds, then event A occurs. If event A occurs, then S IV ( ) ⊆ ŜIV ( + κ) ⊆ S IV ( + 2κ). Φ(p * , •) is a solution to IRM equation 5 and it satisfies ∀Φ(p, •) ∈ S IV ( ) R(Φ(p * )) ≤ R(Φ(p, •). If event B occurs, then for a solution Φ(p, •) of equation 6 (where is replaced with +

-ν 2 |D| 32L 2 ≤ δ and solve for bound on |D| to get

6. ACKNOWLEDGEMENTS

This work was supported in part by the Rensselaer-IBM AI Research Collaboration (part of the IBM AI Horizons Network).

annex

From the definition of κ 2 -representativeness it follows that |R (Φ) -R (Φ)| ≤ κ 2 and substituting this in equation 46 we getFrom equation 40, it follows that R (Φ) ≤ + κ 2 =⇒ R (Φ) ≤ . Therefore, Φ ∈ S IV ( ). This proves the second part ŜIV ( ) ⊆ S IV ( ) and completes the proof.Lemma 3. If κ > 0, D is ν 2 -representative w.r.t X , H Φ , loss and distribution P (joint distribution over (e, X e , Y e ) defined in Section 3.1) and and D is κ 2 -representative w.r.t X , H Φ , loss and distribution P, then every solution Φ to EIRM (equation 6) satisfies Φ is in S IV ( ) and R(Φ * ) ≤ R( Φ) ≤ R(Φ * ) + , where Φ * is the solution of IRM in equation 5.Proof. Given the condition in the above lemma, we are able to use the previous Lemma 2 to deduce that ŜIV ( ) = S IV ( ). This makes the set of predictors satisfying the constraints in EIRM equation 6 and IRM equation 5 the same.Φ * solves equation 5 and Φ solves equation 6. From ν 2 -representativeness we know that R( Φ)ν 2 ≤ R( Φ). From the optimality of Φ we know that R( Φ) ≤ R(Φ * ) (Φ * ∈ ŜIV ( ) = S IV ( )). Moreover, from ν 2 -representativeness we know that R(Φ * ) ≤ R(Φ * ) + ν 2 . We combine these conditions as follows.Comparing the first and third inequality in the above equations we get R( Φ) ≤ R(Φ * )+ν. From the optimality of Φ * over the set S IV ( ) and sinceThis completes the proof.Next, we prove Proposition 2 from the main body of the manuscript. Proposition 8. For every ν > 0, > 0 and δ ∈ (0, 1), if H Φ is a finite hypothesis class, Assumption 3 holds, κ > 0, and if the number of samples |D| is greater than Proof. We first cover the second part of the Proposition. From Proposition 3, we know that the output of ERM will satisfy R(Φ + ) ≤ R(Φ † ) ≤ R(Φ + ) + ν. In this case from Lemma 4, it follows that Φ + = m. From the definition of κ and the fact that ν < κ implies that Φ † = m. We now move to the first part of the Proposition.For EIRM we will derive a tighter bound on sample complexity than the one in Proposition 2 since we can now use the Assumption 4. Observe that ∀e ∈ E tr , ∇ w|w=1.0 R e (w • m) = 0 (see the proof of Lemma 4). Therefore, R (m) = 0.

Define an event

We combine these conditions as follows.From the above we haveNext, we bound the probability of success.2 and P (B c ) ≤ δ 2 , then we know the probability of success is at least 1 -δ.

We write

From equation 51 if the conditionis true, then is true, then event A c occurs with probability at most δ 2 .We writeThe gradient of loss function is boundedWe bound the above equation 60 by δ 2 to getCombining the two conditions equation 59 and equation 61, Assume that Z e 1 component of S is invertible, i.e. ∃ S such that S(S(Z e 1 , Z e 2 )) = Z e 1 and also St γ = 0. In the above ζ a p is a polynomial feature map of degree p defined as ζ a p : R a → R a , where a denotes the dimension of the input to the map] and ⊗ is the Kronecker product. Also, a = p i=1 a i . ∀e ∈ E tr , π e ≥ π min |Etr| . The support of distribution of Z e = [Z e 1 , Z e 2 ], P e Z e , is bounded and the operator norm of S, S = σ max (S) (σ max (S) is maximum singular value of S), is also bounded.Define Z e = (Z e 1 , Z e 2 ) and say Z e 1 ∈ R c and Z e 2 ∈ R d . Can we directly use the analysis from the linear case? We cannot directly use the polynomial map for the features as we also need to find an appropriate transformation of the matrix S, which preserves the linear relationship between the transformed features and the transformed variables Z. We carry out this exercise below. From the model we know that X e = SZ e . We would like to remind the reader of the mixed product property of tensors. Consider matricesIn the expressions that follow, we exploit the mixed-product property of Kronecker product stated above.then the event A ∩ B occurs with at least 1 -δ probability.Project Φ(p, •) ∈ ŜIV ( + κ) on S IV ( ), i.e., find the closest function in terms of the metric dis, to obtain Φ( p, •). If event A occurs, then p ∈ S IV ( + 2κ). The distance p -p ≤ dis(2κ).We choose κ 0 such that κ < κ 0 (use Assumption 15) such that Cdis(2κTherefore, by combining equation 122 amd equation 131, we can conclude that if eventFrom the conditions on |D|, we know that A ∩ B occurs with probability 1 -δ. We substitute Φ(p * , •) as Φ * and Φ(p, •) as Φ and this completes the proof.

D. OOD Performance: Covariate shift case

In this section, we discuss the extension of Proposition 4 to the infinite hypothesis class case.Proposition 16. If Assumptions 4, 13 hold, m ∈ H Φ and if the number of samples |D| is greater than D * 1 , then with a probability at least 1 -δ, every solutionIf Assumptions 4, 13 hold, m ∈ H Φ and if the number of samples |D| is greater than D * 2 , then with a probability at least 1 -δ every solutionProof. We begin with the first part. Following the proof of Proposition 5, our goal is to compute the probability of event A:Using the covering number (from Lemma 5) we construct a minimum cover of size b = N η (P) with points C = {p j } b j=1 . Compute the probability of failure at one point p j in the coverWe use union bound to bound the probability of failure over the cover CNow consider any p ∈ P and suppose p j is nearest point to it in the cover.In the above simplification, we used the Lipschitz continuity of R. ThereforeSet η = ν 8C in equation 135 and from equation 133 with probability at least 1 -N η (P)2eWe bound N η (P)2e We now move to the second part. From the first part of the proof, we conclude that whenwith a probability at least 1 -δ 2 event A occurs. Define an event B: This completes the proof.

7.4.3. EXTENSIONS TO BINARY CLASSIFICATION (CROSS-ENTROPY)

In the main body of the manuscript, we focused on regression (square-loss). In this section, we discuss the results that can be extended to binary classification (cross-entropy) loss. We will not go in the order in which the results were introduced in the manuscript but in an order that makes for easier exposition for the classification case.

