EMPIRICAL OR INVARIANT RISK MINIMIZATION? A SAMPLE COMPLEXITY PERSPECTIVE

Abstract

Recently, invariant risk minimization (IRM) was proposed as a promising solution to address out-of-distribution (OOD) generalization. However, it is unclear when IRM should be preferred over the widely-employed empirical risk minimization (ERM) framework. In this work, we analyze both these frameworks from the perspective of sample complexity, thus taking a firm step towards answering this important question. We find that depending on the type of data generation mechanism, the two approaches might have very different finite sample and asymptotic behavior. For example, in the covariate shift setting we see that the two approaches not only arrive at the same asymptotic solution, but also have similar finite sample behavior with no clear winner. For other distribution shifts such as those involving confounders or anti-causal variables, however, the two approaches arrive at different asymptotic solutions where IRM is guaranteed to be close to the desired OOD solutions in the finite sample regime for polynomial generative models, while ERM is biased even asymptotically. We further investigate how different factors -the number of environments, complexity of the model, and IRM penalty weight -impact the sample complexity of IRM in relation to its distance from the OOD solutions.

1. INTRODUCTION

A recent study shows that models trained to detect COVID-19 from chest radiographs rely on spurious factors such as the source of the data rather than the lung pathology (DeGrave et al., 2020) . This is just one of many alarming examples of spurious correlations failing to hold outside a specific training distribution. In one commonly cited example, Beery et al. (2018) trained a convolutional neural network (CNN) to classify camels from cows. In the training data, most pictures of the cows had green pastures, while most pictures of camels were in the desert. The CNN picked up the spurious correlation and associated green pastures with cows thus failing to classify cows on beaches. Recently, Arjovsky et al. (2019) proposed a framework called invariant risk minimization (IRM) to address the problem of models inheriting spurious correlations. They showed that when data is gathered from multiple environments, one can learn to exploit invariant causal relationships, rather than relying on varying spurious relationships, thus learning robust predictors. More recent work suggests that empirical risk minimization (ERM) is still state-of-the-art on many problems requiring OOD generalization (Gulrajani & Lopez-Paz, 2020) . This gives rise to a fundamental question: when is IRM better than ERM (and vice versa)? In this work, we seek to answer this question through a systematic comparison of the sample complexity of the two approaches under different types of train and test distributional mismatches. The distribution shifts P train (X, Y ) = P test (X, Y ) that we consider informally stated satisfy an invariance condition -there exists a representation Φ * of the covariates such that P train (Y |Φ * (X)) = P test (Y |Φ * (X)) = P(Y |Φ * (X)). A special case of this occurs when Φ * is identity -P train (X) = P test (X) but P train (Y |X) = P test (Y |X) -such a shift is known as a covariate-shift (Gretton et al., 2009) . In many other settings Φ * may not be identity (denoted as I), examples include settings with confounders or anti-causal variables (Pearl, 2009) where covariates appear spuriously correlated with the label and P train (Y |X) = P test (Y |X). We use causal Bayesian networks to illustrate these shifts in Figure 1 . Suppose X e = [X e 1 , X e 2 ] represents the image, where X e 1 is the shape of the animal and X e 2 is the background color, Y e is the label of the animal, and e is the index of the environment/domain. In Figure 1a ) X e 2 is independent of (Y e , X e 1 ), it represents the covariate shift case (Φ * = I). In Figure 1b ) X e 2 is spuriously correlated with Y e through the confounder ε e . In Figure 1c ) X e 2 is spuriously correlated with Y e as it is anti-causally related to Y e . In both Figure 1b ) and c) Φ * = I; Φ * is a block diagonal matrix that selects X e 1 . Our setup assumes we are given data from multiple training environments satisfying the invariance condition, i.e., P(Y |Φ * (X)) is the same across all of them. Ideally, we want to learn and predict using E[Y |Φ * (X)]; this predictor has a desirable OOD behavior as we show later where we prove min-max optimality with respect to (w.r. We prove (Proposition 4) that the sample complexity for both the methods is similar thus there is no clear winner between the two in the finite sample regime. For the setup in Figure 1a ), both ERM and IRM learn a model that only uses X e 1 . 2) Confounder/Anti-causal variable case (Φ * = I): We consider a family of structural equation models (linear and polynomial) that may contain confounders and/or anti-causal variables. For the class of models we consider, the asymptotic solution of ERM is biased and not equal to the desired E[Y |Φ * (X)]. We prove that IRM can learn a solution that is within O( √ ) distance from E[Y |Φ * (X)] with a sample complexity that increases as O( 12 ) and increases polynomially in the complexity of the model class (Proposition 5, 6); (defined later) is the slack in IRM constraints. For the setup in Figure 1b ) and c), IRM gets close to only using X e 1 , while ERM even with infinite data (Proposition 17 in the supplement) continues to use X e 2 . We summarize the results in Table 1 . 2019), and iv) anti-causal and confounded (hybrid) CMNIST (HB-CMNIST): relies on confounders and anti-causal variables to induce spurious correlations. On the latter three datasets, which belong to the Φ * = I class described above, IRM has a much better OOD behavior than ERM, which performs poorly regardless of the data size. However, IRM and ERM have a similar performance on CS-CMNIST with no clear winner. These results are consistent with our theory and are also validated in regression experiments.

2. RELATED WORKS

IRM based works. Following the original work IRM from Arjovsky et al. ( 2019), there have been several interesting works - (Teney et al., 2020; Krueger et al., 2020; Ahuja et al., 2020; Chang et al., 2020; Mahajan et al., 2020) is an incomplete representative list -that build new methods inpired from IRM to address the OOD generalization problem. Arjovsky et al. (2019) prove OOD guarantees for linear models with access to infinite data from finite environments. We generalize these results in several ways. We provide a first finite sample analysis of IRM. We characterize the impact of hypothesis class complexity, number of environments, weight of IRM penalty on the sample complexity and its distance from the OOD solution for linear and polynomial models. Theory of domain generalization and domain adaption. Following the seminal works (Ben-David et al., 2007; 2010) , there have been many interesting works - (Muandet et al., 2013; Ajakan et al., 2014; Zhao et al., 2019; Albuquerque et al., 2019; Li et al., 2017; Piratla et al., 2020; Matsuura & Harada, 2020; Deng et al., 2020; David et al., 2010; Pagnoni et al., 2018) is an incomplete representative list (see Redko et al. (2019) for further references) -that build the theory of domain adaptation and generalization and construct new methods based on it. While many of these works develop bounds on loss over the target domain using train data and unlabeled target data,



t.) unseen test distributions satisfying the invariance condition. Our goal is to analyze and compare ERM and IRM's ability to learn E[Y |Φ * (X)] from finite training data acquired from a fixed number of training environments. Our analysis has two parts.

Figure 1: Causal Bayesian networks for different distribution shifts. 1) Covariate shift case (Φ * = I): ERM and IRM achieve the same asymptotic solution E[Y |X].We prove (Proposition 4) that the sample complexity for both the methods is similar thus there is no clear winner between the two in the finite sample regime. For the setup in Figure1a), both ERM and IRM learn a model that only uses X e 1 . 2) Confounder/Anti-causal variable case (Φ * = I): We consider a family of structural equation models (linear and polynomial) that may contain confounders and/or anti-causal variables. For the class of models we consider, the asymptotic solution of ERM is biased and not equal to the desiredE[Y |Φ * (X)]. We prove that IRM can learn a solution that is within O( √ ) distance from E[Y |Φ * (X)] with a sample complexity that increases as O(12 ) and increases polynomially in the complexity of the model class (Proposition 5, 6); (defined later) is the slack in IRM constraints. For the setup in Figure1b) and c), IRM gets close to only using X e 1 , while ERM even with infinite data (Proposition 17 in the supplement) continues to use X e 2 . We summarize the results in Table1. Arjovsky et al. (2019) proposed the colored MNIST (CMNIST) dataset; comparisons on it showed how ERM-based models exploit spurious factors (background color). The CMNIST dataset relied on anti-causal variables. Many supervised learning datasets may not contain anti-causal variables (e.g. human labeled images). Therefore, we propose and analyze three new variants of CMNIST in addition to the original one that map to different real-world settings: i) covariate shift based CMNIST (CS-CMNIST): relies on selection bias to induce spurious correlations, ii) confounded CMNIST (CF-CMNIST): relies on confounders to induce spurious correlations, iii) anti-causal CMNIST (AC-CMNIST): this is the original CMNIST proposed by Arjovsky et al. (2019), and iv) anti-causal and confounded (hybrid) CMNIST (HB-CMNIST): relies on confounders and anti-causal variables to induce spurious correlations. On the latter three datasets, which belong to the Φ * = I class described above, IRM has a much better OOD behavior than ERM, which performs poorly regardless of the data size. However, IRM and ERM have a similar performance on CS-CMNIST with no clear winner. These results are consistent with our theory and are also validated in regression experiments.

Arjovsky et al. (2019) proposed the colored MNIST (CMNIST) dataset; comparisons on it showed how ERM-based models exploit spurious factors (background color). The CMNIST dataset relied on anti-causal variables. Many supervised learning datasets may not contain anti-causal variables (e.g. human labeled images). Therefore, we propose and analyze three new variants of CMNIST in addition to the original one that map to different real-world settings: i) covariate shift based CMNIST (CS-CMNIST): relies on selection bias to induce spurious correlations, ii) confounded CMNIST (CF-CMNIST): relies on confounders to induce spurious correlations, iii) anti-causal CMNIST (AC-CMNIST): this is the original CMNIST proposed by Arjovsky et al. (

