EMPIRICAL OR INVARIANT RISK MINIMIZATION? A SAMPLE COMPLEXITY PERSPECTIVE

Abstract

Recently, invariant risk minimization (IRM) was proposed as a promising solution to address out-of-distribution (OOD) generalization. However, it is unclear when IRM should be preferred over the widely-employed empirical risk minimization (ERM) framework. In this work, we analyze both these frameworks from the perspective of sample complexity, thus taking a firm step towards answering this important question. We find that depending on the type of data generation mechanism, the two approaches might have very different finite sample and asymptotic behavior. For example, in the covariate shift setting we see that the two approaches not only arrive at the same asymptotic solution, but also have similar finite sample behavior with no clear winner. For other distribution shifts such as those involving confounders or anti-causal variables, however, the two approaches arrive at different asymptotic solutions where IRM is guaranteed to be close to the desired OOD solutions in the finite sample regime for polynomial generative models, while ERM is biased even asymptotically. We further investigate how different factors -the number of environments, complexity of the model, and IRM penalty weight -impact the sample complexity of IRM in relation to its distance from the OOD solutions.

1. INTRODUCTION

A recent study shows that models trained to detect COVID-19 from chest radiographs rely on spurious factors such as the source of the data rather than the lung pathology (DeGrave et al., 2020) . This is just one of many alarming examples of spurious correlations failing to hold outside a specific training distribution. In one commonly cited example, Beery et al. (2018) trained a convolutional neural network (CNN) to classify camels from cows. In the training data, most pictures of the cows had green pastures, while most pictures of camels were in the desert. The CNN picked up the spurious correlation and associated green pastures with cows thus failing to classify cows on beaches. Recently, Arjovsky et al. (2019) proposed a framework called invariant risk minimization (IRM) to address the problem of models inheriting spurious correlations. They showed that when data is gathered from multiple environments, one can learn to exploit invariant causal relationships, rather than relying on varying spurious relationships, thus learning robust predictors. More recent work suggests that empirical risk minimization (ERM) is still state-of-the-art on many problems requiring OOD generalization (Gulrajani & Lopez-Paz, 2020) . This gives rise to a fundamental question: when is IRM better than ERM (and vice versa)? In this work, we seek to answer this question through a systematic comparison of the sample complexity of the two approaches under different types of train and test distributional mismatches. The distribution shifts P train (X, Y ) = P test (X, Y ) that we consider informally stated satisfy an invariance condition -there exists a representation Φ * of the covariates such that P train (Y |Φ * (X)) = P test (Y |Φ * (X)) = P(Y |Φ * (X)). A special case of this occurs when Φ * is identity -P train (X) = P test (X) but P train (Y |X) = P test (Y |X) -such a shift is known as a covariate-shift (Gretton et al., 2009) . In many other settings Φ * may not be identity (denoted as I), examples include settings with confounders or anti-causal variables (Pearl, 2009) where covariates appear spuriously correlated with the label and P train (Y |X) = P test (Y |X). We use causal Bayesian networks to illustrate these shifts in Figure 1 . Suppose X e = [X e 1 , X e 2 ] represents the image, where X e 1 is the shape of the animal and X e 2 is the background color, Y e is the label of the animal, and e is the index of the environment/domain. In Figure 1a ) X e 2 is independent of (Y e , X e 1 ), it represents the covariate shift 1

