MALIGN OVERFITTING: INTERPOLATION CAN PROV-ABLY PRECLUDE INVARIANCE

Abstract

Learned classifiers should often possess certain invariance properties meant to encourage fairness, robustness, or out-of-distribution generalization. However, multiple recent works empirically demonstrate that common invariance-inducing regularizers are ineffective in the over-parameterized regime, in which classifiers perfectly fit (i.e. interpolate) the training data. This suggests that the phenomenon of "benign overfitting," in which models generalize well despite interpolating, might not favorably extend to settings in which robustness or fairness are desirable. In this work, we provide a theoretical justification for these observations. We prove that-even in the simplest of settings-any interpolating learning rule (with an arbitrarily small margin) will not satisfy these invariance properties. We then propose and analyze an algorithm that-in the same setting-successfully learns a non-interpolating classifier that is provably invariant. We validate our theoretical observations on simulated data and the Waterbirds dataset. We study settings where ✓ 1 , ✓ 2 are fixed and d is large compared to N , i.e. the overparameterized regime. We refer to the two distributions P ✓e for e 2 {1, 2} as "training environments", following Peters et al. (2016); Arjovsky et al. (2019) . In the context of Out-of-Distribution (OOD) generalization, environments correspond to different experimental conditions, e.g., collection of medical data

1. INTRODUCTION

Modern machine learning applications often call for models which are not only accurate, but which are also robust to distribution shifts or satisfy fairness constraints. For example, we might wish to avoid using hospital-specific traces in X-ray images (DeGrave et al., 2021; Zech et al., 2018) , as they rely on spurious correlations that will not generalize to a new hospital, or we might seek "Equal Opportunity" models attaining similar error rates across protected demographic groups, e.g., in the context of loan applications (Byanjankar et al., 2015; Hardt et al., 2016) . A developing paradigm for fulfilling such requirements is learning models that satisfy some notion of invariance (Peters et al., 2016; 2017) across environments or sub-populations. For example, in the X-ray case, spurious correlations can be formalized as relationships between a feature and a label which vary across hospitals (Zech et al., 2018) . Equal Opportunity (Hardt et al., 2016) can be expressed as a statistical constraint on the outputs of the model, where the false negative rate is invariant to membership in a protected group. Many techniques for learning invariant models have been proposed including penalties that encourage invariance (Arjovsky et al., 2019; Krueger et al., 2021; Veitch et al., 2021; Wald et al., 2021; Puli et al., 2021; Makar et al., 2022; Rame et al., 2022; Kaur et al., 2022) , data re-weighting (Sagawa et al., 2020a; Wang et al., 2021; Idrissi et al., 2022) , causal graph analysis (Subbaswamy et al., 2019; 2022) , and more (Ahuja et al., 2020) . While the invariance paradigm holds promise for delivering robust and fair models, many current invariance-inducing methods often fail to improve over naive approaches. This is especially noticeable when these methods are used with overparameterized deep models capable of interpolating, i.e., perfectly fitting the training data (Gulrajani & Lopez-Paz, 2021; Dranker et al., 2021; Guo et al., 2022; Zhou et al., 2022; Menon et al., 2021; Veldanda et al., 2022; Cherepanova et al., 2021) . Existing theory explains why overparameterization hurts invariance for standard interpolating learning rules, such as empirical risk minimization and max-margin classification (Sagawa et al., 2020b; Nagarajan et al., 2021; D'Amour et al., 2022) , and also why reweighting and some types of distributionally robust optimization face challenges when used with overparameterized models (Byrd & Lipton, 2019; Sagawa et al., 2020a) . In contrast, training overparameterized models to interpolate the training data typically results in good in-distribution generalization, and such "benign overfitting" (Kini et al., 2021; Wang et al., 2021) is considered a key characteristic of modern deep learning (Cao et al., 2021; Wang & Thrampoulidis, 2021; Shamir, 2022) . Consequently, a num-ber of works attempt to extend benign overfitting to robust or fair generalization by designing new interpolating learning rules (Cao et al., 2019; Kini et al., 2021; Wang et al., 2021; Lu et al., 2022) . In this paper, we demonstrate that such attempts face a fundamental obstacle, because all interpolating learning rules (and not just maximum-margin classifiers) fail to produce invariant models in certain high-dimensional settings where invariant learning (without interpolation) is possible. This does not occur because there are no invariant models that separate the data, but because interpolating learning rules cannot find them. In other words, beyond identically-distributed test sets, overfitting is no longer benign. More concretely, we consider linear classification in a basic overparameterized Gaussian mixture model with invariant "core" features as well as environment-dependent "spurious" features, similar to models used in previous work to gain insight into robustness and invariance (Schmidt et al., 2018; Rosenfeld et al., 2021; Sagawa et al., 2020b) . We show that any learning rule producing a classifier that separates the data with non-zero margin must necessarily rely on the spurious features in the data, and therefore cannot be invariant. Moreover, in the same setting we analyze a simple two-stage algorithm that can find accurate and nearly invariant linear classifiers, i.e., with almost no dependence on the spurious feature. Thus, we establish a separation between the level of invariance attained by interpolating and non-interpolating learning rules. We believe that learning rules which fail in the simple overparameterized linear classification setting we consider are not likely to succeed in more complicated, real-world settings. Therefore, our analysis provides useful guidance for future research into robust and fair machine learning models, as well as theoretical support for the recent success of noninterpolating robust learning schemes (Rosenfeld et al., 2022; Veldanda et al., 2022; Kirichenko et al., 2022; Menon et al., 2021; Kumar et al., 2022; Zhang et al., 2022; Idrissi et al., 2022; Chatterji et al., 2022) . Paper organization. The next section formally states our full result (Theorem 1). In Section 3 we outline the arguments leading to the negative part of Theorem 1, i.e., the failure of interpolating classifiers to be invariant in our model. In Section 4 we establish the positive part Theorem 1, by providing and analyzing a non-interpolating algorithm that, in our model, achieves low robust error. We validate our theoretical findings with simulations and experiments on the Waterbirds dataset in Section 5, and conclude with a discussion of additional related results and directions for future research in Section 6.

2. STATEMENT OF MAIN RESULT

2.1 PRELIMINARIES Data model. Our analysis focuses on learning linear models over covariates x distributed as a mixture of two Gaussian distributions corresponding to the label y. Definition 1. An environment is a distribution parameterized by (µ c , µ s , d, , ✓) where ✓ 2 [ 1, 1] and µ c , µ s 2 R d satisfy µ c ? µ s and with samples generated according to: P ✓ (y) = Unif{ 1, 1}, P ✓ (x|y) = N (yµ c + y✓µ s , 2 I). Our goal is to find a (linear) classifier that predicts y from x and is robust to the value of ✓ (we discuss the specific robustness metric below). To do so, the classifier will need to have significant inner product with the "core" signal component µ c and be approximately orthogonal to the "spurious" component µ s . We focus on learning problems where we are given access to samples from two environments that share all their parameters other than ✓, as we define next. We illustrate our setting with Figure 3 in Appendix A. Definition 2 (Linear Two Environment Problem). In a Linear Two Environment Problem we have datasets S 1 = {x (1) i , y (1) i } N1 i=1 and S 2 = {x (2) i , y i } N2 i=1 of sizes N 1 , N 2 drawn from P ✓1 and P ✓2 respectively. A learning algorithm is a (possibly randomized) mapping from the tuple (S 1 , S 2 ) to a linear classifier w 2 R d . We let S = {x i , y i } N i=1 denote that dataset pooled from S 1 and S 2 where N = N 1 + N 2 . Finally we let r c := kµ c k and r s := kµ s k. in two hospitals. In a fairness context, we may think of these distributions as subpopulations (e.g., demographic groups).foot_0 While these are different applications that require specialized methods, the underlying formalism of solutions is often similar (see, e.g., Creager et al., 2021 , Table 1 ), where we wish to learn a classifier that in one way or another is invariant to the environment variable. Robust performance metric. An advantage of the simple model defined above is that many of the common invariance criteria all boil down to the same mathematical constraint: learning a classifier that is orthogonal to µ s , which induces a spurious correlation between the environment and the label. These include Equalized Odds (Hardt et al., 2016) , conditional distribution matching Li et al. (2018) , calibration on multiple subsets of the data (Hébert-Johnson et al., 2018; Wald et al., 2021) , Risk Extrapolation (Krueger et al., 2021) and CVaR fairness (Williamson & Menon, 2019) . In terms of predictive accuracy, the goal of learning a linear model that aligns with µ c (the invariant part of the data generating process for the label) and is orthogonal to µ s coincides with providing guarantees on the robust error, i.e. the error when data is generated with values of ✓ that are different from the ✓ 1 , ✓ 2 used to generate training data.foot_1 Definition 3 (Robust error). The robust error of a linear classifier w 2 R d is: max ✓2[ 1,1] ✏ ✓ (w), where ✏ ✓ (w) := E x,y⇠P✓ [sign(hw, xi) 6 = y]. (2) Normalized margin. We study is whether algorithms that perfectly fit (i.e. interpolate) their training data can learn models with low robust error. Ideally, we would like to give a result on all classifiers that attain training error zero in terms of the 0-1 loss. However, the inherent discontinuity of this loss would make any such statement sensitive to instabilities and pathologies. For instance, if we do not limit the capacity of our models, we can turn any classifier into an interpolating one by adding "special cases" for the training points, yet intuitively this is not the type of interpolation that we would like to study. To avoid such issues, we replace the 0-1 loss with a common continuous surrogate, the normalize margin, and require it to be strictly positive. Definition 4 (Normalized margin). Let > 0, we say a classifier w 2 R d separates the set S = {x i , y i } N i=1 with normalized margin if for every (x, y) 2 S y i hw, x i i kwk > p 2 d. The p 2 d scaling of is roughly proportional to kxk under our data model in Equation ( 1), and keeps the value of comparable across growing values of d.

2.2. MAIN RESULT

Equipped with the necessary definitions, we now state and discuss our main result. Theorem 1. For any sample sizes N 1 , N 2 > 65, margin lower bound  1 4 p N , target robust error ✏ > 0, and coefficients ✓ 1 = 1, ✓ 2 > N1 p 288N2 , there exist parameters r c , r s > 0, d > N, and > 0 such that the following holds for the Linear Two Environment Problem (Definition 2) with these parameters. 1. Invariance is attainable. Algorithm 1 maps (S 1 , S 2 ) to a linear classifier w such that with probability at least 99 100 (over the draw S), the robust error of w is less than ✏.

2.. Interpolation is attainable.

With probability at least 99 100 , the estimator w mean = N 1 P i2[N ] y i x i separates S with normalized margin (Definition 4) greater than 1 4 p N . 3. Interpolation precludes invariance. Given µ c uniformly distributed on the sphere of radius r c and µ s uniformly distributed on a sphere of radius r s in the subspace orthogonal to µ c , let w be any classifier learned from (S 1 , S 2 ) as per Definition 2. If w separates S with normalized margin , then with probability at least 99 100 (over the draw of µ c , µ s , and the sample), the robust error of w is at least 1 2 . Theorem 1 shows that if a learning algorithm for overparameterized linear classifiers always separates its training data, then there exist natural settings for which the algorithm completely fails to learn a robust classifier, and will therefore fail on multiple other invariance and fairness objectives. Furthermore, in the same setting this failure is avoidable, as there exists an algorithm (that necessarily does not always separate its training data) which successfully learns an invariant classifier. This result has deep implications for theoreticians attempting to prove finite-sample invariant learning guarantees: it shows that-in the fundamental setting of linear classification-no interpolating algorithm can have guarantees as strong as non-interpolating algorithms such as Algorithm 1. Importantly, Theorem 1 requires interpolating invariant classifiers to exist-and shows that these classifiers are information-theoretically impossible to learn. In particular, the first part of the theorem implies that the Bayes optimal invariant classifier w = µ c has robust test error at most ✏. Therefore, for all ✏ < 1 100N we have that µ c interpolates S with probability > 99 100 . Furthermore, a short calculation (see Appendix C.1) shows that (for r c , r s , d and satisfying Theorem 1) the normalized margin of µ c is ⌦((N + p N 2 / ) 1 2 ). However, we prove that-due to the high-dimensional nature of the problem-no algorithm can use (S 1 , S 2 ) to reliably distinguish the invariant interpolator from other interpolators with similar or larger margin. This learnability barrier strongly leverages our random choice of µ c , µ s , without which the (fixed) vector µ c would be a valid learning output. We establish Theorem 1 with three propositions, each corresponding to an enumerated claim in the theorem: (1) Proposition 2 (in §4) establishes that invariance is attainable, (2) Proposition 3 (Appendix C) establishes that interpolation is attainable, and (3) Proposition 1 (in §3) establishes that interpolation precludes invariance. We choose to begin with the latter proposition since it is the main conceptual and technical contribution of our paper. Conversely, Proposition 3 is an easy byproduct of the developments leading up to Proposition 1, and we defer it to the appendix. With Propositions 1, 2 and 3 in hand, the proof of Theorem 1 simply consists of choosing the free parameters in Theorem 1 (r c , r s , d and ) based on these propositions such that all the claims in the theorem hold simultaneously. For convenience we take 2 = 1/d. Then (ignoring constant factors) we pick r 2 s / 1 N and r 2 c / r 2 s /(1 + p N2 N1 ) in order to satisfy requirements in Propositions 1 and 3. Finally, we take d to be sufficiently large so as to satisfy the remaining requirements, resulting in d / max n N 2 , N 2 N 2 1 r 2 c , (Q 1 (✏)) 2 Nminr 4 c , 1 N 2 min r 4 c o , where N min = min{N 1 , N 2 } and Q is the Gaussian tail function (see Appendix E for the full proof). We conclude this section with remarks on the range of parameters under which Theorem 1 holds. The impossibility results in Theorem 1 are strongest when N 2 is smaller than N 2 1 2 . In particular, when N 2  N 2 1 2 /288, our result holds for all ✓ 2 2 [ 1, 1] and moreover the core and spurious signal strengths r c and r s can be chosen to be of the same order. The ratio N 2 /(N 2 1 2 ) is small either when one group is under-represented (i.e., N 2 ⌧ N 1 ) or when considering large margin classifiers (i.e., of the order 1/ p N ). Moreover, unlike prior work on barriers to robustness (e.g., Sagawa et al., 2020b; Nagarajan et al., 2021) , our result continue to hold even for balanced data and arbitrarily low margin, provided ✓ 2 is close to 0 and the core signal is sufficiently weaker than the spurious signal. Notably, the normalized margin can be arbitrarily small while the maximum achievable margin is always at least of the order of 1 p N . Therefore, we believe that Theorem 1 essentially precludes any interpolating learning rule from being consistently invariant.

3. INTERPOLATING MODELS CANNOT BE INVARIANT

In this section we prove the third claim in Theorem 1: for essentially any nonzero value of the normalized margin , there are instances of the Linear Two Environment Problem (Definition 2) where with high probability, learning algorithms that return linear classifiers attaining normalized margin at least must incur a large robust error. The following proposition formalizes the claim; we sketch the proof below and provide a full derivation in Appendix B.3. Proposition 1. For = 1/ p d, ✓ 1 = 1, there are universal constants c r 2 (0, 1) and C d , C r 2 (1, 1), such that, for any target normalized , ✓ 2 > N 1 / p 288N 2 , and failure probability 2 (0, 1), if max{r 2 s , r 2 c }  c r N , r 2 s r 2 c C r ✓ 1 + p N 2 N 1 ◆ and d C d N 2 N 2 1 r 2 c log 1 , then with probability at least 1 over the drawing of µ c , µ s and (S 1 , S 2 ) as described in Theorem 1, any ŵ 2 R d that is a measurable function of (S 1 , S 2 ) and separates the data with normalized margin larger than has robust error at least 0.5. Proof sketch. We begin by noting that for any fixed ✓, the error of a linear classifier w is ✏ ✓ (w) = Q ✓ hw, µ c i + ✓hw, µ s i kwk ◆ = Q ✓ hw, µ c i kwk ✓ 1 + ✓ hw, µ s i hw, µ c i ◆◆ , where Q(t) := P(N (0; 1) > t) is the Gaussian tail function. Consequently, when hw, µ s i/hw, µ c i 1 it is easy to see that ✏ ✓ (w) = 1/2 for some ✓ 2 [ 1, 1] and therefore the robust error is at least 1 2 ; we prove that hw, µ s i/hw, µ c i 1 indeed holds with high probability under the proposition's assumptions. Our proof has two key parts: (a) restricting the set of classifiers to the linear span of the data and (b) lower bounding the minimum value of hw, µ s i/hw, µ c i for classifier in that linear span. For the first part of the proof we use the spherical distribution of µ c and µ s and concentration of measure to show that (with high probability) any component of w chosen outside the linear span of {x i } i2[N ] will have negligible effect on the predictions of the classifier. To explain this fact, let P ? denote the projection operator to the orthogonal complement of the data, so that P ? w is the component of w orthogonal to the data and hP ? w, µ c i = D w, P?µc kP?µck E kP ? µ c k. Conditional on (S 1 , S 2 ) and the learning rule's random seed, the vector P ? µ c /kP ? µ c k is uniformly distributed on a unit sphere of dimension d N while the vector w is deterministic. Assuming without loss of generality that kwk = 1, concentration of measure on the sphere implies that |hw, P?µc kP?µck i| is (with high probability) bounded by roughly 1/ p d, and therefore |hP ? w, µ c i| is roughly of the order r c / p d. For sufficiently large d (as required by the proposition), this inner product would be negligible, meaning that hw, µ c i is roughly the same as h(I P ? )w, µ c i, and (I P ? )w is in the span of the data. The same argument applies to µ s as well. In the second part of the proof, we consider classifiers of the form w = P i2[N ] i y i x i (which parameterizes the linear span of the data) and minimize hw, µ s i/hw, µ c i over 2 R N subject to the constraint that w has normalize margin of at least . To do so, we first use concentration of measure to argue that it is sufficient to lower bound P i2[N1] i subject to the margin constraint and kwk 2  1, which is convex in -we obtain this lower bound by analyzing the Lagrange dual of the problem of minimizing P i2[N1] i subject to these constraints. Overall, we show a high-probability lower bound on hw,µsi hw,µci that (for sufficiently high dimensions) scales roughly as Implication for invariance-inducing algorithms. Our proof implies that any interpolating algorithm should fail at learning invariant classifiers. This alone does not necessarily imply that specific algorithms proposed in the literature for learning invariant classifiers fail, as they may not be interpolating. Yet our simulations in Section 5 show that several popular algorithms proposed for eliminating spurious features are indeed interpolating in the overparameterized regime. We also give a formal statement in Appendix G regarding the IRMv1 penalty (Arjovsky et al., 2019) , showing that it is biased toward large margins when applied to separable datasets. Our results may seem discouraging for the development of invariance-inducing techniques using overparameterized models. It is natural to ask what type of methods can provably learn such models, which is the topic of the next section.  f v (x) = hv 1 • w 1 + v 2 • w 2 , xi that solves maximize X (x,y)2S post yf v (x) subject to kvk 1 = 1 and f v 2 F(S fine 1 , S fine 2 )

4. A PROVABLY INVARIANT OVERPARAMETERIZED ESTIMATOR

We now turn to propose and analyze an algorithm (Algorithm 1) that provably learns an overparametrized linear model with good robust accuracy in our setup. Our approach is a two-staged learning procedure that is conceptually similar to some recently proposed methods (Rosenfeld et al., 2022; Veldanda et al., 2022; Kirichenko et al., 2022; Menon et al., 2021; Kumar et al., 2022; Zhang et al., 2022) . In Section 5 we validate our algorithm on simulations and on the Waterbirds dataset Sagawa et al. (2020a) , but we leave a thorough empirical evaluation of the techniques described here to future work. Let us describe the operation of Algorithm 1. First, we evenlyfoot_3 split the data from each environment into the sets S train Crucially, the invariance penalty is only used in the second stage, in which we are no longer in the overparamterized regime since we are only fitting a two-dimensional classifier. In this way we overcome the negative result from Section 3. While our approach is general and can handle a variety of invariance notions (we discuss some of them in Appendix F), we analyze the algorithm under the Equal Opportunity (EOpp) criterion (Hardt et al., 2016) . Namely, for a model f : R d ! R we write: F(S fine 1 , S fine 2 ) = f : T1 (f ) = T2 (f ) , where Te (f ) := 4 N e X (x,y)2S fine e :y=1 f (x).

This is the empirical version of the constraint

E P✓ 1 [f (x)|y = 1] = E P✓ 2 [f (x)|y = 1]. From a fairness perspective (e.g., thinking of a loan application), this constraint ensures that the "qualified" members (i.e., those with y = 1) of each group receive similar predictions, on average over the entire group. We now turn to providing conditions under which Algorithm 1 successfully learns an invariant predictor. The full proof for the following proposition can be found in section D.1 of the appendix. While we do not consider the following proposition very surprising, the fact that it gives a finite sample learning guarantee means it does not directly follow from existing work (discussed in §6 below) that mostly assume inifinite sample size. Figure 1 : Numerical validation of our theoretical claims. Invariance inducing methods improve robust accuracy compared to ERM in low values of d, but their ability to do so is diminished as d grows (top plot) and they enter the interpolation regime, as seen on the bottom plot for d > 10 2 . Algorithm 1 learns robust predictors as d grows and does not interpolate. Proposition 2. Consider the Linear Two Environment Problem (Definition 2), and further suppose that |✓ 1 ✓ 2 | > 0.1.foot_4 There exist universal constants C p , C c , C s 2 (1, 1) such that the following holds for every target robust error ✏ > 0 and failure probability 2 (0, 1). If N min := min{N 1 , N 2 } C p log(4/ ) for some C p 2 (1, 1), 5 r 2 s C s r log 68 2 p d N min , r 2 c C c 2 r log 68 max ( Q 1 (✏) r d N min , p d N min , r 2 s N min r 2 c ) , and d log 68 then, with probability at least 1 over the draw of the training data and the split of the data between the two stages of learning, the robust error of the model returned by Algorithm 1 does not exceed ✏. Proof sketch. Writing down the error of f v = v 1 • w 1 + v 2 • w 2 under P ✓ , it can be shown that to obtain the desired bound on the robust error of the classifier returned by Algorithm 1, we must upper bound the ratio (v ? 1 ✓ 1 + v ? 2 ✓ 2 )kµ s k 2 + hµ s , v ? 1 n1 + v ? 2 n2 i (v ? 1 + v ? 2 )kµ c k 2 + hµ c , v ? 1 n1 + v ? 2 n2 i , when ne is the mean of Gaussian noise vectors, and v ? 1 and v ? 2 are the solutions to the optimization problem in Stage 2 of Algorithm 1. The terms involving inner-products with the noise terms are zero-mean and can be bounded using standard Gaussian concentration arguments. Therefore, the main effort of the proof is upper bounding v ? 1 ✓ 1 + v ? 2 ✓ 2 v ? 1 + v ? 2 • kµ s k 2 kµ c k 2 . To this end, we leverage the EOpp constraint. The population version of this constraint (corresponding to infinite N 1 and N 2 ) implies that v ? 1 ✓ 1 + v ? 2 ✓ 2 = 0. For finite sample sizes, we use standard Gaussian concentration and the Hanson-Wright inequality to show that the empirical EOpp constraint implies that |v ? 1 ✓ 1 + v ? 2 ✓ 2 | goes to zero as the sample sizes increase. Furthermore, we argue that |v ? 1 + v ? 2 | |✓ 1 ✓ 2 |/2 , implying that-for appropriately large sample sizes-the above ratio indeed goes to zero.

5. EMPIRICAL VALIDATION

The empirical observations that motivated this work can be found across the literature. We therefore focus our simulations on validating the theoretical results in our simplified model. We also evaluate Algorithm 1 on the Waterbirds dataset, where the goal is not to show state-of-the-art results, but rather to observe whether our claims hold beyond the Linear Two Environment Problem.

5.1. SIMLUATIONS

Setup. We generate data as described in Theorem 1 with two environments where ✓ 1 = 1, ✓ 2 = 0 (see Figure 4 in the appendix for results of the same simulation when ✓ = 1 2 ). We further fix r c = 1 and r c = 2, while N 1 = 800 and N 2 = 100. We then take growing values of d, while adjusting so that (r c / ) 2 / p d/N . 6 For each value of d we train linear models with IRMv1 (Arjovsky et al., 2019) , VREx (Krueger et al., 2021) , MMD (Li et al., 2018) , CORAL (Sun & Saenko, 2016) , GroupDRO (Sagawa et al., 2020a) , implemented in the Domainbed package (Gulrajani & Lopez-Paz, 2021) . We also train a classifier with the logistic loss to minimize empirical error (ERM), and apply Algorithm 1 where the "post-processing" stage trains a linear model over the two-dimensional representation using the VREx penalty to induce invariance. We repeat this for 15 random seeds for drawing µ c , µ s and the training set. Evaluation and results. We compare the robust accuracy and the train set accuracy of the learned classifiers as d grows. First, we observe that all methods except for Algorithm 1 attain perfect accuracy for large enough d, i.e., they interpolate. We further note that while invariance-inducing methods give a desirable effect in low dimensions (the non-interpolating regime)-significantly improving the robust error over ERM-they become aligned with ERM in terms of robust accuracy as they go deeper into the interpolation regime (indeed, IRM essentially coincides with ERM for larger d). This is an expected outcome considering our findings in section 3, as we set here N 1 to be considerably larger than N 2 .

5.2. WATERBIRDS DATASET

We evaluate Algorithm 1 on the Waterbirds dataset (Sagawa et al., 2020a) , which has been previously used to evaluate the fairness and robustness of deep learning models. Setup. Waterbirds is a synthetically created dataset containing images of water-and land-birds overlaid on water and land background. Most of the waterbirds (landbirds) appear in water (land) backgrounds, with a smaller minority of waterbirds (landbirds) appearing on land (water) backgrounds. We set up the problem following previous work (Sagawa et al., 2020b; Veldanda et al., 2022) , where a logistic regression model is trained over random features extract from a fixed pretrained ResNet-18. Please see Appendix H for details. Fairness. We use the image background type (water or land) as the sensitive feature, denoted A, and consider the fairness desiderata of Equal Opportunity Hardt et al. (2016) , i.e., the false negative rate (FNR) should be similar for both groups. Towards this, we use the MinDiff penalty term (Prost et al., 2019) . The Evaluation. We compare the following methods: (1) Baseline: Learning a linear classifier w by minimizing L p + • L M , where L p is the standard binary cross entropy loss and L M is the MinDiff penalty; (2) Algorithm 1: In the first stage, we learn group-specific linear classifiers w 0 , w 1 by minimizing L p on the examples from A = 0 and A = 1, respectively. In the second stage we learn v 2 R 2 by minimizing L p + • L M on examples the entire dataset, where the new representation of the data is X = [hw 1 , Xi, hw 2 , Xi] 2 R 2 . 7 Results. Our main objective is to understand the effect of the fairness penalty. Toward this, for each method we compare both the test error and the test FNR gap when using either = 0 (no regularization) or = 5. The results are summarized in Figure 2 . We can see that for the baseline approach, the fairness penalty successfully reduces the FNR gap when the classifier is not interpolating. However, as our negative result predicts and as previously reported in Veldanda et al. ( 2022), the fairness penalty becomes ineffective in the interpolating regime (d 1000). On the other hand, for our two-phased algorithm, the addition of the fairness penalty reduces does reduce the FNR gap with an average relative improvement of 20%; crucially, this improvement is independent of d. Figure 2 : Results for the Waterbirds dataset (Sagawa et al., 2020a) . Top row: Train error (left) and test error (right). The train error is used to identify the interpolation threshold for the baseline method (approximately d = 1000). Bottom row: Comparing the FNR gap on the test set (left), with zoomed-in versions on the right.

6. DISCUSSION AND ADDITIONAL RELATED WORK

In terms of formal results, most existing guarantees about invariant learning algorithms rely on the assumption that infinite training data is available (Arjovsky et al., 2019; Wald et al., 2021; Veitch et al., 2021; Puli et al., 2021; Rosenfeld et al., 2021; Diskin et al., 2021) . Wang et al. (2022) ; Chen et al. (2022) analyze algorithms that bear resemblance to Algorithm 1 as they first project the data to a lower dimension and then fit a classifier. While these algorithms deal with more general assumptions in terms of the number of environments, number of spurious features, and noise distribution, the fact that their guarantees assume infinite data prevents them from being directly applicable to Algorithm 1. A few works with results on finite data are Ahuja et al. (2021) ; Parulekar et al. (2022) (and also Efroni et al. (2022) who work on related problems in the context of sequential decision making) that characterize the sample complexity of methods that learn invariant classifiers. However, they do not analyze the overparameterized cases we are concerned with. Negative results about learning overparameterized robust classifiers have been shown for methods based on importance weighting (Zhai et al., 2022) and max-margin classifiers (Sagawa et al., 2020b) . Our result is more general, applying to any learning algorithm that separates the data with arbitrarily small margins, instead of focusing on max-margin classifiers or specific algorithms. While we focus on the linear case, we believe it is instructive, as any reasonable method is expected to succeed in that case. Nonetheless, we believe our results can be extended to non-linear classifiers, and we leave this to future work. One take-away from our result is that while low training loss is generally desirable, overfitting to the point of interpolation can significantly hinder invariance-inducing objectives. This means one cannot assume a typical deep learning model with an added invariance penalty will indeed achieve any form of invariance; this fact also motivates using held-out data for imposing invariance, as in our Algorithm 1 as well as several other two-stage approaches mentioned above. Our work focuses theory underlying a wide array of algorithms, and there are natural follow-up topics to explore. One is to conduct a comprehensive empirical comparison of two-stage methods along with other methods that avoid interpolation, e.g., by subsampling data (Idrissi et al., 2022; Chatterji et al., 2022) . Another interesting topic is whether there are other model properties that are incompatible with interpolation. For instance, recent work (Carrell et al., 2022) connects the generalization gap and calibration error on the training distribution. We also note that our focus in this paper was not on types of invariance that are satisfiable by using clever data augmentation techniques (e.g. invariance to image translation), or the design of special architectures (e.g. Cohen & Welling (2016) ; Lee et al. (2019); Maron et al. (2019) ). These methods carefully incorporate a-priori known invariances, and their empirical success when applied to large models may suggest that there are lessons to be learned for the type of invariant learning considered in our paper. These connections seem like an exciting avenue for future research.



We note that in some settings, more commonly in the fairness literature, e is treated as a feature given to the classifier as input. Our focus is on cases where this is either impossible or undesired. For instance, because at test time e is unobserved or ill-defined (e.g. we obtain data from a new hospital). However, we emphasize that the leaning rules we consider have full knowledge of which environment produced each training example In fact, as we show in Equation (5) in Section 3, learning a model orthogonal to µs is also a necessary condition to minimize the robust error. Thus, attaining guarantees on the robust error also has consequences on invariance of the model, as defined by these criteria. We discuss this further in section F of the appendix. The even split is used here for simplicity of exposition, and our full proof does not assume it. In practice, allocating more data to the first-stage split would likely perform better. Intuitively, if |✓1 ✓2| = 0 then the two training environments are indistinguishable and we cannot hope to identify that the correlation induced by µs is spurious. Otherwise, we expect |✓1 ✓2| to have a quantifiable effect on our ability to generalize robustly. For simplicity of this exposition we assume that the gap is bounded away from zero; the full result in the Appendix is stated in terms of |✓1 ✓2|. This assumption makes sure we have some positive labels in each environment. This is to keep our parameters within the regime where benign overfitting occurs. This is basically Algorithm 1 with the following minor modifications: (1) The we's are computed via ERM, rather than simply taken to be the mean estimators; (2) Since the FNR gap penalty is already computed w.r.t. a small number of samples, we avoid splitting the data and use the entire training set for both phases; (3) we convert the constrained optimization problem into an unconstrained problem with a penalty term.



Two Phase Learning of Overparameterized Invariant Classifiers Input: Datasets (S 1 , S 2 ) and an invariance constraint function family F(•, •) Output: A classifier f v (x) Draw subsets of data without replacement S train e ⇢ S e for e 2 {1, 2} where S train e = N e /2 Stage 1: Calculate w e = 2N 1 e P (x,y)2S train e yx for each e 2 {1, 2} Define S fine e = S e \ S trn e for e 2 {1, 2} and S post =

e 2 {1, 2}. The two stages of the algorithm operate on different splits of the data as follows. 1. "Training" stage: We use {S train e } to fit overparameterized, interpolating classifiers {w e } separately for each environment e 2 {1, 2}. 2. "Post-processing" stage: We use the second portion of the data S post 1 , S post 2 to learn an invariant linear classifier over a new representation, which concatenates the outputs of the classifiers in the first stage. In particular, we learn this classifier by maximizing a score (i.e., minimizing an empirical loss), subject to an empirical version of an invariance constraint. For generality we denote this constraint by membership in some set of functions F(S post 1

