EXCHANGING LESSONS BETWEEN ALGORITHMIC FAIRNESS AND DOMAIN GENERALIZATION Anonymous

Abstract

Standard learning approaches are designed to perform well on average for the data distribution available at training time. Developing learning approaches that are not overly sensitive to the training distribution is central to research on domainor out-of-distribution generalization, robust optimization and fairness. In this work we focus on links between research on domain generalization and algorithmic fairness-where performance under a distinct but related test distributions is studied-and show how the two fields can be mutually beneficial. While domain generalization methods typically rely on knowledge of disjoint "domains" or "environments", "sensitive" label information indicating which demographic groups are at risk of discrimination is often used in the fairness literature. Drawing inspiration from recent fairness approaches that improve worst-case performance without knowledge of sensitive groups, we propose a novel domain generalization method that handles the more realistic scenario where environment partitions are not provided. We then show theoretically and empirically how different partitioning schemes can lead to increased or decreased generalization performance, enabling us to outperform Invariant Risk Minimization with handcrafted environments in multiple cases. We also show how a re-interpretation of IRMv1 allows us for the first time to directly optimize a common fairness criterion, groupsufficiency, and thereby improve performance on a fair prediction task.

1. INTRODUCTION

Machine learning achieves super-human performance on many tasks when the test data is drawn from the same distribution as the training data. However, when the two distributions differ, model performance can severely degrade to even below chance predictions (Geirhos et al., 2020) . Tiny perturbations can derail classifiers, as shown by adversarial examples (Szegedy et al., 2014) and common-corruptions in image classification (Hendrycks & Dietterich, 2019) . Even new test sets collected from the same data acquisition pipeline induce distribution shifts that significantly harm performance (Recht et al., 2019; Engstrom et al., 2020) . Many approaches have been proposed to overcome model brittleness in the face of input distribution changes. Robust optimization aims to achieve good performance on any distribution close to the training distribution (Goodfellow et al., 2015; Duchi et al., 2016; Madry et al., 2018) . Domain generalization on the other hand tries to go one step further, to generalize to distributions potentially far away from the training distribution. The field of algorithmic fairness meanwhile primarily focuses on developing metrics to track and mitigate performance differences between different sub-populations or across similar individuals (Dwork et al., 2012; Corbett-Davies & Goel, 2018; Chouldechova & Roth, 2018) . Like domain generalization, evaluation using data related to but distinct from the training set is needed to characterize model failure. These evaluations are curated through the design of audits, which play a central role in revealing unfair algorithmic decision making (Buolamwini & Gebru, 2018; Obermeyer et al., 2019) . While the ultimate goals of domain generalization and algorithmic fairness are closely aligned, little research has focused on their similarities and how they can inform each other constructively. One of their main common goals can be characterized as: Learning algorithms robust to changes across domains or population groups.

Method

Handcrafted Train accs Test accs Environments ERM 86.3 ± 0.1 13.8 ± 0.6 IRM 71.1 ± 0.8 65.5 ± 2.3 EIIL+IRM 73.7 ± 0.5 68.4 ± 2.7 Table 1 : Results on CMNIST, a digit classification task where color is a spurious feature correlated with the label during training but anti-correlated at test time. Our method Environment Inference for Invariant Learning (EIIL), taking inspiration from recent themes in the fairness literature, augments IRM to improve test set performance without knowledge of pre-specified environment labels, by instead finding worst-case environments using aggregated data and a reference classifier. Achieving this not only allows models to generalize to different and unobserved but related distributions, it also mitigates unequal treatment of individuals solely based on group membership. In this work we explore independently developed concepts from the domain generalization and fairness literatures and exchange lessons between them to motivate new methodology for both fields. Inspired by fairness approaches for unknown group memberships (Kim et al., 2019; Hébert-Johnson et al., 2018; Lahoti et al., 2020) , we develop a new domain generalization method that does not require domain identifiers and yet can outperform manual specification of domains (Table 1 ). Leveraging domain generalization insights in a fairness context, we show the regularizer from IRMv1 (Arjovsky et al., 2019) optimizes a fairness criterion termed "group-sufficiency" which for the first time enables us to explicitly optimize this criterion for non-convex losses in fair classification. The following contributions show how lessons can be exchanged from the two fields: • We draw several connections between the goals of domain generalization and those of algorithmic fairness, suggesting fruitful research directions in both fields (Section 2). • Drawing inspiration from recent methods on inferring worst-case sensitive groups from data, we propose a novel domain generalization algorithm-Environment Inference for Invariant Learning (EIIL)-for cases where training data does not include environment partition labels (Section 3). Our method outperforms IRM on the domain generalization benchmark ColorMNIST without access to environment labels (Section 4). • We also show that IRM, originally developed for domain generalization tasks, affords a differentiable regularizer for the fairness notion of group sufficiency, which was previously hard to optimize for non-convex losses. On a variant of the UCI Adult dataset where confounding bias is introduced, we leverage this insight with our method EIIL to improve group sufficiency without knowledge of sensitive groups, ultimately improving generalization performance for large distribution shifts compared with a baseline robust optimization method (Section 4). • We characterize both theoretically and empirically the limitations of our proposed method, concluding that while EIIL can correct a baseline ERM solution that uses a spurious feature or "shortcut" for prediction, it is not suitable for all settings (Sections 3 and 4).

2. DOMAIN GENERALIZATION AND ALGORITHMIC FAIRNESS

Here we lay out some connections between the two fields. Table 2 provides a high-level comparison of the objectives and assumptions of several relevant methods. Loosely speaking, recent approaches from both areas share the goal of matching some chosen statistic across a conditioning variable e, representing sensitive group membership in algorithmic fairness or an environment/domain indicator in domain generalization. The statistic in question informs the learning objective for the resulting model, and is motivated differently in each case. In domain generalization, learning is informed by the properties of the test distribution where good generalization should be achieved. In algorithmic fairness the choice of statistic is motivated by a context-specific fairness notion, that likewise encourages a particular solution that achieves "fair" outcomes (Chouldechova & Roth, 2018) . Empty spaces in Table 2 suggest areas for future work, and bold-faced entries suggest connections we show in this paper. Table 2 : Domain Generalization (Dom-Gen) and Fairness methods can be understood as matching some statistic across conditioning variable e, representing "environment" or "domains" in Dom-Gen literature and "sensitive" group membership in the Fairness literature. We boldface new connections highlighted in this work, with blank spaces suggesting future work. Notation Let X be the input space, E the set of environments (a.k.a. "domains"), Y the target space. Let x, y, e ∼ p obs (x, y, e) be observational data, with x ∈ X , y ∈ Y, and e ∈ E. H denotes a representation space, from which a classifier w • Φ (that maps to the pre-image of ∆(Y) via a linear map w) can be applied. Φ : X → H denotes the parameterized mapping or "model" that we optimize. We refer to Φ(x) ∈ H as the "representation" of example x. ŷ ∈ Y denotes a hard prediction derived from the classifier by stochastic sampling or probability thresholding. : H × Y → R denotes the scalar loss, which guides the learning. The empirical risk minimization (ERM) solution is found by minimizing the global risk, expressed as the expected loss over the observational distribution: C ERM (Φ) = E p obs (x,y,e) [ (Φ(x), y)]. Domain Generalization Domain generalization is concerned with achieving low error rates on unseen test distributions. One way to achieve domain generalization is by casting it as a robust optimization problem (Ben-Tal et al., 2009) where one aims to minimize the worst-case loss for every subset of the training set, or other well-defined perturbation sets around the data (Duchi et al., 2016; Madry et al., 2018) . Other approaches tackle domain generalization by adversarially learning representations invariant (Zhang et al., 2017; Hoffman et al., 2018; Ganin et al., 2016) or conditionally invariant (Li et al., 2018) to the environment. Distributionally Robust Optimization (DRO) (Duchi et al., 2016) , seeks good performance for all nearby distributions by minimizing the worst-case loss sup q E q [ ] s.t. D(q||p) < , where D denotes similarity between two distributions (e.g. χ 2 divergence) and is a hyperparameter. The objective can be computed as an expectation over p via per-example importance weights γ i = q(xi,yi) p(xi,yi) . Recently, Invariant Learning approaches such as Invariant Risk Minimization (IRM) (Arjovsky et al., 2019) and Risk Extrapolation (REx) (Krueger et al., 2020) were proposed to overcome the limitations of domain invariant representation learning (Zhao et al., 2019) by discovering invariant relationships between inputs and targets across domains. Invariance serves as a proxy for causality, as features representing "causes" of target labels rather than effects will generalize well under intervention. In IRM, a representation Φ(x) is learned that performs optimally within each environment-and is thus invariant to the choice of environment e ∈ E-with the ultimate goal of generalizing to an unknown test dataset p(x, y|e test ). Because optimal classifiers under standard loss functions can be realized via a conditional label distribution (f * (x) = E[y|x]), then an invariant representation Φ(x) must satisfy the following Environment Invariance Constraint: E[y|Φ(x) = h, e 1 ] = E[y|Φ(x) = h, e 2 ] ∀ h ∈ H ∀ e 1 , e 2 ∈ E. (EI-CONSTR) Intuitively, the representation Φ(x) encodes features of the input x that induce the same conditional distribution over labels across each environment. Because trivial representations such as mapping all x onto the same value may satisfy environment invariance, other objectives must be introduced to encourage the predictive utility of Φ. Arjovsky et al. (2019) propose IRM as a way to satisfy (EI-CONSTR) while achieving a good overall risk. As a practical instantiation, the authors introduce IRMv1, a gradient-penalty regularized objective enforcing simultaneous optimality of the same classifier w • Φ in all environments. 1 Denoting by R e = E p obs (x,y|e) [ ] the per-environment risk, the objective to be minimized is C IRM (Φ) = e∈E R e (Φ) + λ||∇ w|w=1.0 R e (w • Φ)|| 2 . (2) Krueger et al. (2020) propose the related Risk Extrapolation (REx) principle, which dictates a stronger preference to exactly equalize R e ∀ e (e.g. by penalizing variance across e), which is shown to improve generalization in several settings.foot_1  Fairness Early approaches to learning fair representations (Zemel et al., 2013; Edwards & Storkey, 2015; Louizos et al., 2015; Zhang et al., 2018; Madras et al., 2018) leveraged statistical independence regularizers from domain adaptation (Ben-David et al., 2010; Ganin et al., 2016) , noting that marginal or conditional independence from domain to prediction relates to the fairness notions of demographic parity ŷ ⊥ e (Dwork et al., 2012) and equal opportunity ŷ ⊥ e|y (Hardt et al., 2016) . Recall 3 The goal here is to boost the predictions of a pre-trained classifier through multiple rounds of auditing (searching for worstcase subgroups using an auxiliary model) rather than learning an invariant representation. A related line of work also leverages inferred subgroup information to improve worst-case model performance using the framework of DRO. Hashimoto et al. (2018) applied DRO to encourage long-term fairness in a dynamical setting where the average loss for a subpopulation influences their propensity to continue engaging with the model. Lahoti et al. (2020) proposed Adversarially Reweighted Learning, which extends DRO using an auxiliary model to compute the importance weights γ i mentioned above. Amortizing this computation mitigates the tendency of DRO to overfit its reweighting strategy to noisy outliers. Wang et al. (2020) proposed a group DRO method for adaptively estimating soft assignments to groups suitable for the setting where group labels are noisy. Limitations of generalization-first fairness One exciting direction for future work is to apply methods developed in the domain generalization literature to tasks where distribution shift is related to some societal harm that should be mitigated. However, researchers should be wary of blind "solutionism", which can be ineffectual or harmful when the societal context surrounding the machine learning system is ignored (Selbst et al., 2019) . Moreover, many aspects of algorithmic discrimination are not simply a matter of achieving few errors on unseen distributions. Unfairness due to task definition or dataset collection, as discussed in the study of target variable selection by Obermeyer et al. (2019) , may not be reversible by novel algorithmic developments.

3. INVARIANCE WITHOUT DEMOGRAPHICS OR ENVIRONMENTS

In this section we draw inspiration from recent work on fair prediction without sensitive labels (Kearns et al., 2018; Hébert-Johnson et al., 2018; Hashimoto et al., 2018; Lahoti et al., 2020) to propose a novel domain generalization algorithm that does not require a priori domain/environment knowledge. To motivate the study of this setting and show the fairness and invariance considerations at play, consider the task of using a high dimensional medical image x to predict a target label y ∈ {0, 1} indicating the presence of COVID-19 in the imaged patient. DeGrave et al. ( 2020) describe the common use of a composite dataset for this task, where the process of aggregating data across two source hospitals e ∈ {H 1 , H 2 } leads to a brittle neural net classifier w • Φ(x) that fixates on spurious low-level artifacts in x as predictive features. Now we will consider a slightly different scenario. Consider a single hospital serving two different demographic populations e ∈ {P 1 , P 2 }. While P 1 has mostly sick patients at time t = 0 due to the prevalence of COVID-19 in this subpopulation, P 2 currently has mostly well patients. Then p(x, y|e = P 1 , t = 0) and p(x, y|e = P 2 , t = 0) will differ considerably, and moreover a classifier using a spurious feature indicative of subpopulation membership-either a low-level image artifact or attribute of the medical record-may achieve low average error on the available data. Of course such a classifier may generalize poorly. Consider temporal distribution shifts: suppose at time t = 1, due to the geographic density of the virus changing over time, P 1 has mostly well patients while patients from P 2 are now mostly sick. Now the spurious classifier may suffer worse-than-chance error rates and imply unfair outcomes for disadvantaged groups. In reality the early onset and frequency of exposure to COVID-19 has been unequally distributed along many social dimensions (class, race, occupation, etc.) that could constitute protected groups (Tai et al., 2020) , raising concerns of additional algorithmic discrimination. Learning to be invariant to spurious features encoding demographics would prevent errors due to such temporal shifts. While loss reweighting as in DRO/ARL can upweight error cases, without an explicit invariance regularizer the model may still do best on average by making use of the spurious feature. IRM can remove the spurious feature in this particular case, but a method for discovering environment partitions directly may occasionally be needed. 4 This need is clear when demographic makeup is not directly observed and a method to sort each example into the maximally separating the spurious feature, i.e. inferring populations {P 1 , P 2 }, is needed for effective invariant learning.

3.1. ENVIRONMENT INFERENCE FOR INVARIANT LEARNING

We now derive a principle for inferring environments from observational data. Our exposition extends IRMv1 (Equation 2), but we emphasize that our method EIIL is applicable more broadly to any environment-based learning objective. We begin by introducing u i (e ) = p obs (e |x i , y i ) = 1(e i = e ) as an indicator of the hand-crafted environment assignment per-example. Noting that N e := i u i (e) represents the number of examples in environment e, we can re-express this objective to make its dependence on environment labels explicit C IRM (Φ, u) = e∈E 1 N e i u i (e) (Φ(x i ), y i ) + e∈E λ ∇ w|w=1.0 1 N e i u i (e) (w • Φ(x i ), y i ) 2 . (3) Our general strategy is to replace the binary indicator u i (e), with a probability distribution q(e|x i , y i ), representing a soft assignment of the i-th example to the e-th environment. q(e|x i , y i ) should capture worst-case environments w.r.t the invariant learning objective; rewriting q(e|x i , y i ) as q i (e) for consistency with the above expression, we arrive at the following bi-level optimization: min Φ max q C IRM (Φ, q). (EIIL) We leave the full exploration of this bi-level optimization to future work, but for now propose the following practical sequential approach, which we call EIILv1 (See Appendix A for pseudocode): 1. Input reference model Φ; 2. Fix Φ ← Φ and fully optimize the inner loop of (EIIL) to infer environments qi (e) = q(e|x i , y i ); 3. Fix q ← q and fully optimize the outer loop to yield the new model Φ. Instead of requiring hand-crafted environments, we instead require a trained reference model Φ, which is arguably easier to produce and could be found using ERM on p obs (x, y), for example. In our experiments we consider binary environmentsfoot_4 and explicitly parameterize the q(e|x, y) as a vector of probabilities for each example in the training data.foot_5 

3.2. ANALYZING THE EIIL SOLUTION

To characterize the ability of EIILv1 to generalize to unseen test data, we now examine the inductive bias for generalization provided by the reference model Φ. We state the main result here and defer the proofs to Appendix B. Consider a dataset with some feature(s) z which are spurious, and other(s) v which are valuable/causal w.r.t. the label y. Our proof considers binary features/labels and two environments, but the same argument extends to other cases. Our goal is to find a model Φ whose representation Φ(v, z) is invariant w.r.t. z and focuses solely on v. Theorem 1 Consider environments that differ in the degree to which the label y agrees with the spurious features z: P(1(y = z)|e 1 ) = P(1(y = z)|e 2 ): then a reference model ΦSpurious that is invariant to valuable features v and solely focusing on spurious features z maximally violates the Invariance Principle (EI-CONSTR). Likewise, consider the case with fixed representation Φ that focuses on the spurious features: then a choice of environments that maximally violates (EI-CONSTR) is e 1 = {v, z, y|1(y = z)} and e 2 = {v, z, y|1(y = z)}. If environments are split according to agreement of y and z, then the constraint from (EI-CONSTR) is satisfied by a representation that ignores z: Φ(x) ⊥ z. Unfortunately this requires a priori knowledge of either the spurious feature z or a reference model ΦSpurious that extracts it. When the wrong solution ΦSpurious is not a priori known, it can sometimes be recovered directly from the training data; for example in CMNIST we find that ΦERM approximates ΦColor . This allows EIIL to find environment partitions providing the starkest possible contrast for invariant learning. Even if environment partitions are available, it may be possible to improve performance by inferring new partitions from scratch. It can be shown (see Appendix B.2) that the environments provided in the CMNIST dataset (Arjovsky et al., 2019) do not maximally violate (EI-CONSTR) for a reference model ΦColor , and are thus not maximally informative for learning to ignore color. Accordingly, EIIL improves test accuracy for IRM compared with the hand-crafted environments (Table 1 ). Under medium label noise (.1 < θ y < .2), IRM(e EIIL ) is worse than IRM(e HC ) but better than ERM, the logical approach if environments are not available. Under low label noise (θ y < .1), where color is less predictive than shape at train time, ERM performs well and IRM(e EIIL ) fails. GRAYSCALE indicates the oracle solution using shape alone, while Φ Color uses color alone.

4. EXPERIMENTS

We defer a proof-of-concept synthetic regression experiment to Appendix E for lack of space. We proceed with the established domain generalization benchmark ColorMNIST, and then discuss a variant of the algorithmic fairness dataset UCI Adult. We note that benchmarking model performance on a shifted test distribution without access to validation samples-especially during model selection-is a difficult open problem, a solution to which is beyond the scope of this paper. Accordingly we use the default IRM hyperparameters wherever appropriate, and otherwise follow a recently proposed model selection strategy (Gulrajani & Lopez-Paz, 2020 ) (see Appendix D).foot_6 

4.1. COLORMNIST

ColorMNIST (CMNIST) is a noisy digit recognition taskfoot_7 where color is a spurious feature that correlates with the label at train time but anti-correlates at test time, with the correlation strength at train time varying across two pre-specified environments (Arjovsky et al., 2019) . Crucially, label noise is applied by flipping y with probability θ y ; the default setting (θ y = 0.25) implies that shape (the correct feature) is marginally less reliable than color in the train set, so naive ERM ignores shape to focus on color and suffers from below-chance performance at test time. We evaluated the performance of the following methods: ERM: A naive MLP that does not make use of environment labels e, but instead optimizes the average loss on the aggregated environments; IRM(e HC ): the method of Arjovsky et al. ( 2019) using hand-crafted environment labels; IRM(e EIIL ): our proposed method (a.k.a. EIILv1) that infers useful environments (not using handcrafted environment labels) based on the naive ERM, then applies IRM to the inferred environments. After noting that EIILv1-denoted IRM(e EIIL ) above-outperforms IRM without access to environment labels in the default setting (See Tables 1 and 6 ), we examine how the various methods perform as a function of θ y . This parameter influences the ERM solution since low θ y implies shape is more reliable than color in the aggregated training data (thus ERM generalizes well), while the opposite trend holds for high θ y . Because EIILv1 relies on a reference model Φ, its performance is also affected when Φ = ERM (Figure 1 ). We find that IRM(e EIIL ) generalizes better than IRM(e HC ) with sufficiently high label noise θ y > .2, but generalizes poorly under low label noise. This is precisely due to the success of ERM in this setting, where shape is a more reliable feature in the training data than color. We verify this conclusion by evaluating IRM(e EIIL ) when Φ = Φ Color , i.e. a hand-coded color-based predictor as reference. This does relatively well across all settings of θ y , approaching the performance of the (oracle) baseline that classifies using grayscale inputs.

4.2. CENSUS DATA

We now study a fair prediction problem using a variant of the UCI Adult dataset,foot_9 which comprises 48, 842 individual census records collected from the United States in 1994. The task commonly used as an algorithmic fairness benchmark is to predict a binarized income indicator (thresholded at $50, 000) as the target label, possibly considering sensitive attributes such as age, sex, and race. Because the original task measures in-distribution test performance, we instead construct a variant of this dataset suitable for measuring out-of-distribution test performance, which we call Con-foundedAdult. Train accs Test accs Baseline 92.7 ± 0.5 31.1 ± 4.4 ARL (Lahoti et al., 2020) 2020) demonstrate the benefit of per-example loss reweighting on UCI Adult using their method ARL to improve predictive performance for undersampled subgroups. Following Lahoti et al. ( 2020), we consider the effect of four sensitive subgroups-defined by composing binarized race and sex labels-on model performance, assuming the model does not know a priori which features are sensitive. However, we focus on a distinct generalization problem where a pernicious dataset bias confounds the training data, making subgroup membership predictive of the label on the training data. At test time these correlations are reversed, so a predictor that infers subgroup membership to make predictions will perform poorly at test time (see Appendix C for details). Dwork et al. ( 2012) described a similar motivating scenario where the conditional distribution mapping features to target labels varies across demographic groups due to cultural differences, so the most predictive predictor for one group may not generalize to the others. The large distribution shift of our test set can be understood as a worst-case audit to determine whether the classifier uses subgroup information in its predictions. Using EIILv1-to first infer worst-case environments then ensure invariance across them-performs favorably on the audit test set, compared with ARL and a baseline MLP (Table 3 ). We also find that, without access to sensitive group information, using the IRMv1 penalty on the EIIL environments improves subgroup sufficiency (Figure 2 ). Appendix E.3 provides an ablation showing that all components of the EIILv1 approach are needed to achieve the best performance. Figure 2 : We examine subgroup sufficiency-whether calibration curves match across demographic subgroups-on the ConfoundedAdult dataset. Whereas ARL is not subgroup-sufficient (a), EIIL infers worst-case environments and regularizes their calibration to be similar (b), ultimately improving subgroup sufficiency (c). This helps EIIL generalize better to a shifted test set (e) compared with ARL (d). Note that neither method uses sensitive group information during learning.

5. CONCLUSION

We discussed the common goals of algorithmic fairness and domain generalization, compared related methods from each literature, and suggested how lessons can be exchanged between the two fields to inform future research. The most concrete outcome of this discussion was our novel domain generalization method, Environment Inference for Invariant Learning (EIIL). Drawing inspiration from fairness methods that optimize worst-case performance without access to demographic information, EIIL improves the performance of IRM on CMNIST without requiring a priori knowledge of the environments. On a variant of the UCI Adult dataset, EIIL makes use of the IRMv1 regularizer to improve group sufficiency-a fairness criterion previously difficult to optimize for non-convex losses-without requiring knowledge of the sensitive groups.

A EIILV1 PSUEDOCODE

Algorithm 1 The first stage of EIILv1 infers two environments that maximally violate the IRM objective. The inferred environments are then used to train an IRM solution from scratch. Input: Reference model Φ, dataset D = {x i , y i } with x i , y i ∼ p obs iid, loss function , duration N steps Output: Worst case data splits D 1 , D 2 for use with IRM. Randomly initialize e ∈ R |D| as vectorized logit of posterior with σ(e i ) := q(e|x i , y i ). for n ∈ 1 . . . N steps do R 1 = 1 i σ(e i ) i σ(e i ) ( Φ(x i ), y i ) ; // D1 risk G 1 = ∇ w|w=1 || 1 i σ(e i ) i σ(e i ) (w • Φ(x i ), y i )|| 2 ; // D1 invariance regularizer R 2 = 1 i 1-σ(e i ) i (1 -σ(e i )) ( Φ(x i ), y i ) ; // D2 risk G 2 = ∇ w|w=1 || 1 i 1-σ(ei) i (1 -σ(e i )) (w • Φ(x i ), y i )|| 2 ; // D2 invariance regularizer L = 1 2 e∈{1,2} R e + λG e e ← OptimU pdate(e, ∇ e L) end ê ∼ Bernoulli(σ(e)) ; // sample splits D 1 ← {x i , y i |ê i = 1}, D 2 ← {x i , y i |ê i = 0} ; // split data B PROOFS B.1 PROOF OF THEOREM 1 Consider a dataset with some feature(s) z which are spurious, and other(s) v which are valuable/causal w.r.t. the label y. This includes data generated by models where v → y → z, such that P (y|v, z) = P (y|v). Assume further that the observations x are functions of both spurious and valuable features: x := f (v, z). The aim of invariant learning is to form a classifier that predicts y from x that focuses solely on the causal features, i.e., is invariant to z and focuses solely on v. Consider a classifier that produces a score S(x) for example x. In the binary classification setting S is analogous to the model Φ, while the score S(x) is analogous to the representation Φ(x). To quantify the degree to which the constraint in the Invariant Principle (EI-CONSTR) holds, we introduce a measure called the group sufficiency gapfoot_10 : ∆(S, e) = E[[E(y|S(x), e 1 ) -E(y|S(x), e 2 )]] Now consider the notion of an environment: some setting in which the x → y relationship varies (based on spurious features). Assume a single binary spurious feature z. We restate Theorem 1 as follows: Claim: If environments are defined based on the agreement of the spurious feature z and the label y, then a classifier that predicts based on z alone maximizes the group-sufficiency gap (and vice versaif a classifier predicts y directly by predicting z, then defining two environments based on agreement of label and spurious feature-e 1 = {v, z, y|1(y = z)} and e 2 = {v, z, y|1(y = z)}-maximizes the gap). We can show this by first noting that if the environment is based on spurious feature-label agreement, then with e ∈ {0, 1} we have e = 1(y = z). If the classifier predicts z, i.e. S(x) = z, then we have ∆(S, e) = E[E[y|z(x), 1(y = z)] -E[y|z(x), 1(y = z)]] For each instance of x either z = 0 or z = 1. Now we note that when z = 1 we have E(y|z, 1(y = z)) = 1 and E(y|z, 1(y = z)) = 0, while when z = 0 E(y|z, I[y == z]) = 0 and E[y|z, 1(y = z)] = 1. Therefore for each example |E(y|z(x), 1(y = z)) -E(y|z(x), 1(y = z)| = 1, contributing to an overall ∆(S, e) = 1, which is the maximum value for the sufficiency gap.

B.2 GIVEN CMNIST ENVIRONMENTS ARE SUBOPTIMAL W.R.T. SUFFICIENCY GAP

The regularizer from IRMv1 encourages a representation for which sufficiency gap is minimized between the available environments. Therefore when faced with a new task it is natural to measure the natural sufficiency gap between these environments, mediated through a naive or baseline method. Here we show that for CMNIST, when considering a naive color-based classifier as the reference model, the given environment splits are actually suboptimal w.r.t. sufficiency gap, which motivates the proposed EIIL approach for inferring environments that have a more sever sufficiency gap for the reference model. We begin by computing ∆(S, e), the sufficiency gap for color-based classifier g over the given train environments {e 1 , e 2 }. We introduce an auxiliary color variable z, which is not observed but can be sampled from via the color based classifier g: p(y|g(x) = x , e) = E p(z|x ) [p(y|z, e, x ).] Denote by GREEN and RED the set of green and red images, respectively. I.e. we have z ∈ G iff z = 1 and x ∈ GREEN iff z(x) = 1. The the sufficiency gap is expressed as ∆(S, e) = E p(x,e) E p(y|x,e1) [y|g(x), e 1 ] -E p(y|x,e2) [y|g(x), e 2 ] = E p(z,e) E p(y|z,e1) [y|z, e 1 ] -E p(y|z,e2) [y|z, e 2 ] = 1 2 z∈{GREEN,RED} E p(y|z,e1) [y|z, e 1 ] -E p(y|z,e2) [y|z, e 2 ] = 1 2 (|E[y|z = GREEN, e 1 ] -E[y|z = GREEN, e 2 ]| + |E[y|z = RED, e 1 ] -E[y|z = RED, e 2 ]|) = 1 2 (|0.1 -0.2| -|0.9 -0.8|) = 1 10 . The regularizer in IRMv1 is trying to reduce the sufficiency gap, so in some sense we can think about this gap as a learning signal for the IRM learner. A natural question would be whether a different set of environment partition {e} can be found such that this learning signal is stronger, i.e. the sufficiency gap is increased. We find the answer is yes. Consider an environment distribution q(e|x, y, z) that assigns each data point to one of two environments. Any assignment suffices so far as its marginal matches the observed data: z e q(x, y, z, e) = p obs (x, y). We can now express the sufficiency gap (given a color-based classifier g) as a function of the environment assignment q: ∆(S, e ∼ q) = E q(x,e) [|E q(y|x,e,x) [y|g(x), e 1 ] -E q(y|x,e,x) [y|g(x), e 2 ]|] = E q(x,e) [|E q(y|z,e,x)p(z|x) [y|z, e 1 ] -E q(y|z,e,x)p(z|x) [y|z, e 2 ]|] Where we use the same change of variables trick as above to replace g(x) with samples from p(z|x) (note that this is the color factor from the generative process p according with our assumption that g matches this distribution). We want to show that there exists a q yielding a higher sufficiency gap than the given environments. Consider q that yields the conditional label distribution q(y|x, e, z) := q(y|e, z) = 1(y = z) if e = e 1 , 1(y = z) if e = e 2 . This can be realized by an encoder/auditor q(e|x, y, z) that ignores image features in x and partitions the example based on whether or not the label y and color z agree. We also note that z is deterministically the color of the image in the generative process: p(z|x) = 1(x = RED) Now we can compute the sufficiency gap: ∆(S, e ∼ q) = E q(x,e) [|E q(y|z,e,x)p(z|x) [y|z, e 1 ] -E q(y|z,e,x)p(z|x) [y|z, e 2 ]|] = 1 2 E x∈RED |E q(y|z,e,x)p(z|x) [y|z, e 1 ] -E q(y|z,e,x)p(z|x) [y|z, e 2 ]| + 1 2 E x∈GREEN |E q(y|z,e,x)p(z|x) [y|z, e 1 ] -E q(y|z,e,x)p(z|x) [y|z, e 2 ]| = 1 2 E x∈RED (| y z (y * 1(y = z) * 1(g(x) = z)) - y z (y * 1(y = z) * 1(g(x) = z))|) + E x∈GREEN 1 2 (| y z (y * 1(y = z) * 1(g(x) = z)) - y z (y * 1(y = z) * 1(g(x) = z))|) = 1 2 E x∈RED (| y (y * 1(y = 1) * 1(x ∈ RED)) - y (y * 1(y = 1) * 1(x ∈ RED))|) + E x∈GREEN 1 2 (| y z (y * 1(y = 0) * 1(x ∈ GREEN)) - y z (y * 1(y = 0) * 1(x ∈ GREEN))|) = 1 2 E x∈RED [|1 -0|] + E x∈GREEN [ 1 2 |0 -1|] = 1 2 + 1 2 = 1. Note that 1 is the maximal sufficiency gap, meaning that the described environment partition maximizes the sufficiency gap w.r.t. the color-based classifier g.

C DATASET DETAILS

Constructing the ConfoundedAdult dataset To create our semi-synthetic dataset, called Con-foundedAdult, we start by observing that the conditional distribution over labels varies across the subgroups, and in some cases subgroup membership is very predictive of the target label. We construct a test set (a.k.a. the audit set) where this relationship between subgroups and target label is reversed. The four sensitive subgroups are defined following the procedure of Lahoti et al. (2020) , with sex (recorded as binary: Male/Female) and binarized race (Black/non-Black) attributes compose to make four possible subgroups: Non-Black Males (SG1), Non-Black Females (S2), Black Males (SG3), and Black Females (SG4). We start with the observation that each subgroup has a different correlation strength with the target label, and in some cases subgroup membership alone can be used to achieve relatively low error rates in prediction. As these correlations should be considered "spurious" to mitigate unequal treatment across groups, we create a semi-synthetic variant of the UCI Adult dataset, which we call ConfoundedAdult, where these spurious correlations are exaggerated. Table 4 shows various conditional label distributions for the original dataset and our proposed variant. The test set for ConfoundedAdult revereses the correlation strengths, which can be thought of as a worst-case audit to ensure the model is not relying on subgroup membership alone in its predictions. We generate samples for ConfoundedAdult using importance sampling, keeping the original train/test splits from UCI Adult as well as the subgroup sizes, but sampling individual examples under/over-sampled according to importance weights We refer the interested reader to Gulrajani & Lopez-Paz (2020) for an extensive discussion of possible model selection strategies. They also provide a large empirical study showing that ERM is difficult baseline to beat when all methods are put on equal footing w.r.t. model selection. In our case, we use the most relaxed model selection method proposed by Gulrajani & Lopez-Paz (2020) , which amounts to allowing each method a 20 test evaluations using hyperparameter chosen at random from a reasonable range, with the best hyperparameter setting selected for each method. While none of the methods is given an unfair advantage in the search over hyperparameters, the basic model selection premise does not translate to real-world applications, since information about the test-time distribution is required to select hyperparameters. Thus these results can be understood as being overly optimistic for each method, although the relative ordering between the methods can still be compared. CMNIST IRM is trained on these two environments and tested on a holdout environment constructed from 10, 000 test images in the same way as the training environments, where colour is predictive of the noisy label 10% of the time. So using color as a feature to predict the label will lead to an accuracy of roughly 10% on the test environment, while it yields 80% and 90% accuracy respectively on the training environments. To evaluate IRM(e EIIL ) we remove the environment identifier from the training set and thus have one training set comprised of 50, 000 images from both original training environments. We then train an MLP with binary cross-entropy loss on the training environments, freeze its weights and use the obtained model to learn environment splits that maximally violate the IRM penalty. When optimizing the inner loop of EIIL, we use Adam with learning rate 0.001 for 10, 000 steps with full data batches used to computed gradients. The obtained environment partitions are then used to train a new model from scratch with IRM. Following Arjovsky et al. (2019) , we allow the representation to train for several hundred annealing steps before applying the IRMv1 penalty. Census data Following Lahoti et al. (2020) , we use a two-hidden-layer MLP architecture for all methods, with 64 and 32 hidden units respectively, and a linear adversary for ARL. We optimize all methods using Adagrad; learning rates, number of steps, and batch sizes chosen by the model selection strategy described above (with 20 test evaluations per method), as are penalty weights for IRMv1 regularizer and standard weight decay. For the inner loop of EIIL (inferring the environments), we use the same settings as in CMNIST. We find that the performance of EIIL is somewhat sensitive to the number of steps taken with the IRMv1 penalty applied. To limit the number of test queries needed during model selection, we use an early stopping heuristic by enforcing the IRMv1 penalty only during the final 500 steps of training, with the previous steps serving as annealing period to learn a baseline representation to be regularized. Unlike the previous datasets, here we use minibatches to compute gradients during IRM training (for consistency with the ARL method, which uses minibatches). However, full batch gradients are still used for inferring environments in EIIL. Causal MSE Noncausal MSE ERM 0.827 ± 0.185 0.824 ± 0.013 ICP(e HC ) 1.000 ± 0.000 0.756 ± 0.378 IRM(e HC ) 0.666 ± 0.073 0.644 ± 0.061 IRM(e EIIL ) 0.148 ± 0.185 0.145 ± 0.177 Table 5 : IRM using EIIL-discovered environments (e EIIL ) outperforms IRM in a synthetic regression setting without the need for hand-crafted environments (e HC ). This is because the reference representation Φ = Φ ERM uses the spurious feature for prediction. MSE + standard deviation across 5 runs reported.

E.1 SYNTHETIC DATA

We begin with a regression setting originally used as a toy dataset for evaluating IRM (Arjovsky et al., 2019) . The features x ∈ R N comprise a "causal" feature v ∈ R N/2 concatenated with a "non-causal" feature z ∈ R N/2 : x = [v, z]. Noise varies across hand-crafted environments e: v = v v ∼ N (0, 25) y = v + y y ∼ N (0, e 2 ) z = y + z z ∼ N (0, 1). We evaluated the performance of the following methods: • ERM: A naive regressor that does not make use of environment labels e, but instead optimizes the average loss on the aggregated environments; • IRM(e HC ): the method of Arjovsky et al. (2019) using hand-crafted environment labels; • ICP(e HC ): the method of Peters et al. (2016) using hand-crafted environment labels; • IRM(e EIIL ): our proposed method (which does use hand-crafted environment labels) that infers useful environments based on the naive ERM, then applies IRM to the inferred environments. The regression methods fit a scalar target y = 1 T y via a regression model ŷ ≈ w T x to minimize ||y -ŷ|| w.r.t. w, plus an invariance penalty as needed. The optimal (causally correct) solution is w * = [1, 0] Given a solution [ ŵv , ŵz ] from one of the methods, we report the mean squared error for the causal and non-causal dimensions as || ŵv -1|| 2 2 and || ŵz -0|| 2 2 (Table 5 ). Because v is marginally noisier than z, ERM focuses on the spurious z. IRM using hand-crafted environments, denoted IRM(e HC ), exploits variability in noise level in the non-causal feature (which depends on the variability of σ y ) to achieve lower error. Using EIILv1 instead of hand crafted environments yields an improvement on the resulting IRM solution (denoted IRM(e EIIL )) by learning worst-case environments for invariant training. We show in a follow-up experiment that the EIILv1 solution is indeed sensitive to the choice of reference representation, and in fact, can only discover useful environments (environments that allow IRM(e EIIL )to learn the correct causal representation) when the reference representation encodes the incorrect inductive bias by focusing on the spurious feature. We can explore this dependence of EIILv1 on the mix of spurious and non-spurious features in the reference model by constructing a Φ that varies in the degree it focuses on the spurious feature, according to convex mixing parameter α ∈ [0, 1]. α = 0 indicates focusing entirely on the correct causal feature , while α = 1 indicates focusing on the spurious feature . We refer to this variant as IRM(e EIIL | Φ = Φ α-SPURIOUS ), and measure its performance as a function of α (Figure 3 ). Environment inference only yields good testtime performance for high values of α, where the reference model captures the incorrect inductive bias. 1 by adding additional methods discussed in Section 3. In Table 7 we measure the performance of some alternative strategies for optimizing the bi-level problem from Equation (EIIL). In particular, we consider alternating updates to the representation Φ and environment assignments q, as well as solving the inner/outer loop of EIIL multiple times. On the CMNIST dataset, none of these variants offers a performance benefit above the method used in Section 4. EIIL loops=k indicates that the inner and outer loops of the EIIL objective in Equation (EIIL) are successively optimized k times, with k = 1 corresponding to IRM(e EIIL ), the method studied in the main experiments section. In other words, Φ loops=1 is solved using IRM(e EIIL ), then this representation is used as a reference classifier to find Φ loops=k+1 = IRM(e EIIL | Φ = Φ loops=k ) in the next "loop" of computation. This also means that the training time needed is k times the training time of IRM(e EIIL ). As we expect from our theoretical analysis, using the IRM(e EIIL ) solution as a reference classifier for another round of EIIL is detrimental: since the reference classifier already relies on the correct shape feature, environments that encourage invariance to this feature are found in the second round, so the EIIL loops=2 classifier uses color rather than shape. EIIL AltU pdates consists of optimizing Equation 1 using alternating steps to Φ and q. Unforatuntely, whereas this strategy works well for other bi-level optimization problems such as GANs, it seems to do poorly in this setting. This method outperforms ERM but does not exceed chance-level predictions on the test set.

E.3 CENSUS DATA

Ablation Here we provide an ablation study extending the results from Section 4.2 to demonstrate that both ingredients in the EIILv1 solution-finding worst-case environment splits and regularizing using the IRMv1 penalty-are necessary to achieve good test-time performance on the ConfoundedAdult dataset. From Lahoti et al. (2020) we see that ARL can perform favorably compared with DRO (Hashimoto et al., 2018) in adaptively computing how much each example should contribute to the overall loss, i.e. computing the per-example γ i in C = E xi,yi∼p [γ i (Φ(x i ), y i )]. Because all per-environment risks in IRM are weighted equally (see Equation 2), and each per-environment risk comprises an average across per-example losses within the environment, each example contributes its loss to the overall objective in accordance with the size of its assigned environment. For example with two environments e 1 and e 2 of sizes |e 1 | and |e 2 |, we implicitly have the per-example weights of γ i = 1 |e1| for i ∈ e 1 and γ i = 1 |e2| for i ∈ e 2 , indicating that examples in the smaller environment count more towards the overall objective. Because EIILv1 is known to discover worst-case environments of unequal sizes, we measure the performance of EIILv1 using only this reweighting, without adding the gradient-norm penalty typically used in IRM (i.e. setting λ = 0). To determine the benefit of worst-case environment discovery, we also measure IRM with random assignment of environments. Table 8 shows the results, confirming that both ingredients are required to attain good performance using EIILv1.

F.1 OPTIMAL SOFT PARTITIONS MAXIMIMALLY VIOLATE THE INVARIANCE PRINCPLE

We want to show that finding environment assignments that maximize the violation of the softened version of the regularizer from Equation 3 also maximally violates the invariance princple. Because the invaraince principle E[Y |Φ(X), e] = E[Y |Φ(X), e ]∀e, e is difficult to quantify for continuous Φ(X), we consider a binned version of the representation, with b denoting the discrete index of the bin in representation space. Let q i ∈ [0, 1] denote the soft assignment of example i to environment 1, and 1q i denote its converse, the assignment of example i to environment 2. Denote by y i ∈ {0, 1} the binary target for example i, and ŷ ∈ [0, 1] as the model prediction on this example. Assume that represents a cross entropy or squared error loss so that ∇ w (ŷ, y) = (ŷy)Φ(x).



w • Φ yields a classification decision via linear weighting on the representation features. Analogous to REx, Williamson & Menon (2019) adapt Conditional Variance at Risk (CVaR)(Rockafellar & Uryasev, 2002) to equalize risk across demographic groups. Kearns et al. (2018) separately utilize boosting to equalize subgroup errors without sensitive attributes. In a variant of the first example where hospitals e ∈ {H1, H2} are known, the given environments could be improved by a method that sorts whether spurious artifacts are present, i.e. inferring equipment type. The theoretical analysis of IRM suggests that the more (statistically independent) environments the better in term of generalization guarantees. This suggests in the setting where these analyses apply, extending EIIL to find more than two environments (with a term to promote diversity amongst inferred environments) may further help out-of-domain generalization, which we leave for future investigation. Note that under this parameterization, when optimizing the inner loop with fixed Φ the number of parameters equals the number of data points (which is small relative to standard neural net training). We leave amortization of q to future work. Following the suggestion ofGulrajani & Lopez-Paz (2020), we note that Section 4.2 contains "oracle" results that are overly optimistic for each method (see Appendix D for model selection details). MNIST digits are grouped into{0, 1, 2, 3, 4} and {5, 6, 7, 8, 9} so the CMNIST target label y is binary. https://archive.ics.uci.edu/ml/datasets/adult This was previously used in a fairness setting byLiu et al. (2019) to measure differing calibration curves across groups.



Figure1: CMNIST with varying label noise θ y . Under high label noise (θ y > .2), where the spurious feature color correlates to label more than shape on the train data, IRM(e EIIL ) matches or exceeds the performance of IRM(e HC ) on the test set without relying on hand-crafted environments. Under medium label noise (.1 < θ y < .2), IRM(e EIIL ) is worse than IRM(e HC ) but better than ERM, the logical approach if environments are not available. Under low label noise (θ y < .1), where color is less predictive than shape at train time, ERM performs well and IRM(e EIIL ) fails. GRAYSCALE indicates the oracle solution using shape alone, while Φ Color uses color alone.

Figure3: MSE of the causal feature v and non-causal feature z. IRM(e EIIL )applied to the ERM solution (Black) out-performs IRM based on the hand-crafted environment (Green vs. Blue). To examine the inductive bias of the reference model Φ, we hard code a model Φα-SPURIOUS where α controls the degree of spurious feature representation in the reference classifier; IRM(e EIIL ) outperforms IRM(e HC ) when the reference Φ focuses on the spurious feature, e.g. with Φ as ERM or α-SPURIOUS for high α.

Accuracy on ConfoundedAdult, a variant of the UCI Adult dataset where some sensitive subgroups correlate to the label at train time and reverse this correlation pattern at test time.

Model selectionKrueger et al. (2020) discussed the pitfalls of achieving good test performance on CMNIST by using test data to tune hyperparameters. Because our primary interest is in the properties of the inferred environment rather than the final test performance, we sidestep this issue in the Synthetic Regression and CMNIST experiments by using the default parameters of IRM without further tuning. However for the ConfoundedAdult dataset a specific strategy for model selection is needed. ConfoundedAdult is a variant of the UCI Adult dataset that emphasizes test-time distrbution shift.

Accuracy across ten runs with label noise θ y = 0.25 GRAYSCALE hard codes out the color feature and thus represents an oracle solution to CMNIST.

± 0.5 67.6 ± 0.7 EIIL AltU pdates 82.3 ± 0.4 24.0 ± 1.1 We measure performance on CMNIST of various alternative approaches to optimizing the EIIL objective, ultimately concluding that none of the alternatives out-performs the method studied in Section 4. See text for details Train accs Test accs EIILv1 68.7 ± 1.7 79.8 ± 1.1 EIILv1 (no regularizer) 78.6 ± 2.0 69.2 ± 2.8 IRM (random environments) 94.7 ± 0.1 17.6 ± 1.6

Our ablation study shows that both ingredients of EIILv1 (finding worst-case environments and regularizing invariance across them) are required to achieve good test-time performance on the ConfoundedAdult dataset.

annex

Consider the IRMv1 regularizer with soft assignment, expressed asNow consider that the space of Φ(X) is discretized into disjoint bins b over its support, using z i,b ∈ {0, 1} to indicate whether example i falls into bin b according to its mapping Φ(x i ). Thus we haveThe important point is that within a bin, all examples have roughly the same Φ(x i ) value, and the same value for ŷi as well. So denoting K(1) bas the relevant constant within-bin summations, we have the following objective to be maximized by EIIL:One way to maximize this is to assign all y i = 1 values to environment 1 (q i = 1 for these examples) and all y i = 0 to the other environment (q i = 0). We can show this is maximized by considering all of the examples except the i-th one have been assigned this way, and then that the loss is maximized by assigneing the i-th example according to this rule. Now we want to show that the same assignment maximially violates the invariance principle (showing that this soft EIIL solution provides maximal non-invariance). Intuitively within each bin the difference between E[y|e = 1] and E[y|e = 2] is maximized (within the bin) if one of these expected label distributions is 1 while the other is 0. This can be achieved by assigning all the y i = 1 values to the first environment and the y i = 0 values to the second.Thus a global optimum for the relaxed version of EIIL (using the IRMv1 regularizer) also maximally violates the invariance principle.

