MODELING THE DATA-GENERATING PROCESS IS NECESSARY FOR OUT-OF-DISTRIBUTION GENERALIZATION

Abstract

Recent empirical studies on domain generalization (DG) have shown that DG algorithms that perform well on some distribution shifts fail on others, and no state-of-the-art DG algorithm performs consistently well on all shifts. Moreover, real-world data often has multiple distribution shifts over different attributes; hence we introduce multi-attribute distribution shift datasets and find that the accuracy of existing DG algorithms falls even further. To explain these results, we provide a formal characterization of generalization under multi-attribute shifts using a canonical causal graph. Based on the relationship between spurious attributes and the classification label, we obtain realizations of the canonical causal graph that characterize common distribution shifts and show that each shift entails different independence constraints over observed variables. As a result, we prove that any algorithm based on a single, fixed constraint cannot work well across all shifts, providing theoretical evidence for mixed empirical results on DG algorithms. Based on this insight, we develop Causally Adaptive Constraint Minimization (CACM), an algorithm that uses knowledge about the data-generating process to adaptively identify and apply the correct independence constraints for regularization. Results on fully synthetic, MNIST, small NORB, and Waterbirds datasets, covering binary and multi-valued attributes and labels, show that adaptive dataset-dependent constraints lead to the highest accuracy on unseen domains whereas incorrect constraints fail to do so. Our results demonstrate the importance of modeling the causal relationships inherent in the data-generating process. To perform reliably in real world settings, machine learning models must be robust to distribution shifts -where the training distribution differs from the test distribution. Given data from multiple domains that share a common optimal predictor, the domain generalization (DG) task (Wang et al., 2021; Zhou et al., 2021) encapsulates this challenge by evaluating accuracy on an unseen domain. Recent empirical studies of DG algorithms (Wiles et al., 2022; Ye et al., 2022) have characterized different kinds of distribution shifts across domains. Using MNIST as an example, a diversity shift is when domains are created either by adding new values of a spurious attribute like rotation (e.g., Rotated-MNIST dataset (Ghifary et al., 2015; Piratla et al., 2020) ) whereas a correlation shift is when domains exhibit different values of correlation between the class label and a spurious attribute like color (e.g.



across data distributions. We find that existing DG algorithms that are often targeted for a specific shift fail to generalize in such settings: best accuracy falls from 50-62% for individual shift MNIST datasets to <50% (lower than a random guess) for the multi-attribute shift dataset. To explain such failures, we propose a causal framework for generalization under multi-attribute distribution shifts. We use a canonical causal graph to model commonly observed distribution shifts. Under this graph, we characterize a distribution shift by the type of relationship between spurious attributes and the classification label, leading to different realized causal DAGs. Using d-separation on the realized DAGs, we show that each shift entails distinct constraints over observed variables and prove that no conditional independence constraint is valid across all shifts. As a special case of multi-attribute, when datasets exhibit a single-attribute shift across domains, this result provides an explanation for the inconsistent performance of DG algorithms reported by Wiles et al. (2022) ; Ye et al. (2022) . It implies that any algorithm based on a single, fixed independence constraint cannot work well across all shifts: there will be a dataset on which it will fail (Section 3.3). We go on to ask if we can develop an algorithm that generalizes to different kinds of individual shifts as well as simultaneous multi-attribute shifts. For the common shifts modeled by the canonical graph, we show that identification of the correct regularization constraints requires knowing only the type of relationship between attributes and the label, not the full graph. As we discuss in Section 3.1, the type of shift for an attribute is often available or can be inferred for real-world datasets. Based on this, we propose Causally Adaptive Constraint Minimization (CACM), an algorithm that leverages knowledge about the data-generating process (DGP) to identify and apply the correct independence constraints for regularization. Given a dataset with auxiliary attributes and their relationship with the target label, CACM constrains the model's representation to obey the conditional independence constraints satisfied by causal features of the label, generalizing past work on causality-based regularization (Mahajan et al., 2021; Veitch et al., 2021; Makar et al., 2022) to multi-attribute shifts. We evaluate CACM on novel multi-attribute shift datasets based on MNIST, small NORB, and Waterbirds images. Across all datasets, applying the incorrect constraint, often through an existing DG algorithm, leads to significantly lower accuracy than the correct constraint. Further, CACM achieves substantially better accuracy than existing algorithms on datasets with multi-attribute shifts as well as individual shifts. Our contributions include: • Theoretical result that an algorithm using a fixed independence constraint cannot yield an optimal classifier on all datasets. • An algorithm, Causally Adaptive Constraint Minimization (CACM), to adaptively derive the correct regularization constraint(s) based on the causal graph that outperforms existing DG algorithms. • Multi-attribute shifts-based benchmarks for domain generalization where existing algorithms fail.

2. GENERALIZATION UNDER MULTI-ATTRIBUTE SHIFTS

We consider the supervised learning setup from (Wiles et al., 2022) where each row of train data (x i , a i , y i ) n i=1 contains input features x i (e.g., X-ray pixels), a set of nuisance or spurious attributes a i (e.g., vertical shift, hospital) and class label y i (e.g., disease diagnosis). The attributes represent variables that are often recorded or implicit in data collection procedures. Some attributes represent a property of the input (e.g., vertical shift) while others represent the domain from which the input was collected (e.g., hospital). The attributes affect the observed input features x i but do not cause the target label, and hence are spurious attributes. The final classifier g(x) is expected to use only the input features. However, as new values of attributes are introduced or as the correlation of attributes with the label changes, we obtain different conditional distributions P (Y |X). Given a set of data distributions P, we assume that the train data is sampled from distributions, P Etr = {P E1 , P E2 , • • • } ⊂ P while the test data is assumed to be sampled from a single unseen distribution, P Ete = {P Ete } ⊂ P. Attributes and class labels are assumed to be discrete.

2.1. RISK INVARIANT PREDICTOR FOR GENERALIZATION UNDER SHIFTS

The goal is to learn a classifier g(x) using train domains such that it generalizes and achieves a similar, small risk on test data from unseen P Ete as it achieves on the train data. Formally, given a set of distributions P, we define a risk-invariant predictor (Makar et al., 2022) as, Definition 2.1. Optimal Risk Invariant Predictor for P (from (Makar et al., 2022) ) Define the risk of predictor g on distribution P ∈ P as R P (g) = E x,y∼P ℓ(g(x), y) where ℓ is cross-entropy or another classification loss. Then, the set of risk-invariant predictors obtain the same risk across all distributions P ∈ P, and set of the optimal risk-invariant predictors is defined as the risk-invariant predictors that obtain minimum risk on all distributions. g rinv ∈ arg min g∈Grinv R P (g) ∀P ∈ P where G rinv = {g : R P (g) = R P ′ (g)∀P, P ′ ∈ P} (1) An intuitive way to obtain a risk-invariant predictor is to consider only the parts of the input features (X) that cause the label Y and ignore any variation due to the spurious attributes. Let such latent, unobserved causal features be X c . Due to independence and stability of causal mechanisms (Peters et al., 2017) , we can assume that P (Y |X c ) remains invariant across different distributions. Using the notion of risk invariance, we can now define the multi-attribute generalization problem as, Definition 2.2. Generalization under Multi-attribute shifts. Given a target label Y , input features X, attributes A, and latent causal features X c , consider a set of distributions P such that P (Y |X c ) remains invariant while P (A|Y ) changes across individual distributions. Using a training dataset (x i , a i , y i ) n i=1 sampled from a subset of distributions P Etr ⊂ P, the generalization goal is to learn an optimal risk-invariant predictor over P. Special case of single-attribute shift. When |A| = 1, we obtain the single-attribute shift problem that is widely studied (Wiles et al., 2022; Ye et al., 2022; Gulrajani & Lopez-Paz, 2021) . 

2.2. A GENERAL PRINCIPLE FOR NECESSARY CONDITIONAL INDEPENDENCE CONSTRAINTS

ϕ ⊥ ⊥ E match E[ϕ(x)|E] ∀ E MMD max E E[ℓ d (h(ϕ(x)), E)] DANN match Cov[ϕ(x)|E] ∀ E CORAL Y ⊥ ⊥ E|ϕ match E[Y |ϕ(x), E] ∀ E IRM match Var[ℓ(g(x), y)|E] ∀ E VREx ϕ ⊥ ⊥ E|Y match E[ϕ(x)|E, Y = y] ∀ E C-MMD max E E[ℓ d (h(ϕ(x)), E)|Y = y)] CDANN In practice, the causal features X c are unobserved and a key challenge is to learn X c using the observed (X, Y, A). We focus on representation learning-based (Wang et al., 2021) DG algorithms, typically characterized by a regularization constraint that is added to a standard ERM loss such as cross-entropy. Table 1 shows three independence constraints that form the basis of many popular DG algorithms, assuming environment/domain E as the attribute and ℓ as the main classifier loss. We now provide a general principle for deciding which constraints to choose for learning a risk-invariant predictor for a dataset. We utilize a strategy from past work (Mahajan et al., 2021; Veitch et al., 2021) to use graph structure of the underlying data-generating process (DGP). We assume that the predictor can be represented as g(x) = g 1 (ϕ(x)) where ϕ is the representation. To learn a risk-invariant ϕ, we identify the conditional independence constraints satisfied by causal features X c in the causal graph and enforce that learnt representation ϕ should follow the same constraints. If ϕ satisfies the constraints, then any function g 1 (ϕ) will also satisfy them. Below we show that the constraints are necessary under simple assumptions on the causal DAG representing the DGP for a dataset. All proofs are in Suppl. B. Theorem 2.1. Consider a causal DAG G over ⟨X c , X, A, Y ⟩ and a corresponding generated dataset (x i , a i , y i ) n i=1 , where X c is unobserved. Assume that graph G has the following property: X c is defined as the set of all parents of Y (X c → Y ); and X c , A together cause X (X c → X, and A → X). The graph may have any other edges (see, e.g., DAG in Figure 1 (b)). Let P G be the set of distributions consistent with graph G, obtained by changing P (A|Y ) but not P (Y |X c ). Then the conditional independence constraints satisfied by X c are necessary for a (cross-entropy) riskinvariant predictor over P G . That is, if a predictor for Y does not satisfy any of these constraints, then there exists a data distribution P ′ ∈ P G such that predictor's risk will be higher than its risk in other distributions. Thus, given a causal DAG, using d-separation on X c and observed variables, we can derive the correct regularization constraints to be applied on ϕ. This yields a general principle to learn a risk-invariant predictor. We use it to theoretically explain the inconsistent results of existing DG algorithms (Sec. 3.3) and to propose an Out-of-Distribution generalization algorithm CACM (Sec. 4). Note that constraints from CACM are necessary but not sufficient as X c is not identifiable.

3.1. CANONICAL CAUSAL GRAPH FOR COMMON DISTRIBUTION SHIFTS

We consider a canonical causal graph (Figure 2 ) to specify the common data-generating processes that can lead to a multi-attribute shift dataset. Shaded nodes represent observed variables X, Y ; and the sets of attributes A ind , A ind , and E such that A ind ∪ A ind ∪ {E} = A. A ind represents the attributes correlated with label, A ind the attributes that are independent of label, while E is a special attribute for the domain/environment from which a data point was collected. Not all attributes need to be observed. For example, in some cases, only E and a subset of A ind , A ind may be observed. In other cases, only A ind and A ind may be observed while E is not available. Regardless, we assume that all attributes, along with the causal features X c , determine the observed features X. And the features X c are the only features that cause Y . In the simplest case, we assume no label shift across environments i.e. marginal distribution of Y is constant across train domains and test, P Etr (y) = P Ete (y) (see Figure 2a ). More generally, different domains may have different distribution of causal features (in the X-ray example, more women visit one hospital) as shown by E <-> X c (Figure 2b ). Under the canonical graph, we characterize different kinds of shifts based on the relationship between spurious attributes A and the classification label Y . Specifically, A ind is independent of the class label and a change in P (A ind ) leads to an Independent distribution shift. For A ind , there are 2a is general, resolving each dashed edge into a specific type of shift (causal mechanism) leads to a realized causal DAG for a particular dataset. As we shall see, knowledge of these shift types is sufficient to determine the correct independence constraints between observed variables. Our canonical multi-attribute graph generalizes the DG graph from Mahajan et al. (2021) that considered an Independent domain/environment as the only attribute. Under the special case of a single attribute (|A| = 1), the canonical graph helps interpret the validity of popular DG methods for a dataset by considering the type of the attribute-label relationship in the data-generating process. For example, let us consider two common constraints in prior work on independence between ϕ and a spurious attribute: unconditional (ϕ(x) ⊥ ⊥ A) (Veitch et al., 2021; Albuquerque et al., 2020; Ganin et al., 2016) or conditional on the label (ϕ(x) ⊥ ⊥ A|Y ) (Ghifary et al., 2016; Hu et al., 2019; Li et al., 2018c; d) . Under the canonical graph in Figure 2a , the unconditional constraint is true when A ⊥ ⊥ Y (A ∈ A ind ) but not always for A ind (true only under Confounded shift). If the relationship is Causal or Selected, then the conditional constraint is correct. Critically, as Veitch et al. (2021) show for a single-attribute graph, the conditional constraint is not always better; it is an incorrect constraint (not satisfied by X c ) under Confounded setting. Further, under the canonical graph [with E-X c edge] from Figure 2b , none of these constraints are valid due to a correlation path between X c and E. This shows the importance of considering the generating process for a dataset. Inferring attributes-label relationship type. Whether an attribute belongs to A ind or A ind can be learned from data (since A ind ⊥ ⊥ Y ). Under some special conditions with the graph in Figure 2a assuming all attributes are observed and all attributes in A ind are of the same type-we can also identify the type of A ind shift: Y ⊥ ⊥ E|A ind implies Selected; if not, then X ⊥ ⊥ E|A ind , A ind , Y implies Causal, otherwise it is Confounded. In the general case of Figure 2b , however, it is not possible to differentiate between A cause , A conf and A sel using observed data and needs manual input. Fortunately, unlike the full causal graph, the type of relationship between label and an attribute is easier to obtain. For example, in text toxicity classification, toxicity labels are found to be spuriously correlated with certain demographics (A ind ) (Dixon et al., 2018; Koh et al., 2021; Park et al., 2018) ; while in medical applications where data is collected from small number of hospitals, shifts arise due to different methods of slide staining and image acquisition (A ind ) (Koh et al., 2021; Komura & Ishikawa, 2018; Tellez et al., 2019) . Suppl. A contains additional real-world examples with attributes.

3.2. INDEPENDENCE CONSTRAINTS DEPEND ON ATTRIBUTE↔LABEL RELATIONSHIP

We list the independence constraints between ⟨X c , A, Y ⟩ under the canonical graphs from Figure 2 , which can be used to derive the correct regularization constraints to be applied on ϕ (Theorem 2.1). Proposition 3.1. Given a causal DAG realized by specifying the target-attributes relationship in Figure 2a , the correct constraint depends on the relationship of label Y with the attributes A. As shown, A can be split into A ind , A ind and E, where A ind can be further split into subsets that have a causal (A cause ), confounded (A conf ), selected (A sel ) relationship with Y (A ind = A cause ∪ A conf ∪ A sel ). Then, the (conditional) independence constraints X c should satisfy are, 1. Independent: X c ⊥ ⊥ A ind ; X c ⊥ ⊥ E; X c ⊥ ⊥ A ind |Y ; X c ⊥ ⊥ A ind |E; X c ⊥ ⊥ A ind |Y, E 2. Causal: X c ⊥ ⊥ A cause |Y ; X c ⊥ ⊥ E; X c ⊥ ⊥ A cause |Y, E 3. Confounded: X c ⊥ ⊥ A conf ; X c ⊥ ⊥ E; X c ⊥ ⊥ A conf |E 4. Selected: X c ⊥ ⊥ A sel |Y ; X c ⊥ ⊥ A sel |Y, E Corollary 3.1. All the above derived constraints are valid for Graph 2a. However, in the presence of a correlation between E and X c (Graph 2b), only the constraints conditioned on E hold true. Corollary 3.1 implies that if we are not sure about E-X c correlation, E-conditioned constraints should be used. By considering independence constraints over attributes that may represent any observed variable, our graph-based characterization unites the single-domain (group-wise) (Sagawa et al., 2020) and multi-domain generalization tasks. Whether attributes represent auxiliary attributes, group indicators, or data sources, Proposition 3.1 provides the correct regularization constraint.

3.3. A FIXED CONDITIONAL INDEPENDENCE CONSTRAINT CANNOT WORK FOR ALL SHIFTS

Combining with Theorem 2.1, Proposition 3.1 shows that the necessary constraints for a risk-invariant predictor's representation ϕ(X) are different for different types of attributes. This leads us to our key result: under multi-attribute shifts, a single (conditional) independence constraint cannot be valid for all kinds of shifts. Remarkably, this result is true even for single-attribute shifts: any algorithm with a fixed conditional independence constraint (e.g., as listed in Table 1 (Gretton et al., 2012; Arjovsky et al., 2019; Li et al., 2018b; Sun & Saenko, 2016) ) cannot work for all datasets. Theorem 3.1. Under the canonical causal graph in Figure 2 (a,b), there exists no (conditional) independence constraint over ⟨X c , A, Y ⟩ that is valid for all realized DAGs as the type of multiattribute shifts vary. Hence, for any predictor algorithm for Y that uses a single (conditional) independence constraint over its representation ϕ(X), A and Y , there exists a realized DAG G and a corresponding training dataset such that the learned predictor cannot be a risk-invariant predictor for distributions in P G , where P G is the set of distributions obtained by changing P (A|Y ). Corollary 3.2. Even when |A| = 1, an algorithm using a single independence constraint over ⟨ϕ(X), A, Y ⟩ cannot yield a risk-invariant predictor for all kinds of single-attribute shift datasets. Corollary 3.2 adds theoretical evidence for past empirical demonstrations of inconsistent performance of DG algorithms (Wiles et al., 2022; Ye et al., 2022) .To demonstrate its significance, we provide OoD generalization results on a simple "slab" setup (Shah et al., 2020) with three datasets (Causal, Confounded, and Selected shifts) in Suppl. E.2. We evaluate two constraints motivated by DG literature Mahajan et al. (2021) : unconditional X c ⊥ ⊥ A|E, and conditional on label X c ⊥ ⊥ A|Y, E. As predicted by Corollary 3.2, neither constraint obtains best accuracy on all three datasets (Table 6 ).

4. CAUSALLY ADAPTIVE CONSTRAINT MINIMIZATION (CACM)

Motivated by Sec. 3, we present CACM, an algorithm that adaptively chooses regularizing constraints for multi-attribute shift datasets (full algorithm for any general DAG in Suppl. C). It has two phases. Phase I. Derive correct independence constraints. If a dataset's DGP satisfies the canonical graph, CACM requires a user to specify the relationship type for each attribute and uses the constraints from Proposition 3.1. For other datasets, CACM requires a causal graph describing the dataset's DGP and uses the following steps to derive the independence constraints. Let V be the set of observed variables in the graph except Y , and C be the list of constraints.

1.. For each observed variable

V ∈ V, check whether (X c , V ) are d-separated. Add X c ⊥ ⊥ V to C. 2. If not, check if (X c , V ) are d-separated conditioned on any subset Z of the remaining observed variables in Z = {Y } ∪ V \ {V }. For each subset Z with d-separation, add X c ⊥ ⊥ V |Z to C. Phase II. Apply regularization penalty using derived constraints. In Phase II, CACM applies those constraints as a regularizer to the standard ERM loss, g 1 , ϕ = arg min g1,ϕ ; ℓ(g 1 (ϕ(x)), y) + RegP enalty, where ℓ is cross-entropy loss. The regularizer optimizes for valid constraints over all observed variables V ∈ V. Below we provide the regularizer term for datasets following the canonical graphs from Figure 2 (V = A ). We choose Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) to apply our penalty (in principle, any metric for conditional independence would work). Since A includes multiple attributes, the regularizer penalty depends on the type of distribution shift for each attribute. For instance, for A ∈ A ind (Independent), to enforce ϕ(x) ⊥ ⊥ A, we aim to minimize the distributional discrepancy between P (ϕ(x)|A = a i ) and P (ϕ(x)|A = a j ), for all i, j values of A. However, since the same constraint is applicable on E, it is statistically efficient to apply the constraint on E (if available) as there may be multiple closely related values of A in a domain (e.g., slide stains collected from one hospital may be spread over similar colors, but not exactly the same). Hence, we apply the constraint on distributions P (ϕ(x)|E = E i ) and P (ϕ(x)|E = E j ) if E is observed (and A may/may not be unobserved), otherwise we apply the constraint over A.

RegP enalty

A ind = |A ind | i=1 j>i MMD(P (ϕ(x)|a i,ind ), P (ϕ(x)|a j,ind )) For A ∈ A cause (Causal), following Proposition 3.1, we consider distributions P (ϕ(x)|A = a i , Y = y) and P (ϕ(x)|A = a j , Y = y). We additionally condition on E as there may be a correlation between E and X c (Figure 2b ), which renders other constraints incorrect (Corollary 3.1) . We similarly obtain regularization terms for Confounded and Selected (Suppl. C). Acause = |E| y∈Y |Acause | i=1 j>i MMD(P (ϕ(x)|a i,cause , y), P (ϕ(x)|a j,cause , y)) The final RegP enalty is a sum of penalties over all attributes, RegP enalty = A∈A λ A P enalty A , where λ A are the hyperparameters. Unlike prior work (Makar et al., 2022; Veitch et al., 2021) , we do not restrict ourselves to binary-valued attributes and classes. CACM's relationship with existing DG algorithms. Table 1 shows common constraints used by popular DG algorithms. CACM's strength lies in adaptively selecting constraints based on the causal relationships in the DGP. Thus, depending on the dataset, applying CACM for a single-attribute shift may involve applying the same constraint as in MMD or C-MMD algorithms. For example, in Rotated-MNIST dataset with E = A ind = rotation, the effective constraint for MMD, DANN, CORAL algorithms (ϕ ⊥ ⊥ E) is the same as CACM's constraint ϕ ⊥ ⊥ A ind for Independent shift.

5. EMPIRICAL EVALUATION

We perform experiments on MNIST, small NORB, and Waterbirds datasets to demonstrate our main claims: existing DG algorithms perform worse on multi-attribute shifts; CACM with the correct graph-based constraints significantly outperforms these algorithms; and incorrect constraints cannot match the above accuracy. While we provide constraints for all shifts in Proposition 3.1, our empirical experiments with datasets focus on commonly occurring Causal and Independent shifts. All experiments are performed in PyTorch 1.10 with NVIDIA Tesla P40 and P100 GPUs, and building on DomainBed (Gulrajani & Lopez-Paz, 2021) and OoD-Bench (Ye et al., 2022) . Regularizing on model's logit scores provides better accuracy than ϕ(x); hence we adopt it for all our experiments.

5.1. DATASETS & BASELINE DG ALGORITHMS

We introduce three new datasets for the multi-attribute shift problem. For all datasets, details of environments, architectures, visualizations, and setup generation are in Suppl. D.1. MNIST. Colored (Arjovsky et al., 2019) and Rotated MNIST (Ghifary et al., 2015) present Causal (A cause = color) and Independent (A ind = rotation) distribution shifts, respectively. We combine these to obtain a multi-attribute dataset with A cause and A ind (col + rot). For comparison, we also evaluate on single-attribute A cause (Colored) and A ind (Rotated) MNIST datasets. small NORB (LeCun et al., 2004) . This dataset was used by Wiles et al. (2022) to create a challenging DG task with single-attribute shifts, having multi-valued classes and attributes over realistic 3D objects. We create a multi-attribute shift dataset (light+azi), consisting of a causal connection, A cause =lighting, between lighting and object category Y ; and A ind =azimuth that varies independently across domains. We also evaluate on single-attribute A cause (lighting) and A ind (azimuth) datasets. Waterbirds. We use the original dataset (Sagawa et al., 2020) where bird type (water or land) (Y ) is spuriously correlated with background (A cause ). To create a multi-attribute setup, we add different weather effects (A ind ) to train and test data with probability p = 0.5 and 1.0 respectively. Baseline DG algorithms & implementation. We consider baseline algorithms optimizing for different constraints and statistics to compare to causal adaptive regularization: IRM (Arjovsky et al., 2019) , IB-ERM and IB-IRM (Ahuja et al., 2021) , VREx (Krueger et al., 2021) , MMD (Li et al., 2018b) , CORAL (Sun & Saenko, 2016) , DANN (Gretton et al., 2012) , Conditional-MMD (C-MMD) (Li et al., 2018b) , Conditional-DANN (CDANN) (Li et al., 2018d) , GroupDRO (Sagawa et al., 2020) , Mixup (Yan et al., 2020) , MLDG (Li et al., 2018a) , SagNet (Nam et al., 2021) , and RSC (Huang et al., 2020) . Following DomainBed (Gulrajani & Lopez-Paz, 2021) , a random search is performed 20 times over the hyperparameter distribution for 3 seeds. The best models obtained across the three seeds are used to compute the mean and standard error. We use a validation set that follows the test domain distribution consistent with previous work on these datasets (Arjovsky et al., 2019; Sagawa et al., 2020; Wiles et al., 2022; Ye et al., 2022) . Further details are in Suppl. D. 

5.2. RESULTS

Correct constraint derived from the causal graph matters. Table 2 shows the accuracy on test domain for all datasets. Comparing the three prediction tasks for MNIST and small NORB, for all algorithms, accuracy on unseen test domain is highest under A ind shift and lowest under twoattribute shift (A ind ∪ A cause ), reflecting the difficulty of a multi-attribute distribution shift. On the two-attribute shift task in MNIST, all DG algorithms obtain less than 50% accuracy whereas CACM obtains a 5% absolute improvement. Results on small NORB dataset are similar: CACM obtains 69.6% accuracy on the two-attribute task while the nearest baseline is MLDG at 64.2%. On both MNIST and small NORB, CACM also obtains highest accuracy on the A cause task. On MNIST, even though IRM and VREx have been originally evaluated for the Color-only (A cause ) task, under an extensive hyperparameter sweep as recommended in past work (Gulrajani & Lopez-Paz, 2021; Krueger et al., 2021; Ye et al., 2022) , we find that CACM achieves a substantially higher accuracy (70%) than these methods, just 5 units lower than the optimal 75%. While the A ind task is relatively easier, algorithms optimizing for the correct constraint achieve highest accuracy. Note that MMD, CORAL, DANN, and CACM are based on the same independence constraint (see Table 1 ). As mentioned in Section 4, we use the the domain attribute E for CACM's regularization constraint for A ind task, for full comparability with other algorithms that also use E. The results indicate the importance of adaptive regularization for generalization. Table 2 also shows the OoD accuracy of algorithms on the original Waterbirds dataset (Sagawa et al., 2020) and its multi-attribute shift variant. Here we evaluate using worst-group accuracy consistent with past work (Sagawa et al., 2020; Yao et al., 2022) . We observe that on single-attribute (A cause ) as well as multi-attribute shift, CACM significantly outperforms baselines (∼6% absolute improvement). Incorrect constraints hurt generalization. We now directly compare the effect of using correct versus incorrect (but commonly used) constraints for a dataset. To isolate the effect of a single constraint, we first consider the single-attribute shift on A cause and compare the application of different regularizer constraints. Proposition 3.1 provides the correct constraint for A cause : X c ⊥ ⊥ A cause |Y, E. In addition, using d-separation on Causal-realized DAG from Figure 2 , we see the following invalid constraints, X c ⊥ ⊥ A cause |E, X c ⊥ ⊥ A cause . Without knowing that the DGP corresponds to a Causal shift, one may apply these constraints that do not condition on the class. Results on small NORB (Table 3 ) show that using the incorrect constraint has an adverse effect: the correct constraint yields 85% accuracy while the best incorrect constraint achieves 79.7%. Moreover, unlike the correct constraint, application of the incorrect constraint is sensitive to the λ (regularization weight) parameter: as λ increases, accuracy drops to less than 40% (Suppl. E.3, Figure 7 ). Comparing these constraints on small NORB and MNIST (Table 4 ) reveals the importance of making the right structural assumptions. Typically, DG algorithms assume that distribution of causal features X c does not change across domains (as in the graph in Fig. 2a ). Then, both X c ⊥ ⊥ A cause |Y, E and X c ⊥ ⊥ A cause |Y should be correct constraints. However, conditioning on both Y and E provides a 5% point gain over conditioning on Y in NORB while the accuracy is comparable for MNIST. Information about the data-generating process explains the result: Different domains in MNIST include samples from the same distribution whereas small NORB domains are sampled from a different set of toy objects, thus creating a correlation between X c and E, corresponding to the graph in Fig. 2b . Without information on the correct DGP, such gains will be difficult. Finally, we replicate the above experiment for the multi-attribute shift setting for small NORB. To construct an incorrect constraint, we interchange the variables before inputting to CACM algorithm (A ind gets used as A cause and vice-versa). Accuracy with interchanged variables (65.1 ± 1.6) is lower than that of correct CACM (69.6 ± 1.6). More ablations where baseline DG algorithms are provided CACM-like attributes as environment are in Suppl. E.4.

6. RELATED WORK

Improving the robustness of models in the face of distribution shifts is a key challenge. Several works have attempted to tackle the domain generalization problem (Wang et al., 2021; Zhou et al., 2021) using different approaches -data augmentation (Cubuk et al., 2020; He et al., 2016; Zhu et al., 2017) , and representation learning (Arjovsky et al., 2019; Deng et al., 2009; Higgins et al., 2017) being popular ones. Trying to gauge the progress made by these approaches, Gulrajani and Lopez-Paz (Gulrajani & Lopez-Paz, 2021) find that existing state-of-the-art DG algorithms do not improve over ERM. More recent work (Wiles et al., 2022; Ye et al., 2022) uses datasets with different single-attribute shifts and empirically shows that different algorithms perform well over different distribution shifts, but no single algorithm performs consistently across all. We provide (1) multiattribute shift benchmark datasets; (2) a causal interpretation of different kinds of shifts; and (3) an adaptive algorithm to identify the correct regularizer. While we focus on images, OoD generalization on graph data is also challenged by multiple types of distribution shifts (Chen et al., 2022) . Causally-motivated learning. There has been recent work focused on causal representation learning (Arjovsky et al., 2019; Krueger et al., 2021; Locatello et al., 2020; Schölkopf et al., 2021) for OoD generalization. While these works attempt to learn the constraints for causal features from input features, we show that it is necessary to model the data-generating process and have access to auxiliary attributes to obtain a risk-invariant predictor, especially in multi-attribute distribution shift setups. Recent research has shown how causal graphs can be used to characterize and analyze the different kinds of distribution shifts that occur in real-world settings (Makar et al., 2022; Veitch et al., 2021) . Our approach is similar in motivation but we extend from single-domain, single-attribute setups in past work to formally introduce multi-attribute distribution shifts in more complex and real-world settings. Additionally, we do not restrict ourselves to binary-valued classes and attributes.

7. DISCUSSION

We introduced CACM, an adaptive OoD generalization algorithm to characterize multi-attribute shifts and apply the correct independence constraints. Through empirical experiments and theoretical analysis, we show the importance of modeling the causal relationships in the data-generating process. That said, our work has limitations: the constraints from CACM are necessary but not sufficient for a risk-invariant predictor (e.g., unable to remove influence of unobserved spurious attributes). Our work on modeling the data-generating process for improved out-of-distribution generalization is an important advance in building robust predictors for practical settings. Such prediction algorithms, including methods building on representation learning, are increasingly a key element of decisionsupport and decision-making systems. We expect our approach to creating a robust predictor to be particularly valuable in real world setups where spurious attributes and real-world multi-attribute settings lead to biases in data. While not the focus of this paper, CACM may be applied to mitigate social biases (e.g., in language and vision datasets) whose structures can be approximated by the graphs in Figure 2 . Risks of using methods such as CACM, include excessive reliance or a false sense of confidence. While methods such as CACM ease the process of building robust models, there remain many ways that an application may still fail (e.g., incorrect structural assumptions). AI applications must still be designed appropriately with support of all stakeholders and potentially affected parties, tested in a variety of settings, etc.

9. REPRODUCIBILITY STATEMENT

We provide all required experimental details in Suppl. D including dataset details, training details and hyperparameter search sweeps. We additionally submit our code as part of supplementary material which can be used to reproduce the experiments. We provide a demo notebook for prediction using CACM in the DoWhyfoot_2 library. We provide proofs for all our theoretical results in Suppl. B.

A PRESENCE OF AUXILIARY ATTRIBUTE INFORMATION IN DATASETS

Unlike the full causal graph, attribute values as well as the relationships between class labels and attributes is often known. CACM assumes access to attribute labels A only during training time, which are collected as part of the data collection process (e.g., as metadata with training data (Makar et al., 2022) ). We start by discussing the availability of attributes in WILDS (Koh et al., 2021) , a set of real-world datasets adapted for the domain generalization setting. Attribute labels available in the datasets include, the time (year) and region associated with satellite images in FMoW dataset (Christie et al., 2018) for predicting land use category, hospital from where the tissue patch was collected for tumor detection in Camelyon17 dataset (Bandi et al., 2018) and the demographic information for CivilComments dataset (Borkan et al., 2019) . (Koh et al., 2021) create different domains in WILDS using this metadata, consistent with our definition of E ∈ A as a special domain attribute. In addition, CACM requires the type of relationship between label Y and attributes. This is often known, either based on how the dataset was collected or inferred based on domain knowledge or observation. While the distinction between A ind and A ind can be established using a statistical test of independence on a given dataset, in general, the distinction between A cause , A sel and A conf within A ind needs to be provided by the user. As we show for the above datasets, the type of relationship can be inferred based on common knowledge or information on how the dataset was collected. For FMoW dataset, time can be considered an Independent attribute (A ind ) since it reflects the time at which images are captured which is not correlated with Y ; whereas region is a Confounded attribute since certain regions associated with certain Y labels are over-represented due to ease of data collection. Note that region cannot lead to Causal shift since the decision to take images in a region was not determined by the final label nor Selected for the same reason that the decision was not taken based on values of Y . Similarly, for the Camelyon17 dataset, it is known that differences in slide staining or image acquisition leads to variation in tissue slides across hospitals, thus implying that hospital is an Independent attribute (A ind ) (Koh et al., 2021; Komura & Ishikawa, 2018; Tellez et al., 2019) ; As another example from healthcare, a study in MIT Technology Reviewfoot_3 discusses biased data where a person's position (A conf ) was spuriously correlated with disease prediction as patients lying down were more likely to be ill. As another example, (Sagawa et al., 2020) adapt MultiNLI dataset for OoD generalization due to the presence of spurious correlation between negation words (attribute) and the contradiction label between "premise" and "hypothesis" inputs. Here, negation words are a result of the contradiction label (Causal shift), however this relationship between negation words and label may not always hold. Finally, for the CivilComments dataset, we expect the demographic features to be Confounded attributes as there could be biases which result in spurious correlation between comment toxicity and demographic information. To provide examples showing the availability of attributes and their type of relationship with the label, Table 5 lists some popular datasets used for DG and the associated auxiliary information present as metadata. In addition to above discussed datasets, weinclude the popularly used Waterbirds dataset (Sagawa et al., 2020) where the type of background (land/water) is assigned to bird images based on bird label; hence, being a Causal attribute (results on Waterbirds dataset are in Table 2 ). (Sagawa et al., 2020) background (land/water) A cause MultiNLI (Sagawa et al., 2020) negation word A cause CivilComments-WILDS (Koh et al., 2021) demographic A conf where the third and fourth terms cancel out because P 1 (X c , Y ) = P 2 (X c , Y ) and thus the risk of h(X c ) is the same across P 1 and P 2 . However, the risk for g ′ (X, X c ) is not the same since P 1 (Y |g ′ (X, X c )) ̸ = P 2 (Y |g ′ (X, X c )). Thus the absolute risk difference is non-zero, |R P2 (g) -R P1 (g)| > 0 (5) and g is not a risk-invariant predictor. Hence, satisfying conditional independencies that X c satisfies is necessary for a risk-invariant predictor. Remark. In the above Theorem, we considered the case where P (A|Y ) changes across distributions. In the case where A and Y are independent, P (A|Y ) = P (A) and thus P (A) would change across distributions while P (Y |A) = P (Y ) remained constant. Since g ′ (X, X c ) depends on A and X c , we obtain P 1 (Y |g ′ (X, X c )) = P 2 (Y |g ′ (X, X c )). However, the risk difference can still be non-zero since P 1 (A) ̸ = P 2 (A) and the risk expectation E P [ y y log g ′ (X, X c )] is over P (Y, X c , A). B.2 PROOF OF PROPOSITION 3.1 Proposition 3.1. Given a causal DAG realized by specifying the target-attributes relationship in Figure 2a , the correct constraint depends on the relationship of label Y with the attributes A. As shown, A can be split into A ind , A ind and E, where A ind can be further split into subsets that have a causal (A cause ), confounded (A conf ), selected (A sel ) relationship with Y (A ind = A cause ∪ A conf ∪ A sel ). Then, the (conditional) independence constraints X c should satisfy are, 1. Independent: X c ⊥ ⊥ A ind ; X c ⊥ ⊥ E; X c ⊥ ⊥ A ind |Y ; X c ⊥ ⊥ A ind |E; X c ⊥ ⊥ A ind |Y, E 2. Causal: X c ⊥ ⊥ A cause |Y ; X c ⊥ ⊥ E; X c ⊥ ⊥ A cause |Y, E 3. Confounded: X c ⊥ ⊥ A conf ; X c ⊥ ⊥ E; X c ⊥ ⊥ A conf |E 4. Selected: X c ⊥ ⊥ A sel |Y ; X c ⊥ ⊥ A sel |Y, E Proof. The proof follows from d-separation (Pearl, 2009) on the causal DAGs realized from Figure 2a . For each condition, Independent, Causal, Confounded and Selected, we provide the realized causal graphs below and derive the constraints. Independent: As we can see in Figure 3a , we have a collider X on the path from X c to A ind and X c to E. Since there is a single path here, we obtain the independence constraints X c ⊥ ⊥ A ind and X c ⊥ ⊥ E. Additionally, we see that conditioning on Y or E would not block the path from a subset of attributes A s ⊆ A, it is always possible to change the type of at least one of the those attributes' shifts to create a new data distribution (dataset) where the same constraint will not hold. Second Claim. To prove the second claim, suppose that there exists a predictor for Y based on a single conditional independence constraint over its representation, ψ(ϕ(X), A s , Y ) where A s ⊆ A. Since the same constraint is not valid across all attribute shifts, we can always construct a realized graph G (and a corresponding data distribution) by changing the type of at least one attribute shift A ∈ A s , such that X c would not satisfy the same constraint as ϕ(X). Further, under this G, X c would satisfy a different constraint on the same attributes. From Theorem 2.1, all conditional independence constraints satisfied by X c under G are necessary to be satisfied for a risk-invariant predictor. Hence, for the class of distributions P G , a single constraint-based predictor cannot be a risk-invariant predictor. Corollary 3.2. Even when |A| = 1, an algorithm using a single independence constraint over ⟨ϕ(X), A, Y ⟩ cannot yield a risk-invariant predictor for all kinds of single-attribute shift datasets. Proof. Given a fixed (conditional) independence constraint over a predictor's representation, ψ(ϕ(X), A s , Y ), the proof of Theorem 3.1 relied on changing the target relationship type (and hence distribution shift type) for the attributes involved in the constraint. When |A| = 1, the constraint is on a single attribute A, ψ(ϕ(X), A, Y ) and the same proof logic follows. From Proposition 3.1, given a fixed constraint, we can always choose a single-attribute shift type (and realized DAG G) such that the constraint is not valid for X c . Moreover, under G, X c would satisfy a different conditional independence constraint wrt the same attribute. From Theorem 2.1, since the predictor does not satisfy a conditional independence constraint satisfied by X c , it cannot be a risk-invariant predictor for datasets sampled from P G .

C CACM ALGORITHM

We provide the CACM algorithm for a general graph G below (Algorithm 1). Algorithm 1 CACM Input: Dataset (x i , a i , y i ) n i=1 , causal DAG G Output: Function g(x) = g 1 (ϕ(x)) : X → Y A ← set of observed variables in G except Y, E (special domain attribute) C ← {} ▷ mapping of A to A s Phase I: Derive correct independence constraints for A ∈ A do if (X c , A) are d-separated then X c ⊥ ⊥ A is a valid independence constraint else if (X c , A) are d-separated conditioned on any subset A s of the remaining observed variables in A \ {A} ∪ {Y } then X c ⊥ ⊥ A|A s is a valid independence constraint C[A] = A s end if end for Phase II: Apply regularization penalty using constraints derived for A ∈ A do if X c ⊥ ⊥ A then RegP enalty A = |E| |A| i=1 j>i MMD(P (ϕ(x)|A i ), P (ϕ(x)|A j )) else if A is in C then A s = C[A] RegP enalty A = |E| a∈As |A| i=1 j>i MMD(P (ϕ(x)|A i , a), P (ϕ(x)|A j , a)) end if end for RegP enalty = A∈A λ A RegP enalty A g 1 , ϕ = arg min g1,ϕ ; ℓ(g 1 (ϕ(x)), y) + RegP enalty Remark. If E is observed, we always condition on E because of Corollary 3.1. For the special case of Figure 2 , CACM uses the following regularization penalty (RegP enalty) for Independent, Causal, Confounded and Selected shifts, RegP enalty A ind = |E| i=1 j>i MMD(P (ϕ(x)|a i,ind ), P (ϕ(x)|a j,ind )) RegP enalty Acause = |E| y∈Y |Acause | i=1 j>i MMD(P (ϕ(x)|a i,cause , y), P (ϕ(x)|a j,cause , y)) • and corr 1 = 0.9, corr 2 = 0.8, corr 3 = 0.1. All environments have 25% label noise, as in (Arjovsky et al., 2019) . For all experiments on MNIST, we use a two-layer perceptron consistent with previous works (Arjovsky et al., 2019; Krueger et al., 2021) . small NORB. Moving beyond simple binary classification, we use small NORB (LeCun et al., 2004) , an object recognition dataset, to create a challenging setup with multi-valued classes and attributes over realistic 3D objects. It consists of images of toys of five categories with varying lighting, elevation, and azimuths. The objective is to classify unseen samples of the five categories. (Wiles et al., 2022) introduced single-attribute shifts for this dataset. We combine the Causal shift, A cause = lighting wherein there is a correlation between lighting condition lighting i and toy category y i ; and Independent shift, A ind = azimuth that varies independently across domains, to generate our multi-attribute dataset light + azi. Training domains have 0.9 and 0.95 spurious correlation with lighting whereas there is no correlation in test domain. We add 5% label noise in all environments. We use ResNet-18 (pre-trained on ImageNet) for all settings and fine tune for our task. Waterbirds. We use the Waterbirds dataset from Sagawa et al. (2020) . This dataset classifies birds as "waterbird" or "landbird", where bird type (Y ) is spuriously correlated with background (bgd) -"waterbird" images are spuriously correlated with "water" backgrounds (ocean, natural lake) and "landbird" images with "land" backgrounds (bamboo forest, broadleaf forest). Since background is selected based on Y , A cause = background. The dataset is created by pasting bird images from CUB dataset (Wah et al., 2011) onto backgrounds from the Places dataset (Zhou et al., 2018) . There is 0.95 correlation between the bird type and background during training i.e., 95% of all waterbirds are placed against a water background, while 95% of all landbirds are placed against a land background. We create training domains based on background (|E tr | = |A cause | = 2) as in Yao et al. (2022) . We evaluate using worst-group error consistent with past work, where a group is defined as (background, y). We generate the dataset using the official code from Sagawa et al. (2020) and use the same train-validation-test splits. RegP enalty A conf = |E| |A conf | i=1 j>i MMD(P (ϕ(x)|a i,conf ), P (ϕ(x)|a j,conf )) RegP enalty A sel = |E| y∈Y |A sel | i=1 j>i MMD(P (ϕ(x)|a i,sel To create the multi-attribute shift variant of Waterbirds, we add weather effects (A ind ) using the Automold libraryfoot_5 . We add darkness effect (darkness coefficient = 0.7) during training with 0.5 probability and rain effect (rain type = 'drizzle', slant = 20) with 1.0 probability during test. Hence, |A ind |=3 ({no effect, darkness, rain}). Weather effect is applied independent of class label Y . Our training domains are based on background and we perform worst-group evaluation, same as the setup described above. Examples from train and test domains for multi-attribute shift dataset are provided in Figure 6 . We use ResNet-50 (pre-trained on ImageNet) for all settings consistent with past work (Sagawa et al., 2020; Yao et al., 2022) . All models were evaluated at the best early stopping epoch (as measured by the validation set), again consistent with Sagawa et al. (2020) . Model Selection. We create 90% and 10% splits from each domain to be used for training/evaluation and model selection (as needed) respectively. We use a validation set that follows the test domain distribution consistent with previous work on these datasets (Arjovsky et al., 2019; Ye et al., 2022; Wiles et al., 2022; Yao et al., 2022) . Specifically, we adopt the test-domain validation from DomainBed for Synthetic, MNIST, and small NORB datasets where early stopping is not allowed and all models are trained for the same fixed number of steps to limit test domain access. For Waterbirds, we perform early stopping using the validation set consistent with past work (Sagawa et al., 2020; Yao et al., 2022) . MMD implementation details. We use the radial basis function (RBF) kernel to compute the MMD penalty. Our implementation is adopted from DomainBed (Gulrajani & Lopez-Paz, 2021) . The kernel bandwidth is a hyperparameter and we perform a sweep over the hyperparamaeter search space to select the best RBF kernel bandwidth. The search space for hyperparameter sweeps is provided in Table 9 , where γ corresponds to 1/bandwidth. CACM and baselines implementation details. We provide the regularization constraints for different shifts used by CACM in Section C. For statistical efficiency, we use a single λ value as hyperparameter for MNIST and small NORB datasets. The search space for hyperparameters is given in Table 9 . In MNIST and NORB, we input images and domains (E = A ind ) to all baseline methods; CACM receives additional input A cause . Hence, in the Independent single-attribute shift, CACM and all baselines have access to exactly the same information. In Waterbirds, since E is not defined in the original dataset, we follow the setup from Yao et al. (2022) to create domains based on backgrounds. Here, we provide images and domains (E = background) as input to all baselines except GroupDRO; to ensure fair comparison with GroupDRO, we follow Sagawa et al. (2020) and provide 4 groups as input based on (background, y), along with images. For CACM, we do not use background domains but provide the attribute A cause = background for the single-attribute dataset, and both A cause = background and A ind = weather for the multi-attribute shift dataset. Hence, for the single-shift Waterbirds dataset, all baselines receive the same information as CACM.

D.3 HYPERPARAMETER SEARCH

Following DomainBed (Gulrajani & Lopez-Paz, 2021) , we perform a random search 20 times over the hyperparameter distribution and this process is repeated for total 3 seeds. The best models are obtained across the three seeds over which we compute the mean and standard error. The hyperparameter search space for all datasets and algorithms is given in Table 9 .

E.1 SYNTHETIC DATASET

Our synthetic dataset is constructed based on the data-generating processes of the slab dataset (Mahajan et al., 2021; Shah et al., 2020) . The original slab dataset was introduced by (Shah et al., 2020) to demonstrate the simiplicity bias in neural networks as they learn the linear feature which is easier to learn in comparison to the slab feature. Our extended slab dataset, adds to the setting from (Mahajan et al., 2021) by using non-binary attributes and class labels to create a more challenging task and allows us to study DG algorithms in the presence of linear spurious features. Our dataset consists of 2-dimensional input X consisting of features X c and A ind . This is consistent with the graph in Figure 2 where attributes and causal features together determine observed features with different spurious correlation values of training environments and have consistent findings. We observe that application of the incorrect constraint is sensitive to λ (regularization weight) parameter : as λ increases, accuracy drops to less than 40%. However, accuracy with the correct constraint stays invariant across different values of λ. 

E.4 PROVIDING ATTRIBUTE INFORMATION TO DG ALGORITHMS FOR A FAIRER COMPARISON

CACM leverages attribute labels to apply the correct independence constraints derived from the causal graph. However, existing DG algorithms only use the input features X and the domain attribute. Here we provide this attribute information to existing DG algorithms to create a more favorable setting for their application. We show that even in this setup, these algorithms are not able to close the performance gap with CACM, showing the importance of the causal information through graphs.

E.4.1 SYNTHETIC DATASET

We consider our Synthetic dataset with Causal distribution shift where our observed features X = (X c , A cause ). Note that by construction of X, one of our input dimensions already consists of A cause . Hence, all baselines do receive information about A cause in addition to the domain attribute E. However, to provide a fairer comparison with CACM, we now additionally explicitly make A cause available to all DG algorithms for applying their respective constraints by creating domains based on A cause in our new setup. Using the same underlying data distribution, we group the data (i.e., create environments/domains) based on A cause i.e, each environment E has samples with same value of A cause . In this setup (Table 7 , third column), we see IB-IRM, DANN, and Mixup show significant improvement in accuracy but the best performance is still 14% lower than CACM. We additionally observe baselines to show higher estimate variance in this setup. This reinforces our motivation to use the causal graph of the data-generating process to derive the constraint, as the attribute values alone are not sufficient. We also see MMD, CORAL, GroupDRO, and MLDG perform much worse than earlier, highlighting the sensitivity of DG algorithms to domain definition. In contrast, CACM uses the causal graph to study the structural relationships and derive the regularization penalty, which remains the same in this new dataset too.

E.4.2 WATERBIRDS

We perform a similar analysis on the Waterbirds multi-attribute shift dataset. In order to provide the same information to other DG algorithms as CACM, we create domains based on A cause x A ind in this setup (Table 8 ). We observe mixed results -while some algorithms show significant improvement (ERM, IRM, VREx, MMD, MLDG, RSC), there is a performance drop for some others (IB-ERM, IB-IRM, CORAL, C-MMD, GroupDRO, Mixup). CACM uses the knowledge of the causal relationships between attributes and the label and hence the evaluation remains the same. Hence, we empirically demonstrate the importance of using information of the causal graph in addition to the attributes.  F E -X c RELATIONSHIP The E-X c edge shown in Figure 2b represents correlation of E with X c , which can change across environments. As we saw for the Y -A ind edge, this correlation can be due to causal relationship (Figure 8a ), confounding with a common cause (Figure 8b ), or selection (Figure 8c ); all our results (Proposition 3.1, Corollary 3.1) hold for any of these relationships. To see why, note that there is no collider introduced on X c or E in any of the above cases. Figure 9 shows causal graphs used for specifying multi-attribute distribution shifts in an anti-causal setting. These graphs are identical to Figure 2 , with the exception of change in direction of causal arrow from X c -→ Y to Y -→ X c . We derive the (conditional) independence constraints for the anti-causal DAG for Independent, Causal, Confounded and Selected shifts. Proposition G.1. Given a causal DAG realized from the canonical graph in Figure 9a , the correct constraint depends on the relationship of label Y with the nuisance attributes A. As shown, A can be split into A ind , A ind and E, where A ind can be further split into subsets that have a causal (A cause ), confounded (A conf ), selected (A sel ) relationship with Y (A ind = A cause ∪ A conf ∪ A sel ). Then, the (conditional) independence constraints that X c should satisfy are, Table 9 : Search space for random hyperparameter sweeps. 



Note that for Selected to satisfy the assumptions of Theorem 2.1 implies that Xc is fully predictive of Y or that the noise in Y -Xc relationship is independent of the features driving the selection process. https://github.com/py-why/dowhy https://www.technologyreview.com/2021/07/30/1030329/machine-learning-a i-failed-covid-hospital-diagnosis-pandemic/ In practice, the constraint may be evaluated on an intermediate representation of g, such that g can be written as, g(X) = g1(ϕ(X)) where ϕ denotes the representation function. However, for simplicity, we assume it is applied on g(X). https://github.com/UjjwalSaxena/Automold--Road-Augmentation-Library



Figure 1: (a) Our multi-attribute distribution shift dataset Col+Rot-MNIST. We combine Colored MNIST (Arjovsky et al., 2019) and Rotated MNIST (Ghifary et al., 2015) to introduce distinct shifts over Color and Rotation attributes. (b) The causal graph representing the data generating process for Col+Rot-MNIST. Color has a correlation with Y which changes across environments while Rotation varies independently. (c) Comparison with DG algorithms optimizing for different constraints shows the superiority of Causally Adaptive Constraint Minimization (CACM) (full table in Section 5).

Figure 2: (a) Canonical causal graph for specifying multi-attribute distribution shifts; (b) canonical graph with E-X c correlation. Anti-causal graph shown in Suppl. G. Shaded nodes denote observed variables; since not all attributes may be observed, we use dotted boundary. Dashed lines denote correlation, between X c and E, and Y and A ind . E-X c correlation can be due to confounding, selection, or causal relationship; all our results hold for any of these relationships (see Suppl. F). (c) Different mechanisms for Y -A ind relationship that lead to Causal, Confounded and Selected shifts.

Figure 3: Causal graphs for distinct distribution shifts based on Y -A relationship.

y), P (ϕ(x)|a j,sel , y)) Rotated(Ghifary et al., 2015) and Colored MNIST(Arjovsky et al., 2019) present distinct distribution shifts. While Rotated MNIST only has A ind wrt. rotation attribute (R), Colored MNIST only has A cause wrt. color attribute (C). We combine these datasets to obtain a multi-attribute dataset with A cause = {C} and A ind = {R}. Each domain E i has a specific rotation angle r i and a specific correlation corr i between color C and label Y . Our setup consists of 3 domains: E 1 , E 2 ∈ E tr (training), E 3 ∈ E te (test). We define corr i = P (Y = 1|C = 1) = P (Y = 0|C = 0) in E i . In our setup, r 1 = 15 • , r 2 = 60 • , r 3 = 90

Figure 4: (a), (b) Train and (c) Test domains for MNIST.

Figure 5: (a), (b) Train and (c) Test domains for small NORB.

Figure 6: (a), (b) Train and (c) Test domains for Waterbirds.

Figure 7: Accuracy of CACM (X c ⊥ ⊥ A cause |Y, E) and incorrect constraint ( X c ⊥ ⊥ A cause |E) on small NORB Causal shift with varying λ {1, 10, 100} and spurious correlation in training environments (in parantheses in legend).

Figure 8: (a) Causal, (b) Confounded and (c) Selection mechanisms leading to E-X c correlation.

decay: 10 Uniform(-6,-2) generator weight decay: 10 Uniform(-6,-2)IRMλ: [0.01, 0.1, 1, 10, 100] iterations annealing: [10, 100, 1000] IB-ERM, IB-IRM λ IB : [0.01, 0.1, 1, 10, 100] iterations annealing IB : [10, 100, 1000] λ IRM : [0.01, 0.1, 1, 10, 100] iterations annealing IRM : [10, 100generator learning rate: [1e-2, 1e-3, 1e-4, 1e-5] discriminator learning rate: [1e-2, 1e-3, 1e-4, 1e-5] discriminator weight decay: 10 Uniform(-6,-2) λ: [0.1, 1, 10, 100] discriminator steps: [1, 2, 4, 8] gradient penalty: [0.01, 0.1, 1, 10] adam β



: Statistic optimized by DG algorithms. match matches the statistic across E. h is domain classifier (loss ℓ d ) using shared representation ϕ.

Shaded nodes denote observed variables; since not all attributes may be observed, we use dotted boundary. Dashed lines denote correlation, between X c and E, and Y and A ind . E-X c correlation can be due to confounding, selection, or causal relationship; all our results hold for any of these relationships (see Suppl. F). (c) Different mechanisms for Y -A ind relationship that lead to Causal, Confounded and Selected shifts. three mechanisms which can introduce the dashed-line correlation between A ind and Y (Figure2c) -direct-causal (Y causing A ind ), confounding between Y and A ind due to a common cause, or selection during the data-generating process. Overall, we define four kinds of shifts based on the causal graph: Independent, Causal, Confounded, and Selected 1 . While the canonical graph in Figure

Colored + Rotated MNIST: Accuracy on unseen domain for singe-attribute (color, rotation) and multi-attribute (col + rot) distribution shifts; small NORB: Accuracy on unseen domain for single-attribute (lighting, azimuth) and multi-attribute (light + azi) distribution shifts. Waterbirds: Worst-group accuracy on unseen domain for single-and multi-attribute shifts.

small NORB Causal shift. Comparing X c ⊥ ⊥ A cause |Y, E with incorrect constraints.

Commonly used DG datasets include auxiliary information.

Synthetic dataset. Accuracy on unseen domain for Causal distribution shift when A cause is provided in input (column 2) and when A cause is additionally used to create domains (column 3).

Waterbirds. Accuracy on unseen domain for multi-attribute distribution shift when A cause is used to create domains (column 2) and when A cause x A ind is used to create domains (column 3).

10. ACKNOWLEDGEMENTS

We thank Abhinav Kumar, Adith Swaminathan, Yiding Jiang, and Dhruv Agarwal for helpful feedback and comments on the draft. We would also like to thank the anonymous reviewers for their valuable feedback.

B PROOFS

B.1 PROOF OF THEOREM 2.1 Theorem 2.1. Consider a causal DAG G over ⟨X c , X, A, Y ⟩ and a corresponding generated dataset (x i , a i , y i ) n i=1 , where X c is unobserved. Assume that graph G has the following property: X c is defined as the set of all parents of Y (X c → Y ); and X c , A together cause X (X c → X, and A → X). The graph may have any other edges (see, e.g., DAG in Figure 1(b) ). Let P G be the set of distributions consistent with graph G, obtained by changing P (A|Y ) but not P (Y |X c ). Then the conditional independence constraints satisfied by X c are necessary for a (cross-entropy) riskinvariant predictor over P G . That is, if a predictor for Y does not satisfy any of these constraints, then there exists a data distribution P ′ ∈ P G such that predictor's risk will be higher than its risk in other distributions.Proof. We consider X, Y, X c , A as random variables that are generated according to the datagenerating process corresponding to causal graph G. We assume that X c represents all the parents of Y . X c also causes the observed features X but X may be additionally affected by the attributes A. Let ŷ = g(x) be a candidate predictor. Then g(X) represents a random vector based on a deterministic function g of X.Suppose there is an independence constraint ψ that is satisfied by X c but not g(X). 4 Below we show that such a predictor g is not risk-invariant: there exist two data distributions with different P (A|Y ) such that the risk of g is different for them.Without loss of generality, we can write g(x) as,where h is an arbitrary, non-zero, deterministic function of the random variable X c . Since X c satisfies the (conditional) independence constraint ψ and h is a deterministic function, h(X c ) also satisfies ψ. Also since the predictor g(X) does not satisfy the constraint ψ, it implies that the random vector g ′ (X, X c ) cannot satisfy the constraint ψ. Thus, g ′ (X, X c ) cannot be a function of X c only; it needs to depend on X too. Since X has two parents in the causal graph, X c and A, this implies that g ′ (X, X c ) must depend on A too, and hence g ′ (X, X c ) and A are not independent. The risk over any distribution P can be written as (using the cross-entropy loss),The risk difference is,X c to A ind , which results in the remaining constraints:Causal: From Figure 3b , we see that while the path X c → X → A cause from X c to A cause contains a collider X, X c ̸⊥ ⊥ A cause due to the presence of node Y as a chain. By the d-separation criteria, X c and A cause are conditionally independent given Y =⇒ X c ⊥ ⊥ A cause |Y . Additionally, conditioning on E is valid since E does not appear as a collider on any paths between X c and. Hence, we obtain,Confounded: From Figure 3c , we see that all paths connecting X c andAdditionally, conditioning on E is valid since E does not appear as a collider on any paths between. Hence, we obtain,Selected: For the observed data, the selection variable is always conditioned on, with S = 1 indicating inclusion of sample in data. The selection variable S is a collider in Figure 3d and we condition on it. Hence, X c ̸⊥ ⊥ A sel . Conditioning on Y breaks the edge X c → Y , and hence all paths between X c and A sel now contain a collider (collider X inAdditionally, conditioning on E is valid since E does not appear as a collider on any paths between X c and A sel =⇒ X c ⊥ ⊥ A sel |Y, E. Hence, we obtain,2.1 PROOF OF COROLLARY 3.0.1Corollary 3.1. All the above derived constraints are valid for Graph 2a. However, in the presence of a correlation between E and X c (Graph 2b), only the constraints conditioned on E hold true.If there is a correlation between X c and E, X c ̸⊥ ⊥ E. We can see from Figure 3 that in the presence ofHence, conditioning on environment E is required for the valid independence constraints.B.3 PROOF OF THEOREM 3.1Theorem 3.1. Under the canonical causal graph in Figure 2(a, b ), there exists no (conditional) independence constraint over ⟨X c , A, Y ⟩ that is valid for all realized DAGs as the type of multiattribute shifts vary. Hence, for any predictor algorithm for Y that uses a single (conditional) independence constraint over its representation ϕ(X), A and Y , there exists a realized DAG G and a corresponding training dataset such that the learned predictor cannot be a risk-invariant predictor for distributions in P G , where P G is the set of distributions obtained by changing P (A|Y ).Proof. The proof follows from an application of Proposition 3.1 and Theorem 2.1.First claim. Under the canonical graph from Figure 2 (a or b), the four types of attribute shifts possible are Independent, Causal, Confounded and Selected. From the constraints provided for these four types of attribute shifts in Proposition 3.1, it is easy to observe that there is no single constraint that is satisfied across all four shifts. Thus, given a data distribution (and hence, dataset) with specific types of multi-attribute shifts such that X c satisfies a (conditional) independence constraint w.r.t.X; we concatenate X c and A ind to generate X in our synthetic setup. Causal feature X c has a non-linear "slab" relationship with Y while A ind has a linear relationship with Y . We create three different datasets with Causal (E.1.1), Confounded (E.1.2) and Selected (E.1.3) A ind -Y relationship respectively.Implementation details. In all setups, X c is a single-dimensional variable and has a uniform distribution Uniform[0, 1] across all environments. We use the default 3-layer MLP architecture from DomainBed and use mean difference (L2) instead of MMD as the regularization penalty given the simplicity of the data. We use a batch size of 128 for all datasets.E. A cause = y with prob. = p abs(y -1) with prob. = 1 -p Hence, we have a five-way classification setup (|Y | = 5) with multi-valued attributes. Following (Mahajan et al., 2021) , the two training domains have p as 0.9 and 1.0, and the test domain has p = 0.0. We add 10% noise to Y in all environments.

E.1.2 Confounded SHIFT

We have three environments, E 1 , E 2 ∈ E tr (training) and E 3 ∈ E te (test). X c has a uniform distribution Uniform[0, 1] across all environments. Our confounding variable c has different functional relationships with Y and A conf which vary across environments. c E1,E2 = 1 with prob. = 0.25 0 with prob. = 0.75 c E3 = 1 with prob. = 0.75 0 with prob. = 0.25The true function for Y is given by,Observed Y and A conf are functions of confounding variable c and their distribution changes across environments as described below:y E1,E2 = y true + c with prob. = 0.9 y true with prob. = 0.1 y E3 = y true A conf = 2 * c with prob. = p 0 with prob. = 1 -p ; p E1 = 1.0, p E2 = 0.9, p E3 = 0.8

E.1.3 Selected SHIFT

Selected shifts arise due to selection effect in the data generating process and induce an association between Y and A sel . A data point is included in the sample only if selection variable S = 1 holds; S is a function of Y and A sel . The selection criterion may differ between domains (Veitch et al., 2021) .We construct three environments, E 1 , E 2 ∈ E tr (training) and E 3 ∈ E te (test). X c has a uniform distribution Uniform[0, 1] across all environments. Our selection variable S is a function of Y and A sel . We add 10% noise to Y in all environments.The true function for Y is given by,The function used to decide the selection variable S (and hence the selection shift) varies across environments through the parameter p. 6 ).We train a model using ERM (cross-entropy) where the representation is regularized using either of the constraints. As predicted by Theorem 3.1, neither constraint obtains best accuracy on all three datasets. The conditional constraint is better on A cause and A sel datasets, whereas the unconditional constraint is better on A conf , consistent with Proposition 3.1. Predictors with the correct constraint are also more risk-invariant, having lower gap between train and test accuracy. 

E.3 EFFECT OF VARYING REGULARIZATION PENALTY COEFFICIENT

To understand how incorrect constraints affect model generalization capabilities, we study the Causal shift setup in small NORB. From Theorem 3.1, we know the correct constraint for A cause : X c ⊥ ⊥ A cause |Y, E. In addition, we see the following invalid constraint, X c ⊥ ⊥ A cause |E. We compare the performance of these two conditional independence constraints while varying the regularization penalty coefficient (λ) (Figure 7 ). We perform our evaluation across three setups

