BREAKING CORRELATION SHIFT VIA CONDITIONAL INVARIANT REGULARIZER

Abstract

Recently, generalization on out-of-distribution (OOD) data with correlation shift has attracted great attentions. The correlation shift is caused by the spurious attributes that correlate to the class label, as the correlation between them may vary in training and test data. For such a problem, we show that given the class label, the models that are conditionally independent of spurious attributes are OOD generalizable. Based on this, a metric Conditional Spurious Variation (CSV) which controls the OOD generalization error, is proposed to measure such conditional independence. To improve the OOD generalization, we regularize the training process with the proposed CSV. Under mild assumptions, our training objective can be formulated as a nonconvex-concave mini-max problem. An algorithm with a provable convergence rate is proposed to solve the problem. Extensive empirical results verify our algorithm's efficacy in improving OOD generalization.

1. INTRODUCTION

The success of standard learning algorithms rely heavily on the identically distributed assumption of training and test data. However, in real-world, such assumption is often violated due to the varying circumstances, selection bias, and other reasons (Meinshausen & Bühlmann, 2015) . Thus, learning a model that generalizes on out-of-distribution (OOD) data has attracted great attentions. The OOD data (Ye et al., 2022) can be categorized into data with diversity shift or correlation shift. Roughly speaking, there is a mismatch of the spectrum and a spurious correlation between training and test distributions under the two shifts, respectively. Compared with diversity shift, correlation shift is less explored (Ye et al., 2022) , while the misleading spurious correlation works for training data may deteriorate model's performance on test data (Beery et al., 2018) . The correlation shift says, for the spurious attributes in data, there exists variation of (spurious) correlation between class label and such spurious attributes from training to test data (Figure 1 ). Based on a theoretical characterization of it, we show that given the class label, the model which is conditionally independent of spurious attributes has stable performance across training and OOD test data. Then, a metric Conditional Spurious Variation (CSV, Definition 2) is proposed to measure such conditional independence. Notably, in contrast to the existing metrics related to OOD generalization (Hu et al., 2020; Mahajan et al., 2021) , our CSV can control the OOD generalization error. To improve OOD generalization, we regularize the training process with estimated CSV. With observable spurious attributes, we propose an estimator to CSV. However, such observable condition may be violated. In this case, we propose another estimator, which approximates a sharp upper bound of CSV. We regularize the training process with one of them, depending on whether the spurious attributes are observable. Our method improves the observable condition in (Sagawa et al., 2019) . Under mild smoothness assumptions, the regularized training objective can be formulated as a specific non-convex concave minimax problem. A stochastic gradient descent based algorithm with a provable convergence rate of order O(T -2/5 ) is proposed to solve it, where T is the number of iterations. (Liu et al., 2015) , Waterbirds (Sagawa et al., 2019) , MultiNLI (Williams et al., 2018) , and CivilComments (Borkan et al., 2019) involved in this paper. The class labels and spurious attributes are respectively colored with red and blue. Their correlation may vary from training set to test set. More details are shown in Section 6. Finally, extensive experiments are conducted to empirically verify the effectiveness of our methods on the OOD data with spurious correlation. Concretely, we conduct experiments on benchmark classification datasets CelebA (Liu et al., 2015) , Waterbirds (Sagawa et al., 2019) , MultiNLI (Williams et al., 2018) , and CivilComments (Borkan et al., 2019) . Empirical results show that our algorithm consistently improves the model's generalization on OOD data with correlation shifts.

2. RELATED WORKS AND PRELIMINARIES

2.1 RELATED WORKS OOD Generalization. The appearance of OOD data (Hendrycks & Dietterich, 2018) has been widely observed in machine learning community (Recht et al., 2019; Schneider et al., 2020; Salman et al., 2020; Tu et al., 2020; Lohn, 2020) . To tackle this, researchers have proposed various algorithms from different perspectives, e.g., distributional robust optimization (Sinha et al., 2018; Volpi et al., 2018; Sagawa et al., 2019; Yi et al., 2021b; Levy et al., 2020) or causal inference (Arjovsky et al., 2019; He et al., 2021; Liu et al., 2021b; Mahajan et al., 2021; Wang et al., 2022; Ye et al., 2021) . Ye et al. (2022) points out that the OOD data can be categorized into data with diversity shift (e.g., PACS (Li et al., 2018) ) and correlation shift (e.g., Waterbirds (Sagawa et al., 2019) ), and we focus on the latter in this paper, as we have clarified that it deteriorates the performance of the model on OOD test data (Geirhos et al., 2018; Beery et al., 2018; Xie et al., 2020; Wald et al., 2021) . Domain Generalization. To goal of domain generalization is extrapolating model to test data from unseen domains to capture OOD generalization. The problem we explored can be treated by domain generalization methods as data with different spurious attributes can be regarded as from different domains. The core idea in domain generalization is to learn a domain-invariant model. To this end, Arjovsky et al. (2019) ; Hu et al. (2020) ; Li et al. (2018) ; Mahajan et al. (2021) ; Heinze-Deml & Meinshausen (2021) ; Krueger et al. (2021) ; Wald et al. (2021) ; Seo et al. (2022) propose plenty of invariant metrics as training regularizer. However, unlike our CSV, none of these metrics controls the OOD generalization error. Moreover, none of these methods capture the invariance corresponds to the correlation shift we discussed (see Section 4.1). This motivates us to reconsider the effectiveness of these methods. Finally, in contrast to ours, these methods require observable domain labels, and it is usually impractical. The techniques in (Liu et al., 2021b; Devansh Arpit, 2019; Sohoni et al., 2020; Creager et al., 2021) are also applicable without domain information, but they are built on strong assumptions (mixture Gaussian data (Liu et al., 2021b) and linear model (Devansh Arpit, 2019)) or require a high-quality spurious attribute classifier (Sohoni et al., 2020; Creager et al., 2021) . Distributional Robustness. The distributional robustness (Ben-Tal et al., 2013) based methods minimize the worst-case loss over different groups of data (Sagawa et al., 2019; Liu et al., 2021a; Zhou et al., 2022) . The groups are decided via certain rules, e.g., data with same spurious attributes (Sagawa et al., 2019) or annotated via validation sets with observable spurious attributes (Liu et al., 2021a; Zhou et al., 2022) . However, Sagawa et al. (2019) finds that directly minimizing the worstgroup loss results in unstable training processes. In contrast, our method has stable training process as it balances the objectives of accuracy and robustness over spurious attributes (see Section 5).

2.2. PROBLEM SETUP

We collect the notations in this paper. ∥ • ∥ is the ℓ 2 -norm of vectors. O(•) is the order of a number. The sample (X, Y ) ∈ X × Y, where X and Y are respectively input data and its label. The integer set from 1 to K is [K]. The cardinal of a set A is |A|. The loss function is L(•, •) : R K × Y → R + with 0 ≤ L(•, •) ≤ M for positive K, M . For any distribution P , let R pop (P, f ) = E P [L(f (X), Y )] and R emp (P, f ) = n -1 n i=1 L(f (x i ), y i ) respectively be the population risk under P and its empirical counterpart. Here {(x i , y i )} n i=1 are n i.i.d. samples from distribution P , and f (•) : X → R K is the model potentially with parameter space Θ ⊂ R d (f (•) becomes f θ (•)). For random variables V 1 , V 2 with joint distribution P V1,V2 , P V1 and P V2|V1 are the marginal distribution of V 1 and the conditional distribution of V 2 under V 1 . P V1 (v 1 ) and P V2|V1 (v 1 , v 2 ) are their probability measures. Suppose the training and OOD test data are respectively from distributions P X,Y,Z and Q X,Y,Z . We may neglect the subscript if there is no obfuscation. There usually exists similarities between P and Q that guarantee the possibility of OOD generalization (Kpotufe & Martinet, 2021) . The similarity we explored is that there only exists a correlation shift in the OOD test data formulated as follows. For each input X, there exists spurious attributes Z ∈ Z that are not causal to predict class label, but Z is potentially related to class label Y . The Z can be some features of X e.g., gender of celebrity in Figure 1 . The correlation between Z and Y (i.e., spurious correlation) can vary from training to test data, and the one in training distribution may become a misleading signal for the model to make predictions on test data. For example, in the celebrity's face in Figure 1 , if most males in the training set have dark hair, the model may overfit such spurious correlation and mispredict the male with blond hair. Thus we should learn a model that is robust to correlation shift defined as follows. Definition 1. Given training distribution P , the test distribution Q ∈ P has correlation shift, where P = {QX,Y,Z : QY = PY , Q X|Y,Z = P X|Y,Z }. (1) Our definition characterizes the distributions with correlation shift. The first equality in (1) obviates the mismatching caused by label shift (i.e., P Y ̸ = Q Y ) which is unrelated to spurious correlation. More discussions of it are in Appendix A. The second equality in (1) states the invariance of conditional distribution of data, given the class label and spurious attributes, which is reasonable as the unstable spurious correlations are decided by the joint distribution of Y and Z. Finally, since QX,Y (x, y) = QX,Y (x | y)QY (y) = QY (y) Z Q X|Y,Z (x | y, z)dQ Z|Y (z | y) = PY (y) Z P X|Y,Z (x | y, z)dQ Z|Y (z | y), the two constraints in (1) together implies the correlation shift of Q ∈ P is from the variety of conditional distributions Q Z|Y , which is consistent with intuition. Our definition is different from the ones in (Mahajan et al., 2021; Makar & D'Amour, 2022) , as they rely on a causal directed acyclic graph and the existence of a sufficient statistic such that Y only affects X through it.

3. GENERALIZING ON OOD DATA

In this section, we show that misleading spurious correlation can hurt the OOD generalization. Then we give a condition under which the model is OOD generalizable.

3.1. SPURIOUS CORRELATION MISLEADS MODELS

The common way to train a model is empirical risk minimization (ERM Vapnik, 1999) , i.e., approximating the minimizer of R pop (P, f ) which generalizes well on in-distribution samples via minimizing its empirical counterpart R emp (P, f ). However, the following proposition shows that the minimizer of R pop (P, f ) may not generalize well on the OOD data from the other distributions in P. Proposition 1. There exists a population risk R pop (P, f ) whose minimizer has nearly perfect performance on the data from P , while it fails to generalize to OOD data drawn from another Q ∈ P. Similar results also appear in (Xie et al., 2020; Krueger et al., 2021) , while they are not obtained on the minimizer of population risk. The proof of this proposition is in Appendix B which indicates that the spurious correlation in training data can become a misleading supervision signal that deteriorates the model's performance on OOD data. Hence, it is crucial to learn a model that is independent of such spurious correlation, even if it sometimes can be helpful in the training set (Xie et al., 2020) .

3.2. CONDITIONAL INDEPENDENCE ENABLES OOD GENERALIZATION

Next, we give a sufficient condition proved in Appendix B to make the model OOD generalizable. The condition is also necessary under some specific data generating structures (Veitch et al., 2021) . Theorem 1. For model f (•) satisfying f (X) ⊥ Z | Y , the conditional distribution Y | f (X) and population risk E Q [L(f (X), Y )] are invariant with (X, Y ) ∼ Q X,Y such that Q ∈ P. Here f (X)⊥ Z | Y means given Y , f (X) is conditionally independent of Z. Theorem 1 shows our conditional independence obviates the impact of correlation shift, as the prediction error (gap between Y and f (X), decided by Y | f (X)) and population risk of model are invariant over test distributions Q ∈ P. Thus we propose to obtain a model that is conditional independent of spurious attributes. Remark 1. If the spurious attributes are domain labels, the conditional independence in Theorem 1 becomes the ones in (Liu et al., 2015; Hu et al., 2020; Mahajan et al., 2021) , while they do not explore its correlation with the OOD generalization. Besides, the counterexample in Mahajan et al. (2021) violates our conditional invariant assumption in (1) and hence is not contrary to our theorem.

4. LEARNING OOD GENERALIZABLE MODEL

In Theorem 1 we propose a independence condition to break correlation shift. In this section, a metric Conditional Spurious Variation (CSV) is proposed to quantitatively measure the independence. As our CSV can control the OOD generalization error, smaller CSV leads to improved OOD generalization. Finally, two estimators of CSV are proposed, depending on whether spurious attributes are observable.

4.1. GUARANTEED OOD GENERALIZATION ERROR

Theorem 1 shows that the conditional independence between the model and spurious attributes guarantees the OOD generalization. We propose the following metric to measure such independence. Definition 2 (Conditional Spurious Variation). The conditional spurious variation of model f (•) is CSV(f ) = EY sup z 1 ,z 2 (EX [L(f (X), Y ) | Y, Z = z1] -EX [L(f (X), Y ) | Y, Z = z2]) . As can be seen, CSV is a functional of f (•) which measures the intra-class conditional variation of the model over spurious attributes, given the class label Y . It can be computed via training distribution and is invariant across Q ∈ P due to (1). It is worth noting that the model satisfies the conditional independence in Theorem 1 has zero CSV but not vice versa.foot_0 However, the following theorem proved in Appendix C.1 shows that CSV(f ) is sufficient to control the OOD generalization error. Theorem 2. For any Q ∈ P, we have sup Q∈P |Remp(f, P ) -Rpop(f, Q)| ≤ |Remp(f, P ) -Rpop(f, P )| + CSV(f ) The |R emp (f,P )-R pop (f,P )| is in-distribution generalization error, which is well explored (Vershynin, 2018) . Thus, we upper bound the OOD generalization error via the in-distributional one and CSV. The OOD generalization error is also connected to many other metrics e.g., (Hu et al., 2020; Mahajan et al., 2021; Ben-David et al., 2007; 2010; Muandet et al., 2013; Ganin et al., 2016) , but none of them directly control the OOD generalization error. Besides, these metrics are proposed to obtain the invariance over Z as a condition, i.e., invariant P f (X),Y |Z or P f (X)|Z , while the invariances can not handle correlation shift (Definition. 1). As, 1): invariant P f (X),Y |Z implies invariant P Y |Z which is incompatible with correlation shift, 2): invariant P f (X)|Z = Y P f (X)|Z,Y (f (x) | z, y)dP Y |Z (y | z) does not imply invariant P f (X)|Y,Z which guarantees OOD generalization. As our bound (4) involves both CSV and in-distribution generalization error, it motivates us to explore whether the conditional independence is contradicted by the in-distribution generalization. The following information-theoretic bound proved in Appendix C.1 presents a positive answer. Let  I(V 1 , V 2 ) be the mutual information between variables V 1 , V 2 , where E gen (f θ S , P ) = |E[R emp (f θ S , P )] -R pop (f θ S , P )|, g(•) is any measurable function, S x-g(z) = {x i -g(z i )} n i=1 , S y = {y i } n i=1 . Our bound improves the classical result E gen (f θ S , P ) ≤ Mfoot_1 I(S, θ S )/4n without conditional independence (Steinke & Zakynthinou, 2020) , because taking g(•) as a constant function, then applying data processing inequality (Xu & Raginsky, 2017) for some positive integers K y , K z . Besides that, the number of observations A kz = {i : y i = k, z i = z} from each pair of (k, j) ∈ [K y ] × [K z ] is |A kz | = n kz > 0. Assumption 2. The model is parameterized by θ ∈ Θ ⊂ R d . The loss function L(f θ (x), y) is Lipschitz continuous and smooth w.r.t. θ with coefficient L 0 and L 1 , i.e., for any (x, y) ∈ X × Y, and θ 1 , θ 2 ∈ Θ, |L(f θ 1 (x), y) -L(f θ 2 (x), y)| ≤ L0∥θ1 -θ2∥; ∥∇ θ L(f θ 1 (x), y) -∇ θ L(f θ 2 (x), y)∥ ≤ L1∥θ1 -θ2∥. In Assumption 1, we require the spurious attributes space is finite. This is explained as Z is a "label" of spurious attributes, e.g., the gender label "male" or "female" in CelebA dataset (Figure 1 ) when classifying hair color. Assumption 1 also requires the data with all the possible combinations of the label and spurious attributes are collected in the training set. This is a mild condition since we do not restrict the magnitude of n kz . For example, to satisfy this, we can synthetic some of the missing data by generative models as in (Wang et al., 2022; Zhu et al., 2017) . Let L kz (f θ ) = E[L(f θ (X), k) | Y = k, Z = z], Lkz (f θ ) = (1/n kz ) i∈A kz L(f θ (x i ), k), and pk = n k /n with n k = z∈[Kz] n kz . Then the following empirical counterpart of CSV CSV(f θ ) = Ky k=1 sup z 1 ,z 2 ∈[Kz ] Lkz 1 (f θ ) -Lkz 2 (f θ ) pk (7) is a natural estimator to CSV. The following theorem quantify its approximation error. Theorem 4. Under Assumption 1 and 2, if inf k∈[Ky],z∈[Kz] n kz /n k = O(1), then CSV(f θ ) ≤ CSV(f θ ) + O log (1/δ) √ n (8) holds with probability at least 1 -δ for any θ ∈ Θ, δ > 0. This theorem implies that CSV is upper bounded by CSV(f θ ). As shown in the proof in Appendix C.2, we hide a factor related to covering number (Vershynin, 2018)  , P ) to R emp (f θ , P ), F k (θ) to F k (θ). 1: for t = 0, • • • , T do 2: Solve the maximization problem: 3: for k = 1, • • • , K y do 4: F k t+1 = (1 -γ)F k t + γ F k (θ(t)); 5: u k (t + 1) = Softmax(F k t+1 /ρ).

6:

end for 7: Minimization step via SGD: 8: θ(t + 1) = θ(t) -η θ Ky k=1 pk ∇ θ ( Remp (f θ(t) , P ) + λu k (t + 1) ⊤ F k t+1 ). 9: end for

4.3. ESTIMATING CSV WITH UNOBSERVABLE SPURIOUS ATTRIBUTES

Computing the empirical CSV (7) requires observable spurious attributes which may not be available in practice (Liu et al., 2021a) . Thus, we need to estimate CSV in the absence of spurious attributes. Let P kz = P X|Y =k,Z=z be the conditional distributions of X with Y, Z given. The core difficulty of estimating CSV with unobservable spurious attributes is to estimate sup z E P kz [L(f θ (X), k)] - inf z E P kz [L(f θ (X), k)] via n k independent samples {(x i , y i )} i∈A k drawn from a mixture dis- tribution P k = z∈[Kz] π kz P kz . Here A k = z∈[Kz] A kz for k ∈ [K y ], P k = P X|Y =k , π kz = P Z|Y (Z = z | Y = k) , and we can not specify the data in A k is from which of A kz . To proceed, suppose π kz ≥ c > 0, which is a necessary condition for Assumption 1 to hold. We show in Appendix C.3 that the quantile conditional expectation EP k [L(f θ (X), k) | L(f θ (X), k) ≥ qP k (1 -c)] -EP k [L(f θ (X), k) | L(f θ (X), k) ≤ qP k (c)] is an upper bound (which is sharp for K ≥ 3) of sup z E P kz [L(f θ (X), k)] -inf z E P kz [L(f θ (X), k)]. Here q P k (•) is the quantile function defined as q P k (s) = inf{p : P k (L(f θ (X), k) ≤ p) ≥ s}. For large n k , we must have π kz ≥ 1/n k = c for each z ∈ [K z ]. Thus by substituting the expectation on P k in (9) with its empirical counterpart for c = 1/n k , we get the the following estimator CSVU(f θ ) = Ky k=1 max i∈A k L(f θ (xi), k)-min i∈A k L(f θ (xi), k) pk . The subscript "U" means "unobservable spurious attributes". Besides that, the CSV U (f θ ) is an upper bound to the estimator CSV(f θ ), which is another straightfoward way to obtain it.

5. REGULARIZING TRAINING WITH CSV (RCSV)

The previous results have claimed that the model with small CSV generalizes well on OOD data. On the other hand, Theorem 4 and discussion in Section 4.3 have approximated the CSV via CSV(f θ ) and CSV U (f θ ), respectively. Thus we can regularize the training process with one or the other to improve the OOD generalization, depending on whether the spurious attributes are observable. It is notable that both of the regularized training objectives can be formulated as the following minimax problem for positive constants m and λ min θ∈Θ Ky k=1 pk Remp(f θ , P )+λ max u∈∆m u ⊤ F k (θ) . Here ∆ m = {u = (u 1 , . . . , u m ) ∈ R m + : i u i = 1}, F k (θ) ∈ R m , and each dimension of F k (θ) is Lipschitz continuous function with Lipschitz gradient. Under Assumption 1 and 2, the training process of empirical risk minimization regularized with CSV(f θ ) or CSV U (f θ ) can be formulated as the above problem by respectively setting F k (θ) as the vectorization of the two following matrices. The m of CSV(f θ ) and CSV U (f θ ) are respectively K 2 z and |A k | 2 . ( Lkz 1 (f θ ) -Lkz 2 (f θ )) z 1 ,z 2 ∈[Kz ] , (L(f θ (xi 1 ), k) -L(f θ (xi 2 ), k))i 1 ,i 2 ∈A k . Before solving (11), we clarify the difference between regularizing training with CSV and distributional robustness optimization (DRO) based methods which minimize the worst-case expected loss over data with same spurious attributes, e.g., GroupDRO (Sagawa et al., 2019) minimizes max k,z E P kz [L(f θ (X), k)]. Theoretically, the OOD generalizable model has perfect in-distribution test accuracy and robustness over different spurious attributes as in (4). Our regularized training objective split such two goals, while DRO based methods mix them into one objective. Though both objectives theoretically upper bound the loss on OOD data, we empirically observe that the two goals are in contradiction with each other (see Section 6). We also observe that splitting the two goals (our objective) enables us easily take a balance between them, which guarantees a stable training process. In contrast, the mixed training objective can be easily dominated by one of the two goals, which results in an unstable training process. Similar phenomena are also observed in (Sagawa et al., 2019) . This motivates the early stopping or large weight decay regularizer used in GroupDRO.

5.1. SOLVING THE MINIMAX PROBLEM

Let ϕ k (θ, u) = R emp (f θ , P ) + λu ⊤ F k (θ), Φ k (θ) = max u∈∆m ϕ k (θ, u). Under Assumption 1 and 2, ( 11) is a nonconvex-concave minimax problem. As explored in (Lin et al., 2020) , the nonconvex-strongly concave minimax problem is much easier than the nonconvex-concave one. Thus we consider the surrogate of ϕ k (θ, u) defined as ϕ k ρ (θ, u) = ϕ k (θ, u) -λρ m j=1 u(j) log (mu(j)) for λ ρ > 0, which is strongly concave w.r.t. u, and ϕ k (θ, u) is well approximated by it for small ρ. Next, we consider the following nonconvex-strongly concave problem min θ∈Θ Ky k=1 pk max u∈∆m ϕ k ρ (θ, u) = min θ∈Θ Ky k=1 pk Φ k ρ (θ), instead of ( 11), where Φ k ρ (θ) = max u∈∆m ϕ k ρ (θ, u). To solve (13), we propose the Algorithm 1. In Algorithm 1, lines 3-6 solve the maximization problem in ( 13), which has close-formed solution u * k (t + 1) = Softmax(F k (θ(t))/ρ) (Yi et al., 2021a) , where Softmax(•) is the softmax function (Epasto et al., 2020) . As the estimator F k (θ) may have large variance, substituting the F k (θ) in u * k (t + 1) with it in Line 8 (the minimization step) will induce a large deviation. Thus we use the moving average correction F k t+1 (Line 4) to estimate F k (θ(t)), which guarantees our convergence result in Theorem 5. The convergence rate of Algorithm 1 is evaluated via approximating first-order stationary point, which is standard in non-convex problems (Ghadimi & Lan, 2013; Lin et al., 2020) . Theorem 5. Under Assumption 1 and 2, if Remp (f θ , P ) and F k (θ) are all unbiased estimators with bounded variance, θ(t) is updated by Algorithm 1 with η θ = O T -3 5 and γ = T -2 5 , then min 1≤t≤T E   Ky k=1 pk ∇Φ k ρ (θ(t)) 2   ≤ O T -2 5 . ( ) Besides that, for any θ(t) and ρ, we have | Ky k=1 pk (Φ k ρ (θ(t)) -Φ k (θ(t)))| ≤ λρ(1/me + 2 log m). The theorem is proved in Appendix D, and it says the first-order stationary point of the surrogate loss Φ k ρ (•) is approximated by θ(t) in Algorithm 1, in the order of O(T -2/5 ) (can be improved to O(T -1/2 ) when σ 2 = O(T -1/2 )). As the gap between Φ k (•) and Φ k ρ (•) is O(ρ) , taking small ρ yields small Φ k (θ(T )). The unbiased estimators in our theorem are constructed in the next section.

6. EXPERIMENTS

In this section, we empirically evaluate the efficacy of the proposed Algorithm 1 in terms of breaking the spurious correlation. More experiments are shown in Appendix E. For RCSV, in each step t, we let Remp (f θ(t) , P ) be the empirical risk over a uniformly drawn batch (size S) of data. Then we randomly sample another batch (size S) of data with replacement. The probability of each data with class label k and spurious attribute z be sampled is 1/(K y K z n kz ). Then the Lkz (f θ(t) ) in ( 12) is estimated as the conditional expected risk over this batch of data. For RCSV U , in each step t, the Remp (f θ(t) , P ) is as in RCSV. We also randomly sample another mini-batch (batch size S) of data with replacement but the probability of data with label k be sampled is 1/(K y n k ). We estimate F k (θ) in ( 12) via its empirical counterpart over these sampled data. As can be seen, all of the resulting Remp (f θ(t) , P ) and F k (θ(t)) are unbiased estimators with variance of order O(1/S) as in Theorem 5. Besides that, our RCSV (resp. RCSV U ) can be implemented with (resp. without) observable spurious attributes. The complete implementation of RCSV and RCSV U are shown in Appendix G.1. When estimating CSV, the data are sampled with weights that are inversely proportional to n kz or n k . The sampling strategy also appears in (Sagawa et al., 2019; Idrissi et al., 2021; Arjovsky et al., 2019) which significantly improves the OOD generalization according to our ablation study in Appendix F. Data. We use the following benchmark datasets with correlation shift (see details in Appendix G.2). CelebA (Liu et al., 2015) . An image classification task to recognize a celebrity's hair color ("dark" or "blond"), which is spuriously correlated with the celebrity's gender ("male" or "female"). The data are categorized as 4 groups via the combination of hair color and gender, e.g., "dark-female" (D-M). Waterbirds (Sagawa et al., 2019) . An image classification task to recognize a bird as "waterbird" or "landbird", while the bird is spuriously correlated with background "land" or "water". The data are categorized into 4 groups, e.g., "landbird-water" (L-W). MultiNLI (Williams et al., 2018) . Given a sentence-pair, the task aims to recognize the relationship between the two sentences, i.e., "entailment", "neutrality", "contradiction". The relationship is spuriously correlated with the presence of negation words. The data are categorized into 6 groups, e.g., "entailment-without negation" (E-W), "contradiction-negation" (C-N). CivilComments (Borkan et al., 2019) . A textual classification task to check whether a sentence is toxic or not with the label spuriously correlated with whether any of 8 certain demographic identities are mentioned. The data have 4 groups, e.g., "nontoxic-identity" (N-I), "toxic-nonidentity" (T-N). et al., 2009) and pre-trained BERT (Devlin et al., 2019) . The hyperparameters are in Appendix G.4. Main Results. Our goal is to verify whether all these methods can break the spurious correlation in data. Thus for each dataset, we report the test accuracies on each group of it, as the groups are divided via the combination of class label and spurious attribute. We also report the averaged test accuracies over groups ("Avg"), the test accuracies on the whole test set ("Total", which is in-distribution test accuracy expected for Waterbirds), and the worst test accuracies over groups ("Worst"). The results are in Table 1 , 2, 3. The " √ " and "×" for SA (spurious attributes) respectively mean whether the method requires observable spurious attributes. The observations from these tables are as follows. To check the OOD generalization, a direct way is comparing the column of "Avg" and "Worst" to summarize the results in each group. As can be seen, in terms of the two criteria, the proposed RCSV (resp. RCSV U ) consistently achieves better performances, compared to baseline methods with observable (resp. unobservable) spurious attributes. This verifies our methods can improve the OOD generalization. On the other hand, leveraging the observable spurious attributes benefits the OOD generalization since the methods with them consistently exhibits better performances than the ones without them. For example, the discussion in Section 4.3 shows that the estimator of CSV in RCSV with observable spurious attributes is more accurate than the one in RCSV U . There is a trade-off between the robustness of the model over spurious attributes and the test accuracies on the groups with the same class label, especially for CelebA and Waterbirds, see "D-F" v.s. "D-M" in CelebA for example. The phenomenon illustrates that some spurious correlations are captured for all methods. However, compared to the other methods, our methods have better averaged test accuracies and a smaller gap between the test accuracies over groups with the same spurious attributes. The robustness and test accuracies here respectively correspond to the goals of "robustness" and "in-distribution test accuracy" in Section 5, the improvements support our discussion in Section 5 that splitting the goals of accuracy and robustness enables us easily take a balance between them.

A LABEL SHIFT

In the sequel, we may omit the subscribe if no obfuscation. Our discussions in the main body of this paper are built upon the assumption that marginal distribution of label Y is invariant i.e., P Y = Q Y . In this section, we explore OOD generalization without such invariant assumption. Before presenting our discussion, we give the definition of total variation distance. Definition 3. The total variation distance between two distributions P, Q defined on the same measurable space X is TV(P, Q) = 1 2 X |dP (x) -dQ(x)| . In Theorem 1, we show that the gap between the performances of the model on training and OOD test data disappears if the model satisfies conditional independence such that f (X) ⊥ Z | Y . However, we show by the following example that the gap will not disappear if the marginal distribution of Y also varies across training and test data. Example 1. Suppose Y, Z ∈ {-1, 1} and a specialized loss function L(f (X), Y ) = 1 {Y =1} (5 -f (X)) + 1 {Y =-1} (2 + f (X)), where f (•) is any classifier whose output is in {-1, 1}. Let P , Q be two distributions such that P X|Y,Z = Q X|Y,Z but P Y ̸ = Q Y . We suppose X ⊥ Z | Y , and thus f (X) ⊥ Z | Y . Thus, P X|Y (x | y) = z∈Z P X,Z|Y (x, z | y) = P X|Y (x | y) z∈Z P Z|Y (z | y), is unrelated to P Z|Y . Then we have P X|Y = Q X|Y . Thus EP [L(f (X), Y )] = PY (Y = 1)EP [L(f (X), Y ) | Y = 1] + PY (Y = -1)EP [L(f (X), Y ) | Y = -1], EQ[L(f (X), Y )] = QY (Y = 1)EQ[L(f (X), Y ) | Y = 1] + QY (Y = -1)EQ[L(f (X), Y ) | Y = -1]. Since P X|Y = Q X|Y , |EP [L(f (X), Y )] -EQ[L(f (X), Y )]| = |(PY (Y = 1) -QY (Y = 1))(EP [L(f (X), Y ) | Y = 1] -EQ[L(f (X), Y ) | Y = -1])| ≥ |4 -3| × |P (Y = 1) -Q(Y = 1)| = |PY (Y = 1) -QY (Y = 1)| = TV(PY , QY ), where TV(P Y , Q Y ) is the total variation distance of the marginal distributions P Y , Q Y . This inequality holds for any f (X), and hence the gap can never be removed by representation learning like what we do in Theorem 1. The example indicates that under shifted label distribution, the conditional independent model can not generalize on OOD data. Thus, we consider the reweighted loss to fix the bias brought by the shifted label distribution. The formal result is stated as follows. Theorem 6. Let P, Q be two distributions such that P X|Y,Z = Q X|Y,Z but P Y does not necessary equals to Q Y . w(y) : Y → R + is a weighting function satisfies E P [w(Y )] = 1. Then if f (X) ⊥ Z | Y , |EP [w(Y )L(f (X), Y )] -EQ[L(f (X), Y )]| ≤ 2M TV(P w Y , Q) ( ) where P w Y is the reweighted label distribution defined as P w Y (A) = A w(y)dP Y (y) for any measurable set A ⊂ Y. Proof. Because P X|Y,Z = Q X|Y,Z and f (X) ⊥ Z | Y , as in Appendix C.1, we have P (f (X) | Y ) = Q(f (X) | Y ). Thus |EP [w(Y )L(f (X), Y )] -EQ[L(f (X), Y )]| = Y w(y)EP [L(f (X), Y ) | Y = y]dPY (y) - y∈Y EQ[L(f (X), Y ) | Y = y]dQY (y) ≤ Y M |w(y)dPY (y) -dQY (y)| = 2M TV(P w Y , QY ). Remark 2. The total variation distance TV(P w Y , Q Y ) appears in the upper bound to the gap between the two population risk in (21). Moreover, this terms seems to be inevitable since it also appears in the lower bound in (19). According to Theorem 6, we have invariance relationship E P [w(Y )L(f (X), Y )] = E Q [L(f (X), Y )] if we can take w(y) = dQ Y (y)/dP Y (y). Thus if the label distribution in the test data is available, minimizing the reweighted loss with its weights decided by the ration of two label distributions can guarantee the OOD generalization capability of the model. However, the label distribution of test data are usually unavailable in practical. Thus for unknown test label distribution, we alternatively chose the weight w(•) to minimize the worst-case upper bound sup Q TV(P w Y , QY ) = 1 2 sup Q Y |w(y)dPY (y) -dQY (y)| , given the training distribution P , where the supremum is taken over all distributions Q such that Q X|Y,Z = P X|Y,Z . Then by minimizing the reweighted loss under such weight w(•), we get a model with minimized worst-case risk over different distributions. Proposition 2. Suppose that Y is a discrete space, then if P Y (Y = y) > 0 for all y ∈ Y and w * (y) = 1 |Y|P Y (Y =y) , we then have w * (•) ∈ arg min w(•):E P [w(Y )]=1 sup Q∈P TV(P w Y , QY ) , ( ) where |Y| is the cardinal of Y. Proof. From Section A.6.2 in (van der Vaart & Wellner, 2000) , we know that  TV(P w Y , Q Y ) = sup A⊂Y | y∈A w(y)P Y (Y = y) -Q Y (Y = y)|.

B PROOFS IN SECTION 3

In this section, we present the proofs of results in Section 3. Proposition 1. There exists a population risk R pop (P, f ) whose minimizer has nearly perfect performance on the data from P , while it fails to generalize to OOD data drawn from another Q ∈ P. Proof. Let us consider the following example that X = Y • µ 1 Z • µ 2 + ξ, We may take δ = ∥µ∥ 2 /2 + ϵ∥µ∥ 2 , and σ Y Z (P ) → 1, due to ∥θ(σ Y Z (P ))∥ ≤ 1 and for a large enough ∥µ∥, with a high probability, we have Y ∥µ∥ 2 - 1 2 + ϵ ∥µ∥ 2 ≤ θ * ⊤ (P )X ≤ Y ∥µ∥ 2 + 1 2 + ϵ ∥µ∥ 2 . ( ) Since ϵ → 0 for σ Y Z (P ) → 1, we have proved that the population minimizer θ * (P ) has nearly perfect performance on the data from training distribution. However, a similar argument of (33) shows that for data drawn from distribution Q ∈ P Q X|Y θ * ⊤ (P )X -Y (∥µ 1 ∥ 2 -∥µ 2 ∥ 2 ) ≥ δ | Y ≤ exp - (δ -ϵ∥µ∥ 2 ) 2 2 ∥θ * (P )∥ 2 1 -σY Z (Q) 2 + 1 + σY Z (Q) 2 for any δ > 0. Again, by taking σ Y Z (Q) → -1 we get Y (∥µ 1 ∥ 2 -∥µ 2 ∥ 2 ) - 1 2 + ϵ ∥µ 1 ∥ 2 + ∥µ 2 ∥ ≤ θ * ⊤ (P )X ≤ Y (∥µ 1 ∥ 2 -∥µ 2 ∥ 2 ) + 1 2 + ϵ ∥µ 1 ∥ 2 + ∥µ 2 ∥ (36) with high probability. We can take, for example, ∥µ 2 ∥ 2 > 3+2ϵ 1-ϵ ∥µ 1 ∥ 2 for ϵ → 0, then under Y = -1, the inequality becomes 0 < 1 2 -ϵ ∥µ 2 ∥ 2 - 3 2 + ϵ ∥µ 1 ∥ 2 ≤ θ * ⊤ (P )X ≤ 3 2 + ϵ ∥µ 2 ∥ 2 - 1 2 -ϵ ∥µ 1 ∥ 2 , which shows the prediction given by f θ * (P ) (•) for Y = -1 is incorrect with high probability. A similar argument can be verified for Y = 1. Then we complete our proof. Theorem 1. For model f (•) satisfying f (X) ⊥ Z | Y , the conditional distribution Y | f (X) and population risk E Q [L(f (X), Y )] are invariant with (X, Y ) ∼ Q X,Y such that Q ∈ P. Proof. The difference of Y | f (X) for any (X, Y ) ∼ Q with Q ∈ P originates from the different spurious correlation i.e., the different Q Z|Y . Thus to obtain our result, it is suffice to prove that the distribution of Y | f (X) is independent of Q Z|Y . To see this, for any measurable sets A, B ⊂ Y, Q Y |X (Y ∈ A | f (X) ∈ B) = Q X|Y (f (X) ∈ B | Y ∈ A)QY (Y ∈ A) Q X|Y (f (X) ∈ B | Y ∈ A)QY (Y ∈ A) + Q X|Y (f (X) ∈ B | Y / ∈ A)QY (Y / ∈ A) = 1 1 + Q X|Y (f (X)∈B|Y / ∈A)Q Y (Y / ∈A) Q X|Y (f (X)∈B|Y ∈A)Q Y (Y ∈A) . (38) As Q Y (Y / ∈ A)/Q Y (Y ∈ A) is invariant across Q ∈ P. Then for the Q X|Y (f (X) ∈ B | Y / ∈ A)/Q X|Y (f (X) ∈ B | Y ∈ A), we see Q X|Y (f (X) ∈ B | Y / ∈ A) Q X|Y (f (X) ∈ B | Y ∈ A) = Z Q X,Z|Y (f (X) ∈ B, z | Y / ∈ A)dz Z Q X,Z|Y (f (X) ∈ B, z | Y ∈ A)dz = Q X|Y (f (X) ∈ B | Y / ∈ A) Z Q Z|Y (z | Y / ∈ A)dz Q X|Y (f (X) ∈ B | Y ∈ A) Z Q Z|Y (z | Y ∈ A)dz , where the second equality is from the independent condition that f (X) ⊥ Z | Y . From the calculation, we figure out that the distribution of Y | f (X) is independent of spurious correlation P Z|Y due to the arbitrariness of A, B ∈ Y. Then we prove Y | f (X) is invariant over Q ∈ P..

To provide the invariance of E

Q [L(f (X), Y )], it is suffice to show that for the union distribution of (f (X), Y ) is invariant w.r.t. Q for Q ∈ P. Thus for any sets A, B ⊂ Y and (X, Y ) ∼ Q ∈ P QX,Y (Y ∈ A, f (X) ∈ B) = Q X|Y (f (X) ∈ B | Y ∈ A)QY (Y ∈ A) = QY (Y ∈ A) Z Q X,Z|Y (f (X) ∈ B, z | Y ∈ A)dz = QY (Y ∈ A)Q X|Y (f (X) ∈ B | Y ∈ A) Z Q Z|Y (z | Y ∈ A)dz. (40) Since f (X) ⊥ Z | Y , we figure out the Q X,Y (Y ∈ A, f (X) ∈ B) is independent of with spurious correlation Q Z|Y . Then due to the arbitrary of A, B ∈ Y, we summarize that the union distribution of (f (X), Y ) is invariant w.r.t. Q for Q ∈ P. Then the proof is completed.

C PROOFS IN SECTION 4

C.1 PROOFS IN SECTION 4.1 Theorem 2. For any Q ∈ P, we have sup Q∈P |Remp(f, P ) -Rpop(f, Q)| ≤ |Remp(f, P ) -Rpop(f, P )| + CSV(f ) Proof. This theorem can be computed via the assumption in (1). We have EP [L(f (X), Y )] = EP [E[L(f (X), Y ) | Y, Z]] = EY Z EX [L(f (X), Y ) | Y, Z = z]dP (z | Y ) . Due to (1), the first expectation is invariant with P, Q ∈ P, while the second expectation is a function of (Y, z) independent the choice of P . Thus |EP [L(f (X), Y )] -EQ[L(f (X), Y )]| ≤ E sup z 1 ,z 2 |E[L(f (X), Y ) | Y, z1] -E[L(f (X), Y ) | Y, z2]| ≤ CSV(f ), where the last inequality is due to the loss function is non-negative. Then due to sup Q∈P |Remp(f, P )-Rpop(f, Q)| ≤ |Remp(f, P ) -Rpop(f, P )| + sup Q∈P |Rpop(f, P ) -Rpop(f, Q)| ≤ |Remp(f, P ) -Rpop(f, P )| + CSV(f ), we get the theorem. Next we provide the definitions of KL-divergence, mutual information, and conditional mutual information which are useful to prove Theorem 3. Definition 4 (KL-Divergence). Let P, Q be two distributions with the same support and P is absolutely continuous w.r.t. Q. Then the KL divergence from Q to P is DKL(P ∥ Q) = EV ∼P log dP dQ (V ) , where dP dQ is the Radon-Nikodym derivative of P w.r.t. Q. Definition 5 (Mutual Information). For random variables V 1 , V 2 with joint distribution P V1,V2 , the mutual information between them is I(V1; V2) = DKL(PV 1 ,V 2 ∥ PV 1 × PV 2 ). ( ) Definition 6 (Conditional Mutual Information). For three random variables U, V, W , the mutual information between U, V conditional on W is I(U ; V | W ) = Ew∼P W [I(U | W = w; V | W = w)] . Before presenting the proof of Theorem 3, we need the following lemma. Lemma 1. Let U, V, W be three random variables such that U and W are independent with each other, then I(W ; U + V ) ≤ I(W ; V | U ). Proof. By Data Processing Inequality (Xu & Raginsky, 2017) , we have I(W ; U + V ) ≤ I(W ; U, V ) = I(W ; U ) + I(W ; V | U ) = I(W ; V | U ) (48) Thus the proof is completed. Published as a conference paper at ICLR 2023 Now we are ready to give the proof of Theorem 3. Theorem 3. Let model f θ (•) parameterized by θ ∈ Θ ⊂ R d , and is trained on S = {(x i , y i )} n i=1 from distribution P , with the spurious attributes of x i is z i . If the learned model f θ S (•) ⊥ S z | S y 3 Egen(f θ S , P ) ≤ inf g M 2 4n I(S x-g(z) , Sy; f θ S | Sy, S g(z) ) + I(Sy; f θ S ) , where E gen (f θ S , P ) = |E[R emp (f θ S , P )] -R pop (f θ S , P )|, g(•) is any measurable function, S x-g(z) = {x i -g(z i )} n i=1 , S y = {y i } n i=1 . Proof. Let S = {(x i , ỹi )} be another n samples drawn from P independent of S. W.o.l.g., we assume E P ×f θ S [L(f θ S (x), y)] = 0, otherwise we can replace L(f θ S (x i ), y i ) with L(f θ S (x i ), y i )- E P ×f θ S [L(f θ S (x), y)]. For any λ > 0 by Donsker-Varadhan's inequality, DKL(P S×f θ S ∥ PS × P f θ S ) ≥ E S×f θ S λ n n i=1 L(f θ S (xi), yi) -log E S×f θ S exp λ n n i=1 L(f θ S (xi), ỹi) . ( ) Then for any θ, λ > 0, and Lebesgue measurable function g(•), λE [Remp(f θ S , P )] ≤ DKL(P S×f θ S ∥ PS × P f θ S ) + log E S×θ S exp λ n n i=1 L(f θ S (xi), ỹi) a ≤ I(Sx, Sy; f θ S ) + λ 2 M 2 8n = I(Sx; f θ S | Sy) + I(Sy; f θ S ) + λ 2 M 2 8n b ≤ I(S x-g(z) ; f θ S | Sy, S g(z) ) + I(Sy; f θ S ) + λ 2 M 2 8n , ( ) where a is due to the definition of mutual information, L(f θ S (x i ), y i ) is M 2 -sub Gaussian, b is from Lemma 1, and the last equality is due to the conditional independence of the model. Thus, we conclude that E [Remp(f θ S , P )] ≤ inf g M 2 4n I(S x-g(z) ; f θ S | Sy, S g(z) ) + I(Sy; f θ S ) . Thus, we complete the proof.

C.2 PROOFS IN SECTION 4.2

Let F(Θ) and ∥ • ∥ L∞ respectively be the parameterized function class and L ∞ -norm on F(Θ) defined as ∥f θ 1 -f θ 2 ∥L ∞ = sup x |f θ 1 (x) -f θ 2 (x)| (52) for any f θ1 , f θ2 ∈ F(Θ). To provide the proof of Theorem 4, we need the following definition of covering number. Definition 7. A ϵ-cover of metric space (ϵ, F(Θ), ∥ • ∥ L∞ ) is any point set {f θi (•)} ⊆ F(Θ) such that for any f θ (•) ∈ F(Θ), there exists θ i satisfies ∥f θ -f θi ∥ L∞ ≤ ϵ. The covering number N (ϵ, F(Θ), ∥ • ∥ L∞ ) is the cardinality of the smallest ϵ-cover. Theorem 4. Under Assumption 1 and 2, if inf k∈[Ky],z∈[Kz] n kz /n k = O(1), then CSV(f θ ) ≤ CSV(f θ ) + O log (1/δ) √ n (8) holds with probability at least 1 -δ for any θ ∈ Θ, δ > 0. Proof. First, for any given θ ∈ Θ and given Y = k, Z = z, due to 0 ≤ L(f θ (X), Y ) ≤ M , by Azuma-Hoeffding's inequality (Corollary 2.20 in (Wainwright, 2019 )), we know that the with probability at least 1 -δ Lkz (f θ ) -M log (2/δ) 2n kz ≤ L kz (f θ ) ≤ Lkz (f θ ) + M log (2/δ) 2n kz . ( ) Then we see sup z 1 ,z 2 (L kz 1 (f θ ) -L kz 2 (f θ )) ≤ sup z 1 ,z 2 L kz 1 (f θ ) -Lkz 1 (f θ ) -L kz 2 (f θ ) -Lkz 2 (f θ ) + sup z 1 ,z 2 Lkz 1 (f θ ) -Lkz 2 (f θ ) ≤ M log 2 Kzδ sup z 1 ,z 2 1 2n kz 1 + 1 2n kz 2 + sup z 1 ,z 2 Lkz 1 (f θ ) -Lkz 2 (f θ ) holds with probability at least 1 -δ. Since the function class F(Θ) is bounded by M under ∥ • ∥ L∞ , it has finite covering number N (ϵ, F(Θ), ∥ • ∥ L∞ ). Let f θ1 (•), • • • , f θ N (•) ∈ F(Θ) be a ϵ-covering of F(Θ) with N ≤ N (ϵ, F(Θ), ∥ • ∥ L∞ ) such that ∀f θ ∈ F(Θ), ∃q ∈ {1, • • • , N }, ∥f θ -f θq ∥ L∞ ≤ ϵ. Thus combining the above inequality, for any f θ (•) and its corresponded f θq (•), we have sup z 1 ,z 2 (L kz 1 (f θ ) -L kz 2 (f θ )) ≤ sup z 1 ,z 2 L kz 1 (f θ ) -L kz 1 (f θq ) + L kz 2 (f θq ) -L kz 2 (f θ ) + sup z 1 ,z 2 L kz 1 (f θq ) -L kz 2 (f θq ) a ≤ 2ϵ + M log 2 Kzδ + N (F(Θ), ϵ, ∥ • ∥L ∞ ) sup z 1 ,z 2 1 2n kz 1 + 1 2n kz 2 + sup z 1 ,z 2 Lkz 1 (f θq ) -Lkz 2 (f θq ) (55) holds with probability at least 1 -δ for any ϵ > 0. Here the inequality a is due to the definition of L ∞ -norm on F(Θ). On the other hand, as CSV(f θ ) = Ky k=1 sup z 1 ,z 2 (L kz 1 (f θ ) -L kz 2 (f θ )) P (Y = k), We estimate the P (Y = k) with its empirical counterpart n k = Kz z=1 n kz /n. For bounded sub-Gaussian variable 1 {Y =k} , we have E 1 {Y =k} - 1 n n i=1 1 {y i =k} = P (Y = k) -pk ≤ log (1/δ) 2n holds with probability at least 1 -δ. Plugging this into (56) and combining (55) we get CSV(f θ ) ≤ 1 n Ky k=1 sup z 1 ,z 2 (L kz 1 (f θ ) -L kz 2 (f θ )) n k + Ky log (2Ky/δ) 2n ≤ Ky k=1 sup z 1 ,z 2 Lkz 1 (f θ ) -Lkz 2 (f θ ) pk + Ky log (2Ky/δ) 2n + inf ϵ 2ϵ + M log 2 Kzδ + N (F(Θ), ϵ, ∥ • ∥L ∞ ) Ky k=1 sup z 1 ,z 2 1 2n kz 1 + 1 2n kz 2 pk (58) holds with probability at least 1 -δ due to the definition of pk . Then suppose inf k∈[Ky],z∈[Kz] n kz /n k ≥ α, we have Ky k=1 sup z 1 ,z 2 1 2n kz 1 + 1 2n kz 2 pk ≤ 1 n k∈[Ky ] √ 2n k min k∈[Ky ],z∈[Kz ] √ n kz ≤ 2 α k∈[Ky ] √ n k n ≤ 2Ky αn , Lemma 2. Under Assumption 1-2, we have the following conclusions 1. For ϕ k ρ (θ, u) with any ρ and k we have ∇ θ ϕ k ρ (θ1, u) -∇ θ ϕ k ρ (θ2, u) ≤ (1 + 2λKzL1)∥θ1 -θ2∥ = L11∥θ1 -θ2∥; ∇ θ ϕ k ρ (θ, u1) -∇ θ ϕ k ρ (θ, u2) ≤ 2λKzL0∥u1 -u2∥ = L12∥u1 -u2∥; ∇uϕ k ρ (θ1, u) -∇uϕ k ρ (θ2, u) ≤ 2λKzL0∥θ1 -θ2∥ = L12∥u1 -u2∥. (68) 2. Let u * k (θ, ρ) = arg max u∈∆ K 2 z ϕ k ρ (θ, u) and û * k (θ, ρ) = arg max u∈∆ K 2 z φk ρ (θ, u) then ∥u * k (θ, ρ) -û * k (θ, ρ)∥ ≤ 1 ρ F (θ) -F (θ) . ( ) 3. Φ k ρ (θ) is L 1 + λ L 11 + L 2 12 /ρ -smoothness Proof. Let us proof the conclusions by order. For the first conclusion, we have ∇ θ ϕ k ρ (θ, u) = ∇ θ Remp(f θ , P ) + λ∇ θ F k (θ) ⊤ u; ∇uϕ k ρ (θ, u) = λF k (θ) -ρλ log K 2 z u + e , ( ) where e = (1, • • • , 1). Thus, by Schwarz's inequality, one can verify ∇ θ ϕ k ρ (θ1, u) -∇ θ ϕ k ρ (θ2, u) ≤ ∥∇ θ Remp(f θ 1 , P ) -∇ θ Remp(f θ 2 , P )∥ + λ u ⊤ ∇ θ F k (θ1) -∇ θ F k (θ2) ≤ λ   i∈[K 2 z ],j∈[K 2 z ] u(i)u(j) sup (z 1 ,z 2 ) ∇ Lkz 1 (f θ 1 ) -∇ Lkz 2 (f θ 1 ) -∇ Lkz 1 (f θ 2 ) -∇ Lkz 2 (f θ 2 ) 2   1 2 + L1∥θ1 -θ2∥ ≤ (1 + 2λKz)L1∥θ1 -θ2∥, and ∇ θ ϕ k ρ (θ, u1) -∇ θ ϕ k ρ (θ, u2) ≤ λ∥u1 -u2∥ ∇ θ F k (θ) ≤ 2λKzL0∥u1 -u2∥, and ∇uϕ k ρ (θ1, u) -∇uϕ k ρ (θ2, u) ≤ λ F k (θ1) -F k (θ2) ≤ 2λKzL0∥θ1 -θ2∥. ( ) Thus we complete the proof to the first conclusion. For the second conclusion, by the Lagrange's multiplier method or Theorem in (Yi et al., 2021a) , we have the unique closed-form solution of u * k (θ, ρ) ∈ ∆ K 2 z that u * k (θ, ρ) = exp 1 ρ F k (θ)(j) K 2 z j=1 exp 1 ρ F k (θ)(j) = Softmax F k (θ) ρ ; û * k (θ, ρ) = exp 1 ρ F k (θ)(j) K 2 z j=1 exp 1 ρ F k (θ)(j) = Softmax F k (θ) ρ . (74) On the other hand, due to u ∈ ∆ K 2 z we have ∇ 2 uu ϕ k ρ (θ, u) = -ρλdiag 1 u(i) , • • • , 1 u(K 2 z ) ⪯ -ρλI, where A ⪰ B means that A -B is a semi-positive definite matrix and I is the identity matrix. The similar conclusion holds for φk ρ (θ, u). Thus both ϕ k ρ (θ, u) and φk ρ (θ, u) are ρ-strongly concave w.r.t. u. Then ϕ k ρ (θ, u * k (θ, ρ)) ≤ ϕ k ρ (θ, û * k (θ, ρ)) + ∇uϕ k ρ (θ, û * k (θ, ρ)), u * k (θ, ρ) -û * k (θ, ρ) - ρλ 2 ∥u * k (θ, ρ) -û * k (θ, ρ)∥ 2 ; ϕ k ρ (θ, û * k (θ, ρ)) ≤ ϕ k ρ (θ, u * k (θ, ρ)) + ∇uϕ k ρ (θ, u * k (θ, ρ)), û * k (θ, ρ) -u * k (θ, ρ) - ρλ 2 ∥u * k (θ, ρ) -û * k (θ, ρ)∥ 2 . ( ) Plugging the two above inequalities, we have that ∇uϕ k ρ (θ, û * k (θ, ρ)), u * k (θ, ρ) -û * k (θ, ρ) ≥ ρλ ∥u * k (θ, ρ) -û * k (θ, ρ)∥ 2 (77) due to the ∇ u ϕ k ρ (θ, û * k (θ, ρ)), u * k (θ, ρ) -û * k (θ, ρ) ≤ 0. On the other hand, as ∇u φk ρ (θ, û * k (θ, ρ)), u * k (θ, ρ) -û * k (θ, ρ) ≤ 0. Plugging this into the above inequality, we get ρλ ∥u * k (θ, ρ) -û * k (θ, ρ)∥ 2 ≤ ∇uϕ k ρ (θ, û * k (θ, ρ)) -∇u φk ρ (θ, û * k (θ, ρ)), u * k (θ, ρ) -û * k (θ, ρ) = λ F k (θ) - F k (θ), u * k (θ, ρ) -û * k (θ, ρ) ≤ λ F k (θ) - F k (θ) ∥u * k (θ, ρ) -û * k (θ, ρ)∥ . Thus the conclusion is proofed. Finally, we prove the third conclusion. Similar to the proof of the second conclusion, we have ∥u * k (θ1, ρ) -u * k (θ2, ρ)∥ ≤ 1 ρ ∥θ1 -θ2∥. Since ∆ K 2 z is convex, bounded, and u * k (θ, ρ) is unique for any θ, by Danskin's Theorem (Bernhard & Rapaport, 1995) , we have ∇Φ k ρ (θ1) -∇Φ k ρ (θ2) ≤ ∥∇ θ Remp(f θ 1 , P ) -∇ θ Remp(f θ 2 , P )∥ + λ ∇ θ F k (θ1) ⊤ u * k (θ1, ρ) -∇ θ F k (θ2) ⊤ u * k (θ2, ρ) ≤ L1 ∥θ1 -θ2∥ + λ ∇ θ F k (θ1) ⊤ (u * k (θ1, ρ) -u * k (θ2, ρ)) + λ u * k (θ2, ρ) ⊤ ∇ θ F k (θ1) -∇ θ F k (θ2) ≤ L1 ∥θ1 -θ2∥ + λL 2 12 ρ ∥θ1 -θ2∥ + λL11 ∥θ1 -θ2∥ = L1 + λL 2 12 ρ + λL11 ∥θ1 -θ2∥ , which implies our conclusion. We present the following lemma to state the descent property of the obtained iterates via Algorithm 1. Let us define φk (θ, u) = Remp(f θ , P ) + λu ⊤ F k As we have assume that unbiased estimators Remp (f θ , P ) and F k (θ) have bounded variance, then according to E φk (θ, u) -ϕ k (θ, u) 2 ≤ 2E Remp(fθ, P ) -Remp(f θ , P ) 2 + 2λ 2 E F k (θ) -F k (θ) 2 , w.o.l.g. we assume that max E φk (θ, u) -ϕ k (θ, u) 2 , E F k (θ) -F k (θ) 2 ≤ σ 2 . ( ) Lemma 3. Let L 1 + λ L 11 + L 2 12 /ρ = L, if we have the estimation such that E[ φk (θ, u)] = ϕ k (θ, u), E[( φk (θ, u) -ϕ k (θ, u)) 2 ] ≤ σ 2 then E   Ky k=1 pk Φ k ρ (θ(t + 1))   ≤ E   Ky k=1 pk Φ k ρ (θ(t))   - η θ 2 -Lη 2 θ E   Ky k=1 pk ∇Φ k ρ (θ(t)) 2   + Ky k=1 LKyL 2 12 pk η 2 θ ρ + σ 2 + L11 L 2 12 η θ 2ρ 2 pk E F k (θ(t)) -F k t 2 + Lη 2 θ σ 2 2 . ( ) Proof. Due to the L-smoothness of Φ k ρ (•) for any k and ρ we have Ky k=1 pk Φ k ρ (θ(t + 1)) ≤ Ky k=1 pk Φ k ρ (θ(t)) + Ky k=1 pk ∇Φ k ρ (θ(t)), θ(t + 1) -θ(t) + L 2 ∥θ(t + 1) -θ(t)∥ 2 = Ky k=1 pk Φ k ρ (θ(t)) -η θ Ky k=1 pk ∇Φ k ρ (θ(t)), Ky k=1 pk ∇ θ φk ρ (θ(t), û * k (θ(t), ρ)) + Lη 2 θ 2 Ky k=1 pk ∇ θ φk ρ (θ(t), û * k (θ(t), ρ)) 2 ≤ Ky k=1 pk Φ k ρ (θ(t)) -η θ Ky k=1 pk ∇Φ k ρ (θ(t)) 2 + η θ Ky k=1 pk ∇Φ k ρ (θ(t)), Ky k=1 pk ∇Φ k ρ (θ(t)) -∇ θ φk ρ (θ(t), u * k (θ(t), ρ)) + η θ Ky k=1 pk ∇Φ k ρ (θ(t)), Ky k=1 pk ∇ θ φk ρ (θ(t), u * k (θ(t), ρ)) -∇ θ φk ρ (θ(t), u k (t)) + Lη 2 θ 2 Ky k=1 pk ∇ θ φk ρ (θ(t), u k (t)) 2 . ( ) On the other hand, using the fact that E φk ρ (θ, u) = ϕ k ρ (θ, u), E φk ρ (θ, u) -ϕ k ρ (θ, u) 2 ≤ σ 2 , and Young's inequality E   Ky k=1 pk ∇ θ φk ρ (θ(t), u k (t) 2   = E   Ky k=1 pk ∇ θ φk ρ (θ(t), u k (t) -∇ θ ϕ k ρ (θ(t), u k (t)) + Ky k=1 pk ∇ θ ϕ k ρ (θ(t), u k (t)) 2   ≤ σ 2 + E   Ky k=1 pk ∇ θ ϕ k ρ (θ(t), u k (t)) 2   ≤ σ 2 + 2E   Ky k=1 pk ∇Φ k ρ (θ(t)) 2   + 2E   Ky k=1 pk ∇Φ k ρ (θ(t)) -∇ θ ϕ k ρ (θ(t), u k (t)) 2   . Due to Danskin's theorem and (i), (ii) in Lemma 2 we have E   Ky k=1 pk ∇Φ k ρ (θ(t)) -∇ θ ϕ k ρ (θ(t), u k (t)) 2   ≤ E     Ky k=1 pk L12 ∥u * k (θ(t), ρ) -u k (t)∥   2   ≤ E     Ky k=1 pk L12 ρ F k (θ(t)) -F k t   2   ≤ KyL 2 12 ρ 2 Ky k=1 p2 k E F k (θ(t)) -F k t 2 . Finally, we have E   Ky k=1 pk ∇Φ k ρ (θ(t)), Ky k=1 pk ∇ θ φk ρ (θ(t), u * k (θ(t), ρ)) -∇ θ φk ρ (θ(t), u k (t))   = Ky k=1 pk E   Ky k=1 pk ∇Φ k ρ (θ(t)), ∇ θ φk ρ (θ(t), u * k (θ(t), ρ)) -∇ θ φk ρ (θ(t), u k (t))   ≤ 1 2 E   Ky k=1 pk ∇Φ k ρ (θ(t)) 2   + 1 2 Ky k=1 pk E   Ky k=1 ∇ θ φk ρ (θ(t), u * k (θ(t), ρ)) -∇ θ φk ρ (θ(t), u k (t)) 2   ≤ 1 2 E   Ky k=1 pk ∇Φ k ρ (θ(t)) 2   + 1 2 Ky k=1 pk E ∇ F k (θ(t)) 2 ∥u k (t) -u * k (θ(t), ρ)∥ 2 ≤ 1 2 E   Ky k=1 pk ∇Φ k ρ (θ(t)) 2   + σ 2 + L11 L 2 12 2ρ 2 Ky k=1 pk E F k (θ(t)) -F k t 2 . ( ) Plugging the three above inequalities into (86) and taking expectation to the both sides of the equality, we get E   Ky k=1 pk Φ k ρ (θ(t + 1))   ≤ E   Ky k=1 pk Φ k ρ (θ(t))   - η θ 2 -Lη 2 θ E   Ky k=1 pk ∇Φ k ρ (θ(t)) 2   + Ky k=1 LKyL 2 12 pk η 2 θ ρ + σ 2 + L11 L 2 12 η θ 2ρ 2 pk E F k (θ(t)) -F k t 2 + Lη 2 θ σ 2 2 . This completes the proof of our theorem. Then we proceed to the next lemma to characterize the dynamic of E F k (θ(t)) -F k t 2 . Lemma 4. For the F k t defined in Algorithm 1, and let δ k (t) = E F k (θ(t)) -F k t 2 , and δ(t) = Ky k=1 pk δ k (t), by choosing γ ≤ 2/3, we have δ(t + 1) ≤ 1 - γ 2 + 4η 2 θ L 2 11 KyL 2 12 γρ 2 δ(t) + 2γ 2 + 2η 2 θ L 2 11 γ σ 2 + 4η 2 θ L 2 11 γ E   Ky k=1 pk ∇Φ k ρ (θ(t)) 2   . Proof. W.o.l.g., we fix the k during our proof. According to E F k (θ(t)) = F k (θ(t)) and E F k (θ(t)) -F k (θ(t)) 2 ≤ σ 2 we have E F k t+1 -F k (θ(t)) 2 ≤ (1 -γ) 2 δ k (t) + γ 2 σ 2 ≤ (1 -γ)δ k (t) + γ 2 σ 2 , ( ) where we use the fact γ < 1. Then due to the update rule of F k t , Young's inequality and the above inequality, δ k (t + 1) = E F k t+1 -F k (θ(t)) + F k (θ(t)) -F k (θ(t + 1)) 2 ≤ 1 + γ 2 -2γ E F k t+1 -F k (θ(t)) 2 + 1 + 2 -2γ γ E F k (θ(t + 1)) -F k (θ(t)) 2 ≤ 1 - γ 2 δ k (t) + 2γ 2 σ 2 + 2 γ E F k (θ(t + 1)) -F k (θ(t)) 2 . ( ) On the other hand, due to the L 11 -continuity of F k (•) and ( 87), (88) we see E F k (θ(t + 1)) -F k (θ(t)) 2 ≤ L 2 11 E ∥θ(t + 1) -θ(t)∥ 2 = η 2 θ L 2 11 E   Ky k=1 pk ∇ θ φk ρ (θ(t), u k (t) 2   ≤ η 2 θ L 2 11 σ 2 + 2η 2 θ L 2 11 E   Ky k=1 pk ∇Φ k ρ (θ(t)) 2   + 2η 2 θ L 2 11 KyL 2 12 ρ 2 Ky k=1 p2 k δ k (t). Plugging this into the above inequality and weighted summing over k (by pk ) we get Ky k=1 pk δ k (t + 1) ≤ 1 - γ 2 + 4η 2 θ L 2 11 KyL 2 12 γρ 2 Ky k=1 pk δ k (t) + 2γ 2 + 2η 2 θ L 2 11 γ σ 2 + 4η 2 θ L 2 11 γ E   Ky k=1 pk ∇Φ k ρ (θ(t)) 2   which completes the our proof.  min 1≤t≤T E   Ky k=1 pk ∇Φ k ρ (θ(t)) 2   ≤ O T -2 5 . ( ) Besides that, for any θ(t) and ρ, we have | Ky k=1 pk (Φ k ρ (θ(t)) -Φ k (θ(t)))| ≤ λρ(1/me + 2 log m). Proof. First, note that m = K 2 z , for any θ, Φ k ρ (θ) ≤ M -λρKz inf x {x log (K 2 z x)} = M + λρ Kze . ( ) Due to the value of η θ ≤ γ 2 √ 6KyL11L12 we have that 1 - γ 2 + 4η 2 θ L 2 11 KyL 2 12 γρ 2 ≤ 1 - γ 3 . ( ) Thus we have δ(t) ≤ 1 - γ 3 t 4M 2 + 2γ 2 + 2η 2 θ L 2 11 γ σ 2 t-1 j=0 1 - γ 3 j + 4η 2 θ L 2 11 γ t-1 j=0 1 - γ 3 t-j-1 E   Ky k=1 pk ∇Φ k ρ (θ(j)) 2   from Lemma 4. Plugging this into (85) in Lemma 3 and summing up it over t = 0, • • • , T , we have E   Ky k=1 pk Φ k ρ (θ(T ))   ≤ E   Ky k=1 pk Φ k ρ (θ(0))   - η θ 2 -Lη 2 θ T -1 t=0 E   Ky k=1 pk ∇Φ k ρ (θ(t)) θ ρ + σ 2 + L11 L 2 12 η θ 2ρ 2 T -1 t=0 T -1-t j=0 1 - γ 3 j E   Ky k=1 pk ∇Φ k ρ (θ(t)) 2   . It can be verified that for any t, t-1 j=0 (1 -γ/3) j ≤ 3/γ, and plugging this into the above inequality we get E   Ky k=1 pk Φ k ρ (θ(T ))   ≤ E   Ky k=1 pk Φ k ρ (θ(0))   - η θ 3 T -1 t=0 E   Ky k=1 pk ∇Φ k ρ (θ(t)) 2   + 12η θ M 2 Mρ γ + T Lη 2 θ 2 + η θ Mρ 2γ 2 + 2η 2 θ L11 γ 3T γ σ 2 , where Mρ = LKyL 2 12 ρ + (σ 2 +L11)L 2 12 2ρ 2 and we use the fact η θ ≤ min 1 12 L , γ 12 √ MρL11 . Finally from (96) and γ = T -2 5 , η θ ≤ T -3 5 we get that 1 T T -1 t=0 E   Ky k=1 pk ∇Φ k ρ (θ(t)) 2   ≤ 3 η θ T M + ρ Kze + 36M 2 Mρ γT + 3 Lη θ σ 2 2 + 18γ Mρσ 2 + 18η 2 θ L11 Mρσ 2 γ 2 ≤ 3T -2 5 M + ρ Kze + 36M 2 MρT -3 5 + 3 LT -3 5 σ 2 2 + 18T -2 5 Mρσ 2 + 18T -2 5 L11 Mρσ 2 = O T -2 5 . This completes our proof to the first conclusion. We highlight that the value of η θ satisfies η θ = min    1 12 L , T -2 5 12 MρL11 , T -3 5 , T -2 5 2 6KyL11L12    = O T -3 5 . ( ) To see the last conclusion, similar to (96) we have that Φ k ρ (θ) -Φ k (θ) = λF k (θ) ⊤ (u * k (θ) -u * k (θ, ρ)) -λρu * k (θ, ρ) ⊤ log K 2 z u * k (θ, ρ) ≤ λF k (θ) ⊤ (u * k (θ, ρ) -u * k (θ)) + λρ Kze (103) where u * k (θ) = arg max{i : Φ k (θ)(i)}. Due to Theorem 1 in (Epasto et al., 2020) , we have F k (θ) ⊤ (u * k (θ) -u * k (θ, ρ)) ≤ 2ρ log Kz. Thus we can conclude Φ k ρ (θ) -Φ k (θ) ≤ λρ 1 Kze + 2 log Kz (105) which implies our conclusion. 

E MORE EXPERIMENTS

In this section, we conduct more experiments on a synthetic dataset and real-world dataset to further verify the effectiveness of our proposed methods.

E.1 TO Y EX A M P L E

In this section, we apply the proposed RCSV and RCSV U to a constructed toy example with spurious correlation. Data. The data is constructed as the example in Appendix B. For two 5-dimensional vectors µ 1 , µ 2 , the training data X follows normal distribution N ((Y µ ⊤ 1 , Zµ ⊤ 2 ) ⊤ , I 10 ) where I 10 is a 10 × 10 identity matrix. The label Y and spurious attributes Z take value from {-1, 1} and are all drawn from a standard binomial distribution (i.e., P Y (Y = 1) = P Y (Y = -1) = 0.5). As in (1), the spurious correlation coefficient σ train Y Z between Y and Z vary on different distribution. We generate 1000 (resp. 200) training (resp. test) samples. Concretely, the 1 is fixed as 0.99 for the unique training distribution, while there are 6 constructed test distributions respectively with σ test Y Z in {0.00, -0.20, -0.40, -0.60, -0.80, -0.99}. As can be seen, the spurious correlations in the test sets are opposite to the one in the training set. Thus, over-fitting the spurious correlation will mislead the trained model. Setup. We use the linear model f θ (x) = θ ⊤ x and its prediction on Y is sign(f θ (x)). We compare the proposed methods RCSV and RCSV U with the baseline methods as in the main body of this paper. The domain generalization methods can be applied with observed spurious attributes is because the data can be viewed as from two domains i.e., data drawn under conditions of Y = Z and Y ̸ = Z. The loss function L(•, •) is cross entropy. The hyperparameters of baseline methods follow the ones in original papers. Our methods are trained by SGD with the used hyperparemeters deferred in Appendix G.4. Main Results. In Table 4 , we report the test accuracies of trained models evaluated on OOD data to see if all the aforementioned methods can break the spurious correlation. From the results, we have the following observations. For all methods, the test accuracies are consistently improved with the decrease of the gap between σ train Y Z -σ test Y Z . This is explained as the decreased σ train Y Z -σ test Y Z leads to smaller mismatches between training and test distributions, thus improving accuracy. The models trained by the proposed two methods and domain generalization methods (IRM and GroupDRO) can break the spurious correlation (generalize on OOD test data), which verifies the On the other hand, let the last 5-dimensional parameters of the linear model be θ 2 . By X ∼ N ((Y µ ⊤ 1 , Zµ ⊤ 2 ) ⊤ , I 10 ), one can verify that the when θ ⊤ 2 µ 2 ≈ 0, the output of model θ ⊤ X does not related to Z with high probability. Then the model can break the spurious correlation. To see this, in Table 5 , we present the cosine-similarity ⟨θ 2 , µ 2 ⟩/(∥θ 2 ∥∥µ 2 ∥) (the cosine-similarity is used to alleviate the interference caused by scales of the two vectors) of the models trained by methods in Table 4 . The results show that the models trained by OOD generalizable methods have smaller cosine-similarities.

E.2 CO L O R E D-MNIST

In this section, we empirically verify the effectiveness of the proposed methods on a constructed real-world dataset Colored-MNIST. Data. Our dataset is constructed on the MNIST (LeCun et al., 1998) which consists of 60,000 training data and 10,000 test data. Each data is a grey-scale hand-written digit from ten categories, i.e., 0 to 9. We construct our Colored-MNIST (C-MNIST) by inducing the spurious correlation in the training and test sets. Concretely, for each digit, we assign two colors as spurious attributes respectively for its foreground and background. The spurious correlation can be induced into such dataset by tying the relationship between the label of digits and the two colors. We pick 20 specific colors, the first and the last 10 colors are respectively used as 10 categories of two spurious attributes, i.e., the colors of foreground and background of a digit. We consider datasets with two kinds of spurious correlations. The first is fixed spurious correlation, which means data from each specific category of digit is assigned two specific colors respectively for its foreground and background. The other is random spurious correlation which means that for each data, two randomly sampled colors are respectively assigned to its foreground and background regardless of its category. We will construct two C-MNIST with different but fixed spurious correlations (abbrev. as C-MNIST-F1 and C-MNIST-F2), and one C-MNIST with random spurious correlation (abbrev. as C-MNIST-R). Some of the generated datasets are in Figure 2 . As can be seen, the three versions of C-MNIST has different spurious correlations between the label of digits and the colors of foreground and background. Besides that, the spurious correlation in C-MNIST-F1 and C-MNIST-F2 are fixed while C-MNIST-R has randomized spurious correlation. Setup. We construct various training sets based on the original 60,000 training samples of MNIST. Concretely, we choose α ∈ {0.8, 0.85, 0.90, 0.95, 0.99}, then for each α, we construct a training set with ⌊60, 000 × α⌋foot_4 samples are from C-MNIST-F1 while the other ⌊60, 000 × (1 -α)⌋ are constructed as C-MNIST-R. We use two test sets which are respectively the 10,000 test samples constructed as C-MNIST-F2 and C-MNIST-R. Obviously, the data from C-MNIST-R in the training set alleviates the misleading signal from the training set brought by C-MNIST-F1 due to the spurious correlation between color and digit in it, and the existence of these data meets the Assumption 1. Our model is a five-layer convolution neural network in (Devansh Arpit, 2019). The models are trained over the 5 aforementioned datasets with different α by the methods that appeared in the above section. One can verify that the training set can be viewed as a mixture of data from two domains, i.e., C-MNIST-F1 and C-MNIST-R. Thus the domain generalization based methods IRM and GroupDRO can be applied here. The used loss function L(•, •) is cross-entropy, and detailed hyperparameters are presented in Appendix G.4. Main Results. To see if the models trained by these methods can break the induced misleading spurious correlation, we report their test accuracies on the C-MNIST-R and C-MNIST-F2. The results are summarized in Table 6 with the following observations from it. The test accuracies of all these methods increase with the decreased α. This is a natural result since smaller α corresponds with more training samples from C-MNIST-R which alleviates the misleading signal from the data with spurious correlation in C-MNIST-F1. Thus models trained over the training set with smaller α exhibit improved generalization ability OOD data with correlation shift. Similar to the results in Section 6, our RCSV (resp. RCSV U ) consistently improve the OOD generalization error, compared with the methods with (resp. without) observed spurious attributes. More surprisingly, RCSV U beats the methods with observed spurious attributes methods IRM and GroupDRO for a large α. The observations again verify the efficacy of our proposed methods. The model trained by the most commonly used method ERM on datasets with small α also generalizes on OOD data. Thus a relatively large number of data without spurious correlation in the training set also breaks the spurious correlation brought by other data. Finally, we observe that the performance of models on C-MNIST-R consistently better than on C-MNIST-F2. This is due to there exist data drawn from C-MNIST-R in the training set, while the data from C-MNIST-F2 does not appear in the training set.

F ABLATION STUDY

We have discussed in Section 6 that the reweighed sampling trick improves the OOD generalization. Thus, we explore the effect of such trick in this section. We follow the settings in main part of this paper, expected for the reweighted sampling strategy is set as uniformly sampling, thus the methods ERMRS YZ and ERMRS Y become the ERM. The results are summarized in Tables 7, 8, and 9. As can be seen from these tables, the OOD generalization performance of model drops for all these methods compared with the results in Section 6, especially for CelebA and Waterbirds, see the column of "Avg" and "Worst" in each table. We speculate this is because the reweighted sampling strategy enables the data in each group are equivalently appeared during training, this operation itself can break the spurious correlation in training data. Another evidence to support the degenerated OOD generalization is the improved test accuracies on the groups with similar spurious attributes in training data, e.g., the better performances on the groups D-F, D-M of CelebA and L-L, W-W of Waterbirds. The other observation is that even without this trick, our methods improve the OOD generalization compared with other baseline methods due to their better mean and worst test accuracies. Finally, the trade-off between the robustness over spurious attributes and in-distribution test accuracies is more clearly observed in these tables. This is from the comparisons between accuracy gap of data with same spurious attributes and total accuracy, which is in-distribution test accuracy for CelebA, MultiNLI, and CivilComments.

G SETUP FOR EXPERIMENTS G.1 IMPLEMENTATION OF TWO PROPOSED ALGORITHMS

In this section, we present the detailed algorithm flows of the proposed RCSV and R CSV U in the main body of this paper. The critical part is their estimators to the F k (θ) defined in Section 5. Compute the estimator Remp (f θ(t) , P ); 3: Remp (f θ(t) , P ) is the empirical risk over a uniformly-drawn batch (size S) of data.

4:

Compute the estimator F k (θ(t)), k ∈ [K y ]; 5: Initialized K z -dimensional vector Lk = 0, k ∈ [K y ]; 6: Reweighted sample a mini-batch of data {(x t,i , y t,i , z t,i )} with replacement, the probability of data satisfies y t,i = k and z t,i = z is 1/(K y K z n kz ).

7:

Update Lk (z) as the empirical risk over {(x t,i , y t,i )} A kz , k ∈ [K y ], z ∈ [K z ] 8: Compute F k (θ(t)) = K y K z Lk (1) -Lk (1), • • • , Lk (K z ) -Lk (K z ) , k ∈ [K y ] 9: Solve the maximization problem. 10: u k (t + 1) = Softmax(F k t+1 /ρ). F k t+1 = (1 -γ)F k t + γ F k (θ(t)); 12: Update model parameters θ(t) via SGD. Compute the estimator Remp (f θ(t) , P ); 3: Remp (f θ(t) , P ) is the empirical risk over a uniformly-drawn batch (size S) of data.

4:

Compute the estimator F k (θ(t)), k = 1, • • • , K y ; 5: Initialized |A k | 2 -dimensional vector F k (θ(t)) = 0, k =∈ [K y ]; 6: Reweighted sample a mini-batch of data {(x t,i , y t,i )} with replacement, the probability of data satisfies y t,i = k is 1/(K y n k ).

7:

Update F k (j) with L(f θ (x t,i ), y t,i ) if (x t,i , y t,i ) is the j-th data in A k , i ∈ [K y ], j ∈ [K z ].

8:

Solve the maximization problem.

9:

F k t+1 = (1 -γ)F k t + γ F k (θ(t)); u k (t + 1) = Softmax(F k t+1 /ρ).

11:

Update model parameters θ(t) via SGD. 12: θ(t + 1) = θ(t) -η θ Ky k=1 pk ∇ θ ( Remp (f θ(t) , P ) + λu k (t + 1) ⊤ F k t+1 ). 13: end for 



Conditional independence is a strong sufficient condition to make model OOD generalizable. However, the proof of Theorem 1 shows the model that is invariant with spurious correlation Q(Z | Y ) is sufficient to be OOD generalizable, while the invariance can be characterized by both zero CSV and conditional independence. θS is the learned parameters depends on training set S. f θ S (•) is a random element that takes values in a functional space (i.e., model space), details can be referred to(Shiryaev, 2016). CONCLUSIONIn this paper, we explore the OOD generalization for data with correlation shift. After a formal characterization, we give a sufficient condition to make the model OOD generalizable. The condition is the conditional independence of the model, given the class label. Conditional Spurious Variation, which controls the OOD generalization error, is proposed to measure such independence. Based on this metric, we propose an algorithm with a provable convergence rate to regularize the training process with two estimators of CSV (i.e., RCSV and RCSV U ), depending on whether the spurious attributes are observable. Finally, the experiments conducted on the datasets CelebA, Waterbirds, MultiNLI, CivilComments verify the efficacy of our methods. θS is the learned parameters depends on training set S. f θ S (•) is a random element that takes values in a functional space (i.e., model space), details can be referred to(Shiryaev, 2016). ⌊•⌋ is the floor of a number



Figure 1: Examples of CelebA(Liu et al., 2015), Waterbirds(Sagawa et al., 2019), MultiNLI(Williams et al., 2018), and CivilComments(Borkan et al., 2019) involved in this paper. The class labels and spurious attributes are respectively colored with red and blue. Their correlation may vary from training set to test set. More details are shown in Section 6. Finally, extensive experiments are conducted to empirically verify the effectiveness of our methods on the OOD data with spurious correlation. Concretely, we conduct experiments on benchmark classification datasets CelebA(Liu et al., 2015), Waterbirds(Sagawa et al., 2019), MultiNLI(Williams et al., 2018), and CivilComments(Borkan et al., 2019). Empirical results show that our algorithm consistently improves the model's generalization on OOD data with correlation shifts.

Thus

PY (Y = y) -QY (Y = y) = sup Q∈P y∈{y ′ :w(y ′ )P Y (Y =y ′ )≥Q Y (Y =y ′ )} (w(y)PY (Y = y) -QY (Y = y)) w(y)P Y (Y = y) ≥ 0 and E P [w(Y )] = 1. Then, we have min w(•):E P [w(Y )]=1 is due to |Y| min y∈Y w(y)P Y (Y = y) ≤ y∈Y w(y)P Y (Y = y) = 1, and the equality is taken when w(•) = w * (•).

Now we are ready to state the convergence rate of the nonconvex-concave optimization problem. Theorem 5. Under Assumption 1 and 2, if Remp (f θ , P ) and F k (θ) are all unbiased estimators with bounded variance, θ(t) is updated by Algorithm 1 with η θ = O T -3 5 and γ = T -2 5 , then

Figure 2: Images of three C-MNIST datasets with different spurious correlations. The first two have fixed spurious correlation between colors and the label of digits, while the spurious correlation in the last one is random.

Regularize training with CSV (RCSV). Input: Training samples {(x i , y i )} n i=1 , number of labels K y and spurious attributes K z , batch size S, learning rate η θ , training iterations T , model f θ (•) parameterized by θ. Initialized θ 0 , {F k 0 }. Positive regularization constant λ, surrogate constant ρ, and correction constant γ. 1: for t = 0, • • • , T do 2:

+ 1) = θ(t) -η θ Ky k=1 pk ∇ θ ( Remp (f θ(t) , P ) + λu k (t + 1) ⊤ F k t+1 ). 14: end for Algorithm 3 Regularize training with CSV U (RCSV U ). Input: Training samples {(x i , y i )} n i=1 , number of labels K y and spurious attributes K z , batch size S, learning rate η θ , training iterations T , model f θ (•) parameterized by θ. Initialized θ 0 , {F k 0 }. Positive regularization constant λ, surrogate constant ρ, and correction constant γ. 1: for t = 0, • • • , T do 2:

we have the following result. Theorem 3. Let model f θ (•) parameterized by θ ∈ Θ ⊂ R d , and is trained on S = {(x i , y i )} n i=1 from distribution P , with the spurious attributes of x i is z i . If the learned model f θ S (•) ⊥ S z | S y

The class label and spurious attributes are discrete, i.e., the Y = [K y ] and Z = [K z ]

required to estimate CSV in high-dimensional space. Besides that, if the condition inf k,z n kz /n k = O(1) does not hold, the order of error is O(1/ min k,z n kz ) (see Appendix C.2 for details). Regularize training with CSV. Input: Training set {(x i , y i )} n i=1 , number of labels K y and spurious attributes K z , training steps T , model f θ (•) parameterized by θ. Initialized θ 0 , {F k 0 }. Positive regularization constant λ, surrogate constant ρ, and correction constant γ. Estimators Remp (f θ

Test accuracy (%) of ResNet50 on each group of CelebA and Waterbirds.

Test accuracy (%) of BERT on each group of MultiNLI.

Test accuracy (%) of BERT on each group of CivilComments.Setup. We compare our methods RCSV and RCSV U with four baseline methods (see Appendix G.3 for details) i.e., ERM with reweighted sampling (ERMRS)(Idrissi et al., 2021), IRM(Arjovsky et al., 2019), GroupDRO(Sagawa et al., 2019), and Correlation (Devansh Arpit, 2019).The GroupDRO and IRM use the reweighted sampling strategy as in RCSV, while the Correlation uses same one with RCSV U . As these sampling strategies improve the OOD generalization(Idrissi et al., 2021), to make a fair comparison, we also conduct ERMRS with the two sampling strategies. The two ERMRS are respectively denoted as ERMRS Y and ERMRS YZ . The involved 7 methods are categorized as 2 groups, i.e., conducted with observable spurious attributes (RCSV, IRM, GroupDRO, ERMRS YZ ) and with unobservable spurious attributes (RCSV U , Correlation, ERMRS Y , Correlation).

Test accuracy (%) of linear model on the OOD test data of Toy example. The OOD test data are drawn from distributions with different σ test Y Z . The results are the mean of five independent runs.

Cosine-similarity ⟨θ 2 , µ 2 ⟩/(∥θ 2 ∥∥µ 2 ∥) of linear models trained on different methods. Model with smaller cosine-similarity theoretically exhibits better OOD generalization ability.



Test accuracy (%) of ResNet50 on each group of CelebA and Waterbirds. The experiments are conducted without reweighted sampling trick. Dataset Method / Group D-F D-M B-F B-M Avg Total Worst SA

Test accuracy (%) of BERT on each group of MultiNLI. The experiments are conducted without reweighted sampling trick.

Test accuracy (%) of BERT on each group of CivilComments. The experiments are conducted without reweighted sampling trick.

Hyperparameters on Toy example.

Hyperparameters on C-MNIST.

annex

where ξ ∼ N (0, I d1+d2 ), Y, Z ∈ {-1, 1} and follow the standard binomial distribution. Denote the training distribution as P . In this example, Z is the spurious attributes. The correlation coefficient between Y and Z is denoted as σ Y Z (Q) for Q ∈ P. One can verify thatLet us consider the linear classifier f θ (x) = θ ⊤ x and its loss on data (X, Y ) is the exponential loss Soudry et al. (2018) L(f θ (X), Y ) = e -Y f θ (X) .(28)Thus we can compute the population riskSince R pop (P, f θ ) is continuous w.r.t. to σ Y Z (P ) and θ, we have that θ * (P ) = arg min θ R pop (P, f θ ) is continuous to σ Y Z (P ). Since σ Y Z (P ) ∈ [-1, 1], we conclude ∥θ * (P )∥ is upper bounded. W.o.l.g. we assume ∥θ * (P )∥ ≤ 1 for any σ Y Z (P ) ∈ [-1, 1], then for any σ Y Z (P ) we know θ * (P ) satisfies the first order optimality condition such thatwhere μ = (µ ⊤ 1 , -µ ⊤ 2 ) ⊤ . Thus for σ Y Z (P ) ̸ = ±1 we haveThus we can take a σ Y Z (P ) → 1 to make ∥θ * (P ) -µ∥ ≤ ϵ∥µ∥ for any small ϵ where the ϵ can be independent of µ. Then under condition of Y = Z, we haveThus from the sub-Gaussian property of Gaussian random variable, for any δ > 0where the last inequality is due to the Schwarz's inequality. Combining this with (58), we getThat completes our proof.C.3 PROOFS IN SECTION 4.3 In this section we aim at proving that the ( 9) is a sharp estimator to the CSV when we know a lower bound c for π kz . Note that our problem is equivalent to estimate sup and the equality can be taken form someis the density function of P . One can verify thatWe can apply the similar argument to proveCombining the two inequalities implies (61).On the other hand, if K ≥ 3, we takeandThen it is easy to verify thatunder this distribution.According to this proposition, we can use the quintile conditional expectation to estimate the proposed CSV as we did in the main body of this paper.

D SOLVING THE PROPOSED MINIMAX PROBLEM (11)

In this section, we provide the convergence of the proposed Algorithm 1 to solve (11). We illustrate it in the regime of regularize training with CSV(f θ ) i.e., F k (θ) defined in Section 5. Then we have m = K 2 z in this regime. Let us defineWe have the following lemma to state the some continuities.

G.2 DATASET

In this section, we give more details on the datasets appeared in the main part of this paper.CelebA. This is a celebrity face dataset (Liu et al., 2015) with 162770 training samples and 20362 test samples. For each sample, the hair color {Dark, Blond} is class label, while the gender {Female, Male} is spurious attributes. For both training and test datasets, each of them can be divided into 4 groups, i.e., "Dark-Female" (D-F), "Dark-Male" (D-M), "Blond-Female" (B-F), "Blond-Male" (B-M). The numbers of samples in training and test dataset from the 4 groups are respectively {71629, 9767}, {66874, 7535}, {22880, 2880}, {1387, 180}. Our goal is to train a model that correctly recognizes the hair color of celebrities independent of their gender. One can verify that the most difficult group of data to be generalized on is B-M, due to its extremely small proportion in males in the training set.Waterbirds. This is a synthetic real-world dataset in (Sagawa et al., 2019) with 4795 training samples and 6993 test samples, which is constructed based on combining photograph of bird from the Caltech-UCSD Birds-200-2011 (CUB) dataset (Wah et al., 2011) with image backgrounds from the Places (Zhou et al., 2017) . For each image, its class label is from {Waterbird, Landbird}, and each bird is placed on spurious attributes: background from {Land background, Water background}. As in CelebA, the datasets can be categorized into 4 groups, i.e., "Landbird-Land background" (L-L), "Landbird-Water background" (L-W), "Waterbird-Water background" (W-W), "Waterbird-Land background" (W-L). The training and test datasets are constructed with the numbers of samples in each group are respectively {3498, 2255} (L-L), {184, 2255} (L-W), {56, 642} (W-W), {1057, 642} (W-L). As can be seen, the spurious correlations in the training and test sets are quite different. In the training set, most landbirds are on the land, and most waterbirds are on the water. But in the test set, waterbirds and landbirds are uniformly assigned on the two backgrounds. Thus, we are desired to train a model that breaks the spurious correlation between bird and background. The proportion of 4 groups in the training set informs that the most difficult of them to be generalized on are L-W and W-L.MultiNLI. This is a dataset for natural language inference (Williams et al., 2018) with 206175 training samples and 123712 test samples. The dataset is consists of pair of sentences, and our goal is to recognize that whether the second sentence is entailed by, neutral with, or contradicts to the first sentence. It was explored in Gururangan et al. (2018) that there exists spurious correlation in the dataset such as the contradiction can be related to the presence of the negation words nobody, no, never, and nothing. Thus we set such presence as spurious attribute and the dataset can be categorized into 6 groups, i.e., "Contradiction-Without Negation" (C-WN), "Contradiction-Negation" (C-N), "Entailment-Without Negation" (E-WN), "Entailment-Negation" (E-N), "Neutrality-Without Negation" (N-WN), "Neutrality-Negation" (N-N). Our goal is learning a model that makes prediction independent with the presence of negation. The numbers of samples in training and test dataset from the 6 groups are respectively {57498, 34597}, {11158, 6655}, {67376, 40496}, {1521, 886}, {66630, 39930}, {1992, 1146}.CivilComents. This is a dataset consists of collected online comments (Borkan et al., 2019) .The dataset has 269038 training data and 133782 test data. Our goal is to recognize whether the comment is toxic or not. The toxicity can be spurious correlated with the annotation attributes such the presence of 8 certain demographic identities includes male, female, White, Black, LGBTQ, Muslim, Christian, and other religion. Thus we set the identity of any aforementioned words as the spurious attributes, and divided the dataset into 4 groups: "Nontoxic-Nonidentity" (N-N), "Nontoxic-Identity" (N-I), "Toxic-Nonidentity" (T-N), "Toxic-Identity" (T-I). The numbers of samples in training and test dataset from the 4 groups are respectively {148186, 72373}, {90337, 46185}, {12731, 6063}, {17784, 9161}. As can be seen, there exists a spurious correlation between the toxicity and the identity attribute in the training set due to the number of data in each group.For all these datasets, from the number of data in each group, there exists dominated spurious correlation in CelebA and Waterbirds. But this does not happened in MultiNLI and CivilComments, especially for MultiNLI as the strong spurious correlation only exists in the group of "C-WN" v.s. "C-N". Thus for the MultiNLI and CivilComments, expected for the spurious feature, the model should extract other features to guarantee good performance.

G.3 BENCHMARK ALGORITHMS

Empirical Risk minimization (ERM, Vapnik, 1999) pools together the data from all the domains and then minimizes the empirical loss to train the model.Empirical Risk minimization with reweighted sampling (ERMRS, Idrissi et al., 2021) is similar to empirical risk minimization, but it reweight the sampling probability of each sample, and the weightes on each data is pre-defined.Invariant Risk Minimization (IRM, Arjovsky et al., 2019) learns a feature representation such that the optimal classifiers on top of the representation is the same across the domains.Group Distributionally Robust Optimization (GroupDRO, Sagawa et al., 2019) minimizes the worstcase loss over different domains.(Correlation, Devansh Arpit, 2019) minimizes the intra-variance of data from the same category to break the spurious correlation.

G.4 TRAINING DETAILS

As clarified in Section 6, the backbone models for image datasets (CelebA, Waterbirds) and textual datasets (MultiNLI, CivilComments) are respectively ResNet-50 (He et al., 2016) pretrained on ImageNet (Deng et al., 2009) and pre-trained BERT Base model (Devlin et al., 2019) .The loss function L(•, •) is cross-entropy for all of these methods. The experiments on image datasets are conducted without learning rate decay while the results on textual datasets are obtained with linearly decayed learning decay via optimizer AdamW (Loshchilov & Hutter, 2018) .The hyperparameters of baseline methods follow the original one in (Gulrajani & Lopez-Paz, 2020; Sagawa et al., 2019; Devansh Arpit, 2019; Arjovsky et al., 2019; Idrissi et al., 2021) . The hyperparameters of the proposed RCSV and RCSV U on CelebA, Waterbirds, MultiNLI, CivilComments, Toy example and C-MNIST respectively summarized in Table 10, 11, 12, 13, 14, and 15 . 

