PARETO INVARIANT RISK MINIMIZATION: TOWARDS MITIGATING THE OPTIMIZATION DILEMMA IN OUT-OF-DISTRIBUTION GENERALIZATION

Abstract

Recently, there has been a growing surge of interest in enabling machine learning systems to generalize well to Out-of-Distribution (OOD) data. Most efforts are devoted to advancing optimization objectives that regularize models to capture the underlying invariance; however, there often are compromises in the optimization process of these OOD objectives: i) Many OOD objectives have to be relaxed as penalty terms of Empirical Risk Minimization (ERM) for the ease of optimization, while the relaxed forms can weaken the robustness of the original objective; ii) The penalty terms also require careful tuning of the penalty weights due to the intrinsic conflicts between ERM and OOD objectives. Consequently, these compromises could easily lead to suboptimal performance of either the ERM or OOD objective. To address these issues, we introduce a multi-objective optimization (MOO) perspective to understand the OOD optimization process, and propose a new optimization scheme called PAreto Invariant Risk Minimization (PAIR). PAIR improves the robustness of OOD objectives by cooperatively optimizing with other OOD objectives, thereby bridging the gaps caused by the relaxations. Then PAIR approaches a Pareto optimal solution that trades off the ERM and OOD objectives properly. Extensive experiments on challenging benchmarks, WILDS, show that PAIR alleviates the compromises and yields top OOD performances. 1

1. INTRODUCTION

The interplay between optimization and generalization is crucial to the success of deep learning (Zhang et al., 2017; Arora et al., 2019; Allen-Zhu et al., 2019; Jacot et al., 2021; Allen-Zhu & Li, 2021) . Guided by empirical risk minimization (ERM) (Vapnik, 1991) , simple optimization algorithms can find uneventful descent paths in the non-convex loss landscape of deep neural networks (Sagun et al., 2018) . However, when distribution shifts are present, the optimization is usually biased by spurious signals such that the learned models can fail dramatically in Out-of-Distribution (OOD) data (Beery et al., 2018; DeGrave et al., 2021; Geirhos et al., 2020) . Therefore, overcoming the OOD generalization challenge has drawn much attention recently. Most efforts are devoted to proposing better optimization objectives (Rojas-Carulla et al., 2018; Koyama & Yamaguchi, 2020; Parascandolo et al., 2021; Krueger et al., 2021; Creager et al., 2021; Liu et al., 2021; Pezeshki et al., 2021; Ahuja et al., 2021a; Wald et al., 2021; Shi et al., 2022; Rame et al., 2021; Chen et al., 2022b ) that regularize the gradient signals produced by ERM, while it has been long neglected that the interplay between optimization and generalization under distribution shifts has already changed its nature. In fact, the optimization process of the OOD objectives turns out to be substantially more challenging than ERM. There are often compromises when applying the OOD objectives in practice. Due to the optimization difficulty, many OOD objectives have to be relaxed as penalty terms of ERM in Figure 1 : Optimization issues in OOD algorithms. (a) OOD objectives such as IRM usually require several relaxations for the ease of optimization, which however introduces huge gaps. The ellipsoids denote solutions that satisfy the invariance constraints of practical IRM variant IRMv1. When optimized with ERM, IRMv1 prefers f 1 instead of f IRM (The predictor produced by IRM). (b) The gradient conflicts between ERM and OOD objectives generally exist for different objectives at different penalty weights (x-axis). (c) The typically used linear weighting scheme to combine ERM and OOD objectives requires careful tuning of the weights to approach the solution. However, the scheme cannot reach any solutions in the non-convex part of the Pareto front. In contrast, PAIR finds an adaptive descent direction under gradient conflicts that leads to the desired solution. (d) Due to the optimization dilemma, the best OOD performance (e.g., IRMv1 w.r.t. a modified COLOREDMNIST from Sec. 5) usually requires exhaustive tuning of hyperparameters (y-axis: penalty weights; x-axis: pretraining epochs), while PAIR robustly yields top performances by resolving the compromises. practice (Arjovsky et al., 2019; Koyama & Yamaguchi, 2020; Krueger et al., 2021; Pezeshki et al., 2021; Ahuja et al., 2021a; Rame et al., 2021) , but the relaxed formulations can behave very differently from the original objective (Kamath et al., 2021) (Fig. 1(a) ). Moreover, due to the generally existing gradient conflicts between ERM and OOD objectives (Fig. 1(b )), trade-offs among ERM and OOD performance during the optimization are often needed . Sagawa* et al. (2020) ; Zhai et al. (2022) suggest that ERM performance usually needs to be sacrificed for better OOD generalization. On the other hand, it usually requires careful tuning of the OOD penalty hyperparameters (Zhang et al., 2022a) (Fig. 1(d) ), which however either weakens the power of OOD objectives or makes them too strong that prevents models from capturing all desirable patterns. Consequently, using the traditional optimization wisdom to train and select models can easily lead to suboptimal performance of either ERM or OOD objectives. Most OOD objectives remain struggling with distribution shifts or even underperform ERM (Gulrajani & Lopez-Paz, 2021; Koh et al., 2021) . This phenomenon calls for a better understanding of the optimization in OOD generalization, and raises a challenging question: How can one obtain a desired OOD solution under the conflicts of ERM and OOD objectives? To answer this question, we take a multi-objective optimization (MOO) perspective of the OOD optimization. Specifically, using the representative OOD objective IRM (Arjovsky et al., 2019) as an example, we find that the failures in OOD optimization can be attributed to two issues. The first one is the compromised robustness of OOD objectives due to the relaxation in the practical variants. In fact, it can even eliminate the desired invariant solution from the Pareto front w.r.t. the ERM and the OOD penalty (Fig. 1(a) ). Therefore, merely optimizing the ERM and the relaxed OOD penalty can hardly approach the desired solution. On the other hand, when the Pareto front contains the desired solution, as shown in Fig. 1(c ), using the traditional linear weighting scheme that linearly reweights the ERM and OOD objectives, cannot reach the solution if it lies in the non-convex part of the front (Boyd & Vandenberghe, 2014) . Even when the OOD solution is reachable (i.e., lies in the convex part), it still requires careful tuning of the OOD penalty weights to approach the solution, as shown in Fig. 1(d) . To address these issues, we propose a new optimization scheme for OOD generalization, called PAreto Invariant Risk Minimization (PAIR), which includes a new optimizer (PAIR-o) and a new model selection criteria (PAIR-s). Owing to the MOO formulation, PAIR-o allows for cooperative optimization with other OOD objectives to improve the robustness of practical OOD objectives. Despite the huge gaps between IRMv1 and IRM, we show that incorporating VREx (Krueger et al., 2021) into IRMv1 provably recovers the causal invariance (Arjovsky et al., 2019) for some group of problem instances (Sec. 3.2). When given robust OOD objectives, PAIR-o finds a descent path with adaptive penalty weights, which leads to a Pareto optimal solution that trades off ERM and OOD performance properly (Sec. 4). In addition, the MOO analysis also motivates PAIR-s, which facilitates the OOD model selection by considering the trade-offs between ERM and OOD objectives. We conducted extensive experiments on challenging OOD benchmarks. Empirical results show that PAIR-o successfully alleviates the objective conflicts and empowers IRMv1 to achieve high perfor-mance in 6 datasets from WILDS (Koh et al., 2021) . PAIR-s effectively improves the performance of selected OOD models up to 10% across 3 datasets from DOMAINBED (Gulrajani & Lopez-Paz, 2021) , demonstrating the significance of considering the ERM and OOD trade-offs in optimization.

2. BACKGROUND AND RELATED WORK

We first briefly introduce the background of our work (more details are given in Appendix B.1. Problem setup. The problem of OOD generalization typically considers a supervised learning setting based on the data D = {D e } e∈Eall collected from multiple causally related environments E all , where a subset of samples D e = {X e i , Y e i } from a single environment e ∈ E all are drawn independently from an identical distribution P e (Peters et al., 2016) . Given the data from training environments {D e } e∈Etr , the goal of OOD generalization is to find a predictor f : X → Y that generalizes well to all (unseen) environments, i.e., to minimize max e∈Eall L e (f ), where L e is the empirical risk under environment e. The predictor f = w • φ is usually composed of a featurizer φ : X → Z that learns to extract useful features, and a classifier w : Z → Y that makes predictions from the extracted features. Existing solutions to OOD generalization. There exists a rich literature aiming to overcome the OOD generalization challenge, which usually appear as additional regularizations of ERM (Vapnik, 1991) . Ganin et al. (2016) ; Sun & Saenko (2016) ; Li et al. (2018) ; Dou et al. (2019) regularize the learned features to be domain-invariant. Namkoong & Duchi (2016) ; Hu et al. (2018) ; Sagawa* et al. (2020) regularize the models to be robust to mild distributional perturbations of the training distributions, and Zhang et al. (2022c) ; Liu et al. (2021) ; Zhang et al. (2022b) ; Yao et al. (2022) improve the robustness with additional assumptions. Recently there is increasing interest in adopting the causality theory (Pearl, 2009; Schölkopf et al., 2021) and introducing the causal invariance to representation learning (Peters et al., 2016; Arjovsky et al., 2019; Creager et al., 2021; Parascandolo et al., 2021; Wald et al., 2021; Ahuja et al., 2021a) . They require φ to learn causally invariant representations such that a predictor w acting on φ minimizes the risks of all the environments simultaneously. This work focuses on resolving the optimization issue in learning the causal invariance. In addition, Koyama & Yamaguchi (2020) ; Krueger et al. (2021) ; Shi et al. (2022) ; Rame et al. (2021) implement the invariance by encouraging agreements at various levels across environments. However, they mostly focus on developing better objectives while neglecting the optimization process of the objectives. Optimization dilemma in OOD generalization. Along with the development of OOD methods, the OOD optimization dilemma is gradually perceived in the literature. Gulrajani & Lopez-Paz (2021) find it hard to select a proper model in OOD generalization given ERM performance at different environments . Sagawa* et al. (2020) ; Zhai et al. (2022) find the ERM performance needs to be sacrificed for satisfactory OOD performance. Some initial trials are proposed. Lv et al. (2021) use the guidance of the data from similar distributions with the test environment in MOO to resolve gradient conflicts and achieve better performance in domain adaption. Zhang et al. (2022a) propose to construct diverse initializations for stabilizing OOD performance under the dilemma. However, why there exists such a dilemma in OOD generalization and whether we can resolve it remain elusive. Multi-Objective Optimization (MOO). MOO considers solving m objectives w.r.t. {L i } m i=1 losses, i.e., min θ L(θ) = (L 1 (θ), ..., L m (θ)) T (Kaisa, 1999) . A solution θ dominates another θ, i.e., L(θ) ⪯ L( θ), if L i (θ) ≤ L i ( θ) for all i and L(θ) ̸ = L( θ). A solution θ * is called Pareto optimal if no other θ dominates θ * . The set of Pareto optimal solutions is called Pareto set (P) and its image is called Pareto front. In practice, it is usual that one cannot find a global optimal solution for all objectives, hence Pareto optimal solutions are of particular value. Although MOO has been widely applied to improving multi-task learning (Sener & Koltun, 2018) , it remains underexplored on how to model and mitigate objective conflicts in OOD generalization from the MOO perspective.

3. OPTIMIZATION CHALLENGES IN IRM AND ITS EFFECTIVE FIX

This work focus on one of the most representative OOD objectives in learning the causal invariance-IRM, to show how we can understand and mitigate the optimization dilemma through the MOO lens.

3.1. DRAWBACKS OF IRM IN PRACTICE

We first introduce the basics of IRM and the drawbacks of its practical variants, and leave theoretical details in Appendix C.1. Specifically, the IRM framework approaches OOD generalization by finding an invariant representation φ, such that there exists a classifier acting on φ that is simultaneously optimal in E tr . Hence, IRM leads to a challenging bi-level optimization problem as min w,φ e∈Etr L e (w • φ), s.t. w ∈ arg min w:Z→Y L e ( w • φ), ∀e ∈ E tr . (1) Given the training environments E tr , and functional spaces W for w and Φ for φ, predictors f = w • φ satisfying the constraint in Eq. 1 are called invariant predictors, denoted as I(E tr ). When solving for invariant predictors, characterizing I(E tr ) is particularly difficult in practice, hence it is natural to restrict W to be the space of linear functions on Z = R d (Jacot et al., 2021) . Furthermore, Arjovsky et al. (2019) argue that linear classifiers actually do not provide additional representation power than scalar classifiers, i.e., d = 1, W = S = R 1 . The scalar restriction elicits a practical variant IRM S as min φ e∈Etr L e (φ), s.t. ∇ w|w=1 L e (w • φ) = 0, ∀e ∈ E tr . Since Eq. 2 remains a constrained programming. Arjovsky et al. (2019) further introduce a softenconstrained variant, called IRMv1, as the following min φ e∈Etr L e (φ) + λ|∇ w|w=1 L e (w • φ)| 2 . ( ) Theoretical failure of practical IRM variants. Although the practical variants seem promising, the relaxations introduce huge gaps between IRM and the practical variants, so that both IRM S and IRMv1 can fail to capture the invariance (Kamath et al., 2021) . The failure case is illustrated by the two-bit environment with α e , β e ∈ [0, 1]. Each environment D e = {X e , Y e } is generated following Y e := Rad(0.5), X e := (X e 1 , X e 2 ), X e 1 := Y e •Rad(α e ), X e 2 := Y e •Rad(β e ), where Rad(σ) is a random variable taking value -1 with probability σ and +1 with probability 1 -σ. Each environment is denoted as E α = {(α, β e ) : 0 < β e < 1} where X e 1 is the invariant feature as α is fixed for different environment e, and X e 2 is the spurious feature as β e varies across different e. Let I S (E tr ) denote the set of invariant predictors elicited by the relaxed constraint in IRM S . It follows that I(E tr ) ⊆ I S (E tr ). Consequently, there exist some undesired predictors but considered "invariant" by IRM S and IRMv1. For example, in E tr = {(0.1, 0.11), (0.1, 0.4)}, the solutions satisfying the constraint in IRM S are those intersected points in Fig. 1 (a) (The ellipsoids are the constraints). Although f 1 , f IRM ∈ I S (E tr ), both IRM S and IRMv1 prefer f 1 instead of f IRM (the predictor produced by IRM), as f 1 has the smallest ERM loss. In fact, Kamath et al. (2021) show that the failure can happen in a wide range of environments even given infinite amount of environments and samples, demonstrating the huge gap between the practical and the original IRM variants. Empirical drawback of practical IRM variants. In addition, the optimization of IRMv1 introduces more challenges due to the conflicts between the IRMv1 penalty and ERM objective. As shown in Fig. 1 (d), it often requires significant efforts to tune the hyperparameters such as pretraining epochs and penalty weights λ in Eq. 3. Otherwise, the IRMv1 penalty could be either too weak to enforce the invariance as required by IRM, or too strong that prevents ERM from learning all desirable patterns.

3.2. PARETO OPTIMIZATION FOR IRM

As shown that both IRM S and IRMv1 fail to properly trade off between ERM and IRM objectives, we switch to a new perspective, i.e., the lens of MOO, to understand the failures of IRM in practice. ∈ P(L ERM , L IRM ), as f IRM is dominated by f 1 . Therefore, no matter how we carefully control the optimization process, we cannot obtain f IRM by merely minimizing the objectives in Eq. 5. This is essentially because of the weakened OOD robustness of IRM S and IRMv1 caused by the relaxations. Thus, choosing robust objectives for optimization is of great importance to OOD generalization. The ideal objectives should at least constitute a Pareto front that contains the desired OOD solution. Improving OOD robustness of practical IRM variants. In pursuit of proper optimization objectives, we resort to the OOD extrapolation explanation of IRM (Bottou et al., 2019) . A solution that is simultaneously optimal to all training environments (i.e., satisfying the original IRM constraints) is also a stationary point of ERM loss w.r.t. some OOD distribution: ∂L t /∂f IRM = 0, L t ∈ { e∈Etr λ e L e | e∈Etr λ e = 1}, where L t is the ERM loss under the OOD distribution. Different from Distributionally Robust Optimization approaches (Namkoong & Duchi, 2016) , Eq. 6 allows for some negative λ e and hence its solutions are expected to extrapolate better (Bottou et al., 2019) . The previous failure case implies that both IRM S and IRMv1 fail in the extrapolation due to the relaxations, nevertheless, we can introduce additional objectives to directly improve the OOD extrapolation power of the practical IRM variants. To this end, we introduce the REx objective to IRMv1, which is derived by directly minimizing the worst case ERM loss under all OOD distributions up to a certain distance from the training distributions (Krueger et al., 2021) . More formally, REx minimizes the worst case L t under an additional constraint of {λ e } e∈Etr ≥ -β in Eq. 6. For the ease of optimization, they also propose an alternative objective as L VREx := var({L e } e∈Etr ). In Fig. 3 , we plot the distribution of L VREx in the the failure case of Fig. 1 (a). It can be found that, f IRM lies in the low variance region. Similarly, in Fig. 2 , the zero variance solutions (shown as the purple line at middle) points out the underlying f IRM beyond the Pareto front. Therefore, incorporating L VREx in Eq. 5 can relocate f IRM into the Pareto front, which implies the desirable objectives as the following (IRMX) min φ (L ERM , L IRM , L VREx ) T . By resolving a large class of failure cases of IRM S and IRMv1 (Kamath et al., 2021) , solutions to Eq. 7 are more powerful than those to IRM S and IRMv1 in OOD extrapolation. In fact, we have Proposition 1. (Informal) Under Setting A (Kamath et al. (2021) ), for all α ∈ (0, 1), let E := {(α, β e ) : β e ∈ (0, 1)} be any instance of the two-bit environment (Eq. 4), I X denote the invariant predictors produced by Eq. 7, it holds that I X (E) = I(E).foot_0  The formal description and proof of Proposition 1 are given in Appendix E.1. Proposition 1 implies that Eq. 7 are the ideal objectives for optimization. However, Eq. 7 can even add up the difficulty of OOD penalty tunning. It introduces one more penalty to the overall objective that makes the Pareto front more complicated for the linear weighting scheme to find the desired solution. Pareto optimization for IRMX. Ideally, the set of Pareto optimal solutions is small such that each f ∈ P(L ERM , L IRM , L VREx ) satisfies the invariance constraints of IRMv1 and VREx, i.e., L IRM = 0 and L VREx = 0, and with a minimal L ERM , thereby eliciting the desired OOD solutions. However, the ideal constraints might be too strong to be achieved when there are noises among invariant features and labels (Duchin et al., 2020; Ahuja et al., 2021b) , which will future enlarge the set of Pareto optimal solutions. Therefore, it is natural to relax the constraints as L IRM ≤ ϵ IRM and L VREx ≤ ϵ VREx . When ϵ IRM → 0, ϵ VREx → 0, it recovers the ideal invariance. To obtain a desired solution under these circumstances, the optimization process is expected to meet the following two necessities: (i). The additional objective in IRMX can make the Pareto front more complicated such that the desired solutions are more likely to appear in the non-convex part, which are however not reachable by the linear weighting scheme (Boyd & Vandenberghe, 2014) . Therefore, the optimizer needs to be able to reach any Pareto optimal solutions in the front, e.g., MGDA algorithms (Désidéri, 2012 ).foot_1 (ii). When both ϵ IRM , ϵ VREx > 0, there can be multiple Pareto optimal solutions while there are few desired OOD solutions. Hence a preference of ERM and OOD objectives is usually needed. As the optimality of each OOD objective usually appears as a necessary condition for satisfactory OOD performance, the preferences for OOD objectives are expected to be higher. Given the two requirements, we leverage a preference-aware MOO solver to solve IRMX for the desired Pareto optimal solution (Mahapatra & Rajan, 2020) . We summarize the overall solution as PAreto Invariant Risk Minimization (PAIR). When assigning a high preference to L IRM and L VREx in IRMX (Eq. 7), PAIR approaches a Pareto optimal solution that minimizes the OOD losses while not sacrificing the ERM performance too much, and has good OOD performance, shown as in Table . 1.

3.3. RECOVERY OF CAUSAL INVARIANCE

To better understand how PAIR bridges the gaps between the practical and original IRM objectives, we examine to what extent PAIR can recover the causal invariance specified by Arjovsky et al. (2019) in a more difficult case. More formally, the causal invariance is defined as follows. Definition 3.1. (Causal Invariance) Given a predictor f := w • φ, the representation produced by the featurizer φ is invariant over E all if and only if for all e 1 , e 2 ∈ E all , it holds that E De 1 [Y |φ(X) = z] = E De 2 [Y |φ(X) = z], for all z ∈ Z e1 φ ∩ Z e2 φ , where Z e φ := {φ(X)|(X, Y ) ∈ supp(D e )}. Following Definition 3.1, we construct a regression problem. As shown in Fig. 4 , Y = sin(X 1 ) + 1 is solely determined by X 1 , i.e., the values of the x-axis, while X 2 is the values of y-axis and does not influence the values of Y . Different colors indicate different values of Y . In this problem, the invariant representation φ should only take X 1 and discard X 2 . We sampled two training environments as denoted by the ellipsoids colored in red, among which the overlapped region of the invariant features X 1 is [-2, 2] . Hence the prediction produced by the invariant predictor following Definition 3.1 is expected to be independent of X 2 . In other words, the plotted belts need to be perpendicular to the x-axis within the overlapped invariant features [-2, 2] . More details can be found in Appendix C.3. We plot predictions with the best MSE losses of IRMv1 and VREx in Fig. 4 (b) and Fig. 4 (c), respectively. Although both IRMv1 and VREx fail to achieve the causal invariance as expected, perhaps surprisingly, PAIR almost recovers the causal invariance, as shown in Fig. 4(d) .

4. PARETO INVARIANT RISK MINIMIZATION

The success of PAIR in empowering unrobust IRMv1 to achieve the causal invariance of IRM demonstrates the significance of considering the trade-offs between ERM and OOD objectives in the optimization. In the next, we will summarize our findings and elaborate PAIR in more details.

4.1. METHODOLOGY OUTCOMES

Key takeaways from the IRM example. To summarize, the failures of OOD optimization can be attributed to: i) Using unrobust objectives for optimization; ii) Using unreliable scheme to approach the desired solution. Nevertheless, we can improve the robustness of the OOD objectives by introducing additional guidance such that the desired solution is relocated in the Pareto front w.r.t. the new objectives. After obtaining robust objectives to optimize, we then leverage a preference-aware MOO solver to find the Pareto optimal solutions that maximally satisfy the invariance constraints by assigning the OOD objective a higher preference while being aware of retaining ERM performance. More formally, let f ood be the desired OOD solution and F be the functional class of f ood , a group of OOD objectives L ood = {L i ood } m i=1 are robust if their composite objective L ood satisfies that L ood (f ood ) ⪯ L ood (f ), ∀f ̸ = f ood ∈ F, When given a robust OOD objective L ood , our target is to solve the following MOO problem min f (L ERM , L ood ) T , (9) where L ood corresponds to an ϵ ood -relaxed invariance constraint as L ood (f ood ) = ϵ ood ⪯ L ood (f ), ∀f ̸ = f ood ∈ F. Denote the ϵ inv as empirical loss of using the underlying invariant features to predict labels, then the optimal values of the desired OOD solution w.r.t. Eq. 9 are (ϵ inv , ϵ ood ) T = (L ERM (f ood ), L ood (f ood )) T , which corresponds to an ideal preference (or OOD preference) for the objectives, that is p ood = (ϵ -1 inv , ϵ -1 ood ) T . The optimal solutions of Eq. 9 that satisfy the exact Pareto optimality, i.e.,p oodi L i = p oodj L j , ∀L i , L j ∈ L, are expected to recover f ood in Eq. 8. PAIR-o as an optimizer for OOD generalization. To find a desired Pareto optimal solution specified by p ood , we adopt a 2-stage optimization scheme, which consists of two phases, i.e., the "descent" and the "balance" phase, following the common practice (Gulrajani & Lopez-Paz, 2021 ). In the "descent" phase, we train the model with the ERM loss such that it approaches the Pareto front by merely minimizing L ERM first. Then, in the "balance" phase, we adjust the solution to maximally satisfy the exact Pareto optimality specified by p ood . We adopt the off-the-shelf preference-aware MOO solver EPO (Mahapatra & Rajan, 2020) to find the desired Pareto optimal solutions with the given p ood . Specifically, at each step, p ood implies a descent direction g b that maximally increase the satisfaction to the exact Pareto optimality. Then, we will find an objective weight vector to reweight both the ERM and OOD objectives (thus their gradients), such that the reweighted descent direction g dsc has a maximum angle with g b . Meanwhile, to avoid divergence, g dsc also needs to guarantee that it has a positive angle with the objective that diverges from the preferred direction most. We provide detailed descriptions and theoretical discussions of the algorithm in Appendix D.1. PAIR-s for OOD model selection. Model selection in OOD generalization is known to be challenging, as the validation data used to evaluate the model performance is no longer necessarily identically distributed to the test data (Gulrajani & Lopez-Paz, 2021) . The IRM example also implies that the traditional model selection methods that merely depends on the validation performance, i.e., the ERM performance, can easily compromise OOD performance due to the conflicts with ERM objective, especially when the validation set has a large gap between the test set (cf. CMNIST in Table 3 ). When given no additional assumption, we posit that the OOD loss values can serve as a proxy for OOD performance, which essentially corresponds to the underlying prior assumed in the OOD methods. It naturally resembles PAIR optimization therefore motivates PAIR-s. PAIR-s jointly considers and trades off the ERM and OOD performance in model selection, and select models that maximally satisfy the exact Pareto optimality. We leave more details and discussions in Appendix D.2.

4.2. THEORETICAL DISCUSSIONS AND PRACTICAL CONSIDERATIONS

Essentially both PAIR-o and PAIR-s aim to solve Eq. 9 up to the exact Pareto optimality. However, in practice, the ideal preference is usually unknown and the exact Pareto optimality could be too strict to achieve . Therefore, we develop an ϵ-approximated formulation of Eq. 9, i.e.,|p oodi L i -p oodj L j | ≤ ϵ, ∀L i , L j ∈ L, which might be of independent interest. Built upon the relaxed variant, we analyze the OOD performance of PAIR in terms of sample complexity, given the empirical risk and imprecise OOD preference, and prove the following Theorem in Appendix E.2. Theorem 4.1. (Informal) For γ ∈ (0, 1) and any ϵ, δ > 0, if F is a finite hypothesis class, both ERM and OOD losses are bounded above, let I PAIR be the index of all losses, p max := max i∈IPAIR p i and L max := max i∈IPAIR L i , if the number of training samples |D| ≥ (32L 2 max p 2 max /δ 2 ) log[2(m + 1)|F|/γ], then with probability at least 1 -γ, PAIR-o and PAIR-s yield an ϵ-approximated solution of f ood . Practical Considerations. Theorem 4.1 establishes the theoretical guarantee of PAIR-o and PAIR-s given only an imprecise OOD preference. Empirically, we find that assigning a large enough preference to the OOD objectives is generally sufficient for PAIR-o to find a desired OOD solution. For example, in most experiments PAIR-o yields a satisfactory OOD solution with a relative preference of (1, 1e10, 1e12) for ERM, IRMv1, and VREx. For PAIR-s, we can estimate the empirical upper bounds of (ϵ inv , ϵ ood ) from the running history and adjust OOD preference to be slightly larger. We provide a detailed discussion on the preference choice in practice in Appendix D.3. Besides, the requirement of whole network gradients in PAIR-o can be a bottleneck when deployed to models that have a prohibitively large number of parameters (Sener & Koltun, 2018) . To this end, we can use only the gradients of classifier w to solve for the objective weights, or freeze the featurizer after the "descent" phase to further reduce the resource requirement (Zhang et al., 2022a) . We discuss more practical options and how PAIR can be applied to other OOD methods in Appendix D.4.

5. EXPERIMENTS

We conduct extensive experiments on COLOREDMNIST, WILDS and DOMAINBED to verify the effectiveness of PAIR-o and PAIR-s in finding a better OOD solution under objective conflicts. Can PAIR-o effectively find better OOD solutions under realistic distribution shifts? We evaluate PAIR-o implemented with IRMX on 6 challenging datasets from WILDS benchmark (Koh et al., 2021) , and compare PAIR-o with other state-of-the-art OOD methods from different lines (Sec. 2), including CORAL (Sun & Saenko, 2016) , GroupDRO (Sagawa* et al., 2020) , IRM (Arjovsky et al., 2019) , V-REx (Krueger et al., 2021 ), Fish (Shi et al., 2022) and an advanced importance-aware data augmentation method LISA (Yao et al., 2022) . By default, we assign a relative preference (1, 1e10, 1e12) to ERM, IRMv1 and VREx objectives, respectively, and restrict the search space of the preference. Our implementation and evaluation protocol follow the exact configuration as previous works (Koh et al., 2021; Shi et al., 2022; Yao et al., 2022) . Details can be found in Appendix F.3. Table 2 shows that PAIR-o substantially improves over IRMv1 as well as IRMX and yields topranking OOD performance among all state-of-the-art methods across different realistic distribution shifts, demonstrating the effectiveness and significance of resolving the optimization dilemma in OOD generalization. Besides, the advances of PAIR over IRMX also confirm the effectiveness of PAIR-o in finding a better trade-off between ERM and OOD objectives. How can PAIR-o mitigate the objective conflicts? We conduct ablation studies with the modified COLOREDMNIST (More details and results are given in Appendix F.2). First, as shown in Fig. 5 (a), PAIR-o effectively finds a better solution than exhaustive tuning of penalty weights in IRMX. That is because PAIR can adaptively adjust the penalty weights (Fig. 5(b )), which leads to a Pareto optimal solution that has lower OOD losses while not compromising the ERM loss too much (Fig. 5(c) ). The other reason is that, PAIR-o is generally robust to different choices of preference choices (Fig. 5(d) ), which makes it adaptable to various scenarios, confirming our discussions in Sec. 4.2. Can PAIR-s effectively select better OOD solutions under realistic distribution shifts? To verify the effectiveness of PAIR-s, we apply PAIR-s to multiple representative OOD methods as discussed in Sec. 2, and examine whether PAIR-s can improve the model selections under rigorous hyperparameters tunning (Gulrajani & Lopez-Paz, 2021) on COLOREDMNIST (Kamath et al., 2021) , PACS (Li et al., 2017) and TERRAINCOGNITA (Beery et al., 2018) . Intuitively, models selected merely based on ERM performance tend to have a high preference or better performance on environments that have a similar distribution of the corresponding validation set, which will lead to higher variance of performances at different environments or a lower worst environment performance. Hence we use training-domain validation accuracy for COLOREDMNIST and TERRAINCOGNITA, and test-domain validation accuracy for PACS to validate the existence of this issue under different scenarios (Teney et al., 2021) . More details and results are provided in Appendix G. Table 3 shows that there is a high variance in the performances at different environments of the models selected only based on the validation accuracy. In contrast, by jointly considering and trading off the ERM and OOD performances in model selection, PAIR-s substantially mitigates the variance by improving the worst environment performance of all methods under all setups up to 10%. It could serve as strong evidence for the importance of considering ERM and OOD trade-offs.

6. CONCLUSION

In this work, we provided a new understanding of optimization dilemma in OOD generalization from the MOO perspective, and attributed the failures of OOD optimization to the compromised robustness of relaxed OOD objectives and the unreliable optimization scheme. We highlighted the importance of trading off the ERM and OOD objectives and proposed a new optimizer PAIR-o and a new model selection criteria PAIR-s to mitigate the dilemma. We provided extensive theoretical and empirical evidence to show the necessity and significance of properly handling the ERM and OOD trade-offs.

ETHICS STATEMENT

Considering the wide applications and high sensitivity of deep neural networks to distribution shifts and spurious correlations, it is important to develop new methods that are able to generalize to OOD data, especially for some human-centered AI scenarios such as autopilot and social welfare. By understanding and mitigating the optimization dilemma in OOD generalization, our work could serve as an initiate step towards a new foundation of optimization for OOD generalization, with the hope for building more trustworthy and AI systems to facilitate broader AI applications and social benefits. Besides, this paper does not raise any ethical concerns. This study does not involve any human subjects, practices to data set releases, potentially harmful insights, methodologies and applications, potential conflicts of interest and sponsorship, discrimination/bias/fairness concerns, privacy and security issues, legal compliance, and research integrity issues.

REPRODUCIBILITY STATEMENT

To ensure the reproducibility of our theoretical results, we provide detailed proofs for our propositions and theorems in Appendix E. To ensure the reproducibility of our methods and experimental results, we provide detailed description of the IRM case in Appendix C.1, the algorithms D, and the experimental setting in Appendix F, in addition to the main text. Besides, we will further provide a link to an anonymous repository that contains the source codes for reproducing the results in our paper during the discussion phase. A NOTATIONS We first list the notations for key concepts in our paper. Table 4 : Notations X = R n the input space Y = R the label space Z = R d the latent space φ the featurizer φ : X → Z learns a latent representation for each input example In this section, we provide more details of the backgrounds and closely related works to ours, in complementary to Sec. 2. w the classifier w : Z → Y f ∈ F the predictor f = w • φ : X → Y is composed of a featurizer The problem of OOD generalization. The problem of OOD generalization typically considers a supervised learning setting based on the data D = {D e } e∈Eall collected from multiple causally related environments E all , where a subset of samples D e = {X e i , Y e i } from a single environment e ∈ E all are drawn independently from an identical distribution P e (Peters et al., 2016) . Given the data from training environments {D e } e∈Etr , the goal of OOD generalization is to find a predictor f : X → Y that generalizes well to all (unseen) environments, i.e., to minimize max e∈Eall L e (f ), where L e is the empirical risk (Vapnik, 1991) under environment e, X and Y are the input and labeling spaces, respectively. The predictor f = w • φ is usually composed of a featurizer φ : X → Z that learns to extract useful features, and a classifier w : Z → Y that makes predictions from the extracted features. In practice, φ is commonly implemented as a deep feature extractor, while w is generically implemented as a simple dense linear classifier (Gulrajani & Lopez-Paz, 2021; Koh et al., 2021; Rame et al., 2021; Rosenfeld et al., 2022) . Existing solutions to OOD generalization. There exists a rich literature aiming to overcome the OOD generalization challenge, which usually appear as additional regularizations of ERM (Vapnik, 1991) . The first line is the Domain Generalization works (Ganin et al., 2016; Sun & Saenko, 2016; Li et al., 2018; Dou et al., 2019) that tries to regularize the learned features to be domain-invariant. However, Zhao et al. (2019) show that the domain invariant features solely are not sufficient for guaranteed good OOD generalization. We refer readers to Gulrajani & Lopez-Paz (2021) for more details of the literature about Domain Generalization. Moreover, Namkoong & Duchi (2016) ; Hu et al. (2018); Sagawa* et al. (2020) aim to regularize the models to be robust to mild distributional perturbations of the training distributions such that the models are expected to perform well in unseen test environments. Following the line of distributional robustness, Liu et al. (2021) ; Zhang et al. (2022b); Yao et al. (2022) further propose advanced strategies to improve the robustness by assuming that models trained with ERM have strong reliance to spurious features. Recently there is increasing interest in adopt theory of causality (Pearl, 2009; Peters et al., 2017; Schölkopf et al., 2021) and introduce the causal invariance to the learned representations (Peters et al., 2016; Rojas-Carulla et al., 2018; Arjovsky et al., 2019) . The causal invariance is inspired by the assumption of Independent Causal Mechanism (ICM) in causality (Peters et al., 2017) . ICM assumes that conditional distribution of each variable given its causes (i.e., its mechanism) does not inform or influence the other conditional distributions (Pearl, 2009; Peters et al., 2017) . Peters et al. (2016) introduce the concept of environments which are generated by different interventions on certain variables involved in the underlying data generation process of (X, Y ). Despite of the changes to the intervened variables, the conditional distribution of intervened variables (they usually are the direct parents of Y in the underlying causal graph) and Y is invariant. Therefore, the invariant relationship can be leveraged to predict Y and generalize to different environments. We refer interested readers to Peters et al. (2016) ; Schölkopf et al. (2021) ; Ahuja et al. (2021a) for more details. Inspired by the causal invariance principle, Arjovsky et al. (2019) propose the framework of Invariant Risk Minimization (IRM) that allows the adoption of the causal invariance in neural networks. It further inspires plentiful invariant learning works (Parascandolo et al., 2021; Mahajan et al., 2021; Creager et al., 2021; Wald et al., 2021; Ahuja et al., 2021a; Chen et al., 2022b; Lin et al., 2022b) . At the heart of these works is the intuition that: When a predictor w acting on φ minimizes the risks in all of the environments simultaneously, φ is expected to discard the spurious signals while keeping the causally invariant signals. Additionally, there can be more definitions and implementations of the invariance (Koyama & Yamaguchi, 2020; Krueger et al., 2021; Shi et al., 2022; Rame et al., 2021) which further encourage agreements at various levels across different environments. We refer interested readers to Rame et al. (2021) for a detailed comparison and discussion. As shown that most of the existing approaches encounter the optimization dilemma when learning the causal invariance, this work mainly focus on resolving the optimization issue in learning the causal invariance defined by the framework of Invariant Risk Minimization (Arjovsky et al., 2019) , which is different from the literature of IRM variants or other OOD objectives that focus on proposing better objectives to learn the causal invariance. Optimization Dilemma in OOD Algorithms. Along with the developments of OOD methods, the optimization dilemma in OOD generalization is gradually perceived in the literature, and raises new puzzles to the community. In fact, several recent works also notice the optimization dilemma in OOD algorithms, specifically, the trade-off between discovering the statistical correlations (i.e., ERM) and preventing the usage of spurious correlations (e.g., IRM). Empirically, Gulrajani & Lopez-Paz (2021) observe that, with careful hyperparameter tuning and evaluation setting, many OOD algorithms cannot outperform ERM in domain generalization, demonstrating the difficulties of properly mitigating the trade-offs between OOD and ERM objectives in practice. Moreover, Sagawa* et al. ( 2020 2022) that focuses on the optimization consequences, we focus on the optimization process of OOD objectives. In addition, Zhang et al. (2022a) find that, the performance of OOD algorithms largely relies on choosing proper pretraining epochs which aligns with our findings in Fig. 1(d) , hence propose to construct a ready-to-use features for stable OOD generalization performance. Orthogonal to Zhang et al. (2022a) , we focus on developing better optimization scheme for OOD algorithms, including choosing the proper objectives and the achievability of the invariant predictors. Besides, Lv et al. (2021) propose ParetoDA to leverage MOO to resolve the gradient conflicts amon the objectives in Domain Adaption. ParetoDA uses the guidance of validation loss based on the data that has the identical distribution to test distribution, to trade-off the conflicts in domain adaption objectives. However, there can be multiple test domains, and the data that has identical distribution with the test domain is usually unavailable in OOD generalization. Therefore, ParetoDA is unsuitable for general OOD generalization methods. Despite the increasing literature that perceives the OOD optimization dilemma, it remains an open problem on why there exists such a dilemma, and how to effectively mitigate the conflicts of ERM and OOD objectives and obtain a OOD generalizable solution. Further implications by the OOD optimization dilemma. In addition to preventing finding a proper OOD solution, the OOD optimization dilemma also raises significant challenges for model selection of OOD algorithms. Gulrajani & Lopez-Paz (2021) highlight this challenge with rigorous evaluation of OOD algorithms. Similar to PAIR-o, PAIR-s resolves the dilemma by leveraging the OOD loss values and explicitly considering the trade-offs of ERM and OOD performance. We present more details in Sec. G.1. Multi-Objective Optimization (MOO) and its applications in Multi-Task Learning. MOO considers solving m objectives, w.r.t. Kaisa, 1999) . A solution θ dominates another θ, i.e., L(θ) ⪯ L( θ), if L i (θ) ≤ L i ( θ) for all i and L(θ) ̸ = L( θ). A solution θ * is called Pareto optimal if there exists no other solution that dominates θ * . The set of Pareto optimal solutions is called Pareto set, denoted as P, and its image is called Pareto front. As it is usual that we cannot find a global optimal solution for all objectives in practice, hence Pareto optimal solutions are of particular value. The multiple-gradient descent algorithm (MGDA) is one of the commonly used approaches to efficiently find the Pareto optimal solutions (Désidéri, 2012) but limited to low-dimensional data. Sener & Koltun (2018) then resolve the issue and apply MGDA to high-dimensional multi-task learning scenarios, where the objective conflicts may degenerate the performance when using linear scalarization. As pure MGDA cannot find a Pareto optimal solution specified by certain objective preferences, Lin et al. (2019) ; Zhang & Golovin (2020) ; Ma et al. (2020) propose efficient methods to explore the Pareto set. Mahapatra & Rajan (2020) propose EPO to find the exact Pareto optimal solution with the specified objective preferences. Although MOO has gained success in mitigating the task conflicts in multi-task learning, it remains underexplored on whether and how we can leverage the MOO to model and resolve the ERM and OOD conflicts. Without a proper set of objectives and preference guidance, the existing MOO solvers are unable to obtain a desired solution for OOD generalization. {L i } m i=1 losses, i.e., min θ L(θ) = (L 1 (θ), ..., L m (θ)) T (

B.2 LIMITATIONS AND FUTURE DIRECTIONS

Although PAIR is shown to effectively mitigate the objective conflicts and boost the OOD performance via better optimization and model selection, the performance gain sometimes can decrease given the limitations of PAIR. We believe future works can be built upon resolving the limitations of PAIR, as detailed below. From the optimizer perspective, the improvements of PAIR-o can decrease on some datasets. We hypothesize it is because of the inevitable stochastic gradient bias in all MGDA MOO solvers (Liu & Vicente, 2021) , and potentially large variance in estimating the IRMv1 penalties (e.g., RXRX1 where both IRMv1 and VREx are shown to perform poor ), as we discussed in Appendix D.4.2. For PAIR-s, as discussed in Sec. 4 that PAIR-s can mitigate the drawbacks of selecting models using an unreliable validation set (has a large gap from the test domain), the improvements will be a bit smaller when the gaps narrow down (e.g., PACS using test domain validation accuracy). Besides, the estimation of satisfaction to Pareto optimality in PAIR-s can also be affected by the variances in estimating loss values in stochastic setting (e.g., TERRAINCOGNITA), as discussed in Appendix D.2. Additionally, PAIR can also be applied to scenarios where gradient conflicts exist, such as the tradeoff between adversarial power and unnoticeability of the attacks (Chen et al., 2022a) , as well as improving the quality of representations in contrastive learning (Ma et al., 2021) .

C MORE DETAILS ON IRM FAILURES AND FIX

In this section, we provide more details about the failure case of IRM and its effective fix from the perspective of MOO, in complementary to Sec. 3.

C.1 MORE DETAIL ABOUT FAILURE CASE OF IRM

We follow Kamath et al. (2021) to discuss the failure case of IRM. Specifically, given the problem setup as in Sec. B.1, we are interested in the linear classification/regression following the setting. The loss values are measured as population loss in each environment. Setting A (identical to (Kamath et al. (2021) )): Ŷ = R, Y ⊆ R, ℓ is either the square loss ℓ sq (ŷ, y) := 1 2 (ŷ -y) 2 , or the logistic loss ℓ log (ŷ, y) := log (1 + exp (-ŷy)) when Y = {-1, 1} (binary classification). IRM approaches the problem by finding an invariant representation φ : X → Z, such that there exists a predictor w : Z → Y acting on φ that is simultaneously optimal among E all . Hence, IRM leads to a challenging bi-level optimization problem (Arjovsky et al., 2019)  L e (φ), s.t. ∇ w|w=1 L e (w • φ) = 0, ∀e ∈ E tr . Y := Rad(0.5), X 1 := Y •Rad(α e ), X 2 := Y •Rad(β e ), where Rad(σ) is a random variable taking value -1 with probability σ and +1 with probability 1 -σ. We denote an environment e with (α e , β e ) for simplicity. The setup in IRM can be denoted as E α = {(α, β e ) : 0 < β e < 1} where X 1 is the invariant feature as α is fixed for different e. In the example given by Arjovsky et al. (2019) , i.e., E tr := {(0.25, 0.1), (0.25, 0.2)}, IRM S and IRMv1 are shown to be able to learn the invariant predictor f IRM as the original IRM despite of the relaxation. However, due to I(E tr ) ⊆ I S (E tr ), Kamath et al. (2021) show that the set of "invariant predictors" produced by IRM S and IRMv1 is broader than our intuitive sense. For example, when given E tr := {(0.1, 0.11), (0.1, 0.4)}, the solutions satisfying the constraint in IRM S are those intersected points in Fig. 1 (a) (The ellipsoids are the constraints). Although f 0 , f 1 , f 2 , f IRM ∈ I S (E tr ), both IRM S and IRMv1 prefer f 1 instead of f IRM (the predictor elicited by the original IRM), as f 1 has the smallest ERM loss. In fact, Kamath et al. (2021) prove that, the failure can happen in a wide range of environments with α < 0.1464 and α > 0.8356, even being given infinite number of additional environments, under MSE loss. It follows that I(E tr ) ⊊ I S (E tr ). In other words, the relaxation in IRM S and IRMv1 will introduce additional "invariant predictors" which however do not satisfy the original IRM constraint. Both IRM S and IRMv1 will prefer those "invariant predictors" when they have lower ERM loss than f IRM , demonstrating the significant theoretical gap between the practical variants and the original IRM. Practical Drawback of Practical IRM Variants. In addition to the theoretical gap, the optimization of IRMv1 is also difficult due to the conflicts between the IRM penalty and ERM penalty in Eq. 12. It often requires significant efforts for choosing proper hyperparameters such as pretraining epochs and IRM penalty weights, i.e., λ. Otherwise, IRMv1 may not enforce the constraint in IRM S , hence will lead to unsatisfactory performance, as shown in Fig. 1(d) . We argue that the gradient conflicts generally exist in OOD optimization for various objectives, in Fig. 1 (b), we visualize the cosine similarity between the gradients produced by ERM and OOD objectives, which is averaged from 50 epochs after the pretraining. It can be found that, all of the OOD objectives (Arjovsky et al., 2019; Krueger et al., 2021; Ahuja et al., 2021a; Koyama & Yamaguchi, 2020; Rame et al., 2021; Wald et al., 2021; Pezeshki et al., 2021) tend to yield gradients that have a lower cosine similarity with those of ERM. The generally existed conflicts can further lead to suboptimal performances of these OOD objective in practice even with exhaustive parameter tunning. In complementary to Fig. 1 (d), we provide full results in Fig. 8 , where we show the results of IRMv1 under different penalty weights (y-axis) and pretraining epochs (x-axis) on COLOREDMNIST (Arjovsky et al., 2019) (CMNIST) as well as the failure case (Kamath et al., 2021) (CMNIST-m), or E tr := {(0.1, 0.2), (0.1, 0.25)} described in two-bit environment. It can be found that the performances of IRMv1 are highly dependent on proper tuning of pretraining epochs and the penalty weights. The dependence grows stronger when IRMv1 is shown to be unrobust on CMNIST-m. We also provide a more detailed results of IRMv1 on CMNIST-m in Fig. 8(c ), where the dependence can be clearly observed. In contrast, PAIR performs robustly well under different pretraining epochs, using a default preference (1, 1e10, 1e12) to ERM, IRMv1 and VREx objectives, respectively. In Sec. 5, we provide more evidences to demonstrate the power of PAIR-o. In Sec. 3.2, we derive a group of ideal objectives for improving the robustness of IRMv1, shown as the following IRMX (IRMX) min φ (L ERM , L IRM , L VREx ) T . (14) We prove in Proposition 2 that IRMX is able to solve a large number of failure cases of IRM S and IRMv1, and recovers the set of invariant predictors produced by the original IRM. However, motivated readers might be interested in the reasons for keeping IRMv1 in IRMX, since VREx solely could resolve the two-bit environment failure case. Theoretically, Proposition 2 requires also the invariant predictors produced by IRM S , i.e., I S (E), to recover the invariant predictors yielded by IRM. Nevertheless, it considers only the ideal case. In the next, we elaborate on a detailed discussion from the empirical side. Drawbacks of Robust Minimization in Practice. After showing REx (Krueger et al., 2021) can help avoiding the failure cases of IRM S , a natural question is that, does L IRM remain necessary? We find the answer is "Yes". In Fig. 9 , we use a modified example of E tr = {(0.25, 0.1), (0.25, β)} with ColoredMNIST (Arjovsky et al., 2019) , where we change the variance between two environments through different β. It can be found that, as the variance between two environments getting closer, the performance of REx (Krueger et al., 2021) (denoted as vrex) drops more sharply than IRMv1 (denoted as irmv1). The main reason is that, as the variation of spurious signals in two environments tends to be smaller, the gradient signal of var({L e } e∈Etr ) tends to vanish, while the signals from L IRM maintains. This issue can be more serious in stochastic gradient descent where the estimates of the variance of {L e } e∈Etr in minibatches tend to be noisy, leading to weaker signals.

C.3 MORE DETAILS ON THE EXTRAPOLATION EXAMPLE

In this section, we provide more details and results about the extrapolation example that examines the recovery of causal invariance, in complementary to Sec.  E De 1 [Y |φ(X) = z] = E De 2 [Y |φ(X) = z], for all z ∈ Z e1 φ ∩ Z e2 φ , where Z e φ := {φ(X)|(X, Y ) ∈ supp(D e )}. Then, we construct a regression example from X : R 2 → Y : R. The input X is a two dimensional inputs, i.e., X = (X 1 , X 2 ). X 1 is designed to be the invariant feature, i.e., Y = sin(X 1 ) + 1, while Key takeaways from the IRM example. Recall that the key takeaways from the failures of OOD optimization can be attributed to: i) using unrobust objectives for optimization; ii) using unreliable scheme to approach the desired solution. Nevertheless, we can improve the robustness of the OOD objectives by introducing additional guidance such that the desired solution can be relocated in the Pareto front w.r.t. to the new objectives. After obtaining robust objectives to optimize, we then leverage a preference-aware MOO solver to find the Pareto optimal solutions that maximally satisfy the invariance constraints by assigning the OOD objective a higher preference while being aware of retaining ERM performance. More formally, let f ood be the desired OOD solution, a group of OOD objectives L ood = {L i ood } m i=1 are robust if they satisfy that L ood (f ood ) ⪯ L ood (f ), ∀f ̸ = f ood ∈ F, ) where F denotes the functional class of possible predictors. When given a robust OOD objective L ood , our target is to solve the following MOO problem min f (L ERM , L ood ) T , (16) where L ood corresponds to a ϵ ood -relaxed invariance constraint as L ood (f ood ) = ϵ ood ⪯ L ood (f ), ∀f ̸ = f ood ∈ F. Denote the ϵ inv as empirical loss of using the underlying invariant features to predict labels, then the optimal values of the desired OOD solution are (ϵ inv , ϵ ood ) T = (L ERM (f ood ), L ood (f ood )) T , which corresponds to an ideal OOD preference for the objectives that is p ood = ( 1 ϵ inv , 1 ϵ ood ) T . Then the solution of Eq. 9 needs to maximally satisfy the OOD preference, i.e., maximize L(f ) T p ood .

D.1 DETAILED DESCRIPTION OF PAIR-O FOR OOD OPTIMIZATION

To find a Pareto optimal solution that satisfies the OOD preference p ood , we leverage the preferenceaware MOO solver (Mahapatra & Rajan, 2020) . Different from Mahapatra & Rajan (2020) , we adopt an explicit 2-stage "descent" and "balance" scheme, following the common practice in OOD generalization (Gulrajani & Lopez-Paz, 2021) . ℒ !!" ℒ #$% Init. point

Balance Phase

Figure 15: Illustration of PAIR-o. Illustrated as in Fig. 15 , in the "descent" phase, we train the model to minimize the ERM loss such that it approaches the Pareto front by merely minimizing L ERM first. Then, in the "balance" phase, we adjust the solution to maximally satisfy the OOD preference p ood . Meanwhile, to avoid divergence from the Pareto front, at each step, the descent direction g des not only needs to maximize L(f ) T p ood , but also needs to avoid ascending all the loss values. More formally, let G denote the gradient signals produced by L, at step t of the "balance" phase, it solves the following LP for the objective weights β * , β * = arg max β∈S m+1 (Gβ) T g b , s.t. (Gβ) T G j ≥ g T b G j , ∀j ∈ J -J * , (Gβ) T G j ≥ 0, ∀j ∈ J * , ( ) where S m+1 = {β ∈ R m+1 + | m+1 i=1 β i = 1} , g b is the adjustment direction that leads to the preferred Pareto optimal solution by p ood , J = {j|G T j g b > 0} are the indices of objectives which donot conflict with g b while J = {j|G T j g b ≤ 0} are those have conflicts with g b , J * = {j|L j p oodj = max j ′ (L j ′ p oodj ′ )} is the index of the objective which diverges from the preference most. Specifically, Mahapatra & Rajan (2020) show that using the following g b could provably lead the solution converge to the desired preferred Pareto optimal solution, which is defined as follows g b = p ⊙ (log((m + 1) L) -µ(L)), ( ) where ⊙ is the element-wise product operator, µ(L) is the quantitative divergence of the current solution from the preferred direction, calculated through the losses at the current step, as follows Calculate empirical and OOD losses L ERM and L ood and obtain the overall losses L; µ(L) = KL( L|1/m) = m+1 i Li log(m Li ), 15: Obtain gradients G = ∂L/∂θ; 16: Calculate the OOD divergence µ(L) using Eq. 19; 17: Obtain the adjustment direction g b using Eq. 18; 18: Obtain the index sets J, J * , J required by Eq. 17; 19: Solve Eq. 19 for the loss weights β * ; 20: Update parameters θ i+1 = θ i -ηGβ * ; 21: end for where L is the normalized loss as Li = p oodi L i / m+1 j p j L j . Then, we elaborate the detailed algorithm of PAIR-o implemented via the EPO solver (Mahapatra & Rajan, 2020) as in Algorithm 1. We now state a informal version of the convergence guarantee. Theorem D.1. (Informal) Given L ERM along with m differentiable OOD losses L ood , at each step in the "balance" phase (line 9 to line 21 in Algorithm 1), there exists a step size η 0 such that, the set of new loss values L (i+1) = (L ERM , L i , ..., L m ) T with the updated parameters θ (t+1) by any η ∈ [0, η 0 ], denoted as A t has the following properties: (i). A t contains the exact Pareto optimal solution satisfying the OOD preference vector, i.e., L * ∈ A t ; (ii). A t grows monotonically smaller and smaller. From (i) and (ii) in Theorem D.1, it suffices to know that as the optimization continues, A t converges to the losses of the exact Pareto optimal solution, hence for the parameters. The proof for Theorem D.1 simply follows the Theorem 1 to Corollary 1 in Mahapatra & Rajan (2020) . Note that PAIR-o provides a general framework to find a better OOD solution that properly trades off ERM and OOD objectives. In experiments, we find that using the simply modified variant of EPO solver (Mahapatra & Rajan, 2020) in PAIR-o can effectively find a descent path under the gradient conflicts that leads to a better OOD solution. Nevertheless, a more sophisticated preference-aware MOO solver can be developed and integrated into the framework of PAIR-o, which we believe is a promising future direction (Zhao & Zhang, 2015; Zhou et al., 2018; 2020) .

D.2 DETAILED DESCRIPTION OF PAIR-S FOR OOD MODEL SELECTION

In this section, we provide a detailed description of PAIR-s for OOD model selection for Sec. 4.1. Before start, we also provide a detailed description of the critical reasons for designing PAIR-s in Appendix G.1. From the IRM example, it is obvious that traditional model selection methods that merely use validation performance, i.e., ERM performance, are not suitable to select a desired solution for OOD generalization. Otherwise, the OOD performance would be easily compromised due to its conflicts with ERM objective. This issue is more serious when the validation set has a large gap between the test set (cf. Training-domain validation set selection for COLOREDMNIST in Table 3 ). Intuitively, models selected merely based on ERM performance tend to have a high preference or better performance on environments that have a similar distribution of the corresponding validation set, which will lead to higher variance of performances at different environments or a lower worst environment performance. Therefore, it is natural to jointly consider the ERM and OOD performances in model selection. Specifically, the selected model is expected to maximally satisfy the exact Pareto optimality. Since our focus of PAIR-s is mainly to validate the existence of previous mode selection issues, we simply incorporate the PAIR score as an additional model selection criteria. More specifically, given a OOD preference p ood , we can calculate the PAIR selection score as s PAIR = L T pood , ( ) where pood is the normalized OOD preference as p ood / m+1 i=1 p oodi . With the PAIR score, we then can apply it into the DOMAINBED model selection algorithms (Gulrajani & Lopez-Paz, 2021) . Specifically, the model selection in DOMAINBED aims to select models from several rigorous hyperparameter trials according to the validation accuracy. For the model selection in each run, one can obtain all training domain validation accuracies but only one test domain validation accuracy for fairness. The algorithm is detailed as in Algorithm 2. The PAIR score is mainly used to select models among the logged steps within one run. To avoid trivial cases, we expect the models participated into the selection are converged. To this end, we heuristically use a threshold c to filter out the first c steps and find it empirically effective. To select models from different runs, we will first use the validation accuracy to filter out some unreliable cases, and then adopt the PAIR to finalize the model selection. The only exception is the test domain validation accuracy, where the test domain validation accuracy is more likely to be a reliable indicator than the PAIR score. The main limitation of the PAIR estimation is about the estimation of the loss values. In stochastic gradient descent, one could only obtain a stochastic estimate of loss values based on a minibatch sample of D tr . When the stochastic estimates of the loss values are unbiased, the PAIR is unbiased, too. However, there can exist certain variances in the stochastic estimates, which can severely affect the precision of the score thus the comparison of different models. Although Theorem E.1 establishes certain theoretical guarantees that allows for some degree of uncertainties, the variances are usually unavoidable. A instant fix for the issue is that one could afford some additional evaluation time to obtain a better estimate of the loss values. Besides, one could also jointly consider the uncertainty of the estimation and derive a more accurate model selection (Wald et al., 2021) , which we leave for future work.

D.3 DISCUSSION ON THE PRACTICAL CHOICES OF OOD PREFERENCE

Essentially, the performances of both PAIR-o and PAIR-s have certain dependence on the quality of the OOD preference p ood , however, it is often the case that the ideal OOD preference is usually unknown. It is desirable to analyze the performances of PAIR-o and PAIR-s under a imprecise OOD preference. Mahapatra & Rajan (2020) discussed a bit that when the exact Pareto optimal solution under the preference does not exist, the EPO solver can still find a Pareto optimal solution that is closest to the preferred direction. We discuss it in a more general way by developing a new MOO formulation of Eq. 16 under a approximated preference up to some approximation error of ϵ. The theoretical discussion can be found in Sec. E.2. In this section, we focus on the practical side of the choice of p ood . We first discuss some heuristics that can be leveraged to obtain a proper OOD preference under two scenarios: (i). one has little-to-no knowledge about the OOD loss values; (ii). one has the access to some running histories that one has some empirical knowledge about the OOD loss values; Calculate PAIR score using p ood for all T steps as S = {s t } T t=1 using Eq. 20; 4: Filter out the first c steps to avoid trivial cases and get S = {s t } T t=c ; 5: Store the step with maximum PAIR score as s * = arg max t S; 6: end for 7: Obtain the selected steps from R runs as S = {s r * } R r=1 ; 8: Obtain the validation accuracies for all selected steps A val = {A In practice, i) mostly fits to PAIR-o while ii) mostly fits to PAIR-s. When i) one has little-to-no knowledge about the OOD loss values, one can leverage certain theoretical inductive biases about the OOD losses. In fact, it is usual the case that the theoretical conditions for the optimality of OOD objectives do not hold in practice (Ganin et al., 2016; Sagawa* et al., 2020; Krueger et al., 2021; Shi et al., 2022; Rame et al., 2021) . In this case, minimizing the OOD losses acts more like a necessary condition for a satisfactory OOD solution. Therefore, one could assign a sufficiently larger preference to OOD objectives than ERM objective. For example, throughout all experiments in the paper, we mostly assign (1, 1e10, 1e12) to ERM, IRMv1, and VREx losses, which works under many scenarios. Besides, among different OOD objectives, one could easily know which is more likely to be optimized than another. Therefore, to ensure all OOD losses are equally maximally optimized, we could assign the easily-optimizable OOD objectives higher preference. For example in IRMX, VREx tends to be easier to optimize than IRMv1 therefore we assign a higher preference to VREx. Moreover, if one could know the performances of different OOD objectives, it is natural to assign a higher preference to those which solely perform better. When ii) one has the access to some running histories that one has some empirical knowledge about the OOD loss values, one could obtain a empirical estimate of the OOD loss values w.r.t. ERM loss values at convergence. Since the estimate is obtained under gradient conflicts, one could expect the ratios of OOD loss w.r.t. ERM loss should be higher when one could resolve the gradient conflicts properly. Therefore, one could assign a slightly higher preference to OOD losses than the empirically estimated ratios. In the model selection experiments, we directly increase the ratio by 1e2 and find it works well as expected. In fact, both i) and ii) are discussed under minimal assumption about the external knowledge of the optimization process, the task and the data. We expect a better estimate of the OOD preference could be obtained when more external inductive biases are incorporated. For instance, PAIR-o generalize to ParetoDA (Lv et al., 2021) when one could obtain a validation set that has similar distribution to the test data. Even under the case that such data is not available, one could also adopt some techniques such as Mixup (Zhang et al., 2018) to obtain an approximation. We believe that obtaining a better estimate of the ideal OOD preference would be a promising future development based on our work. (Sener & Koltun, 2018; Lin et al., 2019; Mahapatra & Rajan, 2020) , PAIR-o requires full gradients of the predictor to make an accurate derivation of the objective weights β * , which could be a bottleneck when deployed to large-scale networks, as it usually involves a prohibitively massive number of parameters. Sener & Koltun (2018) develops an approximation of the full gradients using the gradients w.r.t. the latent representation produced by the featurizer, i.e., ∂L/∂φ(X). However, it requires a strong assumption on the structure of the data and the model. Moreover, when it involves complex network architectures such as DenseNet (Huang et al., 2017) or DistillBERT (Sanh et al., 2019) in WILDS, the approximation or even the full gradients can be even imprecise, as the gradients of the complex neural networks can not be directly concatenated as those of simple linear networks. To this end, we develop another approximation that takes only the gradients of the classifier, which usually appears as a linear classification layer in the predictor. Interestingly, we empirically find ∂L/∂w can even produce more useful signals for OOD generalization than the gradients w.r.t. classifier, shown as in Table 1 . When considering a more resource restricted scenarios, such as the iWildCam and RxRx1 in WILDS, we freeze the featurizer after the "descent" phase, which can further resolve the memory and computation overheads. It also aligns with some recent discoveries that the featurizer trained merely with ERM may already discovery all useful patterns (Rosenfeld et al., 2022) . Zhang et al. (2022a) also find the technique useful in Camelyon17 dataset of WILDS.

D.4.2 LOSS VALUE ESTIMATION

Similar to other MOO algorithms (Sener & Koltun, 2018; Lin et al., 2019; Mahapatra & Rajan, 2020) , PAIR-o is described and analyzed in full batch setting, i.e., full gradient descent. However, in practice, stochastic setting tends to appear more often than vanilla gradient descent due to the scalability considerations. As also discussed in Sec. 4.1, variances are unavoidable no matter the estimated values are biased or unbiased. Fortunately, the robustness of PAIR-o to the preference can partially mitigate the issue. The another potential limitation in PAIR-o could be the possibly negative estimate of some OOD losses, such as the stochastic estimates of IRMv1, since general MOO algorithms together with PAIR-o only accept non-negative loss values as the inputs. To this end, we will use IRMv1 as an example to explain how one could handle the potentially negative values in loss value estimation. We will first introduce the unbiased empirical estimator of IRMv1, following Arjovsky et al. (2019) ; Ahuja et al. (2021b) . More specifically, considering the IRMv1 objective, for which the simplification is derived by taking the derivative inside the expectation, using the Leibniz integral rule. Obviously, the stochastic estimate of Eq. 22 is biased. To obtain an unbiased estimate of IRMv1 penalty, observe that can separately be estimated in minibatches without bias, Eq. 23 essentially provides a practical unbiased estimator of IRMv1. E[X] 2 = E[AB], if A, B However, different from IRMv1, Eq. 23 does not have any guarantees for its non-negativity, though the expectation of Eq. 23 is non-negative. To this end, we propose two heuristics to mitigate the issue. The first heuristic is to add all minibatch estimates E e ∂ℓ(w•φ(X e ),Y e ) Moreover, as the constant does not affect the calculation of the gradients, when IRMv1 is minimized to 0, E e ∂ℓ(w•φ(X e ),Y e ) ∂w w=1.0 is also optimized to C. The other heuristic is to multiply the negative minibatch estimates E e ∂ℓ(w•φ(X e ),Y e ) ∂w w=1.0 by a proper negative constant -C, which will make all estimations non-negative. On the other hand, however, it can dramatically affect the variances in the estimations. Essentially, this multiplication will enlarge the expectation of the estimated IRMv1, and may cause instability of the training, due to the unrobustness of IRMv1. Therefore, we can heuristically search the values C from 1 to 1e -4 by observing the early training dynamics. If the training is unstable, then we heuristically tune C to be smaller by 1e -2. Although both of the heuristics above can not rigorously recover a non-negative estimate of IRMv1 penalty (which is essentially impossible for the formulations like IRMv1), we empirically find them effective, for which we hypothesize is because of the robustness of PAIR-o to the preference in OOD generalization.

D.4.3 GENERALIZING TO OTHER OOD METHODS

As shown in Fig. 1(b) , the gradient conflicts between ERM and OOD objectives generally exist (Arjovsky et al., 2019; Krueger et al., 2021; Wald et al., 2021; Pezeshki et al., 2021; Rame et al., 2021) . It implies that, on the one hand, the optimization dilemma generally exist for all OOD objectives. Meanwhile, both PAIR-o and PAIR-s are generically applicable to all OOD methods. In experiments (Sec. 5), we validate the generality of PAIR-s only for several OOD methods from the four main lines as discussed in related works (Sec. B.1) though, PAIR-o essentially has similar generality as PAIR-s, for whose performances at real world datasets, we will leave for future verification due to the limited computational resources. Nevertheless, we can theoretically discuss the implementation options about how PAIR-o can be applied to different OOD methods. First, for Domain Generalization based methods (Ganin et al., 2016; Sun & Saenko, 2016; Li et al., 2018; Dou et al., 2019) , such as DANN (Ganin et al., 2016) , PAIR-o can directly take the domain classification loss and the label classification loss as the inputs. Second, for Distributionally Robust Optimization methods (Namkoong & Duchi, 2016; Hu et al., 2018; Sagawa* et al., 2020) , PAIR-o can take the worst group loss or some more sophisticated regularizations and the ERM loss as the inputs. Third, for the causal invariance based methods (Peters et al., 2016; Rojas-Carulla et al., 2018; Arjovsky et al., 2019; Creager et al., 2021; Parascandolo et al., 2021; Wald et al., 2021; Ahuja et al., 2021a; Chen et al., 2022b) and agreement based methods (Koyama & Yamaguchi, 2020; Krueger et al., 2021; Shi et al., 2022; Rame et al., 2021) , they can be handled by PAIR-o similarly as IRMX.

E THEORETICAL DISCUSSIONS E.1 PROOF FOR PROPOSITION 1

We first restate the proposition with formally defined Setting A by Kamath et al. (2021) . Setting A (identical to Kamath et al. (2021) ): Considering the task of linear classification/regression X → Y where the quality of predictors f : X → Y is measured by population losses l : Y × Y → R ≥0 , Y = R, Y ⊆ R, ℓ is either the square loss ℓ sq (ŷ, y) := 1 2 (ŷ -y) 2 , or the logistic loss ℓ log (ŷ, y) := log (1 + exp (-ŷy)) when Y = {-1, 1} (binary classification). Proposition 2. Under Setting A (Kamath et al. (2021) ), for all α ∈ (0, 1), let E := {(α, β e ) : β e ∈ (0, 1)} be any instance of the two-bit environment (Eq. 13), I X denote the invariant predictors produced by Eq. 7, it holds that I S∩X (E) = I(E). 4Our proof is proceeded by discussing the set of invariant predictors elicited by an ideal V-REx (Krueger et al., 2021) objective I X (E) (in a more general way), and then incorporating I X (E) into that elicited by IRM S or IRMv1 (Arjovsky et al., 2019 ) I S (E) for the two-bit failure case (Eq. 13). We now first discuss the invariant predictors produced by the invariance constraints ideally elicited by V-REx. Recall that V-REx (Krueger et al., 2021) aims to minimize the variances of ERM losses at different environments: L VREx := var({L e } e∈Etr ). Therefore, when L VREx is minimized, we have L e1 = L e2 , ∀e 1 , e 2 ∈ E tr . Then, we can define the invariant predictors produced by V-REx, as the following. VREx 0 : Define I X (E) := {f : X → Ŷ | L e1 (f ) = L e2 (f ), ∀e 1 , e 2 ∈ E}. VREx 0 is the objective: min f ∈I X (Etr) e∈Etr L e (f ). Then, we characterize the set of I X through the following lemma. Lemma 1. Under Setting A, let f = w • φ be the predictor elicited by I(E) and (X e , Y e ) ∼ D e . If ℓ = ℓ sq , E De [Y 2 e ] is identical, the distribution of φ(X e ) is identical (or f ≡ 0) ℓ = ℓ log and H(Y e |φ(X e )) is identical for all e ∈ E, then I(E) ⊆ I X (E). Proof. For any f = w • φ ∈ I(E), using Observation 2 in (Kamath et al, 2021) , we have that E De 1 [Y | φ(X) = z] = E De 2 [Y | φ(X) = z], for all e 1 , e 2 ∈ E and for all z ∈ Z.foot_3  (i) For square loss ℓ sq , L e (f ) = 1 2 E De [(f (X) -Y ) 2 ] = 1 2 E De [f (X) 2 -2f (X)Y + Y 2 ] = 1 2 E De E De [w • φ(X) 2 -2w • φ(X)Y | φ(X)] + 1 2 E De [Y 2 ], where w is the simultaneously optimal classifier for all e ∈ E. Then, note that for all z ∈ Z, it holds that E De [w(z) 2 -2w(z)Y | φ(X) = z] = w(z) 2 -2w(z)E De [Y | φ(X) = z]. Using equation 24 and the assumptions that E De [Y 2 ] is identical and the distribution of φ(X) is identical (or f ≡ 0) for all e ∈ E, we can conclude that for all e 1 , e 2 ∈ E, L e1 (f ) = L e2 (f ). (ii) For logistic loss ℓ log , note that the simultaneously optimal w has the form w(z) = log Pr De [Y = 1 | φ(X) = z] Pr De [Y = -1 | φ(X) = z] = log 1 + E De [Y | φ(X) = z] 1 -E De [Y | φ(X) = z] , for all e ∈ E and all z ∈ Z. We can thus conclude that in this case, L e (f ) = E De [H(Y |φ(X) = z)] = H(Y |φ(X) ), which completes the proof. Remarks. We formulate Lemma 1 in a general setting that covers Two-Bit-Env as a special case. It can be easily verified that the assumptions in this lemma are all satisfied in Two-Bit-Env (Eq. 13). Moreover, we can show that other environment settings (e.g., those in IB-IRM (Ahuja et al., 2021a) ) also satisfy the assumptions. Proposition 3. Under Setting A, for all α ∈ (0, 1), let E := {(α, β e ) : β e ∈ (0, 1)} and f be an odd (or linear) predictor. It holds that I X (E) ∩ I S (E) = I(E). Proof. From the proof of Proposition 5 in Kamath et al. (2021) , we know that there are only two predictors in I(E): The zero predictor f 0 ≡ 0 (for both ℓ sq and ℓ log ) and f IRM (x 1 , x 2 ) = (1 -2α) • x 1 (for ℓ = ℓ sq ) or f IRM (x 1 , x 2 ) = log 1-α α • x 1 (for ℓ = ℓ log ). (i) For square loss ℓ sq , L e (f ) = 1 2 E De [f (X) 2 -2f (X)Y + Y 2 ]. Note that in Two-Bit-Env, Y 2 ≡ 1. Thus, in this case, f ∈ I X (E) implies that E De [f (X) 2 -2f (X)Y ] is identical for all e ∈ E. Moreover, f ∈ I S (E) ⇒ ∇ w|w=1 L e (f ) = 0 for all e ∈ E ⇒ E De [f (X) 2 ] = E De [f (X)Y ] for all e ∈ E. We can conclude that for any f ∈ I X (E) ∩ I S (E), it holds that E De [f (X) 2 ] and E De [f (X)Y ] are identical for all e ∈ E, E De [f (X) 2 ] = E De [f (X)Y ] for all e ∈ E. ( ) Denote f (1,1) := f (X 1 = 1, X 2 = 1 ), and f (1,-1) , f (-1,1) , f (-1,-1) are similarly defined. For condition equation 25, E De [f (X) 2 ] = 1 -α 2 f 2 (1,1) + f 2 (-1,-1) + α 2 f 2 (1,-1) + f 2 (-1,1) + β e (1 -2α) 2 -f 2 (1,1) -f 2 (-1,-1) + f 2 (1,-1) + f 2 (-1,1) , E De [f (X)Y ] = 1 -α 2 f (1,1) -f (-1,-1) + α 2 f (-1,1) -f (1,-1) - β e 2 f (1,1) -f (-1,-1) + f (-1,1) -f (1,-1) . ( ) To enforce condition equation 25 for any α, β e ∈ (0, 1), it is required that f (1,1) -f (-1,-1) + f (-1,1) -f (1,-1) = 0, -f 2 (1,1) -f 2 (-1,-1) + f 2 (1,-1) + f 2 (-1,1) = 0. ⇒ f (1,1) -f (-1,-1) = -f (-1,1) -f (1,-1) , f 2 (1,1) + f 2 (-1,-1) = f 2 (1,-1) + f 2 (-1,1) . In this case, condition equation 26 implies that 1,-1 ) . Without restricting f to be an odd predictor (or equivalently, linear predictor), this constraint is a circle passing through f 0 and f IRM . Requiring that f is odd, i.e., f (1,1) = -f (-1,-1) and f (1,-1) = -f (-1,1) , we can conclude that there are only two predictors left in I X (E) ∩ I S (E), which are f 2 (1,1) + f 2 (-1,-1) = (1 -2α) f (1,1) -f (- f (1,1) = f (-1,-1) = f (1,-1) = f (-1,1) = 0 and        f (1,1) = 1 -2α, f (-1,-1) = 2α -1, f (1,-1) = 1 -2α, f (-1,1) = 2α -1. ⇒ f (x 1 , x 2 ) = (1 -2α) • x 1 . (ii) For logistic loss ℓ log , L e (f ) = E De log 1 + exp (-f (X)Y ) . Similarly, f ∈ I X (E) ∩ I S (E) implies that E De log 1 + exp (-f (X)Y ) is identical for all e ∈ E, E De -f (X)Y 1 + exp (f (X)Y ) = 0. From condition equation 28 and that f is an odd predictor (f (1,1) = -f (-1,-1) and f (1,-1) = -f (-1,1) ), we can conclude that (1 + e f (1,1) ) 2α (1 + e -f (1,1) ) 2-2α = (1 + e f (1,-1) ) 2α (1 + e -f (1,-1) ) 2-2α ⇒ f (1,1) = f (1,-1) , which is due to that (1+e x ) 2α (1+e -x ) 2-2α is a one-to-one function. In this case, condition equation 29 can be simplified as e f (1,1) f (1,1) α -f (1,1) (1 -α) = 0 ⇒ f (1,1) = 0 or f (1,1) = log 1 -α α . Thus, the only predictors in I X (E) ∩ I S (E) are f 0 and f IRM . Corollary 1. Under Setting A, for all α ∈ (0, 1) and E tr = {(α, β e1 ), (α, β e2 )} for any two distinct β e1 , β e2 ∈ (0, 1), I X (E tr ) ∩ I S (E tr ) = I X (E) ∩ I S (E). Proof. This directly follows from the observation that in the proof of Proposition 3, enforcing condition equation 25 and equation 28 for two distinct β e1 , β e2 impose the identical constraints on f .

E.2 PROOF FOR THEOREM 4.1

We first restate the informal version of the theorem as the following, while the formal description of Theorem E.1 will be given in Theorem E.4 with more formal definitions. Theorem E.1. (Informal) For γ ∈ (0, 1) and any ϵ, δ > 0, if F is a finite hypothesis class, both ERM and OOD losses are bounded above, let I PAIR be the index of all losses, p max := max i∈IPAIR p i and L max := max i∈IPAIR L i , if the number of training samples |D| ≥ 32L 2 max p 2 max δ 2 log 2(m+1)|F | γ , then with probability at least 1 -γ, PAIR-o and PAIR-s yield an ϵ-approximated solution of f ood . The proof for Theorem 4.1 is also a theoretical discussion on the performances of PAIR-o and PAIR-s under an approximated OOD preference. Essentially, the performances of both PAIR-o and PAIR-s have a certain dependence on the quality of the OOD preference p ood , however, it is often the case that the ideal OOD preference is usually unknown. It is desirable to analyze the performances of PAIR-o and PAIR-s under an imprecise OOD preference. Mahapatra & Rajan (2020) discussed a bit that when the exact Pareto optimal solution under the preference does not exist, the EPO solver can still find a Pareto optimal solution that is closest to the preferred direction. We discuss it in a more general way by developing a new MOO formulation of Eq. 16 under an approximated preference up to some approximation error of ϵ. Without loss of generality, given a OOD preference p ood = (p ERM , p 1 , ..., p m ) T = ( 1 ϵinv , 1 ϵ ood ) T , the ERM loss L ERM and m OOD losses L ood = (L 1 ood , L 2 ood , .., L m ood ) T , Eq. 16 can be reformulated as f PAIR := arg min f ∈F L ERM (f ) s.t. p ERM L ERM (f ) = p 1 L 1 ood (f ) = p 2 L 2 ood (f ) = • • • = p m L m ood (f ). We remark that under the ideal OOD preference, the optimal solution of Eq. 30, is also the optimal solution to Eq. 16 (i.e., the unconstrained version). In other words, f PAIR = f ood . We will use f PAIR to differentiate from the solution to the unconstrained version. We focus on Eq. 30 for the reason that it is more convenient to establish the discussion on the approximated OOD preference, from the perspective of optimization constraints. Exactly enforcing the above preference constraint is too restrictive both practically and theoretically, instead we incorporate the approximation by relaxing the constraint of the loss values w.r.t. the OOD preference. The ϵ-approximated problem of Eq. 30 is as the following f ϵ PAIR := arg min f ∈F L ERM (f ) s.t. ∀i, j ∈ I PAIR , i ̸ = j, |p i L i (f ) -p j L j (f )| ≤ ϵ, where I PAIR := {ERM, ood 1 , ood 2 , . . . , ood m } is the index set of overall losses. We denote the relaxed constraint set in Eq. 31 as P ϵ PAIR := {f | ∀i, j ∈ I PAIR , i ̸ = j, |p i L i (f ) -p j L j (f )| ≤ ϵ}. Clearly, it holds that the solution sets satisfy f 0 PAIR = f PAIR . Then we define the empirical version of the ϵ-approximated problem Eq. 31 with preference vector p ood as follows. f ϵ PAIR := arg min f ∈F L ERM (f ) s.t. ∀i, j ∈ I PAIR , i ̸ = j, |p i L i (f ) -p j L j (f )| ≤ ϵ. Similarly, we denote the above constraint set as P ϵ PAIR := {f | ∀i, j ∈ I PAIR , i ̸ = j, |p i L i (f ) - p j L j (f )| ≤ ϵ}. Assume a finite hypothesis class F and define  δ = min f ∈F ,∀i,j∈IPAIR,i̸ =j |p i L i (f ) -p j L j (f )| -ϵ . if ∀f ∈ F, | L(f ) -L(f )| ≤ ν, where L(f ) := E (X,Y )∼D [ℓ(f (X), Y )] and L(f ) := 1 |S| (Xi,Yi)∈S ℓ(f (X i ), Y i ). Equipped with this definition, we can now characterize the condition under which the constraint sets in equation 31 and equation 32 contain exact the same predictors. Lemma 2. For any ϵ > 0, assuming δ > 0 and denoting p max := max i∈IPAIR p i , if the training set D tr is δ 4pmax -representative w.r.t. domain X , hypothesis F, distribution D and all the ERM and OOD losses {L ERM , L ood }, then P ϵ PAIR = P ϵ PAIR . Proof. We first show that P ϵ PAIR ⊆ P ϵ PAIR . By the definition of δ, for all f ∈ F, and ∀i, j ∈ I PAIR , i ̸ = j we have |p i L i (f ) -p j L j (f )| ≤ ϵ -δ or |p i L i (f ) -p j L j (f )| ≥ ϵ + δ. Using this property, for any f ∈ P ϵ PAIR , we can conclude that ∀i, j ∈ I PAIR , i ̸ = j, |p i L i (f ) -p j L j (f )| ≤ ϵ ⇒ |p i L i (f ) -p j L j (f )| ≤ ϵ -δ. This inequality further implies that |p i L i (f ) -p i L i (f ) + p j L j (f ) -p j L j (f ) + p i L i (f ) -p j L j (f )| ≤ ϵ -δ ⇒ |p i L i (f ) -p j L j (f )| -|p i L i (f ) -p i L i (f ) + p j L j (f ) -p j L j (f )| ≤ ϵ -δ ⇒ |p i L i (f ) -p j L j (f )| ≤ ϵ -δ + |p i L i (f ) -p i L i (f ) + p j L j (f ) -p j L j (f )| ⇒ |p i L i (f ) -p j L j (f )| ≤ ϵ -δ + p i |L i (f ) -L i (f )| + p j | L j (f ) -L j (f )|, which is based on the triangle inequality of the absolute value function. From the definition of δ 4pmax -representative, we have |L i (f ) -L i (f )| ≤ δ 4pmax , ∀i ∈ I PAIR . Substituting this in the above inequality, we obtain |p i L i (f ) -p j L j (f )| ≤ ϵ -δ + p i δ 4p max + p j δ 4p max ≤ ϵ - δ 2 , which implied that f ∈ P ϵ PAIR . Then, we prove that P ϵ PAIR ⊆ P ϵ PAIR . For any f ∈ P ϵ PAIR , it holds that ∀i, j ∈ I PAIR , i ̸ = j, |p i L i (f ) -p j L j (f )| ≤ ϵ ⇒ |p i L i (f ) -p i L i (f ) + p j L j (f ) -p j L j (f ) + p i L i (f ) -p j L j (f )| ≤ ϵ ⇒ |p i L i (f ) -p j L j (f )| -|p i L i (f ) -p i L i (f ) + p j L j (f ) -p j L j (f )| ≤ ϵ ⇒ |p i L i (f ) -p j L j (f )| ≤ ϵ + |p i L i (f ) -p i L i (f ) + p j L j (f ) -p j L j (f )| ⇒ |p i L i (f ) -p j L j (f )| ≤ ϵ + p i | L i (f ) -L i (f )| + p j |L j (f ) -L j (f )| ⇒ |p i L i (f ) -p j L j (f )| ≤ ϵ + p i δ 4p max + p j δ 4p max ⇒ |p i L i (f ) -p j L j (f )| ≤ ϵ + δ 2 , which is again based on the triangle inequality of the absolute value function and the definition of δ 4pmax -representative. Together with equation 33, we conclude that |p i L i (f ) -p j L j (f )| ≤ ϵ -δ ⇒ f ∈ P ϵ PAIR , which implies P ϵ PAIR ⊆ P ϵ PAIR . Based on the above discussion, we have proven that P ϵ PAIR = P ϵ PAIR . Assumption E.3. For all f ∈ F, X ∈ X , Y ∈ Y, the ERM loss is bounded, i.e., |ℓ(f (X), Y )| ≤ L ERM < ∞, and all the OOD objectives L ood can be written as the expectation of some bounded loss functions, i.e., ∀i ∈ [m], L i ood (f ) = E (X,Y )∼D [ℓ i ood (f (X), Y )] and |ℓ i ood (f (X), Y )| ≤ L i ood < ∞. We remark that the assumption is natural and generally holds for many OOD objectives including IRMv1 (Arjovsky et al., 2019) and VREx (Krueger et al., 2021) . Theorem E.4. For any ϵ > 0, γ ∈ (0, 1), if Assumption E.3 holds and δ > 0, denoting p max := max i∈IPAIR p i and L max := max i∈IPAIR L i , if the number of training samples |D tr | ≥ 32L 2 max p 2 max δ 2 log 2(m+1)|F | γ , then with probability at least 1 -γ, we have for any f ϵ PAIR ∈ f ϵ PAIR and f ϵ PAIR ∈ f ϵ PAIR , L ERM (f ϵ PAIR ) ≤ L ERM ( f ϵ PAIR ) ≤ L ERM (f ϵ PAIR ) + δ 2pmax . Proof. We proceed by first assuming that the training set D is δ 4pmax -representative w.r.t. domain X , hypothesis F, distribution D and all the ERM and OOD losses {L ERM , L ood }, and then we establish the sample complexity required for this condition. From Lemma 2, we know that given this condition and the assumptions in the theorem, P ϵ PAIR = P ϵ PAIR . Then, since the training set D tr is δ 4pmax -representative w.r.t. the ERM loss L ERM , we have for any f ϵ PAIR ∈ f ϵ PAIR and f ϵ PAIR ∈ f ϵ PAIR , L ERM (f ϵ PAIR ) -L ERM (f ϵ PAIR ) ≤ δ 4p max , L ERM ( f ϵ PAIR ) -L ERM ( f ϵ PAIR ) ≤ δ 4p max . Moreover, based on the optimality of problem equation 32, we can conclude that L ERM ( f ϵ PAIR ) - δ 4p max ≤ L ERM ( f ϵ PAIR ) ≤ L ERM (f ϵ PAIR ) ≤ L ERM (f ϵ PAIR ) + δ 4p max ⇒ L ERM ( f ϵ PAIR ) ≤ L ERM (f ϵ PAIR ) + δ 2p max . Then, using the optimality of problem equation 31, it holds that L ERM (f ϵ PAIR ) ≤ L ERM ( f ϵ PAIR ) ≤ L ERM (f ϵ PAIR ) + δ 2p max . It remains to analyze the sample complexity of ensuring that the training set D tr is δ 4pmaxrepresentative w.r.t. X , F, D and all the ERM and OOD losses {L ERM , L ood }. For any i ∈ I PAIR , based on Assumption E.3, we can write L i (f ) = E (X,Y )∼D [ℓ i (f (X), Y )] and L i (f ) = 1 |D| (Xj ,Yj )∈D ℓ i (f (X j ), Y j ) with |ℓ i (f (X), Y )| ≤ L i ≤ L max , ∀f, X, Y . Using Hoeffding's inequality, we can conclude that for any f ∈ F, Pr L i (f ) -L i (f ) ≥ δ 4p max ≤ 2 exp -|D|δ 2 32L 2 max p 2 max . Thus, for any γ ∈ (0, 1), if we require |D| ≥ 32L 2 max p 2 max δ 2 log 2(m + 1)|F| γ , it holds that Pr ∃f ∈ F, L i (f ) -L i (f ) ≥ δ 4p max ≤ f ∈F Pr L i (f ) -L i (f ) ≥ δ 4p max ≤ γ m + 1 . Thus, Pr ∃i ∈ I PAIR , ∃f ∈ F, L i (f ) -L i (f ) ≥ δ 4p max ≤ i∈IPAIR Pr ∃f ∈ F, L i (f ) -L i (f ) ≥ δ 4p max ≤ γ. Finally, we can conclude that with probability at least 1 -γ, ∀i ∈ I PAIR , ∀f ∈ F, L i (f ) -L i (f ) ≤ δ 4p max , which completes the proof. Remarks. The ϵ-approximated formulation has a close relationship to another relaxation as the following. f PAIR := arg min f ∈F L ERM (f ) s.t. L i PAIR (f ) ≤ ϵ i , ∀i ∈ [m]. Essentially, both the ϵ-approximated formulation and the above formulation are natural relaxation of the original problem (Eq. 30 or Eq. 16). As the ϵ i → ϵ oodi , the above formulation also yields the optimal solution f ood . In this work, since we focus on the approximations on the preference, ϵ-approximated formulation essentially provides a convenient touch which could be of independent interests for future discussions.

F MORE DETAILS ON EXPERIMENTS

In this section, we provide more details about the experiments (Sec. 5) in the main paper.

F.1 MORE DETAILS ON COLOREDMNIST EXPERIMENTS

In the proof-of-concept experiments with COLOREDMNIST, we follow the evaluation settings as IRM (Arjovsky et al., 2019) and the test-domain selection as DomainBed (Gulrajani & Lopez-Paz, 2021) . Specifically, we use a 4-Layer MLP with a hidden dimension of 256. By default, we use Adam Kingma & Ba (2015) optimizer with a learning rate of 1e -3 and a weight decay of 1e -3 to train the model with 500 epochs and select the last epoch as the output model for each hyperparameter setting. We choose the final model from different hyperparameter setups as the one that maximizes the accuracy on the validation that share the same distribution as test domain. We then do grid search for the corresponding hyperparameters. For pretraining epochs, we search from {0, 50, 100, 150, 200, 250}. For OOD penalty, we search from {1e1, 1e2, 1e3, 1e4, 1e5}. We evaluate each configuration of hyperparameters 10 times and report the mean and standard deviation of the performances. Besides, for IRMv1, we will refresh the history in Adam optimizer when the pretraining finishes, following the practice in Gulrajani & Lopez-Paz (2021) . We also empirically Compared to IRMv1 shown as in Fig. 8 , IRMX can substantially improve the OOD performances in both COLOREDMNIST and the modified COLOREDMNIST, confirming our theoretical results. However, the OOD performances of IRMX turn out to be upper bounded by that optimized with PAIR-o at each pretraining epochs. In other words, PAIR-o requires substantially less parameter tuning efforts to achieve the top OOD performances, confirming the advances of PAIR-o. In more complex tasks where the exhaustive parameter tunning is prohibitively expensive, such as in the experiments with WILDS (Koh et al., 2021) , IRMX performs worse than PAIR, which further validates the effectiveness of PAIR-o. To better demonstrate the advantages of PAIR-o over linear weighting scheme, we replicate the previous study in two datasets from WILDS, i.e., CIVILCOMMENTS and FMOW. Due to the computational resource limits, we limit the search scope of IRMv1 and VREx to {1e -2, 1, 1e2}, respectively. It can be found that, even with a broader hyperparameter search space, IRMX optimized via linear weighting scheme remain under-performed than PAIR-o. Penalty weights trajectory. To examine whether PAIR-o can effectively adjust the penalty weights of ERM and OOD objectives, especially when the model has not arrived at the Pareto front (i.e., the gradient conflicts are expected to be more intense), we plot the trajectories of penalty weights generated by PAIR-o in both CMNIST and CMNIST-m, shown as in Fig. 18 .

Fitting Adaption Generalization

(a) CMNIST. It can be found that the whole training process can be divided into three phases: "Fitting" phase; "Adaption" phase; and "Generalization" phase. In the "Fitting" phase, the model is trained with only the ERM objectives and is expected to approach the Pareto front first (cf. Fig. 15 ). It also corresponds to the "descent" phase in the PAIR-o algorithm, hence the penalty weight for ERM objective is 1 while for OOD objective is 0. Then, when PAIR-o enters into the "balance" phase, PAIR-o begins to yield high weights to OOD objectives, while not diminishing the weights to ERM objectives. That is the "Adaption" phase, where PAIR-o begins to adjust the solution towards the Pareto front as well as the preferred direction. When the solution is close to the Pareto front, then PAIR-o enters into the "Generalization" phase. That is to incorporate the invariance into the features by assigning high weights to the OOD objectives. Preference sensitivity analysis under strict hyperparameter configuration. Another reason for the high performance of PAIR-o at both COLOREDMNIST and realistic datasets from WILDS is because of its robustness to different preference choices. In complementary to the theoretical discussion in Theorem E.1, we also conducted preference sensitivity analysis experiments under strict hyperparameter configurations. In other words, the hyperparameter search space is restricted to single point, i.e., a learning rate of 0.01, and a pretraining epoch of 150. The results are shown in Fig. 19 for both the original and the modified COLOREDMNIST dataset. It can be found that, PAIR-o maintains high performance and robustness to different preference choices. It also aligns with our discussion on preference choice in practice (Sec. D.3), that we need to assign a higher preference to robust, and more easy-to-optimize objectives, i.e., VREx. When the relative preferences are given within a reasonable scale, PAIR-o easily yields top OOD performances. Additional ablation study on COLOREDMNIST with "perfect" initialization. We also conduct experiments with "perfect" initializations for different methods, to check whether the OOD constraints can enforce the invariance, following Zhang et al. (2022a) . Besides the OOD methods used in the paper, we also include another OOD method IGA (Koyama & Yamaguchi, 2020) to give a more comprehensive overview of their performances with "perfect" initialization. We also introduce another variant of ColoredMNIST, i.e., CMNIST-11: {(0.25, 0.10), (0.25, 0.20)} to complement more details. All methods are initialized with a ERM model learned on gray-scale ColoredMNIST data which is expected to learn to use digit shapes in the image to make predictions. The learning rate is 1e -3 and the penalty weight is 1e5. Different from Zhang et al. (2022a) , we use SGD to optimize the models, as Adam would generate larger step sizes when the gradients continue to be within a small range under the "perfect" initialization. Results are shown as in Fig. 20 . Figure 20 : OOD performances with "Perfect" initializations. It can be found that, in CMNIST-10, IRM, IRMx and IGA cannot enforce the invariance while V-REx and PAIR maintain the invariance, which is consistent to our previous findings. Moreover, IGA fails to maintain the invariance in CMNIST-11 and CMNIST-25, demonstrating the relatively low robustness of IGA objective. Besides, V-REx consistently maintain the invariance even in CMNIST-11, for the reason that the gradient signals of variance in "perfect" initialization tend to vanish. In contrast, PAIR improve over both IRM and IRMx to maintain the invariance, confirming the effectiveness of PAIR. Additional ablation study on the performance of PAIR-o and PAIR-s with more OOD objectives and their composite with IRMv1. Besides VREx, we conduct additional ablation studies of PAIR with IB (Ahuja et al., 2021a) , Fishr (Rame et al., 2021 ), CLOvE (Wald et al., 2021) , IGA (Koyama & Yamaguchi, 2020) and SD (Pezeshki et al., 2021) , based on COLOREDMNIST and the modified COLOREDMNIST. We focus on the cases with no less than 2 OOD objectives, as one could simply obtain a low OOD loss for single OOD objective, where linear weighting scheme is likely to approach the desired OOD solution as the Pareto front is simpler. However, it is often the case that single OOD objective is not sufficiently robust to locate the desired OOD solution to the Pareto front. In experiments, we follow the same evaluation protocol as previous experiments on COLOREDM-NIST. Due to the resource limits of NVIDIA RTX 3090Ti used for the original COLOREDMNIST experiments in previous sections, we switch the hardware and software platform to Linux servers with NVIDIA V100 graphics cards with CUDA 10.2, hence the results in Note that even considering the learning rate into the hyperparameter search space, PAIR still uses a smaller scope than that of linear weighting scheme. Besides, we follow our previous discussion in Appendix D.3 to set up the preference of different OOD objectives. Specifically, for Fishr, we use a larger preference of 1e12 than that for IRMv1 (1e8), since the agreements based methods tend to have a smaller loss than IRMv1. While for the other objectives, we use a smaller preference of 1e8 than that for IRMv1 (1e12). Note that this is only a heuristic setup and the performance of PAIR can be further improved if the preferences can be tuned. The results are given in Table . 6. It can be found that, not all OOD objectives can improve IRMv1 performance. For the OOD objectives that can enhance the OOD robustness when incorporated into IRMv1, PAIR can further improve over the combined OOD objectives optimized via linear weighting scheme. While for unrobust combinations, intuitively it is hard to improve the OOD performance for the following reasons: (i). When the new objective combination is unrobust, the desired solution may not lie in the new Pareto optimal front; (ii). Eventhough the desired solution lies in the new Pareto optimal front, the weakened OOD robustness introduces more local minimals that have low OOD losses while worse OOD generalization performance; (iii). As an extra objective is involved, the OOD preference used in PAIR tends to have a higher divergence from the ideal one; Therefore, given unrobust OOD objective combinations, the performance gain of PAIR is not theoretically guaranteed. Nevertheless, PAIR-o can still improve some of the unrobust objective combinations, demonstrating its robustness. Notably, PAIR-s can further improve the performance of PAIR-o at most cases, demonstrating the generality of PAIR. To study what OOD objectives are suitable to be combined with IRMv1 and whether using more OOD objectives can bring more performance improvements, additionally, we conduct experiments with all possible composites of IRMv1 and IB (Ahuja et al., 2021a) , Fishr (Rame et al., 2021) and VREx (Krueger et al., 2021) . In experiments, similar as in previous study, PAIR-o adopts a slightly broader learning rate search scope of {0.01, 0.02, 0.04, 0.1, 0.2} at stage 2, in order to prevent divergence. Note that even considering the learning rate into the hyperparameter search space, PAIR still uses a smaller search scope than that of linear weighting scheme. PAIR-s adopts the training domain validation accuracy to perform the model selection. Both PAIR-o and PAIR-s adopts a heuristic preference setup that uses a decreasing preference from 1e12 to 1e8 by a step size of 1e2 for more objectives. For example, in the composite of IB, IRMv1 and VREx, we adopt the preference of (1e8, 1e10, 1e12) for the OOD objectives. The choice of preference follows previous discussion in Appendix D.3. The results are shown in Table 7 . The best and second best results are in bold and underline, respectively. It can be found that incorporating more OOD objectives does not necessarily bring more performance improvements into IRMv1. The linear weighting scheme can further exacerbate the performance drops of unrobust OOD objective combinations. For example, when incorporating IB objective into IRMv1, the OOD performance drops, since IB is proposed to mitigate a specific type of distribution shifts instead of directly improving learning the invariance in the original IRMv1 setting. In contrast, it can be found that incorporating Fishr can bring performance increases at most cases. The reason is that minimizing Fishr loss can approximately minimizes the VREx loss, as shown by Rame et al. (2021) . Therefore, we suspect that the reason for the performance drop could be that more objectives will make the Pareto front more complicated, and also lead to higher divergence of the OOD preference (since we are less likely know the ideal preference given more objectives). Hence, the preferred composition of the objectives are preferred to those that have theoretical guarantees and are as concise as possible. Interestingly, we also find that, although incorporating more objectives in PAIR-o does not necessarily bring performance increase, a combination of PAIR-o and PAIR-s can further improve the OOD performance, despite of the simple implementation of PAIR-o. It serves as strong evidence for the generality and significance of PAIR.

F.3 MORE DETAILS ABOUT EXPERIMENTS ON WILDS

In this section, we provide more details about the WILDS datasets as well as the evaluation setups in the experiments. F.3.1 DATASET DESCRIPTION. We select 6 challenging datasets from WILDS (Koh et al., 2021) benchmark for evaluating PAIR-o performance in realistic distribution shifts. The datasets cover from domain distribution shifts, subpopulation shifts and the their mixed. A summary of the basic information and statistics of the WILDS datasets can be found in Table . 8, Table . 9, respectively. In the following, we will give a brief introduction to each of the datasets. More details can be found in the WILDS paper (Koh et al., 2021) . Camelyon17. We follow the WILDS splits and data processing pipeline for the Camelyon17 dataset (Bándi et al., 2019) . It provides 450, 000 lymph-node scans from 5 hospitals. The task in Camelyon17 is to take the input of 96 × 96 medical images to predict whether there exists a tumor tissue in the image. The domains d refers to the index of the hospital where the image was taken. The training data are sampled from the first 3 hospitals where the OOD validation and test data are sampled from the 4-th and 5-th hospital, respectively. We will use the average accuracy as the evaluation metric and a DenseNet-121 (Huang et al., 2017) as the backbone for the featurizer. CivilComments. We follow the WILDS splits and data processing pipeline for the CivilComments dataset (Borkan et al., 2019) . It provides 450, 000 comments collected from online articles. The task is to classify whether an online comment text is toxic or non-toxic. The domains d are defined according to the demographic features, including male, female, LGBTQ, Christian, Muslim, other religions, Black, and White. CivilComments is used to study the subpopulation shifts, here we will use the worst group/domain accuracy as the evaluation metric. As for the backbone of the featurizer, we will use a DistillBert (Sanh et al., 2019) following WILDS (Koh et al., 2021) . FMoW. We follow the WILDS splits and data processing pipeline for the FMoW dataset (Christie et al., 2018) . It provides satellite images from 16 years and 5 regions. The task in FMoW is to classify the images into 62 classes of building or land use categories. The domain is split according to the year that the satellite image was collected, as well as the regions in the image which could be Africa, America, Asia, Europe or Oceania. Distribution shifts could happen across different years and regions. The training data contains data collected before 2013, while the validation data contains images collected within 2013 to 2015, and the test data contains images collected after 2015. The evaluation metric for FMoW is the worst region accuracy and the backbone model for the featurizer is a DenseNet-121 (Huang et al., 2017) . iWildCam. We follow the WILDS splits and data processing pipeline for the iWildCam dataset (Beery et al., 2020) . It is consist of 203, 029 heat or motion-activated photos of animal specifies from 323 different camera traps across different countries around the world. The task of iWildCam is to classify the corresponding animal specifies in the photos. The domains is split according to the locations of the camera traps which could introduce the distribution shifts. We will use the Macro F1 as the evaluation metric and a ResNet-50 (He et al., 2016) as the backbone for the featurizer. PovertyMap. We follow the WILDS splits and data processing pipeline for the PovertyMap dataset (Yeh et al., 2020) . It consists of satellite imagery and survey data at 19, 669 villages from 23 African countries between 2009 and 2016. Different from other datasets, the task in PovertyMap is a regression task that asks the model to predict the real-valued asset wealth index computed from Demographic and Health Surveys (DHS) data. The domain is split according to the countries that the image was taken and whether the image is of an urban or rural area. The relative small size of PoverMap allows for using cross-fold evaluation, where each fold defines a different set of OOD countries (Koh et al., 2021) . We will use the Pearson correlation of the worst urban/rural subpopulation as the evaluation metric and a ResNet-18 (He et al., 2016) as the backbone for the featurizer. RxRx1. We follow the WILDS splits and data processing pipeline for the RxRx1 dataset (Taylor et al., 2019) . The input is an image of cells taken by fluorescent microscopy. The cells can be genetically perturbed by siRNA and the task of RxRx1 is to predict the class of the corresponding siRNA that have treated the cells. There exists 1, 139 genetic treatments and the domain shifts are introduced by the experimental batches. We will use the average accuracy of the OOD experimental batches as the evaluation metric and a ResNet-50 (He et al., 2016) as the backbone for the featurizer.

F.3.2 TRAINING AND EVALUATION DETAILS.

We follow previous works to implement and evaluate our models (Koh et al., 2021; Shi et al., 2022; Yao et al., 2022) . The information of the referred paper and code is listed as in Table . 10. The general hyperparemter setting inherit from the referred codes and papers, and are shown as in Table 11 . We use the same backbone models to implement the featurizer (He et al., 2016; Huang et al., 2017; Sanh et al., 2019) . By default, we repeat the experiments by 3 runs with the random seeds of 0, 1, 2. While for Camelyon17, we follow the official guide to repeat 10 times with the random seeds from 0 to 9, and for PovertyMap, we repeat the experiments 5 times with the random seeds from 0 to 4. Note that the PovertyMap use cross-fold validations hence each run will use different training and evaluation splits, following the WILDS official guideline. For the evaluation of baselines, we refer the previous results from the literature (Koh et al., 2021; Shi et al., 2022; Yao et al., 2022) by default, while we rerun Fish (Shi et al., 2022) and LISA (Yao et al., 2022) to validate the reported results. Since the original implementation of Fish does not support the evaluation of the updated PovertyMap dataset, we mildly adjust the hyperparameter settings to reproduce the corresponding results as shown in Table . 11. We also reduce the batch size on FMoW due to the memory limits and we find it does not affect the reproducibility of Fish and LISA. Besides, since the original implementation of LISA does not support PovertyMap, which differentiates as a regression task that could be not suitable with Mixup (Zhang et al., 2018) , however we find the "group by label" strategy in LISA works particularly well and reaches to the state of the art. For IRMX, we implement it as the simple addition of IRMv1 and VREx penalties based on the Fish implementation (Shi et al., 2022) , and search the penalty weights using the same space as for other objectives (Koh et al., 2021) to ensure the fairness. Besides, since previously reported results did not cover the performance of VREx in iWildCam and PovertyMap, we implement VREx and report the results based on the Fish implementation (Shi et al., 2022) . For PAIR-o, we implement it based on the Fish code (Shi et al., 2022) . The detailed algorithm can be found in Algorithm. 1. We leverage the same number of pretraining steps as in Fish to fulfill the first "descent" phase in PAIR-o algorithm. Then, during the "balance" phase, at each training step, we sampled k batches of data from different domains, calculate loss and conduct the back-propagation. By default, we use only the gradients of the classifier to solve for the objective weights during the "balance" phase. Except for iWildCam and RxRx1 datasets, due the memory limits, as discussed in Sec. D.4.1, we use the freeze technique to ensure the consistency of batch size and number of sampled domains as in Table . 11. Moreover, as discussed in Sec. D.4.2, the unbiased stochastic estimate of IRMv1 penalties can not guarantee the non-negativity of the estimated loss values, which are however not compatible with MOO theory (Kaisa, 1999) (thus the same for PAIR-o). Therefore, we will manually adjust the negative values to be positive, by multiplying it with a adjustment rate (short in Neg. IRMv1 adj. rate in Table . 12). The adjustment rate is tuned from 1 to 1e -4 with a step size of 1e -2 to avoid the training divergence and instability. Following the discussion as in Sec. D.3, we tune the OOD relative preference by merely varying the preference for IRMv1 objective from the default choice of (1, 1e10, 1e12) by a step size of 1e2. We find the performances of IRMv1 and VREx highly correlate to the corresponding relative preference weights. We list hyperparameters of PAIR-o in Table 12 . Although we did not tune the hyperparameters heavily, we find that PAIR-o generically works well across different challenging datasets and realistic distribution shifts on WILDS. As discussed in Sec. D.3, there could be more sophisticated approaches to further improve the search and estimate of OOD preference, which we will leave for future developments based on PAIR. Gulrajani & Lopez-Paz (2021) to highlight the importance of model selection in OOD generalization. Specifically, they empirically show that, under rigorous hyperpa-rameter tunning, ERM (Vapnik, 1991) achieves the state-of-the-art performances. Although recently progress are made to outperform ERM under the rigorous DOMAINBED evaluation protocol (Rame et al., 2021) , whether there exists a proper model selection for OOD algorithms remains elusive. The difficulty of a proper model selection for OOD algorithms is mainly because of: We lack the access to a validation set that have a similar distribution with the test data. Therefore, Gulrajani & Lopez-Paz (2021) provide 3 options to choose and construct a validation set from: training domain data; leave-one-out validation data; test domain data. However, all three validation set construction approaches have their own limitations, as they essentially posit different assumptions on the test distribution (Gulrajani & Lopez-Paz, 2021; Teney et al., 2021; Rame et al., 2021) . PAIR-s tries to address the limitations caused by the difficulty of finding a proper validation set for model selection in domain generalization, by leveraging the prior assumed within the OOD algorithm. Essentially, different lines of OOD algorithms discussed in Sec. B.1 adopt different prior and assumptions on the causes of the distribution shifts. The main purpose of the OOD evaluation is to validate the correctness of the posed assumptions. To this end, the selected models should properly reflect the preferences implied by the assumptions, i.e., the OOD loss values. When considering the loss values during the model selection, it is natural to leverage the MOO perspective and explicitly consider the trade-offs between ERM and OOD performance. The detailed description, implementation options, and potential leverages of PAIR-s are provided in Appendix D.

G.2 TRAINING AND EVALUATION DETAILS

Since our main purpose of the DOMAINBED experiments is to validate the existence of the problem and the effectiveness of PAIR-s, we apply PAIR-s to the representative methods of the four discussed OOD solutions in Sec. B.1. Specifically, we choose the following four methods out of all implemented algorithms in DOMAINBED (https://github.com/facebookresearch/ DomainBed): • ERM: Empirical Risk Minimization (Vapnik, 1991) • IRM: Invariant Risk Minimization (Arjovsky et al., 2019) • GroupDRO: Group Distributionally Robust Optimization (Sagawa* et al., 2020) • DANN: Domain Adversarial Neural Network (Ganin et al., 2016) • Fishr: Invariant Gradient Variances for OOD Generalization (Rame et al., 2021) Due to the limits of computational resources, we select 3 out of 7 datasets from DOMAINBED. We refer Rame et al. (2021) to prescribe the detail, listed as follows: 1. Colored MNIST (Arjovsky et al., 2019) is a variant of the MNIST handwritten digit classification dataset (Lecun et al., 1998) . Domain d ∈ {90%, 80%, 10%} contains a disjoint set of digits colored: the correlation strengths between color and label vary across domains. The dataset contains 70,000 examples of dimension (2, 28, 28) and 2 classes. Most importantly, the network, the hyperparameters, the image shapes, etc. are not the same as in the IRM setup for COLOREDMNIST experiments. 2. PACS (Li et al., 2017) (3, 224, 224) and 10 classes. Note that CMNIST dataset in DOMAINBED use a convolutional neural network as the backbone for the featurizer, which is not the same MLP for COLOREDMNIST experiments. By default, all real datasets leverage a ResNet-50 (He et al., 2016) pretrained on ImageNet, with a dropout layer before the newly added dense layer and fine-tuned with frozen batch normalization layers. During the training, we strictly follow the evaluation protocol in DOMAINBED. Note that the hyperparameter configurations of Fishr have some differences from the default configurations hence we refer the configuration tables by Rame et al. (2021) 



Readers might be interested in the necessities of keeping IRMv1 in the objectives. Proposition 1 considers only the ideal case, we additionally provide more empirical reasons in Appendix C.2; Our results can also be extended to multi-class following typical machine learning theory practice. We leave more sophisticated Pareto front exploration methods(Zhang & Golovin, 2020;Ma et al., 2020) to future investigation. Motivated readers might be interested in the necessities of keeping IRMv1 in the objectives, for which we provide details in Appendix C.2. We assume that the support of φ(X) (denoted as Z) is identical in each environment for simplicity.



Figure 2: Pareto front of ERM losses w.r.t. environments.

Figure 3: Variance distribution.

Figure 4: Recovery of causal invariance. The causal invariance (Definition. 3.1) requires the model predictions to be independent of the spurious features within the overlapped invariant features. In this example, intuitively it requires the colored belts to be perpendicular to x-axis within [-2, 2]. It can be found that PAIR succeeds out of IRMv1 and VREx in recovering the causal invariance.

(a) PAIR v.s. IRMX.

Figure 5: (a) Each point is the best performed IRMX among corresponding pretraining epoch (x-axis), the IRMv1 penalty weights (y-axis) and all possible VREx penalty weights. Despite the substantial tunning efforts, IRMX performs no better than PAIR. That is because (b) PAIR can adaptively adjust the penalty weights during the optimization process, and leads to a (c) Pareto optimal solution. (d) The robustness of PAIR-o to different preference choices enables it adaptable to various scenarios.

and classifier when w is linear, f can be simply represented via dot product w • φ Eall the set of indices for all environments Etr the subset of indices of training environments e the index set of a specific environment D e , De the dataset from environment e, containing samples {X e i , Y e i } considered as i.i.d. from P e D the overall dataset containing data from all environments, D = {D e }e∈E all I(E) the set of invariant predictors w.r.t. some OOD objectives (e.g., IRM) and environments E Le the empirical risk calculated based on D e , e.g., square loss or logistic loss L the vector of losses {Li} m i=1 considered in m objectives from a MOO problem, shared a set of parameters θ P(L) the set of Pareto optimal solutions w.r.t. the objectives L pood the vector of objective preference G ∈ R m×d the matrix of gradients w.r.t. m objectives L and parameters θ ∈ R d each objective Li corresponds to a gradient vector g ∈ R d S m+1 the m-simplex corresponding to m OOD objectives, {β ∈ R m+1 + | m+1 i=1 βi = 1} B MORE DISCUSSIONS ON BACKGROUND AND FUTURE DIRECTIONS B.1 BACKGROUND AND RELATED WORK

); Zhai et al. (2022) find that, regularization on ERM, or sacrificing ERM performance, is usually needed for achieving satisfactory OOD performance. A similar phenomenon has also been observed by Zhao et al. (2020); Xie et al. (2021); Sadeghi et al. (2022); Sener & Koltun (2022); Teney et al. (2022), which aligns with our findings through Pareto front as shown in Fig.6(a) and Fig.7(a). Besides,Lin et al. (2022a)  find that IRM can easily overfit and learns unexpected features when applying IRM on large neural networks.Zhou et al. (2022) propose to alleviate this problem by imposing sparsity constraints. Orthogonal toLin et al. (2022a);Zhou et al. (

Let I S (E tr ) denote the set of invariant predictors elicited by the relaxed constraint in IRM S . It follows that I(E tr ) ⊆ I S (E tr )(Kamath et al., 2021). Yet, Eq. 11 remains a constrained programming. Hence,Arjovsky et al. (2019) introduce a soft-constrained variant IRMv1 as min φ e∈Etr L e (φ) + λ|∇ w|w=1 L e (w • φ)| 2 . (12) Theoretical Failure of Practical IRM Variants. Although the practical variants seem promising, Kamath et al. (2021) show there exists huge gaps between the variants and the original IRM such that both IRM S and IRMv1 can fail to capture the desired invariance, even being given the population loss and infinite amount of training environments. The failure case, called two-bit environment (Kamath et al., 2021), follows the setup of ColoredMNIST in IRM (Arjovsky et al., 2019), and defines environments with two parameters α e , β e ∈ [0, 1]. Each D e is defined as

Pareto Front under MSE loss. 𝜑 1, -1 = -𝜑 -1, 1 𝜑 1, 1 = -𝜑 -1, -1 𝑓1 𝑓0 𝑓𝐼𝑅𝑀 𝑓2 (b) Failure case under MSE loss. 𝜑 1, -1 = -𝜑 -1, 1 𝜑 1, 1 = -𝜑 -1Variance distribution under MSE loss.

Figure 6: Counterparts of Fig. 1(a), Fig. 3 and Fig. 2 implemented in MSE loss.

Failure case under Logistic loss.𝜑 1, -1 = -𝜑 -1, 1 𝜑 1, 1 = -𝜑 -1Variance distribution under Logistic loss.

Figure 7: Counterparts of Fig. 1(a), Fig. 3 and Fig. 2 implemented in Logistic loss. More visualization results of the failure cases. In the main paper, we visualize the Pareto front, ERM loss distribution, and the variance distribution of the failure case given MSE losses, given the environment setup of E tr := {(0.1, 0.11), (0.1, 0.4)}. We plot Fig. 1(a) and Fig. 3 based on the Mathematica code provided by Kamath et al. (2021), where we focus on the odd predictors due to the symmetry in two-bit environments, i.e., predictors satisfying φ(1, -1) = -φ(-1, 1) and φ(1, 1) = -φ(-1, -1). Since Fig.1(a), Fig.3and Fig.2are implemented in MSE loss, for completing the discussion under Setting A(Kamath et al., 2021), we also give their logistic counterparts as in Fig.7.

Figure 8: Performances of IRMv1 in CMNIST and CMNIST-m under different hyperparameters.

Figure 9: Drawbacks of V-REx in practice.

3.3. We first restate the definition of causal invariance specified by Peters et al. (2016); Arjovsky et al. (2019); Kamath et al. (2021) as in Definition C.1. Definition C.1. (Causal Invariance) Given a predictor f := w • φ, the representation produced by the featurizer φ is invariant over E all if and only if for all e 1 , e 2 ∈ E all , it holds that

Pseudo code for PAIR-o. 1: Input: Training data D tr = {X i , Y i } N i=1 with environment partitions D tr = {D e } e∈Etr ; learning rate η; batch size b; number of sampled environments d; OOD preference p ood for ERM loss L ERM and m OOD losses L ood = {L i ood } m i=1 ; pre-training epochs e p ; maximum training epochs for "balance" phase e b ; Trainable parameters at the "balance" phase θ; 2: Randomly initialize parameters in the model f = w • φ; 3: for i = 1 to e p do 4: Sample batches of data {X j , Y j } b j=1 ; 5: Make predictions with f : { Y j } b j=1 = f ({X j } b j=1 ); 6: Calculate the empirical loss L ERM with { Y j } b j=1 ; 7: Update parameters of f with the empirical loss L ERM using the learning rate η; 8: end for 9: for i = 1 to e b do 10: for D e ∈ permute({D e } e ∈ E tr ) do 11: Sample a batch of the data from D e , {X e j , Y e j } b j=1 ∼ D e ; 12: Make predictions with f : { Y e j } b j=1 = f ({X e j

the validation selection bar as Āval = (max A val -min A val ) * p + min A val ; 10: Filter out all runs that have a validation accuracy lower than Āval and obtain H; 11: Find the run with highest PAIR score as r * = arg max r∈ H s r * ; 12: Return associated history of r * ;

D.4 DISCUSSION ON THE USE OF PAIR IN PRACTICE D.4.1 SCALABILITY Similar to other MOO algorithms

φ) + λ|∇ w|w=1 L e (w • φ)| 2 . (21) Observe that ∇ w|w=1.0 L e (w • φ) = ∂E e ℓ(w • φ(X e ), Y e ) ∂w w=1.0 = E e ∂ℓ(w • φ(X e ), Y e ) ∂w w=1.0 and ∥∇ w|w=1.0 L e (w • φ)∥ 2 = ∂E e ℓ(w • φ(X e ), Y e ) ∂w w=1.E e ∂ℓ(w • φ(X e ), Y e ) ∂w

and X are i.i.d. random variables w.r.t. the same distribution X . Equipped with this observation, we can further write Eq. 22 as∥∇ w|w=1.0 L e (w • φ)∥ 2 = E e ∂ℓ(w • φ(X e ), Y e ) ∂ww=1.0 ∂ℓ(w • φ( Xe ), Ỹ e ) ∂w w=1.0 , = E e ∂ℓ(w • φ(X e ), Y e ) ∂w w=1.0 E e ∂ℓ(w • φ( Xe ), Ỹ e ) ∂w w=1.0 , (23) where (X e , Y e ) ∼ P e and ( Xe , Ỹ e ) ∼ P e are i.i.d. samples from P e of the environment e. As E e ∂ℓ(w•φ(X e ),Y e ) ∂w w=1.0and E e ∂ℓ(w•φ( Xe ), Ỹ e ) ∂w w=1.0

large constant C, such that the minimum value of E e ∂ℓ(w•φ(X e ),Y e ) ∂w w=1.0+ C is non-negative.

, we recall the definition of ν-representative sample from Shalev-Shwartz & Ben-David (2014). Definition E.2. (Shalev-Shwartz & Ben-David (2014)) A training set S is called ν-representative (w.r.t. domain X , hypothesis F, loss ℓ and distribution D)

Figure 18: Penalty weights trajectory

Figure 19: Preference sensitivity under strict hyperparameter configuration. x-axis is the preference for VREx while y-axis is the preference for IRMv1



OOD Performance on COLOREDMNIST

OOD generalization performances using DOMAINBED evaluation protocol.

Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization. In Advances in Neural Information ProcessingSystems, pp. 525-536, 2018. (Cited on pages 3, 8, 20, 29  and 30) Ozan Sener and Vladlen Koltun. Domain generalization without excess empirical risk. In Advances in Neural Information Processing Systems, 2022. (Cited on page 19) Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning -From Theory to Algorithms. Cambridge University Press, 2014. (Cited on page 35) Yuge Shi, Jeffrey Seely, Philip Torr, Siddharth N, Awni Hannun, Nicolas Usunier, and Gabriel Synnaeve. Gradient matching for domain generalization. In International Conference on Learning Representations, 2022. (Cited on pages 1, 3, 8, 19, 29, 31, 44 and 45) Baochen Sun and Kate Saenko. Deep CORAL: correlation alignment for deep domain adaptation. In European Conference on Computer Vision, volume 9915, pp. 443-450, 2016. (Cited on pages 3, 8, 18 and 31) Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning,

Algorithm 2 Pseudo code for PAIR-s. 1: Input: Running history H from R runs, where each running history is consist of loss history L = {L t 1 , L t 2 , ..., L t (m+1) } T t=1 of (m + 1) losses, i.e., L ERM and L ood = {L i ood } m i=1 , and training and validation accuracy history A = {A t tr , A t val } T t=1 , from T logging steps; OOD preference p ood ; Convergence step c; Validation accuracy percentile p; 2: for r = 1 to R do

Comparison between linear weighting scheme and PAIR-o in WILDS.Loss values distribution at convergence. As for the loss distribution experiments (Fig.16(c), 16(d)), we plot the ERM,IRMv1 and VREx loss values at convergence of best performed algorithms. The plotted values are in log-scale and normalized to [0, 1]. It can be found that PAIR-o effectively find a better solution in terms of IRMv1 and VREx losses, while not generating the ERM performances too much, which confirms our motivations for the design of PAIR.

Table 6 and Table 7 are not directly comparable with those in Table 1. Similar to previous experiments, for the stability of MOO solver under heterogeneous objectives, we search learning rate for VREx and Fishr from {0.01, 0.02, 0.04, 0.1, 0.2} at stage 2 while a larger scope {0.1, 0.2, 0.4, 0.8, 1} for other objectives.

Generality study of PAIR for IRMv1 with other objectives in COLOREDMNIST.

Generality study of PAIR for composite objectives in COLOREDMNIST.

A summary of datasets information from WILDS.

A summary of datasets statistics from WILDS.

The information of the referred paper and code.

General hyperparameter settings for the experiments on WILDS.

Hyperparameter settings of PAIR-o for the experiments on WILDS.We implement our methods with PyTorch(Paszke et al., 2019). For the software and hardware configurations, we ensure the consistent environments for each datasets. Specifically, we run COL-OREDMNIST experiments on Linux Servers with NVIDIA RTX 3090Ti graphics cards with CUDA 11.3, 40 cores Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz, 256 GB Memory, and Ubuntu 18.04 LTS installed. While for WILDS and DOMAINBED experiments, we run on Linux servers with NVIDIA V100 graphics cards with CUDA 10.2.

includes domains d ∈ {art, cartoons, photos, sketches}, with 9,991 examples of dimension (3, 224, 224) and 7 classes.3. TerraIncognita (Beery et al., 2018) contains photographs of wild animals taken by camera traps at locations d ∈ {L100, L38, L43, L46}, with 24,788 examples of dimension

directly, shown as follows.G.3.2 TRAINING DOMAIN VALIDATION SET

OOD generalization performances with training domain validation set on COLOREDM-NIST.

OOD generalization performances with training domain validation set on PACS. ± 1.6 79.2 ± 1.0 97.2 ± 0.5 74.9 ± 2.6 83.5 DANN 84.7 ± 1.8 75.8 ± 0.9 97.3 ± 0.1 72.3 ± 1.0 82.5 DANN ✓ 86.5 ± 0.9 77.0 ± 1.8 97.0 ± 0.2 73.0 ± 0.5 83.3 +0.7 GroupDRO 83.4 ± 1.7 77.1 ± 0.3 97.6 ± 0.2 78.2 ± 1.3 84.1 GroupDRO ✓ 83.4 ± 1.7 78.3 ± 0.3 97.6 ± 0.2 78.2 ± 1.3 84.4

OOD generalization performances with training domain validation set on TERRAINCOG-NITA.

OOD generalization performances with test domain validation set on COLOREDMNIST.

OOD generalization performances with test domain validation set on PACS. ± 0.7 82.5 ± 0.8 97.3 ± 0.5 81.8 ± 0.7 87.0 DANN 86.5 ± 0.8 79.9 ± 0.4 97.1 ± 0.1 75.3 ± 1.1 84.7 DANN ✓ 87.0 ± 0.2 81.4 ± 0.7 96.8 ± 0.5 77.5 ± 1.3 85.7 +2.2 GroupDRO 87.7 ± 0.4 82.1 ± 0.7 98.0 ± 0.2 79.6 ± 0.7 86.9 GroupDRO ✓ 86.7 ± 0.3 83.2 ± 1.1 97.8 ± 0.1 81.4 ± 0.5 87.3

OOD generalization performances with test domain validation set on TERRAINCOGNITA.

ACKNOWLEDGEMENTS

We thank the reviewers for their valuable comments. This work was supported by CUHK direct grant 4055146. YZ and BH were supported by the NSFC Young Scientists Fund No. 62006202, Guangdong Basic and Applied Basic Research Foundation No. 2022A1515011652, RGC Early Career Scheme No. 22200720, and Tencent AI Lab Rhino-Bird Gift Fund. 

funding

* Work done during an internship at Tencent AI Lab.

Code

WILDS (Koh et al., 2021) v2.0.0 https://wilds.stanford.edu/ Fish (Shi et al., 2022) 333efa24572d99da0a4107ab9cc4af93a915d2a9 https://github.com/YugeTen/fish LISA (Yao et al., 2022) bc424c47df6f072986b63cd906c44975bd34d9ff https://github.com/huaxiuyao/LISA 

Appendix of "Pareto Invariant Risk Minimization"

 mainly) from the regions marked in red, and evaluate the predictions across all region from (-4, -4) to (4, 4). The predictor following the invariance defined in IRM (Arjovsky et al., 2019) requires the predictions to be independent of spurious features within the overlapped invariant features. In this example, intuitively it requires the colored lines to be perpendicular to x-axis within [-2, 2]. (b) and (d) show the performances of ERM under two sampling methods, it can be found that ERM fail to recover the causal invariance and incurs a high MSE loss.X 2 is designed to be the spurious feature that can be controlled to be spuriously correlated with label Y . The environments are synthesized according to different sampling methods.Shown as in Fig. 10 , we leverage two sampling methods: i) Uniform sampling and ii) Gaussian sampling, where the latter is more difficult than the former. For Uniform sampling, we uniformly sample the rectangle regions {(-3, -3), (-2, 1)} as environment 1 and {(-1, 2), (3, 3)} as environment 2, shown as the red regions marked in Fig. 10(a) . For Gaussian sampling, we sample from two Gaussian distributions: the first one has the center as (-0.9, -2.2) with the covariance matrix as {(0.9, 0.11), (0.11, 0.1)}; the second one has the center as (1, 2) with the covariance matrix as {(1, -0.3), (-0.3, 0.1)}, shown as the red regions marked in Fig. 10(c ).Therefore, in these two examples, the invariant representation φ should only take X 1 and discard the spurious features X 2 under the overlapped invariant features, i.e., [-2, 2 ]. As we use different colors to denote, the prediction produced by the invariant predictor following Definition C.1 is expected be independent of X 2 . In other words, the plotted lines need to be perpendicular to the x-axis within the overlapped invariant features [-2, 2] .We implement the predictor with a 3-layer linear perceptron that has a hidden dimension of 128. We use the MSE loss and Adam (Kingma & Ba, 2015) to optimize the neural network. We sample 2500 training data points from each environment and evaluate with 1000 data points uniformly sampled across all regions. For fair comparison, we train all algorithms 10000 epochs until converge. Following the common practice (Gulrajani & Lopez-Paz, 2021) , we use a anneal iterations of the OOD penalties for all methods as 150. For IRMv1, VREx and IRMX, we search the penalty weights from 1e -4 to 1e and find they generically perform well when with the penalty weights of 1e -2 to 1e1. While for PAIR, we search the relative preferences across 6 choices (1, 1e4, 1e16), (1, 1e4, 1e12), (1, 1e6, 1e8), (1, 1e8, 1e4), (1, 1e4, 1e4), (1, 1e8, 1e8), and find (1, 1e4, 1e12), (1, 1e8, 1e4), (1, 1e4, 1e4), (1, 1e8, 1e8 ) have lower validation losses. find that refreshing the optimizer after pretraining can bring a better performance of IRMv1 in COLOREDMNIST. While for VREx, we find the refreshing is not needed.For the implementation of IRMX, we change the penalty to be the sum of IRMv1 and VREx losses and conduct the same hyperparameter search as for IRMv1 for fair comparison. As for the implementation of PAIR, we use SGD with a momentum of 0.9 (Sutskever et al., 2013) after pretraining, to avoid the interference of Adam to the gradient direction and convergence of EPO (Mahapatra & Rajan, 2020) solver. Moreover, we also empirically find that SGD requires larger learning rate (we search over two choices, i.e., 0.01 and 0.1) for approaching the direction. This is because of the design in EPO solver that it first fits to the preference direction then does the "pure" gradient descent, while the intrinsically conflicting directions pointed by the objectives can make the loss surface more steep. We will leave in-depth understanding of the above phenomenon and more sophisticated optimizer design in more complex tasks and network architectures to future works (Zhao & Zhang, 2015; Zhou et al., 2020) .

F.2 MORE DETAILS ABOUT ABLATION STUDIES

Comparison between PAIR-o and the linear weighting scheme under exhaustive parameter search. In the main paper, to investigate how PAIR-o can find a better OOD solution under objective conflicts, we first conduct a ablation study to compare the OOD performances of PAIR-o and the exhaustive tuned IRMX. Specifically, we tune both IRMv1 and VREx penalty weights from a substantially larger scope, i.e., {1, 1e1, 1e2, 1e3, 1e4, 1e5, 1e6}. As for pretraining epochs, we search from {0, 50, 100, 150, 200, 250}. The results of IRMX in COLOREDMNIST and the modified COLOREDMNIST are shown as in Fig. 16(a , 1e1, 1e2, 1e3, 1e4, 1e5, 1e6, 1e7, 1e8, 1e9, 1e10, 1e11, 1e12} .Published as a conference paper at ICLR 2023 Table 13 : Hyperparameters, their default values and distributions for random search (Gulrajani & Lopez-Paz, 2021; Rame et al., 2021) .Condition Parameter Default value Random distribution PACS/ learning rate 0.00005 10 Uniform(-5,-3.5) TERRAINCOGNITA batch size 32 2 Uniform(3,5.5) if not DomainNet else 2 Uniform(3,5) weight decay 0 10 Uniform(-6,-2) dropout 0 RandomChoice ([0, 0.1, 0.5])COLOREDMNIST learning rate 0.001 10 Uniform(-4.5,-3.5) batch size 64 2 Uniform(3,9) weight decay 0 0 All steps 5000 5000Fishr regularization strength λ 1000 10 Uniform(1,4) ema γ 0.95 Uniform(0.9, 0.99) warmup iterations 1500 Uniform(0, 5000)As for the construction of the validation set, we test with training domain validation set and test domain validation set, as leave-one-out domain selection requires more runs and more computational resources that are out of our limits. Specifically, to construct the validation set, the data from each domain will be first splitted into 80% (for training and evaluation) and 20% (for validation and model selection). For training domain validation set, the validation data is consist of the 20% split from each training domain. While for the test domain validation set, the validation data is consist of the 20% split from each test domain.The whole evaluation will be repeated 3 times where in each repeat, there will be 20 samplings of hyperparameters from the distribution shown in Table 13 . Therefore, there will be 20 runs in each repeat and there will be 1 model selected from the 20 runs.For the implementation of PAIR-s, we follow the algorithm as in Algorithm 2. Since training domain validation accuracy tends to be a more unreliable indicator than test domain validation accuracy, i.e., has a worse reflection of the OOD generalization performance due to the high similarity with the training data (Teney et al., 2021) , during the selection within each run, we filter out the models before the last 5 steps in COLOREDMNIST and the last 10 steps in PACS and TERRAINCOGNITA.During the selection within one repeat (across different runs), we use a percent of 50% for step 9 in Algorithm 2 and finalize the selection according the PAIR score. Except for GroupDRO and DANN of which the objective value tend to have higher variance and relatively low OOD robustness, we aggregate the models within each repeat by the validation accuracy. In contrast, for the test domain validation accuracy, we filter out the models before the last 5 steps for DANN while 10 steps for others according to the robustness of the objectives during the selection within each run. During the selection within one repeat (across different runs), we directly adopt the validation accuracy to finalize the model selected. Note that Gulrajani & Lopez-Paz (2021) argue that test domain validation is more likely to be a invalid benchmarking methodology, since it requires access to the test domain which is usually inaccessible in realistic applications.For the selection of loss values L, we use the values reported solely at each logging step, which is evaluated every 100 steps with a minibtach of the training data, listed as follows:• ERM: N/A.• IRM: ERM and IRMv1 (nll,penalty).• GroupDRO: Worst group ERM loss (losses.min()).• DANN: Weighted ERM and domain discrimination loss (gen loss).• Fishr: ERM and Fishr penalty (nll,penalty).

G.3 FULL DOMAINBED RESULTS

In this section, we provide full results of the DOMAINBED experiments. To begin with, we first present the overall results of the three datasets, including the averages and the improvements of the worst domain accuracies, as in Table . 14 and Table . 15. From results we can seed that PAIR-s consistently improves the OOD performance across all datasets and validation set options. Remarkably, in the most challenging setting that uses train domain validation set on COLOREDMNIST, PAIR-s improves the worst domain performances of IRMv1 and Fishr by a large margin up tp 14.3%. In the realistic dataset PACS, PAIR-s improves the worst domain performances of IRMv1 by a large margin up to 7.3%. In TERRAINCOGNITA, PAIR-s improves the worst domain performances of DANN by a large margin up to 3.1%. Besides the worst domain performance, PAIR-s improves the average domain performances up to 1.0% and empower the OOD methods to reach new state-of-the-arts.When using the test domain validation set, since the validation set itself could reflect the OOD generalization performance, therefore the improvements could be lower. When comes to OOD objectives that have a relatively low robustness, the worst domain performance could be lower.We also report the detailed results at each domain with the variance in the next section. 

