OUT-OF-DISTRIBUTION GENERALIZATION ANALYSIS VIA INFLUENCE FUNCTION Anonymous

Abstract

The mismatch between training and target data is one major challenge for current machine learning systems. When training data is collected from multiple domains and the target domains include all training domains and other new domains, we are facing an Out-of-Distribution (OOD) generalization problem that aims to find a model with the best OOD accuracy. One of the definitions of OOD accuracy is worst-domain accuracy. In general, the set of target domains is unknown, and the worst over target domains may be unseen when the number of observed domains is limited. In this paper, we show that the worst accuracy over the observed domains may dramatically fail to identify the OOD accuracy. To this end, we introduce Influence Function, a classical tool from robust statistics, into the OOD generalization problem and suggest the variance of influence function to monitor the stability of a model on training domains. We show that the accuracy on test domains and the proposed index together can help us discern whether OOD algorithms are needed and whether a model achieves good OOD generalization.

1. INTRODUCTION

Most machine learning systems assume both training and test data are independently and identically distributed, which does not always hold in practice (Bengio et al. (2019) ). Consequently, its performance is often greatly degraded when the test data is from a different domain (distribution). A classical example is the problem to identify cows and camels (Beery et al. (2018) ), where the empirical risk minimization (ERM, Vapnik (1992) ) may classify images by background color instead of object shape. As a result, when the test domain is "out-of-distribution" (OOD), e.g. when the background color is changed, its performance will drop significantly. The OOD generalization is to obtain a robust predictor against this distribution shift. Suppose that we have training data collected from m domains: S = {S e : e ∈ E tr , |E tr | = m}, S e = {z e 1 , z e 2 , . . . , z e n e } with z e i ∼ P e , where P e is the distribution corresponding to domain e, E tr is the set of all available domains, including validation domains, and z e i is a data point. The OOD problem we considered is to find a model f OOD such that f OOD = arg min f sup P e ∈E all (f, P e ), where E all is the set of all target domains and (f, P e ) is the expected loss of f on the domain P e . Recent algorithms address this OOD problem by recovering invariant (causal) features and build the optimal model on top of these features, such as Invariant Risk Minimization (IRM, Arjovsky et al. (2019) ), Risk Extrapolation (REx, Krueger et al. (2020) ), Group Distributionally Robust Optimization (gDRO, Sagawa et al. (2019) ) and Inter-domain Mixup (Mixup, Xu et al. (2020) ; Yan et al. (2020) ; Wang et al. (2020) ). Most works evaluate on Colored MNIST (see 5.1 for details) where we can directly obtain the worst domain accuracy over E all . Gulrajani & Lopez-Paz (2020) has assembled many algorithms and multi-domain datasets, and finds that OOD algorithms can't outperform ERM in some domain generalization tasks (Gulrajani & Lopez-Paz (2020) ), e.g. VLCS (Torralba & Efros (2011) ) and PACS (Li et al. (2017) ). This is not surprising, since these tasks only require high performance on certain domains, while an OOD algorithm is expected to learn truly invariant features and be excellent on a large set of target domains E all . This phenomenon is described as "accuracy-vs-invariance trade-off" in Akuzawa et al. (2019) . Two questions arise in the min-max problem (2). First, previous works assume that there is sufficient diversity among the domains in E all . Thus the supremacy of (f, P e ) may be much larger than the average, which implies that ERM may fail to discover f OOD . But in reality, we do not know whether it is true. If not, the distribution of (f, P e ) is concentrated on the expectation of (f, P e ), and ERM is sufficient to find an invariant model for E all . Therefore, we call for a method to judge whether an OOD algorithm is needed. Second, how to judge a model's OOD performance? Traditionally, we consider test domains E test ⊂ E tr and use the worst-domain accuracy over E test (which we call test accuracy) to approximate the OOD accuracy. However, test accuracy is a biased estimate of the OOD accuracy unless E tr is closed to E all . More seriously, It may be irrelevant or even negatively correlated to the OOD accuracy. This phenomenon is not uncommon, especially when there are features virtually spurious in E all but show a strong correlation to the target in E tr . We give a toy example in Colored MNIST when the test accuracy fails to approximate the OOD accuracy. For more details, please refer to Section 5.1 and Appendix A.4. We choose three domains from Colored MNIST and use cross-validation (Gulrajani & Lopez-Paz (2020) ) to select models, i.e. we take turns to select a domain S ∈ E tr as the test domain and train on the rest, and select the model with max average test accuracy. Figure 1 shows the comparison between ERM and IRM. One can find that no matter which domain is the test domain, ERM model uniformly outperforms IRM model on the test domain. However, IRM model achieves consistently better OOD accuracy. Shortcomings of the test accuracy here are obvious, regardless of whether cross-validation is used. In short, the naive use of the test accuracy may result in a non-OOD model. Figure 1 : Experiments in Colored MNIST to show test accuracy is not enough to reflect a model's OOD accuracy. The top left penal shows the test accuracy of ERM and IRM. The other three panels present the relationship between test accuracy (x-axis) and OOD accuracy (y-axis) in three setups. To address this obstacle, we hope to find a metric that correlates better with model's OOD property, even when E tr is much smaller than E all and the "worst" domain remains unknown. Without any assumption to E all , our goal is unrealistic. Therefore, we assume that features that are invariant across E tr should also be across E all . This assumption is necessary. Otherwise, the only thing we can do is to collect more domains. Therefore, we need to focus on what features the model has learnt. Specifically, we want to check whether the model learns invariant features and avoid varying features. The influence function (Cook & Weisberg (1980) ) can serve our purpose. Influence function was proposed to measures the parameter change when a data point is removed or upweighted by a small perturbation (details in 3.2). When modified it to domain-level, it measures the influence of a domain instead of a data point on the model. Note that we are not emulating the changes of the parameter when a domain is removed. Instead, we are exactly caring about upweighting the domain by δ → 0 + (will be specified later). Base on this, the variance of influence function allows us to measure OOD property and solve the obstacle. Contributions we summarize our contributions here: (i) We introduce influence function to domain-level and propose index V γ|θ (formula 6) based on influence function of the model f θ . Our index can measure the OOD extent of available domains, i.e. how different these domains (distributions) are. This measurement provides a basis for whether to adopt an OOD algorithm and to collect more diverse domains. See Section 4.1 and Section 5. We organize our paper as follows: Section 2 reviews related works and Section 3 introduces the preliminaries of OOD methods and influence function. Section 4 presents our proposal and detailed analysis. Section 5 shows our experiments. The conclusion is given in Section 6.

2. RELATED WORK

The mismatch between the development dataset and the target domain is one major challenge in machine learning (Castro et al. (2020) ; Kuang et al. (2020) (Li et al. (2018) ; Koyama & Yamaguchi (2020) ) uses discriminator to look for features that are independent of domains and uses these features for further prediction. Influence function is a classic method from the robust statistics literature (Robins et al. (2008; 2017) ; Van der Laan et al. (2003) ; Tsiatis (2007) ). It can be used to track the impact of a training sample on the prediction. Koh & Liang (2017) proposes a second-order optimization technique to approximate the influence function. They verify their method with different assumptions on the empirical risk ranging from being strictly convex and twice-differentiable to non-convex and non-differentiable losses. Koh et al. (2019) also estimates the effect of removing a subgroup of training points via influence function. They find out that the approximation computed by the influence function is correlated with the actual effect. Influence function has been used in many machine learning tasks. Cheng et al. (2019) proposes an explanation method, Fast Influence Analysis, that employs influence function on Latent Factor Model to solve the lack of interpretability of the collaborative filtering approaches for recommender systems. Cohen et al. (2020) uses influence function to detect adversarial attacks. Ting & Brochu (2018) proposes an asymptotically optimal sampling method via an asymptotically linear estimator and the associated influence function. Alaa & Van Der Schaar (2019) develops a model validation procedure that estimates the estimation error of causal inference methods. Besides, Fang et al. (2020) leverages influence function to select a subset of normal users who are influential to the recommendations.

3.1. ERM, IRM AND REX

In this section, we give some notations and introduce some recent OOD methods. Recall the multiple domain setup (1) and OOD problem (2). For a domain P e and a hypothetical model f , the population loss is (f, P e ) = E z∼P e [L(f, z)] where L(f, z) is the loss function on z. The empirical loss, which is the objective of ERM, is (f, S) = (1/m) e∈Etr (f, S e ) with (f, S e ) = (1/n) n i=1 L(f, z e i ). Recent OOD methods propose some novel regularized objective functions in the form: L(f, S) = (f, S) + λR(f, S) to discover f OOD in (2). Here R(f, S) is a regularization term and λ is the tuning parameter which controls the degree of penalty. Note that ERM is a special case by setting λ = 0. For simplicity, we will use L(f, S) to represent the total loss in case of no ambiguity. Arjovsky et al. (2019) focuses on the stability of f OOD and considers the IRM regularization: R(f, S) = e∈Etr ∇ w wf ), S e w=1.0 2 (4) where w is a scalar and fixed "dummy" classifier. Arjovsky et al. (2019) shows that the scalar fixed classifier w is sufficient to monitor invariance and responds to the idealistic IRM problem which decomposes the entire predictor into data representation and one shared optimal top classifier for all training domains. On the other hand, Krueger et al. (2020) encourages the uniform performance of f OOD and proposes the V-REx penalty: Krueger et al. (2020) derives the invariant prediction by the robustness to spurious features and figure out that REx is more robust than group distributional robustness (Sagawa et al. (2019) ). In this work, we also decompose the entire predictor into a feature extractor and a classifier on the top of the learnt features. As we will see, different from Arjovsky et al. (2019) and Krueger et al. (2020) , we directly monitor the invariance of the top model. R(f, S) = e∈Etr ( (f, S e ) -(f, S)) 2 .

3.2. INFLUENCE FUNCTION AND GROUP EFFECT

Consider a parametric hypothesis f = f θ and the corresponding solution: θ = arg min θ L(f θ , S). By a quadratic approximation of L(f θ , S) around θ, the influence function takes the form IF( θ, z) = -H -1 θ ∇ θ L(f θ , z) with H θ = ∇ 2 θ L(f θ , S). When the sample size of S is sufficiently large, the parameter change due to removing a data point z can be approximated by -I(z)/ e∈Etr |S e | without retraining the model. Here |S e | = n e stands for the cardinal of the set S e . Furthermore, Koh et al. (2019) shows that the influence function can also predict the effects of large groups of training points (i.e. Z = {z 1 , ..., z k }), although there are significant changes in the model. The parameter change due to removing the group can be approximated by IF( θ, Z) = -H -1 θ ∇ θ 1 |Z| z∈Z L(f θ , z). Motivated by the work of Koh et al. (2019) , we introduce influence function to OOD problem to address our obstacles.

4.1. INFLUENCE OF DOMAINS

We decompose a parametric hypothesis f θ (x) into a top model g and a feature extractor Φ, i.e. f θ (x) = g(Φ(x, β), γ) and θ = (γ, β). Such decomposition coincides the understanding of most DNN, i.e. a DNN extracts the features and build a top model based on the extracted features. When upweighting a domain e by a small perturbation δ, we do not upweight the regularized term, i.e. L + (θ, S, δ) = L(θ, S) + δ • (f, S e ), since the stability across different domains, which is encouraged by the regularization, should not depend on the sample size of a domain. For a learnt model f θ , fixing the feature extractor Φ, i.e. fixing β = β, the change of top model g caused by upweighting the domain is IF(γ, S e | θ):= lim δ→0 + ∆θ δ = -H -1 γ ∇ γ (f θ , S e ), e ∈ E tr . Here H γ = ∇ 2 γ L(f θ , S), and we assume L is twice-differentiable in γ. Please see Appendix A.3 for detailed derivation and why β should be fixed. For a regularized method, e.g. IRM and REx, the influence of their regularized term is reflected in H and in learnt model f θ . As mentioned above, IF(γ, S e | θ) measures change of model caused by upweighting domain e. Therefore, if g(Φ, γ) is invariant across domains, the entire model f θ treats all domains equally. As a result, a small perturbation on different domains should cause the same model change. This leads to our proposal.

4.2. PROPOSED INDEX AND ITS UTILITY

On basis of the domain-level influence function IF(γ, S e | θ), we propose our index to measure the fluctuation of the parameter change when different domains are upweighted: V γ| θ := ln Cov e∈Etr IF(γ, S e | θ) 2 . ( ) Here also means the model truly learns some useful features for the test domains. However, this is not enough, since we do not know whether the useful features are invariant features across E all or just spurious features on E test . On the other hand, avoiding varying features means that different domains are actually the same to the learnt model, so according to the arguments in Section 4.1, V γ|θ should be small. Combined this, we derive our proposal: if a learnt model f θ manage to simultaneously achieve small V γ| θ and high accuracy over E test , it should have good OOD accuracy. We prove our proposal in a simple but illuminating case, and we conduct various experiments (Section 5) to support our proposal. Several issues should be clarified. First, not all OOD problems demand models to learn invariant features. For example, the set of all target domains is small such that the varying features are always strongly correlated to the labels, or the objective is the mean of the accuracy over E all rather than the worst-domain accuracy. But to our concern, we regard the OOD problem in (2) as a bridge to causal discover. Thus the set of the target domains is large, and the "weak" OOD problems are out of our consideration. To a large extent, invariant features are still the major target and our proposal is still a good criterion to model's OOD property. Second, we admit that the gap between being stable in E tr (small V γ|θ ) and avoiding all spurious features on E all does exist. However, to our knowledge, for features that are varying in E all but are invariant in E tr , demanding a model to avoid them is somehow unrealistic. Therefore, we make a step forward that we measure whether the learnt model successfully avoids features that vary across E tr . We leave index about varying features over E all in our future work. The Shuffle V γ|β As mentioned above, smaller metric V γ|θ means strong stablility across E tr , and hence should have better OOD accuracy. However, the proposed metric depends on the dataset S and the learnt model f θ . Therefore, there is no uniform baseline to check whether the metric is "small" enough. To this end, we propose a baseline value of the proposed metric by shuffling the multi-domain data. Consider pooling all data points in S and randomly redistributed to m new synthetic domains { S1 , S2 , ..., Sm } := S. We compute the shuffle version of V γ|θ for a learnt model f θ over the shuffled data S: Ṽγ| θ := ln Cov e∈Etr IF(γ, S e | θ) 2 . ( ) and denote the standard version and shuffle version of the metric as V γ| θ and Ṽγ| θ respectively. For any algorithm that obtains relatively good test accuracy, if V γ| θ is much larger than Ṽγ| θ , f θ has learnt features that vary across e ∈ E tr , and cannot treat domains in E tr equally. This implies that f θ may not be an invariant predictor over E all . Otherwise, if the two values are similar, the model has avoided varying features in E tr and maybe invariant across E tr . Therefore, either the model capture the invariance over the diverse domains, or the domains are not diverse at all. Note that this process is suitable for any algorithm, hence providing a baseline to see whether V γ| θ is small. Here we also obtain a method to judge whether an OOD algorithm is needed. Consider f θ learnt by ERM. If V γ| θ is relatively larger than Ṽγ|θ , then ERM fails to avoid varying features. In this case, one should consider an OOD algorithm to achieve better OOD generalization. Otherwise, ERM is enough, and any attempt to achieve better OOD accuracy should start with finding more domains instead of using OOD algorithms. This coincides experiments in Gulrajani & Lopez-Paz (2020) (Section5.2). Our understanding is that domains in S are similar. Therefore, the difference between shuffle and standard version of the metric reflects how much varying features a learnt model uses. We show how to use the two version of V γ|θ in Section 5.1.1 and Section 5.2.

4.3. INFLUENCE CALCULATION

There is a question surrounding the influence function: how to efficiently calculate and inverse Hessian? Koh & Liang (2017) suggests Conjugate Gradient and Stochastic estimation solve the problem. However, when θ is obtained by running SGD, it could hardly arrive at the global minimum. Although adding a damping term (i.e. let Ĥ θ = H θ + λI) can moderately alleviate the problem by transforming it into a convex situation, under large neural-network with non-linear activation function like ReLU, this method may still work poorly since the damping term in order to satisfy the transform is so large that it will influence the performance significantly. Most importantly, the variation of the eigenvalue of Hessian is huge, making the convergence of influence function calculation quite slow and inaccurate (Basu et al. (2020) ). In our metric, we circumvent the problem by excluding most parameters β and directly calculate Hessian of γ to get accurate influence function. This modification not only speed up the calculation, but it also coincides our expectation, that an OOD algorithm should learn invariant features does not mean that the influence function of all parameters should be identical across domains. For example, if g(Φ) wants to extract the same features in different domains, the influence function should be different on Φ(•). Therefore, if we use all parameters to calculate the influence, given that γ is relatively insignificant in size compared with β, the information of learnt features provided by γ is hard to be captured. On the contrary, only considering the influence of the top model will manifest the influence of different domains in the aspect of features, thus enabling us to achieve our goal. As our experiments show, after this modification, the influence function calculation speed can be 2000 times faster, and the utility (correlation with OOD property) could be even higher. One may not feel surprised given the huge number of parameters in the embedding model Φ(•). They slow down the calculation and overshadow the top model's influence value.

5. EXPERIMENT

In this section, we experimentally show that: (1) A model f θ reaches small V γ| θ if it has good OOD property, while a non-OOD model won't. (2) The metric V γ|θ provides additional information on the stability of a learnt model, which overcomes the weakness of the test accuracy. (3) The comparison of V γ|β and Ṽγ|β can check whether a better OOD algorithm is needed. We consider experiments in Bayesian Network, Colored MNIST and VLCS. The synthetic data generated by Bayesian Network includes domain-dependent noise and fake associations between features and response. For Colored MNIST, we already know that the digit is the causal feature and the color is non-causal. The causal relationships help us to determine the worst domain and obtain the OOD accuracy. VLCS is a real dataset, in which we show utility of V γ|θ step by step. Due to the space limitation, we put the experiments in Bayesian Network to the appendix. Generally, cross-validation (Gulrajani & Lopez-Paz (2020) ) is used to judge a model's OOD property. In the introduction, we have already shown that the leave-one-domain-out cross-validation may fail to discern OOD properties. We also consider another two potential competitors: conditional mutual information and IRM penalty. The comparison between our metric and the two competitors are postponed into Appendix.

5.1. COLORED MNIST

Figure 2 : The index V γ|θ is highly correlated to x. The plot contains 501 learnt ERM models with x = 2 × 10 -4 i, i = 0, 1, ..., 500. The dashed line is the baseline value when the difference between domains is eliminated by pooling and redistributing the training data. The blue solid line is the linear regression of x versus V γ|θ . Colored MNIST (Arjovsky et al. (2019) ) introduces a synthetic binary classification task. The images are colored according to their label, making color a spurious feature in predicting the label. Specifically, for a domain e, we assign a preliminary binary label ỹ = 1 digits≤4 and randomly flip ỹ with p = 0.25. where x ∈ [0.0, 0.1] is positively related to the diversity among the training domains. If x is zero, all data points are generated from the same domain (p e = 0.2) and so the learning task on E tr is not an OOD problem. On the contrary, larger x means that the training domains are more diverse. We repeat 501 times to learn the model with ERM. Given the learnt model f θ and the training data, we compute V γ| θ and check the correlation between V γ| θ and x. Figure 2 presents the results. Our index V γ|θ is highly related to x. The Pearson coefficient is 0.9869, and the Spearman coefficient is 0.9873. Also, the benchmark of V γ|θ that learns on the same training domains ( S in 4.2) can be derived from the raw data by pooling and redistributing all data points, and we mark it by the black dashed line. If V γ| θ is much higher than the benchmark, indicating that x is not small, an OOD algorithm should be considered if better OOD generalization is demanded. Otherwise, the present algorithm (like ERM) is sufficient. The results coincide our expectation that V γ|θ can discern whether P e is different.

5.1.2. RELATIONSHIP BETWEEN V AND OOD ACCURACY

In this section, we use an experiment to support our proposal in Section 4.2. As previously proposed, if a model shows high test accuracy and small V γ|θ simultaneously, it captures invariant features and avoids varying features, so it deserves to be an OOD model. In this experiment, we consider a model with high test accuracy and show that smaller V γ|θ generally corresponds to better OOD accuracy, which supports our proposal. Consider two setups: p e ∈ {0.0, 0.1} and p e ∈ {0.1, 0.15, 0.2, 0.25, 0.3}. We implement IRM and REx with different penalty (note that ERM is λ = 0) to check relationship between V γ|θ and OOD accuracy. For IRM and REx, we run 190 epochs pre-training with λ = 1 and use early stopping to prevent over-fitting. With this technique, all models successfully achieve good test accuracy (within 0.1 of the oracle accuracy) and meet our requirement. Figure 3 presents the results. We can see that V γ|θ are highly correlated to OOD accuracy in IRM and REx, with the absolute of Pearson Coefficient never less than 0.8417. Those models learned with larger λ present better OOD property, learning less varying features, and showing smaller V γ|θ . The results are consistent with our proposal, except that when λ is large in IRM, V γ|θ is a little bit unstable. We have carefully examined the phenomenon and found that it is caused by computational instability when inversing Hessian with eigenvalue quite close to 0. The problem of unstable inversing happens with a low probability and can be addressed by repeating the experiment once or twice. In this section, we implement the proposed metric for 4 algorithms: ERM, gDRO, Mixup and IRM on the VLCS image dataset, which is widely used for doamin generalization. We emulate a real scenario with E all = {V, L, C, S} and E tr = E all \{S}. As mentioned in Gulrajani & Lopez-Paz (2020), we use "trainingdomain validation set" method, i.e. we split a validation set for each S ∈ E tr and the test accuracy is defined as the average accuracy amount the three validation sets. Note that, our goal is to use the test accuracy and V γ|β to measure the OOD generalization, rather than to tune for the SOTA performance on a unseen domain {S}. Therefore, we do not apply any model selection method and just use the default hyper-parameters in Gulrajani & Lopez-Paz (2020).

5.2.1. STEP 1: TEST ACCURACY COMPARISON

For each algorithm, we run the naive training process 12 times and show the average of test accuracy of each algorithm in Table 1 . Before calculating V γ|β , the learnt model should at least arrive a good test accuracy. Otherwise, there is no need to discuss its OOD performance since OOD accuracy is smaller than test accuracy. In the table, the test accuracy of ERM, Mixup and gDRO is good, but that of IRM is not. In this case, IRM will be eliminated. If an algorithm fails to reach high test accuracy first, we should first change the hyper-parameters until we observe a relatively high test accuracy. 4 . For ERM and Mixup, the two value is nearly the same. In this case, we expect that ERM and Mixup models are invariant and should have a relatively high OOD accuracy, so no more algorithm is needed. For gDRO, we can clearly see that Ṽγ|β is uniformly smaller than V γ|β . Therefore, gDRO models don't treat different domains equally, and hence we predict that the OOD accuracy will be relatively low.

5.2.2. STEP 2: SHUFFLE AND STANDARD METRIC COMPARISON

In this case, one who starts with gDRO should turn to other algorithms if a better OOD performance is demanded. Note that, in the whole process, we know nothing about {S}, so the OOD accuracy is unseen. However, from the above analysis, we know that (1) in this settings, ERM and Mixup is better than gDRO; (2) one who uses gDRO can turn to other algorithms (like Mixup) for better OOD performance; (3) one who uses ERM should consider collecting more environments if he (she) still wants to improve OOD performance. So far, we finish the judgement using test accuracy and the proposed metric. In this step, we fortunately obatin E all and can check whether our judgement is reasonable. Normally, this step will not happen. We now show the OOD accuracy of four algorithms in table 2 . Similar to our judgement, ERM and Mixup models achieve a higher OOD accuracy than gDRO. The performance of IRM (under this hyper-parameters) is lower than test accuracy. During the above process, we can also compare the metric of the model from the same algorithm but with different hyper-parameters (as the same in section 5.1.2). Besides, one may notice that even the highest OOD accuracy is just 63.91%. That is to say, to obtain OOD accuracy larger than 70%, we should consider collecting more environments. In the appendix A.6, we continue our real scenario to see that, if initially E tr is more diverse, what will our metric lead us to. The whole results in VLCS can also be found in the same appendix, and the comparison of the proposed metric with the IRM penalty in formula 4 can be found there too. Besides, we show the comparison with Conditional Mutual Information in the appendix A.5. In summary, we use a realistic task to see how to judge the OOD property of learnt model using the proposed metric and test accuracy. The judgement coincides well with the real OOD performance.

6. CONCLUSION

In this paper, we focus on two presently unsolved problems, that how can we discern the OOD property of multiple domains and of learnt models. To this end, we introduce influence function into OOD problem and propose our metric to help solve these issues. Our metric can not only discern whether a multi-domains problem is OOD but can also judge a model's OOD property when combined with test accuracy. To make our calculation more meaningful, accurate and efficient, we modify influence function to domain-level and propose to use only the top model to calculate the influence. Our method is proved in simple cases and it works well in experiments. We sincerely hope that, with the help of this index, our understanding of OOD generalization will become more and more precise and thorough.

A APPENDIX

A.1 SIMPLE BAYESIAN NETWORK In this section, we show that the model with better OOD accuracy achieves smaller V γ|θ . We assume the data is generated from the following Bayesian network: x 1 ← N (0, σ 2 e ), y ← x e 1 W 1→y + N (0, 1), x 2 ← y e W y→2 + N (0, σ 2 e ). ( ) where x 1 , x 2 ∈ R 5 are the features, y ∈ R 5 is the target vector, W 1→y ∈ R 5×5 and W y→2 ∈ R 5×5 are the underlying parameters that are invariant across domains. The variance of gaussian noise is σ 2 e that depends on domain. For simplicity, we denote e = σ e to represent a domain. The goal here is to linearly regress the response y on the input vector (x 1 , x 2 ), i.e. ŷ = x 1 Ŵ1 + x 2 Ŵ2 . According to the Bayesian network (8), x 1 is the invariant feature, while the correlation between x 2 and y is spurious and unstable since e = σ e varies across domains. Clearly, the model based only on x 1 is an invariant model. Any invariant estimator should achieve Ŵ1 ≈ W 1→y and Ŵ2 ≈ 0. We estimate three linear models using ERM, IRM and REx respectively and record the parameter error as well as V γ|θ (note that γ is θ here). Table 3 presents the results among 500 repetitions. As expected, IRM and REx learn more invariant relationships than ERM (smaller causal error) and better avoid non-causal variables ( Ŵ2 ≈ 0). Furthermore, the proposed measurement V γ|θ is highly related to invariance, i.e. model with better OOD property achieves smaller V γ|θ . This results coincides our understanding.

A.2 PROOF OF AN EXAMPLE

In this section, we use a simple model to illuminate the validity of V γ|θ proposed in Section 4. Consider a structural equation model (Wright (1921) ): x 1 ∼ P e x , y ← x 1 + N (0, 1), x 2 ← y + N (0, σ 2 e ) where P e x is a distribution with a finite second-order moment, i.e. Ex 2 1 < +∞, and σ 2 e is the variance of the noise term in x 2 . Both P e x and σ 2 e vary across domains. For simplicity, we assume there are infinite training data points collected from two training domains E tr = {(P 1 x , σ 2 1 ), (P 2 x , σ 2 2 )}. Our goal is to predict y from x := (x 1 , x 2 ) using a least-squares predictor ŷ = x β := x 1 β1 + x 2 β2 . Here we consider two algorithms: ERM and IRM with λ → +∞. According to Arjovsky et al. (2019) , using IRM we obtain β IRM → (1, 0) . Intuitively, ERM will exploit both x 1 and x 2 , thus achieving a better regression model. However, since relationship between y and x 2 varies across domains, our index will be huge in such condition. Conversely, β IRM only uses invariant features x 1 , thus V γ|θ → -∞. Note that we do not have an embedding model here, so V γ|θ = V β . ERM we denote (β) = 1 |E tr | e∈Etr e (β) with e (β) = E e (y -xβ) 2 . Note that in E e , x 1 is sample from P e x . We then have ∂ (β) β = - e∈Etr E e [x(y -x β)] = - 2 |E tr | e∈Etr E e [x 1 (y -x β)] E e [x 2 (y -x β)] To proceed further, we denote d = 1 |E tr | e∈Etr E e x 2 1 , s = e∈Etr σ 2 e = σ 2 1 + σ 2 2 . By solving the following equations: 1 |E tr | e∈Etr E e [x 1 (y -x β)] = d(1 -β 1 -β 2 ) = 0 and 1 |E tr | e∈Etr E e [x 2 (y -x β)] = ( d + 1)(1 -β 1 -β 2 ) + β 1 - s |E tr | β 2 = 0 we have β = ( β1 , β2 ) with β1 = s s + 2 , β2 = 2 s + 2 . Now we calculate our index. It is easy to see that ∂ e (β) β 1 = -2E e [x 1 (y -x β)] = -2E e x 2 1 (1 -β 1 -β 2 ) ∂ e (β) β 2 = -2E e [x 2 (y -x β)] = -2[(E e x 2 1 + 1)(1 -β 1 -β 2 ) + β 1 -σ 2 e β 2 ]. Therefore, ∇ 1 (β) -∇ 2 (β) = 0 2β 2 (σ 2 1 -σ 2 2 ) and ∇ 1 ( β) -∇ 1 ( β) = 0 4(σ 2 1 -σ 2 2 ) s+2 (9) On the other hand, calculate the hessian and we have H ERM = 2 d 2 d 2 d 2 d + s + 2 and H -1 = 1 2 d(s + 2) 2 d + s + 2 -2 d -2 d 2 d . Then we have (note that IF( β, S e ) = H -1 ∇ e ( β)) V β = ln( Cov e∈E (IF( β, S e )) 2 ) = ln( 1 4 (IF 1 -IF 2 )(IF 1 -IF 2 ) 2 ) = ln( 1 4 IF 1 -IF 2 2 ) = 2 ln( 1 2 H -1 (∇ 1 ( β) -∇ 2 ( β)) ) = 2 ln( 1 4 d(s + 2) 2 d + s + 2 -2 d -2 d 2 d 0 4(σ 2 1 -σ 2 2 ) s+2 ) = 2 ln( 2 √ 2|σ 2 1 -σ 2 2 | (s + 2) 2 ) where the third equation holds because the rank of matrix is 1. Clearly, when |σ 2 1 -σ 2 2 | → 0 (means two domains become identical), our index V β → -∞. Otherwise, given σ 1 = σ 2 , we have V β > -∞, showing that ERM captures varied features. IRM We now turn to IRM model and show that V β → -∞ when λ → +∞, thus proving IRM learnt model βIRM does achieve smaller V β compared with β in ERM. Under IRM model, assuming the tuning parameter is λ, we have L(β) = 1 |E tr | e∈Etr E e [(y -x β) 2 ] + 4λ E e [x β(y -x β)] 2 . Then we have the gradient with respect β: ∇L(β) = 1 |E tr | e∈Etr -2E e [x(y -x β)] + 8λE e [x β(y -x β)]E e [x(y -2x β)] , and the Hessian matrix H = H ERM + 8λ |E| e∈E E e [x(y -2x β)]E e [x(y -2x β)] -2E e [x β(y -x β)]E e [xx ] . Denote β λ the solution of IRM algorithm on E tr when penalty is λ. From Arjovsky et al. ( 2019) we know β λ → β IRM := (1, 0) . To show lim λ→+∞ V β λ = -∞, we only need to show that lim λ→+∞ H -1 (∇ 1 (β λ ) -∇ 2 (β λ )) = 0 We prove this by showing that lim λ→+∞ H -1 (λ) = 0 and lim λ→+∞ ∇ 1 (β λ ) -∇ 2 (β λ ) = 0 simultaneously. We add (λ) after H -1 to show that H -1 is a continuous function of λ. Rewrite H in formula 10 as H(λ, β λ ) = H ERM + λF (β λ ) where F (β) = 8 |E tr | e∈Etr E e [x(y -2x β)]E e [x(y -2x β)] -2E e [x β(y -x β)]E e [xx ] lim β λ →βIRM F (β λ ) = 4 |E tr | e∈Etr -E e x 2 1 1 -E e x 2 1 -E e x 2 1 1 -E e x 2 1 = F (β IRM ) exists. Obviously, F (β IRM ) is positive definite. Therefore, we have lim λ→+∞ H(λ, β λ ) -1 = lim λ→+∞ lim β λ →βIRM [H ERM + λF (β λ )] -1 = lim λ→+∞ [H ERM + λF (β IRM )] -1 = 0 The first equation holds because lim λ→+∞ F (β λ ) = F (β IRM ) has the limit and is not 0, and the last equation holds because the eigenvalue of H goes to +∞ when λ → +∞. Now consider ∇ 1 (β λ ) -∇ 2 (β λ ). According to formula 9, we have lim λ→+∞ ∇ 1 (β λ ) -∇ 2 (β λ ) = lim β λ →βIRM ∇ 1 (β λ ) -∇ 2 (β λ ) = ∇ 1 (β IRM ) -∇ 2 (β IRM ) = 0 2β 2 (σ 2 1 -σ 2 2 ) = 0 Hence we finish proof of formula 11 and show that V β → -∞ in IRM.

A.3 FORMULA (5)

This section shows the derivation of the expression (5). Recall that the training dataset S = {S 1 , ..., S m } and the objective function L(f, S) = (f, S) + λR(f, S), where the second term on the right hand side is the regularization. As to ERM, the regularization term is zero. With the feature extractor (β) fixed, we upweight a domain S e . The new objective function is L + (θ, S, δ) = L(θ, S) + δ • (θ, S e ) Notice that when upweight an domain, we only upweight the empirical loss on the corresponding domain. Further, we denote γ, γ+ as the optimal solutions before and after upweighting a domain. It is easy to see that γ+ -γ → 0 when δ → 0. Following the derivation in Koh & Liang (2017) , according to the first-order Taylor expansion of ∇ γ L + (θ, S, δ) with respect to γ on γ, 0 = ∇ γ [L( θ+ , S) + δ ( θ+ , S e )] = ∇ γ (L( θ, S) + δ ( θ, S e )) + ∇ 2 γ [L( θ, S) + δ ( θ, S e )](γ + -γ) + o( γ+ -γ ) = δ∇ γ ( θ, S e ) + ∇ 2 γ [L( θ, S) + δ ( θ, S e )](γ + -γ) + o( γ+ -γ ) Assume that ∇ 2 [L( θ, S) + δ ( θ, S e )] is invertible, we have γ+ - γ δ = [∇ 2 γ L( θ, S) + δ ( θ, S e )] -1 ∇ γ ( θ, S e ) + o( γ+ - γ δ ) lim δ→0 γ+ - γ δ = [∇ 2 γ L( θ, S)] -1 ∇ γ ( θ, S e ) Note that this derivation is not fully rigorous. Please refer to Van der Vaart (2000) for more rigorous discussions about influence function. The reason that β should be fixed is as follows. First, if β can be varied, then the change of θ will become: H γγ H γβ H βγ H ββ -1 ∇ γ l( θ, S e ) ∇ β l( θ, S e ) . Therefore, the computational cost is similar to calculate and inverse the whole hessian matrix. Most importantly, without fixing β, the change of γ is somehow useless. Say when upweighting S e , the use of a feature decreases. It's possible, however, that the parameter in γ corresponding to the feature increases while β decreases a larger scale. In this case, the use of the feature decreases but γ increases. Without fixing β, the change of γ calculated by influence function may provide no information about the use a feature. Therefore, we argue that, fixing β is a "double-win" choice.

A.4 ACCURACY IS NOT ENOUGH

In Introduction, we have given an example where test accuracy misleads us. In this section, we will first supplement some examples where test accuracy not only misjudge different algorithms, but it also misjudges the OOD property of models learnt with different penalty within the same algorithm. After that, we will show the universality of these problems and why test accuracy fails. and a test domain with flip rate denoted by p test . We implement IRM and REx with penalty λ ∈ {0, 50, 100, 500, 1000} to check the relationship between test accuracy and OOD accuracy. The training process is identical to the experiment in section 5.1.2. As results showed in Figure 5 , when OOD property of model gradually improves (caused by gradually increasing λ), its relationship with test accuracy is either completely (when p test is 0.2) or partly (when p test is 0.3) negatively correlated. This phenomenon reveals the weakness of test accuracy. If one wants to select a λ when p test is 0.3, judged by test accuracy, λ = 50 may be the best choice, no matter in IRM or REx. However, the model learnt with λ = 50 has OOD accuracy even less than a random guess model. Whether test accuracy is positively, negatively correlated or irrelevant to model's OOD property mainly depends on the "distance" between test domain and the "worst" domain for the model. If test accuracy happens to be the lowest among all the domains, we directly have OOD accuracy equals to test accuracy. In practice, however, their distance may be huge, and this is precisely the difficulty of OOD generalization. For example, we are accessible to images of cows in grasslands, woods and forests, but cows in desert are rare. At this point, the "worst" domain is certainly far from what we can get. If we expect a model to capture the real feature of cows, the model should avoid any usage of background color. However, a model based on color will perform consistently well (better than any OOD model) no matter in grasslands, woods and forests since all of the available domains are green background in general. In Colored MNIST, test accuracy fails in the same way. Such situations are quite common. Generally, within domains we have, there may be some features that are strongly correlated to the prediction but are slightly varied across domains. These features are spurious, given that their relationship with prediction is significantly disparate in other domains to which we want to generalize. However, using these features in prediction will easily achieve high test accuracy. Consequently, it will be extremely risky to judge models merely by test accuracy.

A.5 CONDITIONAL MUTUAL INFORMATION

A possible alternative of V γ|θ may be Conditional Mutual Information (CMI). For three continuous random variables X, Y , Z, the CMI is defined as Otherwise, if the prediction ŷ is highly correlated to e, then the mutual information will be high. This metric seems to be promising. However, the numerical estimation of CMI remains a challenge. To this end, previous works have done a lot to solve this problem, including CCMI proposed in Mukherjee et al. (2020) and CCIT proposed in Sen et al. (2017) . In this part, we will first calculate true I(e; y|ŷ) in a simple Colored MNIST experiment to show that if there is no estimation problem, CMI could be a potential metric to judge the OOD property of the learnt model, at least in a simple, discrete task. We then run the code provided by Sen et al. (2017) (https://github.com/ rajatsen91/CCIT) to show that even in this simple task, the estimation of CMI may severely influence its performance. I(X; Y |Z) = p(x, Specifically, the experimental setting is similar to that in subsection 5.1.2, with two OOD algorithm and number of training domains in {2, 5}. For each algorithm, we consider the penalty weight λ ∈ {0, 10, 100, 1000}, run the algorithm 50 times, and record their OOD accuracy as well as true CMI value or CCIT value. The results are shown in Figure 6 . We can see that in the case when true CMI can be easily calculated, especially in the case when the number of domains is small and the task is discrete (not continuous), CMI is highly correlated to OOD accuracy. However, in a regression task or in a task when directly calculating the value of CMI becomes impractical, the estimation process may severely destroy the correlation, and may also result in an inverse correlation. Therefore, we summarized the estimation of CMI has limited its utility. We leave the fine-grained analysis of the relationship between CMI, estimated CMI and OOD property to future works. A.6 RESULTS ON VLCS A.6.1 CONTINUED SCENARIO This is a continuation of section 5.2. Say in this task, E all remains the four domains but c E tr = {L, S, V } (empirically we find it more diverse). Similarly, we start with test accuracy shown in table 4. In this step, the situation is the same, i.e. IRM should be eliminated until proper hyper-parameters are found. In step 2, we show the comparison between V γ|β and Ṽγ|β of three algorithms in Figure 7 . As we can see, this time the two value are similar for all three algorithms, including gDRO. This is different from the case when S is unseen. In this case, we predict that all of the three algorithms should achieve high OOD accuracy. In fact, if we act as the oracle and calculate their OOD performance, we will find that our judgement is close to the reality: ERM, Mixup and gDRO achieve OOD accuracy from 70.55% to 72.87%. According to the confidence interval, they difference are not satistically significant. As for IRM, the OOD accuracy is 38.64%. One who use ERM, Mixup or gDRO should be satisfied for the performance, since higher demand is somehow impractical! We report the full results here, and compare the performance of out metric V γ|θ with IRM penalty in formula 4. Thorough the whole experiments, E all = {V, L, C, S}. We construct four experimental settings. In each setting, one domain is removed and the rest consists of E tr . For each domain in E tr , we split a validation set, and test accuracy is the average accuracy amount validation sets. The results are shown in Table 5 . First, our results coincide with Gulrajani & Lopez-Paz (2020) that ERM nearly outperforms any algorithms. We can see that the OOD accuracy of ERM is either the highest or only slightly lower than Mixup. Meanwhile, it has a relatively small V γ|θ . Second, higher OOD accuracy corresponds to lower V γ|θ . In addition, we notice that IRM has a relatively low test accuracy and OOD accuracy. We explain the phenomenon by an improper hyper-parameters in IRM, although we didn't change the default hyper-parameters in the code of Gulrajani & Lopez-Paz (2020) (https://github.com/facebookresearch/DomainBed). No matter what, this phenomenon provides a good example in which we can compare our metric with IRM penalty and discuss their advantages and disadvantages. Despite that IRM could be a good OOD algorithm, using IRM penalty as the metric to judge the OOD property of a learnt model still has much weakness, and some are severe. First, in different tasks, the value of λ to obtain an OOD model may be different, so as other hyper-parameters like "anneal steps" in IRM code. Without exhaustive search on the proper value of the hyper-parameters, it's easy that IRM overfits on the penalty term (which is the situation in VLCS). When IRM overfits, the IRM penalty will become quite small (higher λ often leads to smaller penalty), but absolutely overfitting on penalty term will not result in good OOD accuracy. Therefore, the balance between loss and penalty is important. However, how to find a balanced point? This is a model selection problem, and Gulrajani & Lopez-Paz (2020) propose that an OOD algorithm without model selection is not complete. No matter what to be used as the metric, it cannot be IRM penalty since we cannot use what is included in the training process as the metric to select training hyper-parameters. Second, IRM penalty shows a bias on different algorithms. In the Table 5 , the IRM penalty of IRM is smaller than most algorithms. Besides, although the OOD accuracy of Mixup is similar to ERM, its IRM penalty is significantly higher. This is not strange but will limit the usage of IRM penalty. As for our metric, we mention that small V γ|θ is better. However, the understanding of "smallness" is based on the relative value of the shuffle version and standard version of V γ|θ . As mentioned in section 5.2, when E all \E tr = {S}, we can see that shuffle section 5.2 is obviously smaller than standard version in gDRO, but in ERM and Mixup, these value are relatively close or indistinguishable. In this case, we know that gDRO captures less invariant features and is not OOD than the other two algorithms. During the whole process, we can circumvent the direct comparison of V γ|θ in different algorithms, which is quite important. In summary, IRM penalty makes IRM a good algorithm, but using it as the general metric of OOD performance is completely another picture.



Figure 3: The relationship between V γ|θ and OOD accuracy in REx (left) and IRM (right) with λ ∈ {0, 50, 100, 500, 1000}. We train 400 models for each λ. The OOD accuracy and V γ|θ enjoy high Pearson coefficient: -0.9745 (up-left), -0.9761 (down-left), -0.8417 (up-right), -0.9476 (down-right). The coefficients are negative because lower V γ|θ forebodes better OOD property.

Figure 4: The standard and shuffle version of the metric, i.e. V γ|β and Ṽγ|β for ERM, Mixup and gDRO. For each algorithm, each version of the metric, we run the experiments more than 12 times in case of statistical error. Similar V γ|β and Ṽγ|β represents invariance across E tr , which is the case of ERM and Mixup. For gDRO, Ṽγ|β is clearly smaller.

Figure 5: Experiments in Colored MNIST to show test accuracy (x-axis) cannot be used to judge model learnt with different penalty. Consider two test domains with p test = 0.2 (up penals) and p test = 0.3 (down penals).For each λ, we run IRM and REx 500 times. We can see that when λ increases from λ = 0 to λ = 1000, the OOD accuracy also increases, but test accuracy does not. When p test = 0.3, their relationship becomes more perplexed.

y, z) log p(x, y, z) p(x, z)p(y|z) dxdydz (12) where p(•) is the probability density function. Consider I(e; y|Φ(x)) or I(e; y|ŷ), i.e. the mutual information of e and true label y, given the features or the prediction ŷ of x. The insight is that, if the model is invariant across different domains, then little information about e should be contained in y given Φ(x).

Figure 6: Experiments of the relationship between OOD accuracy and CMI (true or estimated using the method in Sen et al. (2017)). Models are trained by REx (left) and IRM (right) with λ ∈ {0, 10, 100, 1000}. We train 50 models for each λ and calculate the true CMI I(e; y|ŷ) or CCIT value. As analyzed in the appendix A.5, true CMI enjoys a highly correlated relationship to OOD accuracy, with Pearson Coefficient -0.9923 (left) and -0.9858 (right). However, the estimated value shows a completely different picture, with Pearson Coefficient -0.0768 (left) and -0.1193 (right).

Figure 7: The standard and shuffle version of the metric, i.e. V γ|β and Ṽγ|β for ERM, Mixup and gDRO. This time, all three algorithms show similar V γ|β and Ṽγ|β .

γ|θ can solve the weakness of test accuracy. Specifically, under most OOD generalization problems, using test accuracy and our index together, we can discern the OOD property of a model. See Section 4.2 for details. (iii) We propose to use only a small but important part of the model to calculate the influence function. This overcomes the huge computation cost of solving the inverse of Hessian. It is not merely for calculation efficiency and accuracy, but it coincides with our understanding that only these parameters capture what features a model has learnt (Section 4.3).

). Many works assume that the ground truth can be represented by a causal Direct Acyclic Graph (DAG), and they use the DAG structure to dis-

• 2 is the 2-norm for matrix, i.e. the largest eigenvalue of the matrix, Cov e∈Etr (•) refers to the covariance matrix of the domain-level influence function over E tr and ln(•) is a nonlinear transformation that works well in practice.

Then, we color the image according to ỹ but with a flip rate of p e . Clearly, when p e < 0.25 or p e > 0.75, color is more correlated with ỹ than real digit. Therefore, the oracle OOD model f OOD will attain accuracy 0.75 in all domains while an ERM model may attain high training accuracy and low OOD property if p e in training domains is too small or too large. Throughout the Colored MNIST experiments, we use three-layer MLP with ReLU activation and hidden dimension 256. Although our MLP model has relatively many parameters and is non-convex due to the activation layer, due to the technique mentioned in Section 4.3, the influence calculation is still fast and accurate, with directly calculating influence once spends less than 2 seconds.





Average parameter error Ŵ -W 2 and the stable measurement V γ|θ of 500 models from ERM, IRM and REx. Here, "Causal Error" represents Ŵ1 -W 1→y 2 and "Non-causal Error"



Experiments in VLCS with 4 algorithms. OOD accuracy means the min accuracy in E all . We use training-domain validation method mentioned in Gulrajani & Lopez-Paz (2020), so test accuracy is the average accuracy of three split validation set. "Domain" means which domain is excluded, i.e. which domain is in E all \E tr . In each setting, we run each algorithm 12 times and report the mean and (std). Note that in a real implementation, IRM penalty can be negative.

