MEASURING THE PREDICTIVE HETEROGENEITY

Abstract

As an intrinsic and fundamental property of big data, data heterogeneity exists in a variety of real-world applications, such as in agriculture, sociology, health care, etc. For machine learning algorithms, the ignorance of data heterogeneity will significantly hurt the generalization performance and the algorithmic fairness, since the prediction mechanisms among different sub-populations are likely to differ. In this work, we focus on the data heterogeneity that affects the prediction of machine learning models, and first formalize the Predictive Heterogeneity, which takes into account the model capacity and computational constraints. We prove that it can be reliably estimated from finite data with PAC bounds even in high dimensions. Additionally, we propose the Information Maximization (IM) algorithm, a bi-level optimization algorithm, to explore the predictive heterogeneity of data. Empirically, the explored predictive heterogeneity provides insights for sub-population divisions in agriculture, sociology, and object recognition, and leveraging such heterogeneity benefits the out-of-distribution generalization performance.

1. INTRODUCTION

Big data bring great opportunities to modern society and promote the development of machine learning, facilitating human life within a wide variety of areas, such as the digital economy, healthcare, scientific discoveries. Along with the progress, the intrinsic heterogeneity of big data introduces new challenges to machine learning systems and data scientists (Fan et al., 2014; He, 2017) . In general, data heterogeneity, as a fundamental property of big data, refers to any diversity inside data, including the diversity of data sources, data generation mechanisms, sub-populations, data structures, etc. When not properly treated, data heterogeneity could bring pitfalls to machine learning systems, especially in high-stake applications, such as precision medicine, autonomous driving, and financial risk management (Dzobo et al., 2018; Breitenstein et al., 2020; Challen et al., 2019) , leading to poor out-of-distribution generalization performances and some fairness issues. For example, in supervised learning tasks where machine learning models learn from data to predict the target variable with given covariates, when the whole dataset consists of multiple sub-populations with shifts or different prediction mechanisms, traditional machine learning algorithms will mainly focus on the majority but ignore the minority. It will hurt the generalization ability and compromise the algorithmic fairness, as is shown in (Kearns et al., 2018; Sagawa et al., 2019; Duchi & Namkoong, 2021) . Another well-known example is Simpson's paradox, which brings false discoveries to the social research (Wagner, 1982; Hernán et al., 2011) . Despite its widespread existence, due to its complexity, data heterogeneity has not converged to a uniform formulation so far, and has different meanings among different fields. Li & Reynolds (1995) define the heterogeneity in ecology based on the system property and complexity or variability. Rosenbaum (2005) views the uncertainty of the potential outcome as unit heterogeneity in observational studies in economics. More recently, in machine learning, several works of causal learning (Peters et al., 2016; Arjovsky et al., 2019; Koyama & Yamaguchi, 2020; Liu et al., 2021; Creager et al., 2021) and robust learning (Sagawa et al., 2019; Liu et al., 2022) leverage heterogeneous data from multiple environments to improve the out-of-distribution generalization ability. However, previous works have not provided a precise definition or sound quantification. In this work, from the perspective of prediction power, we propose the predictive heterogeneity, a new type of data heterogeneity. From the machine learning perspective, the main concern is the possible negative effects of data heterogeneity on making predictions. Therefore, given the complexity of data heterogeneity, in this work, we focus on the data heterogeneity that affects the prediction of machine learning models, which could facilitate the building of machine learning systems, and we name it the predictive heterogeneity. We raise the precise definition of predictive heterogeneity, which is quantified as the maximal additional predictive information that can be gained by dividing the whole data distribution into sub-populations. The new measure takes into account the model capacity and computational constraints, and can be reliably estimated from finite samples even in high dimensions with PAC bounds. We theoretically analyze its properties and examine it under typical cases of data heterogeneity (Fan et al., 2014) . Additionally, we design the information maximization (IM) algorithm to empirically explore the predictive heterogeneity inside data. Empirically, we find the explored heterogeneity is explainable and it provides insights for sub-population divisions in many fields, including agriculture, sociology, and object recognition. And the explored sub-populations could be leveraged to enhance the out-of-distribution generalization performances of machine learning models, which is verified with both simulated and real-world data.

2. PRELIMINARIES ON MUTUAL INFORMATION AND PREDICTIVE V -INFORMATION

In this section, we briefly introduce the mutual information and predictive V-information (Xu et al., 2020) which are the preliminaries of our proposed predictive heterogeneity. Notations. For a probability triple (S, F, P), define random variables X : S → X and Y : S → Y where X is the covariate space and Y is the target space. Accordingly. x ∈ X denotes the covariates, and y ∈ Y denotes the target. Denote the set of random categorical variables as C = {C : S → N|supp(C) is finite}. Additionally, P(X ), P(Y) denote the set of all probability measures over the Borel algebra on the spaces X , Y respectively. H(•) denotes the Shannon entropy of a random variable, and H(•|•) denotes the conditional entropy of two random variables. In information theory, the mutual information of two random variables X, Y measures the dependence between the two variables, which quantifies the reduction of entropy for one variable when observing the other: I(X; Y ) = H(Y ) -H(Y |X). It is known that the mutual information is associated with the predictability of Y (Cover Thomas & Thomas Joy, 1991) . While the standard definition of mutual information unrealistically assumes the unbounded computational capacity of the predictor, rendering it hard to estimate especially in high dimensions. To mitigate this problem, Xu et al. (2020) raise the predictive V-information under realistic computational constraints, where the predictor is only allowed to use models in the predictive family V to predict the target variable Y . Definition 1 (Predictive Family (Xu et al., 2020) ). Let Ω = {f : X ∪ {∅} → P(Y)}. We say that V ⊆ Ω is a predictive family if it satisfies: ∀f ∈ V, ∀P ∈ range(f ), ∃f ∈ V, s.t. ∀x ∈ X , f [x] = P, f [∅] = P. A predictive family contains all predictive models that are allowed to use, which forms computational or statistical constraints. The additional condition in Equation 2 means that the predictor can always ignore the input covariates (x) if it chooses to (only use ∅). Definition 2 (Predictive V-information (Xu et al., 2020) ). Let X, Y be two random variables taking values in X × Y and V be a predictive family. The predictive V-information from X to Y is defined as: IV (X → Y ) = HV (Y |∅) -HV (Y |X), where H V (Y |∅), H V (Y |X) are the predictive conditional V-entropy defined as: HV (Y |X) = inf f ∈V Ex,y∼X,Y [-log f [x](y)]. HV (Y |∅) = inf f ∈V Ey∼Y [-log f [∅](y)]. (5) Notably that f ∈ V is a function X ∪ {∅} → P(Y), so f [x] ∈ P(Y) is a probability measure on Y, and f [x](y) ∈ R is the density evaluated on y ∈ Y. H V (Y |∅) is also denoted as H V (Y ). Compared with the mutual information, the predictive V-information restricts the computational power and is much easier to estimate in high-dimensional cases. When the predictive family V contains all possible models, i.e. V = Ω, it is proved that I V (X → Y ) = I(X; Y ) (Xu et al., 2020) .

3. PREDICTIVE HETEROGENEITY

In this paper, from the machine learning perspective, we quantify the data heterogeneity that affects decision making, named Predictive Heterogeneity, which is easy to integrate with machine learning algorithms and could help analyze big data and build more rational algorithms.

3.1. INTERACTION HETEROGENEITY

To formally define the predictive heterogeneity, we begin with the formulation of the interaction heterogeneity. The interaction heterogeneity is defined as: Definition 3 (Interaction Heterogeneity). Let X, Y be random variables taking values in X × Y. Denote the set of random categorical variables as C, and take its subset E ⊆ C. Then E is an environment set iff there exists E ∈ E such that X, Y ⊥ ⊥ E. E ∈ E is called an environment variable. The interaction heterogeneity between X and Y w.r.t. the environment set E is defined as: H E (X, Y ) = sup E∈E I(Y ; X|E) -I(Y ; X). Each environment variable E represents a stochastic 'partition' of X × Y, and the condition for the environment set implies that the joint distribution of X, Y could be preserved in each environment. In information theory, I(Y ; X|E) -I(Y ; X) is called the interaction information, which measures the influence of the environment variable E on the amount of information shared between the target Y and the covariate X. And the interaction heterogeneity defined in Equation 6 quantifies the maximal additional information that can be gained from involving or uncovering the environment variable E. Intuitively, large H E (P ) indicates that the predictive power from X to Y is enhanced by E, which means that uncovering the latent sub-population associated with the environment partition E will benefit the X → Y prediction.

3.2. PREDICTIVE HETEROGENEITY

Based on the mutual information, the computation of the interaction heterogeneity is quite hard, since the standard mutual information is notoriously difficult to estimate especially in big data scenarios. Also, even if the mutual information could be accurately estimated, the prediction model may not be able to make good use of it. Inspired by Xu et al. (2020) , we raise the Predictive Heterogeneity, which measures the interaction heterogeneity that can be captured under computational constraints and affects the prediction of models within the specified predictive family. To begin with, we propose the Conditional Predictive V-information, which generalizes the predictive V-information. Definition 4 (Conditional Predictive V-information). Let X, Y be two random variables taking values in X × Y and E be an environment variable. The conditional predictive V-information is defined as: IV (X → Y |E) = HV (Y |∅, E) -HV (Y |X, E), where H V (Y |∅, E) and H V (Y |X, E) are defined as: HV (Y |X, E) = Ee∼E inf f ∈V E x,y∼X,Y |E=e [-log f [x](y)] . (8) HV (Y |∅, E) = Ee∼E inf f ∈V E y∼Y |E=e [-log f [∅](y)] . Intuitively, the conditional predictive V-information measures the weighted average of predictive Vinformation among environments. And here we are ready to formalize the predictive heterogeneity measure. Definition 5 (Predictive Heterogeneity). Let X, Y be random variables taking values in X × Y and E be an environment set. The predictive heterogeneity for the prediction X → Y with respect to E is defined as: H E V (X → Y ) = sup E∈E IV (X → Y |E) -IV (X → Y ), where I V (X → Y ) is the predictive V-information following from Definition 2. Leveraging the predictive V-information, the predictive heterogeneity defined in Equation 10 characterizes the maximal additional information that can be used by the prediction model when involving the environment variable E. It restricts the prediction models in V and the explored additional information could benefit the prediction performance of the model f ∈ V, for which it is named predictive heterogeneity. Next, we present some basic properties of the interaction heterogeneity and predictive heterogeneity. Proposition 1 (Basic Properties of Predictive Heterogeneity). Let X, Y be random variables taking values in X × Y, V be a function family, and E , E 1 , E 2 be environment sets. 1. Monotonicity: If E 1 ⊆ E 2 , H E1 V (X → Y ) ≤ H E2 V (X → Y ). 2. Nonnegativity: H E V (X → Y ) ≥ 0. 3. Boundedness: H E V (X → Y ) ≤ H V (Y |X). 4. Corner Case: If the predictive family V is the largest possible predictive family that includes all possible models, i.e. V = Ω, we have H E (X, Y ) = H E Ω (X → Y ). For further theoretical properties of predictive heterogeneity, in Section 3.3, we derive its explicit forms under endogeneity, a common reflection of data heterogeneity. And we demonstrate in Section 3.4 that our proposed predictive heterogeneity can be empirically estimated with guarantees if the complexity of V is bounded (e.g., its Rademacher complexity).

3.3. THEORETICAL PROPERTIES IN LINEAR CASES

In this section, we analyze the theoretical properties of the predictive heterogeneity in multiple linear settings, including (1) a homogeneous case with independent noises and (2) heterogeneous cases with endogeneity brought by selection bias and hidden variables. Under these typical settings, we could approximate the analytical forms of the proposed measure and the conclusions provide insights for general cases. Firstly, under a homogeneous case with no data heterogeneity, Theorem 1 proves that our measure is bounded by the scale of label noises (which is usually small) and reduces to 0 in linear case under mild assumptions. It indicates that the predictive heterogeneity is insensitive to independent noises. Notably that in the linear case we only deal with the environment variable satisfying X ⊥ |E, since in common prediction tasks, the independent noises are unknown and unrealistic to be exploited for the inference of latent environments E. Theorem 1 (Homogeneous Case with Independent Noises). For a prediction task X → Y where X, Y are random variables taking values in R n × R, consider the data generation process as Y = g(x) + , ∼ N (0, σ 2 ) where g : R n → R is a measurable function. 1) For a function class G such that g ∈ G, define the function family as V G = {f |f [x] = N (φ(x), σ 2 V ), φ ∈ G, σ V ∈ R + }. With an environment set E , we have H E V G (X → Y ) ≤ πσ 2 . 2) Take n = 1 and g(x) = βx,β ∈ R. Assume E[X] = 0 and E[X 2 ] exists. Given the function family V σ = {f |f [x] = N (θx, σ 2 ), θ ∈ R, σ fixed } and the environment set E = {E|E ∈ C, |supp(E)| = 2, X ⊥ |E}. We have H E Vσ (X → Y ) = 0. Secondly, we examine the proposed measure under two typical cases of data heterogeneity (Fan et al., 2014) , named endogeneity by selection bias (Heckman, 1979; Winship & Mare, 1992; Cui & Athey, 2022) and endogeneity with hidden variables (Fan et al., 2014; Arjovsky et al., 2019) . To begin with, in Theorem 2, we consider the prediction task X → Y with X, Y taking values in R 2 × R. Let X = [S, V ] T . The predictive family is specified as: V = {f |f [x] = N (θSS + θV V, σ 2 ), θS, θV ∈ R, σ = 1}. And the data distribution P (X, Y ) is a mixture of latent sub-populations, which could be formulated by an environment variable E * ∈ C such that P (X, Y ) = e∈supp(E * ) P (E * = e)P (X, Y |E * = e). For each e ∈ supp(E * ), P (X, Y |E * = e) is the distribution of a homogeneous sub-population. Note that the prediction task is to predict Y with covariates X, and the sub-population structure is latent. That is, P (E * |X, Y ) is unknown for models. In the following, we derive the analytical forms of our measure under the one typical case. Theorem 2 (Endogeneity with Selection Bias). For the prediction task X = [S, V ] T → Y with a latent environment variable E * , the data generation process with selection bias is defined as: Y = βS + f (S) + Y , Y ∼ N (0, σ 2 Y ); V = r(E * )f (S) + σ(E * ) • V , V ∼ N (0, 1), where f : R → R and r, σ : supp(E * ) → R are measurable functions. β ∈ R. Assume that E[f (S)S] = 0 and there exists L > 1 such that Lσ 2 (E * ) < r 2 (E * )E[f 2 ]. For the predictive family defined in equation 11 and the environment set E = C, the predictive heterogeneity of the prediction task [S, V ] T → Y approximates to: H C V (X → Y ) ≈ Var(re)E[f 2 ] + E[σ 2 (E * )] E[r 2 e ]E[f 2 ] + E[σ 2 (E * )] E[f 2 (S)], error bounded by 1 2 max(σ 2 Y , R(r, σ, f )). ( ) And R(r (E * ), σ(E * ), f ) = E[( 1 r 2 E[f 2 ] σ 2 +1 ) 2 ]E[f 2 ] + EE * [( 1 r σ + σ rE[f 2 ] ) 2 ] < E[f 2 ]( 1 (L+1) 2 + 1 L+2+ 1 L ). Intuitively, the data generation process in Theorem 2 introduces the spurious correlation between the spurious feature V and the target Y , which varies across different sub-populations (i.e. r(E * ) and σ(E * ) varies) and brings about data heterogeneity. Here E[f (S)S] = 0 indicates a model misspecification since there is a nonlinear term f (S) that could not be inferred by the linear predictive family with the stable feature S. The constant L characterizes the strength of the spurious correlation between V and Y . Larger L means V could provide more information for prediction. From the approximation in Equation 13, we can see that our proposed predictive heterogeneity is dominated by two terms: (1) Var[r(E * )]/E[r 2 (E * )] characterizes the variance of r(E * ) among sub-populations; (2) E[f 2 (S) ] reflects the strength of model misspecifications. These two components account for two sources of the data heterogeneity under selection bias, which validates the rationality of our proposed measure. According to the theorem, the more various r(E * ) among the sub-populations and stronger model misspecifications, the larger the predictive heterogeneity. In general, Theorem 1 and 2 indicate that (1) our proposed measure is insensitive to the homogeneous cases and (2) for the two typical sources of data heterogeneity, our measure accounts for the key components reflecting the latent heterogeneity. Therefore, the theoretical results validate the rationality of our measure.

3.4. PAC GUARANTEES FOR PREDICTIVE HETEROGENEITY ESTIMATION

Defined under explicit computation constraints, our Predictive Heterogeneity could be empirically estimated with guarantees if the complexity of the model family V is bounded. In this work, we provide finite sample generalization bounds with the Rademacher complexity. First, we describe the definition of the empirical predictive heterogeneity, the explicit formula for which could be found in Definition 7 in Appendix. Definition 6 (Empirical Predictive Heterogeneity (informal)). For the prediction task X → Y with X, Y taking values in X ×Y, a dataset D is independently and identically drawn from the population such that D = {(x i , y i ) N i=1 ∼ X, Y }. Given the predictive family V and the environment set  E K = {E|E ∈ C, supp(E) = K} where K ∈ N + is the number of environments, the empirical predictive heterogeneity ĤE K V (X → Y ; D) with respect to D is readily obtained by estimating H E K V (X → Y ) on D with V satisfies ∀x ∈ X , ∀y ∈ Y,∀f ∈ V, log f [x](y) ∈ [-B, B] where B > 0. For given K ∈ N, the environment set is defined as E K = {E|E ∈ C, supp(E) = K}. Let Q be the set of all probability distributions of X,Y ,E where E ∈ E K . Take an e ∈ supp(E) and define a function class G V = {g|g(x, y) = log f [x](y)Q(E = e|x, y), f ∈ V, Q ∈ Q}. Denote the Rademacher complexity of G with N samples by R N (G). Then for any δ ∈ (0, 1/(2K + 2)), with a probability over 1 -2(K + 1)δ, for dataset D independently and identically drawn from X, Y , we have: |H E K V (X → Y ) -ĤE K V (X → Y ; D)| ≤ 4(K + 1)R |D| (GV ) + 2(K + 1)B 2 log 1 δ /|D|, where & Mendelson, 2002) . R |D| (G V ) = O(|D| -1 2 ) (Bartlett

4. ALGORITHM

To empirically estimate the predictive heterogeneity in Definition 6, we derive the Information Maximization (IM) algorithm from the formal definition in Equation 33to infer the distribution of E that maximizes the empirical predictive heterogeneity ĤE K V (X → Y ; D). Objective Function. Given dataset D = {X N , Y N } = {(x i , y i )} N i=1 , denote supp(E) = {e 1 , . . . , e K }, we parameterize the distribution of E|(X N , Y N ) with weight matrix W ∈ W K , where K is the pre-defined number of environments and W K = {W : W ∈ R N ×K + and W 1 K = 1 N } is the allowed weight space. Each element w ij in W represents P (E = e j |x i , y i ) (the probability of the i-th data point belonging to the j-th sub-population). For a predictive family V, the solution to the supremum problem in the Definition 7 is equivalent to the following objective function: min W ∈W K RV (W, θ * 1 (W ), . . . , θ * K (W )) = 1 N N i=1 K j=1 wij V (f θ * j (xi), yi) + UV (W, YN ) , s.t. θ * j (W ) ∈ arg min θ LV (W, θ) = N i=1 wij V (f θ (xi), yi) , for j = 1, . . . , K, where f θ : X → Y denotes a predicting function parameterized by θ, V (•, •) : Y × Y → R rep- resents a loss function and U V (W, Y N ) is a regularizer. Specifically, f θ , V and U V are determined by the predictive family V. Here we provide implementations for two typical and general machine learning tasks, regression and classification. (1) For the regression task, the predictive family is typically modeled as: V1 = {g : g[x] = N (f θ (x), σ 2 ), f is the predicting function and θ is learnable, σ is a constant}. ( ) The corresponding loss function is V1 (f θ (X), Y ) = (f θ (X) -Y ) 2 , and U V1 (W, Y N ) becomes UV 1 (W, YN ) = Var j∈[K] (Y j N ) = K j=1 N i=1 wijyi 2 1 N N i=1 wij - 1 N N i=1 yi 2 where Y j N denotes the mean value of the label Y given E = e j and U (W, Y N ) calculates the variance of Y j N among sub-populations e 1 ∼ e K . (2) For the classification task, the predictive family is typically modeled as: V2 = {g : g[x] = f θ (x) ∈ ∆c, f is the classification model and θ is learnable}, ( ) where c is the class number and ∆ c denotes the c-dimensional simplex. Here each model in the predictive family V 2 outputs a discrete distribution in the form of a c-dimensional simplex. In this case, the corresponding loss function V2 (•, •) is the cross entropy loss and the regularizer becomes U V2 (W, Y N ) = - K j=1 1 N ( N i=1 w ij )H(Y j N ) , where H(Y j N ) is the entropy of Y given E = e j . Optimization. The bi-level optimization in Equation 15 can be solved by performing projected gradient descent w.r.t. W . The gradient of W can be calculated by: (we omit the subscript V here) ∇W R = ∇W U + (f θ j (xi), yi) N ×K i,j + K j=1 ∇ θ j R| θ * j ∇W θ * j , where ∇ θ j R θ * j ∇W θ * j ≈ ∇ θ j R θ t j h≤t   k<h (I - ∂ 2 L ∂θj∂θ T j θ t-k-1 j )   ∂ 2 L ∂θj∂W T θ t-h-1 j (20) ≈ ∇ θ j R θ t j ∂ 2 L ∂θj∂W T θ t-1 j , for j = 1, . . . , K. where (f θj (x i ), y i )] N ×K i,j is an N × K matrix in which the (i, j)-th element is (f θj (x i ), y i ). Here Equation 20 approximates θ * j by θ t j from t steps of inner loop gradient descent and Equation 21 performs 1-step truncated backpropagation (Shaban et al., 2019; Zhou et al., 2022) . Our information maximization algorithm updates W by projected gradient descent as: Then we prove that minimizing Equation 15 exactly finds the supremum w.r.t. E in the Definition 7 (formal) of the empirical predictive heterogeneity. Theorem 4 (Justification of the IM Algorithm). For the regression task with predictive family V 1 and classification task with V 2 , the optimization of Equation 15 is equivalent to the supremum problem of the empirical predictive heterogeneity ĤE K V1 (X → Y ; D), ĤE K V2 (X → Y ; D) respectively in Equation 33 with the pre-defined environment number K (i.e. supp(E) = K). Remark 1 (Difference from Expectation Maximization). The expectation maximization (EM) algorithm is to infer latent variables of a statistic model to achieve the maximum likelihood. Our proposed information maximization (IM) algorithm is to infer the latent variables W which brings the maximal predictive heterogeneity associated with the maximal information. Due to the regularizer U V in our objective function, the EM algorithm cannot efficiently solve our problem, and therefore we adopt bi-level optimization techniques.

5.1. PROVIDE INSIGHTS FOR THE SUB-POPULATION DIVISION

The predictive heterogeneity could provide insights for the sub-population division and benefit decision-making, and we illustrate this in prediction tasks of various fields, including agricultural research, sociological research, and object recognition. From the illustrative examples, we show that the learned sub-population division is highly explainable and relevant to decision-making. Example: Agriculture It is known that the climate affects crop yields and crop suitability (Lobell et al., 2008) . We leverage the data from the NOAA database which contains daily weather from weather stations around the world. Following Zhao et al. (2021) , we summarize the weather sequence of the year 2018 into summary statistics, including the average yearly temperature, humidity, wind speed and rainy days. The task is to predict the crop yield in each place with weather summary statistics and location covariates (i.e. longitude and latitude) of the place. For easy illustration, we focus on the places with crop types of wheat or rice. Notably that the input covariates do not contain the crop type. We use MLP models in this task and set K = 2 for our IM algorithm. Since the crop yield prediction mechanisms are closely related to the crop type that is unknown in the prediction task, we think this causes data heterogeneity in the entire data and the recognized predictive heterogeneity should relate to it. In Figure 1 (a), we plot the real distribution map of wheat and rice planting areas, and in Figure 1 (b), we plot the learned two sub-populations of our IM algorithm. From the results, we surprisingly find the division given by our algorithm is quite similar to the real division of the two crops, indicating the rationality of our measure. For the areas that are not similar (e.g. Tibet Plateau in Asia), we think it is due to some missing features (e.g. population density, altitude) that significantly affect the crop yields. Example: Sociology We use the UCI Adult dataset (Kohavi & Becker, 1996) , which is derived from the 1994 Current Population Survey conducted by the US Census Bureau and is widely used in the study of algorithmic fairness. The task is to predict whether the income of a person is greater or less than 50k US dollars according to personal features. We use linear models in this task and set K = 2. In this example, we would like to investigate whether there exist sub-population structures inside data that affect the learning of machine learning models. In Figure 2 (a), we plot summary statistics for the two sub-populations, where the main difference lies in the capital gain. In Figure 2 (b), we plot the feature importance given by linear models for the two sub-populations, and we find that for people with a high capital gain, the prediction model focuses mainly on capital gain, which is fair. However, for people with a low capital gain, models also address some sensitive attributes such as sex and marital status, which tend to cause discrimination. Our results correspond with the results in Zhao et al. (2021) and can help us identify potential inequality in decision-making. For example, our results indicate the potential discrimination for low capital gain people, which could further promote algorithm design and improve policy fairness. Example: Object Recognition Finally, we use the Waterbird dataset (Sagawa et al., 2019) , which is widely used to evaluate the model's robustness in the robust learning field. The task is to recognize waterbirds or landbirds. However, the image backgrounds are spuriously correlated with the target label, i.e. for the majority, waterbirds are on the water and landbirds on the land, and for minority, the correlation is reversed. Therefore, the spurious correlation causes predictive heterogeneity in this dataset, since such correlation could affect the machine learning model. In this example, we use the ResNet18 and set K = 2 in our IM algorithm. In Figure 3 , to show the learned sub-populations of our method, we randomly sample 50 images for each class (waterbird or landbird) and each learned sub-population. In sub-population 1, the majority of landbirds are on the ground and waterbirds are in the water, while in sub-population 2, the majority of landbirds are in the water and waterbirds are on the ground. Our measure captures such spurious correlation and in the two sub-populations, the spurious correlation between the object and background is inverse. And the learned sub-populations could be leveraged by many robust learning methods (Sagawa et al., 2019; Koyama & Yamaguchi, 2020) to learn models with better generalization ability, since they can help to eliminate the influence of backgrounds on model predictions.

5.2. BENEFIT OOD GENERALIZATION

The predictive heterogeneity could benefit the out-of-distribution (OOD) generalization of machine learning models. Here we investigate the empirical performance of our IM algorithm w.r.t. the OOD generalization performances on simulated data and real-world colored MNIST data. Baselines First, we compare with empirical risk minimization (ERM) and environment inference for invariant learning (EIIL, (Creager et al., 2021) ) which infers the environments for learning invariance. Then we compare with the well-known KMeans algorithm, which is the most popular Data Generation of Simulated Data The input features X = [S, T, V ] T ∈ R 10 consist of stable features S ∈ R 5 , noisy features T ∈ R 4 and the spurious feature V ∈ R: S ∼ N (0, 2I5), T ∼ N (0, 2I4), Y = θ T S S + h(S) + N (0, 0.5), V ∼ Laplace(sign(r) • Y, 1/(5 ln |r|)) (23) where θ S ∈ R 5 is the coefficient and h(S) = S 1 S 2 S 3 is the nonlinear term. |r| > 1 is a factor for each sub-population, and here the data heterogeneity is brought by the endogeneity with hidden variable (Fan et al., 2014) . V is the spurious feature whose relationship with Y is unstable across sub-populations and is controlled by the factor r. Intuitively, sign(r) controls whether the spurious correlation between V and Y is positive or negative. And |r| controls the strength of the spurious correlation, i.e. the larger |r| means the stronger spurious correlation. In training, we generate 10000 points, where the major group contains 80% data with r = 1.9 (i.e. strong positive spurious correlation) and the minor group contains 20% data with r = -1.9 (i.e. strong negative spurious correlation). In testing, we test the performances of the two groups respectively, and we also set r = -2.3 and r = -2.7 to simulate stronger distributional shifts. We use linear regression and set K = 2 for all methods, and we report the mean-square errors (MSE) of all methods. Data Generation of Colored MNIST Following Arjovsky et al. (2019) , we design a binary classification task constructed on the MNIST dataset. Firstly, digits 0 ∼ 4 are labeled Y = 0 and digits 5 ∼ 9 are labeled Y = 1. Secondly, noisy labels Ỹ are induced by randomly flipping the label Y with a probability of 0.2. Then we sample the colored id V spurious correlated with Ỹ as V = + Ỹ , with probability r, -Ỹ , with probability 1 -r. . In fact, r controls the spurious correlation between Ỹ and V . In training, we randomly sample 10000 data points and set r = 0.85, meaning that for 85% of the data, V is positively correlated with Ỹ and for the rest 15%, the spurious correlation becomes negative, which causes data heterogeneity w.r.t. V and Ỹ . In testing, we set r = 0 (strong negative spurious correlation), bringing strong shifts between training and testing. Analysis From the results in Table 1 , for both the simulated and colored MNIST data, the two backbones with our IM algorithm achieve the best OOD generalization performances. Also, for the simulated data, the learned predictive heterogeneity enables backbone algorithms to equally treat the majority and minority inside data (i.e. low-performance gap between 'Major' and 'Minor'), and significantly benefits the OOD generalization. Further, for both experiments, we plot the learned sub-populations of our IM algorithm in Figure 4 and 5. From Figure 4 , compared with KMeans and EIIL, our predictive heterogeneity exploits the spurious correlation between V and Y , and enables the backbone algorithms to eliminate it. From Figure 5 , the learned sub-populations of our method also reflect the different directions of the spurious correlation between digit labels Y and colors (red or green), which helps backbone methods to avoid using colors to predict digits.

6. CONCLUSION

We define the predictive heterogeneity, as the first quantitative formulation of the data heterogeneity that affects the prediction of machine learning models. We demonstrate its theoretical properties and show that it benefits the out-of-distribution generalization performances.

A FORMAL DEFINITION OF EMPIRICAL PREDICTIVE HETEROGENEITY

In this section, we derive the explicit formula for the empirical estimation of the predictive heterogeneity which is described in Definition 6. The dataset D = {(x i , y i )} |D| i=1 is independently and identically drawn from the population X, Y . Given a function family V and an environment set E K , let Q be the set of all probability distributions of X,Y ,E where E ∈ E K . For given E, denote supp(E) = {(e k ) K k=1 }. The empirical predictive heterogeneity ĤE K V (X → Y ; D) is given by: ĤE K V (X → Y ; D) = sup E∈E K ÎV (X → Y |E; D) -ÎV (X → Y ; D) (24) = sup Q∈Q K k=1 Q(E = e k ) ĤV (Y |E = e k ; D) -Q(E = e k ) ĤV (Y |X, E = e k ; D) (25) -[ ĤV (Y ; D) -ĤV (Y |X; D)]. Specifically, Q(E = e k ) ĤV (Y |X, E = e k ; D) (27) = inf f ∈V Q(E = e k ) xi,yi∈D -log f [x i ](y i ) Q(x i , y i |E = e k ) xj ,yj ∈D Q(x j , y j |E = e k ) = inf f ∈V Q(E = e k ) xi,yi∈D -log f [x i ](y i ) Q(E = e k |x i , y i ) Q(x i , y i ) xj ,yj ∈D Q(E = e k |x j , y j ) Q(x j , y j ) (29) = inf f ∈V Q(E = e k ) xi,yi∈D -log f [x i ](y i ) Q(E = e k |x i , y i ) Q(x i , y i ) Q(E = e k ) (30) = inf f ∈V xi,yi∈D -log f [x i ](y i ) Q(E = e k |x i , y i ) Q(x i , y i ) (31) = inf f ∈V 1 |D| xi,yi∈D -log f [x i ](y i ) Q(E = e k |x i , y i ). The explicit formula for Q(E = e k ) ĤV (Y |E = e k ; D), ĤV (Y |X; D) and ĤV (Y ; D) could be similarly derived. Here we are ready to formally define the empirical predictive heterogeneity. Definition 7 (Empirical Predictive Heterogeneity (formal)). For the prediction task X → Y with X, Y taking values in X × Y, a dataset D is independently and identically drawn from the population such that D = {(x i , y i ) N i=1 ∼ X, Y }. Given the predictive family V and the environment set E K = {E|E ∈ C, supp(E) = K} where K ∈ N, let Q be the set of all probability distributions of X,Y ,E where E ∈ E K . The empirical predictive heterogeneity ĤE K V (X → Y ; D) with respect to D is defined as: ĤE K V (X → Y ; D) = sup Q∈Q K k=1 Q(E = e k ) ĤV (Y |E = e k ; D) -Q(E = e k ) ĤV (Y |X, E = e k ; D) -[ ĤV (Y ; D) -ĤV (Y |X; D)], where Q(E = e k ) ĤV (Y |X, E = e k ; D) = inf f ∈V 1 |D| xi,yi∈D -log f [x i ](y i ) Q(E = e k |x i , y i ). (34) Q(E = e k ) ĤV (Y |E = e k ; D) = inf f ∈V 1 |D| xi,yi∈D -log f [∅](y i ) Q(E = e k |x i , y i ). ( ) ĤV (Y |X; D) = inf f ∈V 1 |D| xi,yi∈D -log f [x i ](y i ). ( ) ĤV (Y ; D) = inf f ∈V 1 |D| xi,yi∈D -log f [∅](y i ). B SENSITIVITY OF K In the experiments of Section 5, we set the K = 2 for easy illustrations. In this section, we add the results of choosing different Ks for the simulated experiment in Section 5.2 to show that the OOD generalization performances of some typical algorithms plus our proposed method are not sensitive to the choices of K. In Figure 6 , we show the out-of-distribution generalization error of our methods with Sub-population Balancing, IRM and IGA as backbones. We plot the OOD testing performances under r = -2.7, which has strong distributional shift with the training distribution. From the results, we can see that the performances of three OOD generalization methods do not be affected much by the choice of K, and from Table 1 , our performances significantly outperforms all the baselines. Also, we add one more experiment to show that (1) when the chosen K is smaller than the groundtruth, the performances of our methods will drop but are still better than ERM (2) when the chosen K is larger, the performances are not affected much (consistent with the results in Appendix B). Experiment Setting: The input features X = [S, T, V ] ∈ R 10 consist of stable features S ∈ R 5 , noisy features T ∈ R 4 and the spurious feature V ∈ R: S ∼ N (2, 2I 5 ), T ∼ N (0, 2I 4 ), Y = θ T S S + S 1 S 2 S 3 + N (0, 0.5), and we generate the spurious feature via: V = θ e V Y + N (0, 0.3), where θ e V varies across sub-populations and is dependent on which sub-population the data point belongs to. In training, we sample 8000 data points from e 1 with θ 1 V = 3.0, 1000 points from e 2 with θ 2 V = -1.0, 1000 points from e 3 with θ 3 V = -2.0 and 1000 points from e 4 with θ 4 V = -3.0. Therefore, the ground-truth number of sub-populations is 4. In testing, we test the performances on e 4 with θ 4 V = -3.0, which has strong distributional shifts from training data. The average MSE over 10 runs are shown in Figure 7 . From the results, we can see that when K is smaller than the ground-truth, increasing K benefits the OOD generalization performance, and when K is larger, the performances are not affected much, which is consistent with the results in Figure 6 . For our IM algorithm, we think there are mainly two ways to choose K: • According to the predictive heterogeneity index: When the chosen K is smaller than the ground-truth, our measure tends to increase quickly when increasing K; and when K is larger than the ground-truth, the increasing speed will slow down, which could direct people to choose an appropriate K. • According to the prediction model: Since our IM algorithm aims to learn sub-populations with different prediction mechanisms, one could compare the learned model parameters θ 1 , . . . , θ K to judge whether K is much larger than the ground-truth, i.e., if two resultant models are quite similar, K may be too large (divide one sub-population into two). For linear models, one can directly compare the coefficients. For deep models, we think one can calculate the transfer losses across sub-populations. For a detailed analysis of the best choice of K, we leave it for future work.

C RELATED WORK

To the best of our knowledge, data heterogeneity has not converged to a uniform formulation so far, and has different meanings among different fields. Li & Reynolds (1995) define the heterogeneity in ecology based on the system property and complexity or variability. Rosenbaum (2005) views the uncertainty of the potential outcome as unit heterogeneity in observational studies in economics. For graph data, the heterogeneity refers to various types of nodes and edges (Wang et al. (2019) ). More recently, in machine learning, several works of causal learning (Peters et al., 2016; Arjovsky et al., 2019; Koyama & Yamaguchi, 2020; Creager et al., 2021) and robust learning (Sagawa et al., 2019) leverage heterogeneous data from multiple environments to improve the out-of-distribution generalization ability. Specifically, invariant learning methods (Arjovsky et al., 2019; Koyama & Yamaguchi, 2020; Creager et al., 2021; Zhou et al., 2022) leverage the heterogeneous environment to learn the invariant predictors that have uniform performances across environments. And in distributionally robust optimization field, Sagawa et al. (2019) ; Duchi et al. (2022) propose to optimize the worst-group prediction error to guarantee the OOD generalization performance. However, in machine learning, previous works have not provided a precise definition or sound quantification of data heterogeneity, which makes it confusing and hard to leverage to develop more rational machine learning algorithms. As for clustering algorithms, most algorithms only focus on the covariates X, typified by KMeans and Gaussian Mixture Model (GMM, (Reynolds, 2009) ). However, the learned clusters by KMeans cannot reflect the predictive heterogeneity, which is shown by our experiments. And the expectation maximization (EM, (Moon, 1996) ) can also be used for clustering. However, our IM algorithm has essential differences from EM, for our IM algorithm infers latent variables that maximizes the predictive heterogeneity but EM maximizes the likelihood. Also, there are methods (Creager et al., 2021) from the invariant learning field to infer environments. Though it could benefit the OOD generalization, it lacks the theoretical foundation and only works in some settings.

D PROOF OF PROPOSITION 1

Proof of Proposition 1.

1.. Monotonicity:

Because of E 1 ⊆ E 2 , H E1 V (X → Y ) = sup E∈E1 I V (X → Y |E) -I V (X → Y ) ≤ sup E∈E2 I V (X → Y |E) -I V (X → Y ) (39) = H E2 V (X → Y ). 2. Nonnegativity: According to the definition of the environment set, there exists E 0 ∈ E such that for any e ∈ supp(E), X, Y |E = e is identically distributed as X, Y . Thus, we have H E V (X → Y ) = sup E∈E [H V (Y |∅, E) -H V (Y |X, E)] -[H V (Y |∅) -H V (Y |X)] ≥ [H V (Y |∅, E 0 ) -H V (Y |X, E 0 )] -[H V (Y |∅) -H V (Y |X)] . Specifically, H V (Y |X, E 0 ) = E e∼E0 inf f ∈V E x,y∼X,Y |E=e [-log f [x](y)] (43) = E e∼E0 inf f ∈V E x,y∼X,Y [-log f [x](y)] (44) = H V (Y |X). Similarly, Next, H V (Y |∅, E 0 ) = H V (Y |∅). Thus, H E V (X → Y ) ≥ 0. 3. Boundedness: First, we have H V (Y |X, E) = E e∼E inf f ∈V E x,y∼X,Y |E=e [-log f [x](y)] (46) = E e∼E inf f ∈V E x∼X|E=e E y∼Y |x,e [-log f [x](y)] (47) ≥ 0, H V (Y |∅, E) = E e∼E inf f ∈V E y∼Y |E=e [-log f [∅](y)] (49) ≤ inf f ∈V E e∼E E y∼Y |E=e [-log f [∅](y)] (50) = inf f ∈V E y∼Y [-log f [∅](y)] (51) = H V (Y |∅), where Equation 50is due to Jensen's inequality. Combing the above inequalities, H E V (X → Y ) = sup E∈E [H V (Y |∅, E) -H V (Y |X, E)] -[H V (Y |∅) -H V (Y |X)] ≤ sup E∈E H V (Y |∅, E) -[H V (Y |∅) -H V (Y |X)] ≤ H V (Y |∅) -[H V (Y |∅) -H V (Y |X)] (55) = H V (Y |X). 4. Corner Case: According to Proposition 2 in Xu et al. (2020) , H Ω (Y |∅) = H(Y ). (57) H Ω (Y |X) = H(Y |X). By taking random variables R, S identically distributed as X, Y |E = e for e ∈ supp(E), we have H Ω (Y |X, E = e) = H Ω (S|R) = H(S|R) = H(Y |X, E = e). Thus, H Ω (Y |X, E) = E e∼E [H Ω (Y |X, E = e)] = E e∼E [H(Y |X, E = e)] = H(Y |X, E). Similarly, we have H Ω (Y |∅, E) = H(Y |E). Thus, H E Ω (X → Y ) = sup E∈E [H Ω (Y |∅, E) -H Ω (Y |X, E)] -[H Ω (Y |∅) -H Ω (Y |X)] (61) = sup E∈E [H(Y |E) -H(Y |X, E)] -[H(Y ) -H(Y |X)] (62) = sup E∈E I(Y ; X|E) -I(Y ; X) (63) = H E (X, Y ). E PROOF OF THEOREM 1 Proof of Theorem 1. 1) H V G (Y |X) = inf f ∈V G E x∼X E y∼Y |x [-log f [x](y)] ≤ E x∼X E y∼Y |x [-log 1 √ 2π • 1 √ 2π exp - (y -g(x)) 2 2 • 1 2π (66) = E x∼X E y∼Y |x [π(y -g(x)) 2 ] (67) = πσ 2 . ( ) Equation 66 holds by taking f [x] = N (g(x), 1 2π ).

2)

Given the function family V σ = {f |f [x] = N (θx, σ 2 ), θ ∈ R, σ fixed }, by expanding the Gaussian probability density function in the definition of predictive V-information, it could be shown that I Vσ (X → Y ) ∝ min k∈R E[(Y -kX) 2 ] -Var(Y ), where the predictive V-information is proportional to Mean Square Error subtracted by the variance of target, by a coefficient completely dependent on σ. The minimization problem is solved by k = E[XY ] E[X 2 ] = 1. Substituting k into eq.69, I Vσ (X → Y ) ∝ E[ 2 ] -Var(X + ) (71) = -Var(X) = -E[X 2 ]. ( ) Denote supp(E) = {E 1 , E 2 }. Let Q be the joint distribution of (X, , E). Let Q(E 1 ) = α and Q(E 2 ) = 1 -α be the marginal of E. Abbreviate Q(X, |E = E 1 ) by P 1 (X, ) and Q(X, |E = E 2 ) by P 2 (X, ). Similar to 69, I Vσ (X → Y |E) ∝ min k E[(Y -kX) 2 |E] -Var(Y |E). For E = E 1 , the minimization problem is solved by k = E P1 [XY ] E P1 [X 2 ] . Thus, I Vσ (X → Y |E = E 1 ) ∝ E P1 Y - E P1 [XY ] E P1 [X 2 ] X 2 -Var P1 (Y ) (75) = E P1 [Y 2 ] - E 2 P1 [XY ] E P1 [X 2 ] -(E P1 [Y 2 ] -E 2 P1 [Y ]) (76) = E 2 P1 [Y ] - E 2 P1 [XY ] E P1 [X 2 ] . Similarly, we have I Vσ (X → Y |E = E 2 ) ∝ E 2 P2 [Y ] - E 2 P2 [XY ] E P2 [X 2 ] . Notably, E P1 [X 2 ] and E P2 [X 2 ] are constrained by α and E[X 2 ]. E[X 2 ] = E[E[X 2 |E]] = αE P1 [X 2 ] + (1 -α)E P2 [X 2 ]. Similarly, E[X 2 ] = E[XY ] = αE P1 [XY ] + (1 -α)E P2 [XY ]. (80) 0 = E[Y ] = αE P1 [Y ] + (1 -α)E P2 [Y ]. (81) The moments of P 2 could thereafter be represented by those of P 1 . E P2 [X 2 ] = E[X 2 ] -αE P1 [X 2 ] 1 -α . ( ) E P2 [XY ] = E[X 2 ] -αE P1 [XY ] 1 -α . ( ) E P2 [Y ] = - αE P1 [Y ] 1 -α . Substituting to eq.78, I Vσ (X → Y |E = E 2 ) ∝ α 2 (1 -α) 2 E 2 P1 [Y ] - 1 1 -α E[X 2 ] -αE P1 [XY ] 2 E[X 2 ] -αE P1 [X 2 ] . Thus, H E Vσ (X → Y ) = sup E∈E I Vσ (X → Y ) -αI Vσ (X → Y |E = E 1 ) -(1 -α)I Vσ (X → Y |E = E 2 ) (86) ∝ sup E∈E -E[X 2 ] -αE 2 P1 [Y ] + α E 2 P1 [XY ] E P1 [X 2 ] - α 2 1 -α E 2 P1 [Y ] + E[X 2 ] -αE P1 [XY ] 2 E[X 2 ] -αE P1 [X 2 ] (87) = sup E∈E - α 1 -α E 2 P1 [Y ] + α E P1 [X 2 ] -E P1 [XY ] 2 E P1 [X 2 ] (E[X 2 ] -αE P1 [X 2 ]) E[X 2 ] (88) = sup E∈E - α 1 -α E 2 P1 [X + ] + α E 2 P1 [X ] E P1 [X 2 ] (E[X 2 ] -αE P1 [X 2 ]) E[X 2 ]. Assuming X ⊥ | E, H E Vσ (X → Y ) = sup E∈E - α 1 -α E 2 P1 [X + ] ≤ 0. From Proposition 1, we have H E Vσ (X → Y ) ≥ 0. Thus, H E Vσ (X → Y ) = 0. F PROOF OF LINEAR CASES (THEOREM 2 AND ??) Proof of Theorem 2. For the ease of notion, we denote the r(E * ) as r e , σ(E * ) as σ e , and σ(E * ) • v as e . And we omit the superscript C of H C V . Firstly, we calculate the H V [Y |∅] as: H V [Y |∅] = 1 2σ 2 Var(Y ) + log σ + 1 2 log 2π, H V [Y |∅, E * ] = 1 2σ 2 E E * [Var(Y |E * )] + log σ + 1 2 log 2π. Therefore, we have H V [Y |∅, E * ] -H V [Y |∅] = - 1 2σ 2 Var(E[Y |E * ]) ≤ 0. ( ) As for H V [Y |X], we have H V [Y |X] = inf h S ,h V E X,Y Y -(h S S + h V V ) 2 1 2σ 2 (94) = inf h S ,h V E X,Y f (S) + Y -(h S S + h V V ) 2 1 2σ 2 (95) = inf h S ,h V E E * E[ f (S) + Y -(h S S + h V (r e f (S) + e )) 2 |E * ] 1 2σ 2 , ( ) where we let h S = h S -β here. Then we have 2σ 2 H V [Y |X] = inf h S ,h V E E * E[ (1 -h V r e )f (S) + Y -h S S -h V e 2 |E * ] (97) = inf h S ,h V E E * E[ (1 -h V r e )f (S) -h S S 2 |E * ] + σ 2 Y + h 2 V E E * [σ 2 e ], notably that here for e i , e j ∈ supp(E * ), we assume P ei (S, Y ) = P ej (S, Y ) (we choose such E * as one possible split). And the solution of h S , h V is h S = Var(r e )E[f 2 (S)]E[f (S)S] + E[σ 2 e ]E[f (S)S] E[r 2 e ]E[f 2 (S)]E[S 2 ] + E[σ 2 e ]E[S 2 ] -E 2 [r e ]E 2 [f (S)S] , h V = E[r e ](E[f 2 (S)]E[S 2 ] -E 2 [f (S)S]) E[r 2 e ]E[f 2 (S)]E[S 2 ] + E[σ 2 e ]E[S 2 ] -E 2 [r e ]E 2 [f (S)S] . ( ) According to the assumption that E[f (S)S] = 0, we have h S = 0, h V = E[r(E * )]E[f 2 ] E[r 2 (E * )]E[f 2 ] + E[σ 2 (E * )] . (102) Therefore, we have 2σ 2 H V [Y |X] = E E * [E[ (1 -h V r e )f (S) 2 |E * ]] + σ 2 Y + h 2 V E E * [σ 2 e ] = Var(r e )E[f 2 ] + E[σ 2 (E * )] E[r 2 e ]E[f 2 ] + E[σ 2 (E * )] E[f 2 (S)] + σ 2 Y , 2σ 2 H V [Y |X, E * ] = σ 2 Y + E[( 1 r 2 e E[f 2 ] σ 2 e + 1 ) 2 ]E[f 2 ] + E E * [( 1 re σe + σe reE[f 2 ] ) 2 ]. Note that here we simply set σ = 1 in the main body. And we have: H V (X → Y ) ≈ Var(r e )E[f 2 ] + E[σ 2 (E * )] E[r 2 e ]E[f 2 ] + E[σ 2 (E * )] E[f 2 (S)] The approximation error is bounded by 1 2 max(σ 2 Y , R(r(E * ), σ(E * ), E[f 2 ])), and R(r(E * ), σ(E * ), E[f 2 ]) is defined as: R(r(E * ), σ(E * ), E[f 2 ]) = E[( 1 r 2 e E[f 2 ] σ 2 e + 1 ) 2 ]E[f 2 ] + E E * [( 1 re σe + σe reE[f 2 ] ) 2 ] (107) Proof of Theorem ??. As proved above, we have h S = β + E[f (S)S] Var(r e )(E[f 2 (S)] + σ 2 Y ) + E[σ 2 e ] E[r 2 e ]E[f 2 (S)]E[S 2 ] + E[r 2 e ]σ 2 Y E[S 2 ] + E[σ 2 e ]E[S 2 ] -E 2 [r e ]E 2 [f (S)S] , h V = E[r e ](σ 2 Y + E[f 2 (S)])E[S 2 ] -E[r e ]E 2 [f (S)S] E[r 2 e ]E[f 2 (S)]E[S 2 ] + E[r 2 e ]σ 2 Y E[S 2 ] + E[σ 2 e ]E[S 2 ] -E 2 [r e ]E 2 [f (S)S] . • For the model misspecification case, we further assume that (1) E[f (S)S] = 0 and ( 2) E[σ 2 e ] E[f 2 (S)]E[S 2 ] , and then we have h S = β, h V = E[r e ] E[r 2 e ] , and for the heterogeneity, we have Var(re) E[r 2 e ] (E[f 2 (S)] + E[σ 2 Y ]) + h 2 V EE [σ 2 e ] + σ 2 Y ≥ 2σ 2 HV (X → Y ) ≥ Var(re) E[r 2 e ] (E[f 2 (S)] + E[σ 2 Y ]) + h 2 V EE [σ 2 e ] -EE [ 1 r 2 e σ 2 e ]. • Without the model misspecification, we assume that f ≡ 0, and then we have h S = β, h V = E[r e ]σ 2 Y E[r 2 e ]σ 2 Y + E[σ 2 e ] , and for the heterogeneity we have 2σ 2 H V (X → Y ) ≥ σ 2 Y (1 -2h V E[r e ] + h 2 V E[r 2 e ]) + h 2 V E[σ 2 e ] -E[ 1 r 2 e σ 2 e ], 2σ 2 H V (X → Y ) ≤ σ 2 Y (1 -2h V E[r e ] + h 2 V E[r 2 e ]) + h 2 V E[σ 2 e ]. G PROOF OF THE ERROR BOUND FOR FINITE SAMPLE ESTIMATION (THEOREM 3) In this section, we will prove the error bound of estimating the predictive heterogeneity with the empirical predictive heterogeneity. Before the proof of Theorem 3 which is inspired by Xu et al. (2020), we will introduce three lemmas. Lemma 1. Assume ∀x ∈ X ,∀y ∈ Y,∀f ∈ V, log f [x](y) ∈ [-B, B] where B > 0. Define a function class G k V = {g|g(x, y) = log f [x](y)q(E = e k |x, y), f ∈ V, q ∈ Q}. Denote the Rademacher complexity of G with N samples by R N (G). Define fk = arg inf f 1 |D| xi,yi∈D -log f [x i ](y i )q(E = e k |x i , y i ). Then for any q ∈ Q, any δ ∈ (0, 1), with a probability over 1 -δ, we have q(E = e k )H V (Y |X, E = e k ) - 1 |D| xi,yi∈D -log fk [x i ](y i )q(E = e k |x i , y i ) (117) ≤ 2R |D| (G k V ) + B 2 log 1 δ |D| . Proof. Apply McDiarmid's inequality to the function Φ(D) which is defined as: Φ(D) = sup f ∈V,q∈Q q(E = e k )E q [-log f [x](y)|E = e k ] - 1 |D| xi,yi∈D -log f [x i ](y i )q(E = e k |x i , y i ) . Let D and D be two identical datasets except for one data point x j = x j . We have: Φ(D) -Φ(D ) (120) ≤ sup f ∈V,q∈Q   q(E = e k )E q [-log f [x](y)|E = e k ] - 1 |D| xi,yi∈D -log f [x i ](y i )q(E = e k |x i , y i ) (121) -q(E = e k )E q [-log f [x](y)|E = e k ] - 1 |D | x i ,y i ∈D -log f [x i ](y i )q(E = e k |x i , y i )   (122) ≤ sup f ∈V,q∈Q 1 |D| xi,yi∈D -log f [x i ](y i )q(E = e k |x i , y i ) - 1 |D | x i ,y i ∈D -log f [x i ](y i )q(E = e k |x i , y i ) (123) = sup f ∈V,q∈Q 1 |D| log f [x j ](y j )q(E = e k |x j , y j ) -log f [x j ](y j )q(E = e k |x j , y j ) (124) ≤ 2B |D| . ( ) According to McDiarmid's inequality, for any δ ∈ (0, 1), with a probability over 1 -δ, we have: Φ(D) ≤ E D [Φ(D)] + B 2 log 1 δ |D| . Next we derive a bound for E D [Φ(D)]. Consider a dataset D independently and identically drawn from q(X, Y ) = P (X, Y ) with the same size as D. We notice that q(E = e k )E q [-log f [x](y)|E = e k ] (127) = q(E = e k )E q [-log f [x](y)q(E = e k |x, y)|E = e k ] (128) = E q [E q [-log f [x](y)q(E = e k |x, y)|E = e k ]] (129) = E q [-log f [x](y)q(E = e k |x, y)] (130) = E D   - 1 |D | x i ,y i ∈D -log f [x i ](y i )q(E = e k |x i , y i )   . Thus, E D [Φ(D)] could be reformulated as: E D [Φ(D)] (132) = E D   sup f ∈V,q∈Q E D   - 1 |D | x i ,y i ∈D -log f [x i ](y i )q(E = e k |x i , y i )   (133) - 1 |D| xi,yi∈D -log f [x i ](y i )q(E = e k |x i , y i )   (134) ≤ E D   sup f ∈V,q∈Q E D - 1 |D | x i ,y i ∈D -log f [x i ](y i )q(E = e k |x i , y i ) (135) - 1 |D| xi,yi∈D -log f [x i ](y i )q(E = e k |x i , y i )   (136) ≤ E D,D   sup f ∈V,q∈Q 1 |D| xi,yi∈D log f [x i ](y i )q(E = e k |x i , y i ) (137) - x i ,y i ∈D log f [x i ](y i )q(E = e k |x i , y i )   (138) = E D,D ,σ   sup f ∈V,q∈Q 1 |D| xi,yi∈D σ i log f [x i ](y i )q(E = e k |x i , y i ) (139) - x i ,y i ∈D σ i log f [x i ](y i )q(E = e k |x i , y i )   (140) ≤ E D,σ   sup f ∈V,q∈Q 1 |D| xi,yi∈D σ i log f [x i ](y i )q(E = e k |x i , y i )   (141) + E D ,σ   sup f ∈V,q∈Q 1 |D | x i ,y i ∈D σ i log f [x i ](y i )q(E = e k |x i , y i )   (142) = 2R |D| (G k V ), where σ i are independent Rademacher variables. Equation 137 follows from Jensen's inequality and the convexity of sup. Equation 139 holds due to the symmetry of log f [x i ](y i )q(E = e k |x i , y i )log f [x i ](y i )q(E = e k |x i , y i ) and the argument that Radamacher variables preserve the expected sum of symmetric random variables with a convex mapping (Ledoux & Talagrand (1991) , Lemma 6.3). Substituting Equation 143to Equation 126, we have for any δ ∈ (0, 1), with a probability over 1 -δ, ∀f ∈ V, ∀q ∈ Q, the following holds: q(E = e k )E q [-log f [x](y)|E = e k ] - 1 |D| xi,yi∈D -log f [x i ](y i )q(E = e k |x i , y i ) (144) ≤ 2R |D| (G k V ) + B 2 log 1 δ |D| . ( ) Let fk = arg inf f {q(E = e k )E q [-log f [x](y)|E = e k ]}. Let fk = arg inf f { 1 |D| xi,yi∈D -log f [x i ](y i )q(E = e k |x i , y i )}. Now we have q(E = e k )E q -log fk [x](y)|E = e k - 1 |D| xi,yi∈D -log fk [x i ](y i )q(E = e k |x i , y i ) (146) ≤ q(E = e k )H V (Y |X, E = e k ) - 1 |D| xi,yi∈D -log fk [x i ](y i )q(E = e k |x i , y i ) (147) ≤ q(E = e k )E q -log fk [x](y)|E = e k - 1 |D| xi,yi∈D -log fk [x i ](y i )q(E = e k |x i , y i ). Combining Equation 144and Equation 146-148, the lemma is proved. Lemma 2. Assume ∀x ∈ X ,∀y ∈ Y,∀f ∈ V, log f [∅](y) ∈ [-B, B] where B > 0. The definition of G k V and R N (G) follows from Lemma 1. Define fk = arg inf f 1 |D| xi,yi∈D -log f [∅](y i )q(E = e k |x i , y i ). Then for any q ∈ Q, any δ ∈ (0, 1), with a probability over 1 -δ, we have q(E = e k )H V (Y |E = e k ) - 1 |D| xi,yi∈D -log fk [∅](y i )q(E = e k |x i , y i ) (149) ≤ 2R |D| (G k V ) + B 2 log 1 δ |D| . Proof. Similar to Lemma 1, we could prove that q(E = e k )H V (Y |E = e k ) - 1 |D| xi,yi∈D -log fk [∅](y i )q(E = e k |x i , y i ) (151) ≤ 2R |D| (G k V ∅ ) + B 2 log 1 δ |D| , where G k V ∅ = {g|g(x, y) = log f [∅](y)q(E = e k |x, y), f ∈ V, q ∈ Q}. According to the definition for the predictive family V (Xu et al. (2020) 151, the lemma is proved. Then for any δ ∈ (0, 0.5), with a probability over 1 -2δ, we have , Definition 1), ∀f ∈ V, there exists f ∈ V such that ∀x ∈ X , f [∅] = f [x]. Thus, G k V ∅ ⊂ G k V , and therefore R |D| (G k V ∅ ) ≤ R |D| (G k V ). Substituting into Equation I V (X → Y ) -ÎV (X → Y ) ≤ 4R |D| (G * V ) + 2B 2 log 1 δ |D| . Finally we are prepared to prove Theorem 3. Proof of Theorem 3. We first bound the error of empirical estimation with the sum of items in Lemma 1,2,3.  |H E K V (X → Y ) -ĤE K V (X → Y ; D)| (154) = sup E∈E K I V (X → Y |E) -I V (X → Y ) -sup E∈E K ÎV (X → Y |E; D) -ÎV (X → Y ; D) (155) ≤ sup E∈E K I V (X → Y |E) -sup E∈E K ÎV (X → Y |E; D) + I V (X → Y ) -ÎV (X → Y ; D) (156) ≤ sup E∈E K I V (X → Y |E) -ÎV (X → Y |E; D) + I V (X → Y ) -ÎV (X → Y ; D) (157) = sup q∈Q K k=1 [q(E = e k )H V (Y |E = e k ) -q(E = e k )H V (Y |X, E = e k )] + I V (X → Y ) -ÎV (X → Y ; D) (160) ≤ K k=1 sup q∈Q q(E = e k )H V (Y |E = e k ) -q(E = e k ) ĤV (Y |E = e k ; D) + K k=1 sup q∈Q q(E = e k )H V (Y |X, E = e k ) -q(E = e k ) ĤV (Y |X, E = e k ; D) (162) + I V (X → Y ) -ÎV (X → Y ; D) (163) = K k=1 sup q∈Q q(E = e k )H V (Y |E = e k ) - 1 |D| xi,yi∈D -log fk [x i ](y i )q(E = e k |x i , y i ) (164) + K k=1 sup q∈Q q(E = e k )H V (Y |X, E = e k ) - 1 |D| xi,yi∈D -log f k [∅](y i )q(E = e k |x i , y i ) (165) + I V (X → Y ) -ÎV (X → Y ; D) , where fk = arg inf f 1 |D| xi,yi∈D -log f [x i ](y i )q(E = e k |x i , y i ), and f k = arg inf f 1 |D| xi,yi∈D -log f [∅](y i )q(E = e k |x i , y i ), for any q ∈ Q and 1 ≤ k ≤ K.  ) Equation 172 is because of G k V = G V , G * V ⊂ G V and therefore R |D| (G k V ) ≤ R |D| (G V ), R |D| (G * V ) ≤ R |D| (G V ). Hence, Pr   |H E K V (X → Y ) -ĤE K V (X → Y ; D)| ≤ 4(K + 1)R |D| (G V ) + 2(K + 1)B 2 log 1 δ |D|   (178) ≥ 1 -2(K + 1)δ. H PROOF OF THEOREM 4 Proof of Theorem 4. The objective function of our IM algorithm is directly derived from the definition of empirical predictive heterogeneity in Definition 6. For the regression task, we assume the predictive family as V1 = {g : g[x] = N (f θ (x), σ 2 ), f is the regression model and θ is learnable, σ = 1.0(fixed)}, where we only care about the output of the model and the noise scale of the Gaussian distribution is often ignored, for which we simply set σ = 1.0 as a fixed term. Then for each environment e ∈ supp(E * ), the I V (X → Y |E * = e) becomes I V (X → Y |E * = e) ∝ min θ E [ Y -f θ (X) 2 |E * = e] -Var(Y |E * ), which corresponds with the MSE loss and the proposed regularizer in Equation 17. For the classification task, the derivation is similar, and the regularizer becomes the entropy of Y in sub-population e and the loss function becomes the cross-entropy loss. I DISCUSSION ON DIFFERENCES WITH SUB-GROUP DISCOVERY Subgroup discovery (SD, (Helal, 2016) ) is aimed at extracting "interesting" relations among different variables (X) with respect to a target variable Y . Coverage and precision of each discovered group is the focus of such method. To be specific, it learns a partition on P (X) such that some target label y dominates within each group. The most siginficant gap between subgroup discovery and our predictive heterogeneity lies in the pattern of distributional shift among clusters: for subgroup discovery, P (X) and P (Y ) varies across subgroups but there is a universal P (Y |X). While for predictive heterogeneity P (Y |X) differs across sub-population, which indicates diversified prediction mechanism. It is such disparity of prediction mechanism that inhibits the performance of a universal predictive model on a heterogeneous dataset, which is the emphasis of OOD problem and group fairness. We think sub-group discovery is more applicable for settings where the distributional shift is minor while high explainability is required, since it generates simplified rules that people can understand. Also, sub-group discovery methods is suitable for the settings that only involve tabular data (typlically from a relational database), where the input features have clear semantics. And our proposed method could deal with general machine learning settings, including complicated data (e.g., image data) that involves representation learning. Also, when people have to handle settings where data heterogeneity w.r.t. prediciton mechanism exists inside data, our method is more applicable. However, both kinds of methods can be used to help people understand data and make more reasonable decisions.

J DISCUSSION ON THE POTENTIAL FOR FAIRNESS

We find combining our measure with algorithmic fairness is an interesting and promising direction and we think our measure has the potential to deal with algorithmic bias. Our method could generate sub-populations with possibly different prediction mechanisms, which could do some help in the following aspects: Risk feature selection: we could select features according to our predictive heterogeneity measure to see what features bring the largest heterogeneity. If they are sensitive features, people should avoid their effects, and if they are not, they could direct people to build better machine learning models. Examine the algorithmic fairness: we could use the learned sub-populations to examine whether a given algorithm is fair by calculating the performance gap across the sub-populations.



W ← Proj W K (W -η∇W R) , η is the learning rate of W.(22)



expectations replaced by statistics of finite samples. The formal definition is placed in Definition 7. Theorem 3 (PAC Bound). Consider the prediction task X → Y where X, Y are random variables taking values in X × Y. Assume that the predictive family

Figure 1: Results on the crop yield data. We color each region according to its main crop type, and the shade represents the proportion of the main crop type after smoothing via k-means (k = 3).

Figure 2: Results on the Adults data. Here we show the average of features and the feature coefficients of the two learned sub-populations.

Figure 3: Results on the Waterbird data. Here we randomly sample 50 images for each class and each learned sub-population.

Figure 4: Sub-population division on the simulated data of three methods, where two colors denote two sub-populations.

Figure5: Sub-population division on the MNIST data of our IM algorithm. clustering algorithm. For our IM algorithm and KMeans, we involve three algorithms as backbones to leverage the learned sub-populations, including sub-population balancing and invariant learning methods. The sub-population balancing simply equally weighs the learned sub-populations. invariant risk minimization (IRM,(Arjovsky et al., 2019)) and inter-environment gradient alignment (IGA,(Koyama & Yamaguchi, 2020)) are typical methods in OOD generalization, which take the sub-populations as input environments to learn the invariant models.

Figure 6: The out-of-distribution generalization error of our methods with Sub-population Balancing, IRM and IGA as backbones. Here we plot the errors of different backbones under r = -2.7, which introduces strong distributional shifts with training data.

Figure 7: The out-of-distribution generalization error of our methods with Sub-population Balancing, IRM and IGA as backbones for the added experiments. The ground-truth sub-population number is 4.

by noticing that E y∼Y |x [-log f [x](y)] is the cross entropy between Y |x, e and f [x].

(Xu et al., 2020), Theorem 1). Assume ∀x ∈ X ,∀y ∈ Y,∀f ∈ V, log f [x](y) ∈ [-B, B] where B > 0. Define a function class G * V = {g|g(x, y) = log f [x](y), f ∈ V}.The definition of R N (G) follows from Lemma 1.

= e k ) ĤV (Y |E = e k ; D) -q(E = e k ) ĤV (Y |X, E = e k ; D)

= e k )H V (Y |X, E = e k ) -1 |D| xi,yi∈D -log fk [x i ](y i )q(E = e k |x i , y i ) . = e k )H V (Y |X, E = e k ) -1 |D| xi,yi∈D -log f k [∅](y i )q(E = e k |x i , y i ) . (168)Err * = I V (X → Y ) -ÎV (X → Y ; D) . K (D)| > 4(K + 1)R |D| (G V ) + 2(K + 1Err k + Err * > 4(K + 1)R |D| (G V ) + 2(K + 1

Results of the experiments on out-of-distribution generalization, including the simulated data and colored MNIST data.

acknowledgement

ACKNOWLEDGEMENTS We would like to thank Yuting Pan, Jiaming Song, Fan Bao and anonymous reviewers for helpful feedback. Peng Cui's research was supported in part by National Key R&D Program of China (No. 2018AAA0102004, No. 2020AAA0106300), National Natural Science Foundation of China (No. U1936219, 62141607), Beijing Academy of Artificial Intelligence (BAAI). Bo Li's research was supported by the National Natural Science Foundation of China (No.72171131, 72133002); the Technology and Innovation Major Project of the Ministry of Science and Technology of China under Grants 2020AAA0108400 and 2020AAA0108403.

