MEASURING THE PREDICTIVE HETEROGENEITY

Abstract

As an intrinsic and fundamental property of big data, data heterogeneity exists in a variety of real-world applications, such as in agriculture, sociology, health care, etc. For machine learning algorithms, the ignorance of data heterogeneity will significantly hurt the generalization performance and the algorithmic fairness, since the prediction mechanisms among different sub-populations are likely to differ. In this work, we focus on the data heterogeneity that affects the prediction of machine learning models, and first formalize the Predictive Heterogeneity, which takes into account the model capacity and computational constraints. We prove that it can be reliably estimated from finite data with PAC bounds even in high dimensions. Additionally, we propose the Information Maximization (IM) algorithm, a bi-level optimization algorithm, to explore the predictive heterogeneity of data. Empirically, the explored predictive heterogeneity provides insights for sub-population divisions in agriculture, sociology, and object recognition, and leveraging such heterogeneity benefits the out-of-distribution generalization performance.

1. INTRODUCTION

Big data bring great opportunities to modern society and promote the development of machine learning, facilitating human life within a wide variety of areas, such as the digital economy, healthcare, scientific discoveries. Along with the progress, the intrinsic heterogeneity of big data introduces new challenges to machine learning systems and data scientists (Fan et al., 2014; He, 2017) . In general, data heterogeneity, as a fundamental property of big data, refers to any diversity inside data, including the diversity of data sources, data generation mechanisms, sub-populations, data structures, etc. When not properly treated, data heterogeneity could bring pitfalls to machine learning systems, especially in high-stake applications, such as precision medicine, autonomous driving, and financial risk management (Dzobo et al., 2018; Breitenstein et al., 2020; Challen et al., 2019) , leading to poor out-of-distribution generalization performances and some fairness issues. For example, in supervised learning tasks where machine learning models learn from data to predict the target variable with given covariates, when the whole dataset consists of multiple sub-populations with shifts or different prediction mechanisms, traditional machine learning algorithms will mainly focus on the majority but ignore the minority. It will hurt the generalization ability and compromise the algorithmic fairness, as is shown in (Kearns et al., 2018; Sagawa et al., 2019; Duchi & Namkoong, 2021) . Another well-known example is Simpson's paradox, which brings false discoveries to the social research (Wagner, 1982; Hernán et al., 2011) . Despite its widespread existence, due to its complexity, data heterogeneity has not converged to a uniform formulation so far, and has different meanings among different fields. Li & Reynolds (1995) define the heterogeneity in ecology based on the system property and complexity or variability. Rosenbaum (2005) views the uncertainty of the potential outcome as unit heterogeneity in observational studies in economics. More recently, in machine learning, several works of causal learning (Peters et al., 2016; Arjovsky et al., 2019; Koyama & Yamaguchi, 2020; Liu et al., 2021; Creager et al., 2021) and robust learning (Sagawa et al., 2019; Liu et al., 2022 ) leverage heterogeneous data from multiple environments to improve the out-of-distribution generalization ability. However, previous works have not provided a precise definition or sound quantification. In this From the machine learning perspective, the main concern is the possible negative effects of data heterogeneity on making predictions. Therefore, given the complexity of data heterogeneity, in this work, we focus on the data heterogeneity that affects the prediction of machine learning models, which could facilitate the building of machine learning systems, and we name it the predictive heterogeneity. We raise the precise definition of predictive heterogeneity, which is quantified as the maximal additional predictive information that can be gained by dividing the whole data distribution into sub-populations. The new measure takes into account the model capacity and computational constraints, and can be reliably estimated from finite samples even in high dimensions with PAC bounds. We theoretically analyze its properties and examine it under typical cases of data heterogeneity (Fan et al., 2014) . Additionally, we design the information maximization (IM) algorithm to empirically explore the predictive heterogeneity inside data. Empirically, we find the explored heterogeneity is explainable and it provides insights for sub-population divisions in many fields, including agriculture, sociology, and object recognition. And the explored sub-populations could be leveraged to enhance the out-of-distribution generalization performances of machine learning models, which is verified with both simulated and real-world data.

2. PRELIMINARIES ON MUTUAL INFORMATION AND PREDICTIVE V -INFORMATION

In this section, we briefly introduce the mutual information and predictive V-information (Xu et al., 2020) which are the preliminaries of our proposed predictive heterogeneity. Notations. For a probability triple (S, F, P), define random variables X : S → X and Y : S → Y where X is the covariate space and Y is the target space. Accordingly. x ∈ X denotes the covariates, and y ∈ Y denotes the target. Denote the set of random categorical variables as C = {C : S → N|supp(C) is finite}. Additionally, P(X ), P(Y) denote the set of all probability measures over the Borel algebra on the spaces X , Y respectively. H(•) denotes the Shannon entropy of a random variable, and H(•|•) denotes the conditional entropy of two random variables. In information theory, the mutual information of two random variables X, Y measures the dependence between the two variables, which quantifies the reduction of entropy for one variable when observing the other: I(X; Y ) = H(Y ) -H(Y |X). It is known that the mutual information is associated with the predictability of Y (Cover Thomas & Thomas Joy, 1991) . While the standard definition of mutual information unrealistically assumes the unbounded computational capacity of the predictor, rendering it hard to estimate especially in high dimensions. To mitigate this problem, Xu et al. (2020) raise the predictive V-information under realistic computational constraints, where the predictor is only allowed to use models in the predictive family V to predict the target variable Y . Definition 1 (Predictive Family (Xu et al., 2020) ). Let Ω = {f : X ∪ {∅} → P(Y)}. We say that V ⊆ Ω is a predictive family if it satisfies: ∀f ∈ V, ∀P ∈ range(f ), ∃f ∈ V, s.t. ∀x ∈ X , f [x] = P, f [∅] = P. A predictive family contains all predictive models that are allowed to use, which forms computational or statistical constraints. The additional condition in Equation 2 means that the predictor can always ignore the input covariates (x) if it chooses to (only use ∅). Definition 2 (Predictive V-information (Xu et al., 2020)). Let X, Y be two random variables taking values in X × Y and V be a predictive family. The predictive V-information from X to Y is defined as: IV (X → Y ) = HV (Y |∅) -HV (Y |X), where H V (Y |∅), H V (Y |X) are the predictive conditional V-entropy defined as:  HV (Y |X) = inf



f ∈V Ex,y∼X,Y [-log f [x](y)].(4)HV (Y |∅) = inf f ∈V Ey∼Y [-log f [∅](y)].(5)Notably that f ∈ V is a function X ∪ {∅} → P(Y), so f [x] ∈ P(Y) is a probability measure on Y, and f [x](y) ∈ R is the density evaluated on y ∈ Y. H V (Y |∅) is also denoted as H V (Y ).

