ACCOUNTING FOR UNOBSERVED CONFOUNDING IN DOMAIN GENERALIZATION

Abstract

The ability to extrapolate, or generalize, from observed to new related environments is central to any form of reliable machine learning, yet most methods fail when moving beyond i.i.d data. In some cases, the reason lies in a misappreciation of the causal structure that governs the data, and in particular as a consequence of the influence of unobserved confounders that drive changes in observed distributions and distort correlations. In this paper, we argue for defining generalization with respect to a broader class of distribution shifts (defined as arising from interventions in the underlying causal model), including changes in observed, unobserved and target variable distributions. We propose a new robust learning principle that may be paired with any gradient-based learning algorithm. This learning principle has explicit generalization guarantees, and relates robustness with certain invariances in the causal model, clarifying why, in some cases, test performance lags training performance. We demonstrate the empirical performance of our approach on healthcare data from different modalities, including image and speech data.

1. INTRODUCTION

Prediction algorithms use data, necessarily sampled under specific conditions, to learn correlations that extrapolate to new or related data. If successful, the performance gap between these two domains is small, and we say that algorithms generalize beyond their training data. Doing so is difficult however, some form of uncertainty about the distribution of new data is unavoidable. The set of potential distributional changes that we may encounter is mostly unknown and in many cases may be large and varied. Some examples include covariate shifts (Bickel et al., 2009) , interventions in the underlying causal system (Pearl, 2009) , varying levels of noise (Fuller, 2009) and confounding (Pearl, 1998) . All of these feature in modern applications, and while learning systems are increasingly deployed in practice, generalization of predictions and their reliability in a broad sense remains an open question. A common approach to formalize learning with uncertain data is, instead of optimizing for correlations in a fixed distribution, to do so simultaneously for a range of different distributions in an uncertainty set P (Ben-Tal et al., 2009) . minimize f sup P ∈P E (x,y)∼P [L(f (x), y)] for some measure of error L of the function f that relates input and output examples (x, y) ∼ P . Choosing different sets P leads to estimators with different properties. It includes as special cases, for instance, many approaches in domain adaptation, covariate shift, robust statistics and optimization (Kuhn et al., 2019; Bickel et al., 2009; Duchi et al., 2016; 2019; Sinha et al., 2017; Wozabal, 2012; Abadeh et al., 2015; Duchi & Namkoong, 2018) . Robust solutions to problem (1) are said to generalize if potential shifted, test distributions are contained in P, but also larger sets P result in conservative solutions (i.e. with sub-optimal performance) on data sampled from distribution away from worst-case scenarios, in general. One formulation of causality is in fact also a version of this problem, for P defined as any distribution arising from arbitrary interventions on observed covariates x leading to shifts in their distribution P x (see e.g. sections 3.2 and 3.3 in (Meinshausen, 2018) ). The invariance to changes in covariate distributions of causal solutions is powerful for generalization, but implicitly assumes that all In the presence of unobserved confounders, there is an inherent trade-off in performance -causal and correlation-based solutions are both optimal in different regimes, depending on the shift from which new data is generated. The proposed approach, DIRM, is a relaxation of the causal solution that naturally interpolates between the causal solution and Ordinary Least Squares (OLS), and is described in Section 3. The data generating mechanism, methods, and a discussion of the results are given in the paragraphs above our contributions. covariates or other drivers of the outcome subject to change at test time are observed. Often shifts occur elsewhere, for example in the distribution of unobserved confounders, in which case also conditional distributions P y|x may shift. Perhaps surprisingly, in the presence of unobserved confounders, the goals of achieving robustness and learning a causal model can be different (and similar behaviour also occurs with varying measurement noise). There is in general an inherent trade-off in generalization performance. In the presence of unobserved confounders, causal and correlation-based solutions are both optimal in different regimes, depending on the shift in the underlying generating mechanism from which new data is generated. Consider a simple example, illustrated in Figure 1 , to show this explicitly. We assume access to observations of variables (X 1 , X 2 , Y ) in two training datasets, each dataset sampled with differing variances (σ 2 = 1 and σ 2 = 2) from the following structural model F, X 2 := -H + E X2 , Y := X 2 + 3H + E Y , X 1 := Y + X 2 + E X1 , E X1 , E X2 ∼ N (0, σ 2 ), E Y ∼ N (0, 1) are exogenous variables. In a first scenario (leftmost panel) we consider all data (training and testing) to be generated without unobserved confounders, H := 0; and, in a second scenario (remaining panels) all data with unobserved confounders, H := E H ∼ N (0, 1). Each panel of Figure 1 shows performance on new data obtained after manipulating the underlying data generating system; the magnitude and type of intervention appears in the horizontal axis. We consider the following learning paradigms: Ordinary Least Squares (OLS) learns the linear mapping that minimizes average training risk, Domain Robust Optimization (DRO) minimizes the maximum training risk among the two available datasets, and the causal solution, assumed known, has fixed coefficients (0, 1) for (X 1 , X 2 ). Two important observations motivate this paper. First, observe that Ordinary Least Squares (OLS) and Domain Robust Optimization (DRO) absorb spurious correlations (due to H, and the fact that X 1 is caused by Y ) with unstable performance under shifts in p(X 1 , X 2 ) but as a consequence good performance under shifts in p(H). Causal solutions, by contrast, are robust to shifts in p(X 1 , X 2 ), even on new data with large shifts, but underperform substantially under changes in the distribution of unobserved confounders p(H). Second, the presence of unobserved confounding hurts generalization performance in general with higher errors for all methods, e.g. contrast the middle and leftmost panel. To the best of our knowledge, the influence of unobserved confounders has been minimally explored in the context of generalization of learning algorithms, even though, as Figure 1 shows, in this context different shifts in distribution may have important consequences for predictive performance. Our Contributions. In this paper we provide a new choice of P and learning problem (1) that we show to be justified by certain statistical invariances across training and testing data, to be expected in the presence of unobserved confounders. This leads us to define a new differentiable, regularized objective for representation learning. Our proposal defines P as an affine combination of available training data distributions, and we show that solutions to this problem are robust to more general shifts in distribution than previously considered, spanning robustness to shifts in observed, unobserved, and target variables, depending on the properties of the available training data distributions. This approach has benefits for performance out-of-sample but also for tasks involving variable selection, where important features are consistently replicated across experiments with our objective.



Figure1: The challenges of generalization. In the presence of unobserved confounders, there is an inherent trade-off in performance -causal and correlation-based solutions are both optimal in different regimes, depending on the shift from which new data is generated. The proposed approach, DIRM, is a relaxation of the causal solution that naturally interpolates between the causal solution and Ordinary Least Squares (OLS), and is described in Section 3. The data generating mechanism, methods, and a discussion of the results are given in the paragraphs above our contributions.

