PARTIAL TRANSPORTABILITY FOR DOMAIN GENERALIZATION

Abstract

Learning prediction models that generalize to related domains is one of the most fundamental challenges in artificial intelligence. There exists a growing literature that argues for learning invariant associations using data from multiple source domains. However, whether invariant predictors generalize to a given target domain depends crucially on the assumed structural changes between domains. Using the perspective of transportability theory, we show that invariance learning, and the settings in which invariant predictors are optimal in terms of worst-case losses, is a special case of a more general partial transportability task. Specifically, the partial transportability task seeks to identify / bound a conditional expectation E P ˚rY | xs in an unseen domain π ˚using knowledge of qualitative changes across domains in the form of causal graphs and data from source domains π 1 , . . . , π k . We show that solutions to this problem have a much wider generalization guarantee that subsumes those of invariance learning and other robust optimization methods that are inspired by causality. For computations in practice, we develop an algorithm that provably provides tight bounds asymptotically in the number of data samples from source domains for any partial transportability problem with discrete observables and illustrate its use on synthetic datasets.

1. INTRODUCTION

Generalization guarantees are central to the design of reliable machine learning models as the predictions and conclusions obtained in one or several source domains π 1 , . . . , π k (e.g. in controlled laboratory circumstances, from a specific study or population, etc.) are transported and applied elsewhere, in a domain π ˚that may differ in several aspects from that of source domains. It is apparent that what structure and what assumptions are imposed on the relationship between domains determines whether a model will generalize as intended. For example, if the target environment is arbitrary, or substantially different from the study environment, transporting predictions is difficult or even impossible. A structural account of causation provides suitable semantics for reasoning about the structural invariances across different domains, and has been studied under the umbrella of transportability theory (Pearl & Bareinboim, 2011; Bareinboim et al., 2013; Bareinboim & Pearl, 2016) . Each domain π i is associated with a different structural causal model (SCM) M i that differs in one or more of its component parts with respect to other domains and defines different distributions over the observed variables. In practice, the SCMs are usually not fully observable, which leads to the transportability challenge of using data from one (or more) SCMs to make inference about distributions from another SCM. A query, e.g. E P ˚rY | xs, is said to be point identified if it can be uniquely computed given available data (from one or more domains) and qualitative knowledge about the causal changes between domains in the form of selection diagrams. However, in problems of transportability, especially when no data in the target domain can be collected, the combination of qualitative assumptions and data often does not permit one to uniquely determine a given query, which is said to be non-identifiable. In such cases, partial identification methods deal with bounding a given query e.g. l ă E P ˚rY | xs ă u in non-identifiable problems and may still serve an informative purpose for decision-making if 0 ă l ă u ă 1. Both settings have been studied in the literature. In particular, there exists an extensive set of graphical conditions and algorithms for the identifiability of observational, interventional, and counterfactuals distributions across domains from a combination of datasets in various settings (Pearl & Bareinboim, 2011; Bareinboim et al., 2013; Bareinboim & Pearl, 2014; 2016; Lee et al., 2020; Correa & Bareinboim, 2019) . For example, Lee et al. ( 2020) investigate the transportability of conditional causal effects, while Correa & Bareinboim (2020) investigate the transportability of soft interventions or policies, from an arbitrary combination of datasets collected under different conditions. Several methods exist also for partial identification of causal effects and counterfactuals (Balke & Pearl, 1997; Chickering & Pearl, 1996; Zhang et al., 2021 ) that aim at bounding insead of point-identifying a particular causal effect. Despite the generality of these results, there is still no treatment or algorithms for the partial identification of transportability queries. In the machine learning literature, notably, a version of the transportability task is also widely studied as the problem of domain generalization (Wang et al., 2022) . The objective is to learn a prediction function with a minimum performance guarantee on any distribution in some uncertainty set that includes potential test / target distributions (Ben-Tal et al., 2009; Gulrajani & Lopez-Paz, 2020) . This problem has implicit connections to causality and SCMs if uncertainty sets of distributions are defined on the basis of "invariant correlations", such as stable conditional expectations E P 1 rY | xs " ¨¨¨" E P k rY | xs across training domains π 1 , . . . , π k , to be used for prediction in a target domain π ˚and that may be learned from data sampled across sufficiently many different domains with statistical tests (Peters et al., 2016; Subbaswamy et al., 2019; Subbaswamy & Saria, 2020) 2020) use causal graphs and identifiable interventional distributions to define invariant prediction rules across domains. Notwithstanding their wide applicability, there is little theoretical understanding of the extrapolation guarantees that can be expected from invariant prediction rules given a finite set of domains. Correlations invariant across source domains need not be invariant in a target domain; and performance guarantees, in general, depend on the structural invariances assumed for their respective SCMs. In this paper, we start by describing the conditions under which invariant prediction rules can be expected to perform well in an arbitrary target domain from first principles using the semantics of structural causal models (Pearl, 2009; Pearl & Bareinboim, 2011) . We then introduce a broader optimization problem -the task of partial transportability -whose objective is to bound, instead of point estimate, a query in an arbitrary target domain of interest, such as E P ˚rY | xs, given data from one or more source domains and qualitative knowledge about the causal changes between domains in the form of selection diagrams. We demonstrate that solutions to this problem subsume various instantiations of invariant predictors (in the conditions where these are adequate) and have a wider distributional robustness guarantee to any distribution in the target domain that is compatible with the assumed selection diagrams. For computations in practice, we show that the partial transportability task can be solved approximately for systems of variables with finite domains with a Markov Chain Monte Carlo sampling approach. The resulting bounds are sound and tight, and provide the most informative inference on a target query given the available information.

1.1. PRELIMINARIES

We introduce in this section some basic notations and definitions that will be used throughout the paper. We use capital letters to denote variables (X), small letters for their values (x), bold letters for sets of variables (X) and their values (x), and use Ω to denote their domains of definition (x P Ω X ). A conditional independence statement in distribution P is written as pX K K Y | Zq P . A d-separation statement in some graph G is written as pX K K Y | Zq G . For convenience, we denote by P pxq probabilities P pX " xq, and 1t¨u for the indicator function equal to 1 if the statement in t¨u evaluates to true, and equal to 0 otherwise. All proofs are given in the Appendix. We use the language of structural causal models (SCMs) (Definition 7.1.1 (Pearl, 2009) ) to define the semantics of causality. An SCM M is a tuple M " xV, U, F, P y where V is a set of endogenous variables and U is a set of exogenous variables. Each exogeneous variable U P U is distributed according to a probability measure P puq. F is a set of functions where each f V P F determines the deterministic dependencies of V on other parts of the system. That is, v :" f V ppa V , u V q, with Pa V Ă V, and U V Ă U, the exogeneous sources of variation that influence V . With this construction, we define the potential response Vpuq to be the solution of V in the model M given U " u. Moreover, drawing values of exogenous variables U following the probability measure P



or custom loss functions(Magliacane et al., 2018; Arjovsky et al., 2019; Rojas-Carulla et al., 2018; Bellot & van der  Schaar, 2020). For instance, Arjovsky et al. (2019) argue for learning representations that define an invariant optimal classifier across several training datasets. Subbaswamy et al. (2019); Subbaswamy & Saria (

