PARTIAL TRANSPORTABILITY FOR DOMAIN GENERALIZATION

Abstract

Learning prediction models that generalize to related domains is one of the most fundamental challenges in artificial intelligence. There exists a growing literature that argues for learning invariant associations using data from multiple source domains. However, whether invariant predictors generalize to a given target domain depends crucially on the assumed structural changes between domains. Using the perspective of transportability theory, we show that invariance learning, and the settings in which invariant predictors are optimal in terms of worst-case losses, is a special case of a more general partial transportability task. Specifically, the partial transportability task seeks to identify / bound a conditional expectation E P ˚rY | xs in an unseen domain π ˚using knowledge of qualitative changes across domains in the form of causal graphs and data from source domains π 1 , . . . , π k . We show that solutions to this problem have a much wider generalization guarantee that subsumes those of invariance learning and other robust optimization methods that are inspired by causality. For computations in practice, we develop an algorithm that provably provides tight bounds asymptotically in the number of data samples from source domains for any partial transportability problem with discrete observables and illustrate its use on synthetic datasets.

1. INTRODUCTION

Generalization guarantees are central to the design of reliable machine learning models as the predictions and conclusions obtained in one or several source domains π 1 , . . . , π k (e.g. in controlled laboratory circumstances, from a specific study or population, etc.) are transported and applied elsewhere, in a domain π ˚that may differ in several aspects from that of source domains. It is apparent that what structure and what assumptions are imposed on the relationship between domains determines whether a model will generalize as intended. For example, if the target environment is arbitrary, or substantially different from the study environment, transporting predictions is difficult or even impossible. A structural account of causation provides suitable semantics for reasoning about the structural invariances across different domains, and has been studied under the umbrella of transportability theory (Pearl & Bareinboim, 2011; Bareinboim et al., 2013; Bareinboim & Pearl, 2016) . Each domain π i is associated with a different structural causal model (SCM) M i that differs in one or more of its component parts with respect to other domains and defines different distributions over the observed variables. In practice, the SCMs are usually not fully observable, which leads to the transportability challenge of using data from one (or more) SCMs to make inference about distributions from another SCM. A query, e.g. E P ˚rY | xs, is said to be point identified if it can be uniquely computed given available data (from one or more domains) and qualitative knowledge about the causal changes between domains in the form of selection diagrams. However, in problems of transportability, especially when no data in the target domain can be collected, the combination of qualitative assumptions and data often does not permit one to uniquely determine a given query, which is said to be non-identifiable. In such cases, partial identification methods deal with bounding a given query e.g. l ă E P ˚rY | xs ă u in non-identifiable problems and may still serve an informative purpose for decision-making if 0 ă l ă u ă 1. Both settings have been studied in the literature. In particular, there exists an extensive set of graphical conditions and algorithms for the identifiability of observational, interventional, and counterfactuals distributions across domains from a combination of datasets in various settings (Pearl & Bareinboim, 2011; Bareinboim et al., 2013; Bareinboim & Pearl, 2014; 2016; Lee et al., 2020; Correa & Bareinboim, 2019) . For example, Lee et al. (2020) investigate the transportability of conditional causal effects, while Correa & Bareinboim (2020) investigate the transportability of soft interventions or policies, from an arbitrary combination of datasets collected under different conditions. Several methods exist also for partial identification of causal effects and counterfactuals (Balke & Pearl, 1997; Chickering & Pearl, 1996; Zhang et al., 2021 ) that aim at bounding insead of point-identifying a particular causal effect. Despite the generality of these results, there is still no treatment or algorithms for the partial identification of transportability queries. In the machine learning literature, notably, a version of the transportability task is also widely studied as the problem of domain generalization (Wang et al., 2022) . The objective is to learn a prediction function with a minimum performance guarantee on any distribution in some uncertainty set that includes potential test / target distributions (Ben-Tal et al., 2009; Gulrajani & Lopez-Paz, 2020) . This problem has implicit connections to causality and SCMs if uncertainty sets of distributions are defined on the basis of "invariant correlations", such as stable conditional expectations E P 1 rY | xs " ¨¨¨" E P k rY | xs across training domains π 1 , . . . , π k , to be used for prediction in a target domain π ˚and that may be learned from data sampled across sufficiently many different domains with statistical tests (Peters et al., 2016; Subbaswamy et al., 2019; Subbaswamy & Saria, 2020) or custom loss functions (Magliacane et al., 2018; Arjovsky et al., 2019; Rojas-Carulla et al., 2018; Bellot & van der Schaar, 2020) . For instance, Arjovsky et al. (2019) argue for learning representations that define an invariant optimal classifier across several training datasets. Subbaswamy et al. (2019) ; Subbaswamy & Saria (2020) use causal graphs and identifiable interventional distributions to define invariant prediction rules across domains. Notwithstanding their wide applicability, there is little theoretical understanding of the extrapolation guarantees that can be expected from invariant prediction rules given a finite set of domains. Correlations invariant across source domains need not be invariant in a target domain; and performance guarantees, in general, depend on the structural invariances assumed for their respective SCMs. In this paper, we start by describing the conditions under which invariant prediction rules can be expected to perform well in an arbitrary target domain from first principles using the semantics of structural causal models (Pearl, 2009; Pearl & Bareinboim, 2011) . We then introduce a broader optimization problem -the task of partial transportability -whose objective is to bound, instead of point estimate, a query in an arbitrary target domain of interest, such as E P ˚rY | xs, given data from one or more source domains and qualitative knowledge about the causal changes between domains in the form of selection diagrams. We demonstrate that solutions to this problem subsume various instantiations of invariant predictors (in the conditions where these are adequate) and have a wider distributional robustness guarantee to any distribution in the target domain that is compatible with the assumed selection diagrams. For computations in practice, we show that the partial transportability task can be solved approximately for systems of variables with finite domains with a Markov Chain Monte Carlo sampling approach. The resulting bounds are sound and tight, and provide the most informative inference on a target query given the available information.

1.1. PRELIMINARIES

We introduce in this section some basic notations and definitions that will be used throughout the paper. We use capital letters to denote variables (X), small letters for their values (x), bold letters for sets of variables (X) and their values (x), and use Ω to denote their domains of definition (x P Ω X ). A conditional independence statement in distribution P is written as pX K K Y | Zq P . A d-separation statement in some graph G is written as pX K K Y | Zq G . For convenience, we denote by P pxq probabilities P pX " xq, and 1t¨u for the indicator function equal to 1 if the statement in t¨u evaluates to true, and equal to 0 otherwise. All proofs are given in the Appendix. We use the language of structural causal models (SCMs) (Definition 7.1.1 (Pearl, 2009) ) to define the semantics of causality. An SCM M is a tuple M " xV, U, F, P y where V is a set of endogenous variables and U is a set of exogenous variables. Each exogeneous variable U P U is distributed according to a probability measure P puq. F is a set of functions where each f V P F determines the deterministic dependencies of V on other parts of the system. That is, v :" f V ppa V , u V q, with Pa V Ă V, and U V Ă U, the exogeneous sources of variation that influence V . With this construction, we define the potential response Vpuq to be the solution of V in the model M given U " u. Moreover, drawing values of exogenous variables U following the probability measure P induces a joint distribution over observables given by, X 1 W Y X 2 (a) G ˚X1 W Y X 2 S W (b) G ˚,a X 1 W Y X 2 S W S X1 (c) G ˚,b X 1 W Y X 2 S X1 (d) G a,b P pvq " ż Ω U ź V PV 1tf V ppa V , u V q " vudP puq. (1) An SCM induces a causal graph G in which each variable in V is associated with a node; we draw a directed edge between two variables X Ñ Y if X P Pa Y appears as an argument of f Y in the SCM, and a bidirected arrow X Ø Y if U X X U Y ‰ H, that is X and Y share an unobserved confounder. The set of parent nodes of X in G is denoted by papXq G " Ť XPX papXq G . Its capitalized version Pa includes the argument as well, e.g. PapXq G " papXq G Y X. Similar definitions are used for children ch, descendants de, etc.

2. DOMAIN GENERALIZATION THROUGH THE LENS OF TRANSPORTABILITY

We adopt the setting of domain generalization. We assume access to k source domains π 1 , π 2 , . . . , π k with associated data distributions over a common set of variables V denoted P 1 pvq, P 2 pvq, . . . , P k pvq. Our focus is on a query, such as E P ˚rY | xs, to be evaluated in a target domain π ˚(potentially) different from source domains, where typically Y is an outcome variable, X is a set of covariates, and Y Y X " V. For concreteness, consider a medical study where patient data was collected under different treatment protocols in an attempt to assess, in a target hospital π ˚, the prognosis of neurodegenerative diseases such as Alzheimer's in patients with a number of existing conditions. In the causal graph in Fig. 1a , X 1 and X 2 are treatments for hypertension and clinical depression, respectively, both known to be causes of neurodegenerative diseases Y . In the case of hypertension, the effect is mediated by blood pressure W , whose effect on neurodegenerative diseases is confounded, since both conditions share important confounding factors such as physical activity levels and diet patterns (Skoog & Gustafson, 2006) (graphically encoded through the bidirected arrows). Hypertension and clinical depression are not known to affect each other (no direct link between them), although it's common for patients with clinical depression to simultaneously be at risk of hypertension (Meng et al., 2012) . In this example, we have access to an observational study conducted in a hospital π a , and to hospital π b following different guidelines for the administration of X 1 , both of which however are known to have a different high blood pressure W incidence than that in π ˚. These differences are called domain discrepancies (Pearl & Bareinboim, 2011) . Definition 1 (Domain Discrepancy, (Pearl & Bareinboim, 2011) ). Let π a and π b be domains associated, respectively, with SCMs M a and M b and causal diagrams G a and G b . We denote by ∆ a,b Ă V a set of variables such that, for every V i P ∆ a,b , there might exist a discrepancy if f a Vi ‰ f b Vi or P a pU i q ‰ P b pU i q. Definition 2 (Selection diagram, (Pearl & Bareinboim, 2011) ). Given domain discrepancies ∆ a,b between two domains π a and π b and a causal graph G a " pV, Eq, let S " tS V : V P ∆ a,b u be called selection nodes. Then, a selection diagram G a,b is defined as a graph pV Y S, E Y tS V Ñ V u S V PS q. Selection nodes locate the mechanisms where structural discrepancies between the two domains are suspected to take place. The absence of a selection node pointing to a variable represents the assumption that the mechanism responsible for assigning value to that variable is identical in both domains. In the medical example above, Fig. 1b shows a selection diagram comparing domains π a and π ˚in which the S W node indicates a structural difference in the assignment of W , either f W ‰ f a W and / or P ˚pu W q ‰ P a pu W q, but not in the assignment of other variables, for instance f Y " f a Y and P ˚pu Y q " P a pu Y q. Fig. 1c and Fig. 1d are selection diagrams that compare domains pπ b , π ˚q and pπ a , π b q respectively. X 1 X 2 X 3 X 4 X 5 Y S X1 (a) G a,b X 1 Y X 2 S X1 (b) G 1,2 X 1 Y X 2 S X1 S X2 (c) G ˚,1 X 1 Y X 2 S X1 S X2 (d) G ˚,2

2.1. INVARIANCE LEARNING FOR DOMAIN GENERALIZATION

It is apparent that there is a degree of unidentifiability in optimal prediction rules in a target domain depending on the structural differences between it and the available data. A natural objective for a chosen prediction function is to minimize worst-case losses over an uncertainty set of potential target distributions that are compatible with a set of selection diagrams tG i,˚: i " 1, . . . , ku arg min f max M PMpG ˚q E P M rpY ´f pXqq 2 s, where MpG ˚q is the family of SCMs compatible with the causal graph G ˚. In the literature on domain generalization, selection diagrams tG i,˚: i " 1, . . . , ku are mostly implicit, and it is common to define predictors agnostic of assumptions on the underlying causal structure of the target domain, and instead exploit invariances with respect to source domains, see e.g. the proposals of (Arjovsky et al., 2019; Peters et al., 2016; Lu et al., 2021; Rojas-Carulla et al., 2018; Magliacane et al., 2018) . This section studies the generalization guarantees of a common class of invariant predictors in the language of selection diagrams. Definition 3 (Invariant predictor). Given selection diagrams tG i,j : i, j " 1, . . . , ku, an invariant predictor is given by E P rY | zs where pY K K S | Zq G i,j for i, j " 1, . . . , k and the expectation is taken with respect to any P among source domain distributions.foot_0  Invariant predictors define stable conditional expectations, i.e. E P 1 rY | zs " ¨¨¨" E P k rY | zs. We use the notion of domain-independent Markov blankets to define optimal invariant predictors. Definition 4 (Domain-independent Markov blankets). Given a set of selection diagrams tG i,j : i, j " 1, . . . , ku, the set of domain-independent Markov blankets for Y P V is given by the set of Z Ă V such that (1) pY K K S | Zq G i,j for i, j " 1, . . . , k and (2) pW K K Y | ZzW q G i,j for i, j " 1, . . . , k and all W P Z. Domain-independent Markov blankets are designed to be minimal, in the sense that no proper subset of them satisfies conditions (1) and ( 2), and informative for predicting Y while defining stable conditional distributions across source domains. In general, such a set is not guaranteed to exist. For example, in Fig. 1b there is no set (and by implication no invariant predictor) that separates Y from all selection nodes, i.e. condition (1) in Def. 4 is violated for any subset of V. Moreover, contrary to the conventional Markov blanket (Pearl & Paz, 1985) , it is not guaranteed to be unique. For example, in Fig. 2a both tX 1 , X 2 , X 5 u and tX 1 , X 3 , X 4 u are domain invariant Markov blankets. Which one is most informative to predict Y is undecidable from the graph structure alone, i.e. it depends the exact functional associations between variables. Proposition 1 (Optimal invariant predictor). Given selection diagrams tG i,j : i, j " 1, . . . , ku, the optimal invariant predictor is defined as the minimizer of E P i rpY ´f pZqq 2 s across all i " 1, . . . , k, and belongs to the set of invariant predictors for which Z is a domain-independent Markov blanket for Y P V. Invariant predictors may be desirable due to their stability, although, it is apparent that the extent to which predictors will generalize outside of source domains depends on the structure of MpG ˚q and, in particular, differences in structure with respect to source domains. In general, structural invariances across source domains need not hold outside of source domains. For example, given two source domains π 1 , π 2 described by G 1,2 in Fig. 2b , it holds that E P 1 rY | x 1 , x 2 s " E P 2 rY | x 1 , x 2 s is the invariant predictor, which may not be optimal in a target domain π ˚if the same invariance doesn't hold. For example, for the selection diagrams G ˚,1 and G ˚,2 given in Fig. 2c and Fig. 2d E P 1 rY | x 1 , x 2 s ‰ E P ˚rY | x 1 , x 2 s. In fact, the generalization error of the optimal invariant predictor, here denoted E P 1 rY | zs, can be written as max M PMpG ˚q E P M rpY ´EP 1 rY | Zsq 2 s " max M PMpG ˚q `EP M rpY ´EP M rY | Xsq 2 s `EP M rpE P M rY | Xs ´EP 1 rY | Zsq 2 s ˘, where the first term on the RHS is the expected conditional variance and is in general irreducible, and the second term on the RHS quantifies the difference between the invariant predictor and the optimal prediction rule. This second term may be arbitrarily large for a general class of SCMs MpG ˚q with arbitrary differences with source domains. As a consequence, optimality of invariant predictors as solutions to Eq. ( 2) is limited in general to specific scenarios. Proposition 2 (Generalization guarantees of optimal invariant predictors). Given a set of selection diagrams tG i,j : i, j " 1, . . . , ku, let ∆ " Ť i,j ∆ i,j be the set of variables in V whose causal mechanisms differ between any two source domains, and let S " tS V : V P ∆u. Consider the robust optimization problem in Eq. (2). The optimal invariant predictor is a solution if selection nodes in all selection diagrams tG i,˚: i " 1, . . . , ku are given by S with edges tS V Ñ V u S V PS . In words, an optimal invariant predictor has lowest generalization error in the sense of Eq. ( 2) only in the space of target SCMs MpG ˚q with the same structural invariances observed across source domains. Otherwise, in general better predictors are achievable. This observation includes predictors using causal parents as a conditioning set (often understood as desirable for domain generalization) which, similarly, define robust predictors for a target domain if invariance in the association between causal parents and outcomes is assumed. For example, in Fig. 1, E  P a rY | pa Y s ‰ E P b rY | pa Y s, E P a rY | pa Y s ‰ E P ˚rY | pa Y s, and E P b rY | pa Y s ‰ E P ˚rY | pa Y s, and thus predictors based on causal parents may not be robust or optimal, in general. In particular, a prediction function of the form E P 1 rY | pa Y s is a solution to the robust optimization problem in Eq. ( 2) if and only if it is the optimal invariant predictor and tG i,˚: i " 1, . . . , ku is defined as in Prop. 2. Moreover, independently of whether solutions to a worst-case optimization problem can be found, they say nothing about the range of values optimal prediction functions E P ˚rY | xs may take in other distributions P ˚away from the worst-case. In the following section, we attempt to define predictors and ranges of predictors with guarantees to arbitrary sets MpG ˚q.

3. PARTIAL TRANSPORTABILITY OF STATISTICAL RELATIONS

The uncertainty and inherent under-identifiability of solutions to domain generalization problems motivates us to define the task of partial transportability, that extends the literature on domain generalization by considering bounds on the value of arbitrary queries E P ˚rY | xs in arbitrary target domains π ˚defined by a set of selection diagrams tG i,˚: i " 1, . . . , ku. Task (Partial Transportability). Derive a tight bound rl, us over a query of the form E P ˚rY | xs with knowledge of selection diagrams tG ˚,i : i " 1, . . . , ku, a corresponding collection of data distributions tP i pvq : i " 1, . . . , ku, and set of intervals tI j : V j P Ť i ∆ ˚,i u that define potential constraints on probabilities in the target domain. Algorithmically, this may be written as a solution to the following optimization problem, min / max M PMpG ˚q E P M rY | xs, such that @V R ∆ ˚,i : f V " f i V , P ˚pu V q " P i pu V q, and @V P ď i ∆ ˚,i , P ˚pv | pa V q P I V . In words, the task is to evaluate the minimum and maximum values over all possible SCMs M compatible with tG ˚,i : i " 1, . . . , ku that define the structurally invariant mechanisms in the system and (potentially uninformative, i.e., I V P r0, 1s) assumptions about target-specific probabilities. For example, given the causal description of the protocols presented in the introductory medical example and Fig. 1 , the question might be how to combine these various datasets to predict an individual's risk of developing neurodegenerative diseases in π ˚? The optimal prediction function is given by E P ˚rY | w, x 1 , x 2 s (under mean squared error losses), which may be written as, E P ˚rY | w, x 1 , x 2 s " ÿ yPΩ Y yP ˚py, w, x 1 , x 2 q{ ÿ yPΩ Y P ˚py, w, x 1 , x 2 q, where P ˚py, w, x 1 , x 2 q is equal to, ż Ω U 1tf Y pw, x 2 , u wy q " yu looooooooooooomooooooooooooon matches RCT π b 1tf W px 1 , u wy q " wu loooooooooooomoooooooooooon specific to π ˚1tf X1,X2 pu x1x2 q " x 1 , x 2 u loooooooooooooooomoooooooooooooooon matches hospital π a dP ˚puq. (6) This is a mixture of terms for which data from source domains can be leveraged, for example f Y pw, x 2 , u wy q " f a Y pw, x 2 , u wy q (superscripts denote domain), but also involves unobserved confounders u wy and u x1x2 that cannot be marginalized out, and terms that are specific to π ˚. In addition, although P ˚pw | x 1 q is known to differ in our target medical study it may be the case that we have some domain knowledge that constrains it, e.g. P ˚pw | x 1 q P r0.2, 0.7s ": I w (if left undetermined, I w :" r0, 1s), and can be used to further inform a target query. The following proposition show that the solution of the partial transportability task defines an interval that contains the invariant predictor and, by definition, also the optimal "worst-case" predictor across MpG ˚q. Proposition 3. For a given set of selection diagrams, let rlpxq, upxqs denote the solution of the partial transportability task for the query E P M rY | xs, M P MpG ˚q and E P 1 rY | zs, Z Ď X be the invariant predictor. Then, E P 1 rY | zs P rlpxq, upxqs. Moreover, by definition E P M rY | xs P rlpxq, upxqs for a particular "worst-case" member M P MpG ˚q. In general, there is no reason to believe that the invariant predictor has any special performance guarantee among other solutions in rlpxq, upxqs. For example, the worst-case loss in Eq. ( 3) is not, in general, smallest when E P M rY | xs ‰ E P 1 rY | zs. An alternative is to exploit the solutions to the partial transportability task to define the median of rlpxq, upxqs as a general predictor for domain generalization problems. Proposition 4. For a given set of selection diagrams and data, let rlpxq, upxqs denote the solution of the partial transportability task for the query E P M rY | xs, M P MpG ˚q. Then, max M PMpG ˚q E P M rpY ´med M PMpG ˚qE P M rY | Xsq 2 s ď max M PMpG ˚q ˆEP M rpY ´EP M rY | Xsq 2 s `1 4 E P M rpupXq ´lpXqq 2 s ˙. Under the condition that the irreducible error E P M rpY ´EP M rY | Xsq 2 s is constant across M P MpG ˚q, med M PMpG ˚qE P M rY | Xs provably solves the robust optimization problem Eq. (2). This proposition says that the error of the median is, at most, off from the optimal predictor by "half the range of possible values of E P M rY | xs compatible with the data and assumptions" and that this error is optimal in the worst case (under some assumptions on how the expected conditional variance is allowed to vary). This result is important because it applies to any set of target causal graph, source domains, and selection diagrams. Note, however, that this does not mean that the median is always superior to the optimal invariant predictor: in selected settings where the expected conditional variance changes across domains we may still have the optimal invariant predictor being a better worst-case solution.

4. ALGORITHMS FOR PARTIAL TRANSPORTABILITY

This section presents algorithms to solve the partial transportability task for SCMs with discrete observables, that is each V P V taking values in a finite space of outcomes, while each U P U associated with an arbitrary probability density function P puq. A first step in our argument will be to decompose a chosen query into smaller factors so as to infer which factors can be matched across domains and point-identified from data, and subsequently re-parameterize unmatched factors by a special family of SCMs to make the bounding problem tractable. We use the concept of c-components and C-factors developed by Tian & Pearl (2002) . The set V can be partitioned into c-components such that two variables are assigned to the same set C Ă V if and only if they are connected by a bi-directed path in G. In addition let U C " Ť ViPC U i denote the set of exogeneous variables that are parents of any V P C. For example, the graph in Fig. 1a induces c-components tX 1 , X 2 u and tW, Y u. For any set C Ď V, let Q i rCsppapCqq denote the C-factor of C in domain π i which is defined by, Q i rCsppa c q " ż Ω U C ź V PC 1tf i V ppa V , u V q " vudP i pu C q. ( ) Moreover, let C denote the collection of c-components, then P pvq " ś CPC QrCs and QrCs " P pc | pa C q (we omit the dependence of each C-factor on papCq for readability). This construction is useful because the joint distribution may be factorized according to the c-components of G and its factors matched across domains (Tian & Pearl, 2002; Correa & Bareinboim, 2019) . Lemma 1. Let G a,b be a selection diagram for the SCMs M a and M b , then Q a rCs " Q b rCs if G a,b does not contain selection nodes S V pointing to any variable in V P C. For example, for the selection diagram in Fig. 1b , P ˚pvq " Q ˚rX 1 , X 2 sQ ˚rW, Y s where by Lem. 1 Q ˚rX 1 , X 2 s " Q a rX 1 , X 2 s " P a px 1 , x 2 q, since the is no S-node pointing to X 1 or X 2 . In turn, Q ˚rW, Y s ‰ Q a rW , Y s because of the selection node pointing to W . Note, however, that Q ˚rW, Y s defined as in Eq. ( 7) involves terms, e.g. 1tf Y pw, x 1 , u wy q " yu, that are invariant across domains since the absence of an S-node into Y denotes invariance in causal mechanisms, and for which P a pvq may be used for estimation. We discuss next a re-parameterization of C-factors Q ˚rCs that cannot be matched across domains with the goal of defining a tractable constrained optimization problem to bound Q ˚rCs. Proposition 5. Let M be an arbitrary SCM with graph G and let C be any c-component. Then, there exists a corresponding SCM N with finite exogeneous domain compatible with G such that Q M rCs " Q N rCs, where for every exogenous variable U P U C , its cardinality d U " ˇˇΩ PapCq ˇˇ. This proposition shows that SCMs with discretely-valued exogeneous variables are expressive enough to represent C-factors QrCs irrespective of the true underlying data generating mechanism. From an optimization perspective, this is useful because it allows us to consistently parameterize C-factors and make inference on its distribution in a well-defined latent variable model (Rosset et al., 2017; Zhang et al., 2021) . As an example, consider the introductory example with tX 1 , X 2 , Y, W u binary and causal graphs in Fig. 1 . Q ˚rW, Y s defined using Eq. ( 7) can also be written as: ÿ uwy,uy,uw 1tf a Y pw, x 2 , u wy , u y q " yu1tf W px 1 , u wy , u w q " wuP a pu wy , u y qP ˚pu w q, (8) where |Ω Uwy | " |Ω Uw | " |Ω Uy | " |Ω X1 | ¨|Ω X2 | ¨|Ω W | ¨|Ω Y | " 16; the function f V is a mapping between finite domains Ω Pa V ˆΩU V Þ Ñ Ω V for V P tW, Y u. Moreover, we have used the structural invariances encoded by the selection diagrams in Fig. 1 to match causal mechanisms and exogeneous probabilities between domains. In particular, P a pu y q " P b pu y q " P ˚pu y q by definition of the selection diagrams G a,˚a nd G b,˚. Although discretely-valued causal mechanisms and exogeneous probabilities imply well-defined parameters to optimize over, the partial transportability task remains a difficult constrained optimization problem.

4.1. APPROXIMATIONS VIA GIBBS SAMPLING

We follow (Chickering & Pearl, 1996; Zhang et al., 2021; Bellot et al., 2022) and take a Bayesian perspective to approximating bounds rlpxq, upxqs. We evaluating credible intervals P plpxq ă provided with finite samples v :" pv π 1 , . . . , vπ k q, where vπ i " tv pjq π i : j " 1, . . . , n i u are n i independent sampled collected in domain π i and a set of selection diagrams tG ˚,i : i " 1, 2, . . . , ku using Gibbs sampling. Following the arguments in the previous section the query may be reduced to bounding a C-factor of the form, ω obj :" Q ˚rCs " ÿ U PU C ÿ u"1,...,d U ź V PC 1tξ ppa V ,u V q V " vu ź U PU C θ u . ( ) that are parameterized by ξ " tξ ppa V ,u V q V : V P C, Pa V Ă V, U V Ă U C u and θ " tθ u : U P U C u that represent causal functional assignments and exogeneous probabilities, respectively. We have dropped the domain indicator " ˚" from the definition of parameters for readability. For every V P V, @pa V , u V , the functional assignment parameters ξ ppa V ,u V q V are drawn uniformly in the discrete domain Ω V . For every U P U, exogenous probabilities θ U with dimension d U " ˇˇΩ PapCq ˇˇare drawn from a prior Dirichlet distribution, θ U " pθ 1 , . . . , θ d U q " Dirichlet pα 1 , . . . , α d U q, with hyperparameters α 1 , . . . , α d U . The Gibbs sampler starts with some initial value for all latent quantities pu, ξ, θq in the expression of ω obj , and iterates over the following sampling steps, each parameter conditioned on the current values of the remaining terms in the parameter vector. 1. Sample u. Let u P Ω U , U P U C . For each observed data example across all domains v pnq P v, n " 1, . . . , ř i n i , we sample corresponding exogeneous variables U P U C from the conditional distribution, P pu pnq | v pnq , ξ, θq 9 P pu pnq , v pnq | ξ, θq " ź V PC 1tξ ppa pnq V ,u pnq V q V " v pnq u ź U PU C θ u . (12) 2. Sample ξ. Parameters ξ define deterministic causal mechanisms. For a given parameter ξ ppa V ,u V q V P ξ its conditional distribution is given by P pξ ppa V ,u V q V " v | v, ūq " 1 if there exists a sample pv pnq , pa pnq V , u pnq q for some n, where n iterates over the samples of u from step 1 and v associated with the subset of domains in which exogeneous probabilities match the target domain, such that ξ ppa pnq V ,u pnq V q V " v pnq . Otherwise, P pξ ppa V ,u V q V " v | v, ūq is given by a uniform discrete distribution over its domain Ω V . 3. Sample θ. Let θ U " pθ 1 , . . . , θ d U q P θ be the parameters that define the probability vector of possible values of variables U P U C . Its conditional distribution is given by, θ 1 , . . . , θ d U | v, ū " Dirichlet ˜α1 `ÿ n 1tu pnq " u 1 u, . . . , α d U `ÿ n 1tu pnq " u d U u ¸, where, similarly, n iterates over the samples of u from step 1 associated with the subset of domains in which exogeneous probabilities match the target domain. In the above, we have described the conditional distributions of parameters that can be matched across domains, and therefore estimated from the subset of relevant available data. By the definition of the partial transportability task, parameters that are specific to the target domain π ˚are constrained to lie in an assumed interval, e.g. P ˚pv | pa V q " ř u V 1tξ ppa V ,u V q V " v i u ś U PU V θ u P I V Ď r0, 1s , or else left unspecified. In the first case, parameters are sampled independently and uniformly in the space defined by the constraints and in the second case, they are sampled independently and uniformly in their domain of definition, i.e. ξ ppa V ,u V q V P Ω V , θ u P Ω U , in every step of the sampler. Iterating this procedure forms a Markov chain with the invariant distribution to be the target posterior distribution P pu, ξ, θ | vq. P pω obj | vq is then approximated by plugging the MCMC samples into Eq. ( 10). The upper and lower α quantile among T samples of P pω obj | vq, when combined with the  ûα pxq " inftx : ÿ t 1tE P ˚rY | xs ptq ď xu " 1 ´α{2u. ( ) The following Theorem shows that credible intervals r l0 pxq, û0 pxqs converge to the true bounds rlpxq, upxqs for the unknown query E P ˚rY | xs and are, moreover, maximally informative, in the sense that we can always construct two data generating mechanisms M 1 , M 2 for the target domain that are compatible with our current knowledge of the world such that E P M 1 rY | xs " l and E P M 2 ry | xs " u. Theorem 1. The solution rlpxq, upxqs to the partial transportability task defined over discrete SCMs is a tight bound over a target query E P ˚rY | xs. The credible interval r l0 pxq, û0 pxqs coincides with rlpxq, upxqs as n i Ñ 8 in all observable domains π i , i " 1, . . . , k.

5.1. SMOKING AND LUNG CANCER

Our first experiment is inspired by the debate around the relationship between smoking and lung cancer in the 1950's (US Department of Health and Human Services, 2014). We use a scientificallygrounded variation of the front-door graph that includes an individual's smoking status S, presence of tar in the lungs T , wealth W , and lung cancer status C, using the fact that smoking and lung cancer may be confounded by an individual's unobserved genetic profile. In this example, the objective is to make inference on cancer probability distributions in the French population π FR from corresponding data in π UK where the prevalence of smoking is known to be lower. The selection diagram is given in Fig. 3a and details on the SCMs used to generate data are given in Appendix B. Probability of cancer among smokers P FR pC " 1 | S " 1q . The C-factor decomposition and parameterization is given by the following derivations, P FR pc | sq " P FR pc, sq ř c P FR pc, sq " ř t,w P FR pc, s, t, wq ř c,t,w P FR pc, s, t, wq " ř t,w Q FR rs, csQ FR rwsQ FR rts ř c,t,w Q FR rs, csQ FR rwsQ FR rts , ( ) where Q FR rts " Q UK rts " P UK pt | s, wq and Q FR rws " Q UK rws " P UK pwq, and, Q FR rs, cs " ÿ usc,us 1tξ pw,t,uscq C UK " cu1tξ pw,usc,usq S FR " suθ UK usc θ FR us . Figure 4 : Bounding the probability of cancer. In Fig. 4 , we report estimated 100% credible intervals l0 ă P FR pC " 1 | S " 1q ă û0 as a function of the number of samples without prior information (purple) and with the prior information that P FR ps | wq lies in an interval of width 0.1 around its true value (pink). The black and gray dotted lines are the actual values P FR pC " 1 | S " 1q and P UK pC " 1 | S " 1q respectively. Notice that a relatively small number of samples is required to converge to stable bounds, and that the prior information narrows the credible interval reflecting this additional constraint. We also show for illustration that our Gibbs sampler recovers the true values P FR pC " 1 | S " 1q (pink) and P FR pC " 1 | S " 0q (purple) when trained on data from π FR , i.e. when probabilities are identified. Prediction performance across domains. Consider the task of designing cancer prediction rules for optimal performance in the french population π FR . We introduce an additional training domain to be able to define invariant predictors: data from the Swedish population π SW whose structural differences with π UK and with π FR are given in Fig. 3 . Across π UK and π SW , the optimal invariant predictor (Def. 3) is given by E P UK rC | t, w, ss " E P SW rC | t, w, ss which, however, is not equal to E P FR rC | t, w, ss as no set blocks the open path between the selection node S S and the cancer variable C in G FR,UK . We consider also the common strategy of using causal parents for prediction, i.e. using the prediction rule E P UK rC | t, ws (which, similarly, is not equal to E P FR rC | t, ws). For comparison, we consider the median value medp l0 , û0 q for the optimal prediction rule E P FR rC | t, w, ss computed using data from π UK and π SW . We observe in Fig. 3d that indeed the prediction rule E P UK rC | t, w, ss underperforms in π FR : for reference E P FR rC | t, w, ss has a mean error of .1220, cautioning against naively transporting invariant prediction rules across domains. Similarly, using causal parents for prediction underperforms. In contrast, the median of the derived bound proves to be a slightly better predictor in this case and has a guarantee of optimal performance in the "worst-case" domain compatible with the selection diagrams (Prop. 4).

5.2. PREDICTION OF NEURODEGENERATIVE DISEASES ACROSS HOSPITALS

Our second experiment reconsiders the introductory example that described the design of prediction rules for the development of neurodegenerative diseases in a target hospital π ˚in which no data has been recorded. Instead, we have access to data from two related studies conducted in hospitals π a and π b that, however, are known to differ with respect to the target domain notably in the distribution of blood pressure W , a known cause of neurodegenerative diseases. The causal protocol is given in Fig. 1 and is described in more depth in Sec. 2. Details on the SCMs used to generate data are given in Appendix B. Given this information, we consider the task of designing a prediction rule for optimal mean squared error in the target hospital π ˚. Here, invariant predictors are well defined and given by the function f pw, x 1 , x 2 q " E P a rY | w, x 1 , x 2 s " E P b rY | w, x 1 , x 2 s although note that, in this example, this conditional expectation is not invariant in the target domain due to the difference in the causal mechanisms associated with blood pressure W , see Fig. 1d . Similarly, we can define causal predictors as E P a rY | w, x 2 s and E P b rY | w, x 2 s which in this case are not equal across hospitals π a and π b due to the open path between S X1 and Y once we condition on W . The partial transportability task instead argues for approximating E P ˚rY | w, x 1 , x 2 s which, using the C-factor decomposition, is parameterized by P ˚py, w, x 1 , x 2 q " Q ˚rX 1 , X 2 sQ ˚rW, Y s where Q ˚rX 1 , X 2 s " P a px 1 , x 2 q by Lem. 1 and Q ˚rW, Y s " ÿ uwy,uw 1tξ pw,x2,uwyq Y a " yu1tξ px1,uwy,uwq W ˚" wuθ a uwy θ ůw . π EP a ry|w, x 1 , x 2 s .3640 (.003) E P a ry|w, x 2 s .4244 (.002) E P b ry|w, x 2 s .4013 (.002) medp l0 , û0 q .2961 (.008) E P ˚ry|w, x 1 , x 2 s .2434 (.002) Figure 5 : Performance comparisons. The median value of the resulting interval that encodes the uncertainty in the computation of E P ˚rY | w, x 1 , x 2 s as well as all baseline predictors are given in Fig. 5 . We add the actual optimal (not computable) prediction rule E P ˚rY | w, x 1 , x 2 s for reference. Fig. 5 shows that the median outperforms and that baselines, although common strategies for prediction, can result in significantly worse out-of-distribution performance in examples where unobserved confounding as well as structural differences between domains play a role.

6. CONCLUSIONS

This paper investigated the problem of domain generalization from the perspective of transportability theory. We introduced the task of partial transportability that seeks to bound the value of an arbitrary conditional expectation E P ˚rY | xs in an unseen domain π ˚using selection diagrams and data from source domains. Using this formalism, we showed that invariant predictors and more general solutions to robust optimization problems derived in the literature are special cases of solutions to this task. Moreover, in systems of discrete observables, we showed that we can design provably consistent algorithms for inferring bounds that are sound and tight, and illustrated its performance on synthetic data.

A PROOFS

Proposition 6 (Prop. 1 restated). Given selection diagrams tG i,j : i, j " 1, . . . , ku, the optimal invariant predictor is defined as the minimizer of E P i rpY ´f pZqq 2 s across all i " 1, . . . , k, and belongs to the set of invariant predictors for which Z is a domain-independent Markov blanket for Y P V. Proof. Assume not such that there exists an optimal invariant predictor E P i rY | Zs, i " 1, . . . , k, distinct from any invariant predictor defined conditional on a domain-independent Markov blanket, with Z not a domain-independent Markov blanket for Y P V. Then, by definition of a domainindependent Markov blanket, either pY K K S | Zq G i,j for i, j " 1, . . . , k, in which case E P i rY | Zs is not invariant across source domains, or there exists a W P Z such that pW K K Y | ZzW q G i,j for i, j " 1, . . . , k, in which case E P i rY | Zs " E P i rY | ZzW s. Now if ZzW is not a domainindependent Markov blanket we can continue removing independent variables from ZzW to reach a domain-independent Markov blanket concluding that E P i rY | Zs is not distinct from an invariant predictor defined conditional on a domain-independent Markov blanket. Proposition 7 (Prop. 2 restated). Given a set of selection diagrams tG i,j : i, j " 1, . . . , ku, let ∆ " Ť i,j ∆ i,j be the set of variables in V whose causal mechanisms differ between any two source domains, and let S " tS V : V P ∆u. The optimal invariant predictor solves the robust optimization problem in Eq. (2) if selection nodes in all selection diagrams tG i,˚: i " 1, . . . , ku are given by S with edges tS V Ñ V u S V PS . Proof. Given a set of selection diagrams tG i,j : i, j " 1, . . . , ku, let ∆ " Ť i,j ∆ i,j be the set of variables in V whose causal mechanisms differ between any two source domains, and let S " tS V : V P ∆u. Assume that selection nodes in all selection diagrams tG i,˚: i " 1, . . . , ku are given by S (and with edges tS V Ñ V u S V PS ). In that case, the optimal invariant predictor, written E P 1 rY | Zs " E P M rY | Zs for any M P MpGq. Any additional variable W in the conditioning set is either irrelevant for prediction, i.e. E P M rY | Zs " E P M rY | Z, W s, or breaks the independence between Y and selection nodes, which implies that E P M rY | Z, W s varies as a function of M . Since the the functional form of M (besides the arguments of functions) are not constrained by selection diagrams, for any fixed prediction function E P M rY | Z, W s, we can always find a domain M 1 P MpGq that makes the error E P M 1 rpY ´EP M rY | Z, W sq 2 s arbitrarily large, and thus higher than E P M 1 rpY ´EP M rY | Zsq 2 s which is fixed for any M 1 . Proposition 8 (Prop. 3 restated). For a given set of selection diagrams, let rlpxq, upxqs denote the solution of the partial transportability task for the query E P M rY | xs, M P MpG ˚q and E P 1 rY | zs, Z Ď X be the invariant predictor. Then, E P 1 rY | zs P rlpxq, upxqs. Moreover, by definition E P M rY | xs P rlpxq, upxqs for a particular "worst-case" member M P MpG ˚q. Proof. The set MpG ˚q represents all SCMs compatible with a target causal graph G ˚that is only constrained by selection diagrams tG i,˚: i " 1, . . . , ku. A selection node indicates a potential change between two domains and therefore, in principle all source domain SCMs tM i : i " 1, . . . , ku are possible candidates for the target domain and thus M i P MpG ˚q i " 1, . . . , k. Then, by the definition of the partial transportability task, E P 1 rY | zs P rlpxq, upxqs where z is the value of Z in X. Proposition 9 (Prop. 4 restated). For a given set of selection diagrams and data, let rlpxq, upxqs denote the solution of the partial transportability task for the query Theorem 2 (Prop. 5 restated). Let M be an arbitrary SCM with graph G and let C be any ccomponent. Then, there exists a corresponding SCM N with finite exogeneous domain compatible with G such that Q M rCs " Q N rCs, where for every exogenous variable U P U C , its cardinality d U " ˇˇΩ PapCq ˇˇ. Proof. The proof follows from Rosset et al. (2017) and Zhang et al. (2021) . We include it below for completeness. We first introduce some necessary notations and concepts. The probability distribution for every exogenous variables U Ă U is characterized with a probability space. It is frequently designated xΩ U , F U , P U y where Ω U is a sample space containing all possible outcomes; F U is a σ-algebra containing subsets of Ω U ; P U is a probability measure on F U normalized such that P U pΩ U q " 1. Elements of F U are called events, which are closed under operations of set complement and unions of countably many sets. By means of P U a real number P U pAq P r0, 1s is assigned to every event A P F U ; it is called the probability of event A. For an arbitrary set of exogenous variables U, its realization U " u is an element in the Cartesian product Ś U PU Ω U . We may be interested in inferring whether a sequence of events A for every U P U occurs. Such an event is represented by a subset Ś U PU A U Ď Ś U PU Ω U which in turn generate a product of σ-algebras Â U PU F U . Define the product measure Â U PU P U to satisfy the following mutual independence condition given by the definition of the SCM, P ˜ą U PU A U ¸" ź U PU P U pA U q. ( ) Such P is a probability measure. Moreover, C ą U PUq Ω U , â U PU F U , â U PU P U G , defines a product of probability spaces xΩ U , F U , P U y that describes measurable events over all exogeneous variables U partitioned into c-components. Let C be the collection of all c-components in G. c-components in C form a partition t Ť V PC U V | C P Cu over exogenous variables U . Therefore, for every U P U, there must exist a unique ccomponent denoted by C U containing U . For any c-component C P C, let U C " Ť V PC U V the set of exogenous variables affecting (at least one of) endogenous variables in C. By the definition of c-components, the exogeneous variables do not overlap between c-components and it holds that, P ˜č U PU A U ¸" ź CPCpGq P U ˜č U PC A U ¸. For any SCM M compatible with the causal graph G the joint distribution may be factorized into c-components, P pvq " ź CPC QrCspc, pa C q. ( ) where QrCs is a C-factor and is a function of pc, pa C q. To parameterize this joint distribution it is thus sufficient to look at each C-factor separately. Let C be a generic c-component in G. Denote by m " |U C | the number of exogeneous variables related to C. For convenience, we consistently write xΩ i , F i , P i y as the probability space of i-th exogeneous variable in C. The product of these probability spaces is thus written, C m ą i"1 Ω i , m â i"1 F i , m â i"1 P i G . Each C-factor may thus be written, QrCs " ż Ś m i"1 Ωi ź V PC 1tf V ppa V , u V q " vud m â i"1 P i . ( ) Our goal is to show that all probabilities QrCs, induced by exogenous variables described by arbitrary probability spaces could be produced by a "simpler" generative process with discrete exogenous domains. QrCs defines a mapping between the space of possible realizations of the variables PapCq to the r0, 1s interval. Since PapCq are discrete variables with finite domains, the cardinality of the class of probability assignments that must be defined is also finite. It is given at most by the number of possible combinations of realizations of PapCq which is given by ś V PPapCq |Ω V |. Let P be a vector representing probabilities QrCspc, pa C q. Counting all possible combinations of outcomes for all possible conditioning sets, P is therefore a vector of at most size d " ś V PPapCq |Ω V |. And since QrCspc, pa C q is a probability mass function, it only takes a vector with d ´1 dimensions to uniquely determine it. P may thus be interpreted as a point in the pd ´1q-dimensional real space. Similarly, pP, 1q is vector in d-dimensional space where the d-th element is equal to 1. Now consider sampling a value U 1 " u 1 from the underlying SCM and let Q u1 be the probability model with U 1 " u 1 . Q u1 rCspc | pa C q " « ż Ś m i"2 Ωi ź V PC 1tf V ppa V , u V q " vud m â i"2 P i ff U1"u1 . ( ) and Pu1 is a pd ´1q-dimensional probability vector representing the probabilities of each one of the combinations PapCq given that U 1 " u 1 . We will show that P 1 may equally well be represented by a discrete distribution. For this, let U " t Pu1 : u 1 P Ω 1 u Ă R d be the set of probability points that can be constructed as u 1 varies in Ω 1 . The average ş Ω1 Pu1 dP 1 is a convex mixture of points in U by (Rubin & Wesler, 1958 ) that equals Q since, P " ż Ω1 « ż Ś m i"2 Ωi ź V PC 1tf V ppa V , u V q " vud m â i"2 P i ff U1"u1 dP 1 .

B.1 SCMS FOR THE SMOKING AND LUNG CANCER EXAMPLE

Using the functional dependencies specified by the selection diagram in Fig. 4 , we define the SCMs for domains π UK , π FR , and π SW as follows. For π UK we generate samples from P pu w q, P pu s q, P pu t q and P pu sc q given by independent Gaussian distributions with mean 0 and variance 1. Each generated pu w , u s , u t , u sc q leads to a sample pw, s, t, cq as follows: w Ð 1tu w ą 0u, s Ð 1tw `usc `us ´2 ą 0u, t Ð 1ts ´0.5u t ´1 ą 0u, c Ð 1tt ´0.5w `usc ´1 ą 0u. For π FR we generate samples from P pu w q, P pu s q, P pu t q and P pu sc q given by independent Gaussian distributions with mean 0 and variance 1. Each generated pu w , u s , u t , u sc q leads to a sample pw, s, t, cq as follows: w Ð 1tu w ą 0u, s Ð 1tw `usc `1.5u s ´1 ą 0u, t Ð 1ts ´0.5u t ´1 ą 0u, c Ð 1tt ´0.5w `usc ´1 ą 0u. Notice that the causal mechanism for S has changed while everything else is unchanged. For π SW we generate samples from P pu w q, P pu s q, P pu t q and P pu sc q given by independent Gaussian distributions with mean 0 and variance 1. Each generated pu w , u s , u t , u sc q leads to a sample pw, s, t, cq as follows: w Ð 1tu w ą 0.5u, s Ð 1tw `usc `us ´2 ą 0u, t Ð 1ts ´0.5u t ´1 ą 0u, c Ð 1tt ´0.5w `usc ´1 ą 0u.

B.2 SCMS FOR THE NEURODEGENERATIVE DISEASE PREDICTION EXAMPLE

Using the functional dependencies specified by the selection diagrams in Fig. 1 , we define the SCMs for domains π ˚, π a , and π b as follows. For the target domain π ˚we generate samples from P pu wy q, P pu x2 , P pu w q and P pu x1,x2 q given by independent Gaussian distributions with mean 0 and variance 1. Each generated pu wy , u x1,x2 , u x2 , u w q leads to a sample px 1 , x 2 , w, yq as follows: x 1 Ð 1tu x1 ą 0u, x 2 Ð 1tu x1,x2 `ux2 ą 0u, w Ð 1tx 1 `uwy `1.5u w ´1 ą 0u, y Ð 1tw ´uwy `0.1x 1 ´1 ą 0u. For source domain π a , the distribution of exogenous as well as structural assignment agree with π except in the assignment of W which is given by w Ð 1tx 1 `uwy ´uw `1 ą 0u. For source domain π b , the distribution of exogenous as well as structural assignment agree with π except in the assignment of W and X 1 . The selection diagram specifies that the assignment of W agrees with π a and is thus given by w Ð 1tx 1 `uwy ´uw `1 ą 0u while the assignment of X 1 changes and is given by x 1 Ð 1tu x1 ´0.5 ą 0u. All other components of the SCM are the same.



Other invariance assumptions have also been made, e.g. pZ K K Sq G i,j and pY K K S | Xq G i,j for problems where only the distribution of some covariates is expected to change across domains(Muandet et al., 2013), or such as pZ K K S | Y q G i,j(Li et al., 2018). Problems where the magnitude of changes is assumed to be bounded, i.e. | P ˚py | xq ´P i py | xq |ď c, instead of restricting the d-separation statements involving S have been studied by(Rothenhäusler et al., 2021).



Figure 1: Example of graphs: (a) Causal graph of target domain π ˚, (b) selection diagram that compares domains π ˚with π a , (c) selection diagram that compares domains π ˚with π b , (d) selection diagram that compares domains π a with π b .

Figure 2: Graphs used in Sec. 2.1.

P ˚rY | xs ă upxq | vq " 1 ´α on the posterior of E P ˚rY | xs by approximating the expectation, Er1tlpxq ă E P ˚rY | xs ă upxqu | vs " P plpxq ă E P ˚rY | xs ă upxq | vq (9)

FR with π UK , π FR with π SW , and π SW with π UK , respectively. (d) gives mean squared error for cancer prediction on a sample of data from P FR . identified C-factors that form E P ˚ry | xs, gives us a p1 ´αq credible interval lα ă E P ˚ry | xs ă ûα defined by,

E P M rY | xs, M P MpG ˚q.Under the condition that the irreducible error E P M rpY ´EP M rY | Xsq 2 s is constant across M P MpG ˚q, med M PMpG ˚qE P M rY | Xs provably solves the robust optimization problem Eq. (2).Proof. For a given set of selection diagrams and data, let rlpxq, upxqs denote the solution of the partial transportability task for the queryE P M rY | xs, M P MpG ˚q. Then, ˜EP M r ˆY ´EP M rY | Xs `EP M rY | Xs ´med ˆEP M rpY ´EP M rY | Xsq 2 s `EP M rpE P M rY | Xs ´medThe second equality holds because the cross term in the expansion of the square equal 0 as E P M rpY ÉP M rY | Xsqs " 0. The inequality holds because the largest distance between E P M rY | Xs and the median of values E P M rY | Xs can reach as a function of M P MpG ˚q is half the distance between maximum and minimum values of E P M rY | Xs across M P MpG ˚q, that is pupXq ´lpXqq{2.If E P M rpY ´EP M rY | Xsq 2 s is equal to a constant value independent of M , itcan be taken out of the maximization and we are left with the optimization problem, E P M rpE P M rY | Xs ´f pXqq 2 s (17) For any x and any f , we can always choose M such that |E P M rY | xs ´f pxq| ě |E P M rY | xs ´med For example, by choosing M such that E P M rY | xs " max M PMpG ˚q E P M rY | xs or E P M rY | xs " min M PMpG ˚q E P M rY | xs depending on what distance is larger. Therefore, f pxq :" med

annex

By construction, P itself is a convex mixture of at most d `1 points in U. That is, by using Carathéodory's theorem (Carathéodory, 1911) ,Replacing the definition of Pu 1,k we obtain,This means that we can replace the continuous measure P 1 with a discrete probability set with outcomes tu 1,1 , . . . , u 1,d u and corresponding probabilities tw 1 , . . . , w d u with cardinality d and obtain a probability model that is equivalent to the original P . This procedure can be repeated for all m exogeneous variables in the c-component C. We are thus left with a model,equivalent to its discrete counterpart,whereTheorem 3 (Thm. 1 restated). The solution rl, us to the partial transportability task defined over discrete SCMs is a tight bound over a target query E P π ˚ry | xs. The credible interval r l0 , û0 s coincides with rl, us as n i Ñ 8 in all observable domains π i , i " 1, . . . , k.Proof. The proof strategy follows (Zhang et al., 2021) and shows convergence of the posterior by way of convergence of the likelihood of the data given one SCM M P MpGq. We look at 'convergence' in a frequentist way, for increasing sample size the posterior will, with increasing probability, be low for any parameter configuration, i.e. for any SCM M R MpGq.By the definition of the optimal bounds rl, us given by the solution to the partial transportability task,Therefore if the prior on parameters pξ, θq defining SCMs is non-zero for any M P MpGq, also the posterior converges,which is the definition credible intervals rl 0 , u 0 s as the 0 th and 100 th quantiles of the posterior distribution which coincide with rl, us asymptotically.

B EXPERIMENTAL DETAILS

All experiments use 1000 burn-in MCMC samples that are discarded and 5000 MCMC samples considered as independent samples from the posterior distribution and used for the approximation of target queries.

