LEARNING COUNTERFACTUALLY INVARIANT PREDICTORS

Abstract

We propose a method to learn predictors that are invariant under counterfactual changes of certain covariates. This method is useful when the prediction target is causally influenced by covariates that should not affect the predictor output. For instance, this could prevent an object recognition model from being influenced by position, orientation, or scale of the object itself. We propose a model-agnostic regularization term based on conditional kernel mean embeddings to enforce counterfactual invariance during training. We prove the soundness of our method, which can handle mixed categorical and continuous multivariate attributes. Empirical results on synthetic and real-world data demonstrate the efficacy of our method in a variety of settings.

1. INTRODUCTION AND RELATED WORK

Invariance, or equivariance to certain transformations of data, has proven essential in numerous applications of machine learning (ML), since it can lead to better generalization capabilities Arjovsky et al. (2019) ; Chen et al. (2020) ; Bloem-Reddy & Teh (2020) . For instance, in image recognition, predictions ought to remain unchanged under scaling, translation, or rotation of the input image. Data augmentation is one of the earliest heuristics developed to promote this kind of invariance, that has become indispensable for training successful models like deep neural networks (DNNs) Shorten & Khoshgoftaar (2019) ; Xie et al. (2020) . Well-known examples of certain types of "invariance by design" include convolutional neural networks (CNNs) for translation invariance Krizhevsky et al. (2012) , group equivariant CNNs for other group transformations Cohen & Welling (2016) , recurrent neural networks (RNNs) and transformers for sequential data Vaswani et al. (2017) , DeepSet Zaheer et al. (2017) for sets, and graph neural networks (GNNs) for different types of geometric structures Battaglia et al. (2018) . Many real-world applications in modern ML, however, call for an arguably stronger notion of invariance based on causality, called counterfactual invariance. This case has been made for image classification, algorithmic fairness Hardt et al. (2016) ; Mitchell et al. (2021) , robustness Bühlmann (2020) , and out-of-distribution generalization Lu et al. (2021) . These applications require predictors to exhibit invariance with respect to hypothetical manipulations of the data generating process (DGP) Peters et al. (2016) ; Heinze-Deml et al. (2018) ; Rojas-Carulla et al. (2018) ; Arjovsky et al. (2019) ; Bühlmann (2020) . In image classification, for instance, we want a model that "would have made the same prediction, if the object position had been different with everything else being equal". Similarly, in algorithmic fairness Kilbertus et al. (2017) ; Kusner et al. (2017) introduce notions of interventional and counterfactual fairness, based on certain invariances in the DGP of the causal relationships between observed variables. Counterfactual invariance has the significant advantage that it incorporates structural knowledge of the DGP. However, enforcing this notion in practice is very challenging, since it is untestable in real-world observational settings, unless strong prior knowledge of the DGP is available. Inspired by problems in natural language processing (NLP), Veitch et al. (2021) provide a method to achieve counterfactual invariance based on distribution matching via the maximum mean discrepancy (MMD). This method enforces a necessary, but not sufficient condition of counterfactual invariance during training. Consequently, it is unclear whether this method achieves actual invariance in practice, or just an arguably weaker proxy. Furthermore, the work by Veitch et al. (2021) only considers discrete random variables when enforcing counterfactual invariance, and it only applies to specific, selected causal graphs. To overcome the aforementioned problems, we propose a general definition of counterfactual invariance and a novel method to enforce it. Our main contributions can be summarized as follows: • Based on a structural causal model (SCM), we provide a new definition of counterfactual invariance (cf. Definition 2.2) that is more general than that of Veitch et al. (2021) . • We establish a connection between counterfactual invariance and conditional independence that is provably sufficient for counterfactual invariance (cf. Theorem 3.2). • We propose a new objective function that is composed of the loss function and on the flexible Hilbert-Schmidt Conditional Independence Criterion (HSCIC) Park & Muandet (2020) , to enforce counterfactual invariance in practice. Our method works well for both categorical and continuous covariates and outcomes, as well as in multivariate settings.

2. PRELIMINARIES AND BACKGROUND

Counterfactual invariance. We introduce structural causal models as in Pearl (2000) . Definition 2.1 (Structural causal model (SCM)). A structural causal model is a tuple (U, V, F, P U ) such that U is a set of background variables that are exogenous to the model; V is a set of observable (endogenous) variables; F = {f V } V ∈V is a set of functions from (the domains of) pa(V ) ∪ U V to (the domain of) V , where U V ⊂ U and pa(V ) ⊆ V \ {V } such that V = f V (pa(V ), U V ); (iv) P U is a probability distribution over the domain of U. Further, the subsets pa(V ) ⊆ V \ {V } are chosen such that the graph G over V where the edge V ′ → V is in G if and only if V ′ ∈ pa(V ) is a directed acyclic graph (DAG). We always denote with Y ⊂ V the outcome (or prediction target), and with Ŷ a predictor for that target. The predictor Ŷ is not strictly part of the SCM, because we get to tune f Ŷ . Since it takes inputs from V, we often treat it as an observed variable in the SCM. As such, it also "derives its randomness from the exogenous variables", i.e., is defined on the same probability space. Each SCM implies a unique observational distribution over V (Pearl, 2000) , but it also entails interventional distributions. Given a variable A ∈ V, an intervention A ← a amounts to replacing f A in F with the constant function A = a. This yields a new SCM, which induces the interventional distribution under intervention A ← a. Similarly, we can intervene on multiple variables V ⊇ A ← a. We then write Y * a for the outcome in the intervened SCM, also called potential outcome. Note that the interventional distribution P Y * a (y) differs in general from the conditional distribution P Y|A (y | a). This could for instance happen due to unobserved confounding effects. 1 We can also condition on a set of variables W ⊆ V in the (observational distribution of the) original SCM before performing an intervention, which we denote by P Y * a |W (y | w). This is a counterfactual distribution: "Given that we have observed W = w, what would Y have been had we set A ← a, instead of the value A had actually taken?" Note that the sets A and W need not be disjoint. We can now define counterfactual invariance. Definition 2.2 (Counterfactual invariance). Let A, W be (not necessarily disjoint) sets of nodes in a given SCM. A predictor Ŷ is counterfactually invariant in A with respect to W, if P Ŷ * a |W (y | w) = P Ŷ * a ′ |W (y | w) almost surely, for all a, a ′ in the domain of A and all w in the domain of W. 2 A counterfactually invariant predictor can be viewed as robust to changes of A, in the sense that the (conditional) post-interventional distribution of Ŷ does not change for different values of the intervention. Our Definition 2.2 is more general than previously considered notions of counterfactual invariance. For instance, the invariance in Definition 1.1 by Veitch et al. (2021) requires Ŷ * a = Ŷ * a ′ almost surely for all a, a ′ in the domain of A. First, it does not allow to condition on observed evidence, i.e., it cannot consider "true counterfactuals" and is thus unable to promote-for example-counterfactual fairness, see Definition 3.6 (Kusner et al., 2017) . Second, it may appear stronger in that it asks for equality of random variables instead of equality of distributions. However, (a) contrary to their definition, Veitch et al. (2021) only enforce equality of distributions in practice (via MMD), and (b) since Ŷ * a , Ŷ * a ′ are (deterministic) functions of the same exogenous (unobserved) random variables, distributional equality is a natural choice for counterfactual invariance. We remark that a notion of invariance has been studied by Mouli & Ribeiro (2022) . In this work, the authors focus on learning classifiers that are counterfactually invariant to distribution shift using asymmetry learning. To the best of our knowledge, however, our work is the first attempt to provide a general graphical criterion for invariance, which can be verified from observational data. Kernel mean embeddings and conditional measures. Our new objective function heavily relies on kernel mean embeddings (KMEs). We now highlight the main concepts pertaining KMEs and refer the reader to Smola et al. (2007) ; Schölkopf et al. (2002) ; Berlinet & Thomas-Agnan (2011) ; Muandet et al. (2017) for more details. Fix a measurable space Y with respect to a σ-algebra F Y , and consider a probability measure P on the space (Y , F Y ). Let H be a reproducing kernel Hilbert space (RKHS) with a bounded kernel k Y : Y × Y → R, i.e., k Y is such that sup y∈Y k(y, y) < +∞. The kernel mean embedding µ P of P is defined as the expected value of the function k( • , y) with respect to y, i.e., µ P := E [k( • , y)]. The definition of KMEs can be extended to conditional distributions (Fukumizu et al., 2013; Grünewälder et al., 2012; Song et al., 2009; 2013) . Consider two random variables Y, Z, and denote with (Ω Y , F Y ) and (Ω Z , F Z ) the respective measurable spaces. These random variables induce a probability measure P Y,Z in the product space Ω Y × Ω Z . Let H Y be a RKHS with a bounded kernel k Y (•, •) on Ω Y . We define the KME of a conditional distribution P Y|Z (• | z) via µ Y|Z=z := E [k Y ( • , y) | Z = z]. Here, the expected value is taken over y. KMEs of conditional measures can be estimated from samples. To illustrate this, consider i.i.d. samples (y 1 , z 1 ), . . . , (y n , z n ). Denote with KY the kernel matrix with entries [ KY ] i,j := k Y (y i , y j ). Furthermore, let k Z be a bounded kernel on Ω Z . Then, µ Y|Z=z can be estimated as μY|Z=z := n i=1 ŵ(i) Y|Z (z)k Y (•, y i ) , ŵY|Z (•) := ( KZ -nλI) -1 [k Z (•, z 1 ), • • • , k Z (•, z n )] T (1) where, I is the identity matrix and λ is a regularization parameter. Here, ŵ(i) X|A (•), the i-th entry of ŵX|A (•), are the coefficients of kernel ridge regression (Grünewälder et al., 2012) .

3. COUNTERFACTUALLY INVARIANT PREDICTORS

We now establish a simple graphical criterion to express counterfactual invariance as a conditional independence in the observational distribution, rendering it estimable from i.i.d. data. We first repeat the notion of blocked paths (Pearl, 2000) . Definition 3.1. Consider a path π of causal graph G. A set of nodes Z blocks π, if π contains a triple of consecutive nodes connected in one of the following ways: N i → Z → N j , N i ← Z → N j , or N i → M ← N j , with N i , N j / ∈ Z, Z ∈ Z, and neither M nor any descendent of M is in Z. Theorem 3.2. Let G be a causal diagram, and let A, W be two (not necessarily disjoint) sets of nodes in G. Let Z be a set of nodes that blocks all non-causalfoot_2 paths from A ∪ W to Y. Then, for any SCM compatible with G, any predictor Ŷ that satisfies Ŷ ⊥ ⊥ A | Z is counterfactually invariant in A with respect to W. The proof is deferred to Appendix A. Our key observation is that the valid set Z as in Theorem 3.2 acts as a d-separator for certain random variables in a graph that allows reasoning about dependencies among pre-and post-interventional random variables. This graph simplifies the counterfactual graph by Shpitser & Pearl (2008) , and it generalizes the augmented graph structure described in Theorem 1 by Shpitser & Pearl (2009) . We can then combine the Markov property with covariate adjustment to prove the claim. However, our proof does not rely on identification of the counterfactual distributions (e.g., by simply applying the do-calculus (Pearl, 2000) ). Crucially, the existence of a measurable set Z as in Theorem 3.2 does not imply the identifiability of counterfactual distributions P Y * a (y) (see Figure 1 (a) for a counterexample). In particular, the assumptions do not rule out A X Y (a) A X ∧ Y X ⊥ A X ⊥ Y (b) X ⊥ Y X ∧ X ⊥ A Y A (c) Figure 1 : (a) An example for Theorem 3.2, where P Y * a (y) is unidentifiable, due to the confounding effect denoted by the dashed double arrow. However, by Theorem 3.2 any predictor Ŷ such that Ŷ ⊥ ⊥ A | Z with Z = {X} is counterfactually invariant in A. Hence, the criterion in Theorem 3.2 is weaker than identifiability. (b)-(c) Causal and anti-causal structure as in Veitch et al. (2021) . The variable X is decomposed in three parts. X ⊥ A the part of X that is not causally influenced by A, X ⊥ Y is the part that does not causally influence Y, and X ∧ is the remaining part, that is both influenced by A and that influences Y. hidden confounding in the model. Hence, our method is applicable even when (certain parameters of) P Y * a (y) or P Y * a |W (y) cannot be learned from observational data. Counterfactually invariant predictors. We propose to use the HSCIC( Ŷ, A | Z), to promote counterfactual invariance by encouraging the sufficient conditional independence Ŷ ⊥ ⊥ A | Z. Given a loss function L( Ŷ), we propose to minimize the following loss L TOTAL ( Ŷ) = L( Ŷ) + γ • HSCIC( Ŷ, A | Z) , with a parameter γ ≥ 0, regulating the trade-off between accuracy and counterfactual invariance. There is no principled way to choose the "right" trade-off parameter. However, in practice we can use a standard approach, which consists of two steps: (i) learn a collection of models on the Pareto frontier using different values of γ; (ii) from a collection of models choose the best one following, i.e., Muandet et al. (2021) . Next, we develop and justify using the HSCIC, (see Corollary 3.5) for which HSCIC( Ŷ, A | Z) = 0 if and only if Ŷ ⊥ ⊥ A | Z. Hilbert-Schmidt Conditional Independence Criterion (HSCIC). Our Definition 3.3 differs slightly from Park & Muandet (2020) , who define HSCIC using the Bochner conditional expected value. While it is functionally equivalent (with the same implementation, see eq. ( 3)), it has the benefit of bypassing some technical assumptions required by Park & Muandet (2020) . We refer readers to Appendix C and D for a comparison with previous approaches. The HSCIC has the following important property. Theorem 3.4 (following Theorem 5.4 by Park & Muandet (2020) ). If the kernel k of H X ⊗ H A is characteristic 4 , HSCIC(Y, A | Z) = 0 almost surely if and only if Y ⊥ ⊥ A | Z. A proof is in Appendix B. We remark that "most interesting" kernels such as the Gaussian and Laplacian kernel are characteristic. Furthermore, if kernels are translation-invariant and characteristic, then their tensor product is also a characteristic kernel (Szabó & Sriperumbudur, 2017) . Hence, this natural assumption is non-restrictive in practice. Combining Theorems 3.2 and 3.4, we can now use HSCIC to promote counterfactual invariance. Corollary 3.5. Consider an SCM with causal diagram G and fix two (not necessarily disjoint) sets of nodes A, W. Let Z be a set of nodes in G that blocks all non-causal paths from A ∪ W to Y. Then, any predictor Ŷ that satisfies HSCIC( Ŷ, A | Z) = 0 almost surely is counterfactually invariant in A with respect to W. We do not require A, W, Z or Y to be binary or categorical, a major improvement over existing methods (Chiappa, 2019; Xu et al., 2020) , which cannot handle continuous multi-variate attributes. Estimating the HSCIC from samples. Given n samples {(y i , a i , z i )} n i=1 , denote by K Ŷ and KA the corresponding kernel matrices for Ŷ and A (see eq. ( 1)). We then estimate the H Ŷ,A|X as Ĥ2 Ŷ,A|Z (•) = ŵT Ŷ,A|Z (•) K Ŷ ⊙ KA ŵ Ŷ,A|Z (•) -2 ŵT Ŷ|Z (•) KY ŵ Ŷ,A|X (•) (3) • ŵT A|X (•) KA ŵ Ŷ,A|Z (•) + ŵT Ŷ|Z (•) K Ŷ ŵ Ŷ|Z (•) ŵT A|Z (•) KA ŵA|Z (•) , where ⊙ is element-wise multiplication. The vectors ŵ Ŷ|Z (•), ŵA|Z (•), and ŵ Ŷ,A|Z (•) are found via kernel ridge regression. Caponnetto & Vito (2007) the convergence (and rates) of Ĥ2 Ŷ,A|Z (•) to H 2 Ŷ,A|Z (•) under mild conditions. In practice, computing the HSCIC approximation Ĥ2 Ŷ,A|Z (•) may be time-consuming. To speed it up, we can use random Fourier features to approximate the matrices K Ŷ and KA (Rahimi & Recht, 2007; Avron et al., 2017) . We emphasize that eq. ( 3) allows us to consistently estimate the HSCIC from observational i.i.d. samples, without prior knowledge of the counterfactual distributions. Measuring counterfactual invariance. Besides predictive performance, e.g., mean squared error (MSE) for regression, our key metric of interest is the level of counterfactual invariance achieved by the predictor Ŷ, i.e., a measure of how the distribution of Ŷ * a changes for different values of a and across all conditioning values w. We quantify the overall counterfactual variance as a single scalar, the Variance of CounterFactuals (VCF): VCF( Ŷ) = E w∼P W var a ′ ∼P A [E Ŷ * a ′ |W=w (ŷ | w)] . That is, we look at how the average outcome varies with the interventional value a at conditioning value w and average this variance over w. For deterministic predictors (i.e., point estimators), which we use in all our experiments, the variance term in eq. ( 4) is zero if and only if counterfactual invariance holds at w for (almost) all a ′ in the support of P A (i.e., the prediction remains constant). Since the variance is non-negative, VCF (we often drop the argument) is then zero if and only if counterfactual invariance holds almost surely. To estimate VCF in practice, we pick d datapoints (w i ) d i=1 from the observed data, and for each compute the counterfactual outcomes Ŷ * a ′ | w i for k different values of a ′ . The inner expectation is simply the predictor output. We use empirical variances with k examples for each of the d chosen datapoints, and the empirical mean of the d variances for the outer expectation. Crucially, VCF requires access to ground-truth counterfactual distributions, which by their very nature are unavailable in practice (neither for training nor at test time). Hence, we can only assess VCF, as a direct measure of counterfactual invariance, in synthetic scenarios. We demonstrate in our experiments, that HSCIC (estimable from the observed data) empirically serves as a proxy for VCF. Applications of counterfactual invariance. We briefly outline potential applications of counterfactual invariance, which we will subsequently study empirically on real-world data. Image classification. Counterfactual invariance serves as a strong notion of robustness in highdimensional settings, such as image classification: "Would the truck have been classified correctly had it been winter in this exact situation instead of summer?" For concrete demonstration, we will use the dSprites dataset (Matthey et al., 2017) , which consists of relatively simple, yet highdimensional, square black and white images of different shapes (squares, ellipses, etc.), sizes, and orientations (rotation) in different xy-positions. Counterfactual fairness. The popular definition of counterfactual fairness (Kusner et al., 2017) is an incarnation of our counterfactual invariance (see Definition 2.2). It is captured informally by the following question after receiving a consequential decision: "Would I have gotten the same outcome had I been a different gender, race, or age with all else being equal?". Again, we denote the outcome by Y ⊂ V and so-called protected attributes by A ⊆ V \ Y such as gender, race, or age protected under anti-discrimination laws (see, e.g., Barocas & Selbst (2016) ). Collecting all 2021) motivate the importance of counterfactual invariance in text classification tasks. We now provide a detailed comparison with our method on the (limited) causal models studied by Veitch et al. (2021) , emphasizing that our definition and method applies to a much broader class of causal models. Specifically, we consider the causal and anti-causal structures (Figure 1 in Veitch et al. (2021) ). Both diagrams consist of a protected attribute Z, an observed covariate X, and the outcome Y. Veitch et al. (2021) provide necessary conditions for counterfactual invariance. They prove that if Ŷ * z = Ŷ * z ′ almost surely, then assuming their proposed causal and anti-causal structures (see Figure 1(b-c )) it holds Ŷ ⊥ ⊥ Z and Ŷ ⊥ ⊥ Z | Y, respectively. That is, in practice they enforce a consequence of the desired criterion instead of a prerequisite. Our work complements the results by Veitch et al. (2021) , as shown in the following corollary, which is a direct consequence of Theorem 3.2. Corollary 3.7. Under the causal and anti-causal graph, suppose that Z and Y are not confounded. If Ŷ ⊥ ⊥ Z, it holds P Ŷ * z (y) = P Ŷ * z ′ (y) almost surely, for all z, z ′ in the domain of Z. We remark that the notion of invariance studied by Veitch et al. (2021) is enforced on the true counterfactuals. Our definition, and Corollary 3.7, on the other hand, only requires invariance of the resulting distribution, which is a weaker requirement than Veitch et al. (2021) . However, we provide experiments in Appendix E.6 to show that our method, based on Corollary 3.7, and the method by Veitch et al. (2021) have a similar effect in practice. A X Y Z (a) A X Y C (b) A U C Y X (c)

4.1. SYNTHETIC EXPERIMENTS

We begin our empirical assessment of HSCIC on synthetic datasets, where ground truth is known and systematically study how performance is influenced by the dimensionality of the variables. We simulate different datasets following to the causal graphical structure in Figure 2(a) . The datasets are composed of four sets of observed continuous variables: (i) the prediction target Y, (ii) the variable(s) we want to be counterfactually invariant in A, (iii) observed covariates that mediate effects from A on Y, and (iv) observed confounding variables Z. The goal is to learn a predictor Ŷ that is counterfactually invariant in A with respect to W := A ∪ X ∪ Z. We consider various artificially generated datasets for this example, which mainly differ in the dimension of the observed variables and their correlations. A detailed description of each dataset is deferred to Appendix E. Model choices and parameters. For all synthetic experiments, we train the model using fully connected networks (MLPs). We use the MSE loss L MSE ( Ŷ) as the predictive loss L in eq. ( 2) for continuous outcomes Y. We generate 4k samples from the observational distribution in each setting and use an 80 to 20 train to test split. All metrics reported are on the test set. We perform hyperparameter tuning for MLP hyperparameters based on a random strategy (see Appendix E for details). The HSCIC( Ŷ, A | Z) term is computed as in eq. ( 3) using a Gaussian kernel with amplitude 1.0 and length scale 0.1. The regularization parameter λ for the ridge regression coefficients in eq. ( 1) is set to λ = 0.01. We set d = 1000 and k = 500 in the estimation of VCF. Model performance. We first perform a set of experiments to study the effect of the HSCIC, and to highlight the trade-off between accuracy and counterfactual invariance. For this set of experiments, we generate a dataset as described in Appendix E.1. Figure 3 (left) shows the variance term of VCF for different regularization parameters γ, as a function of the values of Z (i.e., before taking the outer expectation). We observe that increasing values of γ lead to a consistent decrease of the variances w.r.t. the interventional value a. Figure 3 (center) shows the values attained by the HSCIC and MSE for increasing γ, demonstrating the expected trade-off in raw predictive performance and enforcing counterfactual invariance. Finally, Figure 3 (right) highlights the usefulness of HSCIC as a measure of counterfactual invariance, being in strong agreement with VCF (see discussion after eq. ( 4)). Comparison with baselines. We now compare our method against baselines in two settings, which we refer to as Scenario 1 and Scenario 2. These two settings differ in how the conditioning set Z affects the outcome Y, with Scenario 1 exhibiting higher correlation of Z with both the mediator X and the outcome Y (see Appendix E.2 for details). Since counterfactually invariant training has not received much attention yet, our our choice of baselines for experimental comparison is highly limited. Corollary 3.7 together with the fact that Veitch et al. (2021) in practice only enforce distributional equality implies that our method subsumes theirs in the causal setting they have proposed. We benchmarked our method against Veitch et al. (2021) in the limited causal and anti-causal settings of Figure 6(b-c ) in Appendix E.6, showing that our method performs on par or better than theirs. Since counterfactual fairness is a special case of counterfactual invariance, we also compare against two methods proposed by Kusner et al. (2017) (in applicable settings). We compare to the Level 1 (only use non-descendants of A as inputs to Ŷ) and the Level 3 (assume an additive noise model and in addition to non-descendants, only use the residuals of descendants of A after regression on A as inputs to Ŷ) approaches of Kusner et al. (2017) . We refer to these two baselines as CF1 and CF2 respectively. We summarize the results on the two scenarios in Table 1 . For a suitable choice of γ, our method outperforms the baseline CF2 in both MSE and VCF simultaneously. While CF1 satisfies counterfactual invariance perfectly by construction (VCF = 0), its MSE is substantially higher than for all other methods. Our method provides to flexibly trade predictive performance for counterfactual invariance via a single tuning knob λ and pareto-dominates existing methods. Multi-dimensional variables. We perform a third set of experiments to assess HSCIC's performance in higher dimensions. We consider simulated datasets (described in Appendix E.3), where we independently increase the dimension of Z and A in two different simulations, leaving the rest of the variables unchanged. The results in Tables 2 for different regularization coefficients γ and different dimensions of A and Z demonstrate that HSCIC can handle multi-dimensional variables while maintaining performance, as counterfactual invariance is approached when γ increases. We provide results where both A and Z are multivariate in Appendix E.3.

4.2. HIGH-DIMENSIONAL IMAGE EXPERIMENTS

We consider the image classification task on the dSprites dataset (Matthey et al., 2017) . Since this dataset is fully synthetic and labelled (we know all factors for each image), we consider a causal model as depicted in Figure 2 (c). The full structural equations are provided in Appendix E.4, where we assume a causal graph over the determining factors of the image, and essentially look up the corresponding image in the simulated dataset. This experiment is particularly challenging due to the mixed categorical and continuous variables in C (shape, y-pos) and X (color, orientation), continuous A (x-pos). All variables except for scale are assumed to be ob- served, and all variables jointly with the actual high-dimensional image determine the outcome Y. Our goal is to learn a predictor Ŷ that is counterfactually invariant in the x-position with respect to all other observed variables. In the chosen causal structure, {shape, y-pos} ∈ C block all noncausal paths from A ∪ C ∪ U to Y. Hence, we seek to achieve Ŷ ⊥ ⊥ x-pos | {shape, y-pos} via the HSCIC operator. To accommodate the mixed input types, Ŷ puts an MLP on top of features extracted from the images via convolutional layers concatenated with features extracted from the remaining inputs via an MLP. Figure 4 demonstrates that HSCIC achieves improved VCF as γ increases up to a certain point while affecting MSE, an inevitable trade-off.

4.3. FAIRNESS WITH CONTINUOUS PROTECTED ATTRIBUTES

We apply our method to the popular UCI Adult dataset (Kohavi & Becker, 1996) . Our goal is to predict whether an individual's income is above a certain threshold Y ∈ {0, 1} based on a collection of (demographic) information including protected attributes such as gender and age. We follow Nabi & Shpitser (2018) ; Chiappa (2019) , where a subset of variables are selected from the dataset and a causal structure is assumed as in Figure 2 (b) (see Appendix E.5 and Figure 7 for details). We choose gender (considered binary in this dataset) and age (considered continuous) as the protected attributes A. We denote the marital status, level of education, occupation, working hours per week, and work class jointly by X and combine the remaining observed attributes in C. Our goal is to learn a predictor Ŷ that is counterfactually invariant in A with respect to W = C ∪ X. We remark that achieving fairness for continuous or even mixed categorical and continuous protected attributes is an ongoing line of research (even for non-causal fairness notions) (Mary et al., 2019; Chiappa & Pacchiano, 2021) , but directly supported by HSCIC. We use an MLP with binary cross-entropy loss for Ŷ. Since this experiment is based on real data, the true counterfactual distribution cannot be known. Hence, we follow Chiappa & Pacchiano (2021) and estimate a possible true SCM by inferring the posterior distribution over the unobserved vari- 

5. DISCUSSION AND FUTURE WORK

We studied the problem of learning predictors Ŷ that are counterfactually invariant in changes of certain covariates. We put forth a formal definition of counterfactual invariance and described how it generalizes existing notions. Next, we provided a novel sufficient graphical criterion to characterize counterfactual invariance and reduce it to a conditional independence statement in the observational distribution. Our method does not require identifiability of the counterfactual distribution or exclude the possibility of unobserved confounders. Finally, we propose an efficiently estimable, model-agnostic regularization term to enforce this conditional independence (and thus counterfactual invariance) based on kernel mean embeddings of conditional measures, which works for mixed continuous/categorical, multi-dimensional variables. We demonstrate the efficacy of our method in regression and classification tasks involving controlled detailed simulation studies, high-dimensional images, and in a fairness application, where it outperforms existing baselines. The main limitation of our work, shared by all studies in this domain, is the assumption that the causal graph is known. Another limitation is that our methodology is applicable only when our graphical criterion is satisfied, requiring a certain set of variables to be observed (albeit unobserved confounders are not generally excluded). From an ethics perspective, the increased robustness of counterfactually invariant (or societal benefits of counterfactually fair) predictors are certainly desirable. However, this presupposes that all the often untestable assumptions are valid. Overall, causal methodology should not be applied lightly, especially in high-stakes and consequential decisions. A critical analysis of the broader context or systemic factors may hold more promise for societal benefits, than a well-crafted algorithmic predictor. In light of these limitations, an interesting and important direction for future work is to assess the sensitivity of our method to misspecifications of the causal graph or insufficient knowledge of the required blocking set. Finally, we believe our graphical criterion and KME-based regularization can also be useful for causal representation learning, where one aims to isolate causally relevant, autonomous factors underlying the data generating process of high-dimensional data. A MISSING PROOF OF THEOREM 3.2 In this section, we give proof of Theorem 3.2, which we restate for completeness.

A.1 OVERVIEW OF THE PROOF TECHNIQUES

Theorem 3.2. Let G be a causal diagram, and let A, W be two (not necessarily disjoint) sets of nodes in G. Let Z be a set of nodes that blocks all non-causalfoot_4 paths from A ∪ W to Y. Then, for any SCM compatible with G, any predictor Ŷ that satisfies Ŷ ⊥ ⊥ A | Z is counterfactually invariant in A with respect to W. Our proof technique generalizes the work of Shpitser & Pearl (2009) . To understand the proof technique, note that conditional counterfactual distributions of the form P Y * a |W (y | w) involve quantities from two different worlds. The variables W belong to the pre-interventional world, and the interventional variable Y * a belongs to the world after performing the intervention intervention A ← a. Hence, we study the identification of conditional counterfactual distributions using a diagram that embeds the causal relationships between the pre-and the post-interventional world. After defining this diagram, we prove that some conditional measures in this new model provide an estimate for P Y * a |W (y | w). We then combine this result with the properties of Z to prove the desired result.

A.2 A GRAPHICAL CRITERION FOR CONDITIONAL INDEPENDENCE

In this section, we discuss a well-known criterion for conditional independence, which we will then use to prove Theorem 3.2. To this end, we use the notion of a blocked path, which we restate for clarity. Definition 3.1. Consider a path π of causal graph G. A set of nodes Z blocks π, if π contains a triple of consecutive nodes connected in one of the following ways: Definition A.1 is a graphical criterion for conditional independence. In fact, the following wellknown results holds (Pearl, 2000) .  N i → Z → N j , N i ← Z → N j , or N i → M ← N j , with N i , N j / ∈ Z, Z ∈ Z,

A.3 IDENTIFIABILITY OF CONDITIONAL COUNTERFACTUAL DISTRIBUTIONS

A natural way to study the relationships between the pre-and the post-interventional world is to use the counterfactual graph (Shpitser & Pearl, 2008) . However, the construction of the counterfactual graph is rather intricate. For our purposes it is sufficient to consider a simpler construction. Consider an SGM with causal graph G, and fix a set of observable random variables W. We define the corresponding graph G ′ in the following three steps: 1. define G ′ to be the same graph as G; 2. add a new node W to G ′ , for each node W of the set W; 3. for each node W of the set W and for each parent W ′ of W , if W ′ ∈ W then add an edge W → W ′ to G ′ ; if W ′ / ∈ W add an edge W ′ → W to G ′ . An illustration of this graph is presented in Figure 6 . Note that any node W defined as above does not have any descendants in G ′ . In the reminder of this section, we denote with W the set of all nodes W in G ′ defined as above. We remark that this construction generalizes the work by Shpitser & Pearl (2009) . To prove Theorem A.3, we introduce additional concepts. We first introduce the notion of a C-forest and the notion of an edge (Shpitser & Pearl, 2006) . Definition A.4 (C-Forest). Let G be a causal diagram, and consider a complete sub-graph H of G. Denote with R the maximal root set of H. We say that H is a R-rooted C-forest if a subset of its bi-directed arcs forms a spanning tree over all vertices in H, and all the observable nodes of H have at most one child. In our analysis we also use the following definition. Definition A.5 (Edge). Let G be a causal diagram, and fix a set of nodes A. Consider two R-rooted C-forests H, K of G such that (i) H is a sub-graph of K; (ii) H and K do not contain any variable in A; (iii) the nodes of R are ancestors of Y in the graph G a . Then, we say that H and K form an edge for P Y * a (y) in G. There is a connection between these concepts and the identifiability of counterfactual distributions, as shown in the following theorem. Theorem A.6 (Theorem 4 by Shpitser & Pearl (2006) ). Let G be a causal diagram, and fix a set of nodes A. Suppose that there exist two sub-graphs of G that form an edge for P Y * a (y). Then, P Y * a (y) is not identifiable in G. Using these concepts, we can prove Theorem A.3. 5). Case 2: W is not d-separated from Y in G ′ a . Assume without loss of generality that any variable of A ∪ W is not a descendent of Y in G (otherwise it has no effect on Y). Under this assumption, there exists a set of random variables U ⊆ W such that there exists an edge for P (Y∪U) * a (y, u) = P * Y,U (y, u) in G ′ . It follows that P * Y,U (y, u) is not identifiable. Since U ⊆ W, it follows that the distribution P * Y,W (y, w) is also not identifiable. We conclude by showing that the estimand expression is correct. To this end, note that since G ′ is just a causal diagram, the estimand of the post-interventional distribution P * Y|W (y | w) is correct for P Y * a,x |A,X (y | a ′ , x). Since the set W only contains variable copies of A ∪ X, as claimed.

A.4 VALID ADJUSTMENTS FOR CONDITIONAL INTERVENTIONAL DISTRIBUTIONS

Here we discuss a criterion for the identification of conditional distributions, which we will then use to prove Theorem 3.2. We follow Pearl (2000) to this end, and use the d-separation criterion to define valid adjustment sets for conditional counterfactual distributions. Proof. Denote with π any path from I to Y in G I . Since I is a parent of every node in A, and since I has no parents, then any path π can be decomposed into two paths π 1 , π 2 , where π 1 is a single edge from I to A, and π 2 is a path from A to Y. If π is a causal path, then A acts as a separator for this path. If π is not a causal path from I to Y, then it must be undirected (since I has no parents). It follows that π 2 is an undirected path from A to Y, which is d-separated by Z. Hence, π is d-separated. We conclude that Y ⊥ ⊥ G I I | A, Z. The second part of the claim follows by Lemma A.7. A.5 PROOF OF THEOREM 3.2 Theorem A.3 tells us that we can identify conditional counterfactual distributions in G, by identifying distributions on G ′ . We can combine this observation with the notion of a valid adjustment set to derive a closed formula for the identification of the distributions of interest. Proof of Theorem 3.2. Let G ′ be the augmented graph obtained by adding nodes W to G, as described in Section A.3. Denote with P * the induced measure on G ′ a . Suppose that it holds P Y * a |W (y | w) = P Y|A,Z (y | a, z)dP * Z|W (z | w) for any intervention A ← a, and for any possible value w attained by W. Then the claim follows. In fact, assuming that eq. ( 6) holds, we have that P Y * a |W (y | w) = P Y|A,Z (y | a, z)dP * Z|W (z | w) (assuming eq. ( 6)) = P Y|A,Z (y | a ′ , z)dP * Z|W (z | w) (since Y ⊥ ⊥ A | Z) = P Y * a ′ |W (y | w). (assuming eq. ( 6)) Hence, the proof of Theorem 3.2 boils down to proving eq. ( 6). To this end, since Z breaks all non-causal paths from W to Y, then by construction Z is a d-separator between W and Y in the post-interventional graph G ′ a . Hence, it holds P * Y,W (y | w) = P * Y|W,Z (y | w, z)dP * Z|W (z | w) (by conditioning) = P * Y|Z (y | z)dP * W,Z (w, z) (since Y ⊥ ⊥ G ′ a W | Z) = P Y|A,W,Z (y | a, w, z)dP * Z|W (z | w). (by Lemma A.7) The claim follows by applying Theorem A.3 to the equation above, since it holds P * Y|W (y | w) = P Y * a |W (y | w). B MISSING PROOF OF THEOREM 3.4 We prove that the HSCIC can be used to promote conditional independence, using a similar technique as Park & Muandet (2020) . The following theorem holds. Theorem 3.4 (following Theorem 5.4 by Park & Muandet (2020) ). If the kernel k of H X ⊗ H A is characteristic 6 , HSCIC(Y, A | Z) = 0 almost surely if and only if Y ⊥ ⊥ A | Z. Proof. By definition, we can write HSCIC(Y, A | Z) = H Y,A|Z • Z, where H Y,A|Z is a real- valued deterministic function. Hence, the HSCIC is a real-valued random variable, defined over the same domain Ω Z of the random variable X. We first prove that if HSCIC(Y, A | Z) = 0 almost surely, then it holds Y ⊥ ⊥ A | Z. To this end, consider an event Ω ′ ⊆ Ω X that occurs almost surely, and such that it holds (H Y,A|X • X)(ω) = 0 for all ω ∈ Ω ′ . Fix a sample ω ∈ Ω ′ , and consider the corresponding value z ω = Z(ω), in the support of Z. It holds k(y ⊗ a, • )dP Y,A|Z=zω = µ Y,A|Z=zω (by definition) = µ Y|Z=zω ⊗ µ A|Z=zω (since ω ∈ Ω ′ ) = k Y (y, • )dP Y|Z=zω ⊗ k A (a, • )dP A|Z=zω (by definition ) = k Y (y, • ) ⊗ k A (a, • )dP Y|Z=zω P A|Z=zω , (by Fubini's Theorem) with k Y and k A the kernels of H Y and H A respectively. Since the kernel k of the tensor product space H Y ⊗ H A is characteristic, then the kernels k Y and k A are also characteristic. Hence, it holds P Y,A|Z=zω = P Y|Z=zω P A|Z=zω for all ω ∈ Ω ′ . Since the event Ω ′ occurs almost surely, then P Y,A|Z=zω = P Y|Z=zω P A|Z=zω almost surely, that is Y ⊥ ⊥ A | Z. Assume now that Y ⊥ ⊥ A | Z. By definition there exists an event Ω ′′ ⊆ Ω Z such that P Y,A|Z=zω = P Y|Z=zω P A|Z=zω for all samples ω ∈ Ω ′′ , with z ω = Z(ω). It holds µ Y,A|Z=zω = k(y ⊗ a, • )dP Y,A|Z=zω (by definition) = k(y ⊗ a, • )dP Y|Z=zω P A|Z=zω (since ω ∈ Ω ′ ) = k Y (y, • )k A (a, • )dP Y|Z=zω P A|Z=zω (by definition of k) = k Y (y, • )dP Y|Z=zω ⊗ k A (a, • )dP A|Z=zω (by Fubini's Theorem) = µ Y|Z=zω ⊗ µ A|Z=zω . (by definition) The claim follows.

C CONDITIONAL KERNEL MEAN EMBEDDINGS AND THE HSCIC

The notion of conditional kernel mean embeddings has already been studied in the literature. We show that, under stronger assumptions, our definition is equivalent to the definition by Park & Muandet (2020).

C.1 CONDITIONAL KERNEL MEAN EMBEDDINGS AND CONDITIONAL INDEPENDENCE

We show that, under stronger assumptions, the HSCIC can be defined using the Bochner conditional expected value. The Bochner conditional expected value is defined as follows. Definition C.1. Fix two random variables Y, Z taking value in a Banach space H, and denote with (Ω, F, P) their joint probability space. Then, the Bochner conditional expectation of Y given Z is any H-valued random variable X such that E YdP = E XdP for all E ∈ σ(Z) ⊆ F, with σ(Z) the σ-algebra generated by Z. We denote with E [Y | Z] the Bochner expected value. Any random variable X as above is a version of E [Y | Z]. The existence and almost sure uniqueness of the conditional expectation is shown in Dinculeanu (2000) . Given a RKHS H with kernel k over the support of Y, Park & Muandet (2020) define the corresponding conditional kernel mean embedding as µ Y|Z := E [k(•, y) | Z] . Note that, according to this definition, µ Y|Z is an H-valued random variable, not a single point of H. Park & Muandet (2020) use this notion to define the HSCIC as follows. Definition C.2 (The HSCIC according to Park & Muandet (2020) ). Consider (sets of) random variables Y, A, Z, and consider two RKHS H Y , H A over the support of Y and A respectively. The HSCIC between Y and A given Z is defined as the real-valued random variable ω → µ Y,A|Z (ω) -µ Y|Z (ω) ⊗ µ A|Z (ω) , for all samples ω in the domain Ω Z of Z. Here, ∥•∥ the metric induced by the inner product of the tensor product space H Y ⊗ H Z . We show that, under more restrictive assumptions, Definition C.2 can be used to promote conditional independence. To this end, we use the notion of a regular version. Definition C.3 (Regular Version, following Definition 2.4 by Çinlar & ðCınlar (2011) ). Consider two random variables Y, Z, and consider the induced measurable spaces (Ω Y , F Y ) and (Ω Z , F Z ). A regular version Q for P Y|Z is a mapping Q : Ω Z × F Y → [0, +∞] : (ω, y) → Q ω (y) such that: (i) the map ω → Q ω (x) is F A -measurable for all y; (ii) the map y → Q ω (y) is a measure on (Ω Y , F Y ) for all ω; (iii) the function Q ω (y) is a version for E ⊮ {Y=y} | Z . The following theorem shows that the random variable as in Definition C.2 can be used to promote conditional independence. Theorem C.4 (Theorem 5.4 by Park & Muandet (2020) ). With the notation introduced above, suppose that the kernel k of the tensor product space H X ⊗ H A is characteristic. Furthermore, suppose that P Y,A|X admits a regular version. Then, µ Y,A|Z (ω) -µ Y|Z (ω) ⊗ µ A|Z (ω) = 0 almost surely if and only if Y ⊥ ⊥ A | Z. Note that the assumption of the existence of a regular version is essential in Theorem C.4. In this work, HSCIC is not used for conditional independence testing but as a conditional independence measure.

C.2 EQUIVALENCE WITH OUR APPROACH

The following theorem, shows that under the existence of a regular version, conditional kernel mean embeddings can be defined using the Bochner conditional expected value. To this end, we use the following theorem. Theorem C.5 (Following Proposition 2.5 by Çinlar & ðCınlar (2011) ). Following the notation introduced in Definition C.3, suppose that P Y|Z (• | Z) admits a regular version Q ω (y). Consider a kernel k over the support of Y. Then, the mapping ω → k(•, y)dQ ω (y) is a version of E [k(•, y) | Z]. As a consequence of Theorem C.5, we prove the following result. Lemma C.6. Fix two random variables Y, Z. Suppose that P Y|Z admits a regular version. Denote with Ω Z the domain of Z. Then, there exists a subset Ω ⊆ Ω Z that occurs almost surely, such that µ Y|Z (ω) = µ Y|Z=Z(ω) for all ω ∈ Ω. Here, µ Y|Z=Z(ω) is the embedding of conditional measures as in Section 2. Proof. Let Q ω (y) be a regular version of P Y|Z . Without loss of generality we may assume that it holds P Y|Z (y | {Z = Z(ω)}) = Q ω (y). By Theorem C.5 there exists an event Ω ⊆ Ω Z that occurs almost surely such that µ Y|Z (ω) = E[k(y, • ) | Z](ω) = k(y, • )dQ ω (y), for all ω ∈ Ω. Then, for all ω ∈ Ω it holds µ Y|Z (ω) = k(x, • )dQ ω (x) (it follows from eq. ( 7)) = k(x, • )dP X|A (x | {A = A(ω)}) (Q ω (y) = P Y|Z (y | {Z = Z(ω)})) = µ X|{A=A(ω)} , (by definition as in Section 2) as claimed. As a consequence of Lemma C.6, we can prove that the definition of the HSCIC by Park & Muandet (2020)  µ X,A|Z (ω) -µ X|Z (ω) ⊗ µ A|Z (ω) = (H Y,A|Z • Z)(ω). Here, H Y,A|Z is a real-valued deterministic function, defined as H Y,A|Z (z) := µ Y,A|Z=z -µ Y|Z=z ⊗ µ A|Z=z , and ∥•∥ is the metric induced by the inner product of the tensor product space H X ⊗ H A . We remark that the assumption of the existence of a regular version is essential in Corollary C.7.

D CONDITIONAL INDEPENDENCE AND THE CROSS-COVARIANCE OPERATOR

In this section, we show that under additional assumptions, our definition of conditional KMEs is equivalent to the definition based on the cross-covariance operator, under more restrictive assumptions. The definition of KMEs based on the cross-covariance operator requires the use of the following well-known result. Lemma D.1. Fix two RKHS H X and H Z , and let {φ i } ∞ i=1 and {ψ j } ∞ j=1 be orthonormal bases of H X and H Z respectively. Denote with HS(H X , H Z ) the set of Hilbert-Schmidt operators between H X and H Z . There is an isometric isomorphism between the tensor product space H X ⊗ H Z and HS(H X , H Z ), given by the map T : ∞ i=1 ∞ j=1 c i,j φ i ⊗ ψ j → ∞ i=1 ∞ j=1 c i,j ⟨ • , φ i ⟩ H X ψ j . For a proof of this result see i.e., Park & Muandet (2020) . This lemma allows us to define the cross-covariance operator between two random variables, using the operator T . Definition D.2 (Cross-Covariance Oprator). Consider two random variables X, Z. Consider corresponding mean embeddings µ X,Z , µ X and µ Z , as defined in Section 3. The cross-covariance operator is defined as Σ X,Z := T (µ X,Z -µ X ⊗ µ Z ). Here, T is the isometric isomorphism as in Lemma D.1. It is well-known that the cross-covariance operator can be decomposed into the covariance of the marginals and the correlation. That is, there exists a unique bounded operator Λ Y,Z such that Σ Y,Z = Σ 1/2 Y,Y • Λ Y,Z • Σ 1/2 Z,Z Using this notation, we define the normalized conditional cross-covariance operator. Given three random variables Y, A, Z and corresponding kernel mean embeddings, this operator is defined as Λ Y,A|Z := Λ Y,A -Λ Y,Z • Λ Z,A . This operator was introduce by Fukumizu et al. (2007) . The normalized conditional cross-covariance can be used to promote statistical independence, as shown in the following theorem. Theorem D.3 (Theorem 3 by Fukumizu et al. (2007) ). Following the notation introduced above, define the random variable Ä := (A, Z). Let P Z be the distribution of the random variable Z, and denote with L 2 (P Z ) the space of the square integrable functions with probability P Z . Suppose that the tensor product kernel k Y ⊗ k A ⊗ k Z is characteristic. Furthermore, suppose that H Z + R is dense in L 2 (P Z ). Then, it holds Λ Y, Ä|Z = 0 if and only if Y ⊥ ⊥ A | X. Here, Λ Y, Ä|Z is an operator defined as in eq. (8). By Theorem D.3, the operator Λ Y, Ä|Z can also be used to promote conditional independence. However, our method is more straightforward since it requires less assumptions. In fact, Theorem D.3 requires to embed the variable Z in a RKHS. In contrast, our method only requires the embedding on the variables Y and A.

E EXPERIMENT SETTINGS

Additional information about the experiments is now provided. The interested reader may refer to the source code provided in the supplementary material. In all cases, the experiments were performed on an Apple M1 Pro. No external GPU sources were used for the experimental setup.

E.1 DATASETS FOR MODEL PERFORMANCE WITH THE USE OF THE HSCIC

The first set of synthetic experiments involves three different dataset simulations. The datagenerating mechanism corresponding to the results in Figure 3 is the following: Z ∼ N (0, 1) A = Z 2 + ε A X = exp - 1 5 A A + sin (2Z) + 1 5 ε X Y = 1 2 exp {-XZ} • sin (2XZ) + 5A + 1 5 ε Y , where ε A ∼ N (0, 1) and ε Y , ε X i.i.d. ∼ N (0, 0.1). In the first experiment, Figure 3 shows the results of feed-forward neural networks consisting of 8 hidden layers with 20 nodes each, connected with a rectified linear activation function (ReLU) and a linear final layer. Mini-batch size of 256 and the Adam optimizer with a learning rate of 10 -3 for 100 epochs were used. We set the range of trade-off parameter γ ∈ [0, 1] for all the experiments except the comparison against baselines CF1 and CF2.

E.2 DATASETS FOR COMPARISON WITH BASELINES

The simulation procedure for the Scenario 1 and Scenario 2 in Table 1 respectively are the following. Scenario 1: Z ∼ N (0, 1) A = Z 2 + ε A X = 1 2 A • ε X + 2Z Y = 1 2 exp {-XZ} • sin (2XZ) + 5A + 1 5 ε Y , where ε A , ε X i.i.d. ∼ N (0, 1) and ε Y i.i.d. ∼ N (0, 0.1). Scenario 2: Z ∼ N (0, 1) A = Z 2 + ε A X = 1 5 A • ε X + 2 exp - 1 2 Z 2 Y = exp -Z 2 + AX + 1 5 ε Y , where ε A , ε X i.i.d. ∼ N (0, 1) and ε Y i.i.d. ∼ N (0, 0.1). Analysing the results in Table 1 , the same hyperparameters as in the previous setting. This also holds for the baseline methods CF1 and CF2. In this table, both Scenario 1 and Scenario 2 were considered. The results shown in Figure 3 and Table 1 are the average and standard deviation resulting from respectively 10 and 4 random seeds runs. The simulation procedure for the results shown in Section 4.2 is the following. shape ∼ P(shape) y-pos ∼ P(y-pos) color ∼ P(color) orientation ∼ P(orientation) x-pos = round(x), where x ∼ N (shape + y-pos, 1) scale = round x-pos 24 + y-pos 24 • shape + ϵ S Y = e shape • x-pos + scale 2 • sin(y-pos) + ϵ Y , where ϵ S ∼ N (0, 1) and ϵ Y ∼ N (0, 0.01). The data has been generated via a matching procedure on the original dSprites dataset. In Table 4 , the hyperparameters of the layers of the convolutional neural network are presented. Each of the convolutional groups also has a ReLU activation function and a dropout layer. Two MLP architectures have been used. The former takes as input the observed tabular features. It is composed by two hidden layers of 16 and 8 nodes respectively, connected with ReLU activation functions and dropout layers. The latter takes as input the concatenated outcomes of the CNN and the other MLP. It consists of three hidden layers of 8, 8 and 16 nodes, respectively. Figure 4 presents the averaged results of four random seeds runs with new sampled data.

E.5 FAIRNESS WITH CONTINUOUS PROTECTED ATTRIBUTES

The pre-processing of the UCI Adult dataset was based upon the work of (Chiappa & Pacchiano, 2021) . Referring to the causal graph in Figure 7 , a variational autoencoder (Kingma & Welling, 2014) was trained for each of the unobserved variables H m , H l and H r . The prior distribution of these latent variables is assumed to be standard Gaussian. The posterior distributions P(H m |V ), P(H r |V ), P(H l |V ) are modelled as 10-dimensional Gaussian distributions, whose means and variances are the outputs of the encoder. The encoder architecture consists of a hidden layer of 20 hidden nodes with hyperbolic tangent activation functions, followed by a linear layer. The decoders have two linear layers with hyperbolic tangent activation function. The training loss of the variational autoencoder consists of a reconstruction term (Mean-Squared Error for continuous variables and Cross-Entropy Loss for binary ones) and the Kullback-Leibler divergence between the posterior and the prior distribution of the latent variables. For training, we used the Adam optimizer with learning rate of 10 -2 , 30 epochs, mini-batch size 128. The predictor Ŷ is the output of a feed-forward neural network consisting of a hidden layer with hyperbolic tangent activation function and a linear final layer. In the training we used the Adam optimizer with learning rate 10 -3 , mini-batch size 128, and trained for 100 epochs. The choice of the network architecture is based on the work of (Chiappa & Pacchiano, 2021) . The estimation of counterfactual outcomes is based on a Monte-Carlo approach. Given a data point, 500 values of the unobserved variables are sampled from the estimated posterior distribution. Given an interventional value for A, a counterfactual outcome is estimated for each of the sampled unobserved values. The final counterfactual outcome is estimated as the average of these counterfactual predictions. In this experiment setting, we have k = 100 and d = 1000. In the causal graph presented in Figure 7 , A includes the variables age and gender, C includes nationality and race, M marital status, L level of education, R the set of working class, occupation, and hours per week and Y the income class. Compared to (Chiappa & Pacchiano, 2021) , we include the race variable in the dataset as part of the baseline features C. The loss function is the same as Equation 2 but Binary Cross-Entropy loss is used instead of Mean-Squared Error loss: L TOTAL ( Ŷ) = L BCE ( Ŷ) + γ • HSCIC Ŷ, {Age, Gender} Z , where the set Z blocks all the non-causal paths from W ∪ A. Z ∼ N (0, 1) A = Z 2 + ε A X = exp - 1 2 A sin (A) + 1 10 ε X Y = 1 2 exp {-XZ} • sin (2XZ) + 5A + 1 10 ε Y , Table 5 : Results of the MSE, HSCIC, VCF of our method and the baseline Veitch et al. (2021) applied to the causal and anti-causal structure in Figure 1(b-c ). Although the graphical assumptions are not satisfied, our method shows an overall decrease of HSCIC, VCF in both of the graphical structures, outperforming Veitch et al. (2021) in terms of accuracy and counterfactual invariance. ∼ N (0, 0.1). The data-generating mechanism of the anticausal structure is the following (see Figure 1(c)) : Z ∼ N (0, 1) Y = exp 1 5 Z + 3 10 ε Y A = Z 2 + 3 10 ε A X = exp - 1 2 A 2 + 1 5 Y + 1 10 ε X where ε Y , ε A i.i.d ∼ N (0, 0.1) and ε X i.i.d ∼ N (0, 1). We compare our method with different choices for the trade-off parameter γ, against the method by Veitch et al. (2021) . In the causal settings presented in Figure 1(b-c ), an unobserved confounder Z between A and Y is included. In the graphical structure Figure 1 (b), our method presents as regularization term in the model training HSIC( Ŷ, A), as the independence Ŷ ⊥ ⊥ A is enforced. Here, HSIC is the Hilbert-Schmidt Independence Criterion, which is commonly used to promote independence (see, i.e., Gretton et al. (2005); Fukumizu et al. (2007) ). In the selected graphical sstructure, this is the same independence criterion enforced by Veitch et al. (2021) , leading the two methods to converge. In the anti-causal graphical setting presented in Figure 1 (c) proposed by Veitch et al. (2021) , the regularization term used in our method is sill HSIC( Ŷ, A), while in the method of Veitch et al. (2021) is HSCIC( Ŷ, A | Y). In Table 5 , the results of accuracy, HSCIC( Ŷ, A | Z) and VCF are presented. In the experiments, the predictor Ŷ is a feed-forward neural network consisting of 8 hidden layers with 20 nodes each, connected with a rectified linear activation function (ReLU) and a linear final layer. Mini-batch size of 256 and the Adam optimizer with a learning rate of 10 -4 for 500 epochs were used.

E.7 COMPARISON HEURISTIC METHODS EXPERIMENTS

We provide an experimental comparison of the proposed method with some heuristic methods, specifically data-augmentation based methods. We consider the same data-generating procedure and causal structure as presented in E. The heuristic methods considered are data augmentation and causal-based data augmentation. In the former, data augmentation is performed by generating N = 50 samples for every data-point by sampling new values of A as a 1 , ..., a N i.i.d ∼ P A and leaving Z, X, Y unchanged. Differently, in the latter causal-based data augmentation method, we also take into account the causal structure given by the known DAG. Indeed, when manipulating the variable A, its descendants (in this example X) will also change. In this experiment, a predictor for X as X = f θ (A, Z) is trained on 80% of the original dataset. In the data augmentation mechanism, for every data-point {a, x, z, y}, N = 50 samples are generated by sampling new values of A as ∼ P A , estimating the values of X as x 1 = f θ (a 1 , z), ..., x N = f θ (a N , z), while leaving the values of Z and Y unchanged. Heuristic methods such as data-augmentation methods do not theoretically guarantee to provide counterfactually invariant predictors. The results of an empirical comparison are shown in Table 6 . It can be shown that these theoretical insights are supported by experimental results, as the VCF metric measure counterfactual invariance is relevantly lower in both of the two settings of the proposed methods (γ = 1 2 and γ = 1). In these experiments, a dataset of n = 1000 is used, along with k = 500 and d = 500. The architecture used for predicting X and Y are feed-forward neural networks consisting of 8 hidden layers with 20 nodes each, connected with a rectified linear activation function (ReLU) and linear final layer. Mini-batch size of 256 and the Adam optimizer with a learning rate of 10 -3 for 100 epochs were used.



We use P for distributions as is common in the kernel literature(Muandet et al., 2021) and the potential outcome notation Y * a instead of Y | do(a) for conciseness when mixing conditioning with interventions. 2 With a mild abuse of notation, if W = ∅ then the requirement of conditional counterfactual invariance becomes P Ŷ * a (y) = P Ŷ * a ′ (y) almost surely, for all a, a ′ in the domain of A. A non-causal paths from A ∪ W to Y is a path connecting A ∪ W and Y in which at least one edge points against causal ordering.Shpitser et al. (2012) The tensor product kernel k is characteristic if the mapping P Y,A → Ey,a [k( • , y ⊗ a)] is injective. A non-causal paths from A ∪ W to Y is a path connecting A ∪ W and Y in which at least one edge points against causal ordering.Shpitser et al. (2012) The tensor product kernel k is characteristic if the mapping P Y,A → Ey,a [k( • , y ⊗ a)] is injective.



Figure 2: (a) Assumed causal structure for the synthetic experiments (see Section 4.1). The precise corresponding generative random variables are described in Appendix E. (b) Assumed causal structure for the Adult dataset, where A consists of the protected attributes gender and age.(c) Causal structure for the constructed dSprites ground truth, where A = {Pos.X}, U = {Scale}, C = {Shape, Pos.Y}, X = {Color, Orientation}, and Y = {Outcome}. U is unobserved. remaining observed covariates into W := V \ Y, the definition of counterfactual fairness clearly becomes a special case of our counterfactual invariance. Definition 3.6 (Counterfactual Fairness, Definition 5 by Kusner et al. (2017)). A predictor Ŷ is counterfactually fair if under any context W = w and A = a, it holds P Ŷ * a |W,A (y | w, a) = P Ŷ * a ′ |W,A (y | w, a), for all y and for any value a ′ attainable by A. Text classification. Veitch et al. (2021) motivate the importance of counterfactual invariance in text classification tasks. We now provide a detailed comparison with our method on the (limited) causal models studied byVeitch et al. (2021), emphasizing that our definition and method applies to a much broader class of causal models. Specifically, we consider the causal and anti-causal structures (Figure1inVeitch et al. (2021)). Both diagrams consist of a protected attribute Z, an observed covariate X, and the outcome Y.Veitch et al. (2021) provide necessary conditions for counterfactual invariance. They prove that if Ŷ * z = Ŷ * z ′ almost surely, then assuming their proposed causal and anti-causal structures (see Figure1(b-c)) it holds Ŷ ⊥ ⊥ Z and Ŷ ⊥ ⊥ Z | Y, respectively. That is, in practice they enforce a consequence of the desired criterion instead of a prerequisite. Our work complements the results byVeitch et al. (2021), as shown in the following corollary, which is a direct consequence of Theorem 3.2. Corollary 3.7. Under the causal and anti-causal graph, suppose that Z and Y are not confounded. If Ŷ ⊥ ⊥ Z, it holds P Ŷ *

Figure 3: (Left) variance of the counterfactual distributions for 100 random datapoints with lines representing 3 rd -order polynomial regression and shaded areas being 95% confidence intervals. (Center) trade-off between the accuracy and counterfactual invariance. We observe that the HSCIC decreases, as the MSE increases. Vertical bars denote standard errors over 15 different random seeds. (Right) Correspondence between the HSCIC and the VCF, for increasing γ. Again, vertical bars denote standard errors over 15 different random seeds.

Figure 4: Results of MSE, HSCIC operator and VCF for the dSprites image dataset experiment.The HSCIC operator decreases steadily with higher values of γ. Similarly, a necessary increase of MSE can be observed. For both γ = 0.5 and γ = 1 an overall decrease of VCF is observed compared to the not-regularized setting.

Figure 6: (a) A causal graph G, which embeds information for the random variables of the model in the pre-interventional world. (b) The corresponding graph G ′ for the set W = {A, X}. The variables A and X are copies of A and X respectively. (c) The post-interventional graph G ′ a . By construction, any intervention of the form A ← a does not affect the group W = {A, X}.

and neither M nor any descendent of M is in Z.Using Definition A.1, we define the concept of d-separation as follows. Definition A.1 (d-Separation). Consider a a causal graph G. Two sets of nodes X and Y of G are said to be d-separated by a third set Z if every path from any node of X to any node of Y is blocked by Z.

(d-Separation Criterion). Consider a a causal graph G, and suppose that two sets of nodes X and Y of G are d-separated by Z. Then, X is independent of Y given Z in any model induced by the graph G.We use the notation X ⊥ ⊥ G Y | Z to indicate that X and Y are d-separated by U in G.

set A of nodes of G, and consider interventions of the form A ← a, as in the statement of Theorem 3.2. We prove that any conditional counterfactual distribution of the form P Y * a |W (y | w) are identifiable in G if and only if the corresponding probability distributions P Y|W (y | w) are identifiable in G ′ a . The following theorem generalizes Theorem 1 by Shpitser & Pearl (2009). Theorem A.3. Let G be a causal diagram, and consider two sets of nodes A, W (not necessarily disjoint). Consider the corresponding graph G ′ as defined above. If the distribution P Y * a |W (y | w) is identifiable in G, then the distribution P Y|W (y | w) is identifiable in G ′ a , for any model induced by G. Furthermore, the estimand P Y|W (y | w) in G ′ a is correct for P Y * a |W (y | w).

Consider a model with causal graph G, and fix a set of observed variables A. We define an auxiliary graph G I , by adding to G an additional node I. This node is a parent of the nodes in the set A, and it has no other neighbour. We modify the structural assignments of the nodes A, so that for I = 0 the values of A are determined as in G, whereas for I = 1 the values of A are set to A = a. The node I corresponds to a Bernoulli distribution, with I = 1 indicating that the intervention took place. This construction is important, because the following lemma holds. Lemma A.7. Consider an SGM G and fix two disjoints groups of observed variables X and Y. Denote with G I the auxiliary graph as defined above, with respect to an intervention A ← a. Let Z a set of nodes of G such that Y ⊥ ⊥ G I I | A, Z. Then, it holds P * Y|Z (y | z) = P Y|A,Z (y | a, z). Here, P * the post-interventional distribution, after assigning A ← a. Proof. It holds P * Y|Z (y | z) = P * Y|A,Z (y | a, z) (by definition of P * ) = P Y|I,A,Z (y | i = 1, a, z) (by the definition of I) = P Y|I,A,Z (y | i = 0, a, z) (by Lemma A.2) = P Y|A,Z (y | a, z), (by the definition of I) as claimed. The following lemma characterizes all sets Z that fulfills the condition as in Lemma A.7. Lemma A.8. Consider an SGM G, and denote with G I the auxiliary graph as defined above, with respect to an intervention A ← a. Let Z a set of nodes that blocks all non-causal paths from A to Y. Then, it holds Y ⊥ ⊥ G I I | A, Z. In particular, it holds P * Y|Z (y | z) = P Y|A,Z (y | a, z).

Figure 7: Assumed causal graph for the Adult dataset, as in Chiappa & Pacchiano (2021). The variables H m , H l , H r are unobserved, and jointly trained with the predictor Ŷ.

In this example we have W = {C ∪ M ∪ L ∪ R}. The results in Figure5(center, right) refer to one run with conditioning set Z = {Race, Nationality}. The results in Table5(right) are the average and standard deviation of four random seeds.E.6 BASELINE EXPERIMENTSWe provide an experimental comparison against the method byVeitch et al. (2021).To this end, we consider the following artificial causal structure (see Figure1(b)):

Consider two random variables Y and A, and denote with (Ω Y , F Y ) and (Ω A , F A ) the respective measurable spaces. Suppose that we are given two RKHSs H Y , H A over the support of Y and A respectively. Y|Z=z ⊗ µ A|Z=z , with ∥•∥ the norm induced by the inner product of the tensor product space H X ⊗ H A .

Performance of the HSCIC against baselines CF1 and CF2 on two synthetic datasets (see Appendix E.2). Notably, for γ = 5 in Scenario 1 and γ = 13 in Scenario 2 we outperform CF2 in MSE and VCF simultaneously.

Results of the MSE (×10 5 for readability), HSCIC, VCF for increasing dimension of A (top) and Z (bottom), on synthetic datasets as in Appendix E.3. All other variables are onedimensional in both cases.

Proof of Theorem A.3. For simplicity, denote with P * the induced measure onG ′ Case 1: W is d-separated from Y in G ′ a .By Lemma A.2 we have that Y is independent of W.

is equivalent to ours. The following corollary holds. Corollary C.7. Consider (sets of) random variables Y, A, Z, and consider two RKHS H Y , H A over the support of Y and A respectively. Suppose that P Y,A|Z (• | Z) admits a regular version. Then, there exists a set Ω ⊆ Ω A that occurs almost surely, such that

Results of MSE, HSCIC, VCF (all times 10 5 for readability) on synthetic data with bi-dimensional A and Z. Here dimZ = 2, dimA = 2, dimX = 1, dimY = 1.

Architecture of the convolutional neural network used for the image dataset, as described in Appendix E.4.

Results of MSE and VCF (all times 10 2 for readability) on synthetic data of our method with trade-off parameters γ = 1 2 and γ = 1 with the heuristic methods data augmentation and causal-based data augmentation.

E.3 DATASETS FOR MULTI-DIMENSIONAL VARIABLES EXPERIMENTS

The data-generating mechanisms for the multi-dimensional settings of Tables 2 are now shown. Analysing the results in Table 2 (top), given dimA = D 1 ≥ 2, the datasets were generated from:∼ N (0, 1). In this setting, the mini-batch size chosen is 64 and the same hyperparameters are used as in the previous setting. The neural network architecture is trained for 70 epochs.For the results in Table 2 (bottom) the following data-generating process is used:∼ N (0, 0.1). Here, we used mini-batch size of 32, a learning rate of 10 -4 and a number of epochs of 500.The results in Tables 2 are the average obtained from three random seeds runs on the same data-split.We tested the method on a further setting, consisting of bi-dimensional Z and A (dimA = 2, dimZ = 2). Specifically, we have Z = {Z 1 , Z 2 } and A = {A 1 , A 2 }. The data-generating mechanism is the following:∼ N (0, 0.1). In Table 3 , the trade-off between accuracy and counterfactually invariant predictions is once again shown, implying that the proposed method can also be applied in settings where both Z and A are not unidimensional. In Table 3 the average and standard deviation of the results from four runs with random seed and re-sampled data are presented.

