LEARNING ROBUST MODELS USING THE PRINCIPLE OF INDEPENDENT CAUSAL MECHANISMS Anonymous

Abstract

Standard supervised learning breaks down under data distribution shift. However, the principle of independent causal mechanisms (ICM, Peters et al. ( 2017)) can turn this weakness into an opportunity: one can take advantage of distribution shift between different environments during training in order to obtain more robust models. We propose a new gradient-based learning framework whose objective function is derived from the ICM principle. We show theoretically and experimentally that neural networks trained in this framework focus on relations remaining invariant across environments and ignore unstable ones. Moreover, we prove that the recovered stable relations correspond to the true causal mechanisms under certain conditions. In both regression and classification, the resulting models generalize well to unseen scenarios where traditionally trained models fail.

1. INTRODUCTION

Standard supervised learning has shown impressive results when training and test samples follow the same distribution. However, many real world applications do not conform to this setting, so that research successes do not readily translate into practice (Lake et al., 2017) . The task of Domain Generalization (DG) addresses this problem: it aims at training models that generalize well under domain shift. In contrast to domain adaption, where a few labeled and/or many unlabeled examples are provided for each target test domain, in DG absolutely no data is available from the test domains' distributions making the problem unsolvable in general. In this work, we view the problem of DG specifically using ideas from causal discovery. To make the problem of DG well-posed from this viewpoint, we assume that there exists a feature vector h (X) whose relation to the target variable Y is invariant across all environments. Consequently, the conditional probability p(Y | h (X)) has predictive power in each environment. From a causal perspective, changes between domains or environments can be described as interventions; and causal relationships -unlike purely statistical ones -remain invariant across environments unless explicitly changed under intervention. This is due to the fundamental principle of "Independent Causal Mechanisms" which will be discussed in Section 3. From a causal standpoint, finding robust models is therefore a causal discovery task (Bareinboim & Pearl, 2016; Meinshausen, 2018) . Taking a causal perspective on DG, we aim at identifying features which (i) have an invariant relationship to the target variable Y and (ii) are maximally informative about Y . This problem has already been addressed with some simplifying assumptions and a discrete combinatorial search in Magliacane et al. (2018); Rojas-Carulla et al. (2018) , but we make weaker assumptions and use gradient based optimization. Gradient based optimization is attractive because it readily scales to high dimensions and offers the possibility to learn very informative features, instead of merely selecting among predefined ones. Approaches to invariant relations similar to ours were taken in Ghassami et al. (2017) , who restrict themselves to linear relations, and Arjovsky et al. Problems (i) and (ii) are quite intricate because the search space has combinatorial complexity and testing for conditional independence in high dimensions is notoriously difficult. Our main contributions to this problem are the following: • By connecting invariant (causal) relations with normalizing flows, we propose a differentiable two-part objective of the form I(Y ; h(X))+λ I L I , where I is the mutual information and L I enforces the invariance of the relation between h(X) and Y across all environments. This objective operationalizes the ICM principle with a trade-off between feature informativeness and invariance controlled by parameter λ I . Our formulation generalizes existing work because our objective is not restricted to linear models. 2007)). However, this form of invariance is problematic since for instance the distribution of the target variable might change between environments. In this case we might expect that the distribution h(X) changes as well. A more plausible and theoretically justified type of invariance is the invariance of relations (Peters et al., 2016; Magliacane et al., 2018; Rojas-Carulla et al., 2018) . A relation between a target Y and some features is invariant across environments, if the conditional distribution of Y given the features is the same for all environments. Existing approaches model a conditional distribution for each feature selection and check for the invariance property (Peters et al., 2016; Rojas-Carulla et al., 2018; Magliacane et al., 2018) . However, this does not scale well. We provide a theoretical result connecting normalizing flows and invariant relations which in turn allows for gradient-based learning of the problem. In order to exploit our formulation, we also use the Hilbert-Schmidt-Independence Criterion that has been used for robust learning by Greenfeld & Shalit (2019) in the one environment setting. , 1938; Heckman & Pinto, 2013) . However, its crucial role as a tool for causal discovery was -to the best of our knowledge-only recently recognized by Peters et al. (2016) . Their estimator -Invariant Causal Prediction (ICP) -returns the intersection of all subsets of variables that have an invariant relation w.r.t. Y . The output is shown to be the set of the direct causes of Y under suitable conditions. However, their method assumes an underlying linear model and must perform an exhaustive search over all possible variable sets X T , which does not scale. Extensions to time series and non-linear additive noise models were studied in Heinze-Deml et al. (2018); Pfister et al. (2019) . Our treatment of invariance is inspired by these papers and also discusses identifiability results, i.e. conditions when the identified variables are indeed the direct causes. Key differences between ICP and our approach are the following: Firstly, we propose a formulation that allows for a gradient-based learning without strong assumptions on the underlying causal model such as linearity. Second, while ICP tends to exclude features from the parent set when in doubt, our algorithm prefers to err in the direction of best prediction performance in this case.



(2019);Krueger et al. (2020), who minimize an invariant empirical risk objective.

We take advantage of the continuous objective in three important ways: (1) We can learn invariant new features, whereas graph-based methods can only select features from a predefined set. (2) Our approach does not suffer from the scalability problems of combinatorial optimization methods as proposed in e.g.Peters et al. (2016)  andRojas-Carulla et al.  (2018). (3) Our optimization via normalizing flows, i.e. in the form of a density estimation task, facilitates accurate maximization of the mutual information.• We show how our objective simplifies in important special cases and under which conditions its optimal solution identifies the true causal parents of the target variable Y . We empirically demonstrate that the new method achieves good results on two datasets proposed in the literature.

Arjovsky et al. (2019)  propose a gradient-based learning framework which exploits a weaker notion of invariance. Their definition is only a necessary condition, but does not guarantee the more causal definition of invariance we treat in this work. The connection between DG, invariances and causality has been pointed out for instance by Meinshausen (2018);Rojas-Carulla et al. (2018); Zhang et al. (2015). From a causal perspective, DG is a causal discovery task(Meinshausen, 2018).Huang et al. (2020). Most of these approaches rely on combinatorial methods based on graphical models or are restricted to linear mechanisms, whereas our model defines a continuous objective for very general non-linear models. The distinctive property of causal relations to remain invariant across environments in the absence of direct interventions has been known since at least the 1930s (Frisch

