DIFFERENTIABLE and TRANSPORTABLE STRUCTURE LEARNING

Abstract

Directed acyclic graphs (DAGs) encode a lot of information about a particular distribution in its structure. However, compute required to infer these structures is typically super-exponential in the number of variables, as inference requires a sweep of a combinatorially large space of potential structures. That is, until recent advances made it possible to search this space using a differentiable metric, drastically reducing search time. While this technique-named NOTEARS -is widely considered a seminal work in DAG-discovery, it concedes an important property in favour of differentiability: transportability. To be transportable, the structures discovered on one dataset must apply to another dataset from the same domain. In our paper, we introduce D-Struct which recovers transportability in the discovered structures through a novel architecture and loss function, while remaining completely differentiable. Because D-Struct remains differentiable, our method can be easily adopted in existing differentiable architectures, as was previously done with NOTEARS. In our experiments, we empirically validate D-Struct with respect to edge accuracy and structural Hamming distance in a variety of settings.

1. INTRODUCTION

Machine learning has proven to be a crucial tool in many disciplines. With successes in medicine [1] [2] [3] [4] [5] , economics [6] [7] [8] , physics [9] [10] [11] [12] [13] [14] , robotics [15] [16] [17] [18] , and even entertainment [19] [20] [21] , machine learning is transforming the way in which experts interact with their field. These successes are in large part due to increasing accuracy of diagnoses, marketing campaigns, analyses of experiments, and so forth. However, machine learning has much more to offer than improved accuracy alone. Indeed, recent advances seem to support this claim, as machine learning is slowly recognised as a tool for scientific discovery [22] [23] [24] [25] . In these successes, machine learning helped to uncover a previously unknown relationships between variables. Discovering such relationships is the first step of the long process of scientific discovery and are the focus of our paper as D-Struct-the model we propose in this paper -aims to help through differentiable and transportable structure learning. The structures. We focus on discovering directed acyclic graphs (DAGs) in a domain X . A DAG helps us understand how different variables in X interact with each other. Consider a three-variable domain X := {X, Y, Z}, governed by a joint-distribution, P X . A DAG explicitly models variable interactions in P X . For example, consider the following DAG: G = X Y Z , where G depicts P X as a DAG. Such a DAG allows useful analysis of dependence and independence of variables in P X [26, 27] . From G, we learn that X does not directly influence Y , and that X ⊥ ⊥ Y |Z as X does not give us any additional information on Y once we know Z. While DAGs are the model of choice in causality [28] , it is impossible to discover a causal DAG from observational data alone [29] [30] [31] [32] . As we only wish to assume access to observational data, our goal is not causal discovery. The above forms the basis for conventional DAG-structure learning [33] . In particular, X ⊥ ⊥ Y |Z strongly limits the possible DAGs that model P X . Given more independence statements, we limit the potential DAGs further. However, independence tests are computationally expensive which is problematic as the number of potential DAGs increases super-exponentially in |X | [34] . This limitation strongly impacted the adoption of DAG-learning, until Zheng et al. [35] proposed NOTEARS which incorporates a differentiable metric to evaluate whether or not a discovered structure is a DAG [35, 36] . Using automatic differentiation, NOTEARS learns a DAG-structure in a much more efficient way than earlier methods based on conditional independence tests (CITs). left), yet we wish to infer a general structure (right) across hospitals. A structure can only be considered a discovery if it generalizes in distributions over the same domain. For example, the way blood pressure interacts with heart disease is the same for all humans and should be reflected in the structure. While NOTEARS makes DAG inference tractable, we recognise an important limitation in the approach: a discovered DAG does not generalise to equally factorisable distributions, i.e. NOTEARS is not transportable. While we explain why this is the case in Section 2.2 (and confirm it empirically in Section 4), we give a brief description of the problem below, helping us to state our contribution. Transportability. Consider Fig. 1 , depicting two hospitals: and , named hospitals A and B onward. Each hospital hosts patients described by the same set of features such as age and gender. However, the hospitals may have different patient-distributions, e.g. patients in A are older compared to B. But their underlying biology remains the same. Using NOTEARS to learn a DAG from data on hospital A, actually does not guarantee the same DAG is discovered from data in hospital B. Learning from multiple data-sources is not new. In particular, papers focusing on federated structure learning solve a similar objective as described above [37, 38] . However, we believe transportability is a more general property than only training from multiple data sources. Crucially, transportability is very explicit about the domains we learn from, allowing their distributions to vary across domains. Interestingly, despite computational limitations, transportability is actually guaranteed when using a CIT-based discovery method [39, 40] , assuming that patients in both hospitals exhibit the same (in)dependencies in X . Being unable to transport findings across distributions is a major shortcoming, as replicating a found discovery is considered a hallmark of the scientific method [41] [42] [43] [44] . Contributions. In this paper, we present D-Struct, the first transportable differentiable structure learner. Transportability grants D-Struct several advantages over the state-of-the art: D-Struct is more robust, even in the conventional single-dataset case (Table 1 ); D-Struct is fast, in fact, we report time-to-convergence often up to 20 times faster than NOTEARS (Fig. 5 ); and given its completely differentiable architecture, D-Struct is easily incorporated in existing architectures (e.g. [45] [46] [47] [48] [49] ). While transportable methods have clear benefits over non-transportable methods in settings with multiple datasets (as illustrated in Fig. 1 ), we emphasise that our method is not limited to these settings alone. In fact, we find that enforcing transportability significantly increases performance in settings with one dataset, which is arguably most common. In Section 3 we introduce D-Struct and how to use our ideas in the single dataset setting. We then empirically validate D-Struct in Section 4.

2. PRELIMINARIES AND RELATED WORK

Our goal is to build a transportable and differentiable DAG learner. Without loss of generality, we focus our discussion mostly on NOTEARS [35] (and refinements [36, 50-52] ) as it is the most adopted differentiable DAG learner. For a more in depth overview of structure learners (CIT-based as well as score-based), we refer to Appendix G or relevant literature [26, 28, 34] . First, we formally introduce transportability, and then explain how NOTEARS works and why it is not transportable. 



Figure 1: Transportability in DAG discovery. Different patients go to different hospitals (left), yet we wish to infer a general structure (right) across hospitals. A structure can only be considered a discovery if it generalizes in distributions over the same domain. For example, the way blood pressure interacts with heart disease is the same for all humans and should be reflected in the structure.

Factorisation and independence. Consider a distribution, P X , which we can factorise into,i P Xi|X i+1:d ,(1)with i ∈ [d],where [d] := 1, . . . , d, and X i representing the i th element in X . Eq. (1) may get quite long with increasing d as the conditions may contain up to d -1 different variables. This becomes

