DAG LEARNING ON THE PERMUTAHEDRON

Abstract

We propose a continuous optimization framework for discovering a latent directed acyclic graph (DAG) from observational data. Our approach optimizes over the polytope of permutation vectors, the so-called Permutahedron, to learn a topological ordering. Edges can be optimized jointly, or learned conditional on the ordering via a non-differentiable subroutine. Compared to existing continuous optimization approaches our formulation has a number of advantages including: 1. validity: optimizes over exact DAGs as opposed to other relaxations optimizing approximate DAGs; 2. modularity: accommodates any edge-optimization procedure, edge structural parameterization, and optimization loss; 3. end-to-end: either alternately iterates between node-ordering and edge-optimization, or optimizes them jointly. We demonstrate, on real-world data problems in protein-signaling and transcriptional network discovery, that our approach lies on the Pareto frontier of two key metrics, the SID and SHD.

1. INTRODUCTION

In many domains, including cell biology (Sachs et al., 2005) , finance (Sanford & Moosa, 2012) , and genetics (Zhang et al., 2013) , the data generating process is thought to be represented by an underlying directed acylic graph (DAG). Many models rely on DAG assumptions, e.g., causal modeling uses DAGs to model distribution shifts, ensure predictor fairness among subpopulations, or learn agents more sample-efficiently (Kaddour et al., 2022) . A key question, with implications ranging from better modeling to causal discovery, is how to recover this unknown DAG from observed data alone. While there are methods for identifying the underlying DAG if given additional interventional data (Eberhardt, 2007; Hauser & Bühlmann, 2014; Shanmugam et al., 2015; Kocaoglu et al., 2017; Brouillard et al., 2020; Addanki et al., 2020; Squires et al., 2020; Lippe et al., 2022) , it is not always practical or ethical to obtain such data (e.g., if one aims to discover links between dietary choices and deadly diseases). Learning DAGs from observational data alone is fundamentally difficult for two reasons. (i) Estimation: it is possible for different graphs to produce similar observed data, either because the graphs are Markov equivalent (they represent the same set of data distributions) or because not enough samples have been observed to distinguish possible graphs. This riddles the search space with local minima; (ii) Computation: DAG discovery is a costly combinatorial optimization problem over an exponentially large solution space and subject to global acyclicity constraints. To address issue (ii), recent work has proposed continuous relaxations of the DAG learning problem. These allow one to use well-studied continuous optimization procedures to search the space of DAGs given a score function (e.g., the likelihood). While these methods are more efficient than combinatorial methods, the current approaches have one or more of the following downsides: 1. Invalidity: existing methods based on penalizing the exponential of the adjacency matrix (Zheng et al., 2018; Yu et al., 2019; Zheng et al., 2020; Ng et al., 2020; Lachapelle et al., 2020; He et al., 2021) are not guaranteed to return a valid DAG in practice (see Ng et al. (2022) for a theoretical analysis), but require post-processing to correct the graph to a DAG. How the learning method and the post-processing method interact with each other is not currently well-understood; 2. Nonmodularity: continuously relaxing the DAG learning problem is often done to leverage gradient-based optimization (Zheng et al., 2018; Ng et al., 2020; Cundy et al., 2021; Charpentier et al., 2022) . This requires all training operations to be differentiable, preventing the use of certain well-studied black-box estimators for learning edge functions; 3. Error propagation: methods that break the DAG learning problem into two stages risk propagating errors from one stage to the next (Teyssier & Koller, 2005; Bühlmann et al., 2014; Gao et al., 2020; Reisach et al., 2021; Rolland et al., 2022) . Following the framework of Friedman & Koller (2003) , we propose a new differentiable DAG learning procedure based on a decomposition of the problem into: (i) learning a topological ordering (i.e., a total ordering of the variables) and (ii) selecting the best scoring DAG consistent with this ordering. Whereas previous differentiable order-based works (Cundy et al., 2021; Charpentier et al., 2022) implemented step (i) through the usage of permutation matricesfoot_0 , we take a more straightforward approach by directly working in the space of vector orderings. Overall, we make the following contributions to score-based methods for DAG learning: • We propose a novel vector parametrization that associates a single scalar value to each node. This parametrization is (i) intuitive: the higher the score the lower the node is in the order; (ii) stable, as small perturbations in the parameter space result in small perturbations in the DAG space. • With such parameterization in place, we show how to learn DAG structures end-to-end from observational data, with any choice of edge estimator (we do not require differentiability). To do so, we leverage recent advances in discrete optimization (Niculae et al., 2018; Correia et al., 2020) and derive a novel top-k oracle over permutations, which could be of independent interest. • We show that DAGs learned with our proposed framework lie on the Pareto front of two key metrics (the SHD and SID) on two real-world tasks and perform favorably on several synthetic tasks. These contributions allow us to develop a framework that addresses the issues of prior work. Specifically, our approach: 1. Models sparse distributions of DAG topological orderings, ensuring all considered graphs are DAGs (also during training); 2. Separates the learning of topological orderings from the learning of edge functions, but 3. Optimizes them end-to-end, either jointly or alternately iterating between learning ordering and edges. Continuous relaxation. To address the complexity of the combinatorial search, more recent methods have proposed exact characterizations of DAGs that allow one to tackle the problem by continuous optimization (Zheng et al., 2018; Yu et al., 2019; Zheng et al., 2020; Ng et al., 2020; Lachapelle et al., 2020; He et al., 2021) . To do so, the constraint on acyclicity is expressed as a smooth function (Zheng et al., 2018; Yu et al., 2019) and then used as penalization term to allow efficient optimization.



Critically, the usage of permutation matrices allows to maintain a fully differentiable path from loss to parameters (of the permutation matrices) via Sinkhorn iterations or other (inexact) relaxation methods.



Combinatorial methods. These methods are either constraint-based, relying on conditional independence tests for selecting the sets of parents(Spirtes et al., 2000), or score-based, evaluating how well possible candidates fit the data (Geiger & Heckerman, 1994) (seeKitson et al. (2021)  for a survey). Constraint-based methods, while elegant, require conditional independence testing, which is known to be a hard statistical problem Shah & Peters (2020). For this reason, we focus our attention in this paper on score-based methods. Of these, exact combinatorial algorithms exist only for small number of nodes d(Singh & Moore, 2005; Xiang & Kim, 2013; Cussens, 2011), because the space of DAGs grows superexponentially in d and finding the optimal solution is NP-hard to solve(Chickering,  1995). Approximate methods(Scanagatta et al., 2015; Aragam & Zhou, 2015; Ramsey et al., 2017)   rely on global or local search heuristics in order to scale to problems with thousands of nodes.

