DAG LEARNING ON THE PERMUTAHEDRON

Abstract

We propose a continuous optimization framework for discovering a latent directed acyclic graph (DAG) from observational data. Our approach optimizes over the polytope of permutation vectors, the so-called Permutahedron, to learn a topological ordering. Edges can be optimized jointly, or learned conditional on the ordering via a non-differentiable subroutine. Compared to existing continuous optimization approaches our formulation has a number of advantages including: 1. validity: optimizes over exact DAGs as opposed to other relaxations optimizing approximate DAGs; 2. modularity: accommodates any edge-optimization procedure, edge structural parameterization, and optimization loss; 3. end-to-end: either alternately iterates between node-ordering and edge-optimization, or optimizes them jointly. We demonstrate, on real-world data problems in protein-signaling and transcriptional network discovery, that our approach lies on the Pareto frontier of two key metrics, the SID and SHD.

1. INTRODUCTION

In many domains, including cell biology (Sachs et al., 2005) , finance (Sanford & Moosa, 2012) , and genetics (Zhang et al., 2013) , the data generating process is thought to be represented by an underlying directed acylic graph (DAG). Many models rely on DAG assumptions, e.g., causal modeling uses DAGs to model distribution shifts, ensure predictor fairness among subpopulations, or learn agents more sample-efficiently (Kaddour et al., 2022) . A key question, with implications ranging from better modeling to causal discovery, is how to recover this unknown DAG from observed data alone. While there are methods for identifying the underlying DAG if given additional interventional data (Eberhardt, 2007; Hauser & Bühlmann, 2014; Shanmugam et al., 2015; Kocaoglu et al., 2017; Brouillard et al., 2020; Addanki et al., 2020; Squires et al., 2020; Lippe et al., 2022) , it is not always practical or ethical to obtain such data (e.g., if one aims to discover links between dietary choices and deadly diseases). Learning DAGs from observational data alone is fundamentally difficult for two reasons. (i) Estimation: it is possible for different graphs to produce similar observed data, either because the graphs are Markov equivalent (they represent the same set of data distributions) or because not enough samples have been observed to distinguish possible graphs. This riddles the search space with local minima; (ii) Computation: DAG discovery is a costly combinatorial optimization problem over an exponentially large solution space and subject to global acyclicity constraints. To address issue (ii), recent work has proposed continuous relaxations of the DAG learning problem. These allow one to use well-studied continuous optimization procedures to search the space of DAGs given a score function (e.g., the likelihood). While these methods are more efficient than combinatorial methods, the current approaches have one or more of the following downsides: 1. Invalidity: existing methods based on penalizing the exponential of the adjacency matrix (Zheng

