DAG LEARNING ON THE PERMUTAHEDRON

Abstract

We propose a continuous optimization framework for discovering a latent directed acyclic graph (DAG) from observational data. Our approach optimizes over the polytope of permutation vectors, the so-called Permutahedron, to learn a topological ordering. Edges can be optimized jointly, or learned conditional on the ordering via a non-differentiable subroutine. Compared to existing continuous optimization approaches our formulation has a number of advantages including: 1. validity: optimizes over exact DAGs as opposed to other relaxations optimizing approximate DAGs; 2. modularity: accommodates any edge-optimization procedure, edge structural parameterization, and optimization loss; 3. end-to-end: either alternately iterates between node-ordering and edge-optimization, or optimizes them jointly. We demonstrate, on real-world data problems in protein-signaling and transcriptional network discovery, that our approach lies on the Pareto frontier of two key metrics, the SID and SHD. * Work done prior to joining Amazon. However, this procedure no longer guarantees the absence of cycles at any stage of training, and solutions often require post-processing. Concurrently to this work, Bello et al. ( 2022) introduce a log-determinant characterization and an optimization procedure that is guaranteed to return a DAG at convergence. In practice, this relies on thresholding for reducing false positives in edge prediction. Two-stage. The third prominent line of works learns DAGs in two-stages: (i) finding an ordering of the variables, and (ii) selecting the best scoring graph among (or marginalizing over) the structures that are consistent with the found ordering (Teyssier & Koller, 2005; 

1. INTRODUCTION

In many domains, including cell biology (Sachs et al., 2005) , finance (Sanford & Moosa, 2012) , and genetics (Zhang et al., 2013) , the data generating process is thought to be represented by an underlying directed acylic graph (DAG). Many models rely on DAG assumptions, e.g., causal modeling uses DAGs to model distribution shifts, ensure predictor fairness among subpopulations, or learn agents more sample-efficiently (Kaddour et al., 2022) . A key question, with implications ranging from better modeling to causal discovery, is how to recover this unknown DAG from observed data alone. While there are methods for identifying the underlying DAG if given additional interventional data (Eberhardt, 2007; Hauser & Bühlmann, 2014; Shanmugam et al., 2015; Kocaoglu et al., 2017; Brouillard et al., 2020; Addanki et al., 2020; Squires et al., 2020; Lippe et al., 2022) , it is not always practical or ethical to obtain such data (e.g., if one aims to discover links between dietary choices and deadly diseases). Learning DAGs from observational data alone is fundamentally difficult for two reasons. (i) Estimation: it is possible for different graphs to produce similar observed data, either because the graphs are Markov equivalent (they represent the same set of data distributions) or because not enough samples have been observed to distinguish possible graphs. This riddles the search space with local minima; (ii) Computation: DAG discovery is a costly combinatorial optimization problem over an exponentially large solution space and subject to global acyclicity constraints. To address issue (ii), recent work has proposed continuous relaxations of the DAG learning problem. These allow one to use well-studied continuous optimization procedures to search the space of DAGs given a score function (e.g., the likelihood). While these methods are more efficient than combinatorial methods, the current approaches have one or more of the following downsides: 1. Invalidity: existing methods based on penalizing the exponential of the adjacency matrix (Zheng et al., 2018; Yu et al., 2019; Zheng et al., 2020; Ng et al., 2020; Lachapelle et al., 2020; He et al., 2021) are not guaranteed to return a valid DAG in practice (see Ng et al. (2022) for a theoretical analysis), but require post-processing to correct the graph to a DAG. How the learning method and the post-processing method interact with each other is not currently well-understood; 2. Nonmodularity: continuously relaxing the DAG learning problem is often done to leverage gradient-based optimization (Zheng et al., 2018; Ng et al., 2020; Cundy et al., 2021; Charpentier et al., 2022) . This requires all training operations to be differentiable, preventing the use of certain well-studied black-box estimators for learning edge functions; 3. Error propagation: methods that break the DAG learning problem into two stages risk propagating errors from one stage to the next (Teyssier & Koller, 2005; Bühlmann et al., 2014; Gao et al., 2020; Reisach et al., 2021; Rolland et al., 2022) . Following the framework of Friedman & Koller (2003) , we propose a new differentiable DAG learning procedure based on a decomposition of the problem into: (i) learning a topological ordering (i.e., a total ordering of the variables) and (ii) selecting the best scoring DAG consistent with this ordering. Whereas previous differentiable order-based works (Cundy et al., 2021; Charpentier et al., 2022) implemented step (i) through the usage of permutation matricesfoot_0 , we take a more straightforward approach by directly working in the space of vector orderings. Overall, we make the following contributions to score-based methods for DAG learning: • We propose a novel vector parametrization that associates a single scalar value to each node. This parametrization is (i) intuitive: the higher the score the lower the node is in the order; (ii) stable, as small perturbations in the parameter space result in small perturbations in the DAG space. • With such parameterization in place, we show how to learn DAG structures end-to-end from observational data, with any choice of edge estimator (we do not require differentiability). To do so, we leverage recent advances in discrete optimization (Niculae et al., 2018; Correia et al., 2020) and derive a novel top-k oracle over permutations, which could be of independent interest. • We show that DAGs learned with our proposed framework lie on the Pareto front of two key metrics (the SHD and SID) on two real-world tasks and perform favorably on several synthetic tasks. These contributions allow us to develop a framework that addresses the issues of prior work. Specifically, our approach: 1. Models sparse distributions of DAG topological orderings, ensuring all considered graphs are DAGs (also during training); 2. Separates the learning of topological orderings from the learning of edge functions, but 3. Optimizes them end-to-end, either jointly or alternately iterating between learning ordering and edges.

2. RELATED WORK

The work on DAG learning can be largely categorized into four families of approaches: (a) combinatorial methods, (b) continuous relaxation, (c) two-stage, (d) differentiable, order-based. Combinatorial methods. These methods are either constraint-based, relying on conditional independence tests for selecting the sets of parents (Spirtes et al., 2000) , or score-based, evaluating how well possible candidates fit the data (Geiger & Heckerman, 1994) (see Kitson et al. (2021) for a survey). Constraint-based methods, while elegant, require conditional independence testing, which is known to be a hard statistical problem Shah & Peters (2020) . For this reason, we focus our attention in this paper on score-based methods. Of these, exact combinatorial algorithms exist only for small number of nodes d (Singh & Moore, 2005; Xiang & Kim, 2013; Cussens, 2011) , because the space of DAGs grows superexponentially in d and finding the optimal solution is NP-hard to solve (Chickering, 1995) . Approximate methods (Scanagatta et al., 2015; Aragam & Zhou, 2015; Ramsey et al., 2017) rely on global or local search heuristics in order to scale to problems with thousands of nodes. Continuous relaxation. To address the complexity of the combinatorial search, more recent methods have proposed exact characterizations of DAGs that allow one to tackle the problem by continuous optimization (Zheng et al., 2018; Yu et al., 2019; Zheng et al., 2020; Ng et al., 2020; Lachapelle et al., 2020; He et al., 2021) . To do so, the constraint on acyclicity is expressed as a smooth function (Zheng et al., 2018; Yu et al., 2019) and then used as penalization term to allow efficient optimization.

3. SETUP

3.1 THE PROBLEM Let X ∈ R n×d be a matrix of n observed data inputs generated by an unknown Structural Equation Model (SEM) (Pearl, 2000) . An SEM describes the functional relationships between d features as edges between nodes in a DAG G ∈ D[d] (where D [d] is the space of all DAGs with d nodes). Each feature x j ∈ R n is generated by some (unknown) function f j of its (unknown) parents pa(j) as: x j = f j (x pa(j) ). We keep track of whether an edge exists in the graph G using an adjacency matrix A ∈ {0, 1} d×d (i.e., A ij = 1 if and only if there is a (directed) edge i → j). For example, a special case is when the structural equations are linear with Gaussian noise, y j = f j (X, A j ) = X(w j • A j ) + ε; ε ∼ N (0, ν) where w j ∈ R d , A j is the jth column of A, and ν is the noise variance. This is just to add intuition; our framework is compatible with non-linear, non-Gaussian structural equations.

3.2. OBJECTIVE

Given X, our goal is to recover the unknown DAG that generated the observations. To do so we must learn (a) the connectivity parameters of the graph, represented by the adjacency matrix A, and (b) the functional parameters Φ = {ϕ j } d j=1 that define the edge functions {f ϕj } d j=1 . Score-based methods (Kitson et al., 2021) learn these parameters via a constrained non-linear mixed-integer optimization problem min A∈D[d] Φ d j=1 ℓ x j , f ϕj (X • A j ) + λΩ(Φ), where ℓ : R n × R n → R is a loss that describes how well each feature x j is predicted by f ϕj . As in eq. ( 1), the adjacency matrix defines the parents of each feature x j as follows pa(j ) = {i ⊆ [d] \ j | A ij = 1}. Only these features will be used by f ϕj to predict x j , via X • A j . The constraint A ∈ D [d] enforces that the connectivity parameters A describes a valid DAG. Finally, Ω(Φ) is a regularization term encouraging sparseness. So long as this regularizer takes the same value for all DAGs within a Markov equivalence class, the consistency results of Brouillard et al. (2020, Theorem 1) prove that the solution to Problem (2) is Markov equivalent to the true DAG, given standard assumptions. ( < l a t e x i t s h a 1 _ b a s e 6 4 = " / X t N E b z / L I Q d 5 j g x T s Z 8 7 E E A l L Q = " > A A A B 6 H i c b V D L S g N B E O y N r x h f U Y 9 e B o O g l 7 A r I X o M e P G Y g H l A s o T Z S W 8 y Z n Z 2 m Z k V w p I v 8 O J B E a 9 + k j f / x s n j o I k F D U V V N 9 1 d Q S K 4 N q 7 7 7 e Q 2 N r e 2 d / K 7 h b 3 9 g 8 O j 4 v F J S 8 e p Y t h k s Y h V J 6 A a B Z f Y N N w I 7 C Q K a R Q I b A f j u 5 n f f k K l e S w f z C R B P 6 J D y U P O q L F S 4 6 p f L L l l d w 6 y T r w l K c E S 9 X 7 x q z e I W R q h N E x Q r b u e m x g / o 8 p w J n B a 6 K U a E 8 r G d I h d S y W N U P v Z / N A p u b D K g I S x s i U N m a u / J z I a a T 2 J A t s Z U T P S q 9 5 M / M / r p i a 8 9 T M u k 9 S g Z I t F Y S q I i c n s a z L g C p k R E 0 s o U 9 z e S t i I K s q M z a Z g Q / B W X 1 4 n r e u y V y 1 X G p V S r V 5 b x J G H M z i H S / D g B m p w D 3 V o A g O E Z 3 i F N + f R e X H e n Y 9 F a 8 5 Z R n g K f + B 8 / g C I o Y 0 E < / l a t e x i t > ) < l a t e x i t s h a 1 _ b a s e 6 4 = " m v O o s C X s 2 h C X k V j D L T y O 5 l A e U g k = " > A A A B 6 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g Q c K u h O g x 4 M V j A u Y B y R J m J 7 3 J m N n Z Z W Z W C E u + w I s H R b z 6 S d 7 8 G y e P g y Y W N B R V 3 X R 3 B Y n g 2 r j u t 5 P b 2 N z a 3 s n v F v b 2 D w 6 P i s c n L R 2 n i m G T x S J W n Y B q F F x i 0 3 A j s J M o p F E g s B 2 M 7 2 Z + + w m V 5 r F 8 M J M E / Y g O J Q 8 5 o 8 Z K j a t + s e S W 3 T n I O v G W p A R L 1 P v F r 9 4 g Z m m E 0 j B B t e 5 6 b m L 8 j C r D m c B p o Z d q T C g b 0 y F 2 L Z U 0 Q u 1 n 8 0 O n 5 M I q A x L G y p Y 0 Z K 7 + n s h o p P U k C m x n R M 1 I r 3 o z 8 T + v m 5 r w 1 s + 4 T F K D k i 0 W h a k g J i a z r 8 m A K 2 R G T C y h T H F 7 K 2 E j q i g z N p u C D c F b f X m d t K 7 L X r V c a V R K t X p t E U c e z u A c L s G D G 6 j B P d S h C Q w Q n u E V 3 p x H 5 8 V 5 d z 4 W r T l n G e E p / I H z + Q O N L Y 0 H < / l a t e x i t > , 0.1 0.6 < l a t e x i t s h a 1 _ b a s e 6 4 = " G 4 v z A K U f q B v u B o S M S n 1 c P b S + H l o = " > A A A B 6 H i c b V D L S g N B E O y N r x h f U Y 9 e B o M g C G F X g n o M e v G Y g H l A s o T Z S W 8 y Z n Z 2 m Z k V Q s g X e P G g i F c / y Z t / 4 y T Z g y Y W N B R V 3 X R 3 B Y n g 2 r j u t 5 N b W 9 / Y 3 M p v F 3 Z 2 9 / Y P i o d H T R 2 n i m G D x S J W 7 Y B q F F x i w 3 A j s J 0 o p F E g s B W M 7 m Z + 6 w m V 5 r F 8 M O M E / Y g O J A 8 5 o 8 Z K 9 Y t e s e S W 3 T n I K v E y U o I M t V 7 x q 9 u P W R q h N E x Q r T u e m x h / Q p X h T O C 0 0 E 0 1 J p S N 6 A A 7 l k o a o f Y n 8 0 O n 5 M w q f R L G y p Y 0 Z K 7 + n p j Q S O t x F N j O i J q h X v Z m 4 n 9 e J z X h j T / h M k k N S r Z Y F K a C m J j M v i Z 9 r p A Z M b a E M s X t r Y Q N q a L M 2 G w K N g R v + e V V 0 r w s e 1 f l S r 1 S q t 5 m c e T h B E 7 h H D y 4 h i r c Q w 0 a w A D h G V 7 h z X l 0 X p x 3 5 2 P R m n O y m W P 4 A + f z B 3 R z j L g = < / l a t e x i t > + < l a t e x i t s h a 1 _ b a s e 6 4 = " G 4 v z A K U f q B v u B o S M S n 1 c P b S + H l o = " > A A A B 6 H i c b V D L S g N B E O y N r x h f U Y 9 e B o M g C G F X g n o M e v G Y g H l A s o T Z S W 8 y Z n Z 2 m Z k V Q s g X e P G g i F c / y Z t / 4 y T Z g y Y W N B R V 3 X R 3 B Y n g 2 r j u t 5 N b W 9 / Y 3 M p v F 3 Z 2 9 / Y P i o d H T R 2 n i m G D x S J W 7 Y B q F F x i w 3 A j s J 0 o p F E g s B W M 7 m Z + 6 w m V 5 r F 8 M O M E / Y g O J A 8 5 o 8 Z K 9 Y t e s e S W 3 T n I K v E y U o I M t V 7 x q 9 u P W R q h N E x Q r T u e m x h / Q p X h T O C 0 0 E 0 1 J p S N 6 A A 7 l k o a o f Y n 8 0 O n 5 M w q f R L G y p Y 0 Z K 7 + n p j Q S O t x F N j O i J q h X v Z m 4 n 9 e J z X h j T / h M k k N S r Z Y F K a C m J j M v i Z 9 r p A Z M b a E M s X t r Y Q N q a L M 2 G w K N g R v + e V V 0 r w s e 1 f l S r 1 S q t 5 m c e T h B E 7 h H D y 4 h i r c Q w 0 a w A D h G V 7 h z X l 0 X p x 3 5 2 P R m n O y m W P 4 A + f z B 3 R z j L g = < / l a t e x i t > + 0.5 0.1 -0.2 -0.5 structure parameters equation parameters < l a t e x i t s h a 1 _ b a s e 6 4 = " C 8 W y 9 s 5 n d k X 8 Z H W 1 c l j C J c C E 3 H c = " > A A A C A 3 i c b V D L S s N A F J 3 U V 6 2 v q D v d B I v g q i R S 1 G X R j c s K 9 g F N C J P J p B 0 6 m Y S Z G 6 G E g h t / x Y 0 L R d z 6 E + 7 8 G y d t F t p 6 Y J j D O f d y 7 z 1 B y p k C 2 / 4 2 K i u r a + s b 1 c 3 a 1 v b O 7 p 6 5 f 9 B V S S Y J 7 Z C E J 7 I f Y E U 5 E 7 Q D D D j t p 5 L i O O C 0 F 4 x v C r / 3 Q K V i i b i H S U q 9 G A 8 F i x j B o C X f P H J j J v z c D R I e q k m s v 9 y F E Q U 8 n f p m 3 W 7 Y M 1 j L x C l J H Z V o + + a X G y Y k i 6 k A w r F S A 8 d O w c u x B E Y 4 n d b c T N E U k z E e 0 o G m A s d U e f n s h q l 1 q p X Q i h K p n w B r p v 7 u y H G s i g V 1 Z Y x h p B a 9 Q v z P G 2 Q Q X X k 5 E 2 k G V J D 5 o C j j F i R W E Y g V M k k J 8 I k m m E i m d 7 X I C E t M Q M d W 0 y E 4 i y c v k + 5 5 w 7 l o N O + a 9 d Z 1 G U c V H a M T d I Y c d I l a 6 B a 1 U Q c R 9 I i e 0 S t 6 M 5 6 M F + P d + J i X V o y y 5 x D 9 g f H 5 A 2 p X m K w = < / l a t e x i t > min ✓ < l a t e x i t s h a 1 _ b a s e 6 4 = " Y i V V N J p l 4 O p N r Y i E D Z v R d D I O 3 H 4 = " > A A A B / H i c b V D L S s N A F J 3 U V 6 2 v a J d u B o v g q i Q i 6 r L o x m U F + 4 A m l M l k 2 g 6 d T M L M j R B C / R U 3 L h R x 6 4 e 4 8 2 + c t F l o 6 4 F h D u f c y 5 w 5 Q S K 4 B s f 5 t i p r 6 x u b W 9 X t 2 s 7 u 3 v 6 B f X j U 1 X G q K O v Q W M S q H x D N B J e s A x w E 6 y e K k S g Q r B d M b w u / 9 8 i U 5 r F 8 g C x h f k T G k o 8 4 J W C k o V 3 3 g l i E O o v M l X s w Y U B m Q 7 v h N J 0 5 8 C p x S 9 J A J d p D + 8 s L Y 5 p G T A I V R O u B 6 y T g 5 0 Q B p 4 L N a l 6 q W U L o l I z Z w F B J I q b 9 f B 5 + h k + N E u J R r M y R g O f q 7 4 2 c R L r I Z y Y j A h O 9 7 B X i f 9 4 g h d G 1 n 3 O Z p M A k X T w 0 S g W G G B d N 4 J A r R k F k h h C q u M m K 6 Y Q o Q s H 0 V T M l u M t f X i X d 8 6 Z 7 2 b y 4 v 2 i 0 b s o 6 q u g Y n a A z 5 K I r 1 E J 3 q I 0 6 i K I M P a N X 9 G Y 9 W S / W u / W x G K 1 Y 5 U 4 d / Y H 1 + Q O n 6 5 V v < / l a t e x i t > ✓ < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 C O f d g g J f F p 8 y R m U A m W i 1 f q 7 t J 8 = " > A A A C y X i c f V H f a x N B E N 6 7 V q 3 R t q k + + r I 1 C L G U c C f F H 4 V C U R 8 E E S O Y t p C N Y W 4 z S Z b u 7 Z 2 7 c y X x u C f / S l / 8 W 9 x L I 9 g m O L D s t 9 9 8 s z P 7 b Z J r 5 S i K f g X h x u a d u / e 2 7 j c e P N z e 2 W 3 u P T p z W W E l 9 m S m M 3 u R g E O t D P Z I k c a L 3 C K k i c b z 5 P J d n T + / Q u t U Z r 7 S P M d B C h O j x k o C e W r Y n I k k 0 y M 3 T / 1 W i u 5 U V d 8 O 2 j c 4 m i J B 9 Z y L / e M T s c 8 F 2 E m q z L B c K a y 4 Q K 3 b K / w h X 3 f f s N m K O t E i + C q I l 6 D F l t E d 7 g V v x C i T R Y q G p A b n + n G U 0 6 A E S 0 p q r B q i c J i D v I Q J 9 j 0 0 k K I b l A u L K v 7 M M y M + z q x f h v i C / b e i h N T V E 3 p l C j R 1 t 3 M 1 u S 7 X L 2 j 8 e l A q k x e E R l 4 3 G h e a U 8 Z r v / l I W Z S k 5 x 6 A t M r P y u U U L E j y v 7 K 2 S 0 O 8 R / 9 A i 5 / 8 6 X O O F i i z B + X C e J h V / s E T c V i j / w m V + S v 0 q O H N j m 9 b u w r O X n T i l 5 2 j L 0 e t 0 7 d L 2 7 f Y E / a U t V n M X r F T 9 o F 1 W Y 9 J 9 j v Y D L a D n f B j + D 2 c h T + u p W G w r H n M b k T 4 8 w + 9 T u E y < / l a t e x i t > ⇤ (✓) := arg min `( , ✓) < l a t e x i t s h a 1 _ b a s e 6 4 = " A q a h T Z y S 9 L c N G v E T y a 8 p I 0 d A m e U = " > A A A C E 3 i c b V D L S s N A F J 3 U V 6 2 v q k s 3 w S L U L k p S i r o s u n F Z w T 6 g i W E y n b Z D J 5 M w c y O U k H 9 w 4 6 + 4 c a G I W z f u / B u n b R a 2 9 c A w h 3 P u 5 d 5 7 / I g z B Z b 1 Y + T W 1 j c 2 t / L b h Z 3 d v f 2 D 4 u F R W 4 W x J L R F Q h 7 K r o 8 V 5 U z Q F j D g t B t J i g O f 0 4 4 / v p n 6 n U c q F Q v F P U w i 6 g Z 4 K N i A E Q x a 8 o o V x w 9 5 X 0 0 C / S V O c 8 T S h 4 p X K y + o M K K A 0 3 O v W L K q 1 g z m K r E z U k I Z m l 7 x 2 + m H J A 6 o A M K x U j 3 b i s B N s A R G O E 0 L T q x o h M k Y D 2 l P U 4 E D q t x k d l N q n m m l b w 5 C q Z 8 A c 6 b + 7 U h w o K Y L 6 s o A w 0 g t e 1 P x P 6 8 X w + D K T Z i I Y q C C z A c N Y m 5 C a E 4 D M v t M U g J 8 o g k m k u l d T T L C E h P Q M R Z 0 C P b y y a u k X a v a F 9 X 6 X b 3 U u M 7 i y K M T d I r K y E a X q I F u U R O 1 E E F P 6 A W 9 o X f j 2 X g 1 P o z P e W n O y H q O 0 Q K M r 1 / Z y Z 7 H < / l a t e x i t > ⇤ 2 (✓) < l a t e x i t s h a 1 _ b a s e 6 4 = " b X p X 2 B J F M N Q 3 B D w V A + p I K S V I R M k = " > A A A C E 3 i c b V D L S s N A F J 3 U V 6 2 v q k s 3 w S L U L k q i R V 0 W 3 b i s Y B / Q x D C Z T N q h k w c z N 0 I J + Q c 3 / o o b F 4 q 4 d e P O v 3 H S d m F b D w x z O O d e 7 r 3 H j T m T Y B g / W m F l d W 1 9 o 7 h Z 2 t r e 2 d 0 r 7 x 9 0 Z J Q I Q t s k 4 p H o u V h S z k L a B g a c 9 m J B c e B y 2 n V H N 7 n f f a R C s i i 8 h 3 F M 7 Q A P Q u Y z g k F J T r l m u R H 3 5 D h Q X 2 q 1 h i x 7 q D n n 1 T k V h h R w d u q U K 0 b d m E B f J u a M V N A M L a f 8 b X k R S Q I a A u F Y y r 5 p x G C n W A A j n G Y l K 5 E 0 x m S E B 7 S v a I g D K u 1 0 c l O m n y j F 0 / 1 I q B e C P l H / d q Q 4 k P m C q j L A M J S L X i 7 + 5 / U T 8 K / s l I V x A j Q k 0 0 F + w n W I 9 D w g 3 W O C E u B j R T A R T O 2 q k y E W m I C K s a R C M B d P X i a d s 7 p 5 U W / c N S r N 6 1 k c R X S E j l E V m e g S N d E t a q E 2 I u g J v a A 3 9 K 4 9 a 6 / a h / Y 5 L S 1 o s 5 5 D N A f t 6 x f b Y p 7 I < / l a t e x i t > ⇤ 3 (✓) < l a t e x i t s h a 1 _ b a s e 6 4 = " z 3 X 8 W U q m x C o x R d t 9 f A o I o 9 1 4 J V E = " > A A A C E 3 i c b V C 7 T s M w F H V 4 l v I q M L J Y V E i l Q 5 W g C h g r W B i L R B 9 S E y L H d V u r T h z Z N 0 h V 1 H 9 g 4 V d Y G E C I l Y W N v 8 F p O 9 C W I 1 k + O u d e 3 X t P E A u u w b Z / r J X V t f W N z d x W f n t n d 2 + / c H D Y 1 D J R l D W o F F K 1 A 6 K Z 4 B F r A A f B 2 r F i J A w E a w X D m 8 x v P T K l u Y z u Y R Q z L y T 9 i P c 4 J W A k v 1 B 2 A y m 6 e h S a L 3 X r A z 5 + K P t O a U 6 F A Q M y P v M L R b t i T 4 C X i T M j R T R D 3 S 9 8 u 1 1 J k 5 B F Q A X R u u P Y M X g p U c C p Y O O 8 m 2 g W E z o k f d Y x N C I h 0 1 4 6 u W m M T 4 3 S x T 2 p z I s A T 9 S / H S k J d b a g q Q w J D P S i l 4 n / e Z 0 E e l d e y q M 4 A R b R 6 a B e I j B I n A W E u 1 w x C m J k C K G K m 1 0 x H R B F K J g Y 8 y Y E Z / H k Z d I 8 r z g X l e p d t V i 7 n s W R Q 8 f o B J W Q g y 5 R D d 2 i O m o g i p 7 Q C 3 p D 7 9 a z 9 W p 9 W J / T 0 h V r 1 n O E 5 m B 9 / Q L Y M J 7 G < / l a t e x i t > ⇤ 1 (✓) Figure 1 : DAGuerreotype : Our end-to-end approach to DAG learning works by (a) learning a sparse distribution over node orderings via structure parameters θ, and (b) learning a sparse predictor w * to estimate the data. Any black-box predictor can be used to learn w * ; differentiability is not necessary.

3.3. SPARSE RELAXATION METHODS

In developing our method, we will leverage recent works in sparse relaxation. In particular, we will make use of the SparseMAP (Niculae et al., 2018) and Top-k SparseMAX (Correia et al., 2020) operators, which we briefly describe below. At a high level, the goal of both these approaches is to relax structured problems of the form α ⋆ := arg max α∈△ D s ⊤ α so that ∂α ⋆ /∂s is well-defined (where △ D := {α ∈ R D | α ⪰ 0, D i=1 α i = 1} is the D-dimensional simplex) . This will allow s to be learned by gradient-based methods. Note that both approaches require querying an oracle that finds the best scoring structures. We are unaware of such an oracle for DAG learning, i.e. for D [d] being the vertices of △ D . However, we will show that by decomposing the DAG learning problem, we can find an oracle for the decomposed subproblem. We will derive this oracle in Section 4 and prove its correctness. Top-k sparsemax (Correia et al., 2020) . This approach works by (i) regularizing α, and (ii) constraining the number of non-zero entries of α to be at most as follows k: arg max α∈△ D ,∥α∥0≤k s ⊤ α -∥α∥ 2 2 . To solve this optimization problem, top-k sparsemax requires an oracle that returns the k structures with the highest scores s ⊤ α. SparseMAP (Niculae et al., 2018) . Assume s has a low-dimensional parametrization s = B ⊤ r, where B ∈ R q×D and q ≪ D. SparseMAP relaxes α ⋆ = arg max α∈△ D r ⊤ Bα by regularizing the lower-dimensional 'marginal space' arg max α∈△ D r ⊤ Bα -∥Bα∥ 2 2 . The relaxed problem can be solved using the active set algorithm (Nocedal & Wright, 1999) , which iteratively queries an oracle for finding the best scoring structure (r -B ⊤ α (t) ) ⊤ Bα at iteration t + 1.

4. DAG LEARNING VIA SPARSE RELAXATIONS

A key difficulty when learning DAG structures is that the characterization of the set of all valid DAGs: as soon as some edges are added, other edges are prohibited. However, note the following key observation: any DAG can be decomposed as follows (i) Assign to each of the d nodes a rank and reorder nodes according to this rank (this is called a topological ordering); (ii) Only allow edges from lower nodes in the order to higher nodes, i.e., from a node i to a node j if x i ≺ x j . This approach lies at the core of our method for DAG learning, which we dub DAGuerreotype, shown in Figure 1 . In this section we derive the framework, present a global sensitivity result, show how to learn both the structural and the edge equations parameters leveraging the sparse relaxation methods introduced above and study the computational complexity of our method.

4.1. LEARNING ON THE PERMUTAHEDRON

Given d nodes, let Σ d be the set of all permutations of node indices {1, . . . , d}. Given a vector 1) , . . . , v σ(d) ] ⊤ be the vector of reordered v according to the permutation σ ∈ Σ d . Similarly, for a matrix M ∈ R, let M σ be the matrix obtained by permuting the rows and columns of M by σ. v ∈ R d , let v σ := [v σ( Let D C [d] be the set of complete DAGs (i.e., DAGs with all possible edges). Let R ∈ {0, 1} d×d be the binary strictly upper triangular matrix where the upper triangle is all equal to 1. Then D C [d] can be fully enumerated given R and Σ d , as follows: D C [d] = {R σ : σ ∈ Σ d , R ∈ {0, 1} d×d , R ij = 0 ∀i ≥ j, R ji = 1 ∀j < i}. (3) Therefore it is sufficient to learn σ in step (i), and then learn which edges to drop in step (ii). The vector parameterization. Imagine now that θ ∈ R d defines a score for each node, and these scores induce an ordering σ(θ): the smaller the score, the earlier the node should be in the ordering. Formally, the following optimization problem finds such an ordering: σ(θ) ∈ arg max σ∈Σ d θ ⊤ ρ σ , where ρ = [1, 2, . . . , d] . Note that a simple oracle solves this optimization problem: sort θ into increasing order (as given by The Rearrangement Inequality (Hardy et al., 1952, Thms. 368-369) ). We emphasize that we write '∈' in eq. ( 4) since the r.h.s can be a set: in fact, this happens exactly when some components of θ are equal. Beside being intuitive and efficient, the parameterization of D C [d] given by θ → R σ(θ) allows us to upper bound the structural Hamming distance (SHD) between any two complete DAGs (R σ(θ) and R σ(θ ′ ) ) by the number of hyper-planes of "equal coordinates" (H i,j = {x ∈ R d : x i = x j }) that are traversed by the segment connecting θ and θ ′ (see Figure 4 in the Appendix for a schematic). More formally, we state the following theorem. Theorem 4.1 (Global sensitivity). For any θ ∈ R d and θ ′ ∈ R d SHD R σ(θ) , R σ(θ ′ ) ≤ t∈[0,1] i j>i δ Hi,j (θ + t(θ ′ -θ)) dt where δ A (x) is the (generalized) Dirac delta that evaluates to infinity if x ∈ A and 0 otherwise. In particular, Theorem 5 shows that we can expect that small changes in θ (e.g. due to gradientbased iterative optimization) lead to small changes in the complete DAG space, offering a result that is reminiscent to Lipschitz-smoothness for smooth optimization. We defer proof and further commentary (also compared to the parameterization based on permutation matrices) to Appendix C. Learning θ with gradients. Notice that we cannot take gradients through eq. ( 4) because (a) σ(θ) is not even a function (due to it possibly being a set), and (b) even if we restrict the parameter space not to have ties, the mapping is piece-wise constant and uninformative for gradient-based learning. To circumvent these issues, Blondel et al. (2020) propose to relax problem (4) by optimizing over the convex hull of all permutations of ρ, that is the the order-d Permutahedron P [d] := conv{ρ σ | σ ∈ Σ d }, and adding a convex regularizer. These alterations yield to a class of differentiable mappings (soft permutations), indexed by τ ∈ R + , µ(θ) = arg max µ∈P[d] θ ⊤ µ - τ 2 ∥µ∥ 2 2 , which, in absence of ties, are exact for τ → 0. This technique is, however, unsuitable for our case, as the µ(θ)'s do not describe valid permutations except when taking values on vertices of P [d] . Instead, we show next how to obtain meaningful gradients whilst maintaining validity adapting the sparseMAP or the top-k sparsemax operators to our setting. Leveraging sparse relaxation methods. Let D = d! be the total number permutations of d elements, and △ D be the D-dimensional simplex. We can (non-uniquely) decompose µ = σ∈Σ d α σ ρ σ for some α ∈ △ D . Plugging this into eq. ( 6) leads to α sparseMAP (θ) ∈ arg max α∈△ D θ ⊤ E σ∼α [ρ σ ] - τ 2 ∥E σ∼α [ρ σ ]∥ Alternatively, because the only term in the regularization that matters for optimization is α, we can regularize it alone, and directly restrict the number of non-zero entries of α to some k > 2 as follows α top-k sparsemax (θ) ∈ arg max α∈△ |Σ d | ,∥α∥0≤k θ ⊤ E σ∼α [ρ σ ] - τ 2 ∥α∥ 2 2 , where we assume ties are resolved arbitrarily. Algorithm 1: Top-k permutations. Data: k ∈ [d!], θ ∈ R d Result: top-k permutations T k (θ) P (θ) ← {σ 1 ∈ R arg max σ∈Σ d g θ (σ)}; while |T k (θ)| ≤ k do σ ∈ R arg max σ∈P (θ)\T k (θ) g θ (σ); P (θ) ← P (θ) ∪ {σj | j ∈ [d-1]}; T k (θ) ← T k (θ) ∪ {σ}; end This is a formulation of top-k sparsemax introduced in Section 3.3 that we can efficiently employ to learn DAGs provided that we have access to a fast algorithm that returns a set of k permutations with highest value of g θ (σ) = θ ⊤ ρ σ . Algorithm 1 describes such an oracle, which, to our knowledge, has never been derived before and may be of independent interest. The algorithm restricts the search for the best solutions to the set of permutations that are one adjacent transposition away from the best solutions found so far. We refer the reader to Appendix A for notation and proof of correctness of Algorithm 1. Remark: Top-k sparsemax optimizes over the highest-scoring permutations, while sparseMAP draws any set of permutations the marginal solution can be decomposed into. As an example, for θ = 0, sparseMAP returns two permutations, σ and its inverse σ -1 . On the other hand, because at 0 all the permutations have the same probability, top-k sparsemax returns an arbitrary subset of k permutations which, when using the oracle presented above, will lie on the same face of the permutahedron. In Appendix D we provide an empirical comparison of these two operators when applied to the DAG learning problem.

4.2. DAG LEARNING

In order to select which edges to drop from R, we regularize the set of edge functions {f ϕj } d j=1 to be sparse via Ω(Φ) in eq. ( 2), i.e., ∥Φ∥ 0 or ∥Φ∥ 1 . Incorporating the sparse decompositions of eq. ( 7) or eq. ( 8) into the original problem in eq. ( 2) yields min θ,Φ E σ∼α ⋆ (θ)   d j=1 ℓ x j , f ϕj (X • (R σ ) j ) + λΩ(Φ)   , where (R σ ) j is the jth column of R σ and α ⋆ is the top-k sparsemax or the SparseMAP distribution. Notice that for both sparse operators, in the limit τ → 0 + the distribution α ⋆ puts all probability mass on one permutation: σ(θ) (the sorting of θ), and thus eq. ( 9) is a generalization of eq. ( 2). We can solve the above optimization problem for the optimal θ, Φ jointly via gradient-based optimization. The downside of this, however, is that training may move towards poor permutations purely because Φ is far from optimal. Specifically, at each iteration, the distribution over permutations α ⋆ (θ) is updated based on functional parameters Φ that are, on average, good for all selected permutations. In early iterations, this approach can be highly suboptimal as it cannot escape from high-error local minima. To address this, we may push the optimization over Φ inside the objective: min θ E σ∼α ⋆ (θ)   d j=1 ℓ x j , f ϕ ⋆ (σ)j X • (R σ ) j   (10) s.t. Φ ⋆ (σ) = arg min Φ d j=1 ℓ x j , f ϕj X • (R σ ) j + λΩ(Φ). This is a bi-level optimization problem (Franceschi et al., 2018; Dempe & Zemkoho, 2020) where the inner problem fits one set of structural equations {f ϕj } d j=1 per σ ∼ α ⋆ (θ). Note that, as opposed to many other settings (e.g. in meta-learning), the outer objective depends on θ only through the distribution α * (θ), and not through the inner problem over Φ. In practice, this means that the outer optimization does not require gradients (or differentiability) of the inner solutions at all, saving computation and allowing for greater flexibility in picking a solver for fitting Φ ⋆ (σ).foot_3 For example, for Ω(Φ) = ∥Φ∥ 1 we may invoke any Lasso solver, and for Ω(Φ) = ∥Φ∥ 0 we may use the algorithm of Louizos et al. (2017) , detailed in Appendix B. The downside of this bi-level optimization is that it is only tractable when the support of α ⋆ (θ) has a few permutations. Optimizing for θ and Φ jointly is more efficient.

4.3. COMPUTATIONAL ANALYSIS

The overall complexity of our framework depends on the choice of sparse operator for learning the topological order and on the choice of the estimator for learning the edge functions. We analyze the complexity of learning topological orderings specific to our framework, and refer the reader to previous works for the analyses of particular estimators (e.g., Efron et al. (2004) ). Note that, independently from the choice of sparse operator, the space complexity of DAGuerreotype is at least of the order of the edge masking matrix R, hence O(d 2 ). This is in line with most methods based on continuous optimization and can be improved by imposing additional constraints on the in-degree and out-degree of a node. SparseMAP. Each SparseMAP iteration involves a length-d argsort and a Cholesky update of a s-by-s matrix, where s is the size of the active set (number of selected permutations), and by Carathéodory's convex hull theorem (Reay, 1965) can be bounded by d + 1. Given a fixed number of iterations K (as in our implementation), this leads to a time complexity of O(Kd 2 ) and space complexity O(s 2 + sd) for SparseMAP. Furthermore, we warm-start the sorting algorithm with the last selected permutation. Both in theory and in practice, this is better than the O(d 3 ) complexity of maximization over the Birkhoff polytope. Top-k sparsemax. Complexity for top-k sparsemax is dominated by the complexity of the top-k oracle. In our implementation, the top-k oracle has an overall time complexity O(K 2 d 2 ) and space complexity O(Kd 2 ) (as detailed in Appendix A) when searching for the best K permutations. When K is fixed, as in our implementation, this leads to an overall complexity of the top-k sparsemax operator O(d 2 ). In practice, K has to be of an order smaller than √ d for our framework to be more efficient than existing end-to-end approaches.

4.4. RELATIONSHIP TO PREVIOUS DIFFERENTIABLE ORDER-BASED METHODS

The advantages of our method over Cundy et al. (2021) ; Charpentier et al. (2022) are: (a) our parametrization, based on sorting, improves efficiency in practice (Appendix D) and has theoretically stabler learning dynamics as measured by our bound on SHD (Theorem (C.1)); (b) our method allows for any downstream edge estimator, including non-differentiable ones, critically allowing for off-the-shelf estimators; (c) empirically our method vastly improves over both approaches in terms of SID and especially SHD on both real-world and synthetic data (Section 5). Reisach et al. (2021) recently demonstrated that commonly studied synthetic benchmarks have a key flaw. Specifically, for linear additive synthetic DAGs, the marginal variance of each node increases the 'deeper' the node is in the DAG (i.e., all child nodes generally have marginal variance larger than their parents). They empirically show that a simple baseline that sorts nodes by increasing marginal variance and then applies sparse linear regression matches or outperforms state-of-the-art DAG learning methods. Given the triviality of simulated DAGs, here we evaluate all methods on two real-world tasks: Sachs (Sachs et al., 2005) , a dataset of cytometric measurements of phosphorylated protein and phospholipid components in human immune system cells. The problem consists of d = 11 We use the networks made publicly available by Lachapelle et al. (2020) . In Appendix D we, however, compare different configurations of our method also on synthetic datasets, as they constitute an ideal test-bed for assessing the quality of the ordering learning step independently from the choice of structural equation estimator.

Datasets

Baselines We benchmark our framework against state-of-the-art methods: NoTears (both its linear (Zheng et al., 2018) and nonlinear (Zheng et al., 2020 ) models), the first continuous optimization method, which optimizes the Frobenius reconstruction loss and where the DAG constraint is enforced via the Augmented Lagrangian approach; Golem (Ng et al., 2020) , another continuous optimization method that optimizes the likelihood under Gaussian non-equal variance error assumptions regularized by NoTears's DAG penalty; CAM (Bühlmann et al., 2014) , a two-stage approach that estimates the variable order by maximum likelihood estimation based on an additive structural equation model with Gaussian noise; NPVAR (Gao et al., 2020) , an iterative algorithm that learns topological generations and then prunes edges based on node residual variances (with the Generalized Additive Models (Hastie & Tibshirani, 2017) regressor backend to estimate conditional variance); sortnregress (Reisach et al., 2021), a two-steps strategy that orders nodes by increasing variance and selects the parents of a node among all its predecessors using the Least Angle Regressor (Efron et al., 2004) ; BCDNets (Cundy et al., 2021) and VI-DP-DAG (Charpentier et al., 2022) , the two differentiable, probabilistic methods described in Section 2. Before evaluation, we post-process the graphs found by NoTears and Golem by first removing all edges with absolute weights smaller than 0.3 and then iteratively removing edges ordered by increasing weight until obtaining a DAG, as the learned graphs often contain cycles. Metrics We compare the methods by two metrics, assessing the quality of the estimated graphs: the Structural Hamming Distance (SHD) between true and estimated graphs, which counts the number of edges that need to be added or removed or reversed to obtain the true DAG from the predicted one; and the Structural Intervention Distance (SID, Peters & Bühlmann, 2015) , which counts the number of causal paths that are broken in the predicted DAG. It is standard in the literature to compute both metrics, given their complementarity: SHD evaluates the correctness of individual edges, while SID evaluates the preservation of causal orderings. We further remark here that these metrics privilege opposite trivial solutions. Because true DAGs are usually sparse (i.e., their number of edges is much smaller than the number of possible ones), SHD favors sparse solutions such as the empty graph. On the contrary, given a topological ordering, SID favors dense solutions (complete DAGs in the limit) as they are less likely to break causal paths. For this reason, we report both metrics and highlight the solutions on the Pareto front in Figure 2 .

Hyper-parameters and training details

We set the hyper-parameters of all methods to their default values. For our method we tuned them by Bayesian Optimization based on the performance, in terms of SHD and SID, averaged over several synthetic problems from different SEMs. For our method, we optimize the data likelihood (under Gaussian equal variance error assumptions, as derived in Ng et al. (eq. 2, 2020 )) and we instantiate f ϕj j to a masked linear function (linear) or as a masked MLP as for NoTears-nonlinear. Because our approach allows for modular solving of the functional parameters Φ, we also experiment with using Least Angle Regression (LARS) (Efron et al., 2004) . We additionally apply l 2 regularizations on {θ, Φ} to stabilize training, and we standardize all datasets to ensure that all variables have comparable scales. More details are provided in Appendix D. The code for running the experiments is available at https://github.com/vzantedeschi/DAGuerreotype.

5.2. RESULTS

We report the main results in Figure 2 , where we omit some baselines (e.g., DAGuerreotype with sparseMAP) for sake of clarity. We present the complete comparison in Appendix D, together with additional metrics and studies on the sensitivity of DAGuerreotype depending on the choice of key hyper-parameters. To provide a better idea of the learned graphs, we also plot in Figure 3 the graphs learned by the best-performing methods on Sachs. We observe that the solutions found by NPVAR and the matrix-exponential regularized methods (NoTears and Golem) are the best in terms of SHD, but have the worst SIDs. This can be explained by the high sparsity of their predicted graphs (see the number of edges in Tables 1 and 2 ). On the contrary, the high density of the solutions of VI-DP-DAG makes them among the best in terms of SID and the worst in terms of SHD. DAGuerreotype provides solutions with a good trade-off between these two metrics and which lie on the Pareto front. Inevitably its performance strongly depends on the choice of edge estimator for the problem at hand. For instance, the linear estimator is better suited for Sachs than for SynTReN. In Appendix D, we assess our method's performance independently from the quality of the estimator with experiments on synthetic data where the underlying SEM is known, and the estimator can be chosen accordingly.

6. CONCLUSION, LIMITATIONS AND FUTURE WORK

In this work, we presented DAGuerreotype , a permutation-based method for end-to-end learning of directed acyclic graphs. While our approach shows promising results in identifying DAGs, the optimization procedure can still be improved. Alternative choices of estimators, such as Generalized Additive Models (Hastie & Tibshirani, 2017) as done in CAM and NPVAR, could be considered to improve identification. Another venue for improvement would be to include interventional datasets at training. It would be interesting to study in this context whether our framework is more sampleefficient, i.e., allows us to learn the DAG with fewer interventions or observations.

A TOP-K ORACLE

In this section, we propose an efficient algorithm for finding the top-k scoring rankings of a vector and prove its correctness. We leverage this result in order to apply the sparsemax operator (Correia et al., 2020) to our DAG learning problem. We are not aware of existing works on this topic and believe this result is of independent interest.

A.1 NOTATION AND DEFINITIONS

Let us denote the objective function g θ (σ) = ⟨θ, ρ σ ⟩ evaluating the quality of a permutation σ, and denote one of its maximizers as σ 1 ∈ arg max σ∈Σ d g θ (σ), corresponding to the argsort of θ. Notice that we can equivalently write this objective as g θ (σ) = ⟨θ σ , ρ⟩ by applying the permutation to θ. Definition A.1 (Top-k permutations). We denote by T k (θ) ⊆ Σ d a sequence of K highest-scoring permutations according to g θ i.e., T k (θ) = {σ 1 , σ 2 , . . . , σ k } and for any σ ′ / ∈ T k (θ): g θ (σ 1 ) ≥ g θ (σ 2 ) ≥ . . . ≥ g θ (σ k ) ≥ g θ (σ ′ ). In this section, we will make use of the following definition and lemma. Definition A.2 (Adjacent transposition). σj := σ (j j+1) denotes a 2-cycle of the components of the vector σ, which transposes (flips) the two consecutive elements σ j and σ j+1 . Lemma A.1. Let σ ∈ Σ d be a permutation. Exactly one of the following holds. 1. g θ (σ j) = g θ (σ 1 ), 2. There exists an adjacent transposition (j j+1) such that σj satisfies g θ (σ) > g θ (σ 1 ). Proof. Denote by θ σ the permutation of θ by σ and let J = {j ∈ [d-1] | θ σ j > θ σ j+1 }. If J = ∅ then θ σ is in increasing order, so by the Rearrangement Inequality (Hardy et al., 1952, Thms. 368-369) σ is a 1-best permutation. Otherwise, applying the adjacent transposition (j j+1) increases the score: g θ (σ j) -g θ (σ) = θ σ j -θ σ j+1 > 0 . A.2 BEST-FIRST SEARCH ALGORITHM Algorithm (2) finds the set of k-best scoring permutations given θ. Starting from an optimum of g θ , the algorithm grows the set of candidate permutations P (θ) by adding all those that are one adjacent transposition away from the ith-best permutation at iteration i. It then selects the best scoring permutation in P (θ) to be the top-(i+1) solution. Theorem A.2 proves that an i+1th-best solution must lie in this set P (θ), hence that Algorithm (2) is correct. Theorem A.2 (Correctness of Algorithm (2)). Given a sequence of k-1-best permutations T k-1 (θ) = {σ 1 , σ 2 , . . . , σ k-1 }, there exists a k-th best σ k , i.e., one satisfying g θ (σ k-1 ) ≥ g θ (σ k ) ≥ g θ (σ ′ ) for any σ ′ / ∈ T k-1 (θ) , with the property that σ k = σ i j for some i ∈ {1, . . . , k-1} and j ∈ {1, . . . , d-1}. Proof. Let σ k be a k-th best permutation. Case 1. g θ (σ k ) ̸ = g θ (σ 1 ). Invoking Lemma A.1 we have g θ (σ k j) > g θ (σ k ). Since the inequality is strict, we must have σ k j ∈ T k-1 (θ). Case 2. g θ (σ k ) = g θ (σ 1 ). In this case, we have at least k permutations tied for first place. Any two permutations with equal score can only differ in indices that correspond to ties in θ. Therefore, any two tied permutations are connected by a trajectory of transpositions of equal score. This is a face of the permutahedron, of size c > K, containing T k-1 (θ). This face must contain at least one permutation that is one adjacent transposition away from one of T k-1 (θ), and we may as well take this one as the k-th best instead of σ k . Algorithm 2: Top-k permutations. The most expensive operation is checking that the selected best candidate is not already in T k-1 (θ). At worst, this requires going through the whole P (θ) and comparing them in order to all top-k solutions, so no more than Data: k ∈ {1, . . . , d!}, θ ∈ R d Result: top-k permutations T k (θ) P (θ) ← {σ 1 ∈ R arg max σ∈Σ d g θ (σ)} / * initialize set of candidates with an optimum * / while |T k (θ)| ≤ k do σ ∈ R arg max σ∈P (θ)\T k (θ) g θ (σ) / * K × |P (θ)| comparisons of cost O(d) each. This leads to an overall time complexity of O(K 2 d 2 ). The space complexity is dominated by the size of P (θ), which contains at most Kd vectors of size d, leading to O(Kd 2 ).

B L0 REGULARIZATION

In the linear and non-linear variants of DAGuerreotype , we implement the regularization term Ω ξ of the inner problem of Eq. ( 10) with an approximate L0 regularizer. The exact L0 norm ||w|| 0 = n i=1 1 wj ̸ =0 (11) counts the number of non-zero entries of w ∈ R n and, when used as regularizer, it favors sparse solutions without injecting other priors. However, its combinatorial nature and its non-differentiability make its optimization intractable. Following Louizos et al. (2017) , we reparameterize (11) by introducing a set of binary variables z ∈ {0, 1} n and letting w = ŵ • z, sothat ||w|| 0 = z i . Next, we let z ∼ p(z; π) = Bernoulli(π) where π ∈ [0, 1] d . For linear SEMs, we can now reformulate the Inner Problem (10) as follows: min ŵ,π d j=1 E zj ∼p(zj |πj ) ℓ x j , f ŵj •zj j X, (M σ ) j + ξ d i=1 π ji ; where now the decision (inner) variables are intended to be matrices and ξ ≥ 0 is a hyperparameter. In the non-linear (MLP) case, we achieve sparsity at the graph level by group-regularizing the parameters corresponding to each input variable. In the experiments, we optimize (12) using the one-sample Monte Carlo straight-through estimator and set the final functional parameters as w * = ŵ * • MAP [p( • ; π)] = ŵ * • H π - 1 2 1 ; ( ) where H is the Heaviside function. We leave the implementation of more sophisticated strategies, such as the relaxation with the hard-concrete distribution presented in (Louizos et al., 2017) or other estimators (e.g. Paulus et al., 2020; Niepert et al., 2021) , to future work. 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 and how it varies as a function of the score vector θ ∈ R d . The permutation σ * is a MAP state (or mode) of both the sparseMAP and the sparsemax distributions (as well as the standard categorical/softmax distribution). For brevity, we shall rename M σ * := M (θ) in the following. We choose to work on the space of complete DAGs because there is a one-to-one correspondence between topological orderings and complete DAGs. This is not generally true when analyzing the space of DAGs, as a permutation does not uniquely identify a DAG and vice versa.

C.1 PRELIMINARY

For the results of this section, we will use the following relationship between transpositions (adjacent or not) and SHD. Proposition C.0.1 (SHD difference after a flip). Consider θ ∈ R d and θ ′ which is obtained by applying a flip (i j) with the convention that i < j. All the edges from nodes between i and j directed towards i or j need to be reversed. Thus, the SHD between complete DAGs SHD(M (θ), M (θ ′ )) = 2(ji) -1. (because no new undirected edges are added, no undirected edges are removed, and 2(ji) -1 edges are reversed). From Proposition (C.0.1) we deduce that the SHD difference after applying an adjacent flip is SHD(M (θ), M (θ ′ )) = 1.

C.2 ANALYSIS

Recall that σ * sorts the elements of θ in increasing order. Then, the points θ ∈ R d where arg max ⟨θ, ρ σ ⟩ is non-singleton (degeneracy) are exactly those that have at least one tie among their entries (i.e., ∃ i, j such that θ i = θ j ). Following this simple observation, we can populate R d with d 2 hyper-planes (of dimensionality d -1), and call them H i,j with i < j, such that ∀θ ∈ H i,j , θ i = θ j . Intersections of such hyper-planes are lower dimensional subspaces where more than 2 entries of θ are equal. For d = 3, Figure 4 depicts thefoot_4 2 = 3 hyper-planes in orange, green and blue, and their intersection (a line) in gray.

The hyper-planes {H

i,j } i<j 's delimit exactly d! open cones of the form C = {θ ∈ R d |θ σ 1 < θ σ 2 < • • • < θ σ d , σ ∈ Σ d }, i.e. , each cone is the space of all points that give the same MAP permutation and do not contain any ties. This partition of the space allows us to reason about the sensitivity of our MAP estimates to changes to the parameters. For instance, as the points of a cone do not have any ties, changes to the score vector do not affect SHD: ∀θ ∈ C, θ ′ ∈ C SHD(M (θ), M (θ ′ )) = 0. We first analyze the sensitivity to single-entry changes, hence assessing how much an entry of θ can be individually perturbed without changing its MAP. Proposition C.0.2 provides the entry-wise ranges in which the optimal solution set does not change. Proposition C.0.2 (Entry-wise intervals). For any θ ∈ R d and ε i ∈ R, arg max ⟨θ σ , ρ⟩ = arg max ⟨θ σ + ε i e i , ρ⟩ if and only if: • ∀i ∈ [2, . . . , d -1], ε i ∈ (θ σ ⋆ i-1 -θ σ ⋆ i , θ σ ⋆ i+1 -θ σ ⋆ i ); • i = 1, ε i ∈ (-∞, θ σ ⋆ i+1 -θ σ ⋆ i ); • i = d, ε i ∈ (θ σ ⋆ i-1 -θ σ ⋆ i , ∞) where e i denotes the i-th standard unit vector and σ ⋆ ∈ arg max ⟨θ, ρ σ ⟩. We can relate this result to changes in terms of SHD, by making the following observation: if we gradually increase one coordinate θ i initially ranked σ ⋆ i , its rank changes in the optimal ordering as soon as it becomes greater than the coordinate right after it in the ordering, entailing an adjacent transposition between the two coordinates 3 . Greater perturbations entail a longer sequence of adjacent transpositions. By Proposition C.0.1, this implies an increase in SHD of 1 for each transposition. Visually, the SHD increases by 1 every time the perturbed vector crosses a hyper-plane along its perturbation direction, as this corresponds to swapping two components that are adjacent when optimally sorted. For comparison, we can apply the same reasoning to the matrix parametrization of the linear assignment problem, deployed by Cundy et al. (2021) for learning permutation matrices. In this context, we can leverage the edge sensitivity analysis reviewed in Michael et al. (2020, Equations 6-7) . As a general remark, it is not as intuitive to determine the entry-wise intervals for this problem, not only because we have d 2 variables instead of d but especially because it requires solving a different assignment problem per entry. Furthermore, the minimal perturbation that changes the MAP does not necessarily result in flipping two adjacent nodes, entailing an SHD relative to the optimal permutation of at least 1. Consider, for example, the following 3 × 3 matrix parameter: Θ = 16 16 15 5 16 10 16 9 10 Its optimal permutation matrix corresponds to the rank (3, 1, 2) with a score of 29. The minimal perturbations on Θ 1,3 or Θ 3,2 that change the MAP result in the solution (2, 1, 3) which has an SHD of 3 w.r.t. the optimal (and the second best score of 31). For problems of scale larger than this example, the resulting SHD can take values up to 2d -3 for non-adjacent transpositions. We now generalize and formalize this relationship between perturbations and SHD changes for general vector perturbations (global sensitivity). Theorem (C.1) upper bounds the SHD between complete DAGs of any pair of score vectors by the number of hyper-planes crossed by the segment connecting them. Theorem C.1 (Global sensitivity). For any θ ∈ R d and θ ′ ∈ R d SHD(M (θ), M (θ ′ )) ≤ t∈[0,1] i j>i δ Hi,j (θ + t(θ ′ -θ)) dt (15) where δ A (x) is the (generalized) Dirac delta that evaluates to infinity if x ∈ A and 0 otherwise. Proof. Let us first consider the case where θ ∈ C and θ ′ ∈ C ′ , i.e., the two vectors do not contain ties. Let us denote σ (respectively σ ′ ) the permutation that sorts the components of a point of C (C ′ ) by increasing order. The minimal-length sequence of adjacent flips that need to be applied to σ to obtain σ ′ has a length equal to the number of times the segment connecting their parameters crosses a hyper-plane. Then the SHD between their complete DAGs equals the minimal number of adjacent flips, which proves the result. When either θ or θ ′ lies on a hyper-plane, the minimal required number of adjacent flips might be smaller, hence the upper bound in Theorem C.1. In Figure 4 , the segment connecting the two red dots intersects the green and blue hyperplanes and hence the resulting complete DAGs will have an SHD of at most 2 (in fact, exactly 2 in this case). This intuitive characterization links the SHD distance in the complete DAG space to a partition of R d resulting from the "sorting" operator (i.e. the MAP, or maximizer of the linear program) of the score vector θ. This implies that, during optimization, if θ k are in the interior of any cone (which happens almost surely) then it is very likely that updating the parameters results in small changes to the SHD unless the parameters have all similar values. A deeper analysis of the vector parameterization is the object of future work, as we hope it can fuel further improvements in the optimization algorithm, such as better initialization strategies or reparameterization. For instance, optimizing in S d-1 r (over polar coordinates) would expose the role of the radius r as similar to the temperature parameter for the distributions we consider (the smaller the radius, the higher the "temperature"). Typically, the temperature is left constant during training or annealed, suggesting this might be advantageous also in our scenario. We leave the exploration of this strategy to future work.

D ADDITIONAL EXPERIMENTS

We provide a detailed description of the experimental setup and report additional results. The method is implemented in (PyTorch, Paszke et al., 2019) , and the code used for carrying out the experiments is included in the supplementary material. All experiments were run on a machine with 16 cores, 32Gb of RAM and an NVIDIA A100-SXM4-80GB GPU. DAGuerreotype 's optimization and evaluation For our method, we optimize the data likelihood (under Gaussian equal variance error assumptions, as derived in Ng et al. (Equation 2, 2020) ) We optimize the bilevel Problem (10) when not specified otherwise. We report results for the following three variants of DAGuerreotype , where the edge estimator of the graph is instantiated (linear) with L0 regularization, f ϕj j (X, A j ) = X (ϕ j • A j ) and w j ∈ R d ; (non-linear) with L0 regularization, f ϕj j (X, A j ) = h j g ϕj j (X, A j ) , with h j a locally connected MLP with one hidden layer, 50 hidden units and sigmoid activation function, g ϕj j a linear layer with d × 50 hidden units (50 per parent i) and masking out all non-parents of j (according to A j ), and ϕ ji = ∥g ϕj ji ∥ 2 as for NoTears (non-linear); (LARS) with Least Angle Regression (LARS) (Efron et al., 2004)  , f ϕj j (X, A j ) = X (ϕ j •A j ) and ϕ j ∈ R d . We additionally apply a l 2 regularization on {θ, ϕ} to stabilize training, and we standardize all datasets to ensure that all variables have comparable scales. With any variant, the outer problem is optimized for 5, 000 maximum iterations by gradient descent and early-stopped when approximate convergence is reached. When using the (linear) and (non-linear) back-ends, the graph of each permutation is optimized also by gradient descent, for 1, 000 epochs and also with an early-stopping mechanism based on approximate convergence. After training, the graph for the mode permutation is further fine-tuned, and the final evaluation is carried out with this model. We also experiment with the approximated, but faster, joint optimization of θ and ϕ in Problem (2) instead of the bilevel formulation, when we add the suffix joint to the method name. In this case, as we need a differentiable graph estimator for jointly updating the ordering and the graph, we instantiate the method only with the (linear) and (non-linear) back-ends. These models are optimized by gradient descent for 5, 000 epochs and with an early-stopping mechanism based on approximate convergence. The default hyper-parameters of our methods were chosen as follows. We set the sparse operators' temperature τ = 1 and K = 100, the strength of the l 2 regularizations to 0.0005, and tuned the learning rates for the outer and inner optimization ∈ [10 -4 , 10 -1 ] and pruning strength λ ∈ [10 -6 , 10 -1 ]. The tuning was carried out by Bayesian Optimization using (Optuna, Akiba et al., 2019) for 50 trials on synthetic problems, consisting of data generated from different types of random graphs (Scale-Free, Erdős-Rényi, BiPartite) and of noise models (e.g., Gaussian, Gumbel, Uniform) with 20 nodes and 20 or 40 expected edges. For each setting, three datasets are generated by drawing a DAG, its edge weights uniformly in [-2, -0.5] ∪ [0.5, 2], and 1, 000 data points. A set of hyper-parameters is then evaluated by averaging its performance on all the generated datasets, and the tuning is carried out to minimize SHD and SID jointly. The default value of a hyper-parameter was then set to be the average value among those lying on the Pareto front and rounded up to have a single significant digit.

Baseline optimization and evaluation

All baseline methods are optimized using the codes released by their authors, apart from CAM that is included in the Kalainathan & Goudet (Causal Discovery Toolbox, 2019) . Before evaluation, we postprocess the graphs found by NoTears and Golem by first removing all edges with absolute weights smaller than 0.3 and then iteratively removing edges ordered by increasing weight until obtaining a DAG, as the learned graphs often contain cycles. For the probabilistic baselines (VI-DP-DAG and BCDNets), we make use of the mode model (in particular, the mode permutation matrix) for evaluation. We set the hyper-parameters of all methods to their default values, released together with the source code, apart from the parameters of the Least Angle Regressor module of sortnregress that uses the Bayesian Information Criterion for model selection. Additional results on real-world tasks In Figures 5, 6 , and Tables 1 and 2 we extend the evaluation on real-world tasks provided in the main text. More precisely, we report the results also for DAGuerreotype with sparseMAP, and provide additional metrics for comparison: the F1 score using the existence of an edge as the positive class and the number of predicted edges to assess the density of the solutions. In order from top to bottom, we plot SHD, SID, F1, training time in seconds, number of permutations, and number of predicted edges (all at the end of training) as a function of the sparse operators' parameter K, which corresponds to the maximal number of sampled permutations for sparsemax and to the maximal number of iterations of the active set algorithm for sparseMAP. We also include two simple variants of DAGuerreotype , where the ordering is fixed to one true ordering (true) or to a random one (random). Results are averaged over 10 seeds. the nodes consistently improves its performance, compared to initializing them with the vector of all zeros. For these experiments we set K = 10, use the linear edge estimator and jointly optimize all DAGuerreotype 's parameters. Compared to other differentiable order-based methods, DAGuerreotype consistently provides a significantly better trade-off between SHD and SID, confirming our findings on the real-world data. Indeed, these baselines generally discover DAGs with high false positive rates. DAGuerreotype equipped with the sparseMAP operator also improves upon the linear continuous methods based on the exponential matrix regularization, but when equipped with the top-k sparsemax operator its results on these settings depend on a good initialization of θ (the marginal variances in this case) and worsen with the number of nodes. A higher value of k would be required to improve DAGuerreotype -sparsemax's performance in these settings, as shown in Figure 8 . In terms of running times, DAGuerreotype is aligned with NoTears and is generally faster than CAM, Golem and VI-DP-DAG. Of course DAGuerreotype 's running times strongly depend on the value of K, the choice of edge estimator and the optimization of either the joint or bi-level problems. 



Critically, the usage of permutation matrices allows to maintain a fully differentiable path from loss to parameters (of the permutation matrices) via Sinkhorn iterations or other (inexact) relaxation methods. ,(7)We can recognize in (7) an instance of the SparseMAP operator introduced in Section 3.3. Among all possible decomposition, we will favor sparse ones. This is achieved by employing the active set algorithm(Nocedal & Wright, 1999) which only requires access to an oracle solving eq. (4), i.e., sorting θ. This computational advantage is shared with the score-function estimator(SFE, Rubinstein, 1986;Williams, 1992;Paisley et al., 2012;Mohamed et al., 2020). We are not aware of any applications of SFE to permutation learning, likely due to the #P-completeness of marginal inference over the Birkhoff polytope(Valiant, 1979;Taskar, 2004). A similar reasoning also applies when decreasing the value of one coordinate instead.



Figure 2: SHD vs SID on real datasets Sachs and SynTReN. Results are averaged over 10 seeds and the solutions lying on the Pareto front are circled.

Figure 3: Sachs. True DAG and DAGs predicted by the best-performing methods. We plot on the left of the bar correct and missing edges and on the right of the bar wrong edges found by each method. DAGuerreotype strikes a good balance between SID and SHD; other methods focus overly on one over the other by either predicting too few (sortnregress) or too many edges (VI-DP-DAG).

retrieve a best scoring solution among the candidates that has not been retrieved yet * / P (θ) ← P (θ) ∪ {σj | j ∈ {1, . . . , d-1}} / * add its one adjacent transposition away permutations to the candidates * / T k (θ) ← T k (θ) ∪ {σ} / * update set of best permutations * / end A.3 COMPUTATIONAL ANALYSIS Finding σ 1 requires sorting a vector of size d (hence O(d log d) complexity). Then, at each iteration k of Algorithm (2), the maximum among the best candidates P (θ) needs to be found, which requires O(dk) complexity as |P (θ)| ≤ (d -2)k, and O(d) adjacent flip operations are applied to it.

Figure 4: Representation of the degeneracy hyper-planes in R 3 and of two-parameter points (in red) whose connecting segment intersects two hyper-planes. C CHARACTERIZATION AND SENSITIVITY OF THE VECTOR PARAMETRIZATION

Figure 5: SHD vs SID on Sachs. The solutions lying on the Pareto front are colored in orange and the others in blue.

Figure7: Comparison of different strategies for learning topological orderings, on data (1, 000 samples) generated from a linear SEM with Scale-Free graph and Gaussian noise, and a varying number of nodes d. In order from top to bottom, we plot SHD, SID, F1, training time in seconds, number of permutations, and number of predicted edges (all at the end of training) as a function of the sparse operators' parameter K, which corresponds to the maximal number of sampled permutations for sparsemax and to the maximal number of iterations of the active set algorithm for sparseMAP. We also include two simple variants of DAGuerreotype , where the ordering is fixed to one true ordering (true) or to a random one (random). Results are averaged over 10 seeds.

Figure8: Comparison of different strategies for learning topological orderings, on samples of varying size (n ∈ [100, 5 ′ 000] on the x-axes) generated from a linear SEM with Scale-Free graph and Gaussian noise, and number of nodes d = 20. In order from top to bottom, we plot SHD, SID, F1, training time in seconds, number of permutations, and number of predicted edges (all at the end of training) for 4 values of the sparse operators' parameter K, which corresponds to the maximal number of sampled permutations for sparsemax and to the maximal number of iterations of the active set algorithm for sparseMAP. We also include two simple variants of DAGuerreotype , where the ordering is fixed to one true ordering (true) or to a random one (random). Results are averaged over 10 seeds.

Figure 9: Sachs. Effect of L0 pruning intensity (controlled by λ) on SHD, SID, F1 and number of learned edges for DAGuerreotype jointly optimizing a linear estimator and the topological ordering distribution either with sparsemax (sparsemax) or sparseMAP (sparseMAP).

Figure 10: SynTReN. Effect of L0 pruning intensity (controlled by λ) on SHD, SID, F1 and number of learned edges for DAGuerreotype jointly optimizing a linear estimator and the topological ordering distribution either with sparsemax (sparsemax) or sparseMAP (sparseMAP).

Figure11: SHD vs SID on synthetic datasets generated from scale-free DAGs with Gaussian (top), Gumbel (middle) and MLP (bottom) SEMs. DAGuerreotype 's (ours) θ is either initialized with a zero vector (zeros) or with the marginal variances (variances).

Sachs.  We report Structural Hamming Distance (SHD, the lower the better), Structural Interventional Distance (SID, the lower the better), (F1, the higher the better), and the number of predicted edges for all methods.

SynTReN. We report Structural Hamming Distance (SHD, the lower the better), Structural Interventional Distance (SID, the lower the better), Topological Ordering Pearson Correlation (TOPC, the higher the better), (F1, the higher the better) number of predicted edges all averaged over the 10 networks.

ACKNOWLEDGEMENTS

We are grateful to Mathieu Blondel, Caio Corro, Alexandre Drouin and Sébastien Paquet for discussions. Part of this work was carried out when VZ was affiliated with INRIA-London and University College London. Experiments presented in this paper were partly performed using the Grid'5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see https://www.grid5000.fr). VN acknowledges support from the Dutch Research Council (NWO) project VI. Veni.212.228. 

Sparsemax vs sparseMAP comparison

In Figures 8 and 7 , we report an analysis of the effect of DAGuerreotype 's hyper-parameter K and of the sample size n on the quality of the learned DAG.Recall that K corresponds to the maximal number of selected permutations for sparsemax and to the maximal number of iterations of the active set algorithm for sparseMAP, and that is the principal parameter that controls the computational cost of the ordering learning step in our framework. This analysis is carried out on data generated by a linear SEM from a scale-free graph with equal variance Gaussian noise. We choose this simple setting for two reasons: the true DAG can be identified from (enough) observational data only (Proposition 7.5, Peters et al., 2017) ; provided with the true topological ordering, the LARS estimator can identify the true edges. In this setting, we can then assess the quality of sparsemax and sparseMAP independently from the quality of the estimator. For reference, in Figures 8 and 7 we also report the performance of optimizing LARS with a random ordering or with a true one.We observe that sparsemax and sparseMAP provide MAP orderings that are significantly better than random ones for K > 2 and for any sample size. For K big enough, these orderings give DAGs that are almost as good as when knowing the variable ordering. Furthermore, apart when K = 2, increasing the sample size generally results in an improvement (although moderate) in the performance of sparsemax and sparseMAP. However, the gap from true's performance does not reduces when increasing n or K, which can be explained by the non-convexity of the search space and DAGuerreotype getting stuck in local minima. We further observe that sparsemax generally provides better solutions than sparseMAP's at a comparable training time. The only settings where this is not the case are for d = 30 and K > 35. Notice that sparseMAP's performance peaks at K = 35 and degrades for higher K in this setting. This phenomenon can be due to the inclusion of unnecessary orderings to sparseMAP's set, which ends up hurting training.We further study in Figures 9 and 10 the effect of the pruning strength, (controlled by λ) on the performance of both operators on the real datasets. For this experiment, we instantiate DAGuerreotype with the linear estimator and train it by joint optimization. As a general remark, when strongly penalizing dense graphs (higher λ) SHD generally improves and SID degrades. On Sachs, the two operators do not provide significantly different results, while on SynTReN we find that sparsemax provides better SHD for comparable SID. or with the marginal variances (variances). For ease of reading, we split the full comparison (top) into two, to focus on the comparison with differentiable order-based methods (middle) and with differentiable methods based on the matrix exponential constraint (bottom).

