SPACETIME REPRESENTATION LEARNING

Abstract

Much of the data we encounter in the real world can be represented as directed graphs. In this work, we introduce a general family of representations for directed graphs through connected time-oriented Lorentz manifolds, called "spacetimes" in general relativity. Spacetimes intrinsically contain a causal structure that indicates whether or not there exists a causal or even chronological order between points of the manifold, called events. This chronological order allows us to naturally represent directed edges via imposing the correct ordering when the nodes are embedded as events in the spacetime. Previous work in machine learning only considers embeddings lying on the simplest Lorentz manifold or does not exploit the connection between Lorentzian pre-length spaces and directed graphs. We introduce a well-defined approach to map data onto a general family of spacetimes. We empirically evaluate our framework in the tasks of hierarchy extraction of undirected graphs, directed link prediction and representation of directed graphs.

1. INTRODUCTION

Most of the machine learning literature has focused on learning representations that lie on a Riemannian manifold such as the Euclidean space, the d-sphere (e.g., 2 -normalized representations) (Wang et al., 2017; Tapaswi et al., 2019) , hyperbolic geometry to represent graphs without cycles (Nickel & Kiela, 2017) , or a statistical manifold in information geometry (Amari, 1998) . Concepts of Euclidean geometry, such as distances, are naturally generalized to Riemannian geometry which remains easy to interpret. In contrast, recent approaches have considered learning representations that lie on a pseudo-Riemannian manifold to extract hierarchies in graphs with cycles (Law & Stam, 2020; Law, 2021) or represent directed graphs (Clough & Evans, 2017; Sim et al., 2021) . Pseudo-Riemannian manifolds are generalizations of Riemannian manifolds where the constraint of positive definiteness of the nondegenerate metric tensor is relaxed. The machine learning literature on pseudo-Riemannian manifolds can be divided into two categories. The first category focuses on how to optimize a given function whose domain is a pseudo-Riemannian manifold and does not take into account whether the manifold is time-oriented or not (Law & Stam, 2020; Law, 2021) . The second category exploits the interpretation of a specific family of pseudo-Riemannian manifolds called "spacetimes" in general relativity (Clough & Evans, 2017; Sim et al., 2021) . More specifically, spacetimes are connected time-oriented Lorentz manifolds. They intrinsically contain a causal structure that indicates whether or not there exists a causal order between points of the manifold, called events. This causal structure has been utilized to represent directed graphs where each node is an event and the existence of an arc (i.e., directed edge) between two nodes depends on the causal character of the curves joining them (Bombelli et al., 1987) . In particular, Clough & Evans (2017) consider learning representations via the Minkowski spacetime which is the simplest such manifold. On the other hand, Sim et al. (2021) use three types of spacetimes and propose an ad hoc method based on the sign of some time coordinate difference function to determine the orientation of edges. The sign of such a function is not always meaningful as it for instance alternates periodically when the manifold is non-chronological and does not generalize to all spacetimes. Moreover, the distance function that they optimize is constant when two points cannot be joined by a geodesic. Contributions. We propose a framework inspired by Lorentzian causality theory (Kronheimer & Penrose, 1967; Minguzzi, 2019) , and in particular Lorentzian pre-length spaces (Kunzinger & Sämann, 2018) , to learn directed graph representations lying on a large family of spacetimes. To this end, we present tools to account for time-orientation and exploit distances specific to Lorentz geometry. In particular, we propose to restrict the existence of edges to pairs of nodes whose representations lie in an open globally hyperbolic convex normal neighborhood. Such a neighborhood can be defined for any spacetime (see Theorem 2.7 of Minguzzi (2019) ) and admits simple distance and time separation functions whose sign determines the direction of edges. We experimentally show that spacetimes can extract hierarchies in social networks better than standard approaches. Our framework also outperforms existing methods in link prediction on graphs with directed cycles.

2. SPACETIME DIFFERENTIAL GEOMETRY

We introduce some differential geometry background about spacetimes. Spacetimes have been widely studied, and we refer the reader to Hawking & Ellis (1973) ; Uhlenbeck (1975) ; O'Neill (1995) ; Beem et al. (1996) ; Wolf (2011) ; Gourgoulhon (2016) , Chapter 5-8 & 14 of O'Neill (1983) . Pseudo-Riemannian Manifold. A d-dimensional pseudo-Riemannian manifold (M, g) is a smooth manifold such that every point x ∈ M has a d-dimensional tangent space T x M whose metric tensor g x : T x M × T x M → R is a nondegenerate symmetric bilinear form (called a scalar product). Nondegeneracy means that ∀v ∈ T x M, g x (u, v) = 0 =⇒ u = 0. When the context is clear and to simplify the notation, we write •, • := g x (•, •) to define the metric tensor at x. We also write M instead of (M, g). We write points x ∈ M of the manifold in bold serif font, and tangent vectors u ∈ T x M in bold sans-serif font when we want to distinguish them from points. Lorentz manifold. Every tangent space T x M of a d-dimensional pseudo-Riemannian manifold M admits an orthonormal basis {e 1 , . . . , e d } that satisfies ∀i, e i , e i = ±1 and ∀ i = j, e i , e j = 0. The index ν ≤ d of M is the number of vectors e i that satisfy e i , e i = -1. If ν = 0, M is Riemannian and its metric tensor is positive definite (i.e., ∀x ∈ M, ∀u ∈ T x M, u, u ≥ 0 and u, u = 0 ⇐⇒ u = 0). If ν = 1, M is a Lorentz manifold and T x M is a Lorentz vector space. Future timecone. A nonzero tangent vector u is called timelike (or chronological), null, spacelike or non-spacelike (or causal) if u, u is negative, zero, positive or nonpositive, respectively. The type into which u falls is called its causal character. If u = 0, then u is spacelike. Every Lorentz tangent space contains two timecones. Some timelike tangent vector t ∈ T x M can arbitrarily be used to define the future timecone as the following set: C + x (t) := {v ∈ T x M : v, v < 0, t, v < 0} whereas -t defines the past timecone C - x (t) := C + x (-t). Two timelike tangent vectors u and v are in the same timecone iff u, v < 0. They belong to different timecones if v = -u. Time-orientability and time-orientation. A continuous vector field X is a function that assigns to each point x ∈ M a tangent vector of M at x denoted by X(x) ∈ T x M. X and -X are timelike if ∀x ∈ M, X(x), X(x) < 0. A Lorentz manifold is time-orientable iff there exists a timelike vector field. If M is assigned such a timelike vector field X, it is time-oriented by X. In this case, non-spacelike tangent vectors u at each point x can be divided into two separate classes: future-directed if X(x), u < 0, and past-directed if X(x), u > 0. A curve γ x→u : I → M where I ⊆ R is defined such that its initial point is γ x→u (0) = x and its initial velocity is γ x→u (0) = u ∈ T x M. We denote it by γ when its initial conditions are clear from the context. If its acceleration is zero, then it is called a geodesic. The curve γ is called timelike, null or spacelike if its velocity on the whole domain I is timelike, null or spacelike, respectively. It is called future-directed (or future-pointing) if its velocity is future-directed (and causal) on I. Completeness. The manifolds that we consider in this paper are geodesically complete (i.e., I = R), and the exponential map exp x : T x M → M of M at x is defined as exp x (u) := γ x→u (1) where γ x→u is a geodesic. The maximal normal neighborhood of x is the maximal subset U x ⊆ M where the logarithmic map log x := exp -1 x : U x → T x M is a diffeomorphism, it satisfies U x := {y ∈ M : exp x (exp -1 x (y)) = y}. To simplify the notation, we also write -→ xy := log x (y) where y ∈ U x . U x is convex if ∀y ∈ U x , there exists a unique geodesic totally contained in U x from x to y. We now present some pseudo-Riemannian manifolds that will be relevant. Their differential geometry tools (e.g., exponential/logarithmic map and parallel transport) are provided in Appendix B. • Pseudo-Euclidean space R d ν . The flat d-dimensional pseudo-Riemannian manifold of index ν is denoted by R d ν . In particular, R d 0 is the Euclidean space and R d 1 is the Minkowski space (or Minkowski spacetime). Since it is a vector space, we can identify its tangent space to the space itself by means of the natural isomorphism R d ν ≈ T x R d ν . R d ν is equipped with the following scalar product: ∀x = (x 1-ν , . . . , x d-ν ) , y = (y 1-ν , . . . , y d-ν ) , x, y ν := - 0 i=1-ν x i y i + d-ν j=1 x j y j . (1) The maximal normal neighborhood for all x ∈ R d ν is U x = R d ν . In special relativity, the first ν elements of x ∈ R d ν are called time coordinates and the other ones are called space coordinates. • The pseudo-sphere S d ν (r) of radius r > 0 is called the de Sitter space when ν = 1 and defined as: S d ν (r) := {x ∈ R d+1 ν : x, x ν = r 2 }. It is not time-orientable if d -ν is odd, and U x = {y ∈ S d ν (r) : x, y ν > -r 2 }. • The pseudo-hyperbolic space H d ν (r) is called the anti-de Sitter space when ν = 1 and defined as: H d ν (r) := {x ∈ R d+1 ν+1 : x, x ν+1 = -r 2 }. The anti-de Sitter space is time-orientable for all d, and U x = {y ∈ H d ν (r) : x, y ν+1 < r 2 }. • The cylindrical Minkowski space L d 1 (C) := R d 1 /∼ is a quotient set defined such that x ∈ R d 1 and y ∈ R d 1 are equivalent (i.e., x∼y) iff ∀i > 0, y i = x i and ∃k ∈ Z, y 0 = x 0 + kC where C > 0 is a circumference hyperparameter. See page 148 of O'Neill (1983) for other types of Lorentz cylinders. We have U x = {y = (y 0 , . . . , y d-1 ) ∈ R d 1 : y 0 ∈ (x 0 -C/2, x 0 + C/2)}.

3. SPACETIME GRAPH REPRESENTATION

A spacetime is a connected time-oriented Lorentz manifold M whose points are called events. Informally, time-oriented is often weakened to time-orientable, and Hawking & Ellis (1973) ignore the time-orientability criterion to define spacetimes although it is required to study their causal structure. Our main contribution is to exploit this causal structure via Lorentzian pre-length spaces to represent graphs. In the following, we always assume that M is a spacetime unless stated otherwise. A finite directed graph can be given the structure of a Lorentzian pre-length space (Kunzinger & Sämann, 2018) . To define our graphs, we then choose a special type of Lorentzian pre-length space that is easy to optimize. We now provide and follow the definitions of Kunzinger & Sämann (2018) . Causal space. Let X be a set with a reflexive and transitive relation ≤, and a transitive relation contained in ≤ (i.e., ⊆≤, so x y =⇒ x ≤ y). Then (X , , ≤) is called a causal space. This definition is more general than the one given in the seminal work of Kronheimer & Penrose (1967) . Following general relativity, the event x ∈ M causally (resp. chronologically) precedes the event y ∈ M, and we write x < y (resp. x y) iff there exists a future-directed causal (resp. timelike) curve from x to y. This condition might be difficult to verify in general. However, if y is in a convex normal neighborhood of x denoted by V x ⊆ U x (and we always assume in the following that x ∈ V x ), we have x < y (resp. x y) iff there exists a nonconstant future-directed causal (resp. timelike) geodesic from x to y (see Proposition 4.5.1 of (Hawking & Ellis, 1973) ). We note x ≤ y ⇐⇒ x = y or x < y. Any open subset W ⊆ M of a spacetime is a spacetime in its own right, and the intrinsic causal relations of W imply the corresponding ones in M. In the case where W ⊆ M is a convex open set, the intrinsic causality of W is as simple as that of Minkowski space (see page 403 of O'Neill (1983) ). Moreover, both (M, , ≤) and (W, , ≤) are causal spaces. Lorentzian pre-length space. Let (X , , ≤) be a causal space and d a metric on X . Let τ : X × X → [0, ∞] be a lower semicontinuous map (w.r.t. the metric topology induced by d) that satisfies the reverse triangle inequality: ∀x, y, z ∈ X with x ≤ y ≤ z, τ (x, z) ≥ τ (x, y) + τ (y, z). Suppose that τ (x, y) = 0 if x ≤ y and τ (x, y) > 0 iff x y. Then (X , d, ≤, , τ ) is a Lorentzian pre-length space and τ is called the time separation function. See details about τ in Section 3.3.

3.1. GRAPH CONSTRUCTION VIA LORENTZIAN PRE-LENGTH SPACES

We now explain our methodological contribution, which is the construction of directed graphs from the properties mentioned above. Let us define a directed graph G = (V, E) where V = {v i } n i=1 is the node set and E is the set of arcs. Our goal is to represent each node v i by a point x i ∈ M so that there exists an arc from v i to v j (i.e., (v i , v j ) ∈ E) only if x i x j . We propose to consider subgraphs 1983)), and we draw an arc from v i to v j iff x j ∈ I + (x i , V xi ). Assuming V xi is a convex normal neighborhood and x j ∈ V xi , we have x j ∈ I + (x i , V xi ) only if --→ x i x j is defined and future-directed timelike (see Proposition 4.12 of Beem et al. (1996) ). G i = (V, E i ) defined such that E = n i=1 E i . Each subgraph G i is Our framework produces a subgraph G = n i=1 G i of the graph G described by the chronological relations of M, and the causal relations between events depend on the choice of M. It is worth noting that for any spacetime, chronological order is transitive (i.e., x i x j and x j Beem et al. (1996) ). v i is then connected by an arc to all the successors of v j in G . We avoid this degenerate case by drawing an arc from x k =⇒ x i x k (i.e., x k ∈ I + (x i , M)), see Chapter 3.2 of v i to v k in G iff x k ∈ I + (x i , V xi ). The choice of M and V xi ⊆ M for all i is then crucial to construct G ⊆ G . Existence of directed cycles. Our framework can represent graphs with directed cycles only if M is non-chronological (i.e., there exists at least one closed timelike curve: ∃x ∈ M, x x, see Fig. 1 ), which is the case if M is the anti-de Sitter space H d 1 (r) or the Cylindrical Minkowski space L d 1 (C). Spacetimes that do not contain closed timelike curves are called chronological (i.e., x ∈ M, x x) and can represent only Directed Acyclic Graphs (DAGs) in our framework. Some chronological spacetimes such as S d 1 (r) (when d ≥ 3 is odd) or R d 1 are called globally hyperbolic and satisfy: x ≤ y =⇒ x and y can be joined by a (longest) causal geodesic that is not necessarily unique (see page 66 of Beem et al. (1996) ). Nonetheless, if M is S d 1 (r) or R d 1 , we also have y ∈ I + (x, M) iff y ∈ U x and -→ xy is future-directed timelike (see page 411 of (O'Neill, 1983) for details on conditions). Although DAGs do not contain directed cycles, they can contain undirected cycles. For simplicity, we consider that V x ⊆ U x is the convex normal neighborhood of x that contains points y such that the arc length d γ of the geodesic γ from x = γ(0) to y = γ(1) is smaller than some arbitrary threshold ε ∈ (0, ∞] and can be formulated as d γ (x, y) := | -→ xy, -→ xy |. We have: I + (x, V x ) = {y ∈ U x : -ε 2 < -→ xy, -→ xy < 0, -→ xy ∈ C + x (t)}, where C + x (t) is the future timecone parametrized by some arbitrary timelike tangent vector t ∈ T x M. The motivation is to choose or learn ε small enough that v i is not connected to undesired successors of v j . However, in most of our experiments we simply consider that V x = U x (see Section C.4).

3.2. EXAMPLES OF SPACETIMES

We first illustrate how to represent directed graphs with the simplest spacetime, which is the (flat) Minkowski space R d 1 . It is used in special relativity which is a special case of the general relativity of a spacetime isometric to R 4 1 . It is also the geometry induced on each fixed tangent space of an arbitrary Lorentz manifold. It was used in Clough & Evans (2017) to represent DAGs due to its global hyperbolicity and the fact that any pair of points of R d 1 can be joined by a geodesic. It is worth noting that the Hopf-Rinow theorem does not hold for non-Riemannian manifolds such as spacetimes. Therefore for many spacetimes, there exist pairs of points that cannot be joined by a geodesic even if the spacetime is geodesically complete. Working with convex normal neighborhoods V xi allows us to constrain chronological order between points only via timelike geodesics as explained in Section 3.1. • Minkowski spacetime R d 1 . We recall that ∀ν, R d ν ≈ T x R d ν . Its geodesic γ x→y : R → R d ν is γ x→y (t) := x + ty. The exponential map at x is exp x (y) := x + y and its inverse is -→ xy := yx. R d 1 is time-oriented by the vector field ∂/∂x 0 (e.g., ∀x, y, τ (x, y) := y 0 -x 0 , and we should in theory define τ (x, y) := 0 if x y to properly follow the definition of Lorentzian pre-length spaces but we ignore this criterion during training for optimization purpose). Let us define t := (1, 0, . . . , 0) , α = -(y 0 -x 0 ) + d-1 i=1 (y i -x i ) 2 and β := -→ xy, -→ xy = -(y 0 -x 0 ) 2 + d-1 i=1 (y i -x i ) 2 . According to equation 4, we have y ∈ I + (x, V x ) iff -→ xy is future-directed timelike (i.e., β < 0 and -→ xy, t < 0, or equivalently α < 0) and d γ (x, y) is smaller than ε (i.e., -ε 2 < β). There might exist a path but no arc between x y (i.e., y ∈ I + (x, M) \ I + (x, V x )) iff -→ xy, t < 0 and β ≤ -ε 2 . There exists no path between x and y (i.e., x y and y x) iff β ≥ 0. • de Sitter spacetime S d 1 (r). The original de Sitter spacetime S 4 1 (r) is not time-orientable (see Corollary 11.2.6 of Wolf ( 2011)). However, when d ≥ 3, ν > 0 and d -ν is even, S d ν (r) is orientable and time-orientable, the projective space S d ν (r)/ ± 1 used in Law (2021) is orientable but not timeorientable (see page 247 of O'Neill (1983) ). We assume in the following that conditions hold so that S d ν (r) is a spacetime and we refer the reader to the appendix for details. For any pair of points x ∈ S d ν (r) and y ∈ S d ν (r), -→ xy is defined iff y ∈ U x (i.e., x, y 1 > -r 2 ), and -→ xy is timelike iff x, y 1 > r 2 . Let p := (0, . . . , 0, r) ∈ S d 1 (r) denote the positive pole, and Γ x p : T p S d 1 (r) → T x S d 1 (r) denote the parallel transport from T p S d 1 (r) to T x S d 1 (r). Since the parallel transport preserves the causal character of any tangent vector v, we can define the future timecone C + x ( -→ xy) with respect to the timelike tangent vector t := (1, 0, . . . , 0) ∈ T ±p S d 1 (r) as follows: Lemma 3.1. Assuming -→ xy is timelike, -→ xy is future-directed iff Γ x p (t) ∈ C + x ( -→ xy) if Γ x p is defined (i.e., p ∈ U x ), and Γ x -p (t) ∈ C + x ( -→ xy) otherwise (i.e., -p ∈ U x ). See proof in Appendix C.1.2. • The anti-de Sitter spacetime H d 1 (r) and its projective version P d 1 (r) := H d 1 (r)/ ± 1 are nonchronological and satisfy x y =⇒ y x, which is convenient to represent graphs with directed cycles. Sim et al. (2021) use H d 1 (r) to represent directed graphs but also promote arcs (i.e., causal relation) between pairs of nodes that are not connected by any geodesic, which makes their problem hard to optimize. In this work, we only consider the existence of arcs if there exists a timelike geodesic in the convex normal neighborhood V x joining two events. See Appendix C for details. (O'Neill, 1983) ). M satisfies the chronology, causality or strong causality iff B does. Our framework can then be extended to warped products. • A warped product is a manifold (M 1 × M 2 , g 1 ⊕ f g 2 ) denoted by M 1 × f M 2 where (M 1 , g 1 ), (M 2 , g 2 ) and f : M 1 → (0, ∞) is a smooth function called the warping function (e.g., f = 1). Let M = B × f F be a Lorentz warped product where B is Lorentz and F is a complete Riemannian manifold. M is time-orientable iff B is (see page 417 of

3.3. LORENTZIAN DISTANCE AND LORENTZIAN LENGTH SPACES

The Lorentzian distance indicates chronological order (hence causality) between events when it is positive, satisfies the reverse triangle inequality and its squared function is of class C 2 on a normal neighborhood. These properties make it ideal for optimization, and it can be used as time separation function. We refer the reader to Chapter 4 of Beem et al. (1996) and Section 5 of Minguzzi (2019) . Proposition 4.5.3 of Hawking & Ellis (1973) : Let x and y lie in a convex normal neighborhood V x ⊆ U x . If x and y can be joined by a causal curve in V x , the longest such curve is the unique causal geodesic in V x from x to y. This is in contrast with Riemannian geometry where the (spacelike) geodesic corresponds to the shortest curve joining points. Since V x is a convex normal neighborhood, we can define the arc length of such a curve by χ y ) is called the Lorentzian distance from x to y on V x , it corresponds to the elapsed proper time between the events x and y (i.e., as measured by a clock along the geodesic γ x→ -→ xy ) (Gourgoulhon, 2016) . If we define τ := χ V , then (V x , d, ≤, , τ ) is called a Lorentzian length space. In this paper, we consider the squared Lorentzian distance which is defined as V (x, y) := --→ xy, -→ xy ≥ 0. If x y, χ V (x, y ∈ I + (x, V x ) =⇒ χ 2 V (x, y) = --→ xy, -→ xy . In the literature (Hawking & Ellis, 1973; Beem et al., 1996) , the (squared) Lorentzian distance is defined such that χ 2 V (x, y) = 0 if x and y are not joined by a causal curve due to lack of causality. When evaluating learned representations (i.e., at test time), we therefore consider that χ 2 V (x, y) = 0 if this is the case. However, during training and for optimization purpose, we propose to consider: If x y and y x, χ 2 V (x, y) =    --→ xy, -→ xy if M = R d ν 2( x, y ν -r 2 ) if M = S d ν (r) -2(| x, y ν+1 | -r 2 ) if M = P d ν (r). (5) This makes the function χ 2 V differentiable everywhere (except when x, y ν+1 = 0 if M = P d ν (r)), equal to --→ xy, -→ xy > 0 if -→ xy is timelike, and non-positive otherwise. χ 2 V is defined for any pair of points (x, y) by using extrinsic geometry, whether -→ xy is defined or not (see details in Section C.3).

4. RELATED WORK

Exploiting spacetimes to represent directed graphs was proposed by Clough & Evans (2017) to extend MultiDimensional Scaling (MDS) (Kruskal, 1964) to the simplest spacetime R d 1 . As in standard MDS, a target distance matrix between nodes is given as input and the goal of the method is to find the vector representations of nodes that return the best approximation of the target distance matrix when using the squared Lorentzian distance as the dissimilarity function. However, as discussed in Section 3.3, it is difficult to define target distances between pairs of nodes that are not connected. Moreover, the formulation in (Clough & Evans, 2017) considers only R d 1 and does not account for future or past-direction between pairs of nodes during training. We solve this problem in Section 3.2 by enforcing future timelike geodesics to have their length smaller than some threshold ε ∈ (0, ∞]. Clough & Evans (2017) to the anti-de Sitter space and Lorentz cylinder. Although our motivation is similar, our contributions are methodological, rely on the intuitions of Lorentzian pre-length spaces, and provide an easier interpretation of the learned representations as we explain below. First, Sim et al. (2021) do not address clearly the case when there is no geodesic between pairs of points, and their optimization framework leads to a distance loss term with a zero gradient in this case, which is difficult to optimize. Moreover, their prediction of an arc between a pair of nodes is determined via a Triple Fermi-Dirac (TFD) probability function that accounts for the distance between the nodes, the time coordinate difference ∆t and its opposite value -∆t (i.e., TFD accounts for both the chronological future and past of a given node). In contrast, we restrict the representation of nodes connected by an arc to belong to I + (x, V x ) where V x is an open convex normal neighborhood, which is simple to interpret and optimize. In general, the Lorentzian distance function from x to y on M is defined to be infinite if M is non-chronological and there exists a closed timelike curve joining x and y. Moreover, the Hopf-Rinow theorem does not hold for spacetimes. Working with convex normal neighborhoods allows us to restrict the existence of arcs between nodes to the existence of geodesics joining events. The definition of time separation functions also becomes straightforward and we use their sign to determine the direction of an edge. Our framework shares similarities with Sim et al. (2021) when

Sim et al. (2021) extended

M = R d 1 = V x because R d 1 is globally hyperbolic. Otherwise, we do not formulate our time separation function in the same way and we create arcs differently. Further comparisons with Sim et al. (2021) are provided in Appendix E. The negative of the pseudo-Riemannian gradient is generally not a descent direction. We then use the optimization tools introduced in Gao et al. ( 2018); Law & Stam (2020); Law (2021) as described in Appendix F to learn our representations. Our approach could be extended to time-oriented pseudo-Riemannian manifolds of index ν > 1 (see pages 240-242 of (O'Neill, 1983) ). We also have x y if there exists a future-directed timelike curve joining them. Lemma 3.1 that exploits parallel transport can for instance be generalized to timelike tangent vectors t 1 , . . . , t ν to define different types of arcs. Some Riemannian approaches (Bordes et al., 2013; Ganea et al., 2018; Vilnis et al., 2018) have been used to represent DAGs. Ganea et al. (2018) consider entailment relations where x y means that y is a subconcept of x by constructing cones for each node representation in hyperbolic geometry. The DAGs they consider are directed trees and their underlying undirected graphs are trees (i.e., graphs without cycles). We propose to consider the chronological future and past that are intrinsic to our manifolds and have been studied for decades to define the causal structure. For instance, it is known that causal relations induce partially ordered sets (called causal sets) on causal spacetimes and can then represent DAGs (Bombelli et al., 1987) . Our framework can represent DAGs that are not directed trees (see example in Appendix G.1) and is not limited to DAGs (see examples in Appendix G.2). 

5. EXPERIMENTS

We show how our framework can represent graphs with directed cycles and predict links effectively in Section 5.1. We also show how the causal interpretation of our model can be used to represent hierarchical graphs with cycles in Section 5.2. We provide experiments on DAGs in Appendix D.1.

5.1. LINK PREDICTION ON GRAPHS WITH DIRECTED CYCLES

We now consider the link prediction task on the Saccharomyces cerevisiae, in silico and Escherichia coli DREAM5 datasets (Marbach et al., 2012) (see values of hyperparameters, results on Escherichia coli with standard deviations and more discussion in Appendix D.2). We follow the experimental protocol of Sim et al. (2021) by considering only the positive-regulatory nodes from the networks mentioned above while omitting the gene-expression data itself. Each network is randomly split into train and test sets, following 85/15 splits, and a part of the training set is used for validation. Instead, we propose to define the chronological future I + (x, V x ) of x ∈ L d 1 (C) such that if -→ xy is timelike, we have y ∈ I + (x, V x ) if ∃k ∈ Z, y 0 + kC ∈ (x 0 , x 0 + C/2). Similarly, the chronological past I -(x, V x ) is defined such that if -→ xy is timelike, we have y ∈ I -(x, V x ) if ∃k ∈ Z, y 0 + kC ∈ (x 0 -C/2, x 0 ). In detail, we define the time separation function for L d 1 (C) as follows: τ (x, y) := (y 0 -x 0 + C 2 ) mod C -C 2 ∈ [-C 2 , C 2 ) where we use the modulo operation for real values which can be written as follows: a mod b := a -b • a b , and • is the floor function. Following our spacetime interpretation, we propose to learn our representations by optimizing: min {x k ∈M} n k=1 - (vi,vj )∈E log(F (x i , x j )) - (va,v b ) / ∈E log(1 -F (x a , x b )) where F (x i , x j ) := σ m θ1 (χ 2 V (x i , x j )) • σ m θ2 (τ (x i , x j )) , θ 1 , θ 2 > 0 are temperature hyperparameters, m > 0 is an exponent, and σ θ (x) := 1/(1 + e -x/θ ) is the sigmoid function. We refer to Section C.2 and C.3 for the formulations of the time separation function τ and the squared Lorentzian distance χ 2 V , respectively. This promotes future-directed timelike geodesics only between nodes connected by an arc. We report in Table 1 the Average Precision scores of our method (i.e., using equation 6) that outperforms baselines reported in Sim et al. (2021) . The non-chronological cylindrical Minkowski space obtains a higher performance gap in low-dimensional space. This suggests it accounts for directed cycles better than chronological spacetimes. On the other hand, the gap is reduced in higher dimension except on Escherichia coli where the gap increases (see Table 4 ).

5.2. HIERARCHY EXTRACTION ON A SOCIAL NETWORK DATASET

Pseudo-Riemannian manifolds were introduced in the machine learning literature to extract hierarchies in graphs with cycles (Law & Stam, 2020) . Spacetimes are a special family of pseudo-Riemannian manifolds whose time-orientation was not taken into account in Law & Stam (2020) . In this paper, we found that spacetimes are able to consistently outperform other pseudo-Riemannian manifold approaches by exploiting the causality interpretation implicit in the squared Lorentzian distance. 1.0 ± 0.0 2.3 ± 0.5 0.70 ± 0.17 0.80 ± 0.11 P 3 1 (r) equation 8 χ 2 U (x, y) 1.3 ± 0.5 2.4 ± 0.5 0.69 ± 0.29 0.90 ± 0.07 S 3 1 (r) (de Sitter) equation 8 χ 2 U (x, y) 1.1 ± 0.3 2.1 ± 0.3 0.86 ± 0.15 0.88 ± 0.05 R 4 1 (Minkowski) equation 8 χ 2 U (x, y) 1.0 ± 0.0 2.3 ± 0.9 0.70 ± 0.22 0.81 ± 0.13 P 4 1 (r) equation 8 χ 2 U (x, y) 1.1 ± 0.3 2.4 ± 0.5 0.78 ± 0.13 0.91 ± 0.07 Graph dataset. We consider an undirected graph G = (V, E) where V = {v i } n i=1 is the node set and E is the edge set such that (v i , v j ) ∈ E indicates that v i and v j are connected by an edge. Zachary's karate club dataset (Zachary, 1977) is a social network of n = 34 members of the karate club, each represented by a node v i . The club was split in two factions due to a conflict between the instructor (node v 1 ) and the administrator (node v 34 ) that are the two most important members (i.e., leaders) of the dataset. The other members had to decide between joining the new club created by v 1 or stay with v 34 . Two nodes are joined by an edge (v i , v j ) ∈ E if the members are friends. In the original ultrahyperbolic approaches (Law & Stam, 2020; Law, 2021) , each node v i is represented by a point x i ∈ M on some pseudo-Riemannian manifold M. The goal is to learn embeddings {x i } n i=1 so that pairs of nodes joined by an edge (v i , v j ) ∈ E are closer to each other than pairs of nodes not joined by an edge. To this end, the embeddings are learned in Law & Stam (2020); Law (2021) by minimizing the following problem: min {x k ∈M} n k=1 - (vi,vj )∈E log e -d(xi,xj )/θ e -d(xi,xj )/θ + (va,v b ) / ∈E e -d(xa,x b )/θ (7) where the dissimilarity function d(x i , x j ) is the arc length d γ (x i , x j ) := | --→ x i x j , --→ x i x j | of the geodesic joining x i and x j . If M is Riemannian, d γ is the Riemannian distance. In this paper, we propose instead to consider in equation 7 that d is the squared Lorentzian distance χ 2 U , as defined in Section 3.3, where U x is the (convex) maximal normal neighborhood of x ∈ M. We also propose to simply enforce the embeddings of v i and v j to be joined by a timelike geodesic iff (v i , v j ) ∈ E. Since G is an undirected graph, the future-or past-direction is not provided so we do not constrain the future-direction of the geodesics during training. Our problem formulation is then: min {x k ∈M} n k=1 (va,v b ) / ∈E σ θ (d(x a , x b )) + λ (vi,vj )∈E σ θ (-d(x i , x j )) where λ > 0 is a regularization parameter and we set d(x i , x j ) = χ 2 U (x i , x j ). The main goal of equation 8 is to enforce pairs of nodes (v i , v j ) ∈ E to be joined by a timelike geodesic and pairs (v a , v b ) / ∈ E to be joined by a spacelike geodesic since there is no causality between them. In Lorentz geometry, a timelike geodesic joining two points is the longest timelike curve in a given convex normal neighborhood. This translates in the high-level nodes v 1 and v 34 being the furthest from the rest of the nodes. The ground truth edges are plotted in yellow and the node color corresponds to the joined faction. A small number of spacelike edges are visible (those edges more than 45 degrees from vertical). Evaluation metrics. We report in Table 2 the results obtained when optimizing equation 7 or equation 8 using different dissimilarity functions. At test time, a score δ i = n j=1 d(x i , x j ) summing pairwise distances is assigned to each node v i and used as an indicator of importance in the hierarchical graph. Following Section 3.3, when the dissimilarity function is the arc length d γ (resp. squared Lorentzian distance χ 2 U ), we sort the scores δ 1 , . . . , δ n is ascending (resp. descending) order and report the ranks of the two leaders v 1 and v n (in no particular order) in the first two columns of Table 2 averaged over 10 different initializations. We also report in Table 2 the Spearmans rank correlation coefficient ρ (Spearman, 1904) between the ordered δ i scores and the order of the 5 and 10 most important nodes which are 34, 1, 33, 3, 2, 32, 24, 4, 9, 14 (in that order, see Law & Stam (2020)). Results. In general, the squared Lorentzian distance χ 2 U returns better performance than the arc length d γ . Moreover, the optimization problem in equation 8 enforcing timelike geodesics between connected nodes (i.e., causality) and spacelike geodesics between unconnected nodes (i.e., noncausality) extracts the most important nodes in low-dimensional spacetimes more effectively. The predicted ρ also shows higher rank correlation with χ 2 U . It is worth noting that hyperbolic geometry was shown relevant to represent graphs without cycles (Gromov, 1987) but our hierarchical graph contains cycles, which explains why geometries other than hyperbolic may be more relevant. The performance gap with non-Riemannian baselines (Law & Stam, 2020; Law, 2021) can be explained by the fact that those baselines compare the arc lengths of different types of geodesics (i.e., timelike or spacelike) that are not necessarily comparable. On the other hand, our framework sets non-causal distances to zero at test time, and compares only the lengths of causal geodesics. Figure 2 illustrates spacetime diagrams of the learned representations for R 2 1 and S 3 1 (r). The most important nodes tend to be further away from the other nodes with respect to the Lorentzian distance. Most ground truth edges are timelike geodesics and belong to the chronological past or future of an edge representation. We report in Appendix D.3 the same kind of hierarchy extraction experiment on a dataset that describes co-authorship information from papers published at NIPS from 1988 to 2003 (Globerson et al., 2007) . The number of nodes/authors is |V | = 2715 and the number of edges is |E| = 4733.

6. CONCLUSION

We have proposed a general framework to represent data, and in particular nodes of a directed graph, in a spacetime. Compared to previous work, our framework is properly endowed with the structure of a Lorentzian pre-length space, which makes it easy to optimize and applicable to a large family of spacetimes. In the case of hierarchical undirected graphs, we show that we can enforce causality between nodes connected by an edge to extract the most important nodes in the hierarchy. where arccos is the inverse of the cosine function, and arccosh is the inverse of the hyperbolic cosine function. One can verify from equation 10 and equation 11 that -→ xy is timelike (i.e., -→ xy, -→ xy < 0) iff x, y ν > r 2 . Geodesic "distance". As explained in (Law & Stam, 2020) and Chapter 5 of (O'Neill, 1983) , when the logarithmic map exists for some pseudo-Riemannian manifold M, the arc length of the geodesic γ x→ -→ xy from x ∈ M to y ∈ M corresponds to the radius function: |g x ( -→ xy, -→ xy)| where g x : T x M × T x M → R is the metric tensor at x and -→ xy = log x (y) is the logarithmic map. The geodesic distance d γ : S d ν (r) × S d ν (r) → R is then: d γ (x, y) := | -→ xy, -→ xy ν | = r arccosh ( x,y ν r 2 ) if x,y ν r 2 ≥ 1 r arccos ( x,y ν r 2 ) if x,y ν r 2 ∈ (-1, 1) d γ is not a "distance metric" but a symmetric premetric: it satisfies (i) d γ (x, y) = d γ (y, x) ≥ 0 and (ii) d γ (x, x) = 0. In (O'Neill, 1983), the "minimizing geodesic" is defined by its arc length and then also corresponds to the geodesic distance mentioned above. Given the minimizing geodesic γ connecting x to y, the parallel transport Γ y x : T x S d ν (r) → T y S d ν (r) is a linear isometry such that ∀u, v, u, v ν = Γ y x (u), Γ y x (v) ν . The parallel transport along γ from x = γ(0) to y = γ(1) (where x and y satisfy x, y ν > -r 2 ) is: Γ y x (u) := u - y, u ν x, y ν + r 2 (y + x) Theorem B.1 (Diffeomorphism (Wolf, 2011)). There exists a diffeomorphism ψ : S d ν (r) → R ν × S d-ν 0 (r). Let us note x = t s ∈ S d ν (r) with t ∈ R ν and s ∈ R d-ν+1 * . Let us note z = t v ∈ R ν × S d-ν 0 (r) where v ∈ S d-ν 0 (r). The mapping ψ and its inverse ψ -1 can be formulated: ψ(x) = t r s s and ψ -1 (z) = t r 2 + t 2 r v . ( ) where . is the standard Euclidean norm (i.e., s := √ s s).

B.3 PSEUDO-HYPERBOLOID H d ν (r)

We recall the differential geometry tools (from (Law & Stam, 2020)) specific to the pseudohyperboloid which is defined as the set: H d ν (r) := {x ∈ R d+1 ν+1 : x, x ν+1 = -r 2 }. The tangent space T x H d ν (r) of H d ν (r) at x can be defined as: T x H d ν (r) := {u ∈ R d+1 ν+1 : u, x ν+1 = 0}. In the case of H d ν (r), we have ∀u ∈ T x H d ν (r), ∀v ∈ T x H d ν (r), g x (u, v) = u, v = u, v ν+1 . Geodesic. The geodesic γ x→u : R → H d ν (r) satisfying γ x→u (0) = x and γ x→u (0) = u ∈ T x H d ν (r) is formulated for all t ∈ R: γ x→u (t) =        cos t | u,u ν+1| r x + r | u,u ν+1| sin t | u,u ν+1| r u if u, u ν+1 < 0 x + tu if u, u ν+1 = 0 cosh t | u,u ν+1| r x + r | u,u ν+1| sinh t | u,u ν+1| r u if u, u ν+1 > 0 (15) Exponential map. The exponential map exp x : T x H d ν (r) → H d ν (r) is defined such that ∀u ∈ T x H d ν (r), exp x (u) = γ x→u (1) . We then have: exp x (u) =        cos | u,u ν+1| r x + r | u,u ν+1| sin | u,u ν+1| r u if u, u ν+1 < 0 x + u if u, u ν+1 = 0 cosh | u,u ν+1| r x + r | u,u ν+1| sinh | u,u ν+1| r u if u, u ν+1 > 0 (16) Logarithmic map. The logarithmic map log x is defined as the inverse of the exponential map exp x on a normal neighborhood of x ∈ H d ν (r) denoted by U x = {y ∈ H d ν (r) : x,y ν+1 r 2 < 1}. It is formulated: ∀y ∈ U x , -→ xy = log x (y) =                arccosh (- x,y ν+1 r 2 ) ( x,y ν+1 r 2 ) 2 -1 y + x,y ν+1 r 2 x if x,y ν+1 r 2 < -1 y -x if x,y ν+1 r 2 = -1 arccos (- x,y ν+1 r 2 ) 1-( x,y ν+1 r 2 ) 2 y + x,y ν+1 r 2 x if x,y ν+1 r 2 ∈ (-1, 1) (17) One can verify that a nonconstant geodesic from x to y is timelike iff x, y ν ∈ (-r 2 , r 2 ) or y = ±x. Geodesic "distance". The geodesic distance d γ : H d ν (r) × H d ν (r) → R is then: d γ (x, y) = | -→ xy, -→ xy ν+1 | = r arccosh (-x,y ν+1 r 2 ) if x,y ν+1 r 2 ≤ -1 r arccos (-x,y ν+1 r 2 ) if x,y ν+1 r 2 ∈ (-1, 1) Parallel transport on H d ν (r). The parallel transport connecting x ∈ H d ν (r) to y ∈ H d ν (r) is: Γ y x (u) := u - y, u ν+1 x, y ν+1 -r 2 (y + x) where x, y ν+1 < r 2 (19) Theorem B.2 (Diffeomorphism (Wolf, 2011)). There exists a diffeomorphism ψ : H d ν (r) → S ν 0 (r) × R d-ν . Let us note x = t s ∈ H d ν (r) with t ∈ R ν+1 * and s ∈ R d-ν . Let us note z = u v ∈ S ν 0 (r) × R d-ν where u ∈ S ν 0 (r) and v ∈ R d-ν . The mapping ψ and its inverse ψ -1 can be formulated: ψ(x) = r t t s and ψ -1 (z) = r 2 + v 2 r u v . ( ) The anti-de Sitter spacetime H d 1 (r) is non-chronological and satisfies x y =⇒ y x, which is convenient to represent graphs with directed cycles. It was used in Sim et al. (2021) to represent directed graphs. Nonetheless, it is worth noting that the problem formulation of Sim et al. (2021) for H d 1 (r) also promotes arcs (i.e., causal relation) between pairs of nodes that are not connected by any geodesic, which makes their problem hard to optimize. Following general relativity, we only consider the existence of arcs if there exists a timelike geodesic in V x joining two events x and y. See Appendix C for details.

B.4 PROJECTIVE

SPACE P d 1 (r) The manifold 2021) is time-orientable for all d ≥ 2 (see page 214 of O'Neill (1983) ). We refer the reader to (Law, 2021) for details about P d 1 (r) := H d 1 (r)/ ± 1 used in Law ( P d 1 (r) := H d 1 (r)/ ± 1.

C FUTURE DIRECTION, TIME SEPARATION FUNCTION AND LORENTZIAN DISTANCE C.1 FUTURE DIRECTION

We discuss in this section how to constrain future direction for our spacetimes.

C.1.1 MINKOWSKI SPACE

The explanation can be found in Section 3.2 when M = R d 1 . We recall that we defined: t := (1, 0, 0, 0) , -→ xy = y -x, -→ xy, -→ xy = -→ xy, -→ xy 1 and α := -(y 0 -x 0 ) + d-1 i=1 (y i -x i ) 2 . If α is negative, then -→ xy is timelike (i.e., -→ xy, -→ xy < 0) and -→ xy ∈ C + x (t). By definition, -→ xy is then future-directed timelike.

C.1.2 DE SITTER SPACE

We provide the proof of Lemma 3.1 which is written: when M = S d 1 (r) and -→ xy is timelike, -→ xy is future-directed iff Γ x p (t) ∈ C + x ( -→ xy) if Γ x p is defined (i.e., p ∈ U x ), and Γ x -p (t) ∈ C + x ( -→ xy) otherwise (i.e., -p ∈ U x ). We first recall a property from page 146 of (O'Neill, 1983 ): If M is a Lorentz manifold, if a piecewise smooth curve α is timelike, this means not only that α (t) is timelike, but at each break t i of α: α (t - i ), α (t + i ) < 0. ( ) Here the first vector derives from α|[t i-1 , t i ] and the second from α|[t i , t i+1 ]. Thus α does not switch timecones at the break. Let us note the poles p = (0, . . . , 0, r) ∈ S d 1 (r), -p = (0, . . . , 0, -r) ∈ S d 1 (r) and t = (1, 0 . . . , 0) where t ∈ T p S d 1 (r) and t ∈ T -p S d 1 (r). We also note x := (x 0 , x 1 , . . . , x d ) ∈ S d 1 (r). If x d ∈ (-r, r), we show below that equation 21 can be satisfied by defining α such that α(t 0 ) = p, α(t 1 ) = x, α(t 2 ) = -p, α (t 0 ) = t, α (t - 1 ) = Γ x p (t), α (t + 1 ) = Γ x -p (t), α (t 2 ) = t. For completeness, we also have ∀t ∈ [t 0 , t - 1 ], α (t) = Γ α(t) p (t) and ∀t ∈ [t + 1 , t 2 ], α (t) = Γ α(t) -p (t). Let us arbitrarily consider that t ∈ T p S d 1 (r) defines the future direction of the manifold. For all x ∈ S d 1 (r) that satisfies x, p 1 > -r 2 (i.e., p ∈ U x ), we note Γ x p (t) the parallel translate of t ∈ T p S d 1 (r) to T x S d 1 (r). As explained in Appendix B.1.2 of (Law, 2021), it is formulated: Γ x p (t) := t - x, t 1 x, p 1 + r 2 (p + x) = t + x 0 rx d + r 2 (p + x) We know from page 66 of (O' Neill, 1983 ) that parallel transport (also called parallel translation) is a linear isometry that satisfies: ∀u ∈ T x S d 1 (r), ∀v ∈ T x S d 1 (r), u, v = Γ p x (u), Γ p x (v) = Γ p x (u), Γ p x (v) 1 ( ) and u = Γ x p (Γ p x (u)). This implies Γ x p (t) ∈ C + x (u) ⇐⇒ Γ p x (u) ∈ C + p (t) Similarly, if x, -p 1 > -r 2 (i.e., -p ∈ U x ), we have Γ x -p (t) ∈ C + x (u) ⇐⇒ Γ -p x (u) ∈ C + -p (t). Let us assume that x satisfies both p ∈ U x (i.e., x, p 1 > -r 2 ) and -p ∈ U x (i.e., x, p 1 < r 2 ), this is equivalent to x satisfying x d ∈ (-r, r). If x d ∈ (-r, r), then Γ x p (t) ∈ C + x (u), Γ x -p (t) ∈ C + x (u) ⇐⇒ Γ x p (t), Γ x -p (t) 1 < 0. By definition, we have: Γ x -p (t) := t - x, t 1 x, -p 1 + r 2 (-p + x) = t + x 0 -rx d + r 2 (-p + x) Γ x p (t), Γ x -p (t) 1 = -1 - x 2 0 rx d + r 2 - x 2 0 -rx d + r 2 = -1 -2 x 2 0 (x d + r)(-x d + r) < 0 We have shown that Γ x p (t) and Γ x -p (t) have the same future direction by showing that Γ x p (t), Γ x -p (t) 1 < 0. A (piecewise) smooth curve that preserves causality can then be found by using Γ x p (t) or Γ x -p (t). C.1.3 ANTI-DE SITTER SPACE When M = H d 1 (r), -→ xy is future-directed iff Γ x p (t) ∈ C + x ( -→ xy) if Γ x p is defined (i.e., p ∈ U x ), and Γ x -p (-t) ∈ C + x ( -→ xy) otherwise (i.e., -p ∈ U x ). The proof is similar to the one in Section C.1.2. Let us note the poles p = (r, 0, . . . , 0) ∈ H d 1 (r), -p = (-r, 0, . . . , 0) ∈ H d 1 (r) and t = (0, 1, 0 . . . , 0) where t ∈ T p H d 1 (r) and t ∈ T -p H d 1 (r). Let x = (x -1 , x 0 , . . . , x d-1 ) ∈ H d 1 (r). We recall that p ∈ U x iff x, p 2 < r 2 (i.e., x -1 > -r) and -p ∈ U x iff x, -p 2 < r 2 (i.e., x -1 < r). We define x such that x -1 ∈ (-r, r) (i.e., p ∈ U x and -p ∈ U x ).

Let us note Γ

x p (t) the parallel translate of t ∈ T p H d 1 (r) to T x H d 1 (r). As explained in AppendixB.3 of (Law, 2021), it is formulated: Γ x p (t) := t - x, t 2 x, p 2 -r 2 (p + x) = t + x 0 -rx -1 -r 2 (p + x) We show below that Γ x p (t) and Γ x -p (-t) have the same future direction by showing that Γ x p (t), Γ x -p (-t) 2 < 0. Γ x -p (-t) := -t - x, -t 2 x, -p 2 -r 2 (-p + x) = -t - x 0 rx -1 -r 2 (-p + x) Γ x p (t), Γ x -p (-t) 2 = 1 + x 2 0 rx -1 -r 2 + x 2 0 -rx -1 -r 2 = 1 -2 x 2 0 r 2 -x 2 -1 (29) = r 2 -x 2 -1 -2x 2 0 r 2 -x 2 -1 < 0. ( ) equation 30 is negative because we have by definition of H d 1 (r): -x 2 -1 -x 2 0 + d-1 i=1 x 2 i = -r 2 , which implies x 2 0 ≥ r 2 -x 2 -1 > 0, and the denominator is positive because we defined x -1 ∈ (-r, r).

C.2 TIME SEPARATION FUNCTION

We give the formulations we use for the time separation function τ (x, y) that is positive if y is in the chronological future of x. There is no unique way to define it. We assume here that ν = 1. Globally hyperbolic spacetimes. If the spacetime M is globally hyperbolic, there exists a Cauchy time function c : M → R (i.e., for any t ∈ R, the set Σ t = {y ∈ M : c(y) = t} is a Cauchy hypersurface) that can be used to define τ (x, y) := c(y) -c(x) since it satisfies the reverse triangle inequality τ (x, z) ≥ τ (x, y) + τ (y, z) when x ≤ y ≤ z. With this formulation, it actually satisfies τ (x, z) = τ (x, y) + τ (y, z) when x ≤ y ≤ z. In theory, to have a Lorentzian pre-length space, one could define τ such that it is 0 if x y, but this is not easy to optimize in practice so we ignore this constraint for optimization purpose. C.2.1 MINKOWSKI SPACE R d 1 If M = R d 1 then we arbitrarily define: τ (x, y) := --→ xy, t = --→ xy, t 1 = y 0 -x 0 where t := (1, 0, . . . , 0) defines the future timecone C + x (t). It is worth noting that the function c : R d 1 → R defined such that ∀x = (x 0 , . . . , x d-1 ) ∈ R d 1 , c(x) := x 0 is a Cauchy time function. C.2.2 DE SITTER SPACE S d 1 (r) Following our discussion in SectionC.1.2 and using the formulation of the parallel transport in equation 22, we can formulate the time function as: τ (x, y) := -Γ p x ( -→ xy), t ν if x, p ν ≥ 0 -Γ -p x ( -→ xy), t ν otherwise. ( ) where ν = 1. One limitation of equation 32 is that it assumes that -→ xy is defined, which might not be the case (if y / ∈ U x ). From the discussion in Chapter 5.2 and Figure 16 (i) of (Hawking & Ellis, 1973 ) that uses Cartesian coordinates to formulate space coordinates on the pseudo-sphere, the function c : S d 1 (r) → R defined such that ∀x = (x 0 , . . . , x d ) , c(x) := x 0 is a Cauchy time function. One can then also simply define: τ (x, y) := y 0 -x 0 . (33) In practice, we found that using equation 33 returns better performance than equation 32 in the experiments of Section D.1. We then also use it for the experiment of Section 5.1. See Section D.2 for details.

C.2.3 ANTI-DE SITTER

SPACE H d 1 (r) When ν > 0, H d ν (r) is non-chronological and satisfies x y =⇒ y x, which is convenient to represent graphs with directed cycles. There exists a future-directed timelike geodesic from x to y only if -→ xy is timelike or y = ±x. Following our discussion in Section C.1.3 and using the formulation of the parallel transport in equation 27, we can formulate the time function as: τ (x, y) := -Γ p x ( -→ xy), t ν+1 if x, p ν+1 ≤ 0 -Γ -p x ( -→ xy), -t ν+1 otherwise. (34) C.2.4 PROJECTIVE VERSION OF THE ANTI-DE SITTER SPACE P d 1 (r) We recall that each point of P d 1 (r) is an unordered pair {-x, x} ∈ P d 1 (r) where x ∈ H d 1 (r) . By using the formulation of the parallel transport in equation 27 and assuming ν = 1, we can then define: τ ({-x, x}, {-y, y}) :=        -Γ p x ( -→ xy), t ν+1 if x, p ν+1 ≤ 0 and x, y ν+1 ≤ 0 -Γ -p x ( -→ xy), -t ν+1 if x, p ν+1 > 0 and x, y ν+1 ≤ 0 -Γ p x ( ----→ x(-y)), t ν+1 if x, p ν+1 ≤ 0 and x, y ν+1 > 0 -Γ -p x ( ----→ x(-y)), -t ν+1 if x, p ν+1 > 0 and x, y ν+1 > 0 Let us assume that p = (r, 0, . . . , 0) . We can choose x such that x -1 ≥ 0 so that x, p 2 ≤ 0 is satisfied, and we can also choose y so that x, y 2 ≤ 0 is satisfied. We can define t := (0, 1, 0, . . . , 0) ∈ T p P d 1 (r). If we define u = (u -1 , u 0 , . . . , u d-1 ) = Γ p x ( -→ xy), then equation 35 can be rewritten: τ ({-x, x}, {-y, y}) = u 0 Another possibility is to use the following time separation function: τ ({-x, x}, {-y, y}) = u 0 - d-1 i=1 u 2 i ( ) which is positive only if -→ xy is future-directed timelike.

C.2.5 CYLINDRICAL MINKOWSKI

SPACE L d 1 (C) As we explain in the main paper, we propose to define the chronological future I + (x, V x ) of x ∈ L d 1 (C) such that if -→ xy is timelike, we have y ∈ I + (x, V x ) if ∃k ∈ Z, y 0 + kC ∈ (x 0 , x 0 + C/2). Similarly, the chronological past I -(x, V x ) is defined such that if -→ xy is timelike, we have y ∈ I -(x, V x ) if ∃k ∈ Z, y 0 + kC ∈ (x 0 -C/2, x 0 ). We define the time separation function for L d 1 (C) as follows: τ (x, y) := (y 0 -x 0 + C 2 ) mod C - C 2 ∈ [- C 2 , C 2 ) ( ) where we use the modulo operation for real values which can be written as follows: a mod b := a -b • a b , and • is the floor function.

C.3 SQUARED LORENTZIAN DISTANCE

We give the formulation of the squared Lorentzian distance for the different spacetimes that we use in the main paper depending on the nature of M: If M = R d 1 , χ 2 U (x, y) := --→ xy, -→ xy = --→ xy, -→ xy 1 := (y 0 -x 0 ) 2 - d-1 j=1 (y j -x j ) 2 . ( ) If M = S d 1 (r), χ 2 U (x, y) := --→ xy, -→ xy = r 2 arccosh 2 ( x,y 1 r 2 ) if x,y 1 r 2 ≥ 1 2( x, y 1 -r 2 ) otherwise. ( 40) If M = P d 1 (r), χ 2 U (x, y) := --→ xy, -→ xy = r 2 arccos 2 ( | x,y 2| r 2 ) if x,y 2 r 2 ∈ [-1, 1] -2(| x, y 2 | -r 2 ) otherwise. ( 41) If M = L d 1 (C), χ 2 U (x, y) := (y 0 -x 0 + C 2 ) mod C - C 2 2 - d-1 j=1 (y j -x j ) 2 . ( ) The above equations are equal to 0 when x is equivalent to y, which is the case if y = x in general, or if y = ±x when M = P d 1 (r), or if y ∼ x when M = L d 1 (C). Otherwise, the equations are positive iff there exists at least one timelike geodesic from x to y. If the timelike geodesic from x to y = x is uniquely defined in U x , we call it -→ xy, and these equations are positive whether -→ xy is future-directed timelike or past-directed timelike. Squared (Lorentzian) distance in Sim et al. (2021) . Sim et al. (2021) acknowledge that there exists no geodesic from x to y if M = H d 1 (r) and x,y 2 r 2 > 1. Therefore, they consider that their squared Lorentzian distance is equal to π 2 in this case to keep their function smooth. This promotes causality between x and y. However, constraining future-direction between x and y when x,y 2 r 2 > 1 becomes problematic without the explicit formulation of a timelike curve. This is why they formulate their time function based on a different criterion than ours that uses parallel transport (see Section C.2). Moreover, the fact that their distance function becomes a constant in this case makes it difficult to optimize as the gradient is zero. Instead, we propose to use the manifold M = P d 1 (r) which can represent the same types of relationships between points as H d 1 (r) since P d 1 (r) contains elliptic and hyperbolic parts (Law, 2021) , and any pair of points of P d 1 (r) can be joined by a geodesic.

C.4 ABOUT THE CHOICE OF ε IN EQUATION 4

We explain here why our formulation of equation 4 is general and can be extended to other spacetimes. As explained in Theorem 2.7 of Minguzzi (2019) , an open globally hyperbolic convex normal neighborhood V x can be defined for any point x ∈ M where M is a spacetime. As we mention in Section 3, any open subset of a spacetime is a spacetime. Therefore, V x is a spacetime that can be chosen to be globally hyperbolic. We recall that a globally hyperbolic spacetime is strongly causal, and that the chronological future I + (x, V x ) is the set of points y ∈ V x that satisfy x y (i.e., there exists a timelike curve from x to y). Since we define V x to be a convex normal neighborhood, we know from Proposition 4.5. 3 of Hawking & Ellis (1973) that any pair of its points x, y that satisfy x y can be joined by a unique futuredirected timelike geodesic γ x→ -→ xy , which is also the unique longest curve joining x to y. Since γ x→ -→ xy is future-directed timelike, it satisfies -→ xy, -→ xy < 0 and -→ xy, t < 0 (i.e., -→ xy ∈ C + x (t)) where the timelike tangent vector t ∈ T x M defines the future direction of M. Moreover, the arc length of the geodesic γ x→ -→ xy from x to y is --→ xy, -→ xy , and is called its Lorentzian distance. From Theorem 5.6 and Lemma 5.7 of Minguzzi (2019) , there exists a strongly causal open set containing x such that its Lorentzian distance with any other point y in that set has its Lorentzian distance --→ xy, -→ xy upper bounded by some constant ε > 0. We can then define that open set to be the convex normal neighborhood V x and we can choose ε > 0 such that all the points y ∈ V x that satisfy x y also satisfy -ε 2 < -→ xy, -→ xy < 0. Since V x is a subset of the maximal normal neighborhood U x , we then obtain exactly the definition of equation 4, which is: I + (x, V x ) = {y ∈ U x : -ε 2 < -→ xy, -→ xy < 0, -→ xy ∈ C + x (t)}. In our experiments, we consider that V x = U x . When M is the Minkowski space R d 1 or the de Sitter space S d 1 (r), we then have ε = +∞ but we describe some case where ε might be finite in Section 3.2. Similarly, we have ε = rπ/2 when M = P d 1 (r), and ε = C/2 when M = L d 1 (C). 

D ADDITIONAL EXPERIMENTS AND EXPERIMENTAL DETAILS

We now report additional experiments and provide experimental details. Setup. We ran all our experiments on a single desktop with 64 GB of RAM, a 6-core Intel i7-7800X CPU and a NVIDIA GeForce RTX 3090 GPU.

D.1 CHRONOLOGICAL ORDER IN DIRECTED GRAPHS

Our goal in this subsection is to represent a directed graph with spacetimes. As in Clough & Evans (2017), we select the 200 and 1000 most cited papers in the Arxiv High-energy physics theory (HEP-TH) citation network (Gehrke et al., 2003) . HEP-TH is originally a dataset of 27, 770 papers (each represented by a node) with 352, 807 edges. The graph contains an arc from v i to v j if paper i cites paper j. When selecting the 200 or 1000 most cited papers, the graph is not a Directed Acyclic Graph (DAG) as there exist pairs of papers that cite each other. We ignore these pairs of arcs. To simplify the notation, we write v i v j either if there exists a path from v i to v j , or the exists an arc from v i to v j but not from v j to v i (i.e., there can exist a longer path from v j to v i ). We also write v a v b if there exists no path from v a to v b or from v b to v a (i.e., v a v b ⇐⇒ v b v a ). We optimize the problem: min {x k ∈M} n k=1 va v b σ θ1 χ 2 U (x a , x b ) + λ vi vj σ θ1 -χ 2 U (x i , x j ) + σ θ2 (-τ (x i , x j )) where σ θ (x) := 1/(1 + e -x/θ ) is the sigmoid function, θ 1 , θ 2 > 0 are temperature parameters, λ is a regularization parameter and τ (x, y) is a time separation function that is positive (resp. negative) if y is in the chronological future (resp. past) of x. For instance, if M = R d 1 then we arbitrarily define τ (x, y) := --→ xy, t = y 0 -x 0 , where t := (1, 0, . . . , 0) defines the future timecone C + x (t). In this experiment, our temperature hyperparameter values are θ 1 = θ 2 = 1, and we fix the radius to r = 1. We run our experiments for 10 8 iterations with a step size of 10 -6 by using the optimization tools of (Law, 2021; Law & Stam, 2020) . The regularization parameter λ is set to λ = |E c | |E| where |E| is the number of pairs that satisfy v i v j and |E c | is the number of pairs that satisfy v a v b . We report in Table 3 how well the learned representations manage to preserve chronological order when we select the |V | = 200 or 1000 most cited papers. For instance if M = R d 1 , we report the percentage of pairs of nodes v i v j represented by x i and x j that satisfy --→ x i x j , t < 0. The chronological manifolds R d 1 and S d 1 (r) manage to predict chronological order better than the nonchronological manifold P d 1 (r). This suggests that chronological spacetimes are more appropriate to represent graphs that are almost DAGs. The time separation function in equation 33 returns better performance than equation 32, we then use it in the rest of our experiments. Calculate ∇f (x) i.e., ∇f (x) is the Euclidean gradient of f at x in the Euclidean ambient space 3: χ ← Πx (GΠx(G∇f (x))) 4: x ← exp x (-ηχ) where η > 0 is a step size 5: end while

F OPTIMIZATION

The optimizers that we use in the main paper for the different spacetimes can all be optimized as described in Algorithm 1. The goal is to minimize some function f : M → R by using differential geometry tools described in (Gao et al., 2018; Law, 2021; Law & Stam, 2020) . ∇f (x) is the Euclidean gradient of f at x, Π x (z) is the orthogonal projection of an arbitrary vector z onto T x M, and G is an involutory matrix (i.e., G -1 = G). We consider in the following that the step size η > 0 is fixed and given. We give details for the different spacetimes that we consider in the main paper.

G EXAMPLES OF GRAPHS

Since our general framework is fairly abstract, we give some explicit examples of directed graphs with or without cycles that can be described by our framework. G.1 DIRECTED ACYCLIC GRAPHS (DAGS) We first consider a directed acyclic graph that consists of an undirected cycle (Figure 6 ). v 1 v 2 v 3 v 4 v 5 v 6 The DAG G = (V, E) is defined as V = {v i } 6 i=1 and the set of arcs is E = {(v 1 , v 4 ), (v 2 , v 4 ), (v 2 , v 5 ), (v 3 , v 5 ), (v 1 , v 6 ), (v 3 , v 6 )}. One can see that G is not a directed tree since the undirected path v 1 , v 4 , v 2 , v 5 , v 3 , v 6 , v 1 is cyclic. We recall that if M is globally hyperbolic, M is also chronological and can then only describe DAGs with our framework. Let us give an example where M = R d+1 √ 2a 2 -r 2 . We take the embeddings: x 1 = (0, r, 0, 0) , x 2 = (0, 0, r, 0) , x 3 = (0, 0, 0, r) , x 4 = (b, a, a, 0) , x 5 = (b, 0, a, a) , x 6 = (b, a, 0, a) . When M = S d 1 (r), we know from Section B.2 that --→ x i x j is timelike (i.e., --→ x i x j , --→ x i x j < 0) iff x i , x j 1 > r 2 . By using the time separation function in equation 33 and assuming --→ x i x j is timelike, we only need to compare the first coordinate of x i and x j to determine the direction of the edge between v i and v j . The case M = R d+1 1 is similar except that --→ x i x j := x j -x i is timelike iff --→ x i x j , --→ x i x j 1 < 0. One can verify that ∀i, j, (v i , v j ) ∈ E ⇐⇒ x j ∈ U xi and the tangent vector --→ x i x j is future-directed timelike. We now illustrate a simple example of graph with directed cycle that can be drawn with L d 1 (C) (Figure 7 ). The graph G = (V, E) is defined as V = {v i } 3

G.2.1 MINKOWSKI CYLINDER

L d 1 (C) v 1 v 2 v 3 i=1 and E = {(v 1 , v 2 ), (v 2 , v 3 ), (v 3 , v 1 )}, which is a graph with directed cycle. Let us consider that C = 3 and d = 2. The maximal normal neighborhood of every point x ∈ R d 1 is U x = {y = (y 0 , . . . , y d-1 ) ∈ R d 1 : y 0 ∈ (x 0 -1.5, x 0 + 1.5)}. We also consider that V x = U x . Since L d 1 (C) is a quotient set, its points are equivalence classes. We consider three equivalence classes [x i ] := {(i + 3k, 0) : k ∈ Z} where i ∈ {1, 2, 3} and we define x i := (i, 0) ∈ R d 1 . In other words, the five points x 0 = (0, 0) , x 1 = (1, 0) , x 2 = (2, 0) , x 3 = (3, 0) , x 4 = (4, 0) in R d 1 actually correspond to three points in L d 1 (C) due to the equivalence relation (i.e., x 3 ∼ x 0 and x 4 ∼ x 1 ). We can then compare those three equivalence classes by comparing x i with x i-1 and x i+1 only. One can verify that ∀i ∈ {1, 2, 3}, we have x i+1 ∈ U xi , x i+1 ∈ I + (x i , U xi ), and x i-1 ∈ U xi . However, we also have x i-1 / ∈ I + (x i , U xi ). We then obtain the graph illustrated in Figure 7 .



Figure 1: Geodesics of the de Sitter space S d 1 (r) (left) and of the anti-de Sitter space H d 1 (r) (right).

Sim et al. (2021) use the Minkowski space R d 1 , the anti de-Sitter space H d 1 (r), and the Cylindrical Minkowski space L d 1 (C). Both H d 1 (r) and L d 1 (C) are non-chronological and can then represent graphs with directed cycles. Sim et al. (2021) design a probability function that they call Triple Fermi-Dirac (TFD) specifically for their Cylindrical Minkowski space.

Figure 2: (left) Coordinates of 2-dimensional embeddings x = (x 0 , x 1 ) learned with equation 8 when M = R 2 1 . (right) Coordinates of the first three coordinates of embeddings x = (x 0 , x 1 , x 2 , x 3 ) learned with equation 8 when M = S 3 1 (r).In Lorentz geometry, a timelike geodesic joining two points is the longest timelike curve in a given convex normal neighborhood. This translates in the high-level nodes v 1 and v 34 being the furthest from the rest of the nodes. The ground truth edges are plotted in yellow and the node color corresponds to the joined faction. A small number of spacelike edges are visible (those edges more than 45 degrees from vertical).

Figure 3 illustrates the embeddings learned when M = R 2 1 . Article representations tend to satisfy the chronological order along the time coordinate.

both equation 7 and equation 8, we set θ = 10 -2 , r = 1 and train the model for 10 5 iterations/epochs. In equation 8, the regularization parameter λ is set to λ = |E c | |E| where |E| is the number of pairs that satisfy (v i , v j ) ∈ E and |E c | is the number of pairs that satisfy (v a , v b ) / ∈ E.

Figure 4: Coordinates of embeddings {-x, x} ∈ P 2 1 (1) ⊂ R 3 2 learned with equation 8.

Figure 5: (top) Coordinates of 2-dimensional embeddings x = (x 0 , x 1 ) learned with equation 8 when M = R 2 1 .(bottom) Coordinates of the first three coordinates of embeddings x = (x 0 , x 1 , x 2 , x 3 ) learned with equation 8 when M = S 3 1 (r). In Lorentz geometry, a timelike geodesic joining two points is the longest timelike curve in a given convex normal neighborhood. This translates in the high-level nodes v 1 and v 34 being the furthest from the rest of the nodes. The ground truth edges are plotted in yellow and the node color corresponds to the joined faction. A small number of spacelike edges are visible (those edges more than 45 degrees from vertical).

Figure 6: DAG containing an undirected cycle.

or M = S d 1 (r) ⊂ R d+1 1 with d = 3 and r > 0 (e.g., r = 1). Both R d+1 1 and S d 1 (r) are globally hyperbolic. Let us consider any value ε > 0, a = r + ε and b =

Figure 7: A simple directed cycle.

given by the structure of the Lorentzian pre-length space (V xi , d, ≤, , τ ) where V xi ⊆ U xi is an open subset of the maximal normal neighborhood U xi . More precisely, we consider the chronological futureI + (x i , V xi ) := {y ∈ V xi : x i y} ofthe point x i relative to some given set V xi ⊆ U xi (see page 402 of O'Neill (

Link prediction for directed graphs. Median average precision (AP) percentages across 20 random initializations on a held-out test set.

Evaluation scores for the different learned representations (mean ± standard deviation). ↓ the lower the metric, the better. ↑ the larger the metric (in absolute value), the better.

Preservation of chronological order between pairs of articles with one citing the other.

Link prediction for directed graphs. Median average precision (AP) percentages across 20 random initializations on a held-out test set.

Evaluation scores for the different learned representations (mean ± standard deviation). ↓ the lower the metric, the better. ↑ the larger the metric, the better. Whole dataset ρ (↑) top si ≥ 10 ρ (↑) Top si ≥ 20 ρ (↑)

ACKNOWLEDGMENTS

We thank Yuval Atzmon, Gal Chechik and Haggai Maron for helpful discussions during the initial phase of this project. We also thank Aaron Sim for sharing his source code, Rafid Mahmood, Jos Stam and the anonymous reviewers for valuable feedback on early versions of the manuscript.We report scores for the NIPS dataset in Table 5 . We report the Spearmans rank correlation coefficient ρ (Spearman, 1904) for all the authors (left), the authors with at least 10 coauthors (middle) and authors with at least 20 coauthors (right). Spacetimes return better performance for the subset of authors with at least 10 coauthors.

A SUMMARY OF SUPPLEMENTARY MATERIAL

The supplementary material is structured as follows:• Section B gives the differential geometry tools that are general to pseudo-Riemannian manifolds of constant curvature (called space forms).• Section C gives the differential geometry tools that are specific to our spacetimes such as the time separation function and the squared Lorentzian distance.• Section D presents additional results and further experimental details.• Section E is an extended related work section.• Section F gives the details of the optimizers we use.• Section G gives examples of directed graphs described with our model.

B DIFFERENTIAL GEOMETRY OF PSEUDO-RIEMANNIAN SPACE FORMS

We provide the necessary differential geometry tools to work on the pseudo-Euclidean space R d ν , the pseudo-sphere S d ν (r) and the pseudo-hyperboloid H d ν (r). Most of them are explained in (Law, 2021; Law & Stam, 2020) . We refer the reader to (Hawking & Ellis, 1973; O'Neill, 1983; Wolf, 2011) .

B.1 PSEUDO-EUCLIDEAN SPACE

We recall that ∀ν, R d ν ≈ T x R d ν . Its geodesic γ x→y : R → R d ν is γ x→y (t) := x + ty. The exponential map at x is defined as exp x (y) := γ x→y (1) = x + y and its inverse is -→We give here the differential geometry tools specific to the pseudo-sphere which is defined as the following hypersurface: S d ν (r) := x ∈ R d+1 ν : x, x ν = r 2 . The tangent space T x S d ν (r) of S d ν (r) at x can be defined as:In the case of the pseudo-sphere, we have ∀u ∈ T. We then have:Logarithmic map. The logarithmic map log x is the inverse of the exponential map exp x on a normal neighborhood of x ∈ S d ν (r) denoted by U x = {y ∈ S d ν (r) : x,y ν r 2 > -1}. It is formulated:In this section, we provide details about the experiments in Section 5.1.We report in Table 4 the link prediction scores on the Saccharomyces cerevisiae, in silico and Escherichia coli DREAM5 datasets (Marbach et al., 2012) . Following the evaluation protocol of (Sim et al., 2021) , we report the median + standard deviation across 20 random initialization. We use the same training and test splits as (Sim et al., 2021) .We use the following hyperparameters on the DREAM5 datasets to define equation 6:When d = 10, 50 or 100, S d 1 (r) is not time-oriented (see explanation in Section 3.2). We report the scores obtained in these dimensionalities to be fair with baselines.We also reran the experiments of Minkowski + TFD on the Escherichia coli DREAM 5 dataset.It is worth noting that the DREAM5 datasets contain a relatively small number of cycles:• The Saccharomyces cerevisiae DREAM 5 dataset contains 9 nodes that are part of at least one directed cycle (0.5% of 1,994 nodes) and 19 edges that are part of at least one directed cycle (0.5% of 3,940).• The Escherichia coli DREAM 5 dataset contains 18 nodes that are part of at least one directed cycle (1.7% of 1,081 nodes) and 23 edges that are part of at least one directed cycle (1.1% of 2,066).• The in silico DREAM 5 dataset contains 28 nodes that are part of at least one directed cycle (1.8% of 1,565 nodes) and 39 edges that are part of at least one directed cycle (1.0% of 4,012).The choice of a specific manifold acts as some inductive bias. When the manifold is chosen to be chronological, the created graph does not necessarily contain directed cycles but allows their existence. On the other hand, chronological manifolds are more appropriate for Directed Acyclic Graphs as they ensure that the created graph does not contain directed cycles. From our results, it seems that the (nonchronological) Cylindrical Minkowski spacetime obtains much better performance in the low-dimensional case as its lack of causality allows some flexibility that is less beneficial in the high-dimensional case. Sim et al. (2021) also use the synthetic "Duplication-Divergence"(Dupdiv) dataset (Ispolatov et al., 2005) , but their dataset contains only 100 edges and 1,026 edges. We generated a bigger version of Dupdiv that contains 1,000 edges and 26,649 edges (22,651 for training/validation and 3,998 for test) following the same setup. More precisely, 748 nodes (74.8%) are part of at least one directed cycle and 22,409 edges are part of at least one directed cycle (84.1% of 26,649). Sim et al. (2021) obtain their best performance on their Dupdiv dataset with the Cylindrical Minkowski + Triple Fermi-Dirac (TFD). We compare it on our larger dataset with Cylindrical Minkowski + equation 6. The results are reported in Table 4 and show a consistent gain of 2% mean Average Precision by using a proper time separation function.We use the following hyperparameters for Dupdiv: M = R d 1 , θ 1 = 0.4, θ 2 = 0.07, exponent m = 0.5, learning rate = 0.02, number of epochs = 500.

E EXTENDED RELATED WORK

To help explain the contributions of our work relative to prior art, we use this section to provide a more detailed comparison of our contributions to that of Sim et al. (2021) . Sim et al. (2021 ) extended Clough & Evans (2017) to the anti-de Sitter space and Lorentz cylinder. Although our motivation is similar, our contributions are methodological, rely on a simpler use of the intuitions of general relativity and Lorentzian pre-length spaces, and provide an easier interpretation of the learned representations as we explain below. First, Sim et al. (2021) do not address clearly the case when there is no geodesic between pairs of points, and their optimization framework leads to a distance loss term with a zero gradient in this case, which make it difficult to optimize. Moreover, in Sim et al. (2021) , the prediction of an arc between a pair of nodes is determined via a Triple Fermi-Dirac (TFD) probability function that accounts for the squared (Lorentzian) distance between the nodes, the time coordinate difference ∆t and its opposite value -∆t. In other words, TFD accounts for both the chronological future and past (with different weights) of a given node. The major methodological difference with Sim et al. ( 2021) is that we restrict the representation of nodes connected by an arc to belong to I + (x, V x ) where V x is a convex normal neighborhood. Although subtle, this difference makes the optimization and interpretation of results easier.In general, the Lorentzian distance function from x to y on M is defined to be infinite (i.e., χ M (x, y) = +∞) if M is non-chronological and there exists a closed timelike curve joining x and y. Moreover, the Hopf-Rinow theorem does not hold for spacetimes. We propose to work with convex normal neighborhoods, which allows us to restrict the existence of arcs between nodes to the existence of geodesics joining events. Our distance function χ V is called a local distance function (see Definition 4.25 of Beem et al. (1996) ) when its domain is restricted to a convex normal neighborhood. Moreover, χ V is continuous and differentiable on V x × I + (x, V x ) (see Lemma 4.26 of Beem et al. (1996) ). In some cases, it might be easier to optimize its squared function χ 2V which is of class C 2 on V x × V x (see Theorem 2.6 of Minguzzi ( 2019)). It is worth noting that V x can be defined to be globally hyperbolic for any spacetime (see Theorem 2.7 of Minguzzi (2019) ). This means that V x admits a Cauchy time function that can be used to define a time separation function τ (see explanation in Section C.2) whose sign defines the direction of edges. Our framework shares similarities with Sim et al. (2021) when1 is globally hyperbolic and any pair of points of R d 1 can be joined by a geodesic. However, the way we define the time separation τ (instead of using the same ∆t) is different when M = L d 1 (C) because we restrict it to be calculated on the maximal convex normal neighborhood. We construct τ so that it is positive if x j ∈ I + (x, V x ) and negative if x j ∈ I -(x, V x ). We then enforce τ to be positive if we want an arc from v i to v j . The fact that we work only with an open convex set instead of the whole manifold M is crucial because x j can belong to both the chronological future I + (x i , M) := {y ∈ M : x i y} and past I -(x i , M) := {y ∈ M : y x i } if the M is non-chronological. Using the entire manifold requires that the sign of ∆t is not as informative, as in Sim et al. (2021) . Our approach allows us to determine the direction of the arc joining v i and v j only via the sign of τ .We also define a general way of optimizing τ via the parallel transport (see Appendix C.2) instead of working only with Cartesian coordinates, which is not meaningful for some spacetimes such as P d 1 (r). Moreover, since we restrict our Lorentzian distances to be calculated in the convex normal neighborhood V x , we also have the nice interpretation that the Lorentzian distance corresponds to the length of the longest causal curve joining points.One other contribution is the connection of our work with the theory of Lorentzian pre-length spaces (Kunzinger & Sämann, 2018) which does not require notions of differential geometry to be understood and can be applied to discrete topological spaces (see Example 2.16 of Kunzinger & Sämann (2018) ). The framework proposed by Sim et al. (2021) is not a Lorentzian pre-length space due to their formulation of their time coordinate difference function that does not satisfy the properties of a time separation function (especially when M is non-chronological).

1

This optimizer was explained in (Gao et al., 2018) . When M = R d 1 , Π x is the identity function. G is the diagonal matrix with the first diagonal element equal to -1 and the remaining ones equal to 1. exp x (y) := x + y. Algorithm 1 corresponds to the standard Euclidean gradient descent because χ = ∇f (x) in this case.We recall that L d 1 (C) = R d 1 /∼, a quotient set defined such that x ∈ R d 1 and y ∈ R d 1 are equivalent (i.e., x∼y) iff ∀i > 0, y i = x i and ∃k ∈ Z, y 0 = x 0 + kC where C > 0 is a circumference hyperparameter.When M = L d 1 (C), we use the same optimizer as in Section F.1. Although it is optional, we also reproject the time coordinate of x = (x 0 , . . . , x d-1 ) at the end of each iteration as follows:2 ) where we use the modulo operation for real values which can be written as follows: a mod b := a -b • a b , and • is the floor function. If the initial value of x 0 is not in [-C 2 , C 2 ), this projects x to an equivalent point.This optimizer was introduced in (Law & Stam, 2020). We recall that S d 1 (r) := {x ∈ R d+1 1 : x, x 1 = r 2 }. G is the diagonal matrix with the first diagonal element equal to -1 and the remaining ones equal to 1. We have: ΠThe exponential map is defined in equation 10.

F.4 ANTI-DE SITTER

G is the diagonal matrix with the first two diagonal elements equal to -1 and the remaining ones equal to 1. We have: Π x (z) := zz,x 2x,x 2 x = z + z,x 2 r 2 x. The exponential map is defined in equation 15.

F.5 PROJECTIVE VERSION OF THE ANTI-DE SITTER

The neural network optimizer is given in (Law, 2021) . Otherwise, the embedding optimizer is the same as in Section F.4. The main difference is how the points are compared to calculate the distance and time separation function. We now consider the graph G = (V, E) defined as V = {v i } 4 i=1 and E = {(v 1 , v 4 ), (v 2 , v 1 ), (v 3 , v 1 ), (v 3 , v 2 ), (v 4 , v 2 ), (v 4 , v 3 )}. This is a graph with directed cycles (e.g., v 1 → v 4 → v 2 → v 1 , see Figure 8 ).We recall that every point of P d 1 (r) can be written as the unordered pair {-x, x} where x ∈ H d 1 (r) ⊂ R d+1

2

. Taking d = 2, the maximal normal neighborhood of every point {-x, x} can be written U {-x,x} = {{-y, y} : y ∈ H d 1 (r), x, y 2 < 0}.Let us consider r = 1, ε 1 = -0.1, ε 2 = ε 3 = -0.5, a = -1, b = -(r + ε 1 ) 2 -r 2 + a 2 , c = -2, e = -(r + ε 2 ) 2 -r 2 + c 2 , g = 1, h = (r + ε 3 ) 2 -r 2 + g 2 . We construct the four following points: x 1 = p = (r, 0, 0) , x 2 = (r + ε 1 , a, b) , x 3 = (r + ε 2 , c, e) , x 4 = (r + ε 3 , g, h) .To define the future direction, we consider the timelike tangent vector t = (0, 1, 0) ∈ T p H d 1 (r). In our framework, there exists an edge between x i and x j iff | x i , x j 2 | ∈ (0, r 2 ). The direction of the edge is determined by using Section C.2.4.

