AUTOENCODING HYPERBOLIC REPRESENTATION FOR ADVERSARIAL GENERATION

Abstract

With the recent advance of geometric deep learning, neural networks have been extensively used for data in non-Euclidean domains. In particular, hyperbolic neural networks have proved successful in processing hierarchical information of data. However, many hyperbolic neural networks are numerically unstable during training, which precludes using complex architectures. This crucial problem makes it difficult to build hyperbolic generative models for real and complex data. In this work, we propose a hyperbolic generative network in which we design novel architecture and layers to improve stability in training. Our proposed network contains three parts: first, a hyperbolic autoencoder (AE) that produces hyperbolic embedding for input data; second, a hyperbolic generative adversarial network (GAN) for generating the hyperbolic latent embedding of the AE from simple noise; third, a generator that inherits the decoder from the AE and the generator from the GAN. We call this network the hyperbolic AE-GAN, or HAE-GAN for short. The architecture of HAEGAN fosters expressive representation in the hyperbolic space, and the specific design of layers ensures numerical stability. Experiments show that HAEGAN is able to generate complex data with state-of-the-art structure-related performance.

1. INTRODUCTION

High-dimensional data often show an underlying geometric structure, which cannot be easily captured by neural networks designed for Euclidean spaces. Recently, there is intense interest in learning good representation for hierarchical data, for which the most natural underlying geometry is hyperbolic. A hyperbolic space is a Riemannian manifold with a constant negative curvature (Anderson, 2006) . The exponential growth of the radius of the hyperbolic space provides high capacity, which makes it particularly suitable for modeling tree-like hierarchical structures. Hyperbolic representation has been successfully applied to, for instance, social network data in product recommendation (Wang et al., 2019) , molecular data in drug discovery (Yu et al., 2020; Wu et al., 2021) , and skeletal data in action recognition (Peng et al., 2020) . Many recent works (Ganea et al., 2018; Shimizu et al., 2021; Chen et al., 2021) have successfully designed hyperbolic neural operations. These operations have been used in generative models for generating samples in the hyperbolic space. For instance, several recent works (Nagano et al., 2019; Mathieu et al., 2019; Dai et al., 2021b) have built hyperbolic variational autoencoders (VAE) (Kingma & Welling, 2014) . On the other hand, Lazcano et al. (2021) have generalized generative adversarial networks (GAN) (Goodfellow et al., 2014; Arjovsky et al., 2017) to the hyperbolic space. However, the above hyperbolic generative models are known to suffer from gradient explosion when the networks are deep. In order to build hyperbolic networks that can generate real data, it is desired to have a framework that has both representation power and numerical stability. To this end, we design a novel hybrid model which learns complex structures and hyperbolic embeddings from data, and then generates examples by sampling from random noises in the hyperbolic space. Altogether, our model contains three parts: first, we use a hyperbolic autoencoder (AE) to learn the embedding of training data in the latent hyperbolic space; second, we use a hyperbolic GAN to learn generating the latent hyperbolic distribution by passing a wrapped normal noise through the generator; third, we generate samples by applying sequentially the generator of the GAN and the decoder of the AE. We name our model as Hyperbolic AE-GAN, or HAEGAN for short. The advan-tage of this architecture is twofold: first, it enjoys expressivity since the noise goes through both the layers of the generator and the decoder; second, it allows flexible design of the AE according to the type of input data, which does not affect the sampling power of GAN. In addition, HAEGAN avoids the complicated form of ELBO in hyperbolic VAE, which is one source of numerical instability. We highlight the main contributions of this paper as follows: • HAEGAN is a novel hybrid AE-GAN framework for learning hyperbolic distributions that aims for both expressivity and numerical stability. • We validate the Wasserstein GAN formulation in HAEGAN, especially the way of sampling from the geodesic connecting a real sample and a generated sample.. • We design a novel concatenation layer in the hyperbolic space. We extensively investigate its numerical stability via theoretical and experimental comparisons. • In the experiments part, we illustrate that HAEGAN is not only able to faithfully generate synthetic hyperbolic data, but also able to generate real data with sound quality. In particular, we consider the molecular generation task and show that HAEGAN achieves state-of-the-art performance, especially in metrics related to structural properties.

2. BACKGROUND IN HYPERBOLIC NEURAL NETWORKS

2.1 HYPERBOLIC GEOMETRY Hyperbolic geometry is a special kind of Riemannian geometry with a constant negative curvature (Cannon et al., 1997; Anderson, 2006) . To extract hyperbolic representations, it is necessary to choose a "model", or coordinate system, for the hyperbolic space. Popular choices include the Poincaré ball model and the Lorentz model, where the latter is found to be numerically more stable (Nickel & Kiela, 2018) . We work with the Lorentz model L n K = (L, g) with a constant negative curvature K, which is an n-dimensional manifold L embedded in the (n + 1)-dimensional Minkowski space, together with the Riemannian metric tensor g = diag([-1, 1 ⊤ n ]), where 1 n denotes the n-dimensional vector whose entries are all 1's. Every point in L n K is represented by x = [x t , x ⊤ s ] ⊤ , x t > 0, x s ∈ R n , and satisfies ⟨x, x⟩ L = 1/K, where ⟨•, •⟩ L is the Lorentz inner product induced by g K :⟨x, y⟩ L := x ⊤ gy = -x t y t + x ⊤ s y s , x, y ∈ L n K . In the rest of the paper, we will refer to x t as the "time component" and x s as the "spatial component" . In the following, we describe some notations. Extensive details are provided in Appendix A. Notation We use d L (x, y) to denote the length of a geodesic ("distance" along the manifold) connecting x, y ∈ L n K . For each point x ∈ L n K , the tangent space at x is denoted by T x L n K . The norm ∥•∥ L = ⟨•, •⟩ L . For x, y ∈ L n K and v ∈ T x L n K , we use exp K x (v) to denote the exponential map of v at x; on the other hand, we use log K x : L n K → T x L n K to denote the logarithmic map such that log K x (exp K x (v)) = v. For two points x, y ∈ L n K , we use PT K x→y to denote the parallel transport map which "transports" a vector from T x L n K to T y L n K along the geodesic from x to y.

2.2. FULLY HYPERBOLIC LAYERS

One way to define hyperbolic neural operations is to use the tangent space, which is Euclidean. However, working with the tangent space requires taking exponential and logarithmic maps, which cause numerical instability. Moreover, tangent spaces are only local estimates of the hyperbolic spaces, but neural network operations are usually not local. Since generative networks have complex structures, we want to avoid using the tangent space whenever possible. The following hyperbolic layers take a "fully hyperbolic" approach and perform operations directly in hyperbolic spaces. The most fundamental hyperbolic neural layer, the hyperbolic linear layer (Chen et al., 2021) , is a trainable "linear transformation" that maps from L n K to L m K . We remark that "linear" is used to analogize the Euclidean counterpart, which contains activation, bias and normalization. In general, for an input x ∈ L n K , the hyperbolic linear layer outputs y = HLinear n,m (x) = ∥h(W x, v)∥ 2 -1/K h(W x, v) . ( ) Here n+1) are trainable weights, b and b ′ are trainable biases, σ is the sigmoid function, τ is the activation function, and the trainable parameter λ > 0 scales the range. h(W x, v) = λσ(v ⊤ x+b ′ ) ∥W τ (x)+b∥ (W τ (x) + b), where v ∈ R n+1 and W ∈ R m×( While a map between hyperbolic spaces such as HLinear is used most frequently in hyperbolic networks, it is also often necessary to output Euclidean features. A numerically stable way of obtaining Euclidean output features is via the hyperbolic centroid distance layer (Liu et al., 2019) , which maps points from L n K to R m . Given an input x ∈ L n K , it first initializes m trainable centroids {c i } m i=1 ⊂ L n K , then produces a vector of distances y = HCDist n,m (x) = [d L (x, c 1 ) • • • d L (x, c m )] ⊤ . We introduce more hyperbolic neural layers used in our model in Appendix A.2.

3.1. HYPERBOLIC GAN

With the hyperbolic neural layers defined in §2.2, it is not difficult to define a hyperbolic GAN whose generator and critic contain hyperbolic linear layers. The critic also contains a centroid distance layer so that its output is a one-dimensional score. To produce generated samples, we sample z (0) from G(o, I), the wrapped normal distribution (Nagano et al., 2019) and then pass it through the generator. We follow the Wasserstein gradient penalty (WGAN-GP) framework (Gulrajani et al., 2017) to foster easy and stable training. The loss function, adapted from the Euclidean WGAN-GP, is L WGAN = E x∼Pg [D( x)] -E x∼Pr [D(x)] + λ E x∼P x (∥∇D( x)∥ L -1) 2 , ( ) where D is the critic, ∇D( x) is the Riemannian gradient of D(x) at x, P g is the generator distribution and P r is the data distribution. Most importantly, P x samples uniformly along the geodesic between pairs of points sampled from P g and P r instead of a linear interpolation. This manner of sampling is validated in the following proposition, the proof of which can be found in §B. Proposition 3.1. Let P r and P g be two distributions in L n K and f * be an optimal solution of max ∥f ∥ L ≤1 E y∼Pr [f (y)] -E x∼Pg [f (x)] where ∥•∥ L is the Lipschitz norm. Let π be the optimal coupling between P r and P g that minimizes W (P r , P g ) = inf π∈Π(Pr,Pg) E x,y∼π [d L (x, y)], where Π(P r , P g ) is the set of joint distributions π(x, y) whose marginals are P r and P g , respectively. Let x t = γ(t), 0 ≤ t ≤ 1 be the geodesic between x and y, such that γ(0 ) = x, γ(1) = y, γ ′ (t) = v t ∈ T L n K , ∥v t ∥ L = d L (x, y). If f * is differentiable and π(x = y) = 0, then it holds that P (x,y)∼π ∇f * (x t ) = v t d L (x, y) = 1. The WGAN-GP formulation of the hyperbolic GAN is capable of sampling distributions in lowdimensional hyperbolic spaces. We illustrate learning 2D distributions in §B.

3.2. ARCHITECTURE OF HAEGAN

Although a hyperbolic GAN defined above can faithfully generate hyperbolic distributions , its training is difficult. Specifically, numerical instability will be observed when we incorporate complex network architectures into the generator and the critic. To this end, we design our HAEGAN model to contain both a hyperbolic AE and a hyperbolic GAN. First, we train a hyperbolic AE and use the encoder to embed the dataset into a latent hyperbolic space. Then, we use our hyperbolic GAN to learn the latent distribution of the embedded data. Finally, we sample hyperbolic embeddings using the generator and use the decoder to get samples in the original space. An illustration of HAEGAN is shown in Figure 1 . In addition to expressivity, HAEGAN enjoys flexibility in choosing the AE for embedding the hyperbolic distribution. If the original dataset is not readily presented in the hyperbolic domain, the hyperbolic AE can also learn the hyperbolic representation. In this case, we adopt the hyperbolic embedding operation (Nagano et al., 2019) from Euclidean to the hyperbolic space. We review the details of this embedding operation in Appendix A. Concatenation and split are essential operations in neural networks for feature combination, parallel computation, etc. For data with complex structures, the decoder of the hyperbolic AE in HAEGAN will need to produce parts of features and then combine them, where we need to perform concatenation. However, there is no obvious way of doing concatenation in the hyperbolic space. Shimizu et al. (2021) proposed Poincaré β-concatenation and β-split in the Poincaré model. Specifically, they first use the logarithmic map to lift hyperbolic points to the tangent plane of the origin, then perform Euclidean concatenation and split in this tangent space, and finally apply β regularization and apply the exponential map to bring it back to the Poincaré ball. + I 8 = " > A A A C A 3 i c b V C 7 T s M w F H V 4 l v I K s M F i U S E x V U l B g r G C h b F I 9 C G 1 U e Q 4 T m v V s S P b Q S p R J B Z + h Y U B h F j 5 C T b + B q f N A C 1 H s n x 0 z r 3 2 v S d I G F X a c b 6 t p e W V 1 b X 1 y k Z 1 c 2 t 7 Z 9 f e 2 + 8 o k U p M 2 l g w I X s B U o R R T t q a a k Z 6 i S Q o D h j p B u P r w u / e E 6 m o 4 H d 6 k h A v R k N O I 4 q R N p J v H w 4 C w U I 1 i c 2 V P e R + N p B x Z h 5 g e e 7 b N a f u T A E X i V u S G i j R 8 u 2 v Q S h w G h O u M U N K 9 V 0 n 0 V 6 G p K a Y k b w 6 S B V J E B 6 j I e k b y l F M l J d N d 8 j h i V F C G A l p D t d w q v 7 u y F C s i j F N Z Y z 0 S M 1 7 h f i f 1 0 9 1 d O l l l C e p J h z P P o p S B r W A R S A w p J J g z S a G I C y p m R X i E Z I I a x N b 1 Y T g z q + 8 S D q N u n t W b 9 y e 1 5 p X Z R w V c A S O w S l w w Q V o g h v Q A m 2 A w S N 4 B q / g z X q y X q x 3 6 2 N W u m S V P Q f g D 6 z P H 8 t 2 m O Y = < / l a t e x i t > z fake < l a t e x i t s h a 1 _ b a s e 6 4 = " v E i O w / Y l 9 h 5 D 6 v B I C T j J 2 1 l y z o Q = " > A A A C A 3 i c b V D L S s N A F J 3 4 r P U V d a e b w S K 4 K k k V d F l 0 4 7 K C f U A b w m Q y a Y f O Z M L M R K g h 4 M Z f c e N C E b f + h D v / x k m b h b Y e G O Z w z r 3 c e 0 + Q M K q 0 4 3 x b S 8 s r q 2 v r l Y 3 q 5 t b 2 z q 6 9 t 9 9 R I p W Y t L F g Q v Y C p A i j M W l r q h n p J Z I g H j D S D c b X h d + 9 J 1 J R E d / p S U I 8 j o Y x j S h G 2 k i + f T g I B A v V h J s v e 8 j 9 b C B 5 F q E x y X P f r j l 1 Z w q 4 S N y S 1 E C J l m 9 / D U K B U 0 5 i j R l S q u 8 6 i f Y y J D X F j O T V Q a p I g v A Y D U n f 0 B h x o r x s e k M O T 4 w S w k h I 8 2 I N p + r v j g x x V a x p K j n S I z X v F e J / X j / V 0 a W X 0 T h J N Y n x b F C U M q g F L A K B I Z U E a z Y x B G F J z a 4 Q j 5 B E W J v Y q i Y E d / 7 k R d J X I E U Y F a S t q W a k l 0 i C e M B I N x h f F 3 7 3 n k h F Y 3 G n J w n x O B o K G l G M t J F 8 + 3 A Q x C x U E 2 6 + 7 C H 3 s 4 H k G e F B m O e + X X P q z h R w k b g l q Y E S L d / + G o Q x T j k R G j O k V N 9 1 E u 1 l S G q K G c m r g 1 S R B O E x G p K + o Q J x o r x s e k M O T 4 w S w i i W 5 g k N p + r v j g x x V a x p K j n S I z X v F e J / X j / V 0 a W X U Z G k m g g 8 G x S l D O o Y F o H A k E q C N Z s Y g r C k Z l e I R 0 g i r E 1 s V R O C O 3 / y I u k 0 6 u 5 Z v X F 7 X m t e l X F U w B E 4 B q f A B R e g C W 5 A C 7 Q B B o / g G b y C N + v J e X I E U Y F a S t q W a k l 0 i C e M B I N x h f F 3 7 3 n k h F Y 3 G n J w n x O B o K G l G M t J F 8 + 3 A Q x C x U E 2 6 + 7 C H 3 s 4 H k 2 Z C I P P f t m l N 3 p o C L x C 1 J D Z R o + f b X I I x x y o n Q m C G l + q 6 T a C 9 D U l P M S F 4 d p I o k C I / R k P Q N F Y g T 5 W X T E 3 J 4 Y p Q Q R r E 0 T 2 g 4 V X 9 3 Z I i r Y k t T y Z E e q X m v E P / z + q m O L r 2 M i i T V R O D Z o C h l U M e w y A O G V B K s 2 c Q Q h C U 1 u 0 I 8 Q h J h b V K r m h D c + Z M X S a d R d 8 / q j d v z W v O q j K M C j s A x O A U u u A B N c Since we use the Lorentz model, the above operations are not useful and we need to define concatenation and split in the Lorentz space. One could define operations in the tangent space similarly to the Poincaré β-concatenation and β-split. More specifically, if we want to concatenate the input vectors {x i } N i=1 where each x i ∈ L ni K , we could follow a "Lorentz tangent concatenation": first lift each x i to v i = log K o (x i ) = vi t vi s ∈ R ni+1 , and then perform the Euclidean concatenation to get v := 0, v ⊤ 1s , . . . , v ⊤ Ns ⊤ . Finally, we would get y = exp K o (v) as a concatenated vector in the hyperlolic space. We denote y = HTCat({x i } N i=1 ). Similarly, we could perform the "Lorentz tangent split" on an input x i ∈ L n K with split sub-dimensions N i=1 n i = n to get v = log K o (x) = 0, v ⊤ 1s ∈ R n1 , . . . , v ⊤ Ns ∈ R n N ⊤ , v i = 0 vi s ∈ T o L ni K ,

and the split vectors

y i = exp K o (v i ) successively. Unfortunately, there are two problems with both the Lorentz tangent concatenation and the Lorentz tangent split. First, they are not "regularized", which means that the norm of the spatial component will increase after concatenation, and decrease after split. This will make the hidden embeddings numerically unstable. This problem could be partially solved by adding a hyperbolic linear layer after each concatenation and split, similarly to Ganea et al. (2018) , so that we have a trainable scaling factor λ to regularize the norm of the output. The second and more important problem is that if we use the Lorentz tangent concatenation and split in a deep neural network, there would be too many exponential and logarithmic maps. It on one hand suffers from severe precision issue due to the inaccurate float representation (Yu & De Sa, 2019; 2021) , and on the other hand easily suffers from gradient explosion. Moreover, the tangent space is chosen at o. If the points to concatenate are not close to o, their hyperbolic relation may not be captured very well. Therefore, we abandon the use of the tangent space and propose more direct and numerically stable operations, which we call the "Lorentz direct concatenation and split", defined as follows. Given the input vectors {x i } N i=1 where each x i ∈ L ni K and M = N i=1 n i , the Lorentz direct concatenation of {x i } N i=1 is defined to be a vector y ∈ L M k given by y = HCat({x i } N i=1 ) =   N i=1 x 2 it + (N -1)/K, x ⊤ 1s , • • • , x ⊤ Ns   ⊤ . Note that each x is is the spatial component of x i . If we consider x i ∈ L ni K as a point in R ni+1 , the projection of x i onto the Euclidean subspace {0}×R n , or the closest point there, is x is . The Lorentz direct concatenation can thus be considered as Euclidean concatenation of projections, where the Euclidean concatenated point is mapped back to L M K by the inverse map of the projection. We remark that this concatenation directly inherits from the Lorentz model. We also define the Lorentz split for completeness, though our main focus is on concatenation: Given an input x ∈ L n K , the Lorentz direct split of x, with sub-dimensions n 1 , • • • , n N where N i=1 n i = n, will be {y i } N i=1 , where each y i ∈ L ni K is given by first splitting x as x = x t , y ⊤ 1s , • • • , y ⊤ Ns ⊤ , and then calculating the corresponding time dimension as y i = √ ∥yi s ∥ 2 -1/K yi s .

4.2. ADVANTAGE OF LORENTZ DIRECT CONCATENATION

We show the advantage of direct concatenation in this section. Firstly, We state the following theoretical result regarding the exploding gradient of the Lorentz tangent concatenation. Theorem 4.1. Let {x i } N i=1 , where x i ∈ L ni K , denote the input features. Let y = HCat({x i } N i=1 ) denote the output of the Lorentz direct concatenation and z = HTCat({x i } N i=1 ) denote the output of the Lorentz tangent concatenation. Fix j ∈ {1, • • • , N }. The following results hold: 1. For any {x i } N i=1 and any entry y * of y, ∥∂y * /∂x js | x1,••• ,x N ∥ ≤ 1. 2. For any M > 0, there exist {x i } N i=1 and an entry z * of z for which ∥∂z * /∂x js | x1,••• ,x N ∥ ≥ M . This theorem shows that while the Lorentz direct concatenation has bounded gradients, there is no control on the gradients of Lorentz tangent concatenation. The proof can be found in Appendix D.1 and we give a simple numerical validation in Appendix D.2. We remark that in addition to ensuring bounded gradients, discarding exponential and logarithmic maps also makes sure that the concatenation operation does not suffer from inaccurate float representations (Yu & De Sa, 2019; 2021) . There are two additional benefits: first, the direct concatenation is less complex and thus more efficient than the tangent concatenation; second, the operation does not contain exponential functions and is thus GPU scalable (Choudhary & Reddy, 2022) . Next, we design the following simple experiment to show the advantage of our Lorentz direct concatenation over the Lorentz tangent concatenation when they are used in simple neural networks. The hyperbolic neural network in this simple experiment consists of a cascading of L blocks . A d-dimensional input is fed into two different hyperbolic linear layers, whose outputs are then concatenated by the Lorentz direct concatenation and the Lorentz tangent concatenation, respectively. Then, the concatenated output further goes through another hyperbolic linear layer whose output is again d-dimensional. Specifically, for l = 0, • • • , L -1, h (l) 1 = Hlinear d,d (x (l) ), h (l) 2 = Hlinear d,d (x (l) ); h (l) = HCat(h (l) 1 , h (l) 2 ); x (l+1) = Hlinear 2×d,d (h (l) ). In our test, we take d = 64. We sample input and output data from two wrapped normal distributions with different means (input: origin o, output: E2H(1 64 )) and variances (input: diag(1 64 ), output: 3 × diag(1 64 )). Taking the input as x (0) , we fit x (L) to the output data. We record the average gradient norm of the three hyperbolic linear layers in each block. The results for L = 64 blocks and L = 128 blocks are shown in Figure 3 . Clearly, for the first 20 blocks, the Lorentz tangent concatenation leads to significantly larger gradient norms. This difference in norms is clearer when the network is deeper. The gradients from the Lorentz direct concatenation are much more stable. Finally, we remark that Lorentz direct concatenation is numerically stable in practice. In the generation experiments which we introduce in the next section, if we used the Lorentz tangent concatenation, the gradients would explode very rapidly (produce NaN), even during the first epoch. More analysis about hyperbolic concatenations, particularly regarding their impact on hyperbolic distances and their stability, can be found in Appendix D.3.

5. EXPERIMENTS

We perform the following two experiments: random tree generation and molecular generation.

5.1. RANDOM TREE GENERATION

Recent studies (Boguná et al., 2010; Krioukov et al., 2010; Sala et al., 2018; Sonthalia & Gilbert, 2020) have found that hyperbolic spaces are suitable for tree-like graphs. We first perform an experiment in which we generate random trees. We compare the performance of HAEGAN with other hyperbolic models. In this experiment, the AE consists of a tree encoder and a tree decoder. The structure of the AE is explained in detail in Appendix E. We remark that both the tree encoder and the tree decoder are also used in the molecular generation experiment. Dataset Our dataset consists of 500 randomly generated trees. Each tree is created by converting a uniformly random Prüfer sequence (Prüfer, 1918) . The number of nodes in each tree is uniformly sampled from [20, 50] . The dataset is randomly split into 400 for training and 100 for testing.

Baselines and Ablations

We compare HAEGAN with the following baseline hyperbolic generation methods. HGAN, where we only use a hyperbolic Wasserstein GAN without the AE structure, where the tree decoder as the generator and the tree decoder as the critic. HVAE-w and HVAE-r, where we use the same AE but follow the ELBO loss function used by Mathieu et al. (2019) instead of having a GAN ("w" and "r" refer to using wrapped and Riemannian normal distributions, respectively). Although we mainly focus on hyperbolic methods, we also compare with the following Euclidean generation methods: GraphRNN (You et al., 2018b) and AEGAN. The latter has the same architecture as HAEGAN but all layers and operations are in the Euclidean space. As discussed in previous sections, our default choice in HAEGAN is to use Lorentz direct Concatenation and fully hyperbolic linear layers (Chen et al., 2021) . We also consider the following ablations of HAEGAN: HAEGAN-H, where the fully hyperbolic linear layers are replaced with the tangent linear layers defined by Ganea et al. (2018) ; HAEGAN-β, where the concatenation in HAE-GAN is replaced by β-concatenation (Shimizu et al., 2021) ; HAEGAN-T, where the concatenation is replaced by the Lorentz tangent concatenation discussed in §4. Metrics We use the following metrics from You et al. (2018b) to evaluate the models: The Maximum Mean Discrepancy (MMD) of degree distribution (Degree), MMD of the orbit counts statistics distribution (Orbit); Average difference of orbit counts statistics (Orbit), Betweenness Centrality (Betweenness), Closeness Centrality (Closeness). All metrics are calculated between the test dataset and 100 samples generated from the models.

Results

The results and runtime of all models are shown in Table 1 . For all metrics, a smaller number implies a better result. Our default choice of HAEGAN (with direct concatenation and fully hyperbolic linear layers) performs the best across all metrics except the MMD of orbit counts statistics distribution, in which it just marginally falls behind the β-concatenation. In particular, the Lorentz direct concatenation generally performs better and more efficiently than Lorentz tangent concatenation and β-concatenation. Also, the fully hyperbolic linear layer is superior than the tangent linear layer in both effectiveness and efficiency. Our results also show the advantage of the overall framework compared with either a single GAN or VAE. On one hand, the performance of HAEGAN is much better than HGAN. On the other hand, we note that the hyperbolic VAE-based methods suffer from numerical instability for this simple dataset even when using fully hyperbolic linear layers and direct concatenation, possibly because of the complicated ELBO loss. Finally, we remark that it is clear from the results that the hyperbolic models are better at generating trees than the Euclidean ones. 

5.2. DE NOVO MOLECULAR GENERATION WITH HAEGAN

It is a crucial task in machine learning to learn structure of molecules, which has important application in discovery of drugs and proteins (Elton et al., 2019) . Since molecules naturally show a graph structure, many recent works use graph neural networks to extract their information and accordingly train molecular generators (Simonovsky & Komodakis, 2018; De Cao & Kipf, 2018; Jin et al., 2018; 2019) . In particular, Jin et al. (2018; 2019) proposed a bi-level representation of molecules where both a junction-tree skeleton and a molecular graph are used to represent the original molecular data. In this way, a molecule is represented in a hierarchical manner with a tree-structured scaffold. Given that hyperbolic spaces can well-embed such hierarchical and tree-structured data (Peng et al., 2021) , we expect that HAEGAN can leverage the structural information. To validate its effectiveness, in this section, we test HAEGAN using molecular generative tasks, where the latent distribution is embedded in a hyperbolic manifold. In our experiments, we design both a hyperbolic tree AE and a hyperbolic graph AE in our HAE-GAN to embed the structural information of the atoms in each molecule. Specifically, our model takes a molecular graph as the input, passes the original graph to the graph encoder and feeds the corresponding junction tree to the tree encoder, acquiring hyperbolic latent representations z G of the graph, as well as z T for the junction tree. Then, the junction tree decoder constructs a tree from z T autoregressively. Finally, the graph decoder recovers the molecular graph using the generated junction tree and z G . Within the HAEGAN framework, this network contains fully hyperbolic layers and embedding layers, as well as the concatenation layers described in §4. We describe the detailed structure of each component in HAEGAN and include a figure illustration in Appendix E. The hyperbolic representation is supposed to better leverage the hierarchical structure from the junction-tree than the graph neural networks (Jin et al., 2018; 2019) . Dataset We train and test our model on the MOSES benchmarking platform (Polykovskiy et al., 2020) , which is refined from the ZINC dataset (Sterling & Irwin, 2015) Baselines We compare our model with the following baselines: non-neural models including the Hidden Markov Model (HMM), the N-Gram generative model (NGram) and the combinatorial generator; and neural methods including CharRNN (Segler et al., 2018) , AAE (Kadurin et al., 2017a; b; Polykovskiy et al., 2018) , VAE (Gómez-Bombarelli et al., 2018; Blaschke et al., 2018) , JTVAE (Jin et al., 2018) , LatentGAN (Prykhodko et al., 2019) . The benchmark results are taken from (Polykovskiy et al., 2020) foot_0 . Ablations On one hand, we consider an Euclidean counterpart of HAEGAN, named as AEGAN, to examine whether the hyperbolic setting indeed contributes. The architecture of AEGAN is the same as HAEGAN, except that the hyperbolic layers are replaced with Euclidean ones. On the other hand, we also report the following alternative hyperbolic methods: HVAE-w and HVAE-r, where we use the same tree and graph AE but follow the ELBO loss function used by Mathieu et al. (2019) instead of having a GAN ("w" and "r" refer to using wrapped and Riemannian normal distributions, respectively); HGAN, where we train an end-to-end hyperbolic WGAN with the graph and tree decoder as the generator, and the graph and tree encoder as the critic; HAEGAN-H, HAEGAN-β and HAEGAN-T as introduced in §5.1. Metrics We briefly describe how the models are evaluated. Detailed descriptions of the following metrics can be found in the MOSES benchmarking platform (Polykovskiy et al., 2020) . We generate a set of 30,000 molecules, which we call the "generated set". On one hand, we report the Validity, Unique(ness) and Novelty scores, which are the percentage of valid, unique and novel molecules in the generated set, respectively. These are standard metrics widely used to represent the quality of the generation. On the other hand, we evaluate the following structure-related metrics by comparing the generated set with the test set and the scaffold set: Similarity to a Nearest Neighbor (SNN) and the Scaffold similarity (Scaf ). SNN is the average Tanimoto similarly (Tanimoto, 1958) between generated molecule and its nearest neighbor in the reference set. Scaf is cosine distances between the scaffold frequency vectors (Bemis & Murcko, 1996) of the generated and reference sets. In particular, SNN compares the detailed structures while Scaf compares the skeleton structures. By considering them with both the test and the scaffold test sets, we measure both the structural similarity to training data and the capability of searching for novel structures.

Results

We describe the detailed settings and architectures of HAEGAN in Appendix F.5 and present some generated examples in Appendix G. We report in Table 2 the performance of HAEGAN and the baselines. For each metric described above, we take the mean and standard deviation from three independent samples. For all the metrics, a larger number implies a better result. We use bold font to highlight the best performing model in each criteria. First of all, HAEGAN achieves perfect validity and uniqueness scores, which implies the hyperbolic embedding adopted by HAEGAN does not break the rule of molecule structures and does not induce mode collapse. Moreover, our model significantly outperforms the baseline models in the SNN metric. This means that the molecules generated by our model have a closer similarity to the reference set. It implies that our model captures better the underlying structure of the molecules and our hyperbolic latent space is more suitable for embedding molecules than its Euclidean counterparts. Our model also achieves competitive performance in the Scaf metric when the reference set is the scaffold test set. This shows that our model is better in searching on the manifold of scaffolds and can generate examples with novel core structures. Next, although AEGAN can also achieve very good performance in validity, uniqueness and novelty, we notice the big margin HAEGAN has over AEGAN in the structure-related metrics. This suggests that working with the hyperbolic space is necessary in our approach and the hyperbolic space better represents structural information. Lastly, the alternative hyperbolic models all suffer from numerical instability and training reports NaN. This is not surprising since hyperbolic neural operations are known to easily make training unstable, especially in deep and complex networks. The result reveals the stronger numerical stability of HAEGAN, which highlights the importance of (1) the overall framework of HAEGAN (v.s. HVAE); (2) the fully hyperbolic layers (v.s. HAEGAN-H); (3) the Lorentz direct concatenation (v.s. HAEGAN-β, HAEGAN-T).

6. CONCLUSION AND FUTURE WORK

In this paper, we proposed HAEGAN, a hybrid AE-GAN framework and showed its capability of generating faithful hyperbolic examples. We showed that HAEGAN is able to generate both synthetic hyperbolic data and real molecular data. In particular, HAEGAN delivers state-of-theart results for molecular data in structure-related metrics. It is not only the first hyperbolic GAN model that achieves such effectiveness, but also a deep hyperbolic model that does not suffer from numerical instability. We expect that HAEGAN can be applied to broader scenarios due to the flexibility in designing the hyperbolic AE and the possibility of building deep models. Despite the promising results, we point out two possible limitations of the current model. First, not all complex modules are directly compatible with HAEGAN. Indeed, if the gated recurrent units (GRU) were used in our molecular generation task, the complex structure of GRU would cause unstable training and that is why a hyperbolic linear layer is used instead. Nevertheless, we expect defining more efficient hyperbolic operations that incorporate recurrent operations may alleviate the problem and leave it to future work. Second, although the hyperbolic operations in HAEGAN do not require going back and forth between the hyperbolic and the tangent spaces, we need to use exponential maps when sampling from the wrapped normal distribution. We will also work on more efficient ways of sampling from the hyperbolic Gaussian.

APPENDIX

The Appendix is organized as follows. In §A we describe the preliminaries on hyperbolic geometry. In §B we describe the details of hyperbolic GAN and demonstrate generating toy distributions. In §C we show detailed results from MNIST generation using HAEGAN. In §D we present more analysis for the Lorentz direct concatenation. In §E we provide a detailed description of the AE used in HAEGAN for molecular generation. In §F we carefully describe all the experimental details and neural network architectures that we did not cover in the main text for reproducing our work. In §G we illustrate a subset of examples of molecules generated by HAEGAN.

A PRELIMINARIES

A.1 HYPERBOLIC GEOMETRY We describe some fundamental concepts in hyperbolic geometry related to this work. The Lorentz Model The Lorentz model L n K = (L, g) of an n dimensional hyperbolic space with constant negative curvature K is an n-dimensional manifold L embedded in the (n+1)-dimensional Minkowski space, together with the Riemannian metric tensor g = diag([-1, 1 ⊤ n ]), where 1 n denotes the n-dimensional vector whose entries are all 1's. Every point in L n K is represented by x = x t x s , x t > 0, x s ∈ R n and satisfies ⟨x, x⟩ L = 1/K, where ⟨•, •⟩ L is the Lorentz inner product induced by g: ⟨x, y⟩ L := x ⊤ gy = -x t y t + x ⊤ s y s , x, y ∈ L n K . Geodesics and Distances Geodesics are shortest paths in a manifold, which generalize the notion of "straight lines" in Euclidean geometry. In particular, the length of a geodesic in L n K (the "distance") between x, y ∈ L n K is given by d L (x, y) = 1 √ -K cosh -1 (K⟨x, y⟩ L ). Tangent Space For each point x ∈ L n K , the tangent space at x is T x L n K := {y ∈ R n+1 | ⟨y, x⟩ L = 0}. It is a first order approximation of the hyperbolic manifold around a point x and is a subspace of R n+1 . We denote ∥v∥ L = ⟨v, v⟩ L as the norm of v ∈ T x L n K .

Exponential and Logarithmic Maps

The exponential and logarithmic maps are maps between hyperbolic spaces and their tangent spaces. For x, y ∈ L n K and v ∈ T x L n K , the exponential map exp K x (v) : T x L n K → L n K maps tangent vectors to hyperbolic spaces by assigning v to the point exp K x (v) := γ(1), where γ is the geodesic satisfying γ(0) = x and γ ′ (0) = v. Specifically, exp K x (v) = cosh(ϕ)x + sinh(ϕ) v ϕ , ϕ = √ -K∥v∥ L . The logarithmic map log K x (y) : L n K → T x L n K is the inverse map that satisfies log K x (exp K x (v)) = v. Specifically, log K x (y) = cosh -1 (ψ) √ -K y -ψx ∥y -ψx∥ L , ψ = K⟨x, y⟩ L . ( ) Parallel Transport For two points x, y ∈ L n K , the parallel transport from x to y defines a map PT K x→y , which "transports" a vector from T x L n K to T y L n K along the geodesic from x to y. Parallel transport preserves the metric, i.e. ∀u, v ∈ T x L n K , PT K x→y (v), PT K x→y (u) L = ⟨v, u⟩ L . In particular, the parallel transport in L n K is given by PT K x→y (v) = ⟨y, v⟩ L -1/K -⟨x, y⟩ L (x + y).

A.2 MORE HYPERBOLIC LAYERS

We reviewed the Lorentz linear layer and the centroid distance layer in the main text. In this section, we review more hyperbolic layers. The notion of the "centroid" of a set of points is important in formulating attention mechanism and feature aggregation. In the Lorentz model, with the squared Lorentzian distance defined as d 2 L (x, y) = 2/K -2⟨x, y⟩ L , x, y ∈ L n K , the hyperbolic centroid (Law et al., 2019 ) is defined to be the minimizer that solves min µ∈L n K N i=1 ν i d 2 L (x i , µ) subject to x i ∈ L n K , ν i ≥ 0, i ν i > 0, i = 1, • • • , N . A closed form of the centroid is given by µ = HCent(X, ν) = N i=1 ν i x i √ -K ∥ N i=1 ν i x i ∥ L , ( ) where X is the matrix whose i-th row is x i . Extracting the hyperbolic centroid following hyperbolic linear layers produces a hyperbolic graph convolutional network (GCN) layer (Chen et al., 2021) : x (l) v = HGCN(X (l-1) ) v = HCent({HLinear d l-1 ,d l (x (l-1) u ) | u ∈ N (v)}, 1) where x (l) v is the feature of node v in layer l, d l denotes the dimensionality of layer l, and N (v) is the set of neighbor points of node v.

A.3 EMBEDDING FROM EUCLIDEAN TO HYPERBOLIC SPACES

It is possible that a dataset is originally represented as Euclidean, albeit having a hierarchical structure. In this case, the most obvious way of processing is to use the exponential or logarithmic maps so that we can represent data in the hyperbolic space. In order to map t ∈ R n to the hyperbolic space L m K , Nagano et al. ( 2019) add a zero padding to the front of t to make it a vector in T o L m K , and then apply the exponential map. This Euclidean to Hyperbolic (E2H) operation was originally used for sampling, but can also be generally used to map from the Euclidean to the hyperbolic spaces. Specifically, y = E2H m (t) = exp K o ([ 0 t ]) , where o = -1/K, 0, . . . , 0 ⊤ is the hyperbolic origin. For better expressivity, especially when the input x ∈ R n is one-hot, one can first map the input to a hidden embedding h = W x ∈ R m with a trainable embedding matrix W ∈ R n×m . Then, it is mapped to hyperbolic space by the E2H operation defined as above. That is, y = HEmbed n,m (x) = E2H (W x) . ( ) This layer was previously used by Nagano et al. (2019) for word embedding. We remark that besides using the E2H operation, other embedding methods (Nickel & Kiela, 2017; 2018; Sala et al., 2018; Sonthalia & Gilbert, 2020) exist. We use E2H in our experiments since it is simple to incorporate it in HAEGAN.

A.4 MORE RELATED WORKS

Machine Learning in Hyperbolic Spaces A central topic in machine learning is to find methods and architectures that incorporate the geometric structure of data (Bronstein et al., 2021) . Due to the data representation capacity of the hyperbolic space, many machine learning methods have been designed for hyperbolic data. Such methods include hyperbolic dimensionality reduction (Chami et al., 2021) and kernel hyperbolic methods (Fang et al., 2021) . Besides these works, deep neural networks have also been proposed in the hyperpolic domain. One of the earliest such model was the Hyperbolic Neural Network (Ganea et al., 2018) which works with the Poincaré ball model of the hyperbolic space. This was recently refined in the Hyperbolic Neural Network ++ (Shimizu et al., 2021) . Another popular choice is to use the Lorentz model of the hyperbolic space (Chen et al., 2021; Yang et al., 2022) . Our model also uses Lorentz space for numerical stability. Hyperbolic Graph Neural Networks Graph neural networks (GNNs) are successful models for learning representations of graph data. Recent studies (Boguná et al., 2010; Krioukov et al., 2010; Sala et al., 2018; Sonthalia & Gilbert, 2020) have found that hyperbolic spaces are suitable for tree-like graphs and a variety of hyperbolic GNNs (Chami et al., 2019; Liu et al., 2019; Bachmann et al., 2020; Dai et al., 2021a; Chen et al., 2021) 2021) designed fully hyperbolic operations so that message passing can be done completely in the hyperbolic space. Some recent works address special GNNs. For instance, Sun et al. (2021) applied a hyperbolic time embedding to temporal GNN, while Zhang et al. (2021) designed a hyperbolic graph attention network. We also notice the recent survey on hyperbolic GNNs by Yang et al. (2022) . Molecular Generation State-of-the-art methods for molecular generation usually treat molecules as abstract graphs whose nodes represent atoms and edges represent chemical bonds. Early methods for molecular graph generations usually generate adjacency matrices via simple multilayer perceptrons (Simonovsky & Komodakis, 2018; De Cao & Kipf, 2018) . Recently, Jin et al. (2018; 2019) proposed to treat a molecule as a multiresolutional representation, with a junction-tree scaffold, whose nodes represent valid molecular substructures. Other molecular graph generation methods include (You et al., 2018a; Shi* et al., 2020) . Methods that work with SMILES (Simplified Molecular Input Line Entry System) notations instead of graphs include (Segler et al., 2018; Gómez-Bombarelli et al., 2018; Blaschke et al., 2018; Kadurin et al., 2017a; b; Polykovskiy et al., 2018; Prykhodko et al., 2019) . Since the hyperbolic space is promising for tree-like structures, hyperbolic GNNs have also been recently used for molecular generation (Liu et al., 2019; Dai et al., 2021a) .

Wrapped normal distribution

The wrapped normal distribution is a hyperbolic distribution whose density can be evaluated analytically and is differentiable with respect to the parameters (Nagano et al., 2019) . Given µ ∈ L n K and Σ ∈ R n×n , to sample z ∈ L n K from the wrapped normal distribution G(µ, Σ), we first sample a vector ṽ from the Euclidean normal distribution N (0, Σ), then identify ṽ as an element v ∈ T o L n K so that v = 0 ṽ . We parallel transport this v to u = PT K o→µ (v) and then finally map u to z = exp µ (u) ∈ L n K .

Generator and critic

The generator pushes forward a wrapped normal distribution G(o, I) to a hyperbolic distribution via a cascading of hyperbolic linear layers. The critic aims to distinguish between fake data generated from the generator and real data. It contains a cascading of hyperbolic linear layers, and a centroid distance layer whose output is a score in R. Training We adopt the framework of Wasserstein GAN (Arjovsky et al., 2017) , which aims to minimize the Wasserstein-1 (W 1 ) distance between the distribution pushed forward by the generator and the data distribution. Since d L is a valid metric, the W 1 distance between two hyperbolic distribution P r , P g defined on the Lorentz space is W 1 (P r , P g ) = inf γ∈Π(Pr,Pg) E (x,y)∼γ [d L (x, y)], where Π(P r , P g ) is the set of all joint distributions whose marginals are P r and P g , respectively. By Kantorovich-Rubinstein duality (Villani, 2009) , we have the following more tractable form of W 1 distance W 1 (P r , P g ) = sup ∥D∥ L ≤1 E x∼Pr [D(x)] -E x∼Pg [D(x)], where the supremum is over all 1-Lipschitz functions D : L n K → R, represented by the critic. To enforce the 1-Lipschitz constraint, we adopt a penalty term on the gradient following Gulrajani et al. (2017) . The loss function is thus L WGAN = E x∼Pg [D( x)] -E x∼Pr [D(x)] + λ E x∼P x (∥∇D( x)∥ L -1) 2 , ( ) where ∇D( x) is the Riemannian gradient of D(x) at x, P g is the generator distribution and P r is the data distribution, P x samples uniformly along the geodesic between pairs of points sampled from P g and P r . Next, we prove Proposition 3.1 to validate this sampling regime. The proof is obtained by carefully transferring the Euclidean case (Gulrajani et al., 2017) to the hyperbolic space. B.2 PROOF OF PROPOSITION 3.1 Proof. For the optimal solution f * , we have P (x,y)∼π f * (y) -f * (x) = d L (y, x) = 1. (19) Let ψ(t) = f * (x t ) -f * (x), 0 ≤ t, t ′ ≤ 1. Following Gulrajani et al. ( ), it is clear that ψ is d L (x, y)-Lipschitz, and f * (x t ) -f * (x) = ψ(t) = td L (x, y), f * (x t ) = f * (x) + td L (x, y) = f * (x) + t∥v t ∥ L . Let u t = vt d L (x,y) ∈ T L n K be the unit speed directional vector of the geodesic at point x t . Let α : [-1, 1] → L n K be a differentiable curve with α(0) = x t and α ′ (0) = u t . Note that γ ′ (t) = d L (x, y)α ′ (0). Therefore, lim h→0 α(h) = lim h→0 γ t + h d L (x, y) = lim h→0 x t+ h d L (x,y) . ( ) The directional derivative can be thus calculated as ∇ ut f * (x t ) = d dτ f * (α(τ )) τ =0 = lim h→0 f * (α(h)) -f * (α(0)) h = lim h→0 f * x t+ h d L (x,y) -f * (x t ) h = lim h→0 f * (x) + (t + h d L (x,y) )d L (x, y) -f * (x) -td L (x, y) h = lim h→0 h h = 1. (21) Since f * is 1-Lipschitz, we have ∥∇f * (x t )∥ L ≤ 1. This implies 1 ≥ ∥∇f * (x)∥ 2 L = ⟨u t , ∇f * (x t )⟩ 2 L + ∥∇f * (x t ) -⟨u t , ∇f * (x t )⟩ u t ∥ 2 L = |∇ ut f * (x t )| 2 + ∥∇f * (x t ) -u t ∇ ut f * (x t )∥ 2 L = 1 + ∥∇f * (x t ) -u t ∥ 2 L ≥ 1. Therefore, we have ,y) . 1 = 1 + ∥∇f * (x t ) -u t ∥ 2 L , ∇f * (x t ) = u t . This yields ∇f * (x t ) = vt d L (x

B.3 TOY DISTRIBUTIONS

We use a set of challenging toy 2D distributions explored by Rozen et al. (2021) to test the effectiveness of the hyperbolic GAN. We create the dataset in the same way using their codefoot_1 . For our experiment, the training data are prepared in the following manner. We first sample 5,000 points from the toy 2D distributions and scale the coordinates to [-1, 1]. Then, we use the E2H operation ( 14) to map the points to the hyperbolic space. These points are treated as the input data of the hyperbolic GAN. Next, we use the hyperbolic GAN to learn the hyperbolic toy distributions. The generator and the critic both contain 3 layers of hyperbolic linear layers and 64 hidden dimensions at After we train the hyperbolic GAN, we sample from it and compare with the input data. Note that the input data and the generated samples are both in the hyperbolic space. To illustrate them, we map both the input data and the generated samples to the tangent space of the origin by applying the logarithmic map. We present the mapped input data and generated samples in Figure 4 . Clearly, the hyperbolic GAN can faithfully represent the challenging toy distributions in the hyperbolic space.

C MORE RESULTS FROM MNIST GENERATION WITH HAEGAN

In the HAEGAN for generating MNIST, the encoder of the AE consists of three convolutional layers, followed by an E2H layer and three hyperbolic linear layers, while the decoder consists of three hyperbolic linear layers, a logarithmic map to the Euclidean space, and three deconvolutional layers. We describe the training procedures as follows. Firstly, we normalize the MNIST dataset and train the AE by minimizing the reconstruction loss. Secondly, we use the encoder to embed the MNIST in hyperbolic space and train the hyperbolic GAN with the hyperbolic embedding. Finally, we sample a hyperbolic embedding using the generator and use it to produce an image by applying the decoder. We describe the detailed architecture and settings in Appendix F.4. The training curves of the hyperbolic GAN of HAEGAN in the MNIST generation task are shown in Figure 5 , which we compare with an Euclidean Wasserstein GAN. The critic loss includes the gradient penalty term. We observe that the trend of loss in the hyperbolic GAN is similar to the Euclidean one (both the generator and critic) and no instability from the hyperbolic model. In Table 3 and Table 4 , we report the quantitative results for the MNIST generation task. First, we compare the negative log-likelihood (NLL) results between our method and hyperbolic VAEs (Mathieu et al., 2019; Nagano et al., 2019; Bose et al., 2020) . The results are directly taken from Table 4 : Quantitative comparison between HAEGAN and HGAN (Lazcano et al., 2021) in the MNIST generation task. Fréchet inception distance (±std) is reported. Results for (Lazcano et al., 2021) are taken directly from the paper. Model FID HGAN (Lazcano et al., 2021) 54.95 HWGAN (Lazcano et al., 2021) 12.50 HCGAN (Lazcano et al., 2021) 12.43 HAEGAN (Ours) 8.05±0.37

D MORE ANALYSIS OF CONCATENATION

D.1 PROOF OF THEOREM 4.1 Proof. 1. First, consider the case where y * is one of the spatial components of y. According to (5), y * is a copy of an entry in x is for some i ∈ {1, • • • , N }. Therefore, if i ̸ = j, ∂y * /∂x js is a zero vector; if i = j, ∂y * /∂x js a one-hot vector. In both cases, ∥∂y * /∂x js ∥ ≤ 1. Next, consider the case where y * = y t is the time component of y. According to (5), ∂y * /∂x js = x js N i=1 ∥x is ∥ 2 -1 K and thus ∂y * ∂x js = ∥x js ∥ N i=1 ∥x is ∥ 2 -1 K ≤ 1 since the curvature K < 0. We conclude that ∥∂y * /∂x js ∥ ≤ 1 for any entry y * in y.

2.. Write

z = z t , z ⊤ 1s , • • • , z ⊤ Ns ⊤ . According to the definition of the Lorentz tangent concatenation as well as formulas ( 9) and ( 10), for l ∈ {1, • • • , N }, the p-th entry of the vector z ls can be written in the following concrete form: z lsp = sinh(∆) ∆ cosh -1 -K ∥x ls ∥ 2 -1 K √ -K x lsp ∥x ls ∥ = sinh(∆) ∆ sinh -1 K 2 ∥x ls ∥ 2 -K -1 √ -K x lsp ∥x ls ∥ , where ∆ = √ -K N i=1 sinh -1 K 2 ∥x is ∥ 2 -K -1 2 1/2 . ( ) Let j ̸ = l. Differentiating z lsp with respect to x jsk , the k-th entry of x js , yields ∂z lsp ∂x jsk = C l cosh(∆) ∆ 2 - sinh(∆) ∆ 3 • sinh -1 K 2 ∥x js ∥ 2 -K -1 • 1 ∥x js ∥ 2 -1 K x jsk ∥x js ∥ 2 -1+K K 2 , ( ) where C l = sinh -1 K 2 ∥x ls ∥ 2 -K -1 x lsp ∥x ls ∥ does not depend on x js . Arbitrarily take fixed x is for all i ̸ = l with particularly x jsk > 0. Also arbitrarily take fixed x lsq for q ̸ = p. We claim that ∂z lsp ∂x jsk → ∞ as x lsq → ∞. To prove the claim, first note that ∂C l ∂x lsp = x 2 lsp ∥x ls ∥ 2 ∥x js ∥ 2 -1 K ∥x js ∥ 2 -1+K K 2 + 1 - x 2 lsp x 2 ls sinh -1 K 2 ∥x ls ∥ 2 -K -1 ls ∥ . ( ) Since x 2 lsp ≤ ∥x ls ∥ 2 and sinh -1 √ K 2 ∥x ls ∥ 2 -K-1 ∥x ls ∥ > 0, the second term in ( 27) is positive. Consequently, ∂C l /∂x lsp > 0 and thus C l is a positive term that increases with x lsp . More- 25) is a fixed positive term that does not depend on x lsp . Hence, we only need to show that cosh over, sinh -1 K 2 ∥x js ∥ 2 -K -1 1 ∥x js ∥ 2 -1/K x jsk ∥x js ∥ 2 -1+K K 2 in ( (∆) ∆ 2 - sinh(∆) ∆ 3 → ∞ as x lsp → ∞. Since ∆ → ∞ as x lsp → ∞, this reduces to proving f (t) = cosh(t) t 2 - sinh(t) t 3 → ∞ as t → ∞, which is an immediate result once we write explicitly that f (t) = 1 2 e -t t 3 + e -t t 2 + (t -1)e t t 3 . ( ) We conclude that for any M > 0, there exist {x i } N i=1 and an entry z * of z, i.e. z lsp above, for which all entries of ∂z * /∂x js are no less than M . Thus it also holds that ∥∂z -1 denote the concatenated vector of x and y. Numerically, we consider the range of x s , y s ∈ [-100, 100] . * /∂x js | x1,••• ,x N ∥ ≥ M . We plot them in Figure 6 . We remark that due to symmetry, the graphs are the same for ∂ ∂y s . From Figure 6 , we observe unbounded gradients in each component of the gradient of the Lorentz tangent concatenation. On the other hand, all the components of the gradient of the Lorentz direct concatenation has an absolute value bounded by 1. The results of this numerical experiment have validated the conclusion in Theorem 4.1.

D.3 ANALYSIS OF EFFECT ON HYPERBOLIC DISTANCES

In this section, we perform additional analysis of Lorentz direct concatenation and Lorentz tangent concatenation, particularly their effect on hyperbolic distances. First, we study the hyperbolic distances to the hyperbolic origin for both concatenation methods. Suppose we have x ∈ L n K and y ∈ L m K . Let z = HCat(x, y) ∈ L n+m-1 K and z ′ = HTCat(x, y) ∈ L n+m-1 K be their hyperbolic direct concatenation and hyperbolic tangent concatenation, respectively. We compare the difference between d L (z, o) and d L (z ′ , o) as follows. Note that the distance between an arbitrary point x ∈ L n K and the origin only depend on the time component: d L (x, o) = 1 √ -K cosh -1 (K⟨x, o⟩ L ) = 1 √ -K cosh -1 (-Kx t ). Hence, the distance information is completely contained the time component. After the concatenation, the time component is x 2 t + y 2 t + 1/K. Consequently, for Lorentz direct concatenation, the distance is d L (z, o) = 1 √ -K cosh -1 -K x 2 t + y 2 t + 1/K . For Lorentz tangent concatenation, since both the logarithmic and exponential maps reserve the distances, one has y, o) . This relation agrees with the Euclidean concatenation. However, norm-preservation is not why concatenation works in the Euclidean domain. Therefore, we don't consider this as an advantage of the Lorentz tangent concatenation. The Lorentz direct concatenation is more efficient and stable, and no information is lost during concatenation. Therefore, it is still preferred as a neural layer. d L (z ′ , o) = d L (x, o) 2 + d L (y, o) 2 = 1 -K cosh -2 (-Kx t ) + cosh -2 (-Ky t ) . d 2 L (z ′ , o) = d 2 L (x, o) + d 2 L ( More importantly, we study how concatenation changes the relative distances, which is closely related to stability. Specifically, we perform the following experiments. c , y ′ c ) do not deviate much from d L (x, y). We describe our experiments as follows. Take K = -1. We randomly sample three points independently from L n K as x, y and c respectively. We have two scenarios for sampling the points: (1) "spatial normal": the points are sampled so that their spatial components follow the standard normal distribution; (2) "wrapped normal": the points are sampled from the wrapped normal distribution with unit variance. In each scenario, for n ∈ {3, 16, 64}, we do the experiments for 10,000 times. We report the distances |d 7 8 9 , as well as their differences in Figure 10 . L (x c , y c ) -d L (x, y)| and |d L (x ′ c , y ′ c ) -d L (x, y)| in Figures Our experiments clearly show that, especially for large dimensions, the distance between d L (x c , y c ) and d L (x, y) is smaller than the distance between d L (x ′ c , y ′ c ) and d L (x, y). In particular, in many cases, |d L (x c , y c )d L (x, y)| is around zero. On the other hand, |d L (x ′ c , y ′ c )d L (x, y)| tend to be large when n = 16, 64, especially when samples follow the wrapped normal distribution. From this result, the Lorentz Direct Concatenation should be preferred to the Lorentz Tangent Concatenation. In particular, the significant expansion of distance when concatenating with the same vector, in the case n = 64, may be one cause of numerical instability (note that we use the same dimensionality for the experiment in §4.2 and in the molecular generation task).

E DETAILS OF AE IN MOLECULAR GENERATION

In this section, we carefully describe the encoders and decoders, which compose the AE used in the HAEGAN for molecular generation. The basic structure of the AE in the molecular generation task is illustrated in Figure 11 . Notation We denote a molecular graph as G = (V G , E G ), where V G is the set of nodes (atoms) and E G is the set of edges (bonds). Each node (atom) v ∈ V G has a node feature x v describing its atom type and properties. The molecular graph is decomposed into a junction tree T = (V T , E T ) where V T is the set of atom clusters. We use u, v, w to represent graph nodes and i, j, k to represent tree nodes, respectively. The dimensions of the node features of the graph x v and the tree x i are denoted by d G0 and d T0 , respectively. The hidden dimensions of graph and tree embeddings are d G , d T , respectively. 

Molecular Graph

Graph Encoder Tree Encoder Junction Tree Hyperbolic Embedding 8 R + i D m i y 0 O B Q 9 T Y = " > A A A B + X i c b V C 7 T s M w F L 0 p r 1 J e A U Y W i w q J q U o K E o w V D D A W i T 6 k N o o c x 2 2 t O k 5 k O 5 V K 1 D 9 h Y Q A h V v 6 E j b / B a T N A y 5 E s H 5 1 z r 3 x 8 g o Q z p R 3 n 2 y q t r W 9 s b p W 3 K z u 7 e / s H 9 u F R W 8 W p J L R F Y h 7 L b o A V 5 U z Q l m a a 0 2 4 i K Y 4 C T j v B + D b 3 O x M q F Y v F o 5 4 m 1 I v w U L A B I 1 g b y b f t f h D z U E 0 j c 2 V P M / / O t 6 t O z Z k D r R K 3 I F U o 0 P T t r 3 4 Y k z S i Q h O O l e q 5 T q K 9 D E v N C K e z S j 9 V N M F k j I e 0 Z 6 j A E V V e N k 8 + Q 2 d G C d E g l u Y I j e b q 7 4 0 M R y o P Z y Y j r E d q 2 c v F / 7 x e q g f X X s Z E k m o q y O K h Q c q R j l F e A w q Z p E T z q S G Y S G a y I j L C E h N t y q q Y E t z l L 6 + S d r 3 m X t T q D 5 f V x k 1 R R x l O 4 B T O w Y U r a M A 9 N K E F B C b w D K / w Z m X W i / V u f S x G S 1 a x c w x / Y H 3 + A C G V k / o = < / l a t e x i t > z T < l a t e x i t s h a 1 _ b a s e 6 4 = " W q V y 8 V U P A 5 D Q I n u q o E i 1 2 Y y g F v E = " > A A A B + X i c b V C 7 T s M w F L 0 p r 1 J e A U Y W i w q J q U o K E o w V L I x F 6 k t q o 8 h x 3 N a q 4 0 S 2 U 6 l E / R M W B h B i 5 U / Y + B u c N g O 0 H M n y 0 T n 3 y s c n S D h T 2 n G + r d L G 5 t b 2 T n m 3 s r d / c H h k H 5 9 0 V J x K Q t s k 5 r H s B V h R z g R t a 6 Y 5 7 S W S 4 i j g t B t M 7 n O / O 6 V S s V i 0 9 C y h X o R H g g 0 Z w d p I v m 0 P g p i H a h a Z K 3 u a + y 3 f r j o 1 Z w G 0 T t y C V K F A 0 7 e / B m F M 0 o g K T T h W q u 8 6 i f Y y L D U j n M 4 r g 1 T R B J M J H t G + o Q J H V H n Z I v k c X R g l R M N Y m i M 0 W q i / N z I c q T y c m Y y w H q t V L x f / 8 / q p H t 5 6 G R N J q q k g y 4 e G K U c 6 R n k N K G S S E s 1 n h m A i m c m K y B h L T L Q p q 2 J K c F e / v E 4 6 9 Z p 7 V a s / X l c b d 0 U d Z T i D c 7 g E F 2 6 g A Q / Q h D Y Q m M I z v M K b l V k v 1 r v 1 s R w t W c X O K f y B 9 f k D N U m U B w = = < / l a t e x i t > Tree Decoder

Graph Decoder

Figure 11 : Illustration of the autoencoder used in the HAEGAN for molecular generation. The input molecular graph is firstly coarsened into the junction tree. Then both of them are encoded using graph and tree encoders to their respective hyperbolic embeddings z T and z G . To reconstruct the molecule, we first decode the junction tree from z T , and then reconstruct the molecular graph using the junction tree and z G .

E.1 GRAPH AND TREE ENCODER

The graph encoder for the molecular graph G contains the following layers. First, each node feature x v is mapped to the hyperbolic space via x (0) v = E2H d G 0 (x v ). Next, the hyperbolic feature is passed to a hyperbolic GCN with l G layers x (l) = HGCN(x (l-1) ), l = 1, • • • , l G . (33) Finally, we take the centroid of the embeddings of all vertices to get the hyperbolic embedding z G of the entire graph, z G = HCent(x (l G ) ). The tree encoder is similar with the graph encoder, it encodes the junction tree to hyperbolic embedding z T with a hyperbolic GCN of depth l T . The only difference is that its input feature x i 's are one-hot vectors representing the atom clusters in the cluster vocabulary. We need to use a hyperbolic embedding layer as the first layer of the network accordingly. Altogether, the tree encoder contains the following layers in a sequential manner: x (0) i = HEmbed d T 0 ,d T (x i ), x (l) = HGCN(x (l-1) ), l = 1, • • • , l T , z T = HCent(x (l T ) ). (35) E.2 TREE DECODER Similar to Jin et al. (2018; 2019) , we generate a junction tree T = (V T , E T ) using a tree recurrent neural network in a top-down and node-by-node fashion. The generation process resembles a depthfirst traversal over the tree T . Staring from the root, at each time step t, the model makes a decision whether to continue generating a child node or backtracking to its parent node. If it decides to generate a new node, it will further predict the cluster label of the child node. It makes these decision based on the messages passed from the neighboring node. We remark that we do not use the gated recurrent unit (GRU) for message passing. The complex structure of GRU would make the training process numerically unstable for our hyperbolic neural network. We simply replace it with a hyperbolic linear layer. Message Passing Let Ẽ = {(i 1 , j 1 ), . . . , (i m , j m )} denote the collection of the edges visited in a depth-first traversal over T , where m = 2|E T |. We store a hyperbolic message h it,jt for each edge in Ẽ. Let Ẽt be the set of the first t edges in Ẽ. Suppose at time step t, the model visit node i t and it visits node j t at the next time step. The message h it,jt is updated using the node feature x it and inward messages h k,it . We first use hyperbolic centroid to gather the inward messages to produce z nei = HCent(HLinear d T ,d T ({h k,it } (k,it)∈ Ẽ,k̸ =jt )), and then map the tree node features to the hyperbolic space to produce z cur = HEmbed d T 0 ,d T (x it ). (37) Finally, we combine them using the Lorentz Direct Concatenation and pass them through a hyperbolic linear layer to get the message h it,jt = HLinear 2×d T ,d T (HCat({z cur , z nei })) . (38) Topological Prediction At each time step t, the model makes a binary decision on whether to generate a child node, using tree embedding z T , node feature x it , and inward messages h k,it using the following layers successively: (39) Label Prediction If a child node j t is generated, we use the tree embedding z T and the outward message h it,jt to predict its label. We apply the following two layers successively: z all = HLinear 2×d T ,d T (HCat({h it,jt , z T })) , q t = Softmax(HCDist d T ,d T 0 (z all )). z nei = (40) The output q t is a distribution over the label vocabulary. When j t is a root node, its parent i t is dummy and the message is padded with the origin of the hyperbolic space h it,jt = o. Training The topological and label prediction have two induced losses. Suppose pt , qt are the the ground truth topological and label values, obtained by doing depth-first traversal on the real junction tree. The decoder minimizes the following cross-entropy loss: L topo = m t=1 L cross ( pt , p t ), L label = m t=1 L cross ( qt , q t ), where L cross is the cross-entropy loss. During the training phase, we use the teacher forcing strategy: after the predictions at each time step, we replace them with the ground truth. This allows the model to learn from the correct history information. 



We take the most updated results available from https://github.com/molecularsets/moses. https://github.com/noamroze/moser_flow



t e x i t s h a 1 _ b a s e 6 4 = " D H + l x m w S x 9 B 5 r C P w 1 Q I / 4 H A T

p 1 N 2 z e u P 2 v N a 8 K u O o g C N w D E 6 B C y 5 A E 9 y A F m g D D B 7 B M 3 g F b 9 a T 9 W K 9 W x + z 0 i W r 7 D k A f 2 B 9 / g C 3 h p j Z < / l a t e x i t > z embd < l a t e x i t s h a 1 _ b a s e 6 4 = " q Z 7 t z 6 B d 7 p I b 1 L y k O f M u O R F o 2 I I = " > A A A C A 3 i c b V D L S s N A F J 3 4 r P U V d a e b w S K 4 K k k V d F l 0 4 7 K C f U A b w m Q y a Y f O T M L M R K g h 4 M Z f c e N C E b f + h D v / x k m b h b Y e G O Z w z r 3 c e 0 + Q M K q 0 4 3 x b S 8 s r q 2 v r l Y 3 q 5 t b 2 z q 6 9 t 9 9 R c S o x a e O Y x b I

r H e r Y 9 Z 6 Z J V 9 h y A P 7 A + f w C 5 G J j a < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " n n M l F / D l I Y i 9 h o v y T e n M Z X V u y j s= " > A A A C A n i c b V D L S s N A F J 3 4 r P U V d S V u B o v g q i R V 0 G X R j c s K 9 g F t C J P J p B 0 6 M w k z E 6 G G 4 M Z f c e N C E b d + h T v / x k m b h b Y e G O Zw z r 3 c e 0 + Q M K q 0 4 3 x b S 8 s r q 2 v r l Y 3 q 5 t b 2 z q 6 9 t 9 9 R c S o x a e O Y x b I

Figure 1: Overview of HAEGAN. (a) The hyperbolic AE. (b) The hyperbolic GAN for generating the latent embeddings. The encoders in (b) are identical to (a). (c) The process for sampling molecules. The generator in (c) is identical to (b) and the decoders in (c) are identical to (a).

3.As a sanity check, we train a HAEGAN with the MNIST dataset(LeCun et al., 2010) and present some generated samples in Figure2. It is clear that HAEGAN can faithfully generate synthetic examples. We describe the details and perform a quantitative comparison with other hyperbolic models regarding this task in Appendix C.

Figure 2: Samples generated from the HAEGAN trained on MNIST.

Figure 3: Average gradient norm of the each block in training. (a) 64 blocks. (b) 128 blocks.

have been proposed. In particular, Chami et al. (2019); Liu et al. (2019); Bachmann et al. (2020) all performed message passing, the fundamental operation in GNNs, in the tangent space of the hyperbolic space. On the other hand, Dai et al. (2021a); Chen et al. (

Figure 4: Input data and generated samples from the hyperbolic GAN. The hyperbolic data points are transformed to the tangent space of the origin by the logarithmic map.

Figure 5: The training loss of generator and critic in the MNIST generation task. Left: hyperbolic Wasserstein GAN. Right: Euclidean Wasserstein GAN.

Figure 6: Illustration of the gradients of both concatenation methods: (a) Lorentz tangent concatenation (b) Lorentz direct concatenation. (a1) & (b1) ∂z t ∂x s ; (a2) & (b2) ∂z s0 ∂x s ; (a3) & (b3) ∂z s1 ∂x s .

Figure 7: Difference between concatenated distances and original distances with n = 3. Left: spatial normal. Right: wrapped normal.

Although the hyperbolic distance d L (z, o) is not the squared sum of d L (x, o) and d L (y, o), d L (z, o) is larger than each of d L (x, o) and d L (y, o). On the other hand, after concatenation,

Figure 9: Difference between concatenated distances and original distances with n = 64. Left: spatial normal. Right: wrapped normal.

z G < l a t e x i t s h a 1 _ b a s e 6 4 = " R S N Z F x 8 u / 5 Q 2

HCent(HLineard T ,d T ({h k,it } (k,it)∈ Ẽ )), z cur = HEmbed d T 0 ,d T (x it ), z all = HLinear 3×d T ,d T (HCat({z cur , z nei , z T })) , p t = Softmax(HCDist d T ,2 (z all )).

.1 ARCHITECTURE DETAILS OF HYPERBOLIC GENERATIVE ADVERSARIAL NETWORK Generator • Input: points in L 256 K sampled from G(o, diag(1 256 )) • Hyperbolic linear layers: Hyperbolic linear layers:-Input dimension: 3 -Hidden dimension: 128-Depth: 3 -Output dimension: 128 • Hyperbolic centroid distance layer: L 128 K → R • Output: score in R Hyperparameters • Manifold curvature: K = -1.0 • Gradient penalty coefficient: λ = 10• For all hyperbolic linear layers:-Dropout: 0.0 -Use bias: True• Optimizer: Riemannian Adam (β 1 = 0, β 2 = 0.9) • Learning Rate: 1e-4 • Batch size: 128 • Number of epochs: 20 • Gradient penalty λ: 10 F.4 EXPERIMENT DETAILS FOR MNIST GENERATION We describe the detailed architecture for the MNIST Generation experiment. Map to Euclidean space: L 64 K → R 64 • Transposed Convolutional Neural Network Decoder -Transposed Convolutional layer

Results of the tree generation experiments. "NaN" indicates NaN reported during training.

and contains about 1.58M training, 176k test, and 176k scaffold test molecules. The molecules in the scaffold test set have different Bemis-Murcko scaffolds (Bemis & Murcko, 1996), which represent the core structures of compounds, than both the training and the test set. They are used to determine whether a model can generate novel scaffolds absent in the training set.

Performance in Validity, Unique(ness), Novelty, SNN, and Scaf metrics. Reported (mean ± std) over three independent samples. HVAE-w, HVAE-r, HGAN, HAEGAN-H, HAEGAN-β, HAEGAN-T are not included in the table as they all produce "NaN", implying instability.

and calculate the gradients of each entry of

annex

the respective papers, which were produced from different numbers of samples: in Nagano et al. (2019) , the NLL is calculated with 500 samples, while Mathieu et al. (2019) used 3,000 samples. We generate 5,000 samples and compare the NLL with them. Then, we also calculate the FID and compare it with HGAN (Lazcano et al., 2021) . Our NLL results are comparable with the hyperbolic VAEs while FID is slightly better than HGAN.Table 3 : Quantitative comparison between HAEGAN and other methods in the MNIST generation task. Log Likelihood (±std) for different embedding dimensions is reported. We use 5000 samples to estimate the log-likelihood. Results for (Mathieu et al., 2019; Nagano et al., 2019; Bose et al., 2020) are taken directly from the respective paper. ⋆ indicates numerically unstable settings. 

E.3 GRAPH DECODER

The graph decoder assembles a molecular graph given a junction tree T = ( V , Ê) and graph embedding z G . Let G i be the set of possible candidate subgraphs around tree node i, i.e. the different ways of attaching neighboring clusters to cluster i. We want to design a scoring function for each candidate subgraph G (i) j ∈ G i . To this end, we first use the hyperbolic GCN and hyperbolic centroid to acquire the hyperbolic embedding z G (i)Then, the embedding of the subgraph is combined with the embedding of the molecular graph z G by the Lorentz Direct Concatenation to producewhich is then passed to the hyperbolic centroid distance layer to get a scoreTraining We define the loss for the graph decoder to be the sum of the cross-entropy losses in each G i . Specifically, suppose the correct subgraph isSimilar to the tree decoder, we also use teacher forcing when training the graph decoder.

F DETAILED SETTINGS OF EXPERIMENTS

F.1 OPTIMIZATION For all the experiments, we use the Geoopt package (Kochurov et al., 2020) for Riemannian optimization. In particular, we use the Riemannian Adam function for gradient descent. We also use Geoopt for initializing the weights in all hyperbolic linear layers of our model with the wrapped normal distribution.

F.2 ENVIRONMENTS

All the experiments in this paper are conducted with the following environments.• GPU: RTX 3090• CUDA Version: 11.1• PyTorch Version: 1.9.0• RDKit Version: 2020.09.1.0

F.3 EXPERIMENT DETAILS FOR TOY DISTRIBUTION

We describe the details for the toy distribution experiment.

Hyperparameters

• Manifold curvature: K = -1.0• For all hyperbolic linear layers:-Dropout: 0.0 -Use bias: True• Optimizer: Riemannian Adam (β 1 = 0.9, β 2 = 0.9)• Learning Rate: 1e-4• Batch size: 32 

