HYPERBOLIC NEURAL NETWORKS++

Abstract

Hyperbolic spaces, which have the capacity to embed tree structures without distortion owing to their exponential volume growth, have recently been applied to machine learning to better capture the hierarchical nature of data. In this study, we generalize the fundamental components of neural networks in a single hyperbolic geometry model, namely, the Poincaré ball model. This novel methodology constructs a multinomial logistic regression, fully-connected layers, convolutional layers, and attention mechanisms under a unified mathematical interpretation, without increasing the parameters. Experiments show the superior parameter efficiency of our methods compared to conventional hyperbolic components, and stability and outperformance over their Euclidean counterparts. Note that, in this study, we equate g with g x if g x is constant, and denote the Euclidean inner product, norm, and unit vector for any real vector u, v ∈ R n as u, v , v , and [v] = v/ v , respectively. Hyperbolic space. A hyperbolic space is a Riemannian manifold with a constant negative curvature, the coordinates of which can be represented in several isometric models. The most basic model is an 1 composed of one time-like axis and n space-like axes. The manifolds of Poincaré ball model B n c and Beltrami-Klein model K n c are the projections of the hyperboloid model onto the different n-dimensional space-like hyperplanes, as depicted in Figure 1. For their mathematical definitions and the isometric isomorphism between their coordinates, see Appendix A. ℝ

1. INTRODUCTION

Shifting the arithmetic stage of a neural network to a non-Euclidean geometry such as a hyperbolic space is a promising way to find more suitable geometric structures for representing or processing data. Owing to its exponential growth in volume with respect to its radius (Krioukov et al., 2009; 2010) , a hyperbolic space has the capacity to continuously embed tree structures with arbitrarily low distortion (Krioukov et al., 2010; Sala et al., 2018) . It has been directly utilized, for instance, to visualize large taxonomic graphs (Lamping et al., 1995) , to embed scale-free graphs (Blasius et al., 2018) , or to learn hierarchical lexical entailments (Nickel & Kiela, 2017) . Compared to the Euclidean space, a hyperbolic space shows a higher embedding accuracy under fewer dimensions in such cases. Because a wide variety of real-world data encompasses some type of latent hierarchical structures (Katayama & Maina, 2015; Newman, 2005; Lin & Tegmark, 2017; Krioukov et al., 2010) , it has been empirically proven that a hyperbolic space is able to capture such intrinsic features through representation learning (Krioukov et al., 2010; Ganea et al., 2018b; Nickel & Kiela, 2018; Tifrea et al., 2019; Law et al., 2019; Balazevic et al., 2019; Gu et al., 2019) . Motivated by such expressive characteristics, various machine learning methods, including support vector machines (Cho et al., 2019) and neural networks (Ganea et al., 2018a; Gulcehre et al., 2018; Micic & Chu, 2018; Chami et al., 2019) have derived the analogous benefits from the introduction of a hyperbolic space, aiming to improve the performance on advanced tasks beyond just representing data. One of the pioneers in this area is Hyperbolic Neural Networks (HNNs), which introduced an easy-to-interpret and highly analytical coordinate system of hyperbolic spaces, namely, the Poincaré ball model, with a corresponding gyrovector space to smoothly connect the fundamental functions common to neural networks into valid functions in a hyperbolic geometry (Ganea et al., 2018a) . Built upon the solid foundation of HNNs, the essential components for neural networks covering the multinomial logistic regression (MLR), fully-connected (FC) layers, and Recurrent Neural Networks have been realized. In addition to the formalism, the methods for graphs (Liu et al., 2019) , sequential classification (Micic & Chu, 2018) , or Variational Autoencoders (Nagano et al., 2019; Mathieu et al., 2019; Ovinnikov, 2019; Skopek et al., 2020 ) are further constructed. Such studies have applied the Poincaré ball model as a natural and viable option in the area of deep learning. Despite such progress, however, there still remain some unsolved problems and uncovered regions. In terms of the network architectures, the current formulation of hyperbolic MLR (Ganea et al., 2018a) requires almost twice the number of parameters compared to its Euclidean counterpart. This makes both the training and inference costly in cases in which numerous embedded entities should be classified or where large hidden dimensions are employed, such as in natural language processing. The lack of convolutional layers must also be mentioned, because their application is now ubiquitous and is no longer limited to the field of computer vision. For the individual functions that are commonly used in machine learning, the split and concatenation of vectors have yet to be realized in a hyperbolic space in a manner that can fully exploit such space and allow sub-vectors to achieve a commutative property. Additionally, although several types of closed-form centroids in a hyperbolic space have been proposed, their geometric relationships have not yet been analyzed enough. Because a centroid operation has been utilized in many recent attention-based architectures, the theoretical background for which type of hyperbolic centroid should be used would be required in order to properly convert such operations into the hyperbolic geometry. Based on the previous analysis, we reconsider the flow of several extensions to bridge Euclidean operations into hyperbolic operations and construct alternative or novel methods on the Poincaré ball model. Specifically, the main contributions of this paper are summarized as follows: 1. We reformulate a hyperbolic MLR to reduce the number of parameters to the same level as a Euclidean version while maintaining the same range of representational properties. 2. We further exploit the knowledge of 1 as a replacement of an affine transformation and propose a novel generalization of the FC layers that can more properly make use of the hyperbolic nature compared with a previous research (Ganea et al., 2018a) . 3. We generalize the split and concatenation of coordinates to the Poincaré ball model by setting the invariance of the expected value of the vector norm as a criterion. 4. By combining 2 and 3, we further define a novel generalization scheme of arbitrary dimensional convolutional layers in the Poincaré ball model. 5. We prove the equivalence of the hyperbolic centroids defined in three different hyperbolic geometry models, and expand the condition of non-negative weights to entire real values. Moreover, integrating this finding and previous contributions 1, 2, and 3, we give a theoretical insight into hyperbolic attention mechanisms realized in the Poincaré ball model. We experimentally demonstrate the effectiveness of our methods over existing HNNs and Euclidean equivalents based on a performance test of MLR functions and experiments with Set Transformer (Lee et al., 2019) and convolutional sequence to sequence modeling (Gehring et al., 2017) .foot_0 

2. HYPERBOLIC GEOMETRY

Riemannian geometry. An n-dimensional manifold M is an n-dimensional topological space that can be linearly approximated to an n-dimensional real space at any point x ∈ M, and each local linear space is called a tangent space T x M. A Riemannian manifold is a pairing of a differentiable manifold and a metric tensor field g as a function of each point x, which is expressed as (M, g). Here, g defines an inner product on each tangent space such that ∀ u, v ∈ T x M, u, v x = u g x v, where g x is a positive definite symmetric matrix defined on T x M. The norm of a tangent vector derived from the inner product is defined as v x = | v, v x |. A metric tensor g x provides local information regarding the angle and length of the tangent vectors in T x M, which induces the global length of the curves on M through an integration. The shortest path connecting two arbitrary points on M at a constant speed is called a geodesic, the length of which becomes the distance. Along a geodesic where one of the endpoints is x, the function projecting a tangent vector v ∈ T x M as an initial velocity vector onto M is denoted as an exponential map exp x , and its inverse function is called a logarithmic map log x . In addition, the concept of parallel transport P x→y : T x M → T y M is generalized to the specially conditioned unique linear isometry between two tangent spaces. For more details, please refer to Spivak (1979) ; Petersen et al. (2006) ; Andrews & Hopper (2010) . n-dimensional hyperboloid model, which is a hypersurface H n c in an (n + 1)-dimensional Minkowski space R n+1 To operate the coordinates as vector-like mathematical objects, the Möbius gyrovector space provides an algebra that treats them as gyrovectors, equipped with various operations including the generalized vector addition, that is, a noncommutative and non-associative binary operation called the Möbius addition ⊕ c (Ungar, 2009) . lim c→0 ⊕ c converges to + in connection with a Euclidean geometry, the curvature of which is zero. For more details, see Appendix B. Considering one bias point q a,r per one discriminative hyperplane solves this over-parameterization. Aiming to overcome the difficulties discussed in Section 1, we build a novel scheme of hyperbolic neural networks in the Poincaré ball model. The core concept is re-generalization of a, x -b type equations with no increase in the number of parameters, which has the potential to replace any affine transformation based on the same mathematical principle. Specifically, this section starts from the reformulation of the hyperbolic MLR, from which the variants to the FC, convolutional, and multi-head attention layers are derived. Several other modifications are also proposed to support neural networks with broad architectures.

3.1. UNIDIRECTIONAL REPARAMETERIZATION OF HYPERBOLIC MLR LAYER

Given an input x ∈ R n , MLR is an operation used to predict the probabilities of all target outcomes k ∈ {1, 2, ..., K} for the objective variable y as a log-linear model and is described as follows: p(y = k | x) ∝ exp (v k (x)) , where v k (x) = a k , x -b k , a k ∈ R n , b k ∈ R. (1) Circumvention of the double vectorization. To generalize the linear function v k to the Poincaré ball model, Ganea et al. (2018a) first re-parameterized the scalar term b k as a vector p k ∈ R n in the form a k , x -b k = a k , -p k + x , where b k = a k , p k , and then discussed the properties which must be satisfied when such vectors become Möbius gyrovectors. However, this causes an undesirable increase in the parameters from n + 1 to 2n in each class k. As illustrated in Figure 2 (a), this reformulation is redundant from the viewpoint that there exist countless choices of p k to determine the same discriminative hyperplane key of this step is to replace all variables with vectors attributed to the same manifold, we introduce another scalar parameter r k ∈ R instead, which makes the bias vector q a k ,r k parallel to a k : H a k ,b k = {x ∈ R n | a k , x -b k = 0}. Because the (b) F c (x; Z, r) (ours) a k , x -b k = a k , -q a k ,r k + x , where q a k ,r k = r k [a k ] s.t. b k = r k a k . (2) One possible realization of p k is adopted to reduce the previously mentioned redundancies without a loss of generality or representational properties compared to the original affine transformation, and induces another notation: Ha k ,r k := {x ∈ R n | a k , -q a k ,r k + x = 0} = H a k ,r k a k . Based on distance d from a point to a hyperplane, Equation 2 can be rewritten as with Lebanon & Lafferty (2004) in the following form: a k , -q a k ,r k + x = sign( a k , -q a k ,r k + x ) d(x, Ha k ,r k ) a k , which decomposes the inner product into the product of the norm of an orientation vector a k and the signed distance between an input vector x ∈ R n and the hyperplane Ha k ,r k . Unidirectional Poincaré MLR. Based on the observation that q a k ,r k starts from the origin and the concept of Poincaré hyperplanes, we can now generalize v k for x, q a k ,r k ∈ B n c and a k ∈ T qa k ,r k B n c : v k (x) = sign( a k , c q a k ,r k ⊕ c x ) d c x, Hc a k ,r k a k c qa k ,r k , where q a k ,r k = exp c 0 (r k [a k ]), Hc a k ,r k := {x ∈ B n c | a k , c q a k ,r k ⊕ c x = 0}, which are shown in Figure 2 (b). Importantly, the circular reference between a k ∈ T qa k ,r k B n c and q a k ,r k can be unraveled by considering the tangent vector at the origin, z k ∈ T 0 B n c , from which a k is parallel transported by P c x→y : T x B n c → T y B n c described in Appendix C.3 as follows: a k = P c 0→qa k ,r k (z k ) = sech 2 √ c r k z k , q a k ,r k = exp c 0 (r k [z k ]) = q z k ,r k . ( ) Combining Equations 3, 5, and 23, we conclude the derivation of the unidirectional re-generalization of MLR, the parameters of which are r k ∈ R and z k ∈ T 0 B n c = R n for each class k: v k (x) = 2 c -1 2 z k sinh -1 λ c x √ c x, [z k ] cosh 2 √ c r k -(λ c x -1) sinh 2 √ c r k . (6) For more detailed deformation, see Appendix D.1. Note that we recover the form of the standard Euclidean MLR in lim c→0 v k (x) = 4( a k , x -b k ), which is proven in Appendix D.2.

3.2. REFORMULATING FC LAYERS TO PROPERLY EXPLOIT THE HYPERBOLIC PROPERTIES

We next discuss the FC layers, described as a simple affine transformation y = Axb, in an element-wise manner with respect to the output space as y k = a k , x -b k , where x, a k ∈ R n and b k ∈ R. This can be interpreted as an operation that linearly transforms the input x and treats the output score y k as the coordinate value at, or the signed distance from the hyperplane containing the origin and orthogonal to, the k-th axis of the output space R m . Therefore, combining them with a generalized linear transformation, as described in Section 3.1, we can now generalize the FC layers: Poincaré FC layer. Given an input x ∈ B n c , with the generalized linear transformation v k in Equation 6 and the parameters composed of Z = {z k ∈ T 0 B n c = R n } m k=1 , which is a generalization of A and r = {r k ∈ R} m k=1 representing the bias terms, the Poincaré FC layer outputs the following: 7) It can be proven that the signed distance from y to each Poincaré hyperplane containing the origin, and orthogonal to the k-th axis, equals v k (x), as shown in Appendix D.3, satisfying the aforementioned properties. We also recover a FC layer in lim c→0 y k = 4 ( a k , x -r k a k ). y = F c (x; Z, r) := w(1 + 1 + c w 2 ) -1 , where w := (c -1 2 sinh √ c v k (x) ) m k=1 . ( Comparison with a previous method. Ganea et al. (2018a) proposed a hyperbolic FC layer operating a matrix-vector multiplication in a tangent space and adding a bias through the following Möbius addition: y = exp c 0 (A log c 0 (x)) ⊕ c b, which indicates that the discriminative hyperplane determined in T 0 B m c is projected back to B m c by the exponential map at the origin. However, such a surface is no longer a Poincaré hyperplane, except for b = 0. Moreover, the basic shape of the contour surfaces in the output space B m c is determined only by the orientation of each row vector a k in A, whereas their norms and a bias term b contribute to the total scale and shift. Conversely, the parameters in our method cooperate to realize more various contour surfaces. Notably, discriminative hyperplanes become Poincaré hyperplanes, i.e., the set of all geodesics orthogonal to the orientation z k and containing a point exp c 0 (r k [z k ]). As shown in Figure 3 , the input space B n c is separated in a more meaningful manner as a hyperbolic space for each dimension of the output space B m c .

3.3. REGULARIZING SPLIT AND CONCATENATION

Split and concatenation are essential operations for realizing small process branches in parallel or combining feature vectors. However, in the Poincaré ball model, merely splitting the coordinates lowers the norms of the output gyrovectors and limits the representational power, and concatenating them is invalid because the norm of the output can easily exceed the domain of the ball. One simple solution is to conduct an operation in the tangent space. The aforementioned problem regarding a split operation, however, remains. Moreover, as the number of inputs to be concatenated increases, the output gyrovector approaches the boundary of the ball even if the norm of each input is adequately small. The norm of the gyrovector is crucial in the Poincaré ball model owing to its metric. Therefore, reflecting the orientation of inputs while preserving the scale of the norm is considered to be desirable.

Generalization criterion.

In Euclidean neural networks, keeping the variance of feature vectors constant is an essential criterion (He et al., 2015) . As an analogy, keeping the expected values of the norms constant is a worthy criterion in the Poincaré ball because the norm of any Möbius gyrovector is upper-bounded by the ball radius and the variance of the coordinates cannot necessarily remain intact when the dimensions in each layer vary. Such a replacement of the statistic invariance target from each coordinate to the norm is also suggested by Becigneul & Ganea (2019) . To satisfy this criterion, we propose the following generalization scheme with a scalar coefficient β n = B( n 2 , 1 2 ), where B indicates a beta function. Poincaré β-split. First, the input x ∈ B n c is split in the tangent space with integers s.t. N i=1 n i = n: x → v = log c 0 (x) = (v 1 ∈ R n1 , . . . , v N ∈ R n N ) . Each split tangent vector is then properly scaled and projected back to the Poincaré ball as follows: v i → y i = exp c 0 β ni β -1 n v i . Poincaré β-concatenation. Likewise, the inputs {x i ∈ B ni c } N i=1 are first properly scaled and concatenated in the tangent space, and then projected back to the Poincaré ball in the following manner: x i → v i = log c 0 (x i ) ∈ T 0 B ni c , v := (β n β -1 n1 v 1 , . . . , β n β -1 n N v N ) → y = exp c 0 (v). We prove the previously mentioned properties under a certain assumption in Appendix D.4. One can also confirm that the Poincaré β-concatenation is the inverse function of the Poincaré β-split. Discussion about the concatenation. Ganea et al. (2018a) generalized a vector concatenation under the premise that the output must be followed by an FC layer, but such an assumption possibly limits its usage. Furthermore, it requires Möbius additions N -1 times sequentially due to the noncommutative and non-associative properties of the Möbius addition, which incurs a heavy computational cost and an unbalanced priority in each input gyrovector. Alternatively, our method with a pair of exponential and logarithmic maps has a lower computational cost regardless of N and treats every input fairly.

3.4. ARBITRARY DIMENSIONAL CONVOLUTIONAL LAYER

The activation of D-dimensional convolutional layers with kernel sizes of {K i } D i=1 is generally described as an affine transformation y k = a k , x -b k for each channel k, where x ∈ R nK is an input vector per pixel, and is a concatenation of K = i K i feature vectors contained in a receptive field of the kernel. This notation also includes a dilated operation. It is now natural to generalize the convolutional layers with Poincaré β-concatenation and a Poincaré FC layer. Poincaré convolutional layer. At each pixel in the given feature map, the gyrovectors {x s ∈ B n c } K s=1 contained in a receptive field of the kernel are concatenated into a single gyrovector x ∈ B nK c in the manner proposed in Section 3.3, which is then operated in the same way as a Poincaré FC layer.

3.5. ANALYSIS OF HYPERBOLIC ATTENTION MECHANISMS IN THE POINCARÉ BALL MODEL

As preparation for constructing hyperbolic attention mechanisms, it is necessary to theoretically consider the midpoint operation of multiple coordinates in a hyperbolic space. For the Poincaré ball model and Beltrami-Klein model, Ungar (2009) proposed the Möbius gyromidpoint and Einstein gyromidpoint built upon the framework of gyrovector spaces, respectively, which are represented in different coordinate systems but are geometrically the same, as shown in Appendix D.5. On the other hand, Law et al. (2019) proposed another type of hyperbolic centroid based on the minimization problem of the squared Lorentzian distance defined in the hyperboloid model. Based on the above situation, for the major concern of which formulation to utilize, we proved the following theorem. Theorem 1. (The equivalence of three hyperbolic midpoints) The Möbius gyromidpoint, Einstein gyromidpoint, and the centroid of the squared Lorentzian distance exactly match each other, which indicates they are the same midpoint operations projected on each manifold. The proofs are given in Appendix D.5 and D.6. Furthermore, based on this equivalence, we can characterize the Möbius gyromidpoint as a minimizer of the weighted sum of calibrated squared gyrometrics, which we proved in Appendix D.7. With Theorem 1, we can now exploit the Möbius gyromidpoint as a unified option to realize hyperbolic attention mechanisms. Moreover, we further generalized the Möbius gyromidpoint by extending the condition of non-negative weights to entire real values by regarding a negative weight as an additive inverse operation: The midpoint b ∈ B n c of Möbius gyrovectors {b i ∈ B n c } N i=1 with the real scalar weights {ν i ∈ R} N i=1 is given by b = N i=1 [b i , ν i ] c := 1 2 ⊗ c N i=1 ν i λ c bi b i N i=1 |ν i | λ c bi -1 , ( ) which is shown in Appendix D.8. Note that the sum of weights does not need to be normalized to one because any scalar scale is cancelled between the numerator and denominator. On the basis of this insight, in the following, we describe the construction of a multi-head attention as a specific example, aiming at a general approach that can be applied to other arbitrary attention schemes. Multi-head attention. Given a source sequence S ∈ R Ls×n of length L s and target sequence T ∈ R Lt×m of length L t , the module first projects the target onto query Q ∈ R Lt×hd and the source onto key K ∈ R Ls×hd and value V ∈ R Ls×hd with the corresponding FC layers. These are split into d-dimensional vectors of h heads, which is followed by a similarity function between Q i and K i producing a weight Π i = {softmax(d -1 2 q i t K i )} Lt t=1 for 1 ≤ i ≤ h. The weights are utilized to aggregate V i into a centroid, giving X i = Π i V i . Finally, the features in all heads are concatenated. Poincaré multi-head attention. Given the source and target as sequences of gyrovectors, they are projected with three Poincaré FC layers, followed by Poincaré β-splits to produce Q i = {q i t ∈ B d c } Lt t=1 , K i = {k i s ∈ B d c } Ls s=1 and V i = {v i s ∈ B d c } Ls s=1 . Applying a similarity function f c and activation g, each weight π i t,s = g(f c (q i t , k i s )) is obtained and the values are aggregated as follows: x i t = 1≤s≤Ls v i s , π i t,s c . Finally, the features in all heads are Poincaré β-concatenated. For the similarity function f c , there are mainly two choices to exploit. One is the inner product in the tangent space indicated by Micic & Chu (2018) , which is the naive generalization of the Euclidean version. Another choice is based on the distance of two points: f c (q, k) = -τ d c (q, k) -γ, where τ is an inverse temperature and γ is a bias parameter, which was proposed by Gulcehre et al. (2018) . As for the activation g, g(x) = exp (x) is the most basic option because it turns to be a softmax operation due to the property of gyromidpoint. Gulcehre et al. (2018) also suggested g(x) = σ(x). In light of the property of the generalized gyromidpoint, g as an identity function is also exploitable. 

4. EXPERIMENTS

In this section, we evaluate our methods in comparisons with HNNs and Euclidean counterparts. The implementation of hyperbolic architectures is based on the Geoopt (Kochurov et al., 2020) .

4.1. VERIFICATION OF THE MLR CLASSIFICATION CAPACITY

We first evaluated the performance of our unidirectional Poincaré MLR on the same conditioned experiment designed for the MLR of HNNs, that is, a sub-tree classification on the Poincaré ball model. In this task, the Poincaré embeddings of the WordNet noun hierarchy (Nickel & Kiela, 2017) are utilized as the data set, which contains 82,115 nodes and 743,241 hypernymy relations. We pre-trained the Poincaré embeddings of the same dimensions as the experimental settings in HNNs, i.e., two, three, five, and ten dimensions, using the open-source implementationfoot_1 to extract several sub-trees whose root nodes are certain abstract hypernymies, e.g., animal. For each sub-tree, MLR layers learn the binary classification to predict whether each given node is included. All nodes are divided into 80% training nodes and 20% testing nodes. We trained each model for 30 epochs using Riemannian Adam (Becigneul & Ganea, 2019) with a learning rate of 0.001 and a batch size of 16. The F1 scores for the test sets are shown in Table 1 . From the results, we can confirm the tendency of the hyperbolic MLRs to outperform the Euclidean version in all settings, which illustrates that MLR considering the hyperbolic geometry are better suited to the hyperbolic embeddings. In particular, our parameter-reduced approach obtains the same level of performance as a conventional hyperbolic MLR in a more stable training, as can be seen from the relatively narrower confidence intervals.

4.2. AMORTIZED CLUSTERING OF MIXTURE OF GAUSSIANS WITH SET TRANSFORMERS

For the evaluation of a Poincaré multi-head attention, we utilize Set Transformer, which we consider is a proper test case to eliminate the implicit influence of unessential operations, e.g., positional encoding. The task is an amortized clustering of a mixture of Gaussians (MoG). In each sample in a mini-batch, models take hundreds of two-dimensional points randomly generated by the same K-component MoG, and directly estimate all the parameters, i.e., the ground truth probabilities, means, and standard deviations, in a single forward step. We basically follow the model architectures and experimental settings of the official implementationfoot_2 , except that we employed the hyperbolic Gaussian distribution (Ovinnikov, 2019) as well as the Euclidean distribution aiming to verify the performance of the hyperbolic architectures both for the ordinary settings and for their desirable data distributions. When the hyperbolic models need to deal with Euclidean coordinates, the inputs or outputs are projected by an exponential map or logarithmic map, respectively, with a scalar parameter for scaling the values to the fixed-radius Poincaré ball B n 1 . Note that we omit ReLU activations for our models because the hyperbolic operations are inherently non-linear. We also remove normalization layers because there does not exist enough research on the normalizing criterion in the hyperbolic space and existing methods possibly reduce the representational power of gyrovectors by neutralizing their norms. For the hyperbolic attentions, based on a preliminary experiment, we choose to utilize the distance based similarity function and exponential activation. Table 2 : Negative log-likelihood on the test set. For each setting, we report the 95% confidence intervals for five trials. The numbers in brackets indicate the diverged trials, the final scores of which were higher than 10.0, and those trials are not accounted into the reported scores. The results are shown in Table 2 . For the Euclidean distribution, our models achieved almost the same performance as Set Transformers, while the training of those without Layer Normalization for the same conditioned comparison often failed under all settings. This result suggests the intrinsic normalization properties of our methods, which we attribute to the computation using vector norms. For the hyperbolic distribution, our models outperformed the Euclidean counterparts with an order of magnitude smaller confidence intervals, which indicates that our hyperbolic architectures are indeed suited to their assumed data distribution.

4.3. CONVOLUTIONAL SEQUENCE TO SEQUENCE MODELING

Finally, we experimented with the convolutional sequence-to-sequence modeling task for machine translation of WMT'17 English-German (Bojar et al., 2017) . Because the architecture is composed of convolutional layers and attention layers, the hyperbolic version of which has already been verified in Section 4.2, it would provide a comparison focusing on convolutional operations. It also has a practical aspect as a task of natural language processing, in which lexical entities are known to form latent hierarchical structures. We follow the open-source implementation of Fairseq (Ott et al., 2019) , where preprocessed training data contains 3.96M sentence pairs with 40K sub-word tokenization in each language. In our hyperbolic models, feature vectors are completely treated as Möbius gyrovectors because token embeddings can be learned directly on the Poincaré ball model. Note that the inputs for the sigmoid functions in Gated Linear Units are logarithmically mapped just like hyperbolic Gated Recurrent Units proposed by Ganea et al. (2018a) . We train various scaled-down models to verify the representational capacity of our hyperbolic architectures, with Riemannian Adam for 100K iterations. For more implementation details, please check Appendix E. The results are shown in Table 3 . Our model demonstrates the significant improvements compared to the usual Euclidean models in the fewer dimensions, which reflects the immense embedding capacity of hyperbolic spaces. On the other hand, there is no salient differences observed in higher dimensions, which implies that the Euclidean models with higher dimensions than a certain level can obtain a sufficient computational complexity through the optimization. This would fill the gap with the representational properties of hyperbolic spaces. It also implies that the proper construction of neural networks with the product space of multiple small hyperbolic spaces using our methods has the potential for the further improvements even in higher dimensional architectures.

5. CONCLUSION

We showed a novel generalization and construction scheme of the wide range of hyperbolic neural network architectures in the Poincaré ball model, including a parameter-reduced MLR, geodesicaware FC layers, convolutional layers, and attention mechanisms. These were achieved under a unified mathematical backbone based on the concepts of Riemannian geometry and the Möbius gyrovector space, which endow our hyperbolic architectures with theoretical consistency. Through the experiments, we verified the effectiveness of our approaches from diversified tasks and perspectives, such as an embedded sub-tree classification, amortized clustering of distributed points both on the Euclidean space and the Poincaré ball model, and neural machine translation. We hope that this study will pave the way for future research in the field of geometric deep learning.

A HYPERBOLIC GEOMETRY

In this section, we review the definition of the hyperbolic geometry models other than the Poincaré ball model and the relationships between their coordinates. Hyperboloid model. The n-dimensional hyperboloid model is a hypersurface in an (n + 1)dimensional Minkowski space R n+1 1 , which is equipped with an inner product x, y L = x g L y for ∀ x, y ∈ R n+1 1 , where g L = diag(-1, 1 n ). Given a constant negative curvature -c, the manifold of the hyperboloid model is defined by H n c = {x = (x 0 , . . . , x n ) ∈ R n+1 1 | c x, x L = -1, x 0 > 0}. Note that, in this standard (n + 1)-dimensional coordinate system, the metric tensor as a positive definite matrix for the n-dimensional hyperboloid manifold cannot be defined. Instead, when the hyperboloid model is represented in a specific n-dimensional coordinates, e.g., hyperbolic polar coordinates, then its metric tensor has the corresponding representation as the n-dimensional positive definite matrix. Isometric isomorphism with the Poincaré ball model. The bijection between an arbitrary point h = (z, k ) ∈ H n c and its unique corresponding point b ∈ B n c , depicted in Figure 1 , is given by the following: H n c → B n c : b = b (h) = k 1 + √ cz , B n c → H n c : h = h (b) = (z (b) , k (b)) = 1 √ c 1 + c b 2 1 -c b 2 , 2b 1 -c b 2 . ( ) Beltrami-Klein model. The n-dimensional Beltrami-Klein model of a constant negative curvature -c is defined by (K n c , ĝc ), where K n c = {x ∈ R n | c x 2 < 1} and ĝc x = (1 -c x 2 ) -1 I n + (1 - c x 2 ) -2 xx . Here, K n c is an open ball of radius 1/ √ c. Isometric isomorphism with the Poincaré ball model. The bijection between an arbitrary point n ∈ K n c and its unique corresponding point b ∈ B n c , depicted in Figure 1 , is given by the following: K n c → B n c : b = b (n) = n 1 + 1 -c n 2 , ( ) B n c → K n c : n = n (b) = 2b 1 + c b 2 . ( )

B MÖBIUS GYROVECTOR SPACE

In this section, we briefly introduce the concept of the Möbius gyrovector space, which is a specific type of gyrovector spaces. For a rigorous theoretical and detailed mathematical background of this system, please refer to Ungar (2005; 2009; 2001; 2012) . A gyrovector space is an algebraic structure that endows the points in a hyperbolic space with vector-like properties based on a special concept called a gyrogroup. This gyrogroup is similar to ordinary vector spaces that provides a Euclidean space with the well-known vector operations based on the notion of groups. As a particular example in physics, this helps to understand the mathematical structure of the Einstein's theory of special relativity where no possible velocity vectors including the sum of velocities in an arbitrary additive order can exceed the speed of light (Ungar, 2008; 2013) . Because hyperbolic geometry has several isometric models, a gyrovector space also has some variants where the Möbius gyrovector space is a variant for the Poincaré ball model. As an abstract mathematical system, a gyrovector space is constructed through the following steps: (1) Start from a set G. (2) With a certain binary operation ⊕, create a tuple called a groupoid, or magma (G, ⊕). (3) Based on five axioms, define a specific type of magma as a gyrogroup. These axioms include several important properties of gyrovector spaces, such as the left gyroassociative law and an operator called a gyrator gyr : G × G → Aut(G, ⊕), which generates an automorphism Aut(G, ⊕) gyr[x, y] : G → G given by z → gyr[x, y]z, called a gyration, from two arbitrary points x and y ∈ G. The notion of the gyrocommutative law and gyrogroup cooperation are given in this step. ( 4) Adding ten more axioms related to the statements about a real inner product space and a scalar multiplication ⊗, the gyrovector space (G, ⊕, ⊗) is thus defined. Some of the important properties of a gyrovector space are listed below. Here, x, y, z ∈ G. Gyroassociative laws. Although the binary operation ⊕ is not necessarily associative in general, it obeys the left gyroassociative law x ⊕ (y ⊕ z) = (x ⊕ y) ⊕ gyr[x, y]z and right gyroassociative law (x ⊕ y) ⊕ z = x ⊕ (y ⊕ gyr[y, x]z). These equations also provide a general closed-form expression of the gyrations: gyr [x, y]z = (x ⊕ y) ⊕ (x ⊕ (y ⊕ z)). Cases in which gyrations become identity maps. If at least one element for gyr is 0 ∈ G, the gyrations become an identity map I: gyr[x, 0] = gyr[0, x] = I. With the loop properties of the gyrations given by gyr[x, y] = gyr[x ⊕ y, y] = gyr[x, y ⊕ x], many other cases can be also derived. Gyrocommutative law. Although a binary operation ⊕ is not necessarily commutative in general, if it obeys the equation x ⊕ y = gyr[x, y](y ⊕ x), the gyrogroup is called gyrocommutative. Gyrogroup cooperation. Regarding ⊕ as the primal binary addition, the second binary addition in G is defined as the gyrogroup cooperation , which is given by x y = x ⊕ gyr[x, y]y. This has duality symmetries with the first binary operation ⊕, such that x ⊕ y = x gyr[x, y]y. In addition, corresponding to the left cancellation law x ⊕ (x ⊕ y) = y inherent in ⊕, the gyrogroup cooperation induces two types of the right cancellation laws: (x ⊕ y) y = (x y) y = x. In this formalism, the Möbius gyrovector space is then defined as (B n c , ⊕ c , ⊗ c ), where B n c is as previously introduced in Section 2, and ⊕ c and ⊗ c are as shown in the following subsections.

B.1 MÖBIUS ADDITION

In the Möbius gyrovector space, the primary binary operation is denoted as the Möbius addition ⊕ c : B n c × B n c → B n c , which is a noncommutaive and nonassociative addition, given by the following: x ⊕ c y = 1 + 2c x, y + c y 2 x + 1 -c x 2 y 1 + 2c x, y + c 2 x 2 y 2 , x c y = x ⊕ c (-y).

B.2 MÖBIUS GYRATOR

The expression of gyrations in the Möbius gyrovector space can be expanded using the equation of the Möbius addition ⊕ c , which is described by Ungar (2009) as follows: gyr[x, y] : z → z -2c c x, z y 2 -y, z (1 + 2c x, y ) x + c y, z x 2 + x, z y 1 + 2c x, y + c 2 x 2 y 2 . ( ) By writing down all the special operators ⊕ c for the gyrovectors in B n c into the normal vector operations, the expression of the gyrations can be now seen as a general function for the any real vector z ∈ R n . Indeed, gyrations are extended to invertible linear maps of R n (Ungar, 2009) . The Möbius gyrator endows the Möbius gyrovector space with a gyrocommutative nature.

B.3 MÖBIUS COADDITION

The gyrogroup cooperation in the Möbius gyrovector space is called the Möbius coaddition, and is given by the following: x c y = x ⊕ c gyr[x, c y]y = 1 -c y 2 x + 1 -c x 2 y 1 -c 2 x 2 y 2 . With the gamma factor γ x = ( 1 -c x 2 ) -1 for x ∈ B n c , this is also described in the following manner: x c y = γ 2 x x + γ 2 y y γ 2 x + γ 2 y -1 . ( ) Note that the Möbius coaddition is not associative but is commutative.

B.4 MÖBIUS SCALAR MULTIPLICATION

The Möbius scalar multiplication for x ∈ B n c and r ∈ R is given by the following: r ⊗ c x = 1 √ c tanh -1 r tanh √ c x [x] = exp c 0 (r log c 0 (x)). In terms of the Riemannian geometry, the Möbius scalar multiplication adjusts the distance of x from the origin by the scalar multiplier r. The expressions of the logarithmic map log c x and distance in the Möbius gyrovector space are described in the following subsections.

C POINCARÉ BALL MODEL

Owing to the algebraic structure provided by the Möbius gyrovector space, many properties related to the geometry of the Poincaré ball model can be described in implementation-friendly closed-form expressions.

C.1 EXPONENTIAL AND LOGARITHMIC MAPS

The exponential map exp c x : T x B n c → B n c is described in (Ganea et al., 2018a , Lemma 2) as follows: exp c x (v) = x ⊕ c 1 √ c tanh √ cλ c x v 2 [v], ∀ x ∈ B n c , v ∈ T x B n c . ( ) The logarithmic map log c x = (exp c x ) -1 : B n c → T x B n c is also given by the following: log c x (y) = 2 √ cλ c x tanh -1 √ c c x ⊕ c y [ c x ⊕ c y], ∀ x, y ∈ B n c . C.2 DISTANCE

C.2.1 POINCARÉ DISTANCE BETWEEN TWO ARBITRARY POINTS

The distance function d c is originally and preliminary defined as a binary operation for indicating the distance between two arbitrary points x, y ∈ B n c . Based on the notion of the Möbius addition, the distance d c : B n c × B n c → R is succinctly described as follows: d c (x, y) = 2 √ c tanh -1 √ c c x ⊕ c y = log c x (y) c x . Despite the noncommutative aspect of the Möbius addition ⊕ c , this distance function in Equation 19 becomes commutative thanks to the commutative aspect of the Euclidean norm of the Möbius addition, which is expressed as follows: x ⊕ c y = x 2 + 2 x, y + y 2 1 + 2c x, y + c 2 x 2 y 2 , ∀ x, y ∈ B n c .

C.2.2 DISTANCE FROM A POINT TO POINCARÉ HYPERPLANE

In the Euclidean geometry, the generalized concept of two-dimensional plane to a higher dimensional space R n is a hyperplane containing an arbitrary point p ∈ R n and is the set of all straight lines orthogonal to an arbitrary orientation vector a ∈ R n . Because straight lines in Euclidean spaces are geodesics in terms of the Riemannian geometry, a hyperplane can be generalized as another Riemannian manifold M n such that the hyperplane contains an arbitrary point p ∈ M n and is the set of all geodesics orthogonal to an arbitrary orientation vector at p, namely, the tangent vector a ∈ T p M n . This concept in the Poincaré ball model has been rigorously defined in (Ganea et al., 2018a, Definition 3.1) for p ∈ B n c , a ∈ T p B n c as follows: Hc a,p = {x ∈ B n c | log c p (x), a c p = 0} = exp c p {a} ⊥ (21) = {x ∈ B n c | c p ⊕ c x, a = 0}. Note that {a} ⊥ is the set of all tangent vectors at p and orthogonal to a.  d c (x, w) = 1 √ c sinh -1 2 √ c| c p ⊕ c x, a | (1 -c c p ⊕ c x 2 ) a . ( ) C.3 PARALLEL TRANSPORT The concept of a parallel transport is traditionally derived from differential geometry. In the hyperbolic geometry, the gyrovector space provides the algebra to formulate the parallel transport of a gyrovector (Ungar, 2012) . When a gyrovector c x ⊕ c w ∈ B n c rooted at a point x ∈ B n c is transported parallel to another gyrovector c y ⊕ c z ∈ B n c rooted at a point y ∈ B n c along a geodesic connecting x and y, the equation below is satisfied: c y ⊕ c z = gyr[y, c x] ( c x ⊕ c w) . ( ) Because the exponential map in the Poincaré ball model is a bijective function, the parallel transported gyrovectors w and z can be regarded as the exponential mapped tangent vectors v ∈ T x B n c rooted at x and u ∈ T y B n c rooted at y, respectively, that is, c y ⊕ c exp c y (u) = gyr[y, c x] ( c x ⊕ c exp c x (v)) . ( ) With Equations 17 and 18 and the properties of the Möbius gyration described in Appendix B, a succinct expression of the tangent parallel transport P c x→y : T x B n c → T y B n c can be obtained as follows: P c x→y (v) := u = λ c x λ c y gyr[y, c x]v. Note that, in a special case in which x = 0 and v ∈ T 0 B n c , this equation is simplified as follows: P c 0→y (v) = λ c 0 λ c y v = 1 -c y 2 v. One can confirm that Equation 26 deserves to be called a parallel transport in terms of the differential or Riemannian geometry by checking the covariant derivative associated with the Levi-Civita connection of P c x→y along a tangent vector field γ(t) on a smooth curve γ(t) from x to y vanishes to 0.

D.1 FINAL DEFORMATION OF THE PROPOSED UNIDIRECTIONAL POINCARÉ MLR

Proof. First, we clarify the relation between the Poincaré hyperplane Hc a,p , described in Appendix C.2.2, and the variants Hc a,r introduced in Section 3.1: Hc a,r = Hc a,qa k ,r k . ( ) We then start the derivation of Equation 6 from the variables a k and q a k ,r k described in Section 3.1. Following Equation 28 and the concept of the distance from a point to a Poincaré hyperplane described in Equation 23, the generalized MLR score function v k in Equation 3 can be written as follows: v k (x) = λ c qa k ,r k a k √ c sinh -1 2 √ c c q a k ,r k ⊕ c x, a k (1 -c c q a k ,r k ⊕ c x 2 ) a k , ∀ x ∈ B n c . ( ) With Equation 20, we obtain c q a k ,r k ⊕ c x 2 = x 2 -2 x, q a k ,r k + q a k ,r k 2 1 -2c x, q a k ,r k + c 2 x 2 q a k ,r k 2 . ( ) Therefore, we can expand the term inside the sinh -1 function in Equation 29 in the following manner: 2 √ c c q a k ,r k ⊕ c x, a k (1 -c c q a k ,r k ⊕ c x 2 ) a k = 2 √ c a k -1 -2c x, q a k ,r k + c x 2 q a k ,r k , a k + 1 -c q a k ,r k 2 x, a k 1 -2c x, q a k ,r k + c 2 x 2 q a k ,r k 2 -c x 2 -2 x, q a k ,r k + q a k ,r k 2 (31) = 2 √ c -1 -2c x, q a k ,r k + c x 2 q a k ,r k , [a k ] + 1 -c q a k ,r k 2 x, [a k ] 1 -c q a k ,r k 2 -c x 2 + c 2 x 2 q a k ,r k 2 (32) = 2 1 -c x 2 - √ c 1 -2c x, q a k ,r k + c x 2 q a k ,r k , [a k ] 1 -c q a k ,r k 2 + √ c x, [a k ] . ( ) With Equations 5 and 17, the term in the outer brackets in Equation 33 can be further expanded into the form using r k and z k described in Section 3.1: - √ c 1 -2c x, q a k ,r k + c x 2 q a k ,r k , [a k ] 1 -c q a k ,r k 2 + √ c x, [a k ] = - 1 -2 √ c tanh ( √ c r k ) x, [z k ] + c x 2 tanh ( √ c r k ) 1 -tanh 2 ( √ c r k ) + √ c x, [z k ] (34) = -1 + c x 2 sinh √ c r k cosh √ c r k + √ c x, [z k ] 1 + 2 sinh 2 √ c r k (35) = - 1 + c x 2 2 sinh 2 √ c r k + √ c x, [z k ] cosh 2 √ c r k . In addition, we can also expand the term outside the sinh -1 function in Equation 29using Equations 5 and 17 as follows: λ c qa k ,r k a k √ c = 2 a k √ c (1 -c q a k ,r k 2 ) = 2 sech 2 ( √ c r k ) z k √ c 1 -tanh 2 ( √ c r k ) = 2 z k √ c . ( ) Combining Equations 29, 33, 36, and 37, we finally conclude the proof through the following: v k (x) = 2 z k √ c sinh -1 2 √ c x, [z k ] 1 -c x 2 cosh 2 √ c r k - 1 + c x 2 1 -c x 2 sinh 2 √ c r k (38) = 2 z k √ c sinh -1 λ c x √ c x, [z k ] cosh 2 √ c r k -(λ c x -1) sinh 2 √ c r k .

D.2 CONVERGENCE PROOF OF UNIDIRECTIONAL POINCARÉ MLR TO EUCLIDEAN MLR

Proof. For the intended proof, we first introduce the following proposition: Proposition 1. For x = 0, sinh(x) over x converges to 1 in the limit x → 0: lim x→0 sinh(x) x = 1. ( ) Proof. The result can be obtained based on the definition of the differentiation of a scalar function: lim x→0 sinh(x) x = lim x→0 e x -e -x 2x = 1 2 lim x→0 e x -1 x + e -x -1 -x (41) = lim x→0 e x -e 0 x = de x dx x=0 = 1. From Proposition 1, we derive the following two propositions. Proposition 2. For t ∈ R, x = 0, sinh(tx) over x converges to t in the limit x → 0: lim x→0 sinh(tx) x = t. Proof. We divide this proof into two cases: lim x→0 sinh(tx) x =    0 = t (t = 0) t lim tx→0 sinh(tx) tx = t (t = 0, Proposition 1) . ( ) Proposition 3. For t ∈ R, x = 0, sinh -1 (tx) over x converges to t in the limit x → 0: lim x→0 sinh -1 (tx) x = t. Proof. We can directly utilize Proposition 1 as follows: lim x→0 sinh -1 (tx) x = lim s→0 ts sinh(s) (s = sinh -1 (tx)) (46) = t lim s→0 sinh(s) s -1 = t (Proposition 1). ( ) With Propositions 2 and 3, we can now take the limit of Equation 6as follows: lim c→0 v k (x) = lim c→0 2 z k √ c sinh -1 2 √ c x, [z k ] 1 -c x 2 cosh 2 √ c r k - 1 + c x 2 1 -c x 2 sinh 2 √ c r k (48) = lim c→0 2 z k √ c sinh -1 √ c 2 x, [z k ] 1 -c x 2 cosh 2 √ c r k - 1 + c x 2 1 -x 2 sinh (2 √ c r k ) √ c (49) = 2 z k (2 x, [z k ] -2r k ) = 4 ( x, z k -r k z k ) . ( ) Moreover, with Equation 5, we can confirm that z k matches a k in the limit c → 0: lim c→0 a k = lim c→0 sech 2 √ c r k z k = lim c→0 1 cosh 2 ( √ c r k ) z k = z k . ( ) Combining it with Equations 2 and 50, we finally conclude the proof as follows: lim c→0 v k (x) = 4 ( x, a k -r k a k ) = 4 ( a k , x -b k ) , where b k := r k a k . ( ) Here, the factor 4 is derived from the squared conformal factor (λ 0 x ) 2 degenerating into a constant value. This corresponds to the fact that the Poincaré ball model B n c converges to the Euclidean space R n in the limit c → 0 except for the same multiplier lim c→0 (λ c x ) 2 = 4 owing to its metric tensor.

D.3 PROOF OF THE PROPERTIES OF OUTPUT COORDINATES OF POINCARÉ FC LAYER

Proof. To check the properties of the Poincaré FC layer described in Section 3.2, we first clarify the Poincaré hyperplane containing the origin and orthogonal to the k-th axis in B m c . The k-th axis is a geodesic passing through the origin and any point on it except the origin has a non-zero element in only the k-th coordinates. Therefore, an arbitrary point x ∈ B m c along the k-th axis can be written as follows: x = re (k) , where e (k) = (δ ik ) m i=1 , r ∈ - 1 √ c , 1 √ c ⊂ R, which is as intuitive as in a Euclidean space. Specifically, r = 0 represents the origin. We can then easily describe the intended Poincaré hyperplane as follows: D.4 PROOF OF THE PROPERTIES OF POINCARÉ β-SPLIT AND β-CONCATENATION In this section, we prove the properties of the Poincaré β-split and the Poincaré β-concatenation described in Section 3.3. The Poincaré ball model is different from Euclidean neural networks, on the simple calculation of the expected value and the variance of a particular value related to a feature vector or weight matrix owing to the linearity in their operations. In the Poincaré ball model, calculating such values without any postulate for the probabilistic distribution that the feature gyrovectors or tangent vectors follow is difficult owing to the nonlinear transformations in the exponential and logarithmic maps. Thus, we first make the following naive assumption: Assumption 1. Each coordinate of an n-dimensional tangent vector in T 0 B n c follows a normal distribution centered at zero with a certain variance σ 2 n c . The reasons why we assume the distribution on the tangent space rather than on the Poincaré ball model itself are as follows: 1. It is improper to assume a continuous and smooth distribution onto the space with an upperbounded radius because there must be no probability density on or outside the boundary. The rough idea of discontinuing such probabilities outside the domain of the Poincaré ball and discretely taking only the inside into account seems to lack rationality. 2. One simple way to avoid the above issue is to apply a uniform distribution from zero to the ball radius based on the norm of the gyrovector. However, there is no guarantee that such constancy in the distribution can be realized on a complexly curved geometric structure of the Poincaré ball model. 3. Conversely, a tangent space is a linear space that is attached to the manifold and can be treated as an ordinary vector space. 4. The Poincaré ball model is conformal to the Euclidean space, i.e., preserving the same angles, and at the origin, the gyrovectors having the same norms are projected onto the tangent vectors which also have the same norms with their angles unchanged. 5. In Euclidean neural networks, the normal distribution is one of the most popularly considered priors. The multivariate normal distribution is occasionally approximated as an independent and identically distributed distribution for easier calculation. Because the Poincaré β-split and the Poincaré β-concatenation are inverse functions to each other, it is sufficient to prove the properties of either one of these operations. Here, we show a proof for the Poincaré β-concatenation. Recalling that β n = B( n 2 , 1 2 ) and considering the following: Poincaré β-concatenation. The input gyrovectors {x i ∈ B ni c } N i=1 are first scaled by certain coefficients and concatenated in the tangent space, and then projected back to the Poincaré ball as follows: x i → v i = log c 0 (x i ) ∈ T 0 B ni c , v := β n β n1 v 1 , . . . , β n β n N v N → y = exp c 0 (v) ∈ B n c . ( ) Proof. At first, we consider the expected value of the norm of each tangent vector v i , which is the target of the Poincaré β-concatenation. Because the value t i := c vi 2 σ 2 n i follows a χ 2 distribution based on Assumption 1, the expected value of v i can be obtained as follows: E[ v i ] = 1 2 n i 2 Γ ni 2 ∞ 0 v i e -t i 2 t n i 2 -1 i dt i (64) = σ ni 2 n i 2 Γ ni 2 √ c ∞ 0 e -t i 2 t n i -1 2 i dt i (65) = 2 n i +1 2 Γ ni+1 2 2 n i 2 Γ ni 2 σ ni √ c (66) = 2π c σ ni B ni 2 , 1 2 (67) Therefore, in the general cases in which n i 1, we can obtain the following approximation: Γ ni 2 2j-2 Γ ni+1 2 2j-1 Γ j + n i -1 2 = Γ j + n i -1 2 i (n i + 1) n i 2   2j (86) = 2 -j n j i (2e) j n j(ni-1) i (n i + 1) jni (87) = e j n i n i + 1 nij (88) e j e -j (89) = 1. Note that, for the calculation between Equations 84 and 85, we utilize Stirling's approximation, i.e.,

Γ(z)

2π z z e z . In addition, we utilize the definition of Napier's constant for the approximation between Equations 88 and 89, i.e., lim x→∞ (1 + 1 x ) x = e. Combining Equations 79 and 90, the expected value of x i can be approximately expressed by the following: E[ x i ] 1 √ c ∞ j=1 2 2j 2 2j -1 B 2j (2j)! √ 2πC 2j-1 (91) = 1 √ c tanh √ 2πC (92) = 1 √ c tanh √ c E[ v i ] . In the same way, the expected value of the Poincaré β-concatenated gyrovector x is obtained by the following: E[ x ] 1 √ c tanh √ 2πC (94) = 1 √ c tanh √ c E[ v ] , which concludes the proof. Unidirectional Poincaré MLR. When the dimensions of the input gyrovector is n, each element of the weight parameter Z is initialized by a normal distribution centered at zero with a standard deviation n -1 2 . The bias parameter r is initialized as a zero vector. Poincaré FC layer. When the dimensions of the input gyrovector and the output gyrovector are n and m, respectively, each element of the weight parameter Z is initialized by a normal distribution centered at zero with a standard deviation (2nm) -1 2 . The bias parameter r is initialized as a zero vector. Poincaré convolutional layer. When the dimensions of the input gyrovector and the output gyrovector are n and m, respectively, and the total kernel size is K, each element of the weight parameter Z is initialized by a normal distribution centered at zero with a standard deviation (2nKm) -1 2 . The bias parameter r is initialized as a zero vector. Embedding on the Poincaré ball model. As mentioned by Ganea et al. (2018a) , we confirmed the tendency of the parameters in the Poincaré ball model to adjust their angles at the first phase of the training before increasing their norms. In addition, we consider that, due to the exponentially growing distance metric of the hyperbolic space, the farther a gyrovector parameter is placed from the origin, the more costly it moves such a point to another point through the optimization. Therefore, the embedding parameters on the Poincaré ball model should be initialized with a particular small gain E , given as a hyperparameter, aiming to accelerate such an adjustment and make the later optimization smooth. We set the value E to be 10 -2 in the experiment in Section 4.3.

E.2 HYPERPARAMETERS OF THE EXPERIMENT IN SECTION 4.2

Optimization. We used the Riemannian Adam optimizer with β 1 = 0.9, β 2 = 0.999 and = 10 -8 for both of the Euclidean and our hyperbolic architectures. The learning rate η was set to 10 -3 .

E.3 HYPERPARAMETERS OF THE EXPERIMENT IN SECTION 4.3

Model architectures. Let D be the dimension of the source and target token embeddings. Each model for the experiment in Section 4.3 has the encoder and decoder, both of which are composed of five convolutional layers with a kernel size of three and a channel size of D, five convolutional layers with a kernel size of three and a channel size of 2D, and two convolutional layers with a kernel size of one and a channel size of 4D. The output feature maps of the last convolutional layer in the encoder are projected into D-dimensional feature maps. They are utilized as the key for the encoder-decoder attentions. Likewise, the output feature maps of the last convolutional layer in the decoder are projected into D-dimensional feature maps for the final token classification. Training. In each iteration of the training phase, we fed each model a mini-batch containing approximately 10,000 tokens at most. In this setting, the batch size, or the number of the sentence pairs in a mini-batch, dynamically changes. As a loss function, we utilized the cross entropy function with a label smoothing of 0.1. Optimization. We used the Riemannian Adam optimizer with β 1 = 0.9, β 2 = 0.98 and = 10 -9 for both of the Euclidean and our hyperbolic architectures. For the scheduling of the learning rate η, we linearly increased the learning rate for the first 4000 iterations as a warm-up, and utilized the inverse square root decay with respect to the number of iterations t thereafter as η = (Dt) -1 2 .



The code is available at https://github.com/mil-tokyo/hyperbolic_nn_plusplus. https://github.com/facebookresearch/poincare-embeddings https://github.com/juho-lee/set_transformer http://www.editage.com



Figure 1: Geometric relationship between H n c , B n c and K n c

Figure 2: Whichever pair of a and p is chosen, it determines the same discriminative hyperplane.Considering one bias point q a,r per one discriminative hyperplane solves this over-parameterization.

Figure 3: Comparison of FC layers in input spaces B n c . The values at a certain dimension of output spaces are illustrated as contour plots. Black arrows depict the orientation parameters, and they are fixed for the comparison. Their orthogonal curves show discriminative hyperplanes where the values are zeros. As a bias parameter b or r k changes, the outline of the contour landscape in (a) remains unchanged, whereas in (b) the focused regions are dynamically squeezed according to the geodesics.

Ganea et al. (2018a)  have proven the closed-form description of the distance from a point x ∈ B n c to an arbitrary Poincaré hyperplane Hc a,p by considering the minimum distance between x and any point in Hc a,p : d c x, Hc a,p := inf w∈ Hc a,p

Let the point b ∈ B n c be a projection of the weighted centroid h ∈ H n c . With Equation 101 and the notation of Equation10, we obtain the following straightforward expression:z( b) -h i , h( b) . (109)Considering the points{b i ∈ B n c } N i=1 , which correspond to {h i } N i=1, respectively, we can transform Equation 109 into an expression with only the coordinates in the Poincaré ball model:i -b 2 (1 -c b i 2 )(1 -c b 2 ) GENERALIZATION OF THE MÖBIUS GYROMIDPOINTAs mentioned in Section 3.5, we extend the condition of the weights of the Möbius gyromidpoint to all real values {ν i ∈ R} N i=1 by regarding a negative weight as an additive inverse operation, that is, regarding any pair (ν i , b i ) as (|ν i |, sign(ν i )b i ): N i=1 |ν i | λ c sign(νi)bi sign(ν i )b

Test F1 scores for four sub-trees of the WordNet noun hierarchy. The first column indicates the number of nodes in each sub-tree for the training and test times. For each setting, we report the 95% confidence intervals for three different trials. Note that the number of parameters of the Euclidean MLR and our approach is D + 1, whereas for the MLR layer of HNNs, it is 2D.

BLEU-4 scores(Papineni et al., 2002) on the test sets newstest2013. The target sentences were decoded using beam search with a beam size of five. D indicates the dimensions of token embeddings and the final MLR layer.

ACKNOWLEDGMENTS

We would like to thank Naoyuki Gunji and Yuki Kawana from the University of Tokyo for their helpful discussions and constructive advice. We would also like to show our gratitude to Dr. Lin Gu from RIKEN AIP, Kenzo Lobos-Tsunekawa from the University of Tokyo, and Editage 4 for proofreading the manuscript for English language. This work was partially supported by JST AIP Acceleration Research Grant Number JPMJCR20U3, JST CREST Grant Number JPMJCR2015, JSPS KAKENHI Grant Number JP19H01115, and Basic Research Grant (Super AI) of Institute for AI and Beyond of the University of Tokyo.

annex

Definition 1. (Poincaré hyperplane containing the origin and orthogonal to the k-th axis)which is also intuitively obtained.With Definition 1, the preparation for constructing y in Equation 7 is complete.Derivation of y. Let x ∈ B n c and y = (y 1 , y 2 , . . . , y m ) ∈ B m c be the input and output of the Poincaré FC layer, respectively. Below, we start the proof with the score functions v k (x) for ∀ k = {1, 2, . . . , m} already obtained in the same way as in Equation 6.To endow y the properties described in Section 3.2, i.e., the signed distance from y to each Poincaré hyperplane containing the origin and orthogonal to the k-th axis is equal to v k (x), we generate a simultaneous equation for ∀ k as follows:With Equations 54 and 28 and the notion of the distance from a point to a Poincaré hyperplane described in Equation 23, these equations are expanded as follows:Therefore, we obtain the following notation of the coordinates:When considering the Euclidean norm of y using Equation 57, the equation for y can be derived as follows:This can be succinctly rewritten asBy solving this quadratic equation, the closed form of y is obtained through the following:Substituting Equations 59 and 60 for Equation 57 leads to Equation 7 in the notation of the coordinates:Confirmation of the existence of y. Finally, we conclude the proof by checking that y is always within the domain of the Poincaré ballTherefore, when the norm of each input tangent vector v i is kept the same by the former part of neural networks before applying this operation, the standard deviation σ ni must be expressed as follows:In addition, using Equation 63, the squared norm of the Poincaré β-concatenated tangent vector v is obtained as follows:This leads the value t, where σ n = Cβ n , which is expressed as follows:Here, t also follows a χ 2 distribution, and the expected value of the norm of v is obtained as follows:which is the same as the norms of the input tangent vectors. This indicates that each coordinate of v follows a normal distribution centered at zero with a variance σ 2 n c , satisfying the Assumption 1. Based on the results above, the expected value of the norm of each input gyrovector x i is expressed by the following:Note that, for the calculation between Equations 75 and 76, we utilize the Taylor series expansion of tanh for a real value. Furthermore, considering the Laurent series expansion at infinity, we can obtain the following expressions:

D.5 THE MÖBIUS GYROMIDPOINT AND THE EINSTEIN GYROMIDPOINT

Einstein gyromidpoint. In the Beltrami-Klein model, the midpoint nand the non-negative scalar weights {ν i ∈ R + } N i=1 is given as follows:, whereThis operation is called the Einstein gyromidpoint (Ungar, 2009) .Based on the above, we prove the equivalence of the Möbius gyromidpoint and Einstein gyromidpoint.Proof. Let the points, respectively, i.e., b i is a projection of n i to the Poincaré ball model using Equation 11. From Equations 12 and 96, we obtain the following:Substituting Equations 12 and 97 for Equation 96 leads to the representation of the Einstein midpoint using the coordinates in the Poincaré ball model:Therefore, the point b ∈ B n c , which is a projection of n to the Poincaré ball model using Equation 11, is expressed in the following manner:This concludes the proof.

D.6 MÖBIUS GYROMIDPOINT AND CENTROID OF SQUARED LORENTZIAN DISTANCE

Weighted centroid in the hyperboloid model (Law et al., 2019) . With a Lorentzian normand the non-negative scalar weights {ν i ∈ R + } N i=1 is given as follows:This is based on the minimization problem of the weighted sum of squared Lorentzian distances expressed as follows:In the following, we prove the equivalence of the Möbius gyromidpoint and the weighted centroid in the hyperboloid model.Proof. Expanding Equation 100 with the coordinates, we obtain the following:The point b ∈ B n c , which is a projection of h to the Poincaré ball model using Equation 9, is expressed in the following manner:Dividing both the numerator and denominator by i ν i z i , this can be rewritten as follows:Next, considering the points {b i ∈ B n c } N i=1 , which also correspond to {h i } N i=1 , respectively, we can transform the expression of b into an expression with only the coordinates in the Poincaré ball model:This concludes the proof.

D.7 MÖBIUS GYROMIDPOINT AS A SOLUTION OF THE MINIMIZATION PROBLEM

The discovery of the equivalence between the weighted centroid in the hyperboloid model and the Möbius gyromidpoint enables us to discuss what the Möbius gyromidpoint is a minimizer of. In the following, we prove that the Möbius gyromidpoint can be regarded as a minimizer of the weighted sum of calibrated squared gyrometrics. Theorem 2. The Möbius gyromidpoint is a solution of the minimization problem of the weighted sum of calibrated squared gyrometrics, which is expressed as follows: (Ungar, 2009) . In addition, each λ c c b⊕cb i is a conformal factor of the metric tensor of the Poincaré ball model for such a gyrovector. Therefore, the minimization objective in Equation 106 can be interpreted as the weighted sum of squared gyrometrics, each of which is calibrated by a scaling factor at the respective point.

