LAPLACIAN EIGENSPACES, HOROCYCLES AND NEU-RON MODELS ON HYPERBOLIC SPACES

Abstract

We use hyperbolic Poisson kernel to construct the horocycle neuron model on hyperbolic spaces, which is a spectral generalization of the classical neuron model. We prove a universal approximation theorem for horocycle neurons. As a corollary, we obtain a state-of-the-art result on the expressivity of f 1 a,p , which is used in the hyperbolic multiple linear regression. Our experiments get state-of-the-art results on the Poincare-embedding subtree classification task and the classification accuracy of the two-dimensional visualization of images.

1. INTRODUCTION

Conventional deep network techniques attempt to use architecture based on compositions of simple functions to learn representations of Euclidean data (LeCun et al., 2015) . They have achieved remarkable successes in a wide range of applications (Hinton et al., 2012; He et al., 2016) . Geometric deep learning, a niche field that has caught the attention of many authors, attempts to generalize conventional learning techniques to non-Euclidean spaces (Bronstein et al., 2017; Monti et al., 2017) . There has been growing interest in using hyperbolic spaces in machine learning tasks because they are well-suited for tree-like data representation (Ontrup & Ritter, 2005; Alanis-Lobato et al., 2016; Nickel & Kiela, 2017; Chamberlain et al., 2018; Nickel & Kiela, 2018; Sala et al., 2018; Ganea et al., 2018b; Tifrea et al., 2019; Chami et al., 2019; Liu et al., 2019; Balazevic et al., 2019; Yu & Sa, 2019; Gulcehre et al., 2019; Law et al., 2019) . Many authors have introduced hyperbolic analogs of classical learning tools (Ganea et al., 2018a; Cho et al., 2019; Nagano et al., 2019; Grattarola et al., 2019; Mathieu et al., 2019; Ovinnikov, 2020; Khrulkov et al., 2020; Shimizu et al., 2020) . Spectral methods are successful in machine learning, from nonlinear dimensionality reduction (Belkin & Partha, 2002) to clustering (Shi & Malik, 2000; Ng et al., 2002) to hashing (Weiss et al., 2009) to graph CNNs (Bruna et al., 2014) to spherical CNNs (Cohen et al., 2018) and to inference networks (Pfau et al., 2019) . Spectral methods have been applied to learning tasks on spheres (Cohen et al., 2018) and graphs (Bruna et al., 2014) , but not yet on hyperbolic spaces. This paper studies a spectral generalization of the FC (affine) layer on hyperbolic spaces. Before presenting the spectral generalization of the affine layer, we introduce some notations. Let (•, •) E be the inner product, | • | the Euclidean norm, and ρ an activation function. The Poincaré ball model of the hyperbolic space H n (n≥2) is a manifold {x∈R n : |x|<1} equipped with a Riemannian metric ds 2 H n = n i=1 4(1-|x| 2 ) -2 dx 2 i . The boundary of H n under its canonical embedding in R n is the unit sphere S n-foot_0 . The classical neuron y=ρ((x, w) E +b) is of input x∈R n , output y∈R, with trainable parameters w∈R n , b∈R. An affine layer R n → R m is a concatenation of m neurons. An alternative representation of the neuron x →ρ((x, w) E +b) is given by 1 x∈R n → ρ(λ(x, ω) E +b), ω∈S n-1 , λ, b∈R. ( ) This neuron is constant over any hyperplane that is perpendicular to a fixed direction ω. In H n , a horocycle is a n-1 dimensional sphere (one point deleted) that is tangential to S n-1 . Horocycles are hyperbolic counterparts of hyperplanes (Bonola, 2012) . Horocyclic waves x, ω H := 1 2 log 1-|x| 2 |x-ω| 2 are constant over any horocycle that is tangential to S n-1 at ω. Therefore, generalizes the classical neuron model (1) , and a concatenation of finitely many (2) generalizes the FC (affine) layer. We call (2) a horocycle neuron. Figure 1 (middle) is an example on H 2 . The neuron models in (1, 2) are related to spectral theory because (•, ω) E (respectively •, ω H ) are building blocks of the Euclidean (respectively hyperbolic) Laplacian eigenspace. Moreover, many L 2 spaces have a basis given by Laplacian eigenfunctions (Einsiedler & Ward, 2017) . On one side, all Euclidean (respectively hyperbolic) eigenfunctions are some kind of "superposition" of (•, ω) E (respectively •, ω H ). On the other side, neural networks based on (1) (respectively (2)) represent functions that are another kind of "superposition" of (•, ω) E (respectively •, ω H ). They heuristically explain why the universal approximation property is likely to hold for networks constructed by ( 1) and (2). By using the Hahn Banach theorem, an injectivity theorem of Helgason, and integral formula, we prove that finite sums of horocycle neurons (2) are universal approximators (Theorem 2). x∈H n → ρ(λ x, ω H +b), ω∈S n-1 , λ, b∈R Let p ∈ H n , T p (H n ) be the tangent space of H n at p, a ∈ T p (H n ), ⊕ be the Möbius addition (Ungar, 2008) . We remind the reader that the following functions f 1 a,p (x) = 2|a| 1 -|p| 2 sinh -1 2(-p ⊕ x, a)E (1 -| -p ⊕ x| 2 )|a| are building blocks of many hyperbolic learning tools (Ganea et al., 2018a; Mathieu et al., 2019; Shimizu et al., 2020) . Figure 1 illustrates examples of different neuron models (1, 2, 3) on H 2 . In Lemma 1, we shall present a close relationship between (2) and (3) . Using this relationship and Theorem 2, we obtain a novel result on the expressivity of f 1 a,p (Corollary 1). This article contributes to hyperbolic learning. We first apply spectral methods, such as the horocycle, to hyperbolic deep learning. We prove results on the expressivity of horocycle neurons (2) and f 1 a,p (3) . With horocycle neurons, we obtain state-of-the-art results on the Poincaré-embedding subtree classification task and the classification accuracy of the 2-D visualization of images in in the experiment.

Universal approximation

There is a vast literature on universal approximation (Cybenko, 1989; Hornik et al., 1989; Funahashi, 1989; Leshno et al., 1993) . Cybenko (1989) 's existential approach uses the Hahn Banach theorem and Fourier transform of Radon measures. To prove Theorem 2, we also use the Hahn Banach theorem, and additionally an integral formula (7) and an injectivity Theorem 1 of Helgason. Generalizing integral formulas and injectivity theorems is easier than generalizing Fourier transform of Radon measures on most non-Euclidean spaces. (Carroll & Dickinson, 1989) uses the inverse Radon transform to prove universal approximation theorems. This method relates to ours, as injectivity theorems are akin to inverse Radon transforms. However, using the injectivity theorem is an existential approach while using the inverse Radon transform is a constructive one. Spectral methods Spectral methods in Bronstein et al. (2017) ; Bruna et al. (2014) ; Cohen et al. (2018) use a basis of L 2 (X) given by eigenfunctions, where X is a finite graph or the sphere. Because L 2 (H n ) has no eigenfunctions as a basis, our approach is different from theirs. Hyperbolic deep learning One part of hyperbolic learning concerns embedding data into the hyperbolic space (Nickel & Kiela, 2017; Sala et al., 2018) . Another part concerns learning architectures with hyperbolic data as the input (Ganea et al. (2018a) ; Cho et al. (2019) ). Ganea et al. (2018a) proposes two ways to generalize the affine layer on hyperbolic spaces: one by replacing the linear and bias part of an affine map with (25, 26) of their paper; another one by using a concatenation of f 1 a,p in their hyperbolic multiple linear regression (MLR). The latter seems more relevant to ours. A level set of f 1 a,p is a hypercycle that has the same distance to a chosen geodesic hypersurface, while a level set of a horocycle neuron is a horocycle that has the same "spectral" distance to an ideal point at infinity. 2019) take geodesics as decision hyperplanes, while we (initially) take horocycles. We shall construct the horocycle multiple linear regression (MLR), where decision hypersurfaces are geodesics. Geodesics decision hyperplanes (Ganea et al., 2018a; Cho et al., 2019) and geodesic decision hypersurfaces here arise from different methods. Khrulkov et al. (2020) investigates hyperbolic image embedding, where prototypes (or models) of each class are center-based. We study a different one, and we shall call our prototypes end-based.

3. HYPERBOLIC SPACES

This section reviews facts from hyperbolic geometry that are used in the proof of Theorem 2. For the reader who is not interested in the proof, (4) is enough for the implementation.

Hyperbolic metric

We use the Poincaré model. The hyperbolic space H n (n≥2) is the manifold {x∈R n : |x|<1} equipped with a Riemannian metric ds 2 = n i=1 4(1-|x| 2 ) -2 dx 2 i . Let o be the origin of H n . The distance function d H n satisfies d H n (o, x)=2 arctanh |x|. Geodesics, horocycles and corresponding points Geodesics in H n are precisely circular arcs that are orthogonal to S n-1 . Horocycles in H n are precisely (n-1)-dimensional spheres that are tangential to S n-1 (Helgason, 1970) . Horocycles are hyperbolic analogs of hyperplanes. Figure 2 illustrates geodesics and horocycles on H 2 .

Hyperbolic Poisson kernel The Poisson kernel for

H n is P (x, ω)= 1-|x| 2 |x-ω| 2 n-1 , where x∈H n , ω∈S n-1 (Helgason (1970)[p.108] ). The function •, ω H defined by x, ω H = 1 2(n -1) log P (•, ω) = 1 2 log 1 -|x| 2 |x -ω| 2 (4) is constant over any horocycle that is tangential to S n-1 at ω (Figure 1 (middle), ( 6)).

Riemannian volume

The Riemannian volume induced by the metric ds 2 on H n is dVol = 2 n (1 -|x| 2 ) -n dx1 . . . dxn. ( ) Horocycles Let Ξ be the set of horocycles of H n , and let Ξω be the set of all horocycles that are tangential to S n-1 at ω. Given λ∈R, we let ξ λ,ω be the unique horocycle that connects ω and tanh (λ/2) • ω. We have Ξω = ∪ λ∈R {ξ λ,ω } and Ξ = ∪ ω∈S n-1 Ξ ω . The length of any geodesic (that ends at ω) line segment cut by ξ λ1,ω and ξ λ2,ω equals |λ 1 -λ 2 | (A.2). Therefore |λ 1 -λ 2 | is a natural distance function defined on Ξω, and the map λ → ξ λ,ω is an isometry between R and Ξω. This isometry is closely related to •, ω H (A.3): for any x ∈ ξ λ,ω , x, ω H = λ/2. (6) The annoying /2 in ( 6) is a tradeoff that the metric here is different from that in Helgason (2000) . Integral formula For fixed ω ∈ S n-1 , H n =∪ λ∈R ξ λ,ω . Let dVol ξ λ,ω be the measure induced by ds 2 on ξ λ,ω . Let L be a family of geodesics that end at ω, δ > 0, and U =L ∩ (∪ λ≤α≤λ+δ ξ α,ω ). For l ∈ L, d H (l ∩ ξ λ,ω , l ∩ ξ λ+δ,ω )=δ (A.2), hence dVol (U ) = δ • dVol ξ λ,ω (U ∩ ξ λ,ω ) and therefore H n f (x)dVol (x) = R ξ λ,ω f (z)dVol ξ λ,ω (z) dλ. The above proof (for H n ) is essentially the same as that in (Helgason, 2000) [p.37] (for H 2 ). To further convince the reader that ( 7) holds for all n, we give another simple proof in A.4. Injectivity theorem With respect to the canonical measure on Ξ, Helgason (1970)[p.13] proved Theorem 1 (Helgason) . If f ∈ L 1 (H n ) and ξ f (z)dVol ξ (z) = 0 for a.e ξ ∈ Ξ, then f = 0 a.e.. Theorem 1 demonstrates that if the integral of f ∈ L 1 (H n ) over almost every horocycle is zero then f is also zero. This theorem and the integral formula (7) are essential for the proof of Theorem 2.

4. LEARNING ARCHITECTURES AND EIGENFUNCTIONS OF THE LAPLACIAN

In this section, we discuss a heuristic connection between the representation properties of eigenfunctions and classical neurons, and then we define some horocycle-related learning tools.

4.1. EIGENSPACES AND NEURON MODELS

On a Riemannian manifold X, the Laplace-Beltrami L X is the divergence of the gradient, and it has a well-known representation property (Einsiedler & Ward, 2017) : if X is a compact Riemannian manifold or bounded domain in R n , then L 2 (X) has a basis given by eigenfunctions. This statement is false if X is R n or H n (Hislop, 1994) . Eigenspaces of on R n and H n Our work is motivated by the theory of eigenspaces, in which Euclidean (respectively hyperbolic) eigenfunctions are obtained from (x, ω) E (respectively x, ω H ) by some kind of superposition. For example, all smooth eigenfunctions of L R n are precisely the functions (M. Hashizume & Okamoto, 1972)[p.543]  f (x) = S n-1 e λ(x,ω) E dT (ω), and eigenfunctions of L H n are precisely the functions (Helgason, 1970)[Theorem 1.7, p.139 ] f (x) = S n-1 e λ x,ω H dT (ω), where T in (8) and ( 9) are some technical linear forms of suitable functional spaces on S n-1 . Neuron models By ( 8) and ( 1), Euclidean eigenfunctions (respectively classical neurons) are superpositions of (•, ω) E and exp (respectively ρ), with homogeneity and additivity. By ( 9) and (2), hyperbolic eigenfunctions (respectively horocycle neurons) are superpositions of •, ω H and exp (respectively ρ). The representation property of eigenfunctions on compact manifolds and bounded domains suggests that the universal approximation property is likely to hold for networks constructed by (•, ω) E or •, ω H . However, this heuristic is not proof (A.5).

4.2. HOROCYCLE BASED LEARNING ARCHITECTURES

Horocycle neuron In the implementation of the horocycle neuron (2), we take 1 2 log 1-|x| 2 |x-ω| 2 + + for x, ω H , where is a small constant to ensure numerical stability. For updating ω, we use the sphere optimization algorithm (Absil et al., 2008; Bonnabel, 2013) (A.6) . Horocycle feature and horocycle decision hypersurface Given a non-origin point x ∈ H n , for y ∈ H n we define h x (y) = y, x/|x| H and call it the horocycle feature attached to x. This feature is useful in the Poincaré embedding subtree classification task (see the experiment and Figure 3 [left]). The horocycle is the hyperbolic analog of the Euclidean hyperplane, and therefore it could be a possible choice of decision hypersurface, which may arise from a level set of a horocycle feature. End-based clustering and end prototype Natural clustering is a topic in representation learning (Bengio et al., 2013) , and the common prototype-based clusters are center-based (Tan et al., 2005) . We propose a type of clustering that embeds high-dimensional data in H n and places prototypes in S n-1 . Figure 3 [right] is an example for n = 2. For ω ∈ S n-1 and any b ∈ R, the function x ∈ H n → -log 1-|x| 2 |x-ω| 2 + b measures the relative distance of H n from ω in Gromov's bordification theory (Bridson & Haefliger (2009) [II.8], A.18). Moreover, we define Dist : H n × S n-1 × R → R by Dist(x, ω, b) = -log 1 -|x| 2 |x -ω| 2 + b = -2 x, ω H + b. It is a relative distance function, and this is why Dist may assume negative values and why there is a bias term b in (10). Consider classes Cls = {C 1 , C 2 , . . . , C M } and labeled training examples {(X 1 , Y 1 ), . . . , (X N , Y N )}, where X i ∈ R D are D-dimensional input features and Y i ∈ {1, 2, . . . , M }. Each example X i belongs to the class C Y i . In light of (10), our goal is to find a neural network NN θ : R D → H n that is parameterized by θ, prototypes ω 1 , . . . , ω M ∈ S n-1 , and real numbers b 1 , . . . , b M ∈ R such that # 1≤i≤N : Y i = arg min 1≤j≤M Dist(NN θ (X i ), ωj, bj) N (11) is maximized. We call {NN θ (X j ) : 1 ≤ j ≤ N } the end-based clustering and ω i end prototypes (in hyperbolic geometry, the end is an equivalence class of parallel lines in Figure 2 [left]). In experiments, we take NN θ = Exp • NN θ , where NN θ : R D → R n is a standard neural network parameterized by θ and Exp : R n → H n is the exponential map of the hyperbolic space. Horocycle layer, horocycle multiple linear regression (MLR) and geodesic decision hypersurfaces We call a concatenation of (2) a horocycle layer, and we shall carefully describe a prototypical learning framework for end-based clusterings. Using the same notions as in the previous paragraph, the classification task has M classes, and NN θ = Exp • NN θ : R D → H n is a deep network. For prototypes ω 1 , . . . , ω M ∈ S n-1 , real numbers b 1 , . . . , b M ∈ R, and any example X, our feedforward for prediction will be x = NN θ (X), (Feature descriptor) SCj(X) = -Dist(x, ωj, bj), (Scores; Similarity) X ∈ C arg max 1≤j≤M (SC j (X)) . (Classifier) The goal is to maximize the accuracy (11), and then we need a loss function for the backpropagation. Following the convention of prototypical networks (Snell et al., 2017; Yang et al., 2018) , we choose an increasing function ρ (in our experiments, ρ(x) = x or ρ = tanh.foot_1 ) and let the distribution over classes for an input X (with label Y ) be Therefore, given a batch of training examples, the loss function is p θ (Y = Cj|X) ∝ e -ρ(Dist(NN θ (X),m j ,b j )) = e -ρ(-SC j (X)) . L = -(X j ,Y j )∈Batch log p θ (Y = C Y j |Xj) #Batch . ( ) The training proceeds by minimizing L, and we call this framework a horocycle MLR. The set of parameters of the framework is {θ} ∪ {ω 1 , . . . , ω M } ∪ {b 1 , . . . , b M }. It is worth mentioning that decision boundaries of the horocycle MLR are geodesics, which follows from SCi(X)=SCj(X) ⇐⇒ log 1-|x| 2 |x-ωi| 2 -bi = log 1-|x| 2 |x-ωj| 2 -bj ⇐⇒ |x-ωi| |x-ωj| = e b j -b i 2 and the theorem of Apollonian circles (A.7). Poisson neuron and Poisson multiple linear regression (MLR) Although x, ω H (4) is wellmotivated by the theory of eigenspaces ( 9) and fits naturally into metric learning (see 10 or also Corollary 1) , it is only defined on H n . Some readers might not be convinced that the neuron has to be defined on hyperbolic spaces. Therefore, we try to remove the log in (4) and define the Poisson neuron model by P ρ w,λ,b (x) = ρ λ |w| 2 -|x| 2 |x-w| 2 + b for w ∈ R n , λ, b ∈ R, which is well-defined on R n \{w}. Notice that if |x| < |w| then |w| 2 -|x| 2 |x-w| 2 = e 2 x/|w|,w/|w| H . In A.8, Figure 7 illustrates an example of a Poisson neuron on R 2 . In the implementation, we take |w| 2 -|x| 2 |x-w| 2 + for |w| 2 -|x| 2 |x-w| 2 , where is a small constant for numerical stability. We call a concatenation of Poisson neurons a Poisson layer, and we use it with a deep neural network NN θ : R D → R n to construct the Poisson MLR, which is similar to the horocycle MLR. Let w 1 , . . . , w M ∈ R n and b 1 , . . . , b M ∈ R, the feedforward for prediction of our framework is x = NN θ (X), SCj(X) = BatchNorm(P ρ w j ,-1,b j (x)), X ∈ C arg max 1≤j≤M (SC j (X)) . We let the p θ (Y = Cj|X) ∝ e SC j (X) and take (12) as the loss. This framework is called a Poisson MLR. We use the usual optimization algorithms to update parameters in the Poisson neuron. The BatchNorm (Ioffe & Szegedy, 2015) seems crucial for (13) in the experiment. 

5. REPRESENTATIONAL POWER

In this section, ρ is a continuous sigmoidal function (Cybenko, 1989) , ReLU (Nair & Hinton, 2010) , ELU (Clevert et al., 2016) , or Softplus (Dugas et al., 2001) . We remind the reader that ρ is sigmoidal if lim t→∞ ρ(t) = 1 and lim t→-∞ ρ(t) = 0. The following theorem justifies the representational power of horocycle neurons. Theorem 2. Let K be a compact set in H n , and 1≤p<∞. Then finite sums of the form F (x) = N i=1 αiρ(λi x, ωi H +bi), ωi∈S n-1 , αi, λi, bi∈R are dense in L p (K, µ), where µ is either dVol (5) or the induced Euclidean volume. We provide a sketch of the proof here and go through the details in A.9. It suffices to prove the theorem for a sigmoidal function ρ and µ = dVol , as other cases follow from this one. Assume that these finite sums are not dense in L p (K, dVol ). By the Hahn-Banach theorem, there exists some nonzero h∈L q (K, dVol ), where q=p/(p -1) if p>1 and q=∞ if p=1, such that K F (x)h(x)dVol (x) = 0 for all finite sums of the form ( 14). Extend h to be a function H that is defined on H n by assigning H(x)=h(x) if x∈K and H(x)=0 if x∈H n \K. Using the property of sigmoidal functions, the bounded convergence theorem, and the integral formula (7) , we prove that the integration of H on almost every horocycle is zero. By the injectivity Theorem 1, H is almost everywhere zero, which contradicts our assumption and completes the proof. In A.10, we shall prove the same result for Poisson neurons. In A.11, we prove the following lemma, which demonstrates a close relationship between horocycle neurons and the widely used f 1 a,p (3). Lemma 1. Let K be a compact set in H n , ω ∈ S n-1 , and > 0. There are c, d ∈ R, p ∈ H n , and a ∈ T p (H n ) such that the function D(x) = cf 1 a,p (x) + d -x, ω H satisfies ||D|| L p (K,dVol) < . This lemma suggests that •, ω H is a boundary point of some "compactification" of the space of f 1 a,p . The above lemma together with Theorem 2 implies Corollary 1. Let K be a compact set in H n and 1≤p<∞. Finite sums of the form F (x) = N i=1 αiρ(cif 1 a i ,p i (x) + di), pi ∈ H n , ai ∈ Tp i (H n ), αi, ci, di ∈ R, are dense in L p (K, µ), where µ = dVol or µ is the induced Euclidean volume. This result provides novel insights into the hyperbolic neural network (Ganea et al., 2018a) , gyroplane layer (Mathieu et al., 2019) , and Poincaré FC layer (Shimizu et al., 2020) . Although level sets of f 1 a,p are hypercycles, our proof of Lemma 1 relies on the theory of horocycles. It would be interesting to have more natural approaches to treat the expressivity of f 1 a,p .

6. EXPERIMENTS

In this section, we first play with the MNIST toy. Next, we apply a horocycle feature to the Poincaré embedding subtree classification task. After that, we construct 2-D clusterings of image datasets by using the horocycle MLR. Finally, we provide evidence for further possible applications of the Poisson MLR. We use the framework or some functions of Tensorflow, Keras, and scikit-learn (Abadi et al., 2015; Chollet et al., 2015; Pedregosa et al., 2011) .

6.1. MNIST

The MNIST (LeCun et al., 1998) task is popular for testing hyperbolic learning tools (Ontrup & Ritter, 2005; Nagano et al., 2019; Mathieu et al., 2019; Grattarola et al., 2019; Ovinnikov, 2020; Khrulkov et al., 2020) . We train two different classifiers. A.12, A.14, and code contain details. The first one is a single horocycle layer followed by the softmax classifier. 1 ] points out that the traditional CNN is good at linearly separating feature representations, but the learned features are of large intra-class variations. The horocycle MLR leads to the inter-class separability in the same way (angle accounts for label difference) a traditional CNN does. At the same time, it also obtains intra-class compactness (Figure 5 ). 6 ). A.17 and code contain the details. This subsection provides evidence for further applications of horocycles. In the earlier epochs of the training, feature vectors NN θ (X) of the Poisson model are not even close to its compact high-confidence prediction regions (deep red areas), and therefore on the test data, it is not much better than a random guess. In the end, feature vectors lying in these compact regions are of small intra-class variations, which is good for generalization (Yang et al., 2018) .

7. CONCLUSION

Based on the spectral theory of hyperbolic spaces, we introduce several horocycle-related learning tools. They find applications in the hyperbolic neural networks, the Poincaré embedding subtree classification task, and the visualization and classification of image datasets. We give an existential proof of a universal approximation theorem for shallow networks constructed by horocycle neurons or f 1 a,p . Hopefully, it will trigger further research on the expressivity problems, such as constructive approaches, quantitative results, and benefit of depth (Mhaskar & Poggio, 2016) , on horocycle neurons, f 1 a,p , and similar functions on more general manifolds.

A APPENDIX A.1 NOTATIONS AND SYMBOLS Default Notations

Notation Description Related formula

R

The set of real numbers R n n dimensional Euclidean space x ∈ R n , x = (x1, . . . , xn) (•, •)E Euclidean inner product x ∈ R n , y ∈ R n , (x, y)E = n i=1 xiyi •, • H Hyperbolic analogue of (•, •)E x ∈ H n , y ∈ S n-1 , x, ω H = 1 2 log 1-|x| 2 |x-ω| 2 | • | Euclidean norm x ∈ R n , |x| = (x, x)E H n n dimensional hyperbolic space as a set, H n = {x ∈ R n : |x| < 1} Tp(X) Tangent space of X at p T (X) Tangent space of X T (X) = ∪p∈X Tp(X) ds 2

H n

The canonical metric on H n with curvature -1 ds 2 H n = n i=1 4(1-|x| 2 ) -2 dx 2 i dVol Riemannian volume on H n dVol = 2 n (1 -|x| 2 ) -n dx1 . . . dxn L p (K, dVol ) L p space L p (K, dVol ) = f | K |f | p dVol < ∞ || • || L p (K,dVol) L p norm f measurable on K, ||f || L p (K,dVol) = K |f | p dVol 1 p S n-1 n -1 dimensional sphere as a set, S n-1 = {x ∈ R n : |x| = 1} P (•, •) Hyperbolic Poisson kernel x ∈ H n , ω ∈ S n-1 , P (x, ω) = 1-|x| 2 |x-ω| 2 n-1 f 1 a,p Model in the hyperbolic MLR f 1 a,p (x) = 2|a| 1-|p| 2 sinh -1 2(-p⊕x,a) E (1-|-p⊕x| 2 )|a| d H n The hyperbolic distance function

Ξ

The space of horocycles

Ξω

The set of horocycles that are tangential to S n- Given ω∈S n-1 and λ∈R, we let ξ λ,ω the unique horocycle that connects ω and tanh (λ/2) • ω. The length of any geodesic (that ends at ω) line segment cut by ξ λ1,ω and ξ λ2,ω equals |λ 1 -λ 2 |. This fact is obvious in the half-space model. There is a Riemannian isometry F : {z ∈ R n : |z| < 1} → {(x 1 , • • • , x n ) : x 1 > 0} (the latter is with the metric ds 2 = dx 2 1 +•••+dx 2 n x 2 1 ) such that F (ω) = ∞ and F (o) = (1, 0, . . . , 0). Using d H n (o, tanh(λ i /2)ω) = |λ i |, d {(x1,••• ,xn):x1>0} ((1, 0, . . . , 0), (e ±λi , 0, . . . , 0)) = |λ i |, F (ω) = ∞ and F (o) = (1, 0, . . . , 0), we have F (tanh(λ i /2)ω) = (e λi , 0, . . . , 0). Therefore, F maps ξ λi,ω to {(x 1 , x 2 , . . . , x n ) : x 1 = e λi }. Any geodesic (that ends at ω) line segment cut by ξ λ1,ω and ξ λ2,ω is mapped by F to {(t, α 2 , . . . , α n ) : (t -e λ1 )(t -e λ2 ) < 0} for some fixed α j . It is easy to check the length of this segment with respect to dx 2 1 +•••+dx 2 n x 2 1 (as α i are constants, the metric reduces to dx 2 1 /x 2 1 on this segment) is |λ 1 -λ 2 |. A.3 PROOF OF (6) Because x is on ξ λ which is a sphere with center 1+tanh λ/2 2 ω and radius 1-tanh λ/2 2 , we have x -1+tanh λ/2 2 ω 2 = 1-tanh λ/2 2 2 , which leads to |x| 2 -(1+tanh λ/2)(x, ω) E +tanh λ/2|ω| 2 = 0, and then 1+tanh λ/2 2 |x -ω| 2 = 1-tanh λ/2 2 (|ω 2 | -|x| 2 ), and finally x, ω H = 1 2 log |ω| 2 -|x| 2 |x-ω| 2 = 1 2 log 1+tanh λ/2 1-tanh λ/2 = λ/2. A.4 ANOTHER PROOF OF THE INTEGRAL FORMULA (7) We use H n for the upper half space model {(x 1 , • • • , x n ) : x 1 > 0} with the Riemannian volume dx1•••dxn x n 1 . Let ω = (∞, 0, . . . , 0) and o be (1, 0, . . . , 0) as in (A.2), then ξ λ,ω = {(x 1 , x 2 , . . . , x n ) : x 1 = e λ }. The induced Riemannian metric on ξ λ,ω (respectively volume dVol ξ λ,ω ) is dx 2 2 +•••+dx 2 n e 2λ (respectively dx2•••dxn e (n-1)λ ). For any integral function f on H n , using change of variable x 1 = e λ H n f (x 1 , . . . , x n ) dx 1 • • • dx n x n 1 = λ (x2,...,xn)∈R n-1 f (e λ , x 2 , . . . , x n ) dx 2 • • • dx n e nλ e λ dλ = λ (x2,...,xn)∈R n-1 f (e λ , x 2 , . . . , x n ) dx 2 • • • dx n e (n-1)λ dλ = λ ξ λ,ω f (z)dVol ξ λ,ω (z)dλ. The above identity is equivalent to the integral formula H n f (x)dVol (x) = R ξ λ,ω f (z)dVol ξ λ,ω (z) dλ. presented in (7) , according to the Riemannian isometry in (A.2).

A.5 THE HEURISTIC IS NOT A PROOF

The spectral theory does not directly lead to universal approximation theorems because of the following: 1, superpositions in (1, 2) and (8, 9) are different (similarly, although another kind of superposition in Hilbert's 13th problem (Hilbert, 1935; Arnold, 2009) was a driving force for universal approximation theorems (Nielsen, 1987) , the former is hardly relevant for networks (Girosi & Poggio, 1989 )); 2, desired representation properties of hyperbolic eigenfunctions are unknown, partially because H n is non-compact; 3, results in spectral theory favor Hilbert spaces, while universal approximation theorems embrace more than L 2 space.

A.6 OPTIMIZATION

The parameters update for the horocycle unit (2) involves the optimization problem on the sphere (for ω) and the hyperbolic space (for x). We use a standard algorithm of sphere optimization (Absil et al., 2008) to update ω, and in the supplement we present an optimization approach based on the geodesic polar-coordinates to update x. In the implementation of a horocycle layer, the forward propagation is trivial, while the backpropagation involves optimization on the sphere and hyperbolic space. In the following, η is the learning rate, α t is the value of α (α may be η, s, z, ω, . . .) at the t-th step, T p X is the tangent fiber at p, ∇ is the gradient, and ∇ H is the hyperbolic gradient. It suffices to consider the layer s= z, ω . Optimization on the sphere The parameter update of ω in s= z, ω involves the optimization on the sphere. The projection of ∂L θ ∂s ∇s(ω t ) = ∂L θ ∂s zt-ωt |zt-ωt| 2 ∈ T ωt R n onto T ωt S n-1 is given by Absil et al. (2008)[p.48]  v t = ∂L θ ∂s z t -ω t |z t -ω t | 2 - ∂L θ ∂s z t -ω t |z t -ω t | 2 , ω t ω t = ∂L θ ∂s z t -(z t , ω t )ω t |z t -ω t | 2 . Two well-known update algorithms of w t Absil et al. (2008)[p.76] are: Proof. If λ is one then it is trivial. We assume now λ is not one. By |x - ω t+1 = cos (η t |v t |)ω t -sin (η t |v t |)|v t | -1 v t ; ω t+1 = (ω t -η t v t )/|ω t -η t v t |. ω 1 | = λ|x -ω 2 |, we can have x - ω 1 -λω 2 1 -λ 2 = |ω 1 -λω 2 | 2 |1 -λ| 2 -1. The locus is a sphere with center O = ω1-λω2 1-λ and radius R = |ω1-λω2| 2 |1-λ| 2 -1. The theorem of Apollonius (in all dimension) claims that this sphere is orthogonal to S n-1 . To prove this, it suffices to prove |oO| 2 = 1 + R 2 (recall o is the origin of H n ), which follows from ω 1 -λω 2 1 -λ 2 = |ω 1 -λω 2 | 2 |1 -λ| 2 -1 2 + 1.

A.8 INVERSION

On R n ∪ {∞}, given the sphere {x : |x -w 0 | = r}, the corresponding inversion is given by Iv(x) = w 0 + r 2 (x -w 0 ) |x -w 0 | 2 . For x ∈ R n ∪ {∞}, Iv(x) is called the inverse of x with respect to {x : |x -w 0 | = r}. A.9 PROOF OF THEOREM 2 Theorem 2 Let K be a compact set in H n , and 1≤p<∞. Then finite sums of the form F (x) = N i=1 α i ρ(λ i x, ω i H +b i ), ω i ∈S n-1 , α i , λ i , b i ∈R are dense in L p (K, µ) , where µ is either dVol (5) or the induced Euclidean volume. Proof. We first treat the case ρ is sigmoidal and µ = dVol . Assume that these finite sums are not dense in L p (K, dVol ). By the Hahn-Banach theorem, there exists some nonzero h∈L q (K, dVol ), where q=p/(p -1) if p>1 and q=∞ if p=1, such that K F (x)h(x)dVol (x) = 0 for all finite sums of the form (14). As K is a compact set, by Hölder's inequality, K |h(x)| dVol ≤ ( K dVol ) 1/p ||h|| L q (K,dVol) , which leads to h∈L 1 (K, dVol ). Extend h to be a function H that is defined on H n by assigning H(x)=h(x) if x∈K and H(x)=0 if x∈H n \K. Then H∈L 1 (H n , dVol )∩L q (H n , dVol ) and H n F (x)H(x)dVol (x) = 0 (15) for all finite sums of the form ( 14). For any ω∈S n-1 and λ, b∈R, we set F ω,λ,b (x) = ρ(λ( x, ω H -b)). These functions are uniformly bounded, as |F ω,λ,b (x)|≤1. Moreover, lim λ→∞ F ω,λ,b (x) = 1 if x, ω H >b, 0 if x, ω H <b. According to ( 15), for all ω, λ, b, we have H n F ω,λ,b (x)H(x)dVol (x) = 0. Functions {F ω,λ,b } λ∈R converge pointwise as λ→∞, and they are uniformly bounded by |H|∈L 1 (H n , dVol ). By the bounded convergence theorem, for all ω∈S n-1 , b∈R, we have {x: x,ω H >b} H(x)dVol (x) = 0. By the integral formula ( 7) (with notations defined there), ( 6) and ( 17), for all b∈R, ∞ 2b ξt,ω H(z)dVol ξt,ω (z) dt = 0. Taking the derivative of ∞ 2b ξt,ω H(z)dVol ξt,ω (z) dt with respect to b, we deduce from (18) that ξ 2b,ω H(z)dVol ξ 2b,ω (z) = 0 for a.e. b∈R. In other words, the integration of H on a.e. ξ ∈ Ξ ω is zero. This fact is valid for all ω∈S n-1 . Therefore, the integration of H on a.e. ξ ∈ Ξ is zero. By the injectivity Theorem 1, H = 0 a.e., which contradicts our assumption. Therefore, finite sums of the form ( 14) are dense in L p (K, dVol ). The case ρ is ReLU, ELU or Softplus and µ = dVol follows from the above case and the fact that x → ρ(x + 1) -ρ(x) is sigmoidal. The case µ is the Euclidean volume follows from previous cases and the fact that the Euclidean volume on compact K is bounded from above by λdVol for some constant λ. A.10 UNIVERSAL APPROXIMATION THEOREM FOR POISSON NEURONS. In this section, ρ is a continuous sigmoidal function (Cybenko, 1989) , ReLU (Nair & Hinton, 2010) , ELU (Clevert et al., 2016) , or Softplus (Dugas et al., 2001) . We also recall the Poisson neuron: P ρ w,λ,b (x) = ρ λ |w| 2 -|x| 2 |x -w| 2 + b , w ∈ R n , λ, b ∈ R. Theorem 4. Let K be a compact set in H n , and 1≤p<∞. Then finite sums of the form F (x) = N i=1 α i P ρ ωi,λi,bi (x), ω i ∈S n-1 , α i , λ i , b i ∈R (19) are dense in L p (K, µ), where µ is either dVol (5) or the induced Euclidean volume. Proof. We first treat the case ρ is sigmoidal and µ = dVol . Assume that these finite sums are not dense in L p (K, dVol ). By the Hahn-Banach theorem, there exists some nonzero h∈L q (K, dVol ), where q=p/(p -1) if p>1 and q=∞ if p=1, such that K F (x)h(x)dVol (x) = 0 for all finite sums of the form (19). As K is a compact set, by Hölder's inequality, K |h(x)| dVol ≤ ( K dVol ) 1/p ||h|| L q (K,dVol) , which leads to h∈L 1 (K, dVol ). Extend h to be a function H that is defined on H n by assigning H(x)=h(x) if x∈K and H(x)=0 if x∈H n \K. Then H∈L 1 (H n , dVol )∩L q (H n , dVol ) and H n F (x)H(x)dVol (x) = 0 (20) for all finite sums of the form (19). For any ω∈S n-1 , λ ∈ R, and b > 0, we set F ω,λ,b (x) = P ρ ω,λ,-λb (x) = ρ λ 1 -|x| 2 |x -ω| 2 -b . These functions are uniformly bounded, as |F ω,λ,b (x)|≤1. Moreover, lim λ→∞ F ω,λ,b (x) =    1 if 1-|x| 2 |x-ω| 2 >b, 0 if 1-|x| 2 |x-ω| 2 <b. According to ( 20), for all ω, λ, b, we have H(x)dVol (x) = H n F ω,λ,b (x)H(x)dVol (x) = 0. Functions {F ω, x: 1-|x| 2 |x-ω| 2 >b H(x)dVol (x) = 0. ( ) By the integral formula ( 7) (with notations defined there), ( 6) and ( 22), for all b∈R, ∞ log b ξt,ω H(z)dVol ξt,ω (z) dt = 0. ( ) Taking the derivative of ∞ log b ξt,ω H(z)dVol ξt,ω (z) dt with respect to b, we deduce from ( 23) that ξ log b,ω H(z)dVol ξ log b,ω (z) = 0 for a.e. b>0. In other words, the integration of H on a.e. ξ ∈ Ξ ω is zero. This fact is valid for all ω∈S n-1 . Therefore, the integration of H on a.e. ξ ∈ Ξ is zero. By the injectivity Theorem 1, H = 0 a.e., which contradicts our assumption. Therefore, finite sums of the form (19) are dense in L p (K, dVol ). The case ρ is ReLU, ELU or Softplus and µ = dVol follows from the above case and the fact that x → ρ(x + 1) -ρ(x) is sigmoidal. The case µ is the Euclidean volume follows from previous cases and the fact that the Euclidean volume on compact K is bounded from above by λdVol for some constant λ. We refere the reader to the difference of ( 16) and ( 21), ( 17) and ( 22), and ( 18) and ( 23). However, basically the proofs are the same. The points are the integral formula (7) , the injectivity Theorem 1 and the fact that level sets of horocycle/Poisson neurons are horocycles. Moreover, as a corollary of Theorem 4, we have Corollary 2. Let K be a compact set in R n , and 1≤p<∞. Then finite sums of the form F (x) = N i=1 α i P ρ wi,λi,bi (x), w i ∈R n , α i , λ i , b i ∈R are dense in L p (K, µ) , where µ is the Euclidean volume. Proof. Because K is compact, there exists a positive number R such that K ⊂ {x ∈ R n : |x| < R}. By the above theorem, finite sums of the form F (x) = N i=1 α i P ρ wi,λi,bi (x), w i ∈S n-1 , α i , λ i , b i ∈R are dense in L p (K/R, µ). Then the corollary follows from P ρ w,λ,b (x) = P ρ w/R,λ,b (x/R). A.11 PROOF OF THE LEMMA 1 Recall f 1 a,p (x) = 2|a| 1 -|p| 2 sinh -1 2(-p ⊕ x, a) E (1 -| -p ⊕ x| 2 )|a| . ( ) The proof of Lemma 1 follows from the following direct computation. Proof. Let t ∈ (0, 1). Take p t = tω and a t = -ω, then we have -p t ⊕ x = -t(1 -2t(ω, x) E + |x| 2 )ω + (1 -t 2 )x 1 -2t(ω, x) E + t 2 |x| 2 . Let F t (x) = 2(-pt⊕x,at) E (1-|-pt⊕x| 2 )|at| , then F t (x) = 2(-p t ⊕ x, a t ) E (1 -| -p t ⊕ x| 2 )|a t | = 2 t(1-2t(ω,x) E +|x| 2 )-(1-t 2 )(x,ω) E 1-2t(ω,x) E +t 2 |x| 2 1 -|-t(1-2t(ω,x) E +|x| 2 )ω+(1-t 2 )x| 2 (1-2t(ω,x) E +t 2 |x| 2 ) 2 = 2t(1 -2t(ω, x) E + t 2 |x| 2 )(1 -2t(ω, x) E + |x| 2 ) -2(1 -t 2 )(1 -2t(ω, x) E + t 2 |x| 2 )(x, ω) E (1 -2t(ω, x) E + t 2 |x| 2 ) 2 -| -t(1 -2t(ω, x) E + |x| 2 )ω + (1 -t 2 )x| 2 = A t (x)/B t (x), where A t , B t are defined as the corresponding numerator and denominator. We have A t (x)| t=1 = 2|x -ω| 4 B t (x)| t=1 = 0 ∂B t (x)/∂t| t=1 = 2|x -ω| 2 (|x| 2 -1). Let G t (x) = sinh -1 (F t (x)) + log 1-t 1+t , then G t (x) = log A t (x) B t (x) + 1 + A 2 t (x) B 2 t (x) + log 1 -t 1 + t = log (1 -t)A t (1 + t)B t + (1 -t) 2 (1 + t) 2 + (1 -t) 2 A 2 t (x) (1 + t) 2 B 2 t (x) . By L'Hôpital's rule, lim t<1,t→1 (1 -t)A t (x) (1 + t)B t (x) = -A t (x) + (1 -t)A t (x) B t (x) + (1 + t)B t (x) t=1 = |x -ω| 2 2 -2|x| 2 . Therefore, lim t<1,t→1 G t (x) = log |x -ω| 2 1 -|x| 2 . For t < 1, we take p t = tω, a t = -ω, c t = t 2 -1 4 , d t = 1 2 log 1+t 1-t , then for all x ∈ K, lim t<1,t→1 c t f 1 at,pt (x) + d t = lim t<1,t→1 -1 2 G t (x) = 1 2 log 1 -|x| 2 |x -ω| 2 = x, ω H . If there exists c 1 , c 2 such that |c t f 1 a t , p t (x) + d t |(= |G t (x)|/2) ≤ c 2 for all t ∈ (c 1 , ), x ∈ K, then by the dominated convergence theorem, there exists t such that ||c t f 1 at,pt (x) + d tx, ω H || L p (K,m) < , which proves the lemma. Note that (1 -t)A t (x) (1 + t)B t (x) = 2|x -ω| 4 (1 -t) + 4 j=1 U j (x, ω)(1 -t) j+1 -2|x -ω| 2 (|x| 2 -1)(1 -t)(1 + t) + 4 l=2 L l (x, ω)(1 -t) l (1 + t) = 2|x -ω| 4 + 4 j=1 U j (x, ω)(1 -t) j 2|x -ω| 2 (1 -|x| 2 )(1 + t) + 4 l=2 L l (x, ω)(1 -t) l-1 (1 + t) , where U j and L l are continuous functions defined on K × {ω}. There exist positive numbere c 3 , c 4 and c 1 ∈ (0, 1) such that for all x ∈ K and t ∈ (c 1 , 1), c 3 ≤ 2|x -ω| 4 ≤ c 4 , c 3 ≤ 2|x -ω| 2 (1 -|x| 2 )(1 + t) ≤ c 4 , c 3 2 ≥ | 4 j=1 U j (x, ω)(1 -t) j |, c 3 2 ≥ | 4 l=2 L l (x, ω)(1 -t) l-1 (1 + t)|. Therefore, for x ∈ K and t ∈ (c 1 , 1), we have c 3 2c 4 + c 3 ≤ (1 -t)A t (x) (1 + t)B t (x) ≤ 2c 4 + c 3 c 3 . This implies that for t ∈ (c 1 , 1), G t | K and therefore |c t f 1 at,pt + d t || K are uniformly bounded, which finishes the proof of the lemma. A.12 THE FIRST MNIST CLASSIFIER IN 6.1 At the preprocessing stage, we compute the projection of the 28 × 28 input pattern on the 40 principal components and then scale them so that the scaled 40-dimensional PCA features are within the unit ball. In our network, 1. Input layer: scaled 40-dimensional PCA features; 2. First layer: 40 inputs/1000 outputs horocycle layer (tanh activation); 3. Last layer: 1000 inputs/10 outputs affine layer; 4. Loss: cross entroy loss. Take learning rate = 1, learning rate decay = 0.999, and batch size = 128, and run it three times. The average test error rates after 600 epochs is 1.96%. LeCun et al. (1998)(C.3) , where 40 PCA is used for the quadratic network. Quadratic network has a similar structure to ours, because our neuron are contructed by quotient of quadratic functions followed by log.

FUNCTION

Suppose the MNIST classification function M is defined on ∪ 9 j=0 K j ⊂ H 40 , where K i are relatively compact and M| Kj = j. By Theorem 2, for 0≤j≤9, there exist F j (x) = Nj i=1 α j,i ρ(λ j,i x, ω j,i H +b j,i ) such that F j approximates I Kj , where I is the indicator function. Therefore, a network with the first (horocycle) layer given by ρ(λ j,i x, ω j,i H +b j,i )(0≤j≤9, 1≤i≤N j ) followed by a classical MLR with parameters given by α j,i (0≤j≤9, 1≤i≤N j ) (with arg max for prediction) approximates M. A.14 THE SECOND MNIST CLASSIFIER IN 6.1 At the preprocessing stage, we do data augmentation by letting each image 1 step toward each of its 4 corners, so that our traning set has 300000 examples. In our network, 1. Input layer: (28,28, 1); 2. First block: 32-filters 3 × 3 convolution, ReLU, 2 × 2 max-pooling, BatchNorm; 3. Second block: 64-filters 3 × 3 convolution, ReLU, BatchNorm; 4. Thrid block: 64-filters 3 × 3 convolution,ReLU,2 × 2 max-pooling, BatchNorm; In each training, x is one of {animal, group, location, mammal, worker}, dim is one of {2,3,5,10}, and Poincaré embeddings are from the animation_train.py of Ganea et al. (2018b) 4 (with tree=wordnet_full, model=poincare, dim=dim, seed randomly ∈ {7, 8, 9}). All nodes in the subtree rooted at x are divided into training nodes (80%) and test nodes (20%). The same splitting procedure applies for the rest nodes. We choose s that has the best training F1, and then record the corresponding test F1. For each x and dim, we do the training 100 times. The average test F1 classification scores are recorded in Table 2 . The horocycle feature performs well here because it is compatible with the Poincaré embedding algorithm. Let x be a node that is not at the origin. It seems that the Poincaré embedding algorithm tends to pull all nodes that are from the subtree rooted at x towards the direction of x |x| , therefore y → y, x |x| H is a suitable feature for this task.

A.16 END-BASED CLUSTERING IN H 2

For MNIST, at the preprocessing stage, we do data augmentation by letting each image 1 step toward each of its 4 corners, so that our traning set has 300000 examples. Our network for H 2 embedding of MNIST dataset is 1. Input layer: (28,28, 1); 2. First block: 32-filters 3 × 3 convolution, ReLU, 2 × 2 max-pooling, BatchNorm; where Exp is the exponential map T o H 2 (= R 2 ) → H 2 . We apply the data augmentation as in A.14. In optimization, learning rate is 0.1, learning rate decay is 0.99, batch size is 128, epochs is 50. Our network, data augmentation and optimization for H 2 embedding of Fashion-MNIST dataset is completely the same as that for MNIST. For MNIST and Fashion-MNIST we use sphere optimization. We would like to remark that there are interesting new features in sphere optimization. Because the S 1 is compact, for any continuous function f , there exists x = argmax S 1 f . The derivative of f at x vanish, so the usual optimization algorithm to find the minimum will fail in the general case. In our experiments, we solve this problem by adding the following tricks: 1. Observation: if the class C α are all close to ω ∈ S 1 , and the end prototype ω α for the class C α is around -ω, then ω α is a maximum point of the loss function and therefore can not be improved through normal SGD. We solve this problem by adopting an idea(supervised variation) of k-means clustering. In each early epochs, optimization consists of two parts. In the first part, the normal SGD applies. In the second part, we move end prototypes (ω i ) to the average direction of the class (using training data). 2. Observation: if the class C α and class C β are all close to ω ∈ S 1 , and the end prototype ω α , ω β are also both around ω, then all points in class C α and class C β , end prototypes ω α , ω β will all be pulling to ω by the SGD, and finally the network can not distinguish class C α and class C β . We solve this problem by adding a loss if two prototypes are close. With these small tricks, our 2D end-based clustering algorithm is very stable for MNIST and Fashion-MNIST. We run it on MNIST 10 times, and they all get a test acc around 99% within 20 epochs. Suppose the classification task has M classes and the prototype of the i-th class is ω i . We write down the additional loss function for the second observation as follows i = RandomChoice({1, . . . , M }) j = RandomChoice({1, . . . , M } \ {i}) d = (ω i , ω j ) E L Observation2 = arctanh(10 × ReLU(d -0.9 -)), where is a small constant for numerical stability. For CIFAR-10, our network for H 2 embedding of CIFAR-10 dataset is A.18.2 THE BUSEMANN FUNCTION VIEWPOINT. Fix ω ∈ S n-1 , and then let c : [0, ∞) → H n be the unique geodesic ray (with unit speed) that satisfies c[0] = (0, . . . , 0), c(∞) = ω. For the purpose of this section, we do not need the definition of Busemann functions (we refer the interested reader to Bridson & Haefliger (2009)[II.8] ). Instead, it suffices to know the following result of the theory: let d H n be the hyperbolic distance function, and for any x ∈ H n , lim t→∞ (d H n (x, c(t)) -d H n (c(0), c(t))) = -2 x, ω H . (25) We read the above (25) in the following way: for fixed t > 0 x ∈ H n → d H n (x, c(t)) -d H n (c(0), c(t)) is a function that measures the relative distance of {x, c(0)} from c(t), and therefore the left hand side of ( 25) x ∈ H n → lim t→∞ (d H n (x, c(t)) -d H n (c(0), c(t))) is a function that measures the relative distance of {x, c(0)} from lim t→∞ c(t) = ω, and finally the right hand side of (25) x ∈ H n → -2 x, ω H is a function that measures the relative distance of {x, c(0)} from ω. Moreover, if the geodesic ray c starts from a different y ∈ H n , and then there will be a corresponding bias term added to (25). Therefore, for any b ∈ R and ω ∈ S n-1 , x ∈ H n → Dist(x, ω, b) = -2 x, ω H + b is a function that measures the relative distance of H n from ω.



if w = (0, . . . , 0), one can take ω = w/|w|, λ = |w|; else, one can take λ = 0 and any ω ∈ S n-1 . One often takes ρ(x) = x 2 in metric learning, which is improper here because Dist(x) could be negative. https://github.com/dalab/hyperbolic_cones https://github.com/dalab/hyperbolic_cones



Figure 1: (Left) ρ((•, ω) E ); (middle) ρ( •, ω H ); (right) ρ(f 1 a,p (•)). In this figure, ω=(1, 0), a=(1, 0), p=(0.5, 0), and ρ is tanh. The colorbar represents function values.

Figure 2: (Left) A i O are pairwise parallel lines in the sense of Gauss (Bonola, 2012); (Middle) A family of horocycles tangential to O; (Right) Each Horocycle is a locus of corresponding points (Bonola, 2012)[p.73] of parallel lines. This fact justifies that horocycles are analogs of hyperplanes.

Based on functions similar to f 1 a,p , Mathieu et al. (2019); Shimizu et al. (2020) build the gyroplane layer and Poincaré FC layer. Ganea et al. (2018a); Cho et al. (

Figure 3: (Left) Horocycle decision hypersurface. Colored points form a Poincaré embedding of WordNet nouns in H 2 . The yellow point is group.n.01. Points from the subtree rooted at group.n.01 are in red, and the rest are in blue. The green horocycle separates the blue and the red. (Right) End-based clustering and geodesic decision hypersurfaces. It is an embedding of CIFAR-10 in H 2 . Different classes are in different colors. Thin diamonds on S 1 are prototypes. Decision regions are separated by geodesic decision hypersurfaces.

Figure 4: Prediction probabilities of classifiers. Suppose the classification task has 10 classes. Let NN θ : R D →R 2 be the feature descriptor, X the input, and x=NN θ (X). Let wi=( cos ((i-1)π/5), sin ((i-1)π/5))(1≤i≤10). (Left) p θ (Y = C1|x) on R 2 when score functions are Poisson neurons SCj(X) = BatchNorm(P ρ w j ,-1,0 (x)). (Right) p θ (Y = C1|x) on R 2 when score functions are SCj(X) = (x, wj)E.

Figure 4 illustrates that high-confidence prediction regions (deep red areas) of the Poisson MLR are compact sets, in contrast to classical classifiers Hein et al. (2019)[Theorem 3.1]. We shall use this figure to explain an experiment in Section 6.4.

END-BASED CLUSTERING FOR 2D DIMENSION REDUCTIONIn this experiment, we use the horocycle MLR (Section 4.2) to construct end-based clusterings NN θ : R D → H 2 for MNIST, Fashion-MNIST(Xiao et al., 2017), and CIFAR-10(Krizhevsky, 2012). We take NN θ = Exp • NN θ , where Exp is the exponential map of H 2 and NN θ : R D → R 2 is a network with four convolutional blocks for MNIST/Fashion-MNIST or a ResNet-32 structure for CIFAR-10. A.16 and code contain details.

Figure 5: End-based clusters of MNIST, Fashion-MNIST, and CIFAR-10. Test error rates from the left to the right: 0.63%, 9.98%, 8.42%. The performance for CIFAR-10 is better than Fashion-MNIST because NN θ is of a ResNet-32 structure for CIFAR-10 but of only four convolutional layers for Fashion-MNIST. Thin diamonds are prototypes, and shallow areas are decision regions.

Figure 6: Keras'MLR and our Poisson MLR on the task of flowers. Figure 4 explains this experiment.In the earlier epochs of the training, feature vectors NN θ (X) of the Poisson model are not even close to its compact high-confidence prediction regions (deep red areas), and therefore on the test data, it is not much better than a random guess. In the end, feature vectors lying in these compact regions are of small intra-class variations, which is good for generalization(Yang et al., 2018).

Figure 7: Poisson neuron P tanh (1,0),-1/3,0 . The level sets of a Poisson neuron P ρ w,λ,b are horocycles of the ball {x : |x| < |w|} that are tangential to {x : |x| = |w|} at w and their inverses with respect to {x : |x| = |w|}.

Second block: 64-filters 3 × 3 convolution, ReLU, BatchNorm; 4. Thrid block: 64-filters 3 × 3 convolution,ReLU,2 × 2 max-pooling, BatchNorm; 5. Fourth block: 128-filters 3 × 3 convolution, ReLU, 2 × 2 max-pooling, BatchNorm; 6. Fifth block: FC 1000, ReLU, BatchNorm; 7. Sixth block: FC 2, ReLU, BatchNorm, Exp; 8. Last block: 2 input/10 output horocycle layer, sigmoid; 9. Loss: cross entroy loss,

The average test error rate after 600 epochs is 1.96%, and Theorem 2 provides the rationale for this experiment(A.13). The second one is a Poisson MLR. It is the best hyperbolic geometry related MNIST classifier (Table1). In this table,Ontrup & Ritter (2005) uses the hyperbolic SOM,Grattarola et al. (2019) uses the adversarial autoencoder, andKhrulkov et al. (2020) uses the hyperbolic MLR. Our experiment performs well on MNIST suggests that horocycle and Poisson neurons are computationally efficient and easily coordinate with classical learning tools (such as the convolutional layer and the softmax). Test error rates of hyperbolic geometry related MNIST classifiers Poincaré embedding(Nickel & Kiela, 2017) PE : {WordNet noun} → H D of 82114 nouns and given a node x ∈ {WordNet noun}, the task is to classify all other nodes as being part of the subtree rooted at x(Ganea et al., 2018a). Our model is logistic regression, where the horocycle feature p ∈ {WordNet noun} → h PE(x) (PE(p)/s) (s is a hyperparameter lying in[1, 1.5]) is the only predictor, and the dependent variable is whether p is in the subtree rooted at x. The decision hypersurface of this model is a horocycle, as illustrated in Figure3(left). In the experiment, we pre-train three different Poincaré embeddings 3 in each of H 2 , H 3 , H 5 , H 10 . For each x ∈ {animal, group, location, mammal, worker} and D ∈ {2, 3, 5, 10}, we randomly select one of three pre-trained Poincaré embedding PE : {WordNet noun} → H D and then test the model.Table2reports the F1 classification scores and two standard deviations of 100 trials for each {x, D}. Different Poincaré embeddings account for the most variance of the performance. Our model is different from the existing ones. Firstly, we take the horocycle as the decision hypersurface, while others take the geodesic. Secondly, we train a logistic regression on top of the horocycle feature attached to PE(x), which is efficiently calculated, while others train the hyperbolic MLR with different parametrizations. On the number of parameters, we have three (independent of D),Ganea  et al. (2018a) has 2D, andShimizu et al. (2020) has D + 1. The number of parameters explains why our model is prominent in low dimensions.

Average test F1 classification scores (%) for five subtrees of WordNet noun tree

Figure5illustrates end-based clusterings for MNIST, Fashion-MNIST, and CIFAR-10, with performance reported in the caption. Our accuracy for Fashion-MNIST is 8% higher than all numbers presented inMcInnes et al. (2020). Moreover, Table3compares the numbers ofYang et al. (2018);Ghosh & Kirby (2020), and ours for MNIST, and our methods are similar. We all use convolutional networks as the (Feature descriptor) and prototype-based functions as the loss. However,Yang et al.

Test error rates on 2D embedded MNIST by dimensionality reduction techniques

λ,b } λ∈R converge pointwise as λ→∞, and they are uniformly bounded by |H|∈L 1 (H n , dVol ). By the bounded convergence theorem, for all ω∈S n-1 , b∈R, we have

