RANDOM LAPLACIAN FEATURES FOR LEARNING WITH HYPERBOLIC SPACE

Abstract

Due to its geometric properties, hyperbolic space can support high-fidelity embeddings of tree-and graph-structured data, upon which various hyperbolic networks have been developed. Existing hyperbolic networks encode geometric priors not only for the input, but also at every layer of the network. This approach involves repeatedly mapping to and from hyperbolic space, which makes these networks complicated to implement, computationally expensive to scale, and numerically unstable to train. In this paper, we propose a simpler approach: learn a hyperbolic embedding of the input, then map once from it to Euclidean space using a mapping that encodes geometric priors by respecting the isometries of hyperbolic space, and finish with a standard Euclidean network. The key insight is to use a random feature mapping via the eigenfunctions of the Laplace operator, which we show can approximate any isometry-invariant kernel on hyperbolic space. Our method can be used together with any graph neural networks: using even a linear graph model yields significant improvements in both efficiency and performance over other hyperbolic baselines in both transductive and inductive tasks.

1. INTRODUCTION

Real-world data contains various structures that resemble non-Euclidean spaces: for example, data with tree-or graph-structure such as citation networks (Sen et al., 2008) , social networks (Hoff et al., 2002) , biological networks (Rossi & Ahmed, 2015) , and natural language (e.g., taxonomies and lexical entailment) where latent hierarchies exist (Nickel & Kiela, 2017) . Graph-style data features in a range of problems-including node classification, link prediction, relation extraction, and text classification. It has been shown both theoretically and empirically (Bowditch, 2006; Nickel & Kiela, 2017; 2018; Chien et al., 2022) that hyperbolic space-the geometry with constant negative curvature-is naturally suited for representing (i.e. embedding) such data and capturing implicit hierarchies, outperforming Euclidean baselines. For example, Sala et al. (2018) shows that hyperbolic space can embed trees without loss of information (arbitrarily low distortion), which cannot be achieved by Euclidean space of any dimension (Chen et al., 2013; Ravasz & Barabási, 2003) . Presently, most well-known and -established deep neural networks are built in Euclidean space. The standard approach is to pass the input to a Euclidean network and hope the model can learn the features and embeddings. But this flat-space approach can encode the wrong prior in tasks for which we know the underlying data has a different geometric structure, such as the hyperbolic-space structure implicit in tree-like graphs. Motivated by this, there is an active line of research on developing ML models in hyperbolic space H n . Starting from hyperbolic neural networks (HNN) by Ganea et al. (2018) , a variety of hyperbolic networks were proposed for different applications, including HNN++ (Shimizu et al., 2020) , hyperbolic variational auto-encoders (HVAE, Mathieu et al. (2019) ), hyperbolic attention networks (HATN, Gulcehre et al. (2018) ), hyperbolic graph convolutional networks (HGCN, Chami et al. (2019) ), hyperbolic graph neural networks (HGNN, Liu et al. (2019) ), and hyperbolic graph attention networks (HGAT Zhang et al. (2021a) ). The strong empirical results of HGCN and HGNN in particular on node classification, link prediction, and molecular-and-chemicalproperty prediction show the power of hyperbolic geometry for graph learning. These hyperbolic networks adopt hyperbolic geometry at every layer of the model. Since hyperbolic space is not a vector space, operations such as addition and multiplication are not well-defined; neither are matrix-vector multiplication and convolution, which are key components of a deep model that uses hyperbolic geometry at every layer. A common solution is to treat hyperbolic space as a gyro-vector space by equipping it with a non-commutative, non-associative addition and multiplication, allowing hyperbolic points to be processed as features in a neural network forward. However, this complicates the use of hyperbolic geometry in neural networks because the imposition of an extra structure on hyperbolic space beyond its manifold properties-making the approach somehow non-geometric. A second problem with using hyperbolic points as intermediate features is that these points can stray far from the origin (just as Euclidean DNNs require high dynamic range Kalamkar et al. (2019) ), especially for deeper networks. This can cause significant numerical issues when the space is represented with ordinary floating-point numbers: the representation error is unbounded and grows exponentially with the distance from the origin. Much careful hyperparameter tuning is required to avoid this "NaN problem" Sala et al. (2018) ; Yu & De Sa (2019; 2021) . These issues call for a simpler and more principled way of using hyperbolic geometry in DNNs. In this paper, we propose such a simple approach for learning with hyperbolic space. The insight is to (1) encode the hyperbolic geometric priors only at the input via an embedding into hyperbolic space, which is then (2) mapped once into Euclidean space by a random feature mapping ϕ : H n → R d that (3) respects the geometry of hyperbolic space in that its induced kernel k(x, y) = E[⟨ϕ(x), ϕ(y)⟩] is isometry-invariant, i.e. k(x, y) depends only on the hyperbolic distance between x and y, followed by (4) passing these Euclidean features through some downstream Euclidean network. This approach both avoids the numerical issues common in previous approaches (since hyperbolic space is only used once early in the network, numerical errors will not compound) and eschews the need for augmenting hyperbolic space with any additional non-geometric structure (since we base the mapping only on geometric distances in hyperbolic space). Our contributions are as follows: • In Section 4 we propose a random feature extraction called HyLa which can be sampled to be an unbiased estimator of any isometry-invariant kernel on hyperbolic space. This generalizes the classic method of random Fourier features proposed for Euclidean space by Rahimi et al. (2007) . • In Section 5 we show how to adopt HyLa in an end-to-end graph learning architecture that simultaneously learns the embedding of the initial objects and the Euclidean graph learning model. • In Section 6, we evaluate our approach empirically. Our HyLa-networks demonstrate better performance, scalability and computation speed than existing hyperbolic networks: HyLa-networks consistently outperform HGCN, even on a tree dataset, with 12.3% improvement while being 4.4× faster. Meanwhile, we argue that our method is an important hyperbolic baseline to compare against due to its simple implementation and compatibility with any graph learning model.

2. RELATED WORK

Hyperbolic space. n-dimensional hyperbolic space H n is usually defined and used via a model, a representation of H n within Euclidean space. Common choices include the Poincaré ball (Nickel & Kiela, 2017) and Lorentz hyperboloid model (Nickel & Kiela, 2018) . We develop our approach using the Poincaré ball model, but our methodology is independent of the model and can be applied to other models. The Poincaré ball model is the Riemannian manifold (B n , g p ) with B n = {x ∈ R n : ∥x∥ < 1} being the open unit ball and the Riemannian metric g p and metric distance d p being g p (x) = 4(1 -∥x∥ 2 ) -2 g e and d p (x, y) = arcosh 1 + 2 ∥x-y∥ 2 (1-∥x∥ 2 )(1-∥y∥ 2 ) . where g e is the Euclidean metric. To encode geometric priors into neural networks, many versions of hyperbolic neural networks have been proposed. But while (matrix-) addition and multiplication are essential to develop a DNN, hyperbolic space is not a vector space with well-defined addition and multiplication. To handle this issue, several approaches were proposed in the literature. Gyrovector space. Many hyperbolic networks, including HNN (Ganea et al., 2018) , HNN++ (Shimizu et al., 2020) , HVAE (Mathieu et al., 2019) , HGAT (Zhang et al., 2021a) , and GIL Zhu et al. (2020) , adopt the framework of gyrovector space as an algebraic formalism for hyperbolic geometry, by equipping hyperbolic space with non-associative addition and multiplication: Möbius addition ⊕ and Möbius scalar multiplication ⊗, which is defined for x, y ∈ B n and a scalar r ∈ R x ⊕ y := (1+2⟨x,y⟩+∥y∥ 2 )x+(1-∥x∥ 2 )y 1+2⟨x,y⟩+∥x∥ 2 ∥y∥ 2 , r ⊗ x := tanh(r tanh -1 (∥x∥)) x ∥x∥ . However, Möbius addition and multiplication are complicated with a high computation cost; high level operations such as Möbius matrix-vector multiplication are even more complicated and numer-ically unstable (Yu & De Sa, 2021; Yu et al., 2022) , due to the use of ill-conditioned functions like tanh -1 . Also problematic is the way hyperbolic space is treated as a gyrovector space rather than a manifold, meaning this approach can not be generalized to other manifolds that lack this structure. Push-forward & Pull-backward. Since many operations are well-defined in Euclidean space but not in hyperbolic space, a natural idea is to map H n to R d via some mappings, apply well-defined operations in R d , then map the results back to hyperbolic space. Many works, including HATN (Gulcehre et al., 2018) , HGCN (Chami et al., 2019) , HGNN (Liu et al., 2019) , HGAT (Zhang et al., 2021a) , Lorentzian GCN (Zhang et al., 2021b), and GIL Zhu et al. (2020) , adopt this method: • pull the hyperbolic points to Euclidean space with a "pull-backward" mapping; • apply operations such as multiplication and convolution in Euclidean space; and then • push the resulting Euclidean points to hyperbolic space with a "push-forward" mapping. Since hyperbolic space and Euclidean space are different spaces, no isomorphic maps exist between them. A common choice of the mappings (Chami et al., 2019; Liu et al., 2019) is the exponential map exp x (•) and logarithm map log x (•), where x is usually chosen to be the origin O. The exponential and logarithm maps are mappings between the hyperbolic space and its tangent space T x H n , which is an Euclidean space contains gradients. The exponential map maps a vector v ∈ T x H n at x to another point exp x (v) in H n (intuitively, exp x (v) is the point reached by starting at x and moving in the direction of v a distance ∥v∥ along the manifold), while the logarithm map inverts this. This approach is more straightforward and natural in the sense that hyperbolic space is only treated as a manifold object with no more structures added, so it can be generalized to general manifolds (although it does privilege the origin). However, exp o (•) and log o (•) are still complicated and numerically unstable. Both push-forward and pull-backward mappings are used at every hyperbolic layer, which incurs a high computational cost in both the model forward and backward loop. As a result, this prevents hyperbolic networks from scaling to large graphs. Moreover, the Push-forward & Pull-backward mappings act more like nonlinearities instead of producing meaningful features. Kernel Methods and Horocycle Features. Cho et al. (2019) proposed hyperbolic kernel SVM for nonlinear classification without resorting to ill-fitting tools developed for Euclidean space. Their approach differs from ours in that they map hyperbolic points to another (higher-dimensional) hyperbolic feature space, rather than an Euclidean feature space. They also constructed feature mappings only for the Minkowski inner product kernel: it's unknown how to construct feature mappings of their type for general kernels. Another work by Fang et al. (2021) develops several valid positive definite kernels in hyperbolic spaces and investigates their usages; they do not provide any samplingbased features to approximate these kernels. Wang (2020) constructed hyperbolic neuron models using a push-forward mapping along with the hyperbolic Poisson kernel P n (x, ω) = ( 1-∥x∥ 2 ∥x-ω∥ 2 ) n-1 for x ∈ B n , ω ∈ ∂B n as the backbone of an even more complicated feature function. Sonoda et al. (2022) theoretically proposes a continuous version of shallow fully-connected networks on noncompact symmetric space (including hyperbolic space) using Helgason-Fourier transform, where some network functions coincidentally share some similarities to features proposed in this paper.

3. BACKGROUND

Laplace operator (Euclidean). The Laplace operator ∆ on Euclidean space R n is defined as the divergence of the gradient of a scalar function f , i.e., ∆f = ∇ • ∇f = n i=1 ∂ 2 f ∂x 2 i . The eigenfunctions of ∆ are the solutions of the Helmholtz equation -∆f = λf, λ ∈ R, and can form an orthonormal basis for the Hilbert space L 2 (Ω) when Ω ∈ R n is compact (Gilbarg & Trudinger, 2015) , i.e., a linear combination of them can represent any function/model that is L 2 -integrable. A notable parameterization for these eigenfunctions are the plane waves, given by f (x) = exp(i⟨ω, x⟩), where ω ∈ R n and ⟨•, •⟩ is the Euclidean inner product. A standard result given in Helgason (2022) states that any eigenfunction of ∆ can be written as a linear combination of these plane waves (Theorem A.1). The famous result of Rahimi et al. (2007) used these eigenfunctions to construct feature maps, called random Fourier features, for arbitrary shift-invariant kernels in Euclidean space. Theorem 3.1 (Bochner's theorem, Rudin (2017) ). For any shift-invariant continuous kernel k(x, y) = k(x -y) on R n , let p(ω) be its Fourier transform and ξ ω (x) = exp(i⟨ω, x⟩). Then k is positive definite if and only if p ≥ 0, in which case if we draw w proportionally to p, k (x -y) = R n p(ω) exp(i⟨ω, x -y⟩) dω = k(0) • E ω∼p [ξ ω (x)ξ ω (y) * ]. Since both the probability distribution p(ω) and the kernel k(x -y) are real, the integral is unchanged when we replace the exponential with a cosine. Rahimi et al. (2007) leveraged this to produce real-valued features by setting z ω,b (x) = √ 2 cos(⟨ω, x⟩ + b), arriving at a real-valued mapping that satisfies the condition E w∼p,b∼Unif[0,2π] [z ω,b (x)z ω,b (y)] = k(x -y). Rahimi et al. (2007) approximated functions such as Gaussian, Laplacian and Cauchy kernels with this technique. Laplace-Beltrami operator (Hyperbolic). The Laplace-Beltrami operator L is the generalization of the Laplace operator to Riemannian manifolds, defined as the divergence of the gradient for any twice-differentiable real-valued function f , i.e., Lf = ∇ • ∇f . In the n-dimensional Poincaré disk model B n , the Laplace-Beltrami operator takes the form (Agmon, 1987) Just as in the Euclidean case, the eigenfunctions of L in hyperbolic space can be derived by solving the Helmholtz equation. L = 1 4 (1 -∥x∥ 2 ) 2 n i=1 ∂ 2 ∂x 2 i + n-2 2 (1 -∥x∥ 2 ) n i=1 x i ∂ ∂xi . O w x ξ(w, x) ⟨w, x⟩ We might hope to find analogs of the "plane waves" in hyperbolic space that are eigenfunctions of L. One way to approach this is via a geometric interpretation of plane waves. In the Euclidean case, for unit vector ω and scalar λ, f (x) = exp(iλ⟨ω, x⟩) is called a "plane wave" because it is constant on each hyperplane perpendicular to ω. We can interpret ⟨ω, x⟩ as the signed distance from the origin O to the hyperplane ξ(ω, x) which contains x and is perpendicular to w (Figure 1 ). In the Poincaré ball model of hyperbolic space, the geometric analog of the hyperplane is the horocycle. For any z ∈ B n and unit vector ω (i.e., ω ∈ ∂B n ), the horocycle ξ(ω, z) is the Euclidean circle that passes through ω, z and is tangent to the boundary ∂B n at ω, as indicated in Figure 2 . We let ⟨ω, z⟩ H denote the signed hyperbolic distance from the origin O to the the horocycle ξ(ω, z). In the Poincaré ball model, this takes the form ⟨ω, z⟩ H = log((1-∥z∥ 2 )/∥z-ω∥ 2 ). If we define the "hyperbolic plane waves" exp(µ⟨ω, z⟩ H ), where µ ∈ C. Unsurprisingly, they are indeed eigenfunctions of the hyperbolic Laplacian (Agmon, 1987) . L exp(µ⟨ω, z⟩ H ) = µ(µ -n + 1)e(µ⟨ω, z⟩ H ). Since we are interested in finding real eigenfunctions (via the same exp-to-cosine trick used in Rahimi et al. (2007) ), we restrict our attention to µ that yield a real eigenvalue. This happens when µ = n-1 2 + iλ for real λ, in which case the eigenvalue is µ(µ -n + 1) = -λ 2 -(n -1) 2 /4. Just as the Euclidean plane waves exp(i⟨ω, x⟩) span the eigenspaces of the Euclidean Laplacian, the same result holds for these "hyperbolic plane waves" (Theorem A.2).

4. HYLA: EUCLIDEAN FEATURES FROM HYPERBOLIC EMBEDDINGS

In this section, we present HyLa, a feature mapping that can approximate an isometry-invariant kernel over hyperbolic space H n in the same way that the random Fourier features of Rahimi et al. (2007) approximate any shift-invariant kernel over R n . In place of the Euclidean plain waves, which are the eigenfunctions of the Euclidean Laplacian, here we derive our feature extraction using the hyperbolic plain waves, which are eigenfunctions of the hyperbolic Laplacian. Since the hyperbolic plane wave exp(( n-1 2 -iλ)⟨⟨ω, x⟩ H ) is an eigenfunction of the real operator L with real eigenvalue, so will this function multiplied by any phase exp(-ib), as will its real part. Call the result of this HyLa λ,b,ω (z) = exp n-1 2 ⟨ω, z⟩ H cos (λ⟨ω, z⟩ H + b) . (1) This parameterization, which we call HyLa (for Hyperbolic Laplacian features), yields real-valued eigenfunctions of the Laplace-Beltrami operator with eigenvalue -λ 2 -(n-1) 2 /4. HyLa eigenfunctons have the nice property that they are bounded in almost every direction, as ⟨ω, z⟩ H approaches 0 as z approaches any point on the boundary of B n except ω. Note that HyLa eigenfunctions are invariant to isometries of the space: any isometric transformation of HyLa yields another HyLa eigenfunction with the same λ but a transformed ω (depending on how the isometry acts on the boundary ∂B n ). It is easy to compute, parameterized by continuous instead of discrete parameters, 0 1 2 3 4 5 6 7   d(x, y)   0  k λ (x, y) = 1 2 • 2 F 1 n-1 2 + iλ, n-1 2 -iλ; n 2 ; 1 2 (1 -cosh(d H (x, y))) , where 2 F 1 is the hypergeometric function, defined via analytic continuation by the power series 2 F 1 (a, b; c; z) = ∞ n=0 (a)n(b)n (c)n z n n! = 1 + ab c z 1! + a(a+1)b(b+1) c(c+1) z 2 2! + • • • (|z| < 1 ). Note that despite the presence of an i in the formula, this kernel is clearly real because the hypergeometric function satisfies the properties 2 F 1 (a, b; c; z) = 2 F 1 (b, a; c; z) and 2 F 1 (a, b; c; z) * = 2 F 1 (a * , b * ; c * ; z * ). In practice, as with random Fourier features, instead of choosing one single λ, we select them at random from some distribution ρ(λ). The resulting kernel will be an isometryinvariant kernel that depends on the distribution of λ, as follows: k(x, y) = 1 2 ∞ -∞ 2 F 1 n-1 2 + iλ, n-1 2 -iλ; n 2 ; 1 2 (1 -cosh(d H (x, y))) • ρ(λ) dλ. (2) This formula gives a way to derive the isometry-invariant kernel from a distribution ρ(λ); if we are interested in finding a feature map for some particular kernel, we can invert this mapping to get a distribution for λ which will produce the desired kernel. Theorem 4.2. Suppose that k(x, y) = k(d H (x, y)) is an isometry-invariant positive semidefinite kernel. Assume the existence of an associated density ρ(λ) with the kernel, then ρ(λ) = λ tanh ( πλ 2 ) B n ∂B n k(d H (z, O)) exp ( n-1 2 -iλ)⟨ω, z⟩ H dω dz. i.e., ρ(λ) is the spherical transform of the kernel, and if we draw λ proportional to ρ, ω uniformly on ∂B n , and b uniformly on y ). Although Theorem 4.2 lets us find a HyLa distribution for any isometric kernel, for simplicity in this paper, because the closed-forms of many kernels in H n are not available, rather than arriving at a distribution via this inverse, we will instead focus on the case where ρ is a Gaussian. This corresponds closely to a heat kernel (Grigor'yan & Noguchi, 1998) , as illustrated in Figure 3 (left). [0, 2π], then k(0) • E [HyLa λ,b,ω (x) HyLa λ,b,ω (y)] = k λ (x, y) = k(x, We will use HyLa eigenfunctions to produce Euclidean features from hyperbolic embeddings, using the same random-features approach as Rahimi et al. (2007) . Concretely, to map from H n to R D , we draw D independent samples λ 1 , . . . , λ D from ρ, D independent samples ω 1 , . . . , ω D uniform from ∂B n , and D independent samples b 1 , . . . , b D uniform from [0, 2π], and then output a feature map ϕ the kth coordinate of which is ϕ k (x) = 1 √ D HyLa λ k ,b k ,ω k (x). It is easy to see that this will yield feature vectors with E [⟨ϕ(x), ϕ(y)⟩] = k(x, y) as given in Eq. 2. We visualize the kernel k(x, O) for ρ(λ) = N (0, 0.25) in the 2-dimensional Poincarè disk in Figure 3 , evaluating the integral in Eq. 2 using Gauss-Hermite quadrature. In Figure 3 (right), we sample random HyLa features with D = 1000 and plot ⟨ϕ(x), ϕ(O)⟩. Visibly, the HyLa features approximate the kernel well. A discussion of the estimation error is provided in Appendix B. 

Connection to Euclidean

(z) = σ (λ⟨ω, z⟩ H + b) exp n-1 2 ⟨ω, z⟩ H , HyLa generalizes Euclidean activations to hyperbolic space, with an extra factor exp n-1 2 ⟨ω, z⟩ H from the curvature of H n . From a functional perspective, any f ∈ L 2 (H n ) can be expanded as an infinite linear combination (integral form) of HyLa (Theorem 4.3 in Sonoda et al. (2022) ). This statement holds whenever the non-linearity σ is a tempered distribution on R, i.e., the topological dual of the Schwartz test functions, including ReLU and cos. Though the features will not approximate a kernel on hyperbolic space if σ ̸ = cos, this suggests variants of HyLa with other nonlinearities may be interesting to study.

5. HYLA FOR GRAPH LEARNING

In this section, we show how to use HyLa to encode geometric priors for end-to-end graph learning. Background on Graph Learning. A graph is defined formally as G = (V, A), where V represents the vertex set consisting of n nodes and A ∈ R n×n represents the symmetric adjacency matrix. Besides the graph structure, each node in the graph has a corresponding d-dimensional feature vector: we let X ∈ R n×d denote the entire feature matrix for all nodes. A fraction of nodes are associated with a label indicating one (or multiple) categories it belongs to. The node classification task is to predict the labels of nodes without labels or even of nodes newly added to the graph. An important class of Euclidean graph learning model is the graph convolutional neural network (GCN) (Kipf & Welling, 2016; Defferrard et al., 2016) . The GCN is widely used in graph tasks including semi-supervised learning for node classification, supervised learning for graph-level classification, and unsupervised learning for graph embedding. Many complex graph networks and GCN variants have been developed, such as the graph attention networks (GAT, Veličković et al. ( 2017)), FastGCN (Chen et al., 2018) , GraphSage (Hamilton et al., 2017) , and others (Velickovic et al., 2019; Xu et al., 2018) . An interesting work to understand GCN is simplifying GCN (SGC, Wu et al. (2019) ): a linear model derived by removing the non-linearities in a K-layer GCN as: f (A, X) = softmax(S K XW), where S is the "normalized" adjacency matrix with added selfloops and W is the trainable weight. Note that the pre-processed features S K X can be computed before training, which enables large graph learning and greatly saves memory and computation cost. End-to-End Learning with HyLa. We propose a feature-extracted architecture via the following recipe: embed the data objects (graph nodes or features, detailed below) into some space (e.g., Euclidean or hyperbolic), map the embedding to Euclidean features X via the kernel transformation (e.g., RFF or HyLa), and finally apply an Euclidean graph learning model f (A, X). This recipe only manipulates the input of the graph learning model and hence, this architecture can be used with any graph learning model. The graph model and the hyperbolic embedding are learned simultaneously with backpropagation. In theory, the embedding space can be any desired space, just use the same Laplacian recipe to construct features. Below we only show the pipeline for hyperbolic space, while we also include Euclidean space (with random Fourier features) as a baseline in our experiments. Algorithm 1 End-to-End HyLa input: n objects, Poincaré disk B d0 , HyLa feature dimension d 1 , adjacency matrix A, node feature ma- trix X, graph neural network f initialize Z ∈ R n×d0 {hyperbolic embeddings} sample boundary pts matrix Ω ∈ R d1×d0 , eigenval- ues Λ ∈ R n×d1 and biases B ∈ R n×d1 compute P=⟨Ω, Z⟩ H ∈R n×d1 {Horocycle distance} compute X = exp n-1 2 P cos(Λ • P + B) {HyLa} if embedding features: X = XX ∈ R n×d1 return Y = f (A, X) {e.g., SGC} Directly Embed Graph Nodes. We embed each node into a low dimensional hyperbolic space B d0 as hyperbolic embedding Z ∈ R n×d0 for all nodes in the graph, which can be either a pretrained fixed embedding or as parameters learnt together with the subsequent graph learning model during training time. To compute with the HyLa eigenfunctions, first sample d 1 points uniformly from the boundary ∂B d0 to get Ω ∈ R d1×d0 , then sample d 1 eigenvalues and biases separately from N (0, s 2 ) and Uniform([0, 2π]) to get Λ, B ∈ R d1 , where s is a scale constant. The resulting feature matrix (X ∈ R n×d1 ) computation follows; please refer to Algorithm 1 for a detailed breakdown. The mapped features X are then fed into the chosen graph learning model f (A, X) for prediction. Embed Features. One can also embed the given features into hyperbolic space so as to derive the node features implicitly. Specifically, when a node feature matrix X ∈ R n×d is given, we initialize a hyperbolic embedding for each of the d dimensions to derive hyperbolic embeddings Z ∈ R d×d0 . The Euclidean features X can be computed in the same manner following Algorithm 1. However, to get the new node feature matrix, an extra aggregation step X = XX ∈ R n×d1 is required before fedding into a graph learning model. Embedding graph nodes is better-suited for tasks where no feature matrix is available or meaningful features are hard to derive. However, the size of the embedding Z will be proportional to the graph size, hence it may not scale to very large graphs due to memory and computation constraints. Furthermore, this method can only be used in a transductive setting, where nodes in the test set are seen during training. In comparison, embedding features can be used even for large graphs since the dimension d of the original feature matrix is usually fixed and much lower than the number of nodes n. Note that as the hyperbolic embeddings are built for each feature dimension, they can be used in both transductive and inductive settings, as long as the test data shares the same set of features as the training data. One limitation is that its performance depends on the existence of a feature matrix X that contains sufficient information for learning. Any graph learning model can be used in the proposed feature-extracted architecture. In our experiments, we focus primarily on the simple linear graph network SGC, which takes the form f (A, X) = softmax(A K XW) with a trainable weight matrix W. Note that just as the vanilla SGC case, A K or A K X can be pre-computed in the same way before training and inference. For the purpose of end-to-end learning, we jointly learn the embedding parameter Z and weight W in SGC during the training time. It's also possible to adopt a two-step approach, i.e., first pretrain a hyperbolic embedding following (Nickel & Kiela, 2017; 2018; Sala et al., 2018; Sonthalia & Gilbert, 2020; Chami et al., 2019) , then fix the embedding and train the graph learning model only. We defer this discussion to Appendix D due to page limit.

6.1. NODE CLASSIFICATION

Task and Datasets. The goal of this task is to classify each node into a correct category. We use transductive datasets: Cora, Citeseer and Pubmed (Sen et al., 2008) , which are standard citation networks benchmarks, following the standard splits adopted in Kipf & Welling (2016) . We also include datasets adopted in HGCN (Chami et al., 2019) for comparison: disease propagation tree and Airport. Yhe former contains tree networks simulating the SIR disease spreading model (Anderson & May, 1992) , while the latter contains airline routes between airports from OpenFlights. To measure scalability, we supplement our experiment by predicting community structure on a large inductive dataset Reddit following Wu et al. (2019) . More experimental details are provided in Appendix E. Experiment Setup. Since all datasets contain node features, we choose to embed features most of the time, since it applies to both small and large graphs, and transductive and inductive tasks. The only exception is the Airport dataset, which contains only 4 dimensional features-here, we use HyLa/RFF after embedding the graph nodes to produce better features X. We then use SGC model as softmax(A K XW), where both W and Z are jointly learned. Baselines. On Disease, Airport, Pubmed, Citeseer and Cora dataset, we compare our HyLa/RFF-SGC model against both Euclidean models (GCN, SGC and GAT) and Hypebolic models (HGCN, LGCN and HNN) using their publicly released version in Table 1 , where all hyperbolic models adopt a 16-dimensional hyperbolic space for consistency and a fair comparison. For the largest Reddit dataset, a 50-dimensional hyperbolic space is used. We also compare against the reported performance of supervised and unsupervised variants of GraphSAGE and FastGCN in Table 1 . Note that GCN-based models (e.g., HGCN, LGCN) could not be trained on Reddit because its adjacency matrix is too large to fit in memory, unless a sampling way is used for training. Worthy to mention, despite of the fact that standard Euclidean GCN literature (Kipf & Welling, 2016; Wu et al., 2019) train the model for 100 epochs on the node classification task, most hyperbolic (graph) networks including HGCN, LGCN, GILfoot_0 report results of training for (5-)thousand epochs with early stopping. For a fair comparison, we train all models for a maximum of 100 epochs with early stopping, except from HGCNfoot_1 , whose results were taken from the original paper trained for 5,000 epochs. Analysis. Our feature-extracted architecture is particularly strong and expressive to encode geometric priors for graph learning from Table 1 . Together with SGC, HyLa-SGC outperforms state-ofthe-art hyperbolic models on nearly all datasets. In particular, HyLa-SGC beats not only Euclidean SGC/GCN, but also an attention model GAT, except on the Cora dataset with a comparable performance. For the tree network disease with lowest hyperbolicity (more hyperbolic), the improvements of HyLa-SGC over HGCN is about 12.3%! The results suggest that the more hyperbolic the graph is, the more improvements will be gained with HyLa. On Reddit, HyLa-SGC outperforms samplingbased GCN variants, SAGE-GCN and FastGCN by more than 1%. However, the performance is close to SGC, which may indicate that the extra weights and nonlinearities are unnecessary for this particular dataset. Notably, RFF-SGC, embedding into Euclidean space and using RFF, sometimes can be better than GCN/SGC, while HyLa-SGC is consistently bettter than RFF-SGC. Visualization. In Figure 4 , we visualize the learned node Euclidean features using HyLa on Cora, Airport and Disease datasets with t-SNE (Van der Maaten & Hinton, 2008) and PCA projection. This shows that HyLa achieves great label class separation (indicated by different colors).

6.2. TEXT CLASSIFICATION

Task and Datsets. We further evaluate HyLa on transductive and inductive text classification task to assign documents labels. We conducted experiments on 4 standard benchmarks including R52 and R8 of Reuters 21578 dataset, Ohsumed and Movie Review (MR) follows the same data split as Yao et al. (2019) ; Wu et al. (2019) . Detailed dataset statistics are provided in Table 4 . Experiment Setup. In the transductive case, previous work Yao et al. (2019) and Wu et al. (2019) apply GCN and SGC by creating a corpus-level graph where both documents and words are treated as nodes in the graph. For weights and connections in the graph, word-word edge weights are calculated as pointwise mutual information (PMI) and word-document edge weights as normalized TF-IDF scores. The weights of document-document edges are unknown and left as 0. We follows the same data processing setup for the transductive setting, and embed the whole graph since only the adjacent matrix is available. In the inductive setting, we take the sub-matrix of the large matrix in the transductive setting, including only the document-word edges as the node representation feature matrix X, then follow the procedure in Section 5 to embed features and apply HyLa/RFF to get X. Since the adjacency matrix of documents is unknown, we replace SGC with a logistic regression (LR) formalized as Y = softmax(XW). We train all models for a maximum of 200 epochs and compare it against TextSGC and TextGCN in Table 2 . Performance Analysis. In the transductive setting, HyLa-based models can match the performance of TextGCN and TextSGC. The corpus-level graph may contain sufficient information to learn the task, and hence HyLa-SGC does not seem to outperform baselines, but still has a comparable performance. HyLa shows extraordinary performance in the inductive setting, where less information is used compared to the (transductive) node level case, i.e. a submatrix of corpus-level graph. With a linear regression model, it can already outperform inductive TextGCN, sometimes even better than the performance of a transductive TextGCN model, which indicates that there is indeed redundant information in the corpus-level graph. From the results on the inductive text classification task, we argue that HyLa (with features embedding) is particularly useful in the following three ways. First, it can solve the OOM problem of classic GCN model in large graphs, and requires less memory during training, since there are limited number of lower level features (also X is usually sparse), and there is no need to build a corpus-level graph anymore. Second, it is naturally inductive as HyLa is built at feature level (for each word in this task), it generalizes to any unseen new nodes (documents) that uses the same set of words. Third, the model is simple: HyLa follows by a linear regression model, which computes faster than classical GCN models. Efficiency. Following Wu et al. (2019) , we measure the training time of HyLa-based models on the Pubmed dataset, and we compare against both Euclidean models (SGC, GCN, GAT) and Hyperbolic models (HGCN, HNN). In Figure 5 , we plot the timing performance of various models, taking into account the precomputation time of the models into training time. We measure the training time on a NVIDIA GeForce RTX 2080 Ti GPU and show the specific timing statistics in Appendix. HyLa-based models achieve the best performance while incurring a minor computational slowdown, which is 4.4× faster than HGCN.

7. CONCLUSION

We propose a simple and efficient approach to using hyperbolic space in neural networks, by deriving HyLa as an expressive feature using the Laplacian eigenfunctions. Empirical results on graph learning tasks show that HyLa can outperform SOTA hyperbolic networks. HyLa sheds light as a principled approach to utilizing hyperbolic geometry in an entirely different way to previous work. Possible future directions include (1) using HyLa with non-linear graph networks such as GCN to derive even more expressive models; and (2) adopting more numerically stable representations of the hyperbolic embeddings to avoid potential "NaN problems" when learning with HyLa. A THEOREMS Theorem A.1 (Helgason (2022) ; Zelditch (2017) ). All smooth eigenfunctions of the Euclidean Laplacian ∆ on R n are f (x) = S n-1 e iλ⟨ω,x⟩ dT (ω), where λ ∈ C -{0} and T is an analytic functional (or hyperfunction), i.e., an element of the dual space of the space of analytic functions on S n-1 . Theorem A.2 (Helgason (2022) ; Zelditch (2017) ). All smooth eigenfunctions of the Hyperbolic Laplacian L on B n are f (z) = ∂B n exp((iλ + n-1 2 )⟨ω, z⟩ H ) dT (ω), where λ ∈ C and T is an analytic functional (or hyperfunction). Lemma A.3 (Expression of k(x, O)). Denote ζ λ,ω (z) = exp ( n-1 2 + iλ)⟨ω, z⟩ H , then the corresponding kernel k λ (x, O) for a particular value of λ defined as k λ (x, O) = E [HyLa λ,b,ω (x) HyLa λ,b,ω (y)] = 1 2 • E ω [ζ λ,ω (O) * ζ λ,ω (x)] = 1 2 • E ω [ζ λ,ω (x)] (3) takes the form k λ (x, O) = 1 2 • 2 F 1 n -1 2 + iλ, n -1 2 -iλ; n 2 ; 1 2 (1 -cosh(d H (x, O))) , where the expectation in Equation 3 is taken over ω drawn uniformly from the n-dimensional unit sphere. Proof. Recall that for x in the n-dimensional Poincare ball B n and ω ∈ ∂B n , ⟨ω, x⟩ H = log 1 -∥x∥ 2 ∥x -ω∥ 2 . Let γ = n-1 2 + iλ, expanding Equation 3 out, k λ (x, O) = 1 2 • E ω exp γ log 1 -∥x∥ 2 ∥x -ω∥ 2 = 1 2 • exp γ log 1 -∥x∥ 2 1 + ∥x∥ 2 E ω exp γ log 1 + ∥x∥ 2 ∥x -ω∥ 2 = 1 2 • exp γ log 1 -∥x∥ 2 1 + ∥x∥ 2 E ω   ∥x -ω∥ 2 1 + ∥x∥ 2 -γ   = 1 2 • exp γ log 1 -∥x∥ 2 1 + ∥x∥ 2 E ω   ∥x∥ 2 -2ω T x + ∥ω∥ 2 1 + ∥x∥ 2 -γ   , and if we let u = -2 ∥x∥ 1 + ∥x∥ 2 , and let x denote the unit vector in the direction of x, then this becomes k λ (x, O) = 1 2 • exp γ log 1 -∥x∥ 2 1 + ∥x∥ 2 E ω 1 + ux T ω -γ . Expanding this out using the Binomial Formula (which is valid here because |u| < 1), we get k λ (x, O) = 1 2 • exp γ log 1 -∥x∥ 2 1 + ∥x∥ 2 E ω ∞ k=0 -γ k u k xT ω k = 1 2 • exp γ log 1 -∥x∥ 2 1 + ∥x∥ 2 ∞ k=0 -γ k u k E ω xT ω k . Since this expected value is 0 for odd k, this becomes k λ (x, O) = 1 2 • exp γ log 1 -∥x∥ 2 1 + ∥x∥ 2 ∞ k=0 -γ 2k u 2k E ω xT ω 2k . But xT ω has the same the distribution as the inner product of two uniform random unit vectors in n dimensions. The square of this is well known to be Beta-distributed with parameters ( 1 2 , n-1 2 ). So we can write this as k λ (x, O) = 1 2 • exp γ log 1 -∥x∥ 2 1 + ∥x∥ 2 ∞ k=0 -γ 2k u 2k E w w k , where w ∼ Beta( 1 2 , n-1 2 ). Of course, the moments of the Beta distribution are well-known to be E w w k = 1 2 (k) n 2 (k) , where x (k) denotes the Pochhammer symbol representing the rising factorial. On the other hand, -γ 2k = (-γ) (2k) (2k)! where x (k) denotes the Pochhammer symbol representing the falling factorial. Since x (k) = (-1) k (-x) (k) , we can write this in terms of rising factorials as -γ 2k = (γ) (2k) (2k)! . So substituting everything in, we have k λ (x, O) = 1 2 • exp γ log 1 -∥x∥ 2 1 + ∥x∥ 2 ∞ k=0 (γ) (2k) (2k)! • u 2k • 1 2 (k) n 2 (k) . Next, observe that 1 2 (k) = k-1 m=0 1 2 + m = 2 -k k-1 m=0 (2m + 1) = 4 -k • (2k)! k! . So this becomes k λ (x, O) = 1 2 • exp γ log 1 -∥x∥ 2 1 + ∥x∥ 2 ∞ k=0 (γ) (2k) k! • u 2k • 4 -k • 1 n 2 (k) . Next, we leverage the famous identity that γ (2k) = 4 k γ 2 (k) γ + 1 2 (k) to get k λ (x, O) = 1 2 • exp γ log 1 -∥x∥ 2 1 + ∥x∥ 2 ∞ k=0 1 k! • γ 2 (k) • γ + 1 2 (k) • u 2k • 1 n 2 (k) = 1 2 • exp γ log 1 -∥x∥ 2 1 + ∥x∥ 2 • 2 F 1 γ + 1 2 ; γ 2 ; n 2 ; u 2 . Since u 2 = 4 ∥x∥ 2 1 + 2 ∥x∥ 2 + ∥x∥ 4 , and √ 1 -u 2 -1 2 √ 1 -u 2 = -∥x∥ 2 1 -∥x∥ 2 . Recall the classic formula (https://functions.wolfram.com/ HypergeometricFunctions/Hypergeometric2F1/17/ShowAll.html) that 2 F 1 a, a + 1 2 ; c; z = (1 -z) -a 2 F 1 2a, 2c -2a -1; c; √ 1 -z -1 2 √ 1 -z . Substituting z = u 2 , a = γ 2 , and c = n 2 yields k λ (x, O) = 1 2 • 2 F 1 γ, n -γ -1; n 2 ; -∥x∥ 2 1 -∥x∥ 2 = 1 2 • 2 F 1 n -1 2 + iλ, n -1 2 -iλ; n 2 ; -∥x∥ 2 1 -∥x∥ 2 . Therefore, -∥x∥ 2 1 -∥x∥ 2 = -tanh 2 (D/2) 1 -tanh 2 (D/2) = -tanh 2 (D/2) sech 2 (D/2) = -sinh 2 (D/2) = 1 2 (1 -cosh(D)) , where D = d H (x, O). So we get a final expression k λ (x, O) = 1 2 • 2 F 1 n -1 2 + iλ, n -1 2 -iλ; n 2 ; 1 2 (1 -cosh(d H (x, O))) . The proof is done here. We further make a remark here. Recall that for the Poincaré ball model, d(x, O) = 2 log   ∥x∥ + 1 1 -∥x∥ 2   = log (∥x∥ + 1) 2 1 -∥x∥ 2 = log 1 + ∥x∥ 1 -∥x∥ = 2 artanh(∥x∥). Therefore, log 1 -∥x∥ 2 1 + ∥x∥ 2 = log 1 -tanh 2 (D/2) 1 + tanh 2 (D/2) = log sech 2 (D/2) 1 + tanh 2 (D/2) = log 1 sinh 2 (D/2) + cosh 2 (D/2) = log 1 sinh 2 (D/2) + cosh 2 (D/2) = -log (cosh(D)) . And on the other hand, u = -2 tanh(D/2) 1 + tanh 2 (D/2) = tanh(D). then we can also write the kernel as k λ (x, O) = 1 2 • exp - n -1 2 + iλ log (cosh(D)) • 2 F 1 n + 1 4 + i λ 2 , n -1 4 + i λ 2 ; n 2 ; tanh(D) 2 , where D = d H (x, O). Lemma A.4 (Isometry-invariance). The kernel defined by k λ (x, y) = 1 2 •E ω [ζ λ,ω (x) * ζ λ,ω (y)] = 1 2 •E ω exp ( n -1 2 -iλ)⟨x, ω⟩ H + ( n -1 2 + iλ)⟨y, ω⟩ H is isometry-invariant. Proof. Let g be any isometry of the space, then observe the geometric identity (Helgason, 2022) : ⟨g • x, g • ω⟩ = ⟨x, ω⟩ + ⟨g • O, g • ω⟩. Take ω = g -1 • ω in Equation 4, it follows that ⟨g • x, ω⟩ = ⟨x, g -1 • ω⟩ + ⟨g • O, ω⟩. Take x = g -1 • O in Equation 4, it follows that 0 = ⟨O, ω⟩ = ⟨g -1 • O, ω⟩ + ⟨g • O, g • ω⟩, i.e., ⟨g -1 • O, ω⟩ = -⟨g • O, g • ω⟩, replace g -1 with g, then ⟨g • O, ω⟩ = -⟨g -1 • O, g -1 • ω⟩. By definition, k λ (x, y) = 1 2 • E ω exp ( n -1 2 -iλ)⟨x, ω⟩ H + ( n -1 2 + iλ)⟨y, ω⟩ H . Now assume, g • y = O, then consider k λ (g • x, O) = 1 2 • E ω [ζ λ,ω (g • x) * ζ λ,ω (O)] = 1 2 • E ω [ζ λ,ω (g • x) * ] = 1 2 • E ω exp n -1 2 -iλ ⟨g • x, ω⟩ = 1 2 • E ω exp n -1 2 -iλ ⟨x, g -1 • ω⟩ + ⟨g • O, ω⟩ = 1 2 • E ω exp n -1 2 -iλ ⟨x, g -1 • ω⟩ -⟨g -1 • O, g -1 • ω⟩ = 1 2 • E ω exp n -1 2 -iλ ⟨x, g -1 • ω⟩ -⟨y, g -1 • ω⟩ = 1 2 • S n-1 exp n -1 2 -iλ ⟨x, g -1 • ω⟩ -⟨y, g -1 • ω⟩ ρ 1 (ω)dω, where ρ 1 (ω) is a uniform distribution over the sphere, use ω = g -1 • ω as a change of variable, then k λ (g • x, O) = S n-1 exp n -1 2 -iλ ⟨x, g -1 • ω⟩ -⟨y, g -1 • ω⟩ ρ 1 (ω)dω = S n-1 exp n -1 2 -iλ (⟨x, ω⟩ -⟨y, ω⟩) ρ 1 (g • ω)d(g • ω). We claim that the mapping g acts on the boundary with the Jacobian given by d(g • ω) d(ω) = 1 2 • exp((n -1) • ⟨g -1 • O, ω⟩) = 1 2 • 1 -g -1 • O 2 ∥g -1 • O -ω∥ 2 n-1 , then k λ (g • x, O) = 1 2 • S n-1 exp n -1 2 -iλ (⟨x, ω⟩ -⟨y, ω⟩) ρ 1 (g • ω)d(g • ω) = 1 2 • S n-1 exp n -1 2 -iλ (⟨x, ω⟩ -⟨y, ω⟩) ρ 1 (g • ω) exp((n -1) • ⟨g -1 • O, ω⟩)d(ω) = 1 2 • S n-1 exp n -1 2 -iλ (⟨x, ω⟩ -⟨y, ω⟩) + (n -1) • ⟨y, ω⟩ ρ 1 (g • ω)d(ω) = 1 2 • S n-1 exp n -1 2 -iλ ⟨x, ω⟩ + n -1 - n -1 2 + iλ ⟨y, ω⟩ ρ 1 (g • ω)d(ω) = 1 2 • S n-1 exp n -1 2 -iλ ⟨x, ω⟩ + n -1 2 + iλ ⟨y, ω⟩ ρ 1 (g • ω)d(ω). Since ρ 1 is a uniform distribution, this is k λ (g • x, O) = 1 2 • E ω exp n -1 2 -iλ ⟨x, ω⟩ + n -1 2 + iλ ⟨y, ω⟩ , compared with Equation 5, it follows that k λ (g•x, O) = k λ (x, y). Since k λ (g•x, O) only depends on d H (g•x, O) = d H (x, g -1 •O) = d H (x, y) from Lemma A.3, then k λ (x, y) is distance-invariant, and hence isometry-invariant. It suffices to prove Equation 6, i.e., where we use the fact ∥ω∥ = 1. Since the integral is taken over the unit sphere with ∥ω∥ = 1, ∥T a (ω)∥ = 1, we consider only the mapping of T a restricted to the first n -1 (free) dimensions, with an abuse of notation, regard ω as an n -1 dimensional vector. Then the Jacobian of T a (ω) with respect to ω is d(g • ω) d(ω) = 1 -g -1 • O 2 ∥g -1 • O -ω∥ dT a (ω) = (1 -∥a∥ 2 )d( ω -a ∥ω -a∥ 2 ), we can calculate that d( ω -a ∥ω -a∥ 2 ) = ∥ω -a∥ 2 dω -(ω -a) • 2(ω -a) ⊺ dω ∥ω -a∥ 4 = 1 ∥ω -a∥ 2 • ∥ω -a∥ 2 -2(ω -a) • (ω -a) ⊺ ∥ω -a∥ 2 dω = dω ∥ω -a∥ 2 • I n-1,n-1 - 2(ω -a) • (ω -a) ⊺ ∥ω -a∥ 2 , then dT a (ω) dω = 1 -∥a∥ 2 ∥ω -a∥ 2 I n-1,n-1 - 2(ω -a) • (ω -a) ⊺ ∥ω -a∥ 2 , note the relation that det(I + xy ⊺ ) = 1 + x ⊺ y, If we draw HyLa features using a distribution ρ over λ, then the resulting approximated kernel will be k(x, y) = 1 2 • ∞ -∞ k λ (x, y) • ρ(λ) dλ = 1 2 • ∞ -∞ 2 F 1 n -1 2 + iλ, n -1 2 -iλ; n 2 ; 1 2 (1 -cosh(d H (x, y))) • ρ(λ) dλ. Proof of Theorem 4.2. Denote ϕ λ (z) = S n-1 ζ λ,ω (z)dω = S n-1 exp ( n -1 2 + iλ)⟨ω, z⟩ H dω, which are basic spherical functions. For any x, y ∈ B n , assume g y is an isometry that maps y to the origin, i.e., g y • y = O, denote x = g y • x, then k λ (x, y) = k λ (x, O) for any λ, note that k λ (x, O) = 1 2 • S n-1 exp ( n -1 2 + iλ)⟨ω, x⟩ H ρ 1 (ω)dω = 1 2 • 1 Area(S n-1 ) S n-1 exp ( n -1 2 + iλ)⟨ω, x⟩ H dω = 1 2 • 1 Area(S n-1 ) ϕ λ (x), where we use the fact that ρ 1 (ω) is a uniform distribution over the sphere. Assume the existence of an associated density ρ(λ) with the kernel, then k(x, y) = k(d H (x, y)) = k(d H (x, O)) = ∞ -∞ k λ (x, y) • ρ(λ) dλ = ∞ -∞ k λ (x, O) • ρ(λ) dλ = 1 2 • ∞ -∞ 1 Area(S n-1 ) ϕ λ (x) • ρ(λ) dλ = 1 2 • 1 Area(S n-1 ) ∞ -∞ ϕ λ (x) • ρ(λ) dλ = 1 πArea(S n-1 ) ∞ -∞ ρ(λ) λ tanh ( πλ 2 ) • ϕ λ (x) • |c(λ)| -2 dλ where |c(λ)| -2 = πλ 2 tanh πλ 2 when λ ∈ R. Note that the last integral is exactly the inverse spherical transform (Helgason, 2022) of ρ(λ) λ tanh ( πλ 2 ) , hence it can be derived in the reverse direction by taking the spherical transform of k (d H (x, O)), i.e., ρ(λ) λ tanh ( πλ 2 ) = B n k(d H (x, O))ϕ -λ (x)dx. Hence ρ(λ) can be derived as ρ(λ) = λ tanh ( πλ 2 ) B n k(d H (x, O))ϕ -λ (x)dx = λ tanh ( πλ 2 ) B n ∂B n k(d H (z, O)) exp ( n -1 2 -iλ)⟨ω, z⟩ H dωdz. Therefore, given an isometry-invariant positive semidefinite kernel k(x, y) = k(d H (x, y)), we can compute ρ(λ) following the above expression if it exists, then the rest of the theorem follows. The readers may wonder whether there is a concentration behavior using the random variable ⟨ϕ(x), ϕ(y)⟩ to approximate k(x, y). Unfortunately, the eigenfunction ϕ(x) itself is not a sub-Gaussian so as to derive a concentration bound in a straightforward way, but we do provide a numerical experiment to measure the estimation behavior. For the estimation in Figure 3 , we sampled 1, 000 different eigenfunctions. We first fix a set of 1, 000 points x i by uniformly sampling over 2-dimensional hyperbolic space, then approximate the kernel k(o, x i ) using ⟨ϕ(x i ), ϕ(o)⟩ by sampling with an increasing number of eigenfunctions, ranging from 50 to 1, 000. At last, we compute the mean absolute error of the estimation ⟨ϕ(x i ), ϕ(o)⟩ to the true kernel value k(o, x i ). We plot the mean estimation error in Figure 6 , which seems to be an exponentially decay, it's an interesting future work to investigate this estimation error.

C NODE EMBEDDING VS FEATURE EMBEDDING

When HyLa is adopted at node level, i.e., each vertex/node v i in the graph is associated with a hyperbolic embedding parameter z i ∈ B d0 . Then the inner product of HyLa features ⟨ϕ(z i ), ϕ(z j )⟩ of vertex v i , v j approximates some kernel k(z i , z j ). The optimization of z i encourages learning of the kernel on the hyperbolic space to solve the task. When HyLa is adopted at feature level, i.e., each column dimension of the node feature X ∈ R n×d is associated with a hyperbolic embedding parameter z i ∈ B d0 . The HyLa feature associated to each vertex/node v i is then computed as d k=1 X ik ϕ(z k ), where d k=1 X ik = 1 if a row-normalization is applied on the original node features. Therefore, the inner product of two node HyLa features is which in expectation equals a linear combination of kernels, i.e., d k,l=1 X ik X jl k(z k , z l ). Therefore, it captures a much more complicated kernel relation on the hyperbolic space than directly embedding nodes. ⟨ d k=1 X ik ϕ(z k ), d l=1 X jl ϕ(z l )⟩ = d k,l=1 X ik X jl ⟨ϕ(z k ), ϕ(z l )⟩,

D TWO-STEP APPROACH

For the purpose of end-to-end learning, in our experiments, we jointly learn the embedding parameter Z and weight W in SGC during the training time, as detailed in subsection E.2. It's possible to adopt a two-step approach, i.e., first pretrain a hyperbolic embedding, then fix the embedding and train the graph learning model only. In the first step, for example, optimization-based methods (Nickel & Kiela, 2017; 2018) and combinatorial construction methods (Sala et al., 2018; Sonthalia & Gilbert, 2020 ) can be adopted by supervising the graph connectivity. However, these methods only utilize the graph structure information, but ignore the node feature information X, which leads to a natural performance degradation. In comparison, as shown in experiments and analyzed in Appendix C, our end-to-end learning of HyLa can be used to embed features and enables learning a complex kernel representation to avoid this shortcoming. Intuitively, the graph connectivity information can be too general for downstreaming tasks which rely more on semantic information. It's not clear to us how to encode the semantic information (node features) into embedding following e.g., (Nickel & Kiela, 2017; 2018) . Another way (Chami et al., 2019) of deriving a pretrained hyperbolic embedding that might take semantic information into consideration is to train a link prediction model, however, this method is not efficient as HGCN, shown in Figure 5 .

E EXPERIMENT DETAILS E.1 TASK AND DATASET

We provide a detailed description/table of used datasets in Table 3 and Table 4. 1. Citation Networks. Cora, Citeseer and Pubmed Sen et al. (2008) are standard citation network benchmarks, where nodes represent papers, connected to each other via citations. We follow the standard splits Kipf & Welling (2016) with 20 nodes per class for training, 500 nodes for validation and 1000 nodes for test. 2. Disease propagation tree Chami et al. (2019) . This is tree networks simulating the SIR disease spreading model Anderson & May (1992) , where the label is whether a node was infected or not and the node features indicate the susceptibility to the disease. We use dataset splits of 30/10/60% for train/val/test set. 3. Airport. We take this dataset from Chami et al. (2019) . This is a transductive dataset where nodes represent airports and edges represent the airline routes as from OpenFlights. Airport contains 3,188 nodes, each node has a 4 dimensional feature representing geographic information (longitude, latitude and altitude), and GDP of the country where the airport belongs to. For node classification, labels are chosen to be the population of the country where the airport belongs to. We use dataset splits of 524/524 nodes for val/test set. 4. Reddit. This is a much larger graph dataset built from Reddit posts, where the label is the community, or "subreddit", that a post belongs to. Two nodes are connected if the same user comments on both. We use a dataset split of 152K/24K/55K follows Hamilton et al. We use HyLa together with SGC model as softmax(A K XW), where the HyLa feature matrix X ∈ R n×d1 is derived from the hyperbolic embedding Z ∈ R n×d0 using Algorithm 1. Specifically, we randomly sample constants of HyLa features X by sampling the boundary points ω uniformly from the boundary ∂B n , eigenvalue constants λ from a zero-mean s-standard-deviation Gaussian and biases b uniformly from [0, 2π]. These constants remain fixed throughout training. We use cross-entropy as the loss function and jointly optimize the low dimensional hyperbolic embedding Z and linear weight W simultaneously during training. Specifically, Riemannian SGD optimizer Bonnabel (2013) (of learning rate lr 1 ) for Z and Adam Kingma & Ba (2014) optimizer (of learning rate lr 2 ) for W. RSGD naturally scales to very large graph because the graph connectivity pattern is sufficiently sparse. We adopt early-stopping as regularization. We tune the hyper-parameter via grid search over the parameter space. Each hyperbolic embedding is initialized around the origin, by sampling each coordinate at random from [-10 -5 , 10 -5 ].

E.3 HYPER-PARAMETERS

We provide the detailed values of hyper-parameters for node classification and text classification in Table 5 and Table 6 respectively. Particularly, we fix K = 2 for the text classification task and train the model for a maximum of 200 epochs without using any regularization (e.g. early stopping). Also note that in the transductive text classification setting, HyLa is used at node level, hence the size of parameters will be proportional to the size of graph, in which case, d 0 and d 1 can not be too large so as to avoid OOM. In the inductive text classification setting, there is no such constraint as the dimension of lower level features is not very large itself. Please check the code for more details.

F TIMING

We show the specific training timing statistics of different models on Pubmed dataset in Table 7 . Particularly for HGCN model, in order to achieve the report performances, we follow the same training procedure using public code, which is divided into two stages: (1) a link prediction task on the dataset to learn hyperbolic embeddings, and (2) use the pretrained embeddings to train a MLP classifier. Hence, we add the timing of both stages as the timing for HGCN. 



We cannot replicate results of GIL from their public code. HGCN requires pretraining embeddings from a link prediction task to achieve reported results on node classification task for Pubmed and Cora.



Figure 1: Euclidean hyperplane.

Figure 2: Hyperbolic horocycle.

Figure 3: Visualization of the kernel k(x, y) when ρ(λ) = N (0, 0.5 2 ). (left) Distributions of k(x, y) and the heat kernel (at temperature t = 6) in 3D hyperbolic space; (middle) k(x, O) in 2D Poincaré disk model; (right) ⟨ϕ(x), ϕ(y)⟩ for HyLa features based on D = 1000 samples. and are analogous to random Fourier features. Moreover, HyLa can be extended to eigenfunctions on other manifolds (e.g. symmetric spaces) since we only use manifold properties of H n . We show that for any λ ∈ R, under uniform sampling of ω and b, the product HyLa λ,b,ω (x) HyLa λ,b,ω (y) is an unbiased estimate of an isometry-invariant kernel k(x, y). Theorem 4.1. Let ω be sampled uniformly (under the Euclidean metric) from the boundary ∂B n and let b be sampled uniformly from [0, 2π]. Then E [HyLa λ,b,ω (x) HyLa λ,b,ω (y)] = k λ (x, y) for any λ ∈ R and x, y ∈ H n , where the function k λ is an isometry-invariant kernel given by k λ (x, y) = 1 2 • 2 F 1

Activation. There is a close connection between the HyLa eigenfunction and the Euclidean activations used in Euclidean fully connected networks. Given a data point x ∈ R n , a weight ω ∈ R n , a bias b ∈ R and nonlinearity σ, a Euclidean DNN activation can be written as σ(⟨ω, x⟩ + b) = σ(∥ω∥⟨ ω ∥ω∥ , x⟩ + b). In hyperbolic space, for z ∈ B n , ω ∈ ∂B n , λ ∈ R, b ∈ R and σ = cos, we can reformulate the HyLa eigenfunction as HyLa λ,b,ω

Figure 4: Visualization of node HyLa features on Cora, Airport and Disease datasets, where nodes of different classes are indicated by different colors. (left) t-SNE on Cora; (middle) t-SNE on Airport; (right) PCA on Disease.

Figure 5: Performance over training time on Pubmed. HyLa-SGC achieves best performance with minor computation slowdown.

when g is an rotation, it suffices to show this for a translation isometry. Denote Inv(x) = x ∥x∥ 2 , then in the Poincarè Ball model, all translation isometry takes the form T a (x) = -a + (1 -∥a∥ 2 ) Inv(Inv(x) -a), where both x, a in the Poincarè Ball model and T a (a) = O, T -1 a (O) = a. Thus, T a (ω) = -a+(1-∥a∥ 2 ) Inv(Inv(ω)-a) = -a+(1-∥a∥ 2 ) Inv(ω-a) = -a+(1-∥a∥ 2 ) ωa ∥ω -a∥ 2 ,

Figure 6: Averaged estimation error of HyLa to the kernel

(2017);Chen et al. (2018), similarly, we evaluate HyLa inductively by followingWu et al.

Test accuracy/Micro F1 Score (%) averaged over 10 runs on node classification task. Performance of some baselines are taken from their original papers. OOM: Out of memory.

Test accuracy (%) averaged over 10 runs on transductive and inductive text classification task except from the LR mode. Bold numbers: best in both transductive and inductive setting; Underlined numbers: best in inductive setting.

Node classification Dataset statistics.

Text classification Dataset statistics.

Hyper-parameters for node classification.

Hyper-parameters for text classification.

Training time on Pubmed.

annex

thenthen the absolute value of determinant of the Jacobian is, with which a change of variable would give, which finishes the proof.Proof of Theorem 4.1. As a result of Lemma A.3 and Lemma A.4, the expression for k λ (x, y) follows:where 2 F 1 is the hypergeometric function, we can also apply the Euler transformation to getthen this is written more succinctly asWe can also write this as a Legendre function,and similarlyThis is manifestly real because(1 -cosh(d H (x, y))) .

