NODE EMBEDDING FROM NEURAL HAMILTONIAN OR-BITS IN GRAPH NEURAL NETWORKS

Abstract

In the graph node embedding problem, embedding spaces can vary significantly for different data types, leading to the need for different GNN model types. In this paper, we model the embedding update of a node feature as a Hamiltonian orbit over time. Since the Hamiltonian orbits generalize the hyperbolic exponential maps, this approach allows us to learn the underlying manifold of the graph in training, in contrast to most of the existing literature that assumes a fixed graph embedding manifold. Our proposed node embedding strategy can automatically learn, without extensive tuning, the underlying geometry of any given graph dataset even if it has diverse geometries. We test Hamiltonian functions of different forms and verify the performance of our approach on two graph node embedding downstream tasks: node classification and link prediction. Numerical experiments demonstrate that our approach adapts better to different types of graph datasets than popular state-of-the-art graph node embedding GNNs.

1. INTRODUCTION

Graph neural networks (GNNs) (Yue et al., 2019; Ashoor et al., 2020; Kipf & Welling, 2017b; Zhang et al., 2022; Wu et al., 2021) have achieved good inference performance on graph-structured data such as social media networks, citation networks, and molecular graphs in chemistry. Most existing GNNs embed graph nodes in Euclidean spaces without further consideration of the dataset graph geometry. For some graph structures like the tree-like graphs (Liu et al., 2019) , the Euclidean space may not be a proper choice for the node embedding. Recently, hyperbolic GNNs (Chami et al., 2019; Liu et al., 2019) propose to embed nodes into a hyperbolic space instead of the conventional Euclidean space. It has been shown that tree-like graphs can be inferred more accurately by hyperbolic GNNs. Furthermore, works like Zhu et al. (2020b) have attempted to embed graph nodes in a mixture of the Euclidean and hyperbolic spaces, where the intrinsic graph local geometry is attained from the mixing weight. Embedding nodes in a hyperbolic space is achieved through the exponential map (Chami et al., 2019) , which is essentially a geodesic curve on the hyperbolic manifold as the projected curve of the cogeodesic orbits on the manifold's cotangent bundle (Lee, 2013; Klingenberg, 2011) . In our work, we propose to embed the nodes, via more general Hamiltonian orbits, into a general manifold, which generalizes the hyperbolic embedding space, i.e., a strongly constrained Riemannian manifold of constant sectional curvature equal to -1. From the physics perspective, the cotangent bundles are the natural phase spaces in classical mechanics (De León & Rodrigues, 2011) where the physical system evolves according to the basic laws of physics modeled as differential equations on the phase spaces. In this paper, we propose a new GNN paradigm based on Hamiltonian mechanics (Goldstein et al., 2001) with flexible Hamiltonian functions. Our objective is to design a new node embedding strategy that can automatically learn, without extensive tuning, the underlying geometry of any given graph dataset even if it has diverse geometries. We enable the node features to evolve on the manifold under the influence of neighbors. The learnable Hamiltonian function on the manifold guides the node embedding evolution to follow a learnable law analogous to basic physical laws. Main contributions. Our main contributions are summarized as follows: 1. We take the graph as a discretization of an underlying manifold and enable node embedding through a learnable Hamiltonian orbit associated with the Hamiltonian scalar function on its cotangent bundle. 2. Our node embedding strategy can automatically learn, without extensive tuning, the underlying geometry of any given graph dataset even if it has diverse geometries. We empirically demonstrate its ability by testing on two graph node embedding downstream tasks: node classification and link prediction. 3. From empirical experiments, we observe that the oversmoothing problem of GNNs can be mitigated if the node features evolve through Hamiltonian orbits. By the conservative nature of the Hamiltonian equations, our model enables a stable training and inference process while updating the node features over time and layers.

2. RELATED WORK

While our paper is related to Hamiltonian neural networks in the literature, we are the first, to our best knowledge, to model graph-structured data with Hamiltonian equations. In what follows, we briefly review Hamiltonian neural networks, Riemannian manifold GNNs, and physics-inspired GNNs. Hamiltonian neural networks. Among these physics-inspired deep learning approaches, Hamiltonian equations have been applied to conserve an energy-like quantity when training neural networks. The papers Greydanus et al. (2019) ; Zhong et al. (2020) ; Chen et al. (2021) train a neural network to infer the Hamiltonian dynamics of a physical system, where the Hamiltonian equations are solved using neural ODE solvers. The work Haber & Ruthotto (2017) builds a Hamiltonian-inspired neural ODE to stabilize the gradients so as to avoid vanishing and exploding gradients. The paper Huang et al. (2022) further studies the adversarial robustness of Hamiltonian ODE. In this paper, we focus on applying Hamiltonian equations to graph neural networks, which has not been investigated in the above-mentioned works. Riemannian manifold GNNs. Most GNNs in the literature Yue et al. (2019) ; Ashoor et al. (2020) ; Kipf & Welling (2017b) ; Zhang et al. (2022) ; Wu et al. (2021) embed graph nodes in Euclidean spaces. In what follows, we simply call them (vanilla) GNNs. They perform well on some datasets like the Cora dataset McCallum et al. (2004) whose δ-hyperbolicity Chami et al. (2019) is high. When dealing with the datasets whose δ-hyperbolicity are low (hence embedding should more appropriately be in a hyperbolic space) such as the Disease Chami et al. (2019) and Airport Chami et al. (2019) datasets, those GNNs suffer from improper node embedding. To better handle hierarchical graph data, (Liu et al., 2019; Chami et al., 2019; Zhang et al., 2021b; Zhu et al., 2020b) propose to embed nodes into a hyperbolic space, thus yielding hyperbolic GNNs. Moreover, Zhu et al. (2020b) proposes a mixture of embeddings from Euclidean and hyperbolic spaces. This mixing operation relaxes the strong space assumption of using only one type of space for a dataset. In this paper, we embed nodes into a general learnable manifold via the Hamiltonian orbit on its symplectic cotangent bundle. This allows our model to flexibly adapt to the inherent geometry of the dataset. Graph Neural diffusion: Neural Partial Differential Equations (PDEs) have been applied to graphstructured data (Chamberlain et al., 2021b; a; Song et al., 2022) , where different diffusion schemes are assumed when performing message passing on graphs. To be more specific, the heat diffusion model is assumed in (Chamberlain et al., 2021b) and the Beltrami diffusion model is assumed in (Chamberlain et al., 2021a; Song et al., 2022) . (Rusch et al., 2022) models the nodes in the graph as coupled oscillators, i.e., a second-order ODE. While above mentioned graph neural diffusion schemes and our model all use ODEs, there is a fundamental difference between our model and graph neural flows. In the graph PDEs, they wrap the message passing function, e.g., constant aggregation function like the one in GCN, and attention-based aggregation function like the one in GAT, into an ODE function. In contrast, our model treats the node embedding process and node aggregation process as two independent processes, where we use the ODE function only to learn a suitable node embedding space which is then followed by a node aggregation step. To sum up, our ODE is actually a node embedding layer taking node features as the input whereas graph PDEs are node aggregation layers taking node features as well as graph adjacency matrix as the input. Notations: We use the Einstein summation convention (Lee, 2013) for expressions with tensor indices.

3. MOTIVATIONS AND PRELIMINARIES

In this section, we briefly review the concepts of the geodesic curve on a Riemannian manifold from the principle of stationary action in the form of Lagrange's equations. We then further generalize the geodesic curve to the Hamiltonian orbit associated with an energy function H, which is a conserved quantity along the orbit. We first summarize the motivation of our work as follows. Motivation I: from the hyperbolic exponential map to Riemannian geodesic. The geodesic curve gives rise to the exponential map that maps points from the tangent space to the manifold and has been utilized in (Chami et al., 2019) to enable graph node embedding in a special Riemannian manifold known as the hyperbolic space. From this perspective, by using the geodesic curve, we generalize the graph node embedding to an arbitrary (pseudo-)Riemannian manifold with learnable local geometry g using Lagrange's equations. Motivation II: from Riemannian geodesic to Hamiltonian orbit. Despite the above conceptual generalization for node embedding using geodesic curves, the specific curve formulation involving minimization of curve length may result in a loss of generality for node feature evolution along the curve. We thus further generalize the geodesic curve to the Hamiltonian orbit associated with an energy function H that is conserved along the orbit. In Section 4, we propose graph node embedding without an explicit metric by using Hamiltonian orbits with learnable energy functions H.

3.1. MANIFOLD AND RIEMANNIAN METRIC

Manifold and local chart representation. On a d-dimensional manifold M , for each point on M , there exists a triple {q, U, V }, called a chart, such that U is an open neighborhood of the point in M , V is an open subset of R d , and q : U → V is a homeomorphism, which gives us a coordinate representation for a local area in M . Tangent and cotangent vector spaces. For any point q on M (we identify each point covered by a local chart on M by its representation q), we may assign two vector spaces named the tangent vector space T q M and cotangent vector space T * q M . The vectors from the tangent and cotangent spaces can be interpreted as representing a velocity and generalized momentum of motion in classical mechanics, respectively. Riemannian metric. A Riemannian manifold is a manifold M equipped with a Riemannian metric g, where we assign to any point q ∈ M and pair of vectors u, v ∈ T q M an inner product ⟨u, v⟩ g(q) . This assignment is assumed to be smooth with respect to the base point q ∈ M . The length of a tangent vector u ∈ T q M is then defined as ∥u∥ g(q) := ⟨u, u⟩ 1/2 g(q) . (1) • Local coordinates representation: In local coordinates with q = qfoot_0 , . . . , q d ⊺ ∈ M, u = u 1 , . . . , u d ⊺ ∈ T q M and v = v 1 , . . . , v d ⊺ ∈ T q M , the Riemannian metric g = g(q) is a real symmetric positive definite matrix and the inner product above is given by ⟨u, v⟩ g(q) := g ij (q)u i v j (2) • Pseudo-Riemannian metric: We may generalize the Riemannian metric to a metric tensor that only requires a non-degenerate condition (Lee, 2018) instead of the stringent positive definiteness condition in the inner product. One example of a pseudo-Riemannian manifold is the Lorentzian manifold, which is important in applications of general relativity.

3.2. GEODESIC CURVES AND EXPONENTIAL MAPS

Length and energy of a curve. Let q : [a, b] → M be a smooth curve. 1 We define the following: • length of the curve: ℓ(q) := b a ∥ q(t)∥ g(q(t)) dt. • energy of the curve: E(q) := 1 2 b a ∥ q(t)∥ 2 g(q(t)) dt. ( ) Geodesic curves. On a Riemannian manifold, geodesic curves are defined as curves that have a minimal length as given by (3) and with two fixed endpoints q(a) and q(b). However, computations based on minimizing the length to obtain the curves are difficult. It turns out that the minimizers of E(q) also minimize ℓ(q) (Malham, 2016). Consequently, the geodesic curve formulation may be obtained by minimizing the energy of a smooth curve on M . Principle of stationary action and Euler-Lagrange equation. The Lagrangian function L(q(t), q(t)) minimizes the following functional (in physics, the functional is known as an action) S(q) = b a L(q(t), q(t))dt. (5) with two fixed endpoints at t = a and t = b only if the following Euler-Lagrange equation is satisfied: ∂L ∂q i (q(t), q(t)) - d dt ∂L ∂ qi (q(t), q(t)) = 0. Geodesic equation for geodesic curves. The Euler-Lagrange equation derived from minimizing the energy (4) with local coordinates representation, L = 1 2 ∥ q(t)∥ 2 g(q(t)) = 1 2 g ik (q) qi qk (7) is expressed as the following ordinary differential equations called the geodesic equation: qi + Γ i jk qj qk = 0, for all i = 1, . . . , d, where the Christoffel symbols Γ i jk = 1 2 g iℓ ∂g ℓj ∂q k + ∂g kℓ ∂qj - ∂g jk ∂q ℓ and [g ij ] denotes the inverse matrix of the matrix [g ij ]. The solutions to the geodesic equation ( 8) give us the geodesic curves. Exponential map. Given the geodesic curves, at each point x ∈ M , for velocity vector v ∈ T x M, the exponential map is defined to obtain the point on M reached by the unique geodesic that passes through x with velocity v at time t = 1 (Lee, 2018). Formally, we have exp x (v) = γ(1) where γ(t) is the curve given by the geodesic equation ( 8) with initial conditions q(0) = x and q(0) = v. With regards to Motivation I, we note that (Chami et al., 2019) considers graph node embedding over a homogeneous negative-curvature Riemannian manifold called hyperboloid manifold. In contrast, we generalize the embedding of nodes to an arbitrary pseudo-Riemannian manifold through the geodesic equation ( 8) with a learnable metric g that derives the local graph geometry from the nodes and their neighbors.

3.3. FROM GEODESICS TO GENERAL HAMILTONIAN ORBITS

The geodesic curves and the derived exponential map essentially come from (5) with L in (7) specified from the curve energy (4). However, the curves derived from this specific action may potentially sacrifice efficacy for the graph node embedding task since we do not know what is a reasonable action formulation that guides the evolution of the node feature in this task. Therefore, we follow the principle of stationary action but consider a learnable action that is more flexible than the length or energy of the curve. To better model the conserved quantity during the feature evolution, we reformulate the Lagrange equation to the Hamilton equation. This is our Motivation II. Hamiltonian function and equation. The Hamiltonian orbit (q(t), p(t)) is given by the following Hamiltonian equation with a Hamiltonian function H: qi = ∂H ∂p i , ṗi = - ∂H ∂q i , ( ) where q is the local chart coordinate on the manifold while p can be interpreted as a vector of generalized momenta in the cotangent vector space. In classical mechanics, the 2d-dimensional pair (p, q) is called phase space coordinates that fully specify the state of a dynamic system. Later, we consider the node feature evolution following the trajectory specified by the phase space coordinates. Hamiltonian function vs. Lagrangian function. The Hamiltonian function can be taken as the Legendre transform of the Lagrangian function: H(q, p) = p i qi -L(q, q) with q = q(p) such that p = ∂L ∂ q . ( ) If H is restricted to strictly convex functions, the Hamiltonian formalism is equivalent to a Lagrangian formalism (De León & Rodrigues, 2011) . Geodesic equation reformulated as Hamiltonian function. If H is set as H(q, p) = 1 2 g ij (q)p i p j , where [g ij ] denotes the inverse matrix of the matrix [g ij ], we have the following Hamiltonian function: qi = g ij p j , ṗi = - 1 2 ∂ i g jk p j p k . The Hamiltonian orbit (p(t), q(t)), as the solution of (13), gives us again the geodesic curves p(t) on the manifold M if we only look at the first d-dimensional coordinates. Theorem 1 (Conservation of energy (Da Silva & Da Salva, 2008) ). H(p(t), q(t)) is constant along the Hamiltonian orbit as solutions of (10). In physics, H typically represents the total energy of the system, and Theorem 1 indicates the time-evolution of the system follows the law of conservation of energy.

4. PROPOSED FRAMEWORK

We consider an undirected graph G = (V, E) consisting of a finite set V of vertices, together with a subset E ⊂ V ×V of edges. Since the input node features in most datasets are sparse, a fully connected (FC) layer is first applied to compress the raw input node features. Let n q be the d-dimensional compressed node feature for node n after the FC layer.foot_1 However, empirical experiments (see "MLP" results in Section 5.1) indicate that for the graph node embedding task, such simple raw compressing without any consideration of the graph topology does not render a good embedding. Further graph neural network architecture is thus required to update the node embedding. We consider the node features { n q} n∈V to be located in a local chart of an embedding manifold M and take the node features as the chart coordinate representations for points on the manifold. In Motivations I and II in Section 3, we have provided the rationale for generalizing graph node embedding from the hyperbolic exponential map to the Riemannian geodesic, and further to the Hamiltonian orbit. To enforce the graph node feature update on the manifold with well-adapted learnable local geometry, we make use of the concepts from Section 3.

Hamiltonian equation eq(15)

! 𝑞 concatenate ( ! 𝑞 , ! 𝑝) ( ! 𝑞 𝑇 , ! 𝑝(𝑇))

Hamiltonian orbit evolution Aggregation

Hamiltonian orbit evolution ! 𝑝 … Hamiltonian Layer 1 ! 𝑞 𝑇

Hamiltonian Layer

Hamiltonian Layer 2 Hamiltonian Layer 𝑳 Figure 1: HamGNN architecture: in each layer, each node is assigned a learnable "momentum" vector n p (cf. ( 14)) at time t = 0, which initializes the evolution of the node feature. The node features evolve on a manifold following (15) to ( n q(T ), n p(T )) at the time t = T . We only take n q(T ) as the embedding and input it to the next layer. After L layers, we take n q (L) (T ) as the final node embedding.

4.1. MODEL ARCHITECTURE

Node feature evolution along Hamiltonian orbits in a Hamiltonian layer. As introduced in Section 3.3, the 2d-dimensional phase space coordinates (p, q) fully specify a system's state. Consequently, for the node feature as a point on the manifold M , we associate to each point n q a learnable momentum vector n p as an external force to enable the node feature to evolve along Hamiltonian orbits on the manifold. More specifically, we set n p = Q net ( n q) (14) where Q net is instantiated by an FC layer. We consider a learnable Hamiltonian function H netfoot_2 that specifies the node feature evolution trajectory in the phase space by the Hamiltonian equation qi = ∂H net ∂p i , ṗi = - ∂H net ∂q i (15) with learnable Hamiltonian energy function H net : (q, p) → R. (16) The node features are updated along the Hamiltonian orbits which are curves starting from each node ( n q, n p) at t = 0. In other words, they are the solution of ( 15) with the initial conditions ( n q(0), n p(0)) = ( n q, n p) at t = 0. The solution of (15) on the phase space for each node n ∈ V at time T is given by the differential equation solver (Chen et al., 2018a) , and denoted by ( n q(T ), n p(T )). The canonical projection π( n q(T ), n p(T )) = n q(T ) is taken to obtain the node feature positions on the manifold at time T . The aforementioned operations are performed within one layer, and we call it the Hamiltonian layer. Neighborhood Aggregation. After the node features update along the Hamiltonian orbits, we perform neighborhood aggregation on the features { n q (ℓ) (T )} n∈V , where ℓ indicates the ℓ-th layer. Let N (n) = {m : (n, m) ∈ E} denote the set of neighbors of node n ∈ V. We only perform a simple yet efficient aggregation for node n as follows: n q (ℓ+1) = n q (ℓ) (T ) + 1 |N (n)| m∈N (n) m q (ℓ) (T ). ( ) Layer stacking for local geometry learning. We stack up multiple Hamiltonian layers with neighborhood aggregation in between them. We first give an intuitive explanation for the case H net is set as ( 12) where a learnable metric g net for the manifold is involved (see Section 4.2.1 for more details) and the features are evolved following the geodesic curves with minimal length (see Section 3). Within each Hamiltonian layer, the metric g net that is instantiated by a smooth FC layer only depends on the local node position on a pseudo-Riemannian manifold that varies from point to point. Note that with layer stacking, these features contain information aggregated from their neighbors. The metric g net , therefore, learns from the graph topology, and each node is embedded with a local geometry that depends on its neighbors. In contrast, (Chami et al., 2019) considers graph node embedding using geodesic curves over a homogeneous negative-curvature hyperboloid manifold without adjustment of the local geometry. At the beginning of Section 4, we have assumed the node features { n q} n to be located in a local chart of a preliminary embedding manifold M . The basic philosophy is that the embedding manifold evolves with a metric structure that adapts successively with neighborhood aggregation along multiple layers, whereas each node's features evolve to the most appropriate position as the embedding on the manifold along the curves. For a general learnable H net , the Hamiltonian orbit that starts from one node has aggregated information from its neighbors, which guides the learning of the curve that the node will be evolved along on the manifold. Therefore, each node is embedded into a manifold with good adaptation to the underlying geometry of any given graph dataset even if it has diverse geometries. Conservation of H net . From Theorem 1, the feature updating through the orbit indicates that the H net is conserved along the curve. Model summary. Our model is called HamGNN as we use Hamiltonian orbits for node feature updating on the manifold. We summarize the HamGNN model architecture in Fig. 1 . The forms of the Hamiltonian function H net are given in Section 4.2.

4.2. DIFFERENT HAMILTONIAN ORBITS

We next propose different forms for H net from which the corresponding Hamiltonian orbit and its variations are obtained in our GNN model. The node features are updated along the Hamiltonian orbits, which are curves starting from each node ( n q, n p) at t = 0.

4.2.1. LEARNABLE METRIC g net

In this subsection, we consider node embedding onto a pseudo-Riemannian and set H net as ( 12) where a learnable metric g net for the manifold is involved. Within each Hamiltonian layer, the metric g net instantiated by a smooth FC layer only depends on the local node position on the pseudo-Riemannian manifold that varies from point to point. The output of g net at position q represents the inverse metric local representation [g ij ]. However, from (12), the space complexity is order d 3 due to the partial derivative of g's output being a d × d matrix. We therefore only consider diagonal metrics to mitigate the space complexity. More specifically, we now define g net (q) = diag([-1, . . . , -1 r , 1, . . . , 1 s ] ⊙ h net (q)) where h net : R d → R d consists of non-linear trainable layers and ⊙ denotes element-wise multiplication. To ensure non-degeneracy of the metric, the output of h net is set to be away from 0 with the final activation function of it being strictly positive. The vector [-1, . . . , -1, 1, . . . , 1] controls the signature (r, s) of the metric g with r + s = d, where r and s are the number of -1s and 1s, respectively. The signature of the metric is set to be a hyperparameter. According to (13), we have qi = g ij net p j , ṗi = - 1 2 ∂ i g jk net p j p k . Intuitively, the node features evolve through the "shortest" curves on the manifold. The exponential map used in hyperbolic GNNs (Chami et al., 2019) is essentially the geodesic curve on a hyperbolic manifold with an explicit formulation due to the hyperbolic assumption. We do not enforce any assumption here and let the model learn the embedding geometry.  qi = ∂H net ∂p i , ṗi = - ∂H net ∂q i + f net (q). ( ) Instead of keeping the energy during the feature update along the Hamiltonian orbit, we now also include an additional energy term during the node feature update.

4.2.5. LEARNABLE H net WITH A FLEXIBLE SYMPLECTIC FORM

Hamiltonian equations have a coordinate-free representation using the symplectic 2-form. We present a self-contained review of differential geometry that is related to symplectic 2-form and Hamiltonian equations in Appendix A. The chart coordinate representation (q, p) may not be the Darboux coordinate system for the symplectic 2-form. Even if our learnable H net may be able to learn the energy representation under the chosen chart coordinate system, we consider a learnable symplectic 2-form to act in concert with H net . More specifically, following (Chen et al., 2021) , we have θ 1 net = f i,net dq i , where f net : M → R d is the output's i-th component of the neural network parameterized function from which the Hamiltonian orbit is given by (Chen et al., 2021) as follows: qi , ṗi = W -1 (q, p)∇H net (q, p), (23) where the skew-symmetric 2d × 2d matrix W , whose elements are written in terms of (∂ i f j,net -∂ j f i,net ), is given in (50) due to space constraints.

5. EXPERIMENTS

In this section, we implement the proposed HamGNNs with different settings as shown in Section 4.2 and Appendix B. We select datasets with various geometry including the three citation networks Cora (McCallum et al., 2004) , Citeseer (Sen et al., 2008 ), Pubmed (Namata et al., 2012) , and two low hyperbolicity datasets (Chami et al., 2019) , named Disease and Airport as the benchmark datasets. To fairly compare the performance of the proposed HamGNN, we select several popular GNN models as the baseline, including the Euclidean GNNs: GCN (Kipf & Welling, 2017a) , GAT (Veličković et al., 2018) , SAGE (Hamilton et al., 2017) , and SGC (Wu et al., 2019) ; the hyperbolic GNNs (Chami et al., 2019; Liu et al., 2019) : HGNN, HGCN, and HGAT, and also GIL (Zhu et al., 2020b) which learns a weighting between Euclidean and hyperbolic space embeddings; the Graph Neural diffusion GNNs: GRAND (Chamberlain et al., 2021b ), GraphCON Rusch et al. (2022) and LGCN (Zhang et al., 2021b) . The MLP baseline does not utilize the graph topology information. To further demonstrate the advantage of HamGNN, we also include one vanilla ODE system, whose formulation is given in Appendix D.1 without the crucial Hamiltonian layer. Due to space constraints, we refer the readers to Appendix C for the datasets and implementation details. We stress at the outset that we do not aim to outperform all the baselines or other general-purpose GNN models on specific datasets. Instead, our objective is to design a new node embedding strategy that can automatically learn, without extensive tuning, the underlying geometry of any given graph dataset even if it has diverse geometries. To demonstrate that HamGNN can automatically adapt well, we include two graph node embedding downstream tasks, including node classification and link prediction. Due to space constraints, we refer the readers to Appendix C for the datasets and implementation details. We then compare the HamGNN using ( 20) and ( 23). The difference between those two settings is that the symplectic form ω 2 in ( 20) is set to be the special Poincaré 2-form while in (23), the symplectic form is a learnable one. We however observe that the two HamGNNs achieve similar performance and the more flexible symplectic form does not improve the model performance. This may be because of the fundamental Darboux theorem Theorem 3 in symplectic geometry which states that we can always find a Darboux coordinate system to give any symplectic form the Poincaré 2-form. The feature compressing FC may have the network capacity to approximate the Darboux coordinate map while the flexible learnable H net also has the network capacity to get the energy representation under the chosen chart coordinate system. 

6. CONCLUSION

In this paper, we have designed a new node embedding strategy from Hamiltonian orbits that can automatically learn, without extensive tuning, the underlying geometry of any given graph dataset even if it has diverse geometries. We demonstrate empirically that our approach adapts better than popular state-of-the-art graph node embedding GNNs to different types of graph datasets via two graph node embedding downstream tasks, including node classification and link prediction. From experiments, we observe that the over-smoothing problem can be mitigated if the node features evolve through the Hamiltonian orbits.

A DIFFERENTIAL GEOMETRY AND HAMILTONIAN SYSTEM

In this supplementary material, we review some concepts from a differential geometry perspective. We hope that this overview makes the paper more accessible for readers from the graph learning community.

A.1 MANIFOLD, BUNDLES, AND FIELDS

Roughly speaking, a manifold is a topological space that locally looks like Euclidean space. More strictly speaking, a topological space (M, O), where O is the collection of open sets on space M , is called a d-dimensional manifold if for every point x ∈ M , we can find an open neighborhood U ∈ O for x and a coordinate map q : U → q(U ) ⊆ R d that is a homeomorphism, where R d is the d-dimensional Euclidean space with the standard topology. (U, q) is called a chart of the manifold, which gives us a numerical representation for a local area in M . In this work, we only consider smooth manifolds (Lee, 2013) that any two overlapped charts are smoothly compatible. The set of all smooth functions from M to R is denoted as C ∞ (M ). On top of the smooth manifolds, we can define other related manifolds like the tangent or cotangent bundles and the more general tensor bundles. From bundles, we can define the vector or covector fields and the more general tensor fields. In this work, we mainly consider the manifold with the 2-forms that are special (0, 2) smooth tensor fields with antisymmetric constraints. More specifically, we mainly consider the symplectic form (Lee, 2013) with some other light shed on the metric tensor which is another type of (0, 2) smooth tensor field. Definition 1 (tangent vector and tangent space). Let γ : I → M be a smooth curve through x ∈ M s.t. γ(0) = x and I is an interval neighborhood of 0. The tangent vector is a directional derivative operator at x along γ that is the linear map v γ,x : C ∞ (M ) ∼ -→ R f → (f • γ) ′ (0). We also call the directional derivative operator as "velocity" of γ at 0 and denote it as γ(0). (More generally, the 'velocity" of γ at t is denoted at γ(t) which is a tangent vector at point γ(t) ∈ M from the curve reparametrization trick (Fecko, 2006) ). Correspondingly, the tangent space to M at x is the vector space over R with the underlying set T x M := {v γ,x | γ is a smooth curve and γ(0) = x}. ( ) Note for f ∈ C ∞ (M ), using the chart (U, q) with x ∈ U , we have the local representation v γ,x (f ) := (f • γ) ′ (0) = (f • q -1 • q • γ) ′ (0) = (q i • γ) ′ (0) • ∂ i f • q -1 q(x) , where q i is the i-th component of q, • is the function composition and (•)| q(x) means evaluating (•) at q(x). Therefore for a local chart around x, we have a basis of T x M as ∂ ∂q 1 x , ..., ∂ ∂q d x and we call it the chart induced basis, where ∂ ∂q 1 x := ∂ i (f • q -1 )| q(x) . Definition 2 (tangent bundle). Given a smooth manifold M , the tangent bundle of M is the disjoint union of all the tangent spaces to M , i.e., we have T M := x∈M T x M, equipped with the canonical projection map π : T M → M X → π(X), where π(X) sends each vector in T x M to the point x at which it is tangent, and is the disjoint union. Furthermore, the tangent bundle 4 is a 2d-dimensional manifold. If X ∈ π-1 (U ) ⊆ T M with a local chart (U, q) on M s.t. x ∈ U , then X ∈ T π(X) M from the definition. Since π(X) ∈ U , X can be written in terms of the chart induced basis: X = pi (X) ∂ ∂q i π(X) , ( ) where p1 , . . . , pd are smooth scalar functions. We can then define the following map as a local chart for the manifold T M induced from the chart (U, q) on M : ξ : π-1 (U ) → q(U ) × R d ⊆ R 2d X → (q 1 (π(X)), . . . , q d (π(X)), p1 (X), . . . , pd (X)), and the topological structure on T M is derived from the initial topology to ensure continuity. Definition 3 (vector field). A vector field on M is a smooth section (Lee, 2013) of the tangent bundle, i.e. a smooth map σ : M → T M such that π • σ = id M , where id M is the identity map on M . T M M π σ (27) We denote the set of all vector fields on M by Γ(T M ): Γ(T M ) := {σ : M → T M | σ is smooth and π • σ = id M }. ( ) Definition 4 (cotangent vector, cotangent bundle, dual basis, and covector field). For the vector space T x M , a continuous linear functional from T x M to R is called a cotangent vector at x. The set of all such linear maps is denoted as T * x M which is the dual vector space of T x M . For f ∈ C ∞ (M ), at each point x, we define the following linear operator in T * x M (df ) x : T x M → R X x → (df ) x (X x ) := X x (f ). Given a chart (U, q) with x ∈ U and its chart induced basis, the dual basis for the dual space T * x M is the set dq 1 x , . . . , dq d x , where we have (dq a ) x ∂ ∂q b x = ∂ ∂q b x (q a ) = δ a b with δ a b = 1 iff a = b and δ a b = 0 otherwise. We call it the chart induced dual basis . Analogous to the above definition of the vector field, we can define the cotangent bundle of M as T * M := x∈M T * x M (29) which is again a 2d-dimensional manifold equipped with the canonical projection map π : T * M → M ω → π(ω), ( ) where π(ω) sends each vector in T * x M to the point x at which it is cotangent. If ω ∈ π -1 (U ) ⊆ T * M with a local chart (U, q) s.t. x ∈ U , then ω ∈ T * π(ω) M from the definition. Since π(ω) ∈ U , ω can be written in terms of the chart induced dual basis: ω = p i (ω)(dq i ) π(ω) , where p 1 , . . . , p d are smooth scalar functions. We can then define the following map as a local chart for manifold T * M induced from the chart (U, q) on M : ξ : π -1 (U ) → q(U ) × R d ⊆ R 2d ω → (q 1 (π(ω)), . . . , q d (π(ω)), p 1 (ω), . . . , p d (ω)), and the topological structure on T * M is derived from the initial topology to ensure continuity. The covector fields are smooth sections of T * M . The set of all covector fields is denoted as Γ(T * M ). Definition 5 (tensor field and 2-form). The tensor field can be defined using the smooth sections on tensor bundles analogously to the vector fields or the covector fields. We refer readers to (Lee, 2013) for more details. Here, instead of a rigorous definition, we show some basic properties of the tensor fields. For a (r, s) tensor field τ , at x ∈ M , it is a multilinear map τ x : T * x M × • • • × T * x M r copies × T x M × • • • × T x M s copies → R. The differential k-form ω is the (0, k) tensor field that admits alternating (Lee, 2013). Specifically, for the 2-form ω, at each point x, ω x is a antisymmetric (0, 2) tensor ω x : T x M × T x M → R s. t. ω x (X 1 , X 2 ) = -ω x (X 2 , X 1 ) ∀ X 1 , X 2 ∈ T x M which in other words, ω satisfies ω : Γ(T M ) × Γ(T M ) → C ∞ (M ) s. t. ω(X, Y ) = -ω(Y, X) ∀ X, Y ∈ Γ(T M ) In local chart representation, we have that every k-form ω can be expressed locally on U as ω = ω a1•••a k dx a1 ∧ • • • ∧ dx a k , where , 2013) . Here we could abstractly view the set {dx a1 ∧ • • • ∧ dx a k } a1,...,a k , with a i enumerated from 1 to d, abstractly as a basis without more illustrations of the wedge product (We refer readers to (Lee, 2013) for more details). ω a1•••a k ∈ C ∞ (U ), 1 ≤ a 1 < • • • < a k ≤ dim M are increasing sequences and dx a1 ∧ • • • ∧ dx a k is the wedge product (Lee Definition 6 (integral curve). Given a vector field X on M , an integral curve of X is a differentiable curve γ : I → M , where I ⊆ R is an interval, whose velocity at each point is equal to the value of X at that point: γ(t) = X γ(t) for all t ∈ I. ( ) If 0 ∈ J, the point γ( 0) is called the starting point of γ. From Picard's theorem (Hartman, 2002) , we know that locally we always have an interval I on which the solution exists and is necessarily unique. Definition 7 (exterior derivative and closed form). The exterior derivative is a linear operator that maps k-forms to k + 1-forms. In local chart representation, we have that if ω is a k-form on M with the local representation as ω = ω a1•••a k dx a1 ∧ • • • ∧ dx a k . Then, we have the exterior derivative dω = dω a1•••a k ∧ dx a1 ∧ • • • ∧ dx a k = ∂ b ω a1•••a k dx b ∧ dx a1 ∧ • • • ∧ dx a k . A form ω is called closed if dω = 0. Theorem 2. From (Rudin et al., 1976; Lee, 2013) , we know that d • d ≡ 0, which is a dual statement that the boundary of the boundary of a manifold is empty from Stokes' theorem.

A.2 HAMILTONIAN VECTOR FIELDS ON SYMPLECTIC COTANGENT BUNDLE

Definition 8 (symplectic vs. Riemannian). Let M be a smooth manifold. 1. symplectic form: A 2-form (so it is antisymmetric) ω is said to be a symplectic form on M if it is closed, i.e, dω = 0, and it is non-degenerate, i.e, (∀ Y ∈ Γ(T M ) : ω(X, Y ) = 0) ⇒ X = 0. 2. metric tensor: A (0, 2) tensor field g is said to be a Riemannian metric on M if it is non-degenerate and symmetric at each x, i.e., g x : T x M × T x M → R s. t. g x (X 1 , X 2 ) = g x (X 2 , X 1 ) ∀ X 1 , X 2 ∈ T x M Remark 1. A manifold equipped with a symplectic form ω is called a symplectic manifold, while a manifold equipped with a metric tensor g is called a pseudo-Riemannian manifold. The Riemannian metric g is a (0, 2)-tensor field measuring the norms of tangent vectors and the angles between them. To some extent, the "shape structure" of the manifold M is only available if we equipped M with a metric g. • From the above definition, we know the symplectic form and the metric tensor are both nondegenerate bilinear (0, 2) tensor fields. One difference is the symplectic form is antisymmetric while the metric tensor is symmetric. At each point x ∈ M , if we use the local chart coordinate representation, the (0, 2) tensor can be represented as the following matrix multiplication p⊺ W x p where d × d matrix W is symmetric for metric tensor and is skew-symmetric for symplectic form. p⊺ is the transpose of the velocity representation (which is a numerical vector) in a local chart. • Because of non-degenerate, on a symplectic manifold M , we can define an isomorphism between Γ(T M ) and Γ(T * M ) by mapping a vector field X ∈ Γ(T M ) to a 1-form η V ∈ Γ(T * M ), where η X (•) := ω 2 (•, X) Similarly, on a pseudo-Riemannian manifold, we can define an isomorphism between Γ(T M ) and Γ(T * M ) by mapping a vector field X ∈ Γ(T M ) to a 1-form ∈ Γ(T * M ), where α g : T M -→ T * M. ( ) In a local chart coordinate representation, α g = g ij and its inverse α -1 g = g ij with m j=1 g ij g jk = δ k i and δ k i = 1 iff i = k and 0 otherwise. Note the components of the metric and the inverse metric are all taken in a given chart without explicitly mentioning them. Definition 9 (Hamiltonian flow and orbit). For a general symplectic manifold M with a symplectic form ω 2 , if we have H ∈ C ∞ (M ), then dH is a differential 1-form on M . We define the vector field called the Hamiltonian flow X H associated to the Hamiltonian H, which satisfies that η X H (•) = dH(•). The integral curves of are called Hamiltonian orbits of H: γ(t) = (X H ) γ(t) for all t ∈ I. ( ) where (X H ) γ(t) is the tangent vector at γ(t) ∈ M . Definition 10 (Poincaré 1-form and 2-form). On the cotangent bundle T * M of a manifold M , we have a natural symplectic form, called the Poincaré 1-form θ 1 Poincaré = p i dq i . ( ) Therefore, by the exterior derivative, we have the Poincaré 2-form ω 2 Poincaré = dθ 1 = d(p i dq i ) = i dp i ∧ dq i (45) which is closed from (38). Therefore Poincaré 2-form is a symplectic form on the cotangent bundle and cotangent bundles are the natural phase spaces of classical mechanics (De León & Rodrigues, 2011) . For more general symplectic forms on the cotangent bundle, we can again use (38) to construct the closed 2-form from a 1-form which is potentially symplectic: Corollary 1 ( (Chen et al., 2021) ). According to (38), on the cotangent bundle T * M of a manifold M , from a 1-form, θ 1 = f i dq i , we derive a closed 2-form using the exterior derivative (37), and its local representation is given by the following ω 2 = dθ 1 = d(f i dq i ) = i<j (∂ i f j -∂ j f i ) dp i ∧ dq j Remark 2. Note, strictly speaking, we only can get the necessary "closed" condition for ω 2 to be potentially symplectic. However, it is enough for our use in our proposed framework. We now state one of the most fundamental results in symplectic geometry that links the general symplectic form to the special Poincaré 2-form. Theorem 3 (Darboux (Lee, 2013)). Let ( M , ω2 ) be a 2d-dimensional symplectic manifold. For any x ∈ M , there are smooth coordinates q1 , . . . , qd , p1 , . . . , pn centered at x in which ω 2 has the coordinate representation ω2 = n i=1 dq i ∧ dp i . ( ) as a Poincaré 2-form, any coordinates satisfying (46) are called Darboux or symplectic coordinates. Remark 3. Therefore, for any symplectic form w 2 on the above cotangent bundle T * M of M , we can always find a symplectic coordinate with the Poincaré 2-form. Corollary 2 ((Da Silva & Da Salva, 2008) ). According to (33), from the antisymmetric of the 2-forms, we have ω 2 (X H , X H ) = 0 (47) which implies that hamiltonian vector fields preserve their hamiltonian functions H. Remark 4. In physics, hamiltonian functions are typically energy functions for a physical system. Corollary 2 indicates if the system updates its status according to the hamiltonian vector fields, the time-evolution of the system follows the law of conservation of energy. Definition 11 (Hamiltonian orbit generated from Poincaré 2-form (Lee, 2013)). The Hamiltonian orbit generated by the Hamiltonian flow X H on the cotangent bundle equipped with the Poincaré 2-form given in local coordinates (q, p) is given by qi = ∂H ∂p i , ṗi = - ∂H ∂q i . ( ) Definition 12 (cogeodesic orbits). If additionally M is equipped with a metric tensor g, i.e, if M is a pseudo-Riemannian manifold with metric g, and if we set the hamiltonian H on T * M as H(q, p) = 1 2 g ij (q)p i p j , the Hamiltonian orbit generated from Poincaré 2-form is given by qi = ∂H ∂p i = g ij p j , ṗi = - ∂H ∂q i = - 1 2 ∂ i g jk p j p k . ( ) It is called the cogeodesic orbits of (M, g). The canonical projection of cogeodesic orbits under π (30) is called geodesic on the base manifold M which generalizes the notion of a "straight or shortest line" to manifold where the length is measured by the metric tensor.

B SOME FORMULATIONS AND MORE HAMILTONIAN ORBITS

In section, we first present the formulations which are not shown in detail in the main paper due to space constraints. More Hamiltonian-related flows are also presented, which however do not strictly follow the Hamiltonian orbits on the cotangent bundle T * M . B.1 W IN SECTION 4.2.5 W =     0 ∂ 1 f 2,net -∂ 2 f 1,net ∂ 1 f 3,net -∂ 3 f 1,net • • • ∂ 2 f 1,net -∂ 1 f 2,net 0 ∂ 2 f 3,net -∂ 3 f 2,net • • • ∂ 3 f 1,net -∂ 1 f 3,net ∂ 3 f 2,net -∂ 2 f 3,net 0 • • • . . . . . . . . . . . .     (50) B.2 POINCARÉ 2-FORM ω 2 Poincssaré AND LEARNABLE METRIC g net WITH RELAXATION Similar to Section 4.2.4, we now impose additional system biases along the curve compared to the cogeodesic orbits Section 4.2.1, qi = g ij net p j , ṗi = - 1 2 ∂ i g jk net p j p k + f net (q). Therefore, the projection of the curve from (51) now no longer follows the geodesic curve along the base manifold equipped with metric g net .

B.3 HAMILTONIAN RELAXATION FLOW WITH HIGHER DIMENSIONAL "MOMENTUM"

In the paper main context, we present a new type of Hamiltonian-related flow, which does not strictly follow the Hamiltonian equations. Inspired from the work (Haber & Ruthotto, 2017) , we now associate to each node q ∈ R d an additional a learnable momentum vector p ∈ R k which however is not strictly a cotangent vector of the manifold if d ̸ = k. We update the node features using the following equations q = ϕ h 1 net (p) -ρq , ṗ = ϕ h 2 net (q) -ρp . where h 1 net and h 2 net are neural networks with d-dimensional output and k-dimensional output respectively, ϕ is a non-linear activation function and ρ is a scalar hyper-parameter.

C MAIN PAPER EXPERIMENTS SETTING

We select the citation networks Cora (McCallum et al., 2004) , Citeseer (Sen et al., 2008), and Pubmed(Namata et al., 2012) , and the low-hyperbolicity (Chami et al., 2019) Disease, Airport as the benchmark datasets. The citation dataset is widely used in graph representation learning tasks. We randomly generate 20 samples per class for training, 500 samples for validation, and 1000 samples for testing in each dataset as the same setting in Kipf & Welling (2017a) . The low-hyperbolicity datasets Disease and Airport are proposed in Chami et al. (2019) , where the Euclidean GNN models cannot learn the node embeddings effectively. We follow the same data splitting and pre-processing in Chami et al. (2019) for Disease and Airport datasets. For the link prediction tasks, we report the best results from different versions of HamGNN. HamGNN (19) for the Disease dataset and (21) for other datasets. This is because we get asynchronous CUDA kernel errors on Disease on our A5000 GPU in this rebuttal period if using (21). We will update all to (21) in the final version after this bug is resolved. We adjust the model parameters in HamGNN based on the results from the validation data. We use the ADAM optimizer (Kingma & Ba, 2014) with the weight decay as 0.001. We set the learning rate as 0.01 for citation networks and 0.001 for Disease and Airport datasets. The results presented in Table 1 are under the 3 layers HamGNN setting. We report the results by running the experiments over 10 times with different initial random seeds. HamGNN first compresses the dimension of input features to the fixed hidden dimension (e.g. 64) through a fully connected (FC) layer. Then the obtained hidden features are input to the stacked H net ODE layers and aggregation layers. The q in Hamiltonian flow is initialized by the node embeddings after the FC layer. Table 7 shows the implementation details of layer in HamGNN.

C.1 ODE SOLVER FOR HAMILTONIAN EQUATIONS

We employ the ODE solver (Chen, 2018) in the implementation of HamGNN. For computation efficiency and performance effectiveness, the fixed-step explicit Euler solver (Chen et al., 2018a) is used in HamGNN. We also compare the influence of ODE solvers and report the results in Table 4 . One drawback of the ODE solvers provided in (Chen, 2018 ) is that they are not guaranteed to have the energy-preserving property in solving the Hamiltonian equations. However, this flaw does not significantly deteriorate our model performance regarding the embedding adaptation to datasets with various structures. Our extensive experiments on the node classification and link prediction tasks have demonstrated that the solvers provided in (Chen, 2018)  q(t) = fnet (q(t)) where the fnet (q(t)) is composed of two FC layers and a non-linear activation function. Compared to the Hamiltonian orbits in Section 4.2, the equation ( 53) does not include the learnable "momentum" vector for each node and does not follow the Hamiltonian orbits on the cotangent bundle (Di Giovanni et al., 2022) . To further demonstrate that HamGNN can automatically learn the underlying geometry for datasets with different structures, we include more experiments on the node classification task using heterophilic graph datasets. We select the heterophilic graph datasets Cornell, Texas and Wisconsin from the CMU WebKBfoot_4 project where randomly generated splits of data are provided by Pei et al. (2020) . The edges in these graphs represent the hyperlinks between webpages nodes. The labels are manually selected into five classes, student, project, course, staff, and faculty. The features on node are the bag-of-words of the web pages. For the heterophilic graph datasets, we include the baselines GCN, GAT, SAGE, APPNP (Klicpera et al., 2019) , GCNII (Chen et al., 2020b) , GPRGNN (Chien et al., 2020) , and H2GCN (Zhu et al., 2020a) which are the common baselines for heterophilic graph datasets (Bi et al., 2022) . Additionally, we also include GraphCON (Rusch et al., 2022) and GraphCON (Chamberlain et al., 2021b) comparisons. We report the results by running the experiments over 10 times with different initial random seeds for GraphCON and HamGNN, while for the other baselines on the heterophilic graph datasets, we use the results reported in the paper (Bi et al., 2022 ) (since we use the same experimental setting) due to the time limitation of the rebuttal period.

D.3 OVER-SMOOTHING

We continue from Section 5.2 to conduct more experiments on the Cora and Pubmed datasets to demonstrate the resilience of HamGNN against over-smoothing. From Table 6 , we observe that if the H net in ( 21) is convex, HamGNN can even retain its classification ability better than vanilla H net in (20). One possible reason has been indicated in Section 4.2.3 since now the Hamiltonian formalism degenerates to a Lagrangian formalism with a possible minimization of the dual energy functional (5). In physics, lower energy in most cases indicates a more stable system equilibrium. Moreover, to further show that with the convex H net in (21) HamGNN can perform better than the vanilla H net in (20) against over-smoothing, we now include more choices of convex network H net with different layer sizes and different activation functions (as along as the layer weights from the second layer in H net are non-negative, and all activation functions in H net are convex and non-decreasing). The network details of H net are given in Table 7 . The experiment results are shown in Table 6 . We clearly observe that for different convex functions and on different datasets, HamGNN with convex H net nearly keeps the full node classification ability even though we have stacked 20 Hamiltonian layers. Fig. 3 shows how the node features evolve over 10 layers. We sample a node from the test set of PubMed dataset and input it to three neural ODE-based GNN models, which are 1) HamGNN, 2) an ODE using a positive-definite linear layer as the ODE function, and 3) an ODE using a negativedefinite linear layer as the ODE function. Each of these three GNNs contains 10 layers and we compute the node feature magnitude, which is defined to be the L 2 -norm of the feature vector, and the node feature phase, which is defined to be the cosine similarity between the output feature at the current layer and the input feature to the first layer. We can observe from Fig. 3 that HamGNN has its learned node features change steadily and slowly, while the node features learned by GIL and GCN change abruptly, especially for the feature phases. The features from two nodes of different classes, learned by GIL and GCN, are converging to each other much faster than HamGNN.

D.4 STABILITY

It is well known that a Hamiltonian system is an energy-conservative system. Our HamGNN inherits this conservative or stability property. The results are shown in Fig. 4 , where the experimental setting is the same as those in Fig. 3 . The trends are very obvious, where the feature magnitude (you may think of it as feature energy) learned by HamGNN is conserved over layers, while the feature magnitude learned by the positive-definite explodes after a few layers, and the feature magnitude learned by the negative-definite ODE is very close to zero. Regarding the feature phase, the feature phases from two nodes are steadily and slowly approaching each other when using HamGNN, while



We abuse notations in denoting the chart coordinate map as q and the curve as q(t). It will be clear from the context which one is being referred to. We put the node index to the left of the variable to distinguish it from the manifold dimension index. The subscript net of a function Fnet indicates that the function F is parameterized by a neural network. In this paper, we may just use the word "bundle" to indicate the total space in the bundle. http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/



Figure 2: Visualization of a 1-dimensional manifold and its tangent bundle.

Node classification accuracy(%). The best and the second-best result for each criterion are highlighted in bold and underlined and italic respectively.5.1 PERFORMANCE RESULTS AND ABLATION STUDIESNode classification. The node classification performance on the benchmark datasets using the baseline models and the proposed HamGNNs with different H net s is shown in Table1. We observe that HamGNN adapts well to all datasets with various geometry. As an illustration,HamGNN (19)   achieves the third best performance on the Cora dataset, which is Euclidean in nature. According to(Chami et al., 2019;Liu et al., 2019), the Airport dataset has a tree-like structure that cannot be properly embedded into Euclidean spaces. We observe that HamGNN (19) has a well-adapted performance that beats all the baselines on this dataset due to the learnable embedding manifold without many constraints. As discussed in Section 4.1, HamGNN with the learnable metric structure g net or other general H net learns the graph local geometry successively with neighborhood aggregation. The superior performance of HamGNNs on all the datasets over the other baselines has demonstrated the advantage of including Hamiltonian orbits on the manifold to embed the node features. Furthermore, the comparison between HamGNNs with the GNN with a vanilla ODE in (53) without any Hamiltonian mechanism indicates the indispensability of the Hamiltonian layer. Link prediction. In Table2, we report the averaged ROC for the link prediction task. We observe HamGNN adapts well to all datasets and is the best performer on the Airport, Citeseer, and Cora.

Link prediction ROC(%). The best and the second-best result for each criterion are highlighted in bold and underlined and italic respectively.Comparison between different H net . We compare HamGNNs with different H net s as elaborated in Section 4.2 on the node classification task. In Section 3.3, we argue that the geodesic curves derived from the action of curve length (3) may potentially sacrifice efficacy for the graph node embedding task since we do not know what is a reasonable action formulation that guides the evolution of the node feature in this task. We therefore also include a more flexible H net in (20) with other variations, e.g., (21) to (23). From Table1, we observe that the basic (20) has good adaptations for node embedding of various datasets. Flexibility in (20) helps the node classification task slightly for the Citeseer dataset, but the improvement is not significant. Other variants of H

Node classification accuracy(%) when increasing the number of layers on the Cora dataset.5.2 OBSERVATION OF RESILIENCE TO OVER-SMOOTHINGAs a side benefit of HamGNN, we observe from Table3that if more Hamiltonian layers are stacked, HamGNN still retains its node classification ability, while the classification accuracies of other GNNs decrease significantly. This is known as the over-smoothing problem(Chen et al., 2020a)  in GNNs. One reason for HamGNN's stability against over-smoothing is that the node feature energy indicated by the Hamiltonian H net is fixed during the feature update along the Hamiltonian orbit. More Ablation Studies and Experiments. See Appendix D where we include experimental results on heterophilic graph datasets, more empirical analysis of oversmoothness, etc.

are sufficient for our use. We leave the use of Hamiltonian equation solvers for future work to investigate whether solvers with the energy-preserving property can better help graph node embedding or mitigate the over-smoothing problem. Node classification accuracy(%) under different ODE solvers inHamGNN (20).To demonstrate the advantage of HamGNN's design, we also conduct more experiments that replace the Hamiltonian layer in HamGNN with a vanilla ODE as follows:

Node classification accuracy(%) on heterophilic datasets. Due to insufficient time during the rebuttal period, the results for GRAND are reported from

29±2.29 69.87±1.12 26.50±4.68 23.97±5.42 HGCN 78.70±0.96 38.13±6.20 31.90±0.00 26.23±9.87 HamGNN (20) 81.52±1.27 81.58±0.73 79.00±2.17 76.20±0.13 HamGNN (21) 81.84±0.88 81.08±0.16 81.40±0.44 80.58±0.30 HamGNN (21) type 2 82.10±0.80 81.10±0.10 81.06±1.49 79.26±1.07 ) type 2 79.03±0.58 78.46±0.11 78.53±0.31 77.50±0.44 Node classification accuracy(%) when increasing the number of layers on the Pubmed dataset.

