ACMP: ALLEN-CAHN MESSAGE PASSING WITH AT-TRACTIVE AND REPULSIVE FORCES FOR GRAPH NEU-RAL NETWORKS

Abstract

Neural message passing is a basic feature extraction unit for graph-structured data considering neighboring node features in network propagation from one layer to the next. We model such process by an interacting particle system with attractive and repulsive forces and the Allen-Cahn force arising in the modeling of phase transition. The dynamics of the system is a reaction-diffusion process which can separate particles without blowing up. This induces an Allen-Cahn message passing (ACMP) for graph neural networks where the numerical iteration for the particle system solution constitutes the message passing propagation. ACMP which has a simple implementation with a neural ODE solver can propel the network depth up to one hundred of layers with theoretically proven strictly positive lower bound of the Dirichlet energy. It thus provides a deep model of GNNs circumventing the common GNN problem of oversmoothing. GNNs with ACMP achieve state of the art performance for real-world node classification tasks on both homophilic and heterophilic datasets.

1. INTRODUCTION

Graph neural networks (GNNs) have received a great attention in the past five years due to its powerful expressiveness for learning graph structured data, with broad applications from recommendation systems to drug and protein designs (Atz et al., 2021; Baek et al., 2021; Bronstein et al., 2021; 2017; Gainza et al., 2020; Wu et al., 2020) . Neural message passing (Gilmer et al., 2017) serves as a fundamental feature extraction unit for graph-structured data that aggregates the features of neighbors in network propagation. We develop a GNN message passing, called the Allen-Cahn message passing (ACMP), using interacting particle dynamics, where nodes are particles and edges representing the interactions of particles. The system is driven by both attractive and repulsive forces, plus the Allen-Cahn double-well potential from phase transition modeling. This model is motivated by the behavior of the particle system of collective behaviors common in nature and human society, for example, insects forming swarms to work; birds forming flocks to immigrate; humans forming parties Figure 1 : An illustration for one-step ACMP. Graph G t with features x(t) in the purple and green blocks have different treatment of attraction or repulsion. The same color indicates similar node features. The node x(t) is updated by one step to x(t + ∆t) via ODE solver. Nodes in the green block tend to attract each other and in the other block, nodes in different colors repel each other, and thus both colors are strengthened during propagation. It gives rise to forming bi-cluster flocking. The double-well potential turns features darker under gradient flow to circumvent blowup of energy. to express public opinions. Various mathematical models have been proposed to model these behaviors (Albi et al., 2019; Motsch & Tadmor, 2011; Castellano et al., 2009; Proskurnikov & Tempo, 2017; Degond & Motsch, 2008) . There are two major components in this model. First, while the attractive force forces all particles into one cluster, the repulsive forces allow particles to separate into two different clusters, which is essential to avoid oversmoothing. However, repulsive forces could make the Dirichlet energy diverge. We augment the model with the Allen-Cahn (Allen & Cahn, 1979) term (or Rayleigh friction (Rayleigh, 1894) ), which is crucial in preventing the Dirichlet energy in the evolution from becoming unbounded, allowing us to prove mathematically that the lower bound of the Dirichlet energy is strictly bigger than zero, hence avoiding oversmoothing. Specifically, we will prove that under suitable conditions on the parameters, the dynamics of the ACMP particle system will time-asymptotically form 2 d different clusters and the Dirichlet energy has a strictly positive lower bound. The structure of ACMP can handle two problems in GNNs: oversmoothing and heterophily. Oversmoothing (Nt & Maehara, 2019; Oono & Suzuki, 2019; Konstantin Rusch et al., 2022) means that all node features become undistinguishable, and equivalently, in the formulation of particle systems, features form only one consensus. Heterophily problems means GNNs perform worse in heterophilic graphs (Lim et al., 2021; Yan et al., 2021) . It is due to the neighboring nodes of different classes are mistaken for the same class in GNNs like GCN and GAT. However, the presence of repulsion in ACMP makes particles separate into two different clusters, hence provides a simple and neat solution for prediction tasks on both two problems. Overall, the benefit of the Allen-Cahn message passing with repulsion is manifold. 1) It circumvents oversmoothing issue, namely the Dirichlet energy is bounded from below. 2) The network is stable in the sense that features and Dirichlet energy are bounded from above. 3) Feature smoothness (energy decreasing) and the balance between nodes features and edge features can be adjusted by network parameters that control the attraction, repulsion and phase transition. The model can then reach an acceptable trade-off on self-features and neighbor effect, as shown in Figure 1 . Our model can thus handle node classification tasks for both homophilic and heterophilic datasets by using only one-hop neighbour information. 4) The proposed model can be implemented by neural ODE solvers for the system with attractive and repulsive forces. In theory, we prove that Dirichlet energy of GNNs with ACMP has a lower bound above zero (limiting oversmoothing), as well as an upper bound (circumventing blow-up) under specific conditions. This agrees with the experimental results (Section 6). We also prove that ACMP is a process for the features to generate clusters thanks to the double-well potential, which provides an interpretable theory for node classification.

2. BACKGROUND

Message Passing in GNNs Graph neural networks are a kind of deep neural networks which take graph data as input. Neural Message Passing (MP) (Gilmer et al., 2017; Battaglia et al., 2018) is a most widely used propagator for node feature update in GNNs, which takes the following form: for the undirected graph G = (V, E) is with sets of nodes V and edges E, with x (k-1) i ∈ R d denoting features of node i in layer (k -1) and a j,i ∈ R D edge features from node j to node i, x (k) i = γ (k) x (k-1) i , □ j∈Ni ϕ (k) x (k-1) i , x (k-1) j , a j,i , where □ denotes a differentiable, (node) permutation invariant function, e.g., sum, mean or max, and γ and ϕ denote differentiable functions such as MLPs (MultiLayer Perceptrons), and N i is the set of one-hop neighbors of node i. The message passing updates the feature of each node by aggregating the self-feature with neighbors' features. Many GNN feature extraction modules such as GCN (Kipf & Welling, 2017) , GAT (Veličković et al., 2018) and GIN (Xu et al., 2018) can be written as message passing. For example, the MP of GCNs reads, with learnable parameter matrix Θ, x ′ i = Θ ⊤ j∈Ni∪{i} aj,i √ dj di x j , where di = 1 + j∈N (i) a j,i and D = diag( d1 , . . . , dN ) is the degree matrix for A+I. Graph attention network (GAT) uses attention coefficients α i,j as similarity information between nodes in the MP update x ′ i = α i,i Θx i + j∈Ni α i,j Θx j , with α i,j = exp LeakyReLU a ⊤ [Θx i ∥ Θx j ] k∈Ni∪{i} exp (LeakyReLU (a ⊤ [Θx i ∥ Θx k ])) . (1) The MP framework was also developed as PDE solvers in Brandstetter et al. (2022b) by embedding differential equations as a parameter into message passing like Brandstetter et al. (2022a) . This paper regards particle system evolution (ODE) as message passing propagation, and the appropriate design of the particle system offers desired properties for the resulting GNN.

Graph neural diffusion

Neural diffusion equations on graphs (GRAND) are proposed by Chamberlain et al. (2021) , which provides a unified mathematical framework for some message passings: ∂ ∂t x(t) = div[G(x(t), t)∇x(t)], where G = diag(a(x i (t), x j (t), t)) where a is a function reflecting similarity between nodes i and j, and x i is the scale-valued feature for node i, and x = ⊕x i .

3.1. ATTRACTIVE AND REPULSIVE FORCES

The equation (2) itself can be interpreted in a formulation different from diffusion. In this paper, we study the neural equations of interacting particle system, which has a similar structure to (2). We rewrite (2) into a component-wise version and obtain a particle system ∂ ∂t x i (t) = j∈Ni a(x i , x j )(x j -x i ). In the formulation of particle systems, one can easily discover the evolution trend of the features. If a(x i , x j ) > 0, the direction of x i 's velocity is towards x j , which means that x i is attracted by x j . In the contrast, if a(x i , x j ) < 0, x i has a trend to move away from x j . Hence, a(x i , x j ) serves as the attractiveness or repulsiveness of the force between x i and x j . In the diffusion model above, all a(x i , x j )'s are positive, therefore all the node features in one connected component attract each other. If the weight matrix (a(x i , x j )) N ×N is right-stochastic, one can prove that the convex hull of the features will not dilate in time (see Motsch & Tadmor (2014); Chamberlain et al. (2021) ). Such feature aggregation means that message propagates along the edges of the graph and some potential consensus forms in the process. However, the message propagation does not limit to consensus (corresponding to diffusion). Information interaction can derive polarization of final judgement when negative message matters in some problems rather than positive message. For instance, in a node classification task on a bipartite, the neighbour message is negative since connected nodes belong to different classes. In the formulation of particle systems, the mechanism of positive and negative messages can be modelled by adding bias β i,j into (3) ∂ ∂t x i (t) = j∈Ni (a(x i , x j ) -β i,j )(x j -x i ). The coefficient term a(x i , x j ) -β i,j corresponds to the interactive force. By adjusting β i,j , both in the system attractive and repulsive forces co-exist. If a(x i , x j ) -β i,j > 0, x i is attracted by x j . While if a(x i , x j ) -β i,j < 0, x i is repelled by x j . If the coefficient equates zero, there is no interaction between x i and x j . Then, the dynamics is enabled to adapt both positive and negative message passing. In this way, the neural message passing can handle either homophilic or heterophilic datasets (see Section 6 for detailed discussion).

3.2. PSEUDO-GINZBURG-LANDAU ENERGY

However, adding the repulsion term may cause the particles being separated away infinitely, thus the Dirichlet energy becomes unbounded. To avoid this problem, we add a damping term δx i (1 - x 2 i ), which we call an Allen-Cahn term. The coefficient α > 0 is multiplied just for technical convenience. ∂ ∂t x i (t) = α j∈Ni (a(x i , x j ) -β i,j )(x j -x i ) + δx i (1 -x 2 i ). (5)

Gradient Flow

The variational principle governing many PDE models states that the equilibrium state is actually the minimizer of one specific energy. We first introduce the Dirichlet energy and show that (3) can be characterized by looking into the corresponding Euler-Lagrange equation of the Dirichlet energy. Let adjacent matrix A represent the undirected connectivity between nodes x i and x j , with a i,j = 1 for (i, j) ∈ E and a i,j = 0 for (i, j) ̸ ∈ E. The Dirichlet energy E in terms of G = (V, E) and node features x ∈ R N ×d takes the form E(x) = 1 N i∈V j∈Ni a i,j ∥x i -x j ∥ 2 . ( ) By calculus of variation, we can formulate the corresponding particle equation ∂x ∂t = -∇ x E, ∂x i ∂t = - ∂E ∂x i = 2 N j∈Ni a i,j (x j -x i ). On the RHS of (7), the summation takes over the one-hop neighbors N i of node i, which aggregates the impact from the neighboring nodes. Equation ( 7) is (5) when one takes adjacent matrix A as the weight matrix (a(x i , x j )) N ×N . Particle equation with the double-well potential To avoid blowing-up of the solution, one can design an external potential to control the solutions so they are bounded. Here, we define the pseudo-Ginzburg-Landau energy on graph G denoted by Φ : L 2 (V) → R, as a combination of the interacting energy and double-well potential W : R → R + , with W (x) = (δ/4)(1 -∥x∥ 2 ) 2 , Φ(x) = 1 2 α i∈V j∈Ni (a i,j -β i,j )∥x i -x j ∥ 2 + i∈V W (x i ), where parameters α, δ > 0 are used to balance the two types of energy. From now on, we denote a(x i , x j ) by a i,j for simplicity. The pseudo-Ginzburg-Landau energy is not a true energy because the matrix (a i,j -β i,j ) N ×N can be non-positive definite. If β i,j 's all equate zero, it then becomes the Ginzburg-Landau energy defined in Bertozzi & Flenner (2012); Luo & Bertozzi (2017) . Using this combined energy, we can obtain the Allen-Cahn equation with repulsion on graph as ∂x ∂t = -∇ x Φ, which is equivalent to (5).

4. ALLEN-CAHN MESSAGE PASSING

We propose the Allen-Cahn Message Passing (ACMP) neural network based on equation ( 5), where the message is updated by the evolution of the equation via a neural ODE solver. To our best knowledge, this is the first time to introduce a type of message passing to amplify the difference between connected nodes by repulsive force. Figure 2 : We compare the evolution of node features in GCN and ACMP. The initial position is represented by the 2-dimensional position of the nodes, which is shown in the first column. The GCN aggregates all node features by taking the weighted average of its neighbors' features. With the propagated steps increasing, all the nodes' features shrink to a point, which gives rise to oversmoothing. When it comes to ACMP, nodes' features are grouped by four attractors, which helps to circumvent oversmoothing. More details can be seen in Section 6. Network Architecture Suppose d-dimensional node-wise features represented by a matrix x in where row i represents feature of node i. Our scheme first embeds the node feature x(0) = MLP(x in ) by a simple multi-layer perceptron (MLP), which is treated as an input for ACMP propagation A : R d → R d , by x(0) → x(T ), where x(T ) = x(0) + T 0 ∂x(t) ∂t dt, x(0) = MLP x in , where ∂X(t) ∂t is estimated by ACMP defined on G based on (5). The node features x(T ) at the ending time are fed into an MLP based classifier. Then, we define the Allen-Cahn message passing by ∂ ∂t x i (t) = α ⊙ j∈Ni (a(x i (t), x j (t)) -β)(x j (t) -x i (t)) + δ ⊙ x i (t) ⊙ (1 -x i (t) ⊙ x i (t)). (9) Here α, δ ∈ R d are learnable vectors of the same length as the node feature x i . While we can use a more general case when each edge (i, j) uses different trainable β i,j , we have simplied to single hyper-parameter β ∈ R + ∪ {0}, which makes the network and optimization easier. The β in our model is a crucial parameter, which can be adjusted such that the attractive and repulsive forces both present to enrich the message passing effect. If one chooses δ = 0, β = 0, our model is reduced to the graph neural diffusion network (GRAND) in Chamberlain et al. (2021) . In experiments, we would make significant use of nontrivial δ and β. The operations of all terms are channel-wise, involving d channels, except a(x i (t), x j (t)), and ⊙ represents channel-wise multiplication for d feature channels. Figure 1 illustrates the one-step ACMP mechanism (9): Nodes with close colors attracts each other otherwise repel. Nodes in the same block tend to attract each other and both colors are strengthened during message passing propagation. The double-well potential prevents the features and Dirichlet energy from blowup. In this process, node feature x(t) is updated to x(t + ∆t) for a time increment ∆t. Ultimately, a bi-cluster flock is formed for node classification. In the propagation of ACMP in (9), we need to specify how the neighbors are interacted, that is how the a(x i (t), x j (t)) is evolved with time. There are many kinds of methods to update the edge weights. The two typical types are GCN (Kipf & Welling, 2017) and GAT (Veličković et al., 2018) . ACMP-GCN: this model uses deterministic a(x i (t), x j (t)), which is given by the adacency matrix A = (a i,j ) of the original input graph G and does not change with time. That is, the coefficients in GCNs a GCN i,j := a i,j / di dj . The message passing of ( 9) is reduced to ∂ ∂t x i (t) = α ⊙ j∈Ni (a GCN i,j -β)(x j (t) -x i (t)) + δ ⊙ x i (t) ⊙ (1 -x i (t) ⊙ x i (t)) . ACMP-GAT: we can replace a GCN i,j in (10) by the attention coefficients (1) of GAT, which with extra trainable parameters measures the similarity between two nodes by taking account of both node and structure features. The system then drives edges to update in each iteration of message passing. Neural ODE Solver Our method uses an ODE solver to numerically solving the equation (( 9) and ( 10)). To obtain the node features x(T ), we need a stable numerical integrator for solving the ODE efficiently and backpropagation of gradients. Since our model is stable in terms of evolution time, most explicit and implicit numerical methods such as explicit Euler, Runge-Kutta 4th-order, midpoint, Dormand-Prince5 (Chen et al., 2018; Lu et al., 2018; Norcliffe et al., 2020; Chamberlain et al., 2021) work well as long as the step size τ is small enough. In experiments, we implement ACMP using Dormand-Prince5 method which provides a fast and stable numerical solver. The network depth of ACMP-GNN is equal to the numerical iteration number n t set in the solver. Computational Complexity The computational complexity of the ACMP is O(N Edn t ), where n t , N, E and d are number of time steps in time interval [0, T ], number of nodes, number of edges and number of feature dimension, respectively. Since our model only considers nearest (one-hop) neighbors, E is significantly smaller than that of graph rewiring (Gasteiger et al., 2019; Alon & Yahav, 2021) and multi-hop (Zhu et al., 2020) methods. Channel Mixer Channel mixing can be spontaneously introduced from the perspective of diffusion coefficients though our model is previously written in the channel-wise form. Whether channel mixing happens depends on the specific GNN driver we choose for ACMP. When the coefficients a(x i (t), x j (t)) in ( 9) that do not update with time are a scalar or vector, like in ACMP-GCN, the operations of the message passing propagator are channel-wise and channel mixing is not incorporated. On the other hand, the ACMP-GAT with graph attention driver incorporates a learnable channel mixing when the coefficients are tensors. The channel mixer can be introduced by generalizing the Dirichlet energy to high dimension, for example, E(x ) := 1 N i∈V j∈Ni (x i -x j ) T a i,j (x i -x j ), when a i,j ∈ R d×d are connectivity tensors.

5. DIRICHLET ENERGY

The dynamics (5) can circumvent the oversmoothing issue of GNNs (Nt & Maehara, 2019; Oono & Suzuki, 2019; Konstantin Rusch et al., 2022) . Oversmoothing phenomenon means that all node features converge to the same constant -consensus forms -as the network deepens, and equivalently, the Dirichlet energy will decay to zero exponentially. This idea was first introduced in Cai & Wang (2020). Konstantin Rusch et al. (2022) gives an explicit form for oversmoothing. In our model, as we will show below, the node features in each channel tend to evolve into two clusters departing from each other under certain conditions. This implies a strictly positive lower bound of the Dirichlet energy. In addition, the system will not blow up thanks to the Allen-Cahn term. We put all the proofs and some related supplementary results in the appendix. Proposition 1. If δ > 0, the node features x i in ( 5) is bounded in terms of ∥ • ∥ and energy for all t > 0, i.e., E(x(t)) ≤ C, and ∥x∥ ≤ C, where the constant C only depends on N and λ max . In the following propositions, we imitate the emergent behavior analysis in Fang et al. (2019) (see Appendix for details). For a graph G with N nodes, its vertices are said to form bi-cluster flocking if there exist two disjoint sets of vertex subsets {x (1) i } N1 i=1 and {x (2) i } N2 j=1 satisfying (i) sup 0≤t<∞ max 1≤i,j∈N1 |x (1) i (t) -x (1) j (t)| < ∞, sup 0≤t<∞ max 1≤i,j∈N2 |x (2) i (t) -x (2) j (t)| < ∞; (ii) ∃ C ′ , T * * > 0 such that min 1≤i∈N1,1≤j∈N2 |x (1) i (t) -x (2) j (t)| ≥ C ′ , ∀t > T * * , (1) i , x i denote any component of x (1) i , x i . We now show the long-time behaviour of model ( 5) following the analysis of Fang et al. (2019) for strength coupling (α, δ) that satisfies the following condition: there exists {β i,j } such that I := {1, . . . , N } can be divided into two disjoint groups I 1 , I 2 with N 1 and N 2 particles respectively: 0 < S ≤ a i,j with a i,j := a i,j -β i,j for i, j ∈ I 1 , 0 < S ≤ a i,j with a i,j := a i,j - β i,j for i, j ∈ I 2 , 0 ≤ a i,j ≤ D with a i,j := -(a i,j -β i,j ) otherwise, where S, D are independent of time t. The S and D in ( 12) are the repulsive and attractive forces. We prove that if the repulsive force between the particles is stronger than the attractive force, that is, S > D, the system is guaranteed to have bi-cluster flocking, as shown in Proposition 2 below. For time t ≥ 0, suppose x (1) c (t) and x (2) c (t) are the feature centers of the two groups of the particles {x (2) j (t)} N2 j=1 which are partitioned as above from the whole vertex set V, given by x (1) c (t) := 1 N 1 N1 i=1 x (1) i (t), x (2) c (t) := 1 N 2 N2 i=1 x (2) i (t). Suppose x (s) c (t) has the d-dimensional feature, and let x (s) c,k (t), k = 1, . . . , d, be the k th (dimension) component of the feature x (s) c (t), s = 1, 2. Proposition 2. The system (5) has a bi-cluster flocking if for each k = 1, . . . , d, the initial |x (1) c,k (0)- x (2) c,k (0)| ≫ 1, and if there exists a positive constant η such that α(S -D) min{N 1 , N 2 } ≥ δ + η, ) where the δ is the weight factor for the double-well potential in the equation ( 5). Proposition 3. For system (5) with bi-cluster flocking, there exists a constant C > 0 and some time T * such that ∀t ≥ T * , |x i (t) -x (2) j (t)| ≥ C > 0, ∀i, j. Thus, if the non-zero a i,j are all positive, the Dirichlet energy for ACMP is lower bounded by a positive constant.

6. EXPERIMENTS

Dirichlet Energy We first illustrate the evolution of the Dirichlet energy of ACMP by an undirected synthetic random graph. The synthetic graph has 100 nodes with two classes and 2D feature which is sampled from the normal distribution with the same standard deviation σ = 2 and two means µ 1 = -0.5, µ 2 = 0.5. The nodes are connected randomly with probability p = 0.9 if they are in the same class, otherwise nodes in different classes are connected with probability p = 0.1. We compare the performance of GNN models with four message passing propagators: GCNs (Kipf & Welling, 2017) , GAT (Veličković et al., 2018) , GRAND (Chamberlain et al., 2021) and ACMP-GCN. In Figure 2 , we visualize how the node features evolve from their initial state to their final steady state when 50 layers of GNN are applied. Additionally, in Figure 3 , we show the Dirichlet energy of each layer's output in logarithm scales. Traditional GNNs such as GCNs and GAT suffer oversmoothing as the Dirichlet energy exponentially decays to zero in the first ten layers. GRAND relieves this problem by multiplying a small constant which can delay all nodes' features to collapse to the same value. For ACMP, the energy stabilizes at the level that relies upon the roots of the double-well potential in (9) after slightly decaying in the first two layers.

Node Classification

We compare the performance of ACMP with several popular GNN model architectures on various node classification benchmarks, containing both homophilic and heterophilic datasets. Graph data is considered as homophilic (Pei et al., 2020) if similar nodes in the graph tend to connect together. Conversely, the graph data is said heterophilic if it has a small homophily level, when most neighbors do not have the same label with source nodes. We aim to demonstrate that ACMP is a flexible GNN model which can learn well both kinds of datasets by balancing the attractive and repulsive forces. The GCN for examples cannot perform well for heterophilic dataset as its message passing aggregates only the neighbor (1-hop) nodes. The neural ODE is solved by Torchdiffeq package with Dormand-Prince adaptive step size scheme. Only few hyperparameters are needed to be tuned in our model. For all the experiments, we fine tune the learning rate, weight decay, dropout, hidden dimensional of the β which controls the repulsive force between nodes. We outline the details of hyperparameter search space in Appendix D. Homophilic datasets Our results are presented for the most widely used citation networks: Cora (McCallum et al., 2000) , Citeseer (Sen et al., 2008) and Pubmed (Namata et al., 2012) . Moreover, we evaluate our model on the Amazon co-purchasing graphs Computer and Photo (Namata et al., 2012) , and CoauthorCS (Shchur et al., 2018) . We compare our model with traditional GNN models: Graph Convolutional Network (GCN) (Kipf & Welling, 2017) , Graph Attention Network (GAT) (Veličković et al., 2018) , Mixture Model Networks (Monti et al., 2017) and GraphSage (Hamilton et al., 2017) . We also compare our results with recent ODE-based GNNs, Continuous Graph Neural Networks (CGNN) (Xhonneux et al., 2020) , Graph Neural Ordinary Differential Equations (GDE) (Poli et al., 2020) and Graph Neural Diffusion (GRAND) (Chamberlain et al., 2021) . To address the limitations of this evaluation methodology proposed by Shchur et al. (2018) , we report results for all datasets using 100 random splits with 10 random initialization's, and show the results in Table 1 . 

Heterophilic datasets

We evaluate ACMP-GCN on the heterophilic graphs; Cornell, Texas and Wisconsin from the WebKB datasetfoot_0 . In this case, the assumption of common neighbors does not hold. The poor performance of GCN and GAT models shown in Table 2 indicates that many GNN models struggle in this setting. Introducing repulsion can improve the performance of GNNs on heteroplilic datasets significantly. ACMP-GCN scores 30% higher than the original GCN for the Texas dataset which has the smallest homophily level among the datasets in the table. Attractive and Repulsive interpretation As shown in Table 2 and Table 1 , ACMP-GCN and ACMP-GAT achieve better performance than GCN and GAT on both homophilic and heterophilic datasets. The majority of a i,j -β in the homophilic are positive, which means most nodes are attracted to each other. Conversely, most a i,j -β for the heterophilic are negative, which means that most nodes are repelled by their neighbors. Several GNNs exploiting multi-hop information can (Zhu et al., 2020) 84.9 ± 7.2 87.7 ± 5.0 82.7 ± 5.3 GCNII (Chen et al., 2020) 77.6 ± 3.8 80.4 ± 3.4 77.9 ± 3.8 Geom-GCN (Pei et al., 2020) 66.8 ± 2.7 64.5 ± 3.7 60.5 ± 3.7 PairNorm (Zhao & Akoglu, 2020) 60 achieve high performance in node classification (Zhu et al., 2020; Luan et al., 2021) . However, highorder neighbor information will make the adjacency matrix dense and therefore can not be extended to large graphs, due to heavier computational cost. In our model, we take only one-hop information into account and add repulsive force (β ≥ 0) to message passing, which has achieved the same or higher level of accuracy as multi-hop models in heterophilic datasets. Performance of ACMP to β Hyperparameter β is a signal of the repulsive force, meaning that when a ij -β is negative, the two nodes repel one another. To illustrate β's impact, we use GCN as a diffusion term as a ij do not change during the ODE process and all the changes are related to β. As shown by Figure 4 , ACMP performs best in Cora (orange curve) when all nodes are attracted to one another i.e., all a ij -β is positive. As the beta increases, the performance of the model degrades. In contrast, for the Texas dataset, when all force is attractive, ACMP achieves only 70% accuracy (blue curve). As β increases, most a ij -β is negative, and the model's performance gets better.

7. RELATED WORK

Neural differential equations The topic of neural ODEs becomes an emerging field since E (2017) and Chen et al. (2018) , with many follow-up works in the GNN field (Avelar et al., 2019; Poli et al., 2020; Sanchez-Gonzalez et al., 2019) . GRAND (Chamberlain et al., 2021) propagated GNNs by the graph diffusion equation and Wu et al. (2023) developed an energy-constrained diffusion transformer. GraphCON (Konstantin Rusch et al., 2022) employed a second-order system to conquer oversmoothing of deep graph neural networks. By exploiting the fixed point of the dynamical system, Gallicchio & Micheli (2020) proposed FDGNN as an approach to graph classification. 

8. CONCLUSION

We develop a new message passing method with simple implementation. The method is based on the Allen-Cahn particle system with repulsive force. The proposed ACMP inherits the characteristic dynamics of the particle system and thus shows adaption for node classification tasks with high homophily difficulty. Also, it propels networks to dozens of layers without getting oversmoothing. A strictly positive lower bound of the Dirichlet energy is shown by theoretical and experimental results which guarantees non-oversmoothing of ACMP. Experiments show excellent performance of the model for various real datasets.

Published as a conference paper at ICLR 2023

The Appendix is stuructured as follows: • In Appendix A we state more related works on flocking or consensus model. • In Appendix B we introduce several model variants with different damping term with (9). • In Appendix C we give more analysis on the GRAND model and the attraction-only case of equation 5. In addition, we prove the statements in Section 5, i.e. Proposition 1, Proposition 2 and Proposition 3. • In Appendix D we show additional experimental details and an ablation study for ACMP.

A FLOCKING AND CONSENSUS

The microscopic (agent-based particle systems) modeling of flocking and consensus has been extensively studied. Motsch & Tadmor (2014) reviews a general class of models for self-organized dynamics and shows the relationship between heterophily and conseus. Castellano et al. (2009) presents a series of social and dynamics under the formulation of statistical physics. The flocking problem is to some degree similar to the general consensus problem (Olfati-Saber et al., 2007) which studies the emergent behaviours for multi-agent systems. The Cucker-Smale (in short C-S) model (Cucker & Smale, 2007 ) is a famous model in this field considering a second-order system adopting to classical dynamics. Ha & Tadmor (2008) 

More clusters

We can simply replace the double well potential W by a multi-well potential to generate more equilibria. We provide two alternatives here. One can use a higher-order polynomial to construct additional wells. In general, a (2k + 1) th order polynomial can produce k + 1 stable equilibria in a proper form, which gives rise to more stable clusters. One can also use sin(( 3 2 + l)πx + π 2 ), l = 0, • • • , k, defined on the interval [-1, 1] as the multi-well potential, which has l + 2 stable equilibria. Stronger trapping force As the consensus state (i.e., x i = x j for all i, j) might not be a global equilibrium of (10), particles could escape from one well of the potential of W to another well. We can circumvent this instability by enhancing the attraction of the wells, which can be achieved by reducing the diffusion power around wells: ∂ ∂t x i (t) = α⊙ j∈Ni (a GNN (x i (t), x j (t))-β)(x j (t)-x i (t)) 1 -x i (t) ⊙2 ⊙2 +δ⊙x i (t)⊙ 1 -x i (t) ⊙2 . (14) where 'GNN' in a GNN can be GCN or attn, and z ⊙2 is z ⊙ z. With this modification in ( 14), in any channel k, if any particle x (k) i gets caught in one potential well, then it is not likely to escape: Proposition 4. For ( 14), there exists a proper δ ′ > 0 such that x (k) i ∈ [-1, -1 + δ ′ ) ∪ (1 -δ ′ , 1], then particle x (k) i cannot transition into another well. Proof. For the β = 0 case, assume x i = -1 + ϵ for ϵ ≤ δ ′ < 1 at a certain time t 0 , that is, x i ∈ [-1, -1 + δ ′ ). We want to show dxi dt t=t0 < 0, which means α j∈Ni a i,j (x j -x i )(1 -x 2 i ) 2 < -δx i (1 -x 2 i ). By j∈Ni a i,j = 1 from (26), the above inequality is equivalent to j∈Ni a i,j x j < δ α 1 -ϵ 2 -ϵ 1 ϵ + ϵ -1 ≤ δ 2α 1 ϵ + ϵ -1 ≤ δ 2αϵ . ( ) Since {x j } N j=1 are bounded (See Proposition 1.), ( 15) is satisfied for a sufficiently small δ ′ . The other case x i = 1 -ϵ can be similarly proved. For the β ̸ = 0 case, we also assume x i = -1 + ϵ for ϵ ≤ δ ′ < 1 at a certain time t 0 . Similarly with (15), we have j∈Ni (a i,j -β)x j < δ α 1 -ϵ 2 -ϵ 1 ϵ + (1 -d i β)(ϵ -1) ≤ δ 2αϵ + d i β -1 + ϵ(1 -d i β). By the boundedness of {x j } N j=1 , a properly small δ ′ can be found. ■

C SUPPLEMENTARY RESULTS AND PROOFS OF PROPOSITIONS IN SECTION 5

We assume that a i,j is symmetric, and a i,j > 0 if a i,j ̸ = 0. This condition means that graph is undirected. Since we deal with each channel independently, we abuse the notation to let x i denote one feature component of node x i to simplifying the notation in proofs.

C.1 THE GRAND MODEL

First, we consider the oversmoothing phenomenon if there is only the diffusion process with diffusion coefficients independent of x i , which is a specific model of graph diffusion network (GRAND) (Chamberlain et al., 2021) , ẋi = α j:(i,j)∈E a i,j (x j -x i ). ( ) Proposition 5. Let D denote the degree matrix, i.e., D : = diag(d 1 , • • • , d N ), where d i = j a i,j . Then D -A is symmetric positive semi-definite with the eigenvalues 0 = λ 0 ≤ λ 1 ≤ • • • ≤ λ max < ∞. Let λ min > 0 be the smallest positive eigenvalue, then for all t ≥ 0, there exists a constant C > 0 such that E(x(t)) ≤ C exp(-λ 2 min t). Proof. Let L := D -A, we have, x(t) = x(0)e -L . Using eigenvalue decomposition, the solution x(t) writes x(t) = U ⊤ e -Λt Ux(0) Since the Dirichlet energy can also be written as E(x(t)) = x(t) ⊤ Lx(t), Taking ( 17) to ( 18) gives E(x(t)) = x(0) ⊤ U ⊤ e -Λt Λe -Λt Ux(0). Therefore, E(x(t)) ≤ C exp(-λ 2 min t) for some constant C > 0. ■ Proposition 6. We also consider a more general case, d dt x i (t) = j:(i,j)∈E a(x i , x j )(x j -x i ), ( ) with a(x i , x j ) = a(x j , x i ) ≥ a min > 0, for any x i , x j . Let the mass center x c = 1 N i∈V x i . From the symmetry of a(x i , x j ) and ( 20), we obtain dx c /dt = 0 for any t > 0. Without loss of generality, we may assume x c (0) = 0, ( ) and graph G is connected, i.e., ∀(i, j) ∈ V × V, G contains a path from i to j. Then we have, ∥x(t)∥ 2 ≤ ∥x(0)∥ 2 e -2aminλmint and E(x(t)) ≤ λ max ∥x(0)∥ 2 e -2aminλmint . Note that the above estimates hold true for any initial condition x c (0) = c, since x satisfies the ODE system (20) up to a constant. If x c (0) = c, x i will converge to c in time. If G is not connected, then we just need to consider each connected sub-graph separately with the assumption x c ′ (0) = 1 N ′ i∈V ′ x i = c ′ for each sub-graph G ′ = (V ′ , E ′ ). x ′ i in each sub-graph will converge to constant c ′ independently. Proof. We multiply x i on both sides of the equation ( 20) and sum over x i to obtain x i dx i dt = j∈N (j) a (x i , x j ) (x j -x i ) x i (22) ⇒ d dt ∥x∥ 2 = -2 (i,j)∈E a (x i , x j ) (x j -x i ) 2 (23) ⇒ d dt ∥x∥ 2 ⩽ -2a min (i,j)∈E (x j -x i ) 2 (24) The RHS in ( 24) can be written in matrix form with L := D -A, (i,j)∈E (x j -x i ) 2 = (i,j)∈V×V a i,j (x j -x i ) 2 = x ⊤ Lx. Since G is a connected graph, 1 is the only eigenvector consisting of the kernel space of L, therefore, x T Lx ≥ λ min ∥x∥ 2 for any x satisfying i∈V x i = 0. Then, (24) leads to d dt ∥x∥ 2 ⩽ -2a min λ min ∥x∥ 2 . ( ) This yields the decay estimates for ∥x∥ and E(x(t)): ∥x(t)∥ 2 ≤ ∥x(0)∥ 2 e -2aminλmint , E(x(t)) ≤ λ max ∥x(0)∥ 2 e -2aminλmint . ■ C.2 THE MODEL WITH ALLEN-CAHN TERM Next, we consider the case β = 0 but with Allen-Cahn term:        d dt x i (t) = α j:(i,j)∈E a(x i , x j )(x j -x i ) + δx i 1 -x 2 i , a(x i , x j ) = a(x j , x i ) ≥ 0, ∀i, j ∈ V i a(x i , x j ) = 1, ∀j ∈ V. ( ) Proposition 7. Suppose x * = (x * 1 , . . . , x * N ) is a global equilibrium (or steady state solution) of ( 26) on R and x, then x * i ∈ [-1, 1]. Proof. Suppose x * achieves the equilibrium of (26), and x * k ≥ x * i ∀i. If x * k > 1, then α j:(k,j)∈E a(x * k , x * j )(x * j -x * k ) ≤ 0 and x * k (1 -x * 2 k ) < 0, which contradicts with ∂ ∂t x * k = 0.

■

The emergence of clusters depends on the distribution of initial features. If all the initial features are in only one potential well, then intuitively it is impossible to produce more than one cluster in the dynamics (26). As a simple transference of Lemma 3.2 in Ha et al. (2010) , we can prove this. Set x M (t) := max i x i (t), x m (t) := min i x i (t), where x i is still some component of node feature x i . Assume x m , x M are both Lipschitz continuous and therefore they are almost differentiable everywhere in time t. Proposition 8. Let {x i } be the solutions of ( 26), then the following holds. (i) If x m (0) > 0, then x m (t) ≥ 0 for all t > 0. (ii) If x M (0) < 0, then x M (t) ≤ 0 for all t > 0. Proof. The proof was essentially given by Ha et al. (2010) . For the sake of completeness, we give a proof here. (i) If x m (0) > 0, we assert there exists a time sequence {t j } ∞ j=0 satisfying t 0 = 0 < t 1 < • • • < t j < . . . , x m (t) is differentiable in each time interval (t j-1 , t j ) and x m i ≥ 0 when t ∈ [0, t 1 ]. By induction, firstly we set x m (t) ≥ 0, t ∈ [0, t l ]. If x m becomes negative in the time interval (t l , t l+1 ) there exists t * ∈ (t l , t l+1 ) such that x m (t * ) = 0 by the continuity of x m (t). One can assume x m (t) ≡ x i (t) for some node x i in some time interval subset to (t l , t l+1 ). At that moment, dx i dt (t * ) = α j a(x j , x i )(x j (t * ) -x i (t * )) + δx i (t * )(1 -x 2 i (t * )) = α j a(x j , x i )x j (t * ) ≥ 0. Hence, the trajectory x m becomes non-decreasing at t = t * . By induction, we derive (i). (ii) can be proved by the same argument as those for (i). ■ Now we consider the second kinetic model ( 14). We can prove that if any particle x i gets caught in one potential well, then it will not escape from that well.

C.3 THE ATTRACTIVE-REPULSIVE MODEL

We first show that the solution features of graph in Allen-Cahn model below is bounded. For simplicity of the proof, we rewrite (10) in component form where we let a(x i , x j ) := a i,j -β i,j : d dt x i (t) = α j:(i,j)∈E a(x i , x j )(x j -x i ) + δx i 1 -x 2 i . Model ( 29) allows negative a(x i , x j ) which is different from the condition in (26). Proof of Proposition 1. We multiply x i on both sides of the following equation and sum over x i to obtain dx i dt = j∈Ni a(x i , x j ) (x j -x i ) -x 3 i + x i ⇒ 1 2 dx 2 i dt = j∈Ni a(x i , x j ) (x j -x i ) x i -x 4 i + x 2 i ⇒ 1 2 i∈V dx 2 i dt = - i∈V   j∈Ni a(x i , x j ) (x j -x i ) x i -x 4 i + x 2 i   . By grouping a( x i , x j ) (x j -x i ) x i , then 1 2 d dt ∥x∥ 2 = - 1 2 i∈V j∈Ni a(x i , x j ) (x j -x i ) 2 - i∈V x 4 i + ∥x∥ 2 . ( ) Note that a(x i , x j ) are bounded for any (x i , x j ). Let the |a(x i , x j )| < D 1 for a constant D 1 depending on hyper-parameters β i,j . By the Cauchy-Schwarz inequality, |a(x i , x j )(x j -x i ) 2 | ≤ 2D 1 (x 2 i + x 2 j ). Hence, - i∈V j∈Ni a(x i , x j ) (x j -x i ) 2 ≤ c 4 ∥x∥ 2 . Also, i∈V x 4 i ≥ c 3 ∥x∥ 4 for a constant c 3 depending only on N. Taking the above estimates to (31) gives d dt ∥x∥ 2 ≤ -2c 3 ∥x∥ 4 + (c 4 + 2)∥x∥ 2 . If ∥x∥ blows up for t > 0, the ∥x∥ → ∞ as time t increases, and d dt ∥x∥ 2 > 0 for all t before the blowing-up time T end . However, one can find a t * < T end such that ∥x(t * )∥ is large enough and -2c 3 ∥x(t * )∥ 4 + (c 4 + 2)∥x(t * )∥ 2 < 0, which produces a contradiction. Thus, ∥x∥ ≤ c 5 for a constant c 5 only depending on N and D 1 and E(x) ≤ λ max ∥x∥ 2 ≤ λ max c 5 , where λ max is the largest eigenvalue of L := D-A. Thus, we proved the assertion in Proposition 1.

■

Recall (5) under ( 12) and rewrite it as              d dt x (1) i = α N1 k=1 a k,i (x (1) k -x (1) i ) -α N2 k=1 a k,i (x (2) k -x (1) i ) + δx (1) i (1 -(x (1) i ) 2 ), i = 1, . . . , N 1 d dt x (2) j = α N2 k=1 a k,j (x (2) k -x (2) i ) -α N1 k=1 a k,j (x (1) k -x (2) j ) + δx (2) j (1 -(x (2) j ) 2 ), j = 1, . . . , N 2 . ( ) For the attractive-repulsive model ( 32), we can refer to the the proof of its Theorem 5.1. in Fang et al. (2019) . We define the following notations for further proof: V := {nodes indexed by I 1 }, W := {nodes indexed by I 2 }, N 1 := |V |, N 2 := |W |, x := x (1) i -x (1) c , x (2) := x (2) i -x (2) c , x (1) c := 1 N 1 N1 i=1 x (1) i , x (2) c := 1 N 2 N2 i=1 x (2) i , M 2 (V ) := 1 N 1 N1 i=1 (x (1) i ) 2 , M 2 (W ) := 1 N 2 N2 i=1 (x (2) i ) 2 , M 2 := M 2 (V ) + M 2 (W ), M 2 := M 2 ( V ) + M 2 ( W ). Remark 1. (13) indicates that the repulsive force between the particles should be weaker than the attractive force(S > D). To prove Proposition 2, we need the following two lemmas, which we would postpone to prove. Lemma 1. Let {x i } be a solution to (32). Then M 2 satisfies d dt M 2 = - α N 1 N1 i,k=1 a k,i (x (1) k -x (1) i ) 2 - 2α N 1 N2 k=1 N1 i=1 a i,k (x (2) k -x (1) i )x (1) i + 2δ N 1 N1 i=1 (x (1) i ) 2 (1 -(x (1) i ) 2 ) - α N 2 N2 j,k=1 a k,j (x (2) k -x (2) j ) 2 - 2α N 2 N1 k=1 N2 j=1 a j,k (x (1) k -x (2) j )x (2) j + 2δ N 2 N2 j=1 (x (2) j ) 2 (1 -(x (2) j ) 2 ). Suppose that the system parameters satisfy S ≥ 0, D > 0, δ > 0, then there exists a positive constant M ∞ 2 such that sup 0≤t<∞ M 2 (t) ≤ M ∞ 2 < ∞. Proof of Lemma 1. d dt M 2 (V ) = 2 N 1 N1 i=1 x (1) i ẋ(1) i = - α N 1 N1 i,k a k,i (x (1) k -x (1) i ) 2 - 2α N 1 N2 k=1 N1 i=1 a i,k (x (2) k -x (1) i + 2δ N 1 N1 i=1 (x (1) i ) 2 (1 -(x (1) i ) 2 ). Similarly, d dt M 2 (W ) = 2 N 2 N2 i=1 x (2) i ẋ(2) i = - α N 2 N2 j,k=1 a k,j (x (2) k -x (2) j ) 2 - 2α N 2 N1 k=1 N2 j=1 a j,k (x (1) k -x (2) j )x (2) j + 2δ N 2 N2 j=1 (x (2) j ) 2 (1 -(x (2) j ) 2 ). ( ) Sum the M 2 (V ) and M 2 (W ). Note that a ij = a ji . Then d dt M 2 ≤ Dα N 1 N2 k=1 N1 i=1 (x (2) k -x (1) i ) 2 + (x (1) i ) 2 + Dα N 2 N1 k=1 N2 j=1 (x k -x (2) j ) 2 + (x (2) j ) 2 + 2δ N 1 N1 i=1 (x (1) i ) 2 (1 -(x (1) i ) 2 ) + 2δ N 2 N2 i=1 (x (2) i ) 2 (1 -(x i ) 2 ). (37) By the Cauchy-Schwarz inequality, N1 i=1 (x (1) i ) 2 2 ≤ N 1 N1 i=1 (x (1) i ) 4 , N1 i=1 (x (1) i ) 2 2 ≤ N 2 N2 i=1 (x (2) i ) 4 , (x (1) i -x (2) j ) 2 ≤ 2((x (1) i ) 2 + (x (2) j ) 2 ). These relations and (37) yield a Riccati-type differential inequality: d dt M 2 ≤2DαN 2 M 2 (W ) + 3DαN 2 M 2 (V ) + 2DαN 1 M 2 (V ) + 3DαN 2 M 2 (W ) + 2δM 2 -δ(M 2 ) 2 ≤(αC m + 2δ)M 2 -δ(M 2 ) 2 . ( ) Let y be a solution of the following ODE: y ′ = αC m y -δy 2 . ( ) Then, the solution y(t) to (39) satisfies M 2 (t) ≤ y(t) ≤ max αC m δ + 2, M 2 (0) =: M ∞ 2 . ■ Lemma 2. Let {x i } be a solution to (32) with δ > 0. Then M 2 satisfies d dt M 2 ≤ -2η M 2 + 2αDζ|x (1) c -x (2) c | M 2 , where ζ = max{N 1 , N 2 } and η is the positive constant in Proposition 2. Proof of Lemma 2. By computation, ẋ(1) c = 1 N 1 N1 i=1 ẋ(1) i = α N 1 N1 i,k=1 a k,i (x (1) k -x (1) i ) - α N 1 N2 k=1 N1 i=1 a k,i (x (2) k -x (1) i ) + δ N 1 N1 i=1 x (1) i (1 -(x (1) i ) 2 ) = - α N 1 N2 k=1 N1 i=1 a k,i (x (2) k -x (1) i ) + δ N 1 N1 i=1 x (1) i (1 -(x (1) i ) 2 ). Note that ˙ x (1) i = ẋ(1) i - ẋ(1) c . Take the inner product 2 x (1) i with the above equation and sum it over all i = 1, . . . , N 1 , combining with x (1) i = 0. Then, d dt M 2 ( V ) = 1 N 1   -α N1 i,k=1 a k,i ( x (1) k -x (1) i ) 2 -2α N2 k=1 N1 i=1 a k,i (x (2) k -x (1) i ) x (1) i + 2δ N1 i=1 x (1) i (1 -(x (1) i ) 2 )   = 1 N 1   -α N1 i,k=1 a k,i ( x (1) k -x (1) i ) 2 -2α N1 i=1 N2 k=1 a k,i (x (2) c -x (1) c + x (2) k -x (1) i ) x (1) i   + 1 N 1 2δ N1 i=1 x (1) i x (1) i (1 -(x (1) i ) 2 ). Similarly, d dt M 2 ( W ) = 1 N 2   -α N2 i,k=1 a k,i ( x (2) k -x (2) i ) 2 -2α N1 k=1 N2 j=1 a k,j (x (1) c -x (2) c + x (1) k -x (2) j ) x (2) j   + 1 N 2 2δ N2 i=1 x (2) i x (2) i (1 -(x (2) i ) 2 ). Combine the two equations, d dt M 2 = 6 i=1 I i , where I 1 := 1 N 1   -α N1 i,k=1 a k,i ( x (1) k -x (1) i ) 2   ≤ -2αSN 1 M 2 ( V ), I 2 := 1 N 2   -α N2 i,k=1 a k,i ( x (2) k -x (2) i ) 2   ≤ -2αSN 2 M 2 ( W ), I 1 + I 2 ≤ -αS min{N 1 N 2 } M 2 , I 3 := -2α N2 k=1 N1 i=1 a k,i ( x (2) k -x (1) i ) x (1) i 1 N 1 -2α N1 k=1 N2 j=1 a j,k ( x (1) k -x (2) j ) x (2) j 1 N 2 ≤ max 1 N 1 , 1 N 2 2α N1 i=1 N2 j=1 a i,j ( x (1) i -x (2) j ) 2 ≤ max 1 N 1 , 1 N 2 2αD N1 i=1 N2 j=1 ( x i -x (2) j ) 2 = 2αD max 1 N 1 , 1 N 2 N 1 N 2 M 2 = 2αDζ M 2 , I 4 := -2α N2 k=1 N1 i=1 a k,i (x (2) c -x (1) c ) x (1) i 1 N 1 -2α N1 k=1 N2 j=1 a j,k (x (1) c -x (2) c ) x (2) j 1 N 2 ≤ 2αDζ|x (1) c -x (2) c | M 2 , I 5 := 2δ N1 i=1 x (1) i x (1) i (1 -(x (1) i ) 2 ) 1 N 1 , I 6 := 2δ N2 i=1 x (2) i x (2) i (1 -(x (2) i ) 2 ) 1 N 2 . (1) i = x (1) i + x c and i x (1) i = 0, we obtain I 5 = 2δ N 1 N1 i=1 (1 -(x (1) i ) 2 ) x (1) i 2 + 2δ N 1 N1 i=1 x (1) c x (1) i = 2δM 2 ( V ) - 2δ N 1 N1 i=1 (x (1) i ) 2 x (1) i 2 - 2δ N 1 N1 i=1 (x (1) i ) 2 x (1) c x (1) i = 2δM 2 ( V ) - 2δ N 1 N1 i=1 (x (1) i ) 3 x (1) i ≤ 2δM 2 ( V ). The last inequality is based on N1 i=1 (x (1) i ) 3 x (1) i = N1 i=1 (x (1) i ) 2 ((x (1) i ) 2 -x (1) c x (1) i ) = 1 2 N1 i=1 (x (1) i ) 2 ((x (1) i ) 2 -(x (1) c ) 2 + (x (1) i -x (1) c ) 2 ) ≥ 1 2 N1 i=1 (x (1) i ) 2 ((x (1) i ) 2 -(x (1) c ) 2 ) = 1 2 N1 i=1 (x (1) i ) 4 - 1 2 N1 i=1 (x (1) i ) 2 ((x (1) c ) 2 ≥ 1 2 N1 i=1 (x (1) i ) 4 - 1 2N 1 N1 i=1 (x (1) i ) 2 2 ≥ 0. Similarly on I 6 , one has I 6 ≤ 2δM 2 ( W ). Thus, I 5 + I 6 ≤ 2δ M 2 . Note that  I 1 + I 2 + I 3 + I 5 + I 6 ≤ -2αS min{N 1 , N 2 } M 2 + 2αDζ M 2 + 2δ M 2 ≤ -2 [α(S -D) min{N 1 , N 2 } -δ] M 2 ≤ -2η M 2 . ( |x (1) c -x (2) c | = 1 N 1 N1 i=1 x (1) i - 1 N 2 N2 i=1 x (2) i ≤ 1 N 1 N1 i=1 |x (1) i | + 1 N 2 N2 i=1 |x (2) i | ≤ 2 1 N 1 N1 i=1 (x (1) i ) 2 + 1 N 2 N2 i=1 (x (2) i ) 2 = 2 M 2 (t) ≤ 2 M ∞ 2 . (b) (Uniform boundedness of M 2 ) By Lemma 2 and (47), d dt M 2 ≤ -η M 2 + αDζ|x (1) c -x (2) c | ≤ -η M 2 + 2Dαζ M ∞ 2 . Use Gronwall's lemma to obtain M 2 (t) ≤ M 2 (0)e -ηt + 2Dαζ M ∞ 2 η (1 -e -ηt ) ≤ max M 2 (0), 2Dαζ M ∞ 2 η := C 3 . (c) (Separation of the particle centers) By (32), we have d dt |x (1) c -x (2) c | = α N 1 N1 i=1 N2 k=1 a k,i (x (1) k -x (1) i ) - α N 2 N2 j=1 N1 k=1 a k,j (x (2) k -x (2) j ) - α N 1 N1 i=1 N2 k=1 a k,i (x (2) k -x (1) i ) - α N 2 N2 j=1 N1 k=1 a k,i (x (1) k -x (2) j ) + δ N 1 N1 i=1 (x (1) i (1 -((x (1) i ) 2 ) - 2δ N 2 N2 i=1 (x (2) i (1 -((x (2) i ) 2 ) By the symmetry of (a i,j ), Np i=1 Np k=1 a k,i ( x (p) k -x (p) i ) = 0 for p = 1, 2. Thus, d dt |x (1) c -x (2) c | = α N 1 N1 i=1 N2 k=1 a k,i ( x (2) k + x (2) c -x (1) i -x (1) c ) + α N 2 N2 j=1 N1 k=1 a k,j ( x (1) k + x (1) c -x (2) i -x (2) c ) + δ N 1 N1 i=1 ( x (1) i + x (1) c ) - 2δ N 2 N2 i=1 ( x (2) i + x (2) c ) - 2δ N 1 N1 i=1 (x (1) i ) 3 + 2δ N 2 N2 i=1 (x (2) i ) 3 By a similar estimate with Lemma 2, we have d dt |x (1) c -x (2) c | 2 = -2(x (1) c -x (2) c ) α N 1 N1 i=1 N2 k=1 a k,i ( x (2) k + x (2) c -x (1) i -x (1) c ) + 2(x (1) c -x (2) c ) α N 2 N2 j=1 N1 k=1 a k,j ( x (1) k + x (1) c -x (2) i -x (2) c ) + 2δ N 1 N1 i=1 x (1) i - 2δ N 2 N2 i=1 x (2) i (x (1) c -x (2) c ) + 2δ N 1 N1 i=1 x (1) c - 2δ N 2 N2 i=1 x (2) c (x (1) c -x (2) c ) - 2δ N 1 N1 i=1 (x (1) i ) 3 (x (1) c -x (2) c ) + 2δ N 2 N2 i=1 (x (2) i ) 3 (x (1) c -x (2) c ) = 2α N 1 N1 i=1 N2 k=1 a k,i (x (2) c -x (1) c ) 2 + 2α N 1 N2 j=1 N1 k=1 a k,i (x (2) c -x (1) c ) 2 + 2α N 1 N1 i=1 N2 k=1 a k,i ( x (2) k -x (1) i )(x (2) c -x (1) c ) + 2α N 1 N2 j=1 N1 k=1 a k,i ( x (2) k -x (1) i )(x (2) c -x (1) c ) + 2(x (2) c -x (1) c )   δ N 1 N1 i=1 x (1) i - δ N 2 N2 j=1 x (2) j   + 2(x (2) c -x (1) c )   δ N 1 N1 i=1 x (1) c - δ N 2 N2 j=1 x (2) c   . ( ) d dt |x (1) c -x (2) c | 2 ≥ 2   α 1 N 1 + 1 N 2 N1 i=1 N2 j=1 a i,j + 2δ   (x (1) c -x (2) c ) 2 + I c1 + I c2 . ( ) where I c1 := 2α 1 N 1 + 1 N 2 N1 i=1 N2 j=1 a i,j ( x (2) j -x (1) i )(x (2) c -x (1) c ), I c2 := - 2δ N 1 N1 i=1 (x (1) i ) 3 (x (1) c -x (2) c ) + 2δ N 2 N2 i=1 (x (2) i ) 3 (x (1) c -x (2) c ). By the Cauchy-Schwarz inequality, |I c1 | ≤ 2α 1 N 1 + 1 N 2 D N 1 N 2 |x (1) c -x (2) c | N1,N2 i,j ( x (2) j -x (1) i ) 2 ≤ 2α 1 N 1 + 1 N 2 DN 1 N 2 |x (1) c -x (2) c | M 2 . ( ) For I c2 , note that 2δ N 1 N1 i=1 (x (1) i ) 3 ≤ δ|x (1) i |M 2 (V ) ≤ δ N 1 M 2 (V ) 3 2 , 2δ N 2 N2 i=1 (x (2) i ) 3 ≤ δ|x (2) i |M 2 (V ) ≤ δ N 1 M 2 (W ) 3 2 . Then, one gets I c2 ≥ -2 x (1) c -x (2) c δ N 1 N1 i=1 (x (1) i ) 3 + δ N 2 N2 i=1 (x (2) i ) 3 ≥ -2 x (1) c -x (2) c δ max{N 1 , N 2 }M 2 (t) 3 2 . Hence, d dt |x (1) c -x (2) c | 2 ≥     2α( 1 N 1 + 1 N 2 ) N1 i=1 N2 j=1 a i,j   + 4δ   |x (1) c -x (2) c | 2 -2αD(N 1 + N 2 ) x (1) c -x (2) c M 2 -2δ max{N 1 , N 2 } x (1) c -x (2) c M 3 2 2 . ( ) Combining with Lemma 1 and (49), one obtains the estimate d dt |x (1) c -x (2) c | ≥   α 1 N 1 + 1 N 2 N1 i N2 j=1 a i,j + 2δ   |x (1) c -x (2) c | -αD(N 1 + N 2 )C 3 -δ max{N 1 , N 2 }(M ∞ 2 ) 3 2 . ( ) By Gronwall's lemma, if the initial data satisfy: |x (1) c (0) -x (2) c (0)| ≥ αD(N 1 + N 2 )C 3 + δ max{N 1 , N 2 }(M ∞ 2 ) 3 2 2δ := C 4 δ , then,  |x (1) c (t) -x (2) c (t)| ≥ C 4 δ + (|x (1) c (0) -x (2) c (0)| - C 4 δ )e δt ≥ C 4 δ . ( i (t) -x (1) j (t)| ≥ |x (1) c (t) -x (2) c (t)| -| x (1) i (t) -x (2) j (t)| ≥ |x (1) c (t) -x (2) c (t)| -2 max{N 1 , N 2 } M 2 ≥ C 4 δ + |x (1) c (0) -x (2) c (0)| - C 4 δ e δt -2 max{N 1 , N 2 } M 2 (0)e -ηt + M ∞ 2 η (1 -e -ηt ) . Then, there exists some time T * such that ∀t ≥ T * , |x i (t) -x (2) j (t)| ≥ C ′ > 0, ∀i, j. Combing with Proposition 1, we finish the proof. ■ Remark 2. The proof of Proposition 3 is included in part (d) of proof of Proposition 2. Now denote η 2 := i∈I1,j∈I2 a i,j > 0 in some channel, then the Dirichlet energy in this channel has a lower bound: E(x) = 1 N i,j a i,j (x i -x j ) 2 = 1 N   i,j∈I1 a i,j (x (1) i -x (1) j ) 2 + i,j∈I2 a i,j (x (2) i -x (2) j ) 2 + i∈I1,j∈I2 a i,j (x (1) i -x (2) j ) 2   ≥ 1 N i∈I1,j∈I2 a i,j (x 4 . We use the Dormand-Prince adaptive step size scheme (DOPRI5) as the neural ODE solver for all datasets. Hyperparameter search used Ray Tune with a hundred trials using an asynchronous hyperband scheduler with a grace period of 50 epochs. All the details to reproduce our results have been included in the submission and will be publicly available after publication.  (1) i -x (2) j ) 2 ≥ C 2 η 2 N . ( )

D.2 ABLATION STUDY FOR ACMP

Message Passing Performance vs Depths We compare ACMP with various GNN models such as GRAND, GCN, GAT, and GraphSage with different depths on the planetoid datasets. Table 5 lists the nodes classification accuracy on Cora, Citeseer and Pubmed. We observe that ACMP can maintain its model performance as the network deepens and achieve top test accuracy among all listed models using the same depth. ACMP can thus overcome the oversmoothing. Model parameter comparison We compare the number of parameters of our model with different benchmark model on Cora dataset in Table 6 . The depth of all model is chosen as the number which achieves the best performance on Cora dataset. We show that ACMP is a light-weight neural network architecture which can achieve good classification performance with fewer parameters. Allen-Cahn term We now show in Figure 5 how Allen-Cahn term can stabilize training and prevent node features from blowing up. The first row is the evolution of the diffusion equation without Allen-Cahn term while the second row has Allen-Cahn term added. We can observe that introducing the repulsive term is essential for bounding GNN outputs, particularly when learning heterophilic datasets. However, naively adding β to message passing will result in all node's features becoming infinite. In the first row of Figure 5 when Allen-Cahn term is not incorporated, the node's features have increased to 3 × 10 3 when T = 10, from 0.1 when T = 1. By the time T equals 30, the node's largest feature becomes 1 × 10 20 , which the neural ODE solver and message passing can hardly handle numerically corrected. When we introduce Allen-Cahn term, the system contains two strong attractors of ±1, and the nodes are attracted to the two ends of 1 and -1 by their own features.



http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/ http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/



Figure 3: Evolution of Dirichlet energy E(X n ) of layer-wise node features X n propagated by GCN, GAT, GRAND, ACMP-GCN.

Figure 4: Significance plot for β in terms of test accuracy on Cora (orange) and Texas (blue) with 10 fixed random splits.

InBertozzi & Flenner (2012);Luo & Bertozzi (2017);Merkurjev et al. (2013) and references therein, authors extended Allen-Cahn related potential to graphical framework and developed a class of variational algorithms to solve the clustering, semisupervised learning and graph cutting problems. The new ingredient of graph neural network which enables us to combine learnable attraction and repulsion separates our method from the classical variational graph models.

and Ha et al. (2010) discuss asymptotic flocking for the C-S model with the Rayleigh friction.Fang et al. (2019) furthermore studies frameworks leading to bi-cluster flocking for the C-S model with the Raleigh friction and attractive-repulsive coupling.

By Cauchy's inequality and Lemma 1,

) (d)(Spatial separation of the two sub-ensembles) For any i = 1, . . . N 1 , j = 1, . . . , N 2 , |x

Figure 5: Example of how adding Allen-Cahn terms can prevent the nodes feature from becoming infinite. We choose the first channel in the node's feature of dimension 150. In the first row, the repulsive force is added to message passing without Allen-Cahn term, and in the second row, Allen-Cahn term is added to message passing. The first, second and third columns show the neural ODE's initial state, and the states when T = 10 and T = 30.

Test accuracy and std for 10 initialization and 100 random train-val-test splits on six node classification benchmarks. Red (First), blue (Second), and violet (Third) are the best three methods.

Node classification results on heterophilic datasets. We use the 10 fixed splits for training, validation and test fromPei et al. (2020) and show the mean and std of test accuracy. Red (First), blue (Second), and violet (Third) are the best three methods.

Information for Graph Datasets Used in Experiments

Hyperparameter Search Space

Test Accuracy of Models with Different Depth ± 0.52 74.61 ± 1.04 79.74 ± 0.24 ACMP (ours) 16 83.19 ± 0.67 73.13 ± 0.85 79.16 ± 0.36 32 83.11 ± 0.81 72.76 ± 1.05 79.81 ± 1.61 64 80.48 ± 1.21 68.92 ± 1.37 78.01 ± 0.01 128 80.30 ± 1.18 67.83 ± 0.02 77.98 ± 0.01

The number of parameters for different models

ACKNOWLEDGEMENTS

This work was supported by the Shanghai Municipal Science and Technology Major Project, and Science and Technology Commission of Shanghai Municipality grant No. 20JC1414100, (2021SHZDZX0102), and Shanghai Artificial Intelligence Laboratory (P22KN00524). S. Jin was partially supported by the NSFC grant No. 12031013.

availability

Codes are available at https://github.com/ykiiiiii/ACMP.

D EXPERIMENTS

The code for the experiments is available at: https://github.com/ykiiiiii/ACMP We will replace this anonymous link with a non-anonymous GitHub link after the acceptance. We implement all experiments in Python 3.8.13 with PyTorch Geometric on one NVIDIA ® Tesla A100 GPU with 6,912 CUDA cores and 80GB HBM2 mounted on an HPC cluster.In addition, we take the official implementation of the Graph Neural Diffusion (GRAND) as diffusion term in (9) from the repository: https://github.com/twitter-research/graph-neural-pde 

D.1 DETAILS FOR EXPERIMENTS

Datasets We consider two types of datasets: Homophilic and Heterophilic. They are differentiated by the homophily level of a graph (Pei et al., 2020) :Number of v's neighbors who have the same label as v Number of v's neighbors .In the experiments, we have used six homophilic datasets, including Cora (McCallum et al., 2000) , Citeseer (Sen et al., 2008) and Pubmed (Namata et al., 2012) , Computer and Photo (Namata et al., 2012) , and CoauthorCS (Shchur et al., 2018) , and three heterophilic datasets: Cornell, Texas and Wisconsin from the WebKB dataset 2 . For completeness, we list the numbers of classes, features, nodes and edges of each dataset, and their homophily level in Table 3 . The low homophily level means that the dataset is more heterophilic when most of neighbours are not in the same class, and the high homophily level indicates that the dataset close to homophilic when similar nodes tent to be connected. The datasets we used in Table 3 covers various homophily levels.Experiment setup For homophilic datasets, we use 10 random weight initializations and 100 random splits, which contains 1,000 tests. Each combination randomly select 20 numbers for each class. For heterophilic data, we use the original fixed 10 split datasets. We fine-tune our model

