DEEPGRAND: DEEP GRAPH NEURAL DIFFUSION

Abstract

We propose the Deep Graph Neural Diffusion (DeepGRAND), a class of continuous-depth graph neural networks based on the diffusion process on graphs. DeepGRAND leverages a data-dependent scaling term and a perturbation to the graph diffusivity to make the real part of all eigenvalues of the diffusivity matrix become negative, which ensures two favorable theoretical properties: (i) the node representation does not exponentially converge to a constant vector as the model depth increases, thus alleviating the over-smoothing issue; (ii) the stability of the model is guaranteed by controlling the norm of the node representation. Compared to the baseline GRAND, DeepGRAND mitigates the accuracy drop-off with increasing depth and improves the overall accuracy of the model. We empirically corroborate the advantage of DeepGRAND over many existing graph neural networks on various graph deep learning benchmark tasks.

1. INTRODUCTION

Graph neural networks (GNNs) and machine learning on graphs (Bronstein et al., 2017; Scarselli et al., 2008) have been successfully applied in a wide range of applications including physical modeling (Duvenaud et al., 2015; Gilmer et al., 2017; Battaglia et al., 2016) , recommender systems (Monti et al., 2017; Ying et al., 2018) , and social networks (Zhang & Chen, 2018; Qiu et al., 2018) . Recently, more advanced GNNs have been developed to further improve the performance of the models and extend their application beyond machine learning, which include the graph convolutional networks (GCNs) (Kipf & Welling, 2017) , ChebyNet (Defferrard et al., 2016) , GraphSAGE (Hamilton et al., 2017) , neural graph finger prints (Duvenaud et al., 2015) , message passing neural network (Gilmer et al., 2017) , graph attention networks (GATs) (Veličković et al., 2018) , and hyperbolic GNNs (Liu et al., 2019) . A well-known problem of GNNs is that the performance of the model decreases significantly with increasing depth. This phenomena is a common plight of most GNN architectures and referred to as the oversmoothing issue of GNNs (Li et al., 2018; Oono & Suzuki, 2020; Chen et al., 2020) .

1.1. MAIN CONTRIBUTIONS AND OUTLINE

In this paper, we propose Deep Graph Neural Diffusion (DeepGRAND), a class of continuousdepth graph neural networks based on the diffusion process on graphs that improves on various aspects of the baseline Graph Neural Diffusion (GRAND) (Chamberlain et al., 2021b) . At its core, DeepGRAND introduces a data-dependent scaling term and a perturbation to the diffusion dynamic. With this design, DeepGRAND attains the following advantages: 1. DeepGRAND inherits the diffusive characteristic of GRAND while significantly mitigates the over-smoothing issue. 2. DeepGRAND achieves remarkably greater performance than existing GNNs and other GRAND variants when fewer nodes are labelled as training data, meriting its use in lowlabelling rates situations. 3. Feature representation under the dynamic of DeepGRAND is guaranteed to remain bounded, ensuring numerical stability. Organization. In Section 2, we give a concise description of the over-smoothing issue of general graph neural networks and the GRAND architecture. A rigorous treatment and further discussions showing the inherent over-smoothing issue in GRAND is given in Section 3. The formulation for DeepGRAND is given in Section 4, where we also provide theoretical guarantees on the stability and ability to mitigate over-smoothing of DeepGRAND. Finally, we demonstrate the practical advantages of DeepGRAND in Section 5, showing reduced accuracy drop-off at higher depth and improved overall accuracy when compared to variants of GRAND and other popular GNNs.

1.2. RELATED WORK

Neural ODEs. Chen et al. (2018) introduced Neural ODEs, a class of continuous-depth neural networks with inherent residual connections. Follow-up works extended this framework through techniques such as augmentation (Dupont et al., 2019) , regularization (Finlay et al., 2020) , momentum (Xia et al., 2021) . Continuous depth GNN was first proposed by Xhonneux et al. (2020) . Over-smoothing. The over-smoothing issue of GNNs was first recognized by Li et al. (2018) . Many recent works have attempted to address this problem (Oono & Suzuki, 2020; Cai & Wang, 2020; Zhao & Akoglu, 2020) . A connection between the over-smoothing issue and the homophily property of graphs was also recognized by Yan et al. (2021) . PDE-inspired graph neural networks. Partial differential equations (PDEs) are ubiquitous in modern mathematics, arising naturally in both pure and applied mathematical research. A number of recent works in graph machine learning has taken inspiration from classical PDEs, including: diffusion equation (Chamberlain et al., 2021b; Thorpe et al., 2022) , wave equation (Eliasof et al., 2021 ), Beltrami flow (Chamberlain et al., 2021a) , etc. Such approaches have led to interpretable architectures, and provide new solutions to traditional problems in GNNs such as over-smoothing (Rusch et al., 2022) and over-squashing (Topping et al., 2022) .

2. BACKGROUND

Notation. Let G = (V, E) denote a graph, where V is the vertex set with |V| = n and E is the edge set with |E| = e. For a vertex u in V, denote N u the set of vertices neighbor to u ∈ G. We denote d as the dimensionality of the features, i.e, number of features for each node. Following the convention of Goodfellow et al. (2016) , we denote scalars by lower-or upper-case letters and vectors and matrices by lower-and upper-case boldface letters, respectively.

2.1. GRAPH NEURAL NETWORKS AND THE OVER-SMOOTHING ISSUE

As noted by Bronstein et al. (2021) , the vast majority of the literature on graph neural networks can be derived from just three basic flavours: convolutional, attentional and message-passing. The cornerstone architectures for these flavours are GCN (Kipf & Welling, 2017) , GAT (Veličković et al., 2018) and MPNN (Gilmer et al., 2017) , the last of which forms an overarching design over the other two. An update rule for all message passing neural networks can be written in the form H u = ξ X u , v∈Nu µ(X u , X v ) , where µ is a learnable message function, is a permutation-invariant aggregate function, and ξ is the update function. In the case of GCN or GAT, this can be further simplified to H u = σ   v∈Nu∪{u} a uv µ(X v )   , ( ) where a is either given by the normalized augmented adjacency matrix (GCN) or the attention mechanism (GAT), µ is a linear transformation, and σ is an activation function. The learning mechanism behind GCN was first analyzed by Li et al. (2018) , who showed that graph convolution is essentially a type of Laplacian smoothing. It was also noted that as deeper layers are stacked, the architecture risks suffering from over-smoothing. This issue makes features indistinguishable and hurts the classification accuracy, which goes against the common understanding that the deeper the model, the better its learning capacity will be. Furthermore, it prevents the stacking of many layers and impair the ability to model long range dependencies. Over the past few years, over-smoothing has been observed in many traditional GNNs, prompting a flurry of research into understanding (Oono & Suzuki, 2020; Cai & Wang, 2020; Yan et al., 2021) and alleviating (Luan et al., 2019; Zhao & Akoglu, 2020; Rusch et al., 2022) the issue.

2.2. GRAPH NEURAL DIFFUSION

GRAND is a continuous-depth architecture for deep learning on graph proposed by Chamberlain et al. (2021b) . It drew inspiration from the heat diffusion process in mathematical physics, and follows the same vein as other PDE-inspired neural networks (Rusch et al., 2022; Eliasof et al., 2021; Chamberlain et al., 2021a) . Central to the formulation of GRAND is the diffusion equation on graph ∂X(t) ∂t = div[G(X(t), t)∇X(t)], where G = diag (a(X i (t), X j (t), t)) is an e × e diagonal matrix giving the diffusivity between connected vertices, which describes the thermal conductance property of the graph. The architecture utilises the encoder-decoder design given by Y = ψ(X(T )), where X(T ) ∈ R n×d is computed as X(T ) = X(0) + T 0 ∂X ∂t (t)dt, with X(0) = ϕ(X). In the simplest case when G is only dependent on the initial node features, the differential in equation 3 simplifies to ∂X ∂t (t) = (A(X) -I)X(t), where A(X) = [(a(X i (t), X j (t), t)) ] is an n × n matrix with the same structure as the adjacency matrix of the graph. From now on, we omit the term X when writing A(X). The entries of A are exactly those in G, and thus determines the diffusivity. Furthermore, A can be informally thought of as the attention weight between vertices. Building upon this heuristic, GRAND models the attention matrix A in equation 4 by the multi-head self-attention mechanism, where A = 1 h h l=1 A l (X) with h being the number of heads and the attention matrix A l (X) = (a l (X i , X j )), for l = 1, . . . , h. With this specific implementation, which is called GRAND-l, we obtain A as a right-stochastic matrix with positive entries. As has been pointed out by Chamberlain et al. (2021b) ; Thorpe et al. (2022) , many GNN architectures, including GAT and GCN, can be formalised as a discretisation scheme of equation 2 if no non-linearity is used between the layers. The intuition behind the connection can readily be seen from equation 1, where each subsequent node data is computed similar or equal to a weighted average of the neighboring node features. This gives rise to the diffusive nature of these architectures, resonating with the work of Gasteiger et al. (2019) , which asserts that diffusion improves graph learning. The choice of using a time-independent attention function amounts to all the layers sharing the same parameters, making the model lightweight and less prone to over-fitting. The original authors claimed that this implementation solves the over-smoothing problem and performs well with many layers. However, our analysis shows that this is not the case.

3. DOES GRAND SUFFER FROM OVER-SMOOTHING?

Various works (Oono & Suzuki, 2020; Cai & Wang, 2020; Rusch et al., 2022) have attributed the occurrence of the over-smoothing issue to the exponential convergence of node representations. We formally define this phenomenon for continuous depth graph neural networks. Definition 1. Let X(t) ∈ R n×d denote the feature representation at time t ≥ 0. X is said to experience over-smoothing if there exists a vector v ∈ R d and constants C 1 , C 2 > 0 such that for V = (v, v, . . . , v) ⊤ ∥X(t) -V ∥ ∞ ≤ C 1 e -C2t . ( ) This definition is similar in spirit to the one given by Rusch et al. ( 2022). We will show that the GRAND architecture suffers from over-smoothing. Proposition 1. Given a positive right-stochastic matrix A ∈ R n×n , there exists an invertible matrix P ∈ C n×n such that P -1 (A -I)P =     0 0 . . . 0 0 J 1 . . . 0 . . . . . . . . . . . . 0 0 . . . J m     (7) where each J i is a Jordan block associated with some eigenvalue β i of A -I with Re β i < 0, and P can be chosen such that its first column is the vector (1, 1, . . . , 1) ⊤ . The Jordan decomposition of A allows us to study the long term behaviour of GRAND through its matrix spectrum. The fact that the real part of all eigenvalues of A -I is equal or lesser than 0 will be essential in our analysis. Proposition 2. With the dynamic given by equation 3, equation 4, and equation 5, X experiences over-smoothing. Remark 1. A random walk viewpoint of GRAND was given by Thorpe et al. (2022) . Using this, the authors were able to show node representations in the forward Euler discretization of the GRAND dynamic given by X(kδ t ) = X((k -1)δ t ) + δ t (A -I)X((k -1)δ t ), for k = 0, 1, 2, . . . , K and δ t < 1, converges in distribution to a stationary state. Our spectral method can be adapted to this discretized setting (see Appendix D). We note that our analysis is strictly stronger through both the use of the ∞-norm and the exact exponential rate of convergence. Proposition 2 conclusively shows that GRAND-l still suffers from over-smoothing. Of course, if we consider more general variants of GRAND, the above arguments no longer hold. Although we can not rigorously assert that over-smoothing affect all implementations of GRAND, it can be argued that the occurrence of this phenomenon in GRAND intuitively makes sense since diffusion have the tendency to 'even out' distribution over time (further discussion can be found in Appendix B). Hence, a purely diffusion based model like GRAND is inherently ill-suited for deep networks.

4.1. MODEL FORMULATION

We propose DeepGRAND, a new class of continuous-depth graph neural networks based on GRAND that retains the advantage of diffusion and is capable of learning at much higher depth. It leverages a perturbation to the graph diffusivity A -(1 + ϵ)I and a scaling factor ⟨X(t)⟩ α to make the model both more stable and more resilient to the over-smoothing issue. Denote the i-th column of X by X i , and the usual 2-norm by ∥ • ∥. For constant α > 0, we define the column-wise norm matrix ⟨X(t)⟩ α ∈ R n×d ⟨X(t)⟩ α =     ∥X 1 ∥ α ∥X 2 ∥ α . . . ∥X d ∥ α ∥X 1 ∥ α ∥X 2 ∥ α . . . ∥X d ∥ α . . . . . . . . . . . . ∥X 1 ∥ α ∥X 2 ∥ α . . . ∥X d ∥ α     , The dynamic of DeepGRAND is given by equation 3, with ∂X ∂t given by ∂X ∂t (t) = A -(1 + ϵ)I X(t) ⊙ ⟨X(t)⟩ α , ( ) where ⊙ is the Hadamard product. Our proposed model thus represents an alteration to the central GRAND dynamics given by equation 4. With such modifications, DeepGRAND no longer has the form of a pure diffusion process like GRAND, while still retaining its diffusive characteristics: at every infinitesimal instance, the information from graph nodes are aggregated for use in updating feature representation.

4.2. THEORETICAL ADVANTAGES OF DEEPGRAND

To explain the theoretical motivation behind DeepGRAND, we first note that the convergence property of all nodes to a pre-determined feature vector is not necessarily an undesirable trait. It guarantees the model will remain bounded and not 'explode'. Our perturbation by ϵ serves to slightly strengthen this behaviour. Its purpose is to reduce the real part of all eigenvalues to negative values, making all node representations converge to 0 as the integration limit T tends to infinity. We observe that it is in fact the rate of convergence that is the chief factor in determining the range of effective depth. If the model converges very slowly, it is clear we can train it at high depth without ever having to worry about over-smoothing. As such, if we can control the convergence rate, we would be able to alleviate the over-smoothing issue. The addition of the term ⟨X(t)⟩ α ∈ R n×d serves this purpose. As the node representations come close to their limits, this term acts as a scaling factor to slow down the convergent process. An exact bound for the convergent rate of our model is presented in Proposition 3. Proposition 3. Assuming A is a right-stochastic and radial matrix. With the dynamic given in equation 9, we have the bound for each column X i (2 + ϵ)αT + ∥X i (0)∥ -α -1 α ≤ ∥X i (T )∥ ≤ ϵαT + ∥X i (0)∥ -α -1 α . The equation 10 shows that each column X i is bounded on both sides by polynomial-like terms. Hence, it can no longer converge at exponential speed. The over-smoothing problem is thus alleviated. Note also that the right hand side of equation 10 converges to 0 as T goes to infinity. This ensures node representation will not explode in the long run and helps with numerical stability.

5. EMPIRICAL ANALYSIS

In this section, we conduct experiments to compare the performance of our proposed method Deep-GRAND and the baseline GRAND. We aim to point out that DeepGRAND is able to achieve better accuracy when trained with higher depth and limitted number of labelled nodes per class. We also compare our results with those of GRAND++ (Thorpe et al., 2022) and popular GNNs architectures such as GCN, GAT and GraphSage. (McCallum et al., 2000) : A citation network consisting of more than 2700 publications with more than 5000 connections and 7 classes. Each node is a vector of 0s and 1s indicating the presence -absence of the corresponding words in a corpus of 1433 words.

Cora

Citeseer (Giles et al., 1998) : Similar to Cora, the Citeseer dataset is a network of 3312 scientific publications with 4732 connections and each publication is classified into 6 classes. The node vectors consist of 0s and 1s indicating the presence -absence of corresponding words in a library of 3703 words. Pubmed (Sen et al., 2008) : Contains 19717 scientific publications related to diabetes from the Pubmed database separated into 3 classes. The network consists of 44338 connections and each node vector is the TF/IDF representation from a corpus of 500 unique words. CoauthorCS (Monti et al., 2016) 

5.1.2. SETUP

Experiment setup For most of the experiments, we used the same set of hyperparameters as GRAND and only changed some settings for the ablation studies specified in sections 5.2, 5.3 and 5.4. Similar to GRAND-l, we denote the specific implimentation of DeepGRAND when A is only dependent on the initial node features as DeepGRAND-l. The experiments are conducted on the common benchmarks for node classification tasks : Cora, Citeseer, Pubmed, Computers, Photo and CoauthorCS using our NVIDIA RTX 3090 graphics card. One thing to note is that we have made slight adjustments to the original implementation of GRAND. Firstly, in the original work, the authors of GRAND included a learnable scaling term α in front of equation 4. Secondly, they also added the value of X(0) scaled by a learnable scalar β to the equation. The final GRAND dynamics used in the original implementation is indicated in equation 11. ∂X ∂t (t) = α(A(X) -I)X(t) + βX(0) The initial value X(0) is treated as a source term added to the overall dynamics before each evaluation step of the ODE solver. This modification is task specific. For fair comparison of GRAND and DeepGRAND, we have removed this from the original implementation and all of our experiments.

5.1.3. EVALUATION

We aim to demonstrate the advantages of DeepGRAND by evaluating our proposed method using the following experiments: Ablation study with different depths (5.2) : For this experiment, we trained variants of GRAND and DeepGRAND using the same integration limits T across a wide range of values and compare the respective test accuracies. • For Cora, Citeseer and Pubmed : We used the default planetoid splitting method available for Cora, Citeseer and Pubmed. For each T value, we trained GRAND-l, GRAND-nl and DeepGRAND-l with 10 random seeds (weights initializations) per method then calculated mean and standard deviation of the test accuracies across the random seeds. • For Computers, Photo and CoauthorCS : We used random splitting for the 3 remaining benchmarks. For each T value, we trained all methods with 10 random seeds and calculated mean and standard deviation of the test accuracies across the random seeds. Ablation study with different label rates (5.3) : For all of the benchmarks, we used random splitting and trained GRAND-l, GRAND-nl and DeepGRAND-l with different number of labelled nodes per class. For each experiment, we ran with 10 random seeds per method and calculated mean and standard deviation of the test accuracies across the seeds. Ablation study -effect of α (5.4) : For this section we aimed to study the effect of the exponential α on the ODE solver's ability to integrate. For each α value, we trained DeepGRAND-l with 10 random seeds and calculated the mean and standard deviation of the test accuracies across the seeds. Note : For hyper-parameters tuning of DeepGRAND-l, we reused most of the settings tuned for GRAND and only tuned the hyper-parameters that are specific to DeepGRAND -the α and ϵ values specified in equation 9. Summary : For both experiment sets presented in Table 1 and Table 2 , we observed that our proposed dynamics achieved a substantially higher test accuracies than the baseline method and is less prone to performance decay effect when trained with deep architectures (large depth T ). Furthermore, the results in Table 3 reveals that DeepGRAND-l is also more robust when the number of labelled nodes per class is low.

5.2. DEEPGRAND IS MORE ADAPTABLE TO DEEP ARCHITECTURES

Figure 1 : Effect of depth T on the performance of DeepGRAND-l and GRAND-l on Cora, Citeseer, Pubmed, Computers, Photo and CoauthorCS benchmarks. For the first three benchmarks, we used the default planetoid split to conduct our experiments. For the later three, we used random split. We provide empirical evidence for our argument that the dynamics of DeepGRAND is more resilient to deeper architectures. Specifically, we compared the change in performance of both GRAND and DeepGRAND as the integration limit T increases. In Table 1 , for each T value in {4, 16, 32, 64, 128}, we trained both methods for Cora, Citeseer and Pubmed datasets using the default planetoid splitting method with 10 random initializations. For the other benchmarks: Com- puters, Photo and CoauthorCS, we splitted the datasets randomly and trained with 10 random initializations for each integration limit T in {1, 2, 4, 8, 16, 32} and recorded the results in Table 2 . For each initialization, we used the default label rate of 20 labelled nodes per class as indicated in the original GRAND implementation. On all benchmarks, we observe that our proposed dynamics performed substantially better for larger values of T , indicating less performance degradation under the effect of over-smoothing.

5.3. DEEPGRAND IS MORE RESILIENT UNDER LIMITED LABELLED TRAINING DATA

In previous experiments, we showed that DeepGRAND outperforms the baseline GRAND when trained with high integration limit. In the following experiments, we aim to demonstrate that Deep-GRAND also achieves superior results with limited number of labelled nodes per class. For all of the benchmarks: Cora, Citeseer, Pubmed, Computers, Photo and CoauthorCS, we used grid search to find the optimal T values and evaluate the performance under different numbers of labelled nodes per class. For each dataset, we experimented with 1, 2, 5, 10, 20 labelled nodes per class and compared the test accuracies between GRAND, GRAND++ and DeepGRAND-l. On top of GRAND variants, our results recorded in Table 3 also show that DeepGRAND outperforms some of the common GNN architectures like GCN, GAT and GraphSage. More experiment results where we used identical values of T as those used in GRAND can be found in Appendix E. Even without using optimal T values, DeepGRAND is still the top performing design more often than not.

5.4. EFFECT OF α ON THE DYNAMICS OF DEEPGRAND

The exponential α defined in our DeepGRAND dynamics directly influences the ability to integrate of the ODE solver. Specifically, a high α value (from 0.5 to 1.5) will likely cause the ODE solver to receive the MaxNFE error (Max Number of Function Evaluations exceeded). Hence, harming the overall performance. In the following experiment, for each exponential value α in {0.0001, 0.001, 0.01, 0.1, 0.5}, we trained with 10 random weights initializations and increased the integration depth T from 4 to 128 to observe the behavior the test accuracies. The results in Figure 2 suggest the performance of DeepGRAND under smaller α values are quite stable. This gives us a rough idea of what range of α values to use when we perform hyper-parameter tuning for DeepGRAND. 

6. CONCLUSION

We propose DeepGRAND, a class of continuous-depth graph neural networks that leverage a novel data-dependent scaling term and perturbation to the graph diffusivity to decrease the saturation rate of the underlying diffusion process, thus alleviating the over-smoothing issue. We also prove that the proposed method stabilize the learning of the model by controlling the norm of the node representation. We theoretically and empirically showed its advantage over GRAND and other popular GNNs in term of resiliency to over-smoothing and overall performance. DeepGRAND is a first-order system of ODEs. It is natural to explore an extension of DeepGRAND to a second-order system of ODEs and study the dynamics of this system. It is also interesting to leverage advanced methods in improving the Neural ODEs (Chen et al., 2018) to further develop DeepGRAND. Theorem 1 (Perron-Frobenius). Let M be a positive matrix. There is a positive real number r, called the Perron-Frobenius eigenvalue, such that r is an eigenvalue of M and any other eigenvalue λ (possibly complex) in absolute value is strictly smaller than r , |λ| < r. This eigenvalue is simple and its eigenspace is one-dimensional. Proof of Proposition 1. Since A is right-stochastic, its Perron-Frobenius eigenvalue is α 1 = 1. Its eigenspace has the basis u 1 = {1, 1, . . . , 1}. Suppose {α 1 , α 2 , . . . , α k } is the complex spectrum of A. That is, they are all complex eigenvalues of A. The matrix A -I has eigenvalues β i = α i -1 for all i = 1, k, so β 1 = 0 and Re β i < 0 for all i = 2, k. There exists a basis containing u 1 of C n comprising of generalized eigenvectors of A. By rearranging the vectors if needed, the transition matrix P from the standard basis to this basis satisfies P -1 (A -I)P =      0 0 . . . 0 0 J 1 . . . 0 . . . . . . . . . . . . 0 0 . . . J m      , where each J i is a Jordan block associated with some eigenvalue β i . Proof of Proposition 2. Let J = P -1 AP and Z(t) = P -1 X(t) as in Proposition 1, we can rewrite equation 4 as ∂Z ∂t (t) = J Z(t), The solution to this matrix differential equation is Z(t) = exp(tJ )Z(0), where exp(tJ ) =            1 0 0 . . . 0 exp(tJ 1 ) 0 . . . . . . exp(tJ 2 ) . . . . . . . . . . . . 0 exp(tJ m )            . Let Z be the matrix with the same size as Z(0) obtained by setting every entry in all but the first row of Z(0) be equal to 0. We have Z(t)-Z = Z(t)-      1 0 . . . 0 0 0 . . . 0 . . . . . . . . . . . . 0 0 . . . 0      Z(0) =            0 0 0 . . . 0 exp(tJ 1 ) 0 . . . . . . exp(tJ 2 ) . . . . . . . . . . . . 0 exp(tJ m )            Z(0). Denote the size of each Jordan block J i as n i , we have well-known equality exp(tJ i ) = e βit             1 t t 2 2 . . . t n i -1 (ni-1)! 1 t . . . . . . 1 . . . . . . . . . . . . t 1             . Hence, every non-zero entry of Z(t) -Z has the form e βit P (t) for some β i < 0 and polynomial P . Each entry can thus be bounded above in norm by an exponential term of the form c 1 e -c2t for some c 1 , c 2 > 0. A simple calculation shows that ∥X(t) -P Z∥ ∞ = ∥P (Z(t) -Z)∥ ∞ would also be bounded by an exponential term. That is, there exists some C 1 , C 2 > 0 such that ∥X(t) -P Z∥ ∞ ≤ C 1 e -C2t . (12) Recall that the first column of P is u ⊤ = (1, 1, . . . , 1) ⊤ , so that P Z =      Z(0) 1,1 Z(0) 1,2 . . . Z(0) 1,d Z(0) 1,1 Z(0) 1,2 . . . Z(0) 1,d . . . . . . . . . . . . Z(0) 1,1 Z(0) 1,2 . . . Z(0) 1,d      . Let v = (Z(0) 1,1 , Z(0) 1,2 , . . . , Z(0) 1,d ) ⊤ and V = (v, v, . . . , v) ⊤ . We can rewrite equation 12 as ∥X(t) -V ∥ ∞ ≤ C 1 e -C2t . The claim that v ∈ R d follows from the fact that it is the limit of real-valued vectors with regards to the ∥ • ∥ ∞ norm. Proof of Proposition 3. Let Y i (t) = ∥X i (t)∥ 4 , we have Y ′ i (t) = 4∥X i (t)∥ 2 ⟨X ′ i (t), X i (t)⟩ = 4∥X i (t)∥ 2 ⟨(A -(1 + ϵ)I)X i (t)∥X i ∥ α , X i (t)⟩ = 4∥X i (t)∥ 2+α ⟨(A -(1 + ϵ)I)X i (t), X i (t)⟩ = 4∥X i (t)∥ 2+α ⟨AX i (t), X i (t)⟩ -4(1 + ϵ)∥X i (t)∥ 4+α Since A is a right-stochastic and radial matrix, we have ⟨AX i (t), X i (t)⟩ ≤ ∥A∥∥X i (t)∥ 2 = ∥X i (t)∥ 2 , and -⟨AX i (t), X i (t)⟩ ≤ ∥A∥∥X i (t)∥ 2 = ∥X i (t)∥ 2 , so that -∥X i (t)∥ 2 ≤ ⟨AX i (t), X i (t)⟩ ≤ ∥X i (t)∥ 2 . Hence, we deduce that 4∥X i (t)∥ 4+α (-2 -ϵ) ≤ Y ′ i (t) ≤ 4∥X i (t)∥ 4+α (-ϵ). Multiply both sides of equation 14 with -α 4 ∥X i (t)∥ -4-α = -α 4 Y i (t) -1-α/4 , and by noting that -α 4 Y ′ i Y -1-α/4 i = (Y -α/4 i ) ′ , we have α(2 + ϵ) ≥ (Y -α/4 i ) ′ ≥ αϵ. Integrate from 0 to T and rearranging the appropriate terms, we get (2 + ϵ)αT + Y i (0) -α/4 ≥ Y i (T ) -α/4 ≥ ϵαT + Y i (0) -α/4 . Finally, by noting that Y -α 4 i = ∥X i ∥ -α , we easily get the bound equation 10 (2 + ϵ)αT + ∥X i (0)∥ -α -1 α ≤ ∥X i (T )∥ ≤ ϵαT + ∥X i (0)∥ -α -1 α .

B PHYSICAL INTERPRETATION OF GRAND

Informally, diffusion describes the process by which a diffusing material is transported from a region of higher to lower density through random microscopic motions. A natural example is the process of heat diffusion, which occurs when a hot object touches a cold object. Heat will diffuse between them until both objects are of the same temperature. GRAND approaches deep learning on graphs as a continuous diffusion process and treat GNNs as discretisations of an underlying PDE. In doing so, the dynamic in GRAND is the same as that in (heat) diffusion problems. We use this viewpoint to consider the over-smoothing issue. Consider each node in a given graph as a point in space containing some amount of thermal energy. Each pair of points is connected through a heat pipe. The thermal conductivity (also known as the diffusivity) of each pipe is specific to each pair of points. A thermal conductivity of 0 represents two points that are not connected in the graph. The exact formulation of graph diffusion up to equation 4 represents a closed system. That is, the total amount of energy in all nodes is invariant of time. In node classification problems, nodes within the same class are often hypothesized as sharing strong connection with each other. This is known as the homophily assumption, i.e. 'like attracts like'. For example, friends are more likely to share an interest, and papers from the same research area tends to cite each other. Homophily is a key principle of many real-world networks, and its effect on GNNs has gained traction as a research direction (Zhu et al., 2020; Yan et al., 2021; Chen et al., 2022; Luan et al., 2022; Ma et al., 2022) . In the context of GRAND, we can assume nodes within the same class as sharing a highly conductive heat pipe. As such, thermal energy is transferred effectively between them, quickly bridging any gap in temperature. This allows for rapid clustering of nodes into classes of different thermal energy levels, which can then be passed into a fully connected layer to perform classification. This interpretation gives a surprising intuitive explanation to the occurrence of the over-smoothing issue in GRAND. If the dynamic carries on for too long, the energy level of every node will exponentially converge to the average thermal energy, given that the graph is sufficiently connected. Hence, we argue that GRAND (and more generally, any GNN based purely on diffusion) is inherently prone to suffer from over-smoothing.

C ON NEURAL ODES

Traditional neural networks such as residual neural networks, normalizing flows, recurrent-neuralnetworks (RNNs) learn complicated mappings via the composition of multiple transformations to the hidden states. For example, the residual network updates the future hidden state using the following equation: h t+1 = h t + f (h t , t, θ) Where t ∈ {1, . . . , T } is the layer index of the neural network and h t is the hidden state at layer t. This iterative update rule can be seen as the Euler discretisation of a continuous transformation. Neural ODEs (Chen et al., 2018) are a class of continuous-depth (or continuous-time) neural networks where the transformation step t is infinitesimally small. Specifically, the hidden state h t is parameterized using a continuous dynamics with respect to time t: dh t dt = f (h t , t, θ) Where f (h t , t, θ) is specified by a neural network parameterized by θ. Starting from the initial state h(0), Neural ODEs learn the final representation h(T ) by solving 16 using a numerical integrator (often with an adaptive step-size solver or an adaptive solver for short) given an error tolerance. Integrating 16 from 0 to T in a single forward pass requires the adaptive solver to evaluate f (h t , t, θ) at multiple time-steps. The computational complexity of the forward pass is determined by the number of function evaluations. The only problem with updating the hidden state by solving 16 numerically is that the black-box ODE solver is not a differentiable operation. Therefore, the usual back-propagation method for optimizing traditional neural networks does not work for Neural ODEs. The adjoint sensitivity method (or adjoint method) is a memory-efficient alternative of the traditional back-propagation method for training Neural ODEs. We denote h(T ) as the prediction of Neural ODEs and the loss between h(T ) and the ground truth is L. Then, we define the adjoint state as a(t) = ∂L/∂h(t), we have: dL dθ = T 0 a(t) T ∂f (h(t), t, θ) ∂θ dt Where the adjoint state a(t) satisfies the following dynamics: da(t) dt = -a(t) T ∂f (h(t), t, θ) ∂h(t) Since the adjoint state in 18 can be solved numerically using an ODE solver, the gradient of the loss function L with respect to θ in 17 can be evaluated. The computational complexity of the backward pass is determined by the number of function evaluations used by the ODE solver to solve for the adjoint state.

D EULER DISCRETIZATION OF GRAND ALSO SUFFERS FROM OVER-SMOOTHING

Recall that the forward Euler discretization of the GRAND dynamic was given by Thorpe et al. (2022) as X(kδ t ) = X((k -1)δ t ) + δ t (A -I)X((k -1)δ t ), where 1 > δ t > 0 is the fixed step size, k = 1, 2, . . . , K denotes the layers from 1 to K and X k := X(kδ t ) is the node feature at the k-th layer. We mirror our analysis as in Section 3 almost completely. Some calculations will be omitted for clarity. First, we give an analogous definition to 1 for discrete GNNs. Definition 2. Let X k ∈ R n×d denote the feature representation at the k-th layer of some discrete GNN dynamic. (X k ) is said to experience over-smoothing if there exists a vector v ∈ R d and constants C 1 , C 2 > 0 such that for V = (v, v, . . . , v) ⊤ ∥X k -V ∥ ∞ ≤ C 1 e -C2k . Utilising Proposition 1, we can show that Proposition 4. With the dynamic given by equation 5 and equation 19, (X k ) experiences oversmoothing. Proof. Let J = P -1 AP and Z k = P -1 X k as in Proposition 1, we can rewrite equation 19 as Z k = Z k-1 + δ t (J -I)Z k-1 = ((1 -δ t )I + δ t J ) Z k-1 = ((1 -δ t )I + δ t J ) k Z 0 , where ((1 -δ t )I + δ t J ) k =         1 0 . . . 0 ((1 -δ t )I + δ t J 1 ) k . . . . . . . . . . . . 0 ((1 -δ t )I + δ m J ) k         . Let Z be the matrix with the same size as Z 0 obtained by setting every entry in all but the first row of Z 0 be equal to 0. We have Z k -Z =         0 0 . . . 0 ((1 -δ t )I + δ t J 1 ) k . . . . . . . . . . . . 0 ((1 -δ t )I + δ m J ) k         Z 0 . Each block (1-δ t )I +δ t J 1 has spectral radius lesser than 1. Hence, every non-zero entry of Z k -Z can be bounded above in norm by an exponential term of the form c 1 e -c2t for some c 1 , c 2 > 0. We can deduce that there exists some C 1 , C 2 > 0 such that ∥X k -P Z∥ ∞ ≤ C 1 e -C2k . Recall that the first column of P is u ⊤ = (1, 1, . . . , 1) ⊤ . Set v to be the first row of Z 0 . We can rewrite equation 22 as ∥X k -V ∥ ∞ ≤ C 1 e -C2k . The claim that v ∈ R d follows from the fact that it is the limit of real-valued vectors with regards to the ∥ • ∥ ∞ norm.

E DEEPGRAND PERFORMANCE ON DIFFERENT NUMBERS OF LABELLED NODES PER CLASS

In this section, we provide additional experiment results to complement Table 3 . Apart from the test accuracies of DeepGRAND ran on optimal T values, we add an additional column showing DeepGRAND's results when trained with identical T values as were used in the GRAND paper (Chamberlain et al. (2021b) ). The experiments suggest that with the same T values, DeepGRAND already outperforms GRAND on most of the benchmarks with limited number of labeled nodes. We observe that with optimal T values, DeepGRAND has much less test accuracy variances compared to all other designs. 



Figure 2: The change in test accuracies of DeepGRAND-l as depth value T increases under different α values. Left -effect of α on the accuracies of Cora, Right -effect of α on the accuracies of Citeseer. Each α value is trained on 10 random seeds for both benchmarks.

most notable research fields or studies. This graph consists of 18333 nodes and 163788 edges with 15 different classes.Computers(McAuley et al., 2015) : A co-purchase graph extracted from Amazon where each node represents a product and the edges represent the co-purchased relations. The node features are bagof-words vectors extracted from the product reviews. The dataset consists of 13752 nodes with 491722 edges. All the nodes are classified into 10 different product categories.Photo(McAuley et al., 2015) : Similar to Computers, the node features in the Photo dataset represents the bag-of-words vectors derived from the product review and nodes (products) are connected if frequently co-purchased. The dataset consists of 7487 nodes with 119043 edges. The nodes are classified into 8 classes representing 8 product categories.



The means and standard deviations of test accuracies of DeepGRAND-l and variants of GRAND experimented with different depths on the Computers, Photo and CoauthorCS datasets under random split. NA: ODE solver failed. (Note: The results for GRAND++-l was imported from(Thorpe et al., 2022)).

The means and standard deviations of test accuracies of DeepGRAND-l, variants of GRAND, and other common GNNs experimented with different number of labelled nodes per class. Best results are written in bold (note: The results for GRAND++-l, GCN, GAT and GraphSage were imported fromThorpe et al. (2022)).

The means and standard deviations of test accuracies of DeepGRAND-l and other variants of GRAND experimented with different number of labelled nodes per class. Best results are written in bold (Note: The results for GRAND++-l were imported fromThorpe et al. (2022)).

T values of DeepGRAND used for different benchmarks.

Appendix for "DeepGRAND: Deeper Graph Neural Diffusion"

A TECHNICAL PROOFS Consider the dynamic given by equation 3 and equation 4. We say that a matrix is postitive if and only if all of its entries are positive. Clearly, A is a postive right-stochastic matrix. We recall the well-known Perron-Frobenius theorem

