GRAPH NEURAL NETWORKS AS GRADIENT FLOWS: UNDERSTANDING GRAPH CONVOLUTIONS VIA ENERGY Anonymous

Abstract

Gradient flows are differential equations that minimize an energy functional and constitute the main descriptors of physical systems. We apply this formalism to Graph Neural Networks (GNNs) to develop new frameworks for learning on graphs as well as provide a better theoretical understanding of existing ones. We derive GNNs as a gradient flow equation of a parametric energy that provides a physics-inspired interpretation of GNNs as learning particle dynamics in the feature space. In particular, we show that in graph convolutional models (GCN), the positive/negative eigenvalues of the channel mixing matrix correspond to attractive/repulsive forces between adjacent features. We rigorously prove how the channel-mixing can learn to steer the dynamics towards low or high frequencies, which allows to deal with heterophilic graphs. We show that the same class of energies is decreasing along a larger family of GNNs; albeit not gradient flows, they retain their inductive bias. We experimentally evaluate an instance of the gradient flow framework that is principled, more efficient than GCN, and achieves competitive performance on graph datasets of varying homophily often outperforming recent baselines specifically designed to target heterophily.

1. INTRODUCTION

Graph neural networks (GNNs) (Sperduti, 1993; Goller & Kuchler, 1996; Gori et al., 2005; Scarselli et al., 2008; Bruna et al., 2014; Defferrard et al., 2016; Kipf & Welling, 2017; Battaglia et al., 2016; Gilmer et al., 2017) have become the standard ML tool for dealing with different types of relations and interactions. Limitations of GNNs that have recently attracted attention in the literature are over-smoothing (node features becoming increasingly similar with the depth of the model, see Nt & Maehara (2019) ; Oono & Suzuki (2020) ; Cai & Wang (2020) ; Zhou et al. (2021) ), over-squashing (the difficulty of message passing to propagate information on the graph, see Alon & Yahav (2021) ; Topping et al. (2022) ), and poor performance on heterophilic data (i.e. where adjacent nodes tend to have different labeles, see Pei et al. (2020) ; Zhu et al. (2020) ; Bo et al. (2021) ; Yan et al. (2021) ). General motivations and contributions. In the spirit of neural ODEs (Haber & Ruthotto, 2018; Chen et al., 2018) , we regard (residual) GNNs as discrete dynamical systems. A fundamental idea in physics is that particles evolve by minimizing an energy: one can then study the dynamics through the functional expression of the energy. The class of differential equations that minimize an energy are called gradient flows and their extension and analysis in the context of GNNs represent the main focus of this work. We study two ways of understanding the dynamics induced by GNNs: starting from the energy functional or from the evolution equations. From energy to evolution equations: a new conceptual approach to GNNs. We propose a general framework where one parameterises an energy functional and then takes the GNN equations to follow the direction of steepest descent of such energy. We introduce a class of energy functionals that extend those adopted for label propagation (Zhou & Schölkopf, 2005) and whose gradient flow equations consist of generalized graph convolutions (GCN-type architectures, (Kipf & Welling, 2017) ) with symmetric weights. We provide a physical interpretation for GNNs as multi-particle dynamics: this new framework sheds light on the role of the 'channel-mixing' matrix used in graph convolutional models as an edge-wise potential inducing attraction (repulsion) via its positive (negative) eigenvalues. We conduct theoretical analysis of the dynamics including explicit expansions of the GNN learned features, showing that differently from other continuous models, the gradient flow can learn to magnify either the low or high frequencies. This also establishes new links to techniques like residual connections and negative edge weights that have been previously used in heterophilic settings. We experimentally evaluate our framework using gradient flow equations yielding a principled variant of GCNs that is also more efficient due to weight symmetry and sharing across layers. Our experiments demonstrate competitive performance on homophilic and heterophilic graphs of varying size. From evolution equations to energy: understanding graph convolutions via multi-particle dynamics. Recent works of Cai & Wang (2020) ; Bodnar et al. (2022) studied the behaviour of the Dirichlet energy in graph convolutional models in order to determine if (over)smoothing of the features is occurring. The key idea is that the monotonicity of an energy functional along a system of equations conveys significant information about the dynamics, both in terms of its dominating effects and limit points. However, these results are restricted to the classical Dirichlet energy and assume (non-residual) graph convolutions activated by the ReLU nonlinearity. We extend this approach by proving that a much more general multi-particle energy is in fact decreasing along residual graphconvolutions with symmetric weights and with respect to a more general class of nonlinear activation functions. Our result sheds light onto the dynamics of non-linear graph convolutions showing that the 'channel-mixing' matrix used in GCN-type models can be interpreted as a potential in feature space that promotes alignment (repulsion) of adjacent node features depending on its spectrum. Outline. In Section 2 we review non-parametric instances of gradient flows on graphs: the heat equation and label propagation. In Section 3 we extend this approach to the parametric case by introducing a class of energies that generalize the one used for label propagation and whose associated gradient flows are continuous graph convolutions. We provide a physical interpretation for the energy showing that it can induce attraction and repulsion along edges. In Section 4 we discretize the gradient flow into GNN update equations and derive explicit expansions of the learned node representations highlighting how the spectrum of the channel-mixing W controls whether the dynamics is dominated by the low or high frequencies of the graph Laplacian. To our knowledge, ours is the first analysis that studies the interplay of the spectral properties of the graph Laplacian and the channel mixing matrix. In Section 5 we extend the theory by showing that the same multi-particle energy introduced in Section 3 still decreases along more general graph convolutions with symmetric weights, meaning that the physics interpretation is preserved. In Section 6 we evaluate the framework for node classification on a broad range of datasets. Related work. Our analysis is related to studying GNNs as filters (Defferrard et al., 2016; Hammond et al., 2019; Balcilar et al., 2020; He et al., 2021) and adopts techniques similar to Oono & Suzuki (2020) ; Cai & Wang (2020) . Gradient flows were adapted from geometry (Eells & Sampson, 1964) , to image processing (Kimmel et al., 1997) , label propagation (Zhou & Schölkopf, 2005) and recently in ML (Sander et al., 2022) for the analysis of Transformers (Vaswani et al., 2017) . Our work follows the spirit of GNNs as continuous dynamical systems (Xhonneux et al., 2020; Zang & Wang, 2020; Chamberlain et al., 2021a; Eliasof et al., 2021; Chamberlain et al., 2021b; Bodnar et al., 2022; Rusch et al., 2022) . Notations. Let G = (V, E) be an undirected graph with n nodes. Its adjacency matrix A is defined as a ij = 1 if (i, j) ∈ E and zero otherwise. We let D = diag(d i ) be the degree matrix and define the normalized adjacency Ā := D -1/2 AD -1/2 . We denote by F ∈ R n×d the matrix of d-dimensional node features, by f i ∈ R d its i-th row (transposed), by f r ∈ R n its r-th column, and by vec(F) ∈ R nd the vectorization of F obtained by stacking its columns. Given a symmetric matrix B, we let λ B + , λ B denote its most positive and negative eigenvalues, respectively, and ρ B be its spectral radius. ḟ (t) denotes the temporal derivative, ⊗ is the Kronecker product and 'a.e.' means almost every w.r.t. Lebesgue measure. Proofs and additional results appear in the Appendix.

2. GRADIENT FLOWS ON GRAPHS: THE NON-PARAMETRIC CASE

In this Section we review important concepts on graphs and two examples of non-parametric gradient flows that partly motivate our approach to the GNN framework. What is a gradient flow? Consider an N -dimensional dynamical system governed by the evolution equation Ḟ(t) = ODE(F(t)) that evolves some initial state F(0) for time t ≥ 0. In deep learning, the discretisation of such differential equations using the Euler method allows to draw a parallel between iterations of a numerical solver and the layers of a neural network (Haber & Ruthotto, 2018; Chen et al., 2018) . We say that the evolution equation is a gradient flow if there exists an energy functional E : R N → R such that ODE(F(t)) = -∇E(F(t)). Since Ė(F(t)) = -||∇E(F(t))|| 2 , the energy E is decreasing along the solution F(t) of such equations. The existence of E and the knowledge of its functional expression allow for a better understanding of the dynamical system. A prototypical gradient flow: heat equation. Let F ∈ R n×d be the matrix representation of vector features assigned to each node in G. Its graph gradient is defined edge-wise as (∇F ) ij := f j / d j -f i / √ d i . We can then set the Laplacian as ∆ := -div ∇/2 (the divergence div is the adjoint of ∇), represented by ∆ = I -Ā ⪰ 0. We refer to the eigenvalues of ∆ as frequencies: the lowest frequency is always 0 while the highest frequency is ρ ∆ ≤ 2 (Chung & Graham, 1997). The heat equation on each channel is the system ḟ r (t) = -∆f r (t), for 1 ≤ r ≤ d. This is an example of gradient flow: if we stack the columns of F into vec(F) ∈ R nd , we can rewrite the heat equation as vec( Ḟ(t)) = -∇E Dir (vec(F(t))), where E Dir : R nd → R is the (graph) Dirichlet energy defined by (Zhou & Schölkopf, 2005 ) E Dir (F) := 1 4 (i,j)∈E ||(∇F) ij || 2 = 1 2 trace(F ⊤ ∆F) = 1 2 ⟨vec(F), (I d ⊗ ∆)vec(F)⟩. E Dir measures the smoothness of the signal since it accounts for the variations of F (i.e., its gradient) on the edges. The evolution of F(t) by the heat equation ( 1) decreases the Dirichlet energy E Dir (F(t)); in the limit E Dir (F(t → ∞)) = 0, attained by the projection of the initial state F(0) onto ker(∆). A more general gradient flow: label propagation. Assume we have a graph G, node-features F 0 and labels {y i } on V train ⊂ V, and that we want to predict the labels on V test ⊂ V. Zhou & Schölkopf (2005) proposed label propagation (LP) where they first extend the input labels Y(0) outside the training set as y i (0) = 0 for each i ∈ V \ V train and then solve the equation: Ẏ(t) = -∆Y(t) -2µ(Y(t) -Y(0)). This is another example of gradient flow; in fact, Zhou & Schölkopf (2005) originally introduced the following energy and then derived the aforementioned update formula in order to minimize it: Ẏ(t) = -∇E LP (Y(t)), E LP (Y) := E Dir (Y) + µ||Y -Y(0)|| 2 . ( ) The prediction is then attained by the signal that minimizes both E Dir -which enforces smoothnessand the fitting term arising from the available labels (a form of soft boundary conditions). Motivations and goals. In graph ML problems, we often also have node features that can be leveraged for the label prediction. Our goal is to extend the gradient flow formalism from the non-parametric case (heat equation and label propagation) to a deep learning setting, where we (i) parameterise an energy functional and let the GNN equations be the associated gradient flow, and (ii) investigate when existing GNNs admit an energy that is decreasing along their evolution equations.

3. GRADIENT FLOWS ON GRAPHS: THE PARAMETRIC CASE

We can think of a (residual) graph neural network as a parametric evolution equation Ḟ(t) = GNN θ(t) (G, F(t)) discretized using the Euler method with fixed time step 0 < τ ≤ 1: F(t + τ ) = F(t) + τ GNN θ(t) (G, F(t)). Each iteration corresponds to a GNN layer, which in general can have a different set of parameters θ(t). We choose GNN θ to be the gradient flow of some parametric class of energies E θ : R n×d → R generalizing E Dir , resulting in feature evolution by Ḟ(t) = -∇E θ (F(t)) starting from input features F(0), with {θ} learned via backpropagation on the task loss function. This approach extends the LP technique to a framework where the parameters we learn can be interpreted as 'finding the right notion of smoothness' for our task. In fact, minimizing E LP as in Equation (3) works only if the labels are smooth-an assumption known as homophily. We investigate how learning a more general energy yields gradient flow GNNs that can also perform well on heterophilic data.

3.1. ENERGIES GIVING RISE TO GRAPH-CONVOLUTIONAL MODELS

Similarly to the LP approach in Equation (3), our first step consists in choosing a parametric class of energy functionals {E θ } giving rise to the GNN equations via gradient flow. GNNs of the convolutional flavor (Bronstein et al., 2021) evolve the features via (4) using some parametric rule GNN θ (G, F 0 ) typically consisting of two operations: applying a shared linear transformation to the features ('channel mixing') and propagating them along the edges ('diffusion'). Accordingly, we introduce the class of (generalized) graph convolutions: F(t + τ ) = F(t) + τ σ -F(t)Ω t + ĀF(t)W t -F(0) Wt , where the learnable parameters {θ(t)} are the d × d weight matrices Ω t , W t , and Wt acting on each node feature vector independently and performing channel mixing; the normalized adjacency Ā performs the diffusion of features from adjacent nodes. The setting τ = 1, no residual connection, and Ω t = Wt = 0 corresponds to GCN (Kipf & Welling, 2017) . The case of Ω t ̸ = 0 results in an anisotropic instance of GraphSAGE Hamilton et al. (2017) , while by choosing Ω t = 0 and W t and Wt as convex combinations with the identity we recover GCNII (Chen et al., 2020) . We consider a class of energies {E θ } consisting of quadratic terms E θ (F) = 1 2 i,j ⟨f i , Ωf i ⟩ E ext Ω - 1 2 i,j Āij ⟨f i , Wf j ⟩ E pair W + i,j φ 0 (F, F(0)) E source φ 0 , parameterised by d × d weight matrices Ω, W. We motivate our choice by first recovering the non-parametric cases of Section 2. If Ω = W = I d and φ 0 = 0, then E θ = E Dir as per Equation (2); choosing φ 0 as an L 2 -penalty gives E θ = E LP as per Equation (3). We can also recover manifold harmonic energies applied to graphs (see Appendix B.5). More importantly, if φ 0 (F, F(0)) = i ⟨f i , Wf i (0)⟩, for W ∈ R d×d , we can rewrite E θ (F) = ⟨vec(F), 1 2 (Ω ⊗ I n -W ⊗ Ā)vec(F) + ( W ⊗ I n )vec(F(0))⟩ and then derive its gradient flow as: Ḟ(t) = -∇ F E θ (F(t)) = -F(t) Ω + Ω ⊤ 2 + ĀF(t) W + W ⊤ 2 -F(0) W. Since Ω, W appear in Equation ( 8) in a symmetrized way, without loss of generality we can assume Ω and W to be symmetric d × d channel mixing matrices. Therefore, Equation (8) simplifies as Ḟ(t) = -F(t)Ω + ĀF(t)W -F(0) W. (9) Thus, a quadratic energy as in Equation ( 35) leads to continuous linear graph convolutions with symmetric weights shared over time. Equivalently, for generalized graph convolutions to fit the gradient flow formalism, the channel-mixing matrices must be symmetric. Importantly, while reducing the number of parameters and offering a gradient flow interpretation of the GNN, this symmetric constraint does not diminish its expressive power (Hu et al., 2019) . Next, we show that E θ has a simple interpretation in terms of pairwise forces among adjacent features.

3.2. ATTRACTION AND REPULSION: A PHYSICS-INSPIRED FRAMEWORK

Why gradient flows? A multi-particle point of view. Consider the node features as particles in R d with energy E θ . The first term E ext Ω is independent of the pairwise interactions and hence represents an 'external' energy in the feature space. The second term E pair W instead accounts for pairwise interactions along edges via the symmetric matrix W and hence represents an 'internal' energy. We set the source term φ 0 to zero and write W = Θ ⊤ + Θ + -Θ ⊤ -Θ -, by decomposing it into components with positive and negative eigenvalues. We can then rewrite E θ in Equation ( 6) as E θ (F) = 1 2 i,j ⟨f i , (Ω -W)f i ⟩ dampening + 1 4 i,j ||Θ + (∇F) ij || 2 attraction - 1 4 i,j ||Θ -(∇F) ij || 2 repulsion , which we have derived in Appendix B. To understand the dynamics induced by the minimization of E θ by the gradient flow (36), recall that the edge gradient (∇F) ij measures the difference between features f i and f j . We note that: (i) if Ω commutes with W, then the projections of (∇F) ij onto ker(W) remain invariant and are preserved along the gradient flow; (ii) The channel-mixing W encodes attractive edge-wise interactions via its positive-definite component Θ + since the gradient terms ||Θ + (∇F) ij || decrease along the solution of Equation ( 36), hence resulting in a smoothing effect where adjacent node features f i and f j are 'aligned'; (iii) The channel-mixing W encodes repulsive edge-wise interactions via its negative-definite component Θ -since the gradient terms ||Θ -(∇F) ij || increase along the solution of Equation ( 36), hence resulting in a sharpening effect which could be desirable on heterophilic graphs where we need to disentangle adjacent node representations. Next, we formalize the smoothing vs sharpening effects by introducing a new quantity to monitor along a GNN to assess whether the latter is magnifying the low or high frequencies. Low vs high frequency enhancement. Attractive forces minimize the edge gradients and are associated with smoothing effects which magnify low frequencies, while repulsive forces increase the edge gradients and hence afford a sharpening action enhancing the high frequencies. Since we are interested in finding which frequency is dominating the dynamics, we monitor the Dirichlet energy along the normalized solution: (Wu et al., 2019; Klicpera et al., 2019) . In the opposite case of heterophily, the high-frequency components might contain more relevant information for separating classes (Bo et al., 2021) -the classical example being the eigenvector of ∆ with largest frequency ρ ∆ separating a bipartite graph. Accordingly, an ideal framework for learning on graphs must at least accommodate both of these opposite scenarios by being able to induce either an LFD or a HFD dynamics. We can now investigate the gradient flow equations of the energy in Equation (10). Theorem 3.2 (Informal). The continuous gradient flow in Equation (36) can learn to be either LFD (mostly edge-wise attractive) or HFD (mostly edge-wise repulsive) depending on the spectrum of W. (A precise result, along with convergence rates, is stated as Theorem B.3 in the Appendix). Informally, Theorem 3.2 shows that the gradient flow in Equation ( 36) is expressive enough to induce repulsion along edges if needed -as expected based on the decomposition of E θ in Equation (10). As argued above, a dynamical system that can never be HFD might instead struggle on heterophilic graphs where the feature signal needs to be sharpened rather than smoothed out. The property that the gradient flow can be HFD is non-trivial and in fact some continuous-time GNNs -such as those introduced in We refer to Theorem B.4 for a statement including convergence rates and over-smoothing results. Message of Section 3: We introduced an energy E θ allowing to learn attractive/repulsive forces along edges via the spectrum of the channel-mixing inducing an LFD/HFD dynamics as per Theorem 3.2. We argue that energies rather than evolution equations should be the object to parameterise for deriving more principled GNNs that are easier to interpret and analyse. This is studied next.

4. FROM ENERGY TO EVOLUTION EQUATIONS: GNNS AS GRADIENT FLOWS

In order to connect our theory to practice, we discretize Equation (36) as in Equation ( 4), replacing continuous time by fixed steps corresponding to GNN layers.

4.1. DISCRETE GRADIENT FLOWS AND SPECTRAL ANALYSIS

As in Equation (4), we use the Euler scheme with step size τ to solve Equation (36). In our framework we parameterise the energy rather than the equations, which leads to symmetric channel-mixing matrices Ω, W ∈ R d×d . The use of the explicit Euler discretization yields a residual architecture: Interaction between the graph and channel-mixing spectra. We restrict our theoretical analysis to the gradient flows in Equation ( 11) where we remove dampening and source term effects (i.e., Ω = W = 0, which corresponds to a residual GCN). Our technique consists in vectorizing the solution F(t) → vec(F(t)) and rewriting the update as vec(F(t + τ )) = vec(F(t)) + τ W ⊗ Ā vec(F(t)) (see Appendix A.2 for details). In particular, once we choose bases {ϕ W r } and {ϕ ∆ ℓ } of orthonormal eigenvectors for W and ∆ respectively, we can write the solution after m layers explicitly: F(t + τ ) = F(t) + τ -F(t)Ω + ĀF(t)W -F(0) W , F(0) = ψ EN (F 0 ), vec(F(mτ )) = d r=1 n-1 ℓ=0 1 + τ λ W r (1 -λ ∆ ℓ ) m c r,ℓ (0)ϕ W r ⊗ ϕ ∆ ℓ , where c r,ℓ (0) := ⟨vec(F(0)), ϕ W r ⊗ ϕ ∆ ℓ ⟩. We see that the interaction of the spectra {λ W r } and {λ ∆ ℓ } is the 'driving' factor for the dynamics, with positive (negative) eigenvalues of W magnifying the frequencies λ ∆ ℓ < 1 (> 1 respectively). In the following we let λ W ± denote the most positive/negative eigenvalue of W with associated eigenvectors ϕ W ± .foot_0 Note that ϕ ∆ n-1 is the Laplacian eigenvector associated with largest frequency ρ ∆ . We now consider the following: λ W + (ρ ∆ -1)) -1 < |λ W -| < 2(τ (2 -ρ ∆ )) -1 . (13) The first inequality means that the negative eigenvalues of W dominate the positive ones (once we factor in the graph spectrum contribution), while the second is a constraint on the step-size since if τ is too large, then we no longer approximate the gradient flow in Equation (36). Theorem 4.1. Let m be the number of layers. Consider F(t + τ ) = F(t) + τ ĀF(t)W, with symmetric W. If Equation (13) holds, then there exists δ < 1 s.t. for all i ∈ V we have: f i (mτ ) = 1 + τ |λ W -|(ρ ∆ -1) m c -,n-1 (0) ϕ ∆ n-1 (i) • ϕ W -+ O (δ m ) . ( ) Conversely, if λ W + (ρ ∆ -1)) -1 > |λ W -|, then f i (mτ ) = 1 + λ W + m c +,0 (0) d i • ϕ W + + O (δ m ) . ( ) We report the explicit value of δ in Equation ( 38) in Appendix C.1. We now comment on the consequences of Theorem 4.1. Equation ( 14) implies that if the negative eigenvalues of W are sufficiently larger than the positive ones (in absolute value, as per Equation ( 13)), then repulsive forces and hence high frequencies dominate. Indeed for i ∈ V we have f i (mτ ) ∼ ϕ ∆ n-1 (i) • ϕ W -at the fastest scale, up to lower order terms in the number of layers. Thus as we increase the depth, any feature representation f i (mτ ) becomes dominated by a multiple of ϕ W -∈ R d only depending on the value taken by the Laplacian eigenvector ϕ ∆ n-1 at node i. On the other hand, if Equation ( 15) holds, then at the largest scale we have f i (mτ ) ∼ √ d i • ϕ W + ∈ R d , meaning that the node representation becomes dominated by a multiple of ϕ W + only depending on the degree of i -which recovers the over-smoothing phenomenon (Nt & Maehara, 2019; Oono & Suzuki, 2020) . Corollary 4.2. If Equation (14) holds, then the system is HFD for a.e. F(0) and F(mτ )/||F(mτ )|| → F ∞ s.t. ∆f r ∞ = ρ ∆ f r ∞ for each r. Conversely, if Equation (15) holds, then the system is LFD for a.e. F(0) and F(mτ )/||F(mτ )|| → F ∞ s.t. ∆f r ∞ = 0 for each r. Remark. In general neither the highest nor the lowest frequency Laplacian eigenvectors constitute ideal classifiers and in fact we always have a finite depth so that F(mτ ) also depends on the lowerorder terms of the asymptotic expansion in Theorem 4.1. Whether the dynamics is LFD or HFD will affect if the lower or higher frequencies have a larger contribution to the prediction; indeed, we can compute the 'impact' of each graph frequency explicitly thanks to Equation (12).

4.2. CONNECTIONS TO EXISTING RESULTS

Residual connection. The following result shows that the residual connection is crucial: Theorem 4.3. If G is not bipartite, and we remove the residual connection, i.e. F(t+τ ) = τ ĀF(t)W, with W symmetric, then the dynamics is LFD for a.e. F(0) independent of the spectrum of W. Differently from previous over-smoothing results of Oono & Suzuki (2020); Cai & Wang ( 2020), here we have no constraints on the spectral radius of W coming from the graph topology. In other words, the residual connection fully enables the channel-mixing to steer the evolution towards low or high frequencies depending on the task. If we drop the residual connection, W is less powerful; this is also confirmed by our ablation studies (see Figure 3 in the Appendix). Negative eigenvalues flip the edge signs. Let W = Φ W Λ W (Φ W ) ⊤ be the eigendecomposition of W yielding the Fourier coefficients Z(t) = F(t)Φ W . We rewrite the discretized gradient flow F(t + τ ) = F(t) + τ ĀF(t)W in the Fourier domain of W as Z(t + τ ) = Z(t) + τ ĀZ(t)Λ W and note that along the eigenvectors of W, if λ W r < 0 then the dynamics is equivalent to flipping the sign of the edges. This shows that negative edge weight mechanisms proposed in Li et al. (2020) ; Bo et al. (2021) ; Yan et al. (2021) for heterophilic graphs can be achieved with a simple GCN model where the channel-mixing matrix W has negative eigenvalues. We refer to Equation (39) in the appendix for a thorough discussion and derivation. The message of Section 4: Discrete gradient flows of E θ are equivalent to linear graph convolutions with symmetric weights shared across layers. This provides a 'multi-particle' interpretation for graph convolutions and sheds light onto the dynamics they generate. We can derive simple expansions of the learned features and show that the interaction between the eigenvectors and spectra of W and ∆ is what drives the dynamics and determines its dominating effects. Convolutional GNN models can deal with heterophily if the channel mixing matrix has negative eigenvalues.

5. FROM EVOLUTION EQUATIONS TO ENERGY: INTERPRETING GNNS VIA E θ

Analysing energies along GNNs is one approach to investigate their dynamics. Cai & Wang (2020) ; Bodnar et al. (2022) showed that E Dir is decreasing (exponentially) along some classes of graph convolutions, implying over-smoothing -see also Rusch et al. (2022) . In this Section, we start from time-continuous graph convolutions as in Equation ( 5), with σ acting elementwise: Ḟ(t) = σ -F(t)Ω + ĀF(t)W -F(0) W ( ) Although this is no longer necessarily a gradient flow due to σ, we prove that if the weights are symmetric, then E θ in Equation ( 35) still decreases along Equation ( 16). Theorem 5.1. Consider σ : R → R satisfying x → xσ(x) ≥ 0. If F solves Equation (16) with Ω, W symmetric, then t → E θ (F(t) ) is decreasing. If we discretize the system using the Euler method with step size τ and C + denotes the most positive eigenvalue of Ω ⊗ I n -W ⊗ Ā, then E θ (F(t + τ )) -E θ (F(t)) ≤ C + • ||F(t + τ ) -F(t)|| 2 . An important consequence of Theorem 5.1 is that for non-linear graph convolutions with symmetric weights, the physical interpretation is preserved since the same multi-particle energy E θ in Equation ( 10) is decreasing along the solution. Note that in the discrete case the energy monotonicity can be interpreted as a Lipschitz regularity result. Namely, the channel-mixing W still induces attraction/repulsion along edges via its positive/negative eigenvalues (more explicitly, see Lemma D.1). We again emphasize that the requirement of symmetric weights is not restrictive thanks to the universal approximation results of Hu et al. (2019) . Theorem 5.1 differs from Cai & Wang (2020); Bodnar et al. (2022) in two ways: (i) It asserts monotonicity of an energy E θ more general than E Dir , since it is parametric and in fact also able to enhance the high frequencies; and (ii) it holds for an infinite class of non-linear activations (beyond ReLU). So far we have considered time-independent energies. We can generalize our discussion to energies of the form E θ (•, t) whose potentials now vary in time. The equations then take the form of Equation ( 5) with Ω t and W t symmetric. The message of Section 5: We show that graph convolutions with symmetric weights identify curves along which the multi-particle energy E θ decreases hence acting as 'approximate' gradient flows. Despite the non-linear activation of σ we can still interpret the learning dynamics of convolution on graphs as finding the 'right' edge-wise attractive/repulsive potentials through channel mixing.

6. EXPERIMENTS

In Theorem 4.1 we have shown that the subset of linear graph convolutions in Equation ( 5) with shared, symmetric weights is characterized by the strong inductive bias that the multi-particle energy E θ in Equation ( 10) is being minimized along the equations. Although not a gradient flow, even when we activate the equations with σ as in Theorem 5.1, we can preserve such inductive bias since the same energy is decreasing. This means that both frameworks can provably induce attraction or repulsion along edges thanks to the spectrum of the channel-mixing. We validate our theoretical analysis by testing that these principled (and more efficient) classes of convolutional models along which E θ decreases can compete with baselines designed to target heterophilic graphs. The model and the parameterisation. In the following we evaluate a subclass of gradient flows in Equation ( 11) giving rise to a framework termed GRAFF (Gradient Flow Framework): GRAFF : F(t + τ ) = F(t) + τ -F(t)diag(ω) + ĀF(t)W -βF(0) , where ω ∈ R d and β ∈ R, W is a symmetric d × d-matrix shared across layers and (node-wise) encoder and decoder are MLPs. We consider two possible implementations for W: (i) diagonallydominant (see Appendix E), where we learn an off-diagonal symmetric matrix and the diagonal terms separately, and (ii) the case with W diagonal and we report best numbers over these two configurations. We note that both these parameterisations allow the model to control the spectrum of W more easily which we know to be essential from Theorem 4.1; we refer to the methodology description in Appendix E for further details. By Theorem 5.1, if we 'activate' Equation (17) as GRAFF NL : F(t + τ ) = F(t) + τ σ -F(t)diag(ω) + ĀF(t)W -βF(0) , with σ s.t. xσ(x) ≥ 0, then E θ in Equation ( 35) is decreasing, so that we can think of such equations as more general 'approximate' gradient flows termed GRAFF NL (where NL stands for non-linear). Complexity. GRAFF scales as O(|V|pd + m|E|d), where p and d are input feature and hidden dimension respectively, with p ≥ d usually, and m is the number of layers. Note that GCN has complexity O(m|E|(p + d)) and in fact our model is slightly faster than GCN mainly due to the preliminary encoding performed node-wise rather than edge-wise; main baselines on heterophilic graphs like GGCN and Sheaf learn edge-wise weights based on features which is slower (as confirmed in Figure 5 in Appendix E). Moreover, for GRAFF the number of parameters scales as O(pd + d 2 ) while for other baselines they scale with the number of layers at least as O(pd + md 2 ). Synthetic experiments. To investigate our claims we first use the synthetic Cora dataset of (Zhu et al., 2020, Appendix G) where graphs are generated for target levels of homophily see Appendix E.3. Figure 2 reports the test accuracy vs true label homophily. For Neg-prod we set W = -W 0 W ⊤ 0 so to only have non-positive eigenvalues: we see that this is better than the opposite case of prod (W = W 0 W ⊤ 0 ) on low-homophily (and viceversa on high-homophily). This confirms Theorem 4.1 where we have shown that the gradient flow can be HFDthat is generally desirable with low-homophily -through the negative eigenvalues of W. In practice 'non-signed' variants 4 in Appendix E.

Real world experiments.

In Table 1 we test GRAFF and GRAFF NL on datasets with varying homophily (Sen et al., 2008; Rozemberczki et al., 2021; Pei et al., 2020 ) (details in Appendix E.4). We use results provided in (Yan et al., 2021 , Table 1 ), which include GCNs models, GAT (Veličković et al., 2018) , PairNorm (Zhao & Akoglu, 2019) and models designed for heterophily (GGCN (Yan et al., 2021) , Geom-GCN (Pei et al., 2020) , H2GCN (Zhu et al., 2020) and GPRGNN (Chien et al., 2021) ). For Sheaf (Bodnar et al., 2022) , a recent strong baseline with heterophily, we took the best performing variant (out of six) for each dataset. We include continuous baselines CGNN (Xhonneux et al., 2020) and GRAND (Chamberlain et al., 2021a) Results. GRAFF and GRAFF NL are both versions of graph convolutions with stronger 'inductive bias' given by the energy E θ decreasing along the solution; in fact, we can recover them from graph convolutions by simply requiring that the channel-mixing is symmetric and shared across layers. Nonetheless they achieve competitive results on all datasets often outperforming slower and more complex models. They are extremely competitive on more homophilic datasets as well, in contrast with the performance of models like Sheaf mainly designed to handle heterophily.

7. CONCLUSIONS

We argued that when studying and developing GNNs we should focus on energy functionals rather than the evolution equations. We introduced a new framework for GNNs where the evolution is a gradient flow of a multi-particle learnable energy. This gives rise to principled graph convolutions where the channel-mixing is a symmetric matrix and induces attraction (repulsion) along edges via its positive (negative) eigenvalues. We explored the theoretical implications by investigating the dominating terms in the learned feature expansion and corroborated that this graph convolutional framework can perform well in heterophilic settings. We proved that existing (generalized) graph convolutions maintain the dynamics induced by the same class of multi-particle energies if the channelmixing is symmetric even when they are not strictly gradient flows due to non-linear activations; this provides a deeper connection between energy functionals and GNNs and extends several recent results that have monitored the classical Dirichlet energy along GCNs to shed light on their dynamics. Limitations and future works. We limited our attention to a class of energy functionals whose gradient flows give rise to evolution equations of the generalized graph convolution type. In future work, we plan to study other families of energies that generalize different GNN architectures and provide new models that are more 'physics'-inspired. We will also investigate time-dependent energy functionals and how to generalize our results to this setting. To the best of our knowledge, our analysis is a first step into studying the interaction of the graph and 'channel-mixing' spectra. In future work, we will explore other more general dynamics (i.e., that are neither LFD nor HFD).

OVERVIEW OF THE APPENDIX

To facilitate navigating the appendix, where we report several additional theoretical results, analysis of different cases along with further experiments and ablation studies, we provide the following detailed outline. • In Appendix A.1 we review properties of the classical Dirichlet energy on manifolds that inspired traditional PDE variational methods for image processing whose extension to graphs and GNNs more specifically partly constitutes one of the main motivations of our work. We also review important elementary properties of the Kronecker product of matrices that are used throughout our proofs in Appendix A.2. We also comment on the choice of the normalization (and symmetrization) of the graph Laplacian, briefly mentioning the impact of different choices. • In Appendix B.1 we derive the energy decomposition reported in Equation ( 10). In Appendix B.2 we derive additional rigorous results to justify our characterization of LFD and HFD dynamics in Definition 3.1 along with explicit examples. We also formalize more explicitly and quantitatively Theorem 3.2 in Theorem B.3. In Appendix B.3 we report a more explicit statement with convergence rates and over-smoothing results which covers the informal version in Theorem 3.3. In Appendix B.4 we explore the special case of Ω = W which is equivalent to choosing ∆ rather than Ā as message-passing matrix providing new arguments as to why propagating messages using Ā rather than the graph Laplacian is actually 'more robust'. Finally in Appendix B.5 we formally derive an analogy between the continuous energy used for manifolds (images) and a subset of the parametric energies in Equation ( 6). • In Appendix C we prove the main results of Section 4, namely Theorem 4.1, Corollary 4.2, and Theorem 4.3. • In Appendix D we prove Theorem 5.1 and an extra result confirming that even in the nonlinear case the channel-mixing W still induces attraction and repulsion via its spectrum hence magnifying the low or high frequencies respectively. • In Appendix E we report additional details on hyperparameter tuning, datasets adopted, further synthetic and ablation studies, along with extra experiments on larger heterophilic datasets in Appendix E.6. Additional notations and conventions used throughout the appendix. Any graph G is taken to be connected. We order the eigenvalues of the graph Laplacian as 0 = λ ∆ 0 ≤ λ ∆ 1 ≤ . . . ≤ λ ∆ n-1 = ρ ∆ ≤ 2 with associated orthonormal basis of eigenvectors {ϕ ∆ ℓ } n-1 ℓ=0 so that in particular we have ∆ϕ ∆ 0 = 0. Moreover, given a symmetric matrix B, we generally denote the spectrum of B by spec(B) and if B ⪰ 0, then gap(B) denotes the positive smallest eigenvalue of B. Finally, if we write F(t)/||F(t)|| we always take the norm to be the Frobenius one and tacitly assume that the dynamics is s.t. the solution is not zero.

A PROOFS AND ADDITIONAL DETAILS OF SECTION 2

A.1 DISCUSSION ON CONTINUOUS DIRICHLET ENERGY AND HARMONIC MAPS Starting point: a geometric parallelism. To motivate a gradient-flow approach for GNNs, we start from the continuous case (see Appendix A.1 for details). Consider a smooth map f : R n → (R d , h) with h a constant metric represented by H ⪰ 0. The Dirichlet energy of f is defined by E(f, h) = 1 2 R n ∥∇f ∥ 2 h dx = 1 2 d q,r=1 n j=1 R n h qr ∂ j f q ∂ j f r (x)dx and measures the 'smoothness' of f . A natural approach to find minimizers of E -called harmonic maps -was introduced in Eells & Sampson (1964) and consists in studying the gradient flow of E, wherein a given map f (0) = f 0 is evolved according to ḟ (t) = -∇ f E(f (t)). These type of evolution equations have historically been the core of variational and PDE-based image processing; in particular, gradient flows of the Dirichlet energy were shown Kimmel et al. (1997) to recover the Perona-Malik nonlinear diffusion Perona & Malik (1990) . In this subsection we briefly expand on the formulation of continuous Dirichlet energy in Section 2 to provide more context. Consider a smooth map f : (M, g) → (N, h), where N is usually a larger manifold we embed M into, and g, h are Riemannian metrics on domain and codomain respectively. The Dirichlet energy of f is defined by E(f, g, h) := 1 2 M |df | 2 g dµ(g), with |df | g the norm of the Jacobian of f measured with respect to g and h. If (M, g) is standard Euclidean space R n , N = R d and h is a constant positive semi-definite matrix, then we can rewrite the Dirichlet energy in a more familiar form as E(f, h) = 1 2 R n trace Df ⊤ hDf dµ = 1 2 d q,r=1 n j=1 R n h qr ∂ j f q ∂ j f r (x)dx. The Dirichlet energy measures the smoothness of the map f , and indeed if h is the identity in R d , then we recover the classical definition E(f ) = 1 2 d r=1 R n ||∇f r || 2 (x)dx. Gradient flow of Dirichlet energy. Minimizers of E -referred to as harmonic maps -are important objects in geometry: to mention a few, geodesics, minimal isometric immersions and maps f : M → R d solving ∆ g f = 0 are all instances of harmonic maps. To identify such critical points, one computes the first variation of the energy E along an arbitrary direction ∂ t f , which can be written as dE f (∂ t f ) = - M ⟨τ g (f ), ∂ t f ⟩ h dµ(g). for some tensor field τ with explicit form (τ g M (f )) α := ∆ g M f α + h N Γ α βγ ∂ i f β ∂ j f γ g ij M , for 1 ≤ α ≤ dim(N ) , with {y α } local coordinates on N and Γ α βγ Christoffel symbols. It follows that harmonic maps are identified by the condition τ g (f )) = 0. In Eells & Sampson (1964) , the pivotal idea of harmonic map flow -which has shaped much of modern research in geometric analysis -was introduced for the first time: in order to identify minimizers of E, an input map f 0 is evolved along the direction of (minus) the gradient of the energy E leading to the dynamics ∂ t f = τ g (f ). ( ) As a special case, when the target space is the classical Euclidean space one recovers the heat equation induced by the input Riemannian structure. We also note that when (M, g) is a surface representing an image and f : (u 1 , u 2 ) → (u 1 , u 2 , ϕ(u 1 , u 2 )) with ϕ a color map, then Equation (20) becomes ∂ t ϕ = div(C g ∇ϕ), with C g a constant depending on the metric on M . If we now let g to depend on ϕ, one can recover the celebrated Perona-Malik flow Kimmel et al. (1997) . A.2 REVIEW OF KRONECKER PRODUCT AND PROPERTIES OF LAPLACIAN KERNEL Kronecker product. In this subsection we summarize a few relevant notions pertaining the Kronecker product of matrices that are going to be applied throughout our spectral analysis of gradient flow equations for GNNs in both the continuous and discrete time setting. Given a matricial equation of the form Y = AXB, we can vectorize X and Y by stacking columns into vec(X) and vec(Y) respectively, and rewrite the previous system as vec (Y) = B ⊤ ⊗ A vec(X). If A and B are symmetric with spectra spec(A) and spec(B) respectively, then the spectrum of B ⊗ A is given by spec(A) • spec(B). Namely, if Ax = λ A x and By = λ B y, for x and y non-zero vectors, then λ B λ A is an eigenvalue of B ⊗ A with eigenvector y ⊗ x: (B ⊗ A) y ⊗ x = (λ B λ A )y ⊗ x. One can also define the Kronecker sum of matrices A ∈ R n×n and B ∈ R d×d as A ⊕ B := A ⊗ I d + I n ⊗ B, with spectrum spec(A ⊕ B) = {λ A + λ B : λ A ∈ spec(A), λ B ∈ spec(B)}. Additional details on E Dir and the choice of Laplacian. We recall that the classical graph Dirichlet energy E Dir is defined by E Dir (F) = 1 2 trace F ⊤ ∆F , where the (unusual) extra factor of 1 2 is to avoid rescaling the gradient flow by 2 -which is the more common convention. We can use the Kronecker product to rewrite the Dirichlet energy as E Dir (F) = 1 2 vec(F) ⊤ (I d ⊗ ∆)vec(F), from which we immediately derive that ∇ vec(F) E Dir (F) = (I d ⊗ ∆)vec(F) -since ∆ is symmetric -and hence recover the gradient flow in Equation ( 1) leading to the graph heat equation across each channel. Before we further comment on the characterizations of LFD and HFD dynamics, we review the main choices of graph Laplacian and the associated harmonic signals (i.e. how we can characterize the kernel spaces of the given Laplacian operator). Recall that throughout the appendix we always assume that the underlying graph G is connected. The symmetrically normalized Laplacian ∆ = I -Ā is symmetric, positive semi-definite with harmonic space of the form Chung & Graham (1997) ker(∆) := span(D 1 2 1 n : 1 n = (1, . . . , 1) ⊤ ). This confirms that if a given GNN evolution Ḟ(t) = GNN θ (F(t), t) with initial condition F(0) over-smooths meaning that ∆f r (t) → 0 for t → ∞ for each column 1 ≤ r ≤ d, then the only information persisting in the asymptotic regime is the degree and any dependence on the input features is lost, as studied in Oono & Suzuki (2020); Cai & Wang (2020) . A slightly different behaviour occurs if instead of ∆, we consider the unnormalized Laplacian L = D -A with kernel span(1 n ), meaning that if Lf r (t) → 0 as t → ∞ for each 1 ≤ r ≤ d, then any node would be embedded to a single point, hence making any separation task impossible. The same consequence applies to the random walk Laplacian ∆ RW = I -D -1 A. In particular, we note that generally a row-stochastic matrix is not symmetric -if it was, then this would in fact be doubly-stochastic -and the same applies to the random-walk Laplacian (a special exception is given by the class of regular graphs). In fact, in general any dynamical system governed by ∆ RW (or simply D -1 A) is not the gradient flow of an energy due to the lack of symmetry, as further confirmed below in Equation ( 27).

B PROOFS AND ADDITIONAL DETAILS OF SECTION 3 B.1 ATTRACTION VS REPULSION: A PHYSICS-INSPIRED FRAMEWORK

We first note that the system in Equation ( 36) can be written using the Kronecker product as vec( Ḟ(t)) = -(Ω ⊗ I n )vec(F(t)) + (W ⊗ Ā)vec(F(t)) -( W ⊗ I n )vec(F(0)). If this is the gradient flow of F → E θ (F), then we would have ∇ 2 vec(F) E θ (F) = Ω ⊗ I n -W ⊗ Ā, which must be symmetric due to the Hessian of a function being symmetric. The latter means (Ω ⊤ -Ω) ⊗ I n = (W ⊤ -W) ⊗ Ā, which is satisfied if and only if both Ω and W are symmetric. This shows that Equation ( 36) is the gradient flow of E θ if and only if Ω and W are symmetric. We now rely on the spectral decomposition of W to rewrite E θ explicitly in terms of attractive and repulsive interactions. If we have a spectral decomposition W = Φ W Λ W (Φ W ) ⊤ , we can separate the positive eigenvalues from the negative ones and write W = Φ W Λ + (Φ W ) ⊤ + Φ W Λ -(Φ W ) ⊤ := W + -W -. Since W + ⪰ 0, W -⪰ 0, we can use the Choleski decomposition to write 10) follows then by direct computation: namely W + = Θ ⊤ + Θ + and W -= Θ ⊤ -Θ -with Θ + , Θ -∈ R d×d . Equation ( E θ (F) = 1 2 i ⟨f i , Ωf i ⟩ - 1 2 i,j āij ⟨f i , Wf j ⟩ = 1 2 i ⟨f i , (Ω -W)f i ⟩ + 1 2 i ⟨f i , Wf i ⟩ - 1 2 i,j āij ⟨Θ + f i , Θ + f j ⟩ + 1 2 i,j āij ⟨Θ -f i , Θ -f j ⟩ = 1 2 i ⟨f i , (Ω -W)f i ⟩ + 1 4 i,j ||Θ + (∇F) ij || 2 - 1 4 i,j ||Θ -(∇F) ij || 2 , where we have used that i,j 1 di ||Θ + f i || 2 = i ||Θ + f i || 2 .

B.2 ADDITIONAL DETAILS ON LFD AND HFD CHARACTERIZATIONS

In this subsection we provide further details and justifications for Definition 3.1. We first prove the following simple properties. Lemma B.1. Assume we have a (continuous) process t → F(t) ∈ R n×d , for t ≥ 0. The following equivalent characterizations hold: (i) E Dir (F(t)) → 0 for t → ∞ if and only if ∆f r (t) → 0, for 1 ≤ r ≤ d. (ii) E Dir (F(t)/||F(t)||) → ρ ∆ /2 for t → ∞ if and only if for any sequence t j → ∞ there exist a subsequence t j k → ∞ and a unit limit F ∞ -depending on the subsequence -such that ∆f r ∞ = ρ ∆ f r ∞ , for 1 ≤ r ≤ d. Proof. (i) Given F(t) ∈ R n×d , we can vectorize it and decompose it in the orthonormal basis {e r ⊗ ϕ ∆ ℓ : 1 ≤ r ≤ d, 0 ≤ ℓ ≤ n -1}, with {e r } d r=1 canonical basis in R d , and write vec(F(t)) = r,ℓ c r,ℓ (t)e r ⊗ ϕ ∆ ℓ , c r,ℓ (t) := ⟨vec(F(t)), e r ⊗ ϕ ∆ ℓ ⟩. We can then use Equation ( 25) to compute the Dirichlet energy as E Dir (F(t)) = 1 2 d r=1 n-1 ℓ=0 c 2 r,ℓ (t)λ ∆ ℓ ≡ 1 2 d r=1 n-1 ℓ=1 c 2 r,ℓ (t)λ ∆ ℓ ≥ 1 2 gap(∆) d r=1 n-1 ℓ=1 c 2 r,ℓ (t), where we have used the convention above that the eigenvector ϕ ∆ 0 is in the kernel of ∆. Therefore E Dir (F(t)) → 0 ⇐⇒ d r=1 n-1 ℓ=1 c 2 r,ℓ (t) → 0, t → ∞, which occurs if and only if (I d ⊗ ∆)vec(F(t)) = d r=1 n-1 ℓ=1 c r,ℓ (t)λ ∆ ℓ e r ⊗ ϕ ∆ ℓ → 0. (ii) The argument here is similar. Indeed we can write Q(t) = F(t)/||F(t)|| with Q(t) a unit-norm signal. Namely, we can vectorize and write vec(Q(t)) = r,ℓ q r,ℓ (t)e r ⊗ ϕ ∆ ℓ , r,ℓ q 2 r,ℓ (t) = 1. Then E Dir (Q(t)) → ρ ∆ /2 if and only if r,ℓ q 2 r,ℓ (t)λ ∆ ℓ → ρ ∆ , t → ∞, which holds if and only if r q 2 r,ρ ∆ (t) → 1 q 2 r,ℓ (t) → 0, ℓ : λ ∆ ℓ < ρ ∆ , given the unit norm constraint. This is equivalent to the Rayleigh quotient of I d ⊗ ∆ converging to its maximal value ρ ∆ . When this occurs, for any sequence t j → ∞ we have that q 2 r,ℓ (t j ) ≤ 1, meaning that we can extract a converging subsequence that due to Equation (28) will converge to a unit eigenvector Q ∞ of I d ⊗ ∆ satisfying (I d ⊗ ∆)Q ∞ = ρ ∆ Q ∞ . Conversely assume for a contradiction that there exists a sequence t → ∞ such that E Dir (F(t j )/||F(t j )||) < ρ ∆ /2 -ϵ, for some ϵ > 0. Then Equation (28) fails to be satisfied along the sequence, meaning that no subsequence converges to a unit norm eigenvector F ∞ of I d ⊗ ∆ with associated eigenvalue ρ ∆ which is a contradiction to our assumption. Before we address the formulation of low(high)-frequency-dominated dynamics, we solve explicitly the system Ḟ(t) = ĀF(t) in R n×d , with some initial condition F(0). We can vectorize the equation and solve v ec(F(t)) = (I d ⊗ Ā)vec(F(t)), meaning that vec(F(t)) = d r=1 n-1 ℓ=0 e (1-λ ∆ ℓ )t c r,ℓ (0)e r ⊗ ϕ ∆ ℓ , c r,ℓ (0) := ⟨vec(F(0)), e r ⊗ ϕ ∆ ℓ ⟩. Consider any initial condition F(0) such that d r=1 |c r,0 | = d r=1 ⟨vec(F(0)), e r ⊗ ϕ ∆ 0 ⟩ > 0, which is satisfied for each vec(F(0)) ∈ R nd \ U ⊥ , where U ⊥ is the orthogonal complement of R d ⊗ span(ϕ ∆ 0 ). Since U ⊥ is a lower-dimensional subspace, its complement is dense. Accordingly for a.e. F(0), we find that the solution satisfies ||vec(F(t))|| 2 = e 2t d r=1 c 2 r,0 + O(e -2gap(∆)t ) = e 2t ||P ⊥ ker(∆) vec(F(0))|| 2 + O(e -2gap(∆)t ) , with P ⊥ ker(∆) the projection onto R d ⊗ ker(∆). We see that the norm of the solution increases exponentially, however the dominant term is given by the projection onto the lowest frequency signal and in fact vec(F(t)) ||vec(F(t))|| = P ⊥ ker(∆) vec(F(0)) + O(e -gap(∆)t )(I -P ⊥ ker(∆) )vec(F(0)) ||P ⊥ ker(∆) vec(F(0))|| 2 + O(e -2gap(∆)t ) 1 2 → vec(F ∞ ), such that (I d ⊗ ∆)vec(F ∞ ) = 0 which means ∆f r ∞ = 0, for each column 1 ≤ r ≤ d. Equivalently, one can compute E Dir (F(t)/||F(t)||) and conclude that the latter quantity converges to zero as t → ∞ by the very same argument. In fact, this motivates further the nomenclature LFD and HFD. Without loss of generality we focus now on the high-frequency case. Assume that we have a HFD dynamics t → F(t), i.e. E Dir (F(t)/||F(t)||) → ρ ∆ /2, then we can vectorize the solution and write vec(F(t)) = ||F(t)||vec(Q(t)), for some time-dependent unit vector vec(Q(t)) ∈ R nd : vec(Q(t)) = r,ℓ q r,ℓ (t)e r ⊗ ϕ ∆ ℓ , r,ℓ q 2 r,ℓ (t) = 1. By Lemma B.1 and more explicitly Equation ( 28), we derive that the coefficients {q r,ρ ∆ } associated with the eigevenctors e r ⊗ϕ ∆ ρ ∆ are dominant in the evolution hence justifying the name high-frequency dominated dynamics. The next result provides a theoretical justification for the characterization of low (high) frequency dominated dynamics in Definition 3.1. Lemma B.2. Consider a dynamical system Ḟ(t) = GNN θ (F(t), t), with initial condition F(0). (i) GNN θ is LFD if and only if (I d ⊗ ∆) vec(F(t)) ||F(t)|| → 0 if and only if for each sequence t j → ∞ there exist a subsequence t j k → ∞ and F ∞ (depending on the subsequence) s.t. F(tj k ) ||F(tj k )|| → F ∞ satisfying ∆f r ∞ = 0, for each 1 ≤ r ≤ d. (ii) GNN θ is HFD if and only if for each sequence t j → ∞ there exist a subsequence t j k → ∞ and F ∞ (depending on the subsequence) s.t. F(tj k ) ||F(tj k )|| → F ∞ satisfying ∆f r ∞ = ρ ∆ f r ∞ , for each 1 ≤ r ≤ d. Proof. (i) Since ∆f r (t) → 0 for each 1 ≤ r ≤ d if and only if (I d ⊗ ∆)vec(F(t)) → 0, we conclude that the dynamics is LFD if and only if (I d ⊗ ∆) vec(F(t)) ||F(t)|| → 0 due to (i) in Lemma B.1. Consider a sequence t j → ∞. Since vec(F(t j ))/||F(t j )|| is a bounded sequence we can extract a converging subsequence t j k : vec(F(t j k ))/||F(t j k )|| → vec(F ∞ ). If the dynamics is LFD, then (I d ⊗ ∆) vec(F(tj k )) ||F(tj k )|| → 0 and hence we conclude that vec(F ∞ ) ∈ ker(I d ⊗ ∆). Conversely, assume that for any sequence t j → ∞ there exists a subsequence t j k and F ∞ such that F(tj k ) ||F(tj k )|| → F ∞ satisfying ∆f r ∞ = 0, for each 1 ≤ r ≤ d. If for a contradiction we had ε > 0 and t j → ∞ such that E Dir (F(t j )/||F(t j )|| ≥ ε -for j large enough -then by (i) in Lemma B.1 there exist 1 ≤ r ≤ d, ℓ > 0 and a subsequence t j k satisfying |⟨ vec(F(t j k )) ||F(t j k )|| , e r ⊗ ϕ ∆ ℓ ⟩| > δ(ε) > 0, meaning that there is no subsequence of {t j k } s.t. (I d ⊗ ∆)vec(F(t j k ))/||F(t j k )|| → 0, providing a contradiction. (ii) This is equivalent to (ii) in Lemma B.1. Remark. We note that in Lemma B.2 an LFD dynamics does not necessarily mean that the normalized solution converges to the kernel of I d ⊗ ∆ -i.e. one in general has always to pass to subsequences. Indeed, we can consider the simple example t → vec(F(t)) := cos(t)e r ⊗ ϕ ∆ 0 , for some 1 ≤ r ≤ d, which satisfies ∆f r (t) = 0 for each r, but it is not a convergent function due to its oscillatory nature. Same argument applies to HFD. We will now show that Equation (36) can lead to a HFD dynamics. To this end, we assume that Ω = W = 0 so that Equation (36) becomes Ḟ(t) = ĀF(t)W. According to Equation (10) the negative eigenvalues of W lead to repulsion. We show that the latter can induce HFD dynamics as per Definition 3.1. We let P ρ-W be the orthogonal projection into the eigenspace of W ⊗ Ā associated with the eigenvalue ρ -:= |λ W -|(ρ ∆ -1). We recall that λ W ± are the most positive and most negative eigenvalues of W respectively. We define ϵ HFD by: ϵ HFD := min{ρ --λ W + , |λ W -|gap(ρ ∆ I -∆), gap(|λ W -|I + W)(ρ ∆ -1)}. Theorem B.3. If ρ -> λ W + , then Ḟ(t) = ĀF(t) W is HFD for a.e. F(0): E Dir (F(t)) = e 2tρ-ρ ∆ 2 ||P ρ- W F(0)|| 2 + O(e -2tϵHFD ) , t ≥ 0, and F(t)/||F(t)|| converges to F ∞ ∈ R n×d such that ∆f r ∞ = ρ ∆ f r ∞ , for 1 ≤ r ≤ d. If instead ρ -< λ W + then the dynamics is LFD and F(t)/||F(t)|| converges to F ∞ ∈ R n×d such that ∆f r ∞ = 0, for 1 ≤ r ≤ d, exponentially fast. Proof of Theorem B.3. Once we compute the spectrum of W ⊗ Ā via Equation ( 23), we can write the solution as -recall that Ā = I n -∆ so we can rephrase the eigenvalues of Ā in terms of the eigenvalues of ∆: vec(F(t)) = r,ℓ e λ W r (1-λ ∆ ℓ )t c r,ℓ (0)ϕ W r ⊗ ϕ ∆ ℓ , with Wϕ W r = λ W r ϕ W r , for 1 ≤ r ≤ d, where {ϕ W r } r is an orthonormal basis of eigenvectors in R d . We can then calculate the Dirichlet energy along the solution as E Dir (F(t)) = 1 2 ⟨vec(F(t)), (I d ⊗ ∆)vec(F(t))⟩ = 1 2 r,ℓ e 2λ W r (1-λ ∆ ℓ )t c 2 r,ℓ (0)λ ∆ ℓ . We now consider two cases: • If λ W r > 0, then λ W r (1 -λ ∆ ℓ ) ≤ λ W + . • If λ W r < 0, then λ W r (1 -λ ∆ ℓ ) ≤ |λ W -|(ρ ∆ -1) := ρ -, with eigenvectors ϕ W r ⊗ ϕ ∆ ρ ∆ for each r s.t. Wϕ W r = λ W -ϕ W r -without loss of generality we can assume that ρ ∆ is a simple eigenvalue for ∆. In particular, if λ W r < 0 and λ W r (1 -λ ∆ ℓ ) < ρ -, then λ W r (1 -λ ∆ ℓ ) < max{|λ W -|(λ ∆ n-2 -1), |λ W -,2 |(ρ ∆ -1)}, where λ W -,2 is the second most negative eigenvalue of W and λ ∆ n-2 is the second largest eigenvalue of ∆. In particular, we can write λ ∆ n-2 = ρ ∆ -gap(ρ ∆ I n -∆), |λ W -,2 | = |λ W -| -gap(|λ W -|I d + W). ( ) From (i) and (ii) we derive that if λ W r (1 -λ ∆ ℓ ) ̸ = ρ -, then λ W r (1 -λ ∆ ℓ ) -ρ -< -min{ρ --λ W + , ρ --|λ W -|(λ ∆ n-2 -1), ρ --|λ W -,2 |(ρ ∆ -1)} = -min{ρ --λ W + , |λ W -|gap(ρ ∆ I -∆), gap(|λ W -|I + W)(ρ ∆ -1)} = -ϵ HFD , where we have used Equation ( 29). Accordingly, if ρ -> λ W + , then E Dir (F(t)) = e 2tρ-   ρ ∆ 2 r:λ W r =λ W - c 2 r,ρ ∆ (0) + 1 2 r,ℓ:λ W r (1-λ ∆ ℓ )̸ =ρ- e 2(λ W r (1-λ ∆ ℓ )-ρ-)t c 2 r,ℓ (0)   = e 2tρ-ρ ∆ 2 ||P ρ- W F(0)|| 2 + O(e -2tϵHFD ) . By the same argument we can factor out the dominant term and derive the following limit for t → ∞ and for a.e. F(0 ) since P ρ- W vec(F(0)) = 0 only if vec(F(0)) belongs to a lower dimensional subspace of R nd : vec(F(t)) vec(F(t)) = P ρ- W vec(F(0)) + O(e -ϵHFDt )((I -P ρ- W )vec(F(0))) ||P ρ- W vec(F(0))|| 2 + O(e -2ϵHFDt ) 1 2 → P ρ- W vec(F(0)) ||P ρ- W vec(F(0))|| , where the latter is a unit vector vec(F ∞ ) satisfying (I d ⊗ ∆)vec(F ∞ ) = ρ ∆ vec(F ∞ ), which completes the proof. For the opposite case the proof can be adapted without efforts as explicitly derived in the proof of Theorem 4.1.

B.3 COMPARISON WITH CONTINUOUS GNNS: DETAILS AND PROOFS

Comparison with some continuous GNN models In contrast with Theorem 3.2, we show that three main linearized continuous GNN models are either smoothing or more generally LFD. The linearized PDE-GCN D model Eliasof et al. (2021) corresponds to choosing W = 0 and Ω = W = K(t) ⊤ K(t) in Equation ( 36), for some time-dependent family t → K(t) ∈ R d×d : ḞPDE-GCND (t) = -∆F(t)K(t) ⊤ K(t). The CGNN model Xhonneux et al. (2020) can be derived from Equation (36) by setting Ω = I -Ω, W = -W = I: ḞCGNN (t) = -∆F(t) + F(t) Ω + F(0). Finally, in linearized GRAND Chamberlain et al. (2021a) a row-stochastic matrix A(F(0)) is learned from the encoding via an attention mechanism and we have ḞGRAND (t) = -∆ RW F(t) = -(I -A(F(0)))F(t). We note that if A is not symmetric, then GRAND is not a gradient flow. Theorem B.4. PDE -GCN D , CGNN and GRAND satisfy the following: (i) PDE -GCN D is a smoothing model: ĖDir (F PDE-GCN D (t)) ≤ 0. (ii) For a.e. F(0) it holds: CGNN is never HFD and if we remove the source term, then E Dir (F CGNN (t)/||F CGNN (t)||) ≤ e -gap(∆)t . (iii) If G is connected, F GRAND (t) → µ as t → ∞, with µ r = mean(f r (0)), 1 ≤ r ≤ d. By (ii) the source-free CGNN-evolution is LFD independent of Ω. Moreover, by (iii), over-smoothing occurs for GRAND. On the other hand, Theorem 3.2 shows that the negative eigenvalues of W can make the source-free gradient flow in Equation ( 36) HFD. Experiments in Section 6 confirm that the gradient flow model outperforms CGNN and GRAND on heterophilic graphs. We prove the following result which covers Theorem 3.3. Proof of Theorem B.4. We structure the proof by following the numeration in the statement. (i) From direct computation we find dE Dir (F(t)) dt = 1 2 d dt (⟨vec(F(t)), (I d ⊗ ∆)vec(F(t))⟩) = -⟨vec(F(t)), (K ⊤ (t)K(t) ⊗ ∆ 2 )vec(F(t))⟩ ≤ 0, since K ⊤ (t)K(t) ⊗ ∆ 2 ⪰ 0. Note that we have used that (A ⊗ B)(C ⊗ D) = AC ⊗ BD. (ii) We consider the dynamical system ḞCGNN (t) = -∆F(t) + F(t) Ω + F(0). We can write vec(F(t)) = r,ℓ c r,ℓ (t)ϕ Ω r ⊗ ϕ ∆ ℓ , leading to the system ċr,ℓ (t) = (λ Ω r -λ ∆ ℓ )c r,ℓ (t) + c r,ℓ (0), 0 ≤ ℓ ≤ n -1, 1 ≤ r ≤ d. We can solve explicitly the system as c r,ℓ (t) = c r,ℓ (0) e (λ Ω r -λ ∆ ℓ )t 1 + 1 λ Ω r -λ ∆ ℓ - 1 λ Ω r -λ ∆ ℓ , if λ Ω r ̸ = λ ∆ ℓ c r,ℓ (t) = c r,ℓ (0)(1 + t), otherwise. We see now that for a.e. F(0) the projection (I d ⊗ ϕ ∆ ρ ∆ (ϕ ∆ ρ ∆ ) ⊤ )vec(F(t)) is never the dominant term. In fact, if there exists r s.t. λ Ω r ≥ ρ ∆ , then λ Ω r -λ ∆ ℓ > λ Ω r -ρ ∆ , for any other non-maximal graph Laplacian eigenvalue. It follows that there is no Ω s.t. the normalized solution maximizes the Rayleigh quotient of I d ⊗ ∆, proving that CGNN is never HFD. If we have no source, then the CGNN equation becomes Ḟ(t) = -∆F(t) + F(t) Ω ⇐⇒ vec( Ḟ(t)) = ( Ω ⊕ (-∆))vec(F(t)), using the Kronecker sum notation in Equation ( 24). It follows that we can write the vectorized solution in the basis {ϕ Ω r ⊗ ϕ ∆ ℓ } r,ℓ as vec(F(t)) = e λ Ω + t    r:λ Ω r =λ Ω + c r,0 (0)ϕ Ω r ⊗ ϕ ∆ 0 + O(e -gap(λ Ω + I d -Ω)t ) r:λ Ω r <λ Ω + c r,0 (0)ϕ Ω r ⊗ ϕ ∆ 0    + e λ Ω + t   O(e -gap(∆)t )   r,ℓ>0 c r,ℓ (0)ϕ Ω r ⊗ ϕ ∆ ℓ     , meaning that the dominant term is given by the lowest frequency component and in fact, if we normalize we find E Dir (F(t)/||F(t)||) ≤ e -gap(∆)t . (iii) Finally we consider the dynamical system induced by linear GRAND ḞGRAND (t) = -∆ RW F(t) = -(I -A(F(0)))F(t). Since we have no channel-mixing, without loss of generality we can assume that d = 1 -one can then extend the argument to any entry. We can use the Jordan form of A to write the solution of the GRAND dynamical system as f (t) = P diag(e J1t , . . . , e Jnt )P -1 f (0), for some invertible matrix P of eigenvectors, with e J k t = e -(1-λ A k )t    1 t • • • t m k -1 (m k -1)! . . . 1    , where m k are the eigenvalue multiplicities. Since by assumption G is connected and augmented with self-loops, the row-stochastic attention matrix A computed in Chamberlain et al. (2021a) with softmax activation is regular, meaning that there exists m ∈ N such that (A m ) ij > 0 for each entry (i, j). Accordingly, we can apply Perron Theorem to derive that any eigenvalue of A has real part smaller than one except the eigenvalue λ A 0 with multiplicity one, associated with the Perron eigenvector 1 n . Accordingly, we find that each block e J k t decays to zero as t → ∞ with the exception of the one e J0t associated with the Perron eigenvector. In particular, the projection of f 0 over the Perron eigenvector is just µ1 n , with µ the average of the feature initial condition. This completes the proof.

B.4 PROPAGATING WITH THE LAPLACIAN

In this subsection we briefly review the special case of Equation ( 36) where Ω = W, and comment on why we generally expect a framework where the propagation is governed by the graph vector field Ā to be more flexible than one with -∆. If Ω = W and we suppress the source term i.e. W = 0, the gradient flow in Equation (36) becomes Ḟ(t) = -∆F(t)W. (31) We note that once vectorized, the solution to the dynamical system can be written as vec(F(t)) = d r=1 n-1 ℓ=0 e -λ W r λ ∆ ℓ t c r,ℓ (0)ϕ W r ⊗ ϕ ∆ ℓ . In particular, we immediately deduce the following counterpart to Theorem 3.2: Corollary B.5. If spec(W) ∩ R -̸ = ∅, then Equation (31) is HFD for a.e. F(0). Differently from Equation (36) the lowest frequency component is always preserved independent of the spectrum of W. This means that the system cannot learn eigenvalues of W to either magnify or suppress the low-frequency projection. In contrast, this can be done if Ω = 0, or equivalently one replaces -∆ with Ā providing a further justification in terms of the interaction between graph spectrum and channel-mixing spectrum for why graph-convolutional models use the normalized adjacency rather than the Laplacian for propagating messages Kipf & Welling (2017) .

B.5 REVISITING THE CONNECTION WITH THE MANIFOLD CASE

In Equation ( 19) a constant nontrivial metric h in R d leads to the mixing of the feature channels. We adapt this idea by considering a symmetric positive semi-definite H = W ⊤ W with W ∈ R d×d and using it to generalize E Dir by suitably weighting the norm of the edge gradients as E Dir W (F) := 1 4 d q,r=1 i j:(i,j)∈E h qr (∇f q ) ij (∇f r ) ij = 1 4 (i,j)∈E ||W(∇F) ij || 2 . ( ) We note the analogy with Equation ( 19), where the sum over the nodes replaces the integration over the domain and the j-th derivative at some point i is replaced by the gradient along the edge (i, j) ∈ E. We generally treat W as learnable weights and study the gradient flow of E Dir W : Ḟ(t) = -∇ F E Dir W (F(t)) = -∆F(t)W ⊤ W. We see that Equation (33) generalizes Equation (1). Proposition B.6. Let P ker W be the projection onto ker(W ⊤ W). Equation ( 33) is smoothing since E Dir (F(t)) ≤ e -2tgap(W ⊤ W)gap(∆) ||F(0)|| 2 + E Dir ((P ker W ⊗ I n )vec(F(0))), t ≥ 0. In fact F(t) → F ∞ s.t. ∃ ϕ ∞ ∈ R d : for each i ∈ V we have (f ∞ ) i = √ d i ϕ ∞ + P ker W f i (0). Proof of Proposition B.6. We can vectorize the gradient flow system in Equation ( 33) and use the spectral characterization of W ⊤ W ⊗ ∆ in Equation ( 23) to write the solution explicitly as vec(F(t)) = r,ℓ e -(λ W r λ ∆ ℓ )t c r,ℓ (0)ϕ W r ⊗ ϕ ∆ ℓ , where {λ W r } r = spec(W ⊤ W) ⊂ R ≥0 with associated basis of orthonormal eigenvectors given by {ϕ W r } r . Then E Dir (F(t)) = 1 2 ⟨vec(F(t)), (I d ⊗ ∆)vec(F(t))⟩ = 1 2 r,ℓ e -2t(λ W r λ ∆ ℓ ) c 2 r,ℓ (0)λ ∆ ℓ = 1 2 r:λ W r =0,ℓ c 2 r,ℓ (0)λ ∆ ℓ + 1 2 r:λ W r >0,ℓ>0 c 2 r,ℓ (0)e -2t(λ W r λ ∆ ℓ ) λ ∆ ℓ = E Dir ((P ker W ⊗ I n )vec(F(0))) + 1 2 r:λ W r >0,ℓ>0 c 2 r,ℓ (0)e -2t(λ W r λ ∆ ℓ ) λ ∆ ℓ ≤ E Dir ((P ker W ⊗ I n )vec(F(0))) + ρ ∆ 2 e -2tgap(W ⊤ W)gap(∆) ||F(0)|| 2 , where we recall that P ker W is the projection onto ker(W ⊤ W) and that by convention the index ℓ = 0 is associated with the lowest graph frequency λ ∆ 0 = 0 -by assumption G is connected. This proves that the dynamics is in fact smoothing. By the very same argument we find that vec(F(t)) → (I d ⊗ P ker ∆ )vec(F(0)) + (P ker W ⊗ I n )vec(F(0)), t → ∞, with P ker ∆ the orthogonal projection onto ker∆ -the other terms decay exponentially to zero. We first focus on the first quantity, which we can write as (I d ⊗ P ker ∆ )vec(F(0)) = r c r,0 (0)ϕ W r ⊗ ϕ ∆ 0 , which has matrix representation ϕ ∆ 0 ϕ ⊤ ∞ ∈ R n×d with ϕ ∞ := r c r,0 (0)ϕ W r . By Equation (26) we deduce that the i-th row of ϕ ∆ 0 ϕ ⊤ ∞ ∈ R n×d is the d-dimensional vector √ d i ϕ ∞ . We now focus on the term (P ker W ⊗ I n )vec(F(0)) = r:λ W r =0,j c r,j (0)ϕ W r ⊗ ϕ ∆ j which has matrix representation r:λ W r =0,j c r,j (0)ϕ ∆ j (ϕ W r ) ⊤ . In particular, the i-th row is given by r:λ W r =0,j c r,j (0)(ϕ ∆ j ) i ϕ W r = P ker W f i (0). This completes the proof of Proposition B.6. Proposition B.6 implies that no weight matrix W in Equation (33) can separate the limit embeddings F ∞ of nodes with same degree and same input features. In particular, we have the following characterization: • Projections of the edge gradients (∇F) ij (0) ∈ R d into the eigenvectors of W ⊤ W with positive eigenvalues shrink along the GNN and converge to zero exponentially fast as integration time (depth) increases. • Projections of the edge gradients (∇F) ij (0) ∈ R d into the kernel of W ⊤ W stay invariant. If W has a trivial kernel, then nodes with same degrees converge to the same representation and over-smoothing. Differently from Nt & Maehara (2019) ; Oono & Suzuki (2020) ; Cai & Wang (2020) , over-smoothing occurs independently of the spectral radius of the 'channel-mixing' if its eigenvalues are positiveeven for equations which lead to residual GNNs when discretized Chen et al. (2018) . According to Proposition B.6, we do not expect Equation ( 33) to succeed on heterophilic graphs where smoothing processes are generally harmful -this is confirmed in Figure 3 (see prod-curve). To deal with heterophily, one needs negative eigenvalues to generate repulsive forces among adjacent features. A more general energy. Since in general one needs to generate repulsive forces too to deal with heterophilic graphs, we extend the Dirichlet energy associated with H = W ⊤ W ⪰ 0 to an energy accounting for mutual -possibly repulsive -interactions in feature space R d . We first rewrite the energy E Dir W in Equation ( 32) as E Dir W (F) = 1 2 i ⟨f i , W ⊤ Wf i ⟩ - 1 2 i,j āij ⟨f i , W ⊤ Wf j ⟩. ( ) If we replace the occurrences of W ⊤ W with arbitrary symmetric matrices Ω, W ∈ R d×d we obtain E θ (F) := 1 2 i ⟨f i , Ωf i ⟩ - 1 2 i,j āij ⟨f i , Wf j ⟩ ≡ E ext Ω (F) + E pair W (F), with associated gradient flow of the form (see Appendix B) Ḟ(t) = -∇ F E θ (F(t)) = -F(t)Ω + ĀF(t)W. If we include the source term, then we have fully recovered the general energy in Equation ( 6) and its associated gradient flow.

C PROOFS AND ADDITIONAL DETAILS OF SECTION 4

We first explicitly report here the expansion of the discrete gradient flow in Equation ( 11) after m layers to further highlight how this is not equivalent to a single linear layer with a message passing matrix Ām as for SGCN Wu et al. (2019) . For simplicity we suppress the source term. F(t + τ ) = F(t) + τ -F(t)Ω + ĀF(t)W vec(F(t + τ )) = I nd + τ -Ω ⊗ I n + W ⊗ Ā vec(F(t)) vec(F(mτ )) = m k=0 m k τ k -Ω ⊗ I n + W ⊗ Ā k vec(F(0)) and we see how the message passing matrix Ā actually enters the expansion after m layers with each power 0 ≤ k ≤ m. This is not surprising, given that we are discretizing a linear dynamical system, meaning that we are approximating an exponential matrix.

C.1 FROM ENERGY TO EVOLUTION EQUATIONS: EXACT EXPANSION OF THE GNN SOLUTIONS

We first address the proof of the main result. Proof of Theorem 4.1. We consider a linear dynamical system F(t + τ ) = F(t) + τ ĀF(t)W, with W symmetric. We vectorize the system and rewrite it as vec(F(t + τ )) = (I nd + τ W ⊗ Ā)vec(F(t)) which in particular leads to vec(F(mτ )) = (I nd + τ W ⊗ Ā) m vec(F(0)). We can then write explicitly the solution as vec(F(mτ )) = r,ℓ 1 + τ λ W r (1 -λ ∆ ℓ ) m c r,ℓ (0)ϕ W r ⊗ ϕ ∆ ℓ . We now verify that by assumption in Equation ( 13) the dominant term of the solution is the projection into the eigenspace associated with the eigenvalue ρ -= |λ W -|(ρ ∆ -1). The following argument follows the same structure in the proof of Theorem B.3 with the extra condition given by the step-size. First, we note that for any r such that λ W r > 0, we have |1 + τ ρ -| > |1 + τ λ W + | ≥ |1 + τ λ W r (1 -λ ∆ ℓ )| since we required ρ -> λ W + in Equation (13). Conversely, if λ W r < 0, then |1 + τ λ W r (1 -λ ∆ ℓ )| ≤ max{|1 + τ ρ -|, |1 + τ λ W -|} Assume that τ |λ W -| > 1, otherwise there is nothing to prove. Then |1 + τ ρ -| > τ |λ W -| -1 if and only if τ |λ W -|(2 -ρ ∆ ) < 2, which is precisely the right inequality in Equation ( 13). We can then argue exactly as in the proof of Theorem B.3 to derive that for each index r such that λ W r < 0 and λ W r ̸ = λ W -, then |1 + τ λ W r (1 -λ ∆ ℓ )| ≤ max{|1 + τ |λ W -,2 |(ρ ∆ -1)|, |1 + τ |λ W -|(λ ∆ n-2 -1)|} with λ W -,2 and λ ∆ n-2 defined in Equation ( 29). We can then introduce δ HFD := max{λ W + , ρ --|λ W -|gap(ρ ∆ I -∆), ρ --(ρ ∆ -1)gap(|λ W -|I + W), |λ W -| - 2 τ } and conclude that f i (mτ ) = r,ℓ 1 + τ λ W r (1 -λ ∆ ℓ ) m c r,ℓ (0)ϕ ∆ ℓ (i)ϕ W r = (1 + τ ρ -) m   c -,n-1 (0)ϕ ∆ n-1 (i) • ϕ W -+ O 1 + τ δ HFD 1 + τ ρ - m ℓ,r:λ W r (1-λ ∆ ℓ )̸ =ρ- c r,ℓ (0)ϕ ∆ ℓ (i)ϕ W r   = (1 + τ ρ -) m c -,n-1 (0)ϕ ∆ n-1 (i) • ϕ W -+ O (δ m ) , which completes the proof of Equation ( 14). Conversely, if ρ -< λ W + , then the projection onto the eigenspace spanned by ϕ W + ⊗ϕ ∆ 0 is dominating the dynamics with exponential growth (1+τ λ W + (1+0)) m . We can then adapt the very same argument above by factoring out the dominating term once we note that due to the choice of symmetric normalized Laplacian ∆, we have ϕ ∆ 0 (i) = √ d i , which then yields Equation ( 15). We can now also address the proof of Corollary 4.2. Proof of Corollary 4.2. Once we have the node-wise expansion we can simply compute the Rayleigh quotient of I d ⊗ ∆. We report the explicit details for the HFD case since the argument for LFD extends without relevant modifications. Using Equation ( 12), we can compute the Dirichlet energy along a solution of F(t + τ ) = F(t) + τ ĀF(t)W satisfying Equation ( 13) by E Dir (F(mτ )) = 1 2 r,ℓ 1 + τ λ W r (1 -λ ∆ ℓ ) 2m c 2 r,ℓ (0)λ ∆ ℓ = (1 + τ ρ -) 2m   ρ ∆ 2 r:λ W r =λ W - c 2 r,ρ ∆ (0) + O 1 + τ δ HFD 1 + τ ρ - 2m ℓ,r:λ W r (1-λ ∆ ℓ )̸ =ρ- c 2 r,ℓ (0)λ ∆ ℓ   = (1 + τ ρ -) 2m ρ ∆ 2 ||P ρ- W F(0)|| 2 + O 1 + τ δ HFD 1 + τ ρ - 2m , where P ρ-W is the orthogonal projector onto the eigenspace associated with the eigenvalue ρ -= |λ W -|(ρ ∆ -1). In particular, since vec(F(mτ )) = (1 + τ ρ -) m P ρ- W vec(F(0)) + O(δ m ) , we find that the dynamics is HFD with vec(F(t))/||vec(F(t))|| converging to the unit projection of the initial projection by P ρ-W provided that such projection is not zero, which is satisfied for a.e. initial condition F(0).

C.2 COMPARISON WITH EXISTING RESULTS: PROOFS

Proof of Theorem 4.3. If we drop the residual connection and simply consider F(t+τ ) = τ ĀF(t)W, then vec(F(mτ )) = (τ W ⊗ Ā) m vec(F(0)). Since G is not bipartite, the Laplacian spectral radius satisfies ρ ∆ < 2. Therefore, for each pair of indices (r, ℓ) we have the following bound: |λ W r (1 -λ ∆ ℓ )| ≤ max{λ W + , |λ W -|}, and the inequality becomes strict if ℓ > 0, i.e. λ ∆ ℓ > 0. The eigenvalues λ W + and λ W -are attained along the eigenvectors ϕ W + ⊗ϕ ∆ 0 and ϕ W -⊗ϕ ∆ 0 respectively. Accordingly, the dominant terms of the evolution lie in the kernel of I d ⊗∆, meaning that for any F 0 with non-zero projection in ker(I d ⊗∆) -which is satisfied by all initial conditions except those belonging to a lower dimensional subspacethe dynamics is LFD. In fact, without loss of generality assume that |λ W -| > λ W + , then vec(F(mτ )) = |λ W -| m r:λ W r =λ W - (-1) m c r,0 (0)ϕ W -⊗ ϕ ∆ 0 + |λ W -| m   O(φ(m))   I nd - r:λ W r =λ W - (ϕ W -⊗ ϕ ∆ 0 )(ϕ W -⊗ ϕ ∆ 0 ) ⊤   vec(F(0))   , with φ(m) → 0 as m → ∞, which completes the proof. Gradient flow as spectral GNNs. We finally discuss Equation (11) from the perspective of spectral GNNs as in Balcilar et al. (2020) . Let us assume that W = 0, Ω = 0. If we let ∆ = Φ W Λ W (Φ W ) ⊤ be the eigendecomposition of the graph Laplacian and {λ W r } be the spectrum of W with associated orthonormal basis of eigenvectors given by {ϕ W r }, and we introduce z r (t) : V → R defined by z r i (t) = ⟨f i (t), ϕ W r ⟩, then we can rewrite the discretized gradient flow as z r (t + τ ) = Φ W (I + τ λ W r (I -Λ ∆ ))(Φ W ) ⊤ z r (t) = z r (t) + τ λ W r Āz r (t), 1 ≤ r ≤ d. ( ) Accordingly, for each projection into the r-th eigenvector of W, we have a spectral function in the graph frequency domain given by λ ∆ → 1 + τ λ W r (1 -λ ∆ ). If λ W r > 0 we have a low-pass filter while if λ W r < 0 we have a high-pass filter. Moreover, we see that along the eigenvectors of W, if λ W r < 0 then the dynamics is equivalent to flipping the sign of the edge weights, which offers a direct comparison with methods proposed in Bo et al. (2021) ; Yan et al. (2021) where some 'attentive' mechanism is proposed to learn negative edge weights based on feature information. The previous equation simply follows from z r i (t + τ ) = ⟨f i (t + τ ), ϕ W r ⟩ = ⟨f i (t) + W( Āf (t)) i , ϕ W r ⟩ = z r i (t) + λ W r j āij z r j (t), which concludes the derivation of Equation ( 39).

D PROOFS AND ADDITIONAL DETAILS OF SECTION 5

Proof of Theorem 5.1. First we check that if time is continuous, then E θ in Equation ( 35) is decreasing. We use the Kronecker product formalism to rewrite the gradient ∇ F E θ (F) as a vector in R nd : explicitly, we get ∇ F E θ (F) = (Ω ⊗ I n -W ⊗ Ā)vec(F) + ( W ⊗ I n )vec(F(0)). It follows then that dE θ (F(t)) dt = (∇ F E θ (F(t))) ⊤ vec( Ḟ(t)) = = (∇ F E θ (F(t))) ⊤ σ (-∇ F E θ (F(t))) . If we introduce the notation Z(t) = -∇ F E θ (F(t), then we can rewrite the derivative as dE θ (F(t)) dt = -Z(t) ⊤ σ(Z(t)) = - α Z(t) α σ(Z(t) α ) ≤ 0 by assumption on σ. The discrete case follows similarly. Let us use the same notation as above so we can write F(t + τ ) = F(t) + τ σ(Z(t)), with Z(t) = -∇ F E θ (F(t)). E θ (F(t + τ )) = ⟨vec(F(t + τ )), 1 2 (Ω ⊗ I n -W ⊗ Ā)vec(F(t + τ )) + ( W ⊗ I n )vec(F(0))⟩ = ⟨vec(F(t)) + τ σ(Z(t)), 1 2 (Ω ⊗ I n -W ⊗ Ā)vec(F(t + τ )) + ( W ⊗ I n )vec(F(0))⟩ = ⟨vec(F(t)) + τ σ(Z(t)), 1 2 Ω ⊗ I n -W ⊗ Ā (vec(F(t) + τ σ(Z(t))) + ( W ⊗ I n )vec(F(0))⟩ = ⟨vec(F(t)), 1 2 (Ω ⊗ I n -W ⊗ Ā)vec(F(t)) + ( W ⊗ I n )vec(F(0))⟩ + τ ⟨vec(F(t)), 1 2 (Ω ⊗ I n -W ⊗ Ā)σ(Z(t))⟩ + τ ⟨σ(Z(t)), 1 2 (Ω ⊗ I n -W ⊗ Ā)vec(F(t) + ( W ⊗ I n )vec(F(0))⟩ + τ 2 ⟨σ(Z(t)), 1 2 (Ω ⊗ I n -W ⊗ Ā)σ (Z(t))⟩. By using that Ω ⊗ I n -W ⊗ Ā is symmetric, we find that E θ (F(t + τ )) = E θ (F(t)) + τ ⟨σ(Z(t), (Ω ⊗ I n -W ⊗ Ā)vec(F(t) + ( W ⊗ I n )vec(F(0))⟩ + τ 2 ⟨ 1 τ (F(t + τ ) -F(t)), 1 2 (Ω ⊗ I n -W ⊗ Ā) 1 τ (F(t + τ ) -F(t))⟩ = E θ (F(t)) -τ ⟨σ(Z(t)), Z(t)⟩ + ⟨F(t + τ ) -F(t), 1 2 (Ω ⊗ I n -W ⊗ Ā)(F(t + τ ) -F(t))⟩ ≤ E θ (F(t)) + C + ||F(t + τ ) -F(t))|| 2 , where again we have used that Z ⊤ σ(Z) ≥ 0. This completes the proof. We perform an ablation study to further corroborate the behaviour seen in Figure 3 . For heterophilic datasets we used the splits from Pei et al. (2020) . For homophilic datasets we used the methodology in Shchur et al. (2018) , each split randomly selects 1,500 nodes for the development set, from the development set 20 nodes for each class are taken as the training set, the remainder are allocated as the validation set. The remaining nodes outside of the development set are used as the test set. This gives a lower percentage (3-6%) of training nodes. This approach was taken because less training information is needed in the homophilic setting and performance can become less sensitive to other factors, meaning less signal from the controlled variable. We tested the structures of W against the real-world datasets with known homophily, again neg-prod outperforms prod in the heterophilic setting and vice-versa due the sign of their spectra. To validate the complexity analysis in Section 6 we performed a runtime ablation for the models between standard GCN and GRAFF described in the GCN ablation Figure 4 . The average inference runtime over 100 runs for 1 split of Cora was recorded. We also include runtimes for the provided dense and sparse implementations of GGCN Yan et al. (2021) . Adding the encoder/decoder (step 1) speeds up the model due to dimensionality reduction. Subsequent steps also reduce complexity and offer speedup with GRAFF performing the fastest. significantly, with the partial exception of MixHop -which accounts directly for 2-hop information at each layer resulting in worse complexity that GRAFF. The reason why MixHop is more competitive than other MPNNs is because these datasets have some strong 'monophily' type of bias such that architectures that directly access the more homophilic 2-hop are at an advantage; indeed, it has already been noted that LINK is indeed meant to work well in this setting (Altenburger & Ugander, 2018) . We also note that the labelling in arXiv-year was introduced in Lim et al. (2021) . We point out how GRAFF manages to stay competitive with other MPNNs and almost consistently outperforms them. This confirms the validity of our theoretical analysis and motivates investigating energy functionals that would allow to incorporate higher-order terms as in MixHop in a gradient flow framework. We also note that by losing the inductive bias of message-passing LINKX is not an optimal framework for dealing with homophilic graphs as reported in Lim et al. (2021) ; on the other hand, an advantage of the gradient flow graph convolutional equations is that they manage to adapt to the underlying homophily of the graph in a way that is provable and justifiable. The latter point is also in stark contrast with LINKX which instead works as a 'black box' on both adjacency and features. Remark. Both arXiv-year and snap-patents graphs come as directed and such information is essential for the task as already noted in Lim et al. (2021) . Strictly speaking, by using the directed (and hence non symmetric) adjacency matrix Ā in Equation ( 17) and Equation ( 18) we are no longer a gradient flow due to the lack of symmetry of the gradient of the quadratic energy E θ as noted in Appendix B.1. It is still interesting to observe how symmetrizing and sharing the weights have enabled our framework to beat both GCN and GCNII by a margin even though we are no longer a gradient flow. We reserve a more thorough investigation of gradient flows for directed non symmetric relations to future work.

E.7 FURTHER COMPARISONS WITH BASELINES

In this subsection we provide further comparison with recent baselines (Luan et al., 2021; Lingam et al., 2021; Maurya et al., 2021) that try to specifically target heterophily. We report their best numbers without reproducing them. A few comments are in order. First, we have reported the best numbers over all (several) configurations of the reported baselines. We note that despite its simplicity, our framework is very competitive on the small heterophilic graphs and stronger on the larger dataset Film and on all the homophilic ones. On the other hand, HLP and FSGNN are much stronger baselines on Squirrel and Chameleon. This is also mainly because both the architectures handle the graphs as directed without self-loops which helps the performance massively and we suspect effectively reduces the heterophily of the graph hence making tha task easier. We reserve the investigation of heterophily in the context of directed graphs for future work.

E.8 SPECTRAL PROPERTIES OF THE LEARNT CHANNEL-MIXING

In Table 9 we report a few spectral properties of the channel-mixing W learnt on 4 real datasets -two homophilic and two heterophilic -along with the value of the normalized Dirichlet energy X → E Dir (X)/∥X∥ 2 used for characterizing LFD and HFD dynamics. Some important comments are in order. The quantities E Dir in and E Dir fin are the normalized Dirichlet energy of the encoded node features before and after the diffusion layers respectively; recall that these are values between 0 and 2 we introduced in Section 3 to characterize whether a given GNN is mostly smoothing or not. λ W + /|λ W -| instead is the ratio between most positive and most negative eigenvalue. (i) we note how the initial encoder provides us with node-wise features that are increasingly less smooth depending on the underlying graph heterophily -as we derive from E Dir in . (ii) The model learns to adapt to the underlying homophily as we see from the fact that the normalized Dirichlet energy decreases much more after the message-passing for Cora and Citeseer. (iii) The ratio λ W + /|λ W -|which is a partial indicator of the amount of attraction vs repulsion exerted by the diffusion -is also an indicator of performances as we read when comparing the case of CORA with the other datasets. (iv) As we increase the heterophily, the most negative eigenvalue increases in absolute value since most likely more repulsion is needed. We argue that the model might benefit from inducing fast attraction along some directions and fast repulsion along others -since on heterophilic graphs we have larger positive and negative eigenvalues. Connected to this last point though, we note that at the end of the architecture we have a decoder that can also potentially discard some node feature projections hence keeping only the smoother (or less smooth) components.



Our arguments extend trivially to the degenerate case.



time

Figure1:Gradient flow dynamics: attractive and repulsive forces lead to a process able to separate heterophilic labels.

Figure 2: Synthetic experiments with controlled homophily.

E Dir (F(t))/||F(t)|| 2 . This is the Rayleigh quotient of I d ⊗ ∆ and so it satisfies 0 ≤ E Dir (F)/||F|| 2 ≤ ρ ∆ /2 (see Appendix A.2). If the normalized Dirichlet energy is approaching its minimum, then the lowest frequency component is dominating, whereas if the normalized Dirichlet energy is converging to its maximum, then the dynamics is dominated by the highest frequencies. This allows us to introduce the following

Models like CGNN, GRAND and PDE -GCN D are never HFD.

where an encoder ψ EN : R n×p → R n×d processes input features F 0 and the prediction ψ DE (F(T )) is produced by a decoder ψ DE : R n×d → R n×k . Here, k is the number of label classes, T = mτ is the integration time, and m is the number of layers. We note that (i) non-linear activations can be included in ψ EN , ψ DE making the entire model non-linear; (ii) since the framework is residual, even if the message-passing is linear, this is not equivalent to collapsing the dynamics into a single layer with diffusion matrix Ām as done inWu et al. (2019) -see Equation (37) in the Appendix.

± 4.55 86.86 ± 3.29 85.68 ± 6.63 37.54 ± 1.56 55.17 ± 1.58 71.14 ± 1.84 77.14 ± 1.45  89.15 ± 0.37 87.95 ± 1.05 GPRGNN 78.38 ± 4.36 82.94 ± 4.21 80.27 ± 8.11 34.63 ± 1.22 31.61 ± 1.24 46.58 ± 1.71 77.13 ± 1.67 87.54 ± 0.38 87.95 ± 1.18 H2GCN 84.86 ± 7.23 87.65 ± 4.98 82.70 ± 5.28 35.70 ± 1.00 36.48 ± 1.86 60.11 ± 2.15 77.11 ± 1.57 89.49 ± 0.38 87.87 ± 1.20 GCNII 77.57 ± 3.83 80.39 ± 3.40 77.86 ± 3.79 37.44 ± 1.30 38.47 ± 1.58 63.86 ± 3.04 77.33 ± 1.48 90.15 ± 0.43 88.37 ± 1.25 Geom-GCN 66.76 ± 2.72 64.51 ± 3.66 60.54 ± 3.67 31.59 ± 1.15 38.15 ± 0.92 60.00 ± 2.81 78.02 ± 1.15 89.95 ± 0.47 85.35 ± 1.57 PairNorm 60.27 ± 4.34 48.43 ± 6.14 58.92 ± 3.15 27.40 ± 1.24 50.44 ± 2.04 62.74 ± 2.82 73.59 ± 1.47 87.53 ± 0.44 85.79 ± 1.01 GraphSAGE 82.43 ± 6.14 81.18 ± 5.56 75.95 ± 5.01 34.23 ± 0.99 41.61 ± 0.74 58.73 ± 1.68 76.04 ± 1.30 88.45 ± 0.50 86.90 ± 1.04 Sheaf (max) 85.95 ± 5.51 89.41 ± 4.74 84.86 ± 4.71 37.81 ± 1.15 56.34 ± 1.32 68.04 ± 1.58 76.70 ± 1.57 89.49 ± 0.40 86.90 ± 1.13 GRAFF 88.38 ± 4.53 88.83 ± 3.29 84.05 ± 6.10 37.11 ± 1.08 58.72 ± 0.84 71.08 ± 1.75 77.30 ± 1.85 90.04 ± 0.41 88.01 ± 1.03 GRAFFNL 86.49 ± 4.84 87.26 ± 2.52 77.30 ± 3.24 35.96 ± 0.95 59.01 ± 1.31 71.38 ± 1.47 76.81 ± 1.12 89.81 ± 0.50 87.81 ± 1.13 Node-classification results. Top three models are coloured by First, Second, Third like GRAFF are more flexible and outperform GCN with low homophily, confirming Theorem 4.3 where we have shown that without a residual connection convolutional models are LFD irrespectively of the spectrum of W -further results in Figure

Summary of properties of synthetic Cora dataset we added self loops and made the edges undirected as a preprocessing step. All other datasets are provided as undirected but without self loops. Each split uses 48/32/20 of nodes for training, validation and test set respectively. Table6summarises each of the datasets.

Summary of properties of real-word datasets. All LCC except *

Ablation with controlled spectrum of W on real-world datasets

Performance on larger heterophilic datasets. (M) stands for out of memory.

ACMbest 87.84 ± 3.87 88.43 ± 3.22 85.95 ± 5.64 36.89 ± 1.18 54.4 ± 1.88 67.08 ± 2.04 77.15 ± 1.45 90.00 ± 0.52 88.01 ± 1.08 HLPbest 87.57 ± 5.44 86.67 ± 4.22 84.05 ± 4.67 34.59 ± 1.32 74.17 ± 1.83 77.48 ± 0.80 N A N A N A FSGNN 87.30 ± 5.55 88.43 ± 3.22 87.03 ± 5.77 35.67 ± 0.69 73.48 ± 2.13 78.14 ± 1.25 77.19 ± 1.35 89.73 ± 0.39 87.73 ± 1.36 GRAFF 88.38 ± 4.53 88.83 ± 3.29 84.05 ± 6.10 37.11 ± 1.08 58.72 ± 0.84 71.08 ± 1.75 77.30 ± 1.85 90.04 ± 0.41 88.01 ± 1.03 GRAFFNL 86.49 ± 4.84 87.26 ± 2.52 77.30 ± 3.24 35.96 ± 0.95 59.01 ± 1.31 71.38 ± 1.47 76.81 ± 1.12 89.81 ± 0.50 87.81 ± 1.13

Node-classification results. Test accuracy 87.69 ± 0.86 76.75 ± 1.33 70.44 ± 1.47 59.60 ± 1.49

Spectral analysis.

annex

To further support the principle that the effects induced by W are similar even in this non-linear setting, we consider a simplified scenario.Lemma D.1. If we choose Ω = W = diag(ω) with ω r ≤ 0 for 1 ≤ r ≤ d and W = 0 i.e. t → F(t) solves the dynamical system Ḟ(t) = σ (-∆F(t)diag(ω)) , with xσ(x) ≥ 0, then the standard graph Dirichlet energy satisfies dE Dir (F(t)) dt ≥ 0.Proof. This again simply follows from directly computing the derivative:Important consequence: The previous Lemma implies that even with non-linear activations, negative eigenvalues of the channel-mixing induce repulsion and indeed the solution becomes less smooth as measured by the classical Dirichlet Energy increasing along the Generalising this result to more arbitrary choices is not immediate and we reserve this for future work.

E ADDITIONAL DETAILS ON EXPERIMENTS E.1 GENERAL EXPERIMENTAL DETAILS

GRAFF is implemented in PyTorch Paszke et al. (2019) , using PyTorch geometric Fey & Lenssen (2019) and torchdiffeq Chen et al. (2018) . Code and instructions to reproduce the experiments are available on GitHub. Hyperparameters were tuned using wandbBiewald (2020) and random grid search. Experiments were run on AWS p2.8xlarge machines, each with 8 Tesla V100-SXM2 GPUs.

Methodology.

Throughout the experiments and ablations we rely on the following parameterisations. We implement ψ EN , ψ DE as single linear layers or MLPs, and we set Ω to be diagonal. For the real-world experiments we consider diagonally-dominant (DD) and diagonal (D) choices for the structure of W that offer explicit control over its spectrum. In the (DD)-case, we consider a W 0 ∈ R d×d symmetric with zero diagonal and w ∈ R d defined by w α = q α β |W 0 αβ | + r α , and set W = diag(w) + W 0 . Due to the Gershgorin Theorem the eigenvalues of W belong to, so the model 'can' easily re-distribute mass in the spectrum of W via q α , r α . This generalizes the decomposition of W in Chen et al. (2020) providing a justification in terms of its spectrum. For (D) we take W to be diagonal. To investigate the role of the spectrum of W on synthetic graphs, we construct three additional variants: W = W ′ + W ′ ⊤ , W = ±W ′⊤ W ′ named sum, prod and neg-prod respectively where prod (neg-prod) variants have only non-negative (non-positive) eigenvalues.

E.2 ABLATION STUDIES ON SYNTHETIC HOMOPHILY

Synthetic experiments and ablation studies. To investigate our claims in a controlled environment we use the synthetic Cora dataset of (Zhu et al., 2020, Appendix G) . Graphs are generated for target levels of homophily via preferential attachment -see Appendix E.3 for details. Figure 3 confirms the spectral analysis and offers a better understanding in terms of performance and smoothness of the predictions. Each curve -except GCN -represents one version of W as in 'methodology' and we implement Equation ( 11) with W = 0, Ω = 0. Figure 3 (top) reports the test accuracy vs true label homophily. Neg-prod is better than prod on lowhomophily and viceversa on high-homophily. This confirms Theorem 4.1 where we have shown that the gradient flow can lead to a HFD dynamics -that are generally desirable with low-homophily -through the negative eigenvalues of W. Conversely, the prod configuration (where we have an attraction-only dynamics) struggles in low-homophily scenarios even though a residual connection is present. Both prod and neg-prod are 'extreme' choices and serve the purpose of highlighting that by turning off one side of the spectrum this could be the more damaging depending on the underlying homophily. In general though 'neutral' variants like sum and (DD) are indeed more flexible and better performing. In fact, (DD) outperforms GCN especially in low-homophily scenarios, confirming Theorem 4.1 where we have shown that without a residual connection convolutional models are LFD -and hence more sensitive to underlying homophily -irrespectively of the spectrum of W. This is further confirmed in Figure 4 .In Figure 3 (bottom) we compute the homophily of the prediction (cross) for a given method and we compare with the homophily (circle) of the prediction read from the encoding (i.e. graph-agnostic).The homophily here is a proxy to assess whether the evolution is smoothing, the goal being explaining the smoothness of the prediction via the spectrum of W as per our theoretical analysis. For neg-prod the homophily after the evolution is lower than that of the encoding, supporting the analysis that negative eigenvalues of W enhance high-frequencies. The opposite behaviour occurs in the case of prod and explains that in the low-homophily regime prod is under-performant due to the prediction being smoother than the true homophily. (DD) and sum variants adapt better to the true homophily. We note how the encoding compensates when the dynamics can only either attract or repulse (i.e. the spectrum of W has a sign) by decreasing or increasing the initial homophily respectively. The synthetic Cora dataset is provided by (Zhu et al., 2020, Appendix G) . They use a modified preferential attachment process to generate graphs for target levels of homophily. Nodes, edges and features are sampled from Cora proportional to a mix of class compatibility and node degree resulting in a graph with the required homophily and appropriate feature/label distribution. To validate the provided data before use we provide Table 2 summarising the properties of the synthetic Cora dataset. All rows/levels of homophily have the same number of nodes (1,490), edges (5,936), features (1,433) and classes (5).

E.3 ADDITIONAL DETAILS ON SYNTHETIC ABLATION STUDIES:

As well as the ablation shown in Figure 3 we used this dataset to perform an ablation using GCN as the baseline. We asses the impact of each of the steps necessary to augment a standard GCN model to GRAFF. This involves 5 steps; 1) add an encoder/decoder. 2) add a residual connection. 3) share the weights of W and Ω across time/layers. 4) symmetrize W and Ω. 5) remove the non-linearity between layers. The results are shown in Figure 4 and corroborate Theorem 4.1 that adding a residual term is beneficial especially in low-homophily scenarios. We also note augmentations 3,4 and 5 are not "costly" in terms of performance.

E.4 ADDITIONAL DETAILS ON REAL-WORLD ABLATION STUDIES

For the real-world experiments in Table 1 we performed 10 repetitions over the splits taken from Pei et al. (2020) . For all datasets we used the largest connected component (LCC) apart from Citeseer where the 5th and 6th split are LCC and others require the full dataset. For Chameleon and Squirrel Using wandb Biewald (2020) we performed a random grid search with uniform sampling of the continuous variables. We provide the hyperparameters that achieved the best results from the random grid search in Table 5 . An implementation that uses these hyperparameters is available in the provided code with hyperparameters provided in graff params.py. Input dropout and dropout are the rates applied to the encoder/decoder respectively with no dropout applied in the ODE block. Further hyperparameters decide the use of non-linearities, batch normalisation, parameter vector ω and source term multiplier β which are specified in the code. Both the datasets concern predicting 'publication time' in a citation network. We report baselines as in Lim et al. (2021) and the best performing number among GRAFF and GRAFF NL in Table 7 . Before commenting on the results, we observe that LINK (Zheleva & Getoor, 2009 ) is a method that only acts on the input adjacency and ignores completely the node features. The facts that LINK is such a strong baseline on the datasets and that the MLP is instead very poor denote that the features on these datasets actually carry very little valuable information. Indeed, most of the MPNNs struggle

