AFFINITY-AWARE GRAPH NETWORKS

Abstract

Graph Neural Networks (GNNs) have emerged as a powerful technique for learning on relational data. Owing to the relatively limited number of message passing steps they perform-and hence a smaller receptive field-there has been significant interest in improving their expressivity by incorporating structural aspects of the underlying graph. In this paper, we explore the use of affinity measures as features in graph neural networks, in particular measures arising from random walks, including effective resistance, hitting and commute times. We propose message passing networks based on these features and evaluate their performance on a variety of node and graph property prediction tasks. Our architecture has low computational complexity, while our features are invariant to the permutations of the underlying graph. The measures we compute allow the network to exploit the connectivity properties of the graph, thereby allowing us to outperform relevant benchmarks for a wide variety of tasks, often with significantly fewer message passing steps. On one of the largest publicly available graph regression datasets, OGB-LSC-PCQM4Mv1, we obtain the best known single-model validation MAE at the time of writing.

1. INTRODUCTION

Graph Neural Networks (GNNs) constitute a powerful tool for learning meaningful representations in non-Euclidean domains. GNN models have achieved significant successes in a wide variety of node prediction Hamilton et al. (2017) ; Luan et al. (2019) , link prediction Zhang & Chen (2018) ; You et al. (2019) , and graph prediction Duvenaud et al. (2015) ; Ying et al. (2019) tasks. These tasks naturally emerge in a wide range of applications, including autonomous driving Chen et al. (2019) , neuroimaging Parisot et al. (2018) , combinatorial optimization Gasse et al. (2019) ; Nair et al. (2020) , and recommender systems Ying et al. (2018) , while they have enabled significant scientific advances in the fields of biomedicine Wang et al. (2021a) , structural biology Jumper et al. (2021) , molecular chemistry Stokes et al. (2020) and physics Bapst et al. (2020) . Despite the predictive power of GNNs, it is known that the expressive power of standard GNNs is limited by the 1-Weisfeiler-Lehman (1-WL) test Xu et al. (2018) . Intuitively, GNNs possess at most the same power in terms of distinguishing between non-isomorphic (sub-)graphs, while having the added benefit of adapting to the given data distribution. For some architectures, two nodes with different local structures have the same computational graph, thus thwarting distinguishability in a standard GNN. Even though some attempts have been made to address this limitation with higherorder GNNs Morris et al. (2019) , most traditional GNN architectures fail to distinguish between such nodes. A common approach to improving the expressive power of GNNs involves encoding richer structural/positional properties. For example, distance-based approaches form the basis for works such as Position-aware Graph Neural Networks You et al. (2019) , which capture positions/locations of nodes with respect to a set of anchor nodes, as well as Distance Encoding Networks Li et al. (2020) , which use the first few powers of the normalized adjacency matrix as node features associated with a set of target nodes. Here, we take an approach that is inspired by this line of work but departs from it in some crucial ways: we seek to capture both distance and connectivity information using general-purpose node and edge features without the need for specifying any anchor or target nodes.

Contributions:

We propose the use of affinity metrics as features in a graph neural network. Specifically, we consider statistics that arise from random walks in graphs, such as hitting time and commute time between pairs of vertices (see Sections 3.1 and 3.2). We present a means of incorporating these statistics as scalar edge features in a message passing neural network (MPNN) Gilmer et al. (2017) (see Section 3.4). In addition to these scalar features, we present richer vector-valued resistive embeddings (see Section 3.3), which can be incorporated as node or edge feature vectors in the network. Resistive embeddings are a natural way of embedding each node into Euclidean space such that the squared L 2 -distance between nodes recovers the commute time. We show that such embeddings can be efficiently approximated, even for larger graphs, using sketching and dimensionality reduction techniques (see Section 4). Moreover, we evaluate our networks on a number of benchmark datasets of diverse scales (see Section 5). First, we show that our networks outperform other baselines on the PNA dataset Corso et al. (2020) , which includes 6 node and graph algorithmic tasks, showing the ability of affinity measures to exploit structural properties of graphs. We also evaluate the performance on a number of graph and node tasks for datasets in the Open Graph Benchmark (OGB) collection Hu et al. (2020) , including molecular and citation graphs. In particular, our networks with scalar effective resistance edge features achieve the state of the art on the OGB-LSC PCQM4Mv1 dataset, which was featured in a KDD Cup 2021 competition for large scale graph representation learning. Finally, we provide intuition for why affinity-based measures are fundamentally different from aforementioned distance-based approaches (see Section 3.5) and bolster it with detailed theoretical and empirical results (see Appendix D) showing favorable results for affinity-based measures.

2. RELATED WORK

Our work builds upon a wealth of graph theoretic and graph representation learning works, while we focus on a supervised, inductive setting. Even though GNN architectures were originally classified as spectral or spatial, we abstain from this division as recent research has demonstrated some equivalence of the graph convolution process regardless of the choice of convolution kernels (Balcilar et al., 2021; Bronstein et al., 2021) . Spectrallymotivated methods are often theoretically founded on the eigendecomposition of the graph Laplacian matrix (or an approximation thereof) and, hence, corresponding convolutions capture different frequencies of the graph signal. Early works in this space include ChebNet Defferrard et al. (2016) and its more efficient 1-hop version by Kipf et al. (Kipf & Welling, 2017) , which offers a linear function on the graph Laplacian spectrum. Levie et al. (Levie et al., 2018) proposed CayleyNets, an alternative rational filter. Message passing neural networks (MPNNs) Gilmer et al. (2017) perform a transformation of node and edge representations before and after an arbitrary aggregator (e.g. sum). Graph attention networks (GATs) Veličković et al. (2018) aimed to augment the computations of GNNs by allowing graph nodes to "attend" differently to different edges, inspired by the success of transformers in NLP tasks. One of the most relevant works was proposed by Beaini et al. (Beaini et al., 2021) , i.e. directional graph networks (DGN). DGN uses the gradients of the low-frequency eigenvectors of the graph Laplacian, which are known to capture key information about the global structure of the graph and prove that the aggregators they construct using these gradients lead to more discriminative models than standard GNNs according to the 1-WL test. Prior work Morris et al. (2019) used higher-order (k-dimensional) GNNs, based on k-WL, and a hierarchical variant and proved theoretically and experimentally the improved expressivity in comparison to other models. Other notable works include Graph Isomorphism Networks (GINs) Xu et al. (2018) , which represent a simple, maximally-powerful GNN over discrete-featured inputs. This work also brought to light the expressivity limitations of GNNs. Hamilton et al. (Hamilton et al., 2017) proposed a method to constuct node representations by sampling a fixed-size neighborhood of each node, and then performing a specific aggregator over it, which led to impressive performance on large-scale inductive benchmarks. Bouritsas et al. (Bouritsas et al., 2020) use topologically-aware message passing to detect and count graph substructures, while Bodnar et al. (Bodnar et al., 2021) propose a messagepassing procedure on cell complexes motivated by a novel colour refinement algorithm to test their isomorphism which prove to be powerful for molecular benchmarks. Expressivity. Techniques that improve a GNN's expressive power largely fall under three broad directions. While we focus on the feature-based direction in this paper, we also acknowledge that it in no way compels the GNN to use the additional provided features. Hence, we briefly survey the other two, as an indication of the future research in affinity-based GNNs we hope this work will inspire. Another avenue involves modulating the message passing rule to make advantage of the desired computations. Popular recent examples of this include DGN (Beaini et al., 2021) and LSPE (Dwivedi et al., 2021) . DGNs leverage the graph's Laplacian eigenvectors, but they do not merely use them as input features; instead, they define a directional vector field based on the eigenvectors, and use it explicitly to anisotropically aggregate neighbourhoods. LSPE features a "bespoke pipeline" for processing positional inputs. The final direction is to modulate the graph over which messages are passed, usually by adding new nodes that correspond to desired substructures. An early proponent of this is the work of (Morris et al., 2019) , which explicitly performs message passing over k-tuples of nodes at once. Recently, scalable efforts in this direction focus on carefully chosen substructures, e.g., junction trees (Fey et al., 2020) , cellular complexes (Bodnar et al., 2021) .

3. AFFINITY MEASURES AND GNNS

3.1 RANDOM WALKS, HITTING AND COMMUTE TIMES Let G = (V, E) be a graph of vertices V and edges E ⊆ V × V between them. We define several natural properties of a graph that arise from a random walk. A random walk on G starting from a node u is a Markov chain on the vertex set V such that the initial vertex is u, and at each time step, one moves from the current vertex to a neighbor, chosen with probability proportional to the weight of outgoing edges. We will use π to denote the stationary distribution of this Markov Chain. For random walks on weighted, undirected graphs, we know that π u = du 2M , where d u is the weighted degree of node u, and M is the sum of edge weights. The hitting time H uv fom u to v is defined as the expected number of steps for a random walk starting at u to hit v. We can also define the commute time between u and v as K uv = H uv + H vu , the expected round-trip time for a random walk starting at u to reach v and then return to u.

3.2. EFFECTIVE RESISTANCE

A closely related quantity is the measure of effective resistances in undirected graphs. This quantity corresponds to the effective resistance if the whole graph was replaced with a circuit where each edge becomes a resistor with resistance equal to the reciprocal of its weight. We will use Res(u, v) to denote the effective resistance between nodes u and v. For undirected graphs, it is known that Lovász (1993) the effective resistance is proportional to the commute time, Res(u, v) =foot_0 2M K uv . Our broad goal is to incorporate effective resistances and hitting times as edge features in an MPNN. In Section 3.4, we will show they can provably improve MPNNs' expressivity.

3.3. RESISTIVE EMBEDDINGS

Effective resistances allow us to define the resistive embedding, a mapping that associates each node v of a graph G = (V, E, W ), where W are the non-negative edge weights, with an embedding vector. Before we specify the resistive embedding, we define a few terms. Let L = D -A be the graph Laplacian of G, where D ∈ R n×n is the diagonal matrix containing the weighted degree of each node and A ∈ R n×n is the adjacency matrix, whose (i, j) th entry is equal to the edge weight between i and j, if it exists; and 0 otherwise. Let B be the m × n edge-node incidence matrix, where |V | = n and |E| = m, defined as follows: The i-th row of B corresponds to the i-th edge e i = (u i , v i ) of G and has a +1 in the u i -th column and a -1 in the v i -th column, while all other entries are zero. Finally we will use C ∈ R m×m to denote the conductance matrix, which is a diagonal matrix with C ii being the weight of i th edge. It is easy to verify that B T CB = L. Even though L is not invertible, its null-space consists of the indicator vectors for every connected component of G. For example, if G is connected, then L's nullspace is spanned by the all-1's vector 1 . Hence, for any vector x orthogonal to all-1's, L • L † x = x, where L † is the pseudo-inverse. We can express effective resistance between any pair of nodes using Laplacian matrices Lovász (1993) as Res(u, v) = (1 u -1 v ) T L † (1 u -1 v ) , where 1 v is an n-dimensional vector specifying the indicator for node v. We are now ready to define the resistive embedding. Definition 3.1. (Effective Resistance Embedding) r v = C 1/2 BL † 1 v . A key property is that the effective resistance between two nodes in the graph can be obtained easily from the distance between their corresponding embeddings (see proof in Appendix C): Lemma 3.2. For any pair of nodes u, v, we have r u -r v 2 2 = Res(u, v). One can easily check that any rotation of r also satisfies Lemma 3.2, since rotations preserve Euclidean distances; more generally, if U is an orthonormal matrix, then U r is also a valid resistive embedding. This poses a challenge if we want to use the resistive embeddings as node or edge features: we want a way to enforce that a (G)NN using them will do so in a way that is invariant or equivariant to any rotations of the embeddings. In our current work, we rely on data augmentation: at every training iteration, we apply random rotations to the input ER embeddings. Remark 3.3. While data augmentation is a popular approach for promoting invariant and equivariant predictions, it is only hinting to the network that such predictions are favourable. It is also possible, in the spirit of the geometric deep learning blueprint (Bronstein et al., 2021) , to combine ER embeddings with an O(n)-equivariant GNN, which rigorously enforces rotational equivariance. A popular approach to building equivariant GNNs has been proposed by (Satorras et al., 2021) , though it focuses on the full Euclidean group E(n) rather than O(n). We leave this exploration to future work. Definition 3.4. Let p := u π u r u be the mean of effective resistance embedding. We might view p as a "weighted mean"foot_1 of r. We will define the hitting time radius, H max , of a given graph as the maximum hitting time between any two nodes: Definition 3.5 (Hitting Time Radius). H max := max u,v H u,v . We will need the following to bound the hitting times we computed: Lemma 3.6. For any node u, r u -p 2 ≤ Hmax M . The proof follows easily from the fact that p is a convex combination of all r's and Jensen's inequality.

3.4. INCORPORATING FEATURES INTO MPNNS

We reiterate that our main aim is to demonstrate (theoretically and empirically) that there are good reasons to incorporate affinity-based measures into GNN computations. In the simplest instance, a method that improves a GNN's expressive power may compute additional features (positional or structural) which would assist the GNN in discriminating between examples it otherwise wouldn't (easily) be able to. These features are then appended to the GNN's inputs for further processing. For example, it has been shown that endowing nodes with a one-hot based identity is already sufficient for improving expressive power (Murphy et al., 2019) ; this was then relaxed to any randomly-sampled scalar feature by Sato et al. (2021) . It is, of course, possible to create dedicated features that even count substructures of interest (Bouritsas et al., 2020) . Further, the adjacency information can be factorised (Qiu et al., 2018) or eigendecomposed (Dwivedi et al., 2021) to provide useful structural embeddings for the GNN. We will focus our attention on exactly this class of methods, as it is a lightweight and direct way of demonstrating improvements from these computations. Hence, our baselines will all be instances of the MPNN framework (Gilmer et al., 2017) , which we will attempt to improve by endowing them with affinity-based features. We start by theoretically proving that these features indeed improve expressive power: Theorem 3.7. MPNNs that make use of any one of (a.) effective resistances, (b.) hitting times, (c.) resistive embeddings are strictly more powerful than the WL-1 test. Proof. Since the networks in question arise from augmenting standard MPNNs with additional node/edge features, we have that these networks are at least as powerful as the 1-WL test. In order to show that these networks are strictly more powerful than the 1-WL test, it suffices to show the existence of a graph for which our affinity measure based networks can distinguish between certain nodes that a standard GNN (limited by the 1-WL test) cannot. We present an example of a 3-regular graph on 8 nodes in Figure 1 . It is well-known that a standard GNN that is limited by the 1-WL test cannot distinguish any pair of nodes in a regular graph, as the computation tree rooted at any node in the graph looks identical. However, there are three isomorphism classes of nodes in the above graph (denoted by different colors), namely, V 1 = {1, 2}, V 2 = {3, 4, 7, 8}, and V 3 = {5, 6}. We now show that GNNs with affinity based measures can distinguish between a node in V i and a node in V j , for i = j. We note that the hitting time from a to b depends only on the isomorphism classes of a and b. Thus, we write r i,j as the effective resistance between a node in V i and a node in V j . Note that r i,j = r j,i , and it is easy to verify that: r 1,1 = 2/3, r 2,2 = 15/28, r 3,3 = 4/7, r 1,2 = r 2,1 = 185/336, r 2,3 = r 3,2 = 209/336. Hence, it follows that in a message passing step of an MPNN that uses effective resistances, vertices in V 1 , V 2 , and V 3 will aggregate feature multisets {r 1,1 , r 1,2 , r 1,2 } = {2/3, 185/336, 185/336}, {r 2,1 , r 2,2 , r 2,3 } = {185/336, 15/28, 209/336}, and {r 2,3 , r 2,3 , r 3,3 } = {209/336, 209/336, 4/7}, respectively, all of which are all distinct multisets. Hence, such an MPNN can distinguish nodes in V i and V j , i = j for a suitable aggregation function. If, instead of effective resistance, we use hitting time features or resistive embeddings, our results still hold. This is because, as we showed previously, the effective resistance between nodes is a function of the two hitting times in either direction, as well as of the resistive embeddings of the nodes. In other words, if either hitting time features or resistive embeddings are used as input features for an MPNN, this MPNN would be able to compute the effective resistance features by applying an appropriate function (e.g., Lemma 3.2 for the case of resistive embeddings). Having computed these features, the MPNN can distinguish any two graphs that the MPNN with effective resistance features can.

3.5. EFFECTIVE RESISTANCE VS. SHORTEST PATH DISTANCE

It is interesting to ask how effective resistances compare with shortest path distances (SPDs) in GNNs, given the plethora of recent works that make use of SPDs (e.g., (Ying et al., 2021b; You et al., 2019; Li et al., 2020) ). The most direct comparison of our effective resistance-based MPNNs would be to use SPDs as edge features in the MPNNs. However, note that SPDs along graph edges are trivial (unlike effective resistances, which incorporate useful information about the global graph structure). An alternative to edge features would be to use (a) SPDs to a small set of anchor nodes as features in an MPNN (e.g., P- GNN You et al. (2019) ) or (b) a dense featurization incorporating shortest paths between all pairs of nodes (e.g., the dense attention mechanism in Graphormer Ying et al. (2021b) ). We remark that the latter approach typically incurs an O(n 2 ) overhead, which our MPNN-based approach avoids. We empirically compare our models to MPNNs that use approaches (a) and (b). Results on the PNA dataset show that our effective resistance-based GNNs outperform these approaches. Furthermore, we complement these empirical results with a theoretical result showing that under a limited number of message-passing steps, effective resistance features can allow one to distinguish structures that cannot be done using shortest path features. We point the reader to Appendix D for these results.

4. EFFICIENT COMPUTATION OF AFFINITY MEASURES

In order to use our features, it is important that they be computable efficiently. In this section, we show how to compute or approximate the various random walk-based affinity measures. We share results of our method on large-scale graphs in Appendix B.

4.1. REDUCING DIMENSIONALITY OF RESISTIVE EMBEDDINGS

Given their higher dimensionality, we might find it beneficial to use resistive embeddings as GNN features instead of the effective resistance. Now, the difficulty with using resistive embeddings directly as GNN features is the fact that the embeddings have dimension m, which can be quite large, e.g., up to n 2 for dense graphs. It was shown by (Spielman & Srivastava, 2011) that one can reduce the dimensionality of the embedding while approximately preserving Euclidean distances. The idea is to use a random projection via a constructive version of the Johnson-Lindenstrauss Lemma: Lemma 4.1 (Constructive Johnson-Lindenstrauss). Let x 1 , x 2 , . . . , x n ∈ R d be a set of n points; and let α 1 , α 2 , . . . , α m ∈ R n be fixed linear combinations. Suppose Π is a k × d matrix whose entries are chosen i.i.d. from a Gaussian N (0, 1) and consider x i := 1 √ k Πx i . Then, it follows that for k ≥ C log(mn)/ 2 , with probability 1 -o(1), we have j α i,j x j 2 = (1 ± ) j α i,j x j 2 2 for every 1 ≤ i ≤ m. Using triangle inequality, we can see that the inner products between fixed linear combinations are preserved up to some additive error: Corollary 4.2. For any fixed vectors α, β ∈ R n , if we let X := i α i x i , X := i α i x i and similarly Y := i β i x i , Y := i β i x i ; then: X, Y -X, Y ≤ 2 X 2 + Y 2 . The proof of Corollary 4.2 can be found in Appendix C. Therefore, we can choose a desired > 0 and k = O(log(n)/ 2 ) and instead use r : V → R k as the embedding, where r v = 1 √ k ΠBL † e v for a randomly chosen k × d matrix Π whose entries are i.i.d. Gaussians from N (0, 1). Then, by Lemma 3.2 and Lemma 4.1, we have that for every edge (u, v) ∈ E, r u -r v 2 2 = (1 ± 3 )Res(u, v ) with probability at least 1 -1 n 2 . So the computation of random embeddings, r, requires solving O((n + m) log n/ 2 ) many Laplacian linear systems. By using one of the nearly linear time Laplacian solvers Koutis et al. (2014) , we can compute the random embeddings in the near-linear time. Hence the total running time becomes O (n + m) log 3/2 npoly log log n/ 2 .

4.2. FAST COMPUTATION OF HITTING TIMES

Note that it is not clear whether there is a fast method for computing hitting times similar to commute times / effective resistances. The naive approach involves solving a linear system for each edge, resulting in a running time of at least Ω(nm), which is prohibitive. One of our technical contributions in this paper is a method for fast computation of hitting times. In particular, we will show how to use the approximate effective resistance embeddings, r, to obtain an estimate for hitting times with additive error. Let p := u π u r u . Just like r being an approximation of r, p is an approximation of p. Consider the following quantity: H u,v = 2M r v -r u , r v -p . We will use this quantity as an approximation of H u,v . In the following part, we will bound the difference between H u,v and H u,v (proofs of can be found in Appendix C). Our starting point will be expressing H u,v in terms of the effective resistance embeddings. Lemma 4.3. H u,v = 2M r v -r u , r v -p where p := u π u r u . Lemma 4.4. | H u,v -H u,v | ≤ 3 H max .

5. EXPERIMENTS

As previously discussed, our empirical evaluation seeks to show benefits from endowing standard expressive GNNs with additional affinity-based features. All architectures we experiment with will therefore conform to the message passing neural network (MPNN) blueprint (Gilmer et al., 2017) , which we now briefly describe for convenience. Assume that our input graph, G = (V, E), has node features x u ∈ R n , edge features x uv ∈ R m and graph-level features x G ∈ R l , for nodes u, v ∈ V and edges (u, v) ∈ E. We provide encoders f n : R n → R k , f e : R m → R k and f g : R l → R k that transform these inputs into a latent space: h (0) u = f n (x u ) h (0) uv = f e (x uv ) h (0) G = f g (x G ) Our MPNN then performs several message passing steps: H (t+1) = P t+1 (H (t) ) where H (t) = h (t) u u∈V , h (t) uv (u,v)∈E , h (t) G contains all of the latents at a particular process- ing step t ≥ 0. This process is iterated for T steps, recovering final latents H (T ) . These can then be decoded into node-, edge-, and graph-level predictions (as required), using analogous decoder functions g n , g e and g g : y u = g n (h (T ) u ) y uv = g e (h (T ) uv ) y G = g g (h (T ) G ) (5) Generally, f and g are simple MLPs, whereas we use the MPNN update rule for P . It computes message vectors, m (t) uv , to be sent across the edge (u, v), and then aggregates them in the receiver nodes as follows: m (t+1) uv = ψ t+1 h (t) u , h (t) v , h (0) uv (6) h (t+1) u = φ t+1 h (t) u , u∈Nv m (t+1) vu (7) The message function ψ t+1 and the update function φ t+1 are both MLPs. All of our models have been implemented using the jraph library (Godwin et al., 2020) . We incorporate edge-based affinity features (e.g., effective resistances and hitting times) in f e and node-based affinity features (e.g., resistive embeddings) in f n . Note that node-based affinity features may also naturally be incorporated as edge features by concatenating the node features at the endpoints. Occasionally, the dataset in question will be easy to overfit with the most general form of message function (Equation 6). In these cases, we resort to assuming that ψ factorises into an attention mechanism: m (t+1) uv = a t+1 h (t) u , h (t) v , h (0) uv ψ t+1 h (t) u (8) where the attention function a is scalar-valued. We will refer to this particular MPNN baseline as a graph attention network (GAT) (Veličković et al., 2018) . When relevant, we may also recall the results that a particular strong baseline (such as DGN (Beaini et al., 2021) or Graphormer (Ying et al., 2021a )) achieves on a dataset of interest. Note that these baselines modulate the message passing procedure rather than appending features, and are hence a different category to our method-their performance is provided for indicative reasons only. Where appropriate, we will use "DGN (features)" to refer to an MPNN that uses the eigenvector flows as additional edge features, without modulating the mechanism. Under review as a conference paper at ICLR 2023 The ogbg-molhiv dataset is a molecular property prediction dataset comprised of molecular graphs without spatial information (such as atom coordinates). Each graph corresponds to a molecule, with nodes representing atoms and edges representing chemical bonds. Each node has an associated 9-dimensional feature, containing atomic number and chirality, as well as other additional atom features such as formal charge and whether the atom is in the ring or not. The goal is to predict whether a molecule inhibits HIV virus replication or not. Our results are given in Table 2 . Note that we provide topological and random walk methods as baselines, including GSN (Bouritsas et al., 2020) , HIMP (Fey et al., 2020) , and GRWNN (Nikolentzos & Vazirgiannis, 2022) . On this dataset, effective resistances provide an improvement over the standard MPNN. We achieve the best performance using ER node embeddings and hitting times with random rotations. With these features, our network achieves 79.13% ± 0.358 test accuracy, which is close to DGN. The ogbg-molpcba dataset comprises molecular graphs without spatial information (such as atom coordinates). The aim is to classify them across 128 different biological activities. We follow the baseline MPNN architecture from Godwin et al. (2022) , including the use of Noisy Nodes.

5.3. MULTI-TASK MOLECULAR CLASSIFICATION: OGBG-MOLPCBA

Mirroring the evaluation protocol of Godwin et al. (2022) , Table 3 compares the performance of incorporating ER and hitting time (HT) features into the baseline MPNN models with Noisy Nodes, at various depths. What can be noticed is that models utilising affinity-based features are capable of reaching as well as exceeding peak test performance (in terms of mean average precision). However, what's important is the effect of these features at lower depths: it is possible to achieve comparable or better levels of performance with half the layers, when utilising ER or HT features. This result illustrates the potential benefit affinity-based computations can have on molecular benchmarks, especially when no geometry is provided as input. PCQM4Mv1 comprises molecular graphs which consist of bonds and atom types, and no 3D or 2D coordinates. We reuse the experimental setup and architecture from Godwin et al. (2022) , with only one difference: appending the effective resistance to the edge features. Additionally, we compare against an equivalent model which uses molecular conformations estimated by RDKit as an additional feature. This gives us a baseline which leverages an explicit estimate of the molecular geometry.

5.4. LARGE

Our results are summarised in Table 5 . We once again see a powerful synergy of effective resistanceendowed GNNs and Noisy Nodes Godwin et al. (2022) , allowing us to significantly reduce the number of layers (to 32) and outperform the 50-layer MPNN result in Godwin et al. (2022) . Further, we improve on the single-model performance of both the Graphormer Ying et al. (2021a) (which won the original PCQM4M contest after ensembling), and an equivalent model to ours which uses molecular conformers from RDKit. This illustrates how ER features can be competitive in geometry-relevant tasks even against features that inherently encode an estimate of the molecule's spatial geometry. Lastly, we remark that, to the best of our knowledge, our result is the best published single-model result on the large-scale PCQM4M-v1 benchmark to date, and the only single model result with validation MAE under 0.120. We hope this will inspire future investigation on affinity-related GNNs for molecular tasks, especially in settings where spatial geometry is not reliably available.

6. CONCLUSIONS

In this paper, we proposed a message passing network based on random walk based affinity measures. We believe that the comprehensive theoretical and practical results presented in our paper have solidified affinity-based computations as a strong component of a graph representation learner's toolbox. Our proposal carefully balances theoretical expressive power, empirical performance, and scalability to large graphs Specifically, in future work we would like to see variants of GNN message functions that explicitly make use of affinity-based computations, rather than providing them as additional hints to the model. We also explore the use of the centrality encoding (in-degree and out-degree embeddings) from Graphormer as additional node features. • The second baseline is the Position-Aware GNN (P-GNN), which makes use of "anchor sets" of nodes and encodes distances to these nodes. The results of these baselines are shown in Table 7 . In particular, we note that our ER-based MPNNs outperform all aforementioned baselines. In addition to experimental results, we would like to provide some theory for why effective resistances can capture structure in GNNs that SPDs are unable to. We will call an initialization function u → h u on nodes of a graph node-based if it assigns values that are independent of the edges of the graph. Such an initialization is, however, allowed to depend on node identities (e.g., for the single-source shortest path problem from a source s, one might find it natural to define h (0) s = 0 and h (0) u = +∞ for all u = s). Consider the task of computing "single-source effective resistances," i.e., the effective resistance from a particular node to every other node. We show that a GNN with a limited number of message passing steps cannot possibly learn single-source effective resistances, even to nearby nodes. Theorem D.1. Suppose we fix k > 0. Then, given any node-based initialization function h (0) u , it is impossible for a GNN to compute single-source effective resistances from a given node w to any nodes within a k-hop neighborhood. More specifically, for any update rule m (t+1) uv = ψ t+1 h (t) u , h (t) v , f e (x uv ) h (t+1) u = φ t+1 h (t) u , f ({m uv : v ∈ N (u)}) , there exists a graph G = (V, E) and u ∈ V such that after k rounds of message passing, h k) v = Res(u, v) for some v = u within a k-hop neighborhood of u. On the other hand, there exists an initialization with respect to which k rounds of message passing will compute the correct shortest path distances to all nodes within k-hop neighborhood. Note that the assumption on the initialization function in the above theorem is reasonable because enabling the use of arbitrary, unrestricted functions would allow for the possibility of precomputing effective resistances in the graph and trivially incorporating them as node features, which would defeat the purpose of computing them using message-passing. We now prove the theorem. Proof. Consider the following set of graphs, each on 4k + 1 nodes: v 0 v 1 v 2 v 2k v 2k+1 v 4k v 0 v 1 v 2 v 2k v 2k+1 v 4k Figure 2: Both of the above graphs are on 4k +1 vertices, labeled v 0 , v 1 , . . . , v 4k . The only difference is a single edge, i.e., the graph on the left has an edge between v 2k and v 2k+1 , while the one on the right does not have this edge. Let V = {v 0 , v 1 , . . . , v 4k }. The first graph G = (V, E) is a cycle, while the second graph G = (V, E ) is a path, obtained by removing a single edge from the first graph (namely, the one between v k and v k+1 ). Suppose the edge weights are all 1 in the above graphs. Let w = v 0 be the source and let {h (0) v : v ∈ V } be a "local" node feature initialization. Note that for any GNN (i.e., update and aggregation rules in equation 12, add the formal update rule somewhere), the computation tree after k rounds of message passing is identical for nodes v 0 , v 1 , . . . , v k , v 3k+1 , v 3k+2 , . . . , v 4k (i.e., the nodes within the k-hop neighborhood of v 0 ) in both G and G . This is because the only difference between G and G is the existence of the edge between v 2k and v 2k+1 , and this edge is beyond a k-hop neighborhood centered at any one of the aforementioned nodes. Therefore, we will necessarily have that h (k) vi is identical in both G and G for i = 1, . . . , k, 3k + 1, 3k + 2, . . . , 4k. However, it is easy to calculate the effective resistances in both graphs. In G, we have Res G (v 0 , v i ) = i(4k+1-i) 4k+1 , while in G , we have Res G (v 0 , v i ) = min{i, 4k + 1 -i}. Therefore, Res G (v 0 , v i ) = Res G (v 0 , v i ) for all i = 1, 2, . . . , k, 3k + 1, 3k + 2, . . . , 4k. It follows that for any i = 1, 2, . . . , k, 3k + 1, 3k + 2, . . . , 4k, the execution of k message passing steps of a GNN cannot result in h (k) vi = Res(v 0 , v i ) for both G and G , which proves the first claim of the theorem. For the second part (regarding single-source shortest paths), observe that single-source shortest path distances can, indeed, be realized via aggregation and update rules for a message passing network. In particular, for k rounds of message passing, it is possible to learn shortest path distances of all nodes within a k-hop neighborhood. Specifically, for a source w, we can use the following setup: Take h w = 0 and h u = ∞ for all u = w. Moreover, for any edge (u, v), let the edge feature x uv ∈ R simply be the weight of (u, v) in the graph. Then, take the update rule equation 12 with f e , ψ t+1 as Under review as a conference paper at ICLR 2023 identity functions and f e (x uv ) = x uv ψ t+1 h (t) u , h (t) v , f e (x uv ) = h (t) u + x uv f (S) = min It is clear that the above update rule simply simulates the execution of an iteration of the Bellman-Ford algorithm. Therefore, k message passing steps will simulate k iterations of Bellman-Ford, resulting in correct shortest path distances from the source w for every node within a k-hop neighborhood.



Throughout this section, we will assume that our graph is connected. However everything applies to disconnected graphs, too. Note that the average of all ru's will be 0. If the graph is regular, then p will also be 0.



Figure 1: Degree 3 graph on 8 nodes, with isomorphism classes indicated by colors. While nodes of the same color are structurally identical, nodes of different colors are not. A standard GNN limited by the 1-WL cannot distinguish between nodes of different colors. However, affinity based networks that use effective resistances, hitting times, or resistive embeddings can distinguish every pair of such nodes.

{s ∈ S} φ t+1 (a, b) = min{a, b}.

Spectral radius. PNA dataset is a set of structured graph tasks, and it complements our other datasets. As we can see in Table1, even adding a single feature, effective resistance (ER GNN), yields the best average score compared to other models. Using hitting times as edge features improve upon effective resistances. However once we combine all ER features, which include effective resistances, hitting times as well as node and edge embeddings, we get the best scores. On these structured tasks, we can see that the affinity based measures provide a significant advantage. Test % AUC-ROC on the MolHIV dataset. Our results are averaged over 5 seeds.

ogbg-molpcba performance for various model depths. Best performance across all models is underlined. NN refers to Noisy Nodes.

SCALE GRAPH REGRESSION: OGB-LSC PCQM4MV1

Results to predict the HOMO-LUMO gap, an important quantum-chemical property. It is anticipated that structural features such as ER could be of great help on this task, as the v1 version of it is provided without any structural information, and the molecule's geometry is assumed critical for predicting the gap. We report the single-model validation performance on this dataset, in line with previous worksGodwin et al. (2022);Ying et al. (2021a);Addanki et al. (2021).

Results on the PNA dataset for MPNNs with Graphormer-based features (yellow) as well as SPD-based P-GNNs (orange). Here, CE refers to the centrality encoding, which is incorporated in the relevant MPNNs as additional node features. Similarly, SPD refers to shortest path distance features -in the relevant MPNNs, shortest path distances between all pairs of nodes in the graph are incorporated as edge features, along with an additional edge feature indicating whether an edge exists in the input graph. Therefore, the MPNN baselines are all variants of the same model with additional node/edge features. Similarly, P-GNN(You et al., 2019) uses SPD features with respect to a set of chosen anchor nodes. The average score metric is, as before, the average of the log(M SE) metric over all six tasks, as in Table1.

annex

A HYPERPARAMETERS FOR PNA DATASET.In this section we provide the hyperparameters used for the different models on the PNA multitask benchmark. We train all models for 2000 steps and with 3 layers. The remaining hyperparameters for hidden size of each layer, learning rate, number of message passing steps (only valid for MPNN models), number of rotation matrices and same example frequency (when relevant) are provided in Table 6 .

B SCALING TO LARGER GRAPHS: OGBN-ARXIV

Most expressive GNNs that rely on computation of structural features have not been scaled beyond small molecular datasets (such as the ones discussed in prior sections). This is due to the fact that computing them requires (time or storage) complexity which is at least quadratic in the graph size-making them inapplicable even for modest-sized graphs. This is, however, not the case for our proposed affinity-based metrics. We demonstrate this by scalably computing them on a larger-scale node classification benchmark, ogbn-arXiv (a citation network with the goal of predicting the arXiv category of each paper As MPNN models overfit this transductive dataset quite easily, the dominant approach to tackling it are graph attention networks (GATs) (Veličković et al., 2018) . Accordingly, we trained a simple four-layer GAT on this dataset, achieving 72.02% ± 0.05 test accuracy. This compares with 71.97% ± 0.24 reported for a related attentional baseline on the leaderboard (Zhang et al., 2018) , indicating that our baseline performance is relevant.ER embeddings on ogbn-arXiv need to be exceptionally high-dimensional to achieve accurate ER estimates (∼11,000 dimensions), hence we were unable to use them here. However, incorporating ER scalar features into our GAT model yielded a statistically-significant improvement of 72.14% ± 0.03 test accuracy. Hitting time features improve this result further to 72.25% ± 0.04 test accuracy. This demonstrates that our affinity-based metrics can yield useful improvements even on larger scale graphs, which are traditionally out of reach for methods like DGN (Beaini et al., 2021) due to computational complexity limitations.Reliable global leaderboarding with respect to ogbn-arXiv is difficult, as state-of-the art approaches rely either on privileged information (such as raw text of the paper abstracts), incorporating node labels as features (Wang et al., 2021b) , post-processing the predictions (Huang et al., 2020) , or various related tricks (Wang et al., 2021b) . With that in mind, we report for convenience that the current state-of-the-art performance for ogbn-arXiv without using raw text is 76.11% ± 0.09 test accuracy, achieved by GIANT-XRT+DRGAT.

C OMITTED PROOFS

Lemma 3.2. For any pair of nodes u, v, we have r u -r v 2 2 = Res(u, v).Proof.where equation 9 follows from Lemma 4.1 with probability 1-o(1), by our choice of k (as Lemma 4.1 guarantees that each ofwith probability 1 -o(1), and one can take a union bound over the two events).Lemma 4.3. H u,v = 2M r v -r u , r v -p where p := u π u r u .Proof. Consider the following expression of hitting times in terms of commute times by Tetali (1991) .Dividing both sides of Equation ( 10) and using the relation K u,v = 2M Res(u, v), we see that:Let's focus on the inner summation. After expanding out the squared norms, we see that:Substituting this back into Equation ( 11), we can express 1 2M H u,v as:where we used Corollary 4.2 in the first inequality and Definition 3.5 in the last inequality.

D COMPARISON: EFFECTIVE RESISTANCES VS. SHORTEST PATH DISTANCES

Given that effective resistance (ER) captures times associated with random walks in a graph, it is tempting to ask how effective resistances compare to shortest path distances (SPDs) between nodes in a graph. Indeed, for some simple graphs, e.g., trees, shortest path distances and effective resistances turn out to be identical. However, in general, effective resistances and shortest path distances behave quite differently.Nevertheless, it is tempting to ask how effective resistance features compare to SPD features in GNNs, especially as there have been a number of recent model architectures that make use of SPD features (e.g., Graphormer (Ying et al., 2021b) , Position-Aware GNNs (You et al., 2019) , DE-GNN (Li et al., 2020) ). We first note that the most natural direct comparison of our ER-based MPNNs with SPD-based networks does not quite make sense. The reason is that the analogous comparison would be to determine the effect of replace ERs with SPDs as features in our MPNNs. However, since our networks only use ER features along edges of the given graph, the corresponding SPD features would then be trivial (as the SPD between two nodes directly connected by an edge in the graph is 1, resulting in a constant feature on every edge)!As a result, graph learning architectures that use SPDs typically either (a.) use a densely-connected network (e.g., Graphormer (Ying et al., 2021b) , which uses a densely-connected attention mechanism) that incurs O(n 2 ) overhead, or (b.) pick a small set of anchor nodes or landmark nodes to which SPDs from all other nodes are computed and incorporated as node features (e.g., Position-Aware GNNs (You et al., 2019) , DE-GNN (Li et al., 2020) ). We stress that the former approach generally modifies the graph (by connecting all pairs of nodes) and therefore does not fall within the standard MPNN approach, while the latter includes architectures that fall within the MPNN paradigm.Furthermore, we note that DE-GNNs are arguably one of the closest proposals to ours, as they compute distance-encoded features. These features can be at least as powerful as our proposed affinity-based features if polynomially many powers of the adjacency matrix are used. However, for all but the smallest graphs, using this many powers will be impractical-in fact, (Li et al., 2020) only use powers of A up to 3, which would not be able to reliably approximate affinity-based features.We also observe that the DE-GNN paper is concerned with learning representations of small sets of nodes (e.g., node-, link-, and triangle-prediction) and does not show how to handle graph prediction tasks, which the authors mention as possible future work. This makes a direct comparison of our methods with DE-GNNs difficult.

D.1 EMPIRICAL RESULTS

In an effort to empirically compare the expressivity of ER features with that of SPD features, we once again perform experiments on the PNA dataset, picking the following baselines that make use of SPD features:• The first baseline is roughly an MPNN with Graphormer-based features. More precisely, it is a densely-connected MPNN with SPDs from the original graph as edge features. In order to retain the structure of the original graph, we also use additional edge features to indicate whether or not an edge in the dense (complete) graph is a true edge of the original graph.

