FOSR: FIRST-ORDER SPECTRAL REWIRING FOR ADDRESSING OVERSQUASHING IN GNNS

Abstract

Graph neural networks (GNNs) are able to leverage the structure of graph data by passing messages along the edges of the graph. While this allows GNNs to learn features depending on the graph structure, for certain graph topologies it leads to inefficient information propagation and a problem known as oversquashing. This has recently been linked with the curvature and spectral gap of the graph. On the other hand, adding edges to the message-passing graph can lead to increasingly similar node representations and a problem known as oversmoothing. We propose a computationally efficient algorithm that prevents oversquashing by systematically adding edges to the graph based on spectral expansion. We combine this with a relational architecture, which lets the GNN preserve the original graph structure and provably prevents oversmoothing. We find experimentally that our algorithm outperforms existing graph rewiring methods in several graph classification tasks.

1. INTRODUCTION

Graph neural networks (GNNs) (Gori et al., 2005; Scarselli et al., 2008) are a broad class of models which process graph-structured data by passing messages between nodes of the graph. Due to the versatility of graphs, GNNs have been applied to a variety of domains, such as chemistry, social networks, knowledge graphs, and recommendation systems (Zhou et al., 2020; Wu et al., 2020) . GNNs broadly follow a message-passing framework, meaning that each layer of the GNN aggregates the representations of a node and its neighbors, and transforms these features into a new representation for that node. The aggregation function used by the GNN layer is taken to be locally permutationinvariant, since the ordering of the neighbors of a node is arbitrary, and its specific form is a key component of the GNN architecture; varying it gives rise to several common GNN variants (Kipf and Welling, 2017; Veličković et al., 2018; Li et al., 2015; Hamilton et al., 2017; Xu et al., 2019) . The output of a GNN can be used for tasks such as graph classification or node classification. Although GNNs are successful in computing dependencies between nodes of a graph, they have been found to suffer from a limited capacity to capture long-range interactions. For a fixed graph, this is caused by a variety of problems depending on the number of layers in the GNN. Since graph convolutions are local operations, a GNN with a small number of layers can only provide a node with information from nodes close to itself. For a GNN with l layers, the receptive field of a node (the set of nodes it receives messages from) is exactly the ball of radius l about the node. For small values of l, this results in "underreaching", and directly limits which functions the GNN can represent. On a related note, the functions representable by GNNs with l layers are limited to those computable by l steps of the Weisfeiler-Lehman (WL) graph isomorphism test (Morris et al., 2019; Xu et al., 2019; Barceló et al., 2020) . On the other hand, increasing the number of layers leads to its own set of problems. In contrast to other architectures that benefit from the expressivity of deeper networks, GNNs experience a decrease in accuracy as the number of layers increases (Li et al., 2018; Chen et al., 2020) . This phenomenon has partly been attributed to "oversmoothing", where repeated graph convolutions eventually render node features indistinguishable (Li et al., 2018; Oono and Suzuki, 2020; Cai and Wang, 2020; Zhao and Akoglu, 2020; Rong et al., 2020; Di Giovanni et al., 2022) . Separate from oversmoothing is the problem of "oversquashing" first pointed out by Alon and Yahav (2021) . As the number of layers of a GNN increases, information from (potentially) exponentiallygrowing receptive fields need to be concurrently propagated at each message-passing step. This leads to a bottleneck that causes oversquashing, when an exponential amount of information is squashed into fixed-size node vectors (Alon and Yahav, 2021) . Consequently, for prediction tasks relying on long-range interactions, the GNN can fail. Oversquashing usually occurs when there are enough layers in the GNN to reach any node (the receptive fields are large enough), but few enough that the GNN cannot process all of the necessary relations between nodes. Hence, for a fixed graph, the problems of underreaching, oversquashing, and oversmoothing occur in three different regimes, depending on the number of layers of the GNN. Figure 1 : Top: Schematic showing different rewiring methods, FoSR (ours), SDRF (Topping et al., 2022) , and G-RLEF (Banerjee et al., 2022) for alleviating structural bottlenecks in the input graph. Our method adds new edges that are labeled differently from the existing ones so that the GNN can distinguish them in training. Bottom: Normalized spectral gap and training accuracy as functions of the number of rewiring iterations for a learning task modeled on the NEIGHBORSMATCH problem for a path-of-cliques input (for details, see Appendix B.1.1). A common approach to addressing oversquashing is to rewire the input graph, making changes to its edges so that it has fewer structural bottlenecks. A simple approach to rewiring is to make the last layer of the GNN fully adjacent, allowing all nodes to interact with one another (Alon and Yahav, 2021) . Alternatively, one can make changes to edges of the input graph, feeding the modified graph into all layers of the GNN (Topping et al., 2022; Banerjee et al., 2022) . The latter approaches can be viewed as optimizing the spectral gap of the input graph for alleviating structural bottlenecks and improving the overall quality of signal propagation across nodes (see Figure 1 ). While these rewiring methods improve the connectivity of the graph, there are drawbacks to making too many modifications to the input. The most obvious problem is that we are losing out on topological information about the original graph. If the structure of the original graph is indeed relevant, adding and removing edges diminishes that benefit to the task. Another issue arises from the smoothing effects of adding edges: If we add too many edges to the input graph, an ordinary GCN will suffer from oversmoothing (Li et al., 2018) . In other words, if we use this natural approach to rewiring, we experience a trade-off between oversquashing and oversmoothing. This observation, which does not seem to have been pointed out in earlier works, is the main motivation for the approach that we develop in this work.

1.1. MAIN CONTRIBUTIONS

This paper presents a new framework for rewiring a graph to reduce oversquashing in GNNs while preventing oversmoothing. Here are our main contributions: • We introduce a framework for graph rewiring which can be used with any rewiring method that sequentially adds edges. In contrast to previous approaches that only modify the input graph (e.g., Topping et al., 2022; Banerjee et al., 2022; Bober et al., 2022) , our solution gives special labels to the added edges. We then use a relational GNN on this new graph, with the relations corresponding to whether the edge was originally in the input graph or added during the rewiring. This allows us to preserve the input graph topology while using the new edges to improve its connectivity. In Theorem 3 we show that this approach also prevents oversmoothing. • We introduce a new rewiring method, FoSR (First-order Spectral Rewiring) aimed at optimizing the spectral gap of the graph input to the GNN (Algorithm 1). This algorithm computes the first-order change in the spectral gap from adding each edge, and then adds the edge which maximizes this (Theorem 4 and Proposition 5). • We empirically demonstrate that the proposed method results in faster spectral expansion (a marker of reduced oversquashing) and improved test accuracy against several baselines on several graph classification tasks (see Table 1 ). Experiments demonstrate that the relational structure preserving the original input graph significantly boosts test accuracy.

1.2. RELATED WORKS

Past approaches to reducing oversquashing have hinged upon choosing a measure of oversquashing, and modifying the edges of the graph to minimize it. Topping et al. (2022) argue that negatively curved edges are responsible for oversquashing drawing on curvature notions from Forman (2003) and Ollivier (2009) . They introduce a rewiring method known as stochastic discrete Ricci Flow (SDRF), which aims to increase the balanced Forman curvature of negatively curved edges by adding new edges. Bober et al. (2022) extend this line of investigation by considering the same type of rewiring but using different notions of discrete curvature. Banerjee et al. (2022) approach oversquashing from an information-theoeretic viewpoint, measuring it in terms of the spectral gap of the graph and demonstrate empirically that this can increase accuracy for certain graph classification tasks. They propose a rewiring algorithm greedy random local edge flip (G-RLEF) motivated by an expander graph construction employing an effective resistance (Lyons and Peres, 2017) based edge sampling strategy. The work of Alon and Yahav (2021) first pointing at oversquashing also introduced an approach to rewiring, where they made the last GNN layer an expander -the complete graph that allows every pair of nodes to connect to each other. They also experimented with making the last layer partially adjacent (randomly including any potential edge). This can be thought of as a form of spectral expansion in the final layer since random graphs have high spectral gap (Friedman, 1991) . In contrast to these works, our method gives a practical way of achieving the largest possible increase in the spectral graph with the smallest possible modification of the input graph and in fact preserving the input graph topology via a relational structure. Although not as closely related, we find it worthwhile also pointing at following works in this general context. Prior to the diagnosis of the oversquashing problem, Klicpera et al. (2019) used graph diffusion to rewire the input graph, improving long-range connectivity for the GNN. Rewiring can also be performed while training a GNN. Arnaiz-Rodríguez et al. ( 2022) use first-order spectral methods to define a loss function depending on the adjacency matrix, allowing a GNN to learn a rewiring that alleviates oversquashing. We should mention that aside from rewiring the input graph, some works pursue different approaches to solve oversquashing, such as creating positional embeddings for the nodes or edges inspired by the transformer architecture (Vaswani et al., 2017) . The most direct generalization of this approach to graphs is using Laplacian embeddings (Kreuzer et al., 2021; Dwivedi and Bresson, 2020) . Brüel-Gabrielsson et al. (2022) combine this with adding neighbors to encode the edges which are the result of multiple hops.

2.1. BACKGROUND ON SPECTRAL GRAPH THEORY

Let G = (V, E, R) be an undirected graph with node set V, |V| = n, edge set E, |E| = m, and relation set R. The set R is a finite set of relation types, and elements (u, v, r) ∈ E consist of a pair of nodes u, v ∈ V together with an associated relation type r ∈ R. When the relation type of an edge is not relevant, we will simply write (u, v) for an edge. For each v ∈ V we define N (v) to consist of all neighbors of v, that is all u ∈ V such that there exists an edge (u, v) ∈ E. For each r ∈ R and v ∈ V, we define N r (v) to consist of all neighbors of v of relation type r. The degree d v of a node v ∈ V is the number of neighbors of v. We define the adjacency matrix A = A(G) by A ij = 1 if (i, j) ∈ E, and A ij = 0 otherwise. Let D = D(G) denote the diagonal matrix of degrees given by D ii = d i . The normalized Laplacian L = L(G) is defined as L = I -D -1/2 AD -1/2 . We will often add self-loops (edges (i, i) for i ∈ V) to the graphs we consider, so we define augmented versions of the above matrices corresponding to graphs with self-loops added. If G is a graph without self-loops, we define its augmented adjacency matrix Ã := I + A, its augmented degree matrix D = I + D, and its augmented Laplacian L = I -D-1/2 Ã D-1/2 . We denote the eigenvalues of the normalized Laplacian L by 0 = λ 1 ≤ λ 2 ≤ • • • ≤ λ n ≤ 2. Let 1 denote the constant function which assumes the value 1 on each node. Then D 1/2 1 is an eigenfunction of L with eigenvalue 0. The spectral gap of G is λ 2 -λ 1 = λ 2 . We say that G has good spectral expansion if it has a large spectral gap. In Appendix A, we review the relation between the spectral gap and a related measure of graph expansion, the Cheeger constant.

2.2. BACKGROUND ON RELATIONAL GNNS

The framework we propose fundamentally relies on relational GNNs (R-GNNs) (Battaglia et al., 2018) , so we review their formulation here. We define a general R-GNN layer by h (k+1) v = ϕ k h (k) v , r∈R u∈Nr(v) ψ k,r (h (k) u , h (k) v ) , where the ψ k,r : R As a special case of R-GNNs, R-GCN layers are defined by the update d k × R d k → R d h (k+1) v = σ W (k) h (k) v + r∈R u∈Nr(v) 1 cu,v W (k) r h (k) u , where σ is a nonlinear activation, c u,v is a normalization factor, and W (k) , W (k) r are relationspecific learned linear maps. One can interpret the W h (k) v term as adding self-loops to the original graph, and encoding these self-loops as their own relation. We often take σ = ReLU, and c u,v = (1 + d u )(1 + d v ). Other specializations of R-GNNs include graph convolutional networks (GCNs) (Kipf and Welling, 2017) , defined by h (k+1) v = σ u∈N (v)∪{v} 1 cu,v W (k) h (k) u , and graph isomorphism networks (GINs) (Xu et al., 2019) , defined by h (k+1) v = MLP (k) u∈N (v)∪{v} h (k) u .

3. RELATIONAL REWIRING OF GNNS

We introduce a graph rewiring framework to improve connectivity while retaining the original graph via a relational structure, which demonstrably allows us to also control the rate of smoothing. The rate of smoothing measures the similarity of neighboring node features in the GNN output graph. Adding separate weights for the rewired edges allows the network more flexibility in learning an appropriate rate of smoothing. Our main result in this section, Theorem 3 makes this precise.

3.1. RELATIONAL REWIRING

We incorporate relations into our architecture in the following way. Suppose that we rewire G = (V, E 1 ) by adding edges, yielding a rewired graph G ′ = (V, E 1 ∪ E 2 ). We equip G ′ with a relational structure by assigning each edge in E 1 the edge type 1, and each edge in E 2 the edge type 2. For example, in an R-GCN, this relational structure would result in the following layer form: h (k+1) v = σ W (k) h (k) v + (u,v)∈E1 1 cu,v W (k) 1 h (k) u + (u,v)∈E2 1 cu,v W (k) 2 h (k) u . The original layer (before relational rewiring) would include only the first two terms. A rewired graph with no relational structure would add the third term but use the same weights W (k) 2 = W (k) 1 . A relational structure serves to provide the GNN with more flexibility, since it remembers both the original structure of the graph and the rewired structure. In the worst case scenario, if we add too many edges, the GNN may counterbalance this by setting most of the weights W (k) 2 to be 0, focusing its attention on the original graph. In a more realistic scenario, the GNN can use the original edge weights W simply to transmit information along the graph. Since the weights assigned to the original and rewired edges are separate, their purposes can be served in parallel without sacrificing the topology of the graph. Finally, this R-GCN model allows us to better regulate the rate of smoothing as we demonstrate next.

3.2. RATE OF SMOOTHING FOR R-GCN WITH REWIRING

In this section, we analyze the smoothing effects of repeatedly applying R-GCN layers with rewiring. Here, we consider R-GCN layers φ : R n×p → R n×p of the form φ(X) = XΘ + r∈R D -1/2 A r D -1/2 XΘ r , where Θ, Θ r ∈ R p×p are weight matrices, A r is the adjacency matrix of the r-th relation, and D is the matrix of degrees of G. For a vanilla GCN without relations or residual connections, one problem with rewiring the graph is that adding too many edges will result in oversmoothing (Di Giovanni et al., 2022) . Indeed, adding edges to a graph will by definition increase its Cheeger constant. For input graphs with a high spectral gap (and hence high Cheeger constant), a GCN will produce similar representations for each node (Xu et al., 2018; Oono and Suzuki, 2020) . As an extreme case, if we replace the graph with a complete graph, the GCN layer will assign each node the same representation. We will show that R-GCNs are robust to this effect, allowing us to add edges without the adverse side effect of oversmoothing. Definition 1. Let G be a connected graph with adjacency matrix A and normalized Laplacian L. For i ∈ V, let d i denote the degree of node i. Given a scalar field f ∈ R n , its Dirichlet energy with respect to G is defined as E (f ) := 1 2 i,j A i,j fi √ di - fj √ dj 2 = f T Lf. For a vector field X ∈ R n×p , we define E (X) := 1 2 i,j,k A i,j X i,k √ di - X j,k √ dj 2 = Tr(X T LX). The Dirichlet energy of a unit-norm function on the nodes of a graph is a measure of how "nonsmooth" it is (Chung and Graham, 1997) . The Dirichlet energy of a function f is small when f i / √ d i is close in value to f j / d j for all edges (i, j). In the case where G is d-regular (all nodes have degree d), this reduces to the familiar notion of adjacent nodes having representations close in value. Hence, we make the following definition: Definition 2. Let G be a graph and φ : R n×p → R n×p be a mapping. We define the rate of smoothing of φ with respect to G as RS G (φ) := 1 - sup X:E (X)̸ =0 E (φ(X))/E (X) sup X:X̸ =0 ∥φ(X)∥ 2 F /∥X∥ 2 F 1/2 . The numerator of the fraction above indicates the rate of decay (or expansion) of the Dirichlet energy upon applying φ. We take a supremum over X to find the largest possible relative change in energy upon applying φ. We would also like our notion of the rate of smoothing to be scale-invariant; multiplying φ by a scalar should not change its rate of smoothing. To impose scale-invariance, we divide by a factor in the denominator which captures how much φ scales up the entries of X. By defining the the rate of smoothing as a ratio of two norms, we can estimate them separately which gives us a good theoretical handle of smoothing. Note that if φ is linear, RS G (φ) := 1 - sup E (X)=1 E (φ(X)) sup ∥X∥ F =1 ∥φ(X)∥ 2 F 1/2 , since ∥ • ∥ 2 F and E are quadratic forms. The following theorem shows that R-GCN layers are flexible in their ability to choose an appropriate rate of smoothing for a graph: Theorem 3. Let G 1 = (V, E 1 ) be a graph and G 2 = (V, E 1 ∪ E 2 ) be a rewiring of G 1 . Consider an R-GCN layer φ defined as in (1), with relations r 1 = E 1 , r 2 = E 2 . Then for any λ ∈ [0, λ 2 (L(G 2 ))], there exist values of Θ, Θ 1 , Θ 2 for which φ smooths with rate RS G2 (φ) = λ with respect to G 2 . Proof. The map φ is given by φ(X; Θ, Θ 1 , Θ 2 ) = XΘ + D -1/2 A 1 D -1/2 XΘ 1 + D -1/2 A 2 D -1/2 XΘ 2 , where A = A 1 + A 2 is the adjacency matrix of G 2 . Here D is the degree matrix of G 2 . Fix α ∈ [0, 1], and take Θ 1 = Θ 2 = αI and Θ = (1 -α)I. Then φ(X) = (1 -α)X + αD -1/2 A 1 D -1/2 X + αD -1/2 A 2 D -1/2 X = (1 -α)X + αD -1/2 AD -1/2 X = (I -αL)X, where L is the normalized Laplacian of G 2 . We will show that φ smooths with rate αλ 2 (L). Let L = U ΣU -1 be an orthogonal diagonalization of L. Let λ 1 , . . . , λ n be the eigenvalues of L in ascending order and recall that 0 ≤ λ i ≤ 2 for all i with λ 1 = 0. We have E (φ(X)) = Tr(X T L(I -αL) 2 X) = Tr(X T U Σ(I -αΣ) 2 U T X) = i,j λ i (1 -αλ i ) 2 ((U T X) i,j ) 2 (a) ≤ i,j λ i (1 -αλ 2 ) 2 ((U T X) i,j ) 2 = (1 -αλ 2 ) 2 Tr(X T U ΣU T X) = (1 -αλ 2 ) 2 E (X), with equality in (a) when (U T X) i,j = 0 for i ̸ = 2. Hence, sup E (X)̸ =0 E (φ(X)) E (X) = (1 -αλ 2 ) 2 . (2) Next, we compute ∥φ(X)∥ 2 F = Tr(X T (I -αL) 2 X) = Tr(X T U (I -αΣ) 2 U X) = i,j (1 -αλ i ) 2 ((U X) i,j ) 2 (b) ≤ i,j ((U X) i,j ) 2 = ∥U X∥ 2 F = ∥X∥ 2 F , with equality in (b) when (U X) i,j = 0 for i ̸ = 1. Hence sup X̸ =0 ∥φ(X)∥ 2 F ∥X∥ 2 F = 1. Combining ( 2) and (3) yields RS(φ) = αλ 2 = αλ 2 (L). Since this holds for any α ∈ [0, 1], the result follows. In Appendix B.3, we empirically demonstrate that R-GNNs achieve lower Dirichlet energy than GNNs.

4. FOSR: FIRST ORDER SPECTRAL REWIRING

In order to use the R-GCN framework, we need to make a choice of rewiring algorithm that is capable of adding edges without removing any. We propose a first-order spectral rewiring algorithm (FoSR) with the goal of improving graph connectivity. The rate of spectral expansion, i.e., the rate at which the spectral gap improves as a function of the rewiring iterations is one of the key determinants of oversquashing (Banerjee et al., 2022; Deac et al., 2022) . Compared to existing approaches such as SDRF (Topping et al., 2022) and G-RLEF (Banerjee et al., 2022) , the proposed algorithm has a faster rate of spectral expansion. See Figure 1 for an illustration. Our algorithm FoSR follows a similar philosophy as Chan and Akoglu (2016) optimizing the spectral gap of the input graph by sequentially adding edges such that the rate of spectral expansion is maximized. At each step, we wish to add the edge that maximizes the spectral gap of the resulting graph. However, computing the spectral gap for each edge addition is expensive as it requires us to compute eigenvalues for O(n 2 ) different matrices. Instead of computing this quantity directly, we use a first-order approximation of the spectral gap based on matrix perturbation theory. The following theorem determines the first-order change of the eigenvalues of a perturbed symmetric matrix: Theorem 4. For symmetric matrices M ∈ R n×n with distinct eigenvalues, the i-th largest eigenvalue λ i satisfies ∇ M λ i (M ) = x i x T i , where x i denotes the (normalized) eigenvector for the i-th largest eigenvalue of M . We provide a proof in Appendix C. Stated differently, the previous theorem allows us to make the linear approximation λ 2 (M + δM ) ≈ λ 2 (M ) + Tr(∇λ T 2 (δM )) = λ 2 (M ) + x T 2 (δM )x 2 . ( ) Let us apply the approximation (4) to the spectrum of a graph G. We wish to sequentially add an edge (u, v) in a way that maximizes the spectral gap of the resulting graph. Equivalently, the second-largest eigenvalue of D -1/2 AD -1/2 should be minimized, since this matrix is equal to I -L. Using Theorem 4 for the matrix D -1/2 AD -1/2 , we obtain the following result. Proposition 5. The first-order change in λ 2 = λ 2 (D -1/2 AD -1/2 ) from adding the edge (u, v) is 2x u x v ( √ 1 + d u )( √ 1 + d v ) + 2λ 2 x 2 u √ d u √ 1 + d u -1 + 2λ 2 x 2 v √ d v √ 1 + d v -1 , where x denotes the second eigenvector of D -1/2 AD -1/2 , and x u denotes the u-th entry of x. We provide a proof in Appendix C. We choose to minimize the first term in (5) 2x u x v (1 + d u )(1 + d v ) , ( ) since this is simpler and cheaper to compute. To see why this approximation is reasonable, note that 1 ( √ 1+du)( √ 1+dv) ∼ (d u d v ) -1/2 as d u , d v → ∞, while √ du √ 1+du -1 ∼ d -2 u . So if the degrees of the nodes are sufficiently large and comparable in size, the first term in (5) dominates. Again, our decision to optimize the first term is chosen for computational reasons. In Appendix B.2, we show that FoSR has significantly smaller computation overhead compared to SDRF on real and synthetic datasets. Further, we show in Appendix B.1.2 that FoSR has a much faster rate of spectral expansion compared to SDRF. Finally in Appendix B.4, we discuss the approximation error of FoSR. For each iteration that it runs, FoSR computes an approximation of the second eigenvector x ∈ R n of D -1/2 AD -1/2 and chooses an edge (u, v) which minimizes (6). To produce an initial approximation of x, we use power iteration (rather than computing a full eigendecomposition). Let d ∈ R n denote the vector of degrees of G. Let √ d denote the entry-wise square root of d, i.e., with components √ d i . Recall that √ d is the first eigenvector of D -1/2 AD -1/2 . Since x is the second eigenvector of D -1/2 AD -1/2 , we may approximate it by repeatedly applying D -1/2 AD -1/2 and then subtracting the component in the direction of the first eigenvector. This map is given by x → D -1/2 AD -1/2 x - ⟨x, √ d⟩ 2m √ d. After applying this map, we normalize x back to a unit vector, x → x ∥x∥ . FoSR operates by repeatedly alternating between adding an edge and computing a new approximation of x via power iteration. Importantly, our algorithm avoids computing a full eigendecomposition of G by only approximating its second eigenvector. We compute increasingly accurate estimates of this eigenvector using power iteration. The steps of our method are outlined in Algorithm 1. Algorithm 1 FoSR: First-order Spectral Rewiring Input: G = (V, E), iteration count k, initial number of power iterations r Output: 8: Rewired graph G ′ = (V, E ′ ) 1: Initialize x ∈ R n arbitrarily 2: for i = 1, 2, • • • , r do 3: x ← D -1/2 AD -1/2 x -⟨x, x ← D -1/2 AD -1/2 x -⟨x, √ d⟩ 2m √ d ▷ Power iteration to update second eigenvector 9: x ← x ∥x∥2 10: end for The number of edge additions k and the initial number of power iterations r are hyperparameters. The number of power iterations r only needs to be chosen large enough to produce an initial approximation of the eigenvector x. In practice, we found that taking r between 5 and 10 is sufficient. The proper choice of iteration count k is problem-specific, and can be chosen to be such that the spectral gap increases sufficiently. We tried multiple configurations of k when training models. Computational complexity Taking advantage of the sparsity of G, FoSR requires O(m) floatingpoint operations for each power iteration and O(n 2 ) operations for each edge added. The O(n 2 ) operations come from searching over all node pairs (i, j) for the minimal value of y i y j , where y i = x i / √ 1 + d i . This results in a cost of O(kn 2 ). When the graph is sparse, we can often compute the minimal value of y i y j in a faster way. We can choose i = arg min k y k and j = arg max k y k if min k y k ≤ 0 and the (i, j) chosen in this way is not already an edge. Indeed, the probability of (i, j) not already being an edge is higher if G is sparse. In this case, the edge addition only takes O(m) operations. More generally, we can relax the minimization by choosing i = arg min k y k and j = arg max k / ∈N (i) y k when min k y k ≤ 0. We can use a similar relaxation when min k y k > 0; take i = arg min k y k and j = arg min k / ∈N (i) y k . If we use the above relaxations, the total cost is O(km).

5. EXPERIMENTS

We compare our rewiring methods to existing ones to demonstrate the efficacy of relational rewiring as well as rewiring informed by spectral expansion on several graph classification tasks. Datasets We consider graph classification tasks REDDIT-BINARY, IMDB-BINARY, MUTAG, ENZYMES, PROTEINS, COLLAB from the TUDataset (Morris et al., 2020) where the topology of the graphs in relation to the task has been identified to require long-range interactions. While certain node classification tasks have also been considered in previous works, these have been found to be tractable with nearest neighbor information (Brockschmidt, 2020) .

Compared methods

We focus on approaches that preprocess the input graph by adding edges. Diffusion Improves Graph Learning (DIGL) (Klicpera et al., 2019 ) is a diffusion-based rewiring scheme that computes a kernel evaluation of the adjacency matrix, followed by sparsification. SDRF (Topping et al., 2022) surgically rewires a graph by adding edges to support other edges with low curvature, which are locations of bottlenecks. For SDRF, we include results for both the original method (configured to only add edges), and our relational method (again only add edges, and include the added edges with their own relation). Fully adjacent layers (Alon and Yahav, 2021) rewire the graph by adding all possible edges, setting E 2 = (V × V) \ E 1 . We include results for rewiring only the last layer (last layer FA) and rewiring every layer (every layer FA).

Experimental details

We test each rewiring algorithm on an R-GCN (Schlichtkrull et al., 2018) , with the relations varying depending on the rewiring method. For rewirings which do not explicitly assign relations to edges, we use two relations: one for the default edge type, and one for self-loops. In other words, we simply use the rewired graph and add residual connections to the architecture. For each task, we use the same GNN architecture for all rewiring methods to illustrate how the choice of rewiring affects test performance. That is, we hold the number of layers, hidden dimension, and dropout probability constant across all rewiring methods for a given dataset. The hyperparameter values are reported in Appendix D.1. However, for each dataset, we separately tune all hyperparamters specific to the rewiring method (such as the number of iterations for FoSR and SDRF). We train all models with an Adam optimizer with a learning rate of 10 -3 and a scheduler which reduces the learning rate by a factor of 10 after 10 epochs with no improvement in validation loss. For FoSR and SDRF, we only tuned the number of rewiring iterations. For DIGL (Klicpera et al., 2019) , we tuned the sparsification threshold (ϵ) and teleport probability (α). For optimizing hyperparamters and evaluating each configuration, we first select a test set consisting of 10 percent of the graphs and a development set consisting of the other 90 percent of the graphs. We determine accuracies of each configuration using 100 random train/validation splits of the development set consisting of 80 and 10 percent of the graphs, respectively. When training the GNNs, we use a stopping patience of 100 epochs based on the validation loss. We evaluate each hyperparameter configuration using the validation accuracy, so the test set is only used once to generate results for the best hyperparamter configurations. For the test results, we record 95 percent confidence intervals for the hyperparameters with the best validation accuracy using the 100 runs. Results Table 1 shows the results of our experiments on various GNN layer types. Shown is the test accuracy. For each rewiring method, using a relational GNN type improves classification accuracy. The boost in accuracy from using an R-GNN is greatest for the rewiring methods which surgically add edges (FoSR and SDRF). Out of all of the rewiring methods, FoSR typically achieves the best classification accuracy when relational rewiring is used. When relational rewiring is not used, FoSR still outperforms DIGL and SDRF for most datasets. We note that +FA is sometimes competitive, but this is highly dependent on the GNN architecture. In particular, it does not perform well for GCNs and R-GCNs. Our results are also very competitive against results reported for the DiffWire methods by Arnaiz-Rodríguez et al. (2022, Table 1 ), although they use a different type of architecture that is not directly comparable to preprocessing methods like ours or SDRF which sequentially add edges. 

6. CONCLUSIONS

We proposed an efficient graph rewiring method for preventing oversquashing based on iterative first-order maximization of the spectral gap. Further, we identified a shortcoming of existing rewiring approaches that can cause oversmoothing and propose a relational rewiring method to overcome this with theoretical guarantees. Experiments on several graph classification benchmarks demonstrate that the proposed methods can significantly improve the performance of GNNs. In future work it will be interesting to study the effects of oversmoothing and oversquashing in relation to training dynamics and investigate rewiring strategies that aid training.

Reproducibility Statement

The computer implementation of the proposed methods along with scripts to re-run our experiments are made publicly available on https://github.com/ kedar2/FoSR. The experimental settings are described in detail in Section 5 and Appendix D.1, which also details the compute infrastructure and selected hyperparameter values.

APPENDIX A SPECTRAL GAP AND CHEEGER CONSTANT

A.1 QUANTIFYING STRUCTURAL BOTTLENECKS VIA THE CHEEGER CONSTANT Since traditional GNNs use the input graph to propagate neural messages, structural characteristics of the input graph play a crucial role in the quality of signal propagation across nodes. Our method is motivated from the key idea that structural bottlenecks in the input graph lead to information oversquashing in GNNs. A classic measure of "bottlenecked-ness" of a graph is the Cheeger constant (Chung and Graham, 1997 ) h(G) := min ∅̸ =S⊂V |∂S| min vol S, vol S , where vol S = v∈S d v , S = V \ S and ∂S = {(u, v) : u ∈ S, v ∈ S, (u, v) ∈ E} is the edge boundary of S ⊂ V. This definition considers the number of edges connecting two complementary subsets of nodes and their volume as measured by the sum of the degrees of the nodes, and then takes the minimum of this value over all possible non-trivial bipartitions of the graph. Note that 0 ≤ h(G) ≤ 1 for all S ⊂ V. G is connected if and only if h(G) > 0. Intuitively, when h(G) is large, G has no structural bottlenecks in the sense that every part of G is connected to the rest of it by a large fraction of its edges. A large h(G) implies that there is a low probability of being trapped in a small subset of the nodes and consequently, a simple random walk over the nodes of G is rapidly mixing (Levin and Peres, 2017) . Well-connected graphs such as the complete graph over n nodes, K n has a high Cheeger constant, while easily-disconnected graphs have a low Cheeger constant. An example of the latter is the dumbbell graph K n -K n comprising of two cliques joined by a bridge (see Figure 1 ), which has a Cheeger constant of 1/(1 + n(n -1)).

A.2 BOUNDING THE CHEEGER CONSTANT VIA THE SPECTRAL GAP

In general, computing the exact value of h(G) in practice is known to be a hard problem; see, e.g., Leighton and Rao (1999) ; Garey et al. (1974) ; Mohar (1989) ; Kaibel (2004) . Nonetheless, it is possible to bound the Cheeger constant from below and from above in terms of the spectral gap as we explain next. Recall from Section 2.1 that the spectral gap of a graph is the difference λ 2 -λ 1 between the first two eigenvalues of the Laplacian (in increasing order), whereby λ 1 = 0. The discrete Cheeger inequality (Cheeger, 1970; Alon and Milman, 1984) shows that the spectral gap of G provides an estimate of its Cheeger constant: λ 2 2 ≤ h(G) ≤ 2λ 2 . To intuitively understand this inequality, we can consider the variational formulation of the spectral gap (Chung and Graham, 1997) : λ 2 = inf x⊥D 1/2 1 x T Lx x T x . We can write the numerator above as x T Lx = (i,j)∈E x i √ d i - x j d j 2 . Let S be a subset of the nodes. We may encode S as a "zero-one vector", writing x i = √ d i if i ∈ S and x i = 0 if i / ∈ S. This gives us x T Lx = (i,j)∈E 1 (i,j)∈∂S = |∂S|. the same label as T. We take the input G to be a path-of-cliques, which comprises of three cliques each of size 10 connected in a path by two edges; see Figure 2(b ). The set S consists of the first 9 nodes in clique-1, and the target T is the last node of clique-3. The range of the interaction, i.e., the maximum distance between T and a node in S is 5. We trained a graph attention network (Veličković et al., 2018) with 6 layers each of width 64 on a training dataset comprising of 10000 copies of G, where each copy has a different mapping matching the target. Figure 1 (bottom) shows the evolution of the normalized spectral gap and training accuracy as a function of the number of rewiring iterations for three algorithms: our FoSR, SDRF (Topping et al., 2022) , and G-RLEF (Banerjee et al., 2022) . SDRF is motivated from the idea that negatively curved edges lead to oversquashing. The notion of curvature, which can be construed as a "local" Cheeger constant, plays a key role in the SDRF rewiring process. SDRF cycles between adding a supporting edge around a negatively curved edge and removing a redundant positively-curved edge; see Figure 1 (top). Since the addition and removal of edges are done independently of each other, SDRF can potentially disconnect a connected input graph. G-RLEF, on the other hand, seeks to improve the "global" Cheeger constant h(G) sampling edges according to inverse triangle counts (a proxy for curvature) and using a local edge flip mechanism so that the input graph is never disconnected. In our experiments, we configured SDRF to only add edges. For all three algorithms, we observe that both the normalized spectral gap and training accuracy increase monotonically in the number of rewiring steps until they saturate at around 150 iterations. The rate of spectral expansion i.e., the rate at which the spectral gap improves as a function of the rewiring iterations is the fastest for FoSR.

B.1.2 THE TUDATASET

Figure 3 shows the normalized spectral gap as a function of the number of rewiring iterations for FoSR and SDRF on the TUDataset (Morris et al., 2020) . Again, we observe that FoSR has a much faster rate of spectral expansion compared to that of the SDRF (Topping et al., 2022) . Figure 3 : Normalized spectral gap as a function of the number of rewiring iterations for FoSR and SDRF on the TUDataset graphs (Morris et al., 2020) . We record the average spectral gap across all graphs in the dataset for each iteration count.

B.2 COMPARISON OF THE COMPUTATION TIME FOR FOSR AND SDRF

Our First-order Spectral Rewiring (FoSR) method computes the first-order change in the spectral gap from adding each edge, and then adds the edge which maximizes this. Optimizing only the first-order change gives our method a significant computational advantage over competing curvature-based methods like the SDRF (Topping et al., 2022) . Figure 4 and Table 2 show the computation time of FoSR and SDRF for a synthetic dataset consisting of Erdös-Rényi graphs, and the TUDataset graphs ( Morris et al., 2020) respectively. We observe that the run times for FoSR are almost an order of magnitude lower than that of SDRF. Table 3 shows that the relational architecture indeed helps to reduce oversmoothing, since the relational rewiring leads to higher Dirichlet energies. Here the Dirichlet energy is measured as an average over all graphs in the dataset after training, summed over 100 runs. Relational GNNs evidently help to reduce oversmoothing. However, the tradeoff between oversquashing and oversmoothing still exists. Recall that input graphs with small spectral gaps are prone to oversquashing, whereas inputs with large spectral gaps are prone to oversmoothing (see Section 3.2). We use R-GNNs to alleviate oversmoothing, but there is still an optimal number of edges to add, beyond which performance declines. Figure 5 shows that the performance for R-GCNs peaks at around 25 iterations, which is reflected in both the test accuracy and Dirichlet energy of the final layer.

B.4 DISCUSSION ON THE APPROXIMATION ERROR OF FOSR

When rewiring a graph using FoSR, the objective at each iteration is to add the edge which maximally increases the spectral gap. Since FoSR makes this calculation using a first-order approximation of the spectral gap, there is some error incurred between the estimated spectral gap and the actual spectral gap. This error is caused by two factors. The first is the error from using the first-order change of the spectral gap instead of the actual change. The second error accrues from using the dominant term of the first-order change instead of the first-order change. We record each of these three quantities (actual change, first-order change, FoSR approximated change) for graphs in the ENZYMES dataset (Morris et al., 2020) in Figure 6 . We see that the actual change is well-correlated with the approximations, and that the approximations are well-correlated with each other. While FoSR accumulates some error from the actual change in the spectral gap, we are primarily interested in how this affects its ability to choose good edges for rewiring. Hence, it is natural to ask how FoSR compares to a greedy algorithm which computes the spectral gap exactly. We test the performance of FoSR on a synthetic dumbbell graph, consisting of two cliques of size 50, connected via a path of length 3. The results are shown in Figure 7 . We see that the spectral evolution of FoSR is indistinguishable from that of the greedy method, indicating that the approximation is strong and that the two methods choose similar edges. Figure 7 : Evolution of the spectral gap of a dumbbell graph under rewiring, evaluated at every iteration count between 0 and 100. The performance of FoSR is nearly identical to a method which selects edges to add via exact computation of eigenvalues.

C DETAILS ON SECTION 4

Proof of Theorem 4. We refer to Stewart and Sun (1990, Theorem 2. 3) for a proof that λ i is a continuously differentiable function and a computation of its gradient for general classes of matrices. We provide a calculation for our case here. Fix a symmetric matrix M 0 ∈ R n×n and let M 0 = U ΣU -1 be an orthonormal diagonalization of M 0 with the diagonal entries of Σ sorted in descending order. Fix j, k ∈ [n]. Let E j,k denote the matrix whose (j, k) entry is 1 and remaining entries are 0. Define a curve γ : R → R n×n by γ(t) := M 0 + tU E j,k U -1 . Then λ i (γ(t)) = λ i (U (Σ + tE j,k )U -1 ) = λ i (Σ + tE j,k ) = λ i + δ i,j δ i,k t, where δ i,j is equal to 1 if i = j and 0 otherwise. The final equality follows since the i-th diagonal entry of Σ is λ i , and adding an off-diagonal entry does not change its eigenvalues. For sufficiently small t the order of the eigenvalues does not change. By the chain rule, d dt t=0 λ i (γ(t)) = (dλ i ) M0 ( γ(0)). By equation ( 10), the left-hand side of the above equation is equal to δ i,j δ i,k . So δ i,j δ i,k = (dλ i ) M0 ( γ(0)) = (dλ i ) M0 (U E j,k U -1 ). By the definition of the gradient, Tr(U E j,k U -1 ∇λ i (M 0 )) = (dλ i ) M0 (U E j,k U -1 ) = δ i,j δ i,k = Tr(E i,i E j,k ). Rewriting the left-hand side of the above equation gives us Tr(U -1 ∇λ i (M 0 )U E j,k ) = Tr(E i,i E j,k ). Since this holds for all j, k ∈ [n] and the E j,k span R n×n , this implies that U -1 ∇λ i (M 0 )U = E i,i ∇λ i (M 0 ) = U E i,i U -1 ∇λ i (M 0 ) = x i x T i , as desired. Proof of Proposition 5. Let A be the adjacency matrix of a graph G, and let A N be the normalized adjacency matrix. Let x be an eigenvector of A N with eigenvalue λ. Suppose that we add an edge (u, v) to G. Let δA N denote the entry-wise change in the normalized adjacency matrix from adding the edge (u, v). If i ̸ = u and j ̸ = v, then (δA N ) ij = 0. If i ̸ = u and j = v, then (δA N ) ij = A ij √ d i 1 1 + d j - 1 d j . If i = u and j ̸ = v, then (δA N ) ij = A ij d j 1 √ 1 + d i - 1 √ d i . If i = u and j = v, then (δA N ) ij = 1 (1 + √ d i )(1 + d j ) . Table 5 : Rewiring iteration counts for FoSR and SDRF. 

Sparsification threshold (ϵ)

Architecture REDDIT-BINARY IMDB-BINARY MUTAG ENZYMES PROTEINS COLLAB GCN 10 -3 10 -4 10 -3 10 -4 10 -4 10 -4 R-GCN 10 -4 10 -3 10 -4 10 -3 10 -3 10 -3 GIN 10 -3 10 -4 10 -4 10 -3 10 -3 10 -4 R-GIN 10 -3 10 -4 10 -3 10 -4 10 -4 10 -3

D.2 COMPUTER INFRASTRUCTURE AND LIBRARIES

In this section, we provide details of our implementation. All the experiments were implemented in Python using PyTorch (Paszke et al., 2019) , NumPy (Harris et al., 2020) , PyG (PyTorch Geometric) (Fey Lenssen, 2019) , with plots created using Matplotlib (Hunter, 2007) . PyTorch, PyG and NumPy are made available under the BSD license, and Matplotlib under the PSF license. We conducted all our experiments on a local server with 2x 22-Core/44-Thread Intel Xeon Scalable Gold 6152 processor (2.1/3.7 GHz) and 8x NVIDIA GeForce RTX 2080 Ti graphics card (11 GB, GDDR6).



compute important structural features of the graph, and the weights W

for 6: for i = 1, 2, • • • , k do 7: Add edge (i, j) which minimizes xixj √ (1+di)(1+dj )

Figure 4: Compute time of FoSR and SDRF for Erdös-Rényi graphs. For a graph with n nodes, we add an edge (i, j) with probability p = 5 log n n .

Figure 5: Trade-off between oversquashing and oversmoothing as a function of rewiring iterations for the R-GCN on the IMDB-BINARY graph classification task. Left: Test accuracy as a function of the rewiring iterations. Right: Dirichlet energy of the final layer as a function of rewiring iterations. The performance for R-GCNs peaks at around 25 iterations, which is reflected in both the test accuracy and Dirichlet energy.

Figure 6: Change in the spectral gap, first-order change, and the FoSR approximated change for rewired graphs. The changes are recorded for each graph in the ENZYMES dataset after 1 rewiring iteration.



k+1 are message-passing functions, and the ϕ k : R d k+1 → R d k+1 are update functions. The ϕ k and ψ k,r can either be fixed mappings or learned during training. All GNN layer types we consider will be special cases of this general R-GNN layer. We recover a classical (non-relational) GNN when there is only one relation, |R| = 1.

Results of rewiring methods for GCN and GIN comparing standard and relational. The best results in each setting are highlighted in bold font and best across settings are highlighted red. ± 1.098 49.770 ± 0.817 72.150 ± 2.442 27.667 ± 1.164 70.982 ± 0.737 33.784 ± 0.488 Last layer FA 68.485 ± 0.945 48.980 ± 0.945 70.050 ± 2.027 26.467 ± 1.204 71.018 ± 0.963 33.320 ± 0.435 Every layer FA 48.490 ± 1.044 48.170 ± 0.801 70.450 ± 1.960 18.333 ± 1.038 60.036 ± 0.925 51.798 ± 0.419 DIGL 49.980 ± 0.680 49.910 ± 0.841 71.350 ± 2.391 27.517 ± 1.053 70.607 ± 0.731 15.530 ± 0.294 SDRF 68.620 ± 0.851 49.400 ± 0.904 71.050 ± 1.872 28.367 ± 1.174 70.920 ± 0.792 33.448 ± 0.472 FoSR 70.330 ± 0.727 49.660 ± 0.864 80.000 ± 1.574 25.067 ± 0.994 73.420 ± 0.811 33.836 ± 0.584 Last layer FA 90.220 ± 0.4750 70.910 ± 0.788 83.450 ± 1.742 47.400 ± 1.387 72.304 ± 0.666 75.056 ± 0.406 Every layer FA 50.360 ± 0.648 49.160 ± 0.870 72.550 ± 3.016 28.383 ± 1.052 70.375 ± 0.910 32.894 ± 0.390 Last layer FA 89.995 ± 0.647 69.710 ± 1.025 80.600 ± 1.639 48.183 ± 1.401 70.304 ± 0.844 75.434 ± 0.491 Every layer FA 56.855 ± 0.943 71.480 ± 0.876 83.050 ± 1.518 54.950 ± 1.331 71.045 ± 0.909 75.432 ± 0.475

Runtime to rewire every graph for 10 iterations (in seconds).

Dirichlet energy of the final layer for TUDataset graphs after training. The relational architectures tend to provide higher energy, which indicates that they are less prone to oversmoothing.

DIGL hyperparameters.

ACKNOWLEDGMENTS

This project has been supported by NSF CAREER Grant DMS-2145630, ERC Starting Grant 757983, DFG SPP 2298 (FoDL) Grant 464109215. This work is partly supported by BMBF in DAAD project 57616814 (SECAI).

annex

That is, we can encode terms from h(G) into the expression for λ 2 if we use this variational formulation.The discrete Cheeger inequality tells us that the spectral gap is bounded away from zero if and only if h(G) is bounded away from zero. G is connected if and only if λ 2 > 0. Since the spectral gap can be computed efficiently, we use it as a measure of graph bottlenecked-ness in lieu of the Cheeger constant.

B SPECTRAL GAP AND OVERSQUASHING

In this section, we provide some more intuition behind our idea of optimizing the spectral gap of the input graph for alleviating structural bottlenecks and improving the overall quality of information propagation across nodes.The rate of spectral expansion, i.e., the rate at which the spectral gap improves as a function of the rewiring iterations is one of the key determinants of oversquashing (Banerjee et al., 2022) , and a faster rate is one of the main features of our new algorithm. We show that FoSR has a faster rate of spectral expansion compared to existing curvature-based methods such as SDRF (Topping et al., 2022) for both the synthetic benchmark NEIGHBORSMATCH (Alon and Yahav, 2021 ) (Appendix B.1.1) and the TUDataset (Morris et al., 2020) In Appendix B.2, we show that optimizing only the first-order change in the spectral gap gives FoSR a significant computational advantage over SDRF. In Appendix B.3, we empirically explore the trade-off between oversquashing and oversmoothing as a function of the number of rewiring iterations. Finally in Appendix B.4, we discuss the approximation error of FoSR.

B.1.1 THE NEIGHBORSMATCH PROBLEM (DETAILS ON FIGURE 1)

Alon and Yahav (2021) introduced the synthetic benchmark NEIGHBORSMATCH to test the impact of oversquashing in GNNs. Given an input graph G, suppose that we wish to predict the label for a node T; see Figure 2(a) . The correct label is the label of the green node that has the same number of blue neighbors as T. For the example in Figure 2(a ), the answer is B which happens to reside at the opposite end of the graph. A correct prediction of the label for this example will require a number of GNN layers l that is equal to the diameter of the graph. With increasing l, however, at each message-passing layer, information from exponentially-growing receptive fields need to be concurrently propagated. This leads to oversquashing and the GNN fails to propagate long-range signals and fit the training dataset perfectly. We consider a learning task modeled on the NEIGHBORSMATCH problem. Given an input graph G, a target node T, and a subset S of the nodes of G, we assign each node in S a different random |S|-dimensional one-hot vector encoding the number of blue neighbors. Likewise, we represent the label of T as a random |S|-dimensional one-hot vector and the goal is to predict the node T' ∈ S with By Theorem 4, the first-order change in λ 2 (A N ) is given bySince x is an eigenvector of A N with eigenvalue λ, we may write the above expression asApplying this equality to the second eigenvector of A N yields the result.

D DETAILS ON THE EXPERIMENTS FROM SECTION 5

The computer implementation of the proposed methods along with scripts to re-run our experiments are made publicly available on https://anonymous.4open.science/r/FoSR-0CE3/.In the following we list details on hyperparameters, libraries, and the compute infrastructure we used in these experiments. 

