RED-GCN: REVISIT THE DEPTH OF GRAPH CONVOLU-TIONAL NETWORK

Abstract

Finding the proper depth d of a GNN that provides strong representation power has drawn significant attention, yet nonetheless largely remains an open problem for the graph learning community. Although noteworthy progress has been made, the depth or the number of layers of a corresponding GCN is realized by a series of graph convolution operations, which naturally makes d a positive integer (d ∈ N+). An interesting question is whether breaking the constraint of N+ by making d a real number (d ∈ R) can bring new insights into graph learning mechanisms. In this work, by redefining GCN's depth d as a trainable parameter continuously adjustable within (-∞, +∞), we open a new door of controlling its expressiveness on graph signal processing to model graph homophily/heterophily (nodes with similar/dissimilar labels/attributes tend to inter-connect). A simple and powerful GCN model RED-GCN, is proposed to retain the simplicity of GCN and meanwhile automatically search for the optimal d without the prior knowledge regarding whether the input graph is homophilic or heterophilic. Negative-valued d intrinsically enables high-pass frequency filtering functionality for graph heterophily. Variants extending the model flexibility/scalability are also developed. The theoretical feasibility of having a real-valued depth with explainable physical meanings is ensured via eigen-decomposition of the graph Laplacian and a properly designed transformation function from the perspective of functional calculus. Extensive experiments demonstrate the superiority of RED-GCN on node classification tasks for a variety of graphs. Furthermore, by introducing the concept of eigengraph, a novel graph augmentation method is obtained: the optimal d effectively generates a new topology through a properly weighted combination of eigengraphs, which dramatically boosts the performance even for a vanilla GCN.

1. INTRODUCTION

Graph convolutional network (GCN) (Kipf & Welling, 2016; Veličković et al., 2017; Hamilton et al., 2017) has exhibited great power in a variety of graph learning tasks, such as node classification (Kipf & Welling, 2016; Luan et al., 2019; 2022a) , link prediction (Zhang & Chen, 2018) , community detection (Chen et al., 2020) , and many more. Since the representation power of GCN is largely determined by its depth, i.e., the number of graph convolution layers, tremendous research efforts have been made on finding the optimal depth that strengthens the model's ability for downstream tasks. Upon increasing the depth, the over-smoothing issue arises: a GCN's performance is deteriorated if its depth exceeds a uncertain threshold (Kipf & Welling, 2016) . It is unveiled in (Li et al., 2018) that a graph convolution operation is a special form of Laplacian smoothing (Taubin, 1995) . Thus, the similarity between the graph node embeddings grows with the depth so that these embeddings eventually become indistinguishable. Various techniques are developed to alleviate this issue, e.g., applying pairwise normalization can make distant nodes dissimilar (Zhao & Akoglu, 2019) , and dropping sampled edges during training slows down the growth of embedding smoothness as depth increases (Rong et al., 2019) . Other than the over-smoothing issue due to large GCN depth, another fundamental phenomenon widely existing in real-world graphs is homophily and heterophily. In a homophilic graph, nodes with similar labels or attributes tend to inter-connect, while in a heterophily graph, connected nodes usually have distinct labels or dissimilar attributes. Most graph neural networks (GNNs) are developed based on homophilic assumption (Yang et al., 2016) , while models able to perform well on heterophilic graphs often need special treatment and complex designs (Bianchi et al., 2021; Zhu et al., 2020) . Despite the achievements made by these methodologies, little correlation has been found between the adopted GNN model's depth and its capability of characterizing graph heterophily. For most GNNs, if not all, the depth needs to be manually set as a hyper-parameter before training, and finding the proper depth usually requires a considerable amount of trials or good prior knowledge of the graph dataset. Since the depth represents the number of graph convolution operations and naturally takes only positive integer values, little attention has been paid to the question whether a non-integer depth is realizable, and if yes, whether it is practically meaningful, and whether it can bring unique advantages to current graph learning mechanisms. This work revisits the GCN depth from spectral and spatial perspectives and explains the interdependencies between the following key ingredients in graph learning: the depth of a GCN, the spectrum of the graph signal, and the homophily/heterophily of the underlying graph. Firstly, through eigen-decomposition of the symmetrically normalized graph Laplacian, we present the correlation between graph homophily/heterophily and the eigenvector frequencies. Secondly, by introducing the concept of eigengraph, we show the graph topology is equivalent to a weighted linear combination of eigengraphs, and the weight values determine the GCN's capability of capturing homophilic/heterophilic graph signals. Thirdly, we reveal that the eigengraph weights can be controlled by GCN's depth, so that an automatically tunable depth parameter is needed to adjust the eigengraph weights into the designated distribution in match of the underlying graph homophily/heterophily. To realize the adaptive GCN depth, we extend its definition from a positive integer to an arbitrary real number with theoretical feasibility guarantees from functional calculus (Shah & Okutmuştur, 2020) . With a trainable depth parameter, we propose a simple and powerful model, Redefined Depth-GCN (ReD-GCN), with two variants. Extensive experiments demonstrate the automatically optimal depth searching ability, and it is found that negative-valued depth plays the key role in handling heterophilic graphs. Systematical investigation on the optimal depth is conducted in both spectral and spatial domains. It in turn inspires the development of a novel graph augmentation methodology. With clear geometric explanability, the augmented graph structure possesses supreme advantages over the raw input topology, especially for graphs with heterophily. The main contributions of this paper are summarized as following: • The interdependence between negative GCN depth and graph heterophily is discovered; In-depth geometric and spectral explanations are presented. • A novel problem of automatic GCN depth tuning for graph homophily/heterophily detection is formulated. To our best knowledge, this work presents the first trial to make GCN's depth trainable by redefining it on the real number domain. • A simple and powerful model RED-GCN with two variants (RED-GCN-S and RED-GCN-D) is proposed. A novel graph augmentation method is discussed. • Our model achieves superior performance on semi-supervised node classification tasks on 11 graph datasets.

2. PRELIMINARIES

Notations. We utilize bold uppercase letters for matrices (e.g., A), bold lowercase letters for column vectors (e.g., u) and lowercase letters for scalars (e.g., α). We use the superscript ⊤ for transpose of matrices and vectors (e.g., A ⊤ and u ⊤ ). An attributed undirected graph G = {A, X} contains an adjacency matrix A ∈ R n×n and an attribute matrix X ∈ R n×q with the number of nodes n and the dimension of node attributes q. D denotes the diagonal degree matrix of A. The adjacency matrix with self-loops is given by Ã = A + I (I is the identity matrix), and all variables derived from Ã are decorated with symbol ˜, e.g., D represents the diagonal degree matrix of Ã. M d stands for the d-th power of matrix M, while the parameter and node embedding matrices in the d-th layer of a GCN are denoted by W (d) and H (d) . Graph convolutional network (GCN) and simplified graph convolutional network (SGC). The layer-wise message-passing and aggregation of GCN (Kipf & Welling, 2016) is given by H (d+1) = σ( D-1 2 Ã D-1 2 H (d) W (d) ), 1 1 1 𝑣 ! 𝑣 " 𝑣 # 𝑣 ! 𝑣 " 𝑣 # normalization 1 2 1 2 1 2 = 𝑣 ! 𝑣 " 𝑣 # 1 -𝜆 ! = 1, (𝜆 ! = 0) 1 3 1 3 1 3 1 3 1 3 1 3 𝑣 ! 𝑣 " 𝑣 # - 1 3 1 6 - 1 3 2 3 1 6 1 6 𝑣 ! 𝑣 " 𝑣 # 1 -𝜆 " = - 1 2 , (𝜆 " = 3 2 ) - 1 2 1 2 1 2 + + 𝐀 𝐒 = 𝐃 ! " #𝐀𝐃 ! " # eigengraph 1 eigengraph 2 eigengraph 3 1 -𝜆 # = - 1 2 , (𝜆 # = 3 2 ) Figure 1 : Decompose the symmetrically normalized adjacency matrix into three eigengraphs. where H (d) /H (d+1) stands for the embedding matrix (H (0) = X) in the d-th/(d + 1)-th layer; W (d) is the trainable parameter matrix; and σ(•) is the non-linear activation function. With σ(•) removed in each layer, SGC (Wu et al., 2019) is obtained as below: H (d) = Sd XW, where S = D-foot_0 2 Ã D-1 2 , and the parameter of each layer W (i) are compressed into one trainable W = d-1 i=0 W (i) . Graph Laplacian and spectrum. In graph theory, graph Laplacian L = D-A and its symmetrically normalized correspondence L sym = I -D -1 2 AD -1 2 possess critical properties of the underlying graph G. L sym has eigenvalues [λ 1 , λ 2 , . . . , λ n ], where λ i ∈ [0, 2), ∀i ∈ {1, 2, . . . , n} (Chung & Graham, 1997) . 1 Here they are put in ascending order: 0 = λ 1 ≤ λ 2 ≤ • • • ≤ λ n < 2. It can be eigen-decomposed as: L sym = UΛU ⊤ , where U = [u 1 , u 2 , . . . , u n ] is the eigenvector matrix (u i ⊥ u j , ∀i ̸ = j), and Λ is the diagonal eigenvalue matrix: Λ =    λ 1 • • • 0 . . . . . . . . . 0 • • • λ n    . For each eigenvector u i , we have u i u ⊤ i ∈ R n×n . As we will show in Section 3, this n × n matrix can be viewed as a weighted adjacency matrix of a graph with possible negative edges, which we name u i u ⊤ i as the i-th eigengraph of G. Accordingly, L sym can be written as the linear combination of all eigengraphs weighted by the corresponding eigenvalues (Chung & Graham, 1997) : L sym = λ 1 u 1 u ⊤ 1 + . . . + λ i u i u ⊤ i + . . . + λ n u n u ⊤ n , where the first eigenvalue λ 1 = 0, and the corresponding eigengraph u 1 u ⊤ 1 has an identical value 1 n for all entries (Shuman et al., 2013) . Thus, for SGC, we have S = I -Lsym = Ũ(I -Λ) Ũ⊤ = n i=0 (1 -λi )ũ i ũ⊤ i . A SGC with d layers requires d consecutive graph convolution operations, which involves the multiplication of S by d times. Due to the orthogonality of Ũ, namely, Ũ⊤ Ũ = I, we obtain Sd = Ũ(I -Λ) Ũ⊤ Ũ(I -Λ) Ũ⊤ . . . Ũ(I -Λ) Ũ⊤ = Ũ(I -Λ) d Ũ⊤ = n i=1 (1 -λi ) d ũi ũ⊤ i , where 1 -λi ∈ (-1, 1], and the depth d of SGC serves as the power of S's eigenvalues. Sd can be viewed as the sum of eigengraphs ũi ũ⊤ i weighted by coefficients (1 -λi ) d . Graph homophily and heterophily. Graph homophily describes to what extent edges tend to link nodes with the same labels and similar features. In this work, we focus on edge homophily (Zhu et al., 2020)  : h(G) = i,j,A[i,j]=1 ⟨y[i]=y[j]⟩ i,j A[i,j] ∈ [0, 1], where ⟨x⟩ = 1 if x is true and 0 otherwise. A graph is more homophilic for h(G) closer to 1 or more heterophilic for h(G) closer to 0.

3. MODEL

Firstly, we establish the intrinsic connections between eigengraphs with small/large weights, graph signals with high/low frequencies, and graphs with homophilic/heterophilic properties. Secondly, we show how the positive/negative depth d of a GCN affects the eigengraph weights and in turn determines the algorithm's expressive power to process homophilic/heterophilic graph signals. Thirdly, with the help of functional calculus (Shah & Okutmuştur, 2020) , we present the theoretical feasibility of extending the domain of d from N+ to R. Finally, by making d a trainable parameter, we present our model RED-GCN and its variants, which are capable of automatically detecting the homophily/heterophily of the input graph and finding the corresponding optimal depth. The eigenvectors of a graph Laplacian form a complete set of basis vectors in the n-dimensional space capable of expressing the original node attribute X as a linear combination. From the perspective of graph spectrum analysis (Shuman et al., 2013) , the frequency of eigenvector u i reflects how much the j-th entry u i [j] deviates from the k-th entry u i [k] for each connected node pair v j and v k in G. This deviation is measured by the set of zero crossings of u i : Z(u i ) := {e = (v j , v k ) ∈ E : u i [j]u i [k] < 0} , where E is the set of edges in graph G. Larger/smaller |Z(u i )| indicates higher/lower eigenvector frequency. A zero-crossing also corresponds a negative weighted edge in an eigengraph. Due to the widely existing positive correlation between λ i and |Z(u i )| (Shuman et al., 2013) , large/small eigenvalues mostly correspond to the high/low frequencies of the related eigenvectors. As illustrated by the toy example of n = 3 in Figure 1 , for λ 1 = 0, we have |Z(u 1 )| = 0, and eigengraph u 1 u ⊤ 1 is well-connected with identical edge weight 1 n ; negative edge weights exist in the 2nd and 3rd eigengraphs, indicating more zero crossings (|Z(u 2 )| = 1 and |Z(u 3 )| = 2) and higher eigenvector frequencies. Since node labels correlate with their attributes (Zheng et al., 2022a) , and node attribute similarities indicate the extent of smoothness/homophily (Luan et al., 2020; 2021) , plus node attributes can be expressed by eigenvectors, the deviation between eigenvector entry pairs naturally implies the extent of heterophily. Apparently, high frequency eigenvectors and their corresponding eigengraphs have advantage on capturing graph heterophily. High frequency eigengraphs should accordingly take larger weights when modeling heterophilic graphs, while low frequency ones should carry larger weights when dealing with homophilic graphs. In turn, eigengraph weights are controlled by GCN/SGC's depth d, e.g., for a SGC of depth d, the weight of the i-th eigengraph is (1 -λi ) d , and changing the layer d of SGC adjusts the weights of different eigengraphs. Therefore, depth d controls the model's expressive power to effectively filter low/high-frequency signals for graph homophily/heterophily. A question is naturally raised: instead of manually setting the depth d, can d be built into the model as a trainable parameter so that a proper set of the eigengraph weights matching the graph homophily/heterophily can be automatically reached by finding the optimal d in an end-to-end fashion during training? Differentiable variables need continuity, which requires the extension of depth d from the discrete positive integer domain (N+) to the continuous real number domain R. According to functional calculus (Shah & Okutmuştur, 2020) , applying an arbitrary function f on a graph Laplacian L sym is equivalent to applying the same function only on the eigenvalue matrix Λ: f (L sym ) = Uf (Λ)U ⊤ = U    f (λ 1 ) • • • 0 . . . . . . . . . 0 • • • f (λ n )    U ⊤ , which also applies to Lsym and S. Armed with this, we seek to realize an arbitrary depth SGC via a power function as However, since λi ∈ [0, 2), we have (1 -λi ) ≤ 0 when 1 ≤ λi < 2, and for (1 -λi ) taking zero or negative values, (1 -λi ) d is not well-defined or involving complex-number-based calculations for a real-valued d (e.g.,(-0.5) 3 8 ) (Shah & Okutmuştur, 2020) . Moreover, even for integer-valued ds under which (1 -λi ) d is easy to compute, the behavior of (1 -λi ) d is complicated versus λi and diverges when λi = 1 for negative ds, as shown in Figure 2a . Thus, the favored weight distribution may be hard to obtain by tuning d. f ( S) = Sd = Ũ(I -Λ) d Ũ⊤ = n i=1 (1 -λi ) d ũi ũ⊤ i (d ∈ R). 0.0 0.5 1.0 1.5 4 3 2 1 0 1 2 3 4 (1 ) d d = 2 d = 1 d = 0 d = 1 d = 2 (a) (1 -λ) d . 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 (1 0.5 ) d d = 2 d = 1 d = 0 d = 1 d = 2 (b) (1 -1 2 λ) d To avoid such complications and alleviate the difficulties for manipulating the eigengraph weights, a transformation function g(•) operating on the graph Laplacian L sym or Lsym is in need to shift g(λ i ) or g( λi ) into a proper value range so that its power of a real-valued d is easy to obtain and well-behaved versus λ i or λi . Without the loss of generality, our following analysis focuses on L sym and λ i . There may exist multiple choices for g(•) satisfying the requirements. In this work, we focus on the following form: Ŝ = g(L sym ) = 1 2 (2I -L sym ). This choice of g(•) holds three properties: (1) Positive eigenvalues. Since we have L sym 's i-th eigenvalue λ i ∈ [0, 2), the corresponding eigenvalue of Ŝ is g(λ i ) = 1 2 (2 -λ i ) ∈ (0, 1]. Thus, the d-th power of g(λ i ) is computable for any d ∈ R. (2) Monotonicity versus eigenvalues λ. As shown in Figure 2b , g(λ i ) d = (1 -1 2 λ) d is monotonically increasing/decreasing when λ varies between 0 and 2 under negative/positive depth. (3) Geometric interpretability. Filter Ŝ can be expressed as: Ŝ = U ΛU ⊤ = U(I - 1 2 Λ)U ⊤ = 1 2 I + 1 2 (I -L sym ) = 1 2 (I + D -1 2 AD -1 2 ). As shown in Figure 3 , in spatial domain, Ŝ is obtained via 3 operations on adjacency matrix A: normalization, adding self-loops, and scaling all edge weights by 1 2 (a type of lazy random walk (Luan et al., 2020) ), while S in vanilla GCN/SGC contains 2 operations: adding self-loops and normalization. With the help of transformation g, the depth d is redefined on real number domain, and the message propagation process of depth d can be realized via the following steps: (1) Eigen-decompose L sym ; (2) Calculate Ŝd via weight g(λ i ) d and the weighted sum of all eigengraphs: Ŝd = U Λd U ⊤ = n i=1 g(λ i ) d u i u ⊤ i (3) Multiply Ŝd with original node attributes X. 𝑣 ! 𝑣 " 𝑣 # 𝑣 ! 𝑣 " 𝑣 # 1 4 1 4 1 4 1 2 1 2 1 2 𝑣 ! 𝑣 " 𝑣 # 1 1 1 1 2 𝐃 ! " #𝐀𝐃 ! " # 1 2 𝐈 ! 𝐒 𝑣 ! 𝑣 " 𝑣 # 1 3 1 3 1 3 1 3 1 3 1 3 # 𝐒 1 1 1 Figure 3: The difference between S (left) for GCN/SGC and Ŝ (right) for RED-GCN. Negative depth explained. An intuitive explanation of negative d can be obtained from the perspective of matrix inverse and message diffusion process when d takes integer values. Since Ŝ-1 Ŝ = U Λ-1 U ⊤ U Λ1 U ⊤ = I, Ŝ-1 is the inverse matrix of Ŝ. In diffusion dynamics, X can be viewed as an intermediate state generated in a series of message propagation steps. ŜX effectively propagates the message one-step forward, while Ŝ-1 can cancel the effect of Ŝ on X and recover the original message by moving backward: Ŝ-1 ŜX = X. Accordingly, Ŝ-1 X traces back to the message's previous state in the series. However, neither A or L has inverse due to their non-positive eigenvalues. More discussions on the impact of negative depth in spatial domain are presented in Section 4.4. Non-integer d indicates the back-or forward propagation can be a continuous process. RED-GCN-S. By further making d a trainable parameter, we present our model, Redefined Depth-GCN-Single (RED-GCN-S), whose final node embedding matrix is given by H = σ( Ŝd XW), where σ(•) is the nonlinear activation function; W is a trainable parameter matrix; and d is the trainable depth parameter. As observed from Figure 2b , weight distribution of different frequencies/eigengraphs is tuned via d: (1) for d = 0, the weight is uniformed distributed among all frequency components (g(λ i ) d = 1), which implies that no particular frequency is preferred by the graph signal; (2) for d > 0, weight g(λ i ) d decreases with the corresponding frequency, which indicates the low frequency components are favored so that RED-GCN-S effectively functions as a low-pass filter and therefore captures graph homophily; (3) for d < 0, high frequency components gains amplified weights so that RED-GCN-S serves as a high-pass filter capturing graph heterophily. During training, RED-GCN-S tunes its frequency filtering functionality to suit the underlying graph signal by automatically finding the optimal d. RED-GCN-D. During optimization, RED-GCN-S embraces a single depth d unified for all eigengraphs and selects its preferences for either homophily or heterophily. However, RED-GCN-S requires a full eigen-decomposition of L sym , which can be expensive for large graphs. Additionally, the high and low frequency components in a graph signal may not be mutually exclusive, namely, there exists the possibility for a graph to simultaneously possess homophilic and heterophilic counterparts. Therefore, we propose the second variant of RED-GCN: RED-GCN-D (Dual), which introduces two separate trainable depths, d h and d l , to gain more flexible weighting of the high and low frequency related eigengraphs respectively. Arnoldi method (Lehoucq et al., 1998 ) is adopted to conduct EVD on L sym and obtain the top-K largest and smallest eigen-pairs (λ i , u i )s. By denoting U l = U[:, 0 : K] and U h = U[:, n -K : n] (U l and U h ∈ R n×K ), we define a new diffusion matrix Ŝdual (d l , d h , K) as Ŝdual (d l , d h , K) = U l Λd l l U ⊤ l + U h Λd h h U ⊤ h , where Λl ∈ R K×K and Λh ∈ R K×K are diagonal matrices of the top-K smallest and largest eigenvalues. 2 The final node embedding of RED-GCN-D is presented as H = σ( Ŝdual (d l , d h , K)XW), where depths d l and d h are trainable; and W is a trainable parameter matrix. We make RED-GCN-D scalable on large graphs by choosing K ≪ n, so that Ŝdual (d l , d h , K) approximates the full diffusion matrix by covering only a small subset of all eigengraphs. For small graphs, we use K = ⌊ n 2 ⌋ to include all eigengraphs, and Ŝdual (d l , d h , K) thus gains higher flexibility than Ŝ with the help of the two separate depth parameters instead of a unified one. Differences with ODE-based GNNs. Previous attempts on GNNs with continuous diffusion are mostly inspired by graph diffusion equation, an Ordinary Difference Equation (ODE) characterizing the dynamical message propagation process versus time. In contrast, our framework starts from discrete graph convolution operations without explicitly involving ODE. CGNN (Xhonneux et al., 2020) aims to build a deep GNN immune to over-smoothing by adopting the neural ODE framework (Chen et al., 2018) . But its time parameter t is a non-trainable hyper-parameter predefined within the positive domain, which is the key difference with RED-GCN. A critical CGNN component for preventing over-smoothing, restart distribution (the skip connection from the first layer), is not needed in our framework. Moreover, CGNN applies the same depth to all frequency components, while RED-GCN-D has the flexibility to adopt two different depths respectively to be adaptive to high and low frequency components. GRAND (Chamberlain et al., 2021) introduces non-Eular multi-step schemes with adaptive step size to obtain more precise solutions of the diffusion equation. Its depth (total integration time) is continuous but still predefined/non-trainable and takes only positive values. DGC (Wang et al., 2021) decouples the SGC depth into two prefinded non-trainable hyper-parameters: a positive real-valued T controlling the total time and a positive integer-valued K dgc corresponding to the number of diffusion steps. However, realizing negative depth in DGC is non-applicable since the implementation is through propagation by K dgc times, rather than through an arbitrary real-valued exponent d on eigengraph weights in RED-GCN.

4. EXPERIMENT

In this section, we evaluate the proposed RED-GCN on the semi-supervised node classification task on both homophilic graphs and heterophilic graphs.

4.1. EXPERIMENT SETUP

Datasets. We use 11 datasets for evaluation, including 4 homophilic graphs: Cora (Kipf & Welling, 2016), Citeseer (Kipf & Welling, 2016 ), Pubmed (Kipf & Welling, 2016) and DBLP (Bojchevski & Günnemann, 2017) , and 7 heterophilic graphs: Cornell (Pei et al., 2020) , Texas (Pei et al., 2020) , Wisconsin (Pei et al., 2020) , Actor (Pei et al., 2020 ), Chameleon (Rozemberczki et al., 2021 ), Squirrel (Rozemberczki et al., 2021) , and cornell5 (Fey & Lenssen, 2019) . We collect all datasets from the public GCN platform Pytorch-Geometric (Fey & Lenssen, 2019) . For Cora, Citeseer, Pubmed with data splits in Pytorch-Geometric, we keep the training/validation/testing set split as in GCN (Kipf & Welling, 2016) . For the remaining 8 datasets, we randomly split every dataset into 20/20/60% for training, validation, and testing. The statistics of all datasets are presented in Appendix. Baselines and Metrics. We compare our model with 7 baseline methods,including 4 classic GNNs: GCN (Kipf & Welling, 2016), SGC (Wu et al., 2019) , APPNP (Klicpera et al., 2018) and ChebNet (Defferrard et al., 2016) , and 3 GNNs tailored for heterophilic graphs: FAGCN (Bo et al., 2021) , GPRGNN (Chien et al., 2020) and H2GCN (Zhu et al., 2020) . Accuracy (ACC) is used as the evaluation metric. We report the average ACCs with the standard deviation (std) for all methods, each obtained by 5 runs with different initializations. Implementation Details. See Appendix due to the page limit.

4.2. NODE CLASSIFICATION

The semi-supervised node classification performances on homophilic graphs and heterophilic graphs are shown in Table 1 and Table 2 respectively. Homophilic graphs. From Table 1 , it is observed that different methods have similar performance on homophilic graphs. RED-GCN-S achieves the best accuracies on two datasets: Cora and DBLP. On the remaining two datasets, RED-GCN-S is only 1.1% and 0.4% below the best baselines (APPNP on Citeseer and SGC on Pubmed). For RED-GCN-D, it obtains similar performance as the other methods, even though it only uses the top-K largest/smallest eigen-pairs. Heterophilic graphs. RED-GCN-S/RED-GCN-D outperforms every baseline on all heterophilic graphs, as shown in Table 2 . These results demonstrate that without manually setting the model depth and without the prior knowledge of the input graph, RED-GCN has the capability of automatically detecting the underlying graph heterophily. We have an interesting observation: on 3 large datasets, Squirrel, Chameleon, and cornell5, even with only a small portion of the eigengraphs, RED-GCN-D is able to achieve better performance than RED-GCN-S with the complete set of eigengraphs. This suggests that the graph signal in some real-world graphs might be dominated by a few low and high frequency components, and allowing two independent depth parameters in RED-GCN-D brings the flexibility to capture the low and high frequencies at the same time.

4.3. TRAINABLE DEPTH

A systematic study is conducted on the node classification performance w.r.t the trainable depth d. Optimal depth. In Figure 4 , the optimal depths and their corresponding classification accuracies are annotated. For two homophilic graphs, Cora and Citeseer, the optimal depths are positive (5.029 and 3.735) in terms of the best ACCs, while for two heterophilic graphs, Actor and Squirrel, the optimal depths are negative (-0.027 and -3.751). These results demonstrate our model indeed automatically capture graph heterophily/homophily by finding the suitable depth to suppress or amplify the relative weights of the corresponding frequency components. Namely, high/low frequency components are suppressed for homophilic/heterophilic graphs respectively. Close to zero depth. For the two homophilic graphs in Figures 4a and 4b , sharp performance drop is observed when depth d approaches 0, since the eigengraphs gain close-to-uniform weights. For the heterophilic Actor dataset, its optimal depth -0.027 is close to 0, as shown in Figure 4c . In addition, the performance of RED-GCN-D (27.6%) is similar to that of GCN (27.5%), both of which are much worse than RED-GCN-S (35.3%). This result indicates that Actor is a special graph where all frequency components have similar importance. Due to the absence of the intermediate frequency components between the high-and low-end ones, the performance of RED-GCN-D is severely impacted. For vanilla GCN, the suppressed weights of the high frequency components deviate from the near-uniform spectrum and thus lead to low ACC on this dataset. in Eq. 1 by Ŝd is equivalent to applying the vanilla GCN to a new topology. This topology effectively plays the role of a structural augmentation for the original graph. The impact of such augmentation on performance is tested on 3 heterophilic graphs: Texas, Cornell and Wisconsin, as shown in Table 3 . Apparently, for the vanilla GCN, the performance obtained with this new topology is superior over that with the raw input graph: it dramatically brings 20%-30% lifts in ACC. Moreover, the augmented topologies also make vanilla GCN outperform RED-GCN-S and RED-GCN-D on 2 out of the 3 datasets. By nature, the augmented graph is a re-weighted linear combination of the eigengraphs, and its topological structure intrinsically assigns higher weights to eigengraphs corresponding to higher frequencies, as shown in Figures 5. Geometric properties. To further understand how the topology of Ŝd with a negative optimal d differs from that of Ŝ and why the performance is significantly boosted, a heat map of ( Ŝd -Ŝ) is presented in Figure 6 for Cornell. 3 First, the dark red diagonal line in the heat map indicates the weights of self-loops are significantly strengthened in the augmented graph, and as a result, in consistency with the previous findings (Zheng et al., 2022a) , the raw node attributes make more contributions in determining their labels. These strengthened self-weights also play the similar role as restart distribution or skip connections (Xhonneux et al., 2020) preventing the node embeddings becoming over-smoothed. In addition, there is a horizontal line and a vertical line (light yellow line marked by dashed ovals) in the heat map in Figure 6 , correspond to the hub node in the graph, namely the node with the largest degree. Interestingly, the connections between this node and most other nodes in the graph experience a negative weight change. Therefore, the influence of the hub node on most other nodes are systematically reduced. Consequently, the augmentation amplifies the deviations between node embeddings and facilitates the characterization of graph heterophily. Graph Convolutional Network (GCN). GCN models can be mainly divided into two categories: (1) spectral graph convolutional networks and (2) spatial convolutional networks. In (1), Spectral CNN (Bruna et al., 2013) borrows the idea from convolutional neural network (Goodfellow et al., 2016) to construct a diagonal matrix as the convolution kernel. ChebNet (Defferrard et al., 2016) adopts a polynomial approximation of the convolution kernel. GCN (Kipf & Welling, 2016) further simplifies the ChebNet via the first order approximation. Recently, (He et al., 2021; Bianchi et al., 2021; Wang & Zhang, 2022) propose more advanced filters as the convolution kernel. Most works in (2) follow the message-passing mechanism. GraphSAGE (Hamilton et al., 2017) iteratively aggregates features from local neighborhood. GAT (Veličković et al., 2017) applies self-attention to the neighbors. APPNP (Klicpera et al., 2018) deploys personalized pagerank (Tong et al., 2006) to sample nodes for aggregration. MoNet (Monti et al., 2017) unifies GCNs in the spatial domain.

5. RELATED WORKS

The Depth of GCN and Over-smoothing. A large amount of works focus on the over-smoothing issue. Its intrinsic cause is demystified: a linear GCN layer is a Laplacian smoothing operator (Li et al., 2018; Wu et al., 2019) . PairNorm (Zhao & Akoglu, 2019) forces distant nodes to be distinctive by adding an intermediate normalization layer. Dropedge (Rong et al., 2019) , DeepGCN (Li et al., 2019) , AS-GCN (Huang et al., 2018) , and JK-net (Xu et al., 2018) borrow the idea of ResNet (He et al., 2016) to dis-intensify smoothing. DeeperGXX (Zheng et al., 2021) adopts a topology-guided graph contrastive loss for connected node pairs to obtain discriminative representations. Most works aim to build deep GCNs (i.e., d is a large positive integer) by reducing over-smoothing, while RED-GCN extends the depth from N+ to R and explores the negative depth. Node Classification on Homophilic and Heterophilic Graphs. GCN/GNN models mostly follow the homophily assumption that connected nodes tend to share similar labels (Kipf & Welling, 2016; Veličković et al., 2017; Hamilton et al., 2017) . Recently, heterophilic graphs, in which neighbors often have disparate labels, attract lots of attention. Geom-GCN (Pei et al., 2020) and H2GCN (Zhu et al., 2020) extend the neighborhood for aggregation. FAGCN (Bo et al., 2021) and GPRGNN (Chien et al., 2020) adaptively integrate the high/low frequency signals with trainable parameters. Alternative message-passing mechanisms have been proposed in HOG-GCN (Wang & Zhang, 2022) and CPGNN (Zhu et al., 2021) . The latest related works include ACM-GCN (Luan et al., 2021; 2022b) , LINKX (Lim et al., 2021) , BernNet (He et al., 2021) , GloGNN (Li et al., 2022) and GBKGNN (Du et al., 2022) . Other works can be found in a recent survey (Zheng et al., 2022b) .

6. CONCLUSION AND FUTURE WORK

To our best knowledge, this work presents the first effort to make GCN's depth trainable by redefining it on the real number domain. We unveil the interdependence between negative GCN depth and graph heterophily. A novel problem of automatic GCN depth tuning for graph homophily/heterophily detection is formulated, and we propose a simple and powerful solution named RED-GCN with two variants (RED-GCN-S and RED-GCN-D). An effective graph augmentation method is also discussed via the new understanding on the message propagation mechanism generated by the negative depth. Superior performance of our method is demonstrated via extensive experiments with semi-supervised node classification on 11 graph datasets. The new insights on GCN's depth obtained by our work may open a new direction for future research on spectral and spatial GNNs. Since RED-GCN requires to conduct eigen-decomposition of the graph Laplacian, it is not directly applicable to inductive and dynamic graph learning problems, which we leave for future exploration.



This work focuses on connected graph without bipartite components (i.e., a connected component which is a bipartite graph). Case λi = 2, namely g(λi) = 0, is excluded since it corresponds to the existence of bipartite components. Heat maps for Texas and Wisconsin are in Appendix with similar observations.



Figure 2: Eigengraph weight versus eigenvalue for (a) SGC and (b) RED-GCN-S under different depth ds.

Figure 4: Node classification accuracy w.r.t. the trainable depth d on four datasets: Cora, Citeseer, Actor and Squirrel. (the optimal d, accuracy) is annotated (e.g., (-0.027, 36.9%) for Actor).

Figure 5: The weights of eigengraphs w.r.t. eigenvalues on the augmented diffusion matrix Ŝd and original Ŝ for Cornell (d = -0.362).

Figure 6: The difference between the augmented diffusion matrix and the original one Ŝd -Ŝ for Cornell in heat map. Best viewed in color.

Performance comparison (mean±std accuracy) on homophilic graphs.

Performance comparison (mean±std accuracy) on heterphilic graphs.

The performance of one-layer vanilla GCN over the augmented Ŝd . It is especially interesting to analyze what change a negative depth brings to the spatial domain and how such change impacts the subsequent model performance. Graph augmentation. By picking the optimal depth d according to the best performance on the validation set, a new diffusion matrix Ŝd is obtained. With the optimal d fixed,

