NODE CLASSIFICATION BEYOND HOMOPHILY: TOWARDS A GENERAL SOLUTION

Abstract

Graph neural networks (GNNs) have become core building blocks behind a myriad of graph learning tasks. The vast majority of the existing GNNs are built upon, either implicitly or explicitly, the homophily assumption, which is not always true and could heavily degrade the performance of learning tasks. In response, GNNs tailored for heterophilic graphs have been developed. However, most of the existing works are designed for the specific GNN models to address heterophily, which lacks generality. In this paper, we study the problem from the structure learning perspective and propose a family of general solutions named ALT. It can work hand in hand with most of the existing GNNs to decently handle graphs with either low or high homophily. The core of our method is learning to (1) decompose a given graph into two components, (2) extract complementary graph signals from these two components, and (3) adaptively merge the graph signals for node classification. Moreover, analysis based on graph signal processing shows that our framework can empower a broad range of existing GNNs to have adaptive filter characteristics and further modulate the input graph signals, which is critical for handling complex homophilic/heterophilic patterns. The proposed ALT brings significant and consistent performance improvement in node classification for a wide range of GNNs over a variety of real-world datasets.

1. INTRODUCTION

Graph neural networks (GNNs) have demonstrated the great power as building blocks for a variety of graph learning tasks, such as node classification (Kipf & Welling, 2017) , graph classification (Xu et al., 2018) , link prediction (Zhang & Chen, 2018) , clustering (Bianchi et al., 2020) , and many more. Most of the existing GNNs follow the homophily assumption, i.e., edges tend to connect nodes with the same labels and similar node features. Such an assumption holds true for networks such as citation networks (Yang et al., 2016; Bojchevski & Günnemann, 2018) where a paper tends to cite related literature. However, in many other cases, the heterophilic settings arise. For instance, to form a protein structure, different types of amino acids are more likely to be linked together (Zhu et al., 2020) . On such heterophilic networks, the performance of classic GNN models (Klicpera et al., 2018; Veličković et al., 2018; Hamilton et al., 2017) could degrade greatly and might be even worse than an MLP which does not utilize any topology information at all (Zhu et al., 2020) . In response, researchers have analyzed the limitations of the existing GNNs in the presence of node heterophily and further proposed specific models to address it from both the spatial and spectral perspectives. For instance, an important design by H2GCN (Zhu et al., 2020) is that high-order neighbors should be considered during message aggregation. GPRGNN (Chien et al., 2021 ) also aggregates messages from multi-hop neighbors but it emphasizes that messages can also be negative via a set of learnable aggregation weights. From the spectral perspective, FAGCN (Bo et al., 2021) points out that low-pass filter-based GNNs smooth the node representations between connected nodes, which is not desirable for the heterophilic settings where connected nodes are more likely to have different labels. Hence, FAGCN (Bo et al., 2021) adaptively mixes the low-pass graph filter with the high-pass graph filter via an attention mechanism to tackle this problem. A more detailed review of related work can be found in Section 5. Despite the theoretic insights and empirical performance gain, most of the existing works focus on the model level, i.e., they aim to propose better GNNs models to handle the heterophilic graphs. In other words, the success of their methods relies on specific designs of GNN models. In this paper, we take a step further and ask: how to develop a generic method to benefit a broad range of GNNs for node classification beyond homophily, even if they are not originally tailored for the heterophilic graphs? To this end, we address this problem from a structure learning (Zhu et al., 2021b) perspective, that is, we optimize the given graph structure to benefit downstream tasks (e.g., node classification). Different from the existing approaches that refine the specific GNNs models, our approach focuses on the data level by optimizing the input graph topology to tackle heterophily. Challenges. In pursuing such a data-centric general solution, here are the key challenges. First (model diversity), our goal is to strengthen a broad range of established GNNs so that they can handle graphs with arbitrary homophily. However, the aggregation mechanism and the graph convolution kernels are different between various GNN models. It is unknown how to accommodate diverse GNNs seamlessly. Second (theoretical foundation), analyses on the success of some specific GNNs for heterophilic graphs have recently emerged (e.g., from the graph signal processing perspective (Shuman et al., 2013) ). However, few works focus on the theoretical foundation of structure learning and its connection to dealing with graphs with low homophily. Our main contributions are listed as follows: (1) We propose a general graph structure learning-based framework named duAL sTructure learning (ALT), which can accommodate a variety of GNN models. Specifically, after removing the activation function from the last layer, any GNN can be plugged into our framework and be trained end-to-end with common optimizers. (2) We provide a detailed analysis from the graph signal processing perspective. Our analysis guides the design of ALT and validates its effectiveness theoretically. (3) Experiments show that with the help of ALT, the node classification accuracy of a broad range of existing GNNs is boosted on heterophilic graphs, and meanwhile kept competitive on homophilic graphs.

2. PRELIMINARIES

Notations. We use bold uppercase letters for matrices (e.g., A), bold lowercase letters for column vectors (e.g., u), lowercase and uppercase letters in regular font for scalars (e.g., d, K), and calligraphic letters for sets (e.g., T ). We use A[i, j] to represent the entry of matrix A at the i-th row and the j-th column, A[i, :] to represent the i-th row of matrix A, and A[:, j] to represent the j-th column of matrix A. Similarly, u[i] denotes the i-th entry of vector u. Superscript ⊤ denotes the transpose of matrices and vectors. ⊙ denotes the Hadamard product. An attributed graph can be represented as G = {A, X} which is composed of an adjacency matrix A ∈ R n×n and an attribute matrix X ∈ R n×d , where n is the number of nodes and d is the node feature dimension. In total, nodes can be categorized into a set of classes C. The normalized Laplacian matrix is L = I -D -1 2 AD -1 2 where D is the diagonal degree matrix of A. It can be decomposed as L = UΛU ⊤ where U ∈ R n×n is the eigenvector matrix and Λ ∈ R n×n is the diagonal eigenvalue matrix. In graph signal processing (Shuman et al., 2013) , the diagonal entry of Λ represents frequency and Λ[i, i] = λ i . Given a signal x ∈ R n , its graph Fourier transform (Shuman et al., 2013) is represented as x = Ux, and its inverse graph Fourier transform is defined as x = U ⊤ x. For a diffusion matrix C ∈ R n×n , its frequency response (or profile (Balcilar et al., 2021) ) is defined as Φ fp = diag -1 (U ⊤ CU) where diag -1 (•) returns the diagonal entries. This frequency response is also known as the filter and the convolution kernel. Semi-supervised Node Classification. In this paper, we study semi-supervised node classification (Yang et al., 2016; Kipf & Welling, 2017) where the graph topology A, all node features X, and a part of node labels are given and our goal is to predict the labels of unlabelled nodes. Numerous works (Kipf & Welling, 2017; Veličković et al., 2018; Klicpera et al., 2018) achieve impressive performance on this problem. However, recent studies show that their successes heavily rely upon the homophily assumption of the given graphs (Zheng et al., 2022; Zhu et al., 2020) . In general, homophily describes to what extent edges tend to link nodes with the same labels and similar features. Following previous works (Zhu et al., 2020; Pei et al., 2019) , this paper focuses on the node label homophily. There are various homophily metrics and we introduce one of them named edge homophily (Zhu et al., 2020)  as: h(G) = i,j,A[i,j]=1 y[i]=y[j] i,j A[i,j] ∈ [0, 1], where x = 1 if x is true and 0 otherwise. The more homophilic a given graph is, the closer its h(G) is to 1. 

3. PROPOSED METHODS

In this section, we first propose a flexible method named ALT-global which empowers any GNN with an adaptive filter characteristics. Next, we carefully analyze the expressiveness of ALT-global from the graph signal processing perspective (Shuman et al., 2013) . This analysis guides the design of another more advanced method named ALT-local which enhances the spectral expressiveness of any GNN to be a local adaptive filter by modulating the input graph signals.

3.1. ALT-GLOBAL: A GLOBAL ADAPTIVE METHOD

Intuitively, nodes with different labels should be located as far as possible in the embedding space and nodes with the same labels should be assigned closely. This intuition is aligned well with the utility of many classic GNNs (e.g., GCN (Kipf & Welling, 2017) ) on homophilic graphs. That is because, on homophilic graphs, many same-label nodes are connected, whose embeddings will be smoothed by those classic low-pass filter GNNs (Bo et al., 2021; Balcilar et al., 2021) . In contrast, the low-pass filter GNNs' performance degrades significantly on heterophilic graphs since the connected nodes' embeddings should not be smoothed. Many efforts (Bo et al., 2021; Chien et al., 2021) point out that a key design to deal with graphs with unknown homophily is to equip GNNs with an adaptive filter. We aim to propose a data-centric solution such that minimal modification on the given GNNs (e.g., a low-pass filter GNN) is needed. As we do not make any assumption about the model structure of the given GNN, its filter can be either low-pass, high-pass, band-pass, or others. To equip the given GNN with an adaptive filter, our core idea is to adaptively combine signals from two filters with the complementary filter characteristics. For example, if a low-pass filter GNN is given, it should be adaptively combined with another high-pass filter. To find such a complementary filter, a two-step modification of the frequency response is needed. Figure 1 shows that we can first reflect the frequency response curve over the frequency axis and then set an appropriate offset to the reflected frequency response. Guided by this idea, the mathematical details of the proposed ALT-global are as follows, H 1 = GNN(wA, X, θ 1 ), H 2 = GNN((1 -w)A, X, θ 2 ), (1b) H offset = MLP(X, θ 3 ), (1c) Z = softmax(H 1 -H 2 + ηH offset ), where θ 1 and θ 2 are the parameters of the backbone dual GNNs (i.e., GNNs from Eq. 1a and Eq. 1b), θ 3 is the parameter of a multi-layer perceptron (MLP), η ∈ R and w ∈ [0, 1] are learnable parameters, and Z ∈ R n×C is the prediction matrix. Here the softmax is applied row-wise. For models using the normalized adjacency matrix (e.g., Ã = (D + I) -1 2 (A + I)(D + I) -1 2 ) as the diffusion matrix (e.g., GCN (Kipf & Welling, 2017) ), the re-weighting can be set over the normalized adjacency matrix (i.e., w Ã and (1 -w) Ã). We elaborate more on the design of ALT-global. First, all the insights we obtained from Figure 1 are still applicable to the convolution kernel directly. Nonetheless, since our method works in a plugand-play fashion which does not modify the backbone GNNs, it uses a well-designed aggregation (i.e., Eq. 1d) to achieve an equivalent effect. Specifically, (1) H 1 is the signals from a backbone GNN with positive re-scaling; (2) -H 2 is the negative signals that correspond to the signals from a reflected filter; (3) ηH offset is the offset term which is equivalent to signals from an all-pass filter. Second, the adaptive mixture of the above three sets of graph signals is controlled by the learnable parameters w and η. Other aggregation functions are also applicable. One of the options is an MLP whose input is the concatenation of H 1 , H 2 , and H offset . However, it is not used in this paper because (1) it increases the analysis difficulties dramatically and (2) empirically, no performance advantage is observed in the ablation study (Section 4.3). Analysis in the following section shows that ALT-global bears strong flexibility in filter characteristics.

3.2. ANALYSIS OF ALT-GLOBAL

For clarity and brevity, in the following analysis, we assume that the backbone GNNs are graphaugmented MLPs (GA-MLPs) as defined below. This is because, first, many GNNs fall into the GA-MLP family if part of the nonlinear functions is removed; and second, GA-MLPs have shown strong empirical performance while enjoying provable expressiveness (Chen et al., 2021) . Definition 1. Graph-Augmented Multi-Layer Perceptron (GA-MLP) (Chen et al., 2021) is a family of GNNs that first conduct feature transformation via an MLP and then diffuse the features. Mathematically they compute node embeddings as H = C • MLP(X) where C is the diffusion matrix. The (full) frequency profile (Balcilar et al., 2021) is closely related to the filter characteristics of GNNs and it is introduced as follows. Definition 2. Frequency profile (Balcilar et al., 2021) is defined as Φ fp = diag -1 (U ⊤ CU) where diag -1 (•) returns the diagonal entries if U ⊤ CU is a diagonal matrix. In case U ⊤ CU is not a diagonal matrix, full frequency profile (Balcilar et al., 2021) is defined as Φ = U ⊤ CU. It is well-known that the frequency profile of a diffusion matrix (if diagonal) is a filter/convolution kernel for the input graph signal. Next, we show that ALT is indeed equipped with an adaptive filter. Lemma 1. The filter characteristic of the proposed ALT-global (Eq. 1d) is adaptive regardless of the frequency filtering functionality of the backbone GNNs (Eq. 1a and Eq. 1b). Proof. For analysis convenience, we assume (1) the learnable weight w is multiplied with the diffusion matrix, and (2) the backbone GNNs are GA-MLPs whose MLP modules (from Eq. 1a and Eq. 1b) share common parameters with the offset MLP (from Eq. 1c). We start from the case where backbone GNNs are fixed low-pass filters. Without loss of generality, their corresponding full frequency profiles can be presented as Φ = I -ξ(Λ) where ξ is a monotonically increasing function. Then, in this case, the diffusion matrices from two GNNs are re-weighted as wC and (1 -w)C respectively. Considering the offset MLP as a special GA-MLP whose diffusion matrix is I, the combined graph signals are wC • MLP(X) -(1 -w)C • MLP(X) + ηI • MLP(X) = C • MLP(X) where the combined diffusion matrix is C = wC -(1 -w)C + ηI. Hence the diagonal entry of the corresponding full frequency profile is Φ[i, i] = Φ(λ i ) = (2w -1)(1 -ξ(λ i )) + η. When w > 0.5, i.e., 2w -1 > 0, Φ(λ i ) is a monotonically decreasing function. The proposed method is a low-pass filter when η > 0. Similarly, it is a high-pass filter when w is close to 0 and η > 1. The above conditions are sufficient and in fact, there are many other combinations of w and η which can produce low-pass/high-pass filters. Similar results can be obtained when the backbone GNNs are fixed high-pass filters and we omit that part for brevity. Remarks. The filter characteristics of the ALT-global can also be interpreted from the Graph Diffusion Equation (GDE) (Newman, 2018) perspective and we provide the GDE-related analysis in Appendix.

3.3. GLOBAL FILTERS VS. LOCAL FILTERS

We have shown that ALT-global is equipped with adaptive filter characteristics. However, ALT-global fundamentally applies a global filter to every node, which could lead to suboptimal performance. Recent studies (Zhu et al., 2021a; Wang et al., 2022a) reveal that heterophilic connection patterns differ between different nodes. Take gender classification on a dating network as an example. While node pairs are often of different labels (i.e., genders), homosexuality also exists between some node pairs. Therefore, simply applying a global low-pass or high-pass filter over all the nodes can degrade the overall classification performance. Next, we will study how to generalize our proposed ALT-global to a local (i.e., node-specific) and adaptive filter. Before that, let us take a closer look at the full frequency profile (Balcilar et al., 2021) : Φ = U ⊤ CU. In the following proposition, we point out that Φ can describe both the filter and modulator characteristics of a given diffusion matrix C. Proposition 1. The diagonal entries of the full frequency profile Φ of the diffusion matrix serve as the filter and the non-zero off-diagonal entries are the frequency modulator. Proof. The diffusion of the input graph signal X in = MLP(X) can be represented as CX in = UΦU ⊤ X in = U(Φ Xin ), where Xin is the input graph signal in spectral domain. According to the definitions of graph signal processing (Shuman et al., 2013) , (Φ Xin )[i :] represents the amplitude of output graph signal whose frequency is λ i . We further expand the computation and obtain (Φ Xin )[i :] = j Φ[i, j] • X in [j, :]. In the summation, if i = j, it represents the filter/convolution kernel which has been adopted by many spectral GNNs (Balcilar et al., 2021) . If i ̸ = j (i.e., if non-zero off-diagonal entries of Φ exist), it shows that the λ i -component of the output graph signal is merged with scaled (by Φ[i, j]) λ j -component of the input graph signal which is essentially the modulation (Shuman et al., 2013) . Based on the above property of the full frequency profile Φ, the following proposition points out the key design for local filter characteristics. Proposition 2. Modulation of the input graph signal (i.e., non-zero off-diagonal entries in the full frequency profile) is necessary for local filters. Proof. We follow the terminology used in the proof of Proposition 1. If the full frequency profile Φ only contains non-zero diagonal entries, we can obtain (Φ Xin )[i, :] = (diag -1 (Φ)) ⊤ ⊙ Xin [i, :], where diag -1 extracts the diagonal entries into a vector from the input square matrix. Hence, if we define the scaling of the λ i -frequency signal over node p after and before the operator Φ as SCALING(i, p, Φ) = (Φ Xin)[i,p] Xin[i,p] , based on Eq. 2 we obtain ∀i, p, q, SCALING(i, p, Φ) = SCALING(i, q, Φ) i.e., for any specific frequency (e.g., λ i ), its scaling over any two nodes (p and q) are equal. In other words, the filter Φ works globally over every node. If we expect the filter Φ to not work globally, i.e. ∃i, p, q, SCALING(i, p, Φ) ̸ = SCALING(i, q, Φ). The above inequality is equivalent to k,k̸ =i Φ[i, k] • Xin [k, p] Xin [i, p] ̸ = k,k̸ =i Φ[i, k] • Xin [k, q] Xin [i, q] . Assume that ∀k, if k ̸ = i, Φ[i, k] = 0, and then the left-hand side is equal to the right-hand side which leads to a contradiction. Hence, non-zero off-diagonal entries of the full frequency profile Φ must exist if we expect the filter to not work globally. Notice that the above definition of scaling (e.g., (Φ Xin)[i,p] Xin [i,p] ) is not fully aligned with the classic graph filtering (Shuman et al., 2013) but a combination of filtering and modulation as we mentioned in Proposition 1. Next, we present a family of GA-MLPs whose spectral expressiveness is limited to a global filter. Proposition 3. A family of GA-MLPs are global filters if their full frequency profiles are in the form of C = k a k Ãk + bI which only contains non-zero diagonal entries. We prove Proposition 3 in the Appendix. A wide range of GA-MLPs (e.g., SGC (Wu et al., 2019) , APPNP (Klicpera et al., 2018 )) follow the above form and therefore cannot modulate graph signal. Unfortunately, even when they are equipped with our proposed ALT-global, they are still global filters because ALT-global assigns the same weight to every edge (i.e., w Ã and (1 -w) Ã).

3.4. ALT-LOCAL: A LOCAL ADAPTIVE METHOD

In this subsection, we propose a more flexible method based on ALT-global. Our goal is to empower the backbone GNNs with local adaptive signal filtering capabilities, which is an essential property for capturing complex heterophilic connection patterns. (Zhu et al., 2021a; Wang et al., 2022a) . According to Proposition 3, we know that if all the edges are assigned with the same weight (e.g., w Ã) the corresponding full frequency profile will only contain diagonal non-zero entries. Lemma 2 provides a clue on how to bring non-zero off-diagonal entries in full frequency profiles. Lemma 2. By re-weighting the edge weights non-uniformly (i.e., if re-weighting by W ⊙ Ã, ∃i, j, k, l, W[i, j] ̸ = W[k, l]), the off-diagonal entries of Φ can be non-zero. We prove Lemma 2 in the appendix. Guided by Lemma 2 we modify ALT-global as follows so that the edge weights are different: H 1 = GNN(W ⊙ A, X, θ 1 ), H 2 = GNN((1 -W) ⊙ A, X, θ 2 ), (3b) H offset = MLP(X, θ 3 ), (3c) Z = softmax(H 1 -H 2 + ηH offset ), One option is to set W as a learnable parameter which is prune to overfitting as the number of parameters is equal to the number of edges. Therefore, we parameterize the edge weight W by an edge augmenter as follows, H = GNN aug (A, X, ϕ 1 ), (4a) W[i, j] = w ij = sigmoid(MLP(H[i, :]||H[j, :], ϕ 2 )) where ϕ 1 and ϕ 2 are the parameters of the augmenter GNN and a multi-layer perceptron (MLP) respectively. Here we first obtain the node embedding matrix via the augmenter GNN (i.e., GNN aug ) in Eq. 4a. Then we concatenate node embeddings into edge embeddings (i.e., H[i, :]||H[j, :]). The edge weight (i.e., w ij ) is computed via an MLP with sigmoid activation. Naturally, the node embeddings from the augmenter GNN (Eq. 4a) should be as discriminative as possible so that the edge importance can be better measured. Thus, we use a two-layer high-pass filter GNN as the GNN aug whose mathematical formulation is as follows, GNN aug (A, X, ϕ 1 ) = Ã2 high MLP(X, ϕ 1 ), (5a) Ãhigh = ϵI -D -1 2 AD -1 2 , ( ) where ϵ is a scaling hyper-parameter to adjust the amplitude of the high-pass filter. We name the above model (i.e., Eqs.3a-5b) as ALT-local which is summarized in Figure 2 . Remarks. Our method is partly inspired by FAGCN (Bo et al., 2021) and we claim the uniqueness and advantages of our work compared with FAGCN as follows. From the method perspective, FAGCN explicitly mixes high-frequency and low-frequency signals. ALT generalizes this idea to the 'mixture of complementary filters'; thus, even though the backbone GNN's convolution kernel is unknown, ALT can still boost its performance decently, which provides great generality. For the theoretical contribution, Bo et al. (2021) analyze the spatial effects of signals with different frequencies. Our analysis takes a solid step forward to reveal the connections between the full frequency profile, graph signal modulation, and local adaptive filters.

3.5. TRAINING PROCEDURE

To train our models, we formulate the following bi-level optimization problem. ϕ * = arg min ϕ L upper (g(G, ϕ), θ * , Y valid ) s.t. θ * = arg min θ L lower (g(G, ϕ), θ, Y train ), where the augmenter is denoted as g(•) whose parameter is ϕ and the dual backbone GNNs are parameterized as θ for brevity. Specifically, for the ALT-global, θ = {θ 1 , θ 2 , θ 3 } and ϕ = w are from Eq.1a, Eq.1b, and Eq.1c. For ALT-local, θ = {θ 1 , θ 2 , θ 3 } is from Eq. 3a, 3b, and Eq. 3c; ϕ = {ϕ 1 , ϕ 2 } is from Eq. 4a and 4b. Both L upper and L lower are cross-entropy loss between the classification results (Eq. 1d for ALT-global and Eq. 3d for ALT-local) and the labelled nodes. The difference lies in that, for the lower-level objective we compute the loss over the training nodes but for the upper-level one we compute the loss over the validation nodes. To solve such a bilevel optimization problem, we resort to the classic first-order approximation (Nichol et al., 2018) to compute the hyper-gradient ∇ ϕ L upper and any gradient descent-based methods can then be used. If all the feature dimensions of different layers (including the input layers) from different backbone GNNs and MLPs are denoted as d and all the models (GNNs and MLPs) contain 2 feature transformation matrices, the number of trainable parameters of ALT-local is composed of three parts: (1) GNN aug (2d 2 ), (2) MLP from Eq. 4b (2d 2 + d), (3) GNN 1 , GNN 2 , and offset MLP (3d 2 + 3dc) where c is the number of classes. In practice, the parameter number is much smaller than the estimated number. For example for datasets whose d > 500, empirically, setting the hidden dimension as 32 is enough. However, compared with vanilla backbone GNNs (e.g., a simple GCN (Kipf & Welling, 2017) ), ALT-local inevitably contains more parameters as ALT-local is composed of 3 GNNs and 2 MLPs in total. Even for ALT-global, it is still composed of 2 GNNs and 1 MLP. Hence, the increased number of parameters is a potential limitation of ALT-local and ALT-global. Besides, our theoretical analysis relies on the assumption that the backbone GNNs are GA-MLPs. Generalizing our theoretical results to a broader range of GNNs is our future work.

4.1. EXPERIMENT SETTINGS

Datasets. We use 16 datasets, including Cora (Yang et al., 2016) , Citeseer (Yang et al., 2016) , Pubmed (Yang et al., 2016) , DBLP (Bojchevski & Günnemann, 2018) , Computers (Shchur et al., 2018) , Photos (Shchur et al., 2018) , CS (Shchur et al., 2018) , Physics (Shchur et al., 2018) , Cornell (Pei et al., 2019) , Texas (Pei et al., 2019) , Wisconsin (Pei et al., 2019 ), Chameleon (Rozemberczki et al., 2021 ), Squirrel (Rozemberczki et al., 2021) , Film (Pei et al., 2019) , Cornell5 (Lim et al., 2021) , and Penn94 (Lim et al., 2021) . For Cora, Citeseer, and Pubmed, we follow the dataset split from (Kipf & Welling, 2017) . We randomly split the other datasets into 20/20/60% for training, validation, and test. Detailed statistics of the datasets are presented in the Appendix -Dataset Statistics. Baseline Methods and Metric. We use 6 baseline methods including 3 classic GNNs: GCN (Kipf & Welling, 2017) , SGC (Wu et al., 2019) , and APPNP (Klicpera et al., 2018) , and 3 adaptive GNNs: GPRGNN (Chien et al., 2021) , FAGCN (Bo et al., 2021) , and H2GCN (Zhu et al., 2020) which use specific designs to tackle graphs with low homophily. Thanks to the flexibility of our method, we equip the above baseline methods with our proposed ALT to validate the effectiveness. As ALT-local is more powerful than ALT-global, we mainly show the performance comparison with ALT-local (short as ALT). The comparison between ALT-local and ALT-global will be presented in the ablation study. We use the accuracy (ACC) as the metric and report the average accuracy with the standard deviation in 10 runs. 

4.2. MAIN RESULTS

We present the performance comparison on heterophilic graphs in Table 1 . First, on the heterophilic graphs, in general, our method ALT can significantly improve the performance of most of the existing GNNs, especially for methods originally not designed for the heterophilic graphs (e.g., GCN, SGC, and APPNP). On average, over 10% improvement is obtained among the heterophilic graphs. Second, over the heterophilic graphs, for adaptive GNNs (e.g., GPRGNN, FAGCN, and H2GCN), their performance improvement is not as significant as low-pass filter GNNs. This is expected since these methods have already dealt with heterophily to some extent. Nonetheless, we still gain 2 -4% performance improvements averaged over all 8 heterophilic datasets. The performance comparison on homophilic graphs is presented in Table 2 . We test 48 graph-GNN combinations, out of which, 35 cases show improvements. It is worth noting that even though GCN, SGC, and APPNP are designed mainly for homophilic graphs, the proposed ALT is still able to significantly boost their performance on Computers by nearly 10%. Moreover, for each backbone GNN, the average gain of applying the proposed ALT over all 8 homophilic graphs is always positive. Most of the remaining cases bear very minor performance losses (12 out of 13 are below 0.5%). Thus, we conclude that ALT can retain or even boost the performance of given backbone GNNs on homophilic graphs.

4.3. ABLATION STUDY AND HYPERPARAMETER STUDY

In this section, we present a systematic ablation study on datasets: Chameleon (Rozemberczki et al., 2021 ), Squirrel (Rozemberczki et al., 2021) , Film (Pei et al., 2019) , Computers (Shchur et al., 2018) , Photos (Shchur et al., 2018) , and CS (Shchur et al., 2018) . Specifically, we have the following ablated versions: (1) ALT-local, (2) ALT-local with a low-pass filter augmenter (i.e., change Eq.5b as a twolayer SGC) which is named as ALT-local-low, (3) ALT-local-concat whose aggregation step (Eq. 3d) is instantiated by 'concatenation' followed by an MLP (4) ALT-global, and (5) vanilla backbone GNNs without our methods (named as None). Results with GCN as the backbone are presented in Table 3 and results with SGC and APPNP as the backbones are presented in the Appendix -Additional Experimental Results. From the above results we conclude that the ALT-local has consistent advantages over all ablated versions. In addition, we provide a hyperparameter sensitivity study in the Appendix -Additional Experimental Results.

5. RELATED WORK

Graph Structure Learning. Graph structure learning aims to modify the given graph structure to improve the performance of downstream tasks. For instance, to boost message propagation, inserting virtual nodes is an effective approach (Gilmer et al., 2017; Li et al., 2017) . For topology denoising, dropping some existing edges can improve the model robustness (Wu et al., 2020; Luo et al., 2021) and eliminate redundant information from the input (Yu et al., 2020) . Another line of research views the given graph as the optimization variable and updates them according to the performance of downstream node classifiers (e.g., LDS (Franceschi et al., 2019) and Gasoline (Xu et al., 2022) ). Other works which formulate the given graph as a random variable and infer its optimal parameters include Bayesian GCNN (Zhang et al., 2019) , GEN (Wang et al., 2021) , and many more. Recently, Zhu et al. (Zhu et al., 2021b ) provide a comprehensive survey on this topic. Graph Learning on Heterophilic Graphs. Heterophilic graphs are also known as disassortative graphs. Many message-passing based GNNs suffer from the performance degradation on the heterophilic graphs and several approaches have been developed for that. For example, Geom-GCN (Pei et al., 2019) and H2GCN (Zhu et al., 2020) expand the message-passing mechanism beyond the first-order neighbors. GPRGNN (Chien et al., 2021) and BernNet (He et al., 2021) set the weights for different propagation results as learnable parameters to work as an adaptive graph filter. FAGCN (Bo et al., 2021) , GBK (Du et al., 2022) , and ACM-GNN (Luan et al., 2021) explicitly mix two convolution kernels through attention-based mechanisms. Based on the above work, DMP (Yang et al., 2021) studies this problem in a finer granularity where it introduces a feature specific messagepassing mechanism. Yan et al. (2021) reveal the connections between oversmoothing and network heterophily. Other works which modify the propagation step of GNNs for the heterophilic graphs include CPGNN (Zhu et al., 2021a) , HOG-GCN (Wang et al., 2022b) , and GloGNN (Li et al., 2022) . Interestingly, Luan et al. (Luan et al., 2021) and Ma et al. (Ma et al., 2021) both report that there are some cases where high heterophily will not hurt the performance of low-pass filter GNN which reveals further unexplored space for this problem. Zheng et al. (Zheng et al., 2022) recently present a survey on this topic. The only structure learning-based solution on addressing graph heterophily, as far as authors' knowledge, is WRGAT (Suresh et al., 2021) which improves the graph homophily by a heuristic method. As a comparison, our proposed framework is more flexible and theoretically solid.

6. CONCLUSION

In this paper, we propose a general framework ALT for the semi-supervised node classification problem on graphs beyond homophily. Our method introduces a novel structure learning-based augmenter to decompose the given graph. After that, a dual GNN module can be instantiated as most of the existing GNNs on the decomposed graphs. Systematic theoretical analysis shows that our proposed method can adaptively filter and modulate the graph signals which is critical to address complex heterophilic connection patterns. Comprehensive empirical evaluation and ablation study demonstrate that the proposed ALT obtains significant performance improvement for a wide range of GNN models, on a variety of graph datasets with arbitrary homophily.

A REPRODUCIBILITY

Hardware We implement ALT in pytorchfoot_0 and pytorch-geometricfoot_1 using one NVIDIA Tesla V100 SXM2-32GB.

Dataset statistics

The detailed statistics of datasets are presented in Table 4 and Table 5 . 

Detailed Experimental Settings

We obtain all the datasets from pytorch-geometricfoot_2 which are public. We follow the given dataset split for Cora, Citeseer, and Pubmed. For the remaining datasets, we randomly split them into 20/20/60% as training, validation, and test set. Notice that here we do not follow the dataset split from the paper of GPRGNN (Chien et al., 2021) as they manually assign the same number of training samples to each class and our dataset split is more practical. For all the GNNs (including the augmenter and backbone GNNs), we set the hidden dimension as 16, the learning rate as 0.05. For all the backbone GNNs, their weight decay is set as 0.0005. For the augmenter GNN, its weight decay is searched in {0.005, 0.0005, 0.00005} and its ϵ is set as 0.5. We are still going through our internal review process for releasing the code, and we expect to be able to release it before the conference.

B ADDITIONAL EXPERIMENTAL RESULTS

The ablation study results with SGC and APPNP backbones are presented in We provide a hyperparameter sensitivity study as follows. Specifically, we study the sensitivity of ALT-local concerning the amplitude of the high-pass filter for the augmenter GNN (i.e, ϵ from Eq. 5b). We select GCN (Kipf & Welling, 2017) and GPRGNN (Chien et al., 2021) as backbone GNNs and conduct experiments over Cora (Yang et al., 2016) , Citeseer (Yang et al., 2016) Proof. Here we prove the solution of Eq. 10a and for Eq. 10b its solution can be obtained in a similar way. For Eq. 10a, by decomposing the graph signal with the eigenvectors ({u i }) of the normalized Laplacian L we have: h (t) 1 = i a (t) i u i . As only h and a i are the functions of t, based on the fact that Lu i = λ i u i and Iu i = u i we have:  Similarly, the graph signal h e -(wλi+(1-w))t -e -((1-w)λi+w)t + η denotes the amplitude of the the λ i -frequency signal after filtered by ALT-global. We know the signal before filtering (i.e., diffusion) is h (0) = h (0) 1 = h (0) 2 = h (0) offset = n i=0 a (0) i u i , and the amplitude of the the λ i -frequency signal before filtering is a 0 i . Hence, the filter response to λ i frequency is Φ(λ i ) = a (0) i e -(wλi+(1-w))t -e -((1-w)λi+w)t + η a (0) i (19a) =e -(wλi+(1-w))t -e -((1-w)λi+w)t + η (19b) It is clear when w > 0" Φ(λ i ) is a monotonically decreasing function and when w < 0, Φ(λ i ) is a monotonically increasing function. With appropriate η and different w, ALT-global can be instantiated as either a low-pass filter or a high-pass filter.



https://pytorch.org/ https://pytorch-geometric.readthedocs.io/en/latest/ https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html



Figure 1: The Illustration of obtaining a filter with complementary filter characteristics. Given a filter (a), its reflected frequency response (b) with offset (c) has complementary filter characteristics.

Figure 2: The proposed ALT-local.

, Chameleon(Rozemberczki et al., 2021), Squirrel (Rozemberczki et al., 2021)  datasets. Results are presented in Figure3from which we observe that the model performance is stable for the selection of ϵ over four datasets and two selections of the backbone GNNs (i.e., GCN and GPRGNN).

Figure 3: Hyperparameter sensitivity of ALT with backbone GNN as (a) GCN and (b) GPRGNN.

signal can be presented as (here we useh offset = h (0) wλi+(1-w))t -e -((1-w)λi+w)t + η u i (17b)According to the graph signal processingShuman et al. (2013), u i denotes the graph signal with λ i frequency. Hence, a (0) i

Performance comparison (mean±std accuracy) on heterophilic graphs. The last column indicates the average performance boosting for a specific backbone GNN over all the datasets.

Performance comparison (mean±std accuracy (%)) on homophilic graphs. The last column indicates the average performance boosting for a specific backbone GNN over all the datasets.

Results of ablation study (Backbone GNN: GCN).

Dataset statistics of heterophilic graphs.

Dataset Statistics of homophilic graphs.

Table 6 and 7. Our best model ALT local obtains consistent advantages. The results are consistent with the ones presented in the main content whose backbone GNNs are GCN. Results of ablation study (Backbone GNN: SGC).

Results of ablation study (Backbone GNN: APPNP).

As all the eigenvectors are orthogonal with each other, by multiplying both sides of the above equation with u ⊤ i we have

annex

words, it is common that U[k, i] ̸ = 0. Therefore, it should be easy to find a pair of node i and j such that U[l, i]U[k, j] ̸ = 0 and we obtainTherefore, we proved that if the edge weights are re-weighted non-uniformly, the off-diagonal entries of Φ can be non-zero, i.e., the GNN can be a local filter.

E ANALYSIS OF ALT-GLOBAL FROM THE GRAPH DIFFUSION EQUATION (GDE) PERSPECTIVE

As we claimed in Lemma 1, our proposed ALT-global can be an adaptive filter even if the given backbone GNNs only have fixed filters. Here, we prove this from the Graph Diffusion Equation (GDE) (Newman, 2018) perspective. Our proof will focus on the case where the diffusion matrix is the normalized adjacency matrix Ã = D -1 2 AD -1 2 whose convolution kernel is fixed. Other cases can be proved in similar ways.Given graph signals H, its diffusion process can be presented as H (t+1) = ÃH (t) . Thus, we haveIn the GNN case, t > 0 denotes the GNN depth and in the GDE context, it denotes the diffusion time. Thus, if we set the time interval as ∆t, the graph diffusion dynamics can be presented as follows,where L = I -D -1 2 AD -1 2 is the normalized Laplacian matrix. As ALT-global re-weights all the edges into w Ã and (1 -w) Ã, we haveRecap that the prediction matrix of ALT-global is by combining signals from dual backbone GNNs and an offset MLP as Z = softmax(H 1 -H 2 + ηH offset ). We keep the assumption that the dual backbone GNNs are both GA-MLPs (Chen et al., 2021) which shares parameters with our offset MLP. Thus, we haveAs we are analyzing its diffusion dynamics, there is no interaction between any two columns of the feature matrix2 ). Hence, for brevity, we only show analysis of a single feature hThe dual GNNs' GDEs can be presented as follows,Proposition 4. The solutions of Eq. 10a and Eq. 10b can be presented asi e -(wλi+(1-w))t u i and h 

