GRAPH CONTRASTIVE LEARNING UNDER HETEROPHILY: UTILIZING GRAPH FILTERS TO GENERATE GRAPH VIEWS Anonymous

Abstract

Graph Neural Networks have achieved tremendous success in (semi-)supervised tasks for which task-specific node labels are available. However, obtaining labels is expensive in many domains, specially as the graphs grow larger in size. Hence, there has been a growing interest in the application of self-supervised techniques, in particular contrastive learning (CL), to graph data. In general, CL methods work by maximizing the agreement between encoded augmentations of the same example, and minimizing agreement between encoded augmentations of different examples. However, we show that existing graph CL methods perform very poorly on graphs with heterophily, in which connected nodes tend to belong to different classes. First, we show that this is attributed to the ineffectiveness of existing graph augmentation methods. Then, we leverage graph filters to directly generate augmented graph views for graph CL under heterophily. In particular, instead of explicitly augmenting the graph topology and encoding the augmentations, we use a high-pass filter in the encoder to generate node representations only based on high-frequency graph signals. Then, we contrast the high-pass filtered representations with their low-pass counterparts produced by the same encoder, to generate representations. Our experimental results confirm that our proposed method, HLCL, outperforms state-of-the-art CL methods on benchmark graphs with heterophily, by up to 10%.

1. INTRODUCTION

Graph neural networks (GNNs) are powerful tools for learning graph-structured data in various domains, including social networks, biological compound structures, and citation networks (Kipf & Welling, 2016; Hamilton et al., 2017; Veličković et al., 2017) . In general, GNNs leverage the graph's adjacency matrix to update the node representations by aggregating information from their neighbors. This can be seen as a lowpass filter that smooths the graph signals and produces similar node representations (Nt & Maehara, 2019) . GNNs have achieved great success in supervised and semi-supervised learning, where task-specific labels are available. However, obtaining high-quality labels is very expensive in many domains, specially as graphs grow larger in size. This has motivated a recent body of work on self-supervised learning on graphs that learn the representations in an unsupervised manner (Velickovic et al., 2019; Peng et al., 2020; Qiu et al., 2020; Hassani & Khasahmadi, 2020; Zhu et al., 2020b) . Among self-supervised methods, Contrastive Learning (CL) has shown a great success by achieving comparable performance with its supervised counterparts (Chen et al., 2020) . Contrastive learning obtains representations by maximizing the mutual information between different augmented views of the same example, and minimizing agreement between differently augmented views of different examples. Despite being successful on graphs with homophily, where neighboring nodes tend to share the same label, existing graph CL methods cannot learn high-quality representations for graphs with heterophily, where connected nodes often belong to different classes Zhu et al. (2020b) . State-of-the-art graph CL methods work by contrasting the encoded node representations in two explicitly augmented graph views, generated by altering the graph topology or node features (Zhu et al., 2020c; 2021b;  Figure 1 : Our proposed HLCL framework leverages a high-pass graph filter in the encoder to generate a non-smooth graph view, and contrast it with its smooth counterparts generated by the same encoder. Velickovic et al., 2019; Thakoor et al., 2021) . Specifically, topology augmentation methods alter the graph structure by removing or adding nodes or edges. Feature augmentation methods alter the node features by masking particular columns, dropping features at random, or randomly shuffling the node features. The choice of the augmentation strategy has been shown to have a great influence on the performance of graph CL methods (Zhang & Ma, 2022; Zhu et al., 2021b) . In particular, topology augmentations such as edge dropping that produce sparser graphs are the most effective for graphs with homophily (Zhu et al., 2021b) . In this work, we first study the effect of existing graph augmentation methods on the performance of graph CL under heterophily. We show that topology and feature augmentation methods that can effectively boost the performance of graph CL under homophily do not provide any considerable advantage under heterophily. In particular, we experimentally confirm that for graphs with homophily, topology augmentation effectively decreases the homophily ratio, and feature augmentation changes the variance of features in a neighborhood. Hence, explicit graph augmentation techniques enable the low-pass GNN encoder to better distinguish the smooth node representations-that are similar in a neighborhood-by adding more diversity to the graph. However, such methods cannot effectively diversify and alter the homophily ratio or variance of features in graphs with heterophily. Indeed, for such graphs, incorporating the graph structure is crucial for generating high-quality representations that can distinguish dissimilar nodes from their neighborhood. Next, we address the above shortcoming by proposing an effective augmentation strategy for CL on graphs with heterophily, while ensuring high-quality representation learning on graphs with homophily. The key idea of our proposed method, HLCL, is to leverage a high-pass graph filter in a typical GNN encoder to directly generate non-smooth node representations, and contrast them with their smooth counterparts generated by the same encoder. More specifically, we use the normalized Lapcalian matrix as the high-pass filter in the GNN encoder to aggregate the node representations. The Laplacian matrix magnifies the differences in the node features in a neighborhood and makes the representations distinct. This is crucial for learning high-quality representation under heterophily. In contrast, a typical GNN encoder uses the normalized adjacency matrix as a low-pass filter to aggregate the node features with those of their neighbors, and enforce similar representations for the nodes in the same neighborhood. Maximizing the mutual information between the high-pass and low-pass filtered representations enables the encoder to learn rich representations on graphs with heterophily. Importantly, our method is complementary to explicit graph augmentation techniques, and such methods can be applied to further boost the performance, particularly on graphs with homophily. This allows our framework to achieve state-of-the-art under heterophily and comparable performance under homophily as we confirm by our experiments. We show the effectiveness of our HLCL framework through extensive experiments on graphs with heterophily and homophily for unsupervised representations learning under the linear evaluation protocol. Our results demonstrate that, on five popular public benchmark datasets, our framework can outperform existing graph CL methods by up to 10% for representation learning on graphs with heterophily, while ensuring a comparable performance to state-of-the-art graph CL methods on graphs with homophily.

2. RELATED WORK

(Semi-)supervised learning on graphs. In recent years, GNNs have become one of the most prominent tools for processing graph-structured data. In general, GNNs utilize the adjacency matrix to learn the node representations, by aggregating information within every node's neighborhood (Defferrard et al., 2016; Kipf & Welling, 2016) . Existing variants, including GraphSAGE (Hamilton et al., 2017) , Graph Attention (GAT) (Veličković et al., 2017 ), MixHop (Abu-El-Haija et al., 2019) , SGC (Nt & Maehara, 2019) , GAT (Velickovic et al., 2019) , and GIN (Xu et al., 2018) , learn a more general class of neighborhood mixing relationships, by aggregating weighted information within a multi-hop neighborhood of every node. GNNs can be generally seen as applying a fix, or a parametric and learnable (e.g. GAT) low-pass graph filter to graph signals. Those with trainable parameters can adapt to a wider range of frequency levels on different graphs. However, they still have a higher emphasis on lower-frequency signals and discard the high-frequency signals in a graph. While the aggregation operation makes GNNs powerful tools for semi-supervised learning, it often leads to over-smoothing issue, i.e., the learned node representations become indistinguishable in a neighborhood (Nt & Maehara, 2019) . As a result, typical GNNs and their variants have been long criticized on their inability to capture graph heterophily and poor generalization performance on heterogeneous dataset (Balcilar et al., 2020) , where most connected nodes are from different classes. (Semi-)supervised learning under heterophily. To address over-smoothing issue of GNNs, recent methods propose to use other types of aggregation that better fit graphs with heterophily. Geom-GCN uses geometric aggregation in place of the typical aggregation (Pei et al., 2020) , H 2 GCN uses several special model designs including separate aggregation and higher-neighborhood aggregation to train the model for handling graphs with heterophily, and CPGNN trains a compatibility matrix to model the heterophily level (Zhu et al., 2020a) . More recently, Wang et al. (2019) proposed to learn an aggregation filter for every graph from a set of based filters designed based on different ways of normalizing the adjacency matrix. Most recently, GGCN introduced degree corrections and signed message passing on GCN to address both oversmoothing problems and the model's poor performances on heterophily graphs (Yan et al., 2021) . Zhu et al. (2021a) analyzed and designed a uniform framework for GNNs propagations and proposed GNN-LF and GNN-HF that preserve information of different frequency separately by using different filtering kernels with learnable weights. FAGCN (Bo et al., 2021) and FBGNN (Luan et al., 2020) also leveraged the Laplacian matrix to capture the non-smooth signals discarded in typical GNNs. Two separate encoders are trained to capture the high-pass and low-pass graph signals separately, and the two outputs are combined with learnable parameters to balance their importance. Such methods achieve a superior performance under heterophily. However, learning how to combine the encoder outputs is highly sensitive to having high-quality labels. This makes such methods highly impractical for unsupervised contrastive learning. Contrastive learning on graphs. Self-supervised contrastive learning methods learn representations of data points by maximizing the mutual information between different views of the same data point, and minimizing agreement between differently augmented views of different examples (Bachman et al., 2019; Ye et al., 2019; Wu et al., 2018; Chen et al., 2020; Sohn et al., 2020) . For graphs, global graph-level and local node-level data are augmented and contrasted in different ways. DGI (Velickovic et al., 2019) and GMI (Peng et al., 2020) contrast graph and node representations within one augmented view of the original graph. More recent methods contrast global and local representations in two augmented views. GRAPHCL generates graph augmentations by subgraph sampling, node dropping, and edge perturbation and contrasts the augmented graph representations. GCC samples and contrasts subgraphs of the original graph (Qiu et al., 2020) . MVGRL leverages node diffusion to augment the graph and contrasts the node representations (Hassani & Khasahmadi, 2020) . More recently, contrasting the local node representations has been shown to achieve state-of-the-art. GRACE contrasts the node representations in two graph views augmented with feature masking and edge removal (Zhu et al., 2020c) . GCA extends this by dropping the less important edges and features, based on node centrality and feature importance metrics (Zhu et al., 2021c) . A thorough empirical study on the combinatorial effect of different augmentations has been conducted by Zhu et al. (2021b) . Due to the complexity of collecting negative samples in graph data, negative-samples-free contrastive objectives have been also studied. Among existing methods, BGRL that uses the Bootstrapping Latent loss (Thakoor et al., 2021) , and GBT uses Barlow Twins loss (Bielak et al., 2021) are the most successful. Existing graph CL methods explicitly augment the input graph and contrast the augmented graph representations obtained with low-pass GNN-based encoders. Hence, they perform poorly on graphs with heterophily. In contrast, we leverage low-pass and high-pass graph filters to directly obtain different graph views, without relying on explicit graph augmentation. This allows achieving state-of-the-art under heterophily.

3. CONTRASTIVE LEARNING VIA GRAPH FILTERS

Here, we introduce our proposed contrastive learning pipeline, HLCL. The key idea of our method is to generate a non-smooth augmented representation and contrast it with the smooth representation generated using the same encoder. A typical GNN filters smooth graph frequencies by aggregating the node representations with those of their neighbors. Hence, it results in similar representations for the nodes in a neighborhood. In contrast, the high-pass filter only preserves the high-pass frequencies. In doing so, it magnifies the dissimilarities between the nodes and diversifies the representations of nodes in a neighborhood. Contrasting the two filtered representations results in learning rich representations, particularly under heterophily.

3.1. PRELIMINARIES

We denote by G = (V, E) an undirected graph, where V = {v 1 , v 2 , . . . , v N } represents the node set, and E ⊆ V × V represents the edge set. We denote by A A A ∈ {0, 1} N ×N the symmetric adjacency matrix of graph G. That is, A A A ij = 1 if and only if (v i , v j ) ∈ E, and A A A ij = 0 otherwise. We also denote the feature matrix by X X X, where X X X i. ∈ R m is the feature vector of the i th node, and x x x ∈ R N is a column of the matrix and represents a graph signal. D D D is the degree matrix of the graph, with D D D ii = j A A A ij , and N i = {j : A A A ij = 1} is the neighborhood of node i. L L L is the Laplacian matrix of the graph, defined as L L L = D D D-A A A. The normalized Laplacian matrix is denoted by L L L sym = D D D -1 2 LD LD LD -1 2 , and the normalized adjacency matrix is defined in a similar fashion: A A A sym = D D D -1 2 AD AD AD -1 2 . Here, we use the renormalized version of the adjacency matrix  Â Â Âsym = D D D-1 2 Ã Ã Ã D D D-1 2 as introduced in (Kipf & λ i ∈ [0, 2) (Chung, 1997). For Â Â Âsym we have λ i ( Â Â Âsym ) ∈ (-1, 1] High-pass and Low-pass graph filters. Multiplication of Laplacian with a graph signal L L Lsym x x x = i λ i u u u i u u u T i x x x, acts as a filtering operation over x x x, adjusting the scale of the components of x x x in the frequency domain. The entries of every eigenvector, u u u i aligns with a cluster of connected nodes in the graph. For the Laplacian matrix, a smaller eigenvalue λ i corresponds to a smoother eigenvector u u u i , and corresponds to a larger cluster of connected nodes. On the other hand, a larger λ i corresponds to non-smooth eigenvectors u u u i , which identify smaller clusters of closely connected nodes in the graph. A Laplacian filter magnifies the part of the signal that aligns well with basis functions corresponding to large eigenvalues λ i ∈ (1, 2) and suppresses the part the signal that aligns with basis functions corresponding to small eigenvalues λ i ∈ [0, 1]. That means, for the small clusters of nodes that have a large alignment with u u u i corresponding to λ i > 1, the projection λ i u u u i u u u T i x x xamplifies x x x within the cluster and consequently magnifies the difference in x x x among the nodes within that cluster. On the other hand, for the larger clusters that align well with u u u i corresponding to λ i < 1, the projection λ i u u u i u u u T i x x x suppresses x x x within the cluster and reduces the differences in x x x among the nodes within that cluster. Hence the Laplacian matrices can be generally regarded as high-pass filters (Ekambaram, 2014) , that enlarge the differences in node features over small clusters, and smooths out the differences over larger clusters in the graph. On the other hand, affinity matrices, such as the normalized adjacency matrix, can be treated as low-pass filters (Nt & Maehara, 2019) , which suppress and filter out non-smooth components of the signals. This is because all of the eigenvalues of the affinity matrices are smaller than 1, i.e., λ i ∈ (-1, 1]. On the node level, left multiplying L L Lsym and Â Â Âsym filters with x x x can be understood as diversification and aggregation operations, respectively (Luan et al., 2020) : ( L L Lsym x x x) i = j∈Ni 1 D D D ii (x x x i -x x x j ), ( Â Â Âsym x x x) i = j∈Ni 1 D D D ii x x x j . High-pass filters highlight the differences between the nodes and their neighbors, making them distinguishable. On the other hand, low-pass filters aggregate the features of the nodes in a neighborhood and make them similar. Since L L Lsym + Â Â Âsym = I I I, both filters capture complementary information and their combination allow learning richer representations. Homophily Ratio. To quantify how likely nodes with similar labels are connected in the graph, homophily ratio, β, is proposed (Pei et al., 2020) , as follows: β = 1 |V | v∈V

Number of v's neighbors who have the same label as v Number of v's neighbors

For larger values of β, it is more likely that nodes with the same labels are connected together. As the nodes belonging to different classes often have different features, graphs with higher β are generally smoother.

3.2. GENERATING GRAPH VIEWS VIA GRAPH FILTERS

As discussed, the key idea of our HLCL framework is to leverages a high-pass filter to generate a diverse set of node representations, and contrast them with smooth and similar node representations generated by a low-pass filter. To do so, we leverage the normalized Laplacian matrix L L Lsym and adjacency matrix Â Â Âsym as the diversification and aggregation operations in Eq. ( 1). We note that other types of high-pass and low-pass filters can be used in a similar way in our framework. More specifically, we input the graph X X X into a graph encoder, and apply a high-pass filter F F F HP = L L Lsym and a low-pass filter F F F LP = Â Â Âsym to the encoder to generate high-pass node representations H H H H and low-pass node representations H H H L as follows: H H H l H = σ(F F F HP H H H l-1 H W W W l-1 ), H H H l L = σ(F F F LP H H H l-1 L W W W l-1 ), H H H 0 L = H H H 0 H = X X X. where H H H l L , and H H H l H are the low-pass filtered and high-pass filtered representations at layer l of the encoder, W W W l ∈ R d l ×d l-1 is the weight matrix in layer l of the encoder, and σ is the activation function. The high-pass filter F F F H filters out the low-frequency signals and preserves the high-frequency signals. In doing so, it captures the difference between features of every node with the features of the nodes in its neighborhood. Using a high-pass encoder within a multi-layer encoder iteratively captures the difference between features of the nodes in a multi-hop neighborhood of a node. Hence, it makes the representations of nodes that have different features from their neighbors distinct in their multi-hop neighborhood. On the other hand, the low-pass filter, F F F L , resembles a typical GNN, which only preserves the low-frequency signals by aggregating every node's features with the features of nodes in its immediate neighborhood.

Algorithm 1 Contrastive Learning with Graph Filters (HLCL)

1: for epoch=1,2,3,... do

2:

Input graph G into the shared graph Encoder f (•) 3: Generate high-frequency-signal node embeddings: H H H l H =σ(F F F HP H H H l-1 H W W W l-1 ) 4: Generate low-frequency-signal node embeddings: H H H l L = σ(F F F LP H H H l-1 L W W W l-1 ) 5: Compute the contrastive objective L with Eq. ( 5) 6: Update parameters by applying stochastic gradient ascent to maximize L 7: end for Using the low-pass filter within a multi-layer graph encoder, it iteratively aggregates features in a multi-hop neighborhood of every node to learn its representation. Hence, it smooths out the node representations and produces similar representations for the nodes within the same multi-hop neighborhood. While the low-pass filter is essential for learning good representations in smooth graphs, it cannot produce distinguished representations for graphs that are non-smooth, specially under heterophily. For such graphs, the high-pass filter is crucial to provide distinct representations. The combinations of both filters provides complementary information and allows learning both smooth and non-smooth components of the graphs simultaneously. Hence, it enables learning richer representations, particularly under heterophily. The high-pass and low-pass filtered representations can be obtained through message passing, according to Eq. ( 1). In particular, the high-pass filtered representations can be obtained by iteratively differentiating the representations of a node and those of its neighbors, and the low-pass filtered representations can be obtained by aggregating the node's representation with those of its neighbors. Formally: ĥ ĥ ĥl i = σ(W W W l-1 h h h l-1 i ), (h h h l i ) L = Σ j∈{Ni∪{i}} ( ĥ ĥ ĥl i + ĥ ĥ ĥl j ), (h h h l i ) H = Σ j∈{Ni∪{i}} ( ĥ ĥ ĥl i -ĥ ĥ ĥl j ). (3) Explicit graph augmentation. We note that the graph X X X can be optionally explicitly augmented before entering the encoder. While explicit augmentation do not provide a considerable benefit and can even be harmful under heterophily, it is crucial for graphs with homophily. We will study in details the effect of explicit augmentation under homophily and heterophily, in our experiments. Our HLCL framework is demonstrated in Fig. 1 . In Appendix, we study the effect of filters on a toy example.

3.3. THE CONTRASTIVE LEARNING FRAMEWORK

After generating the low-pass and high-pass filtered graph views, we pass them through a shared non-linear projection head and employ a contrastive objective that enforces the filtered representations of each node in the two views to agree with each other and can be discriminated from representations of other nodes. Non-linear projection head. Before contrasting the high-pass and low-pass filtered representations, we pass them to a non-linear projection head g(•), which maps the filtered views to another latent space where the contrastive loss is calculated. This has been shown to improve the performance (Chen et al., 2020; You et al., 2020; Zhu et al., 2020c; a) , and we also confirm its effect in our experiment. In our setting, we use a two-layer perceptron network to obtain z z z h = g(H H H H ) and z z z l = g(H H H L ). We use the projection head only during the contrastive training, and use the low-pass filtered representations H H H L in the downstream task. Contrastive loss. For every node i, its projected representation generated by the low-pass filter, z z z i l , is treated as the anchor, and the one generated by the high-pass filter, z z z i l , is treated as the corresponding positive sample. The other projected representations in the two views are treated as the negative samples. For the contrastive loss function, we use InfoNCE proposed in (Van den Oord et al., 2018) . Formally: l(z z z i l , z z z i h ) = log e sim(z z z i l ,z z z i h )/τ e sim(z z z i l ,z z z i h )/τ + k∈[N ], k̸ =i e sim(z z z i l ,z z z k h )/τ + k∈[N ], k̸ =i e sim(z z z i l ,z z z k l )/τ , in which sim(z z z i l , z z z i h ) is the cosine similarity between z z z i l and z z z i h , and τ is a temperature parameter. The second term in the denominator, k̸ =i e sim(z z z i l ,z z z k h ) , represents the inter-view negative pairs, i.e., between the low-pass view z z z i l of node i and the high-pass views of all the other nodes, and k̸ =i e sim(z z z i h ,z z z k h ) represents the intra-view negative pairs within the low-pass view. Since two views are symmetric, the loss for another view, l(z z z i h , z z z i l ), is defined in a similar fashion. The overall objective to be maximized is then defined as the average over all positive pairs. Formally, we maximize: L = 1 2N N i=1 [l(z z z i l , z z z i h ) + l(z z z i h , z z z l h )]. Effectively, maximizing the agreement between the low-pass and high-pass views pulls away the representation of nodes with different features from their neighborhood, and allows them to be distinguished from their neighbors. The pseudocode is illustrated in Alg. 1.

4. EXPERIMENTS

In this section, we first study and show the ineffectiveness of existing graph augmentation methods under heterophily. Then, we evaluate our proposed framework for node representation learning under linear probe. Datasets. We conduct experiments on five widely-used public benchmark datasets with different levels of homophily ratios, β. For graphs with homophily, we use the citation networks including Cora and Citeseer (Yang et al., 2016) . For graphs with heterophily, we use the Wikipideia network and the web page networks including Chameleon, Squirrel, and Texas (Rozemberczki et al., 2021; Pei et al., 2020) . Table 4 shows β, and the number of nodes, edges, and classes for the above datasets. We also report our results on a large-scale social network, namely Penn94 (Lim et al., 2021) . Details can be found in Tables 4 and 5 of the Appendix. HLCL setup. We employ a two-layer GCN (Kipf & Welling, 2016) as our low-pass encoder, and consider an additional high-pass channel with L L Lsym as the message passing filter to capture the high-frequency signals from the graphs. The high-pass channel generates augmented node representations to be contrasted with those encoded by the GCN. The high-pass channel shares the same weight parameters with the GCN, but generate embedding with different filters. The final embedding is generated by the GCN encoder.

4.1. GRAPH AUGMENTATION UNDER HETEROPHILY

First, we study existing data augmentation methods and their effectiveness for graph CL under heterophily. We use InfoNCE loss defined in Eq. ( 5) to contrast node representations in augmented graphs, using a standard 2-layer GCN as the graph encoder. We consider topology augmentation methods including Edge Removing (ER), Node Dropping (ND), Edge Adding (EA), Subgraph with Random Walk (RWS), and Personalized PageRank (PPR), as well as feature augmentation method including Feature Masking (FM) and Feature Dropping (FD) for generating different graph augmentations. We compare these methods with HLCL. Except for the proposed high-pass augmentation methods, we do not use any other explicit graph augmentation. We train the model 10 times with early-stopping, and report the average accuracy as the final result. For each run, we randomly select 10% nodes for training, 10% nodes for validation, 80% nodes for testing. Table 1 shows that our method achieves a remarkable boost of up to 10% compared to the best graph augmentation method. This confirms the necessity of utilizing the non-smooth graph components under heterophily. Note that even though topology and feature augmentation methods benefit the homophily graphs, they are highly ineffective for graph CL under heterophily. This can be explained from the perspective of Homophily ratio (β) and graph smoothness, which is represented here as neighborhood feature variance (α).As shown in figure 2 , augmentation methods that remove or add edges (ER, EA) decrease β of the graphs under homophily. On Cora with homophily, β drops rapidly given its initial high value, so it can generate a more diverse graph structures for aggregations among nodes that are more beneficial for the contrastive framework. However, for Chameleon with heterophily, β does not considerably change, diminishing the benefits of explicit augmentation. Similarly, for Feature augmentation methods that zero out feature entries, α decreases for both heterophily and homophily graphs, smoothing out the graphs. For homophily graphs that have smooth node neighborhoods, the change is not as rapid, and the masked features serve as dropout on the input layer, which improves the performance accordingly Zhu et al. (2021b) . On the other hand, feature masking decreases α greatly on Chameleon with heterophily.It smoothes out its non-smooth graph features in a deteriorated manner, hence outweights its benefits as a dropout layer.

4.2. SELF-SUPERVISED REPRESENTATION LEARNING

Next, we evaluate HLCL for self-supervised learning under linear probe We follow the evaluation protocol used in (Zhu et al., 2020c) . Models are first trained in a self-supervised manner without labels. Then, we fed the final node embeddings into a l 2 -regularized logistic regression classifier to fit the labeled data. We train the model 10 times with early-stopping, and report the average accuracy as the final result. For each run, we randomly select 10% nodes for training, 10% nodes for validation, and 80% nodes for testing. We consider traditional representation learning methods like DeepWalk (Perozzi et al., 2014) , deep learning methods including DGI (Velickovic et al., 2019) , BGRL (Thakoor et al., 2021) , and GRACE (Zhu et al., 2020c) , and other popular representation learning baselines like Raw Features and DeepWalk + Raw Features as baselines. We also include MixHop (Abu-El-Haija et al., 2019) in our experiments as the supervised learning baseline. Table 2 compares the performance of HLCL when used on its own, HLCL (no-aug), and when it is applied to the graphs that are augmented with Edge Removing (ER) on the graphs with homophily and Feature Masking (FM) on the graphs with heterophily. We see that HLCL shows a significant boost on graphs with heterophily and a comparable performance on the graphs with homophily. Even without explicit augmentation, HLCL surpasses other baselines on graphs with heterophily, which confirms the effectiveness of the high-pass channel. Importantly, applying HLCL to graphs that are explicitly augmented achieves a comparable performance under homophily. Thus, the high-pass filter does not harm the performance on such graphs. Comparing to supervised learning model such as MixHop trained in an end-to-end manner, HLCL achieves a comparable or superior performances. This, demonstrates the effectiveness of our design. To investigate the interactions between traditional graph augmentations and HLCL on graphs with different β, we perform ablation studies on explicit graph augmentation methods applied to both high-pass and the typical low-pass GCN channels. We include Edge Removing (ER), Node Dropping (ND), Edge Adding (EA), Feature Masking (FM), and Feature Dropout (FD) as graph augmentation methods. Table 3 shows that combined with HLCL, feature and topology augmentations considerably benefit the performance on graphs with homophily. Topology augmentation, however, may not be helpful under heterophily.

A APPENDIX

A.1 BENCHMARK DATASETS We conduct our experiments on five widely-used public benchmark datasets with different level of homophily ratios, β. We shows the specific statistics of the datasets including β, the number of nodes, edges, and classes, in table 4 In addition, to illustrate the scalability of our model, we run our models on realistic dataset like Penn94, proposed by Lim et al. (2021) . We train the model 3 times with early-stopping, and report the average accuracy as the final result. For each run, we randomly select 10% nodes for training, 10% nodes for validation, and 80% nodes for testing. We present our result in table 5. As shown, our method works better in large-scale dataset than other self-supervised learning methods as well. 

A.2 ABLATION STUDIES: HLCL WITH EXPLICIT AUGMENTATION

To investigate the interactions between traditional graph augmentations and HLCL on graphs with different β, we perform ablation studies on explicit graph augmentation methods applied to both high-pass and the typical low-pass GCN channels. We include Edge Removing (ER), Node Dropping (ND), Edge Adding (EA), Feature Masking (FM), and Feature Dropout (FD) as graph augmentation methods. We present our result in table 6. It is important to note that the encoder that is trained with contrasting with the high-pass view does not suffer from the over-smoothing problem, even without using the high-pass filter during the inference. This is because the parameters of the encoder are learned by contrasting the low-pass representation with high-pass representations. Hence, we do not need to rely on the high-pass filter during the inference. To illustrate the effectiveness of our structural design, we present an ablation study on using different filters or combinations of filters in our final encoder. The result is shown in table 7 . We see that on most datasets low-pass encoder achieves the highest accuracy with a large gap, and variation involving the high-pass encoder has worse performances across all datasets except Texas, which has a very low homophily ratio. This demonstrates the ineffectiveness of using high-pass filter directly in the representation encoder. In this section, we illustrate the importance of our contrastive structure. We show that the main accuracy gains on heterophily datasets are from contrasting the low-and high-pass filtered graph views, instead of the introduction of the high-pass filter alone. The high-pass filter alone cannot provide high-quality representations under heterophily. For different graphs, different combinations of low-pass and high-pass filters are required to provide good representations. Having access to labels, the best combination can be learned. However, this is indeed not possible without labels. We have shown that contrasting the low-and high-pass filtered graph views can provide high-quality representations in the unsupervised scenario. Here, We conduct a new ablation study by swapping the encoders from other graph CL methods with a high-pass encoder. The result is presented in 8. By replacing low-pass filter with high-pass filter, grph CL methods are able to achieve an better accuracy on some heterophily datasets like Squirrel and Texas. However, their performances deteriorate significantly under homophily cases, where aggregation among neighborhoods are desired. Our results show that when using high-pass or low-pass alone, the model cannot achieve good performance.

A.5 ILLUSTRATION EXAMPLES ON HP AND LP FILTER UNDER DIFFERENT HOMOPHILY RATIOS

To explain the effect of high-pass and low-pass filters and motivates using both filters in a contrastive way, we construct a simple graph with seven nodes. Nodes are separated into two classes, labeled as red and green as shown in the figure. First, we generate features vectors with 200 dimensions for nodes from red and green classes based on two Gaussian distributions, centered at -5 and 5 respectively with a standard deviation of 1. Then, we do graph aggregation or diversification with low-and high-pass filters separately, following Eq. (1). For clearer illustration, we do not project the feature vectors into other dimensions. We considered three different homophily ratios and ran the node features through high-pass and low-pass filters for a few iterations. We see that: (1) for high homophily ratio graph (Fig. 3 ), the low-pass filter reduces the variance of the features and pulls different classes towards each other, while the high-pass filter increases their variance and eventually pulls the classes away from each other. Hence, contrasting the two filtered representations prevents over-smoothing. The figure illustration is shown at Fig. 4 (2) for lower homophily ratio graph, where a part of the graph has homogeneous labels and the other part has heterogeneous labels (Fig. 5 ), the low-pass filter mainly reduces the variance of the features for the label-homogeneous part, while the high-pass filter unnecessarily mixes the class features. At the same time, the high-pass filter increases the variance of features for the label-heterogeneous part and pushes them away from the other class, while the low-pass filter unnecessarily mixes the class features. In this case, contrasting the two filters is crucial for learning good-quality representations. The figure illustration is shown at Fig. 6 (3) Finally, for very low homophily ratio graph (Fig. 7 ), the low-pass filter reduces the variance of the distributions and mixes the distributions of the features in the two classes, making them indistinguishable. On the other hand, the highpass filter pushes away the distributions of the features in different classes. By contrasting the high-pass and low-pass filtered features, HLCL is able to distinguish the feature distributions under heterophily. The figure illustration is shown at Fig. 8 



CONCLUSIONWe proposed HLCL, a contrastive learning framework that leverages a high-pass graph filter to generate nonsmooth augmented representations to be contrasted with their smooth counterparts, generated using the same encoder. This is particularly beneficial for graphs with heterophily, where existing graph CL methods are highly ineffective. Through extensive experiments, we demonstrated that our proposed framework achieves up to 10% boost in the performance for representation learning under the linear probe, for various graphs under heterophily. At the same time, it provides a comparable performance on graphs with homophily.



Welling, 2016), where Ã Ã Ã = A A A + I I I, D D D = D D D + I I I. Similarly, the renormalized Laplacian matrix is defined as L L Lsym = I I I -Â Â Âsym . L L Lsym is a real symmetric matrix, with orthonormal eigenvectors {u u u i } n l=1 ∈ R n , and corresponding eigenvalues

Figure 2: Left: Homophily ratio drops with the increasing edge removing ratio and edge adding ratio with Cora under homophily and stays approximately the same with Chameleon under heterophily. Right: neighborhood variance drops with the increasing feature masking ratio, with both Cora under homophily and Chameleon under heterophily, with the decrease more rapid on Chameleon.

Figure 3: Case 1: High homophily ratio graph

Figure 5: Case 2: A mixture of homophily and heterophily components

Existing graph augmentation methods are highly ineffective for graph CL under heterophily. In contrast, our HLCL method achieves up to 10% improvement over the best graph augmentation method.

HLCL achieves state-of-the-art under heterophily, and a comparable performance under homophily when combined with explicit graph augmentation.

Combining HLCL with different graph view augmentations. LP represents augmentation on lowpass channel only, HL represents augmentation on both channels. Both topology and feature augmentations are included. The full table 6 presented in the Appendix. Worst results are highlighted in gray.

Statistics for our datasets

HLCL achieves a noticeably higher accuracy on realistic large-scale dataset comparing to other GCL methods. HLCL outperforms SOTA by 8%.

The result on combining HLCL with different graph view augmentations. HP represents augmentation on high-pass channel only, LP represents augmentation on low-pass channel only, HL represents augmentation on both channels. Both topology and feature augmentations are included in the experiments.

HLCL with different graph filters in the final encoder. HP represents high-pass filter; LP represents low-pass filter; HP Concat LP represents using both high-pass filters and concatenate the output representations; HP Aggre LP represents using both high-pass filters and aggregate the output representations

Swapping low-pass encoders in GCL methods to high-pass encoders. Significant performance drops are highlighted in gray.

