CAN SINGLE-PASS CONTRASTIVE LEARNING WORK FOR BOTH HOMOPHILIC AND HETEROPHILIC GRAPH?

Abstract

Existing graph contrastive learning (GCL) typically requires two forward pass for a single instance to construct the contrastive loss. Despite its remarkable success, it is unclear whether such a dual-pass design is (theoretically) necessary. Besides, the empirical results are hitherto limited to the homophilic graph benchmarks. Then a natural question arises: Can we design a method that works for both homophilic and heterophilic graphs with a performance guarantee? To answer this, we analyze the concentration property of features obtained by neighborhood aggregation on both homophilic and heterophilic graphs, introduce the single-pass graph contrastive learning loss based on the property, and provide performance guarantees of the minimizer of the loss on downstream tasks. As a direct consequence of our analysis, we implement the Single-Pass Graph Contrastive Learning method (SP-GCL). Empirically, on 14 benchmark datasets with varying degrees of heterophily, the features learned by the SP-GCL can match or outperform existing strong baselines with significantly less computational overhead, which verifies the usefulness of our findings in real-world cases. Under review as a conference paper at ICLR 2023 views through two encoders of different updating strategies and pushes together the representations of the same node/class. In both categories, existing GCL methods typically require two graph forward-pass, i.e., one forward-pass for each augmented graph in the augmentation-based GCL or one for each encoder in augmentation-free GCL. Unfortunately, theoretical analysis and empirical observation (Liu et al., 2022; Wang et al., 2022a) show that previous GCL methods tend to capture low-frequency information, which limit the success of those methods to the homophilic graphs. Therefore, in this paper, we ask the following question: Can one design a simple single-pass graph contrastive learning method effective on both homophilic and heterophilic graphs? We provide an affirmative answer to this question both theoretically and empirically. First, we theoretically analyze the neighborhood aggregation mechanism on a homophilic/heterophilic graph and present the concentration property of the obtained features. By exploiting such property, we introduce the single-pass graph contrastive loss and show its minimizer is equivalent to that of Matrix Factorization (MF) over the transformed graph where the edges are constructed based on the aggregated features. In turn, the transformed graph introduced conceptually is able to help us illustrate and derive the theoretical guarantee for the performance of the learned representations in the down-streaming node classification task. To verify our theoretical findings, we introduce a direct implementation of our analysis, Single-Pass Graph Contrastive Learning (SP-GCL). Experimental results show that SP-GCL achieves competitive performance on 8 homophilic graph benchmarks and outperforms state-of-the-art GCL algorithms on all 6 heterophilic graph benchmarks with a nontrivial margin. Besides, we analyze the computational complexity of SP-GCL and empirically demonstrate a significant reduction of computational overhead brought by SP-GCL. Coupling with extensive ablation studies, we verify that the conclusions derived from our theoretical analysis are feasible for real-world cases. Our contribution could be summarized as: • We show the concentration property of representations obtained by the neighborhood feature aggregation, which in turn inspires our novel single-pass graph contrastive learning loss. A directly consequence is a graph contrastive learning method, SP-GCL, without relying on graph augmentations. • We provide the theoretical guarantee for the node embedding obtained by optimizing graph contrastive learning loss in the down-streaming node classification task. • Experimental results show that without complex designs, compared with SOTA GCL methods, SP-GCL achieves competitive or better performance on 8 homophilic graph benchmarks and 6 heterophilic graph benchmarks, with significantly less computational overhead. Graph neural network on heterophilic graph. Recently, the heterophily has been recognized as an important issue for graph neural networks, which is outlined by Pei et al. ( 2020) firstly. To make graph neural networks able to generalize well on the heterophilic graph, several efforts have been done from both the spatial and spectral perspectives (Pei et al.,

1. INTRODUCTION

Graph Neural Networks (GNNs) (Kipf & Welling, 2016a; Xu et al., 2018; Veličković et al., 2017; Hamilton et al., 2017) have demonstrated great power in various graph-related tasks, especially the problems centered around node representation learning, such as node classification (Kipf & Welling, 2016a) , edge prediction (Kipf & Welling, 2016b) , graph classification (Xu et al., 2018) , etc. Prior studies posit that the good performance of GNNs largely attribute to the homophily nature of the graph data (Pei et al., 2020; Lim et al., 2021b; Zhu et al., 2020b; Abu-El-Haija et al., 2019; Chien et al., 2020; Li et al., 2021; Bo et al., 2021) , i.e., the linked nodes are likely from the same class in homophilic graphs, e.g. social network and citation networks (McPherson et al., 2001) . In contrast, for heterophilic graphs, on which existing GNNs might suffer from performance drop (Pei et al., 2020; Chien et al., 2020; Zhu et al., 2020b) , similar nodes are often far apart (e.g., the majority of people tend to connect with people of the opposite gender (Zhu et al., 2020b) in dating networks). As a remedy, researchers have attempted to design new GNNs able to generalize well on heterophilic graph data (Pei et al., 2020; Abu-El-Haija et al., 2019; Zhu et al., 2020a; Chien et al., 2020; Li et al., 2021; Bo et al., 2021) . For both homophilic and heterophilic graphs, GNNs, like other modern deep learning approaches, require a sufficient amount of labels for training to enjoy a decent performance, while the recent trend of the Graph Contrastive Learning (GCL) (Xie et al., 2021) , as an approach for learning better representation without the demand of manual annotations, has attracted great attention. Existing work of GCL could be roughly divided into two categories according to whether or not a graph augmentation is employed. First, the augmentation-based GCL (You et al., 2020; Peng et al., 2020; Hassani & Khasahmadi, 2020; Zhu et al., 2021a; b; 2020d; c; Thakoor et al., 2021) follows the initial exploration of contrastive learning in the visual domain (Chen et al., 2020; He et al., 2020) and involves pre-specified graph augmentations (Zhu et al., 2021a) ; specifically, these methods encourage representations of the same node encoded from two augmentation views to contain as less information about the way the inputs are transformed as possible during training, i.e., to be invariant to a set of manually specified transformations. Secondly, augmentation-free GCL (Lee et al., 2021; Xia et al., 2022) follows the recent bootsrapped framework (Grill et al., 2020) and constructs different works in Table 9 of Appendix D. Other works (Lee et al., 2021; Xia et al., 2022) try to get rid of the manual design of augmentation strategies, following the bootstrapped framework (Grill et al., 2020) . They construct different views through two graph encoders updated with different strategies and push together the representations of the same node/class from different views. In both categories, those existing GCL methods require two graph forward-pass. Specifically, two augmented views of the same graph will be encoded separately by the same or two graph encoders for augmentation-based GCLs and the same graph will be encoded with two different graph encoders for augmentation-free GCLs, which is prohibitively expensive for large graphs. Besides, the theoretical analysis for the performance of GCL in the downstream tasks is still lacking. Although several efforts have been made in the visual domain (Arora et al., 2019; Lee et al., 2020; Tosh et al., 2021; HaoChen et al., 2021) , the analysis for image classification cannot be trivially extended to graph setting, since the non-Euclidean graph structure is far more complex.

3. PRELIMINARY

Notation. Let G = (V, E) denote an undirected graph, where V = {v i } i∈ [N ] and E ⊆ V × V denote the node set and the edge set respectively. We denote the number of nodes and edges as N and E, and the label of nodes as y ∈ R N , in which y i ∈ [1, c], c ≥ 2 is the number of classes. The associated node feature matrix denotes as X ∈ R N ×F , where x i ∈ R F is the feature of node v i ∈ V and F is the input feature dimension. We denote the adjacent matrix as A ∈ {0, 1} N ×N , where A ij = 1 if (v i , v j ) ∈ E; and the corresponding degree matrix as D = diag d 1 , . . . , d N , d i = j A i,j . Our objective is to unsupervisedly learn a GNN encoder f θ : X, A → R N ×K receiving the node features and graph structure as input, that produces node representations in low dimensionality, i.e., K ≪ F . The representations can benefit the downstream supervised or semisupervised tasks, e.g., node classification. Homophilic and heterophilic graph. Various metrics have been proposed to measure the homophily degree of a graph. Here we adopt two representative metrics, namely, node homophily and edge homophily. The edge homophily (Zhu et al., 2020b) is the proportion of edges that connect two nodes of the same class: , which evaluates the average proportion of edge-label consistency of all nodes. They are all in the range of [0, 1] and a value close to 1 corresponds to strong homophily while a value close to 0 indicates strong heterophily. As conventional, we refer the graph with high homophily degree as homophilic graph, and the graph with a low homophily degree as heterophilic graph. And we provided the homophily degree of graph considered in this work in Table 7 of Appendix A.1.

4. THEORETICAL ANALYSES

In this section, we firstly show the property of node representations obtained through the neighbor aggregation (Lemma 1). Then, based on the property, we introduce the single-pass graph contrastive loss (Equation ( 5)), in which the contrastive pairs are constructed according to the node similarity, instead of the graph augmentations. And Theorem 1 shows the viability of the pair selection through the node similarity computed based on node feature. We bridge the graph contrastive loss and the Matrix Factorization (Lemma 2). Then, leveraging the analysis for matrix factorization, we obtain the performance guarantee for the embedding learned with SP-GCL in the downstream node classification task (Theorem 2).

4.1. ANALYSIS OF AGGREGATED FEATURES

Assumptions on graph data. To obtain analytic and conceptual insights of the aggregated features, we firstly describe the graph data we considered. We assume that the node feature follows the Gaussian mixture model (Reynolds, 2009) . For simplicity, we focus on the binary classification problem. Conditional on the (binary-) label y and a latent vector µ ∼ N (0, I F /F ) where the identity matrix I F ∈ R F ×F , the features are governed by: xi = yiµ + qi √ F , where random variable q i ∈ R F has independent standard normal entries and y i ∈ {-1, 1} representing latent classes with abuse of notation. Then, the features of nodes with class y i follow the same distribution depending on y i , i.e., x i ∼ P yi (x). Furthermore, we make an assumption about the neighborhood patterns, For node i, its neighbor's labels are independently sampled from a distribution P (y i ). Remark. The above assumption implies that the feature of a node depends on its label and the neighbor's label is generated from distribution only dependent on the label of the central node, which contains both cases of homophily and heterophily. With this assumption, we present the following Lemma 1, where we denote the learned embedding through the neighbor aggregation and a linear projection by Z, and Z i is the learned embedding with respect to input x i , and the W denotes the linear weight. Lemma 1 (Concentration Property of Aggregated Features) Consider a graph G following the graph data assumption and Eq. ( 1), then the expectation of embedding is given by E[Zi] = W E y∼P (y i ),x∼Py (x) [x], (2) Furthermore, with a probability at least 1 -δ over the distribution for the graph, we have: ∥Zi -E[Zi]∥2 ≤ σ 2 max (W)F log(2F/δ) 2Dii∥x∥ ψ 2 (3) and Z ⊤ i Zj -E[Z ⊤ i Zj] ≤ σ 2 max (W ⊤ W) log(2N 2 /δ) 2D 2 ∥x 2 ∥ ψ 1 (4) where D ii is the degree of node v i , D ≡ min i D ii , and the sub-gaussian norms ∥x∥ ψ2 ≡ min i ∥x i,d ∥ ψ2 , sub-exponential norms ∥x 2 ∥ ψ1 ≡ min i ∥x 2 i,d ∥ ψ1 for d ∈ [1, F ]. Besides, σ 2 max (W) is the largest singular value of W. Remark. Extending from Theorem 1 in (Ma et al., 2021) , the above lemma indicates that, for any graph following the graph data assumption, (i) in expectation, nodes with the same label have the same embedding, Equation (2); (ii) the embeddings of nodes with the same label tends to concentrate onto a certain area in the embedding space, which can be regarded as the inductive bias of the neighborhood aggregation mechanism with the graph data assumption, Equation (3); (iii) the inner product of hidden representations approximates to its expectation with a high probability, Equation (4); (iv) with commonly used initialization, e.g. Kaiming initialization (He et al., 2015) and Lecun initialization (LeCun et al., 2012) , the σ 2 max (W ⊤ W) is bounded and the concentration is relatively tight. (Proof and detailed discussion are in Appendix C.1 and C.5 respectively.) Although the Gaussian mixture modeling on the feature, the neighborhood patterns modeling on the graph structure and the linearization on the graph neural network are commonly adopted in several recent works for the theoretical analysis of GNNs (Deshpande et al., 2018; Baranwal et al., 2021; Ma et al., 2021) , and their high-level conclusions still hold empirically for a wide range of real-world cases, we empirically verify the derived Concentration Property (Lemma 1) on the multiclass real-world graph data and the non-linear graph neural network ( Section 6.3 and Table 6 ) and provide a discussion about the empirical observations of the concentration property in other recent works (Wang et al., 2022b; Trivedi et al., 2022) (Appendix E.1).

4.2. SINGLE-PASS GRAPH CONTRASTIVE LOSS

In order to learn a more compact and linearly separable embedding space, we introduce the singlepass graph contrastive loss based on the property of aggregated features. Exploiting the concentration property explicitly, we regard nodes with small distance in the embedding space as positive pairs, and nodes with large distance as negative pairs. We formally define the positive and negative pairs to introduce the loss. We draw a node v i uniformly from the node set V, v i ∼ U ni(V), and draw the node v i + uniformly from the set S i , where the set S i is consisted by the K pos nodes closest to node v i . Concretely, S i = {v 1 i , v 2 i , . . . , v Kpos i } = arg max vj ∈V Z ⊤ i Z j , K pos , where K pos ∈ Z + and arg max(•, K pos ) denotes the operator for the top-K pos selection. The two sampled nodes form a positive pair (v i , v i + ). Two nodes v j and v k , sampled independently from the node set, can be regarded as a negative pair (v j , v k ) (Arora et al., 2019) . Following the insight of Contrastive Learning (Wang & Isola, 2020) , similar sample pairs stay close to each other while dissimilar ones are far apart, the Single-Pass Graph Contrastive Loss is defined as, LSP-GCL = -2 E v i ∼U ni(V) v i + ∼U ni(S i pos ) Z ⊤ i Z i + + E v j ∼U ni(V) v k ∼U ni(V) Z ⊤ j Z k 2 . (5)

4.3. PERFORMANCE GUARANTEE FOR LEARNING LINEAR CLASSIFIER

For the convenience of analysis, we firstly introduce the concept of transformed graph as follows, which is constructed based on the original graph and the selected positive pairs. Definition 1 (Transformed Graph) Given the original graph G and its node set V, the transformed graph, G, has the same node set V but with the selected positive pairs as the edge set, E = i {(v i , v k i )| K k=1 }. Note, the transformed graph is formed by positive pairs selected based on aggregated features. Coupling with the Concentration Property of Aggregated Features (Lemma 1), the transformed graph tends to have a larger homophily degree than the original graph. We provide more empirical verifications in Section 6.3. The transformed graph is illustrated in Figure 1 . We denote the adjacency matrix of transformed graph as A ∈ {0, 1} N ×N , the number of edges as Ê, and the symmetric normalized matrix as A sym = D -1/2 A D -1/2 , where D = diag d1 , . . . , dN , di = j A i,j . Correspondingly, we denote the symmetric normalized Laplacian as L sym = I -A sym = UΛU ⊤ . Here U ∈ R N ×N = [u 1 , . . . , u N ], where u i ∈ R N denotes the i-th eigenvector of L sym and Λ = diag (λ 1 , . . . , λ N ) is the corresponding eigenvalue matrix. λ 1 and λ N be the smallest and largest eigenvalue respectively. Then we show that optimizing a model with the contrastive loss (Equation ( 5)) is equivalent to the matrix factorization over the transformed graph, as stated in the following lemma: Lemma 2 Denote the learnable embedding for matrix factorization as F ∈ R N ×K . Let F i = F ψ (v i ). Then, the matrix factorization loss function L mf is equivalent to the contrastive loss, Equation (5), up to an additive constant: L mf (F) = Asym -FF ⊤ 2 F = LSP-GCL + const (6) The above lemma bridges the graph contrastive learning and the graph matrix factorization and therefore allows us to provide the performance guarantee of SP-GCL by leveraging the power of matrix factorization. We leave the derivation in Appendix C.2. With the Lemma 1 and 2, we arrive at a conclusion about the expected value of the inner product of embeddings (More details in Appendix C.3): Theorem 1 For a graph G following the graph data assumption, then when the optimal of the contrastive loss is achieved, i.e., L mf (F * ) = 0, we have, E[Z ⊤ i Zj|y i =y j ] -E[Z ⊤ i Zj| y i ̸ =y j ] ≥ 1 -φ, where φ = E vi,vj ∼U ni(V) A i,j • 1[y i ̸ = y j ] . Remark. The theorem shows that the embedding inner product of nodes from the same class is larger than the inner product of nodes from different classes. Besides, the 1 -φ, indicating the probability of an edge connecting two nodes from the same class, can be regarded as the edge homophily of the transformed graph. Therefore, the theorem implies that if the edge homophily of the transformed graph is larger, embeddings of nodes from the same class will be more compact in the high dimensional space. Finally, with the Theorem 1 and Lemma 1, we obtain a performance guarantee for node embeddings learned by SP-GCL with a linear classifier in the downstream task (More details in Appendix C.4): Theorem 2 Let f * SP-GCL ∈ arg min f :X →R K be a minimizer of the contrastive loss, L SP-GCL . Then there exists a linear classifier B * ∈ R c×K with norm ∥B * ∥ F ≤ 1/ (1 -λ K ) such that, with a probability at least 1 -δ, Ev i ⃗ yi -B * f * gcl (v) 2 2 ≤ φ λK+1 + σ 2 max (W ⊤ W) log(2N 2 /δ) 2D 2 ∥x 2 ∥ ψ 1 λ2 K+1 , λi are the i smallest eigenvalues of the symmetrically normalized Laplacian matrix of the transformed graph.

5. SINGLE-PASS GRAPH CONTRASTIVE LEARNING (SP-GCL)

As a direct consequence of our theory, we introduce the Single-Pass Graph Contrastive Learning (SP-GCL) to verify our findings. Instead of relying on the graph augmentation function or the exponential moving average, our new learning framework only forwards single time and the contrastive pairs are constructed based on the aggregated features. As we shall see, this exceedingly simple, theory motivated method also yields better performance in practice compared to dual-pass graph contrastive learning methods on both homophilic and heterophilic graph benchmarks. As the analysis revealed, for each class, the embedding obtained from neighbor aggregation will concentrate toward the expectation of embedding belonging to the class. Inspired by this, we design the self-supervision signal based on the obtained embedding and propose a novel single-pass graph contrastive learning framework, SP-GCL, which selects similar nodes as positive node pairs. As shown in Figure 2 , In each iteration, the proposed framework firstly encodes the graph with a graph encoder f θ denoted by H = f θ (X, A). Then, a projection head with L2-normalization, g ω , is employed to project the node embedding into the hidden representation Z = g ω (H). To scale up SP-GCL on large graphs, the the node pool, P , are constructed by the T -hop neighborhood of b nodes (the seed node set S) uniformly sampled from V. For each seed node v i ∈ S, the top-K pos nodes with highest similarity from the node pool are selected as positive set for it which denote as S i pos = arg max vj ∈P Z ⊤ i Z j , K pos , and K neg nodes are sampled from V to form the negative set S i neg , S i neg ⊆ V. Concretely, the framework is optimized with the following objective: LSP-GCL = - 2 N • Kpos v i ∈V v i + ∈S i pos Z ⊤ i Z i + + 1 N • Kneg v j ∈V v k ∈S i neg Z ⊤ j Z k 2 , Notably, the empirical contrastive loss is an unbiased estimation of the Equation ( 5). Overall, the training algorithm SP-GCL is summarized in Algorithm 1. Although we provide a theoretical discussion about the manner of self-selection for positive pairs and the proposed method, whether the method is effective and whether the self-selected manner is feasible for real-world cases are still not answered. In the following section, we empirically verify the effectiveness of the method and usefulness of our findings over a wide range of graph datasets.

6.1. PERFORMANCE ON HOMOPHILIC AND HETEROPHILIC GRAPH

The homophilic graph benchmarks have been studied by several previous works (Velickovic et al., 2019; Peng et al., 2020; Hassani & Khasahmadi, 2020; Zhu et al., 2020d; Thakoor et al., 2021; Lee et al., 2021) . We re-use their configuration and compare SP-GCL with those methods and leave the detailed description about the experiment setting in Appendix A. And we leave the implementation details and the hyperparameter selection in Appendix A.4. The result is summarized in Table 1 , in which the best performance achieved by self-supervised methods is marked in boldface. Algorithm 1: Single-Pass Graph Contrastive Learning (SP-GCL). Input: Graph neural network f θ , MLP projection head g ω , input adjacency matrix A, node features X, batch size b, number of hops T , number of positive nodes K pos . for epoch ← 1, 2, • • • do 1. Obtain the node embedding, H = f θ (X, A). 2. Obtain the hidden representation, Z = g ω (H). 3. Sample b nodes as seed node set S and construct the node pool P with the T -hop neighbors of each node in the set S. 4. Select top-K pos similar nodes for every v i ∈ S to form the positive node set S i pos . 5. Compute the contrastive objective with Eq. ( 5) and update parameters by applying stochastic gradient. end for return Final model f θ . Compared with augmentation-based and augmentation-free GCL methods, SP-GCL outperforms previous methods on 2 datasets and achieves competitive performance on the others, which shows the effectiveness of the single-pass contrastive loss on homophilic graphs. We further assess the model performance on heterophilic graph benchmarks that employed in Pei et al. (Pei et al., 2020) and Lim et al. (Lim et al., 2021a) . As shown in Table 2 , SP-GCL achieves the best performance on 6 of 6 heterophilic graphs by an evident margin. The above result indicates that, instead of relying on the augmentations which are sensitive to the graph type, SP-GCL is able to work well over a wide range of real-world graphs (described in Appendix A.1) with different homophily degree. In order to illustrate the advantages of SP-GCL, we provide a brief comparison of the time and space complexities between SP-GCL, the previous strong contrastive method, GCA (Zhu et al., 2021b) , and the memory-efficient contrastive method, BGRL (Thakoor et al., 2021) . Consider a graph with N nodes and E edges, and a graph neural network (GNN), f , that compute node embeddings in time and space O(N + E). BGRL performs four GNN computations per update step, in which twice for the target and online encoders, and twice for each augmentation, and a node-level projection; GCA performs two GNN computations (once for each augmentation), plus a node-level projection. Both methods backpropagate the learning signal twice (once for each augmentation), and we assume the backward pass to be approximately as costly as a forward pass. Both of them will compute the augmented graphs by feature masking and edge masking on the fly, the cost for augmentation computation is nearly the same. Thus the total time and space complexity per update step for BGRL is 6C encoder (E +N )+4C proj N +C prod N +C aug and 4C encoder (E +N )+4C proj N +C prod N 2 + C aug for GCA. The C prod depends on the dimension of node embedding and we assume the node embeddings of all the methods with the same size. For our proposed method, only one GNN encoder is employed and we compute the inner product of b nodes to construct positive samples and K pos and K neg inner product for the loss computation. Then for SP-GCL, we have: 2C encoder (E + N ) + 2C proj N + C prod (K pos + K neg ) 2 . We empirically measure the peak of GPU memory usage of SP-GCL, GCA and BGRL. As a fair comparison, we set the embedding size as 128 for all those methods on the four datasets and keep the other hyper-parameters of the three methods the same as the main experiments. As shown by Table 3 , the computational overhead of SP-GCL is much less than the previous methods. 

6.3. EMPIRICAL VERIFICATION FOR THE THEORETICAL ANALYSIS

Effect of embedding dimension (Scaling Behavior). We observe performance gains by scaling up models, as shown in Table 4 . These observations are aligned with our analysis (Theorem 2), a larger hidden dimension, K, leads to better performance (larger λK+1 leads to a lower bound). Coupling with the efficiency property (Section 6.2), our scalable approach allows for learning high-capacity models that generalize well under the same computational requirement.

Homophily of transformed graph (Initial Stage).

The positive sampling depends largely on how the hidden representation is obtained. In other words, if starting from a "poor" initialization, the GNN encoder could yield false positive samples since the inner product is not correctly evaluated in the beginning. Although, our analysis shows that the concentration property (Lemma 1) is relatively tight with commonly used initialization methods (σ max (W ⊤ W) of the Equation ( 4) is bounded, Appendix C.5) and indicates that the transformed graph formed by selected positive pairs will have large edge homophily degree, we empirically measure the edge homophily of the transformed graph at the beginning of the training stage with 20 runs. The mean and standard deviation are reported in Table 6 . The edge homophily of all transformed graphs are larger than the original one with a small standard deviation, except the CiteSeer in which a relatively high edge homophily (0.691) can still be achieved. The result indicates that the initialization is "good" enough at the beginning stage to form a useful transformed graph and, in turn, support the feasibility of the Lemma 1 over the real-world data. Distance to class center and Node Coverage (Learning Process). We measure change of the average cosine distance (1-CosineSimilarity) between the node embeddings and the class-center embeddings during training. Specifically, the class-center embedding is the average of the node embeddings of the same class. As shown in Figure 3 , we found that the node embeddings will concentrate on their corresponding class centers during training, which implies that the learned embedding space becomes more compact and the learning process is stable. Intuitively, coupling with the "good" initialization as discussed above, the SP-GCL iteratively leverage the inductive bias in the stable learning process and refine the embedding space. Furthermore, to study which nodes are benefited during the learning process, we measure the Node Cover Ratio and Overlapped Selection Ratio on four graph datasets. We denote the set of edges forming by the K pos positive selection at epoch t as e t and define the Overlapped Selection Ratio as e t+1 ∩e t |e t+1 | . Besides, we denote the set of positive nodes at at epoch t as v t . Then the Node Cover Ratio at epoch t is defined as: |v 0 ∩v 1 •••v t | N . Following the hyperparameters described in the previous section, we measure the Node Cover Ratio and Overlapped Selection Ratio during training. The results are shown in Figure 4 , which shows that all nodes are benefited from the optimization procedure. Besides, we attribute the relatively small and non-increasing overlapped selection ratio to the batch training.

Selected positive pairs (End of Training).

To further study the effect of the proposed method, we provide the true positive ratio (TPR) of the selected pairs at the end of training with 20 runs. As shown in Table 5 , the relatively large TPR and low variance suggests that the quality of the selected positive pairs is good and the learning process is stable. 

7. CONCLUSION

In this work, we firstly analyze the concentration property of embedding obtained through neighborhood aggregation which holds for both homophilic and heterophilic graphs. Then, exploiting the concentration property, we introduce the single-pass graph contrastive loss. We theoretically show that the equivalence between the contrastive objective and the matrix factorization. Further, leveraging the analysis of the matrix factorization, we provide a theoretical guarantee for the node embedding, obtained through minimizing the contrastive loss, in the downstream classification task. To verify the usefulness of our findings in real-world datasets, we implement the Single-Pass Graph Contrastive Learning framework (SP-GCL). Empirically, we show that SP-GCL can outperform or be competitive with SOTA methods on 8 homophilic graph benchmarks and 6 heterophilic graph benchmarks with significantly less computational overhead. The empirical results verify the feasibility and effectiveness of our analysis in real-world cases. We leave the discussion about the connection with existing observation, limitation and future work in Appendix E.

8. REPRODUCIBILITY STATEMENT

To ensure the results and conclusions of our paper are reproducible, we make the following efforts: Theoretically, we state the full set of assumptions and include complete proofs of our theoretical results in Section 4 and Appendix C. Experimentally, we provide our code, and instructions needed to reproduce the main experimental results. And we specify all the training and implementation details in Section 6 and Appendix A. Besides, we independently run experiments and report the mean and standard deviation.

A APPENDIX: EXPERIMENT SETTING A.1 DATASET INFORMATION

We analyze the quality of representations learned by SP-GCL on transductive node classification benchmarks. Specifically, we evaluate the performance of using the pretraining representations on 8 benchmark homophilic graph datasets, namely, Cora, Citeseer, Pubmed (Kipf & Welling, 2016a) and Wiki-CS, Amazon-Computers, Amazon-Photo, Coauthor-CS, Coauthor-Physics (Shchur et al., 2018) , as well as 6 heterophilic graph datasets, namely, Chameleon, Squirrel (Rozemberczki et al., 2021) , Actor (Pei et al., 2020) , Twitch-DE, Twitch-gamers (Rozemberczki & Sarkar, 2021) , and Genius (Lim et al., 2021b) . The datasets are collected from real-world networks from different domains; their detailed statistics are summarized in Table 7 . For the 8 homophilic graph data, we use the processed version provided by PyTorch Geometric (Fey & Lenssen, 2019) . Besides, for the 6 heterophilic graph data, 3 of them, e.g., Chameleon, Squirrel and Actor are provided by PyTorch Geometric library. The other three dataset, genius, twitch-DE and twitch-gamers can be obtained from the official github repositoryfoot_0 , in which the standard splits for all the 6 heterophilic graph datasets can also be obtained. Those graph datasets follow the MIT license, and the personal identifiers are not included. We do not foresee any form of privacy issues. 

A.2 BASELINES

We consider representative baseline methods belonging to the following three categories (1) Traditional unsupervised graph embedding methods, including DeepWalk (Perozzi et al., 2014) and Node2Vec (Grover & Leskovec, 2016) , (2) Self-supervised learning algorithms with graph neural networks including Graph Autoencoders (GAE, VGAE) (Kipf & Welling, 2016b) , Deep Graph Infomax (DGI) (Velickovic et al., 2019) , Graphical Mutual Information Maximization (GMI) (Peng et al., 2020) , and Multi-View Graph Representation Learning (MVGRL) (Hassani & Khasahmadi, 2020) , graph contrastive representation learning (GRACE) (Zhu et al., 2020d) Graph Contrastive learning with Adaptive augmentation (GCA) (Zhu et al., 2021b) , Bootstrapped Graph Latents (BGRL) (Thakoor et al., 2021) , Augmentation-Free Graph Representation Learning (AFGRL) (Lee et al., 2021) , (3) Supervised learning and Semi-supervised learning, e.g., Multilayer Perceptron (MLP) and Graph Convolutional Networks (GCN) (Kipf & Welling, 2016a) , where they are trained in an end-to-end fashion.

A.3 EVALUATION PROTOCOL

We follow the evaluation protocol of previous state-of-the-art graph contrastive learning approaches. Specifically, for every experiment, we employ the linear evaluation scheme as introduced in (Velickovic et al., 2019) , where each model is firstly trained in an unsupervised manner; then, the pretrained representations are used to train and test via a simple linear classifier. For the datasets that came with standard train/valid/test splits, we evaluate the models on the public splits. For datasets without standard split, e.g., Amazon-Computers, Amazon-Photo, Coauthor-CS, Coauthor-Physics, we randomly split the datasets, where 10%/10%/80% of nodes are selected for the training, validation, and test set, respectively. For most datasets, we report the averaged test accuracy and standard deviation over 10 runs of classification. While, following the previous works (Lim et al., 2021b; a) , we report the test ROC AUC on genius and Twitch-DE datasets.

A.4 IMPLEMENTATION DETAILS

Model architecture and hyperparamters. We employ a two-layer GCN (Kipf & Welling, 2016a) as the encoder for all methods. The propagation for a single layer GCN is given by, GCN i (X, A) = σ D-1 2 Ā D-1 2 XW i , where Ā = A + I is the adjacency matrix with self-loops, D is the degree matrix, σ is a nonlinear activation function, such as ReLU, and W i is the learnable weight matrix for the i'th layer. Besides, following the previous works (Lim et al., 2021a; b) , we use batch normalization within the graph encoder for the heterophilic graphs. The hyperparameters setting for all experiments are summarized in Table 8 . We would like to release our code after acceptance. Linear evaluation of embeddings. In the linear evaluation protocol, the final evaluation is over representations obtained from pretrained model. When fitting the linear classifier on top of the frozen learned embeddings, the gradient will not flow back to the encoder. We optimize the one layer linear classifier 1000 epochs using Adam with learning rate 0.0005. Hardware and software infrastructures. Our model are implemented with PyTorch Geometric 2.0.3 (Fey & Lenssen, 2019) , PyTorch 1.9.0 (Paszke et al., 2017) . We conduct experiments on a computer server with four NVIDIA Tesla V100 SXM2 GPUs (with 32GB memory each) and twelve Intel Xeon Gold 6240R 2.40GHz CPUs. We study the effect of top-K positive sampling on the node classification performance by measuring the classification accuracy and the corresponding standard deviation with a wide range of K. The results over WikiCS, Chameleon, Squirrel, Cora, and Photo are summarized in Figure 5 . We observe that the performance achieved by SP-GCL is insensitive to the selection of K from 2 to 18. Proof. We first calculate the expectation of aggregated embedding: E[f θ (xi)] = E W j∈N (i) 1 Dii xj = WE y∼Py i ,x∼Py (x) [x] This equation is based on the graph data assumption such that x j ∼ P yi (x) for every j. Now we provide a concentration analysis. Because each feature x i is a sub-Gaussian variable, then by Hoeffding's inequality, with probability at least 1 -δ ′ for each d ∈ [1, F ], we have, 1 Dii j (x j,d -E[x j,d ]) ≤ log(2/δ ′ ) 2Dii∥x j,d ∥ ψ 2 (11) where ∥x j,d ∥ ψ2 is sub-Gaussian norm of x j,d . Furthermore, because each dimension of x j is i.i.d., thus we have ∥x j ∥ ψ2 = ∥x j,d ∥ ψ2 , for d ∈ [1, F ]. Then we apply a union bound by setting δ ′ = F δ on the feature dimension k. Then with probability at least 1 -δ we have 1 Dii j (x j,d -E[x j,d ]) ≤ log(2F/δ) 2Dii∥x∥ ψ 2 (12) Next, we use the matrix perturbation theory, 1 Dii j (x j,d -E[x j,d ]) 2 ≤ √ F 1 Dii j (x j,d -E[x j,d ]) ≤ F log(2F/δ) 2Dii∥x∥ ψ 2 (13) Finally, plug the weight matrix into the inequality, ∥f θ (xi) -E[f θ (xi)]∥ ≤ σmax(W) 1 Dii j (x j,k -E[x j,k ]) 2 (14) where σ max is the largest singular value of weight matrix. Next, we perform a concentration analysis for the inner product. We first write down the detailed expression for each pair of i, j, si,j ≡ x ⊤ i W ⊤ Wxj (15) We first bound x ⊤ i x j . Because x i and x j are independently sampled from an identical distribution, then the product x ⊤ i x j is sub-exponential. This can been seen from Orilicz norms relation that ∥x 2 ∥ ψ1 = (∥x 2 ∥ ψ2 ) 2 , where ∥x∥ ψ2 is sub-exponential norm of x 2 . Then by the Hoeffding's inequality for sub-exponential variable, with a probability at least 1 -δ, we have |x ⊤ i xj -Ex i ∼Py i ,x j ∼Py j [x ⊤ i xj]| ≤ σ 2 max (W ⊤ W) log(2/δ) 2∥x 2 ∥ ψ 1 (16) Because that the aggregated feature is normalized by the degree of corresponding node, we have, for each pair of i, j |si,j -E[si,j]| ≤ log(2/δ)σ 2 max (W ⊤ W) 2∥x 2 ∥ ψ 1 DiiDjj ≤ σ 2 max (W ⊤ W) log(2/δ) 2∥x 2 ∥ ψ 1 D 2 where D = min i D ii for i ∈ [1, N ]. Finally we apply a union bound over a pair of i, j. Then with probability at least 1 -δ we have Z ⊤ i Zj -E[Z ⊤ i Zj] ≤ σ 2 max (W ⊤ W) log(2N 2 /δ) 2D 2 ∥x 2 ∥ ψ 1 C.2 PROOF OF LEMMA 2 To prove this lemma, we first introduce the concept of the probability adjacency matrix. For the transformed graph G, we denote its probability adjacency matrix as W, in which ŵij = 1 E • A ij . ŵij can be understood as the probability that two nodes have an edge and the weights sum to 1 because the total probability mass is 1: i,j ŵi,j ′ = 1, for v i , v j ∈ V. The corresponding symmetric normalized matrix is W sym = D -1/2 w W D -1/2 w , and the D w = diag [ ŵ1 , . . . , ŵN ] , where ŵi = j ŵij . We then introduce the Matrix Factorization Loss which is defined as: min F∈R N ×k L mf (F) := Asym -FF ⊤ 2 F . ( ) By the classical theory on low-rank approximation, Eckart-Young-Mirsky theorem (Eckart & Young, 1936) , any minimizer F of L mf (F) contains scaling of the smallest eigenvectors of L sym (also, the largest eigenvectors of A sym ) up to a right transformation for some orthonormal matrix R ∈ R K×K . We have F = F * . diag √ 1 -λ 1 , . . . , √ 1 -λ K R, where F * = [u 1 , u 2 , • • • , u K ] ∈ R N ×K . To proof the Lemma 2, we first present the Lemma 3. Lemma 3 For transformed graph, its probability adjacency matrix W, and adjacency matrix A are equal after the symmetric normalization, W sym = A sym . Proof. For any two nodes v i , v j ∈ V and i ̸ = j, we denote the the element in i-th row and j-th column of matrix W sym as W ij sym . W ij sym = 1 k ŵik k ŵkj 1 E A ij = 1 k A ik k A kj A ij = A ij sym . (20) Leveraging the Lemma 3, we present the proof of Lemma 2. Proof of Lemma 2 We start from the matrix factorization loss over A sym to show the equivalence. ∥ Asym -FF ⊤ ∥ 2 F = ∥ Wsym -FF ⊤ ∥ 2 F = ij ŵij √ ŵi ŵj -f mf (vi) ⊤ f mf (vj) 2 = ij (f mf (vi) ⊤ f mf (vj)) 2 -2 ij ŵij √ ŵi ŵj f mf (vi) ⊤ f mf (vj) + ∥ Ŵsym∥ 2 F = ij ŵi ŵj 1 √ ŵi • f mf (vi) ⊤ 1 ŵj • f mf (vj) 2 -2 ij ŵij 1 √ ŵi • f mf (vi) ⊤ 1 ŵj • f mf (vj) + C where f mf (v i ) is the i-th row of the embedding matrix F. The ŵi which can be understood as the node selection probability which is proportional to the node degree. Then, we can define the corresponding sampling distribution as P deg . If and only if √ w i • F ψ (v i ) = f mf (v i ) = F i , the have: E v i ∼P deg v j ∼P deg F ψ (vi) ⊤ F ψ (vj) 2 -2 E v i ∼U ni(V) v i + ∼U ni(N (v i )) F ψ (vi) ⊤ F ψ (v i + ) + C where N (v i ) denotes the neighbor set of node v i and U ni(•) is the uniform distribution over the given set. Because we constructed the transformed graph by selecting the top-K pos nodes for each node, then all nodes have the same degree. We can further simplify the objective as: E v i ∼U ni(V) v j ∼U ni(V) Z ⊤ i Zj 2 -2 E v i ∼U ni(V) v i + ∼U ni(S i pos ) Z ⊤ i Z i + + C. Due to the node selection procedure, the factor √ w i is a constant and can be absorbed by the neural network, F ψ . Then, because Z i = F ψ (v i ), we can have the Equation 23. Therefore, the minimizer of matrix factorization loss is equivalent with the minimizer of the contrastive loss.

C.3 PROOF OF THEOREM 1

Proof. Now we provide the proof for the inner product of embedding. In particular, when we can achieve near-zero contrastive learning loss, i.e., L mf (F) = 0, the minimizer F * of L mf (F) contains scaling of the smallest eigenvectors of L sym (also, the largest eigenvectors of A sym ), F * = [u 1 , u 2 , • • • , u K ] ∈ R N ×K , according to the Eckart-Young-Mirsky theorem (Eckart & Young, 1936) . Recall that y = {1, -1} N ∈ R N ×1 is label of all nodes. Then we show there is a constraint on the quadratic form with respect to the optimal classifier. According to graph spectral theory (Chung & Graham, 1997) , the quadratic form y ⊤ L sym y = 1 2 i,i ′ ( A sym ) i,i ′ (y i -y i ′ ) 2 which captures the amount of edges connecting different labels. Furthermore, suppose that the expected homophily over distribution of graph feature and label, i.e., y ∼ P (y i ), x ∼ P y (x), through similarity selection satisfies E[h edge ( Ĝ)] = 1 -φ. Here φ = E vi,vj ∼U ni(V) A i,j • 1[y i ̸ = y j ] . Since we have defined that φ as the density of edges that connects different labels, we can directly show that y ⊤ L sym y ≤ φN . Then we expand the above expression accoding to L sym = I -A sym , y ⊤ L sym y = y ⊤ (I -A sym )y = N -y ⊤ A sym y Next we consider the situation that we achieve the optimal solution of the contrastive loss, Z = arg min L SP-GCL . Then, we have L mf (F * ) = 0, which implies that A sym = (F * ) ⊤ F * . As we analyzed in Section C.2, we have Z = F * in this case. Furthermore, we have, y ⊤ Lsymy = y ⊤ (I -Asym)y = N -y ⊤ Z ⊤ Zy = N -y ⊤ (Z ⊤ i Zj) (i,j∈n×n) y = N -( i,j Z ⊤ i Zj|y i =y j - i,j Z ⊤ i Zj| y i ̸ =y j ) This leads to, Lemma 4 For a graph G, let f * mf ∈ arg min f mf :V→R K be a minimizer of the matrix factorization loss, L mf (F), where F i = f mf (v i ). Then, for any label y, there exists a linear classifier 1 N i,j Z ⊤ i Zj|y i =y j - i,j Z ⊤ i Zj| y i ̸ =y j = E[Z ⊤ i Zj|y i =y j ] -E[Z ⊤ i Zj| y i ̸ =y j ] ≥ 1 -φ B * ∈ R c×K with norm ∥B * ∥ F ≤ 1/ (1 -λ K ) such that Ev i ∥⃗ yi -B * f * mf (vi)∥ 2 2 ≤ ϕ y λK+1 , where ⃗ y i is the one-hot embedding of the label of node v i . The difference between labels of connected data points is measured by ϕ y , ϕ y := 1 E vi,vj ∈V A ij • 1 [y i ̸ = y j ] . Proof of Theorem 2. This proof is a direct summary on the established lemmas in previous section. By Lemma 2 and Lemma 4, we have, Ev i ⃗ yi -B * f * gcl (vi) 2 2 ≤ ϕ y λK+1 (27) where λi is the i-th smallest eigenvalue of the Laplacian matrix L sym = I -A sym . Note that ϕ y in Lemma 4 equals 1 -h edge . Then we apply Theorem 1 and Lemma 1 for h edge to conclude the proof: Recall that the concentration inequality for inner product of embedding in Lemma 1 is: Ev i ⃗ yi -B * f * gcl (v) Z ⊤ i Z j -E[Z ⊤ i Z j ] ≤ σ 2 max (W ⊤ W) log(2N 2 /δ) 2D 2 ∥x 2 ∥ ψ1 As shown by Lemma 1, the accuracy of using the graph encoder to obtain the similarity between two nodes is determined by the maximum singular value of the weight of the graph encoder. In other words, the quality of the transformed graph is guaranteed if we can be sure that the maximum singular value of the weight (of the graph encoder) is bounded. When initializing the network, we generally use Gaussian initialization, such as, Kaiming and Lecun initialization (He et al., 2015; LeCun et al., 2012) . Then W ⊤ W will be a Wishart matrix, and the limiting spectral density can be shown to be the Marchenko-Pastur distribution (Wigner, 1993) , whose largest singular value is bounded when K/F is bounded, where W ∈ R F ×K . As a result, the quality of the transformed graph can be guaranteed with a high probability under condition that K/F is bounded. On the other hand, we assume that for node i, its neighbor's labels are independently sampled from a distribution P (y i ). This means that the probability of an edge between two nodes can be fixed given labels. In this case, the degree scales linearly with number of nodes N . In the large graph limit, D will tends to infinity, thus our bound can be tighter. 

Method

Topology Aug. Feature Aug. DGI (Velickovic et al., 2019) Node Shuffling -GMI (Peng et al., 2020) Node Shuffling -MVGRL (Hassani & Khasahmadi, 2020) Diffusion -GCC (Qiu et al., 2020) Subgraph -GraphCL (You et al., 2020) Multiple * Feature Dropout GRACE (Zhu et al., 2020d) Edge Removing Feature Masking GCA (Zhu et al., 2021b) Edge Removing Feature Masking BGRL (Grill et al., 2020) Edge In Lemma 1, we theoretically analyze the concentration property of aggregated features of graphs following the Graph Assumption (Section 4.1). The concentration property has also been empirically observed in some other recent works (Wang et al., 2022b; Trivedi et al., 2022) . Specifically, in the Figure 1 of Wang et al. (2022b) , a t-SNE visualization of node representations of the Amazon-Photo dataset shows that the representations obtained by a randomly initialized untrained SGC (Wu et al., 2019) model will concentrate to a certain area if they are from the same class. This observation also provides an explanation for the salient performance achieved by SP-GCL on the Amazon-Photo, as shown in Table 1 .

E.2 LIMITATION AND FUTURE WORK

Our work is only the first step in understanding the possibility of single-pass graph contrastive learning and the corresponding performance guarantee on both the homophilic graph and heterophilic graph. There is still a lot of future work to be done. Below we indicate three questions that need to be addressed. • Our theoretical framework is built upon the graph assumption (Section 4.1) in which we assume the neighbor pattern. The graph assumption includes the homophilic graph and the "benign" heterophilic graph. However, there could exist graphs in which the neighbor pattern is messy or arbitrarily distributed. It is still an open question to understand the graph contrastive learning on those graphs. • Recently, several Graph neural networks (Zhu et al., 2020b; Chien et al., 2020; Luan et al., 2021; Li et al., 2022) have been proposed to work on heterophilic graphs. Whether further improvement can be achieved through combining with those GNNs is still an open question. • To verify the effectiveness of our theoretical findings, we keep the implementation to be simple. Whether a better performance can be achieved by involving more complex techniques still need to be explored.



https://github.com/CUAI/Non-Homophily-Large-Scale



h edge = |{(vi,vj ):(vi,vj )∈E∧yi=yj }| E And the node homophily (Pei et al., 2020) is defined as, h node = 1 N vi∈V |{vj :(vi,vj )∈E∧yi=yj }| |{vj :(vi,vj )∈E}|

Figure 1: Transformed Graph formed with Positive Pairs.

Figure 2: Overview of SP-GCL. The graph data is encoded by a graph neural network f θ and a following projection head g ω . The contrastive pairs are constructed based on the representation Z.

Figure 3: Average cosine distance between node embeddings and their corresponding class centers.

Figure 4: The Node Cover Ratio and Overlapped Selection Ratio.

Figure 5: The effect of top-K positive sampling on the Performance.

Figure 6: The effect of T -hop neighborhood on the Performance.

et al. (2021)  presented the following theoretical guarantee for the model learned with the matrix factorization loss.

Graph Contrastive Learning on Homophilic Graphs. The highest performance of unsupervised models is highlighted in boldface. OOM indicates Out-Of-Memory on a 32GB GPU.

Graph Contrastive Learning on Heterophilic Graphs. The highest performance of unsupervised models is highlighted in boldface. OOM indicates Out-Of-Memory on a 32GB GPU.

Computational requirements on a set of standard benchmark graphs. OOM indicates running out of memory on a 32GB GPU.

The performance of SP-GCL with different hidden dimension. The average accuracy over 10 runs is reported.

True positive ratio of selected edges at the end of training. The minimal, maximal, average, standard deviation of 20 runs are presented.

Edge homophily of the transformed graph at the initial stage. Homophily value of original graphs is shown in parentheses.

Statistics of datasets used in experiments.

Hyperparameter settings for all experiments.

Summary of graph augmentations used by representative GCL models. Multiple * denotes multiple augmentation methods including edge removing, edge adding, node dropping and subgraph induced by random walks.

