LINK PREDICTION WITH NON-CONTRASTIVE LEARNING

Abstract

Graph neural networks (GNNs) are prominent in the graph machine learning domain, owing to their strong performance across various tasks. A recent focal area is the space of graph self-supervised learning (SSL), which aims to derive useful node representations without labeled data. Notably, many state-of-theart graph SSL approaches are contrastive methods, which use a combination of positive and negative samples to learn node representations. Owing to challenges in negative sampling (slowness and model sensitivity), recent literature introduced non-contrastive methods, which instead only use positive samples. Though such methods have shown promising performance in node-level tasks, their suitability for link prediction tasks, which are concerned with predicting link existence between pairs of nodes, and have broad applicability to recommendation systems contexts, is yet unexplored. In this work, we extensively evaluate the performance of existing non-contrastive methods for link prediction in both transductive and inductive settings. While most existing non-contrastive methods perform poorly overall, we find that, surprisingly, BGRL generally performs well in transductive settings. However, it performs poorly in the more realistic inductive settings where the model has to generalize to links to/from unseen nodes. We find that non-contrastive models tend to overfit to the training graph and use this analysis to propose T-BGRL, a novel non-contrastive framework that incorporates cheap corruptions to improve the generalization ability of the model. This simple modification strongly improves inductive performance in 5/6 of our datasets, with up to a 120% improvement in Hits@50-all with comparable speed to other non-contrastive baselines, and up to 14× faster than the best-performing contrastive baseline. Our work imparts interesting findings about non-contrastive learning for link prediction and paves the way for future researchers to further expand upon this area.

1. INTRODUCTION

Graph neural networks (GNNs) are ubiquitously used modeling tools for relational graph data, with widespread applications in chemistry (Chen et al., 2019; Guo et al., 2021; 2022a; Liu et al., 2022) , forecasting and traffic prediction (Derrow-Pinion et al., 2021; Tang et al., 2020) , recommendation systems (Ying et al., 2018b; He et al., 2020; Sankar et al., 2021; Tang et al., 2022; Fan et al., 2022) , graph generation (You et al., 2018; Fan & Huang, 2019; Shiao & Papalexakis, 2021) , and more. Given significant challenges in obtaining labeled data, one particularly exciting recent direction is the advent of graph self-supervised learning (SSL), which aims to learn representations useful for various downstream tasks without using explicit supervision besides available graph structure and node features (Zhu et al., 2020; Jin et al., 2021; Thakoor et al., 2022; Bielak et al., 2022) . One prominent class of graph SSL approaches are contrastive methods (Jin et al., 2020) . These methods typically utilize contrastive losses such as InfoNCE (Oord et al., 2018) or margin-based losses (Ying et al., 2018b) between node and negative sample representations. However, such methods usually require either many negative samples (Hassani & Ahmadi, 2020) or carefully chosen ones (Ying et al., 2018b; Yang et al., 2020) , where the first one results with quadratic number of in-batch comparisons, and the latter is especially expensive on graphs since we often store the sparse adjacency matrix instead of its dense complement (Thakoor et al., 2022; Bielak et al., 2022) . These drawbacks motivated the development of non-contrastive methods (Thakoor et al., 2022; Bielak et al., 2022; Zhang et al., 2021; Kefato & Girdzijauskas, 2021) , based on advances in the image domain (Grill et al., 2020; Chen & He, 2021; Chen et al., 2020) , which do not require negative samples and solely rely on augmentations. This allows for a large speedup compared to their contrastive counterparts with strong performance (Bielak et al., 2022; Zhang et al., 2021) . However, non-contrastive SSL methods are typically evaluated on node-level tasks, which is a more direct analog of image classification in the graph domain. In comparison, the link-level task (link prediction), which focuses on predicting link existence between pairs of nodes, is largely overlooked. This presents a critical gap in understanding: Are non-contrastive methods suitable for link prediction tasks? When do they (not) work, and why? This gap presents a huge opportunity, since link prediction is a cornerstone in the recommendation systems community (He et al., 2020; Zhang & Chen, 2019; Berg et al., 2017) . Present Work. To this end, our work first performs an extensive evaluation of non-contrastive SSL methods in link prediction contexts to discover the impact of different augmentations, architectures, and non-contrastive losses. We evaluate all of the (to the best of our knowledge) currently existing non-contrastive methods: CCA-SSG (Zhang et al., 2021) , Graph Barlow Twins (GBT) (Bielak et al., 2022) , and Bootstrapped Graph Latents (BGRL) (Thakoor et al., 2022) (which has the same design as the independently proposed SelfGNN (Kefato & Girdzijauskas, 2021) ). We also compare these methods against a baseline end-to-end GCN (Kipf & Welling, 2017) with cross-entropy loss, and two contrastive baselines: GRACE (Zhu et al., 2020) , and a GCN trained with max-margin loss (Ying et al., 2018a) . We evaluate the methods in the transductive setting and find that BGRL (Thakoor et al., 2022) greatly outperforms not only the other non-contrastive methods, but also GRACE-a strong augmentation-based contrastive model for node classification. Surprisingly, BGRL even performs on-par with a margin-loss GCN (with the exception of 2/6 datasets). However, in the more realistic inductive setting, which considers prediction between new edges and nodes at inference time, we observe a huge gap in performance between BGRL and a margin-loss GCN (ML-GCN). Upon investigation, we find that BGRL is unable to sufficiently push apart the representations of negative links from positive links when new nodes are introduced, owing to a form of overfitting. To address this, we propose T-BGRL, a novel non-contrastive method which uses a corruption function to generate cheap "negative" samples-without performing the expensive negative sampling step of contrastive methods. We show that it greatly reduces overfitting tendencies, and outperforms existing non-contrastive methods across 5/6 datasets on the inductive setting. We also show that it maintains comparable speed with BGRL, and is 14× faster than the margin-loss GCN on the Coauthor-Physics dataset. Main Contributions. In short, our main contributions are as follows: • To the best of our knowledge, this is the first work to explore link prediction with non-contrastive SSL methods. • We show that, perhaps surprisingly, BGRL (an existing non-contrastive model) works well in the transductive link prediction, with performance at par with contrastive baselines, implicitly behaving similarly to other contrastive models in pushing apart positive and negative node pairs. • We show that non-contrastive SSL models underperform their contrastive counterparts in the inductive setting, and notice that they generalize poorly due to a lack of negative examples. • Equipped with this understanding, we propose T-BGRL, a novel non-contrastive method that uses cheap "negative" samples to improve generalization. T-BGRL is simple to implement, very efficient when compared to contrastive methods, and improves on BGRL's inductive performance in 5/6 datasets, making it at or above par with the best contrastive baselines.

2. PRELIMINARIES

Notation. We denote a graph as G = (V, E), where V is the set of n nodes (i.e., n = |V|) and E ⊆ V × V be the set of edges. Let the node-wise feature matrix be denoted by X ∈ R n×f , where f is the number of raw features, and its i-th row x i is the feature vector for the i-th node. Let A ∈ {0, 1} n×n denote the binary adjacency matrix. We denote the graph's learned node representations as H ∈ R n×d , where d is the size of latent dimension, and h i is the representation for the i-th node. Let Y ∈ {0, 1} n×n be the desired output for link prediction, as E and A may have validation and test edges masked off. Similarly, let Ŷ ∈ {0, 1} n×n be the output predicted by the decoder for link prediction. Let ORC be a perfect oracle function for our link prediction task, i.e., ORC(A, X) = Y . Let NEIGH(u) = {v | (u, v) ∈ E ∨ (v, u) ∈ E}. Note that we use the terms "embedding" and "representation" interchangeably in this work. GNNs for Link Prediction. Many new approaches have also been developed with the recent advent of graph neural networks (GNNs). A predominant paradigm is the use of node-embedding-based methods (Hamilton et al., 2017; Berg et al., 2017; Ying et al., 2018b; Zhao et al., 2022b) . Nodeembedding-based methods typically consist of an encoder H = ENC(A, X) and a decoder DEC(H). The encoder model is typically a message-passing based Graph Neural Network (GNN) (Kipf & Welling, 2017; Hamilton et al., 2017; Zhang et al., 2020) . The message-passing iterations of a GNN for a node u can be described as follows: h (k+1) u = UPDATE (k) h (k) u , AGGREGATE (k) ({h (k) v , ∀v ∈ NEIGH(u)}) (1) where UPDATE and AGGREGATE are differentiable functions, and h (0) u = x u . The decoder model is usually an inner product or MLP applied on a concatenation of Hadamard product of the source and target learned node representations (Rendle et al., 2020; Wang et al., 2021) . Graph SSL. Below, we define a few terms used throughout our work which helps set the context for our discussion. Definition 2.1 (Augmentation). An augmentation AUG + is a label-preserving random transformation function AUG + : (A, X)→( Ã, X) that does not change the oracle's expected value: E[ORC(AUG + (A, X))] = Y . Definition 2.2 (Corruption). A corruption AUG -is a label-altering random transformation AUG -: (A, X) → ( Ǎ, X) that changes the oracle's expected value: E[ORC(AUG -(A, X))] ̸ = Y . 1 Definition 2.3 (Contrastive Learning). Contrastive methods select anchor samples (e.g. nodes) and then compare those samples to both positive samples (e.g. neighbors) and negative samples (e.g. non-neighbors) relative to those anchor samples. Definition 2.4 (Non-Contrastive Learning). Non-contrastive methods select anchor samples, but only compare those samples to variants of themselves, without leveraging other samples in the dataset. BGRL. While we examine the performance of all of the non-contrastive graph models, we focus our detailed analysis exclusively on BGRLfoot_1 (Thakoor et al., 2022) due to its superior performance in link prediction when compared to GBT (Bielak et al., 2022) and CCA-SSG (Zhang et al., 2021) . BGRL consists of two encoders, one of which is referred to as the online encoder ENC θ ; the other is referred to as the target encoder ENC ϕ . BGRL also incorporates a predictor PRED (typically a MLP) and two sets of augmentations: A + 1 , A + 2 . A single training step for BGRL is as follows: (a) we apply these augmentations: ( Ã(1) , X(1) ) = AUG + 1 (A, X); ( Ã(2) , X(2) ) = AUG + 2 (A, X). (b) we perform forward propagation H = ENC( Ã(1) , X(1) ); H 2 = ENC( Ã(2) , X(2) ). (c) we pass the output through the predictor Z = PRED(H 1 ). (d) we use the mean pairwise cosine distance of Z and H 2 as the loss (see Eqn. 2). (e) ENC θ is updated via backpropagation and ENC ϕ is updated via exponential moving average (EMA) from ENC θ . The BGRL loss is as follows: L BGRL = - 2 n n-1 i=0 zi • h (2) i || zi || ||h (2) i || In the next section, we evaluate BGRL and other non-contrastive link prediction methods against contrastive baselines.

3. DO NON-CONTRASTIVE LEARNING METHODS PERFORM WELL ON LINK PREDICTION TASKS?

Several non-contrastive methods have been proposed and have shown effectiveness in node classification (Kefato & Girdzijauskas, 2021; Thakoor et al., 2022; Zhang et al., 2021; Bielak et al., 2022) . However, none of these methods evaluate or target link prediction tasks. We thus aim to answer the following questions: First, how well do these methods work for link prediction compared to existing contrastive/end-to-end baselines? Second, do they work equally well in both transductive and inductive settings? Finally, if they do work, why; if not, why not? Differences from Node Classification. Link prediction differs from node classification in several key aspects. First, we must consider the embedding of both the source and destination nodes. Second, we have a much larger set of candidates for the same graph-O(n 2 ) instead of O(n). Finally, in real applications, link prediction is usually treated as a ranking problem, where we want positive links to be ranked higher than negative links, rather than as a classification problem, e.g. in recommendation systems, where we want to retrieve the top-k most likely links (Cremonesi et al., 2010; Hubert et al., 2022) . We discuss this in more detail in Section 3.1 below. Given these differences, it is unclear if methods performing well on node classification naturally perform well on link prediction tasks. Ideal Link Prediction. What does it mean to perform well on link prediction? We clarify this point here. For some nodes u, v, w ∈ V, let (u, v) ∈ E and (u, w) ̸ ∈ E. Then, an ideal encoder for link prediction would have DIST(h u , h v ) < DIST(h u , h w ) for some distance function DIST. This idea is the core motivation behind margin-loss-based models (Ying et al., 2018a; Hamilton et al., 2017) .

3.1. EVALUATION

Datasets. We use datasets from three different domains: citation networks, co-authorship networks, and co-purchase networks. We use the Cora and Citeseer citation networks (Sen et al., 2008) , the Coauthor-CS and Coauthor-Physics co-authorship networks, and the Amazon-Computers and Amazon-Photos co-purchase networks (Shchur et al., 2018) . We include dataset statistics in Appendix A.1.

Metric.

Following work in the heterogeneous information network (Chen et al., 2018) , knowledgegraph (Lin et al., 2015) , and recommendation systems (Cremonesi et al., 2010; Hubert et al., 2022) communities, we choose to use Hits@k over AUC-ROC metrics, since we often empirically prioritize ranking candidate links from a selected node context (e.g. ranking the probability that user A will buy item B, C, or D), as opposed to arbitrarily ranking a randomly chosen positive over negative link (e.g. ranking whether the probability that user A buys item B is more likely than user C does not buy item D). We report Hits@50 (k = 50) to strike a balance between the smaller datasets like Cora and the larger datasets like Coauthor-Physics. However, for completeness of the evaluation, we also include AUC-ROC results in Appendix A.8. Decoder. Since our goal is to evaluate the performance of the encoder, we use the same decoder for all of our experiments across all of the methods. The choice of decoder has also been previously studied (Wang et al., 2021; 2022) , so we use the best-performing decoder -a Hadamard product MLP. For a candidate link (u, v), we have Ŷ = DEC(h u * h v ) where * represents the Hadamard product, and DEC is a two-layer MLP (with 256 hidden units) followed by a sigmoid. For the self-supervised methods, we first train the encoder and freeze its weights before training the decoder. As a contextual baseline, we also report results on an end-to-end GCN (E2E-GCN), for which we train the encoder and decoder jointly, backpropagating a binary cross-entropy loss on link existence.

3.1.1. TRANSDUCTIVE EVALUATION

Transductive Setting. We first evaluate the performance of the methods in the transductive setting, where we train on G train = (V, E train ) for E train ⊂ E, validate our method on G val = (V, E val ) for E val ⊂ (E -E train ), and test on G test = (V, E test ) for E test = E -E train -E val . Note that the same nodes are present in training, validation, and testing. We also do not introduce any new edges during inference time-inference is performed on E train . Results. The results of our evaluation are shown in Table 1 . As expected, the end-to-end GCN generally performs the best across all of the datasets. We also find that CCA-SSG and GBT similarly We can see that while they behave similarly, the ML-GCN does a better job of ensuring that positive/negative links are well separated. These scores are computed on Amazon-Photos. perform poorly relative to the other methods. This is intuitive, as neither method was designed for link prediction and were only evaluated for node classification in their respective papers. Surprisingly, however, BGRL outperforms the ML-GCN (the strongest contrastive baseline) on 3/6 of the datasets and performs similarly on 1 other (Cora). It also outperforms GRACE across all of the datasets. Understanding BGRL Performance. Interestingly, we find that BGRL exhibits similar behavior to the ML-GCN on many datasets, despite the BGRL loss function (see Equation ( 2)) not explicitly optimizing for this. Relative to an anchor node u, we can express the max-margin loss of the ML-GCN as follows: L(u) = E v∼NEIGH(u) E w∼E-NEIGH(u) J(u, v, w) where J(u, v, w) is the margin ranking loss for an anchor u, positive sample v, and negative w: J(u, v, w) = max{0, h u • h v -h u • h w + ∆} (4) and ∆ is a hyperparameter for the size of the margin. This seemingly explicitly optimizes for the aforementioned ideal link prediction behavior (anchor-aware ranking of positive over negative links). Despite these circumstances, Figure 1 shows that both BGRL and ML-GCN both clearly separate positive and negative samples, although ML-GCN pushes them further apart. We provide some intuition on why this may occur in Appendix A.10 below. Why Does BGRL Not Collapse? The loss function for BGRL (see Equation ( 2)) is 0 when h (2) i = 0 or zi = 0, i.e., the loss is minimized when the model produces all-zero outputs. While theoretically possible, this is clearly undesirable behavior since this does not result in useful embeddings. We refer to this case as model collapse. It is not fully understood why non-contrastive models do not collapse, but there have been several reasons proposed in the image domain with both theoretical and empirical grounding. We discuss this more in Appendix A.9. Consistent with the findings from Thakoor et al. (2022) , we find that collapse does not occur in practice (with reasonable hyperparameter selection).

Conclusion.

We find that CCA-SSG and GBT generally perform poorly compared to contrastive baselines. Surprisingly, we find that BGRL generally performs well in the transductive setting by successfully separating positive and negative link distance distributions. However, this setting may not be representative of real-world problems. In the next section, we evaluate the methods in the more realistic inductive setting to see if this performance holds.

3.1.2. INDUCTIVE EVALUATION

Inductive Setting. While we observe some promising results in favor of non-contrastive methods (namely, BGRL) in the transductive setting, we note that this setting is not entirely realistic. In practice, we often have both new nodes and edges introduced at inference time after our model is trained. For example, consider a social network upon which a model is trained at some time t 1 but is used for inference (for a GNN, this refers to the message-passing step) at time t 2 , where new users and friendships have been added to the network in the interim. Then, the goal of a model run at time t 2 would be to predict any new links at new network state t 3 (although we assume there are no new nodes introduced at that step since we cannot compute the embedding of nodes without performing inference on them first). To simulate this setting, we first partition the graph into two sets of nodes: "observed" nodes (that we see during training) and "unobserved nodes" (that are only used for inference and testing). We then withhold a portion of the edges at each of the time steps t 3 , t 2 , t 1 to serve as testing-only, inference-only, and training-only edges, respectively. We describe this process in more detail in Appendix A.4.

Results.

Table 2 shows that in the inductive setting, BGRL is outperformed by the contrastive ML-GCN on all datasets. It still outperforms CCA-SSG and GBT, but it is much less competitive in the inductive setting. We next ask: what accounts for this large difference in performance? Why Does BGRL Not Work Well in the Inductive Setting? One possible reason for the poor performance of BGRL in the inductive setting is that it is unable to correctly differentiate unseen positives from unseen negatives, i.e., it is overfitting on the training graph. Intuitively, this could happen due to a lack of negative samples-BGRL never pushes samples away from each other. We show that this is indeed the case in Figure 2 , where BGRL's negative link score distribution has heavy overlap with its positive link score distribution. We can also see this behavior in Figure 1 where the ML-GCN does a clearly better job of pushing positive/negative samples far apart, despite BGRL's surprising success. Naturally, improving the separation between these distributions increases the chance of a correct prediction. We investigate this hypothesis in Section 4 below and propose T-BGRL (Figure 3 ), a novel method to help alleviate this issue.

4. IMPROVING INDUCTIVE PERFORMANCE IN A NON-CONTRASTIVE FRAMEWORK

In order to reduce this systematic gap in performance between ML-GCN (the best-performing contrastive model) and BGRL (the best-performing non-contrastive model), we observe that we need to push negative and positive node pair representations further apart. This way, pairs between new nodes-introduced at inference time-have a higher chance of being classified correctly. Contrastive methods utilize negative sampling for this purpose, but we wish to avoid negative sampling owing to high computational cost. In lieu of this, we propose a simple, yet powerfully effective idea below. Model Intuition. To emulate the effect of negative sampling without actually performing it, we propose Triplet-BGRL (T-BGRL). In addition to the two augmentations performed during standard non-contrastive SSL training, we add a corruption to function as a cheap negative sample. For each node, like BGRL, we minimize the distance between its representations across two augmentations. However, taking inspiration from triplet-style losses (Hoffer & Ailon, 2014) , we also maximize the distance between the augmentation and corruption representations. Model Design. Ideally, this model should not only perform better than BGRL in the inductive setting, but should also have the same time complexity as BGRL. In order to meet these expectations, we design efficient, linear-time corruptions (same asymptotic runtime as the augmentations). We also choose to use the online encoder ENC ϕ to generate embeddings for the corrupted graph so that T-BGRL does not have any additional parameters. Figure 3 illustrates the overall architecture of the proposed T-BGRL, and Algorithm 1 presents PyTorch-style pseudocode. Our new proposed loss function is as follows: L T-BGRL = λ n n-1 i=0 zi • ȟi || zi || || ȟi || T-BGRL Loss Term - (1 -λ) n n-1 i=0 zi • h (2) i || zi || ||h (2) i || BGRL Loss (5) where λ is a hyperparameter controlling the repulsive forces between augmentation and corruption. Algorithm 1: PyTorch-style pseudocode for Corruption Choice. We experiment with several different corruptions methods, but limit ourselves to linear-time corruptions in order to maintain the efficiency of BGRL. We find that SHUFFLEFEATRANDOMEDGE(A, X) = ( Ǎ, X), where Ǎ∼{0, 1} n×n and X = SHUFFLEROWS(X) works the best. We describe each of the different corruptions we experimented with in Appendix A.7. Inductive Results. Table 2 shows that T-BGRL improves inductive performance over BGRL in 5/6 datasets, with very large improvements in the Cora and Citeseer datasets. The only dataset where BGRL outperformed T-BGRL is the Amazon-Photos dataset. However, this gap is much smaller (0.01 difference in Hits@50) than the improvements on the other datasets. We plot the scores output by the decoder for unseen negative pairs compared to those for unseen positive pairs in Figure 2 . We can see that T-BGRL pushes apart unseen negative and positive pairs much better than BGRL. Transductive Results. We also evaluate the performance of T-BGRL in the transductive setting to ensure that it does not significantly reduce performance when compared to BGRL. See Table 3 on the right for the results. Difference from Contrastive Methods. While our method shares some similarities with contrastive methods, we believe T-BGRL is strictly non-contrastive because it does not require the O(n 2 ) sampling from the complement of the edge index used by contrastive methods. This is clearly shown in Figure 4 , where T-BGRL and BGRL have similar runtimes and are much faster than GRACE and ML-GCN. The corruption can be viewed as a "negative" augmentation-with the only difference being that it changes the expected label for each link. In fact, one of the corruptions that we consider, SPARSIFYFEATSPARSIFYEDGE, is essentially the same as the augmentations using by BGRL (except with much higher drop probability). We discuss other corruptions below in Appendix A.7. Scalability. We evaluate the runtime of our model on different datasets. Figure 4 shows the running times to fully train a model for different contrastive and non-contrastive methods over 5 different runs. Note that we use a fixed 10,000 epochs for GRACE, CCA-SSG, GBT, BGRL, and T-BGRL, but use early stopping on the ML-GCN with a maximum of 1,000 epochs. We find that (i) T-BGRL is comparable to BGRL in runtime owing to efficient choices of corruptions, (ii) it is about 4.3× faster than GRACE on Amazon-Computers (the largest dataset which GRACE can run on), and (ii) it is 14× faster than ML-GCN. CCA-SSG is the fastest of all the methods but performs the worst. As mentioned above, we do not compare with SEAL (Zhang & Chen, 2018) Link Prediction. Link prediction is a longstanding graph machine learning task. Some traditional methods include (i) matrix (Menon & Elkan, 2011; Wang et al., 2020) or tensor factorization (Acar et al., 2009; Dunlavy et al., 2011) methods which factor the adjacency and/or feature matrices to derive node representations which can predict links equipped with inner products, and (ii) heuristic methods which score node pairs based on neighborhood and overlap (Yu et al., 2017; Zareie & Sakellariou, 2020; Philip et al., 2010) . Several shallow graph embedding methods (Grover & Leskovec, 2016; Perozzi et al., 2014) which train node embeddings by random-walk strategies have also been used for link prediction. In addition to the nodeembedding-based GNN methods mentioned in Section 2, several works (Yin et al., 2022; Zhang & Chen, 2018; Hao et al., 2020) propose subgraph-based methods for this task, which aim to classify subgraphs around each candidate link. Few works focus on scalable link prediction with distillation (Guo et al., 2022b) , decoder (Wang et al., 2022) , and sketching designs (Chamberlain et al., 2022) . Graph SSL Methods. Most graph SSL methods can be put categorized into contrastive and noncontrastive methods. Contrastive learning has been applied to link prediction with margin-loss-based methods such as PinSAGE (Ying et al., 2018a) , and GraphSAGE (Hamilton et al., 2017) , where negative sample representations are pushed apart from positive sample representations. GRACE (Zhu et al., 2020) uses augmentation (Zhao et al., 2022a) during this negative sampling process to further increase the performance of the model. DGI (Veličković et al., 2018) leverages mutual information maximization between local patch and global graph representations. Some works (Ju et al., 2022; Jin et al., 2021 ) also explore using multiple contrastive pretext tasks for SSL. Several works (You et al., 2020; Lin et al., 2022 ) also focus on graph-level contrastive learning, via graph-level augmentations and careful negative selection. Recently, non-contrastive methods have been applied to graph representation learning. Self-GNN (Kefato & Girdzijauskas, 2021) and BGRL (Thakoor et al., 2022) use ideas from BYOL (Grill et al., 2020) and SimSiam (Chen & He, 2021) to propose a graph framework that does not require negative sampling. We describe BGRL in depth in Section 2 above. Graph Barlow Twins (GBT) (Bielak et al., 2022) is adapted from the Barlow Twins model in the image domain (Zbontar et al., 2021) and uses cross-correlation to learn node representations with a shared encoder. CCA-SSG (Zhang et al., 2021) uses ideas from Canonical Correlation Analysis (CCA) (Hotelling, 1992) and Deep CCA (Andrew et al., 2013) for their loss function. These models are somewhat similar in that it has also been shown that Barlow Twins is equivalent to Kernel CCA (Balestriero & LeCun, 2022) .

6. CONCLUSION

To our knowledge, this is the first work to study non-contrastive SSL methods and their performance on link prediction. We first evaluate several contrastive and non-contrastive graph SSL methods on link prediction tasks, and find that surprisingly, one popular non-contrastive method (BGRL) is able to perform well in the transductive setting. We also observe that BGRL struggles in the inductive setting, and identify that it has a tendency to overfit the training graph, indicating it fails to push positive and negative node pair representations far apart from each other. Armed with these insights, we propose T-BGRL, a simple but effective non-contrastive strategy which works by generating extremely cheap "negatives" by corrupting the original inputs. T-BGRL sidesteps the expensive negative sampling step evident in contrastive learning, while enjoying strong performance benefits. T-BGRL improves on BGRL's inductive performance in 5/6 datasets while achieving similar transductive performance, making it comparable to the best contrastive baselines, but with a 14× speedup over the best contrastive methods.

REPRODUCIBILITY STATEMENT

To ensure reproducibility, our source code is available online at https://github.com/ snap-research/non-contrastive-link-prediction. The hyperparameters and instructions for reproducing all experiments are provided in the README.md file. is the result of the mean averaged over 5 runs (retraining both the encoder and decoder). We used the Weights and Biases (Biewald, 2020) Bayesian optimizer for our experiments. We provide a sample configuration file to reproduce our sweeps, as well as the exact parameters used for the top T-BGRL runs shown in our tables. We used the reference GRACE implementation and BGRL implementation but modified them for link prediction instead of node classification. We based our E2E-GCN off of the OGB (Hu et al., 2020) implementation. We re-implemented CCA-SSG and GBT. The code for all of our implementations and modifications can be found in the link in our paper above.

A.6 FULL RESULTS

Table 5 shows the results of all the methods (including T-BGRL) on transductive setting. Table 5 : Full transductive performance table (combination of Tables 1 and 3 ). 

A.7 CORRUPTIONS

In this work, we experiment with the following corruptions: 1. RANDOMFEATRANDOMEDGE: Randomly generate an adjacency matrix Ã and X with the same sizes as A and X, respectively. Note that Ã and A also have the same number of non-zero entries, i.e., the same number of edges. 2. SHUFFLEFEATRANDOMEDGE: Randomly shuffle the rows of X, and generate a random Ã with the same size as A. Note that Ã and A also have the same number of non-zero entries, i.e., the same number of edges. 3. SPARSIFYFEATSPARSIFYEDGE: Mask out a large percentage (we chose 95%) of the entries in X and A. Of these corruptions, we find that SHUFFLEFEATRANDOMEDGE works the best across our experiments.

A.8 AUC-ROC RESULTS

Here we include the area under the ROC curve for each of the different models under both the inductive and transductive settings. Note that we perform early stopping on the validation Hits@50 when training the link prediction model, not on the validation AUC-ROC. A.9 WHY DOES BGRL NOT COLLAPSE? The loss function for BGRL (see Equation ( 2)) is 0 when h (2) i = 0 or zi = 0. While theoretically possible, this is clearly undesirable behavior since this does not result in useful embeddings. We refer to this case as the model collapsing. It is not fully understood why non-contrastive models do not collapse, but there have been several reasons proposed in the image domain with both theoretical and empirical grounding. Chen & He (2021) showed that the SimSiam architecture requires both the predictor and the stop gradient. This has also been shown to be true for BGRL. Tian et al. (2021) claim that the eigenspace of predictor weights will align with the correlation matrix of the online network under the assumption of a one-layer linear encoder and a one-layer linear predictor. Wen & Li (2022) looked at the case of a two-layer non-linear encoder with output normalization and found that the predictor is often only useful during the learning process, and often converges to the identity function. We did not observe this behavior on BGRL-the predictor is usually significantly different from that of the identity function.

A.10 HOW DOES BGRL PULL REPRESENTATIONS CLOSER TOGETHER?

Here we clarify the intuition behind BGRL pulling similar points together. To simplify this analysis, we assume that the predictor is the identity function, which Wen & Li (2022) found is true in the image representation learning setting. Although we have not observed this in the graph setting, this assumption greatly simplifies our analysis and we argue it is sufficient for understanding why BGRL works. Suppose we have three nodes: an anchor node u, a neighbor v, and a non-neighbor w. That is, we have (u, v) ∈ E, (u, w) ̸ ∈ E, and (v, w) ̸ ∈ E. Let u, v, w be the embeddings for u, v, w, respectively (e.g. u = ENC(u)). 



Note that the definition of these functions are different from the corruption functions inZhu et al. (2020) (which we define as augmentations) and are instead similar to the corruption functions inVeličković et al. (2018). Self-GNN (Kefato & Girdzijauskas, 2021), which was published independently, also shares the same architecture. As such, we refer to these two methods as BGRL.



Figure 2: These plots show similarities between node embeddings on Citeseer. Left: distribution of similarity to non-neighbors for T-BGRL and BGRL. Right: distribution of similarity to neighbors for T-BGRL and BGRL. Note that the y-axis is on a logarithmic scale. T-BGRL clearly does a better job of ensuring that negative link representations are pushed far apart from those of positive links.

Figure 4: Total runtime comparison of different contrastive and non-contrastive methods. T-BGRL and BGRL have relatively similar runtimes and are significantly faster than the contrastive methods (GRACE and ML-GCN).

Figure5: These plots show similarities between node embeddings on Coauthor-Cs. Left: distribution of similarity to non-neighbors for T-BGRL and BGRL (closer to 0 is better). Right: distribution of similarity to neighbors for T-BGRL and BGRL (closer to 1 is better). Note that the y-axis is on a logarithmic scale. T-BGRL clearly does a better job of ensuring that negative link representations are pushed far apart from those of positive links.

Transductive performance of different link prediction methods. We bold the best-performing method and underline the second-best method for each dataset. BGRL consistently outperforms other non-contrastive methods and GRACE, and also outperforms ML-GCN, on 3/6 datasets.

Performance of various methods in the inductive setting. See Section 3.1.2 for an explanation of our inductive setting. Although we do not introduce T-BGRL until Section 4, we include the results here to save space.

Transductive performance of T-BGRL compared to ML-GCN and BGRL (same numbers as Table 1 above; full figure in Table 5).

or other subgraph-based methods due to how slow they are during inference. SUREL(Yin et al., 2022) is ~250× slower, and SEAL(Zhang & Chen, 2018) is about ~3900× slower according toYin et al. (2022). In conclusion, we find that T-BGRL is roughly as scalable as other non-contrastive methods, and much more scalable than existing contrastive methods.

Area under the ROC curve for the methods in the transductive setting.

AUC-ROC of various methods in the inductive setting. See Section 3.1.2 for an explanation of our inductive setting.

ACKNOWLEDGMENTS

UCR coauthors were partly supported by the National Science Foundation under CAREER grant no. IIS 2046086 and were also sponsored by the Combat Capabilities Development Command Army Research Laboratory under Cooperative Agreement Number W911NF-13-2-0045 (ARL Cyber Security CRA). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding parties.

A APPENDIX

A.1 DATASET STATISTICS We run all of our experiments on either NVIDIA P100 or V100 GPUs. We use machines with 12 virtual CPU cores and 24 GB of RAM for the majority of our experiments. We exclusively use V100s for our timing experiments. We ran our experiments on Google Cloud Platform.

A.3 TRANSDUCTIVE SETTING DETAILS

We use an 85/5/10 split for training/validation/testing data-following Zhang & Chen (2018); Cai et al. (2020) .

A.4 INDUCTIVE SETTING DETAILS

The inductive setting represents a more realistic setting than the transductive setting. For example, consider a social network upon which a model is trained at some time t 1 but is used for inference (for a GNN, this refers to the message-passing step) at time t 2 , where new users and friendships have been added to the network in the interim. Then, the goal of a model run at time t 2 would be to predict any new links at new network state t 3 (although we assume there are no new nodes introduced at that step since we cannot compute the embedding of nodes without performing inference on them first).To simulate this setting, we first perform the following steps:1. We withhold a portion of the edges (and the same number of disconnected node pairs) to use as testing-only edges. 2. We partition the graph into two sets of nodes: "observed" nodes (that we see during training)and "unobserved nodes" (that can only be seen during testing). 3. We mask out some edges to use as testing-only edges. 4. We mask out some edges to use as inference-only edges. 5. We mask out some edges to use as validation-only edges. 6. We mask out some edges to use as training-only edges.As the test edges are sampled before the node split, there will be three kinds of them after the splitting. Specifically: edges within observed nodes, edges between observed nodes and unobserved nodes, and edges within unobserved nodes. For ease of data preparation, we use the same percentages for the test edge splitting, unobserved node splitting, and validation edge splitting. Specifically, we mask out 30% of the edges (at each of the above stages) on the small datasets (Cora and Citeseer), and 10% on all the other datasets. We use a 30% split on the small datasets to ensure that we have a sufficient number of edges for testing and validation purposes.

A.5 EXPERIMENTAL SETUP

To ensure that we fairly evaluate each model, we run a Bayesian hyperparameter sweep for 25 runs across each model-dataset combination with the target metric being the validation Hits@50. Each run Assuming homophily between the nodes, we have u • v < u • w. For ease of visualization, let us project the points in a 2D space. Then, we have the following:We then apply the two augmentations on u, producing ũ1 = AUG 1 (u) and ũ2 = AUG 2 (u). For the sake of simplicity, let us assume that we perform edge dropping and feature dropping with the same probability p (in practice, they may be different from each other). We represent the space of possible values for ũ1 and ũ2 as a circle with radius r centered at u, where r is controlled by p.

{

The BGRL loss is stated in Equation ( 2) above, but we rewrite it relative to our anchor u and with our assumption about the predictor: Note that v in this example lies within space of possible augmentations -that is, v ∈ A, where A is the set of all possible values of AUG(u). This means, as we repeat this we implicitly push u and v closer together -leading to distributions like those shown Figure 1 .

