LOCALIZED GRAPH CONTRASTIVE LEARNING

Abstract

Contrastive learning methods based on InfoNCE loss are popular in node representation learning tasks on graph-structured data. However, its reliance on data augmentation and its quadratic computational complexity might lead to inconsistency and inefficiency problems. To mitigate these limitations, in this paper, we introduce a simple yet effective contrastive model named Localized Graph Contrastive Learning (LOCAL-GCL in short). LOCAL-GCL consists of two key designs: 1) We fabricate the positive examples for each node directly using its firstorder neighbors, which frees our method from the reliance on carefully-designed graph augmentations; 2) To improve the efficiency of contrastive learning on graphs, we devise a kernelized contrastive loss, which could be approximately computed in linear time and space complexity with respect to the graph size. We provide theoretical analysis to justify the effectiveness and rationality of the proposed methods. Experiments on various datasets with different scales and properties demonstrate that in spite of its simplicity, LOCAL-GCL achieves quite competitive performance in self-supervised node representation learning tasks on graphs with various scales and properties.

1. INTRODUCTION

Self-supervised learning has achieved remarkable success in learning informative representations without using costly handcrafted labels (van den Oord et al., 2018; Devlin et al., 2019; Banville et al., 2021; He et al., 2020; Chen et al., 2020; Grill et al., 2020; Zhang et al., 2021; Gao et al., 2021) . Among current self-supervised learning paradigms, InfoNCE loss (van den Oord et al., 2018) based multi-view contrastive methods (He et al., 2020; Chen et al., 2020; Gao et al., 2021) are recognized as the most widely adopted ones, due to their solid theoretical foundations and strong empirical results. Generally, contrastive learning aims at maximizing the agreement between the latent representations of two views (e.g. through data augmentation) from the same input, which essentially maximizes the mutual information between the two representations (Poole et al., 2019) . Inheriting the spirits of contrastive learning on vision tasks, similar methods have been developed to deal with graphs and bring up promising results on common node-level classification benchmarks (Velickovic et al., 2019; Hassani & Ahmadi, 2020; Zhu et al., 2020b; 2021) . The challenge, however, is that prevailing contrastive learning methods rely on predefined augmentation techniques for generating positive pairs as informative training supervision. Unlike grid-structured data (e.g., images or sequences), it is non-trivial to define well-posed augmentation approaches for graph-structured data Zhu et al. (2021); Zhang et al. (2021) . The common practice adopted by current methods resorts to random perturbation on input node features and graph structures (You et al., 2020) , which might unexpectedly violate the underlying data generation and change the semantic information (Lee et al., 2021) . Such an issue plays as a bottleneck limiting the practical efficacy of contrastive methods on graphs. Apart from this, the InfoNCE loss function computes all-pair distance for in-batch nodes as negative pairs for contrasting signals (Zhu et al., 2020b; 2021) , which induces quadratic memory and time complexity with respect to the batch size. Given that the model is preferred to be trained in a full-graph manner (i.e., batch size = graph size) since the graph structure information might be partially lost through mini-batch partition, such a nature heavily constrains contrastive methods for scaling to large graphs. Some recent works seek negative-sample-free methods to resolve the scalability issue by harnessing asymmetric structures (Thakoor et al., 2021) or feature-level decorrelation objectives (Zhang et al., 2021) . However, these methods either lack enough theoretical justification (Thakoor et al., 2021) or necessitate strong assumptions on the data distributions (Zhang et al., 2021) . Moreover, they still require data augmentations to generate two views of the input graph, albeit non-contrastive and free from negative sampling. Some other works construct positive examples using the target's k-nearestneighbors (kNN) in the latent space (Dwibedi et al., 2021; Koohpayegani et al., 2021; Lee et al., 2021) . Nonetheless, the computation of nearest neighbors could be cumbersome, time-consuming, and, therefore hard to scale. Presented Work. To cope with the dilemmas above, in this paper we introduce Localized Graph Contrastive Learning (LOCAL-GCL in abbreviation), a light and augmentation-free contrastive model for self-supervised node-level representation learning on graphs. LOCAL-GCL benefits from two key designs. First, it does not rely on data augmentation to construct positive pairs. Instead, inspired by the graph homophiliy theory (McPherson et al., 2001) , it directly treats the first-order neighboring nodes as the positive examples of the target node. This not only increases the number of positive examples for each node but also helps our model get rid of complicated data augmentations. Besides, the computation of positive pairs can be performed in linear time and space complexity w.r.t the graph size, bringing no additional cost for the model. Second, to deal with the quadratic complexity curse of contrastive loss (i.e., InfoNCE loss (van den Oord et al., 2018)), we propose a surrogate loss function in place of the negative loss term in the vanilla InfoNCE loss, which could be efficiently and accurately approximated within linear time and space complexity (Rahimi et al., 2007; Liu et al., 2020; Yu et al., 2016) . Such a design greatly improves the efficiency of our model. We evaluate the proposed methods on seven public node classification benchmarks with various scales. The empirical results demonstrate that though not using any graph augmentations, our method achieves state-of-the-art performance on six of seven datasets. On the challenging Ogbn-Arxiv dataset, our method can also give a competitive performance with a much training speed compared with other scalable models. Experiments on three heterophily graphs demonstrate that besides homophily graphs, LOCAL-GCL can also perform well on graphs with low homophily ratios. We summarize the highlights of this paper as follows: 1) We introduce LOCAL-GCL, a simple model for contrastive learning on graphs, where the positive example is fabricated using the first-order neighborhood of each node. This successfully frees node-level contrastive learning methods from unjustified graph augmentations. 2) To overcome the quadratic complexity curse of contrastive learning, we propose a kernelized contrastive loss computation that can precisely approximate the original loss function within linear complexity w.r.t. graph size. This significantly reduces the training time and memory cost of contrastive learning on large graphs. 3) Experimental results show that without data augmentation and other cumbersome designs LOCAL-GCL achieves quite competitive results on a variety of graphs of different scales and properties. Furthermore, LOCAL-GCL demonstrates a better balance of model performance and efficiency than other self-supervised methods.

2.1. CONTRASTIVE REPRESENTATION LEARNING

Inspired by the great success of contrastive methods in learning image representations (van den Oord et al., 2018; Hjelm et al., 2019; Tian et al., 2020; He et al., 2020; Chen et al., 2020) , recent endeavors develop similar strategies for node-level tasks in graph domain (Velickovic et al., 2019; Hassani & Ahmadi, 2020; Zhu et al., 2020b; 2021) . Among graph contrastive learning methods, the most popular methods should be those based on the InfoNCE loss (van den Oord et al., 2018) due to their simple concepts and better empirical performance. InfoNCE-based graph contrastive learning methods, including GRACE (Zhu et al., 2020b) and GCA (Zhu et al., 2021) aim to maximize the similarity of positive node-node (or graph-graph) pairs (e.g., two views generated via data augmentation) and minimize the similarity of negative ones (e.g., other nodes/graphs within the current batch). However, they require well-designated data augmentations that could positively inform downstream tasks (Tsai et al., 2021) . The quadratic complexity also limits their applications to larger batch sizes/datasets.

