EFFICIENT BLOCK CONTRASTIVE LEARNING VIA PARAMETER-FREE META-NODE APPROXIMATION

Abstract

Contrastive learning has recently achieved remarkable success in many domains including graphs. However contrastive loss, especially for graphs, requires a large number of negative samples which is unscalable and computationally prohibitive with a quadratic time complexity. Sub-sampling is not optimal. Incorrect negative sampling leads to sampling bias. In this work, we propose a meta-node based approximation technique that can (a) proxy all negative combinations (b) in quadratic cluster size time complexity, (c) at graph level, not node level, and (d) exploit graph sparsity. By replacing node-pairs with additive cluster-pairs, we compute the negatives in cluster-time at graph level. The resulting Proxy approximated meta-node Contrastive (PamC) loss, based on simple optimized GPU operations, captures the full set of negatives, yet is efficient with a linear time complexity. By avoiding sampling, we effectively eliminate sample bias. We meet the criterion for larger number of samples, thus achieving block-contrastiveness, which is proven to outperform pair-wise losses. We use learnt soft cluster assignments for the meta-node construction, and avoid possible heterophily and noise added during edge creation. Theoretically, we show that real world graphs easily satisfy conditions necessary for our approximation. Empirically, we show promising accuracy gains over state-of-the-art graph clustering on 6 benchmarks. Importantly, we gain substantially in efficiency; up to 2x in training time and over 5x in GPU memory reduction. The code is publicly available.

1. INTRODUCTION

Discriminative approaches based on contrastive learning has been outstandingly successful in practice (Guo et al., 2017; Wang & Isola, 2020) , achieving state-of-the-art results (Chen et al., 2020a) or at times outperforming even supervised learning (Logeswaran & Lee, 2018; Chen et al., 2020b) . Specifically in graph clustering, contrastive learning can outperform traditional convolution and attention-based Graph Neural Networks (GNN) on speed and accuracy (Kulatilleke et al., 2022) . While traditional objective functions encourage similar nodes to be closer in embedding space, their penalties do not guarantee separation of unrelated graph nodes (Zhu et al., 2021a) . Differently, many modern graph embedding models (Hamilton et al., 2017; Kulatilleke et al., 2022) , use contrastive objectives. These encourage representation of positive pairs to be similar, while making features of the negatives apart in embedding space (Wang & Isola, 2020) . A typical deep model consists of a trainable encoder that generates positive and negative node embedding for the contrastive loss (Zhu et al., 2021a) . It has been shown that convolution is computationally expensive and may not be necessary for representation learning (Chen et al., 2020a) . As the requirement for contrastive loss is simply an encoder, recently researchers have been able to produce state-of-the-art results using simpler and more efficient MLP based contrastive loss implementations (Hu et al., 2021; Kulatilleke et al., 2022) . Thus, there is a rapidly expanding interest and scope for contrastive loss based models. We consider the following specific but popular (Hu et al., 2021; Kulatilleke et al., 2022) form of contrastive loss where τ is the temperature parameter, γ ij is the relationship between nodes i, j and the loss for the i th node is: ℓ i = -log B j=1 1 [j̸ =i] γ ij • exp (sim (z i , z j ) • τ ) B k=1 1 [k̸ =i] exp (sim (z i , z k ) • τ ) , When no labels are present, sampling of positive and negative nodes plays a crucial role (Kipf & Welling, 2016) and is a key implementation detail in contrastive methods (Velickovic et al., 2019) Motivation: Prior work has approximated the task in order to approximate the loss. SwAV (Caron et al., 2020) learns to predict a node prototype code of an augmented view from the other view. GRCCA (Zhang et al., 2021a) maps augmented graphs to prototypes using k-means for alignment. PCL (Li et al., 2020) assigns several prototypes of different granularity to an image enforcing its representation to be more similar to its corresponding prototype. However, all these works use some form of data augmentation which assumes that the task-relevant information is not significantly altered and require computationally expensive operations. Wang & Isola (2020) identifies alignment and uniformity as key properties of contrastive loss: alignment encourages encoders to assign similar features to similar samples; uniformity encourages a feature distribution that preserves maximal information. It is fair to assume that latent clusters are dissimilar. Even with the rare possibility of two identical cluster centers initially, one will usually change or drift apart. It is intuitive that cluster centers should be uniformly distributed in the hyperspace, similar to nodes, in order to preserve as much information of the data as possible. Uniformly distributing points on a hyperspace is defined as minimizing the total pairwise potential w.r.t. a certain kernel function and is well-studied (Wang & Isola, 2020). Thus, we are naturally motivated to use the cluster centers as meta-nodes for negative contrastion. By aggregation, all its constituent nodes cab be affected. Thus, we avoid sampling, effectively eliminate sample bias, and also meet the criterion of larger number of samples. Learned soft cluster assignments can avoid possible heterophily and add robustness to noise in edge construction. In this work, we propose a novel contrastive loss, PamC, which uses paramaterless proxy metanods to approximate negative samples. Our approach indirectly uses the full set of negative samples and yet is efficient with a time complexity of O(N ). Not only does PamCGC, based on PamC, outperform or match previous work, but it is also simpler than any prior negative sample generation approach, faster and uses relatively less GPU memory. It can be incorporated to any contrastive learning-based clustering model with minimal modifications, and works with diverse data, as we demonstrate using benchmark datasets from image, text and graph modalities.



. Positive samples in graphs are typically connected by edges(Kulatilleke et al., 2021), similar to words in a sentence in language modelling(Logeswaran & Lee, 2018). Often data augmentation is used to generate positive samples; Chen et al. (2020b) used crop, coloring, blurring. However, it is harder to obtain negative samples. With no access to labels, negative counterparts are typically obtained via uniform sampling(Park et al., 2022), via synthesizing/augmenting(Chen et al., 2020b)  or adding noise. Also, in graphs, adjacency information can be exploited to derive negatives(Hu et al.,  2021; Kulatilleke et al., 2022)  for feature contrastion. However, while graphs particularly suited for contrastive learning, to be effective, a large number of negative samples must be used (Wang & Isola, 2020) (e.g., 65536 in He et al. (2020)), along with larger batch sizes and longer training compared to its supervised counterparts (Chen et al., 2020b). Prior work has used data augmentation-based contrastive methods Zhu et al. (2020; 2021b), negative samples using asymmetric structures Thakoor et al. (2021) or avoided negative samples altogether via feature-level decorrelation Zhang et al. (2021b). While Thakoor et al. (2021); Zhang et al. (2021b) address complexity and scalability, as seen in Appendix Table4, their performance can be further improved.Unlike other domains, such as vision, negative sample generation brings only limited benefits to graphs(Chuang et al., 2020; Zhu et al., 2021a). To understand this phenomenon, observe the raw embedding of USPS image dataset, in the top row of Figure7which looks already clustered. A direct consequence of this is that graphs are more susceptible to sampling bias(Chuang et al., 2020; Zhu  et al., 2021a). Thus, graph contrastive learning approaches suffer from insufficient negatives and the complex task of sample generation in addition to O(N 2 ) time complexity required to contrast every negative node.

