EFFICIENT BLOCK CONTRASTIVE LEARNING VIA PARAMETER-FREE META-NODE APPROXIMATION

Abstract

Contrastive learning has recently achieved remarkable success in many domains including graphs. However contrastive loss, especially for graphs, requires a large number of negative samples which is unscalable and computationally prohibitive with a quadratic time complexity. Sub-sampling is not optimal. Incorrect negative sampling leads to sampling bias. In this work, we propose a meta-node based approximation technique that can (a) proxy all negative combinations (b) in quadratic cluster size time complexity, (c) at graph level, not node level, and (d) exploit graph sparsity. By replacing node-pairs with additive cluster-pairs, we compute the negatives in cluster-time at graph level. The resulting Proxy approximated meta-node Contrastive (PamC) loss, based on simple optimized GPU operations, captures the full set of negatives, yet is efficient with a linear time complexity. By avoiding sampling, we effectively eliminate sample bias. We meet the criterion for larger number of samples, thus achieving block-contrastiveness, which is proven to outperform pair-wise losses. We use learnt soft cluster assignments for the meta-node construction, and avoid possible heterophily and noise added during edge creation. Theoretically, we show that real world graphs easily satisfy conditions necessary for our approximation. Empirically, we show promising accuracy gains over state-of-the-art graph clustering on 6 benchmarks. Importantly, we gain substantially in efficiency; up to 2x in training time and over 5x in GPU memory reduction. The code is publicly available.

1. INTRODUCTION

Discriminative approaches based on contrastive learning has been outstandingly successful in practice (Guo et al., 2017; Wang & Isola, 2020) , achieving state-of-the-art results (Chen et al., 2020a) or at times outperforming even supervised learning (Logeswaran & Lee, 2018; Chen et al., 2020b) . Specifically in graph clustering, contrastive learning can outperform traditional convolution and attention-based Graph Neural Networks (GNN) on speed and accuracy (Kulatilleke et al., 2022) . While traditional objective functions encourage similar nodes to be closer in embedding space, their penalties do not guarantee separation of unrelated graph nodes (Zhu et al., 2021a) . Differently, many modern graph embedding models (Hamilton et al., 2017; Kulatilleke et al., 2022) , use contrastive objectives. These encourage representation of positive pairs to be similar, while making features of the negatives apart in embedding space (Wang & Isola, 2020) . A typical deep model consists of a trainable encoder that generates positive and negative node embedding for the contrastive loss (Zhu et al., 2021a) . It has been shown that convolution is computationally expensive and may not be necessary for representation learning (Chen et al., 2020a) . As the requirement for contrastive loss is simply an encoder, recently researchers have been able to produce state-of-the-art results using simpler and more efficient MLP based contrastive loss implementations (Hu et al., 2021; Kulatilleke et al., 2022) . Thus, there is a rapidly expanding interest and scope for contrastive loss based models. We consider the following specific but popular (Hu et al., 2021; Kulatilleke et al., 2022) form of contrastive loss where τ is the temperature parameter, γ ij is the relationship between nodes i, j and the loss for the i th node is: ℓ i = -log B j=1 1 [j̸ =i] γ ij • exp (sim (z i , z j ) • τ ) B k=1 1 [k̸ =i] exp (sim (z i , z k ) • τ ) , When no labels are present, sampling of positive and negative nodes plays a crucial role (Kipf & Welling, 2016) and is a key implementation detail in contrastive methods (Velickovic et al., 2019) . Positive samples in graphs are typically connected by edges (Kulatilleke et al., 2021) , similar to words in a sentence in language modelling (Logeswaran & Lee, 2018) . Often data augmentation is used to generate positive samples; Chen et al. (2020b) used crop, coloring, blurring. However, it is harder to obtain negative samples. With no access to labels, negative counterparts are typically obtained via uniform sampling (Park et al., 2022) , via synthesizing/augmenting (Chen et al., 2020b) or adding noise. Also, in graphs, adjacency information can be exploited to derive negatives (Hu et al., 2021; Kulatilleke et al., 2022) for feature contrastion. However, while graphs particularly suited for contrastive learning, to be effective, a large number of negative samples must be used (Wang & Isola, 2020 ) (e.g., 65536 in He et al. (2020) ), along with larger batch sizes and longer training compared to its supervised counterparts (Chen et al., 2020b) . Prior work has used data augmentation-based contrastive methods Zhu et al. (2020; 2021b) , negative samples using asymmetric structures Thakoor et al. (2021) or avoided negative samples altogether via feature-level decorrelation Zhang et al. (2021b) . While Thakoor et al. (2021) ; Zhang et al. (2021b) address complexity and scalability, as seen in Appendix Table 4 , their performance can be further improved. Unlike other domains, such as vision, negative sample generation brings only limited benefits to graphs (Chuang et al., 2020; Zhu et al., 2021a) . To understand this phenomenon, observe the raw embedding of USPS image dataset, in the top row of Figure 7 which looks already clustered. A direct consequence of this is that graphs are more susceptible to sampling bias (Chuang et al., 2020; Zhu et al., 2021a) . Thus, graph contrastive learning approaches suffer from insufficient negatives and the complex task of sample generation in addition to O(N 2 ) time complexity required to contrast every negative node. However, what contrastive loss exactly does remain largely a mystery (Wang & Isola, 2020) . For example, Arora et al. (2019) 's analysis based on the assumption of latent classes provides good theoretical insights, yet their explanation on representation quality dropping with large number of negatives is inconsistent with experimental findings (Chen et al., 2020b) . Contrastive loss is seen as maximizing mutual information (MI) between two views. Yet, contradictorily, tighter bound on MI can lead to poor representations (Wang & Isola, 2020) . Motivation: Prior work has approximated the task in order to approximate the loss. SwAV (Caron et al., 2020) learns to predict a node prototype code of an augmented view from the other view. GRCCA (Zhang et al., 2021a) maps augmented graphs to prototypes using k-means for alignment. PCL (Li et al., 2020) assigns several prototypes of different granularity to an image enforcing its representation to be more similar to its corresponding prototype. However, all these works use some form of data augmentation which assumes that the task-relevant information is not significantly altered and require computationally expensive operations. Wang & Isola (2020) identifies alignment and uniformity as key properties of contrastive loss: alignment encourages encoders to assign similar features to similar samples; uniformity encourages a feature distribution that preserves maximal information. It is fair to assume that latent clusters are dissimilar. Even with the rare possibility of two identical cluster centers initially, one will usually change or drift apart. It is intuitive that cluster centers should be uniformly distributed in the hyperspace, similar to nodes, in order to preserve as much information of the data as possible. Uniformly distributing points on a hyperspace is defined as minimizing the total pairwise potential w.r.t. a certain kernel function and is well-studied (Wang & Isola, 2020) . Thus, we are naturally motivated to use the cluster centers as meta-nodes for negative contrastion. By aggregation, all its constituent nodes cab be affected. Thus, we avoid sampling, effectively eliminate sample bias, and also meet the criterion of larger number of samples. Learned soft cluster assignments can avoid possible heterophily and add robustness to noise in edge construction. In this work, we propose a novel contrastive loss, PamC, which uses paramaterless proxy metanods to approximate negative samples. Our approach indirectly uses the full set of negative samples and yet is efficient with a time complexity of O(N ). Not only does PamCGC, based on PamC, outperform or match previous work, but it is also simpler than any prior negative sample generation approach, faster and uses relatively less GPU memory. It can be incorporated to any contrastive learning-based clustering model with minimal modifications, and works with diverse data, as we demonstrate using benchmark datasets from image, text and graph modalities. Red dotted section is our core contribution: we use μ as an efficient approximation, computing centroid-pairs instead of node-pairs, achieve block-contrastivness and do so at graph level, not instance level. To summarize, our main contributions are: • We introduce an efficient novel parameter-free proxy, PamC, for negative sample approximation that is scalable, computationally efficient and able to include all samples. It works with diverse data, including graphs. We claim PamC is the first to implicitly use the whole graph with O(N ) time complexity, in addition to further 3-fold gains. • We provide theoretical proof and show that real world graphs always satisfies the necessary conditions, and that PamCGC is block-contrastive, known to outperform pair-wise losses. • Extensive experiments on 6 benchmark datasets show PamCGC, using proposed PamC, is on par with or better than state-of-the-art graph clustering methods in accuracy while achieving 2x training time and 5x GPU memory efficiency.

2. IMPLEMENTATION

First we describe PamC, which is our parameter-free proxy to efficiently approximate the negative samples, as shown in Figure 1 . Next, we introduce PamCGC, a self-supervised model based on PamC to simultaneously learn discriminative embeddings and clusters.

2.1. NEGATIVE SAMPLE APPROXIMATION BY META-NODE PROXIES

Contrastive loss makes positive or connected nodes closer and negative or unconnected nodes further away in the feature space (Kulatilleke et al., 2022) . However, in order to be effective, all negative nodes need to be contrasted with x i which is computationally expensive. A cluster center is formed by combining all member nodes, and can be seen as an aggregated representation, or a proxy, of its compositional elements. Motivated by this, we use the cluster centers to enforce negative contrastion. Specifically, we contrast every cluster center μi with every cluster center μj where i ̸ = j. Following Arora et al. (2019) ; Chuang et al. (2020) , we assume an underlying set of discrete latent classes C which represents semantic content, i.e., similar nodes x i , x j are in the same latent class μ. Thus, we derive our proxy for negative samples as: ℓ proxy = log C a=1 C b=1 1 [a̸ =b] exp (sim (μ a , μb ) • τ ), Note that, ℓ proxy contains no i or j terms! resulting in three fold gains. Firstly, we replace N i=1 , with a more efficient C a=1 where N ≫ C, typically many magnitudes, in almost all datasets, as evident from Table 1 . Secondly, the ℓ proxy is at graph level with time complexity of O(N ) rather than an instance level O(N 2 ). Finally, given real world graphs (especially larger graphs,) are sparse, a sparse implantation for the positives, using edge-lists, will result in a third efficiency gain, which is only possible by not having to operate on the negatives explicitly. Note that a prerequisite of the proxy approximation is the availability of labels to construct the learned cluster centers μ, which we explain in the next section. Thus, the complete graph level contrastive loss can be expressed as: ℓ P contrast = - 1 N N i=1 log N j=1 1 [j̸ =i] γ ij exp (sim (z i , z j ) • τ ) + ℓ proxy , Theoretical explanation. The standard contrastive loss uses Jensen-Shannon divergence, which yields log 2 constant and vanishing gradients for disjoint distributions of positive and negative sampled pairs (Zhu et al., 2021a) . However, in the proposed method, positive pairs are necessarily edge-linked (either explicitly or via influence (Kulatilleke et al., 2022)), and unlikely to be disjoint. Using meta-nodes for negatives, which are compositions of multiple nodes, lowers the possibility of disjointness. An algorithm using the average of the positive and negative samples in blocks as a proxy instead of just one point has a strictly better bound due to Jensen's inequality getting tighter and is superior compared to their equivalent of element-wise contrastive counterparts (Arora et al., 2019) . The computational and time cost is a direct consequence of node level contrastion. Given, N ≫ clusters, we circumvent the problem of large N by proposing a proxy-ed negative contrastive objective that operates directly at the cluster level. Establishing mathematical guarantee: Assume node embeddings Z = {z 1 , z 2 , z 3 . . . z N }, clusters µ = {µ 1 , µ 2 . . . µ C }, a label assignment operator label(z i ) such that µ a = N i=1 1 [i∈label(zi)=a] •z i , a temperature hyperparameter τ and, similarity(i, j, z i , z j ) = sim(z i , z j ) 0, i = j zi•zj ∥zi∥∥zj ∥ , i ̸ = j Using sim(z i , z j ) as the shorthand notation for similarity(i, j, z i , z j ), the classic contrastive loss is: loss N N = 1 N N i=1 log   N j=1 exp(sim(i, j, z i , z j )τ )   , Similarly, we can express the cluster based contrastive loss as: loss CC = 1 C C a=1 log M b=1 exp(sim(a, b, µ a , µ b )τ ) As 0 ≤ sim ≤ 1.0, we establish the condition for our inequality as; loss N N loss CC > log(N ) log [1 + (C -1)e τ ] We provide the full derivation in Appendix A.1. As C > 1 (minimum 2 are needed for a cluster), and log(x) : x > 0 is strictly increasing, N > 1 + (C -1)e τ is the necessary condition, which is easily satisfied for nearly all real world datasets and as seen in Figure 2 for multiple τ temperatures. Thus, as loss N N > loss CC , loss N N upper bounds loss CC , the more efficient variant. Additionally loss CC benefits from block-contrastiveness (Arora et al., 2019) , achieves a lower minima and uses the fullest possible negative information. We also show, experimentally, that minimizing loss CC results in effective, and sometimes better, representations for downstream tasks.

2.2. CONSTRUCTING THE META-NODE CLUSTER CENTERS ( μ)

In order to derive the real cluster centers μ, which is distinct from the learnt cluster centers µ, we simply aggregate all the node embedding z of a cluster using its label. Even with unlabeled data, label() can be accomplished using predicted soft labels. The intuition here is that, during backpropagation, the optimization process will update the constituent node embeddings, z, to incorporate negative distancing. Thus, μc = 1 N N i=1 1 [i∈label(c)] z i , where label(c) is either the ground-truth or learnt soft labels. Accordingly, our proxy can equally be used in supervised and unsupervised scenarios and has a wider general applicability as an improvement of the contrastive loss at large. Finlay, Equation 8 can be realized with sof tmax() and mean() operations, which are well optimized GPU primitives in any machine learning framework. We provide a reference pytorch implementation.

2.3. OBTAINING THE SOFT LABELS

Graph clustering is essentially unsupervised. 2022), we use probability distribution derived softlabels and a self-supervision mechanism for cluster enhancement. Specifically, we obtain soft cluster assignments probabilities q iu for embedding z i and cluster center µ u . In order to handle differently scaled clusters and be computationally convenient (Wang et al., 2019) , we use the student's t-distribution (Maaten & Hinton, 2008) as a kernel for the similarity measurement between the embedding and centroid: q iu = (1 + ∥z i -µ u ∥ 2 /η) -η+1 2 u ′ (1 + ∥z i -µ u ′ ∥ 2 /η) -η+1 2 , where, η is the Student's t-distribution's degree of freedom. Cluster centers µ are initialized by K-means on embeddings from the pre-trained AE. We use Q = [q iu ] as the distribution of the cluster assignments of all samples, and η=1 for all experiments following Nodes closer in embedding space to a cluster center has higher soft assignment probabilities in Q. A target distribution P that emphasizes the confident assignments is obtained by squaring and normalizing Q, given by : p iu = q 2 iu / i q iu k (q 2 ik / i q ik ) , ( ) where i q iu is the soft cluster frequency of centroid u. Following Kulatilleke et al. ( 2022), we minimize the KL divergence between Q and P distributions, which forces the current distribution Q to approach the more confident target distribution P . KL divergence updates models more gently and lessens severe disturbances on the embeddings (Bo et al., 2020) . Further, it can accommodate both the structural and feature optimization targets of PamCGC. We self-supervise cluster assignmentsfoot_0 by using distribution Q to target distribution P , which then supervises the distribution Q in turn by minimizing the KL divergence as: loss cluster = KL(P ||Q) = i u p iu log p iu q iu , The final proposed model, after incorporating PamC contrastive objective with self-supervised clustering, where α > 0 controls structure incorporation and β > 0 controls cluster optimization is: For PamC, we only compute ∥z∥ 2 2 and z i • z j for the actual positive edges E using sparse matrix resulting in a time complexity O + = O(N Ed z ), linear with the number of edges E, with d z embedding dimension. For the negatives, we use the meta-node based negatives O -= O(CC) where C is the meta-node. Note that, for real graphs, N ≫ C in many magnitudes. Thus, the overall time complexity is linearly related to the number of samples and edges. PamCGC : L final = αℓ Pcontrast (K, τ ) + βloss cluster ,

3. EXPERIMENTS

We evaluate PamCGC on transductive node clustering comparing to state-of-the-art self-supervised, contrastive and (semi-)supervised methods. Datasets. Following Bo et al. (2020) ; Peng et al. (2021) ; Kulatilleke et al. (2022) , experiments are conducted on six common clustering benchmarks, which includes one image dataset (USPS (Le Cun et al., 1990 )), one sensor data dataset (HHAR (Stisen et al., 2015) ), one text dataset (REUT (Lewis et al., 2004) ) and three citation graphs (ACMfoot_1 , CITE 4 , and DBLPfoot_2 ). For the non-graph data, we use undirected k-nearest neighbour (KNN (Altman, 1992)) to generate adjacency matrix A following Bo et al. (2020) ; Peng et al. (2021) . Table 1 summarizes the datasets. Baseline Methods. We compare with multiple models. K-means (Hartigan & Wong, 1979 ) is a classical clustering method using raw data. AE (Hinton & Salakhutdinov, 2006) applies K-means to deep representations learned by an auto-encoder. DEC (Xie et al., 2016) clusters data in a jointly optimized feature space. IDEC (Guo et al., 2017) enhances DEC by adding KL divergence-based reconstruction loss. Following models exploit graph structures during clustering: SVD (Golub & Reinsch, 1971) applies singular value decomposition to the adjacency matrix. DGI (Velickovic et al., 2019) learns embeddings by maximizing node MI with the graph. GAE (Kipf & Welling, 2016) combines convolution with the AE. ARGA (Pan et al., 2018) uses an adversarial regularizer to guide the embeddings learning. Following deep graph clustering jointly optimize embeddings and graph clustering: DAEGC (Wang et al., 2019) , uses an attentional neighbor-wise strategy and clustering loss. SDCN (Bo et al., 2020) , couples DEC and GCN via a fixed delivery operator and uses feature reconstruction. AGCN (Peng et al., 2021) , extends SDCN by adding an attention-based delivery operator and uses multi scale information for cluster prediction. CGC (Park et al., 2022) uses a multi-level, hierarchy based contrastive loss. SCGC and SCGC* (Kulatilleke et al., 2022) uses block contrastive loss with an AE and MLP respectively. The only difference between SCGC* and PamCGC is the novel PamC loss, Also as SCGC* is the current state-of-the-art. Thus, it is used as the benchmark. Evaluation Metrics. Following Bo et al. (2020) ; Peng et al. (2021) , we use Accuracy (ACC), Normalized Mutual Information (NMI), Average Rand Index (ARI), and macro F1-score (F1) for evaluation. For each, larger values imply better clustering.

3.1. IMPLEMENTATION

The positive component of our loss only requires the actual connections and can be efficiently represented by sparse matrices. Further, the negative component of the loss is graph-based, and not instance based, thus needs to be computed only once per epoch. Thus, by decoupling the negatives, our loss is inherently capable of batching and is trivially parallelizable. Computation of the negative proxy, which is only C • C does not even require a GPU! For all timing and memory experiments, we replicate the exact same training loops, including internal evaluation metric calls, when measuring performance for fair comparison. Our code will be made publicly available.

3.2. RESULTS

We show our hyperparameters in Table 1 . Comparison of results with state-of-the-art graph and non-graph datasets are in Table 2 and Table 3 , respectively. For the graph data, PamCGC is stateof-the-art for DBLP. A paired-t test shows ACM and CITE results to be best for both SCGC* and PamCGC. In non-graph results, PamCGC comes second best in USPS image data. While results for HHAR are somewhat lagging, PamCGC is the best for REUT. Generally we achieve better results on the natural graph datasets; ACM, DBLP and CITE, while being competitive on other modalities. We present the qualitative results in Appendix A.4. 

3.3. PERFORMANCE

In Figure 3 we compare the GPU based training time and GPU memory. Our model times also include the time taken for the cumulative influence computation. For all the datasets, PamCGC is superior by 2.2x training time and 5.3x GPU memory savings. Especially, for larger datasets USPS, HHAR and REUT, PamCGC uses 5.2,7.7,8.7x less GPU memory. Additionally, we used CITE dataset (3327 nodes) to create synthetics nodes. For a scale factor n, as contact nodes n times, along with edge-lists. 

3.4. ABLATION STUDY

To investigate PamCs ability to generalize to other models, we incorporate it to SDCN and AGCN models, modified for contrastive loss. We also carry out extensive experimentation to assess the behavior of hyperparameters. PamC is robust to changes in hyperparameter values and performs best with a learning rate of 0.001, as shown in Appendix A.2. Further, PamC accuracy against all hyperparameter combinations is generally equal or better than the less efficient non proxy-ed contrastive loss variant, as seen in Appendix A.3.

3.5. FUTURE WORK

Our parameter-free proxy-ed contrastive loss uses the full positive edge information which, as some of our experiments has shown, is redundant. For example, USPS gives similar results with 40% positive edges removed. An algorithm to drop un-informative edges may result in further efficiency improvements, which we leave for future work. While theoretically possible, it would be interesting to see how our proxy-ed contrastive loss works with semi or fully supervised data. Further study is needed to explore how hard cluster centers effect the optimization process.

4. CONCLUSION

In this work, we present an efficient parameter-free proxy approximation to incorporate negative samples in contrastive loss for joint clustering and representation learning. We eliminate sample bias, achieve block contrastiveness and 0(N ). Our work is supported by theoretical proof and empirical results. We improve considerably over previous methods accuracy, speed and memory usage. Our approach differs from prior self-supervised clustering by the proxy mechanism we use to incorporate all negative samples efficiently. The strength of this simple approach indicates that, despite the increased interest in graphs, effective contrastive learning remains relatively unexplored.

A APPENDIX

A.1 PROOFS OF THEORETICAL RESULTS -DERIVATION OF EQUATION 7 Assume node embeddings Z = {z 1 , z 2 , z 3 . . . z N }, clusters µ = {µ 1 , µ 2 . . . µ C }, a label assignment operator label(z i ) such that µ a = N i=1 1 [i∈label(zi)=a] • z i , a hyperparameter τ related to the temperature in contrastive loss and similarity(i, j, z i , z j ) = sim(z i , z j ) 0, i = j zi•zj ∥zi∥∥zj ∥ , i ̸ = j We use sim(z i , z j ) as the shorthand notation for similarity(i, j, z i , z j ) interchangeably for brevity. We begin with Equation 1, which is the popular form of contrastive loss (Hu et al., 2021; Kulatilleke et al., 2022) . With τ as the temperature parameter, γ ij the relationship between nodes i, j, the loss for the i th can be expanded as: ℓ i = + log B j=1 1 [j̸ =i] exp (sim (z i , z j ) τ ) -log B j=1 1 [j̸ =i] γ ij exp (sim (z i , z j ) τ ), where, the first part on the right corresponds to the negative node contrasting portion and the second portion contrasts the positives for node i. From Equation 14, for all nodes N , we take to negative node contrasting portion, by averaging over N nodes to obtain: loss N N = 1 N N i=1 log   N j=1 e sim(i,j,zi,zj )τ   , Note our use of the more concise sim() and the compact e notation over exp() interchangeably for compactness reasons. We expand Equation 15, together with e 0 = 1 in cases where i = j, as: z1,z2)τ + e sim(z1,z3)τ + e sim(z1,z4)τ . . . + e sim(z1,z N )τ + log e sim(z2,z1)τ + 1 + e sim(z2,z3)τ + e sim(z2,z4)τ . . . + e sim(z2,z N )τ + log e sim(z3,z1)τ + e sim(z3,z2)τ + 1 + e sim(z3,z4)τ . . . + e sim(z3,z N )τ + • • • log e sim(z N ,z1)τ + e sim(z N ,z2)τ + e sim(z N ,z3)τ + e sim(z N ,z4)τ . . . + 1 loss N N = 1 N log 1 + e sim( (16) Similarly, we can express the cluster based contrastive loss as: loss CC = 1 C C a=1 log M b=1 e sim(a,b,µa,µ b )τ (17) with the following expansion: µ1,µ2)τ + e sim(µ1,µ3)τ + e sim(µ1,µ4)τ . . . + e sim(µ1,µ C )τ + log e sim(µ2,µ1)τ + 1 + e sim(µ2,µ3)τ + e sim(µ2,µ4)τ . . . + e sim(µ2,µ C )τ + log e sim(µ3,µ1)τ + e sim(µ3,µ2)τ + 1 + e sim(µ3,µ4)τ . . . + e sim(µ3,µ C )τ + loss CC = 1 C log 1 + e sim( • • • log e sim(µ C ,µ1)τ + e sim(µ C ,µ2)τ + e sim(µ C ,µ3)τ + e sim(µ C ,µ4)τ . . . + 1 (18) If, loss min N N > loss max CC , we have loss N N loss CC > 1. Next we show the conditions necessary for establishing this inequality. As 0 ≤ sim ≤ 1.0, we obtain the min using sim min = 0: loss min N N = 1 N log 1 + e 0 + e 0 + . . . + e 0 + • • • + log 1 + e 0 + e 0 + . . . + e 0 = log 1 + (N -1)e 0 = log(N ) Similarly, we can obtain the max, using sim max = 1.0: loss max CC = 1 C log 1 + e 1. τ + e 1.τ + . . . + e 1.τ + • • • + log 1 + e 1.τ + e 1.τ + . . . + e 1.τ = log [1 + (C -1)e τ ] Combining Equation 19and Equation 20, we establish the necessary condition for our inequality, Equation 7 as; loss N N loss CC > log(N ) log [1 + (C -1)e τ ] This derivation is used in Section 2.1, where we show how the condition is almost always satisfied in real graphs. As a result, loss N N upper bounds loss CC . Note that a lower loss is better. ORDER on reut 



We followBo et al. (2020) use of the term 'self-supervised' to be consistent with the GCN training method. http://dl.acm.org/ https://dblp.uni-trier.de http://citeseerx.ist.psu.edu/index



Figure 1: PamCGC jointly learns structure and clustering via probabilistic soft assignment which is used to derive the real cluster centers μ, used as proxy for negative samples. Grey dotted section outlines the training components. Cluster centroids µ are obtained by pre-training an AE for reconstruction. Red dotted section is our core contribution: we use μ as an efficient approximation, computing centroid-pairs instead of node-pairs, achieve block-contrastivness and do so at graph level, not instance level.

Figure 2: Nodes N vs Clusters C with different τ temperature values. Grey surface shows the ratio = 1.0 inequality boundary. Generally, real world graphs satisfy the condition ratio > 1.0 easily. Best viewed in color.

2.4 COMPLEXITY ANALYSIS Given input data dimension d and AE layer dimensions of d 1 , d 2 , • • • , d L , following Kulatilleke et al. (2022), O AE = O(N d 2 d 2 1 ...d 2 L/2 ) for PamCGC-AE. Assuming K clusters, from Equation 9, the time complexity is O cluster = O(N K + N log N ) following Xie et al. (2016).

fair comparison, we use the same 500 -500 -2000 -10 AE dimensions as inGuo et al. (2017);Bo et al. (2020);Peng et al. (2021);Kulatilleke et al. (2022) and the same pre-training procedure, i.e. 30 epochs; learning rate of 10 -3 for USPS, HHAR, ACM, DBLP and 10 -4 for REUT and CITE; batch size of 256. We made use of the publicly available pre-trained AE fromBo et al. (2020). We use a once computed edge-list for training, which is not needed during inference. For training, for each dataset, we initialize the cluster centers from K-means and repeat the experiments 10 times with 200 epochs to prevent extreme cases. We cite published accuracy results fromBo et al. (2020);Peng et al. (2021);Kulatilleke et al. (2022) for other models.

Figure 3(right) shows the scaled edges and nodes for scale factors 5, 10, 15 • • • 45 and the GPU memory and training time for 1 epoch on Google colab T4 GPU with 16GB memory. Without PamC, scales over 5 is not possible due to running out of memory. With PamC over x45 (150,000 nodes) is possible. GPU and memory increase is liner confirming the theoretical time complexity. We used CITE as it is a very common dataset. We used synthetic node creation to capture variation over node size. Appendix A.8 shows GPU time breakup. Appendix A.6 shows the CITE dataset results with PamC when scaled from 1 . . . 20 in steps of 1.

Figure 3: GPU performance from the pytorch profiler on Google Colab with T4 16Gb GPU. left:training time for 200 epochs. center:memory utilization per epoch. right:graph size vs time and memory on synthetic CITE data per epoch; W/o PamC, model runs out of memory after 17,000 nodes. With PamC, 150,000 nodes and over 18 million edges can be handled on the T4's 16GB. Note that SCGC* only differs from PamCGC by its use of the novel proxy-ed PamC to which we solely attribute the time and memory savings.0 30 60 90 120

Figure5: Ablation study on the hyperparameters. TAU=τ , ALPHA=α, ORDER=R and LR denotes learning rate. A hyperparameter with higher and more condensed distribution represents its superiority over its counterpart. PamCGC is robust to τ, α, R and best with a learning rate 0f 0.001. Best viewed in color.

Figure 7: Visual comparison of embeddings; top: raw data, second row: after AE pre-training, third-row: from SCGC*, and last-row: from PamCGC*. Colors represent ground truth groups. Black squares, μ, are the approximated meta-nodes. Red dots, µ, are the cluster centroids.We use UMAP(McInnes et al., 2018), in Figure7, to get a visual understanding of the raw and learnt embedding spaces. Except for USPS, which is a distinct set of 0 • • • 9 handwritten digits (raw 1), we see that all other datasets produce quite indistinguishable clusters. Clustering is nearly non-existent in the (last 3) graph datasets. This clearly shows a characteristic difference in graph data, which can lead to high samplings bias. Note that μ ̸ = µ for any meta-node.

Statistics of the datasets (left) and PamCGC hyperparameters (right).

Clustering performance the three graph datasets (mean±std). Best results are bold. Results reproduced from Bo et al. (2020); Peng et al. (2021); Kulatilleke et al. (2022); Park et al. (2022). SCGC (Kulatilleke et al., 2022) uses neighbor based contrastive loss with AE while SCGC* variant uses r-hop cumulative Influence contrastive loss with MLP, same as our PamCGC

Clustering performance the three non-graph datasets (mean±std). Best results are bold; second best is underlined. Results reproduced from Bo et al. (2020); Peng et al. (2021); Kulatilleke et al. (2022). SCGC (Kulatilleke et al., 2022) uses neighbour based contrastive loss with AE while SCGC* variant uses r-hop cumulative Influence contrastive loss with MLP, same as our PamCGC

Comparison of hyperparameters with and without PamC. TAU=τ , ALPHA=α, ORDER=R and LR denotes learning rate. A hyperparameter with higher and more condensed distribution represents its superiority over its counterpart. PamC is generally better in accuracy for majority of the hyperparameter combinations. Best viewed in color.

Results for CITE dataset shows PamC is competitiveness in terms of accuracy. ‡Results reproduced fromZheng et al. (2022)

The GPU time breakdown for USPS dataset for 200 epochs on Colab T4 (16GB). The model forward figures (1.272 and 2.257) are different because the GPU is caching the results in the case of no PamC. During inference, these figures are identical.

