UNIFYING DATA-MODEL SPARSITY FOR CLASS-IMBALANCED GRAPH REPRESENTATION LEARNING

Abstract

To relieve the massive computation cost in the field of deep learning, models with more compact architectures have been proposed for comparable performance. However, it is not only the cumbersome model architectures but also the massiveness of the training data that adds up to the expensive computational burdens. This problem is particularly accentuated in the graph learning field: on one hand, Graph Neural Networks (GNNs) trained upon non-Euclidean graph data often encounter relatively higher time costs, due to their irregular density properties; on the other hand, the natural class-imbalance property accompanied by graphs cannot be alleviated by the massiveness of data, therefore hindering GNNs' ability in generalization. To fully tackle the above issues, (i) theoretically, we introduce a hypothesis on to what extent a subset of the training data can approximate the full dataset's learning effectiveness, which is further guaranteed by the gradients' distance between the subset and the full set; (ii) empirically, we discover that during the learning process of a GNN, some samples in the training dataset are informative in providing gradients for model parameters update. Moreover, the informative subset evolves as the training process proceeds. We refer to this observation as dynamic data sparsity. We also notice that a pruned sparse contrastive GNN model sometimes "forgets" the information provided by the informative subset, reflected in their large loss in magnitudes. Motivated by the above findings, we develop a unified data-model dynamic sparsity framework named Graph Decantation (GraphDec) to address the above challenges. The key idea of GraphDec is to identify the informative subset dynamically during the training process by adopting the sparse graph contrastive learning. Extensive experiments on multiple benchmark datasets demonstrate that GraphDec outperforms state-of-the-art baselines for the class-imbalanced graph/node classification tasks, with respect to classification accuracy and data usage efficiency.

1. INTRODUCTION

Graph representation learning (GRL) (Kipf & Welling, 2017) has shown remarkable power in dealing with non-Euclidean structure data (e.g., social networks, biochemical molecules, knowledge graphs). Graph neural networks (GNNs) (Kipf & Welling, 2017; Hamilton et al., 2017; Veličković et al., 2018) , as the current state-of-the-art of GRL, have become essential in various graph mining applications. However, in many real-world scenarios, training on graph data often encounters two difficulties: class imbalance (Park et al., 2022) and massive data usage (Thakoor et al., 2021; Hu et al., 2020) . Firstly, class imbalance naturally exists in datasets from diverse practical domains, such as bioinformatics and social networks. GNNs are sensitive to this property and can be biased toward the dominant classes. This bias may mislead GNNs' learning process, resulting in underfitting samples that are of real importance to the downstream tasks, and poor test performance at last. Secondly, massive data usage requires GNN to perform message-passing over nodes of high degrees bringing about heavy computation burdens. Some calculations are redundant in that not all neighbors are informative regarding learning task-related embeddings. Unlike regular data such as images or texts, the connectivity of irregular graph data invokes random memory access, which further slows down the efficiency of data readout. Accordingly, recent studies (Chen et al., 2021; Zhao et al., 2021; Park et al., 2022) arise to address the issues of class imbalance or massive data usage in graph data: (i) On one hand, to deal with the class imbalance issue in node classification on graphs, GraphSMOTE (Zhao et al., 2021) tries to generate new nodes for the minority classes to balance the training data. Improved upon GraphSMOTE, GraphENS (Park et al., 2022) further proposes a new augmentation method by constructing an ego network to learn the representations of the minority classes. (ii) On the other hand, to alleviate the massive data usage, (Eden et al., 2018; Chen et al., 2018) explore efficient data sampling policies to reduce the computational cost from the data perspective. From the model improvement perspective, some approaches design the quantization-aware training and low-precision inference method to reduce GNNs' operating costs on data. For example, GLT (Chen et al., 2021) applies the lottery ticket pruning technique (Frankle & Carbin, 2019) to simplify graph data and the GNN model concurrently. Despite progress made so far, existing methods fail to address the class imbalance and computational burden altogether. Dealing with one may even exacerbate the condition of the other: when tackling the data imbalance, the newly synthetic nodes in GraphSMOTE and GraphENS bring along extra computational burdens for the nextcoming training process. While a compact model reduces the computational burden to some extent, we interestingly found that the pruned model easily "forgets" the minorities in class-imbalanced data, reflected in its worse performance than the original model's. To investigate this observation, we study how each graph sample affects the GNN training by taking a closer look at the gradients each of them exerts. Specifically, (i) in the early phases of training, we identify a small subset that provides the most informative supervisory signals, as measured by the gradient norms' magnitudes (shown in later Figure 5 ); (ii) the informative subset evolves dynamically as the training process proceeds (as depicted in later Figure 3 ). Both the phenomenons prompt the hypothesize that the full training set's training effectiveness can be approximated, to some extent, by that of the dynamic subset. We further show that the effectiveness of the approximation is guaranteed by the distance between the gradients of the subset and the full training set, as stated in Theorem 1. Based on the above, we propose a novel method called Graph Decantation (GraphDec) to guide dynamic sparsity training from both the model and data aspects. The principle behind GraphDec is shown in Figure 1 . Since the disadvantaged but informative samples tend to bring about higher gradient magnitudes, GraphDec relies on the gradients directed by dynamic sparse graph contrastive learning loss to identify the informative subsets that approximate the full set's training effectiveness. This mechanism not only does not require supervised labels, but also allows for the training of the primary GNN, and the pruning of the sparse one. Specifically, for each epoch, our proposed framework scores samples from the current training set and keep only k most informative samples for the next epoch. Additionally, the framework incorporates a data recycling process, which randomly recycles prior discarded samples (i.e., samples that are considered unimportant in the previous training epochs) by re-involving them in the current training process. As a result, the dynamically updated subset (i) supports the sparse GNN to learn relatively unbiased representations and (ii) approximates the full training set through the lens of Theorem 1. To summarize, our contributions in this work are: • We develop a novel framework, Graph Decantation, which leverages dynamic sparse graph contrastive learning on class-imbalanced graph data for efficient data usage. To our best knowledge, this is the first study to explore the dynamic sparsity property for class-imbalanced graphs. • We introduce cosine annealing to dynamically control the sizes of the sparse GNN model and the graph data subset to smooth the training process. Meanwhile, we introduce data recycling to refresh the current data subset and avoid overfitting. • Comprehensive experiments on multiple benchmark datasets demonstrate that GraphDec outperforms state-of-the-art methods for both the class-imbalanced graph classification and classimbalanced node classification tasks. Additional results show that GraphDec dynamically finds an informative subset across the training epochs effectively.

2. RELATED WORK

Graph Contrastive Learning. Contrastive learning is first established for image tasks and then receives considerable attention in the field of graph representation learning (Chen et al., 2020) . Contrastive learning is based on utilizing instance-level identity as supervision and maximizing agreement between positive pairs in hidden space by contrast mode (Velickovic et al., 2019; Hassani & Khasahmadi, 2020; You et al., 2020) . Recent research in this area seeks to improve the efficacy of graph contrastive learning by uncovering more difficult views (Xu et al., 2021; You et al., 2021) . However, the majority of available approaches utilize a great deal of data. By identifying important subset from the entire dataset, our model avoids this issue. Training deep model with sparsity. Parameter pruning aiming at decreasing computational cost has been a popular topic and many parameter-pruning strategies are proposed to balance the trade-off between model performance and learning efficiency (Deng et al., 2020; Liu et al., 2019) . Some of them belong to the static pruning category and deep neural networks are pruned either by neurons (Han et al., 2015b; 2016) or architectures (layer and filter) (He et al., 2017; Dong et al., 2017) . In contrast, recent works propose dynamic pruning strategies where different compact subnets will be dynamically activated at each training iteration (Mocanu et al., 2018; Mostafa & Wang, 2019; Raihan & Aamodt, 2020) . The other line of computation cost reduction lies in the dataset sparsity (Karnin & Liberty, 2019; Mirzasoleiman et al., 2020; Paul et al., 2021) . Recently, the property of sparsity is also used to improve model robustness (Chen et al., 2022; Fu et al., 2021) . In this work, we attempt to accomplish dynamic sparsity from both the GNN model and the graph dataset simultaneously. Class-imbalanced learning on graphs. Excepting conventional node re-balanced methods, like reweighting samples (Zhao et al., 2021; Park et al., 2022) and oversampling (Zhao et al., 2021; Park et al., 2022) , an early work (Zhou et al., 2018) characterizes rare classes through a curriculum strategy, while other previous works (Shi et al., 2020; Zhao et al., 2021; Park et al., 2022) tackles the class-imbalanced issue by generating synthetic samples to re-balance the dataset. Compared to the node-level task, graph-level re-balancing is under-explored. A recent work (Wang et al., 2021) proposes to utilize neighboring signals to alleviate graph-level class-imbalance. To the best of our knowledge, our proposed GraphDec is the first work to solve the class-imbalanced for both the node-level and graph-level tasks.

3. METHODOLOGY

In this section, we first theoretically illustrate our graph sparse subset approximation hypothesis, which guides the design of GraphDec to continuously refine the compact training subset via the dynamic graph contrastive learning method. The presentation is organized by the importance ranking procedure of each sample, refine smoothing, and overfitting regularization. Relevant preliminaries of GNNs, graph contrastive learning, and network pruning are provided in Appendix B.

3.1. GRAPH SPARSE SUBSET APPROXIMATION HYPOTHESIS

We first introduce the key notations used in the method. Specifically, we denote the full graph dataset as G F , the graph data subset used to train the model as G S , the learning rate as α, and the graph learning model parameters as θ (the optimal model parameters as θ ˚). The detailed proof of Theorem 1 is provided in Appendix A. According to Theorem 1, it is straightforward that we can minimize the gap between the models trained on the full graph dataset and graph data subset, i.e., L G S ;θ ptq ´Lθ ˚, by reducing the distance between the gradients of the full graph dataset and the graph subset, i.e., Err ptq . In other words, the optimized graph subset G ptq S is expected to approximate the gradients of the full graph dataset, and thereby exerts minimal affects on parameters' update. In contrast to GraphDec, data diet (Paul et al., 2021) is designed to identify the most influential data samples G S (those with largest gradients during the training phase) only at the early training stage and have them involved in further training processes, while excluding samples from ḠS " G F ´GS with smaller gradients (i.e., ∇ θ ptq L G ptq S ;θ ptq " ∇ θ ptq L Ḡ ptq S ;θ ptq ) eternally. This one-shot selection, however, as we will show in the experiments (Section 4.5), does not always capture the most important samples across all epochs during the training. Specifically, the rankings of elements within a specific G S might be relatively static, but those within the full graph dataset, i.e., G, are usually more dynamic, which implies the gradients of the one-shot subset ∇ θ ptq L G ptq S ;θ ptq is unable to constantly approximate that of the full graph dataset ∇ θ ptq L G F ;θ ptq during training.

3.2. GRAPH DECANTATION

Inspired by Theorem 1 and to solve the massive data usage in class-imbalance graphs, we propose GraphDec for achieving competitive performance as well as efficient data usage simultaneously by dynamically filtering out the most influential data subset. The overall framework of GraphDec is illustrated in Figure 2 . The training processes are summarized into four steps: (i) First, compute the gradients of the samples in G | samples from the recycled bin. The union of these samples and the ones selected in step (iii) will be used for model training in the (t `1)-th epoch. Each of the four steps is described in detail in the following content. Compute gradients by dynamic sparse graph contrastive learning model. We adopt the mechanism of dynamic sparse graph contrastive learning in computing the gradients. The reason is two-folded: (a) it scores the graph samples without the supervision of any label; (b) this pruning process is more sensitive in selecting informative samples, verified in Appendix D. We omit the superscript ptq for the dataset and model parameters for simplicity in the explanation of this step. Specifically, given a graph training set G " tG i u N i"1 as input, for each training sample G i , we randomly generate two augmented graph views, G 1 i and G 2 i , and feed them into the original GCN model f θ p¨q, and the sparse model f θp p¨q pruned dynamically by the dynamic sparse pruner, respectively. The gradients are computed based on the outputs of the two GNN branches, directed by the contrastive learning loss signals. To obtain the pruned GNN model, the pruner only keeps neural connections with the top-k largest weight magnitudes. Specifically, the pruned parameters of l-th GNN layer (i.e., θ l ) are selected following the formula below: θ l p " TopKpθ l , kq; k " β ptq ˆ|θ l |, where TopKpθ l , kq refers to the operation of selecting the top-k largest elements of θ l , and β ptq is the fraction of the remaining neural connections, controlled by the cosine annealing formulated as follows: β ptq " β p0q 2 " 1 `cosp πt T q * , t P r1, T s , where β p0q is initialized as 1. In addition, we refresh θ l p every few epochs to reactivate neurons based on their gradients, following the formula below: I θ l g " argTopKp∇ θ l L D S ;θ , kq; k " β ptq ˆ|θ l |, where argTopK returns the indices of the top-k largest elements I θ l g of the corresponding neurons θ l g . To further elaborate, we refresh θ l p every few epochs by θ l p Ð θ l p Y θ l g , as the updated pruned parameters to be involved in the next iteration. After we obtain the pruned model, gradients are computed based on the contrastive learning loss between f θ pG 1 i q and f θp pG 2 i q, which are then saved for the further ranking process. Rank graph samples according to their gradients' L 2 norms. In order to find the relative importance of the samples, we rank the samples based on the gradients each of them brings about, saved in the previous training epoch by the last step. Specifically, at each of the t-th training epoch, we score each sample by the L 2 norm of its gradient: gpG i q " › › ∇ θ L `fθ pG 1 i q, f θp pG 2 i q ˘› › 2 , ( ) where L is the popular InfoNCE (Van den Oord et al., 2018) loss in contrastive learning, taking the outputs of the two GNN branches as inputs. Therefore, the gradient is calculated as follows: ∇ θ Lpf θ pG 1 i q, f θp pG 2 i qq " p θ pG 1 q ´pθp pG 2 i q, where p θ pG 1 i q and p θp pG 2 i q are the normalized model's predictions, i.e., pp¨q " Spf p¨qq and Sp¨q is the softmax function or sigmoid function. The samples are ranked based on the values calculated by Eq. 4. Decay the size of G S by cosine annealing. For decreasing the size of the subset, we use cosine annealing when the training process proceeds. As we will show in Figure 3 for the experiments, some graph samples showing low scores of importance at the early training stage may be highly-scored again if given more patience in the later training epochs. Therefore, chunking the size of the sparse subset radically in one shot deprives the chances of the potential samples informing the models at a later stage. To tackle this issue, we employ cosine annealing to gradually decrease the size of the subset: |G ptq S | " |G| 2 " 1 `cosp πptq T q * , t P r1, T s . ( ) Note that this process not only automatically decreases the size of G S smoothly, but also avoids the manual one-shot selection as in the data diet (Paul et al., 2021) . Recycle removed graph samples for the next training epoch. We aim to update the elements in G ptq S obtained in the last step. Since current low-scored samples may still have the potential to be highly-scored in the later training epochs, we randomly recycle a proportion of the removed samples and re-involve them in the training process again. Specifically, the exploration rate ϵ controls the proportion of data that substitutes a number of ϵ|G 

4. EXPERIMENTS

In this section, we conduct extensive experiments to validate the effectiveness of our proposed model for both the graph and node classification tasks under imbalanced datasets. We also conduct ablation study and informative subset evolution analysis to further prove the effectiveness. Due to space limit, more analysis validating GraphDec's properties and resource cost are provided in Appendix D and E. 4.1 EXPERIMENTAL SETUP Datasets. We validate our model on various graph benchmark datasets for the two classification tasks under the class-imbalnced data scenario. For the class-imbalanced graph classification task, we choose the seven validation datasets in G 2 GNN paper (Wang et al., 2021) , i.e., MUTAG, PROTEINS, D&D, NCI1, PTC-MR, DHFR, and REDDIT-B in (Morris et al., 2020) . For the class-imbalanced node classification task, we choose the five datasets in the GraphENS paper (Park et al., 2022) , i.e., Cora-LT, CiteSeer-LT, PubMed-LT (Sen et al., 2008) , Amazon-Photo, and Amazon-Computers. Detailed descriptions of these datasets are provided in the Appendix C.1. Baselines. We compare our model with a variety of baselines methods with different rebalance methods. For class-imbalanced graph classification, we consider three rebalance methods, i.e., vanilla (without re-balancing when training), up-sampling (Wang et al., 2021) , and re-weight (Wang et al., 2021) . For each rebalance method, we run three baseline methods including GIN (Xu et al., 2019) , InfoGraph (Sun et al., 2019) , and GraphCL (You et al., 2020) . In addition, we adopt two versions of G 2 GNN (i.e., remove-edge and mask-node) (Wang et al., 2021) for in-depth comparison. For class-imbalanced node classification, we consider nine baseline methods including vanilla, SynFlow (Tanaka et al., 2020) , BGRL (Thakoor et al., 2021) , GRACE (Zhu et al., 2020) , reweight (Japkowicz & Stephen, 2002) , oversampling (Park et al., 2022) , cRT (Kang et al., 2020) , PC Softmax (Hong et al., 2021) , DR-GCN (Shi et al., 2020) , GraphSMOTE (Zhao et al., 2021) , and GraphENS (Park et al., 2022) . We adopt Graph Convolutional Network (GCN) (Kipf & Welling, 2017) as the default architecture for all rebalance methods. Further details about the baselines are illustrated in Appendix C.2. Evaluation Metrics. To evaluate model performance, we choose F1-micro (F1-mi.) and F1-macro (F1-ma.) scores as the metrics for the class-imbalanced graph classification task, and accuracy (Acc.), balanced accuracy (bAcc.), and F1-macro (F1-ma.) score for the node classification task. Experimental Settings. We adopt GCN (Kipf & Welling, 2017) as the GNN backbone of GraphDec for both the tasks. In particular, we concatenate a two-layers GCN and a one-layer fully-connected layer for node classification, and add one extra average pooling operator as the readout layer for graph classification. We follow (Wang et al., 2021) and (Park et al., 2022) varying the imbalance ratios for graph and node classification tasks, respectively. In addition, we take GraphCL (You et al., 2020) as the graph contrastive learning framework, and cosine annealing to dynamically control the sparsity rate in the GNN model and the dataset. The target pruning ratio for the model is set to 0.75, and the one for the dataset is set to 1.0. After the contrastive pre-training, we take the GCN output logits as the input to the Support Vector Machine for fine-tuning. GraphDec is implemented in PyTorch and trained on NVIDIA V100 GPU.

4.2. CLASS-IMBALANCED GRAPH CLASSIFICATION PERFORMANCE

The evaluated results for the graph classification task on class-imbalanced graph datasets are reported in Table 1 , with the best performance and runner-ups bold and underlined, respectively. From the table, we find that GraphDec outperforms baseline methods on both the metrics across different datasets, while only uses an average of 50% data and 50% model weights per round. Although a slight F1-micro difference has been detected on D&D when comparing GraphDec to the best baseline G 2 GNN, it is understandable due to the fact that the graphs in D&D are significantly larger than those in other datasets, necessitating specialized designs for graph augmentations (e.g., the average graph size in terms of node number is 284.32 for D&D, but 39.02 and 17.93 for PROTEINS and MUTAG, respectively). However, in the same dataset, G 2 GNN only achieves 43.93 on F1-macro while GraphDec reaches to 44.01, which complements the 2% difference on F1-micro and further demonstrates GraphDec's ability to learn effectively even on large graph datasets. Specifically, models trained under the vanilla setting perform the worst due to the ignorance of the class-imbalance. Up-sampling strategy improves the performance, but it introduces additional unnecessary data usage by sampling the minorities multiple times. Similarly, re-weight strategy tries to address the class- imbalanced issue by assigning different weights to different samples. However, it requires the labels for weight calculation and thus may not generalize well when labels are missing. G 2 GNN, as the best baseline, obtains decent performance by considering the usage of rich supervisory signals from both globally and locally neighboring graphs. Finally, the proposed model, GraphDec, achieves the best performance due to its ability in capturing dynamic data sparsity on from both the model and data perspectives. In addition, we rank the performance of GraphDec with regard to baseline methods on each dataset. GraphDec ranks 1.00 and 1.14 on average, which further demonstrates the superiority of GraphDec. Notice that all existing methods utilize the entire datasets and the model weights while GraphDec only uses half of the data and weights to achieve superior performance.

4.3. CLASS-IMBALANCED NODE CLASSIFICATION PERFORMANCE

For the class-imbalanced node classification task, we first evaluate GraphDec on three long-tailed citation graphs (i.e., Cora-LT, CiteSeer-LT, PubMed-LT) and report the results on Table 2 . We find that GraphDec obtains the best performance compared to baseline methods for different metrics. GraphSMOTE and GraphENS achieve satisfactory performance by generating virtual nodes to enrich the involvement of the minorities. In comparison, GraphDec does not rely on synthetic virtual nodes to learn balanced representations, thereby avoiding the unnecessary computational costs. Similarly to the class-imbalanced graph classification task in Section 4.2, GraphDec leverages only half of the data and weights to achieve the best performance, whereas all baselines perform worse even with the full dataset and weights. To validate the efficacy of the proposed model on the real-world data, we evaluate GraphDec on naturally class-imbalanced benchmark datasets (i.e., Amazon-Photo and Amazon-Computers). We see that GraphDec has the best performance on both datasets, which demonstrates our model's effectiveness with data sourced from different practical scenes.

4.4. ABLATION STUDY

Since GraphDec is a unified learning framework relying on multiple components (steps) to employ dynamic sparsity training from both the model and dataset perspectives, we conduct ablation study to prove the validity of each component. Specifically, GraphDec relies on four components to address data sparsity and imbalance, including pruning samples by ranking gradients (GS), training with sparse dataset (SS), using cosine annealing to reduce dataset size (CAD), and recycling removed samples (RS), and the other four to address model sparsity and data imbalance, including pruning weights by ranking magnitudes (RM), using sparse GNN (SG), using cosine annealing to progressively reduce sparse GNN's size (CAG), and reactivate removed weights (RW). In addition, GraphDec employs self-supervision to calculate the gradient score. The details of model variants are provided in Appendix C.3. We analyze the contributions of different components by removing each of them independently. Experiments for both tasks are conducted comprehensively for effective inspection. The results are shown in Table 3 . From the table, we find that the performance drops after removing any component, demonstrating the effectiveness of each component. In general, both mechanisms for addressing data and model sparsity contribute significantly to the overall performance, demonstrating the necessity of these two mechanisms in solving sparsity problem. Self-supervision contributes similarly to the dynamic sparsity mechanisms, in that it enables the identification of informative data samples without label supervision. In the dataset dynamic sparsity mechanism, GS and CAD contribute the most as sparse GNN's discriminability identifies hidden dynamic sparse subsets accurately and efficiently. Regarding the model dynamic sparsity mechanism, removing RM and SG leads to a significant performance drop, which demonstrates that they are the key components in training the dynamic sparse GNN from the full GNN model. In particular, CAG enables the performance stability after the model pruning and helps capture informative samples during decantation by assigning greater gradient norms. Among these variants, the full model GraphDec achieves the best result in most cases. indicating the importance of the combination of the dynamic sparsity mechanisms from the two perspectives, and the self-supervision strategy.

4.5. ANALYZING EVOLUTION OF SPARSE SUBSET BY SCORING ALL SAMPLES

To show GraphDec's capability in dynamically identifying informative samples, we show the visualization of sparse subset evolution of data diet and GraphDec on class-imbalanced NCI1 dataset in Figure 3 . Specifically, we compute 1000 graph samples with their importance scores. These samples are then ranked according to their scores and marked with sample indexes. From the upper figures in Figure 3 , we find that data diet is unable to accurately identify the dynamic informative nodes. Once a data sample has been removed from the training list due to the low score, the model forever disregards it. However, the fact that a sample is currently unimportant does not imply that it will remain unimportant indefinitely, especially in the early training stage when the model cannot detect the true importance of each sample, resulting in premature elimination of vital nodes. Similarly, if a data sample is considered important at early epochs (i.e., marked with higher sample index), it cannot be removed during subsequent epochs. Therefore, we observe that data diet can only increase the scores of samples within the high index range (i.e., 500-1000), while ignoring samples within the low index range (i.e., ă 500). However, GraphDec (Figure 3 (bottom)) can capture the dynamic importance of each sample regardless of the initial importance score. We see that samples with different indexes all have the opportunities to be considered important and therefore be included in the training list. Correspondingly, GraphDec takes into account a broader range of data samples when shrinking the training list, meanwhile maintaining flexibility towards the previous importance scores. Compared with the full GNN, our dynamic sparse GNN is more sensitive in recognizing informative data samples which can be empirically verified by Figure 4 . Our dynamic pruned model assigns larger gradients to the minorities than the majorities during the contrastive training, while the full model generally assigns relatively uniform gradients for both of them. Thus, the proposed dynamically pruned model demonstrates its discriminatory ability on the minority class. This ability in our GraphDec framework is capable of resolving the class-imbalance issue.

6. CONCLUSION

In this paper, to take up the graph data imbalance challenge, we propose an efficient and effective method named Graph Decantation (GraphDec), by leveraging the dynamic sparse graph contrastive learning to dynamically identified a sparse-but-informative subset for model training, in which the sparse GNN encoder is dynamically sampled from a dense GNN, and its capability of identifying informative samples is used to rank and update the training data in each epoch. Extensive experiments demonstrate that GraphDec outperforms state-of-the-art baseline methods for both node classification and graph classification tasks in the class-imbalanced scenario. The analysis of the sparse informative samples' evolution further explains the superiority of GraphDec in identifying the informative subset among the training periods effectively. A PROOF OF THEOREM 1 We denote the full graph dataset as G F , the graph data subset used to train the model as G S , the learning rate as α, and the graph learning model parameters as θ (the optimal model parameters as θ ˚).  ∇ θ L G ptq S ;θ ptq pθ ptq q T pθ ptq ´θ˚q " 1 α ptq pθ ptq ´θpt`1q q T pθ ptq ´θ˚q , ∇ θ L G ptq S ;θ ptq pθ ptq q T pθ ptq ´θ˚q " 1 2α ptq ˆ› › ›θ ptq ´θpt`1q › › › 2 `› › ›θ ptq ´θ˚› › › 2 ´› › ›θ pt`1q ´θ˚› › › 2 ˙. (9) Since one update step θ ptq ´θpt`1q can be optimized by gradient multiplying with learning rate α ptq ∇ θ L G ptq S ;θ ptq pθ ptq q, we have: ∇ θ L G ptq S ;θ ptq pθ ptq q T pθ ptq ´θ˚q " 1 2α ptq ˆ› › ›α ptq ∇ θ L G ptq S ;θ ptq pθ ptq q › › › 2 `› › ›θ ptq ´θ˚› › › 2 ´› › ›θ pt`1q ´θ˚› › › 2 ˙. (10) Since ∇ θ L G ptq S ;θ ptq pθ ptq q T pθ ptq ´θ˚q can be represented as follows: T pθ ptq ´θ˚q " ∇ θ L G ptq S ;θ ptq pθ ptq q T pθ ptq ´θ˚q " ∇ θ L G ptq S ; 1 2α ptq ˆ› › ›α ptq ∇ θ L G ptq S ;θ ptq pθ ptq q › › › 2 `› › ›θ ptq ´θ˚› › › 2 ´› › ›θ pt`1q ´θ˚› › › 2 ˙(12) ∇ θ L G ptq S ;θ ptq T pθ ptq ´θ˚q " 1 2α ptq ˆ› › ›α ptq ∇ θ L G ptq S ;θ ptq pθ ptq q › › › 2 `› › ›θ ptq ´θ˚› › › 2 ´› › ›θ pt`1q ´θ˚› › › 2 ∇θ L G ptq S ;θ ptq pθ ptq q ´∇θ L G ptq S ;θ ptq ¯Tpθ ptq ´θ˚q . (13) We assume learning rate α ptq , t P r0, T ´1s is a constant value, then we have: T ´1 ÿ t"0 ∇ θ L G ptq S ;θ ptq T pθ ptq ´θ˚q " 1 2α › › ›θ p0q ´θ˚› › › 2 ´› › ›θ ptq ´θ˚› › › 2 `T ´1 ÿ t"0 p 1 2α › › ›α∇θL G ptq S ;θ ptq pθ ptq q › › › 2 q `T ´1 ÿ t"0 ˆ´∇ θ L G ptq S ;θ ptq pθ ptq q ´∇θ L G ptq S ;θ ptq ¯Tpθ ptq ´θ˚q ˙. Since we assume › › θ ptq ´θ˚› › 2 ě 0, then we have: T ´1 ÿ t"0 ∇ θ L G ptq S ;θ ptq T pθ ptq ´θ˚q ď 1 2α › › ›θ p0q ´θ˚› › › 2 `T ´1 ÿ t"0 p 1 2α › › ›α∇θL G ptq S ;θ ptq pθ ptq q › › › 2 q `T ´1 ÿ t"0 ˆ´∇ θ L G ptq S ;θ ptq pθ ptq q ´∇θ L G ptq S ;θ ptq ¯Tpθ ptq ´θ˚q ˙. (14) We assume loss L is convex and training loss L G ptq S ;θ ptq is lipschitz continuous with parameter σ. Then for convex function Lpθq, we have L G ptq S ;θ ptq ´Lθ ˚ď ∇ θ L G ptq S ;θ ptq T pθ ptq ´θ˚q . By combining this result with Equation 14, we get: T ´1 ÿ t"0 L G ptq S ;θ ptq ´Lθ ˚ď 1 2α › › ›θ p0q ´θ˚› › › 2 `T ´1 ÿ t"0 p 1 2α › › ›α∇θL G ptq S ;θ ptq pθ ptq q › › › 2 q `T ´1 ÿ t"0 ˆ´∇ θ L G ptq S ;θ ptq pθ ptq q ´∇θ L G ptq S ;θ ptq ¯Tpθ ptq ´θ˚q ˙. ( ) Since › › ›L G ptq S ;θ ptq pθq › › › ď σ, › › ›α∇θL G ptq S ;θ ptq pθ ptq q › › › ď σ, and we assume }θ ´θ˚} ď d, then we have: T ´1 ÿ t"0 L G ptq S ;θ ptq ´Lθ ˚ď d 2 2α `T ασ 2 2 `T ´1 ÿ t"0 d ´› › ›∇θL G ptq S ;θ ptq pθ ptq q ´∇θ L G ptq S ;θ ptq › › › ¯, T ´1 ÿ t"0 L G ptq S ;θ ptq ´Lθ ˚ď d 2 2αT `ασ 2 2 `T ´1 ÿ t"0 d T ´› › ›∇θL G ptq S ;θ ptq pθ ptq q ´∇θ L G ptq S ;θ ptq › › › ¯. Since min pL G ptq S ;θ ptq ´Lθ ˚q ď 1 T ř T ´1 t"0 L G ptq S ;θ ptq ´Lθ ˚, based on Equation 17, we have: min pL G ptq S ;θ ptq ´Lθ ˚q ď d 2 2αT `ασ 2 2 `T ´1 ÿ t"0 d T ´› › ›∇θL G ptq S ;θ ptq pθ ptq q ´∇θ L G ptq S ;θ ptq › › › ¯. ( ) We set learning rate α " d σ ? T and then have: min pL G ptq S ;θ ptq ´Lθ ˚q ď dσ ? T `T ´1 ÿ t"0 d T ´› › ›∇θL G ptq S ;θ ptq pθ ptq q ´∇θ L G ptq S ;θ ptq › › › ¯. B PRELIMINARIES: GNNS, GRAPH CONTRASTIVE LEARNING, NETWORK PRUNING In this work, we denote graph as G " pV, E, Xq, where V is the set of nodes, E is the set of edges, and X P R d represents the node (and edge) attributes of dimension d. In addition, we represent the neighbor set of node v P V as N v . Graph Neural Networks. GNNs (Wu et al., 2020) learn node representations from the graph structure and node attributes. This process can be formulated as: h plq v " COMBINE plq ´hpl´1q v , AGGREGATE plq ´!h pl´1q u , @u P N v )¯¯, where h plq v denotes representation of node v at l-th GNN layer; AGGREGATEp¨q and COMBINEp¨q are neighbor aggregation and combination functions, respectively; h p0q v is initialized with node attribute X v . We obtain the output representation of each node after repeating the process in Equation ( 20) for L rounds. The representation of the whole graph, denoted as h G P R d , can be obtained by using a READOUT function to combine the final node representations learned above: i and G 2 i . The goal of GCL is to map samples within positive pairs closer in the hidden space, while those of the negative pairs are further. GCL methods are usually optimized by a contrastive loss. Taking the most popular InfoNCE loss (Oord et al., 2018) as an example, the contrastive loss is defined as: h G " READOUT ! h pLq v | @v P V ) , L CL pG 1 i , G 2 i q " ´log exp psim pz i,1 , z i,2 qq ř N j"1,j‰i exp psim pz i,1 , z j,2 qq , where z i,1 " f θ pG 1 i q, z i,2 " f θ pG 2 i q, and sim denotes the similarity function. Network Pruning. Given an over-parameterized deep neural network f θ p¨q with weights θ, the network pruning is usually performed layer-by-layer. The pruning process of the l th layer in f θ p¨q can be formulated as follows: θ l th pruned " TopKpθ l th , kq, k " α ˆ|θ l th |, where θ l th is the parameters in the l th layer of f θ p¨q and TopKp¨, kq refers to the operation to choose the top-k largest elements of θ l th . We use a pre-defined sparse rate α to control the fraction of parameters kept in the pruned network θ l th pruned . Finally, only the top k " α ˆ|θ l th | largest weights will be kept in the pruned layer. The pruning process will be implemented iteratively to prune the parameters in each layer of deep neural network (Han et al., 2015a) .

C EXPERIMENTAL DETAILS C.1 DATASETS DETAILS

In this work, seven graph classification datasets and five node classification datasets are used to evaluate the effectiveness of our proposed model, we provided their detailed statistics in Table 4 . For graph classification datasets, we follow the imbalance setting of (Wang et al., 2021) to set the trainvalidation split as 25%/25% and change the imbalance ratio from 5:5 (balanced) to 1:9 (imbalanced). The rest of the dataset is used as the test set. The specified imbalance ratio of each dataset is clarified after its name in Table 5 . For node classification datasets, we follow (Sen et al., 2008) to set the imbalance ratio of Cora, CiteSeer and PubMed as 10. Besides, the setting of Amazon-Photo and Amazon-Computers are borrowed from (Park et al., 2022) , where the imbalance ratio ρ is set as 82 and 244, respectively.

C.2 BASELINE DETAILS

We compare our model with a variety of baseline methods using different rebalance methods: Table 5 : Imbalanced graph classification results. The numbers after each dataset name indicate the imbalance ratios of minority to majority categories. We report the macro F1-score and micro F1-score with the standard errors as Results are reported as mean ˘std for 3 repetitions on each dataset. We bold the best performance. 

E RESOURCE COST

To evaluate the proposed GraphDec's computational cost on a wide range of datasets, results in Table 7 that include three different class-imbalanced node classification datasets (PubMed-LT, Cora-LT, CiteSeer-LT), three different class-imbalanced graph classification datasets (MUTAG, PROTEINS, PTC MR), and four baselines (vanilla GCN, re-weight, re(/over)-sample, GraphCL). We run 200 epochs for each method to measure their computational time (second) for training. On NVIDIA GeForce RTX 3090 GPU device, we obtain the running time as reported in Table 7 . All models are implemented in PyTorch Geometric (Fey & Lenssen, 2019 ). According to the results, our GraphDec encounters less computation cost than prior methods. The following explains why augmentation doubles the input graph without increasing overall computation costs: (i) The augmentations we adopt (e.g, node dropping and edge dropping) reduce the size of input graphs (i.e., node number decreases 25%, edge number decreases 25-35%); (ii) During each epoch, our GraphDec prunes datasets so that approximately only 50% of the training data is used. (iii) GraphDec prunes the model weights, resulting in a lighter model requiring less computational resources. (iv) Despite the fact that augmentation doubles the number of input graphs, the additional new views only consume forward computational resources without requiring a backward or weight update step, thereby only marginally increases the computation.



Figure 1: The principle of graph decantation. It decants data samples based on rankings of their gradient scores, and then uses them as the training set in the next epoch.

Figure 2: The overall framework of GraphDec: (i) The dynamic sparse graph contrastive learning model computes gradients for graph/node samples; (ii) The input samples are sorted according to their gradients; (iii) Part of the samples with the smallest gradients are thrown into the recycling bin; (iv) Part of the samples with the largest gradients in the current epoch and some sampled randomly from the recycling bin are jointly used as training input in the next epoch.

with respect to the contrastive learning loss; (ii) Normalize the gradients and rank the corresponding graph/node samples in a descending order based on their gradient magnitudes; (iii) Decay the number of samples from |G ptq S | to |G pt`1q S | with cosine annealing, where we only keep the top p1 ´ϵq|G pt`1q S| samples (ϵ is the exploration rate which controls the ratio of the randomly re-sampled samples from the recycle bin. The rest samples will hold in the recycle bin temporarily; (iv) Finally, randomly re-sample ϵ|G pt`1q S

Figure 3: Evolution of data samples' gradients computed by data diet (Paul et al., 2021) (upper figures) and our GraphDec (lower figures) on NCI1 data.

Figure 4: Results of data samples' gradients computed by full GNN model and our dynamic sparse GNN model on NCI1 data. Red dashed line: on the left side, points on the x-axis [0, 900] are majority class; on the right side, points on the x-axis [900, 1000] are minority class.

θ ptq pθ ptq q T pθ ptq ´θ˚q ´∇θ L G ptq S ;θ ptq T pθ ptq ´θ˚q `∇θ L G ptq S ;θ ptq T pθ ptq ´θ˚q , (11) then based on the combination of the Equation equation 10 and Equation equation 11, we have: ∇ θ L G ptq S ;θ ptq pθ ptq q T pθ ptq ´θ˚q ´∇θ L G ptq S ;θ ptq T pθ ptq ´θ˚q `∇θ L G ptq S ;θ ptq

Figure 5: Results of data samples' gradients computed by full GNN model and our dynamic sparse GNN model on NCI1 data. Red dashed line: on the left side, points on the x-axis [0, 900] are majority class; on the right side, points on the x-axis [900, 1000] are minority class.

Meanwhile, we add a superscript to represent the model's parameters and the graph data subset at epoch t, i.e., θ ptq and G Lp¨q is convex, we can have the following guarantee: If training loss L G S is Lipschitz continuous, ∇ θ ptq L G S is upper-bounded by σ, and α " d

Class-imbalanced graph classification results. Numbers after each dataset name indicate imbalance ratios of minority to majority categories. Best/second-best results are in bold/underline.

Class-imbalanced node classification results. Best/second-best results are in bold/underline.

Ablation study results for both tasks. Four rows of red represent removing four individual components from data sparsity perspective. Four rows of blue represent removing four individual components from model sparsity perspective. Best results are in bold.

Meanwhile, we add a superscript to represent the model's parameters and the graph data subset at epoch t, i.e., θ ptq and G ptq S . Besides, we use L G ptq S ;θ ptq to indicate the loss of model θ ptq over the graph dataset G The sparse graph subset approximation hypothesis states that the model effectiveness trained on G can be approximated by the one trained on G S . We introduce the hypothesis as follows: Theorem 1. Assume the model's parameters at epoch t satisfies › › θ ptq › › 2 ď d 2 , where d is a constant, and the loss function Lp¨q is convex, we can have the following guarantee: If training loss L G S is Lipschitz continuous, ∇ θ ptq L G S is upper-bounded by σ, and α " d The gradients of training loss L G ptq S ;θ ptq at epoch t are supposed to be σ-bounded by σ. According to gradient descent, we have:

Original dataset details for imbalanced graph classification and imbalanced node classification tasks. READOUT function can be any permutation invariant, like summation, averaging, etc. Graph Contrastive Learning. Given a graph dataset D " tG i u

Computational time comparisons.

ETHICS STATEMENT

We do not find that this work is directly related to any ethical risks to society. In general, we would like to see that imbalanced learning algorithms (including this work) are able to perform better on minority groups in real-world applications.

REPRODUCIBILITY STATEMENT

For the reproducibility of this study, we provide the source code for GraphDec in the supplementary materials. The datasets and other baselines in our experiments are described in Appendix C.1 and C.2. I. For imbalanced graph classification (Wang et al., 2021) , four models are included as baselines in our work, we list these baselines as follow:(1) GIN (Xu et al., 2019) , a popular supervised GNN backbone for graph tasks due to its powerful expressiveness on graph structure;(2) InfoGraph (Sun et al., 2019) , an unsupervised graph learning framework by maximizing the mutual information between the whole graph and its local topology of different levels;(3) GraphCL (You et al., 2020) , learning unsupervised graph representations via maximizing the mutual information between the original graph and corresponding augmented views;(4) G 2 GNN (Wang et al., 2021) , a re-balanced GNN proposed to utilize additional supervisory signals from both neighboring graphs and graphs themselves to alleviate the imbalance issue of graph. II. For imbalanced node classification, we consider nine baseline methods in our work, including(1) vanilla, denoting that we train GCN normally without any extra rebalancing tricks;(2) re-weight (Japkowicz & Stephen, 2002) , denoting we use cost-sensitive loss and re-weight the penalty of nodes in different classes;(3) oversampling (Park et al., 2022) , denoting that we sample nodes of each class to make the data's number of each class reach the maximum number of corresponding class's data;(4) cRT (Kang et al., 2020) , a post-hoc correction method for decoupling output representations;(5) PC Softmax (Hong et al., 2021) , a post-hoc correction method for decoupling output representations, too;(6) DR-GCN (Shi et al., 2020) , building virtual minority nodes and forces their features to be close to the neighbors of a source minority node;(7) GraphSMOTE (Zhao et al., 2021) , a pre-processing method that focuses on the input data and investigates the possibility of re-creating new nodes with minority features to balance the training data.(8) GraphENS (Park et al., 2022) , proposing a new augmentation method to construct an ego network from all nodes for learning minority representation.(9) SynFlow (Tanaka et al., 2020) , a one-shot model pruning method with less reliance on data.(10) BGRL (Thakoor et al., 2021) , a graph contrastive learning method using only simple augmentations and avoids the requirements for contrasting with negative examples, and thus makes itself scalable.(11) GRACE (Zhu et al., 2020) , a graph contrastive learning method generating two views by corrupting a graph and learning node embeddings by minimizing the distance of node embeddings in these two views.We use Graph Convolutional Network (GCN) (Kipf & Welling, 2017) as the default architecture for all rebalance methods.

C.3 DETAILS OF GRAPHDEC VARIANTS

The details of model variants are provided as follows: I. Specifically, GraphDec contains four components to address data sparsity and imbalance: (1) GS is sampling informative subset data according to ranking gradients; (2) SS is training model with the sparse dataset, correspondingly; (3) CAD is using cosine annealing to reduce dataset size; (4) RS is recycling removed samples, correspondingly. To investigate their corresponding effectiveness, we remove them correspondingly as:(1) w/o GS is that we randomly sample subset from the full set;(2) w/o SS is that we train GNN with the full set;(3) w/o CAD is that we directly reduce dataset size to target dataset size and it is same as data diet;

