UNIFYING DATA-MODEL SPARSITY FOR CLASS-IMBALANCED GRAPH REPRESENTATION LEARNING

Abstract

To relieve the massive computation cost in the field of deep learning, models with more compact architectures have been proposed for comparable performance. However, it is not only the cumbersome model architectures but also the massiveness of the training data that adds up to the expensive computational burdens. This problem is particularly accentuated in the graph learning field: on one hand, Graph Neural Networks (GNNs) trained upon non-Euclidean graph data often encounter relatively higher time costs, due to their irregular density properties; on the other hand, the natural class-imbalance property accompanied by graphs cannot be alleviated by the massiveness of data, therefore hindering GNNs' ability in generalization. To fully tackle the above issues, (i) theoretically, we introduce a hypothesis on to what extent a subset of the training data can approximate the full dataset's learning effectiveness, which is further guaranteed by the gradients' distance between the subset and the full set; (ii) empirically, we discover that during the learning process of a GNN, some samples in the training dataset are informative in providing gradients for model parameters update. Moreover, the informative subset evolves as the training process proceeds. We refer to this observation as dynamic data sparsity. We also notice that a pruned sparse contrastive GNN model sometimes "forgets" the information provided by the informative subset, reflected in their large loss in magnitudes. Motivated by the above findings, we develop a unified data-model dynamic sparsity framework named Graph Decantation (GraphDec) to address the above challenges. The key idea of GraphDec is to identify the informative subset dynamically during the training process by adopting the sparse graph contrastive learning. Extensive experiments on multiple benchmark datasets demonstrate that GraphDec outperforms state-of-the-art baselines for the class-imbalanced graph/node classification tasks, with respect to classification accuracy and data usage efficiency.

1. INTRODUCTION

Graph representation learning (GRL) (Kipf & Welling, 2017) has shown remarkable power in dealing with non-Euclidean structure data (e.g., social networks, biochemical molecules, knowledge graphs). Graph neural networks (GNNs) (Kipf & Welling, 2017; Hamilton et al., 2017; Veličković et al., 2018) , as the current state-of-the-art of GRL, have become essential in various graph mining applications. However, in many real-world scenarios, training on graph data often encounters two difficulties: class imbalance (Park et al., 2022) and massive data usage (Thakoor et al., 2021; Hu et al., 2020) . Firstly, class imbalance naturally exists in datasets from diverse practical domains, such as bioinformatics and social networks. GNNs are sensitive to this property and can be biased toward the dominant classes. This bias may mislead GNNs' learning process, resulting in underfitting samples that are of real importance to the downstream tasks, and poor test performance at last. Secondly, massive data usage requires GNN to perform message-passing over nodes of high degrees bringing about heavy computation burdens. Some calculations are redundant in that not all neighbors are informative regarding learning task-related embeddings. Unlike regular data such as images or texts, the connectivity of irregular graph data invokes random memory access, which further slows down the efficiency of data readout. Accordingly, recent studies (Chen et al., 2021; Zhao et al., 2021; Park et al., 2022) Despite progress made so far, existing methods fail to address the class imbalance and computational burden altogether. Dealing with one may even exacerbate the condition of the other: when tackling the data imbalance, the newly synthetic nodes in GraphSMOTE and GraphENS bring along extra computational burdens for the nextcoming training process. While a compact model reduces the computational burden to some extent, we interestingly found that the pruned model easily "forgets" the minorities in class-imbalanced data, reflected in its worse performance than the original model's. To investigate this observation, we study how each graph sample affects the GNN training by taking a closer look at the gradients each of them exerts. Specifically, (i) in the early phases of training, we identify a small subset that provides the most informative supervisory signals, as measured by the gradient norms' magnitudes (shown in later Figure 5 ); (ii) the informative subset evolves dynamically as the training process proceeds (as depicted in later Figure 3 ). Both the phenomenons prompt the hypothesize that the full training set's training effectiveness can be approximated, to some extent, by that of the dynamic subset. We further show that the effectiveness of the approximation is guaranteed by the distance between the gradients of the subset and the full training set, as stated in Theorem 1. Based on the above, we propose a novel method called Graph Decantation (GraphDec) to guide dynamic sparsity training from both the model and data aspects. The principle behind GraphDec is shown in Figure 1 . Since the disadvantaged but informative samples tend to bring about higher gradient magnitudes, GraphDec relies on the gradients directed by dynamic sparse graph contrastive learning loss to identify the informative subsets that approximate the full set's training effectiveness. This mechanism not only does not require supervised labels, but also allows for the training of the primary GNN, and the pruning of the sparse one. Specifically, for each epoch, our proposed framework scores samples from the current training set and keep only k most informative samples for the next epoch. Additionally, the framework incorporates a data recycling process, which randomly recycles prior discarded samples (i.e., samples that are considered unimportant in the previous training epochs) by re-involving them in the current training process. As a result, the dynamically updated subset (i) supports the sparse GNN to learn relatively unbiased representations and (ii) approximates the full training set through the lens of Theorem 1. To summarize, our contributions in this work are: • We develop a novel framework, Graph Decantation, which leverages dynamic sparse graph contrastive learning on class-imbalanced graph data for efficient data usage. To our best knowledge, this is the first study to explore the dynamic sparsity property for class-imbalanced graphs. • We introduce cosine annealing to dynamically control the sizes of the sparse GNN model and the graph data subset to smooth the training process. Meanwhile, we introduce data recycling to refresh the current data subset and avoid overfitting. • Comprehensive experiments on multiple benchmark datasets demonstrate that GraphDec outperforms state-of-the-art methods for both the class-imbalanced graph classification and classimbalanced node classification tasks. Additional results show that GraphDec dynamically finds an informative subset across the training epochs effectively.



arise to address the issues of class imbalance or massive data usage in graph data: (i) On one hand, to deal with the class imbalance issue in node classification on graphs, GraphSMOTE (Zhao et al., 2021) tries to generate new nodes for the minority classes to balance the training data. Improved upon GraphSMOTE, GraphENS (Park et al., 2022) further proposes a new augmentation method by constructing an ego network to learn the representations of the minority classes. (ii) On the other hand, to alleviate the massive data usage, (Eden et al., 2018; Chen et al., 2018) explore efficient data sampling policies to reduce the computational cost from the data perspective. From the model improvement perspective, some approaches design the quantization-aware training and low-precision inference method to reduce GNNs' operating costs on data. For example, GLT (Chen et al., 2021) applies the lottery ticket pruning technique (Frankle & Carbin, 2019) to simplify graph data and the GNN model concurrently.

Figure 1: The principle of graph decantation. It decants data samples based on rankings of their gradient scores, and then uses them as the training set in the next epoch.

