UNREAL:UNLABELED NODES RETRIEVAL AND LA-BELING FOR HEAVILY-IMBALANCED NODE CLASSIFI-CATION

Abstract

Extremely skewed label distributions are common in real-world node classification tasks. If not dealt with appropriately, it significantly hurts the performance of GNNs on minority classes. Due to the practical importance, there have been a series of recent researches devoted to this challenge. Existing over-sampling techniques smooth the label distribution by generating "fake" minority nodes and synthesize their features and local topology, which largely ignore the rich information of unlabeled nodes on graphs. On the other hand, methods based on loss function modification re-weight different samples or change classification margins. Representative methods in this category need to use label information to estimate the distance of each node to its class center, which is unavailable on unlabeled nodes. In this paper, we propose UNREAL, an iterative over-sampling method. The first key difference is that we only add unlabeled nodes instead of synthetic nodes, which eliminates the challenge of feature and neighborhood generation. To select which unlabeled nodes to add, we propose geometric ranking to rank unlabeled nodes. Geometric ranking exploits unsupervised learning in the node embedding space to effectively calibrates pseudo-label assignment. Finally, we identify the issue of geometric imbalance in the embedding space and provide a simple metric to filter out geometrically imbalanced nodes. Extensive experiments on real-world benchmark datasets are conducted, and the empirical results show that our method significantly outperforms current state-of-the-art methods consistent on different datasets with different imbalance ratios.

1. INTRODUCTION

Node classification is ubiquitous in real-world applications, ranging from malicious account detection (Mohammadrezaei et al., 2018) to fake news detection (Monti et al., 2019) . Many realworld data comes with an imbalanced class distribution (Mohammadrezaei et al., 2018; Wang et al., 2020b) . For instance, the proportion of malicious accounts in social networks is usually very rare. A model trained using an imbalanced dataset is prone to be sub-optimal on under-represented classes. While GNNs have achieved superior performance on node classification, training a fair GNN model for handling highly-imbalanced class distributions remains a challenging task. For the application of malicious account detection, GNN models would easily overfit the samples from the rare class of malicious accounts (Liu et al., 2018; Zhao et al., 2021) . The message passing scheme of GNN models make the problem even more complex, as here the samples cannot be treated as i.i.d. samples. Moreover, quantity imbalanced is often coupled with topology imbalance (Chen et al., 2021) , and thus it is difficult to extend existing techniques for handling i.i.d. data to relational data. Given its importance and unique characteristics, a group of recent studies has been devoted to solving the imbalanced node classification problem (Zhao et al., 2021; Shi et al., 2020; Chen et al., 2021; Park et al., 2021; Song et al., 2022) . Over-sampling strategies are simple and effective for handling data imbalance. However, it is a non-trivial task to adapt them to graph data since the topological information of newly synthesized nodes is not provided. GraphSMOTE (Zhao et al., 2021) 2021) only uses label information in the training set when computing the topology imbalance metric. However, the training set is highly-skewed in the first place, so information derived from them is less reliable, and the bias could spread to later building blocks, which hurts the overall performance. In this paper, we propose a novel imbalanced node classification method: unlabeled node retrieval and labeling (UNREAL). At a high level, UNREAL is an over-sampling based approach, however several distinct features make our method differ from existing over-sampling techniques significantly. First, motivated by the observation that abundant unlabeled nodes are available in a node classification scenario, instead of synthesizing new minority nodes, which brings in both additional noise and large computational burden, we only add "real" nodes to the training set. Adding unlabeled nodes (together with their pseudo-labels) to the training set is a commonly used technique for semi-supervised node classification, which is proved to be highly effective for dealing with label sparseness (Li et al., 2018; Zhou et al., 2019; Sun et al., 2020; Wang et al., 2021c) . Self-Training (ST) trains GNN on existing labeled data, then selects samples with high prediction confidence for each class from unlabeled data, and adds them to the training set. However, in imbalanced scenarios, ST cannot achieve satisfactory performance due to the bias in the original training set: using the predictions from a classifier trained on the imbalanced training set may be highly biased and contains a large portion incorrect pseudo-labels. This drawback of ST is empirically verified in our experiments (see Section 3). Thus, we propose a series of techniques to overcome this challenge, and our experimental results show these techniques are highly effective and outperforms baselines by a large margin. Similar to (Chen et al., 2021) , we try to add nodes that are close to class centers to alleviate topology imbalance. To identify such good nodes, we train the model with the training set and use the prediction confidence as the selecting criteria, which we call confidence ranking. However, the bias in the original training set results in unreliable predictions (Song et al., 2022) , which inevitably hurts the performance. Therefore, we introduce a key building block which utilizes the geometric structure in the embedding space to calibrate the bias in the prediction confidence. This idea is partially inspired by the work of Kang et al. (2019) , where they hypothesize and verify empirically that the classifier is the only under-performed component in the model when trained on an imbalanced training set. Thus, after the preliminary training step, we retrieve node embeddings from the output layer (before the classification layer) and use unsupervised clustering methods to rank the closeness of nodes to their class centers, which we call geometric ranking. Also, given the two rankings, we apply information retrieval techniques to select the best-unlabeled nodes to add. In practice, this procedure will be applied iteratively for multiple rounds. We summarize our contribution as follows: 1) As far as we know, UNREAL is the first method to use unlabeled nodes rather than synthetic ones in over-sampling approaches to deal with class imbalanced node classification; 2) for unlabeled node selection, UNREAL is also the first to apply unsupervised methods in the embedding space to get complementary and less biased label predictions; 3) we introduce geometric ranking, which ranks nodes according to the closeness of each node to its class center in the embedding space; 4) given confidence and geometric rankings, information retrieval techniques is used to effectively select high-quality new samples; 5) We identify the Geometric Imbalance (GI) issue in the embedding space, and propose a metric to measure GI and discard imbalanced nodes. We conduct comprehensive experiments on multiple benchmarks, including citation networks (Sen et al., 2008) , an Amazon product co-purchasing network (Sen et al., 2008), and Flickr (Zeng et al., 2019) . We also test the performance of UNREAL on several mainstream GNN architectures namely



extends the synthetic minority over-sampling technique (SMOTE) to graph data by synthesizing nodes in the embedding space and generating relation information using link prediction. Shi et al. (2020) uses a generative model to generate nodes to smooth the label distribution. GraphENS (Park et al., 2021) synthesizes the whole ego network of a new sample by combining two different ego networks based on their similarity. It is empirically observed in (Song et al., 2022) that the performance of existing over-sampling approaches is easily affected by (synthetic) minority nodes with high connectivity to other classes. To alleviate this issue, Song et al. (2022) modify the loss function (thus the classification margin) based on various statistics of the true label distributions of target nodes and classes. Chen et al. (2021) coin this phenomenon by topology imbalance, and propose to re-weight the samples according to their distance to the classification boundary, in which the distance is inferred via the structural similarity and label information. As we note, both methods rely on ground truth label information, which is not available for most nodes. Song et al. (2022) first train the model with the original training set and use the model predictions to modify the loss function, while Chen et al. (

