UNREAL:UNLABELED NODES RETRIEVAL AND LA-BELING FOR HEAVILY-IMBALANCED NODE CLASSIFI-CATION

Abstract

Extremely skewed label distributions are common in real-world node classification tasks. If not dealt with appropriately, it significantly hurts the performance of GNNs on minority classes. Due to the practical importance, there have been a series of recent researches devoted to this challenge. Existing over-sampling techniques smooth the label distribution by generating "fake" minority nodes and synthesize their features and local topology, which largely ignore the rich information of unlabeled nodes on graphs. On the other hand, methods based on loss function modification re-weight different samples or change classification margins. Representative methods in this category need to use label information to estimate the distance of each node to its class center, which is unavailable on unlabeled nodes. In this paper, we propose UNREAL, an iterative over-sampling method. The first key difference is that we only add unlabeled nodes instead of synthetic nodes, which eliminates the challenge of feature and neighborhood generation. To select which unlabeled nodes to add, we propose geometric ranking to rank unlabeled nodes. Geometric ranking exploits unsupervised learning in the node embedding space to effectively calibrates pseudo-label assignment. Finally, we identify the issue of geometric imbalance in the embedding space and provide a simple metric to filter out geometrically imbalanced nodes. Extensive experiments on real-world benchmark datasets are conducted, and the empirical results show that our method significantly outperforms current state-of-the-art methods consistent on different datasets with different imbalance ratios.

1. INTRODUCTION

Node classification is ubiquitous in real-world applications, ranging from malicious account detection (Mohammadrezaei et al., 2018) to fake news detection (Monti et al., 2019) . Many realworld data comes with an imbalanced class distribution (Mohammadrezaei et al., 2018; Wang et al., 2020b) . For instance, the proportion of malicious accounts in social networks is usually very rare. A model trained using an imbalanced dataset is prone to be sub-optimal on under-represented classes. While GNNs have achieved superior performance on node classification, training a fair GNN model for handling highly-imbalanced class distributions remains a challenging task. For the application of malicious account detection, GNN models would easily overfit the samples from the rare class of malicious accounts (Liu et al., 2018; Zhao et al., 2021) . The message passing scheme of GNN models make the problem even more complex, as here the samples cannot be treated as i.i.d. samples. Moreover, quantity imbalanced is often coupled with topology imbalance (Chen et al., 2021), and thus it is difficult to extend existing techniques for handling i.i.d. data to relational data. Given its importance and unique characteristics, a group of recent studies has been devoted to solving the imbalanced node classification problem (Zhao et al., 2021; Shi et al., 2020; Chen et al., 2021; Park et al., 2021; Song et al., 2022) . Over-sampling strategies are simple and effective for handling data imbalance. However, it is a non-trivial task to adapt them to graph data since the topological information of newly synthesized nodes is not provided. GraphSMOTE (Zhao et al., 2021) extends the synthetic minority over-sampling technique (SMOTE) to graph data by synthesizing nodes in the embedding space and generating relation information using link prediction. Shi et al. ( 2020) uses a generative model to generate nodes to smooth the label distribution. GraphENS (Park et al., 2021) 

