UNREAL:UNLABELED NODES RETRIEVAL AND LA-BELING FOR HEAVILY-IMBALANCED NODE CLASSIFI-CATION

Abstract

Extremely skewed label distributions are common in real-world node classification tasks. If not dealt with appropriately, it significantly hurts the performance of GNNs on minority classes. Due to the practical importance, there have been a series of recent researches devoted to this challenge. Existing over-sampling techniques smooth the label distribution by generating "fake" minority nodes and synthesize their features and local topology, which largely ignore the rich information of unlabeled nodes on graphs. On the other hand, methods based on loss function modification re-weight different samples or change classification margins. Representative methods in this category need to use label information to estimate the distance of each node to its class center, which is unavailable on unlabeled nodes. In this paper, we propose UNREAL, an iterative over-sampling method. The first key difference is that we only add unlabeled nodes instead of synthetic nodes, which eliminates the challenge of feature and neighborhood generation. To select which unlabeled nodes to add, we propose geometric ranking to rank unlabeled nodes. Geometric ranking exploits unsupervised learning in the node embedding space to effectively calibrates pseudo-label assignment. Finally, we identify the issue of geometric imbalance in the embedding space and provide a simple metric to filter out geometrically imbalanced nodes. Extensive experiments on real-world benchmark datasets are conducted, and the empirical results show that our method significantly outperforms current state-of-the-art methods consistent on different datasets with different imbalance ratios.

1. INTRODUCTION

Node classification is ubiquitous in real-world applications, ranging from malicious account detection (Mohammadrezaei et al., 2018) to fake news detection (Monti et al., 2019) . Many realworld data comes with an imbalanced class distribution (Mohammadrezaei et al., 2018; Wang et al., 2020b) . For instance, the proportion of malicious accounts in social networks is usually very rare. A model trained using an imbalanced dataset is prone to be sub-optimal on under-represented classes. While GNNs have achieved superior performance on node classification, training a fair GNN model for handling highly-imbalanced class distributions remains a challenging task. For the application of malicious account detection, GNN models would easily overfit the samples from the rare class of malicious accounts (Liu et al., 2018; Zhao et al., 2021) . The message passing scheme of GNN models make the problem even more complex, as here the samples cannot be treated as i.i.d. samples. Moreover, quantity imbalanced is often coupled with topology imbalance (Chen et al., 2021) , and thus it is difficult to extend existing techniques for handling i.i.d. data to relational data. Given its importance and unique characteristics, a group of recent studies has been devoted to solving the imbalanced node classification problem (Zhao et al., 2021; Shi et al., 2020; Chen et al., 2021; Park et al., 2021; Song et al., 2022) . Over-sampling strategies are simple and effective for handling data imbalance. However, it is a non-trivial task to adapt them to graph data since the topological information of newly synthesized nodes is not provided. GraphSMOTE (Zhao et al., 2021) extends the synthetic minority over-sampling technique (SMOTE) to graph data by synthesizing nodes in the embedding space and generating relation information using link prediction. Shi et al. (2020) uses a generative model to generate nodes to smooth the label distribution. GraphENS (Park et al., 2021) synthesizes the whole ego network of a new sample by combining two different ego networks based on their similarity. It is empirically observed in (Song et al., 2022) that the performance of existing over-sampling approaches is easily affected by (synthetic) minority nodes with high connectivity to other classes. To alleviate this issue, Song et al. (2022) modify the loss function (thus the classification margin) based on various statistics of the true label distributions of target nodes and classes. Chen et al. (2021) coin this phenomenon by topology imbalance, and propose to re-weight the samples according to their distance to the classification boundary, in which the distance is inferred via the structural similarity and label information. As we note, both methods rely on ground truth label information, which is not available for most nodes. Song et al. (2022) first train the model with the original training set and use the model predictions to modify the loss function, while Chen et al. (2021) only uses label information in the training set when computing the topology imbalance metric. However, the training set is highly-skewed in the first place, so information derived from them is less reliable, and the bias could spread to later building blocks, which hurts the overall performance. In this paper, we propose a novel imbalanced node classification method: unlabeled node retrieval and labeling (UNREAL). At a high level, UNREAL is an over-sampling based approach, however several distinct features make our method differ from existing over-sampling techniques significantly. First, motivated by the observation that abundant unlabeled nodes are available in a node classification scenario, instead of synthesizing new minority nodes, which brings in both additional noise and large computational burden, we only add "real" nodes to the training set. Adding unlabeled nodes (together with their pseudo-labels) to the training set is a commonly used technique for semi-supervised node classification, which is proved to be highly effective for dealing with label sparseness (Li et al., 2018; Zhou et al., 2019; Sun et al., 2020; Wang et al., 2021c) . Self-Training (ST) trains GNN on existing labeled data, then selects samples with high prediction confidence for each class from unlabeled data, and adds them to the training set. However, in imbalanced scenarios, ST cannot achieve satisfactory performance due to the bias in the original training set: using the predictions from a classifier trained on the imbalanced training set may be highly biased and contains a large portion incorrect pseudo-labels. This drawback of ST is empirically verified in our experiments (see Section 3). Thus, we propose a series of techniques to overcome this challenge, and our experimental results show these techniques are highly effective and outperforms baselines by a large margin. Similar to (Chen et al., 2021) , we try to add nodes that are close to class centers to alleviate topology imbalance. To identify such good nodes, we train the model with the training set and use the prediction confidence as the selecting criteria, which we call confidence ranking. However, the bias in the original training set results in unreliable predictions (Song et al., 2022) , which inevitably hurts the performance. Therefore, we introduce a key building block which utilizes the geometric structure in the embedding space to calibrate the bias in the prediction confidence. This idea is partially inspired by the work of Kang et al. (2019) , where they hypothesize and verify empirically that the classifier is the only under-performed component in the model when trained on an imbalanced training set. Thus, after the preliminary training step, we retrieve node embeddings from the output layer (before the classification layer) and use unsupervised clustering methods to rank the closeness of nodes to their class centers, which we call geometric ranking. Also, given the two rankings, we apply information retrieval techniques to select the best-unlabeled nodes to add. In practice, this procedure will be applied iteratively for multiple rounds. We summarize our contribution as follows: 1) As far as we know, UNREAL is the first method to use unlabeled nodes rather than synthetic ones in over-sampling approaches to deal with class imbalanced node classification; 2) for unlabeled node selection, UNREAL is also the first to apply unsupervised methods in the embedding space to get complementary and less biased label predictions; 3) we introduce geometric ranking, which ranks nodes according to the closeness of each node to its class center in the embedding space; 4) given confidence and geometric rankings, information retrieval techniques is used to effectively select high-quality new samples; 5) We identify the Geometric Imbalance (GI) issue in the embedding space, and propose a metric to measure GI and discard imbalanced nodes. We conduct comprehensive experiments on multiple benchmarks, including citation networks (Sen et al., 2008) , an Amazon product co-purchasing network (Sen et al., 2008) , and Flickr (Zeng et al., 2019) . We also test the performance of UNREAL on several mainstream GNN architectures namely GCN (Kipf & Welling, 2016) , GAT (Veličković et al., 2017) , and GraphSAGE (Hamilton et al., 2017) . Experimental results demonstrate the superiority of the proposal as UNREAL consistently outperforms existing state-of-the-art approaches by a large margin.

2.1. NOTATION AND DEFINITIONS

In this work, we mainly focus on the ubiquitous semi-supervised node classification setup. Given an undirected and unweighted graph G = (V, E, L). Here, V is the node set and E is the edge set, L ⊂ V denote the set of labeled nodes, so the set of unlabeled nodes is U = V -L, and X ∈ R n×f is the feature matrix (where n = |V| is the node size and f is the node feature dimension). We use A ∈ {0, 1} n×n to denote the adjacency matrix and N (v) the set of 1-hop neighbors for node v. The labeled sets for all classes are denoted by (C 1 , C 2 , • • • , C k ), where k is the number of different classes.We use imbalance ratio, defined as ρ := maxi (|Ci|) mini (|Ci|) , to measure the level of imbalance in a dataset. We summarize the notation in a table in Appendix A.

2.2. MESSAGE PASSING NEURAL NETWORK FOR NODE CLASSIFICATION

In this section, we briefly introduce message passing neural networks (MPNNs). A standard MPNNs consists of three components, a message function m l , an information aggregation function θ l , and a node feature update function ψ l . The feature of each node is updated iteratively. Let h l v be the feature of node v in the l-th layer, then in the (l + 1)-th layer the feature is updated as: h (l+1) v = ψ l h (l) v , θ l m l h (l) v , h (l) u , e v,u | u ∈ N (v) , where e v,u is the edge weight between v and u. For the classic GCN model (Kipf & Welling, 2016) , h (l+1) v is computed as: h (l+1) v = Φ l u∈N (v)∪{v} ev,u √ du dv h (l) u , where Φ l is the parameter matrix of the l-th layer and dv = 1 + u∈N (v) e v,u . For node classification, a classification layer is concatenated after the last layer of a GNN.

3. PSEUDO-LABEL MISJUDGMENT AUGMENTATION PROBLEM IN IMBALANCED LEARNING

Since self-training adds pseudo-labels to the training set and trains the model iteratively, misjudgements in the early stages will cause the method to fail badly. We extensively investigate this issue of ST in imbalanced learning. Conventional ST-based methods are generally exploited to deal with sparsely label distribution to improve the performance of the model. However, the problem of classifier bias that often occurs in imbalanced scenarios has not received attention if we apply these methods straightly to imbalanced learning. Here, we hypothesize that as the imbalance ratio of the dataset becomes larger, the pseudo-labels obtained by ST-based methods are less credible. At the same time, the prediction confidence of unlabeled nodes is no longer reliable. We conduct comprehensive experimental studies to very this hypothesis. Due to space constraints, we elaborate the experimental details and conclusions in Appendix B.

4. UNREAL

In this section, we provide the details of the proposed method. UNREAL iteratively adds unlabeled nodes (with predicted labels) to the training set and retrains the model. We propose three complementary techniques to enhance the unlabeled node selection and labeling. More specifically, in Section 4.1, we describe Dual Pseudo-tag Alignment Mechanism (DPAM) for effective node filtering, the key idea of which is to use unsupervised clustering in the embedding space to obtain a node ranking. In Section 4.2, we show how to combine geometric rank from DPAM and confidence ranking to reorder unlabeled nodes according to their closeness to the class centers (Node-reordering). Finally, in Section 4.3, we identify the issue of geometric node imbalance (GI) and define a new metric to measure GI, which is then used to filter out nodes with high GI. The overall pipeline of UNREAL is illustrated in Figure 1 . Our full algorithm is also provided in the Appendix G (Algorithm 1). 4.1 DUAL PSEUDO-TAG ALIGNMENT MECHANISM FOR NODE FILTERING UNREAL iteratively adds unlabeled nodes to the training set. In each iteration, we first train the GNN model using the current training set. In the early stages the training set remains imbalanced, so the model is likely to generate biased predictions. According to Kang et al. (2019) , the embeddings learned by the model are still of high quality, even if it is trained on imbalanced data. Therefore, DPAM exploits the geometric structure in the embedding space and produce a candidate set of new samples. Let d be the embedding dimension. We use H L ∈ R |L|×d and H U ∈ R |U |×d to denote the embedding matrix of labeled and unlabeled nodes respectively. Each row of the embedding matrix is the embedding of a node u (denoted as h L u and h U u ), which is considered as a point in the d-dimension Euclidean space. DPAM applies an unsupervised clustering algorithm, f cluster , which partitions the embeddings of unlabeled nodes into k ′ clusters and produces k ′ corresponding cluster centers, where k ′ is usually larger than k, the number of classes. f cluster (H U ) =⇒ {K 1 , c 1 , K 2 , c 2 , • • • , K k ′ , c k ′ } (2) where K i is the i-th cluster and c i is the i-th cluster center. We use vanilla k-means in our implementation. We also compute the embedding center of each class in the training set c train i = M ({h L u | y u ∈ C i }). Since we use k-means in our experiments, M (•) is simply the mean function. We next assign a pseudo-label ỹm to each cluster K m : ỹi = arg min j distance(c train j , c i ). We then combine clusters with the same pseudo-label m as Ũm , and U = k m=1 Ũm . On the other hand, the GNN model gives each node u in U a prediction ŷu , and we put unlabeled nodes whose prediction is m into the set U m , and U = k m=1 U m . Dual Pseudo-tag Alignment Mechanism (DPAM) The pseudo-labels produced by applying an unsupervised algorithm on the embeddings provide an alternative and potentially less biased prediction, which may compensate the bias introduced by the imbalanced training set. At the same time, the overall accuracy of the unsupervised algorithm is inferior to supervised methods, and thus it is sub-optimal to rely solely on the pseudo-labels from clustering. As a result, DPAM only keeps unlabeled nodes whose two labels aligns, i.e., those belong to the intersection of Ũm and U m for each m ∈ {1, 2, • • • , k}; and each node in Ũm ∩ U m gets a pseudo-label m. Due to the space constraints, we defer the empirical studies on why DPAM works to Appendix D.1.

4.2. NODE RE-ORDERING

Now DPAM has selected a pool of candidate nodes: Z = k i=m ( Ũm ∩ U m ). In this section, we present Node-Reordering, a method that re-orders nodes in Z according to the closeness of each node to its class center. Node-Reordering combines the geometric ranking from the unsupervised method and confidence ranking from model prediction. Geometric and confidence rankings Suppose u ∈ Ũm ∩ U m , and let h U u be the embedding of u. We measure the distance between node u and its class center by where c train m is the class center of class m (see equation 3). For each class m, we sort nodes in Ũm ∩ U m in the increasing order of their distance to the class center, so we obtain k sorted lists {S 1 , S 2 , • • • , S k }, which we call geometric rankings. δ u = distance (h U u , c train m ) On the other hand, for each node u ∈ Ũm ∩ U m , we can get a classification confidence for the node from the output of the classifier as follow: predictions = softmax (logits), conf idence = max (predictions), Here, logits is the output of the neural network, usually a k (number of classes) dimensional vector. The pseudo-labels of u from the classifier is the index of class with highest prediction probability and the corresponding probability is its confidence. We sort nodes in Ũm ∩ U m in the decreasing order of their confidence, and obtain another k sorted lists {T 1 , T 2 , • • • , T k }, which we call confidence rankings. Rank Biased Overlap In the fields of information retrieval and recommendation systems, a fundamental task is to measure the similarity between two rankings. Rank Biased Overlap (RBO) (Webber et al., 2010) compares two ranked lists, and returns a numeric value between zero and one to quantify their similarity. A RBO value of zero indicates the lists are completely different, and a RBO of one means completely identical. Node-Reordering For each class m, we calculate the RBO value between S m and T m and then use the RBO score as a weight and get the weighted combination of the two rankings. More specifically, we first compute r m = RBO(S m , T m ), and then compute N N ew m = max{r m , 1 -r m } • S m + min{r m , 1 -r m } • T m , We then select nodes according to the new ranking based on values in N N ew m . Note that we always make the geometric rankings have the dominating influence in this step. Due to the space constraints, ablation analysis on Node-Reordering is presented in Appendix D.2.

4.3. GEOMETRIC IMBALANCE

In this section, we consider the issue of geometric imbalance (GI) in the embedding space, and define a simple and effective metric to measure GI.

Geometric Imbalance

In highly imbalanced scenarios, minority nodes often suffer from topology imbalance (Song et al., 2022; Chen et al., 2021) , which means the node stays near the boundary between the minority class and a majority class. The geometric ranking and DPAM introduced above effectively alleviate this issue. However, when the class centers of a minority class and a majority class are very close in the embedding space, the problem may still exist: we rank nodes only based on their absolute distance to the centers, so the nodes on the boundary of two close classes may have high rankings. We refer to this issue as geometric imbalance in the embedding space. We present a visualization to illustrate geometric imbalance, which is in Figure 9 due to space constraints. Discarding geometrically imbalanced nodes (DGI) After identifying the GI problem, we define a simple and natural metric to measure the degree of GI. According to equation 5, δ u refers to the distance between the embedding of u and the center of the class to which u is assigned (i.e., the closest class center among all classes). Similarly, we define β u as the distance between the embedding of u and the second closest center to u. We have δ u ≤ β u for all u, and intuitively, if δ u ≈ β u , then u is likely to have high degree of GI. We thus define the metric for measuring GI as GI u = β u -δ u δ u . ( ) We refer to the metric as GI index. The GI issue is more seriously on node with smaller GI index. So we set a threshold and discard all nodes with GI index below the threshold. We empirically verify the effectiveness of DGI, and the results and analysis are provided in Appendix D.2.

4.4. SELECTING NEW NODES ITERATIVELY

As in self-training techniques, we select nodes to join the training set in several rounds, and in each round we retrain the model using the newly formed training set. In highly-imbalanced cases, we only add nodes from the minority classes. In this way, the label distribution of the training set is gradually smoothed, and the imbalance issues of minority nodes are alleviated, benefiting from the addition of high-quality new samples.

5.1. EXPERIMENTAL SETUPS

Datasets We validate the advantages of our method on five benchmark datasets(i.e. Cora, CiteSeer, PubMed, Amazon-Computers, and Flickr) under different imbalance scenarios, in which the step imbalance scheme given in (Zhao et al., 2021; Park et al., 2021; Song et al., 2022 ) is adopted to construct class imbalanced datasets. More specifically, we choose half of the classes as minority classes and convert randomly picked labeled nodes into unlabeled ones until the imbalance ratio of the training set reaches ρ. For Flickr, in the public split, the training set is already imbalanced, and thus we directly use this split and do not make any changes. For the three citation networks (Cora, CiteSeer, Pubmed), we use the standard splits from Yang et al. (2016) as our initial splits when the imbalance ratio is 10, 20. To create a larger imbalance ratio, 20 labeled nodes per class is not enough, and we use a random split as the initial split for creating an imbalance ratio of 50 and 100. The detailed experimental settings such as evaluation protocol and implementation details of our algorithm are described in Appendix F. Baselines We compare UNREAL with several classic techniques (cross-entropy loss with reweighting (Japkowicz & Stephen, 2002) , PC Softmax (Hong et al., 2021) and Balanced Softmax (Ren et al., 2020) ) and state-of-the-art methods for imbalanced node classification, including GraphSMOTE (Zhao et al., 2021) , GraphENS (Park et al., 2021) , ReNode (Chen et al., 2021) , and TAM (Song et al., 2022) . Among them GraphSMOTE and GraphENS are representative over-sampling method for node classification, ReNode and TAM are loss function modification approaches. For TAM, we test its performances when combined with different base models, including GraphENS, ReNode, and Balanced softmax, following Song et al. (2022) . The implementation details of baselines are described in Appendix F.5. 

Experimental results under different imbalance ratios

In Table 1 and Table 2 , we report the averaged balanced accuracy (bAcc.) and F1 score with standard errors for the baselines and UNREAL on four class-imbalanced node classification benchmark datasets under different imbalance ratios (ρ = 10, 20). The results clearly demonstrate the advantage of UNREAL. Our method consistently outperforms existing state-of-the-art approaches across four datasets, three base models and two imbalance ratios (except for GraphSAGE on Amazon-Computers with imbalance ratio 10). In many cases the margin is significant. To evaluate the performance on very skewed label distribution, we also test in more imbalanced settings (ρ = 50, 100), and similarly, our method outperforms all other methods consistently and often by a notable margin. We remark that since GraphSMOTE (Zhao et al., 2021) synthesizes nodes within the minority class, it is not applicable when there is only one node in some classes, which is the case when ρ = 20, 50, 100 in our setup. The results are presented in Appendix C.1.

Experimental results for naturally imbalanced datasets

We also validate our model on a naturally imbalanced dataset, Flickr. The split of training set, validation set, and testing set follows (Zeng et al., 2019) , which has an imbalance ratio roughly ρ ≈ 10.8. We found that existing over-sampling methods use too much memory due to synthetic nodes generation, and cannot handle Flickr on a 3090 GPU with 24GB memory. This include GraphENS (Park et al., 2021) , GraphSMOTE (Zhao et al., 2021) and ReNode (Chen et al., 2021) . Due to the space constraints, we provide the experimental results in 8.

5.3. ABLATION ANALYSIS

In this section, we conduct ablation studies to analyze the benefit of each component in our method. From the results in Section 3, the necessity of unsupervised learning in the embedding space has been verified. Thus, in this section, DPAM is applied in all comparing methods. Here, we test the performance of three different ranking methods, namely confidence ranking, geometric ranking, and Node-reordering (which combines the former two rankings with information retrieval techniques). Moreover, we test the effect of DGI, which aims to eliminate geometrically imbalanced nodes. As shown in Table 3 , each component of our method can bring performance improvements. In particular, in three out of four settings in the table, Node-reordering+DGI achieves best F1 scores. In all cases, using geometric ranking is better than confidence ranking, which empirically verifies our hypothesis that the prediction confidence scores might contain bias and be less reliable.

6. RELATED WORK

Imbalanced learning Most real-world data is naturally imbalanced. The major challenge in imbalanced scenarios is how to train a fair model which does not biased toward majority classes. There are several commonly used approaches for alleviating this problem. Ensemble learning (Freund & Schapire, 1997; Liu et al., 2008; Zhou et al., 2020; Wang et al., 2020a; Liu et al., 2020; Cai et al., 2021) (Zhou & Liu, 2005; Tang et al., 2008; Cao et al., 2019; Tang et al., 2020; Xu et al., 2020; Ren et al., 2020; Wang et al., 2021b) . Methods based on post-hoc correction compensate minority classes during the inference step, after model training is complete (Kang et al., 2019; Tian et al., 2020; Menon et al., 2020; Hong et al., 2021) . Although these techniques have been widely applied on the i.i.d. data, it is not a trivial task to extend them to graph-structured data. Imbalanced learning in node classification Recently, a series of researches (Shi et al., 2020; Wang et al., 2020c; Zhao et al., 2021; Liu et al., 2021; Qu et al., 2021; Chen et al., 2021; Park et al., 2021; Song et al., 2022) explicitly tackle the challenges brought by the topological structures of graph data when handling imbalanced node classification. GraphSMOTE (Zhao et al., 2021) To obtain label information of unlabeled nodes, TAM trains the model using the original imbalanced training set and takes the model predictions as proxies for ground-truth labels.

7. CONCLUSION

In this work, we observe that selecting unlabeled nodes instead of generating synthetic nodes in oversampling based methods for imbalanced node classification is much simpler and more effective. We propose a novel iterative unlabeled nodes selection and retraining framework, which effectively select high-quality new samples from the unlabeled sets to smooth the label distribution of training set. Moreover, we propose to exploit the geometric structure in the node embedding space to compensate the bias in the model predictions. Extensive experimental results show that UNREAL consistently outperforms existing state-of-the-art approaches by large margins. A NOTATION TABLE 

Indices n

The number of nodes,|V| f The node feature dimension k The number of different classes k ′ The number of cluster centers in the embedding space d The dimension of the embedding space, or the dimension of the last layer of GNNs T Rounds to select nodes Parameters G An undirected and unweighted graph V The node set of G E The edge set of G X The feature matrix of G, X ∈ R n×f L The set of labeled nodes of G A The adjacency matrix of G, A ∈ {0, 1} n×n N (v) The set of 1-hop neighbors for node v U The set of unlabeled nodes, U = V -L C i The i class of the labeled sets ρ Imbalance ratio of a dataset, ρ : = maxi(|Ci|) mini(|Ci|) h l v The feature of node v in the l-th layer e v,u The edge weight between v and u Φ l The parameter matrix of the l-th layer H L The embedding matrix of labeled nodes, H L ∈ R |L|×d H U The embedding matrix of unlabeled nodes, H U ∈ R |U |×d h L u The embedding of a node u, if u ∈ L h U u The embedding of a node u, if u ∈ U K i The i-th cluster c i The i-th cluster center,the center of cluster i-th ỹi The pseudo-label of the cluster K i Ũm The combination of clusters with the same pseudo-label m ŷu The prediction of node u in U given by GNN model U m The combination of unlabeled nodes whose prediction given by the GNN model is m

Z

The pool of candidate nodes after DPAM, Z = k i=m ( Ũm ∩ U m ) c train m The class center of class m in the embedding space S i The sorted lists of geometric rankings T i The sorted lists of confidence rankings r m The similarity between two rankings, r m = RBO(S m , T m ) δ u The distance between the embedding of u and the closest class center to u β u The distance between the embedding of u and the second closest class center to u γ Threshold of DGI p Weight hyperparameter of RBO α The size threshold of nodes being added in each class per round η Learning rate of GNN model Functions m l The message function of MPNNs θ l The information aggregation function ψ l The node feature update function f cluster An unsupervised clustering algorithm for the embedding space M (•) The mean function f g GNN model

B ADDITIONAL RESULTS OF PSEUDO-LABEL MISJUDGMENT AUGMENTATION PROBLEM

Here, we present the details and results of the experiment which are not reported in Section 3 due to the space constraints. Experimental setup We first conduct experiments to test the accuracy of pseudo label for unlabeled nodes on class-imbalanced graphs. ST based on different GNN structures are trained on four node classification benchmark datasets, Cora, CiteSeer, PubMed, Amazon-Computers. We process the four datasets with a traditional imbalanced distribution following Zhao et al. (2021) ; Park et al. (2021) ; Song et al. (2022) . The imbalance ratio ρ between the numbers of the most frequent class and the least frequent class is set as 1, 5, 10, 20, 50, 100. We fix architecture as the 2-layer GNN (i.e. GCN (Kipf & Welling, 2016), GAT (Veličković et al., 2017) , GraphSAGE (Hamilton et al., 2017) ) having 128 hidden dimensions and train models for 2000 epochs. We select the model by the validation accuracy. We test the accuracy of pseudo labels for unlabeled nodes which are newly added to the training set, more specifically, we separately examine 100 nodes that join the majority class and join the minority class. We repeat each experiment five times and report the average experiment results.

Pseudo-label Misjudgment Augmentation Problem

In different imbalanced scenarios for ST, the accuracy of the pseudo labels for the unlabeled nodes which are selected into the minority class and the majority class of the training set respectively are reported in Figure 2 , 3, 4, 5 and Table 5 . We can find that as ρ becomes larger, the accuracy of pseudo labels for unlabeled nodes selected into the minority class becomes lower, in other words, the influence of the bias of the classifier becomes larger . This means that in an imbalanced scenario, the pseudo-labels given by the classifier are not credible. Similarly, we also believe that even if the pseudo-label of a node is accurate, the confidence given by the classifier is skewed, which means that we will also possibly put the low-quality unlabeled nodes into the training set, and neglect high-quality unlabeled nodes. For the unlabeled nodes selected into the majority class, we found that with the increasing degree of imbalance, accuracy of pseudo labels for unlabeled nodes is basically stable at a low level, which also better confirms the bias problem of the classifier. More importantly, regardless of selecting majority class nodes or minority class nodes, UNREAL consistently outperforms ST. The specific performance of ST ST is a classic technique in semi-supervised learning to enhance performance and robustness, e.g., Lee et al. (2013) . However, as we have argued and verified above, for highly imbalanced data, ST is unlikely to achieve optimal performance as biased and untrustworthy predictions may bring low-quality nodes into the training set in the early stage. Our key idea to remedy this is to exploit the geometric structural information in the embedding space. In this sec-tion, we empirically verify the informativeness of geometric structures by comparing UNREAL with pure self-training schemes. = 10, 20, 50, 100) . We compare the F1-score (%) with the standard errors of ST and UNREAL.

Results

The size of added nodes in each round for each class is a hyperparameter, and we tune the hyperparameter based on the accuracy on the validation set. We repeat each experiment five times and report the average experiment results on the node classification benchmark datasets under different imbalance ratios in Figure 6 , 7, 8. It can be observed that across different ratios, UN-REAL consistently outperforms self-training by a large margin, and as imbalance ratio increases, the gap of the F1 scores between ST and our method becomes larger. This shows that as the data imbalance issue become more severe, the performance of ST degrades more rapidly, which is likely due to noise introduced in early rounds.

C ADDITIONAL RESULTS IN DIFFERENT SCENARIOS C.1 MORE RESULTS ON HIGHER IMBALANCE RATIOS

In this section, we show the performance of UNREAL in highly-imbalanced scenarios by constructing training sets with ρ = 50, 100 on the benchmark datasets which is not presented in the main paper. The results are presented in Table 6 and Table 7 . We can find that our model is more robust on highly-imbalanced datasets with different architectures, namely GCN (Kipf & Welling, 2016) , GAT (Veličković et al., 2017) , GraphSAGE (Hamilton et al., 2017) ). It is shown that UNREAL can deal with different degrees of imbalance, and significantly outperforms other methods with a large margin. We observe from the Table 6 and Table 7 that the performance of of GraphENS (also GraphENS+TAM) degrades noticeably on highly-imbalanced datasets. In highly imbalanced scenarios, synthesizing or duplicating huge amounts of nodes based on rare minority nodes in the training set is prone to overfitting of minority classes. On the other hand, BalancedSoftmax+TAM achieves an overall better performance than GraphENS+TAM in those highly imbalanced scenarios.

C.2 RESULTS ON FLICKR

The Flickr is naturally imbalanced and the training set in the public split is also imbalanced, so we directly evaluate the performance of all methods on the public split (Zeng et al., 2019) . The results is presented in Table 8 .

D ADDITIONAL ANALYSIS FOR EACH COMPONENTS OF UNREAL D.1 ADDITIONAL ANALYSIS FOR DPAM

In this section, we analyze why DPAM works. DPAM conducts an unsupervised algorithm to obtain pseudo-labels for each unlabeled node in the embedding space, and finally only unlabeled nodes whose pseudo-labels and classifier's predictions are aligned are put into the candidate pool, which effectively circumvents the bias problem of the classifier, such as pseudo-label misjudgment of unlabeled nodes, selecting low-quality nodes into training set based on the skewed condidence rankings. To quantify the performance of DPAM, we conduct the novel experiments below. Experimental setup We use DPAM to filter the unlabeled nodes of the whole graph, and test the accuracy of pseudo-labels (prediction of the classifier) of the aligned node set U in and the discarded node set U out respectively. DPAM based on different GNN structures are trained on two node classification benchmark datasets, Cora, Amazon-Computers. We process the two dataset with a traditional imbalanced distribution following Zhao et al. (2021) ; Park et al. (2021) ; Song et al. (2022) . The imbalance ratio ρ between the numbers of the most frequent class and the least frequent class is set as 1, 5, 10, 20, 50, 100. We fix architecture as the 2-layer GNN (i.e. GCN(Kipf & Welling, 2016) , GAT (Veličković et al., 2017) , GraphSAGE (Hamilton et al., 2017) ) having 128 hidden dimensions and train models for 2000 epochs. We select the model by the validation accuracy. We observe the accuracy of pseudo labels for unlabeled nodes which are filtered out and absorbed into by DPAM respectively. We repeat each experiment five times and present the average experiment results. Result DPAM divides the unlabeled nodes of the whole graph into two parts, U in , U out . We verify the effect of DPAM by testing the accuracy of pseudo-labels for these two parts of nodes. We can observe that the accuracy of pseudo-labels for U in and U out differ greatly in different imbalanced scenarios. Usually the pseudo-label accuracy of U in is high and the pseudo-label accuracy of U out is lower, which means the effectiveness of DPAM. We can also observe that as ρ increases, the accuracy of both decreases, which also reflects the model bias caused by the imbalanced label distribution.

D.2 ADDITIONAL ANALYSIS FOR NODE-REORDERING AND DGI

In this section, we analyze why Node-Reordering and DGI works. With DPAM, we filter out a large part of untrustworthy nodes, and get a pool of candidate nodes. We try to carefully hunt for a part of high-quality nodes in the pool to add to the training set, which involves a priority issue. As we mentioned before, we have already verified in Section 3 that the prediction and confidence given by the classifier are biased, resulting in low accuracy of the pseudo-labels for nodes selected by ST in highly imbalanced scenarios. We can get the geometric ranking according to the distance between the unlabeled nodes and the class centers in the embedding space. Considering the influence of classifier bias on confidence ranking, we believe that geometric ranking is more credible in the early rounds. At the same time, we take into account the suboptimal nature of the unsupervised algorithm. We believe that with the rounds of UNREAL increases, the label distribution of the training set is gradually balanced, and the confidence given by the classifier is more reliable. Node-reordering considers both geometric ranking and confidence ranking, specifically, obtain the similarity between them to get a weight to reorder the priority of the nodes. To quantify the performance of Node-Reordering and DGI, we conduct the novel experiments below. { L 1 L 2 L 3 1 2 } |L1-L2| |L2| < γ { even if L2 < L3 DGI 2 √ {

Experimental setup

We conduct experiments to test the accuracy of pseudo labels for unlabeled nodes on class-imbalanced graphs. All model combinations based on different GNN structures are trained on two node classification benchmark datasets, Cora, Amaon-Computers. We process the two dataset with a traditional imbalanced distribution following Zhao et al. (2021) ; Park et al. (2021) ; Song et al. (2022) . The imbalance ratio ρ between the numbers of the most frequent class and the least frequent class is set as 1, 5, 10, 20, 50, 100. We fix architecture as the 2-layer GNN (i.e. GCN(Kipf & Welling, 2016) , GAT (Veličković et al., 2017) , GraphSAGE (Hamilton et al., 2017) ) having 128 hidden dimensions and train models for 2000 epochs. We select the model by the validation accuracy. We observe the accuracy of pseudo labels for unlabeled nodes which are newly added to the minority class of training set. We repeat each experiment five times and present the average experiment results. Result As shown in Table 11 and 

E HYPERPARAMETER SENSITIVITY ANALYSIS OF UNREAL

We investigate the sensitivity of performance to clusters' size k ′ of K-Means algorithm and the threshold γ of DGI in Figure 10 . We observe the performance gradually stabilize when k ′ have extremely high values, on the other hand, when k ′ is extremely low values, the performance of UN-REAL drops largely. We believe that when k ′ is too small, the pseudo-labels given by unsupervised algorithms will have more errors. Also, we observe the performance gradually stabilize when γ have extremely low values. We believe this is because that the DGI screening is too strict, which will lead to the loss of some high-quality nodes. On the other hand, extremely large γ will introduce many noise into the training set.

F DETAILS OF THE EXPERIMENTAL SETUP

Here, we introduce the method of imbalanced datasets construction, evaluation protocol, and the details of our algorithm and baseline methods.

F.1 IMBALANCED DATASETS CONSTRUCTION

The detailed descriptions of the datasets are shown in Table 13 . For each citation dataset, for ρ = 10, 20, we follow the "public" split, and randomly convert minority class nodes to unlabeled nodes until the dataset reaches imbalanced ratio ρ. For ρ = 50, 100, since there are not enough nodes per class in the public split training set, we choose randomly selected nodes as training samples, and for validation and test sets we still follow the public split. For the co-purchased networks Amazon-Computers, we randomly select nodes as training set in each replicated experiment, and construct a random validation set with 30 nodes in each class, and treat the remaining nodes as testing set. , 5, 10, 20, 50, 100 . We select 100 unlabeled nodes newly added to the minority class of training set through different method combinations, and evaluate the validity of Node-Reordering & DGI by testing the accuracy (%) with the standard errors of the pseudo labels for these nodes. We report averaged results over 5 repetitions on three representative GNN architectures. For Flickr, we follow the dataset split from Zeng et al. (2019) . The details of label distribution in training set of the five imbalanced benchmark datasets are in Table 14 , and the label distribution of full graph is provided in Table 15 .

F.2 DETAILS OF GNNS

We evaluate our method with three classic GNN architectures, namely GCN (Kipf & Welling, 2016), GAT (Veličković et al., 2017) , and GraphSAGE (Hamilton et al., 2017) . GNN consists of L = 1, 2, 3 layers and each GNN layer is followed by a BatchNorm layer (momentum=0.99) and a PRelu activation (He et al., 2015) . For GAT, we adopt multi-head attention with 8 heads. We search for the best model on the validation set. 

F.4 IMPLEMENTATION DETAILS

In UNREAL, we employ the vanilla K-means algorithm as the unsupervised clustering method. The number of clusters K is chosen from {100, 300, 500, 700, 900} for Cora, CiteSeer, PubMed and Amaon-Computers. For Flickr, K is selected among {1000, 2000, 3000, 5000}. For Cora, CiteSeer, PubMed, and Amazon-Computers, the number of training round T is tuned among {40, 60, 80, 100}. For Flickr, T is tuned among {40, 50, 60, 70}. We also introduce a hyperparameter α, which is the upper bound on the number of nodes being added per class per round. The tuning range of α is {4, 6, 8, 10} for Cora, CiteSeer, Amazon-Computers and {64, 72, 80} for PubMed. For Flickr the value of α is selected among {30, 40, 50, 60}. The weight parameters p in RBO is selected among {0.5, 0.75, 0.98}, and the threshold in DGI is tuned among {0.25, 0.5, 0.75, 1.00}. For Flickr, we only add minority nodes to the training set in all iterations, which means that we set α = 0 for majority classes in Flickr. F.5 BASELINES For GraphSMOTE (Zhao et al., 2021) , we use the branched algorithms whose edge predictions are discrete-valued, which have achieved superior performance over other variants in most experiments. For the ReNode method (Chen et al., 2021) , we search hyperparameters among lower bound of cosine annealing w min ∈ {0.25, 0.5, 0.75} and upper bound of the cosine annealing w max ∈ {1.25, 1.5, 1.75} following Chen et al. (2021) . PageRank teleport probability is fixed as α = 0.15, which is the default setting in the released codes. For TAM (Song et al., 2022) , we search the best hyperparameters among the coefficient of ACM term α ∈ {1.25, 1.5, 1.75}, the coefficient of ADM term β ∈ {0.125, 0.25, 0.5}, and the minimum temperature of class-wise temperature ϕ ∈ {0.8, 1.2} following Song et al. (2022) . The sensitivity to imbalance ratio of class-wise temperature δ is fixed as 0.4 for all main experiments. Following (Song et al., 2022) , we adopt warmup for 5 iterations since we utilize model prediction for unlabeled nodes.

F.6 CONFIGURATION

All the algorithms and models are implemented in Python and PyTorch Geometric. Experiments are conducted on a server with an NVIDIA 3090 GPU (24 GB memory) and an Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz. 

G MAIN ALGORITHM

Algorithm 1 UNREAL Input: Imbalanced dataset (G = (V, E, L 0 ), y), feature matrix X , adjacency matrix A, unlabeled set U = V -L 0 , rounds T to select nodes, the size threshold α of nodes being added in each class per round, weight hyperparameter p of RBO, threshold γ of DGI, learning rate η, the size k ′ of the clusters, GNN model f g , clustering algorithm f cluster , and the mean function M (•).  f cluster (H U ) =⇒ {K 1 , c 1 , K 2 , c 2 , • • • , K k ′ , c k ′ } 6: c train i = M ({h L u | y u ∈ C i }) 7: Assign a label ỹm to each cluster K m : ỹm = arg min j distance(c train j , c m ). For each u ∈ Ũm ∩ U m : δ u = distance (h L u , c train m ). 12: Obtain geometric rankings {S 1 , S 2 , • • • , S k } based on δ; and confidence rankings {T 1 , T 2 , • • • , T k } based on r. Obtain the distance between the embedding of u and the second closest center to u as β u , compute GI index of node u as βu-δu δu . 17: if βu-δu δu <γ then 18: Discard node u. 



Figure 1: Overall pipeline of our UNREAL. Colored nodes denote labeled nodes. Parameters in the GNN Model and the classifier are trained together using the current training set.

synthesizes minority nodes in embedding space by interpolating two minority nodes using the SMOTE(Chawla et al., 2002) algorithm, and infers the neighborhoods of new nodes with link prediction algorithms. ImGAGN(Qu et al., 2021) generates the features of minority nodes with all of the minority nodes according to the learned weight matrix, and synthesizes the neighborhoods of new nodes by based on weights.Qu et al. (2021) only consider binary classification, and it is computationally expensive to build a generator for each class on multi-classification tasks. GraphENS(Park et al., 2021) works for multi-class node classification, which synthesizes the whole ego network for minority nodes by interpolating the ego networks of two nodes based on their similarity.Chen et al. (2021) identify topology imbalance as a main source of difficulty when handling imbalance on node classification tasks; they propose ReNode, which mitigates topology imbalance by adjusting the weights of nodes according to their distance to class boundaries. TAM(Song et al., 2022) adjusts the scores of different classes in the Softmax function based on local topology and label statistics.

Figure 2: The experimental results on Cora under different imbalance scenarios (ρ = 1, 5, 10, 20,  50, 100). We select 100 unlabeled nodes newly added to the training set through ST & UNREAL, and evaluate the performance of ST & UNREAL by testing the accuracy (%) with the standard errors of these nodes' pseudo labels. ST-Minor, UNREAL-Minor means that we only test unlabeled nodes that are selected into the minority class, and SL-Major, UNREAL-Major means that we only test unlabeled nodes that are selected into the majority class.

Figure 3: The experimental results on CiteSeer under different imbalance scenarios (ρ = 1, 5, 10, 20,  50, 100). We select 100 unlabeled nodes newly added to the training set through ST & UNREAL, and evaluate the performance of ST & UNREAL by testing the accuracy (%) with the standard errors of these nodes' pseudo labels. ST-Minor, UNREAL-Minor means that we only test unlabeled nodes that are selected into the minority class, and SL-Major, UNREAL-Major means that we only test unlabeled nodes that are selected into the majority class.

Figure 4: The experimental results on PubMed under different imbalance scenarios (ρ = 1, 5, 10, 20,  50, 100). We select 100 unlabeled nodes newly added to the training set through ST & UNREAL, and evaluate the performance of ST & UNREAL by testing the accuracy (%) with the standard errors of these nodes' pseudo labels. ST-Minor, UNREAL-Minor means that we only test unlabeled nodes that are selected into the minority class, and SL-Major, UNREAL-Major means that we only test unlabeled nodes that are selected into the majority class.

Figure5: The experimental results on Amazon-Computers under different imbalance scenarios(ρ  = 1, 5, 10, 20, 50, 100). We select 100 unlabeled nodes newly added to the training set through ST & UNREAL, and evaluate the performance of ST & UNREAL by testing the accuracy (%) with the standard errors of these nodes' pseudo labels. ST-Minor, UNREAL-Minor means that we only test unlabeled nodes that are selected into the minority class, and SL-Major, UNREAL-Major means that we only test unlabeled nodes that are selected into the majority class.

Figure 6: The experimental results on Cora under different imbalance scenarios (ρ = 10, 20, 50, 100). We compare the F1-score (%) with the standard errors of ST and UNREAL.

Figure 9: An elaboration of Geometric Imbalance and DGI, we use T-SNE to visualize the all embeddings of nodes in the training set and the part of embeddings of unlabeled nodes.

Figure 10: Sensitivity analysis on Cora based on GCN. The two images show the performance change as clusters' size k ′ of K-Means and the threshold γ of DGI increases respectively.

1: for i = 0 to round T do 2: Train f g based on the current training set Li := {C 1 , • • • , C k }.

matrix of the labeled node set and unlabeled node set H L ∈ R |L|×d ,H U ∈ R |U |×d , prediction ŷ and confidence r from classifier. 4: % Step 1: Dual Pseudo-tag Alignment Mechanism(DPAM) 5:

the same pseudo-label m as Ũm , and U = k m=1 Ũm . 9:Put unlabeled nodes whose prediction in ŷ is m into the set U m , and U =

m , 1 -r m } • S m + min{r m , 1 -r m } • T m . 14:Select nodes based on the rank of their values in N

Experimental results of our method UNREAL and other baselines on four class-imbalanced node classification benchmark datasets with ρ = 10. We report averaged balanced accuracy (bAcc.,%) and F1-score (%) with the standard errors over 5 repetitions on three representative GNN architectures. ± 1.43 61.67 ± 1.59 38.72 ± 1.88 28.74 ± 3.21 65.64 ± 1.72 56.97 ± 3.17 80.01 ± 0.71 71.56 ± 0.81 Re-Weight 65.36 ± 1.15 64.97 ± 1.39 44.69 ± 1.78 38.61 ± 2.37 69.06 ± 1.84 64.08 ± 2.97 80.93 ± 1.30 73.99 ± 2.20 PC Softmax 68.04 ± 0.82 67.84 ± 0.81 50.18 ± 0.55 46.14 ± 0.14 72.46 ± 0.80 70.27 ± 0.94 81.54 ± 0.76 73.30 ± 0.51 BalancedSoftmax 69.98 ± 0.58 68.68 ± 0.55 55.52 ± 0.97 53.74 ± 1.42 73.73 ± 0.89 71.53 ± 1.06 81.46 ± 0.74 74.31 ± 0.51 GraphSMOTE 66.39 ± 0.56 65.49 ± 0.93 44.87 ± 1.12 39.20 ± 1.62 67.91 ± 0.64 62.68 ± 1.92 79.48 ± 0.47 72.63 ± 0.76 Renode 67.03 ± 1.41 67.16 ± 1.67 43.47 ± 2.22 37.52 ± 3.10 71.40 ± 1.42 67.27 ± 2.96 81.89 ± 0.77 73.13 ± 1.60 GraphENS 70.89 ± 0.71 70.90 ± 0.81 56.57 ± 0.98 55.29 ± 1.33 72.13 ± 1.04 70.72 ± 1.07 82.40 ± 0.39 74.26 ± 1.05 BalancedSoftmax+TAM 69.94 ± 0.45 69.54 ± 0.47 56.73 ± 0.71 56.15 ± 0.78 74.62 ± 0.97 72.25 ± 1.30 82.36 ± 0.67 72.94 ± 1.43 Renode+TAM 68.26 ± 1.84 68.11 ± 1.97 46.20 ± 1.17 39.96 ± 2.76 72.63 ± 2.03 68.28 ± 3.30 80.36 ± 1.19 72.51 ± 0.68 GraphENS+TAM 71.69 ± 0.36 72.14 ± 0.51 58.01 ± 0.68 56.32 ± 1.03 74.14 ± 1.42 72.42 ± 1.39 81.02 ± 0.99 70.78 ± 1.72 UNREAL 78.33 ± 1.04 76.44 ± 1.06 65.63 ± 1.38 64.94 ± 1.38 75.35 ± 1.41 73.65 ± 1.43 85.08 ± 0.38 75.27 ± 0.23 Vanilla 62.33 ± 1.56 61.82 ± 1.84 38.84 ± 1.13 31.25 ± 1.64 64.60 ± 1.64 55.24 ± 2.80 79.04 ± 1.60 70.00 ± 2.50 Re-Weight 66.87 ± 0.97 66.62 ± 1.13 45.47 ± 2.35 40.60 ± 2.98 68.10 ± 2.85 63.76 ± 3.54 80.38 ± 0.66 69.99 ± 0.76 PC Softmax 66.69 ± 0.79 66.04 ± 1.10 50.78 ± 1.66 48.56 ± 2.08 72.88 ± 0.83 71.09 ± 0.89 79.43 ± 0.94 71.33 ± 0.86 BalancedSoftmax 67.89 ± 0.36 67.96 ± 0.41 54.78 ± 1.25 51.83 ± 2.11 72.30 ± 1.20 69.30 ± 1.79 82.02 ± 1.19 72.94 ± 1.54 GraphSMOTE 66.71 ± 0.32 65.01 ± 1.21 45.68 ± 0.93 38.96 ± 0.97 67.43 ± 1.23 61.97 ± 2.54 79.38 ± 1.97 69.76 ± 2.31

Experimental results of our method UNREAL and other baselines on four class-imbalanced node classification benchmark datasets with ρ = 20. We report averaged balanced accuracy (bAcc.,%) and F1-score (%) with the standard errors over 5 repetitions on three representative GNN architectures. ± 1.41 58.77 ± 1.95 43.38 ± 2.01 37.76 ± 2.12 70.81 ± 1.41 70.25 ± 1.30 71.16 ± 1.15 62.26 ± 0.87 BalancedSoftmax 62.05 ± 1.62 61.14 ± 1.71 47.89 ± 1.25 44.84 ± 1.35 69.91 ± 1.68 67.43 ± 1.73 72.91 ± 1.93 62.79 ± 0.98 Renode 59.52 ± 2.28 57.16 ± 2.47 37.21 ± 2.01 27.09 ± 3.17 64.56 ± 1.65 55.87 ± 2.83 69.34 ± 2.35 59.02 ± 1.67 GraphENS 64.52 ± 2.05 62.52 ± 1.84 43.74 ± 3.81 37.47 ± 4.21 69.00 ± 2.67 65.54 ± 3.54 71.78 ± 2.30 61.83 ± 1.75 BalancedSoftmax+TAM 63.30 ± 0.99 62.81 ± 1.18 49.34 ± 1.29 46.92 ± 1.39 71.17 ± 2.09 68.85 ± 2.90 65.59 ± 2.86 58.12 ± 1.22 Renode+TAM 61.32 ± 2.18 59.19 ± 2.64 39.85 ± 2.20 30.63 ± 2.63 66.28 ± 3.24 58.99 ± 3.04 65.81 ± 2.57 56.73 ± 1.62 GraphENS+TAM 65.78 ± 1.62 63.80 ± 1.79 44.81 ± 2.66 39.47 ± 3.54 70.33 ± 2.33 67.00 ± 3.25 73.55 ± 2.04 64.03 ± 1.32 UNREAL 79.10 ± 0.71 76.21 ± 0.58 55.11 ± 5.00 53.67 ± 5.51 72.54 ± 1.52 70.54 ± 1.91 83.19 ± 0.66 74.39 ± 0.89 Vanilla 54.61 ± 1.21 50.95 ± 1.90 37.36 ± 1.03 27.49 ± 1.41 62.04 ± 1.34 54.18 ± 1.73 62.70 ± 2.87 55.39 ± 2.69 Re-Weight 57.37 ± 0.61 55.30 ± 0.72 37.69 ± 1.20 27.92 ± 2.01 65.01 ± 2.69 58.34 ± 2.19 68.31 ± 2.06 60.45 ± 2.40 PC Softmax 59.25 ± 0.74 58.55 ± 0.81 42.77 ± 1.82 40.08 ± 1.82 70.55 ± 1.19 67.60 ± 1.59 70.57 ± 2.86 62.73 ± 2.69 BalancedSoftmax 61.93 ± 1.26 60.89 ± 1.36 43.64 ± 1.33 38.31 ± 1.13 69.89 ± 1.40 68.12 ± 0.78 68.45 ± 2.92 62.12 ± 3.10

Ablation analysis on different components

combines the results of multiple weak classifiers. Data re-sampling methods(Chawla et al., 2002;Han et al., 2005;Smith et al., 2014;Sáez et al., 2015;Kang et al., 2019; Wang et al., 2021a)   smooth the label distribution in the training set by synthesizing or duplicating minority class samples. A third class of approaches alleviate the imbalance problem by modifying the loss function, which give larger weights to minority classes or change the margins of different classes

Elaborated notation table of this paper.

Experimental ± 1.31 94.60 ± 4.92 61.60 ± 1.25 69.40 ± 2.96 73.80 ± 1.43 88.60 ± 1.27 66.00 ± 1.00 78.00 ± 3.39 ρ = 100(minor) 57.00 ± 1.69 78.20 ± 2.47 65.80 ± 1.20 70.80 ± 3.11 75.40 ± 0.97 91.00 ± 3.43 70.00 ± 1.07 79.80 ± 3.03 ± 3.36 97.40 ± 1.81 69.60 ± 3.62 71.60 ± 2.19 92.00 ± 5.70 96.20 ± 2.58 87.20 ± 0.83 99.60 ± 0.54 ρ = 100(major) 90.20 ± 3.11 94.00 ± 2.70 68.80 ± 5.80 77.20 ± 1.97 94.00 ± 3.31 97.60 ± 1.51 94.60 ± 1.94 99.80 ± 1.64 94.20 ± 1.30 95.60 ± 3.43 69.60 ± 2.19 82.00 ± 1.35 91.80 ± 1.92 95.60 ± 3.97 99.40 ± 0.54 99.20 ± 0.83

Experimental results of our method UNREAL and other baselines on four class-imbalanced node classification benchmark datasets with ρ = 50. We report averaged balanced accuracy (bAcc.,%) and F1-score (%) with the standard errors over 5 repetitions on three representative GNN architectures. ± 1.58 67.25 ± 1.27 53.43 ± 2.42 51.74 ± 2.80 77.20 ± 1.45 74.86 ± 0.99 81.74 ± 2.30 73.85 ± 2.68 Renode+TAM 63.93 ± 1.96 61.64 ± 2.71 48.17 ± 1.58 41.07 ± 2.34 69.63 ± 2.55 64.30 ± 3.51 80.55 ± 1.75 72.33 ± 1.63 GraphENS+TAM 65.05 ± 1.11 62.11 ± 1.98 45.03 ± 1.34 42.65 ± 1.94 69.74 ± 0.78 70.82 ± 0.63 81.69 ± 2.22 72.09 ± 1.75 UNREAL 75.62 ± 2.02 72.59 ± 2.13 59.97 ± 4.59 58.66 ± 5.20 78.55 ± 0.84 75.91 ± 0.81 85.54 ± 0.26 75.76 ± 0.13 Vanilla 53.90 ± 0.63 45.53 ± 0.89 36.48 ± 0.08 23.68 ± 0.16 60.16 ± 0.47 46.99 ± 0.58 72.42 ± 2.17 64.41 ± 2.68 Re-Weight 59.78 ± 1.92 56.69 ± 2.21 38.70 ± 2.23 29.38 ± 3.06 66.27 ± 0.68 57.34 ± 1.41 73.46 ± 3.07 67.00 ± 2.60 PC Softmax 59.44 ± 2.62 58.06 ± 2.69 43.13 ± 1.56 37.04 ± 2.07 70.86 ± 0.44 70.96 ± 0.54 77.21 ± 2.90 69.17 ± 2.89 BalancedSoftmax 64.71 ± 2.28 62.55 ± 2.61 51.89 ± 1.15 49.36 ± 1.52 70.94 ± 1.09 70.33 ± 0.99 77.49 ± 1.58 70.44 ± 2.33 Renode 63.81 ± 1.72 60.63 ± 2.26 41.60 ± 2.30 33.94 ± 4.60 70.35 ± 1.26 67.43 ± 0.01 72.39 ± 2.75 65.23 ± 3.35 GraphENS 64.52 ± 2.51 61.41 ± 3.15 45.23 ± 2.97 41.12 ± 4.23 69.66 ± 1.01 66.83 ± 0.94 78.36 ± 2.74 70.44 ± 2.51 BalancedSoftmax+TAM 68.05 ± 1.03 66.07 ± 1.14 54.28 ± 0.79 52.77 ± 0.97 75.65 ± 1.11 74.02 ± 1.44 78.86 ± 1.53 70.71 ± 2.04 Renode+TAM 64.40 ± 1.83 63.48 ± 2.83 43.54 ± 1.54 35.80 ± 2.43 71.23 ± 2.04 66.61 ± 4.31 76.07 ± 2.70 68.43 ± 2.68 GraphENS+TAM 65.33 ± 2.67 65.34 ± 2.53 48.00 ± 1.46 48.14 ± 1.43 71.50 ± 1.26 72.58 ± 1.07 80.02 ± 2.32 72.38 ± 2.47 UNREAL 77.07 ± 0.83 73.44 ± 1.05 57.70 ± 4.35 56.81 ± 4.67 79.41 ± 0.29 77.38 ± 0.39 86.06 ± 0.45 77.55 ± 0.71 Vanilla 53.02 ± 0.83 45.58 ± 1.30 38.81 ± 0.89 25.28 ± 0.51 61.41 ± 1.01 50.46 ± 2.47 56.53 ± 2.12 48.52 ± 2.75 Re-Weight 58.03 ± 0.81 54.32 ± 0.99 38.49 ± 1.34 30.41 ± 1.82 62.41 ± 0.90 51.37 ± 2.62 70.36 ± 2.21 61.52 ± 2.73 PC Softmax 62.33 ± 1.62 59.97 ± 1.98 41.79 ± 1.19 36.90 ± 0.84 69.58 ± 1.09 67.13 ± 0.95 73.53 ± 2.02 66.12 ± 3.19 BalancedSoftmax 64.57 ± 0.77 62.22 ± 0.82 41.84 ± 1.72 40.09 ± 1.04 70.43 ± 0.38 68.99 ± 0.99 73.27 ± 2.30 68.30 ± 1.97 Renode 61.35 ± 1.86 58.88 ± 2.53 40.37 ± 2.33 32.57 ± 3.62 67.54 ± 3.05 59.77 ± 5.30 70.46 ± 3.45 62.30 ± 4.40 GraphENS 63.95 ± 0.96 62.63 ± 2.12 41.99 ± 1.54 37.44 ± 2.43 66.07 ± 1.12 61.63 ± 1.82 76.21 ± 2.84 68.10 ± 2.56 BalancedSoftmax+TAM 65.97 ± 0.71 65.53 ± 0.88 52.89 ± 1.65 49.92 ± 1.83 71.11 ± 0.75 71.73 ± 0.79 73.12 ± 1.41 66.45 ± 1.04 Renode+TAM 62.79 ± 0.47 61.05 ± 0.82 43.04 ± 1.30 36.97 ± 1.92 71.79 ± 1.33 67.80 ± 2.45 74.55 ± 2.95 66.06 ± 2.16 GraphENS+TAM 65.98 ± 1.37 64.84 ± 1.13 49.54 ± 1.79 49.48 ± 1.70 73.24 ± 1.32 73.73 ± 1.14 80.75 ± 1.22 72.31 ± 0.95 UNREAL 76.04 ± 1.30 72.99 ± 1.25 58.70 ± 4.10 57.53 ± 4.59 75.27 ± 1.26 72.16 ± 1.50 82.03 ± 0.77 72.98 ± 0.52

Experimental results of our method UNREAL and other baselines on four class-imbalanced node classification benchmark datasets with ρ = 100. ± 2.12 62.30 ± 2.27 49.33 ± 1.12 44.58 ± 1.64 70.68 ± 0.92 69.15 ± 0.84 74.66 ± 0.86 66.28 ± 1.92 Renode 62.42 ± 0.90 60.08 ± 1.19 39.61 ± 2.66 30.13 ± 3.86 67.11 ± 1.12 61.09 ± 3.50 73.73 ± 2.26 64.47 ± 2.39 GraphENS 63.09 ± 0.97 61.20 ± 1.74 42.03 ± 1.88 36.71 ± 2.99 69.71 ± 1.87 63.47 ± 3.87 81.33 ± 1.66 72.83 ± 1.76 BalancedSoftmax+TAM 66.58 ± 1.53 64.56 ± 2.49 53.33 ± 1.06 50.15 ± 1.45 72.59 ± 2.06 72.22 ± 2.08 78.01 ± 1.06 71.02 ± 1.08 Renode+TAM 62.06 ± 2.08 60.72 ± 3.32 42.08 ± 1.88 33.19 ± 3.45 69.95 ± 1.01 65.99 ± 2.28 74.81 ± 3.29 67.48 ± 3.32 GraphENS+TAM 65.95 ± 2.25 63.88 ± 1.78 51.03 ± 1.51 50.49 ± 1.88 73.58 ± 2.01 72.44 ± 1.77 81.72 ± 1.08 72.31 ± 1.98 UNREAL 73.47 ± 2.31 68.30 ± 2.11 59.77 ± 2.98 58.92 ± 3.07 77.11 ± 0.59 74.03 ± 0.81 82.92 ± 2.94 73.11 ± 2.57

Experimental results of our method UNREAL and other baselines on Flickr . We report averaged balanced accuracy (bAcc.,%) and F1-score (%) with the standard errors over 5 repetitions on three representative GNN architectures.

Experimental results of DPAM effectiveness on Cora with ρ = 1, 5, 10, 20, 50, 100. We observe the accuracy (%) of the pseudo-label (prediction of the classifier) of the aligned node set U in and the discarded node set U out respectively. We report averaged results with the standard errors over 5 repetitions on three representative GNN architectures. All, Labeled, Unlabeled represent the size of whole nodes, labeled nodes, and unlabeled nodes on the graph. Align, Out, Align-True, Out-Ture represent the size of U in , U out , nodes with accurate pseudo-labels of U in , U out respectively.

Table 12, we verify the effectiveness of each component of UNREAL by testing the accuracy of the nodes' pseudo-labels selected by different model combinations, DPAM+Confidence ranking(with or without DGI), DPAM+Geometric ranking(with or without DGI), DPAM+Node-Reordering(with or without DGI). It can be observed that in different imbalanced scenarios, each component of UNREAL (Node-reordering & DGI) plays an important role, and the performance outperforms the other model combinations significantly.

Analyzed experimental results of Node-Reordering and DGI on Cora with ρ = 1

The choices of hidden unit size for each layer are 64, 128, 256.F.3 EVALUATION PROTOCOLWe adopt Adam(Kingma & Ba, 2014) optimizer with initial learning rate 0.01 or 0.005. We follow(Song et al., 2022) to devise a scheduler, which cuts the learning rate by half if there is no decrease on validation loss for 100 consecutive epochs. All learnable parameters in the model adopt weight decay with rate 0.0005. For the first training iteration, we train the model for 200 epochs using the original training set for Cora, CiteSeer, PubMed or Amazon-Computers. For Flickr, we train for 2000 epochs in the first iteration. We train models for 2000 epochs in the rest of the iteration with the above optimizer and scheduler. The best models are selected based on validation accuracy. Early stopping is used with patience set to 300.

Label distributions in the training sets

Label distributions on the whole graphs

