TUNEUP: A TRAINING STRATEGY FOR IMPROVING GENERALIZATION OF GRAPH NEURAL NETWORKS Anonymous authors Paper under double-blind review

Abstract

Despite many advances in Graph Neural Networks (GNNs), their training strategies simply focus on minimizing a loss over nodes in a graph. However, such simplistic training strategies may be sub-optimal as they neglect that certain nodes are much harder to make accurate predictions on than others. Here we present TUNEUP, a curriculum learning strategy for better training GNNs. Crucially, TUNEUP trains a GNN in two stages. The first stage aims to produce a strong base GNN. Such base GNNs tend to perform well on head nodes (nodes with large degrees) but less so on tail nodes (nodes with small degrees). So, the second stage of TUNEUP specifically focuses on improving prediction on tail nodes. Concretely, TUNEUP synthesizes many additional supervised tail node data by dropping edges from head nodes and reusing the supervision on the original head nodes. TUNEUP then minimizes the loss over the synthetic tail nodes to finetune the base GNN. TUNEUP is a general training strategy that can be used with any GNN architecture and any loss, making TUNEUP applicable to a wide range of prediction tasks. Extensive evaluation of TUNEUP on five diverse GNN architectures, three types of prediction tasks, and both inductive and transductive settings shows that TUNEUP significantly improves the performance of the base GNN on tail nodes, while often even improving the performance on head nodes, which together leads up to 58.5% relative improvement in GNN predictive performance. Moreover, TUNEUP significantly outperforms its variants without the two-stage curriculum learning, existing graph data augmentation techniques, as well as other specialized methods for tail nodes.

1. INTRODUCTION

Graph Neural Networks (GNNs) are one of the most successful and widely used paradigms for representation learning on graphs, achieving state-of-the-art performance in a variety of prediction tasks, such as semi-supervised node classification (Kipf & Welling, 2017; Velickovic et al., 2018) , link prediction (Hamilton et al., 2017; Kipf & Welling, 2016) , and recommender systems (Ying et al., 2018; He et al., 2020) . There has been a surge of work on improving GNN model architectures (Velickovic et al., 2018; Xu et al., 2019; 2018; Shi et al., 2020; Klicpera et al., 2019; Wu et al., 2019; Zhao & Akoglu, 2019; Li et al., 2019; Chen et al., 2020; Li et al., 2021) and task-specific losses (Kipf & Welling, 2016; Rendle et al., 2012; Verma et al., 2021; Huang et al., 2021) . Despite all these advances, strategies for training a GNN on a given supervised loss remain largely simplistic. Existing work has focused on simply minimizing the given loss over nodes in a graph. While such a simplistic default strategy already gives a strong performance, the strategy may still be sub-optimal as it neglects that some nodes are much harder to make accurate predictions on than others. Consequently, a GNN trained with the default strategy may significantly under-perform on those hard nodes, resulting in overall sub-optimal predictive performance. Here we present TUNEUP to better train a GNN on a given supervised loss. The key motivation behind TUNEUP is that GNNs tend to under-perform on tail nodes, i.e., nodes with a small number of neighbors (Liu et al., 2021) . In practice, performing well on tail nodes is important since they are prevalent in real-world scale-free graphs (Clauset et al., 2009) and newly-arriving cold-start nodes (Lika et al., 2014) . To better train a GNN on those hard-to-predict tail nodes, the key idea of TUNEUP is to use a curriculum learning strategy (Bengio et al., 2009) ; TUNEUP first trains a GNN

Semi-sup node classification

Link prediction Recommender systems Figure 1 : Degree-specific generalization performance of the base GNN and TUNEUP in the transductive setting. The x-axis represents the node degrees in the training graph, and the y-axis is the generalization performance averaged over nodes with the specific degrees. We see from the dotted blue curves that the base GNN tends to perform poorly on tail nodes, i.e., nodes with small degrees. Our TUNEUP (denoted by the solid orange curves) improves or at least maintains the base GNN performance on almost all node degrees. The improvement is more significant on tail nodes. to perform well on relatively easy head nodes, i.e., nodes with a large number of neighbors. It then proceeds to improve the performance on the hard tail nodes. Specifically, TUNEUP uses the two-stage strategy to train a GNN. In the first stage, TUNEUP employs the default training strategy, i.e., simply minimizing the given supervised loss, to produce a strong base GNN to start with. The base GNN tends to perform well on head nodes, but poorly on tail nodes (see the dotted blue curves in Figure 1 ). To mitigate this issue, the second stage of TUNEUP focuses on improving the performance on the tail nodes. Specifically, TUNEUP synthesizes many additional tail node inputs by dropping edges from head nodes. TUNEUP then adds target supervision (e.g., class labels for node classification, edges for link prediction) on the synthetic tail nodes by reusing the supervision on the original head nodes (before dropping edges). Finally, TUNEUP finetunes the base GNN by minimizing the loss over the increased supervised tail node data. The dedicated training on the synthetic tail nodes allows the resulting GNN to perform much better on the real tail nodes, while often even improving the performance on head nodes. TUNEUP is simple to implement on top of the default training pipeline of GNNs, as shown in Algorithm 1. Moreover, TUNEUP can be used to train any GNN model with any supervised loss, making it generally applicable to a broad range of node and edge-level prediction tasks. We extensively evaluate TUNEUP on a wide range of settings. We consider five diverse GNN architectures, three types of key prediction tasks (semi-supervised node classification, link prediction, and recommender systems) with a total of eight datasets, as well as both transductive (i.e., prediction on nodes seen during training) and inductive (i.e., prediction on new nodes never seen during training) settings. For the inductive setting, we additionally consider the challenging cold-start scenario (i.e., limited edge connectivity from new nodes) by randomly removing certain portions of edges from new nodes. Across all settings, TUNEUP produces consistent improvement on the generalization performance of GNNs. In the transductive setting, TUNEUP significantly improves the performance of base GNNs on tail nodes, while oftentimes even improving the performance on head nodes (see Figure 1 ). Moreover, our ablation study shows that the two-stage curriculum training strategy of TUNEUP is critical and gives significantly improved performance over its variant strategy without curriculum learning. Finally, we extensively compare our TUNEUP against recent graph augmentation techniques (Rong et al., 2020; Liu et al., 2022) and specialized methods for tail nodes (Liu et al., 2021; Zheng et al., 2022; Zhang et al., 2022; Kang et al., 2022) . Our TUNEUP outperforms all these methods in all settings, while being simpler and more general. Overall, our work demonstrates that training strategies can play an important role in improving generalization performance of GNNs.

2. GENERAL SETUP AND TUNEUP

TUNEUP is a curriculum learning strategy to train any GNN model with any supervised loss to solve node or edge-level prediction tasks over graphs. We first provide a general task setup for machine learning on graphs and review the default training strategy of GNNs to solve the task. We then present TUNEUP, which adds a few simple components to the default training strategy. Finally, we discuss assumptions TUNEUP exploits to improve generalization performance of GNNs and why TUNEUP even improves the performance on head nodes.

2.1. GENERAL TASK SETUP

We are given a graph G = (V, E), with a set of nodes V and edges E with potentially some features associated with them. GNN F θ , parameterized by θ, takes the graph G as input and makes prediction Y for the task of interest. The loss function L measures the discrepancy between the GNN's prediction Y and the target supervision Y . In the default training, GNN parameter θ is learned to minimize the loss L( Y , Y ) using gradient descent. The setup is general to cover most graph machine learning scenarios. Below, we describe three representative scenarios under the general task setup, which we also consider in our experiments. Semi-supervised node classification. The task is to predict class labels of unlabeled nodes given a small set of labeled nodes in a graph, which can be formalized as follows. • Graph G: A graph with input node features. • Supervison Y : Class labels of labeled nodes V labeled ⊂ V . • GNN F θ : A model that takes G as input and predicts class probabilities over V . • Prediction Y : The GNN's prediction over V labeled . • Loss L: Cross-entropy loss. Since input node features are available, the GNN F θ can make not only transductive predictions, i.e., prediction over V unlabeled ≡ V \ V labeled , but also inductive predictions (Hamilton et al., 2017) , i.e., prediction over new nodes V new that are not in V but connected to V via new edges E new . Link prediction. The task is to predict new links in a graph given existing links. We consider the node-centric formulation (You et al., 2021) : given a source node, predict target nodes that the source node is linked to. • Graph G: A graph with input node features. • Supervison Y : Whether node s ∈ V is linked to node t ∈ V in G (positive) or not (negative). • GNN F θ : A model that takes G as input and predicts the score for a pair of nodes (s, t) ∈ V × V . Specifically, the model generates embedding z v for each node in v ∈ V and uses an MLP over the concatenation of z s and z t to predict the score for the pair (s, t) (He et al., 2017) . • Prediction Y : The GNN's predicted scores over V × V . • Loss L: The Bayesian Personalized Ranking (BPR) loss (Rendle et al., 2012) , which encourages the predicted score for the positive pair (s, t pos ) to be higher than that for the negative pair (s, t neg ) for each source node s ∈ V . As input node features are available, the GNN F θ can naturally make inductive link prediction by generating node embeddings on a new graph with new nodes and edges. Recommender systems. A recommender system can be modeled as a bipartite graph between user nodes V user and item nodes V item , where edges represent user-item interactions. The task is essentially link prediction, i.e., given a user node u ∈ V user , predict a set of item nodes that u is likely to interact with. In recommender systems, the most successful paradigm is collaborative filtering (Schafer et al., 2007) , where shallow embeddings (learnable embeddings for each node) instead of input node features are used to achieve state-of-the-art performance (Wang et al., 2019; He et al., 2020) . As input node features are not available in many public recommender system datasets anyway, we focus on the feature-less setting. • Graph G: User-item bipartite graph without input node features. • Supervison Y : Whether a user node u has interacted with an item node v in G (positive) or not (negative). • GNN F θ : A model that takes G as input and predicts the score for a pair of nodes (u, v) ∈ V user × V item . Following Wang et al. (2019) , GNN parameter θ contains the input shallow embeddings in addition to the original message passing GNN parameter. To produce the score for the pair of nodes (u, v), we generate the user and item embeddings, z u and z v , and take the inner product z ⊤ u z v to compute the score (Wang et al., 2019) . • Prediction Y : The GNN's predicted scores over V user × V item . Algorithm 1 TUNEUP. Compared to the default training of a GNN (L2-5), TUNEUP introduces the two-stage training and only adds two components (L8 and L12) that are straightforward to implement. Given: GNN F θ , graph G, loss L, supervision Y , DropEdge ratio α. 1: # First stage: Default training to obtain a base GNN. 2: while θ not converged do 3: Make prediction Y = F θ (G) 4: Compute loss L( Y , Y ), compute gradient ∇ θ L, and update parameter θ. 5: end while 6: # Set up for the second stage. 7: if task is semi-supervised node classification then 8: Use F θ to predict pseudo labels on non-isolated, unlabeled nodes. Add the pseudo labels into Y . 9: end if 10: # Second stage: Fine-tuning the base GNN with increased tail supervision. 11: while θ not converged do 12: Synthesize tail nodes, i.e., randomly drop α of edges: G DropEdge ------→ G.

13:

Make prediction Y = F θ ( G). 14: Compute loss L( Y , Y ), compute gradient ∇ θ L, and update parameter θ. 15: end while • Loss L: The BPR loss (Rendle et al., 2012) . As we learn the shallow embedding for each node, it is non-trivial to make inductive predictions on new nodes. Therefore, we only consider the transductive setting for recommender systems.

2.2. DEFAULT GNN TRAINING

Given the graph G, supervision Y , GNN F θ , its prediction Y = F θ (G), and the loss function L( Y , Y ), it is straightforward to train the GNN F θ using gradient descent to minimize the loss. The default training procedure of a GNN is described in L2-5 of Algorithm1. Remark on mini-batch training. In practice, the prediction Y and the loss computation L( Y , Y ) can be made in a mini-batch manner for scalable training. For instance, in the case of semi-supervised node classification, we can predict and compute the loss on a mini-batch of labeled nodes (Hamilton et al., 2017; Zeng et al., 2020) . In the case of link prediction and recommender systems, to compute the BPR loss, the score prediction only needs to be made over positive links and randomly sampled negative links. Moreover, the BPR loss can be computed in a mini-batch manner by subsampling the source nodes and keeping only one positive link per source node. These mini-batch training tricks are hidden in Algorithm 1 for simplicity, but should be implemented in practice. Our TUNEUP, which we explain next, is fully compatible with mini-batching training.

2.3. TUNEUP

We are ready to present TUNEUP that uses a two-stage curriculum learning strategy (Bengio et al., 2009) to better train a GNN. At high level, TUNEUP first trains a GNN to perform well on relatively easy head nodes and then proceeds to finetune the GNN to also perform well on hard tail nodes. Specifically, in the first stage (L2-5 in Algorithm 1), TUNEUP uses the default training of GNNs to obtain a strong base GNN model. The base GNN model tends to perform well on head nodes, but poorly on tail nodes. To remedy this issue, in the second training stage, TUNEUP finetunes the base GNN with increased supervision on tail nodes (L7-L15 in Algorithm 1). TUNEUP increases the supervised tail node data in two steps: (1) synthesizing additional tail node inputs and (2) adding target supervision on the synthetic tail nodes, which we detail below. (1) Synthesizing tail node inputs. TUNEUP synthesizes many additional tail nodes by removing edges from the head nodes. Specifically, in this work, we directly adopt DropEdge (Rong et al., 2020) for simplicity, where a certain portion (given by hyperparameter α) of edges are randomly removed from the original graph G to obtain G (L12 in Algorithm 1). The resulting G contains more nodes with low degrees, i.e., tail nodes, than the original graph G does. Hence, the GNN sees more (synthetic) tail nodes as input during training. More advanced strategies to synthesize tail nodes (e.g., dropping more ratio of edges from head nodes) are left for future work. (2) Adding supervision on the synthetic tail nodes. After synthesizing the tail node inputs, TUNEUP adds target supervision (e.g., class labels for node classification, edges for link prediction) on them so that the supervised loss can be computed over the synthetic tail nodes. For link prediction tasks, TUNEUP directly reuses the original edges E in G (before dropping) as the target supervision on the synthetic tail nodes. To describe the effectiveness of the approach, suppose we have a node v with six neighbors in the original training graph G. After dropping α = 0.5 of edges in L12 of Algorithm 1, this node becomes a synthetic tail node v with three neighbors in G. Nevertheless, in the loss computation of L14, TUNEUP still reuses all original six edges from v in G as target supervision on this synthetic node v. Therefore, this synthetic tail node v has twice as much edge supervision as any degree-three real tail node (in the original graph G) has. Similarly, for semi-supervised node classification, TUNEUP can also reuse the target labels of labeled nodes in G for synthetic tail nodes in G. Specifically, for a labeled node v ∈ V labeled with groundtruth class label y v , TUNEUP can reuse y v for the corresponding synthetic tail node v in G. However, in the semi-supervised setting, the number of labeled nodes V labeled is often very small, e.g., 1%-5% of all nodes V , limiting the amount of target label supervision TUNEUP can reuse. To resolve this issue, TUNEUP utilizes pseudo labels (Lee et al., 2013) in addition to the limited ground-truth labels on V labeled . Specifically, TUNEUP applies the base GNN (obtained in the first training stage) over G to predict pseudo labels on non-isolated (i.e., positive-degree) nodes in V unlabeled . In practice, the pseudo labels do not need to be directly predicted by the base GNN, e.g., we can apply post-processing, such as label smoothing and C&S, to refine the pseudo labels. We leave the investigation to the future work. TUNEUP then includes the pseudo labels as supervision Y in the second stage (L8 in Algorithm 1). This would significantly increase the size of the supervision Y , e.g., by a factor of ≈100 if only 1% of nodes are labeled. While the predicted pseudo labels are noisy in general, they are "best guesses" in the sense that the base GNN uses the full graph information G to predict the labels. In the second stage, TUNEUP essentially forces the base GNN to maintain its best guesses given sparser graph G with limited neighborhood as input. This in turn allows the resulting GNN to make more accurate prediction on real tail nodes with limited neighborhood in the original graph G.

2.4. ASSUMPTIONS TUNEUP EXPLOITS TO IMPROVE GENERALIZATION PERFORMANCE

As the no-free-lunch theorem suggests (Wolpert, 1996) , improving generalization performance involves exploiting additional assumptions on real-world prediction tasks, which may not be satisfied by all possible tasks. Here we discuss three key assumptions TUNEUP exploits to improve generalization performance, which are (approximately) satisfied by many real-world tasks, including our experimented benchmark datasets across three different task types in Section 4. Tail nodes can be synthesized by dropping edges from head nodes. This assumption holds for many real-world graph datasets, as head nodes often start off as tail nodes, e.g., well-cited paper nodes are not cited at the beginning in a paper citation network, and warm users (users with many item interactions) start off as cold-start users in recommender systems. Target supervision on head nodes can be reused for synthetic tail nodes. This assumption holds for tasks where prediction to be made on a given node is more or less a static property of the node. For instance, papers' subject areas in a paper citation network, products' categories in a product co-purchasing network, and users' taste in recommender systems stay (mostly) the same regardless of the number of edges we observe on the nodes. More edges benefit GNNs to make accurate predictions. TUNEUP assumes that more edges are useful for GNNs to make accurate predictions, and tail nodes are harder to predict due to the lack of edges. This assumption is likely to hold for many tasks as GNNs can aggregate more neighboring information with more edges, and is experimentally verified in Figure 1 .

2.5. WHY TUNEUP IMPROVES PERFORMANCE ON HEAD NODES

It is counter-intuitive that TUNEUP improves performance not only on tail nodes but also on head nodes, as seen in Figure 1 . One reason may be that best-performing GNNs for node/edge-level tasks, including our experimented ones, use (roughly) average-based schemes to aggregate neighboring node features (Hamilton et al., 2017; Kipf & Welling, 2017; Wu et al., 2019; Velickovic et al., 2018) . With the average-based GNNs, node embeddings obtained on the sparsified graph G can be thought of as the noisy version of the node embeddings obtained on the full graph G. If the base GNN is finetuned to perform well on many realizations of the noisy embeddings (with different realizations of G in L12 of Algorithm 1), then the resulting GNN would most likely still perform well on the noise-free embeddings (computed over the full graph G). Moreover, training with the noisy embeddings can even improve the generalization performance on head nodes. We leave in-depth empirical/theoretical investigation for future work.

3. RELATED WORK

Data augmentation for GNNs. The second stage of TUNEUP can be regarded as data augmentation over graphs, on which there has been rich body of work (Ding et al., 2022) . Some are specifically designed for semi-supervised node classification (Zhao et al., 2021; Feng et al., 2020; Verma et al., 2021) , while others are designed for recommender systems (Verma et al., 2021) and graphs with input node features (Liu et al., 2022) . Different from these methods, TUNEUP is generally applicable to any prediction tasks over nodes and edges. The general nature of TUNEUP also allows it to be combined with any of the task-specific data augmentation techniques. As a general graph augmentation technique, Kong et al. (2020) proposed FLAG, which adversarially perturbs input node features (Shafahi et al., 2019) . This is complementary to TUNEUP, which perturbs the edge connectivity of the graph. DropEdge (Rong et al., 2020) randomly drops edges from graphs. It was originally developed to overcome the over-smoothing issue of GNNs (Li et al., 2018) in semisupervised node classification. In contrast, in this work, we adopt DropEdge as a way to synthesize additional tail node inputs for a wide range of prediction tasks over graphs. Methodologically, TUNEUP is distinct from DropEdge in that it employs the two-stage curriculum learning strategy and uses pseudo labels to add supervision on the synthetic tail nodes, both of which are important to yield substantially better performance than the original DropEdge. Curriculum learning for GNNs. A few works have explored curriculum learning for GNNs. Wang et al. (2021) developed a curriculum learning approach for graph classification, while our work focuses on node/edge-level prediction tasks. Ying et al. (2018) presented a curriculum learning for negative sampling in link prediction, and Li et al. (2022) developed a curriculum learning for tackling imbalanced class labels in node classification. TUNEUP is complementary to both of these approaches while being more broadly applicable to any node/edge-level prediction tasks. Pre-training GNNs. Pre-training GNNs has attracted huge attention (Veličković et al., 2019; Hu et al., 2020b; Qiu et al., 2020; Hu et al., 2020c; You et al., 2020b) . This line of work develops taskagnostic strategies to pre-train a GNN such that the resulting GNN can be finetuned with task-specific supervised losses to improve performance on diverse downstream tasks. Our work focuses on the downstream stage and presents a strategy for training a GNN on a task-specific supervised loss. Specialized methods for tail nodes. Recently, many methods have been developed for improving generalization performance of GNNs on tail nodes (Liu et al., 2021; Zheng et al., 2022; Kang et al., 2022; Zhang et al., 2022) . These methods require augmenting a GNN with tail-node-specific architectural components, while our work does not require any architectural modification and focuses purely on a strategy for training a GNN that performs well on both tail and head nodes.

4. EXPERIMENTS

Here we extensively evaluate TUNEUP under a wide range of settings. We consider five diverse GNN models and test them on the three prediction tasks described in Section 2.1 for three different predictive settings: transuctive, inductive, and cold-start inductive predictions.

4.1. EXPERIMENTAL SETTINGS

Here we describe our experimental settings and datasets for evaluating TUNEUP. We noticed that the standardized experimental protocols by Hu et al. (2020a) ; Wang et al. (2019) are not suitable for evaluating TUNEUP because (1) inductive prediction (cold-start) settings are not provided and (2) datasets are heavily pre-processed to eliminate tail nodes (e.g., recommender system benchmarks are processsed with the 10-core algorithm to eliminate the cold-start users and items (Wang et al., 2019) ), which is the focus of this work. We therefore take their original realistic graph datasets and split them ourselves to create the realistic inductive (cold-start) prediction setting as well as the realistic transductive setting with tail nodes. The dataset statistics are summarized in Table 5 in Appendix. Below, we describe the split and the datasets for each task type. Semi-supervised node classification. Given the entire nodes in the original dataset, we randomly selected 95% of the nodes and used their subgraph induced as the graph G = (V, E) to train GNNs. The remaining 5% of the entire nodes, V new , are used for inductive prediction. Within V , 10% and 2% of the nodes are used as labeled nodes V labeled for arxiv and products, respectively. A half of V labeled is used for computing the loss for supervised training, and another half is used as the transductive validation set for tuning hyper-parameters. For the evaluation metric, we used the standard classification accuracy. For the transductive performance, we report the accuracy on the unlabeled nodes V unlabeled ≡ V \ V labeled , while for the inductive performance, we report the accuracy on V new . For the inductive prediction, we also consider the cold-start scenario, where certain portions (30%, 60%, and 90%) of edges are randomly removed from the new nodes. We used the following two datasets in our experiments. • arxiv (Hu et al., 2020a) : Given a paper citation network, the task is to predict the subject areas of the papers. Each paper has abstract words as its feature. • products (Hu et al., 2020a) : Given a product co-purchasing network, the task is to predict the categories of the products. Each product has the product description as its feature. Link prediction. We follow the standard link prediction evaluation (Zhang & Chen, 2018; You et al., 2021) and randomly split the edges in the original graph into training and validation edges with the ratio of 50%/50%. We follow the same protocol as semi-supervised node classification to obtain nodes for transductive and inductive settings. For the evaluation metric, we used the recall@50 averaged over nodes (Wang et al., 2019) , where the positive target nodes are scored among all negative nodes. For the transductive performance, we report the recall@50 computed over validation edges within V , while for inductive setting, we report the recall@50 over validation edges from V new . For the inductive setting, we also consider the cold-start scenario. We used the following three datasets in our experiments. • flickr (Zeng et al., 2020) : Given an incomplete image-image common-property (e.g., same geographic location, same gallery, comments by the same user, etc.) network, the task is to predict the new common-property links between images. Each image has its description has its feature. • ppi (Chandak et al., 2022) : Given an incomplete protein-protein interaction network, the task is to predict new interactions. Each protein feature is generated with ESM protein language model (Rives et al., 2021) applied to the protein sequence. • arxiv (Hu et al., 2020a) : Given an incomplete paper citation network, the task is to predict the additional citation links. Each paper has words in its abstract as its feature. Recommender systems. For recommender systems, we notice that widely-used benchmark datasets are heavily processed to eliminate all tail nodes, e.g., via the 10-core algorithm (Wang et al., 2019) . For example, with the conventional 80%/20% train/validation split, the median training interactions per user is 17, 26, and 27 for gowalla, yelp2018, and amazon-book, respectively, which clearly do not reflect the realistic use case that involves many cold-start users and items (Lika et al., 2014) . To reflect the realistic use case, we use the small training edge ratio on top of the existing benchmark datasets. Specifically, we randomly split the edges in the original graph into training and validation edges with 10%/90% ratio. For the evaluation metric, we used the recall@50 averaged over users (Wang et al., 2019; He et al., 2020) . For the transductive performance, we report the recall@50 computed over the validation edges. We do not consider the inductive setting for recommender systems. We used the following three datasets in our experiments. • gowalla (Liang et al., 2016; Wang et al., 2019) : Given an user-location check-in bipartite graph, the task is to predict new check-in of users. • yelp2018 (Wang et al., 2019) : Given user-restaurant review graph, the task is to predict new reviews by users. • amazon-book (He & McAuley, 2016; Wang et al., 2019) : Given user-product reviews, the task is to predict new reviews by users. 

4.2. BASELINES AND ABLATIONS

We compare our TUNEUP against the following strong baselines. • Base: Trains a GNN with the default strategy, i.e., L2-5 of Algorithm 1. The accuracy of pseudo labels coincides with the accuracy of the base GNN. • DropEdge (Wang et al., 2019) : Randomly drops edges during training, i.e., L11-15 of Algorithm 1. • Local augentation (LocalAug) (Liu et al., 2022) : Uses a conditional generative model to generate neighboring node features and use them as additional input to a GNN. • ColdBrew (Zheng et al., 2022) : Distills head node embeddings computed by the base GNN into an MLP. Uses the resulting MLP to obtain higher-quality tail node embeddings. • GraphLessNN (Zhang et al., 2022) : Distills the pseudo labels predicted by the base GNN into an MLP. Uses the resulting MLP to make prediction. • Tail-GNN (Liu et al., 2021) : Adds a tail-node specific component inside the original GNN. • RAWLS-GCN (Kang et al., 2022) : Modifies the GCN's adjacency matrix to be doubly-stochastic (i.e., all rows and columns sum to 1). Note that GraphLessNN is only applicable for node classification. LocalAug and ColdBrew require input node features to be available; hence, not applicable to recommender systems. RAWLS-GCN is only applicable to the GCN architecture. In addition to the existing baselines, we consider the following three direct ablations of TUNEUP. • TUNEUP w/o curriculum: Interleaves the first stage prediction (L3 in Algorithm 1) and the second stage prediction (L12-13 in Algorithm 1) in every parameter update. It is close to TUNEUP except that it does not follow the two-stage curriculum learning strategy. • TUNEUP w/o syn-tails: No L12 in Algorithm 1. • TUNEUP w/o pseudo-labels: No L8 in Algorithm 1. 

4.3. GNN MODEL ARCHITECTURES

We mainly experimented with two classical yet strong GNN models: the mean-pooling variant of GraphSAGE (or SAGE for short) (Hamilton et al., 2017) and GCN (Kipf & Welling, 2017) . In Table 4 , we additionally experimented with the max-and sum-pooling variants of GraphSAGE as well as the Graph Attention Network (GAT) (Velickovic et al., 2018) to demonstrate the applicability of TUNEUP on diverse GNN architectures to improve their performance. In total, we have five diverse GNN architectures that cover representative aggregation schemes (i.e., mean, renormalized-mean (Kipf & Welling, 2017) , max, sum, and attention) that many recent advanced GNN architectures are based on (Corso et al., 2020; Shi et al., 2021; You et al., 2020b; Wu et al., 2019; Rossi et al., 2020; Li et al., 2018; You et al., 2020a) .

4.4. HYPER-PARAMETERS

We used 3-layer GNNs and the Adam optimizer (Kingma & Ba, 2015) for all GNN models and datasets, which we found to perform well in our preliminary experiments. For all methods, we performed the early stopping and tuned their hyper-parameters based on the transductive validation performance. We used the resulting models for both transductive and inductive prediction. For the drop edge ratio α, we tuned it from [0.25, 0.5, 0.75] for all the datasets. We repeated all the experiments with 5 different training seeds to report the mean and the standard deviation. More details are described in Appendix A.

4.5. RESULTS

We first compare TUNEUP against the base GNNs that are trained with the default strategy. The last rows of Tables 1, 2 , and 3 highlight the relative improvement of TUNEUP over the base GNNs. TUNEUP improves over the base GNNs across the transductive settings, giving up to 1.9%, 58.5%, and 21.8% relative improvement in the semi-supervised node classification, link prediction, and recommender systems, respectively. TUNEUP gives even larger improvement on the challenging cold-start inductive prediction setting, yielding up to 26.5% and 80.1% relative improvement on the node classification and link prediction, respectively. In Appendix, we provide Tables 7, 8 , 10, and 11 to show the full results on the cold-start inductive prediction with the different edge removal ratios from new nodes. The larger the ratio is, the more cold-start the setting becomes. We observe that TUNEUP provides larger relative gain on larger edge removal ratios, demonstrating its effectiveness on the highly cold-start prediction setting. We also analyze the degree-specific generalization improvement and highlight the results in Figure 1 . The full results (two GNN architectures times the eight datasets) are available in Figures 2, 3 , and 4 in Appendix. Across all datasets, architectures, and node degrees, TUNEUP produces consistent improvement over the base GNNs. Not surprisingly, improvement is most significant on tail nodes. Finally, we compare TUNEUP against the strong baselines and ablation methods described in Section 4.2. Tables 1, 2 , and 3 show the results. We summarize our findings below. • TUNEUP outperforms the graph augmentation methods (DropEdge and LocalAug) as well as the specialized methods for tail nodes (ColdBrew, GraphLessNN, and Tail-GNN), establishing its superior performance against the existing strong baseline methods. • TUNEUP outperforms TUNEUP w/o curriculum, which highlights the importance of the two-stage curriculum learning strategy in TUNEUP. • TUNEUP also outperforms TUNEUP w/o syn-tails and TUNEUP w/o pseudo-labels, which suggests that both of the ablated components are necessary for TUNEUP to achieve the high performance. • On semi-supervised node classification (Table 1 ), TUNEUP w/o syn-tails, i.e., conventional semisupervised training with the pseudo labels (Lee et al., 2013) , gives limited improvement over the base GNN. In contrast, TUNEUP trains the GNN to predict pseudo labels with limited neighborhood, which gives significant improvement over the base GNN. Moreover, TUNEUP significantly outperforms DropEdge, suggesting the importance of using DropEdge together with pseudo labels. • On link prediction (Table 2 ), DropEdge (TUNEUP without the first stage) already gives significant performance improvement over the base GNN, implying the unrealized potential of DropEdge on this task, beyond node classification (Rong et al., 2020) . Nonetheless, TUNEUP still gives consistent improvement over DropEdge, suggesting the benefit of the two-stage training. • On recommender systems (Table 3 ), TUNEUP is the only method that produced the significantly better performance than the base GNN. DropEdge and TUNEUP w/ curriculum even gave worse performance than the base GNN. This is possibly because jointly learning the GNN and shallow embeddings is hard without the two-stage training. • Overall, TUNEUP, despite its simplicity, is the only method that yielded consistent improvement across the three prediction tasks. • From Table 4 , we see that TUNEUP improves the performance on the five diverse GNN architectures. Although the performance improvement with the sum aggregation is limited for semi-supervised node classification, the sum aggregation gave the poor base GNN performance anyway due to the poor inductive bias and unstable training (Wu et al., 2019; Hamilton et al., 2017) .

5. CONCLUSIONS

In this paper, we presented TUNEUP, a curriculum learning strategy to train a GNN to improve its generalization performance. TUNEUP first trains a GNN to produce a strong base GNN that performs well on easy head nodes. It then proceeds to improve the prediction over hard tail nodes by finetuning the base GNN with additional synthetic tail nodes. TUNEUP is a general strategy that can be used to train any GNN model with any supervised loss. Through extensive experiments, we demonstrated the effectiveness of TUNEUP on a wide range of settings, including five GNN architectures, three types of prediction tasks, and both transductive and inductive settings. Overall, our work suggests that training strategies matter in improving generalization of GNNs and can be complementary to advances in model architectures and task-specific losses. A DETAILS OF HYPER-PARAMETERS Here we present the details of hyper-parameters we used in our experiments. Semi-supervised node classification. We used the hidden dimensionality of 256 and 64 for arxiv and products, respectively. We trained GNNs in a full-batch manner, and for products, we used the reduced dimensionality of 64 so that the entire graph can be fit into the limited GPU memory of 45GB. Mini-batch training is left for future work. We used 1500 epochs for both the default training and finetuning. The learning rate is set to 0.001. Link prediction. We used the hidden dimensionality of 256 for all datasets. We added the L2 regularization on the node embeddings and tuned its weight for each dataset and GNN architecture. For both the default training and finetuning, we used 1000 epochs and the learning rate of 0.0001. Recommender systems. We used the shallow embedding dimensionality of 64 and the hidden embedding dimensionality of 256. Similar to the link prediction, we added the L2 regularization to the node embeddings and tuned its weight for each dataset and GNN architecture. For the default training, we trained the model for 2000 epochs with the initial learning rate of 0.001, which is multiplied by 0.1 at the 1000th and 1500th epoch. For finetuning, we used 500 epochs with the learning rate of 0.0001. For training strategies without curriculum learning, we used the same configuration as the default training. 



Figure 2: Degree-specific generalization performance of the base GNN and TUNEUP in transductive semi-supervised node classification. The evaluation metric is classification accuracy.

Figure3: Degree-specific generalization performance of the base GNN and TUNEUP in transductive link prediction. The evaluation metric is recall@50.

Figure4: Degree-specific generalization performance of the base GNN and TUNEUP in transductive recommender systems. The evaluation metric is recall@50.

Semi-supervised node classification performance with GraphSAGE as the backbone architecture. The metric is classification accuracy. For the "Inductive (cold)", 90% of edges are randomly removed from new nodes. For the results with other edge removal ratios, refer to Table7in Appendix. Refer to Table6in Appendix for the performance with GCN, where a similar trend is observed.

Link prediction performance with GraphSAGE as the backbone architecture. The metric is recall@50. For the "Inductive (cold)", 60% of edges are randomly removed from new nodes. For other edge removal ratios, refer to Table10in Appendix, where TUNEUP consistently outperforms the baselines. Refer to Table9in Appendix for the performance with GCN, where we see a similar trend.

Transductive performance on the recommender systems datasets. The metric is recall@50.

The improvement with TUNEUP over the base GNNs for diverse GNN model architectures. We used the same set of datasets as Figure1. † For semi-supervised node classification, GAT gave Out-Of-Memory (OOM) on the products dataset; so we report the on arxiv instead.

Statistics of nodes used for the transductive evaluation. For link prediction and the recommender system graphs (user-item bipartite graphs), we only evaluate on nodes/users that have at least one edge in the validation set.

Semi-supervised node classification performance with GCN as the backbone architecture. The evaluation metric is classification accuracy. For the "Inductive (cold)", 90% of edges are randomly removed from new nodes. For the results with other edge removal ratios, refer to Table8in Appendix.

Cold-start inductive node classification performance with GraphSAGE as the backbone architecture. The larger the edge removal ratio is, the more cold-start the prediction task becomes. The evaluation metric is classification accuracy.

Cold-start inductive node classification performance with GCN as the backbone architecture. The larger the edge removal ratio is, the more cold-start the prediction task becomes. The evaluation metric is classification accuracy.

Link prediction performance with the GCN as the backbone architecture. The evaluation metric is recall@50. For the "Inductive (cold)", 60% of edges are randomly removed from new nodes. For the results with other edge removal ratios, refer to Table11 in Appendix.

Cold-start inductive link prediction performance with GraphSAGE. The evaluation metric is recall@50. The larger the edge removal ratio is, the more cold-start the prediction task becomes.

Cold-start inductive link prediction performance with GCN. The evaluation metric is recall@50. The larger the edge removal ratio is, the more cold-start the prediction task becomes. 1272±0.0022 0.1048±0.0016 0.0661±0.0021 0.1437±0.0046 0.1241±0.0039 0.0754±0.0053 0.1852±0.0015 0.1356±0.0017 0.0557±0.0023 TUNEUP w/o syn-tails 0.1166±0.0021 0.0883±0.0015 0.0284±0.0010 0.1110±0.0078 0.0916±0.0053 0.0347±0.0036 0.1737±0.0008 0.1195±0.0014 0.0368±0.0010 TUNEUP (ours) 0.1308±0.0021 0.1080±0.0027 0.0661±0.0028 0.1714±0.0051 0.1430±0.0073 0.0786±0.0090 0.1933±0.0018 0.1440±0.0017 0.0627±0.0015

