DIP-GNN: DISCRIMINATIVE PRE-TRAINING OF GRAPH NEURAL NETWORKS

Abstract

Graph neural network (GNN) pre-training methods have been proposed to enhance the power of GNNs. Specifically, a GNN is first pre-trained on a large-scale unlabeled graph and then fine-tuned on a separate small labeled graph for downstream applications, such as node classification. One popular pre-training method is to mask out a proportion of the edges, and a GNN is trained to recover them. However, such a generative method suffers from graph mismatch. That is, the masked graph input to the GNN deviates from the original graph. To alleviate this issue, we propose DiP-GNN (Discriminative Pre-training of Graph Neural Networks). Specifically, we train a generator to recover identities of the masked edges, and simultaneously, we train a discriminator to distinguish the generated edges from the original graph's edges. The discriminator is subsequently used for downstream fine-tuning. In our pre-training framework, the graph seen by the discriminator better matches the original graph because the generator can recover a proportion of the masked edges. Extensive experiments on large-scale homogeneous and heterogeneous graphs demonstrate the effectiveness of the proposed framework. Our code will be publicly available.

1. INTRODUCTION

Graph neural networks (GNNs) have achieved superior performance in various applications, such as node classification (Kipf & Welling, 2017) , knowledge graph modeling (Schlichtkrull et al., 2018) and recommendation systems (Ying et al., 2018) . To enhance the power of GNNs, generative pretraining methods are developed (Hu et al., 2020b) . During the pre-training stage, a GNN incorporates topological information by training on a large-scale unlabeled graph in a self-supervised manner. Then, the pre-trained model is fine-tuned on a separate small labeled graph for downstream applications. Generative GNN pre-training is akin to masked language modeling in language model pre-training (Devlin et al., 2019) . That is, for an input graph, we first randomly mask out a proportion of the edges, and then a GNN is trained to recover the original identity of the masked edges. One major drawback with the abovementioned approach is graph mismatch. That is, the input graph to the GNN deviates from the original one since a considerable amount of edges are dropped. This causes changes in topological information, e.g., node connectivity. Consequently, the learned node embeddings may not be desirable. To mitigate the above issues, we propose DiP-GNN ( Discriminative Pre-training of Graph Neural Networks). In DiP-GNN, we simultaneously train a generator and a discriminator. The generator is trained similar to existing generative pre-training approaches, where the model seeks to recover the masked edges and outputs a reconstructed graph. Subsequently, the reconstructed graph is fed to the discriminator, which predicts whether each edge resides in the original graph (i.e., a true edge) or is wrongly constructed by the generator (i.e., a fake edge). After pre-training, we fine-tune the discriminator on downstream tasks. Figure 1 illustrates our training framework. Note that our work is related to Generative Adversarial Nets (GAN, Goodfellow et al. 2014) , and detailed discussions are presented in Section 3.4. We remark that similar approaches have been used in natural language processing (Clark et al., 2020) . However, we identify the graph mismatch problem (see Section 4.5), which is specific to graph-related applications and is not observed in natural language processing. The proposed framework is more advantageous than generative pre-training. This is because the reconstructed graph fed to the discriminator better matches the original graph compared with the masked graph fed to the generator. Consequently, the discriminator can learn better node embeddings. Such a better alignment is because the generator recovers the masked edges during pretraining, i.e., we observe that nearly 40% of the missing edges can be recovered. We remark that in our framework, the graph fed to the generator has missing edges, while the graph fed to the discriminator contains wrong edges since the generator may make erroneous predictions. However, empirically we find that missing edges hurt more than wrong ones, making discriminative pre-training more desirable (see Section 4.5 in the experiments). We demonstrate effectiveness of DiP-GNN on large-scale homogeneous and heterogeneous graphs. Results show that the proposed method significantly outperforms existing generative pre-training and self-supervised learning approaches. For example, on the homogeneous Reddit dataset (Hamilton et al., 2017) that contains 230k nodes, we obtain an improvement of 1.1 in terms of F1 score; and on the heterogeneous OAG-CS graph (Tang et al., 2008) that contains 1.1M nodes, we obtain an improvement of 2.8 in terms of MRR score in the paper field prediction task.

2. BACKGROUND

⋄ Graph Neural Networks. Graph neural networks compute a node's representation by aggregating information from the node's neighbors. Concretely, for a multi-layer GNN, the feature vector h (k) v of node v at the k-th layer is a (k) v = Aggregate h (k-1) u ∀u ∈ Neighbor(v) , h (k) v = Combine a (k) v , h (k-1) v , where Neighbor(v) denotes all the neighbor nodes of v. Various implementations of Aggregate(•) and Combine(•) are proposed for both homogeneous (Defferrard et al., 2016; Kipf & Welling, 2017; Velickovic et al., 2018; Xu et al., 2019) and heterogeneous graphs (Schlichtkrull et al., 2018; Wang et al., 2019; Zhang et al., 2019; Hu et al., 2020c) . ⋄ Graph Neural Network Pre-Training. Previous unsupervised learning methods leverage the graph's proximity (Tang et al., 2015) or information gathered by random walks (Perozzi et al., 2014; Grover & Leskovec, 2016; Dong et al., 2017; Qiu et al., 2018) . However, the learned embeddings cannot be transferred to unseen nodes, limiting the methods' applicability. Other unsupervised learning algorithms adopt contrastive learning (Hassani & Ahmadi, 2020; Qiu et al., 2020; Zhu et al., 2020; 2021; You et al., 2020; 2021) . That is, we generate two views of the same graph, and then maximize agreement of node presentations in the two views. However, our experiments reveal that these methods do not scale well to extremely large graphs with millions of nodes. et al., 2020) proposes to maximize the mutual information between graph representations and representations of the graphs' sub-structures. In this work, we focus on pre-training GNNs on a single large graph instead of multiple small graphs.



Figure 1: Illustration of DiP-GNN. From left to right: Original graph; Graph with two masked edges (dashed lines); Reconstructed graph created by the generator (generated edges are the dashed red lines); Discriminator labels each edge as [G] (generated) or [O] (original), where there are two wrong labels (shown in red).

There are also pre-training methods that extract graph-level representations, i.e., models are trained on a large amount of small graphs instead of a single large graph. For example,Hu et al. 2020a   propose pre-training methods that operate on both graph and node level; and InfoGraph (Sun

