BETTER WITH LESS: DATA-ACTIVE PRE-TRAINING OF GRAPH NEURAL NETWORKS Anonymous

Abstract

Recently, pre-training on graph neural networks (GNNs) has become an active research area and is used to learn transferable knowledge for downstream tasks with unlabeled data. The success of graph pre-training models is often attributed to the massive amount of input data. In this paper, however, we identify the curse of big data phenomenon in graph pre-training: more training samples and graph datasets do not necessarily lead to better performance. Motivated by this observation, we propose a better-with-less framework for graph pre-training: few, but carefully chosen data are fed into a GNN model to enhance pre-training. This novel pretraining pipeline is called the data-active graph pre-training (APT) framework, and is composed of a graph selector and a pre-training model. The graph selector chooses the most representative and instructive data points based on the inherent properties of graphs as well as the predictive uncertainty. The proposed predictive uncertainty, as feedback from the pre-training model, measures the confidence level of the model to the data. When fed with the chosen data, on the other hand, the pre-training model grasps an initial understanding of the new, unseen data, and at the same time attempts to remember the knowledge learnt from the previous data. Therefore, the integration and interaction between these two components form a unified framework, in which graph pre-training is performed in a progressive way. Experiment results show that the proposed APT framework is able to obtain an efficient pre-training model with fewer training data and better downstream performance.

1. INTRODUCTION

Pre-training Graph Neural Networks (GNNs) shows the potential to be an attractive and competitive strategy for learning graph representations without costly labels. However, its transferability is guaranteed only if the pre-training datasets come from the same or similar domain as the downstream Hu et al. (2019; 2020b) (2019) . In view of this, contemporary research almost has no controversy on the following issue: Is a massive amount of input data really necessary, or even beneficial, for pre-training GNNs? However, two simple experiments regarding the number of training samples and graph datasets seem to doubt the positive answer to this question. The first observation is that scaling pre-training samples does not result in a one-model-fits-all increase in downstream performance (see the first row of Figure 1 ). Second, we observe that adding input graphs (while fixing sample size) does not improve and sometimes even deteriorates the generalization of the pre-trained model (see the second row in Figure 1 ). Furthermore, even if the number of input graphs (the horizontal coordinate) is fixed, the performance of the model pre-trained on different combinations of inputs varies dramatically; see the standard deviation in blue. As the first contribution, we identify the curse of big data phenomenon in graph pre-training: more training samples and graph datasets do not necessarily lead to better downstream performance. Therefore, instead of training on massive data, it is more appealing to choose wisely some samples and graphs for pre-training. However, without the knowledge of downstream tasks, the difficulty is how to design new criteria for selecting input data to the pre-training model. To fill this gap, we propose a novel graph selector that is able to provide the most instructive data for the model. The criteria in the graph selector include predictive uncertainty and graph properties. Predictive uncertainty is introduced to measure the level of confidence (or certainty) in the data. On the other hand, some graphs are more informative and representative than others, due to their inherent structure. To this end, some fundamental properties of graphs also help in the selection process. Given the selected input data, we take full advantage of the predictive uncertainty as a proxy for measuring the model capability during the training phase. Instead of swallowing data as a whole, the pre-training model is encouraged to learn from the data in a progressive way. We start with a natural question: What does cross-domain graph pre-training actually learn? Previous studies argue that the semantic meaning associated with structural patterns is transferable. For example, both in citation networks and social networks, the closed triangle structure ( ) is interpreted as a stable relationship, while the open triangle ( ) indicates an unstable relationship.



; You et al. (2020a;b); Hu et al. (2020c); Li et al. (2021); Lu et al. (2021); Sun et al. (2021). When we have no knowledge of the downstream, an encouraging yet largely unexplored research direction is pre-training GNNs on cross-domain data Qiu et al. (2020); Hafidi et al. (2020). Taking the graphs from multiple domains as the input, graph pre-training is able to learn the transferable structural patterns in graphs (when some semantic meanings are present), or to obtain the capability of discriminating these patterns. With diverse and various cross-domain data, the success of a graph pre-training model is often attributed to the massive amount of unlabeled training data, a well-established fact for pre-training in computer vision Girshick et al. (2014); Donahue et al. (2014); He et al. (2020) and natural language processing Mikolov et al. (2013); Devlin et al.

Figure 1: Top row: The effect of scaling up sample size (log scale) on the downstream performance based on a group of GCCs Qiu et al. (2020) under different configurations (the graphs used for pre-training are kept as all eleven pre-training data in Table3, and the samples are taken from the backbone pre-training model according to its sampling strategy). The results for different downstream graphs (and tasks) are presented in separate figures. To better show the changing trend, we fit a curve to the best performing models (i.e., the convex hull fit as Abnar et al. (2022) does). Bottom row: The effect of scaling up the number of graph datasets on the downstream performance based on GCC. For a fixed horizontal coordinate, we run 5 trials. For each trial, we randomly choose a combination of input graphs. The shaded area indicates the standard deviation over the 5 trials. See Appendix D for more observations on other graph pre-training models and detailed settings.

After learning a certain amount of training data, the predictive uncertainty gives feedback on what kind of data the model has least knowledge of. Then the pre-training model is able to reinforce itself on highly uncertain data in next training iterations. Putting together, we propose a data-active graph pre-training (APT) framework, which integrates the graph selector and the pre-training model into a unified framework. The two components in the framework actively cooperate with each other. The graph selector recognizes the most instructive data for the model. Equipped with this intelligent selector, the pre-training model is well-trained and in turn provides better guidance for the graph selector. The rest of the paper is organized as follows. In §2 we review the existing works about basic graph pre-training framework commonly used for training cross-domain graph data. Then in §3 we describe in detail the proposed data-active graph pre-training (APT) paradigm. §4 contains numerical experiments, which demonstrate the superiority of APT in different downstream tasks, especially when the test and training graphs come from different domains. Lastly, we also include the applicable scope of our pre-trained model. 2 BASIC GRAPH PRE-TRAINING FRAMEWORK This section reviews the basic framework of cross-domain graph pre-training commonly used in related literature. The backbone of our graph pre-training model also follows this framework, and uses GCC Qiu et al. (2020) as an instantiation. In principle, GCC can be substituted by any encoder suitable for training cross-domain graphs.

