BETTER WITH LESS: DATA-ACTIVE PRE-TRAINING OF GRAPH NEURAL NETWORKS Anonymous

Abstract

Recently, pre-training on graph neural networks (GNNs) has become an active research area and is used to learn transferable knowledge for downstream tasks with unlabeled data. The success of graph pre-training models is often attributed to the massive amount of input data. In this paper, however, we identify the curse of big data phenomenon in graph pre-training: more training samples and graph datasets do not necessarily lead to better performance. Motivated by this observation, we propose a better-with-less framework for graph pre-training: few, but carefully chosen data are fed into a GNN model to enhance pre-training. This novel pretraining pipeline is called the data-active graph pre-training (APT) framework, and is composed of a graph selector and a pre-training model. The graph selector chooses the most representative and instructive data points based on the inherent properties of graphs as well as the predictive uncertainty. The proposed predictive uncertainty, as feedback from the pre-training model, measures the confidence level of the model to the data. When fed with the chosen data, on the other hand, the pre-training model grasps an initial understanding of the new, unseen data, and at the same time attempts to remember the knowledge learnt from the previous data. Therefore, the integration and interaction between these two components form a unified framework, in which graph pre-training is performed in a progressive way. Experiment results show that the proposed APT framework is able to obtain an efficient pre-training model with fewer training data and better downstream performance.

1. INTRODUCTION

Pre-training Graph Neural Networks (GNNs) shows the potential to be an attractive and competitive strategy for learning graph representations without costly labels. However, its transferability is guaranteed only if the pre-training datasets come from the same or similar domain as the downstream Hu et al. (2019; 2020b) 2019). In view of this, contemporary research almost has no controversy on the following issue: Is a massive amount of input data really necessary, or even beneficial, for pre-training GNNs? However, two simple experiments regarding the number of training samples and graph datasets seem to doubt the positive answer to this question. The first observation is that scaling pre-training samples does not result in a one-model-fits-all increase in downstream performance (see the first row of Figure 1 ). Second, we observe that adding input graphs (while fixing sample size) does not improve and sometimes even deteriorates the generalization of the pre-trained model (see the second row in Figure 1 ). Furthermore, even if the number of input graphs (the horizontal coordinate) is fixed, the performance of the model pre-trained on different combinations of inputs varies dramatically; see the standard deviation in blue. As the first contribution, we identify the curse of big data phenomenon in graph pre-training: more training samples and graph datasets do not necessarily lead to better downstream performance.



; You et al. (2020a;b); Hu et al. (2020c); Li et al. (2021); Lu et al. (2021); Sun et al. (2021). When we have no knowledge of the downstream, an encouraging yet largely unexplored research direction is pre-training GNNs on cross-domain data Qiu et al. (2020); Hafidi et al. (2020). Taking the graphs from multiple domains as the input, graph pre-training is able to learn the transferable structural patterns in graphs (when some semantic meanings are present), or to obtain the capability of discriminating these patterns. With diverse and various cross-domain data, the success of a graph pre-training model is often attributed to the massive amount of unlabeled training data, a well-established fact for pre-training in computer vision Girshick et al. (2014); Donahue et al. (2014); He et al. (2020) and natural language processing Mikolov et al. (2013); Devlin et al. (

