BETTER WITH LESS: DATA-ACTIVE PRE-TRAINING OF GRAPH NEURAL NETWORKS Anonymous

Abstract

Recently, pre-training on graph neural networks (GNNs) has become an active research area and is used to learn transferable knowledge for downstream tasks with unlabeled data. The success of graph pre-training models is often attributed to the massive amount of input data. In this paper, however, we identify the curse of big data phenomenon in graph pre-training: more training samples and graph datasets do not necessarily lead to better performance. Motivated by this observation, we propose a better-with-less framework for graph pre-training: few, but carefully chosen data are fed into a GNN model to enhance pre-training. This novel pretraining pipeline is called the data-active graph pre-training (APT) framework, and is composed of a graph selector and a pre-training model. The graph selector chooses the most representative and instructive data points based on the inherent properties of graphs as well as the predictive uncertainty. The proposed predictive uncertainty, as feedback from the pre-training model, measures the confidence level of the model to the data. When fed with the chosen data, on the other hand, the pre-training model grasps an initial understanding of the new, unseen data, and at the same time attempts to remember the knowledge learnt from the previous data. Therefore, the integration and interaction between these two components form a unified framework, in which graph pre-training is performed in a progressive way. Experiment results show that the proposed APT framework is able to obtain an efficient pre-training model with fewer training data and better downstream performance. This section reviews the basic framework of cross-domain graph pre-training commonly used in related literature. The backbone of our graph pre-training model also follows this framework, and uses GCC Qiu et al. (2020) as an instantiation. In principle, GCC can be substituted by any encoder suitable for training cross-domain graphs. We start with a natural question: What does cross-domain graph pre-training actually learn? Previous studies argue that the semantic meaning associated with structural patterns is transferable. For example, both in citation networks and social networks, the closed triangle structure ( ) is interpreted as a stable relationship, while the open triangle ( ) indicates an unstable relationship. We provide an open-source implementation of our model APT at https://github.com/ anonymous-APT-ai/Anonymous-APT-code. Hyperparameters necessary for reproducing the experiments can be found in §4.1, Appendix D and Appendix F. Users can run APT on their own datasets.

1. INTRODUCTION

Pre-training Graph Neural Networks (GNNs) shows the potential to be an attractive and competitive strategy for learning graph representations without costly labels. However, its transferability is guaranteed only if the pre-training datasets come from the same or similar domain as the downstream Hu et al. (2019; 2020b) ; You et al. (2020a; b) ; Hu et al. (2020c) ; Li et al. (2021) ; Lu et al. (2021) ; Sun et al. (2021) . When we have no knowledge of the downstream, an encouraging yet largely unexplored research direction is pre-training GNNs on cross-domain data Qiu et al. (2020) ; Hafidi et al. (2020) . Taking the graphs from multiple domains as the input, graph pre-training is able to learn the transferable structural patterns in graphs (when some semantic meanings are present), or to obtain the capability of discriminating these patterns. With diverse and various cross-domain data, the success of a graph pre-training model is often attributed to the massive amount of unlabeled training data, a well-established fact for pre-training in computer vision Girshick et al. (2014) ; Donahue et al. (2014) ; He et al. (2020) and natural language processing Mikolov et al. (2013) ; Devlin et al. (2019) . In view of this, contemporary research almost has no controversy on the following issue: Is a massive amount of input data really necessary, or even beneficial, for pre-training GNNs? However, two simple experiments regarding the number of training samples and graph datasets seem to doubt the positive answer to this question. The first observation is that scaling pre-training samples does not result in a one-model-fits-all increase in downstream performance (see the first row of Figure 1 ). Second, we observe that adding input graphs (while fixing sample size) does not improve and sometimes even deteriorates the generalization of the pre-trained model (see the second row in Figure 1 ). Furthermore, even if the number of input graphs (the horizontal coordinate) is fixed, the performance of the model pre-trained on different combinations of inputs varies dramatically; see the standard deviation in blue. As the first contribution, we identify the curse of big data phenomenon in graph pre-training: more training samples and graph datasets do not necessarily lead to better downstream performance. 3 , and the samples are taken from the backbone pre-training model according to its sampling strategy). The results for different downstream graphs (and tasks) are presented in separate figures. To better show the changing trend, we fit a curve to the best performing models (i.e., the convex hull fit as Abnar et al. (2022) does). Bottom row: The effect of scaling up the number of graph datasets on the downstream performance based on GCC. For a fixed horizontal coordinate, we run 5 trials. For each trial, we randomly choose a combination of input graphs. The shaded area indicates the standard deviation over the 5 trials. See Appendix D for more observations on other graph pre-training models and detailed settings. Therefore, instead of training on massive data, it is more appealing to choose wisely some samples and graphs for pre-training. However, without the knowledge of downstream tasks, the difficulty is how to design new criteria for selecting input data to the pre-training model. To fill this gap, we propose a novel graph selector that is able to provide the most instructive data for the model. The criteria in the graph selector include predictive uncertainty and graph properties. Predictive uncertainty is introduced to measure the level of confidence (or certainty) in the data. On the other hand, some graphs are more informative and representative than others, due to their inherent structure. To this end, some fundamental properties of graphs also help in the selection process. Given the selected input data, we take full advantage of the predictive uncertainty as a proxy for measuring the model capability during the training phase. Instead of swallowing data as a whole, the pre-training model is encouraged to learn from the data in a progressive way. After learning a certain amount of training data, the predictive uncertainty gives feedback on what kind of data the model has least knowledge of. Then the pre-training model is able to reinforce itself on highly uncertain data in next training iterations. Putting together, we propose a data-active graph pre-training (APT) framework, which integrates the graph selector and the pre-training model into a unified framework. The two components in the framework actively cooperate with each other. The graph selector recognizes the most instructive data for the model. Equipped with this intelligent selector, the pre-training model is well-trained and in turn provides better guidance for the graph selector. The rest of the paper is organized as follows. In §2 we review the existing works about basic graph pre-training framework commonly used for training cross-domain graph data. Then in §3 we describe in detail the proposed data-active graph pre-training (APT) paradigm. §4 contains numerical experiments, which demonstrate the superiority of APT in different downstream tasks, especially when the test and training graphs come from different domains. Lastly, we also include the applicable scope of our pre-trained model.

2. BASIC GRAPH PRE-TRAINING FRAMEWORK

When data comes from other domains like molecular networks, the semantic meaning can be quite different. Nevertheless, we argue that the distinction between different structural patterns is still transferable. Taking the same example, the closed and open triangles might yield different interpretations in molecular networks (unstable vs. stable in terms of chemical property) from those in social networks (stable vs. unstable in terms of social relationship), but the distinction between these two structures remains the same because they indicate opposite (or contrastive) semantic meanings. Therefore, the cross-domain pre-training either learns representative structural patterns (when semantic meanings are present), or more importantly, obtains the capability of distinguishing these patterns. This observation in graph pre-training is not only very different from that in other areas (e.g., computer vision and natural language processing), but may also explain why graph pre-training is effective, especially when some downstream information is absent. With the hope to learn the transferable structural patterns or the ability to distinguish them, the crossdomain graph pre-training model is fed with a collection of input graphs (possibly from different domains), and the learnt model, denoted by f θ (or simply f if the parameter θ is clear from context), maps a node to a low-dimensional representation. Unaware of the specific downstream task as well as task-specific labels, one should design a self-supervised task for the pre-training model. Such self-supervised information for a node is usually hidden in its neighborhood pattern, and thus the structure of its ego network is often used as the transferable pattern. Naturally, subgraph instances sampled from the same ego network Γ i are considered similar while those sampled from different ego networks are rendered dissimilar. Therefore, the pre-training model attempts to capture the similarities (and dissimilarities) between subgraph instances, and such a self-supervised task is called the subgraph instance discrimination task. More specifically, given a subgraph instance ζ i from an ego network Γ i centered at node v i as well as its representation x i = f (ζ i ), the model f aims to encourage higher similarity between x i and the representation of another subgraph instance ζ + i sampled from the same ego network. This can be done by minimizing, e.g., the InfoNCE loss Oord et al. ( 2018): Li = -log exp x ⊤ i f (ζ + i )/τ exp x ⊤ i f (ζ + i )/τ + ζ ′ i ∈Ω - i exp x ⊤ i f (ζ ′ i )/τ , where Ω - i is a collection of subgraph instances sampled from different ego networks Γ j (j ̸ = i), and τ is a temperature hyper-parameter. Here the inner product is used as a similarity measure between two instances. One common strategy to sample these subgraph instances is via random walks on graphs, as used in GCC Qiu et al. (2020) , but other sampling methods as well as loss functions are also valid.

3. DATA-ACTIVE GRAPH PRE-TRAINING

In this section we present the proposed APT framework for cross-domain graph pre-training, and the overall pipeline is illustrated in Figure 2 . The APT framework consists of two major components, a graph selector and a graph pre-training model. The technical core is the interaction between these two components: The graph selector feeds suitable data for pre-training, and the graph pre-training model learns from the carefully chosen data. The feedback of the pre-training model in turn helps select the needed data tailored to the model. The rest of this section is organized as follows. We describe the graph selector in § 3.1 and the graph pre-training model in § 3.2. The overall pre-training and fine-tuning strategy is presented in § 3.3.

3.1. GRAPH SELECTOR

In view of the curse of big data phenomenon, it is more appealing to carefully choose data well suited for graph pre-training rather than training on a massive amount of data. Conventionally, the criterion of suitable data, or the contribution of a data point to the model, is defined based on the output predictions on downstream tasks Goodfellow et al. (2016) . In graph pre-training where downstream information is absent, new selection criteria or guidelines are needed to provide effective instructions for the model. Here we introduce two kinds of selection criteria, originated from different points of view, to help select suitable data for pre-training. The predictive uncertainty measures the model's understanding of certain data, and thus helps select the least certain data points for the current model. In addition to the measure of model's ability, some inherent properties of graphs can also be used to assess the level of representativeness or informativeness of a given graph. Predictive uncertainty. The notion of predictive uncertainty can be explained via an illustrative example, as shown in part (a) of the graph selector component in Figure 2 . Consider a query sub-  > A A A C C n i c b Z D L S s N A F I Y n 9 V b r L e r S z W g R X J W k C r o s u H F Z w V 6 g C W U y n b R D J 5 M w c y K U 0 L U b X 8 W N C 0 X c + g T u f B s n b U B t / W H g 4 z / n z M z 5 g 0 R w D Y 7 z Z Z V W V t f W N 8 q b l a 3 t n d 0 9 e / + g r e N U U d a i s Y h V N y C a C S 5 Z C z g I 1 k 0 U I 1 E g W C c Y X + f 1 z j 1 T m s f y D i Y J 8 y M y l D z k l I C x + v a x F y p C M y 8 h C j g R e P q D H o w Y k G n f r j o 1 Z y a 8 D G 4 B V V S o 2 b c / v U F M 0 4 h J o I J o 3 X O d B P w s v 5 U K N q 1 4 q W Y J o W M y Z D 2 D k k R M + 9 l s l S k + N c 4 A h 7 E y R w K e u b 8 n M h J p P Y k C 0 x k R G O n F W m 7 + V + u l E F 7 5 G Z d J C k z S + U N h K j D E O M 8 F D 7 h i F M T E A K G K m 7 9 i O i I m G z D p V U w I 7 u L K y 9 C u 1 9 z z W v 3 2 o t p w i z j K 6 A i d o D P k o k v U Q D e o i V q I o g f 0 h F 7 Q q / V o P V t v 1 v u 8 t W Q V M 4 f o j 6 y P b x I X m x Q = < / l a t e x i t > network entropy density average degree degree variance scale-free exponent other properties @ @✓ < l a t e x i t s h a 1 _ b a s e 6 4 = " n K o g 6 8 C 9 5 y t F a q O e n / V j 5 m u 3 1 l M = " > A A A C C n i c b Z D L S s N A F I Y n X m u 9 R V 2 6 G S 2 C q 5 J U Q d 0 V 3 L i s Y C / Q h D K Z T t q h k 0 m Y O R F K 6 N q N r + L G h S J u f Q J 3 v o 2 T N q C 2 / j D w 8 Z 9 z Z u b 8 Q S K 4 B s f 5 s p a W V 1 b X 1 k s b 5 c 2 t 7 Z 1 d e 2 + / p e N U U d a k s Y h V J y C a C S 5 Z E z g I 1 k k U I 1 E g W D s Y X e f 1 9 j 1 T m s f y D s Y J 8 y M y k D z k l I C x e v a R F y p C M y 8 h C j g R e P K D H g w Z k E n P r j h V Z y q 8 C G 4 B F V S o 0 b M / v X 5 M 0 4 h J o I J o 3 X W d B P w s v 5 U K N i l 7 q W Y J o S M y Y F 2 D k k R M + 9 l 0 l Q k + M U 4 f h 7 E y R w K e u r 8 n M h J p P Y 4 C 0 x k R G O r 5 W m 7 + V + u m E F 7 6 G Z d J C k z S 2 U N h K j D E O M 8 F 9 7 l i F M T Y A K G K m 7 9 i O i Q m G z D p l U 0 I 7 v z K i 9 C q V d 2 z a u 3 2 v F K / K u I o o U N 0 j E 6 R i y 5 Q H d 2 g B m o i i h 7 Q E 3 p B r 9 a j 9 W y 9 W e + z 1 i W r m D l A f 2 R 9 f A M U f 5 s c < / l a t e x i t > @ @✓ < l a t e x i t s h a 1 _ b a s e 6 4 = " n K o g 6 8 C 9 5 y t F a q O e n / V j 5 m u 3 graph instance ζ i (denote by ? in Figure 2 ) from the ego network Γ i in a graph G. If the pre-training model cannot tell its similar instance ζ + i (denoted by + ) from its dissimilar instance ζ - i ∈ Ω - i (denoted by ), we say that the current model is uncertain about the query instance ζ i . Therefore, the contrastive loss function in Eq. (1) comes in handy as a natural measure for the predictive uncertainty of the instance ζ i : ϕ uncertain (ζ i ) = L i . Accordingly, the predictive uncertainty of a graph G (i.e., the graph-level predictive uncertainty) is defined as ϕ uncertain (G) = (1/M ) M i=1 L i , where M is the number of subgraph instances queried in this graph. 1 l M = " > A A A C C n i c b Z D L S s N A F I Y n X m u 9 R V 2 6 G S 2 C q 5 J U Q d 0 V 3 L i s Y C / Q h D K Z T t q h k 0 m Y O R F K 6 N q N r + L G h S J u f Q J 3 v o 2 T N q C 2 / j D w 8 Z 9 z Z u b 8 Q S K 4 B s f 5 s p a W V 1 b X 1 k s b 5 c 2 t 7 Z 1 d e 2 + / p e N U U d a k s Y h V J y C a C S 5 Z E z g I 1 k k U I 1 E g W D s Y X e f 1 9 j 1 T m s f y D s Y J 8 y M y k D z k l I C x e v a R F y p C M y 8 h C j g R e P K D H g w Z k E n P r j h V Z y q 8 C G 4 B F V S o 0 b M / v X 5 M 0 4 h J o I J o 3 X W d B P w s v 5 U K N i l 7 q W Y J o S M y Y F 2 D k k R M + 9 l 0 l Q k + M U 4 f h 7 E y R w K e u r 8 n M h J p P Y 4 C 0 x k R G O r 5 W m 7 + V + u m E F 7 6 G Z d J C k z S 2 U N h K j D E O M 8 F 9 7 l i F M T Y A K G K m 7 9 i O i Q m G z D p l U 0 I 7 v z K i 9 C q V d 2 z a u 3 2 v F K / K u I o o U N 0 j E 6 R i y 5 Q H d 2 g B m o i i h 7 Q E 3 p B The proposed selection process is different from strategies used in curriculum learning Bengio et al. (2009) . Predictive uncertainty encourages the model to learn more difficult (uncertain) graphs and samples in the first place, while in curriculum learning, the easiest samples are fed first. The choice of difficult-first order is intuitive in our case; see also Appendix G for empirical evidence. Graph properties. As we see above, the predictive uncertainty measures the model's ability to distinguish (or identify) a given graph (or subgraph instance). However, predictive uncertainty is sometimes misleading, especially when the chosen graph (or subgraph) happens to be an outlier of the entire data collection. Hence learning solely from the most uncertain data might not improve the overall performance, or even worse, lead to overfitting. The inherent properties of the graph turn out to be equivalently important as a selection criterion for graph pre-training. Intuitively, it is preferable to choose those graphs that are good by themselves, those with a better structure, or those containing more information. So here we introduce five inherent properties of graphs (i.e., network entropy, density, average degree, degree variance and scale-free exponent) to help select better data points for pre-training. All these properties exhibit a strong correlation with downstream performance, which is empirically verified and presented in part (b) of the graph selector component in Figure 2 . The choice of these properties also has an intuitive explanation, and here we discuss the intuition behind the network entropy as an example. The use of network entropy is inspired from the sampling methods used in most cross-domain graph pre-training models (see e.g., Qiu et al. (2020) ; Hafidi et al. (2020) ): Random walks started at a node are employed to construct a subgraph instance as the model input. Random walks can also be used to compute the amount of information contained in a graph. Especially, the amount of information contained in the move from node v i to node v j islog P ij Cover & Thomas (1999) , where P is the transition matrix. Thus the network entropy of a connected graph G = (V, E) can be defined as the expected information of individual transitions over the random walk process Burda et al. (2009) : where π is the stationary distribution of a random walk and ⟨•⟩ P denotes the expectation of a random variable according to P . Network entropy ( 2) is in general difficult to calculate, but for a connected unweighted graph, P ij = 1/d i , π = (1/2|E|)d (where d i is the degree of node v i ∈ V and d = (d 1 , d 2 , . . .) is the degree vector). Then the network entropy (2) reduces to ϕentropy = 1 2|E| N i=1 di log di, where N = |V | is the total number of nodes in G. In this case, the network entropy of a graph depends solely on its degree distribution, and is straightforward to compute. Although the definition of network entropy originates from random walks on graphs, it is still useful in graph pre-training even when the sampling of subgraph instances does not depend on random walks. Here we provide another intuitive explanation of network entropy from the coding theory. Network entropy can be viewed as the entropy rate of a random walk, and it is known that the entropy rate is the expected number of bits per symbol required to describe a stochastic process Cover & Thomas (1999) . Similarly, the network entropy can be interpreted as the expected number of "words" needed to describe the graph. Thus intuitively, the larger the network entropy is, the more information the graph contains. As a final remark on network entropy, the connectivity assumption does not limit the usefulness of Eq. (3) in our case. For disconnected input graphs, we can simply compute the network entropy of the largest connected component, since for most real-world networks, the largest connected component contains most of the information Easley & Kleinberg (2010) . Alternatively, we can also take some of the largest connected components from the graph and treat them separately as several connected graphs. Furthermore, the other four graph properties, i.e., density, average degree, degree variance and scalefree exponent, are closely related to the network entropy. Figure 3 presents a clear correlation between the network entropy and the other four graph properties, as well as provides some illustrative graphs. (These example graphs are generated by the configuration model proposed in Newman ( 2003), and Appendix E contains more results of real-world networks.) Intuitively, graphs with higher network entropy contain a larger amount of information, and so are graphs with larger density, higher average degree, higher degree variance, or a smaller scale-free exponent. The connections between all five graph properties can also be theoretically justified and the motivations of choosing these properties can be found in Appendix A. The detailed empirical justification of these properties and the pre-training performance in included in Appendix E. Time-adaptive selection strategy. The proposed predictive uncertainty and the five graph properties together act as a powerful indicator of a graph's goodness. Thus the selection of graph can be formulated as the following optimization problem: maximize J (G) = (1 -γt) φuncertain + γtMEAN( φentropy, φdensity , φavg deg , φdeg var , -φα), where the optimization variable is the graph G to be selected, φuncertain , φentropy , φdensity , φavg deg , φdeg var and φα are the z-score normalized value of graph-level predictive uncertainty, network entropy, density, average degree, degree variance and scale-free exponent of graph G respectively, γ t ∈ [0, 1] is a parameter to trade off the predictive uncertainty and the graph properties, and t is the iteration counter. Note that the pre-training model learns nothing at the beginning, so we initialize γ 0 = 0. The balance between the predictive uncertainty and the inherent graph properties ensures that the selected graph is a good supplement to the current pre-training model as well as an effective representative for the entire data distribution. We shall also note that, at the beginning of the pre-training, the outputs of the model are not accurate enough to guide data selection, so the parameter γ t should be set smaller so that the graph properties play a leading role. As the training phase proceeds, the graph selector gradually pays more attention to the feedback ϕ uncertain from the model via a larger value of γ t . Therefore, the parameter γ t is called the time-adaptive parameter, and is set to be a random variable depending on time t. In this work, we take γ t from a Beta distribution γ t ∼ Beta(1, β t ), where β t decreases over time (training iterations). Finally, after a graph is selected, we can further choose subgraph instances with high predictive uncertainty for training, rather than feed the model with random subgraph samples. Connections and differences with hard example mining. 2008), which cannot be adapted in pre-training with unlabeled data.

3.2. GRAPH PRE-TRAINING MODEL

The graph pre-training model takes the input graphs and samples one by one and enhances itself in a progressive manner. However, such a sequential training process does not guarantee the model to remember all the contributions of previous input data. As shown in the orange curve in Figure 4 , the previously learnt graph exhibits a larger predictive uncertainty as the training phase proceeds. The empirical result indicates that the knowledge or information contained in previous input data will be forgotten or covered by newly incoming data. This phenomenon, called catastrophic forgetting Kirkpatrick et al. (2017) , was first noticed in continual learning and is also identified here. Intuitively, when the training data is taken in a one-by-one manner, the learnt parameters will cater to the newly incoming data compared with the old, previous data points. One remedy for this issue is adding a proximal term to the objective. The additional proximal term (i.e., the regularization) guarantees the proximity between the new parameters and the model parameters learnt from previous graph. Therefore, the final loss function for our pre-training model in APT is L(θ) = i Li(θ) + k j=k-1 m λj 2 F (j) m ∥θm -θ (j) m ∥ 2 , ( ) where L i is given in Eq. ( 1), the summation in the first term is taken over the subgraph instances sampled from the new input graph, k is the number of previously learnt graphs, θ (j) is the model parameters learnt from the first j graphs, and λ j 's are the trade-off parameters between the knowledge learnt from new data and that from previous data. Typically, the trade-off parameters {λ j } form a nondecreasing sequence, i.e., λ 1 ≤ λ 2 ≤ • • • ≤ λ k . Inspired from the Elastic weight consolidation (EWC) algorithm Kirkpatrick et al. (2017) , we take F (j) as the Fisher information matrix of θ (j) , F m is the diagonal element of F (j) and m labels each parameter. When F is set as an identity matrix, the second term degenerates to the L2 regularization (serves as one of our variants). Note that the proximal term in Eq. ( 5) is absent when the first input graph is introduced to the model.

3.3. TRAINING AND FINE-TUNING

Integrating the graph selector and the pre-training model forms the entire APT framework, and the overall algorithm is presented in Appendix B. After the training phase, the APT framework returns a pre-trained GNN model, and then the pre-trained model can be applied to various downstream tasks from a wide spectrum of domains. In the so-called freezing mode, the pre-trained model outputted from APT is directly applied to downstream tasks, without any changes in parameters. Alternatively, the fine-tuning mode uses the pre-trained graph encoder as initialization, and offers the flexibility of training the graph encoder and the downstream classifier together in an end-to-end manner.

4. EXPERIMENTS

In the experiments, we pre-train our model on the incoming data provided by the graph selector, and then evaluate the transferability of our pre-trained model on multiple unseen graphs from different domains in the node classification and graph classification task. Lastly, we include the applicable scope of our pre-trained model. Additional experiments can be found in Appendix G, including analysis on training time, sensitivity analysis of hyper-parameters, ablation study on various combinations of graph properties. 4.1 EXPERIMENTAL SETUP Datasets. The datasets for pre-training and testing, and their detailed statistics are listed in Appendix C. The datasets for pre-training are collected from different domains, including social networks, citation networks, and movie collaboration networks. We then evaluate the pre-trained models on 13 real-world graphs, including large-scale datasets with millions of edges from Open Graph Benchmark Hu et al. (2020a) . Some of them are from the similar domain as pre-training (like citation networks), while most of them are from a totally unseen domains (like web networks, transportation networks, protein networks and others). Baselines. We comprehensively evaluate our model against the following baselines for node classification and graph classification tasks, respectively. 2017) are used as baselines, and then the learned representations are fed into the logistic regression for node classification (as most of baselines did). As for graph classification tasks, we take graph2vec Narayanan et al. (2017) , In-foGraph Sun et al. (2020) , DGCNN Zhang et al. (2018) and GIN Xu et al. (2019) as baselines, and then feed the representations into SVM as the classifier (as most of baselines did). For both tasks, we also compare our model with (1) Random, where random vectors are generated as representations; (2) GraphCL You et al. (2020a) , a GNN pre-training scheme based on contrastive learning with augmentations; (3) JOAO You et al. (2021) , a GNN pre-training scheme that can automatically select data augmentations; (4) GCC Qiu et al. (2020) , the state-of-the-art cross-domain graph pre-training model (the version of our model without data selection scheme, which trains on all pre-training data). GCC, GraphCL and JOAO are trained on the entire collected input data, and the suffix (rand, fine-tune) indicates that the model is trained from scratch. We also include 4 variants of our model: (1) APT-G, which removes the criteria of graph properties in the graph selector; (2) APT-P, which removes the criterion of predictive uncertainty in the graph selector; (3) APT-R, which removes the regularization w.r.t old knowledge in Eq. ( 5); (4) APT-L2, which degenerates the second term in Eq. ( 5) to the L2 regularization. Experimental settings. In the training phase, we iteratively select graphs for pre-training until the predictive uncertainty of any candidate graph is below 3.5. For each selected graph, we choose those samples with predictive uncertainty higher than 3. We include M = 500 query subgraph instances in a graph when measuring the predictive uncertainty of this graph. The time-adaptive parameter γ t in Eq. ( 4) is drawn from γ t ∼ Beta (1, β t ), where β t = 3 -0.995 t . We set the trade-off parameter λ j = 10 for all j for APT-L2, and λ j = 500 for APT. The total iteration number is 100. We adopt GCC as the backbone pre-training model with their default hyper-parameters. Note that we can also use other pre-training models like GraphCL as the backbone, but we do not report them due to the non-ideal performance of GraphCL. In the fine-tuning phase, we select logistic regression or SVM as the downstream classifier and adopt the same setting as GCC. See Appendix F for more details. 1 presents the micro F1 score of different methods over 10 unseen graphs from a wide spectrum of domain for node classification task. We can observe our model beats the graph pre-training competitor by an average of 9.94% and 17.83% under freezing and fine-tuning mode respectively. This suggests that instead of pre-training on all the collected graphs (like GCC), it is better to choose a part of graphs better suited for pre-training (like our model APT). Moreover, compared with the traditional models without pre-training, the performance gain of our 

Graph classification.

The micro F1 score on unseen test data in the graph classification task is summarized in Table 2 . Especially, our model is 7.2% and 1.3% on average better than the graph pre-training backbone model under freezing and fine-tuning mode, respectively. Interestingly, we find that the variants of APT perform well under graph classification, indicating that we can apply a version with simpler architecture in practice yet achieve good results. Analysis of the selected graphs. The data sequentially selected via our graph selector are uillinois, soc-sign0811, msu, michigan, wiki-vote, soc-sign0902 and dblp. To further analyze why these graphs are chosen, we present their detailed structural properties in Table 4 in the Appendix C. We first observe that uillinois, michigan and msu have the largest value of MEAN( φentropy , φdensity , φavg deg , φdeg var , -φα ), while dblp has the smallest. This shows that both criteria, the graph properties and the predictive uncertainty, play an important role in data selection. Moreover, it is also interesting to see that wiki-vote is the smallest graph among all the pre-training graphs, but it still contributes to the performance. This observation again verifies the curse of big data phenomenon in graph pre-training.

4.3. DISCUSSION: SCOPE OF APPLICATION

The transferability of the pre-trained model comes from the learnt representative structural patterns and the ability to distinguish these patterns (as discussed in §2). Therefore, our pre-training model is more suitable for the datasets where the target (e.g., labels) is correlated with subgraph patterns or structural properties (e.g., motifs, triangles, betweenness, stars). For example, for node classification on heterophilous graphs (e.g., winconsin, cornell), our model performs very well because in these graphs, nodes with the same label are not directly connected, but share similar structural properties and behavior (or role, position). On the contrary, graphs with strong homophily (like cora, pubmed, ogbarxiv and ogbproteins) may not benefit too much from our models. Similar observation can also be made on graph classification: our model could also benefit the graphs whose label has a strong relationship with their structure, like molecular, chemical, and protein networks (e.g., dd in our experiments) Vishwanathan et al. (2010) ; Gardiner et al. (2000) . Han et al. (2022) . One of the technical cores is to design appropriate data augmentation like attribute masking, edge perturbation, node dropping, diffusion, etc., which either performed on node attributes or the whole graph structure. So they only achieve transferability in graphs from similar (or the same) domains, or the downstream task is restricted to graph classification. With the purpose of learning transferable patterns across different domains, some works take subgraph sampling as augmentation, such that the transferable (sampled) subgraph patterns can be captured during pre-training Qiu et al. (2020) ; You et al. (2020a; 2021) . However, these existing works only focus on how to design the pre-training model, rather than how to select data for pre-training. Our paper first points out the necessity of selecting data, and fills the gap of the data selection strategy in graph pre-training.

6. CONCLUSION

In this paper, we observe that big data is not a necessity for pre-training GNNs. This motivates us to wisely choose some suitable graphs and samples for pre-training rather than training on a massive amount of data. Without any knowledge of the downstream tasks, we propose a novel graph selector to provide the most instructive data for the model. The pre-training model is then encouraged to learn from the data in a progressive way and reinforce itself on newly selected data. We integrate the graph selector and the graph pre-training model in a unified framework, and form a data-active graph pre-training (APT) paradigm. The two components in APT are able to mutually boost the capability of each other. Extensive experimental results show that the proposed APT framework can enhance model capability with a fewer number of input data. (2020) . We here theoretically show some connections between the proposed network entropy and typical structural properties. To make theoretical analysis, we consider connected, unweighted and undirected graph, whose network entropy depends solely on its degree distribution (see Eq. ( 3)). Considering a random graph G with a fixed node set, we suppose that the degree of any node v i independently follows distribution p, which is a common setting in random graph theory Gómez-Gardenes & Latora (2008) . Then the expected network entropy of G is ⟨H(G)⟩ = 1 2|E| i ⟨d i log d i ⟩ = ⟨d log d⟩ ⟨d⟩ . ( ) where every d i (and d) is an independent random variable follows the distribution p. Now we are ready to discuss the connection between network entropy ⟨H(G)⟩ and some typical graph properties (i.e., average degree ⟨d⟩, degree variance Var(d) and the scale-free exponent α). Average degree. Given that the function x log x is convex in x, we have ⟨H(G)⟩ ≥ ⟨d⟩ log⟨d⟩ ⟨d⟩ = log⟨d⟩. It is clear that average degree is the lower bound of network entropy. Based on our discussion on §3.1, we conclude that when used for pre-training, an input graph with higher average degree would in general result in better performance of the pre-trained model. Degree variance. The Taylor expansion of ⟨d log d⟩ in Eq. ( 6) at ⟨d⟩ gives ⟨H(G)⟩ = log⟨d⟩ + Var(d) 2⟨d⟩ 2 + o 1 ⟨d⟩ 2 . where Var(d) is the variance of d. We find that log⟨d⟩ is exactly the zeroth-order term in the expansion. When average degree is fixed, the network entropy and the degree variance Var(d) are positively correlated. This in turn implies a positive correlation between degree variance and the test performance of the model. Scale-free exponent. Most real-world networks exhibit an interesting scale-free property (i.e., only a few nodes have high degrees), and thus the degree distribution often follows a power-law distribution. That is, we can just write the degree distribution as p(x) ∼ x -α , where α is called the scale-free exponent. For a real-world network, the scale-free exponent α is usually larger than 2 Clauset et al. (2009) . Suppose the degrees of a random graph G with N nodes follows a powerlaw distribution p(x) = Cx -α where C is a normalization constant. When α > 2, we could approximately have Gómez-Gardenes & Latora ( 2008) ⟨H(G)⟩ = 1 α -2 , if N → ∞. Clearly, a smaller scale-free exponent α results in a higher network entropy. Remark 1 (Connection between network entropy and typical structural properties) A graph with high network entropy arises from graphs with typical structural characteristics like large average degree, large degree variance, and scale-free networks with low scale-free exponent. Besides the above theoretical analysis. The motivation of choosing density, average degree, degree variance and scale-free exponent is similar to that of network entropy. Intuitively, graphs with larger average degree and higher density have more interactions among the nodes, thus providing more topological information to graph pre-training. Also, the larger the diversity of node degrees, the more diverse the subgraph samples. The diversity of node degrees can be measured by degree variance and scale-free exponent. (A smaller scale-free exponent indicates the length of the tail of degree distribution is relatively longer, i.e., the degree distribution spreads out wider. )

B ALGORITHM

The overall algorithm for APT is given in Algorithm 1. Given a collection of graphs G = {G 1 , . . . , G N } from various domains, APT aims to pre-train a better generalist GNN (i.e., pretraining model) on wisely chosen graphs and samples. Our APT pipeline involves the following three steps. (i) At the beginning, the graph selector chooses a graph for pre-training according to the graph properties (line 1). (ii) Given the chosen graph, the graph selector chooses the subgraph samples whose predictive uncertainty is higher than T s in this graph (line 3). (iii) The selected samples are then fed into the model for pre-training until the predictive uncertainty of the chosen graph is below T g or the number of training iterations on this chosen graph reaches F (line 4-5). (iv) The model's feedback in turn helps select the most needed graph based on predictive uncertainty and graph properties until the predictive uncertainty of any candidate graph is low enough (line 6-7). The last three steps are repeated until the iteration number reaches a pre-set maximum value T (which can be considered as the total iteration number required to train on all selected graphs). Algorithm 1 Overall algorithm for APT. Sample instances with predictive uncertainty higher than T s from G * via the graph selector. Input: A collection of graphs G = {G 1 , . . . , G N },

4:

Update model parameters θ ← θ -µ∇ θ L(θ).

5:

if ϕ uncertain (G * ) < T g or the model has been trained on G * by F iterations then 6: Update the trade-off parameter γ t ∼ Beta (1, β t ).

7:

Choose a graph G * from G, and G ← G\{G * }. 2019). ( 2) The time complexity of GNN encoder propagation depends on the architectures of the backbone GNN. We denote it as X here. (3) The time complexity of the contrastive loss is O(B 2 D) Li et al. (2022) . ( 4) Sample selection is conducted by choosing the samples with high contrastive loss (the loss is computed before), which costs O(B). (5) Graph selection costs O(|G|M 2 D) (where M the number of samples needed to compute the predictive uncertainty of a graph, and |G| is the number of graphs that have not been selected). This step is executed only in a few epochs (around 6% in our current model), so we ignore its time overhead in graph selection. Therefore, the overall time complexity of APT in each batch is O(B|V | 3 + X + B 2 D + B).

C DATASET DETAILS

The graph datasets for pre-training and testing in this paper are collected from a wide spectrum of domains (see Table 3 for an overview). The consideration of the graphs for pre-training and test is as follows. When selecting pre-training data, we hope that the graph size is at least hundreds of thousands to contain enough information for pre-training. When selecting test data, we hope that: (1) some test data is in the same domain as the pre-training data, and some is cross-domain, so as to comprehensively evaluate our model's in and across-domain transferability. Accordingly, the in-domain test data is selected from the type of movie and citations, and the others test data are across-domain; (2) the size of test graphs can scale from hundreds to millions. Regarding the pre-training datasets, arxiv, dblp and patents-main are citation networks collected from Bonchi et al. (2012) , Yang & Leskovec (2012) and Hall et al. (2001) , respectively. Imdb is the collection of movie from Rossi & Ahmed (2015) . As for the social networks, soc-sign0902 and soc-sign0811 are collected from Leskovec et al. (2009) , wiki-vote is from Leskovec et al. (2010) , academia is from Fire et al. (2011) , and michigan, msu and uillions are from Traud et al. (2012) . Regarding the test datasets, we collect the protein network dd and ogbproteins from Dobson & Doig (2003) and Hu et al. (2020a) . The image network msrc-21 is from Neumann et al. (2016) . The movie network imdb-binary is from Yanardag & Vishwanathan (2015) . The citation networks, cora, pubmed and ogbarxiv, are from McCallum et al. (2000) , Namata et al. (2012) and Hu et al. (2020a) . The web networks cornell and wisconsin are collected from Pei et al. (2019) . The transportation network brazil is form Ribeiro et al. (2017) , and dd242, dd68 and dd687 are from Rossi & Ahmed (2015) . The detailed graph properties of the pre-training data and test data are presented in Table 4 and Table 5 , respectively. 2020) model with different model configurations (i.e., the number of GNN layers is set to be 3, 4 and 5 respectively), when pre-trained on all training graphs listed in Table 3 and evaluated on different test graphs (annotated in the upper left corner of each figure) under freezing setting. Note that GCC and GraphCL are the only two pre-training models that can be adopted for the cross-domain setting. For each experiment, we calculate the mean and standard deviation over 10 evaluation results of the downstream task with random training/testing splits. The observations of GCC and GraphCL model can be found in Figure 5 and Figure 6 respectively. The downstream results of different test data are presented in separate rows. The figures in left three columns present the effect of scaling up the number of graphs on the downstream performance under different model configurations (i.e., the number of GNN layers) respectively. We first pre-train the model with only two input graphs, and the result is plotted in a dotted line. The largest standard deviation among the results w.r.t different graph last is also marked by the blue arrow. The figures in the last column illustrate the effect of scaling up sample size (log scale) on the performance. Table 6 : The value of parameters for fitting the curve according to the function f (x) = a 1 ln x/x a2 + a 3 (a 1 , a 2 , a 3 > 0), based on the points in the last column in Figure 5 and Figure 6 . 3 ) via GCC. The results indicate that network entropy, density, average degree and degree variance exhibit a clear positive correlation with the performance, while the scale-free exponent presents an obviously negative relation with the performance. On the contrary, some other properties of graphs, including clique number, transitivity, degree assortativity and average clustering coefficient, do not seem to have connections with downstream performance, and also exhibit little or no correlation with the performance. Therefore, the favorable properties of network entropy, density, average degree, degree variance and the scale-free exponent of a real graph are able to characterize the contribution of a graph to pre-training. 

G ADDITIONAL EXPERIMENTAL RESULTS

Effects of hyper-parameter {λ j }. The hyper-parameter {λ j } is the trade-off parameters between the knowledge learnt from new data and that from previous data in Eq. ( 5). We simply set λ 1 = λ 2 = • • • = λ k . We use the dataset dd242 as an example to find the suitable values of the hyperparameter under the L2 and EWC regularization setting respectively, and present here for reference (see Figure 9 ). Clearly, a too small or too large λ would deteriorate the performance. Thus, an appropriate value of λ is preferred to ensure that the graph pre-training model can learn from new data as well as remember previous knowledge. We leave changing {λ j } as the future work. Effects of hyper-parameter F, T g , T s . Our model training involves three hyper-parameter F, T g , T s , where F controls the largest number of epochs training on each graph, T g is the predictive uncertainty threshold of moving to a new graph, T s is the predictive uncertainty threshold of choosing training samples. We use grid search to show F ∈ {4, 5, 6}'s, T g ∈ {3, 3.5, 4}'s and T s ∈ {1, 2, 3}'s role in the pre-training. F remains at 5 while studying T g and T s , T g remains at 3.5 while studying F and T s , and T s remains at 2 while studying F and T g . Figure 10 presents the effect of these parameters, We find that if the value of F is set too small or that of T g is too large, the model cannot learn sufficient knowledge from each graph, leading to suboptimal results. Too large F or small T g also lead to poor performance. This indicates that instead of training on a graph for a large period, it would be better to switch to training on various graphs in different domains to gain diverse and comprehensive knowledge. Regarding the hyper-parameter T s , we observe that large T s would make the model having too few training samples to learn knowledge, and small T s could not select the most uncertain and representative samples, thus both cases achieve suboptimal performance. The choice of β t , its alternatives, and ablation study. At the beginning of the pre-training, the model is less accurate and needs more guidance from graph properties. We therefore set γ t as larger at the beginning and gradually decrease it. To simplify this process, we follow Cai et al. (2017) to use the exponential formula of β t = c 1c t 2 to set the expectation of γ t to be strictly decreasing (where γ t ∼ Beta(1, β t )). The parameters c 1 and c 2 in the exponential formula of β t = c 1c t 2 are suggested as 1.005 and 0.995 in Cai et al. (2017) . We simply perform grid search on c 1 in {1.005, 3, 5}; see the effect of c 1 in the Figure 11 . We then illustrate that the choice of the decay function of β t is robust. The initial value β 1 is set the same as ours.) While there is no universally better decay function, the performance of our method is not significantly impacted by the choice of different decay functions, and our performance is better than the baselines in most cases regardless of the choices of specific decay functions. Impact of five graph properties combination. As a further experimental analysis, we study the effect of the strategy of utilizing only one graph property in Table 10 and Table 11 . We find that the five properties used in our model are all indispensable, and the most important one probably varies for different tasks and datasets. That's why we choose to combine all graph properties. Moreover, these case studies may provide some clues of how to select pre-training graphs when some knowledge of the downstream tasks is known. For example, if the downstream dataset is extremely dense (like imdb-binary), the density property dominates among the selection criteria (such that the probability of encountering very dense out-of-distribution samples during testing can be reduced). If the entropy of downstream dataset is very high (like brazil), it is perhaps better to choose graphs with high entropy for pre-training. But still, when the downstream task is unknown, using the combination of five metrics often leads to the most satisfactory and robust results. The justification of input graphs' learning order. Table 12 reveals the downstream performance can be affected by the learning order of input training graphs. With the guidance of graph selector, the pre-training model is encouraged to first learn the graphs and samples with higher predictive uncertainty and graph properties. Such learning order accomplishes better downstream performance compared to the reverse or random one. The choice of the "difficult" data. Among all the data, "difficult" samples contribute the most to the loss function, and thus they can yield gradients with large magnitude. Comparatively, training with easy samples may suffer from inefficiency and poor performance as these data points produce gradients with magnitudes close to zero Huang et al. (2016) ; Sohn (2016) . In addition, learning from difficult samples has proven to be able to accelerate convergence and enhance the expressive power of the learnt representations Suh et al. (2019) ; Schroff et al. (2015) . For our model, the importance of learning from difficult samples is also justified empirically, as shown in Table 13 . Training time. As empirically noted in • The time spent on the inference on all graphs during graph selection (which is the main time spent for graph selection) only accounts for 3.95% and 3.87% of the total time under APT-L2 and APT respectively. Note that this step is executed only in a few epochs (around 6% in our current model) if and only if the condition in line 5 in Algorithm 1 is satisfied. • The time cost of the L2 regularization term only accounts for 0.08% of the total time and the EWC regularization term only accounts for 0.45% of the total time, which is calculated by the runtime gap between the models with and without the regularization term. Note that the regularization term is imposed on the first two layers of the GNN encoder, which only accounts for 12.4% of the total number of parameters. The efficiency of our model is due to a much smaller number of carefully selected training graphs and samples at each epoch. In addition, the number of parameters in our model is 190,544, which is the same order of magnitude as classical GNNs like GraphSAGE, GraphSAINT, etc. and is relatively small among models in open graph benchmark Hu et al. (2020b) . 3.9 4.9⇥ < l a t e x i t s h a 1 _ b a s e 6 4 = " b O T t x K g B v 1 / Y 3 Y p P v W 7 u q U l U a + A = " > A A A B 7 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l q Q Y 8 F L x 4 r 2 A 9 o Q 9 l s N + 3 a z S b s T o Q S + h + 8 e F D E q / / H m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I J H C o O t + O 4 W N z a 3 t n e J u a W / / 4 P C o f H z S N n G q G W + x W M a 6 G 1 D D p V C 8 h Q I l 7 y a a 0 y i Q v B N M b u d + 5 4 l r I 2 L 1 g N O E + x E d K R E K R t F K 7 T 6 K i J t B u e J W 3 Q X I O v F y U o E c z U H 5 q z + M W R p x h U x S Y 3 q e m 6 C f U Y 2 C S T 4 r 9 V P D E 8 o m d M R 7 l i p q l / j Z 4 t o Z u b D K k I S x t q W Q L N T f E x m N j J l G g e 2 M K I 7 N q j c X / / N 6 K Y Y 3 f i Z U k i J X b L k o T C X B m M x f J 0 O h O U M 5 t Y Q y L e y t h I 2 p p g x t Q C U b g r f 6 8 j p p 1 6 r e V b V 2 X 6 8 0 6 n k c R T i D c 7 g E D 6 6 h A X f Q h B Y w e I R n e I U 3 J 3 Z e n H f n Y 9 l a c P K Z U / g D 5 / M H s 0 O P K g = = < / l a t e x i t > ⇥ < l a t e x i t s h a 1 _ b a s e 6 4 = " b O T t x K g B v 1 / Y 3 Y p P v W 7 u q U l U a + A = " > A A A B 7 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l q Q Y 8 F L x 4 r 2 A 9 o Q 9 l s N + 3 a z S b s T o Q S + h + 8 e F D E q / / H m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I J H C o O t + O 4 W N z a 3 t n e J u a W / / 4 P C o f H z S N n G q G W + x W M a 6 G 1 D D p V C 8 h Q I l 7 y a a 0 y i Q v B N M b u d + 5 4 l r I 2 L 1 g N O E + x E d K R E K R t F K 7 T 6 K i J t B u e J W 3 Q X I O v F y U o E c z U H 5 q z + M W R p x h U x S Y 3 q e m 6 C f U Y 2 C S T 4 r 9 V P D E 8 o m d M R 7 l i p q l / j Z 4 t o Z u b D K k I S x t q W Q L N T f E x m N j J l G g e 2 M K I 7 N q j c X / / N 6 K Y Y 3 f i Z U k i J X b L k o T C X B m M x f J 0 O h O U M 5 t Y Q y L e y t h I 2 p p g x t Q C U b g r f 6 8 j p p 1 6 r e V b V 2 X 6 8 0 6 n k c R T i D c 7 g E D 6 6 h A X f Q h B Y w e I R n e I U 3 J 3 Z e n H f n Y 9 l a c P K Z U / g D 5 / M H s 0 O P K g = = < / l a t e x i t > ⇥ < l a t e x i t s h a 1 _ b a s e 6 4 = " b O T t x K g B v 1 / Y 3 Y p P v W 7 u q U l U a + A = " > A A A B 7 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l q Q Y 8 F L x 4 r 2 A 9 o Q 9 l s N + 3 a z S b s T o Q S + h + 8 e F D E q / / H m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I J H C o O t + O 4 W N z a 3 t n e J u a W / / 4 P C o f H z S N n G q G W + x W M a 6 G 1 D D p V C 8 h Q I l 7 y a a 0 y i Q v B N M b u d + 5 4 l r I 2 L 1 g N O E + x E d K R E K R t F K 7 T 6 K i J t B u e J W 3 Q X I O v F y U o E c z U H 5 q z + M W R p x h U x S Y 3 q e m 6 C f U Y 2 C S T 4 r 9 V P D E 8 o m d M R 7 l i p q l / j Z 4 t o Z u b D K k I S x t q W Q L N T f E x m N j J l G g e 2 M K I 7 N q j c X / / N 6 K Y Y 3 f i Z U k i J X b L k o T C X B m M x f J 0 O h O U M 5 t Y Q y L e y t h I 2 p p g x t Q C U b g r f 6 8 j p p 1 6 r e V b V 2 X 6 8 0 6 n k c R T i D c 7 g E D 6 6 h A X f Q h B Y w e I R n e I U 3 J 3 Z e n H f n Y 9 l a c P K Z U / g D 5 / M H s 0 O P K g = = < / l a t e x i t > 00(0.47) 11.41(0.91) 10.65(0.65) 51.46(1.64) 44.36(1.38) 35.66(0.62) 45.92(0.14) Random order 67.25(2.40) 16.11(0.79) 12.57(1.17) 11.06(0.75) 53.06(2.41) 46.76(1.95) 35.90(0.72) 46.36(0.20) Table 13 : The comparison of learning from easy samples and learning from difficult sample in our pipeline (APT-L2 (freeze)) on node classification. Micro F1 is reported in the table. (Under the setting of learning from easy samples, we replace ϕ uncertain with -ϕ uncertain in Eq.( 4), and only sample instances with predictive uncertainty lower than T s .) 



Figure 1: Top row: The effect of scaling up sample size (log scale) on the downstream performance based on a group of GCCs Qiu et al. (2020) under different configurations (the graphs used for pre-training are kept as all eleven pre-training data in Table3, and the samples are taken from the backbone pre-training model according to its sampling strategy). The results for different downstream graphs (and tasks) are presented in separate figures. To better show the changing trend, we fit a curve to the best performing models (i.e., the convex hull fit asAbnar et al. (2022)  does). Bottom row: The effect of scaling up the number of graph datasets on the downstream performance based on GCC. For a fixed horizontal coordinate, we run 5 trials. For each trial, we randomly choose a combination of input graphs. The shaded area indicates the standard deviation over the 5 trials. See Appendix D for more observations on other graph pre-training models and detailed settings.

t e x i t s h a 1 _ b a s e 6 4 = " R s M E z c + g 1 + y z C R X e l J D s S Q b w e 5 4 = "

Figure 2: Overview of the proposed data-active graph pre-training paradigm. The graph selector provides the graph and samples suitable for pre-training, while the graph pre-training model learns from the incoming data in a progressive way and in turn better guides the selection process. In the graph selector component, Part (a) provides an illustrating example on the predictive uncertainty, and Part (b) plots the Pearson correlation between the properties of the input graph and the performance of the pre-trained model (using this graph) on different unseen test datasets (see Appendix E for other properties that exhibit little/correlation with performance).

Figure 3: The illustrative graphs (from bottom left to top right) with increasing network entropy and the other four graph properties.

Figure 4: Predictive uncertainty of a learnt graph ("michigan") versus training epoch.

For node classification tasks, ProNE Zhang et al. (2019), DeepWalk Perozzi et al. (2014), struc2vec Ribeiro et al. (2017), DGI Velickovic et al. (2019), GAE Kipf et al. (2016), and GraphSAGE Hamilton et al. (

Pre-training in CV and NLP. Initially, the CV community benefits from the models like Vision TransformersLiu et al. (2021),MLP-mixers Tolstikhin et al. (2021)  and ResNetsHe et al. (2016), which are supervised pre-trained on large-scale image data. To take full advantage of massive unlabeled data, NLP community adapts self-supervised learning models like Transformer-based encoderVaswani et al. (2017);Radford & Narasimhan (2018);Devlin et al. (2019) for language pretraining. When pre-training in CV and NLP, researchers find that scaling up the pre-training data size would results in a better or saturating performance in downstream Tan & Le (2019); Kaplan et al. (2020); El-Nouby et al. (2021); Abnar et al. (2022); Raffel et al. (2020). However, this is not true in graph pre-training. In this paper we argue that adding input graphs or pre-training samples does not necessarily improve, but sometimes even deteriorates the downstream performance. In view of the above phenomenon in CV and NLP pre-training, data selection is not an active research direction. The only related research we notice focus on domain-specific pre-training models, which select pre-training data that is most similar to the downstream domain Cui et al. (2018); Beltagy et al. (2019); Dai et al. (2019; 2020); Yan et al. (2020); Lee et al. (2020); Chakraborty et al. (2022). The assumption on the knowledge of downstream domain is different from the acrossdomain graph pre-training in our paper, and thus data selection in CV/NLP pre-training is not that relevant to the current work. Graph pre-training. Taking inspiration from the pre-training in CV and NLP, recent efforts have shed the light on pre-training GNNs. Initially, some unsupervised graph representation learning can be used for graph pre-training Tang et al. (2015); He et al. (2016); Grover & Leskovec (2016); Narayanan et al. (2017); Ribeiro et al. (2017); Donnat et al. (2018); Zhang et al. (2019); Hamilton et al. (2017). They are designed based on neighborhood similarity assumption, thus cannot generalize to unseen nodes and graphs. Later, a line of graph self-supervised learning can be also treated as graph pre-training, which are categorized into two folds: graph generative models and contrastive models. Graph generative models capture the universal graph patterns by recovering certain parts of input graph (e.g., masked structure or attributes) Kipf et al. (2016); Wang et al. (2017); Hu et al. (2020c); Cui et al. (2020); Hou et al. (2022). These works rely on specific domain knowledge, for example the node/edge/attribute type should be the same, which makes them difficult to transfer across different types of graphs.ƒ On the other hand, graph contrastive models maximize the agreement between positive pairs and minimize that between negative pairs Velickovic et al. (2019); Hu et al. (2020b); You et al. (2020a); Zhu et al. (2020); Hassani & Khasahmadi (2020); Sun et al. (2020); Li et al. (2021); Lu et al. (2021); Sun et al. (2021); Zhu et al. (2021b); Xu et al. (2021); Zhu et al. (2021a); Lee et al. (2022); Zeng & Xie (2021); Zhang et al. (2021b);

The time complexity of our model mainly consists of five components: data augmentation, GNN encoder propagation, contrastive loss, sample selection and graph selection. Suppose the maximal number of nodes of subgraph instances is |V |, the batch size is B, and D is the representation dimension. (1) As for the data augmentation, the time complexity of random walk with restart is at least O(B|V | 3 ) Xia et al. (

Figure 5: The additional observations of curse of big data phenomenon, performed on different GCC pre-training models.

EMPIRICAL STUDY OF GRAPH PROPERTIESAdditional properties for part (b) in Figure2. In Figure7, we plot the Pearson correlation between the graph properties of the graph used in pre-training (shown in the y-axis) and the performance of the pre-trained model using this graph on different unseen test datasets (shown in the x-axis). Note that the pre-training is performed on each of the input training graphs (in Table

ra g e d e g re e D e g re e v a ri a n c e C li q u e n u m b e r T ra n s iv it y D e g re e a s s o rt a ti v it y C lu s te ri n g c o e ff ic ie n t S c a le -f re e e x p o n e n t E n tr o p y

Figure 7: Pearson correlation between the structural features of the graph used in pre-training and the performance of the pre-trained model (using this graph) on different unseen test datasets.

Figure 9: Performance of our model on dd242 w.r.t varying {λ j }.

Figure 10: Performance of our model on dd242 w.r.t varying F, T g , T s .

Figure 11: Performance of APT-L2 (freeze) w.r.t varying c 1 .

Time comparison: pre-training vs. training from scratch. Using a pre-trained model can significantly reduce the time required for training from scratch. The reason is that the weights of the pre-trained model have already been put close to appropriate and reasonable values; thus the model converges faster during fine-tuning on a test data. As shown in Figure12, compared to regular GNN model (e.g. GIN), our model yields a speedup of 4.7× on average (which is measured by the ratio of the training time of GIN to the fine-tuning time of APT). Based on above analysis, we can draw a conclusion that pre-training is beneficial both in effectiveness and efficiency.

Figure 12: The running time of our model and the basic GNN model on graph classification task. Our model achieves a speedup of 4.7× on average compared with GIN.

Hard example mining learns from the examples that contribute the most to model training, which has been widely applied in computer vision, natural language processing and recommender system Simo-Serra et al. (2014); Loshchilov & Hutter (2015); Shrivastava et al. (2016);Krishnan et al. (2020). Our usage of predictive uncertainty for choosing graphs is conceptually similar to hard example mining. However, existing approaches for hard sample mining can not be directly applied to our setting with the following two requirements. (1) The chosen instances should follow a joint distribution that reflects the topologi-

Micro F1 scores of different models in the node classification task. The column "A.R." reports the average rank of each model. Asterisk ( * ) denotes the best result on each dataset, and bold numbers denote the best result among graph pre-training models in the freezing or fine-tuning setting. The notation "/" means out of memory or no convergence for more than three days. 52.15(2.25) 47.51(0.62) 51.30(0.16) 27.40(4.87) 61.64(0.35) 1.3 model is attributed to the transferable knowledge learned by pre-training strategies. We also find that some proximity-based models like ProNE enforce neighboring nodes share similar representations, thus they perform well on graphs with strong homophily rather than weak homophily.Table2: Micro F1 of different models in the graph classification.

Yanqiao Zhu, Yichen Xu, Qiang Liu, and Shu  Wu. An empirical study of graph contrastive learning.In NeurIPS D&B, 2021a. Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. Graph contrastive learning with adaptive augmentation. In WWW, 2021b. A THEORETICAL CONNECTION BETWEEN NETWORK ENTROPY AND TYPICAL GRAPH PROPERTIES Many interesting graph structural properties from basic graph theory give rise to a graph with high network entropy Lynn et al.

maximal period F of training one graph, trade-off parameter γ t = 0, hyperparameter {β t }, the learning rate µ, the predictive uncertainty threshold T g of moving to a new graph, the predictive uncertainty threshold T s of choosing training samples, and the maximum iteration number T .

Datasets for pre-training and testing, where * denotes the average statistic of multiple graphs under graph classification setting. |V | and |E| denote the number of nodes and the number of edges in a graph, respectively.

Detailed structural properties of pre-training datasets, where avg properties equals to MEAN( φentropy , φdensity , φavg deg , φdeg var , -φα ) in Eq. (4), and nei 2 denotes the average number and standard deviation of 2-hop neighbors, |V | and |E| denote the number of nodes and the number of edges in a graph, respectively.

Detailed structural properties of test datasets, where nei 2 denotes the average number and standard deviation of 2-hop neighbors, and the numbers with * denote the average statistics of multiple graphs under graph classification setting. |V |, |E| and |G| denote he number of nodes in a graph, the number of edges in a graph and the number of graphs in graph classification datasets, respectively.

Micro F1 of APT-L2 (freeze) with the different decay functions in the node classification task.

Micro F1 of APT-L2 (freeze) with the different decay functions in the graph classification task.

Table 8 and Table 9 below show the effect of linear decay, step decay and exponential decay on β t . (The function for linear decay and step decay are designed as β t = 2.001 + 0.004t, β t = 2.005 + floor(t/20), respectively.

Table 14, the total training time of APT-L2 and APT is 18321.39 seconds and 18592.01 seconds respectively (including the time consumed in graph selection and regularization term), while the competitive graph pre-training model GCC takes 40161.68 seconds for the same number of training epochs on the same datasets.

The effect of different graph properties on downstream performance (micro F1 is reported) under APT-L2 (fine-tune) in the node classification task. The last row is our strategy of combining all the graph properties, and each of the first five rows is the strategy of only utilizing one graph property.Scale-free exponent 79.70(2.71) 24.94(0.68) 17.26(0.63) 12.03(1.41) 64.77(2.31) 51.37(2.70) 45.18(0.52) 50.84(0.26) Our combination 78.75(1.63) 24.62(0.90) 17.83(1.35) 12.26(0.78) 67.04(1.50) 52.94(1.95) 47.48(0.46) 51.25(0.21)

The effect of different graph properties on downstream performance (micro F1 is reported) under APT-L2 (fine-tune) in the graph classification task.

The effect of input graphs' learning order on downstream performance (micro F1 is reported) under freezing mode in the node classification task. The first row is the order learnt from APT-L2, and the second and third rows are the reverse and random order of the first row, respectively.

.45) 14.38(0.53) 11.76(1.04) 9.90(0.64) 50.65(1.84) 48.09(1.72) 35.74(0.42) 46.03(0.17) Learning from difficult samples (ours) 69.82(2.32) 16.79(0.88) 12.68(0.81) 10.34(1.12) 55.11(1.74) 48.76(2.20) 34.27(0.43) 46.21(0.15)

Training time (sec) comparison between our model and GCC. All the models are trained under the same number of epochs, which is set as 100 in practice. (The difference in time cost of inference on all graphs is due to different runs.)

annex

The explanation of convex hull fit. In order to better show the changing trend, the blue curve in the last column in Figure 5 and Figure 6 is fitted to the convex hull of the points. The convex hull is proposed to capture the performance of a randomized classifier made by choosing pre-training models with different probabilities Abnar et al. (2022) .We first introduce the concept of randomized classifier. Given two classifiers with training sample size and downstream performance c 1 = (c sz 1 , c ds 1 ) and c 2 = (c sz 2 , c ds 2 ), a randomized classifier can be made to choose the first classifier with probability p and the second classifier with probability 1p. Then the output of the randomized classifier is pc 1 + (1p)c 2 , which is the convex combination of c 1 and c 2 . All the points on this convex combination can be obtained by choosing different p. Extend the notion to the case of multiple classifiers, we can consider the output of such a randomized classifier to be a convex combination of the outputs of its endpoints Abnar et al. (2022) . All the points on the convex hull are achievable. Therefore, the output of the randomized classifier is equivalent to the convex hull of our trained classifiers' performance.In our experiments, we include the upper hull of the convex hull of the model performances, i.e., the highest downstream performance for every given sample size. Such convex hull fit is proved to be robust to the density of the points in each figure Abnar et al. (2022) .A final remark is that our observations on different downstream datasets do not result in a onemodel-fits-all trend. So we propose to fit a complicated curve whose function has form f (x) = a 1 ln x/x a2 +a 3 (a 1 , a 2 , a 3 > 0) to the best performing models (i.e., the convex hull fit as discussed above). The fitted parameters a 1 , a 2 and a 3 in this function of each curve are given in Table 6 . Additional real-world example for Figure 3 . In Figure 8 , we provide a real-world example of how network entropy correlates with four typical structural properties (in red), as well as the performance of the pre-trained model on test graphs (in blue). Numerical experiments again support our explanation (or intuition) of their strong correlation. 

F IMPLEMENTATION DETAILS

The number reported in all the experiments are the mean and standard deviation over 10 evaluation results of the downstream task with random training/testing splits. When conducting the downstream task, For each dataset, we consistently use 90% of the data as the training set, and 10% as the testing set. We conduct all experiments on a single machine of Linux system with an Intel Xeon Gold 5118 (128G memory) and a GeForce GTX Tesla P4 (8GB memory). Our codes are available at https://github.com/anonymous-APT-ai/Anonymous-APT-code.Implementations of our model. The regularization for weights of the model in Eq. ( 5) is applied to first 2 layers of GIN. The maximal period of training one graph F is 6, the maximum iteration number T is 100, and the predictive uncertainty thresholds T s and T g are set to be 3 and 2 respectively. The selected instances are sampled from 20,000 instances each epoch. Since the pre-training model is unable to provide precise predictive uncertainty in the initial training stage, the model is warmed up over the first 20 iterations. Since we adopt GCC as the backbone pre-training model, the other settings are the same as GCC. Implementations of baselines. We compare against several graph representation learning methods. For implementation, we directly adopt their public source codes and most of their default hyperparameters. The key parameter settings and code links can be found in Table 7 . We here discuss two advantages of using the model loss (i.e., InfoNCE loss) as predictive uncertainty to select samples. First, InfoNCE loss is exactly the objective function of our model, so what we do is actually to select the samples with the greatest contributions to the objective function (i.e., select the samples with the greatest InfoNCE loss). Such strategy has been justified to accelerate convergence and enhance the discriminative power of the learned representations [1] [2] [3] [4] . Second, as the loss function of our model, InfoNCE is already computed during the training, and thus no additional computation expense is needed in the data selection phase.

