MULTI-TASK SELF-SUPERVISED GRAPH NEURAL NET-WORKS ENABLE STRONGER TASK GENERALIZATION

Abstract

Self-supervised learning (SSL) for graph neural networks (GNNs) has attracted increasing attention from the graph machine learning community in recent years, owing to its capability to learn performant node embeddings without costly label information. One weakness of conventional SSL frameworks for GNNs is that they learn through a single philosophy, such as mutual information maximization or generative reconstruction. When applied to various downstream tasks, these frameworks rarely perform equally well for every task, because one philosophy may not span the extensive knowledge required for all tasks. To enhance the task generalization across tasks, as an important first step forward in exploring fundamental graph models, we introduce PARETOGNN, a multi-task SSL framework for node representation learning over graphs. Specifically, PARETOGNN is self-supervised by manifold pretext tasks observing multiple philosophies. To reconcile different philosophies, we explore a multiple-gradient descent algorithm, such that PARE-TOGNN actively learns from every pretext task while minimizing potential conflicts. We conduct comprehensive experiments over four downstream tasks (i.e., node classification, node clustering, link prediction, and partition prediction), and our proposal achieves the best overall performance across tasks on 11 widely adopted benchmark datasets. Besides, we observe that learning from multiple philosophies enhances not only the task generalization but also the single task performances, demonstrating that PARETOGNN achieves better task generalization via the disjoint yet complementary knowledge learned from different philosophies.

1. INTRODUCTION

Graph-structured data is ubiquitous in the real world (McAuley et al., 2015; Hu et al., 2020) . To model the rich underlying knowledge for graphs, graph neural networks (GNNs) have been proposed and achieved outstanding performance on various tasks, such as node classification (Kipf & Welling, 2016a; Hamilton et al., 2017) , link prediction (Zhang & Chen, 2018; Zhao et al., 2022b) , node clustering (Bianchi et al., 2020; You et al., 2020b) , etc. These tasks form the archetypes of many real-world practical applications, such as recommendation systems (Ying et al., 2018; Fan et al., 2019) , predictive user behavior models (Pal et al., 2020; Zhao et al., 2021a; Zhang et al., 2021a) . Existing works for graphs serve well to make progress on narrow experts and guarantee their effectiveness on mostly one task or two. However, given a graph learning framework, its promising performance on one task may not (and usually does not) translate to competitive results on other tasks. Consistent task generalization across various tasks and datasets is a significant and well-studied research topic in other domains (Wang et al., 2018; Yu et al., 2020) . Results from the Natural Language Processing (Radford et al., 2019; Sanh et al., 2021) and Computer Vision (Doersch & Zisserman, 2017; Ni et al., 2021) have shown that models enhanced by self-supervised learning (SSL) over multiple pretext tasks observing diverse philosophies can achieve strong task generalization and learn intrinsic patterns that are transferable to multiple downstream tasks. Intuitively, SSL over multiple pretext tasks greatly reduces the risk of overfitting (Baxter, 1997; Ruder, 2017) , because learning intrinsic patterns that well-address difficult pretext tasks is non-trivial for only one set of parameters. Moreover, gradients from multiple objectives regularize the learning model against extracting task-irrelevant information (Ren & Lee, 2018; Ravanelli et al., 2020) , so that the model can learn multiple views of one training sample. Nonetheless, current state-of-the-art graph SSL frameworks are mostly introduced according to only one pretext task with a single philosophy, such as mutual information maximization (Velickovic et al., 2019; Zhu et al., 2020; Thakoor et al., 2022) , whitening decorrelation (Zhang et al., 2021b) , and generative reconstruction (Hou et al., 2022) . Though these methods achieve promising results in some circumstances, they usually do not retain competitive performance for all downstream tasks across different datasets. For example, DGI (Velickovic et al., 2019) , grounded on mutual information maximization, excels at the partition prediction task but underperforms on node classification and link prediction tasks. Besides, GRAPHMAE (Hou et al., 2022) , based on feature reconstruction, achieves strong performance for datasets with powerful node features (e.g., graph topology can be inferred simply by node features (Zhang et al., 2021d) ), but suffers when node features are less informative, which is empirically demonstrated in this work. To bridge this research gap, we ask: How to combine multiple philosophies to enhance task generalization for SSL-based GNNs? A very recent work, AUTOSSL (Jin et al., 2022) , explores this research direction by reconciling different pretext tasks by learning different weights in a joint loss function so that the node-level pseudo-homophily is promoted. This approach has two major drawbacks: (i) Not all downstream tasks benefit from the homophily assumption. In experimental results shown by Jin et al. (2022) , we observe key pretext tasks (e.g., DGI based on mutual information maximization) being assigned zero weight. However, our empirical study shows that the philosophies behind these neglected pretext tasks are essential for the success of some downstream tasks, and this phenomenon prevents GNNs from achieving better task generalization. (ii) In reality, many graphs do not follow the homophily assumption (Pei et al., 2019; Ma et al., 2021) . Arguably, applying such an inductive bias to heterophilous graphs is contradictory and might yield sub-optimal performance. In this work, we adopt a different perspective: we remove the reliance on the graph or task alignment with homophily assumptions while self-supervising GNNs with multiple pretext tasks. During the self-supervised training of our proposed method, given a single graph encoder, all pretext tasks are simultaneously optimized and dynamically coordinated. We reconcile pretext tasks by dynamically assigning weights that promote the Pareto optimality (Désidéri, 2012) , such that the graph encoder actively learns knowledge from every pretext task while minimizing conflicts. We call our method PARETOGNN. Overall, our contributions are summarized as follows: • We investigate the problem of task generalization on graphs in a more rigorous setting, where a good SSL-based GNN should perform well not only over different datasets but also at multiple distinct downstream tasks simultaneously. We evaluate state-of-the-art graph SSL frameworks in this setting and unveil their sub-optimal task generalization. • To enhance the task generalization across tasks, as an important first step forward in exploring fundamental graph models, we first design five simple and scalable pretext tasks according to philosophies proven to be effective in the SSL literature and propose PARETOGNN, a multi-task SSL framework for GNNs. PARETOGNN is simultaneously self-supervised by these pretext tasks, which are dynamically reconciled to promote the Pareto optimality, such that the graph encoder actively learns knowledge from every pretext task while minimizing potential conflicts. • We evaluate PARETOGNN along with 7 state-of-the-art SSL-based GNNs on 11 acknowledged benchmarks over 4 downstream tasks (i.e., node classification, node clustering, link prediction, and partition prediction). Our experiments show that PARETOGNN improves the overall performance by up to +5.3% over the state-of-the-art SSL-based GNNs. Besides, we observe that PARE-TOGNN achieves SOTA single-task performance, proving that PARETOGNN achieves better task generalization via the disjoint yet complementary knowledge learned from different philosophies.

2. MULTI-TASK SELF-SUPERVISED LEARNING VIA PARETOGNN

In this section, we illustrate our proposed multi-task self-supervised learning framework for GNNs, namely PARETOGNN. As Figure 1  (∇ + ! ℒ ! , ∇ + ! ℒ % , ∇ + ! ℒ ' , … , ∇ + ! ℒ ) ) 𝒯 % (. ) 𝒯 ' (. ) 𝒯 -(. )

2.1. MULTI-TASK GRAPH SELF-SUPERVISED LEARNING

PARETOGNN is a general framework for multi-task self-supervised learning over graphs. We regard the full graph G as the data source; and for each task, PARETOGNN is self-supervised by sub-graphs sampled from G, followed by task-specific augmentations (i.e., T k (•)). The rationale behind the exploration of sub-graphs is two-fold. Firstly, the process of graph sampling is naturally a type of augmentation (Zeng et al., 2019) by enlarging the diversity of the training data. Secondly, modeling over sub-graphs is more memory efficient, which is significant especially under the multi-task scenario. In this work, we design five simple pretext tasks spanning three high-level philosophies, including generative reconstruction, whitening decorrelation, and mutual information maximization. However we note that PARETOGNN is not limited to the current learning objectives and the incorporation of other philosophies is a straightforward extension. Three high-level philosophies and their corresponding five pretext tasks are illustrated as follows: • Generative reconstruction. Recent studies (Zhang et al., 2021d; Hou et al., 2022) have demonstrated that node features contain rich information, which highly correlates to the graph topology. To encode node features into the representations derived by PARETOGNN, we mask the features of a random batch of nodes, forward the masked graph through the GNN encoder, and reconstruct the masked node features given the node representations of their local sub-graphs (Hou et al., 2022) . Furthermore, we conduct the similar reconstruction process for links between the connected nodes to retain the pair-wise topological knowledge (Zhang & Chen, 2018) . Feature and topology reconstruction are denoted as FeatRec and TopoRec respectively. • Whitening decorrelation. SSL based on whitening decorrelation has gained tremendous attention, owing its capability of learning representative embeddings without prohibitively expensive negative pairs or offline encoders (Ermolov et al., 2021; Zbontar et al., 2021) . We adapt the same philosophy to graph SSL by independently augmenting the same sub-graph into two views, and then minimize the distance between the same nodes in the two views while enforcing the feature-wise covariance of all nodes equal to the identity matrix. We denote this pretext as RepDecor. • Mutual Information Maximization. Maximizing the mutual information between two corrupted views of the same target has been proved to learn the intrinsic patterns, as demonstrated by deep infomax-based methods (Bachman et al., 2019; Velickovic et al., 2019) and contrastive learning methods (Hassani & Khasahmadi, 2020; Zhu et al., 2020) . We maximize the local-global mutual information by minimizing the distance between the graph-level representation of the intact sub-graph and its node representations, while maximizing the distance between the former and the corrupted node representations. Besides, we also maximize the local sub-graph mutual information by maximizing similarity of representations of two views of the sub-graph entailed by the same anchor nodes, while minimizing the similarities of the representations of the sub-graphs entailed by different anchor nodes. The pretext tasks based on node-graph mutual information and node-subgraph mutual information are denoted as MI-NG and MI-NSG, respectively. Technical details and objective formulation of these five tasks are provided in Appendix B. As described above, pretext tasks under different philosophies capture distinct dimensions of the same graph. Empirically, we observe that simply combining all pretext SSL tasks with weighted summation can sometimes already lead to better task generalization over various downstream tasks and datasets. Though promising, according to our empirical studies, such a multi-task self-supervised GNN falls short on some downstream tasks, if compared with the best-performing experts on these tasks. This phenomenon indicates that with the weighted summation there exist potential conflicts between different SSL tasks, which is also empirically shown by previous works from other domains such as Computer Vision (Sener & Koltun, 2018; Chen et al., 2018) .

2.2. MULTI-TASK GRAPH SSL PROMOTING PARETO OPTIMALITY

To mitigate the aforementioned problem and simultaneously optimize multiple SSL tasks, we can derive the following empirical loss minimization formulation as: min θg, θ1,...,θ K K k=1 α k • L k (G; T k , θ g , θ k ), where α k is the task weight for k-th SSL task computed according to pre-defined heuristics. For instance, AUTOSSL (Jin et al., 2022) derives task weights that promote pseudo-homophily. Though such a formulation is intuitively reasonable for some graphs and tasks, heterophilous graphs, which are not negligible in the real world, directly contradict the homophily assumption. Moreover, not all downstream tasks benefit from such homophily assumption, which we later validate in the experiments. Hence, it is non-trivial to come up with a unified heuristic that suits all graphs and downstream tasks. In addition, weighted summation of multiple SSL objectives might cause undesirable behaviors, such as performance instabilities entailed by conflicting objectives or different gradient scales (Chen et al., 2018; Kendall et al., 2018) . Therefore, we take an alternative approach and formulate this problem as multi-objective optimization with a vector-valued loss L, as the following: min θg, θ1,...,θ K L(G, θ g , θ 1 , . . . , θ K ) = min θg, θ1,...,θ K L 1 (G; T 1 , θ g , θ 1 ), . . . , L K (G; T K , θ g , θ K ) . (2) The focus of multi-objective optimization is approaching Pareto optimality (Désidéri, 2012) , which in the multi-task SSL setting can be defined as follows: Definition 1 (Pareto Optimality). A set of solution (θ ⋆ g , θ ⋆ 1 , . . . , θ ⋆ K ) is Pareto optimal if and only if there does not exists a set of solution that dominates (θ ⋆ g , θ ⋆ 1 , . . . , θ ⋆ K ). (θ ⋆ g , θ ⋆ 1 , . . . , θ ⋆ K ) dominates ( θg , θ1 , . . . , θK ) if for every SSL task k, L k (G; T k , θg , θk ) ≥ L k (G; T k , θ ⋆ g , θ ⋆ k ) and L(G, θg , θ1 , . . . , θK ) ̸ = L(G, θ ⋆ g , θ ⋆ 1 , . . . , θ ⋆ K ). In other words, if a self-supervised GNN is Pareto optimal, it is impossible to further optimize any SSL task without sacrificing the performance of at least one other SSL task. Finding the Pareto optimal model is not sensible if there exist a set of parameters that can easily fit all SSL tasks (i.e., no matter how different SSL tasks are reconciled, such a model approaches Pareto optimality where every SSL task is perfectly fitted). However, this is rarely the case, because solving all difficult pretext tasks is non-trivial for only one set of parameters. By promoting the Pareto optimlaity, PARETOGNN is enforced to learn intrinsic patterns applicable to a number of pretext tasks, which further enhances the task generalization across various downstream tasks.

2.3. PARETO OPTIMALITY BY MULTIPLE GRADIENT DESCENT ALGORITHM

To obtain the Pareto optimal parameters, we explore the Multiple Gradient Descent Algorithm (MGDA) (Désidéri, 2012) and adapt it to the multi-task SSL setting. Specifically, MGDA leverages the saddle-point test and theoretically proves that a solution (i.e., the combined gradient descent direction or task weight assignments in our case) that satisfies the saddle-point test gives a descent direction that improves all tasks and eventually approaches the Pareto optimality. We further elaborate descriptions of the saddle-point test for the share parameters θ g and task-specific parameters θ k in Appendix G. In our multi-task SSL scenario, the optimization problem can be formulated as: min α1,...,α K K k=1 α k • ∇ θg L k (G; T k , θ g , θ k ) F , s.t. K k=1 α k = 1 and ∀k α k ≥ 0, where ∇ θg L k (G; T k , θ g , θ k ) ∈ R 1×|θg| refers to the gradients of parameters for the GNN encoder w.r.t. the k-th SSL task. PARETOGNN can be trained using Equation ( 1) with the task weights derived by the above optimization. As shown in Figure 1 , optimizing the above objective is essentially finding descent direction with the minimum norm within the convex hull defined by the gradient direction of each SSL task. Hence, the solution to Equation ( 3) is straight-forward when K = 2 (i.e., only two gradient descent directions involved). If the norm of one gradient is smaller than their inner product, the solution is simply the gradient with the smaller norm (i.e., (α 1 = 0, α 2 = 1) or vice versa). Otherwise, α 1 can be calculated by deriving the descent direction perpendicular to the convex line with only one step, formulated as: α 1 = ∇ θg L 2 (G; T 2 , θ g , θ 2 ) • ∇ θg L 2 (G; T 2 , θ g , θ 2 ) -∇ θg L 1 (G; T 1 , θ g , θ 1 ) ⊺ ∇ θg L 2 (G; T 2 , θ g , θ 2 ) -∇ θg L 1 (G; T 1 , θ g , θ 1 ) F . ( ) When K > 2, we minimize the quadratic form α(∇ θg L)(∇ θg L) ⊺ α ⊺ , where ∇ θg L ∈ R K×|θg| refers to the vertically concatenated gradients w.r.t. θ g for all SSL tasks, and α ∈ R 1×K is the vector for task weight assignments such that ||α|| = 1. Inspired by Frank-Wolfe algorithm (Jaggi, 2013) , we iteratively solve such a quadratic problem as a special case of Equation ( 4). Specifically, we first initialize every element in α as 1/K, and we increment the weight of the task (denoted as t) whose descent direction correlates least with the current combined descent direction (i.e., K k=1 α k • ∇ θg L k (G; T k , θ g , θ k )). The step size η of this increment can be calculated by utilizing the idea of Equation ( 4), where we replace ∇ θg L 2 (G; T 2 , θ g , θ 2 ) with K k=1 α k • ∇ θg L k (G; T k , θ g , θ k ) and replace ∇ θg L 1 (G; T 1 , θ g , θ 1 ) with ∇ θg L t (G; T t , θ g , θ t ). One iteration of solving this quadratic problem is formulated as: α := (1 -η) • α + η • e t , s.t. η = ∇θg • ∇θg -∇ θg L t (G; T t , θ g , θ t ) ⊺ ∇θg -∇ θg L t (G; T t , θ g , θ t ) F , where t = arg min r K i=1 α i • ∇ θg L i (G; T i , θ g , θ i ) • ∇ θg L r (G; T r , θ g , θ r ) ⊺ and ∇θg = K k=1 α k • ∇ θg L k (G; T k , θ g , θ k ). e t in the above solution refers to an one-hot vector with t-th element equal to 1. The optimization described in Equation ( 5) iterates until η is smaller than a small constant ξ or the number of iterations reaches the pre-defined number γ. Furthermore, the above task reconciliation of PARETOGNN satisfies the following theorem. Theorem 1. Assuming that α(∇ θg L)(∇ θg L) ⊺ α ⊺ is β-smooth. α converges to the optimal point at a rate of O(1/γ), and α is at most 4β/(γ + 1) away from the optimal solution. The proof of Theorem 1 is provided in Appendix A. According to this theorem and our empirical observation, the optimization process in Equation ( 5) would terminate and deliver well-approximated results with γ set to an affordable value (e.g., γ = 100).

3.1. EXPERIMENTAL SETTING

Datasets. We conduct comprehensive experiments on 11 real-world benchmark datasets extensively explored by the graph community. They include 8 homophilous graphs, which are Wiki-CS, Pubmed, Amazon-Photo, Amazon-Computer, Coauthor-CS, Coauthor-Physics, ogbn-arxiv, and ogbn-products (McAuley et al., 2015; Yang et al., 2016; Hu et al., 2020) , as well as 3 heterophilous graphs, which are Chameleon, Squirrel, and Actor (Tang et al., 2009; Rozemberczki et al., 2021) . Besides the graph homophily, this list of datasets covers graphs with other distinctive characteristics (i.e., from graphs with thousands of nodes to millions, and features with hundred dimensions to almost ten thousands), to fully evaluate the task generalization under different scenarios. The detailed description of these datasets can be found in Appendix C. Downstream Tasks and Evaluation Metrics. We evaluate all models by four most commonlyused downstream tasks, including node classification, node clustering, link prediction, and partition prediction, whose performance is quantified by accuracy, normalized mutual information (NMI), area under the characteristic curve (AUC), and accuracy respectively, following the same evaluation protocols from previous works (Tian et al., 2014; Kipf & Welling, 2016a; Zhang & Chen, 2018 ). Evaluation Protocol. For all downstream tasks, we follow the standard linear-evaluation protocol on graphs (Velickovic et al., 2019; Jin et al., 2022; Thakoor et al., 2022) , where the parameters of the GNN encoder are frozen during the inference time and only logistic regression models (for node classification, link prediction and partition prediction) or K-Means models (for node clustering) are trained to conduct different downstream tasks. For datasets whose public splits are available (i.e., ogbn-arxiv, and ogbn-products), we utilize their given public splits for the evaluations on node classification, node clustering and partition prediction. Whereas for other datasets, we explore a random 10%/10%/80% split for the train/validation/test split, following the same setting as explored by other literature. To evaluate the performance on link prediction, for large graphs where permuting all possible edges is prohibitively expensive, we randomly sample 210,000, 30,000, and 60,000 edges for training, evaluation, and testing. And for medium-scale graphs, we explore the random split of 70%/10%/20%, following the same standard as explored by (Zhang & Chen, 2018; Zhao et al., 2022b) . To prevent label leakage for link prediction, we evaluate link prediction by another identical model with testing and validation edges removed. Label for the partition prediction is induced by metis partition (Karypis & Kumar, 1998) , and we explore 10 partitions for each dataset. All reported performance is averaged over 10 independent runs with different random seeds. Baselines. We compare the performance of PARETOGNN with 7 state-of-the-art self-supervised GNNs, including DGI (Velickovic et al., 2019) , GRACE (Zhu et al., 2020) , MVGRL (Hassani & Khasahmadi, 2020) , AUTOSSL (Jin et al., 2022) , BGRL (Thakoor et al., 2022) , CCA-SSG (Zhang et al., 2021b) and GRAPHMAE (Hou et al., 2022) . These baselines are experts in at least one of the philosophies of our pretext tasks, and comparing PARETOGNN with them demonstrates the improvement brought by the multi-task self-supervised learning as well as promoting Pareto optimlaity. Hyper-parameters. To ensure a fair comparison, for all models, we explore the GNN encoder with the same architecture (i.e., GCN encoder with the same layers), fix the hidden dimension of the GNN encoders, and utilize the recommended settings provided by the authors. Detailed configurations of other hyper-parameters for PARETOGNN are illustrated in Appendix D.

3.2. PERFORMANCE GAIN FROM THE MULTI-TASK SELF-SUPERVISED LEARNING

We conduct experiments on our pretext tasks by individually evaluating their performance as well as task generalization, as shown in Table 1 . We first observe that there does not exist a single-task model that can simultaneously achieve competitive performance on every downstream task for all datasets, demonstrating that knowledge learned through a single philosophy does not suffice the strong and consistent task generalization. Models trained by a single pretext tasks alone are narrow experts (i.e., delivering satisfactory results on only few tasks or datasets) and their expertise does not translate to the strong and consistent task generalization across various downstream tasks and datasets. For instance, TopoRec achieves promising performance on link prediction (i.e., average rank of 3.7), but falls short on all other tasks (i.e., ranked 5.9, 5.3, and 5.9). Similarly, MI-NSG performs reasonably well on the partition prediction (i.e., average rank of 3.0), but underperforms on the link prediction task (i.e., average rank of 4.4). However, comparing them with the model trained by combining all pretext tasks through the weighted summation (i.e., w/o Pareto), we observe that the latter achieves both stronger task generalization and better single-task performance. The model w/o Pareto achieves an average rank of 2.4 on the average performance, which is 1.2 ranks higher than the best single-task model. This phenomenon indicates that multi-task self-supervised GNNs indeed enable stronger task generalization. Multiple objectives regularize the learning model against extracting redundant information so that the model learns multiple complementary views of the given graphs. Besides, multi-task training also improves the performance of the single downstream task by 1.5, 0.7, 0.3, and 0.6 ranks, respectively. In some cases (e.g., node clustering on PUBMED and CHAMELEON, or link prediction on WIKI.CS and CO.CS), we observe large performance margins between the best-performing single-task models and the vanilla multi-task model w/o Pareto, indicating that there exist potential conflicts between different SSL tasks. PARETOGNN further mitigates these performance margins by promoting Pareto optimality, which enforces the learning model to capture intrinsic patterns applicable to a number of pretext tasks while minimizing potential conflicts. As shown in Table 1 , PARETOGNN is the top-ranked at both average metric and other individual downstream tasks, demonstrating the strong task generalization as well as the promising single-task performance. Specifically, PARETOGNN achieves an outstanding average rank of 1.0 on the average performance of the four downstream tasks. As for the performance on individual tasks, PARETOGNN achieves an average rank of 1.7, 1.6, 1.0, and 1.2 on the four individual downstream tasks, outperforming the corresponding best baselines by 0.2, 1.2, 1.3, and 1.1 respectively. This phenomenon demonstrates that promoting Pareto optimality not only helps improve the task generalization across various tasks and datasets but also enhances the single-task performance for the multi-task self-supervised learning. 

3.3. PERFORMANCE COMPARED WITH OTHER UNSUPERVISED BASELINES

We compare PARETOGNN with 7 state-of-the-art SSL frameworks for graphs, as shown in Table 2 . Similar to our previous observations, there does not exist any SSL baseline that can simultaneously achieve competitive performance on every downstream task for all datasets. Specifically, from the perspective of individual tasks, CCA-SSG performs well on the node classification but underperforms on the link prediction and the partition prediction. Besides, DGI works well on the partition prediction but performs not as good on other tasks. Nevertheless, from the perspective of datasets, we can observe that AUTOSSL has good performance on homohpilous datasets but this phenomenon does not hold for heterophilous ones, demonstrating that the assumption of reconciling tasks by promoting graph homophily is not applicable to all graphs. PARETOGNN achieves a competitive average rank of 1.0 on the average performance, significantly outperforming the runner-ups by 2.7 ranks. Moreover, PARETOGNN achieves an average rank of 1.8, 1.9, 1.0, and 1.2 on the four individual tasks, outrunning the best-performing baseline by 0.9, 2.3, 3.3, and 1.6 respectively, which further proves the strong task generalization and single-task performance of PARETOGNN.

3.4. PERFORMANCE WHEN SCALING TO LARGER DIMENSIONS

To evaluate the scalability of PARETOGNN, we conduct experiments from two perspectives: the graph and model dimensions. We expect our proposal to retain its strong task generalization when applied to large graphs. And we should expect even stronger performance when the model dimension scales up, compared with other single-task frameworks with larger dimensions as well, since multi-task SSL settings enable the model capable of learning more. The results are shown below. From Figure 2 , we observe that the task generalization of PARETOGNN is proportional to the model dimension, indicating that PARETOGNN is capable of learning more, compared with single-task models like BGRL, whose performance gets saturated with less-parameterized models (e.g., 128 for AM.COMP. and 256 for ARXIV). Moreover, from Table 3 , we notice that the graph dimension is not a limiting factor for the strong task generalization of our proposal. Specifically, PARETOGNN outperforms the runner-ups by 2.5 on the rank of the average performance.

4. RELATED WORKS

Graph Neural Networks. Graph neural networks (GNNs) are powerful learning frameworks to extract representative information from graphs (Kipf & Welling, 2016a; Veličković et al., 2017; Hamilton et al., 2017; Xu et al., 2018b; Klicpera et al., 2019; Xu et al., 2018a; Fan et al., 2022; Zhang et al., 2019) . They aim at mapping the input nodes into low-dimensional vectors, which can be further utilized to conduct either graph-level or node-level tasks. Most GNNs explore layer-wise message passing scheme (Gilmer et al., 2017; Ju et al., 2022a) , where a node iteratively extracts information from its first-order neighbors. They are applied in many real-world applications, such as predictive user behavior modeling (Zhang et al., 2021c; Wen et al., 2022) , molecular property prediction (Zhang et al., 2021e; Guo et al., 2021) , and question answering (Ju et al., 2022b) . Self-supervised Learning for GNNs. For node-level tasks, current state-of-the-art graph SSL frameworks are mostly introduced according to a single pretext task with a single philosophy (You et al., 2020b) , such as mutual information maximization (Velickovic et al., 2019; Zhu et al., 2020; Hassani & Khasahmadi, 2020; Thakoor et al., 2022) , whitening decorrelation (Zhang et al., 2021b) , and generative reconstruction (Kipf & Welling, 2016b; Hou et al., 2022) . Whereas for graph-level tasks, previous works explore mutual information maximization to encourage different augmented views of the same graphs sharing similar representations (You et al., 2020a; Xu et al., 2021; You et al., 2021; Li et al., 2022; Zhao et al., 2022a) . Multi-Task Self-supervised Learning. Multi-task SSL is broadly explored in computer vision (Lu et al., 2020; Doersch & Zisserman, 2017; Ren & Lee, 2018; Yu et al., 2020; Ni et al., 2021) and natural language processing (Wang et al., 2018; Radford et al., 2019; Sanh et al., 2021; Ravanelli et al., 2020) fields. For the graph community, AutoSSL (Jin et al., 2022) explores a multi-task setting where tasks are reconciled in a way that promotes graph homophily (Zhao et al., 2021b) . Besides, Hu et al. (2019) focused on the graph-level tasks and pre-trains GNNs in separate stages. In these frameworks, tasks are reconciled according to weighted summation or pre-defined heuristics.

5. CONCLUSION

We study the problem of task generalization for SSL-based GNNs in a more rigorous setting and demonstrate that their promising performance on one task or two usually does not translate into good task generalization across various downstream tasks and datasets. In light of this, we propose PARETOGNN to enhance the task generalization by multi-task self-supervised learning. Specifically, PARETOGNN is self-supervised by manifold pretext tasks observing multiple philosophies, which are reconciled by a multiple-gradient descent algorithm promoting Pareto optimality. Through our extensive experiments, we show that multi-task SSL indeed enhances the task generalization. Aided by our proposed task reconciliation, PARETOGNN further enlarges the margin by actively learning from multiple tasks while minimizing potential conflicts. Compared with 7 state-of-the-art SSL-based GNNs, PARETOGNN is top-ranked on the average performance. Besides stronger task generalization, PARETOGNN achieves better single-task performance, demonstrating that disjoint yet complementary knowledge from different philosophies is learned through the multi-task SSL. A PROOF TO THEOREM 1 Here we re-state Theorem 1 before diving into its proof: Theorem 1. Assuming that α(∇ θg L)(∇ θg L) ⊺ α ⊺ is β-smooth. α converges to the optimal point at a rate of O(1/γ), and α is at most 4β/(γ + 1) away from the optimal solution. Proof. Let ϕ(α) denotes α(∇ θg L)(∇ θg L) ⊺ α ⊺ . If ϕ(•) is β-smooth, we have: ||∇ϕ(α) -∇ϕ(α ′ )|| ≤ β • ||α -α ′ ||, from which we can also derive a quadratic upper-bound: ϕ(α ′ ) ≤ ϕ(α) + ∇ϕ(α) ⊺ • (α ′ -α) + β 2 • ||α ′ -α|| 2 , where α ′ refers to the weight combination after one iteration from α. Combining Equations ( 5) and ( 7), we have: ϕ(α ′ ) -ϕ(α) ≤ ∇ϕ(α) ⊺ • (α ′ -α) + β 2 • ||α ′ -α|| 2 , ≤ η • ∇ϕ(α) ⊺ • (e t -α) + β 2 η 2 • R 2 , where R = SUP α1,α2∈X (||α 1 -α 2 ||) refers to the diameter of the domain of α(∇ θg L)(∇ θg L) ⊺ α ⊺ , and SUP(•) is the supremum operation. In our case, R = √ 2 since ||α|| = 1. The derivation above is valid because α ′ = (1 -η) • α + η • e t , as shown in Equation (5). Since t = arg min r K i=1 α i • ∇ θg L i (G; T i , θ g , θ i ) • ∇ θg L r (G; T r , θ g , θ r ) ⊺ , we have: ϕ(α ′ ) -ϕ(α) ≤ η • ∇ϕ(α) ⊺ • (α * -α) + β • η 2 , ≤ η • ϕ(α * ) -ϕ(α) + β • η 2 , where α * is the unknown optimal solution. By rearranging terms in the above inequality, we have: ϕ(α ′ ) -ϕ(α * ) ≤ (1 -η) • ϕ(α) -ϕ(α * ) + β • η 2 . (10) The left hand side of the above equation is the distance between the updated solution and the optimal solution (we denote this distance as δ ′ = ϕ(α ′ ) -ϕ(α * )). And on the right hand side, inside the parenthesis we have the distance between the previous solution and the optimal solution (denoting this distance as δ = ϕ(α) -ϕ(α * )) as: δ ′ ≤ (1 -η) • δ + β • η 2 , where in our case, η = ∇θg • ∇θg -∇ θg Lt(G;Tt,θg,θt) ⊺ ∇θg -∇ θg Lt(G;Tt,θg,θt) F . In this proof, we assume η = 2 γt+1 , where γ t refers to the index of the current iteration. Such an assumption is a relaxed version of our formulation upon η, hence any proof that holds for this assumption also holds for our case. Next, we show δ γ ≤ 4β γ+1 in Theorem 1 through proof-by-induction: It's straight-forward and easy to validate that our derivation stands for the base case γ = 2, where δ 2 ≤ 4β 3 . Here we show that it also holds for γ + 1: δ γ+1 ≤ (1 -η) • δ γ + β • η 2 , ≤ (1 - 2 γ + 1 ) • 4β γ + 1 + β • ( 2 γ + 1 ) 2 , = γ -1 γ + 1 • 4β γ + 1 + 4β (γ + 1) 2 , = 4β γ + 1 • γ γ + 1 , ≤ 4β γ + 1 (12)

B DESCRIPTION OF THE PRETEXT TASKS

In this section, we demonstrate the design of our proposed five pretext tasks, including two based on generative reconstruction (i.e., FeatRec and TopoRec), one based on whitening decorrelation (i.e., RepDecor) and two based on mutual information maximization (i.e., MI-NG and MI-NSG). We first explain the operation of graph convolution as proposed by Kipf & Welling (2016a) . The key mechanism of graph convolution is layer-wise message passing where a node iteratively extracts information from its first-order neighbors and information from milti-hop neighbors can be captured through stacked convolution layers. Specifically, at l-th layer, this process is formulated as follows: H l+1 = σ(A • H l • W l ), where H 0 = X, A ∈ {0, 1} N ×N is the adjacency matrix of the input graph, σ(•) refers to the non-linear activation function, W l ∈ R d l ×d l+1 refers to the learnable parameters of l-th layer, and d l and d l+1 are the hidden dimensions at these two consecutive layers respectively. The graph encoder f g (•; θ g ) : G → R N ×d of PARETOGNN is constructed by stacked graph convolution layers as: f g (G; θ g ) = H L = σ(A • H L-1 • W L-1 ), where L stands for the number of layers in the encoder of PARETOGNN and θ g = {W l } L-1 l=0 . As briefly described in Section 2.1, we regard the full graph G as the data source; and for each task, PARETOGNN is self-supervised by sub-graphs sampled from G, followed by task-specific augmentations (i.e., T t (•)). The graph sampling strategy is fairly straightforward. For the pretext task t, we select the sub-graph constituted by nodes within k t hops of N t randomly selected seed nodes, where k t and N t are two task-specific hyper-parameters. The graph augmentation operations we explore include feature masking, edge dropping, and node dropping. For the simplicity of denotation, we unify the operation of graph sampling and graph augmentation together as T t (•). Task-specific hyper-parameters for graph augmentation and sub-graph sampling are covered in Appendix D.

B.1 GENERATIVE RECONSTRUCTION

Feature reconstruction, denoted as FeatRec, utilizes the high-level idea from Zhang et al. (2021d) , proving that topological information can be referred purely from the node features. Hence to utilize such inductive bias, following the implementation of GraphMAE (Hou et al., 2022) , we mask the node features and forward the masked graphs through f g (•; θ g ). Then we re-mask the previous masked nodes and feed the resulted graph to a convolution-based decoder, formulated as: X′ = A ′ • f g (G ′ ; θ g ) ⊙ M • W Dec , where ⊙ refers to Hadamard product, A ′ is the adjacency matrix of the sampled sub-graph G ′ ∼ T FeatRec (G), M ∈ {0, 1} N ′ ×d is the feature mask matrix whose rows equal to 1 if their corresponding nodes are targeted for reconstruction, and W Dec ∈ R d×D is the parameter matrix for the feature decoder. The objective for FeatRec is formulated as: L FeatRec = || X′ ⊙ M -X ′ ⊙ M|| F ||X ′ ⊙ M|| F , where M ∈ {0, 1} N ′ ×D is the mask matrix defined similarly to M with different dimension, and X ′ ∈ R N ′ ×D is the feature matrix of the sampled sub-graph G ′ ∼ T FeatRec (G). Topological reconstruction, denoted as TopoRec, aims at capturing the pair-wise relationships between the connected nodes. Given a sampled sub-graph G ′ ∼ T TopoRec (G), we randomly select B pairs of nodes V + = {(i, j)|A ′ i,j = 1} and another B pairs of nodes V -= {(i, j)|A ′ i,j = 0}. The connection between two node i and j is measured by a logit calculated as: P TopoRec (i, j) = σ (f g (G ′ ; θ g )[i] ⊙ f g (G ′ ; θ g )[j]) • W Topo , where [•] refers to the indexing operation, and W Topo ∈ R d×1 is the parameter vector. The objective of TopoRec is maximizing the P TopoRec for nodes in V + and minimizing for nodes in V -, formulated as a binary cross entropy loss as: L TopoRec = - 1 2B (i,j)∈V + log(P TopoRec (i, j)) + (i,j)∈V - log(1 -P TopoRec (i, j)). to load the datasets. For ogbn-arxiv and ogbn-products, we use the API from Open Graph Benchmark (OGB)foot_1 . For Chameleon, Actor and Squirrel, the datasets are downloaded from the official repository of Geom-GCN (Pei et al., 2019) foot_2 . Hardware and software configurations. We conduct experiments on a server having one RTX3090 GPU with 24 GB VRAM. The CPU we have on the server is an AMD Ryzen 3990X with 128GB RAM. The software we use includes DGL 1.9.0 and PyTorch 1.11.0. Baseline Implementation. As for the baseline models that we compare PARETOGNN with, we explore the implementations provided by code repositories listed as follows: • DGI (Velickovic et al., 2019) (Hou et al., 2022) : https://github.com/THUDM/GraphMAE. We sincerely appreciate the authors of these works for open-sourcing their valuable code and researchers at DGL for providing reliable implementations of these models.

E TRAINING TIME AND MEMORY CONSUMPTION

The scalability w.r.t. the graph dimensions is well leveraged by our utilization of sampled sub-graphs and experimentally verified by PARETOGNN's strong performance over large graphs, as shown in Table 3 . On top of this, we also measure the training time and GPU memory consumption to give a direct empirical understanding of PARETOGNN's overhead, as shown in Table 6 . For AUTOSSL, we notice that the calculation of the pseudo-homophily is extremely slow because such a process cannot enjoy the GPU acceleration. GPU remains mostly idle during the training of AUTOSSL. Though PARETOGNN is not as efficient as BGRL when the graphs are small-scaled; for large graphs such as OGBN-PRODUCTS, all methods require sampling strategies and in this case the efficiency of PARETOGNN is on par with that of BGRL. BGRL learns from one large graph (though sampled), and PARETOGNN learns from multiple relatively small graphs, which entails similar computational overheads. 

F PERFORMANCE OF INDIVIDUAL TASKS ON LARGE GRAPHS

In Table 3 , we demonstrate the task generalization of all models over large graphs (i.e., ogbn-arxiv and ogbn-products) quantified by the average performance over the four downstream tasks. Here we provide additional experimental results on models' performance of every individual task, as shown in Table 7 . We notice that the graph dimension is not a limiting factor for the strong task generalization of our proposal. Besides the conclusion we have drawn in Section 3.4, where PARETOGNN outperforms the runner-ups by 2.5 on rank of the average performance calculated over the four tasks, we also observe strong single-task performance on some tasks, demonstrating that PARETOGNN achieves better task generalization via the disjoint yet complementary knowledge learned from different philosophies.

G SADDLE-POINT TEST FOR PARAMETERS IN PARETOGNN

In PARETOGNN, the saddle-point test conditions (Désidéri, 2012) for the shared GNN encoder (i.e., f g (•; θ g )) as well as the task-specific heads (i.e., f k (•; θ k )) are defined as the following: • For θ g , there exist α such that every element in α is greater than or equal to 0, ||α|| = 1, and K k=1 α k • ∇ θg L k (G; T k , θ g , θ k ) = 0. • For {θ k } K k=1 , we have ∇ θ k L k (G; T k , θ g , θ k ) = 0. According to MGDA, the solution above gives a descent direction that improves all tasks, which improves the task generalization while minimizing potential conflicts.

H CONNECTIONS TO OTHER PARETO LEARNING FRAMEWORKS

This section is greatly inspired by valuable comments from the reviewers of this paperfoot_3 . We sincerely appreciate endeavors from the reviewers to help us further refining this paper. Task reconciliation by Pareto learning is mostly explored by works in the Computer Vision community (Lin et al., 2019; Mahapatra & Rajan, 2020; Chen et al., 2021) . In their settings, usually there exist one main task and multiple auxiliary tasks. They explore Pareto learning with preference vectors to enforce the learning models to remain good performance on the main task while extracting information from auxiliary tasks as much as possible. Pareto learning with user-defined preference vectors is sensible in their cases, as they do not want the auxiliary tasks to hinder the learning of the main task (i.e., a preference of the main task over the auxiliary tasks). For example, PSST (Chen et al., 2021) enhances the model's few-shot learning capability (i.e., the main task) by optimizing the model with the main task as well as additional pretext tasks (i.e., the auxiliary tasks). In this case, while remaining competitive performance on the main task, PSST also regularizes the learning model against extracting task-irrelevant information (Ren & Lee, 2018; Ravanelli et al., 2020) by training with auxiliary tasks, which empirically further improves the performance on the main task. However, our scenario is completely different from theirs because we do not want a task governing others (i.e., all tasks are main tasks). The task reconciliation proposed in our paper is not biased toward any task. If preference vectors are explored, some pretext tasks will definitely be jeopardized due to the definition of Pareto optimality (i.e., Definition 1). Such a bias over the pretext tasks would cause the self-supervised GNNs not performing equally well on every downstream task and dataset. It hurts the overall average performance and task generalization, which are the main focuses of this paper. To empirically validate that an utilization of a preference vector contradicts our goal, conduct experiments on PARETOMTL (Lin et al., 2019) with five preference vectors (i.e., five vectors for different preferences over five tasks), as shown in Table 8 . We observe that multi-task self-supervised learning via Pareto learning with a preference vector under most of the time does not even outperform the vanilla weighted summation (i.e., w/o Pareto) due to the bias introduced by the preference vectors. Our proposed PARETOGNN consistently outperforms variants based on PARETOMTL with different preferences. Though sometimes variants with preference vectors approach the performance of PARETOGNN (i.e., PARETOMTL with MI-NSG in WIKI.CS), coming to this results needs a grid-search on all possible preference vectors, which is inefficient when heuristics such as the prior knowledge are not available. Including preferences in multi-task learning could be very helpful when one or a set of training tasks need to be highlighted, whose success is demonstrated by the experiments in PARETOMTL. But this is not our case, because to achieve strong task generalization for GNNs, philosophies contained in all pretext tasks are equally important. Our implementation of PARETOMTL comes from its official Github repositoryfoot_4 .



https://www.dgl.ai https://ogb.stanford.edu https://github.com/graphdml-uiuc-jlu/geom-gcn Detailed reviews are available at: https://openreview.net/forum?id=1tHAZRqftM. https://github.com/Xi-L/ParetoMTL



Figure1: PARETOGNN is simultaneously self-supervised by K SSL tasks. All SSL tasks share the same GNN encoder and have their own projection heads (i.e., f k (•; θ k )) and augmentations (i.e., T k (•)) such as node or edge dropping, feature masking, and graph sampling. To dynamically reconcile all tasks, we assign weights to tasks such that the combined descent direction has a minimum norm in the convex hull, which promotes Pareto optimality and further enhances task generalization.simultaneously to enhance the task generalization. Specifically, given a graph G with N nodes and their corresponding D-dimensional input features X ∈ R N ×D , PARETOGNN learns a GNN encoder f g (•; θ g ) : G → R N ×d parameterized by θ g , that maps every node in G to a d-dimensional vector (s.t. d ≪ N ). The resulting node representations should retain competitive performance across various downstream tasks without any update on θ g . With K self-supervised tasks, we consider the loss function for k-th SSL task as L k (G; T k , θ g , θ k ) : G → R + , where T k refers to the graph augmentation function required for k-th task, and θ k refers to task-specific parameters for k-th task (e.g., MLP projection head and/or GNN decoder). In PARETOGNN, all SSL tasks are dynamically reconciled by promoting Pareto optimality, where the norm of gradients w.r.t. the parameters of our GNN encoder θ g is minimized in the convex hull. Such gradients guarantee a descent direction to the Pareto optimality, which enhances task generalization while minimizing potential conflicts.

Performance and task generalization of our proposed SSL pretext tasks. w/o Pareto stands for combining all the objectives via vanilla weighted summation. RANK refers to the average rank among all variants given an evaluation metric. Bold indicates the best performance and underline indicates the runner-up, with standard deviations as subscripts.

Performance and task generalization of PARETOGNN as well as the state-of-the-art unsupervised baselines. OOM stands for out-of-memory on a RTX3090 GPU with 24 GB memory.

Task generalization on large graphs. (*: Graphs are sampled by GRAPHSAINT(Zeng et al., 2019) matching the memory of others due to OOM.) Individual tasks are reported in Appendix F.

Dataset Statistics. The hyper-parameters for PARETOGNN across all datasets are listed in Table5.

Hyper-parameters used for PARETOGNN. SAINT stands for sampling strategy proposed in GRAPHSAINT(Zeng et al., 2019), and we use its node version.

: https://github.com/dmlc/dgl/tree/master/ examples/pytorch/dgi.

Training time and memory consumption of PARETOGNN. *: model is trained on sub-graphs with dimensions matching the maximum GPU memory (i.e., 24 GB).

The performance and task generalization of PARETOGNN as well as state-of-the-art unsupervised baselines over large graphs. (*: Graphs are sampled by GRAPHSAINT(Zeng et al., 2019) matching the memory of others due to OOM.) ±0.14 47.08 ±0.07 98.87 ±0.10 80.99 ±0.05 74.89 PARETOGNN * 73.25 ±0.11 50.17 ±0.32 98.54 ±0.13 81.97 ±0.23 75.98

The task generalization of multi-task learning with weighted summation (i.e., w/o Pareto), PARETOGNN, and five PARETOMTL variants with preferences favoring different tasks.

ACKNOWLEDGMENTS

This work is partially supported by the NSF under grants IIS-2209814, IIS-2203262, IIS-2214376, IIS-2217239, OAC-2218762, CNS-2203261, CNS-2122631, CMMI-2146076, and the NIJ 2018-75-CX-0032. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of any funding agencies.

ETHICS STATEMENT

We observe no ethical concern entailed by our proposal, but we note that both ethical or unethical applications based on graphs may benefit from the stronger task generalization of our work. Care should be taken to ensure socially positive and beneficial results of machine learning algorithms.

REPRODUCIBILITY STATEMENT

Our code is publicly available at https://github.com/jumxglhf/ParetoGNN. The hyperparameters and other variables required to reproduce our experiments are described in Appendix D.

availability

//github.com/jumxglhf/ParetoGNN.

B.2 WHITENING DECORRELATION

Representation decorrelation, denoted as RepDecor, encourages the similarities between the representations of the same nodes in two independently augmented sub-graphs. During this process, to the prevent the node representations from collapsing into a trivial solution, the covariance between representations matrices of two sub-graphs are enforced to be an identity matrix, such that the knowledge learned by each dimension in the hidden space is orthogonal to each other (Ermolov et al., 2021; Zbontar et al., 2021; Zhang et al., 2021b) . Given two sub-graphs G ′ 1 , G ′ 2 ∼ T TopoRec (G) that are constituted by the same seed nodes but augmented differently, the objective of RepDecor is formulated as:where the first term encourages the node similarity and the second term regularize the solution from collapsing, I d×d is the square identity matrix with dimension d × d, and α refers to a pre-defined balancing term (i.e., we use an α of 1e-3 across all datasets).

B.3 MUTUAL INFORMATION MAXIMIZATION

Mutual information between nodes and the whole graph, denoted as MI-NG, enables the graph encoder to learn coarse graph-level knowledge. Specifically, given a sampled sub-graph G ′ ∼ T MI-NG (G), we first corrupt G ′ into G ′′ by feature shuffling (Velickovic et al., 2019) . Then we extract the hidden graph-level representation of G ′ by the graph mean pooling (Xu et al., 2018a) , and enforce the representations of nodes in G ′ similar to the pooled representation while nodes in G ′′ far away from the pooled representation. This pretext task allows the graph encoder to capture the perturbation brought by the topological change (i.e., feature shuffling). The objective of MI-NG is formulated as a binary cross entropy as:where POOL(•) refers to the graph mean pooling function (Xu et al., 2018a) , || is the horizontal concatenation operation, and W MI-NG ∈ R 2d×1 is the parameter vector.Mutual information between nodes and their sub-graphs, denoted as MI-NSG, enables the graph encoder to learn fine-grained graph-level knowledge. Unlike MI-NG that enforces mutual information between node representations and the graph-level representation, MI-NSG maximizes the mutual information between the independently augmented sub-graphs entailed by the same anchor nodes, which learns fine-grained knowledge compared with MI-NG. Specifically, given two sub-graphsthat are constituted by the same seed nodes but augmented differently, the objective of MI-NSG is formulated as a variant of InfoNCE (Chen et al., 2020) :whereF is the similarity metric, EXP(•) stands for the exponential function, and τ is the temperature hyper-parameter used to control the sharpness of the similarity distribution (i.e., we explore a τ of 0.1 across all datasets).

C DATASET DESCRIPTION

We evaluate our proposed PARETOGNN as well as unsupervised SSL-based GNNs on 11 realworld datasets spanning various fields such as citation network and merchandise network. Their statistics are shown in Table 4 . For Wiki-CS, Pubmed, Amazon-Photo, Amazon-Computer, Coauthor-CS, and Coauthor-Physics, we use the API from Deep Graph Library (DGL) 1

