RELATIONAL CURRICULUM LEARNING FOR GRAPH NEURAL NETWORK

Abstract

Graph neural networks have achieved great success in representing structured data and its downstream tasks such as node classification. The key idea is to recursively propagate and aggregate information along the edges of a given graph topology. However, edges in real-world graphs often have varying degrees of difficulty, and some edges may even be noisy to the downstream tasks. Therefore, existing graph neural network models may lead to suboptimal learned representations because they usually consider every edge in a given graph topology equally. On the other hand, curriculum learning, which mimics the human learning principle of learning data samples in a meaningful order, has been shown to be effective in improving the generalization ability and robustness of representation learners by gradually proceeding from easy to more difficult samples during training. Unfortunately, most existing curriculum learning strategies are designed for independent data samples and cannot be trivially generalized to handle data with dependencies. In order to address these issues, in this paper we propose a novel curriculum learning method for structured data to leverage the various underlying difficulties of data dependencies to improve the quality of learned representations on structured data. Specifically, we design a learning strategy that gradually incorporates edges in a given graph topology into training according to their difficulty from easy to hard, where the degree of difficulty is measured by a self-supervised learning paradigm. We demonstrate the strength of our proposed method in improving the generalization ability and robustness of learned representations through extensive experiments on nine synthetic datasets and nine real-world datasets with different commonly used graph neural network models as backbone models.

1. INTRODUCTION

Learning powerful representations of data samples with dependencies has become a core paradigm for understanding the underlying network mechanisms and performing a variety of downstream tasks such as social network analysis (Wasserman et al., 1994) and recommendation systems (Ying et al., 2018; Fan et al., 2019) . As a class of state-of-the-art representation learning methods, graph neural networks (GNNs) have received increasing attention in recent years due to their powerful ability to jointly model data samples and their dependencies in an end-to-end fashion. Typically, GNNs treat data samples as nodes and their dependencies as edges, and then follow a neighborhood aggregation scheme to learn data sample representations by recursively transforming and aggregating information from their neighboring samples. On the other hand, inspired by cognitive science studies (Elman, 1993; Rohde & Plaut, 1999) that humans can benefit from the sequence of learning basic (easy) concepts first and advanced (hard) concepts later, curriculum learning (CL) (Bengio et al., 2009; Kumar et al., 2010) suggests training a machine learning model with easy data samples first and then gradually introducing more hard samples into the model according to a designed pace, where the difficulty of samples can usually be measured by their training loss. Many previous studies have shown that this easy-to-hard learning strategy can effectively improve the generalization ability of the model (Bengio et al., 2009; Jiang et al., 2018; Han et al., 2018; Gong et al., 2016; Shrivastava et al., 2016; Weinshall et al., 2018) . Furthermore, previous studies (Jiang et al., 2018; Han et al., 2018; Gong et al., 2016) have shown that CL strategies can increase the robustness of the model against noisy training samples. An intuitive explanation is that in CL settings noisy data samples correspond to harder samples and CL learner spends less time with the harder (noisy) samples to achieve better generalization performance. However, existing CL algorithms are typically designed for independent data samples (e.g. image) while designing effective CL strategy for data samples with dependencies has been largely underexplored. In dependent data, not only the difficulty of learning individual data samples can vary but also the difficulty of perceiving the dependencies between data samples. For example, in social networks, the connections between users with certain similar characteristics, such as geological location or interests, are usually considered as easy edges because their formation mechanisms are well expected. Some connections from unrelated users, such as advertisers or even bots, can usually be considered hard as they are not well expected or even noisy such that the likelihood of these connections can positively contribute to the downstream task, e.g. community detection, is relatively low. Therefore, as the previous CL strategies indicated that an easy-to-hard learning sequence on data samples can improve the learning performance, an intuitive question is whether a similar strategy on data dependencies that iteratively involving easy-to-hard edges in learning can also benefit. Unfortunately, there exists no trivial way to directly generalize existing CL strategies on independent data to handle data dependencies due to several unique challenges: (1) Difficulty in designing a feasible principle to select edges by properly quantifying how well they are expected. Existing CL studies on independent data often use supervised computable metrics (e.g. training loss) to quantify sample complexity, but how to quantify how well the dependencies between data samples are expected which has no supervision is challenging. (2) Difficulty in designing an appropriate pace to gradually involve edges based on model status. Similar to the human learning process, the model should ideally retain a certain degree of freedom to adjust the curriculum according to its own learning status. It is extremely hard to design a general pacing policy suitable for different real-world scenarios. (3) Difficulty in ensuring convergence and a numerical steady process for optimizing the model. Since GNN models work by propagating and aggregating message information over the edges, due to discrete changes in the number of propagating messages by CL strategy, the learning process of incremental edges increases the difficulty of finding optimal parameters. In order to address the aforementioned challenges, in this paper, we propose a novel CL algorithm named Relational Curriculum Learning (RCL) to improve the generalization ability of representation learners on data with dependencies. To address the first challenge, we propose a self-supervised learning approach to select the most K easiest edges that are well expected by the model. Specifically, we jointly learn the node-level prediction task and estimate how well the edges are expected by measuring the relation between learned node embeddings. Second, to design an appropriate learning pace for gradually involving more edges in training, we present the learning process as a concise optimization model under the self-paced learning framework (Kumar et al., 2010) , which lets the model gradually increase the number K of selected edges according to its own status. Third, to ensure convergence of optimizing the model, we propose a proximal optimization algorithm with a theoretical convergence guarantee and an edge reweighting scheme to smooth the structure transition. Finally, we demonstrate the superior performance of RCL compared to state-of-the-art comparison methods through extensive experiments on both synthetic and real-world datasets.

2. RELATED WORK

Curriculum learning (CL). Bengio et al. (2009) first proposed the idea of CL in the context of machine learning, aiming to improve model performance by gradually including easy to hard samples in training the model. Self-paced learning (Kumar et al., 2010) measures the difficulty of samples by their training loss, which addressed the issue in previous works that difficulties of samples are generated by prior heuristic rules. Therefore, the model can adjust the curriculum of samples according to its own training status. Following works (Jiang et al., 2015; 2014; Zhou et al., 2020) further proposed many supervised measurement metrics for determining curriculums, for example, the diversity of samples (Jiang et al., 2014) or the consistency of model predictions (Zhou et al., 2020) . Meanwhile, many empirical and theoretical studies were proposed to explain why CL could lead to generalization improvement from different perspectives. For example, studies such as MentorNet (Jiang et al., 2018) and Co-teaching (Han et al., 2018) empirically found that utilizing CL strategy can achieve better generalization performance when the given training data are noisy. Gong et al. (2016) provided theoretical explanations on the denoising mechanism that CL learners waste less time with the noisy samples as they are considered harder samples.

