COLDEXPAND: SEMI-SUPERVISED GRAPH LEARNING IN COLD START

Abstract

Most real-world graphs are dynamic and eventually face the cold start problem. A fundamental question is how the new cold nodes acquire initial information in order to be adapted into the existing graph. Here we postulates the cold start problem as a fundamental issue in graph learning and propose a new learning setting, "Expanded Semi-supervised Learning." In expanded semi-supervised learning we extend the original semi-supervised learning setting even to new cold nodes that are disconnected from the graph. To this end, we propose ColdExpand model that classifies the cold nodes based on link prediction with multiple goals to tackle. We experimentally prove that by adding additional goal to existing link prediction method, our method outperforms the baseline in both expanded semi-supervised link prediction (at most 24%) and node classification tasks (at most 15%). To the best of our knowledge this is the first study to address expansion of semisupervised learning to unseen nodes.

1. INTRODUCTION

Graph-based semi-supervised learning has attracted much attention thanks to its applicability to real-world problems. For example, a social network is graph-structured data in which people in the network are considered to be nodes and relationships between people are considered to be edges: two people are friends or sharing posts, etc. With this structural information, we can infer some unknown attributes of a person (node) based on the information of people he is connected to (i.e., semi-supervised node classification). In the case of retail applications, customers and products can be viewed as heterogeneous nodes and edges between customers and products can represent relationships between the customers and the purchased products. Such a graph can be used to represent spending habits of each customer and we can recommend a product to a user by inferring the likelihood of connection between the user and the product (i.e., semi-supervised link prediction). Recent progress on Graph Neural Networks (GNN) (Bruna et al., 2013; Kipf & Welling, 2016a; Gilmer et al., 2017; Veličković et al., 2018; Jia et al., 2019) allows us to effectively utilize the expressive power of the graph-structured data and to solve various graph related tasks. Early GNN methods tackled semi-supervised node classification task, a task to label all nodes within the graph when only a small subset of nodes is labeled, achieving a satisfactory performance (Zhou et al., 2004) . Link prediction is another graph-related task that was covered comparatively less than other tasks in the field of GNNs. In the link prediction task, the goal is to estimate the likelihood of connection between two nodes given node feature data and topological structure data. Link prediction can be used in recommendation tasks (Chen et al., 2005; Sarwar et al., 2001; Lika et al., 2014; Li & Chen, 2013; Berg et al., 2017) or graph completion tasks (Kazemi & Poole, 2018; Zhang & Chen, 2018) . Most of the work on semi-supervised graph learning and link prediction assumes a static graph, that is, the structural information is at least "partially" observable in terms of nodes. In the real world, however, new users or items can be added (as nodes) without any topological information (Gope & Jain, 2017) . This is also referred as the cold start problem when a new node is presented to an existing graph without a single edge. In contrast to the warm start case, in which at least some topological information is provided, the cold start problem is an extreme case where there isn't any topological information to refer. In this setting, previous semi-supervised learning algorithms can not propagate information to the cold nodes. Even though the cold start problem is an extreme setting, it is an inevitable problem that occurs often in the real world data. In this paper, we postulates the cold start problem as a fundamental issue in graph learning and propose a new learning setting, expanded semi-supervised learning. In expanded semi-supervised learning we extend the original semi-supervised learning setting even to new cold nodes that are disconnected from the graph as shown in Figure 1 . To this end, we suggest the ColdExpand method, the method that uses multi-task learning strategy to alleviate the cold start problem that typical graph learning methods face. We experimentally prove that by adding an additional goal to the existing link prediction method, performance on the link prediction task is enhanced in every benchmark dataset and at most by 24%. We also prove that our method can expand semi-supervised node classification even to unseen, cold nodes. To the best of our knowledge, this is the first study to expand semisupervised learning methods to unseen nodes. In the next section, we briefly introduce related works. In Section 3, we define our problem definition. Finally, in Section 4 we propose our ColdExpand model followed by our experimental environments along with corresponding results.

2.1. SEMI-SUPERVISED NODE CLASSIFICATION

Semi-supervised learning is a combination of unsupervised learning and supervised learning where true labels are only partially given. Semi-supervised learning algorithms aim to improve the model's performance by using unlabeled data points on top of labeled ones to approximate the underlying marginal data distribution (Van Engelen & Hoos, 2020) . One of the most popular fields of semi-supervised learning is graph based semi-supervised node classification where all nodes should be classified when only a few node labels are given. In order to smooth the given subset of labels throughout the graph, various methods have been studied to effectively represent nodes. Deep Walk (Perozzi et al., 2014) , LINE (Tang et al., 2015 ), node2vec (Grover & Leskovec, 2016) were early deep-learning-based methods targeting the node classification task by learning latent representations from truncated random walks. However, these models fail to share parameters between nodes causing learning to be inefficient. Kipf & Welling (2016a) introduced Graph Convolutional Networks (GCN) which use an efficient layer-wise propagation rule by approximating the first-order of spectral graph convolution. By limiting the spectral convolution to the first-order, GCNs not only lightened computation cost of the operation but also alleviated the over-fitting problem that previous spectral methods had. Commonly, GCN model is formed from stacked convolutional layers; the number of which decides how many hops of neighbors to consider for the convolution operation. Gilmer et al. (2017) presented Message Passing Neural Networks (MPNN) as a general form of spatial convolution operation and treated GCN as specific kind of a message passing process. In MPNNs, information between nodes is delivered directly by edges without visiting any spectral domains.



Figure 1: Left: Original Semi-supervised Learning. Right: Expanded Semi-supervised Learning. Nodes with solid lines are labeled nodes and other nodes with dot lines are unlabeled nodes. Expanded semi-supervised learning extends the original semi-supervised learning even to new nodes that are disconnected from the graph. By adding unseen cold nodes to the training we can not only expand learning to cold nodes but also use node features of cold nodes to better approximate the marginal distribution of the original graph data.

