GRAPH NEURAL NETWORKS AS MULTI-VIEW LEARN-ING

Abstract

Graph Neural Networks (GNNs) have demonstrated powerful representation capability in semi-supervised node classification. In this task, there are often three types of information -graph structure, node features, and node labels. Existing GNNs usually leverage both node features and graph structure by feature transformation and aggregation, following end-to-end training via node labels. In this paper, we change our perspective by considering these three types of information as three views of nodes. This perspective motivates us to design a new GNN framework as multi-view learning which enables alternating optimization training instead of end-to-end training, resulting in significantly improved computation and memory efficiency. Extensive experiments with different settings demonstrate the effectiveness and efficiency of the proposed method.

1. INTRODUCTION

Graph is a fundamental data structure that denotes pairwise relationships between entities in a wide variety of domains (Wu et al., 2019b; Ma & Tang, 2021) . Semi-supervised node classification is one of the most crucial tasks on graphs. Given graph structure, node features, and labels on a part of nodes, this task aims to predict labels of the remaining nodes. In recent years, Graph Neural Networks (GNNs) have proven to be powerful in semi-supervised node classification (Gilmer et al., 2017; Kipf & Welling, 2016; Velickovic et al., 2017) . Existing GNN models provide different architectures to leverage both graph structure and node features. Coupled GNNs, such as GCN (Kipf & Welling, 2016) and GAT (Velickovic et al., 2017) , couple feature transformation and propagation to combine node feature and graph structure in each layer. Decoupled GNNs, such as APPNP (Klicpera et al., 2018) , first transform node features and then propagate the transformed features with graph structure for multiple steps. Meanwhile, there are GNN models such as Graph-MLP (Hu et al., 2021) that extract graph structure as regularization when integrating with node features. Nevertheless, the majority of aforementioned GNNs utilize node labels via the loss function for end-to-end training. In essence, existing GNNs have exploited three types of information to facilitate semi-supervised node classification. This understanding motivates us to change our perspective by considering these three types of information as three views of nodes. Then we can treat the design of GNN models as multi-view learning. The advantages of this new perspective are multi-fold. First, we can follow key steps in multi-view learning methods to design GNNs by investigating (1) how to capture node information from each view and (2) how to fuse information from three views. Such superiority offers us tremendous flexibility to develop new GNN models. Second, multi-view learning has been extensively studied (Xu et al., 2013) and there is a large body of literature that can open new doors for us to advance GNN models. To demonstrate the potential of this new perspective, following a traditional multi-view learning method (Xia et al., 2010) , we introduce a shared latent variable to explore these three views simultaneously in a multi-view learning framework for graph neural networks (MULTIVIEW4GNN). The proposed framework MULTIVIEW4GNN can be conveniently optimized in an alternating way, which remarkably alleviates the computational and memory inefficiency issues of the end-to-end GNNs. Extensive experiments under different settings demonstrate that MULTIVIEW4GNN can achieve comparable or even better performance than the end-to-end trained GNNs especially when the labeling rate is low, but it has significantly better computation and memory efficiency. We use bold upper-case letters such as X to denote matrices. X i denotes its i-th row and X ij indicates the i-th row and j-th column element. We use bold lower-case letters such as x to denote vectors. The Frobenius norm and trace of a matrix X are defined as ∥X∥ F = ij X 2 ij and tr(X) = i X ii . Let G = (V, E) be a graph, where V is the node set and E is the edge set. N i denotes the neighborhood node set for node v i . The graph can be represented by an adjacency matrix A ∈ R n×n , where A ij > 0 indices that there exists an edge between nodes v i and v j in G, or otherwise A ij = 0. Let D = diag(d 1 , d 2 , . . . , d n ) be the degree matrix, where d i = j A ij is the degree of node v i . The graph Laplacian matrix is defined as L = D -A. We define the normalized adjacency matrix as Ã = D -1 2 AD -1 2 and the normalized Laplacian matrix as L = I -Ã. Furthermore, suppose that each node is associated with a d-dimensional feature x and we use X = [x 1 , x 2 , . . . , x n ] ⊤ ∈ R n×d to denote the feature matrix. In this work, we focus on the node classification task on graphs. Given a graph G = {A, X} and a partial set of labels Y L = {y 1 , y 2 , . . . , y l } for node set V L = {v 1 , v 2 , . . . , v l }, where y i ∈ R C is a one-hot vector with C classes, our goal is to predict labels of unlabeled nodes. The labels of graph G can also be represented as a label matrix Y ∈ R n×C , where Y i = y i if v i ∈ V L and Y i = 0 if v i ∈ V U . The subscript U and L denote the sets of unlabeled and labeled nodes, respectively.

2.1. MULTI-VIEW LEARNING FOR GNNS

For the node classification task, we take a new perspective that considers node feature X, graph structure A, and node label Y as three views for nodes, and model graph neural networks as multiview learning. In particular, we need to jointly model each view and integrate three views. To achieve this goal, we introduce a latent variable F inspired by a traditional multi-view learning method (Xia et al., 2010) . Then the loss function can be written as: arg min F,Θ L = λ 1 D X (X, F) + D A (A, F) + λ 2 D Y (Y L , F L ), ( ) where F is the introduced latent variable shared by three views, D X (•, •), D A (•, •) and D Y (•, •) are functions to explore node feature, graph structure, and node label, respectively. These functions can contain parameters which we denote them as Θ. Hyper-parameters λ 1 and λ 2 are introduced to balance the contributions from these three views. One major advantage of the multi-view learning perspective is that it enables immense flexibility to design GNN models. Specifically, based on (1), there are numerous designs for D X (•, •), D A (•, •) and D Y (•, •). Examples are shown below: • D X is to map node features X to F. In reality, we can first transform X before mapping. Thus, feature transformation methods can be applied including traditional methods such as PCA (Collins et al., 2001; Shen, 2009) and SVD (Godunov et al., 2021) , and deep methods such as, MLP and self-attention (Vaswani et al., 2017) . We also have various choices of the mapping functions such as Multi-Dimensional Scaling (MDS) (Hout et al., 2013) which preserves the pairwise distance between X and F and any distance measurements. In this work, we set the dimensions of the latent variable F as R n×C , which can be considered as a soft pseudo-label matrix. Then the following designs are chosen for these functions: (i) for D X , we use an MLP with parameter Θ to encode the features of node i as MLP(X i ; Θ), and then adopt the Euclidean distance to map F i as ∥MLP(X i ; Θ) -F i ∥ 2 2 ; (ii) for D A , Laplacian smoothness is imposed to constrain the distance between one node's pseudo labels F i and its neighbors as



• D A aims to impose constraints on the latent variable F with the graph structure. Traditional graph regularization techniques can also be employed. For instance, the Laplacian regularization(Yin et al., 2016) is to guide a node i's feature F i to be similar to its neighbors; Locally Linear Embedding (LLE)(Roweis & Saul, 2000)  is to force the F i be reconstructed from its neighbors. Moreover, modern deep graph learning methods can be applied, such as graph embedding methods(Perozzi et al., 2014; Grover & Leskovec, 2016)  and Graph ContrastiveLearning (Zhu et al., 2020; Hu et al., 2021), which implicitly encodes node similarity and dissimilarity.• D Y establishes the connection between the latent variable F L and the ground truth node label Y L for labeled nodes. It can be any classification loss function, such as the Mean Square Error and Cross Entropy Loss.

