DUAL GRAPH COMPLEMENTARY NETWORK

Abstract

As a powerful representation learning method on graph data, graph neural networks (GNNs) have shown great popularity in tackling graph analytic problems. Although many attempts have been made in literatures to find strategies about extracting better embedding of the target nodes, few of them consider this issue from a comprehensive perspective. Most of current GNNs usually employ some single method which can commendably extract a certain kind of feature but some equally important features are often ignored. In this paper, we develop a novel dual graph complementary network (DGCN) to learn representation complementarily. We use two different branches, and inputs of the two branches are the same, which are composed of structure and feature information. At the same time, there is also a complementary relationship between the two branches. Beyond that, our extensive experiments show that DGCN outperforms state-of-the-art methods on five public benchmark datasets.

1. INTRODUCTION

Although many attempts have been made in literatures to find a better strategy to learn the target node representation, the feature extraction capabilities of most methods are still far from optimal, especially when only a small amount of data is labeled. However, in fact, compared with the expensive and laborious acquisition of labeled data, unlabeled data is much easier to obtain. Therefore, how to learn more useful representations with limited label information is the key direct of representation learning study. Methods of this issue, commonly referred to as semi-supervised learning, which essentially believe that the similar points have similar outputs. Thus, it can properly utilize the consistency of data to make full use of the rich information of unsupervised data. In the real world, it is common that we have data with specific topological structures which usually called graph data. The graph structure is usually expressed as the connection between nodes. By aggregating the features of neighborhood and performing appropriate linear transformation, graph neural networks (GNNs) can convert graph data into a low-dimensional, compact, and continuous feature space. Nevertheless, most of them only care about a single aggregation strategy, which is counter intuitive: for example, as far as social networks are concerned, the relationship between people is very complex, while, most of the traditional GNNs only consider the single connection between nodes and ignore other implicit information. In this paper, our work focuses on learning node representations by GNNs in a semi-supervised way. Despite there are already many graph-based semi-supervised learning methods (Kipf & Welling, 2016; Yang et al., 2016; Khan & Blumenstock, 2019) , most of them can only find a single relationship between nodes. As a result, some information in unsupervised data is usually ignored. To overcome this problem, we develop a novel dual graph complementary network (DGCN) to extract information from both feature and topology spaces. An intuition of our method is to learn based on disagreement: network performance is largely related to the quality of the graph, which usually emphasizes the relevance of an attribute of instances. So, since we don't know what attributes are most important, we consider both of them in the model design. Compared with the traditional GNN-based methods, we perform two different aggregate strategies which emphasize different attributes in each branch, one from the perspective of node feature, and the other from the topological structure. Then, to further utilize implicit information, we employ two networks with different structures to extract embedding from input feature. By doing so, nodes' information can be propagated in different ways. Then, the supervised loss ℓ sup and diversity constraint ℓ div are used to guide the training. We use two different branches to extract common information in topology and feature spaces. By utilizing disagreements between the two branches, model can gain information that may be ignored by single branch. To prove the effectiveness of our method, we conducted experiments on five public benchmark datasets. The contributions of our work are summarized as follows: • We propose a novel dual graph complementary network (DGCN) to fuse complementary information, which utilizes different graphs to aggregate nodes that are similar in certain attributes in a complementary way. • By comparing with algorithms that use non-single graphs, it proves that our complementary architecture can extract richer information • Through extensive evaluation on multiple datasets, we demonstrate DGCN effectiveness over state-of-the-art baselines.

2.1. SEMI-SUPERVISED LEARNING

Semi-supervised learning is usually aimed at the case of insufficient data labels. X ∈ R n×d is the feature of input nodes. Y = [y ij ] ∈ R n×k is the label matrix, where k is the class number. y ij means that the i-th node belongs to the j-th class. Then split data points into labeled and unlabeled points. Accordingly, x L and x U express a feature of labeled and unlabeled instance, respectively. Moreover, the ground-truth label of the label nodes is available only. The main objective of semi-supervised learning is to extract supervised information from labeled dataset whilst adequately utilizing data distribution information contained in X. There are four categories of semi-supervised learning algorithms: 1. Self-training semi-supervised learning (Lee, 2013) : It utilizes high-confidence pseudo labels to expand label set. Ideally, it can continuously improve network performance, but is usually limited by the quality of pseudo labels. 2. Graph-based semi-supervised learning: It propagates information between instances according to edges in graph. It's an inductive learning method, of which the performance mainly depends on the aggregation algorithm. 3. Low-density separation methods (Joachims, 1999) : They assume that the decision hyperplane is consistent with the data distribution, so so it should pass through the sparse region of the data. 4. Pretrain semi-supervised learning: such as autoencoder (Vincent et al., 2008; Rifai et al., 2011) , trains the model based on reconstruction error and then fine tune it using labeled data. However, semi-supervised learning tasks prefer to obtain information related to data distribution rather than all information of samples. In this paper, we mainly focus on the graph-based semisupervised learning.

2.2. GRAPH-BASED SEMI-SUPERVISED LEARNING

In addition to features, graph-based semi-supervised learning methods (Kipf & Welling, 2016) represent the topological edge connection between different instances. For many datasets, graph is given as a feature. If the features of the dataset do not contain the relationships between different samples, a graph can also be constructed by measuring the similarity between the features of the instances (Zhu et al., 2003) . Actually, the graph is a measure of whether the instances are closely connected. Then, according to this graph, information exchange between instances can be carried out, so that the information of unlabeled data can be effectively utilized. Network performance is largely related to the quality of the graph. When the attributes emphasized in the graph do not match the expectations of the task objective, misjudgments are often caused. Usually, it is difficult to finding what really matters. The traditional graph-based semi-supervised learning methods usually uses a single graph for node aggregation, which causes a single attribute to be emphatically considered, but when this attribute does not match the task goal, it will mislead the training instead.

3. DGCN ARCHITECTURE

In this section, we will present the overall framework of DGCN, see Fig. 1 . The main idea of DGCN is that information exchange under the control of graphs emphasizing different attributes can extract more abundant features. To this end, we use two branches to extract information from two inputs at the same time. The node features of these two inputs are the same, the only difference is the graphs that control the information exchange. In addition, in order to further expand the difference between branches, we use a diversity loss ℓ div . 

3.1. NOTATION & PROBLEM STATEMENT

Let G = (V, A, X ) be an undirected graph. V is the set of nodes on the graph, which is composed of unlabeled (V u ) and labeled (V l ) nodes with the number of nodes is n u and n l respectively. n = n l + n u is the number of nodes. A = [a ij ] ∈ R n×n is the adjacency matrix. a ij = 1 represents that node i and node j are closely related in an attribute, otherwise, a ij = 0.

3.2. BRANCHES

In order to capture different characteristics by the two branches (also called viewer), we use different network structures for each branch: GCN (Kipf & Welling, 2016) and GAT (Veličković et al., 2017) . Given a graph G = (V, A, X ), both GCN and GAT intend to extract richer features at a vertex by aggregating features of vertices from its neighborhood (Li et al., 2019) . So the node representation of the l-th layer H l can be defined by: H l = Update(Aggregate(H l-1 , Θ agg l ), Θ update l ). where Θ agg l and Θ update l are the learnable weights of aggregation and update functions of the l-th layer respectively and the initial H 0 = X . The aggregation and update functions are the essential components of GNNs, and obviously the features extracted by different aggregation functions will have certain differences. Thus, we take advantage of two different networks, GCN and GAT, to obtain node representation. The node features output by the l-th GCN layer can be expressed as: H l = σ ( ( D-1 2 (A + I) D-1 2 ) H l-1 W l ) . ( ) where I ∈ R n×n indicates the identity matrix, A + I means adding self-loop in the graph, D is the diagonal degree matrix of A + I, and σ(•) is the activation function. It can be seen from equation 2 that GCN aggregates neighbor features by weighting the value of symmetric normalized laplacian. Next, we introduce the algorithm GAT that uses the attention mechanism to calculate the neighbor weight. Through a learnable coefficient a, GAT can assign learnable weights to each neighbor of the node. For node i, the weight α ij between it and its neighbor node j can be expressed as: α ij = exp ( LeakyReLU ( a ⊤ [W h i ∥W h j ] ) ) ∑ k∈Ni exp ( LeakyReLU ( a ⊤ [W h i ∥W h k ] ) ) . ( ) where • ⊤ is the transposition operation and ∥ represents concatenation. Then the forward propagation process of node v in l-th layer can be represented as: h l,i = ∥ M m=1 σ   ∑ j∈Ni α m l,ij W m l h l-1,j   . ( ) where, h l,i is the embeding of node i in the l-th layer. M is the number of independent attention mechanisms. σ is activation function of GAT. α m ij is the normalized attention coefficients computed by the m-th attention mechanism , see equation 3. As can be seen from equation 4, the weights GAT assigns to a node's neighbors are learnable. Thus we can assign adaptive weights to different neighbors. Although these two methods are based on the existence of connection between points as the premise of aggregation. Both the GCN and GAT models we use have their own advantages and disadvantages. The former considers the relationship between nodes (probability conduction matrix), but can't learn neighbor weights dynamically. Although the latter can assign dynamic weights to neighbors, it ignores the influence of degree attribute of node on aggregation. Therefore, using these two branches, we can extract more complementary features from the input.

3.3. FORWARD PROPAGATION

In this subsection, we introduce the input used by the network and the specific forward propagation strategy. In order to consider different attributes when aggregating, we use different graphs for training, but adopt the same features. In this paper, the datasets used in our experiment are graphstructured which have two characteristics, one is the feature of the instance itself, which is not affected by other instances, and the other reflects the relationship with other instances. For example, dataset ACM (Wang et al., 2019) which extracted from ACM dataset contains 3025 papers. It has two properties: one is a bag-of-words that indicates whether the keyword exists, and the other indicates which papers are written by the same author.Obviously, if we only base whether the paper is written by the same author as the basis for aggregation, we will inevitably ignore the situation where the same author has written different types of papers and the same type of papers belong to different authors, thus mistakenly aggregate articles of different categories together. Therefore, we also construct a graph based on another attribute of the dataset: bag-of-words, so that information can be transferred between instances with similar keywords. In order to measure the similarity of instances' features, we find the cosine similarity between the features of all instances: s ij = x i • x j |x i | |x j | . ( ) where s ij denotes the cosine similarity between the feature x i of node i and the feature x j of node j ∈ V. Notice that j ̸ = i. For node i , we choose the t largest s ij and let the corresponding j as the neighbors of i. Then if j is the neighbor of node i, obviously i is the neighbor of node j too. As above-mentioned, we can get a new graph constructed from features. We use A 1 and A 2 to represent the inherent graph structure of the data and the graph constructed according to the feature, respectively. Therefor, by inputting A 1 and A 2 for each branch, we can get four different outputs. according to equation 2 and equation 4 the forward propagation of DGCN can be represented as: H gcn v,l = σ ( ( D-1 2 n (A v + I) D-1 2 n ) H gcn v,l-1 Θ v,l ) . ( ) h gat v,l,i = ∥ M m=1 σ   ∑ j∈Ni α m l,i,j W m v,l h gat v,l-1,j   . ( ) where, v = 1 represents that the graph is A 1 , while v = 2 corresponding to A 2 . σ and σ are the activation function. α k v,i,j is the normalized attention coefficients. Θ v,l and W k v,l are the weights of linear transformations. For the GAT branch, h gat n,l,i means the representation of node i in the l-th layer with the input graph is A v . Similarly, H gcn v,l corresponds to the l-layer embedding matrix of the GCN branch when the input graph is A v . For these four embeddings, we first use the attention mechanism to combine the embeddings generated by different graphs of the same branch: H gcn c = att(H gcn 1,l ∥H gcn 2,l ). ( ) H gat c = att(H gat 1,l ∥H gat 2,l ). Then, we apply the attention mechanism again to combine H gcn 1,l , H gcn 2,l , H gat 1,l , H gat 2,l , H gcn c and H gat c . Through these two attention operations, we can dynamically assign weights to different embedding to find attributes that better match the task goal.

3.4. LOSS FUNCTIONS OF DGCN

The objective function of DGCN consists of two parts: the supervised loss ℓ sup and the diversity loss ℓ div .

3.4.1. SUPERVISED LOSS

Given a graph G = (V, A, X ), as V = V l ∪ V u , the corresponding label of V l is Y l . In order to utilize the supervision information, we use the cross-entropy loss function to guide the training: ℓ sup = - ∑ i∈V l k ∑ j=1 y ij ln p ij . ( ) where y ij is the ground-truth label of node i and p ij is the model predicted label. k is the number of classes.

3.4.2. DIVERSITY LOSS

In order to further expand the differences between branches and capture richer node features, we use L div to add a diversity constraint on H gcn Then, the diversity loss can be defined by: ℓ div = ∑ n i=1 ∑ n j=1 ŝi,j n 2 (12) where n is the number of nodes. Through this diversity constraint, we can expand the difference between the branches to learn complementary features. Therefore, our final optimization object can be expressed as: ℓ total = (1 -γ)ℓ sup + γℓ div (13) where γ is parameter of the disparity constraint terms. Using this objective function, we can optimize the proposed model through back propagation and learn the node embedding for classification. For evaluating the effectiveness of DGCN, we evaluate on several semi-supervised classification benchmarks. Following the experimental setup of Wang et al. (2020) , we evaluate on five datasets. • ACM (Wang et al., 2019) : This dataset is extracted from the ACM dataset, where the nodes represent the papers, the edges represent that the connected two papers belong to the same author, and the feature is the word bag representation of paper's keywords. • UAI2010 (Wang et al., 2018) : This dataset has 3067 nodes and 19 classes. • Citeseer (Kipf & Welling, 2016): In the CiteSeer dataset, papers are divided into six categories, containing a total of 3312 papers, which record citation information between papers. And the feature is the word bag representation of the paper. • BlogCatalog (Meng et al., 2019) : This is a network of social relationships from the Blog-Catalog websitewhere the nodes are bloggers and edges are their social relationships. Node attributes are the short descriptions of users' blogs provided by users. The labels represent the topic categories provided by the authors which can be divided into 6 classes. • Flickr (Meng et al., 2019) : This network is built from profile and relation data of users in Flickr. We treat each user as a node, relationships between two user as an edge.The labels represent the interest groups of the users. The detailed descriptions of the datasets used here are shown in Table 1 . We compare with some state-of-art baselines to verfify the effectiveness of the proposed DGCN. • DeepWalk (Perozzi et al., 2014 ) is a random walk based network embedding method, learning feature by treating truncated random walks in a graph as the equivalent of sentences. • LINE (Tang et al., 2015) is a large-scale embedding method that retains both the local network structure and the global network structure. • GCN (Kipf & Welling, 2016) is a variant of convolutional neural networks which aggregates information of nodes to get node characteristics. • kNN-GCN. The network structure of kNN-GCN is the same as that of GCN. But the graph we use here is the aforementioned A 2 , see Section 3.3. • GAT (Veličković et al., 2017 ) is a graph attention based method which can assign different weights to nodes during aggregation. • DEMO-Net (Wu et al., 2019) assumes that nodes with the same degree value will share the same graph convolution, and the feature aggregation is expressed as a multi-task learning problem according to the degree value of the node. • MixHop (Abu-El-Haija et al., 2019) can learn the neighbor mixture relationship by repeatedly mixing the feature representations of neighbors at various distances. • AM-GCN (Wang et al., 2020) extracts embeddings from node features, topological structures and their combinations, and uses the attention mechanism to learn the adaptive importance weights of embeddings

4.3. RESULTS

We train the DGCN network described in Section 3 on five public datasets and evaluate the prediction accuracy on a test set of 1,000 labeled examples, and experiments on all datasets are optimized using Adam optimizer. The model adopts GCN and GAT branches with a layer number of 2, and the quantitative analysis results can be seen in Table 2 . • It can be seen from the Table 2 that DGCN can exceed the baseline in most of the accuracy rates of all datasets, which proves the effectiveness of our method. On most datasets, the performance of DGCN is better than AM-GCN using two graphs and GCN, kNN-GCN, GAT using one graph, which fully proves that DGCN can capture more information that meets the task objectives.In addition, by comparing with AM-GCN, which also uses different graphs for learning, our DGCN can learn better node embeddings through its complementary learning mechanism. • The main difference between GCN and KNN-GCN is that the structure graph and cosine graph are used respectively. For a dataset, a graph that is more relevant to its classification goal(Nodes connected by edges are more likely to belong to the same class) will perform better. As we can see, on UAI2010, BlogCatelog and Flickr kNN-GCN seem to be significantly better than GCN, but the other two datasets are opposite. This means that on the UAI2010, BlogCatelog and Flickr cosine graph is closer to the classification target than structure graph.

5. CONCLUSION

In this paper, aiming at the problem of semi-supervised graph node classification, we proposes a novel dual graph complementary network (DGCN), which can utilize graphs that emphasize the different attributes of the input to guide the aggregation process. In addition, in order to further capture richer information, we use two different branches to perform feature learning separately. At the same time, the disparity constraint is used between the two branches to further expand the difference. However, just using the diversity loss may retain too much unnecessary redundant information, which will interfere with the really important information. Therefore, our future work will try to emphasize the common attributes in the embedding while expanding the differences between



Figure 1: The framework of DGCN network. The original dataset contains the graph and the feature. First, use the node features in the dataset to construct another graph, then use viewer 1 and viewer 2 to observe the two graphs at the same time, and the latent features are H gcn 1,l , H gcn 2,l , H gat 1,l and H gat 2,l respectively. Then, we fuse GCN view and GAT view respectively to obtain H gcn c and H gat c

and H gat c . First, we use L 2 -normalization to normalize H gcn c and H gat c output by the attention module. The normalized results are Ĥgcn and Ĥgat respectively. Using the above results, we can capture the similarity of node embedding:

Statistics of the datasets. Refer Section 4.1 for more details.

Experiments results (%) on the node classification task. L/C means the number of labeled

annex

branches. The extensive experiments on several datasets further demonstrate the effectiveness of our DGCN algorithm.In the future, we will further study the correlation measurement of graphs and training objectives and further enrich our model with them.

