TOPOTER: UNSUPERVISED LEARNING OF TOPOLOGY TRANSFORMATION EQUIVARIANT REPRESENTATIONS

Abstract

We present the Topology Transformation Equivariant Representation (TopoTER) learning, a general paradigm of unsupervised learning of node representations of graph data for the wide applicability to Graph Convolutional Neural Networks (GCNNs). We formalize the TopoTER from an information-theoretic perspective, by maximizing the mutual information between topology transformations and node representations before and after the transformations. We derive that maximizing such mutual information can be relaxed to minimizing the cross entropy between the applied topology transformation and its estimation from node representations. In particular, we seek to sample a subset of node pairs from the original graph and flip the edge connectivity between each pair to transform the graph topology. Then, we self-train a representation encoder to learn node representations by reconstructing the topology transformations from the feature representations of the original and transformed graphs. In experiments, we apply the TopoTER to the downstream node and graph classification tasks, and results show that the TopoTER outperforms the state-of-the-art unsupervised approaches.

1. INTRODUCTION

Graphs provide a natural and efficient representation for non-Euclidean data, such as brain networks, social networks, citation networks, and 3D point clouds. Graph Convolutional Neural Networks (GCNNs) (Bronstein et al., 2017) have been proposed to generalize the CNNs to learn representations from non-Euclidean data, which has made significant advances in various applications such as node classification (Kipf & Welling, 2017; Veličković et al., 2018; Xu et al., 2019a) and graph classification (Xu et al., 2019b) . However, most existing GCNNs are trained in a supervised fashion, requiring a large amount of labeled data for network training. This limits the applications of the GCNNs since it is often costly to collect adequately labeled data, especially on large-scale graphs. Hence, this motivates the proposed research to learn graph feature representations in an unsupervised fashion, which enables the discovery of intrinsic graph structures and thus adapts to various downstream tasks. Auto-Encoders (AEs) and Generative Adversarial Networks (GANs) are two most representative unsupervised learning methods. Based on the AEs and GANs, many approaches have sought to learn transformation equivariant representations (TERs) to further improve the quality of unsupervised representation learning. It assumes that the learned representations equivarying to transformations are able to encode the intrinsic structures of data such that the transformations can be reconstructed from the representations before and after transformations (Qi et al., 2019b) . Learning TERs traces back to Hinton's seminal work on learning transformation capsules (Hinton et al., 2011) , and embodies a variety of methods developed for Euclidean data (Kivinen & Williams, 2011; Sohn & Lee, 2012; Schmidt & Roth, 2012; Skibbe, 2013; Lenc & Vedaldi, 2015; Gens & Domingos, 2014; Dieleman et al., 2015; 2016; Zhang et al., 2019; Qi et al., 2019a) . Further, Gao et al. (2020) extend transformation equivariant representation learning to non-Euclidean domain, which formalizes Graph Transformation Equivariant Representation (GraphTER) learning by auto-encoding nodewise transformations in an unsupervised fashion. Nevertheless, only transformations on node features are explored, while the underlying graph may vary implicitly. The graph topology has not been fully explored yet, which however is crucial in unsupervised graph representation learning. To this end, we propose the Topology Transformation Equivariant Representation (TopoTER) learning to infer unsupervised graph feature representations by estimating topology transformations. In-stead of transforming node features as in the GraphTER, the proposed TopoTER studies the transformation equivariant representation learning by transforming the graph topology, i.e., adding or removing edges to perturb the graph structure. Then the same input signals are attached to the resultant graph topologies, resulting in different graph representations. This provides an insight into how the same input signals associated with different graph topologies would lead to equivariant representations enabling the fusion of node feature and graph topology in GCNNs. Formally, we propose the TopoTER from an information-theoretic perspective, aiming to maximize the mutual information between topology transformations and feature representations with respect to the original and transformed graphs. We derive that maximizing such mutual information can be relaxed to the cross entropy minimization between the applied topology transformations and the estimation from the learned representations of graph data under the topological transformations. Specifically, given an input graph and its associated node features, we first sample a subset of node pairs from the graph and flip the edge connectivity between each pair at a perturbation rate, leading to a transformed graph with attached node features. Then, we design a graph-convolutional auto-encoder architecture, where the encoder learns the node-wise representations over the original and transformed graphs respectively, and the decoder predicts the topology transformations of edge connectivity from both representations by minimizing the cross entropy between the applied and estimated transformations. Experimental results demonstrate that the proposed TopoTER model outperforms the state-of-the-art unsupervised models, and even achieves comparable results to the (semi-)supervised approaches in node classification and graph classification tasks at times. Our main contributions are summarized as follows. • We propose the Topology Transformation Equivariant Representation (TopoTER) learning to infer expressive node feature representations in an unsupervised fashion, which can characterize the intrinsic structures of graphs and the associated features by exploring the graph transformations of connectivity topology. • We formulate the TopoTER from an information-theoretic perspective, by maximizing the mutual information between feature representations and topology transformations, which can be relaxed to the cross entropy minimization between the applied transformations and the prediction in an end-to-end graph-convolutional auto-encoder architecture. • Experiments demonstrate that the proposed TopoTER model outperforms the state-of-the-art unsupervised methods in both node classification and graph classification.

2. RELATED WORK

Graph Auto-Encoders. Graph Auto-Encoders (GAEs) are the most representative unsupervised methods. GAEs encode graph data into feature space via an encoder and reconstruct the input graph data from the encoded feature representations via a decoder. GAEs are often used to learn network embeddings and graph generative distributions (Wu et al., 2020) . For network embedding learning, GAEs learn the feature representations of each node by reconstructing graph structural information, such as the graph adjacency matrix (Kipf & Welling, 2016) and the positive pointwise mutual information (PPMI) matrix (Cao et al., 2016; Wang et al., 2016) . For graph generation, some methods generate nodes and edges of a graph alternately (You et al., 2018) , while other methods output an entire graph (Simonovsky & Komodakis, 2018; Ma et al., 2018; De Cao & Kipf, 2018) .

Graph Contrastive

Learning. An important paradigm called contrastive learning aims to train an encoder to be contrastive between the representations of positive samples and negative samples. Recent contrastive learning frameworks can be divided into two categories (Liu et al., 2020) : context-instance contrast and context-context contrast. Context-instance contrast focuses on modeling the relationships between the local feature of a sample and its global context representation. Deep InfoMax (DIM) (Hjelm et al., 2018) first proposes to maximize the mutual information between a local patch and its global context through a contrastive learning task. Deep Graph InfoMax (DGI) (Velickovic et al., 2019) proposes to learn node-level feature representation by extending DIM to graph-structured data, while InfoGraph (Sun et al., 2020a) aims to use mutual information maximization for unsupervised representation learning on entire graphs. Peng et al. (2020) propose a Graphical Mutual Information (GMI) approach to maximize the mutual information of both features and edges between inputs and outputs. In contrast to context-instance methods, contextcontext contrast studies the relationships between the global representations of different samples. M3S (Sun et al., 2020b ) adopts a self-supervised pre-training paradigm as in DeepCluster (Caron et al., 2018) for better semi-supervised prediction in GCNNs. Graph Contrastive Coding (GCC) Topology Transformation Δ𝐀 = 0 1 0 -1 0 0 0 1 0 0 0 0 0 0 0 0 0 -1 0 0 0 -1 0 -1 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 1 0 0 0 0 -1 1 0 Figure 1: An example of graphs before and after topology transformations. (Qiu et al., 2020) designs the pre-training task as subgraph instance discrimination in and across networks to empower graph neural networks to learn the intrinsic structural representations. Transformation Equivariant Representation Learning. Many approaches have sought to learn transformation equivariant representations. Learning transformation equivariant representations has been advocated in Hinton's seminal work on learning transformation capsules. Following this, a variety of approaches have been proposed to learn transformation equivariant representations (Gens & Domingos, 2014; Dieleman et al., 2015; 2016; Cohen & Welling, 2016; Lenssen et al., 2018) . To generalize to generic transformations, Zhang et al. (2019) propose to learn unsupervised feature representations via Auto-Encoding Transformations (AET) by estimating transformations from the learned feature representations of both the original and transformed images, while Qi et al. (2019a) extend AET from an information-theoretic perspective by maximizing the lower bound of mutual information between transformations and representations. Wang et al. (2020) extend the AET to Generative Adversarial Networks (GANs) for unsupervised image synthesis and representation learning. Gao et al. (2020) introduce the GraphTER model that extends AET to graph-structured data, which is formalized by auto-encoding node-wise transformations in an unsupervised manner. de Haan et al. ( 2020) propose Gauge Equivariant Mesh CNNs which generalize GCNNs to apply anisotropic gauge equivariant kernels. Fuchs et al. ( 2020) introduce a self-attention mechanism specifically for 3D point cloud data, which adheres to equivariance constraints, improving robustness to nuisance transformations.

3.1. PRELIMINARY

We consider an undirected graph G = {V, E, A} composed of a node set V of cardinality |V| = N , an edge set E connecting nodes of cardinality |E| = M . A is a real symmetric N × N matrix that encodes the graph structure, where a i,j = 1 if there exists an edge (i, j) between nodes i and j, and a i,j = 0 otherwise. Graph signal refers to data that reside on the nodes of a graph G, denoted by X ∈ R N ×C with the i-th row representing the C-dimensional graph signal on the i-th node of V.

3.2. TOPOLOGY TRANSFORMATION

We define the topology transformation t as adding or removing edges from the original edge set E in graph G. This can be done by sampling, i.i.d., a switch parameter σ i,j as in (Velickovic et al., 2019) , which determines whether to modify edge (i, j) in the adjacency matrix. Assuming a Bernoulli distribution B(p), where p denotes the probability of each edge being modified, we draw a random matrix Σ = {σ i,j } N ×N from B(p), i.e., Σ ∼ B(p). We then acquire the perturbed adjacency matrix as A = A ⊕ Σ, ) where ⊕ is the exclusive OR (XOR) operation. This strategy produces a transformed graph through the topology transformation t, i.e., A = t(A). Here, the edge perturbation probability of p = 0 corresponds to a non-transformed adjacency matrix, which is a special case of an identity transformation to A. The transformed adjacency matrix A can also be written as the sum of the original adjacency matrix A and a topology perturbation matrix ∆A: A = A + ∆A, where ∆A = {δa i,j } N ×N encodes the perturbation of edges, with δa i,j ∈ {-1, 0, 1}. As shown in Fig. 1 , when δa i,j = 0, the edge between node i and node j keeps unchanged (i.e., black solid lines); when δa i,j = -1 or 1, it means removing (i.e., orange dotted lines) or adding (i.e., blue solid lines) the edge between node i and node j, respectively.

3.3. THE FORMULATION OF TOPOTER

Definition 1 Given a pair of graph signal and adjacency matrix (X, A), and a pair of graph signal and transformed adjacency matrix (X, A) by a topology transformation t(•), a function E(•) is transformation equivariant if it satisfies E(X, A) = E (X, t(A)) = ρ(t) [E(X, A)] , where ρ(t)[•] is a homomorphism of transformation t in the representation space. Let us denote H = E(X, A), and H = E(X, A). We seek to learn an encoder E : (X, A) → H; (X, A) → H that maps both the original and transformed sample to representations {H, H} equivariant to the sampled transformation t, whose information can thus be inferred from the representations via a decoder D : ( H, H) → ∆A as much as possible. From an information-theoretic perspective, this requires (H, ∆A) should jointly contain all necessary information about H. Then a natural choice to formalize the topology transformation equivariance is the mutual information I(H, ∆A; H) between (H, ∆A) and H. The larger the mutual information is, the more knowledge about ∆A can be inferred from the representations {H, H}. Hence, we propose to maximize the mutual information to learn the topology transformation equivariant representations as follows: max θ I(H, ∆A; H), where θ denotes the parameters of the auto-encoder network. Nevertheless, it is difficult to compute the mutual information directly. Instead, we derive that maximizing the mutual information can be relaxed to minimizing the cross entropy, as described in the following theorem. Theorem 1 The maximization of the mutual information I(H, ∆A; H) can be relaxed to the minimization of the cross entropy H(p q) between the probability distributions p(∆A, H, H) and q( ∆A| H, H): min θ H p(∆A, H, H) q( ∆A| H, H) - E p(∆A, H,H) log q( ∆A| H, H). (5) Proof By using the chain rule of mutual information, we have I(H, ∆A; H) = I(∆A; H|H) + I(H; H) ≥ I(∆A; H|H). Thus the mutual information I(∆A; H|H) is the lower bound of the mutual information I(H, ∆A; H) that attains its minimum value when I(H; H) = 0. Therefore, we relax the objective to maximizing the lower bound mutual information I(∆A; H|H) between the transformed representation H and the topology transformation ∆A: I(∆A; H|H) = H(∆A|H) -H(∆A| H, H), where H(  We next introduce a conditional probability distribution q( ∆A| H, H) to approximate the intractable posterior q(∆A| H, H) with an estimated ∆A. According to the definition of the Kullback-Leibler divergence, we have H(∆A, H, H) = H(p) = H(p q) -D KL (p q) ≤ H(p q), where D KL (p q) denotes the Kullback-Leibler divergence of p and q that is non-negative, and H(p q) is the cross entropy between p and q. Thus, Eq. ( 6) is converted to minimizing the cross entropy as the upper bound: min θ H p(∆A, H, H) q( ∆A| H, H) - E p(∆A, H,H) log q( ∆A| H, H). Hence, we relax the maximization problem in Eq. ( 4) to the optimization in Eq. ( 5). Based on Theorem 1, we train the decoder D to learn the distribution q( ∆A| H, H) so as to estimate the topology transformation ∆A from the encoded { H, H}, where the input pairs of original and transformed graph representations { H, H} as well as the ground truth target ∆A can be sampled tractably from the factorization of p(∆A, H, H) p(∆A)p(H)p( H|∆A, H). This allows us to minimize the cross entropy between p(∆A, H, H) and q( ∆A| H, H) as in (5) with the training triplets ( H, H; ∆A) drawn from the tractable factorization of p(∆A, H, H). Hence, we formulate the TopoTER as the joint optimization of the representation encoder E and the transformation decoder D.

3.4. THE ALGORITHM

We design a graph-convolutional auto-encoder network for the TopoTER learning, as illustrated in Fig. 2 . Given a graph signal X associated with a graph G = {V, E, A}, the proposed unsupervised learning algorithm for the TopoTER consists of three steps: 1) topology transformation, which samples and perturbs some edges from E to acquire a transformed adjacency matrix A; 2) representation encoding, which extracts the feature representations of graph signals before and after the topology transformation; 3) transformation decoding, which estimates the topology transformation parameters from the learned feature representations. We elaborate on the three steps as follows. Topology Transformation. We randomly sample a subset of edges from E for topology perturbation-adding or removing edges, which not only enables to characterize local graph structures at various scales, but also reduces the number of edge transformation parameters to estimate for computational efficiency. In practice, in each iteration of training, we sample all the node pairs with connected edges S 1 , and randomly sample a subset of disconnected node pairs S 0 , i.e., S 0 = (i, j) a i,j = 0 , S 1 = (i, j) a i,j = 1 , where |S 0 | = |S 1 | = M . Next, we randomly split S 0 and S 1 into two disjoint sets, respectively, i.e., S i = S (1) i , S

S

(1) i ∩ S (2) i = ∅, S i ∪ S (2) i = S i , |S i | = r • |S i | , i ∈ {0, 1}, where r is the edge perturbation rate. Then, for each node pair (i, j) in S (1) 0 and S (1) 1 , we flip the corresponding entry in the original graph adjacency matrix. That is, if a i,j = 0, then we set ãi,j = 1; otherwise, we set ãi,j = 0. For each node pair (i, j) in S (2) 0 and S (2) 1 , we keep the original connectivities unchanged, i.e., ãi,j = a i,j . This leads to the transformed adjacency matrix A, as well as the sampled transformation parameters by accessing ∆A at position (i, j) from S 0 and S 1 . Also, we can category the sampled topology transformation parameters into four types: 1. add an edge to a disconnected node pair, i.e., {t : a i,j = 0 → ãi,j = 1, (i, j) ∈ S (1) 0 }; 2. delete the edge between a connected node pair, i.e., {t : a i,j = 1 → ãi,j = 0, (i, j) ∈ S (1) 1 };

3.. keep the disconnection between node pairs in S

(2) 0 , i.e., {t : a i,j = 0 → ãi,j = 0, (i, j) ∈ S (2) 0 }; 4. keep the connection between node pairs in S (2) 1 , i.e., {t : a i,j = 1 → ãi,j = 1, (i, j) ∈ S (2) 1 }. Thus, we cast the problem of estimating transformation parameters in ∆A from ( H, H) as the classification problem of the transformation parameter types. The percentage of these four types is r : r : (1 -r) : (1 -r). Representation Encoder. We train an encoder E : (X, A) → E(X, A) to encode the feature representations of each node in the graph. As demonstrated in Fig. 2 , we leverage GCNNs with shared weights to extract feature representations of each node in the graph signal. Taking the GCN (Kipf & Welling, 2017) as an example, the graph convolution in the GCN is defined as H = E(X, A) = D -1 2 (A + I)D -1 2 XW, ( ) where D is the degree matrix of A + I, W ∈ R C×F is a learnable parameter matrix, and H = [h 1 , ..., h N ] ∈ R N ×F denotes the node-wise feature matrix with F output channels. Similarly, the node feature of the transformed counterpart is as follows with the shared weights W. H = E(X, A) = D -1 2 ( A + I) D -1 2 XW = D -1 2 (A + I) D -1 2 XW + D -1 2 ∆A D -1 2 XW. ( ) We thus acquire the feature representations H and H of graph signals before and after topology transformations. Transformation Decoder. Comparing Eq. ( 10) and Eq. ( 11), the prominent difference between H and H lies in the second term of Eq. ( 11) featuring ∆A. This enables us to train a decoder D : ( H, H) → ∆A to estimate the topology transformation from the joint representations before and after transformation. We first take the difference between the extracted feature representations before and after transformations along the feature channel, ∆H = H -H = [δh 1 , ..., δh N ] ∈ R N ×F . (12) Thus, we can predict the topology transformation between node i and node j through the node-wise feature difference ∆H by constructing the edge representation as e i,j = exp{-(δh i -δh j ) (δh i -δh j )} exp{-(δh i -δh j ) (δh i -δh j )} 1 ∈ R F , ∀(i, j) ∈ S 0 ∪ S 1 , ( ) where denotes the Hadamard product of two vectors to capture the feature representation, and • 1 is the 1 -norm of a vector for normalization. The edge representation e i,j of node i and j is then fed into several linear layers for the prediction of the topology transformation, y i,j = softmax (linear(e i,j )) , ∀(i, j) ∈ S 0 ∪ S 1 , (14) where softmax(•) is an activation function. According to Eq. ( 5), the entire auto-encoder network is trained by minimizing the cross entropy L = - E (i,j)∈S0∪S1 3 f =0 y (f ) i,j log y (f ) i,j , where f denotes the transformation type (f ∈ {0, 1, 2, 3}), and y is the ground-truth binary indicator (0 or 1) for each transformation parameter type. (Monti et al., 2017) X, A, Y 81.7 ± 0.5 -78.8 ± 0.3 GAT (Veličković et al., 2018) X, A, Y 83.0 ± 0.7 72.5 ± 0.7 79.0 ± 0.3 SGC (Wu et al., 2019) X, A, Y 81.0 ± 0.0 71.9 ± 0.1 78.9 ± 0.0 GWNN (Xu et al., 

4. EXPERIMENTS

4.1 NODE CLASSIFICATION Datasets. We adopt three citation networks to evaluate our model: Cora, Citeseer, and Pubmed (Sen et al., 2008) , where nodes correspond to documents and edges represent citations. We follow the standard train/test split in (Kipf & Welling, 2017) to conduct the experiments. Implementation Details. In this task, the auto-encoder network is trained via Adam optimizer, and the learning rate is set to 10 -4 . We use the same early stopping strategy as DGI (Velickovic et al., 2019) on the observed training loss, with a patience of 20 epochs. We deploy one Simple Graph Convolution (SGC) layer (Wu et al., 2019) as our encoder, and the order of the adjacency matrix is set to 2, while we will study the order of the adjacency matrix in Appendix A. The LeakyReLU activation function with a negative slope of 0.1 is employed after the SGC layer. Similar to DGI (Velickovic et al., 2019) , we set the output channel F = 512 for Cora and Citeseer dataset, and 256 for Pubmed dataset due to memory limitations. After the encoder, we use one linear layer to classify the transformation types. We set the edge perturbation rate in Eq. ( 9) as r = {0.7, 0.4, 0.7} for Cora, Citeseer, and Pubmed, respectively. The analysis of the edge perturbation rate will be presented in Appendix B. During the training procedure of the classifier, the SGC layer in the encoder is used to extract graph feature representations with the weights frozen. After the SGC layer, we apply one linear layer to map the features to the classification scores. Experimental Results. We compare the proposed method with five unsupervised methods, including one node embedding method DeepWalk, two graph auto-encoders GAE and VGAE (Kipf & Welling, 2016) , and two contrastive learning methods DGI (Velickovic et al., 2019) and GMI (Peng et al., 2020) . Additionally, we report the results of Raw Features and DeepWalk+Features (Perozzi et al., 2014) under the same settings. For fair comparison, the results of all other unsupervised methods are reproduced by using the same encoder architecture of the TopoTER except DeepWalk and Raw Features. We report the mean classification accuracy (with standard deviation) on the test nodes for all methods after 50 runs of training. As reported in Tab. 1, the TopoTER outperforms all other competing unsupervised methods on three datasets. Further, the proposed unsupervised method also achieves comparable performance with semi-supervised results. This significantly closes the gap between unsupervised approaches and the semi-supervised methods. Moreover, we compare the proposed TopoTER with two contrastive learning methods DGI and GMI in terms of the model complexity, as reported in Tab. 2. The number of parameters in our model is less than that of DGI and even less than half of that of GMI, which further shows the TopoTER model is lightweight. Implementation Details. In this task, the entire network is trained via Adam optimizer with a batch size of 64, and the learning rate is set to 10 -3 . For the encoder architecture, we follow the same encoder settings in the released code of InfoGraph (Sun et al., 2020a) , i.e., three Graph Isomorphism Network (GIN) layers (Xu et al., 2019b) with batch normalization. We also use one linear layer to classify the transformation types. We set the sampling rate r = 0.5 for all datasets. During the evaluation stage, the entire encoder will be frozen to extract node-level feature representations, which will go through a global add pooling layer to acquire global features. We then use LIBSVM to classify these global features to classification scores. We adopt the same procedure of previous works (Sun et al., 2020a) to make a fair comparison and use 10-fold cross validation accuracy to report the classification performance, and the experiments are repeated five times. Experimental Results. We take six graph kernel approaches for comparison: Random Walk (RW) (Gärtner et al., 2003) , Shortest Path Kernel (SP) (Borgwardt & Kriegel, 2005) , Graphlet Kernel (GK) (Shervashidze et al., 2009) , Weisfeiler-Lehman Sub-tree Kernel (WL) (Shervashidze et al., 2011) , Deep Graph Kernels (DGK) (Yanardag & Vishwanathan, 2015) , and Multi-Scale Laplacian Kernel (MLG) (Kondor & Pan, 2016) . Aside from graph kernel methods, we also compare with three unsupervised graph-level representation learning methods: node2vec (Grover & Leskovec, 2016) , sub2vec (Adhikari et al., 2018) , and graph2vec (Narayanan et al., 2017) , and one contrastive learning method: InfoGraph (Sun et al., 2020a) . The experimental results of unsupervised graph classification are preseted in Tab. 3. The proposed TopoTER outperforms all unsupervised baseline methods on the first five datasets, and achieves comparable results on the other dataset. Also, the proposed approach reaches the performance of supervised methods at times, thus validating the effectiveness of the TopoTER model.

5. CONCLUSION

We propose Topology Transformation Equivariant Representation (TopoTER) for learning unsupervised representations on graph data. By maximizing the mutual information between topology transformations and feature representations before and after transformations, the TopoTER enforces the encoder to learn intrinsic graph feature representations that contain sufficient information about structures under applied topology transformations. We apply the TopoTER model to node classification and graph classification tasks, and results demonstrate that the TopoTER outperforms stateof-the-art unsupervised approaches and reaches the performance of supervised methods at times.

A EXPERIMENTS ON DIFFERENT ORDERS OF THE ADJACENCY MATRIX

As presented in Sec. 3.2, we perturb the 1-hop neighborhoods via the proposed topology transformations, leading to possibly significant changes in the graph topology. This increases the difficulties of predicting the topology transformations when using one-layer GCN (Kipf & Welling, 2017) by aggregating the 1-hop neighborhood information. Therefore, we employ one Simple Graph Convolution (SGC) layer (Wu et al., 2019) with order k as our encoder E(•), where the output feature representations aggregate multi-hop neighborhood information. Formally, the SGC layer is defined as H = E(X, A) = D -1 2 (A + I)D -1 2 k XW, ( ) where D is the degree matrix of A + I, W ∈ R C×F is a learnable parameter matrix, and k is the order of the normalized adjacency matrix. To study the influence of different orders of the adjacency matrix, we adopt five orders from 1 to 5 to train five models on the node classification task. Fig. 3 presents the node classification accuracy under different orders of the adjacency matrix for TopoTER and DGI respectively. As we can see, the proposed TopoTER achieves best classification performance when k = {4, 2, 3} on the three datasets respectively. When k = 1, our model still achieves reasonable results although it is difficult to predict the topology transformations from 1-hop neighborhood information; when k > 1, our proposed TopoTER outperforms DGI by a large margin on Cora and Pubmed dataset, and achieves comparable results to DGI on Citeseer dataset. This is because DGI adopts feature shuffling to generate negative samples, which is insufficient to learn contrastive feature representations when aggregating multi-hop neighborhood information, while TopoTER takes advantage of multi-hop neighborhood information to predict the topology transformations, leading to improved performance. 



Figure 3: Node classification accuracies under different orders of the adjacency matrix on the Cora, Citeseer, and Pubmed datasets.

The architecture of the proposed TopoTER.where the conditional entropy H(∆A| H, H) is upper bounded by the joint entropy H(∆A, H, H).

Node classification accuracies (with standard deviation) in percentage on three datasets. X, A, Y denote the input data, adjacency matrix and labels respectively.

Model size comparison of DGI, GMI, and the proposed TopoTER.

Graph classification accuracies (with standard deviation) in percentage on 6 datasets. ">1 Day" represents that the computation exceeds 24 hours. "OOM" is out of memory error.

B EXPERIMENTS ON DIFFERENT EDGE PERTURBATION RATES

Further, we evaluate the influence of the edge perturbation rate in Eq. ( 9) on the node classification task. We choose 11 edge perturbation rates from 0.0 to 1.0 at an interval of 0.1 to train the proposed TopoTER. We use one SGC layer as our encoder E(•), where the order of the adjacency matrix is set to 1. As presented in Fig. 4 , the blue solid line with error bar shows the classification accuracy of our TopoTER under different edge perturbation rates. We also provide the classification accuracy on feature representations of graphs from a randomly initialized encoder E(•), denoted as Random Init., which serves as the lower bound of the performance.As we can see, the classification performance reaches the best when the graph is perturbed under a reasonable edge perturbation rate, e.g., r = {0.6, 0.5, 0.6} for the Cora, Citeseer, and Pubmed dataset, respectively. When the edge perturbation rate r = 0.0, the unsupervised training task of TopoTER becomes link prediction, which cannot take advantage of the proposed method by predicting the topology transformations; when the edge perturbation rate r = 1.0, our TopoTER still achieves reasonable classification results, which shows the stability of our model under high edge perturbation rates. At the same time, we observe that the proposed TopoTER outperforms Random Init. by a large margin, which validates the effectiveness of the proposed unsupervised training strategy. 

