SLAPS: SELF-SUPERVISION IMPROVES STRUCTURE LEARNING FOR GRAPH NEURAL NETWORKS

Abstract

Graph neural networks (GNNs) work well when the graph structure is provided. However, this structure may not always be available in real-world applications. One solution to this problem is to infer the latent structure and then apply a GNN to the inferred graph. Unfortunately, the space of possible graph structures grows super-exponentially with the number of nodes and so the available node labels may be insufficient for learning both the structure and the GNN parameters. In this work, we propose the Simultaneous Learning of Adjacency and GNN Parameters with Self-supervision, or SLAPS, a method that provides more supervision for inferring a graph structure. This approach consists of training a denoising autoencoder GNN in parallel with the task-specific GNN. The autoencoder is trained to reconstruct the initial node features given noisy node features as well as a structure provided by a learnable graph generator. We explore the design space of SLAPS by comparing different graph generation and symmetrization approaches. A comprehensive experimental study demonstrates that SLAPS scales to large graphs with hundreds of thousands of nodes and outperforms several models that have been proposed to learn a task-specific graph structure on established benchmarks.

1. INTRODUCTION

Graph representation learning has grown rapidly and found applications in domains where data points define a graph (Chami et al., 2020; Kazemi et al., 2020) . Graph neural networks (GNNs) (Scarselli et al., 2008) have been a key component to the success of the research in this area. Following the success of graph convolutional networks (GCNs) (Kipf & Welling, 2017) on semi-supervised node classification, several other GNN variants have been proposed for different prediction tasks on graphs (Hamilton et al., 2017; Veličković et al., 2018; Gilmer et al., 2017; Battaglia et al., 2018) and the power of these models has been studied theoretically (Xu et al., 2019; Sato, 2020) . GNNs take as input a set of node features and an adjacency matrix corresponding to the graph structure, and, for each node, output an embedding that captures not only the initial features of the node but also the features and embeddings of its neighbors. The performance of GNNs highly depends on the quality of the input graph structure and deteriorates substantially when the graph structure is noisy (see Zügner et al., 2018; Dai et al., 2018; Fox & Rajamanickam, 2019) . The need for both node features and a clean graph structure impedes the applicability of GNNs to domains where one has access to a set of nodes and their features but not to their underlying graph structure, or only has access to a noisy structure. Examples of such domains include brain signal classification (Jang et al., 2019) , computer-aided diagnosis (Cosmo et al., 2020) , analysis of computer programs (Johnson et al., 2020) , and particle reconstruction (Qasim et al., 2019) . In this paper, we address this limitation by developing a model that learns both the GNN parameters as well as an adjacency matrix simultaneously. Since the number of possible graph structures grows super-exponentially with the number of nodes (Stanley, 1973) and obtaining node labels is typically costly, the number of available labels may not be enough for learning both the GNN parameters and an adjacency matrix-especially for semi-supervised node classification. Our main contribution is to supplement the classification task with a self-supervised task that helps learn a high-quality adjacency matrix. Our self-supervision approach masks some input features (or adds noise to them) and trains a separate GNN aiming at updating the adjacency matrix in such a way that it can recover the masked (or noisy) features. Introducing this self-supervision adds the inductive bias that a graph structure suitable for predicting the node features is also suitable for predicting the node labels. We experiment with several classification datasets. For datasets with a graph structure, we only feed the node features to our model. The model operates on the node features and an adjacency that is learned simultaneously from data. We compare our model with different classes of methods: some which do not use the graph structure for predicting labels, some which use a fixed k-Nearest Neighbors (kNN) graph built based on a chosen similarity metric, and some which initialize the graph with kNN but then revise it throughout the training. We show that our model consistently outperforms these methods. We also show that the self-supervised task is key to the high performance of our model. As an additional contribution, we provide an implementation for simultaneous structure and parameter learning that scales to graphs with hundreds of thousands of nodes.

2. RELATED WORK

Existing methods that relate to this work can be grouped into the following categories. Similarity Graph: One approach for inferring a graph structure is to select a similarity metric and set the edge weight between two nodes to be their similarity (Roweis & Saul, 2000; Tenenbaum et al., 2000) . To obtain a sparse structure, one may create a kNN similarity graph, only connect pairs of nodes whose similarity surpasses some predefined threshold, or do sampling. As an example, Gidaris & Komodakis (2019) create a (fixed) kNN graph using the cosine similarity of the node features. Wang et al. (2019b) extend this idea by creating a fresh graph in each layer of the GNN based on the node embedding similarities in that layer as opposed to fixing a graph solely based on the initial features. Instead of choosing a single similarity metric, Halcrow et al. (2020) fuse several (potentially weak) measures of similarity. The quality of the predictions of these methods depends heavily on the choice of the similarity metric(s) and the value of k for the kNN graph, or the threshold on similarity. Furthermore, designing an appropriate similarity metric may not be straightforward in some applications. Fully-connected Graph: Another approach is to assume a fully-connected graph and employ GNN variants such as graph attention networks (Veličković et al., 2018; Zhang et al., 2018) or the transformer (Vaswani et al., 2017) which infer the graph structure via an attention mechanism, or infer the graph structure using additional information. This approach has been used in computer vision (e.g., Suhail & Sigal, 2019) , natural language processing (e.g., Zhu et al., 2019) , and few-shot learning (e.g., Garcia & Bruna, 2017) , where there are not many nodes. The complexity of this approach, however, grows rapidly making it applicable only to small-sized graphs with a few thousand nodes and not scalable to the datasets we use in our experiments. Learnable Graph: Instead of computing a similarity graph on the initial features, one may use a graph generator with learnable parameters. Li et al. (2018b) create a fully-connected graph based on a biliear similarity function with learnable parameters. A common approach is to learn to project the nodes to a latent space where node similarities correspond to edge weights. Wu et al. (2018) project the nodes to a latent space by learning weights for each of the input features. Cosmo et al. (2020) and Qasim et al. ( 2019) use a multi-layer perceptron for projection. Yu et al. (2020) use a GNN that projects the nodes into a latent space using the initial node features as well as an initial graph structure, aiming at providing a revised graph structure to the task-specific GNN. Franceschi et al. (2019) propose a model named LDS with a bi-level optimization setup for simultaneously learning the GNN parameters and a full adjacency matrix. Yang et al. (2019) update the input adjacency matrix based on the inductive bias that nodes belonging to the same class should be connected to each other and nodes belonging to different classes should be disconnected. Chen et al. (2020) propose an iterative approach that iterates over projecting the nodes to a latent space and constructing an adjacency matrix from the latent representations multiple times. In our experiments, we compare with several approaches from this category. Leveraging Domain Knowledge: In applications where specific domain knowledge is available, one may leverage this to guide the model toward learning specific structures. For example, Johnson et al. (2020) leverage abstract syntax trees and regular languages in learning graph structures of Python programs that aid reasoning for downstream tasks. Jin et al. (2020b) train GNNs that are robust to adversarial attack by learning a cleaned version of the input poisoned adjacency matrix using the domain knowledge that clean adjacency matrices are often sparse and low-rank and exhibit feature smoothness along connected nodes. Proposed Method: Our model falls within the learnable graph category in which we use graph generators with learnable parameters to infer the adjacency matrix. We supplement the training with a self-supervised objective to increase the amount of supervision in learning a graph structure. The self-supervised objective is generic and can be combined with many of the models described above. It is in the same vein as the auxiliary tasks in the context of multi-task learning used in computer vision, natural language processing, and reinforcement learning (see, e.g., Jaderberg et al., 2016; Liebel & Körner, 2018; Alonso & Plank, 2016) . The self-supervised task is inspired by the successful training procedures of several recent language models such as BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019b) . Similar self-supervision techniques have also been employed for GNNs (Hu et al., 2020b; c) (Jin et al., 2020a; You et al., 2020; Zhu et al., 2020) . While we employ similar self-supervision techniques, our work differs from this line of work as we use selfsupervision for learning a graph structure whereas the above methods use it to learn better (and, in some cases, transferable) GNN parameters. Specifically, we adopt the multi-task learning framework of You et al. (2020) with two differences: 1-we do not used shared parameters for the task-specific and self-supervised GNNs, and 2-instead of using a fixed adjacency matrix provided as input, we allow both GNNs to provide gradients for a generator that learns to generate a graph structure suitable for the downstream task.

3. BACKGROUND AND NOTATION

We use lowercase letters to denote scalars, bold lowercase letters to denote vectors and bold uppercase letters to denote matrices. I represents an identity matrix. For a vector v, we represent its i th element as v i . For a matrix M , we represent the i th row as M i and the element at the i th row and j th column as M ij . For an attributed graph, we use n, m and f to represent the number of nodes, edges, and features respectively, and denote the graph as G = {V, A, X} where V = {v 1 , . . . , v n } is a set of nodes, A ∈ R n×n is a (sparse) adjacency matrix with A ij indicating the weight of the edge from v i to v j (A ij = 0 implies no edge), and X ∈ R n×f is a matrix whose rows correspond to node features or attributes. A degree matrix D for a graph G is a diagonal matrix where D ii = j A ij . Graph convolutional networks (GCNs) are a powerful variant of GNNs. For a graph G = {V, A, X} with a degree matrix D, layer l of the GCN architecture can be defined as H (l) = σ( ÂH (l-1) W (l) ) where Â represents a normalized adjacency matrix, H (l-1) ∈ R n×d l-1 represents the node representations in layer l-1 with H (0) = X, W (l) ∈ R d l-1 ×d l is a weight matrix, σ is an activation function such as ReLU, and H (l) ∈ R n×d l is the updated node embeddings. For undirected graphs where the adjacency is symmetric, Â = D -1 2 (A + I)D -1 2 corresponds to a row-and-column normalized adjacency with self-loops, and for directed graphs where the adjacency is not necessarily symmetric, Â = D -1 (A + I) corresponds to a row normalized adjacency matrix with self-loops.

4. SLAPS: SIMULTANEOUS LEARNING OF ADJACENCY MATRIX AND GNN PARAMETERS WITH SELF-SUPERVISION

We break our model into four components: 1) generator, 2) adjacency processor, 3) classifier, and 4) self-supervision. The generator takes the node features as input and generates a (perhaps sparse, non-normalized, and non-symmetric) matrix Ã ∈ R n×n . Ã is then fed into the adjacency processor which outputs A ∈ R n×n corresponding to a normalized, and in some cases symmetric version of Ã. The classifier is a GNN that receives A as well as the node features as input and classifies the nodes into a set of predefined classes. The self-supervision component is a GNN that receives noisy features and the generated adjacency as input and aims at denoising the features. Figure 1 illustrates the different components and in what follows, we describe each component in more detail. The resulting noisy features and the generated adjacency go into GNN DAE which then denoises the features (Section 4.4).

4.1. GENERATOR

The generator is a function G : R n×f → R n×n with parameters θ G which takes the node features X as input and produces Ã ∈ R n×n as output. We consider the following two generators. Full Parameterization (FP): In this case, θ G is a single matrix in R n×n and the generator function is defined as Ã = G F P (X; θ G ) = θ G . That is, the generator ignores the input node features and directly optimizes the adjacency matrix. The disadvantages of this generator include adding n 2 parameters to the model, which limits scalability and makes the model susceptible to overfitting, and not being applicable to inductive settings where during test time predictions are to be made for nodes unseen during training. This generator is similar to the one proposed by Franceschi et al. (2019) except that they treat each element of Ã as the parameter of a Bernoulli distribution and sample graph structures from these Bernoulli distributions.

MLP-kNN:

In this case, θ G corresponds to the weights of a multi-layer perceptron (MLP) and Ã = G MLP (X; θ G ) = kNN(MLP(X)), where MLP : R n×f → R n×f is an MLP that produces a matrix with updated node representations X ; kNN : R n×f → R n×n produces a sparse matrix. Let M ∈ R n×n with M ij = 1 if v j is among the top k similar nodes to v i and 0 otherwise, and let S ∈ R n×n such that S ij = Sim(X i , X j ) for some differentiable similarity function Sim (we used cosine in our experiments). Then Ã = kNN(X ) = M S where represents the Hadamard (element-wise) product. Since S is computed based on X , the gradients flow to the elements in X (and consequently to the weights of the MLP) through S. In the backward phase of our model, we compute the gradients only with respect to those elements in S whose corresponding value in M is 1 (i.e. those elements S ij such that M ij = 1); the gradient with respect to the other elements is 0. With this formulation, in the forward phase of the network, one can first compute the matrix M using an off-the-shelf k-nearest neighbors algorithm and then compute the similarities in S only for pairs of nodes where M ij = 1. Unlike FP, this generator can be used for the inductive setting.

Smart initialization:

In our experiments, we found the initialization of the generator parameters (i.e. θ G ) to be important. Let A kN N represent an adjacency matrix created by applying a kNN function on the initial node features. One smart initialization for θ G is to initialize them in a way that the generator generates A kN N before training starts (i.e. Ã = A kN N before training starts). Such an initialization can be trivially done for the FP generator by initializing θ G to A kN N , but may not be straightforward for the MLP-kNN generator. To enable initializing the parameters of the MLP-kNN generator in a way that it generates A kN N before training starts, we consider two variants of this generator. In one, hereafter referred to simply as MLP, we keep the input dimension the same throughout the layers. In the other, hereafter referred to as MLP-D, we consider MLPs with diagonal weight matrices (i.e., except the main diagonal, all other parameters in the weight matrices are zero). For both variants, we initialize the weight matrices in θ G with the identity matrix to ensure that the output of the MLP is initially the same as its input and the kNN graph created on these outputs is equivalent to A kN N . MLP-D can be thought of as assigning different weights to different features and then computing node similarities. Note that, alternatively, one may use other MLP variants but pre-train the weights to output A kN N before the main training starts.

4.2. ADJACENCY PROCESSOR

The output Ã of the generator may have both positive and negative values, may be non-symmetric and non-normalized. To ensure all values of the adjacency are positive and make the adjacency symmetric and normalized, we apply the following function to Ã: A = D -1 2 P( Ã) + P( Ã) T 2 D -1 2 (1) Here P is a function with a non-negative range. In our experiments, when using an MLP generator, we apply the ReLU function to the elements of Ã. When using the fully-parameterized (FP) generator, applying ReLU results in a gradient flow problem as any edge whose corresponding value in Ã becomes less than or equal to zero stops receiving gradient updates. For this reason, for FP we apply the ELU function to the elements of Ã and then add a value of 1. The sub-expression P( Ã)+P( Ã) T 2 makes the resulting matrix P( Ã) symmetric. To understand the reason for taking the mean of P( Ã) and P( Ã) T , assume Ã is generated by G MLP . If v j is among the k most similar nodes to v i and vice versa, then the strength of the connection between v i and v j will remain the same. However, if, say, v j is among the k most similar nodes to v i but v i is not among the top k for v j , then taking the average of the similarities reduces the strength of the connection between v i and v j . Finally, once we have a symmetric adjacency with non-negative values, we compute the degree matrix D for P( Ã)+P( Ã) T 2 and normalize P( Ã)+P( Ã) T 2 by multiplying it left and right with D -1 2 .

4.3. CLASSIFIER

The classifier is a function GNN C : R n×f × R n×n → R n×|C| with parameters θ GNN C . It takes the node features X and the generated adjacency A as input and provides for each node the logits for each class. C corresponds to the classes and |C| corresponds to the number of classes. We use a twolayer GCN for which θ GNN C = {W (1) , W (2) } and define our classifier as GNN C (A, X; θ GNN C ) = AReLU(AXW (1) )W (2) but other GNN variants can be used as well (recall that A is normalized). The training loss L C for the classification task is computed by taking the softmax of the logits to produce a probability distribution for each node and then computing the cross-entropy loss.

4.4. ADDING SELF-SUPERVISION

As explained in Section 1, in many domains, the number of labeled nodes may be insufficient for learning both the structure and the GNN parameters from data. To increase the amount of supervision for learning the structure, we propose a self-supervised approach based on denoising autoencoders (Vincent et al., 2008) . Let GNN DAE : R n×f × R n×n → R n×f be a GNN that takes node features as well as a normalized adjacency produced by a generator as input and provides updated node features with the same dimension as input. We train GNN DAE such that it receives a noisy version X of the features X as input and produces the denoised features X as output. Let idx represent the indices corresponding to the elements of X to which we have added noise, and X idx represent the values at these indices. The aim of the training procedure is to minimize: L DAE = L(X idx , GNN DAE ( X, A; θ GNN DAE ) idx ) (2) where A is the generated adjacency matrix and L is a loss function. To add noise to the input features for datasets where features consist of binary vectors, in each iteration, idx consists of r percent of the indices of X whose values are ones and rη percent of the indices whose values are zeros, both selected uniformly at random in each epoch. Both r and η (corresponding to the negative ratio) are hyperparameters. In this case, we add noise by setting the ones in the selected mask to zeros and L is the binary cross-entropy loss. For datasets where the input features are continuous numbers, idx consists of r percent of the indices of X selected uniformly at random in each epoch. We add noise by either replacing the values at idx with zeros or by adding independent Gaussian noises to each of the features. In this case, L is the mean-squared error loss. To understand the cruciality of the proposed self-supervision, let us consider a scenario for training the model in Figure 1 (or a model with a similar architecture) but without the self-supervised task. As training proceeds, assume that two unlabeled nodes v i and v j are not directly connected to any labeled nodes. Then, since a two-layer GCN makes predictions for the nodes based on their two-hop neighbors, the edge between v i and v j receives no supervisionfoot_0 . Figure 2 provides an example of such a scenario. As a quantitative example, in the original structures and train/validation/test splits of Cora and Citeseer, 80.4% and 89.9% of the nodes, and consequently 64.6% and 80.8% of pairs of nodes, are not connected to any labeled/train nodes. If no supervision is provided for some edges in the graph, after training the existence and weights of these edges may end up being set randomly (or the same as their initialization value), which may be problematic during testing. With the self-supervised task, however, although these edges may not receive supervision from the main task (i.e. from GCN C ), the supervision provided by the self-supervised task (i.e. from GCN DAE ) helps learn an appropriate weight for them.

4.5. SLAPS

Our final model, dubbed SLAPS, is trained to minimize L = L C + λL DAE where L C is the classification loss, L DAE is the denoising autoencoder loss (see Equation 2), and λ is a hyperparameter controlling the relative importance of the two losses. To verify the merit of the GNN DAE for learning an adjacency matrix in isolation, we also consider a variant of SLAPS named SLAP S 2s that is trained in two stages. We first train the GNN DAE model by minimizing the loss function described in Equation 2. Note that the loss function in Equation 2depends on the parameters θ G of the generator and the parameters θ GNN DAE of the denoising autoencoder. After every t epochs of training, we fix the adjacency matrix, train a classifier with the fixed adjacency matrix, and measure classification accuracy on the validation set. We select the epoch that produces the adjacency providing the best validation accuracy for the classifier. Note that in SLAP S 2s , the adjacency matrix is trained only based on GNN DAE .

5. EXPERIMENTS

Baselines: We compare our proposal to several baselines with different properties. The first baseline is a multi-layer perceptron (MLP) which does not take the graph structure into account. We also compare against MLP-GAM* (Stretcu et al., 2019) which learns a fully-connected graph structure and uses this structure to supplement the loss function of the MLP toward predicting similar labels for neighboring nodes. Similar to Franceschi et al. (2019) , we also consider a baseline named kNN-GCN where we create a kNN graph based on the node features and feed this graph to a GCN. The graph structure remains fixed in this approach. We also compare with baselines that learn the graph structure from data including LDS (Franceschi et al., 2019) , GRCN (Yu et al., 2020) , DGCNN (Wang et al., 2019b) , and IDGL (Chen et al., 2020) . We feed a kNN graph to the models requiring an initial graph structure.

Datasets:

We use three established benchmarks in the GNN literature namely Cora, Citeseer, and Pubmed (Sen et al., 2008) as well as a newly released dataset for node classification named ogbn- arxiv (Hu et al., 2020a) that is orders of magnitude larger than the other three datasets and is more challenging due to the more realistic split of the data into train, validation, and test sets. For all the datasets, we only feed the node features to the models and not the graph structure. Following Franceschi et al. (2019) , we also experiment with several classification (non-graph) datasets available in scikit-learn (Pedregosa et al., 2011) including Wine, Cancer, Digits, and 20News. Dataset statistics can be found in the Appendix. For Cora and Citeseer, the LDS model uses the train data for learning the parameters of their classification GCN, half of the validation for learning the parameters of the adjacency matrix (in their bi-level optimization setup, these are considered as hyperparameters), and the other half of the validation set for early stopping and tuning the other hyperparameters. Besides experimenting with the original setups of these two datasets, we also consider a setup that is closer (although not identical) to that of LDS: we use the train set and half of the validation set for training and the other half of validation for early stopping and hyperparameter tuning. We name the modified versions Cora390 and Citeseer370 respectively where the number proceeding the dataset name corresponds to the number of labels used for training. We also follow a similar procedure for the scikit-learn datasets.

5.1. COMPARATIVE RESULTS

The results of SLAPS and the baselines on the node classification benchmarks are reported in Table 1 . Considering only the baselines first, we see that kNN-GCN significantly outperforms MLP on Cora and Citeseer but underperforms on Pubmed and ogbn-arxiv. This shows the importance of the similarity metric and the graph structure that is fed into GCN as a low-quality structure can harm model performance. LDS outperforms MLP but the fully parameterized adjacency matrix of LDS results in memory issues for Pubmed and ogbn-arxiv. As for GRCN, it was shown in the original paper that GRCN can revise a good initial adjacency matrix and provide a substantial boost in performance. However, as evidenced by the results, if the initial graph structure is somewhat poor, GRCN's performance becomes on-par with kNN-GCN. IDGL is the best performing baseline but the iterative nature of it makes it slow to train and test. SLAPS consistently outperforms the baselines on all datasets, in some cases by large margins. Among the generators, the winner is dataset-dependent with MLP-D mostly outperforming MLP on datasets with many features and MLP outperforming on datasets with small numbers of features. Using the software that was publicly released by the authors, all baselines that learn a graph structure fail on ogbn-arxiv and our implementation is the first that generalizes to such large graphs. Table 2 reports the results for the scikit-learn datasets and compares with LDS and IDGL. On three out of four datasets, SLAPS outperforms the other two baselines. Among the datasets on which we can train SLAPS with the FP generator, 20news has the largest number of nodes. On this dataset, we observed that an FP generator suffers from overfitting and produces weaker results compared to other generators due to its large number of parameters. does not use the node labels in learning an adjacency matrix, it outperforms kNN-GCN (8.4% improvement when using an FP generator). With an FP generator, SLAPS 2s even achieves competitive performance with SLAPS; this is mainly because FP does not leverage the supervision provided by GCN C toward learning generalizable patterns that can be used for nodes other than those in the training set. These results corroborate the effectiveness of the self-supervision task for learning an adjacency matrix. Besides, the results show that learning the adjacency using both self-supervision and the task-specific node labels results in higher predictive accuracy. The value of λ: Symmetrization: To symmetrize the adjacency, in Equation 1 we took the average of P( Ã) and P( Ã) T . Here we also consider two other choices: 1) max(P( Ã), P( Ã) T ), and 2) not symmetrizing the adjacency (i.e. using P( Ã)). Figure 3 (d) compares these three choices on Cora and Citeseer with an MLP generator (other generators produced similar results). On both datasets, symmetrizing the adjacency provides a performance boost. Compared to mean symmetrization, max symmetrization performs slightly worse. This may be because max symmetrization does not distinguish between the case where both v i and v j are among the k most similar nodes of each other and the case where only one of them is among the k most similar nodes of the other. Analysing the learned adjacency: Many graph-based semi-supervised classification models are based on the cluster assumption according to which nearby nodes are more likely to share the same label (Chapelle & Zien, 2005) . To verify the quality of the adjacency matrix learned using SLAPS, for every pair of nodes in the test set, we compute the odds of the two nodes sharing the same label as a function of the normalized weight of the edge connecting them. for different weight intervals. For both Cora and Citeseer, nodes connected with higher edge weights are more likely to share the same label compared to nodes with lower or zero edge weights. As a specific example, when A ij ≥ 0.1, v i and v j are almost 2.5 times more likely to share the same label on Cora and almost 2.0 times more likely on Citeseer. Note that SLAPS may connect nodes based on a different criterion than the one used in the original datasets and so the learned adjacencies do not necessarily resemble the original structures. the best set of hyperparameters so instead, we used the validation cross-entropy loss. We fixed the maximum number of epochs to 2000. We use two-layer GCNs for both GNN C and GNN DAE as well as for baselines and two-layer MLPs throughout the paper (for experiments on ogbn-arxiv, although the original paper uses models with three layers and with batch normalization after each layer, to be consistent with our other experiments we used two layers and removed the normalization). We used two learning rates, one for GCN C and one for the other parameters of the models. We tuned the two learning rates from the set {0.01, 0.001}. We added dropout layers with dropout probabilities of 0.5 after the first layer of the GNNs. We also added dropout to the adjacency matrix and tuned the values from the set {0.25, 0.5}. We set the hidden dimension of GNN C to 32 for all datasets except for ogbn-arxiv for which we set it to 256. We used cosine similarity for building the kNN graphs and tuned the value of k from the set {15, 20, 30}. We tuned λ (λ controls the relative importance of the two losses) from the set {0.1, 1, 10, 100}. The code of our experiments will be available upon acceptance of the paper. For GRCN (Yu et al., 2020) , DGCNN (Wang et al., 2019b) , and IDGL (Chen et al., 2020) , we used the code released by the authors and tuned the hyperparameters as suggested in the original papers. The results of LDS (Franceschi et al., 2019) are directly taken from the original paper. All the results for our model and the baselines are averaged over 10 runs. We report the mean and standard deviation. Dataset statistics: The statistics of the datasets used in the experiments can be found in Table 3 . 



While this problem may be alleviated to some extent by increasing the number of layers of the GCN, deeper GCNs typically provide inferior results due to issues such as oversmoothing (see, e.g.,Li et al., 2018a;Oono & Suzuki, 2020). The generator used in this experiment is MLP; other generators produced similar results. CONCLUSIONIn this paper, we proposed a model for learning the parameters of a graph neural network and the graph structure of the nodes simultaneously. We showed the effectiveness of our model using a comprehensive set of experiments and analyses. In the future, we would like to try more sophisticated graph generation models (e.g., GraphRNN(You et al., 2018) and GNF(Liu et al., 2019a)) and extend our approach to applications with a temporal aspect where node features are observed over time but their connections are not provided as input.



Figure 2: The dashed edge receives no supervision when training a two-layer GCN as it is not in the two-hop neighborhood of any labeled node.

Figure 3(b)  shows the performance of SLAPS 2 on Cora and Citeseer with different values of λ. When λ = 0, corresponding to removing self-supervision, the model performance is somewhat poor. As soon as λ becomes positive, both models see a large boost in performance showing that self-supervision is crucial to the high performance of SLAPS. Increasing λ further provides larger boosts until it becomes so large that the self-supervision loss dominates the classification loss and the performance deteriorates. Note that with λ = 0, SLAPS with the MLP generator becomes a variant of the model proposed byCosmo et al. (2020), but with a different similarity function.Importance of k in kNN: Figure3(c)shows the performance of SLAPS on Cora for three graph generators as a function of k in kNN. For all three cases, the value of k plays a major role in model performance. The FP generator is the least sensitive because in FP, k only affects the initialization of the adjacency matrix but then the model can change the number of neighbors of each node. For MLP and MLP-D, however, the number of neighbors of each node remains close to k (but not necessarily equal as the adjacency processor can add or remove some edges) and the two generators become more sensitive to k. For larger values of k, the extra flexibility of the MLP generator enables removing some of the unwanted edges through the function P or reducing the weights of the unwanted edges resulting in MLP being less sensitive to large values of k compared to MLP-D.

Figure 3: The performance of SLAPS (a) compared to SLAPS 2s on Cora with different generators, (b) with MLP graph generator on Cora and Citeseer as a function of λ, (c) with different graph generators on Cora as a function of k in kNN, and (d) on Cora and Citeseer with different adjacency symmetrizations. (e) The odds of two nodes in the test set sharing the same label as a function of the edge weights learned by SLAPS.

Overview of SLAPS. At the top, a generator receives the node features and produces a non-symmetric, non-normalized adjacency having (potentially) both positive and negative values (Section 4.1). The adjacency processor makes the values positive, symmetrizes and normalizes the adjacency (Section 4.2). The resulting adjacency and the node features go into GNN C which predicts the node classes (Section 4.3). At the bottom, some noise is added to the node features.

Results of SLAPS and the baselines on established node classification benchmarks. † indicates results have been taken fromFranceschi et al. (2019). ‡ indicates results have been taken fromStretcu et al. (2019). Bold and underlined values indicate best and second-best mean performances respectively. OOM indicates out of memory.

Results on classification datasets. † indicates results have been taken fromFranceschi et al. (2019). Bold and underlined values indicate best and second-best mean performances respectively. To provide more insight into the value provided by the self-supervision task on the learned adjacency, we conduct experiments with SLAPS 2s . Recall from Section 4.5 that in SLAPS 2s , the adjacency is learned only based on the self-supervision task and the node labels are only used for early stopping, hyperparameter tuning, and training GCN C . Figure3(a) shows the performance of SLAPS and SLAPS 2s on Cora and compares them with kNN-GCN. Although SLAPS 2s

Dataset statistics.

