HOW TO FIND YOUR FRIENDLY NEIGHBORHOOD: GRAPH ATTENTION DESIGN WITH SELF-SUPERVISION

Abstract

Attention mechanism in graph neural networks is designed to assign larger weights to important neighbor nodes for better representation. However, what graph attention learns is not understood well, particularly when graphs are noisy. In this paper, we propose a self-supervised graph attention network (SuperGAT), an improved graph attention model for noisy graphs. Specifically, we exploit two attention forms compatible with a self-supervised task to predict edges, whose presence and absence contain the inherent information about the importance of the relationships between nodes. By encoding edges, SuperGAT learns more expressive attention in distinguishing mislinked neighbors. We find two graph characteristics influence the effectiveness of attention forms and self-supervision: homophily and average degree. Thus, our recipe provides guidance on which attention design to use when those two graph characteristics are known. Our experiment on 17 real-world datasets demonstrates that our recipe generalizes across 15 datasets of them, and our models designed by recipe show improved performance over baselines.

1. INTRODUCTION

Graphs are widely used in various domains, such as social networks, biology, and chemistry. Since their patterns are complex and irregular, learning to represent graphs is challenging (Bruna et al., 2014; Henaff et al., 2015; Defferrard et al., 2016; Duvenaud et al., 2015; Atwood & Towsley, 2016) . Recently, graph neural networks (GNNs) have shown a significant performance improvement by generating features of the center node by aggregating those of its neighbors (Zhou et al., 2018; Wu et al., 2020) . However, real-world graphs are often noisy with connections between unrelated nodes, and this causes GNNs to learn suboptimal representations. Graph attention networks (GATs) (Veličković et al., 2018) adopt self-attention to alleviate this issue. Similar to attention in sequential data (Luong et al., 2015; Bahdanau et al., 2015; Vaswani et al., 2017) , graph attention captures the relational importance of a graph, in other words, the degree of importance of each of the neighbors to represent the center node. GATs have shown performance improvements in node classification, but they are inconsistent in the degree of improvement across datasets, and there is little understanding of what graph attention actually learns. Hence, there is still room for graph attention to improve, and we start by assessing and learning the relational importance for each graph via self-supervised attention. We leverage edges that explicitly encode information about the importance of relations provided by a graph. If node i and j are linked, they are more relevant to each other than others, and if node i and j are not linked, they are not important to each other. Although conventional attention is trained without direct supervision, if we have prior knowledge about what to attend, we can supervise attention using them (Knyazev et al., 2019; Yu et al., 2017) . Specifically, we exploit a self-supervised task, using the attention value as input to predict the likelihood that an edge exists between nodes. To encode edges in graph attention, we first analyze what graph attention learns and how it relates to the presence of edges. In this analysis, we focus on two commonly used attention mechanisms, GAT's original single-layer neural network (GO) and dot-product (DP), as building blocks of our proposed model, self-supervised graph attention network (SuperGAT). We observe that DP attention shows better performance than GO attention in the task to predict link with attention value. On the other hand, GO attention outperforms DP attention in capturing label-agreement between a target node and its neighbors. Based on our analysis, we propose two variants of SuperGAT, scaled dotproduct (SD) and mixed GO and DP (MX) , to emphasize the strength of GO and DP. Then, which graph attention models the relational importance best and produces the best node representations? We find that it depends on the average degree and homophily of the graph. We generate synthetic graph datasets with various degrees and homophily, and analyze how the choice of attention affects node classification performance. Based on this result, we propose a recipe to design graph attention with edge self-supervision that works most effectively for given graph characteristics. We conduct experiments on a total of 17 real-world datasets and demonstrate that our recipe can be generalized across them. In addition, we show that models developed by our method improve performance over baselines. We present the following contributions. First, we present models with self-supervised attention using edge information. Second, we analyze the classic attention forms GO and DP using label-agreement and link prediction tasks, and this analysis reveals that GO is better at label agreement and DP at link prediction. Third, we propose recipes to design graph attention concerning homophily and average degree and confirm its validity through experiments on real-world datasets. We make our code available for future research (https://github.com/dongkwan-kim/SuperGAT).

2. RELATED WORK

Deep neural networks are actively studied in modeling graphs, for example the graph convolutional networks (Kipf & Welling, 2017) which approximate spectral graph convolution (Bruna et al., 2014; Defferrard et al., 2016) . A representative work in a non-spectral way is the graph attention networks (GATs) (Veličković et al., 2018) which model relations in graphs using self-attention mechanism (Vaswani et al., 2017) . Similar to attention in sequence data (Bahdanau et al., 2015; Luong et al., 2015; Vaswani et al., 2017) , variants of attention in graph neural networks (Thekumparampil et al., 2018; Zhang et al., 2018; Wang et al., 2019a; Gao & Ji, 2019; Zhang et al., 2020; Hou et al., 2020) are trained without direct supervision. Our work is motivated by studies that improve attention's expressive power by giving direct supervision (Knyazev et al., 2019; Yu et al., 2017) . Specifically, we employ a self-supervised task to predict edge presence from attention value. This is in line with two branches of recent GNN research: self-supervision and graph structure learning. Recent studies about self-supervised learning for GNNs propose tasks leveraging the inherent information in the graph structure: clustering, partitioning, context prediction after node masking, and completion after attribute masking (Hu et al., 2020b; Hui et al., 2020; Sun et al., 2020; You et al., 2020) . To the best of our knowledge, ours is the first study to analyze self-supervised learning of graph attention with edge information. Our self-supervised task is similar to link prediction (Liben-Nowell & Kleinberg, 2007) , which is a well-studied problem and recently tackled by neural networks (Zhang & Chen, 2017; 2018) . Our DP attention to predict links is motivated by graph autoencoder (GAE) (Kipf & Welling, 2016) and its extensions (Pan et al., 2018; Park et al., 2019) reconstructing edges by applying a dot-product decoder to node representations. Graph structure learning is an approach to learn the underlying graph structure while jointly learning downstream tasks (Jiang et al., 2019; Franceschi et al., 2019; Klicpera et al., 2019; Stretcu et al., 2019; Zheng et al., 2020) . Since real-world graphs often have noisy edges, encoding structure information contributes to learn better representation. However, recent models with graph structure learning suffer from high memory and computational complexity. Some studies target all spaces where edges can exist, so they require O(|V | 2 ) space and computational complexity (Jiang et al., 2019; Franceschi et al., 2019) . Others using iterative training (or co-training) between the GNNs and the structure learning model are time-intensive in training (Franceschi et al., 2019; Stretcu et al., 2019) . We moderate this problem using graph attention, which consists of parallelizable operations, and our model is built on it without additional parameters. Our model learns attention values that are predictive of edges, and this can be seen as a new paradigm of learning the graph structure.

3. MODEL

In this section, we review the original GAT (Veličković et al., 2018) and then describe our selfsupervised GAT (SuperGAT) models. ⃗ a e ij,GO e ij,DP e ij,SD Dot-product 1/ F ϕ ij,MX,DP ϕ ij,SD e ij,MX × ϕ ij,GO σ σ σ Original GAT α ij softmax j = P(( j, i) ∈ E) e ij ϕ ij Figure 1 : Overview of attention mechanism of SuperGATs: GO, DP, MX, and SD. Blue circles (e ij ) represent the unnormalized attention before softmax and red diamonds (φ ij ) indicate the probability of edge between node i and j. The attention mechanism of the original GAT (Veličković et al., 2018) is in the dashed rectangle. Notation For a graph G = (V, E), N is the number of nodes and F l is the number of features at layer l. Graph attention layer takes a set of features H l = {h l 1 , . . . , h l N }, h l i ∈ R F l as input and produces output features H l+1 = {h l+1 1 , . . . , h l+1 N }. To compute h l+1 i , the model multiplies the weight matrix W l+1 ∈ R F l+1 ×F l to H l , linearly combines the features of its first-order neighbors (including itself) j ∈ N i ∪ {i} by attention coefficients α l+1 ij , and finally applies a non-linear activation ρ. That is h l+1 i = ρ j∈Ni∪{i} α l+1 ij W l+1 h l j . We can compute α l+1 ij = softmax j (LReLU(e l+1 ij )) by normalizing e l+1 ij = a e (W l+1 h l i , W l+1 h l j ) with softmax on N i ∪ {i} under leaky ReLU activation (Maas et al., 2013) , where a e is a function of the form R F l+1 × R F l+1 → R. Graph Attention Forms Among two widely used attention mechanisms, the original GAT (GO) computes the coefficients by single-layer feed-forward network parameterized by a l+1 ∈ R 2F l+1 . The other is the dot-product (DP) attention, (Luong et al., 2015; Vaswani et al., 2017) motivated by prior work on node representation learning, and it adopts the same mathematical expression for link prediction score (Tang et al., 2015; Kipf & Welling, 2016) , e l+1 ij,GO = (a l+1 ) W l+1 h l i W l+1 h l j and e l+1 ij,DP = (W l+1 h l i ) • W l+1 h l j . From now on, we call GAT that uses GO and DP as GAT GO and GAT DP , respectively.

Self-supervised Graph Attention Network

We propose SuperGAT with the idea of guiding attention with the presence or absence of an edge between a node pair. We exploit the link prediction task to self-supervise attention with labels from edges: for a pair i and j, 1 if an edge exists and 0 otherwise. We introduce a φ with sigmoid σ to infer the probability φ ij of an edge between i and j. a φ : R F × R F → R and φ ij = P ((j, i) ∈ E) = σ(a φ (W h i , W h j )) We employ four types (GO, DP, SD, and MX) of SuperGAT based on GO and DP attention. For a φ , the form of which is the same as a e in GAT GO and GAT DP , we name them SuperGAT GO and SuperGAT DP respectively. For more advanced versions, we describe SuperGAT SD (Scaled Dotproduct) and SuperGAT MX (Mixed GO and DP) by unnormalized attention e ij and probability φ ij that an edge exist between i and j. SuperGAT SD : e ij,SD = e ij,DP / √ F , φ ij,SD = σ(e ij,SD ). (3) SuperGAT MX : e ij,MX = e ij,GO • σ(e ij,DP ), φ ij,MX = σ(e ij,DP ). (4) SuperGAT SD divides the dot-product of nodes by a square root of dimension as Transformer (Vaswani et al., 2017) . This prevents some large values to dominate the entire attention after softmax. SuperGAT MX multiplies GO and DP attention with sigmoid. The motivation of this form comes from the gating mechanism of Gated Recurrent Units (Cho et al., 2014) . Since DP attention with the sigmoid represents the probability of an edge, it can softly drop neighbors that are not likely linked while implicitly assigning importance to the remaining nodes. Training samples are a set of edges E and the complementary set E c = (V × V ) \ E. However, if the number of nodes is large, it is not efficient to use all possible negative cases in E c . So, we use negative sampling as in training word or graph embeddings (Mikolov et al., 2013; Tang et al., 2015; Grover & Leskovec, 2016) , arbitrarily choosing a total of p n • |E| negative samples E -from E c where the negative sampling ratio p n ∈ R + is a hyperparameter. SuperGAT is capable of modeling graphs that are sparse with a sufficiently large number of negative samples (i.e., |V × V | |E|), but this is generally not a problem because most real-world graphs are sparse (Chung, 2010) . We define the optimization objective of layer l as a binary cross-entropy loss L l E , L l E = -1 |E∪E -| (j,i)∈E∪E -1 (j,i)=1 • log φ l ij + 1 (j,i)=0 • log 1 -φ l ij , where 1 • is an indicator function. We use a subset of E ∪ E -sampled by probability p e ∈ (0, 1] (also a hyperparameter) at each training iteration for a regularization effect from randomness. Finally, we combine cross-entropy loss on node labels (L V ), self-supervised graph attention losses for all L layers (L l E ), and L2 regularization loss, with mixing coefficients λ E and λ 2 . L = L V + λ E • L l=1 L l E + λ 2 • W 2 . ( ) We use the same form of multi-head attention in GAT and take the mean of each head's attention value before the sigmoid to compute φ ij . Note that SuperGAT has equivalent time and space complexity as GAT. To compute L l E for one head, we need additional operations in terms of O(F l • |E ∪ E -|), and we do not need extra parameters.

4. EXPERIMENTS

Our primary research objective is to design graph attentions that are effective with edge selfsupervision. To do this, we pose four specific research questions. We first analyze what basic graph attentions (GO and DP) learn (RQ1 and 2) and how that can be improved with edge self-supervision (RQ3 and 4). We describe each research question and the corresponding experiment design below. RQ1. Does graph attention learn label-agreement? First, we evaluate what the graph attentions of GAT GO and GAT DP learn without edge supervision. For this, we present ground-truth of relational importance and a metric to assess graph attention with ground-truth. Wang et al. (2019a) showed that node representations in the connected component converge to the same value in deep GATs. If there is an edge between nodes with different labels, then it will be hard to distinguish the two corresponding labels with GAT of sufficiently many layers; that is, ideal attention should give all weights to label-agreed neighbors. In that sense, we choose label-agreement between nodes as ground-truth of importance. We compare label-agreement and graph attention based on Kullback-Leibler divergence of the normalized attention α k = [α kk , α k1 , . . . , α kJ ] with label agreement distribution for the center node k and its neighbors 1 to J. The label agreement distribution, k = [ kk , k1 , . . . , kJ ] is defined by, kj = ˆ kj / s ˆ ks , ˆ kj = 1 (if k and j have the same label) or 0 (otherwise). We employ KL divergence in Eq. 8, whose value becomes small when attention captures well the label-agreement between a node and its neighbors. KLD(α k , k ) = j∈N k ∪{k} α kj log(α kj / kj ) RQ2. Is graph attention predictive of edge presence? To evaluate how well edge information is encoded in SuperGAT, we conduct link prediction experiments with SuperGAT GO and SuperGAT DP using φ ij of the last layer as a predictor. We measure the performance by AUC over multiple runs. Since link prediction performance depends on the mixing coefficient λ E in Eq. 6, we adopt multiple λ E ∈ {10 -3 , 10 -2 , . . . , 10 3 }. We train with an incomplete set of edges, and test with the missing edges and the same number of negative samples. At the same time, node classification performance is measured with the same settings to see how learning edge presence affects node classification. RQ3. Which graph attention should we use for given graphs? The above two research questions explore what different graph attention learns with or without supervision of edge presence. Then, which graph attention is effective among them for given graphs? We hypothesize that different graph attention will have different abilities to model graphs under various homophily and average degree. We choose these two properties among various graph statistics because they determine the quality and quantity of labels in our self-supervised task. From the perspective of supervised learning of graph attention with edge labels, the learning result depends on how noisy labels are (i.e., how low the homophily is) and how many labels exist (i.e., how high the average degree is). So, we generate 144 synthetic graphs (Section 4.1) controlling 9 homophily (0.1 -0.9) and 16 average degree (1 -100) and perform the node classification task in the transductive setting with GCN, GAT GO , SuperGAT SD , and SuperGAT MX . In RQ3, there are also practical reasons to use the average degree and homophily, out of many graph properties (e.g., diameter, degree sequence, degree distribution, average clustering coefficient). First, the graph property can be computed efficiently even for large graphs. Second, there should be an algorithm that can generate graphs by controlling the property of interest only. Third, the property should be a scalar value because if the synthetic graph space is too wide, it would be impossible to conduct an experiment with sufficient coverage. Average degree and homophily satisfy the above conditions and are suitable for our experiment, unlike some of the other graph properties. RQ4. Does design choice based on RQ3 generalize to real-world datasets? Experiments on synthetic datasets provide an understanding of graph attention models' performance, but they are oversimplified versions of real-world graphs. Can design choice from synthetic datasets be generalized to real-world datasets, considering more complex structures and rich features in real-world graphs? To answer this question, we conduct experiments on 17 real-world datasets with the various average degree (1.8 -35.8) and homophily (0.16 -0.91), and compare them with synthetic graph experiments in RQ3.

4.1. DATASETS

Real-world datasets We use a total of 17 real-world datasets (Cora, CiteSeer, PubMed, Cora-ML, Cora-Full, DBLP, ogbn-arxiv, CS, Physics, Photo, Computers, Wiki-CS, Four-Univ, Chameleon, Crocodile, Flickr, and PPI) in diverse domains (citation, co-authorship, co-purchase, web page, and biology) and scales (2k -169k nodes). We try to use their original settings as much as possible. To verify research questions 1 and 2, we choose four classic benchmarks: Cora, CiteSeer, PubMed in the transductive setting, and PPI in the inductive setting. See appendix A.1 for detailed description, splits, statistics (including degree and homophily), and references.

Synthetic datasets

We generate random partition graphs of n nodes per class and c classes (Fortunato, 2010), using NetworkX library (Hagberg et al., 2008) . A random partition graph is a graph of communities controlled by two probabilities p in and p out . If the nodes have the same class labels, they are connected with p in , and otherwise, they are connected with p out . To generate a graph with an average degree of d avg = n • δ, we choose p in and p out by p in + (c -1) • p out = δ. The input features of nodes are sampled from overlapping multi-Gaussian distributions (Abu-El-Haija et al., 2019) . We set n to 500, c to 10, and choose d avg between 1 and 100, p in from {0.1δ, 0.2δ, . . . , 0.9δ}. We use 20 samples per class for training, 500 for validation and 1000 for test.

4.2. EXPERIMENTAL SET-UP

We follow the experimental set-up of GAT with minor adjustments. All parameters are initialized by Glorot initialization (Glorot & Bengio, 2010) and optimized by Adam (Kingma & Ba, 2014) . We apply L2 regularization, dropout (Srivastava et al., 2014) to features and attention coefficients, and early stopping on validation loss and accuracy. We use ELU (Clevert et al., 2016) as a nonlinear activation ρ. Unless specified, we employ a two-layer SuperGAT with F = 8 features and K = 8 attention heads (total 64 features). All models are implemented in PyTorch (Paszke et al., 2019) and PyTorch Geometric (Fey & Lenssen, 2019) . See appendix A.5 for detailed model and hyperparameter configurations. Baselines For all datasets, we compare our model against representative graph neural models: graph convolutional network (GCN) (Kipf & Welling, 2017) , GraphSAGE (Hamilton et al., 2017) , and graph attention network (GAT) (Veličković et al., 2018) . Furthermore, for Cora, CiteSeer, and PubMed, we choose recent graph neural architectures that learn aggregation coefficients (or discrete structures) over edges: constrained graph attention network (CGATfoot_0 ) (Wang et al., 2019a) , graph learning-convolutional network (GLCN) (Jiang et al., 2019) , learning discrete structure (LDS) (Franceschi et al., 2019) , graph agreement model (GAM) (Stretcu et al., 2019) , and Neu-ralSparse (NS in short) (Zheng et al., 2020) . For the PPI, we use CGAT as an additional baseline.

5. RESULTS

This section describes the experimental results which answer the research questions in Section 4. We include a qualitative analysis of attention, quantitative comparisons on node classification and link prediction, and recipes of graph attention design. The results for sensitivity analysis of important hyper-parameters are in the appendix B.5. Does graph attention learn label-agreement? GO learns label-agreement better than DP. We draw box plots of KL divergence between attention and label agreement distributions of twolayer and four-layer GAT with GO and DP attention for Cora dataset in Figure 2 . We see similar patterns in other datasets and place their plots in the appendix B.3. At the rightmost of each subfigure, we draw the KLD distribution when uniform attention is given to all neighborhoods. Note that the maximum value of KLD of each node is different since the degree of nodes is different. Also, the KLD distribution shows a long-tail shape like a degree distribution of real-world graphs. There are three observations regarding distributions of KLD. First, we observe that the KLD distribution of GO attention shows a pattern similar to the uniform attention for all citation datasets. This implies that trained GO attention is similar to the uniform distribution, which is in line with previously reported results in the case of entropyfoot_1 (Wang et al., 2019b) . Second, KLD values of DP attention tend to be larger than those of GO attention for the last layer, resulting in bigger long-tails. This mismatch between the learned distribution of DP attention and the label agreement distribution suggests that DP attention does not learn label-agreement in the neighborhood. Third, the deeper the model (more than two), the larger the KLD value of DP attention in the last layer. This is because the variance of DP attention increases as the layer gets deeper, as explained below in Proposition 1. Proposition 1. For l + 1th GAT layer, if W and a are independent and identically drawn from zero-mean uniform distribution with variance σ 2 w and σ 2 a respectively, assuming that parameters are independent to input features h l and elements of h l are independent to each other, Var[e l+1 ij,GO ] = 2F l+1 σ 2 w σ 2 a E( h l 2 2 ) and Var[e l+1 ij,DP ] ≥ F l+1 σ 4 w 4 5 E ((h l i ) h l j ) 2 + Var((h l i ) h l j ) (9) The proof is given in the appendix B.1. While the variance of GO depends on the norm of features only, the variance of DP depends on the expectation of the square of input's dot-product and variance of input's dot-product. Stacking GAT layers, the more features of i and j correlate with each other, the larger the input's dot-product will be. After DP attention is normalized by softmax, which intensifies the larger values among them, normalized DP attention attends to only a small portion of the neighbors and learns a biased representation. Is graph attention predictive for edge presence? DP predicts edge presence better than GO. In Figure 3 , we report the mean AUC over multiple runs (5 for PPI and 10 for others) for link prediction (red lines) and node classification (gray lines). As the mixing coefficient λ E increases, the link prediction score increases in all datasets and attentions. This is a natural result considering that λ E is the weight factor of self-supervised graph attention loss (L E in Equation 6). For three out of four datasets, DP attention outperforms GO for link prediction for all range of λ E in our experiment. Surprisingly, even for small λ E , DP attention shows around 80 AUC, much higher than the performance of GO attention. PPI is an exception where GO attention shows higher performance for small λ E than DP, but the difference is slight. The results of this experiment demonstrate that DP attention is more suitable than GO attention in encoding edges. This figure also includes node classification performance. For all datasets except PubMed, we observe a trade-off between node classification and link prediction; that is, node classification performance decreases in SuperGAT GO and SuperGAT DP as λ E increases and thus link prediction performance increases. PubMed also shows a decrease in performance at the largest λ E we have tested. This implies that it is hard to learn the relational importance from edges by simply optimizing graph attention for link prediction. Which graph attention should we use for given graphs? It depends on homophily and average degree of the graph. In Figure 4 , we draw the mean test accuracy gains (over 5 runs) against GAT GO as the average degree increases from 1 to 100, for different values of homophily, on 64 synthetic graphs with GCN, SuperGAT SD , SuperGAT MX . See full results from appendix B.2. We define homophily h as the average ratio of neighbors with the same label as the center node (Pei et al., 2020) . That is h = 1 |V | i∈V j∈Ni 1 l(i)=l(j) |N i | , where l(i) is the label of node i. The expectation of homophily for random partition graphs is analytically p in /δ, and we just adopt this value to label the homophily of graphs in Figure 4 . We make the following observations from this figure. First, if the homophily is low (≤ 0.2), SuperGAT SD performs best among models because DP attention tends to focus on a small number of neighbors. This result empirically confirms what we analytically found in Proposition 1. Second, even when homophily is low, the performance gain of SuperGAT against GAT increases as the average degree increases to a certain level (around 10), meaning relation modeling can benefit from self-supervision if there are sufficiently many edges providing supervision. This is in agreement with prior study of label noise for deep neural networks where they find that the absolute amount of data with correct labels affects the learning quality more than the ratio between data with noisy and correct labels (Rolnick et al., 2017) . Third, if the average degree and homophily are high enough, Best graph attention (See Fig. 5 ) there is no difference between all models, including GCNs. If there are more correct edges beyond a certain amount, we can learn fine representation without self-supervision. Most importantly, if the average degree is not too low or high and homophily is above 0.2, SuperGAT MX performs better than or similar to SuperGAT SD . This implies that we can take advantage of both GO attention to learn label-agreement and DP attention to learn edge presence by mixing GO and DP. Note that many of the real-world graphs belong to this range of graph characteristics. The results on synthetic graphs imply that understanding of graph domains should be preceded to design graph attention. That is, by knowing the average degree and homophily of the graphs, we can choose the optimal graph attention in our design space. Does design choice based on RQ3 generalize to real-world datasets? It does for 15 of 17 realworld datasets. In Figure 5 , we plot the best-performed graph attention for synthetic graphs with square points in the plane of average degree and homophily. The size is the performance gain of SuperGAT against GAT, and the color indicates the best model. If the difference is not statistically significant (p-value ≥ .05) between GAT and SuperGAT, and between SuperGAT MX and SuperGAT SD , we mark as GAT-Any and SuperGAT-Any, respectively. We call this plot a recipe since it introduces the optimal attention to a specific region's graph. Now we map the results of 17 real-world datasets in Figure 5 according to their average degree and homophily. Average degree and homophily can be found in the appendix A.1, and experimental results of graph attention models are summarized in Tables 1, 2 and 3. We report the mean and standard deviation of performance over multiple seeds (30 for graphs with more than 50k nodes and 100 for others). We put unpaired t-test results of SuperGAT with GAT GO with asterisks. We find that the graph attention recipe based on synthetic experiments can generalize across realworld graphs. PPI and Four-Univ ( ) are surrounded by squares of SuperGAT SD ( ) at the bottom of the plane. Wiki-CS ( ), located in the SuperGAT-Any region ( ), also show no difference in performance between SuperGATs. Nine datasets ( ), which SuperGAT MX shows the highest performance, are located in the MX regions ( ) or within the margin of two squares. Note that there are two MX regions: lower-middle average degree (2.5 -7.5) and high homophily (0.8 -0.9), and uppermiddle average degree (7.5 -50) and lower-middle homophily (0.3 -0.5). There are five datasets with no significant performance change across graph attention ( ). CiteSeer, Photo, and Computers are within a margin of one square from the GAT-Any region ( ); however, Flickr and Crocodile are in the SuperGAT-Any region. To find out the cause of this irregularity, we examine the distribution of degree and per-node homophily (appendix A.4). We observe a more complex mixture distribution of homophily and average degree in Flickr and Crocodile, and this seems to be equivalent to mixing graphs of different characteristics, resulting in inconsistent results with our attention design recipe. Comparison with baselines For a total of 17 datasets, SuperGAT outperforms GCN for 13 datasets, GAT for 12 datasets, GraphSAGE for 16 datasets. Interestingly, for CS, Physics, Cora-ML, and Flickr, in which our model performs worse than GCN, GAT also cannot surpass GCN. It is not yet known when the degree-normalized aggregation of GCN outperforms the attentionbased aggregation, and more research is needed to figure out how to embed the degree information into graph attention. Tables 2 and 3 show performance comparisons between SuperGAT and recent GNNs for Cora, CiteSeer, PubMed, and PPI. Our model performs better for CiteSeer (0.6%p) and PubMed (3.4%p) than GLCN, which gives regularization to all relations in a graph. GCN + NS (NeuralSparse) performs better than our model for CiteSeer (1.5%p) but not for Cora (0.6%p). CGAT modified for semi-supervised learning shows lower performance than GAT. Although LDS and GAM which use iterative training show better performance except for LDS on Cora, these models require significantly more computation for the iterative training. For example, GCN + GAM compared to our model needs more ×34 more training time for Cora, ×72 for CiteSeer, and ×82 for PubMed. See appendix A.7 and B.4 for the experimental set-up and the result of wall-clock time analysis.

A EXPERIMENTAL SET-UP A.1 REAL-WORLD DATASET

In this section, we describe details (including nodes, edges, features, labels, and splits) of real-world datasets. We report statistics of real-world datasets in Tables 4 and 5 . For multi-label graphs (PPI), we extend the homophily to the average of the ratio of shared labels on neighbors over nodes, i.e., h = 1 |V | i∈V j∈Ni |C i ∩ C j | (|N i | • |C|) , where C i is a set of labels for node i and C is a set of all labels.

A.1.1 CITATION NETWORK

We use a total of 7 citation network datasets. Nodes are documents, and edges are citations. The task for all citation network datasets is to classify each paper's topic. Cora, CiteSeer, PubMed We use three benchmark datasets for semi-supervised node classification tasks in the transductive setting (Sen et al., 2008; Yang et al., 2016) . The features of the nodes are bag-of-words representations of documents. We follow the train/validation/test split of previous work (Kipf & Welling, 2017) . We use 20 samples per class for training, 500 samples for the validation, and 1000 samples for the test. Cora-ML, Cora-Full, DBLP These are other citation network datasets from Bojchevski & Günnemann (2018) . Node features are bag-of-words representations of documents. For CoraFull with features more than 5000, we reduce the dimension to 500 by performing PCA. We use the split setting in Shchur et al. (2018) : 20 samples per class for training, 30 samples per class for validation, the rest for the test. ogbn-arxiv The ogbn-arxiv is a recently proposed large-scale dataset of citation networks (Hu et al., 2020a; Wang et al., 2020) . Nodes represent arXiv papers, and edges indicate citations between papers, and node features are mean vectors of skip-gram word embeddings of their titles and abstracts. We use the public split by publication dates provided by the original paper.

CS, Physics

The CS and Physics are co-author networks in each domain (Shchur et al., 2018) . Nodes are authors, and edges mean whether two authors co-authored a paper. Node features are paper keywords from the author's papers, and we reduce the original dimension (6805 and 8415) to 500 using PCA. The split is the 20-per-class/30-per-class/rest from Shchur et al. (2018) . The goal of this task is to classify each author's respective field of study.

A.1.3 AMAZON CO-PURCHASE

Photo, Computers The Photo and Computers are parts of the Amazon co-purchase graph (McAuley et al., 2015; Shchur et al., 2018) . Nodes are goods, and edges indicate whether two goods are frequently purchased together, and node features are a bag-of-words representation of product reviews. The split is the 20-per-class/30-per-class/rest from Shchur et al. (2018) . The task is to classify the categories of goods.

A.1.4 WEB PAGE NETWORK

Wiki-CS The Wiki-CS dataset is computer science related page networks in Wikipedia (Mernyei & Cangea, 2020) . Nodes represent articles about computer science, and edges represent hyperlinks between articles. The features of nodes are mean vectors of GloVe word embeddings of articles. There are 20 standard splits, and we experiment with five random seeds for each split (total 100 runs). The task is to classify the main category of articles. Chameleon, Crocodile These datasets are Wikipedia page networks about specific topics, Chameleon and Crocodile (Rozemberczki et al., 2019) . Nodes are articles, and edges are mutual Four-Univ The Four-Univ dataset is a web page networks from computer science departments of diverse universities (Craven et al., 1998) . Nodes are web pages, edges are hyperlinks between them, and node features are TF-IDF vectors of web page's contents. There are five graphs consists of four universities (Cornell, Texas, Washington, and Wisconsin) and a miscellaneous graph from other universities. As the original authors suggestedfoot_3 , we use three graphs of universities and a miscellaneous graph for training, another one graph for validation (Cornell), and the other one graph for the test (Texas). Classification labels are types of web pages (student, faculty, staff, department, course, and project).

A.1.5 FLICKR

The Flickr dataset is a graph of images from Flickr (McAuley & Leskovec, 2012; Zeng et al., 2020) . Nodes are images, and edges indicate whether two images share common properties such as geographic location, gallery, and users commented. Node features are a bag-of-words representation of images. We use labels and split in in Zeng et al. (2020) . For labels, they construct seven classes by manually merging 81 image tags.

A.1.6 PROTEIN-PROTEIN INTERACTION

The protein-protein interaction (PPI) dataset (Zitnik & Leskovec, 2017; Hamilton et al., 2017; Subramanian et al., 2005) is a well-known benchmark in the inductive setting. A graph is given for human tissue, the nodes are proteins, the node's features are biological signatures like genes, and the edges illustrate proteins' interactions. The dataset consists of 20 training graphs, two validation graphs, and two test graphs. This dataset has multi-labels of gene ontology sets.

A.2 LINK PREDICTION

For link prediction, we split 5% and 10% of edges for validation and test set, respectively. We fix the negative edges for the test set and sample negative edges for the training set at each iteration. 

A.3 SYNTHETIC DATASET

To the best of our knowledge, our synthetic datasets are not used in recent literature. Therefore, we give some small examples of synthetic datasets to see qualitatively how the average degree and homophily vary according to δ and p in . Specifically, we draw 2D t-SNE plot of node features and edges in Figure 6 

A.4 DISTRIBUTION OF DEGREE AND HOMOPHILY OF DATASETS

In Figure 7 , we draw kernel density estimation plots of per-node homophily and degree of nodes in real-world datasets. We define per-node homophily as the ratio of neighbors with the same label as the center node, that is, h i = j∈Ni 1 l(i)=l(j) |N i |. Note that we define homophily as h = We focus more on the part outside of where the degree is 1 (0 with log scale), and the per-node homophily is 0. These are leaf nodes incorrectly connected and does not significantly affect the learning overall graph representation. In most datasets, only the largest mode exists, or there are some small modes around it. However, in Flickr and Crocodile, we can observe that the interval between modes is wide. More specifically, Crocodile's modes are in the area of (high degree, low per-node homophily) and (low degree, high per-node homophily), and Flickr's modes cover most homophily at a specific degree. Note that we can regard a mixture of distribution as a mixture of different sub-graphs. We argue that this is the reason our recipe does not fit for these two datasets. 1 |V | i∈V h i in Section 5.

A.5 MODEL & HYPERPARAMETER CONFIGURATIONS

Model Since we experiment with numbers of datasets, we maintain almost the same configurations across datasets. We do not use other methods such as residual connections, deeper layers, batch normalization, edge augmentation, and more hidden features, although we have confirmed from previous studies that these techniques contribute to performance improvement. For example, prior 88.2 ± 0.3 78.9 ± 0.2 87.4 ± 0.3 CGAT (Our Impl.: 2-layer w/ 64-features) 88.9 ± 0.3 78.9 ± 0.2 86.9 ± 0.2 SuperGAT MX (2-layer w/ 64-features) 88.7 ± 0.2 79.1 ± 0.2 87.0 ± 0.1 work has shown f1-score close to 100 for PPI. To clearly see the difference between the various graph attention designs, we intentionally keep a simple model configuration. Hyperparameter For real-world datasets, we tune two hyperparameters (mixing coefficients λ 2 and λ E ) by Bayesian optimization for the mean performance of 3 random seeds. We choose negative sampling ratio p n from {0.3, 0.5, 0.7, 0.9}, and edge sampling ratio p e from {0.6, 0.8, 1.0}. We fix dropout probability to 0.0 for PPI, 0.2 for ogbn-arxiv, 0.6 for others. We set learning rate to 0.05 (ogbn-arxiv), 0.01 (PubMed, PPI, Wiki-CS, Photo, Computers, CS, Physics, Crocodile, Cora-Full, DBLP), 0.005 (Cora, CiteSeer, Cora-ML, Chameleon), 0.001 (Four-Univ). For ogbn-arxiv, we set the number of features per head to 16 and the number of heads in the last layer to one; otherwise, we use eight features per head and eight heads in the last layer. For synthetic datasets, we choose λ E from {10 -5 , 10 -4 , 10 -3 , 10 -2 , 10 -1 , 1, 10, 10 2 } and λ 2 from {10 -7 , 10 -5 , 10 -3 }. We fix learning rate to 0.01, dropout probability to 0.2, p n to 0.5, and, p e to 0.8 for all synthetic graphs. Table 6 describes hyperparameters for SuperGAT on four real-world datasets. For other datasets and experiments, please see the code (./SuperGAT/args.yaml). A.6 CGAT IMPLEMENTATION CGAT (Wang et al., 2019a) has two auxiliary losses: graph structure based constraint L g and class boundary constraint L b . We borrow their notation for this section: V as a set of nodes, N i as a set of one-hop neighbors, N + i as a set of neighbors that share labels, N - i as a set of neighbors that do not share labels, ζ • as a margin between attention values, and φ (v i , v k ) as unnormalized attention value. L g = i∈V j∈Ni\N - i k∈(V\Ni) max (0, φ (v i , v k ) + ζ g -φ (v i , v j )) L b = i∈V j∈N + i k∈N - i max (0, φ (v i , v k ) + ζ b -φ (v i , v j )) Since label information is included in these two losses, they are difficult to use in semi-supervised settings that provide few labeled samples. In fact, in the CGAT paper, they conduct experiments in full-supervised settings; that is, they use all nodes in training except validation and test nodes. So, we only use L g modified for semi-supervised learning. L SSL g = i∈V j∈Ni k∈(V\Ni) max (0, φ i , v k ) + ζ g -φ (v i , v j )) With L c , the multi-class cross-entropy on node labels, CGAT's optimization objective is L = L c + λ g L g + λ b L b , and our modified CGAT's loss is L = L c + λ g L SSL g . In addition to losses, CGAT proposes top-k softmax and node importance based negative sampling (NINS). Top-k softmax picks up nodes with top-k attention values among neighbors. NINS adopts importance sampling when choosing negative sample nodes. Since the code for CGAT has not been released, we implement our own version. In all experiments in our paper, we use only the modified losses (Equation 14) and top-k softmax due to the training and implementation complexity of NINS. For PPI, even if we do not assume a semi-supervised setting, we use the same loss because we could not accurately implement multi-label cases for Equation 10and 11 with only CGAT's description. To verify the functionality of our implementation, we report the results of a full-supervised setting with the original loss (Equation 13) like CGAT paper, in Table 7 . Our implementation of CGAT shows almost the same performance reported in the original paper. In addition, SuperGAT and CGAT showed almost similar performance in a full-supervised setting. Note that the original paper employs two hidden layers with hidden dimensions as 32 for Cora, 64 for CiteSeer, and three hidden layers with hidden dimensions 32 for PubMed, where models in our experiments are all two-layer with 64 features.

A.7 WALL-CLOCK TIME EXPERIMENTAL SET-UP

To demonstrate our model's efficiency, we measure the mean wall-clock time of the entire training process of three runs using a single GPU (GeForce GTX 1080Ti). We compare our model with GAT (Veličković et al., 2018) and GAM (Stretcu et al., 2019) . GAT is the basic model using a simpler attention mechanism than ours, and GAM is the state-of-the-art model using co-training with the auxiliary model. For GAT and SuperGAT, we use our implementation (including hyperparameter settings) in Py-Torch (Paszke et al., 2019) . For GAM, we adopt the code in TensorFlow (Abadi et al., 2015) from the authorsfoot_5 and choose GCN + GAM model, which showed the best performance. We retain the default settings in the code but use the hyperparameters reported in the paper, if possible. With this setting, GCN + GAM on PubMed is not finished after 24 hours; therefore, we manually early-stop the training at the best accuracy.

B.1 PROOF OF PROPOSITION

Proposition 2. For l + 1th GAT layer, if W and a are independent and identically drawn from zero-mean uniform distribution with variance σ w and σ 2 a respectively, assuming that parameters are independent to input features h l and elements of h l are independent to each other, Var[e l+1 ij,GO ] = 2F l+1 σ 2 w σ 2 a E( h l 2 2 ) and Var[e l+1 ij,DP ] ≥ F l+1 σ 4 w 4 5 E ((h l i ) h l j ) 2 + Var((h l i ) h l j ) (15) Proof. Let h = W h l , then h i,k = F l+1 r=1 W kr h l i,r . Note that, e l+1 ij,GO = a [h i h j ] and e l+1 ij,DP = h i h j First, we compute E(a 2 ) and E(h 2 ). E(a 2 ) = Var(a) + E(a) 2 = σ 2 a (17) E(h 2 k ) = E F l+1 r=1 W kr h l •,r 2 (18) = E F l+1 r=1 W 2 kr (h l •,r ) 2 (19) = E W 2 k,• E F l+1 r=1 (h l •,r ) 2 (20) = σ 2 w E h l 2 2 (21) For the variance of e l+1 ij,GO , Var(e l+1 ij,GO ) = Var(a [h i h j ]) (22) = Var F l+1 r=1 (a r h i,r + a r+F l+1 h j,r ) (23) = 2F l+1 Var(ah ) (24) = 2F l+1 E a 2 E h 2 -E (a) 2 E (h ) 2 (25) = 2F l+1 E a 2 E h 2 (26) = 2F l+1 σ 2 a σ 2 w E h l 2 2 (27) Now we compute E(h i h j ) and E(h 2 i h 2 j ), E(h i,k h j,k ) = E F l+1 r=1 W kr h l i,r F l+1 r=1 W kr h l j,r = E F l+1 r=1 W 2 kr h l i,r h l j,r = E(W 2 k,• )E F l+1 r=1 h l i,r h l j,r = σ 2 w E((h l i ) h l j ) (31) E(h 2 i,k h 2 j,k ) (32) = E F l+1 r=1 W kr h l i,r 2 F l+1 r=1 W kr l j,r 2 (33) = E(W 4 k,• )E   F l+1 r=1 (h l i,r h l j,r ) 2   + E(W 2 k,• ) 2 E   F l+1 s =t (h l i,s h l j,t ) 2 + 2(h l i,s h l j,s )(h l i,t h l j,t )   (34) = E(W 4 k,• )E   F l+1 r=1 (h l i,r h l j,r ) 2   + E(W 2 k,• ) 2 E r (h l i,r ) 2 r (h l j,r ) 2 - r (h l i,r h l j,r ) 2 (35) + 2E(W 2 k,• ) 2 E   r h l i,r h l j,r 2 - r h l i,r h l j,r 2   (36) = 9 5 σ 4 w E   F l+1 r=1 (h l i,r h l j,r ) 2   + σ 4 w E   h l i 2 2 h l j 2 2 -3 F l+1 r=1 (h l i,r h l j,r ) 2   + 2σ 4 w E ((h l i ) h l j ) 2 (37) = σ 4 w E   h l i 2 2 h l j 2 2 - 6 5 F l+1 r=1 (h l i,r h l j,r ) 2 + 2((h l i ) h l j ) 2   (38) = σ 4 w E   4 5 ((h l i ) h l j ) 2 + 1 10 s =t (h l i,s h l j,t + h l i,t h l j,s ) 2 + 8(h l i,s h l j,t ) 2   + σ 4 w E ((h l i ) h l j ) 2 Note that for the zero-mean uniform distribution U(-u, u) with variance 25.4 ± 0.6 29.2 ± 0.5 37.1 ± 0.5 47.6 ± 1.0 58.1 ± 0.9 73.1 ± 2.8 85.3 ± 2.6 92.7 ± 1.1 97.9 ± 0.5 12.5 24.0 ± 0.4 27.6 ± 0.5 39.6 ± 0.8 48.9 ± 0.4 65.8 ± 3.1 77.8 ± 2.4 88.5 ± 1.5 95.0 ± 0.9 99.2 ± 0.0 15 22.1 ± 0.5 28.2 ± 0.6 44.0 ± 0.7 53.0 ± 0.5 69.9 ± 4.7 79.1 ± 2.0 92.2 ± 1.8 98.0 ± 0.4 99.6 ± 0.3 20 24.3 ± 0.6 30.9 ± 0.7 41.9 ± 0.9 58.1 ± 1.2 75.0 ± 2.9 86.7 ± 1.1 96.1 ± 0.3 98.9 ± 0.1 99.8 ± 0.1 25 26.6 ± 1.0 31.1 ± 0.3 44.9 ± 0.2 62.4 ± 0.7 80.1 ± 2.6 91.0 ± 1.5 97.5 ± 0.5 99.5 ± 0.2 100.0 ± 0.0 32.5 26.3 ± 0.7 33.8 ± 0.7 51.8 ± 0.8 67.5 ± 2.9 83.5 ± 1.9 95.7 ± 0.3 98.4 ± 0.2 100.0 ± 0.0 100.0 ± 0.0 40 23.7 ± 0.5 34.2 ± 0.5 53.4 ± 0.4 72.8 ± 1.5 87.6 ± 0.9 96.1 ± 0.6 99.7 ± 0.1 99.9 ± 0.0 100.0 ± 0.0 50 25.0 ± 1.1 36.4 ± 1.0 55.1 ± 0.5 82.7 ± 3.0 91.2 ± 0.7 98.6 ± 0.4 99.8 ± 0.0 99.9 ± 0.0 100.0 ± 0.0 75 25.9 ± 0.6 40.6 ± 0.6 68.0 ± 1.0 87.9 ± 2.6 97.1 ± 0.6 99.6 ± 0.1 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 100 25.1 ± 0.5 42.0 ± 0.9 72.4 ± 1.4 92.6 ± 0.4 98.6 ± 0.2 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0. 31.9 ± 1.6 34.6 ± 1.6 49.5 ± 0.9 62.0 ± 0.7 77.2 ± 1.5 85.9 ± 0.3 94.4 ± 0.9 98.5 ± 0.5 99.4 ± 0.1 20 34.4 ± 1.8 38.3 ± 1.6 54.1 ± 1.9 70.0 ± 1.1 83.9 ± 1.0 93.7 ± 0.1 97.5 ± 0.3 99.0 ± 0.3 99.7 ± 0.0 25 35.8 ± 2.0 42.0 ± 2.4 57.7 ± 0.6 77.7 ± 0.9 87.9 ± 1.2 95.0 ± 0.7 98.7 ± 0.6 99.6 ± 0.3 99.9 ± 0.1 32.5 37.4 ± 1.2 44.7 ± 1.1 66.5 ± 1.9 79.9 ± 1.2 91.4 ± 1.4 98.0 ± 0.5 99.0 ± 0.3 99.8 ± 0.1 100.0 ± 0.0 40 37.5 ± 2.0 45.1 ± 0.9 66.5 ± 1.1 85.7 ± 1.5 93.5 ± 1.0 97.6 ± 0.6 99.5 ± 0.1 99.9 ± 0.1 99.9 ± 0.1 50 38.7 ± 1.8 49.5 ± 2.3 68.9 ± 2.5 89.5 ± 0.8 96.0 ± 0.8 99.0 ± 0.4 99.7 ± 0.2 99.8 ± 0.1 100.0 ± 0.0 75 39.6 ± 3.0 53.5 ± 1.7 77.6 ± 2.6 92.8 ± 2.3 98.0 ± 1.2 99.5 ± 0.3 99.8 ± 0.2 100.0 ± 0.0 100.0 ± 0.0 100 41.3 ± 1.2 56.6 ± 1.4 81.1 ± 3.5 95.9 ± 1.2 99.0 ± 0.4 99.8 ± 0.1 99.9 ± 0.1 100.0 ± 0.1 100.0 ± 0.0 SuperGAT SD 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 38.5 ± 1.0 40.5 ± 1.5 39.5 ± 1.6 42.3 ± 0.9 44.0 ± 0.5 47.1 ± 0.5 50.1 ± 1.0 53.0 ± 0.3 55.1 ± 1.0 1.5 36.7 ± 1.2 37.8 ± 1.8 42.0 ± 0.4 42.9 ± 0.8 45.0 ± 0.7 47.8 ± 0.5 52.9 ± 0.7 54. 37.8 ± 0.8 42.1 ± 0.9 56.9 ± 0.9 70.0 ± 0.5 86.1 ± 0.8 95.6 ± 0.3 99.1 ± 0.1 99.6 ± 0.2 99.7 ± 0.1 25 40.0 ± 1.0 48.8 ± 1.2 59.5 ± 0.5 78.7 ± 0.2 90.7 ± 0.2 97.5 ± 0.1 99.4 ± 0.1 100.0 ± 0.0 99.9 ± 0.1 32.5 39.7 ± 1.1 48.5 ± 0.7 69.4 ± 0.4 82.7 ± 0.5 94.3 ± 0.2 99.3 ± 0.1 99.7 ± 0.1 99.9 ± 0.1 99.9 ± 0.0 40 44.2 ± 1.1 48.7 ± 1.3 69.2 ± 0.7 88.3 ± 0.4 97.1 ± 0.1 99.7 ± 0.1 99.9 ± 0.0 99.8 ± 0.2 100.0 ± 0.0 50 44.3 ± 0.7 53.2 ± 0.6 73.4 ± 1.2 91.2 ± 0.4 97.7 ± 0.2 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 75 44.8 ± 0.7 56.1 ± 0.9 82.7 ± 0.6 95.7 ± 0.2 99.8 ± 0.2 99.9 ± 0.1 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 100 43.1 ± 1.3 60.7 ± 0.7 87.4 ± 0.3 98.3 ± 0.1 99.9 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 SuperGAT MX 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 33.4 ± 0.7 36.3 ± 0.8 35.4 ± 0.7 40.5 ± 0.3 43.4 ± 0.6 47.0 ± 0.8 50.0 ± 0.5 54.2 ± 1.1 56.4 ± 0.4 1.5 σ 2 w , E(W 2 •,• ) = Var(W •,• ) + E(W •,• ) 2 = σ 2 w ( ) σ 2 w = 1 12 (u -(-u)) 2 = 1 3 u 2 (41) E(W 4 •,• ) = = Var h i h j (44) = Var F l+1 r=1 h i,r h j,r = F l+1 Var h i h j (46) = F l+1 E((h i h j ) 2 ) -E(h i h j ) 2 (47) = F l+1 σ 4 w E   4 5 ((h l i ) h l j ) 2 + 1 10 s =t (h l i,s h l j,t + h l i,t h l j,s ) 2 + 8(h l i,s h l j,t ) 2   (48) + F l+1 σ 4 w E ((h l i ) h l j ) 2 -E((h l i ) h l j ) 2 (49) ≥ F l+1 σ 4 w 4 5 E ((h l i ) h l j ) 2 + Var((h l i ) h l j ) 28.8 ± 0.7 31.5 ± 1.0 36.0 ± 0.8 41.1 ± 1.1 41.3 ± 0.6 47.5 ± 0.8 52.3 ± 0.7 54.8 ± 0.5 63.4 ± 0.3 2.5 26.6 ± 0.9 30.5 ± 1.3 33.6 ± 0.8 42.7 ± 1.5 44.7 ± 0.5 52.2 ± 0.5 56.5 ± 0.8 66.7 ± 0.8 74.5 ± 0.7 3.5 27.7 ± 0.9 33.9 ± 1.7 37.9 ± 0.7 42.4 ± 0.6 49.7 ± 0.8 54. 36.4 ± 1.3 39.9 ± 2.1 53.8 ± 1.5 68.5 ± 0.8 81.0 ± 0.4 90.2 ± 1.0 96.1 ± 0.9 99.2 ± 0.2 99.6 ± 0.1 20 35.2 ± 3.1 41.0 ± 2.1 62.3 ± 1.2 73.9 ± 0.9 86.3 ± 0.6 95.2 ± 1.3 98.6 ± 0.7 99.6 ± 0.1 99.9 ± 0.2 25 36.8 ± 1.9 46.6 ± 1.7 62.7 ± 1.1 81.9 ± 1.7 90.7 ± 0.8 96.8 ± 1.5 99.3 ± 0.3 99.8 ± 0.2 100.0 ± 0.0 32.5 36.7 ± 0.8 46.8 ± 0.5 68.9 ± 0.6 83.8 ± 0.7 93.7 ± 1.0 98.4 ± 0.7 99.8 ± 0.1 100.0 ± 0.0 100.0 ± 0.0 40 39.8 ± 2.4 48.8 ± 1.7 72.1 ± 0.8 87.5 ± 1.7 96.6 ± 0.8 98.9 ± 0.3 99.9 ± 0.1 100.0 ± 0.0 100.0 ± 0.0 50 42.1 ± 1.3 53.6 ± 2.3 75.1 ± 1.9 92.7 ± 1.0 97.1 ± 0.9 99.6 ± 0.2 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 75 41.9 ± 1.1 55.9 ± 2.4 80.8 ± 1.2 95.0 ± 1.1 99.5 ± 0.2 99.9 ± 0.1 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 100 39.6 ± 2.0 59.2 ± 1.8 84.2 ± 3.1 96.9 ± 0.7 99.7 ± 0.1 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0

B.2 FULL RESULT OF SYNTHETIC GRAPH EXPERIMENTS

In Table 8 , we report all results of synthetic graph experiments. We experiment on a total of 144 synthetic graphs controlling 9 homophily (0.1, 0.2, ..., 0.9) and 16 average degree (1, 1.5, 2.5, 3.5, 5, 7.5, 10, 12.5, 15, 20, 25, 32.5, 40, 50, 75, 100) . 

B.5 SENSITIVITY ANALYSIS OF HYPER-PARAMETERS

We analyze sensitivity of mixing coefficient of losses λ E , negative sampling ratio p n , and edge sampling ratio p e . We plot mean node classification performance (over 5 runs) against each hyperparameter in Figure 9 , 10, and 11 respectively. We use the best model for each dataset: SuperGAT MX for citation networks and SuperGAT SD for PPI. For λ E , there is a specific range that maximizes test performance in all datasets. Performance on PPI is the largest when λ E is 10 -3 , but the difference is relatively small comparing to others. We observe that there is an optimal level of the edge supervision for each dataset, and using too large λ E degrades node classification performance. For p n , using too many negative samples has been shown to decrease performance. The optimal number of negative samples is different for each dataset, and all are less than the number of positive samples (p n < 1.0). Note that as p n increases, the required GPU memory also increases. When p n = 5.0, the model and data for PPI could not be accommodated by one single GPU (GeForce GTX 1080Ti). When p e changes, the performance also changes, but the pattern is different by datasets. For Cora and PubMed, the performance against p e shows the convex curve. Performance for CiteSeer generally decreases as p e increases, but there are intervals the performance change of which is nearly zero. In the case of PPI, there are no noticeable changes against p e .



Since CGAT uses node labels in the loss function, it is difficult to use it in semi-supervised learning. So, we modify its auxiliary loss for SSL. See appendix A.6 for details. https://docs.dgl.ai/en/latest/tutorials/models/1_gnn/9_gat.html CONCLUSIONWe proposed novel graph neural architecture designs to self-supervise graph attention following the input graph's characteristics. We first assessed what graph attention is learning and analyzed the effect of edge self-supervision to link prediction and node classification performance. This analysis showed two widely used attention mechanisms (original GAT and dot-product) have difficulty encoding label-agreement and edge presence simultaneously. To address this problem, we suggested several graph attention forms that balance these two factors and argued that graph attention should be designed depending on the input graph's average degree and homophily. Our experiments demonstrated that our graph attention recipe generalizes across various real-world datasets such that the models designed according to the recipe outperform other baseline models. http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/ https://github.com/samihaija/mixhop/blob/master/data/synthetic/make_x. py https://github.com/tensorflow/neural-structured-learning/tree/master/ research/gam



Figure 2: Distribution of KL divergence between normalized attention and label-agreement on all nodes and layers for Cora dataset (Left: two-layer GAT, Right: four-layer GAT).

Figure3: Test performance on node classification and link prediction for GO and DP attentions against the mixing coefficient λ E . We report accuracy (Cora, CiteSeer, PubMed) and micro f1-score (PPI) for node classification, and AUC for link prediction.

Photo

. In this figure, we can observe that the average degree (d avg = n • δ) increases as δ increases, and homophily (h = p in /δ) increases as p in increases. Note that these are raw input features sampled from the 2D Gaussian distribution, not learned node representation. We use the code from the prior work 4 (Abu-El-Haija et al., 2019) and apply normalization by standard score. We choose δ from {0.025, 0.2}, p in from {0.1δ, 0.5δ, 0.9δ}, and fix n = 100 and c = 5.

Figure 6: t-SNE plots of node features and edges for synthetic graph examples. Hyperparameters are δ ∈ {0.025 (Top), 0.2 (Bottom)} and p in ∈ {0.1δ (Left), 0.5δ (Center), 0.9δ (Right)}.

Figure 7: Kernel density estimate plot of distribution of degree and per-node homophily in realworld graphs.

Figure11: Test performance on node classification against the edge sampling ratio p e for SuperGAT MX(Cora, CiteSeer, PubMed)  and SuperGAT SD (PPI).

Summary of classification accuracies of GCN, GraphSAGE, GAT, SuperGAT SD , and SuperGAT MX for real-world datasets (30 runs for ogbn-arxiv and Flickr, and 100 runs for others).

Summary of classification accuracies with 100 random seeds for Cora, CiteSeer, and PubMed. We mark with daggers ( †) the reprinted results from the respective papers.

Average degree and homophily of real-world graphs.

Statistics of the real-world datasets.

Hyperparameters for experiments on real-world datasets.

Summary of classification accuracies with 10 random seeds for Cora, CiteSeer, and PubMed in the full-supervised setting. We mark with asterisks the reprinted results from the respective papers.

Summary of classification accuracies (of 5 runs) for synthetic datasets.

Figure 9: Test performance on node classification against the mixing coefficient λ E for SuperGAT MX (Cora, CiteSeer, PubMed) and SuperGAT SD (PPI).Figure10: Test performance on node classification against the negative sampling ratio p n for SuperGAT MX(Cora, CiteSeer, PubMed)  and SuperGAT SD (PPI).

ACKNOWLEDGMENTS

This research was supported by the Engineering Research Center Program through the National Research Foundation of Korea (NRF) funded by the Korean Government MSIT (NRF-2018R1A5A1059921)

annex

In Figure 8 , we draw box plots of KL divergence between attention distribution and label agreement distribution for all nodes and layers of two-layer GATs and four-layer GATs. As shown in the paper, we can see that DP attention does not capture label-agreement rather than GO attention. Also, the degree of this phenomenon becomes stronger as the layer goes down.

B.4 WALL-CLOCK TIME RESULT

In Table 9 , we report the mean wall-clock time (over three runs) of the training of GAT, GAM, and SuperGAT MX . In SuperGAT, we find that negative sampling of edges is the bottleneck of training. So, we additionally implement SuperGAT MX + MPNS, which employs multi-processing when sampling negative edges. There are three observations in this experiment. GCN + GAM is highly time-intensive in the training stage (×53.9 -×328.1 versus GAT) for all datasets. Compared to GAT, our model needs ×2.7 more training time for Cora and ×7.2 for PubMed, and we reduce the time by applying multi-processing to negative sampling (×1.8 for Cora and ×4.0 for PubMed). For CiteSeer, we can see that SuperGAT MX ends faster than GAT because of faster convergence and fewer epochs.

