Learnable Graph Convolutional Attention Networks

Abstract

Existing Graph Neural Networks (GNNs) compute the message exchange between nodes by either aggregating uniformly (convolving) the features of all the neighboring nodes, or by applying a non-uniform score (attending) to the features. Recent works have shown the strengths and weaknesses of the resulting GNN architectures, respectively, GCNs and GATs. In this work, we aim at exploiting the strengths of both approaches to their full extent. To this end, we first introduce the graph convolutional attention layer (CAT), which relies on convolutions to compute the attention scores. Unfortunately, as in the case of GCNs and GATs, we show that there exists no clear winner between the three-neither theoretically nor in practice-as their performance directly depends on the nature of the data (i.e., of the graph and features). This result brings us to the main contribution of our work, the learnable graph convolutional attention network (L-CAT): a GNN architecture that automatically interpolates between GCN, GAT and CAT in each layer, by adding two scalar parameters. Our results demonstrate that L-CAT is able to efficiently combine different GNN layers along the network, outperforming competing methods in a wide range of datasets, and resulting in a more robust model that reduces the need of cross-validating.

1. Introduction

In recent years, Graph Neural Networks (GNNs) (Scarselli et al., 2008) have become ubiquitous in machine learning, emerging as the standard approach in many settings. For example, they have been successfully applied for tasks such as topic prediction in citation networks (Sen et al., 2008) ; molecule prediction (Gilmer et al., 2017) ; and link prediction in recommender systems (Wu et al., 2020a) . These applications typically make use of message-passing GNNs (Gilmer et al., 2017) , whose idea is fairly simple: in each layer, nodes are updated by aggregating the information (messages) coming from their neighboring nodes. Depending on how this aggregation is implemented, we can define different types of GNN layers. Two important and widely adopted layers are graph convolutional networks (GCNs) (Kipf & Welling, 2017) , which uniformly average the neighboring information; and graph attention networks (GATs) (Velickovic et al., 2018) , which instead perform a weighted average, based on an attention score between receiver and sender nodes. More recently, a number of works have shown the strengths and limitations of both approaches from a theoretical (Fountoulakis et al., 2022; Baranwal et al., 2021; 2022), and empirical (Knyazev et al., 2019) point of view. These results show that their performance depends on the nature of the data at hand (i.e., the graph and the features), thus the standard approach is to select between GCNs and GATs via computationally demanding cross-validation. In this work, we aim to exploit the benefits of both convolution and attention operations in the design of GNN architectures. To this end, we first introduce a novel graph convolutional attention layer (CAT), which extends existing attention layers by taking the convolved features as inputs of the score function. Following (Fountoulakis et al., 2022) , we rely on a contextual stochastic block model to theoretically compare GCN, GAT, and CAT architectures. Our analysis shows that, unfortunately, no free lunch exists among these three GNN architectures since their performance, as expected, is fully data-dependent. This motivates the main contribution of the paper, the learnable graph convolutional attention network (L-CAT): a novel GNN which, in each layer, automatically interpolates between the three operations by introducing only two scalar parameters. As a result, L-CAT is able to learn the proper operation to apply at each layer, thus combining different layer types in the same GNN architecture while overcoming the need to cross-validate-a process that was prohibitively expensive prior to this work. Our extensive empirical analysis demonstrates the capabilities of L-CAT on a wide range of datasets, outperforming existing baseline GNNs in terms of both performance, and robustness to input noise and network initialization.

2. Preliminaries

Assume as input an undirected graph G = (V, E), where V = [n] denotes the set of vertices of the graph, and E ⊆ V × V the set of edges. Each node i ∈ [n] is represented by a d-dimensional feature vector X i ∈ R d , and the goal is to produce a set of predictions { ŷi } n i=1 . To this end, a message-passing GNN layer yields a representation hi ∈ R d ′ for each node i, by collecting and aggregating the information from each of its neighbors into a single message; and using the aggregated message to update its representation from the previous layer, h i ∈ R d . For the purposes of this work, we can define this operation as the following: hi = f (h ′ i ) where h ′ i def = j∈N * i γ ij W v h j , where N * i is the set of neighbors of node i (including i), W v ∈ R d ′ ×d a learnable matrix, f an elementwise function, and γ ij ∈ [0, 1] are coefficients such that j γ ij = 1 for each node i. Let the input features be h 0 i = X i , and h L i = ŷi the predictions, then we can readily define a message-passing GNN (Gilmer et al., 2017) as a sequence of L layers as defined above. Depending on the way the coefficients γ ij are computed, we identify different GNN flavors. Graph convolutional networks (GCNs) (Kipf & Welling, 2017) are simple yet effective. In short, GCNs compute the average of the messages, i.e., they assign the same coefficient γ ij = 1/|N * i | to every neighbor: hi = f (h ′ i ) where h ′ i def = 1 |N * i | j∈N * i W v h j , Graph attention networks take a different approach. Instead of assigning a fixed value to each coefficient γ ij , they compute it as a function of the sender and receiver nodes. A general formulation for these models can be written as follows: hi = f (h ′ i ) where h ′ i def = j∈N * i γ ij W v h j and γ ij def = exp(Ψ(h i , h j )) ℓ∈N * i exp(Ψ(h i , h ℓ )) . Here, Ψ(h i , h j ) def = α(W q h i , W k h j ) is known as the score function (or attention architecture), and provides a score value between the messages h i and h j (or more generally, between a learnable mapping of the messages). From these scores, the (attention) coefficients are obtained by normalizing them, such that j γ ij = 1. We can find in the literature different attention layers and, throughout this work, we focus on the original GAT (Velickovic et al., 2018) and its extension GATv2 (Brody et al., 2022) : GAT: Ψ(h i , h j ) = LeakyRelu a ⊤ [W q h i ||W k h j ] , GATv2: Ψ(h i , h j ) = a ⊤ LeakyRelu (W q h i + W k h j ) , (5) where the learnable parameters are now the attention vector a; and the matrices W q , W k , and W v . Following previous work (Velickovic et al., 2018; Brody et al., 2022) , we assume that these matrices are coupled, i.e., W q = W k = W v . Note that the difference between the

