THE GRAPH LEARNING ATTENTION MECHANISM: LEARNABLE SPARSIFICATION WITHOUT HEURISTICS

Abstract

Graph Neural Networks (GNNs) are local aggregators that derive their expressive power from their sensitivity to network structure. However, this sensitivity comes at a cost: noisy edges degrade performance. In response, many GNNs include edge-weighting mechanisms that scale the contribution of each edge in the aggregation step. However, to account for neighborhoods of varying size, nodeembedding mechanisms must normalize these edge-weights across each neighborhood. As such, the impact of noisy edges cannot be eliminated without removing those edges altogether. Motivated by this issue, we introduce the Graph Learning Attention Mechanism (GLAM): a drop-in, differentiable structure learning layer for GNNs that separates the distinct tasks of structure learning and node embedding. In contrast to existing graph learning approaches, GLAM does not require the addition of exogenous structural regularizers or edge-selection heuristics to learn optimal graph structures. In experiments on citation and co-purchase datasets, we demonstrate that our approach can match state of the art semisupervised node classification accuracies while inducing an order of magnitude greater sparsity than existing graph learning methods.

1. INTRODUCTION

Local interactions govern the properties of nearly all complex systems, from protein folding and cellular proliferation to group dynamics and financial markets (Stocker et al., 1996; Doyle et al., 1997; Mathur, 2006; Özgür, 2011; Jiang et al., 2014) . When modeling such systems, representing interactions explicitly in the form of a graph can improve model performance dramatically, both at the local and global level. Graph Neural Networks (GNNs) are designed to operate on such graphstructured data and have quickly become state of the art in a host of structured domains (Wu et al., 2019) . However, GNN models rely heavily on the provided graph structures representing meaningful relations, for example, the bonds between atoms in a molecule (Fang et al., 2022) . Additionally, to generate useful node embeddings, GNNs employ permutation invariant neighborhood aggregation functions which implicitly assume that neighborhoods satisfy certain homogeneity properties (Zhu et al., 2020) . If noisy edges are introduced, or if the neighborhood assumptions are not met, GNN performance suffers. To address both issues simultaneously, many GNNs include mechanisms for learning edge weights which scale the influence of the features on neighboring nodes in the aggregation step. The Graph Attention Network (GAT) (Veličković et al., 2017) , for example, adapts the typical attention mechanism (Vaswani et al., 2017) to the graph setting, learning attention coefficients between adjacent nodes in the graph as opposed to tokens in a sequence. As we will show in Section 3, the demands of edge weighting (or structure learning) inherently conflict with those of node embedding, and edge weighting mechanisms that are joined together with node embedding mechanisms are not capable of eliminating the negative impact of noisy edges on their own. In this paper, we introduce a method for separating the distinct tasks of structure learning and node embedding in GNNs. Our method takes the form of a structure learning layer that can be placed in front of existing GNN layers to learn task-informed graph structures that optimize performance on the downstream task. Our primary contributions are as follows: 1. We first introduce a principled framework for considering the inherent conflicts between structure learning and node embedding. 2. Motivated by this framework, we introduce the Graph Learning Attention Mechanism (GLAM) a layer that, when used alongside GNNs, separates the distinct tasks of structure learning and node embedding. In addition to enabling GAT models to meet or exceed state of the art performance in semisupervised node classification tasks, the GLAM layer induces an order of magnitude greater sparsity than other structure learning methods (Luo et al., 2020; Ye & Ji, 2021; Chen et al., 2020; Franceschi et al., 2019; Shang et al., 2021; Miao et al., 2022) . Also, in contrast to the existing structure learning methods, GLAM does not employ any edge selection heuristics, exogenous structural regularizers or otherwise modify the existing loss function to accommodate the structure learning task. This makes it simpler to apply in existing GNN pipelines as there is no need to modify carefully crafted and domain-specific objective functions. Our approach is also scalable and generalizable to the inductive setting as it does not rely on optimizing a fixed adjacency matrix.

2. PRELIMINARIES

As our method takes inspiration from the original GAT, we begin by reviewing the mechanism by which the GAT layer generates edge weights, as well as how those edge weights are used to aggregate neighborhood information. Understanding this mechanism is important to understanding our conceptual framework (Section 3) as well as the GLAM layer (Section 4). Graph attention networks learn weighted attention scores e ij for all edges between nodes i and j, j ∈ N i where N i is the one-hop neighborhood of node i. These attention scores represent the importance of the features on node j to node i and are computed in the following manner: e ij = LeakyReLU ⃗ a T [W GAT ⃗ h i ∥ W GAT ⃗ h j ] where ⃗ h i ∈ R F are node feature vectors, ∥ is vector concatenation, W GAT ∈ R F ′ ×F is a shared linear transformation for transforming input features into higher level representations, and ⃗ a ∈ R 2F ′ are the learnable attention weights that take the form of a single-layer feedforward neural network. To ensure the attention scores are comparable across neighborhoods of varying size, they are normalized into attention coefficients α ij using a softmax activation: α ij = softmax j (e ij ) = exp(e ij ) j∈Ni exp(e ij ) For stability and expressivity, the mechanism is extended to employ multi-head attention, and the outputs of the K heads in the final layer are aggregated by averaging: ⃗ h ′ i = softmax 1 K K k=1 j∈Ni α k ij W k GAT ⃗ h j In the next section, we explain why the normalization procedure in Eq. 2, while crucial for node embedding, is an impediment for structure learning.

3. CONFLICTING DEMANDS: NODE EMBEDDING VS. STRUCTURE LEARNING

At first glance, it may seem we could address the structure learning problem by simply thresholding the existing GAT attention coefficients α ij . However, due to the need for neighborhood-wise normalization and permutation invariant aggregation, this would not be ideal. As a motivating example, consider the following: if we add three random edges per node to the standard Cora dataset McCallum et al. (2000) , then train a GAT to perform semi-supervised node classification, we get

