DEFORMABLE GRAPH TRANSFORMER

Abstract

Transformer-based models have recently shown success in representation learning on graph-structured data beyond natural language processing and computer vision. However, the success is limited to small-scale graphs due to the drawbacks of full dot-product attention on graphs such as the quadratic complexity with respect to the number of nodes and message aggregation from enormous irrelevant nodes. To address these issues, we propose Deformable Graph Transformer (DGT) that performs sparse attention via dynamically selected relevant nodes for efficiently handling large-scale graphs with a linear complexity in the number of nodes. Specifically, our framework first constructs multiple node sequences with various criteria to consider both structural and semantic proximity. Then, combining with our learnable Katz Positional Encodings, the sparse attention is applied to the node sequences for learning node representations with a significantly reduced computational cost. Extensive experiments demonstrate that our DGT achieves superior performance on 7 graph benchmark datasets with 2.5 ∼ 449 times less computational cost compared to transformer-based graph models with full attention.

1. INTRODUCTION

Transformer (Vaswani et al., 2017) has proven its effectiveness in modeling sequential data in various tasks such as natural language understanding (Devlin et al., 2019; Yang et al., 2019; Brown et al., 2020) and speech recognition (Zhang et al., 2020; Gulati et al., 2020) . Beyond sequential data, recent works (Dosovitskiy et al., 2021; Liu et al., 2021; Yang et al., 2021; Carion et al., 2020; Zhu et al., 2021; Zhao et al., 2021) have successfully generalized Transformer to various computer vision tasks such as image classification (Dosovitskiy et al., 2021; Liu et al., 2021; Yang et al., 2021) , object detection (Carion et al., 2020; Zhu et al., 2021; Song et al., 2021), and 3D shape classification (Zhao et al., 2021) . Inspired by the success of Transformer-based models, there have been recent efforts to apply the Transformer to graph domains by using graph structural information through structural encodings (Ying et al., 2021; Dwivedi & Bresson, 2020; Mialon et al., 2021; Kreuzer et al., 2021) , and they have achieved the best performance on various graph-related tasks. However, most existing Transformer-based graph models have difficulty in learning representations on large-scale graphs while they have shown their superiority on small-scale graphs. Since the Transformer-based graph models perform self-attention by treating each input node as an input token, the computational cost is quadratic in the number of input nodes, which is problematic on large-scale graphs. In addition, different from graph neural networks that aggregate messages from local neighborhoods, Transformer-based graph models globally aggregate messages from numerous nodes. So, on large-scale graphs, a huge number of messages from falsely correlated nodes often overwhelm the information from relevant nodes. As a result, Transformer-based graph models often exhibit poor generalization performance. A simple method to address these issues is performing masked attention where the key and value pairs are restricted to neighborhoods of query nodes (Dwivedi & Bresson, 2020) . But, since the masked attention has a fixed small receptive field, it struggles to learn representations on large-scale graphs that require a large receptive field. In this paper, we propose a novel Transformer for graphs named Deformable Graph Transformer (DGT) that performs sparse attention with a small set of key and value pairs adaptively selected considering both semantic and structural proximity. To be specific, our approach first generates multiple node sequences for each query node with diverse sorting criteria such as Personalized PageRank (PPR) score, BFS, and feature similarity. Then, our Deformable Graph Attention (DGA), a key module of DGT, dynamically adjusts offsets to choose the key and value pairs on the generated node sequences and learns representations with selected key-value pairs. In addition, we present simple and effective positional encodings to capture structural information. Motivated by Katz index (Katz, 1953) , which is used for measuring connectivity between nodes, we design Katz Positional Encoding (Katz PE) to incorporate structural similarity and distance between nodes on a graph into the attention. Our extensive experiments show that DGT achieved good performances on 7 benchmark datasets and outperformed existing Transformer-based graph models on all 8 datasets at a significantly reduced computational cost. Our contributions are as follows: (1) We propose Deformable Graph Transformer (DGT) that performs sparse attention with a reduced number of keys and values for learning node representations, which significantly improves the scalability and expressive power of Transformer-based graph models. (2) We design deformable attention for graph-structured data, Deformable Graph Attention (DGA), that flexibly attends to a small set of relevant nodes based on various types of the proximity between nodes. (3) We present learnable positional encodings named Katz PE to improve the expressive power of Transformer-based graph models by incorporating structural similarity and distance between nodes based on Katz index (Katz, 1953) . (4) We validate the effectiveness of the Deformable Graph Transformer with extensive experimental results that our DGT achieves the best performance on 7 graph benchmark datasets with 2.5 ∼ 449 times less computational cost compared to transformer-based graph models with full attention.

2. RELATED WORKS

Graph Neural Networks. Graph Neural Networks have become the de facto standard approaches on various graph-related tasks (Kipf & Welling, 2017; Hamilton et al., 2017; Wu et al., 2019; Xu et al., 2018; Gilmer et al., 2017) . There have been several works that apply attention mechanisms to graph neural networks (Rong et al., 2020; Veličković et al., 2018; Brody et al., 2022; Kim & Oh, 2021) motivated by the success of the attention. GAT (Veličković et al., 2018) and GATv2 (Brody et al., 2022) adaptively aggregate messages from neighborhoods with the attention scheme. However, the previous works often show poor performance on heterophilic graphs due to their homophily assumption that nodes within a small neighborhood have similar attributes and potentially the same labels. So, recent works (Abu-El-Haija et al., 2019; Pei et al., 2020; Zhu et al., 2020; Park et al., 2022) have been proposed to extended message aggregation beyond a few-hop neighborhood to cope with both homophilic and heterophilic graphs. H2GCN (Zhu et al., 2020) separates input features and aggregated features to preserve information of input features. Deformable GCN (Park et al., 2022) improves the flexibility of convolution by performing deformable convolution. Transformer-based Graph Models. Recently, (Ying et al., 2021; Dwivedi & Bresson, 2020; Mialon et al., 2021; Kreuzer et al., 2021; Wu et al., 2021 ) have adopted the Transformer architecture for learning on graphs. Graphormer (Ying et al., 2021) and GT (Dwivedi & Bresson, 2020) are built upon the standard Transformer architectures by incorporating structural information of graphs into the dot-product self-attention. However, these approaches, which we will refer to as 'graph Transformers' for brevity, are not suitable for large-scale graphs. It is because referencing numerous key nodes for each query node is prohibitively costly, and that hinders the attention module from learning the proper function due to the noisy features from irrelevant nodes. Although restricting the attention scope to local neighbors is a simple remedy to reduce the computational complexity, it leads to a failure in capturing local-range dependency, which is crucial for large-scale or heterophilic graphs. To mitigate the shortcomings of existing Transformer-based graph models, we propose DGT equipped with deformable sparse attention that dynamically selected relevant nodes to efficiently learn powerful representations on both homophilic and heterophilic graphs with significantly improved scalability. Sparse Transformers in Other Domains. Transformer (Vaswani et al., 2017) and its variants have achieved performance improvements in various domains such as natural language processing (Devlin et al., 2019; Brown et al., 2020) and computer vision (Dosovitskiy et al., 2021; Carion et al., 2020) . However, these models require quadratic space and time complexity, which is especially problematic with long input sequences. Recent works (Choromanski et al., 2021; Jaegle et al., 2021; Kitaev et al., 2020) have studied this issue and proposed various efficient Transformer architectures. (Choromanski et al., 2021; Xiong et al., 2021) study the low-rank approximation for attention to

