ARE MORE LAYERS BENEFICIAL TO GRAPH TRANS-FORMERS?

Abstract

Despite that going deep has proven successful in many neural architectures, the existing graph transformers are relatively shallow. In this work, we explore whether more layers are beneficial to graph transformers, and find that current graph transformers suffer from the bottleneck of improving performance by increasing depth. Our further analysis reveals the reason is that deep graph transformers are limited by the vanishing capacity of global attention, restricting the graph transformer from focusing on the critical substructure and obtaining expressive features. To this end, we propose a novel graph transformer model named DeepGraph that explicitly employs substructure tokens in the encoded representation, and applies local attention on related nodes to obtain substructure based attention encoding. Our model enhances the ability of the global attention to focus on substructures and promotes the expressiveness of the representations, addressing the limitation of self-attention as the graph transformer deepens. Experiments show that our method unblocks the depth limitation of graph transformers and results in stateof-the-art performance across various graph benchmarks with deeper models.

1. INTRODUCTION

Transformers have recently gained rapid attention in modeling graph-structured data (Zhang et al., 2020; Dwivedi & Bresson, 2020; Maziarka et al., 2020; Ying et al., 2021; Chen et al., 2022) . Compared to graph neural networks, graph transformer implies global attention mechanism to enable information passing between all nodes, which is advantageous to learn long-range dependency of the graph stuctures (Alon & Yahav, 2020) . In transformer, the graph structure information can be encoded into node feature (Kreuzer et al., 2021) or attentions (Ying et al., 2021) by a variant of methods flexibly with strong expressiveness, avoiding the inherent limitations of encoding paradigms that pass the information along graph edges. Global attention (Bahdanau et al., 2015) also enables explicit focus on essential parts among the nodes to model crucial substructures in the graph. Graph transformer in current studies is usually shallow, i.e., less than 12 layers. Scaling depth is proven to increase the capacity of neural networks exponentially (Poole et al., 2016) , and empirically improve transformer performance in natural language processing (Liu et al., 2020a; Bachlechner et al., 2021) and computer vision (Zhou et al., 2021) . Graph neural networks also benefit from more depth when properly designed (Chen et al., 2020a; Liu et al., 2020b; Li et al., 2021) . However, it is still not clear whether the capability of graph transformers in graph tasks can be strengthened by increasing model depth. So we conduct experiments and find that current graph transformers encounter the bottleneck of improving performance by increasing depth. The further deepening will hurt performance when model exceeds 12 layers, which seems to be the upper limit of the current graph transformer depth, as Figure 1 (left) shows. In this work, we aim to answer why more self-attention layers become a disadvantage for graph transformers, and how to address these issues with the proper model design. Self-attention (Bahdanau et al., 2015) makes a leap in model capacity by dynamically concentrating on critical parts et al., 2022) . Although the self-attention module appears to be very beneficial for automatically learning important substructure features in the graph, our analysis indicates that this ability vanishes as depth grows, restricting the deeper graph transformer from learning useful structure features. Specifically, we focus on the influence of attention on different substructures, which we found to decrease after each self-attention layer. In consequence, it is difficult for deep models to autonomously learn effective attention patterns of substructures and obtain expressive graph substructure features. We further propose a graph transformer model named DeepGraph with a simple but effective method to enhance substructure encoding ability of deeper graph transformer. The proposed model explicitly introduces local attention mechanism on substructures by employing additional substructure tokens in the model representation and applying local attention to nodes related to those substructures. Our method not only introduces the substructure based attention to encourage the model to focus on substructure feature, but also enlarges the attention capacity theoretically and empirically, which improves the expressiveness of representations learned on substructures. In summary, our contributions are as follows: • We present the bottleneck of graph transformers' performance when depth increases, illustrating the depth limitation of current graph transformers. We study the bottleneck from the perspective of attention capacity decay with layers theoretically and empirically, and demonstrate the difficulty for deep models to learn effective attention patterns of substructures and obtain informative graph substructure features. • According to the above finding, we propose a simple yet effective local attention mechanism based on substructure tokens, promoting focus on local substructure features of deeper graph transformer and improving the expressiveness of learned representations. • Experiments show that our method unblocks the depth limitation of graph transformer and achieves state-of-the-art results on standard graph benchmarks with deeper models.

2. RELATED WORK

Graph transformers Transformer with the self-attention has been the mainstream method in nature language processing (Vaswani et al., 2017; Devlin et al., 2019; Liu et al., 2019) , and is also proven competitive for image in computer vision (Dosovitskiy et al., 2020) . Pure transformers lack relation information between tokens and need position encoding for structure information. Recent works apply transformers in graph tasks by designing a variety of structure encoding techniques. Some works embed structure information into graph nodes by methods including Laplacian vector, random walk, or other feature (Zhang et al., 2020; Dwivedi & Bresson, 2020; Kreuzer et al., 2021; Kim et al., 2022; Wu et al., 2021) . Some other works introduce structure information into attention by graph distance, path embedding or feature encoded by GNN (Park et al., 2022; Maziarka et al., 2020; Ying et al., 2021; Chen et al., 2022; Mialon et al., 2021; Choromanski et al., 2022) . Other works use transformer as a module of the whole model (Bastos et al., 2022; Guo et al., 2022) .



Figure 1: Left: Performance on ZINC dataset of different graph transformers by varying their depths. Our DeepGraph successfully scales up the depth, while the baselines can not. (Lower is better.) Right: Layer attention capacity to substructures with depth. (Chen et al., 2022), i.e., substructures of a graph, and obtaining particular features. Substructures are the basic intrinsic features of graph data widely used in data analysis (Yu et al., 2020) and machine learning (Shervashidze et al., 2009), as well as graph model interpretability (Miaoet al., 2022). Although the self-attention module appears to be very beneficial for automatically learning important substructure features in the graph, our analysis indicates that this ability vanishes as depth grows, restricting the deeper graph transformer from learning useful structure features. Specifically, we focus on the influence of attention on different substructures, which we found to decrease after each self-attention layer. In consequence, it is difficult for deep models to autonomously learn effective attention patterns of substructures and obtain expressive graph substructure features.

