ARE MORE LAYERS BENEFICIAL TO GRAPH TRANS-FORMERS?

Abstract

Despite that going deep has proven successful in many neural architectures, the existing graph transformers are relatively shallow. In this work, we explore whether more layers are beneficial to graph transformers, and find that current graph transformers suffer from the bottleneck of improving performance by increasing depth. Our further analysis reveals the reason is that deep graph transformers are limited by the vanishing capacity of global attention, restricting the graph transformer from focusing on the critical substructure and obtaining expressive features. To this end, we propose a novel graph transformer model named DeepGraph that explicitly employs substructure tokens in the encoded representation, and applies local attention on related nodes to obtain substructure based attention encoding. Our model enhances the ability of the global attention to focus on substructures and promotes the expressiveness of the representations, addressing the limitation of self-attention as the graph transformer deepens. Experiments show that our method unblocks the depth limitation of graph transformers and results in stateof-the-art performance across various graph benchmarks with deeper models.

1. INTRODUCTION

Transformers have recently gained rapid attention in modeling graph-structured data (Zhang et al., 2020; Dwivedi & Bresson, 2020; Maziarka et al., 2020; Ying et al., 2021; Chen et al., 2022) . Compared to graph neural networks, graph transformer implies global attention mechanism to enable information passing between all nodes, which is advantageous to learn long-range dependency of the graph stuctures (Alon & Yahav, 2020) . In transformer, the graph structure information can be encoded into node feature (Kreuzer et al., 2021) or attentions (Ying et al., 2021) by a variant of methods flexibly with strong expressiveness, avoiding the inherent limitations of encoding paradigms that pass the information along graph edges. Global attention (Bahdanau et al., 2015) also enables explicit focus on essential parts among the nodes to model crucial substructures in the graph. Graph transformer in current studies is usually shallow, i.e., less than 12 layers. Scaling depth is proven to increase the capacity of neural networks exponentially (Poole et al., 2016) , and empirically improve transformer performance in natural language processing (Liu et al., 2020a; Bachlechner et al., 2021) and computer vision (Zhou et al., 2021) . Graph neural networks also benefit from more depth when properly designed (Chen et al., 2020a; Liu et al., 2020b; Li et al., 2021) . However, it is still not clear whether the capability of graph transformers in graph tasks can be strengthened by increasing model depth. So we conduct experiments and find that current graph transformers encounter the bottleneck of improving performance by increasing depth. The further deepening will hurt performance when model exceeds 12 layers, which seems to be the upper limit of the current graph transformer depth, as Figure 1 (left) shows. In this work, we aim to answer why more self-attention layers become a disadvantage for graph transformers, and how to address these issues with the proper model design. Self-attention (Bahdanau et al., 2015) makes a leap in model capacity by dynamically concentrating on critical parts

