GRAPH MLP-MIXER

Abstract

Graph Neural Networks (GNNs) have shown great potential in the field of graph representation learning. Standard GNNs define a local message-passing mechanism which propagates information over the whole graph domain by stacking multiple layers. This paradigm suffers from two major limitations, over-squashing and poor long-range dependencies, that can be solved using global attention but significantly increases the computational cost to quadratic complexity. In this work, we consider an alternative approach to overcome these structural limitations while keeping a low complexity cost. Motivated by the recent MLP-Mixer architecture introduced in computer vision, we propose to generalize this network to graphs. This GNN model, namely Graph MLP-Mixer, can make long-range connections without over-squashing or high complexity due to the mixer layer applied to the graph patches extracted from the original graph. As a result, this architecture exhibits promising results when comparing standard GNNs vs. Graph MLP-Mixers on benchmark graph datasets. In this section, we review the main classes of GNNs with their advantages and their limitations. Then, we introduce the ViT/MLP-Mixer architectures from computer vision which have motivated us to design a new graph network architecture.



Weisfeiler-Leman GNNs (WL-GNNs). One of the major limitations of MP-GNNs is their inability to distinguish (simple) non-isomorphic graphs. This limited expressivity can be formally analyzed with the Weisfeiler-Leman graph isomorphism test (Weisfeiler & Leman, 1968) , as first proposed in Xu et al. (2019) ; Morris et al. (2019) . Later on, Maron et al. (2018) introduced a general class of k-order WL-GNNs that can be proved to universally represent any class of k-WL graphs (Maron et al., 2019; Chen et al., 2019) . But to achieve such expressivity, this class of GNNs requires using k-tuples of nodes with memory and speed complexities of O(N k ), with N being the number of nodes and k ≥ 3. Although the complexity can be reduced to O(N 2 ) and O(N 3 ) respectively (Maron et al., 2019; Chen et al., 2019; Azizian & Lelarge, 2020) , it is still computationally costly compared to the linear complexity O(E) of MP-GNNs, which often reduces to O(N ) for real-world graphs that exhibit sparse structures s.a. molecules, knowledge graphs, transportation networks, gene regulatory networks, to name a few. In order to reduce memory and speed complexities of WL-GNNs while keeping high expressivity, several works have focused on designing graph networks from their sub-structures s.a. sub-graph isomorphism (Bouritsas et al., 2022) , sub-graph routing mechanism (Alsentzer et al., 2020 ), cellular WL sub-graphs (Bodnar et al., 2021) , expressive sub-graphs (Bevilacqua et al., 2021; Frasca et al., 2022) , rooted sub-graphs (Zhang & Li, 2021) and k-hop egonet sub-graphs (Zhao et al., 2021a) . Graph Positional Encoding (PE). Another aspect of the limited expressivity of GNNs is their inability to recognize simple graph structures s.a. cycles or cliques, which are often present in molecules and social graphs (Chen et al., 2020) . We can consider k-order WL-GNNs with value k to be the length of cycle/clique, but with high complexity O(N k ). An alternative approach is to add positional encoding to the graph nodes. It was proved in Murphy et al. (2019) ; Loukas (2020) that unique and equivariant PE increases the representation power of any MP-GNN while keeping the linear complexity. This theoretical result was applied with great empirical success by Murphy et al. 2021) with k-step Random Walk. All these graph PEs lead to GNNs strictly more powerful than the 1-WL test, which seems to be enough expressivity in practice (Zopf, 2022) . However, none of the PE proposed for graphs can provide a global position of the nodes that is unique, equivariant and distance sensitive. This is due to the fact that a canonical positioning of nodes does not exist for arbitrary graphs, as there is no notion of up, down, left and right on graphs. For example, any embedding coordinate system like graph Laplacian eigenvectors (Belkin & Niyogi, 2003) can flip up-down directions, right-left directions, and would still be a valid PE. This introduces ambiguities for the GNNs that require to (learn to) be invariant with respect to the graph or PE symmetries. A well-known example is given by the eigenvectors: there exist 2 k number of possible sign flips for k eigenvectors that require to be learned by the network. Over-Squashing. Standard MP-GNNs require L layers to propagate the information from one node to their L-hop neighborhood. This implies that the receptive field size for GNNs can grow exponentially, for example with O(2 L ) for binary tree graphs. This causes over-squashing; information from the exponentially-growing receptive field is compressed into a fixed-length vector by the aggregation mechanism (Alon & Yahav, 2020; Topping et al., 2022) . Consequences of over-squashing are overfitting and poor long-range node interactions as relevant information cannot travel without being disturbed. Over-squashing is well-known since recurrent neural networks (Hochreiter & Schmidhuber, 1997) , which have led to the development of the (self-and cross-)attention mechanisms for the translation task (Bahdanau et al., 2014; Vaswani et al., 2017) first, and then for more general natural language processing (NLP) tasks (Devlin et al., 2018; Brown et al., 2020) . Transformer architectures are the most elaborated networks that leverage attention. Attention is a simple but powerful mechanism that solves over-squashing and long-range dependencies by making "everything connected to everything" but it also requires to trade linear complexity for quadratic complexity. Inspired by the great successes of Transformers in NLP and computer vision (CV), several works have proposed to generalize the transformer architecture for graphs , achieving competitive or superior performance against standard MP-GNNs. We highlight the most recent research works in the next paragraph. Graph Transformers. GraphTransformers (Dwivedi & Bresson, 2021) generalize Transformers to graphs, with graph Laplacian eigenvectors as node PE, and incorporating graph structure into the permutation-invariant attention function. SAN and LSPE (Kreuzer et al., 2021; Dwivedi et al., 2021) further improve with PE learned from Laplacian and random walk operators. GraphiT (Mialon et al., 2021) encodes relative PE derived from diffusion kernels into the attention mechanism. GraphTrans (Wu et al., 2021b) and SAT (Chen et al., 2022) add Transformers on the top of standard GNN layers. Graphormer (Ying et al., 2021) introduce three structural encodings, with great success on large molecular benchmarks. GPS (Rampášek et al., 2022) categorizes the different types of PE and puts forward a hybrid MPNN+Transformer architecture. We refer to Min et al. (2022) for an overview of graph-structured Transformers. Generally, most Graph Transformer architectures address the problems of over-squashing and limited long-range dependencies in GNNs but they also increase significantly the complexity from O(E) to O(N 2 ), resulting in a computational bottleneck. ViT and MLP-Mixer. Transformers have gained remarkable success in CV and NLP, most notably with architectures like ViT (Dosovitskiy et al., 2020) and BERT (Devlin et al., 2018) . The success of transformers has been long attributed to the attention mechanism (Vaswani et al., 2017) , which is able to model long-range dependencies as it does not suffer from over-squashing. But recently, this prominent line of networks has been challenged by more cost efficient alternatives. A novel family of models based on the MLP-Mixer introduced by Tolstikhin et al. ( 2021) has emerged and gained recognition for its simplicity and its efficient implementation. Overall, MLP-Mixer replaces the attention module with multi-layer perceptrons (MLPs) which are also not affected by over-squashing



(2019) with index PE, Dwivedi et al. (2020); Dwivedi & Bresson (2021); Kreuzer et al. (2021); Lim et al. (2022) with Laplacian eigenvectors and Li et al. (2020a); Dwivedi et al. (

