GRAPH MLP-MIXER

Abstract

Graph Neural Networks (GNNs) have shown great potential in the field of graph representation learning. Standard GNNs define a local message-passing mechanism which propagates information over the whole graph domain by stacking multiple layers. This paradigm suffers from two major limitations, over-squashing and poor long-range dependencies, that can be solved using global attention but significantly increases the computational cost to quadratic complexity. In this work, we consider an alternative approach to overcome these structural limitations while keeping a low complexity cost. Motivated by the recent MLP-Mixer architecture introduced in computer vision, we propose to generalize this network to graphs. This GNN model, namely Graph MLP-Mixer, can make long-range connections without over-squashing or high complexity due to the mixer layer applied to the graph patches extracted from the original graph. As a result, this architecture exhibits promising results when comparing standard GNNs vs. Graph MLP-Mixers on benchmark graph datasets. In this section, we review the main classes of GNNs with their advantages and their limitations. Then, we introduce the ViT/MLP-Mixer architectures from computer vision which have motivated us to design a new graph network architecture.



Weisfeiler-Leman GNNs (WL-GNNs). One of the major limitations of MP-GNNs is their inability to distinguish (simple) non-isomorphic graphs. This limited expressivity can be formally analyzed with the Weisfeiler-Leman graph isomorphism test (Weisfeiler & Leman, 1968) , as first proposed in Xu et al. (2019) ; Morris et al. (2019) . Later on, Maron et al. (2018) introduced a general class of k-order WL-GNNs that can be proved to universally represent any class of k-WL graphs (Maron et al., 2019; Chen et al., 2019) . But to achieve such expressivity, this class of GNNs requires using k-tuples of nodes with memory and speed complexities of O(N k ), with N being the number of nodes and k ≥ 3. Although the complexity can be reduced to O(N 2 ) and O(N 3 ) respectively (Maron et al., 2019; Chen et al., 2019; Azizian & Lelarge, 2020) , it is still computationally costly compared to the linear complexity O(E) of MP-GNNs, which often reduces to O(N ) for real-world graphs that exhibit sparse structures s.a. molecules, knowledge graphs, transportation networks, gene regulatory networks, to name a few. In order to reduce memory and speed complexities of WL-GNNs while keeping high expressivity, several works have focused on designing graph networks from their sub-structures s.a. sub-graph isomorphism (Bouritsas et al., 2022) , sub-graph routing mechanism (Alsentzer et al., 2020) , cellular WL sub-graphs (Bodnar et al., 2021) , expressive sub-graphs (Bevilacqua et al., 2021; Frasca et al., 2022) , rooted sub-graphs (Zhang & Li, 2021) and k-hop egonet sub-graphs (Zhao et al., 2021a) . Graph Positional Encoding (PE). Another aspect of the limited expressivity of GNNs is their inability to recognize simple graph structures s.a. cycles or cliques, which are often present in molecules and social graphs (Chen et al., 2020) . We can consider k-order WL-GNNs with value k to be the length of cycle/clique, but with high complexity O(N k ). An alternative approach is to add positional encoding to the graph nodes. It was proved in Murphy et al. (2019) ; Loukas (2020) that unique and equivariant PE increases the representation power of any MP-GNN while keeping the linear complexity. This theoretical result was applied with great empirical success by Murphy et al. (2019) with index PE, Dwivedi et al. (2020) ; Dwivedi & Bresson (2021) ; Kreuzer et al. (2021) ; Lim et al. (2022) with Laplacian eigenvectors and Li et al. (2020a) ; Dwivedi et al. (2021) with k-step Random Walk. All these graph PEs lead to GNNs strictly more powerful than the 1-WL test, which seems to be enough expressivity in practice (Zopf, 2022) . However, none of the PE proposed for graphs can provide a global position of the nodes that is unique, equivariant and distance sensitive. This is due to the fact that a canonical positioning of nodes does not exist for arbitrary graphs, as there is no notion of up, down, left and right on graphs. For example, any embedding coordinate system like graph Laplacian eigenvectors (Belkin & Niyogi, 2003) can flip up-down directions, right-left directions, and would still be a valid PE. This introduces ambiguities for the GNNs that require to (learn to) be invariant with respect to the graph or PE symmetries. A well-known example is given by the eigenvectors: there exist 2 k number of possible sign flips for k eigenvectors that require to be learned by the network. Over-Squashing. Standard MP-GNNs require L layers to propagate the information from one node to their L-hop neighborhood. This implies that the receptive field size for GNNs can grow exponentially, for example with O(2 L ) for binary tree graphs. This causes over-squashing; information from the exponentially-growing receptive field is compressed into a fixed-length vector by the aggregation mechanism (Alon & Yahav, 2020; Topping et al., 2022) . Consequences of over-squashing are overfitting and poor long-range node interactions as relevant information cannot travel without being disturbed. Over-squashing is well-known since recurrent neural networks (Hochreiter & Schmidhuber, 1997) , which have led to the development of the (self-and cross-)attention mechanisms for the translation task (Bahdanau et al., 2014; Vaswani et al., 2017) first, and then for more general natural language processing (NLP) tasks (Devlin et al., 2018; Brown et al., 2020) . Transformer architectures are the most elaborated networks that leverage attention. Attention is a simple but powerful mechanism that solves over-squashing and long-range dependencies by making "everything connected to everything" but it also requires to trade linear complexity for quadratic complexity. Inspired by the great successes of Transformers in NLP and computer vision (CV), several works have proposed to generalize the transformer architecture for graphs , achieving competitive or superior performance against standard MP-GNNs. We highlight the most recent research works in the next paragraph. Graph Transformers. GraphTransformers (Dwivedi & Bresson, 2021) generalize Transformers to graphs, with graph Laplacian eigenvectors as node PE, and incorporating graph structure into the permutation-invariant attention function. SAN and LSPE (Kreuzer et al., 2021; Dwivedi et al., 2021) further improve with PE learned from Laplacian and random walk operators. GraphiT (Mialon et al., 2021) encodes relative PE derived from diffusion kernels into the attention mechanism. GraphTrans (Wu et al., 2021b) and SAT (Chen et al., 2022) add Transformers on the top of standard GNN layers. Graphormer (Ying et al., 2021) introduce three structural encodings, with great success on large molecular benchmarks. GPS (Rampášek et al., 2022) categorizes the different types of PE and puts forward a hybrid MPNN+Transformer architecture. We refer to Min et al. (2022) for an overview of graph-structured Transformers. Generally, most Graph Transformer architectures address the problems of over-squashing and limited long-range dependencies in GNNs but they also increase significantly the complexity from O(E) to O(N 2 ), resulting in a computational bottleneck. ViT and MLP-Mixer. Transformers have gained remarkable success in CV and NLP, most notably with architectures like ViT (Dosovitskiy et al., 2020) and BERT (Devlin et al., 2018) . The success of transformers has been long attributed to the attention mechanism (Vaswani et al., 2017) , which is able to model long-range dependencies as it does not suffer from over-squashing. But recently, this prominent line of networks has been challenged by more cost efficient alternatives. A novel family of models based on the MLP-Mixer introduced by Tolstikhin et al. ( 2021) has emerged and gained recognition for its simplicity and its efficient implementation. Overall, MLP-Mixer replaces the attention module with multi-layer perceptrons (MLPs) which are also not affected by over-squashing and poor long-range dependencies. The original architecture is simple (Tolstikhin et al., 2021) ; it takes image patches (or tokens) as inputs, encodes them with a linear layer (equivalent to a convolutional layer over the image patches), and updates their representations with a series of feed-forward layers applied alternatively to image patches (or tokens) and features. The follow-up variants investigate different mixing operations, such as ResMLP (Touvron et al., 2021) , gMLP (Liu et al., 2021) , and DynaMixer (Wang et al., 2022) . These plain networks can perform competitively with state-ofthe-art vision Transformers, which tends to indicate that attention is not the only important inductive bias, but other elements like the general architecture of Transformers with patch embedding, residual connection and layer normalization, and carefully-curated data augmentation techniques seem to play essential roles as well (Yu et al., 2022) . Main Objective. Motivated by the MLP-Mixer introduced in CV, our goal is to investigate a generalization of this architecture from grids to graphs. The motivation is clear; MLP-Mixer offers a low-cost alternative to ViT for images, avoiding the quadratic complexity of the attention mechanism while keeping long-range interactions. We wish to transfer these advantages to GNNs. Our contributions are as follows. • We identify the key challenges to generalize MLP-Mixer from images to graphs. • We design a new GNN, namely Graph MLP-Mixer, that is not limited by over-squashing and poor long-distance dependencies while keeping the linear complexity of MP-GNNs. • We report extensive experiments to analyze the proposed GNN architecture with several datasets from the Benchmarking GNNs (Dwivedi et al., 2020) and the Open Graph Benchmark (OGB) (Hu et al., 2020 ). • Our approach forms a bridge between CV, NLP and graphs under a unified architecture, that can potentially benefit cross-over domain collaborations to design better networks.

2. GENERALIZATION CHALLENGES

In the following, we list the main questions when adapting MLP-Mixer from images to graphs. (1) How to define and extract graph patches/tokens? One notable geometrical property that distinguishes graph-structured data from regular structured data, such as images and sequences, is that there does not exist in general a canonical grid to embed graphs. As shown in Table 1 , images are supported by a regular lattice, which can be easily split into multiple grid-like patches (also referred to as tokens) of the same size via fast pixel reordering. However, graph data is irregular: the number of nodes and edges in different graphs is typically different. Hence, graphs cannot be uniformly divided into similar patches across all examples in the dataset. Finally, the extraction process for graph patches cannot be uniquely defined given the lack of canonical graph embedding. This raises the questions of how we identify meaningful graph tokens, and quickly extract them. (2) How to encode graph patches into a vectorial representation? Since images can be reshaped into patches of the same size, they can be linearly encoded with an MLP, or equivalently with a convolutional layer with kernel size and stride values equal to the patch size. However, graph patches are not all the same size: they have variable topological structure with different number of nodes, edges and connectivity. Another important difference is the absence of a unique node ordering for graphs, which constrains the process to be invariant to node re-indexing for generalization purposes. In summary, we need a process that can transform graph patches into a fixed-length vectorial representation for arbitrary subgraph structures while being permutation invariant. GNNs are naturally designed to perform such transformations, and as such will be used to encode graph patches. (3) How to preserve positional information for nodes and graph patches? As shown in Table 1 , image patches in the sequence have implicit positions since image data is always ordered the same way due to its unique embedding in the Euclidean space. For instance, the image patch at the upperleft corner is always the first one in the sequence and the image patch at the bottom-right corner is the last one. On this basis, the token mixing operation of the MLP-Mixer is able to fuse the same patch information. However, graphs are naturally not-aligned and the set of graph patches are therefore unordered. We face a similar issue when we consider the positions of nodes within each graph patch. In images, the pixels in each patch are always ordered the same way; in contrast, nodes in graph tokens are naturally unordered. (4) How to reduce over-fitting for Graph MLP-Mixer? Most MLP-variants (Tolstikhin et al., 2021; Touvron et al., 2021; Wang et al., 2022) first pre-train on large-scale datasets, and then fine-tune on downstream tasks, coupled with a rich set of data augmentation and regularization techniques, e.g. cropping, random horizontal flipping, RandAugment (Cubuk et al., 2020) , mixup (Zhang et al., 2017) , etc. While data augmentation has drawn much attention in CV and NLP, graph data augmentation methods are not yet as effective, albeit interest and works on this topic (Zhao et al., 2021b) . Variable number of nodes, edges and connectivity make graph augmentation challenging. Thus, how do we augment graph-structured data given this nature of graphs? We summarize the differences between standard MLP-Mixer and Graph MLP-Mixer in Table 1 .

3.1. OVERVIEW

The basic architecture is illustrated in Figure 1 . The goal of this section is to detail the choices we made to implement each component of the architecture. On the whole, these choices lead to a simple framework that provides good practice performance. Notation. Let G = (V, E) be a graph with V being the set of nodes and E the set of edges. The graph has N = |V| nodes and E = |E| edges. The connectivity of the graph is represented by the adjacency matrix A ∈ R N ×N . The node features of node i are denoted by h i , while the features for an edge between nodes i and j are indicated by e ij . Let {V 1 , ..., V P } be the nodes partition, P be the pre-defined number of patches, and G i = (V i , E i ) be the induced subgraph of G with all the nodes in V i and all the edges whose endpoints belong to V i . Let h G be the graph-level vectorial representation and y G be the graph-level target, which can be a discrete variable for graph classification problem, or a scalar for graph regression task. The patch extraction module partitions graphs into overlapping patches. The patch embedding module transforms these graph patches into corresponding token representations, which are fed into a sequence of mixer layers to generate the output tokens. A global average pooling layer followed by a fully-connected layer is finally used for prediction. Each Mixer Layer is a residual network that alternates between a Token Mixer applied to all patches, and a Channel Mixer applied to each patch independently.

3.2. PATCH EXTRACTION

When generalizing MLP-Mixer to graphs, the first step is to extract patches. This extraction is straightforward for images. Indeed, all image data x ∈ R H×W ×C are defined on a regular grid with the same fixed resolution (H, W ), where H and W are respectively the height and the width, and C is the number of channels. Hence, all images can be easily reshaped into a sequence of flattened patches x p ∈ R P ×(R 2 C) , where (R, R) is the resolution of each image patch, and P = HW/R 2 is the resulting number of patches, see Table 1 . Unlike images with fixed resolution, extracting graph patches is more challenging. Generally, graphs have different sizes, i.e. number of nodes, and therefore cannot be uniformly divided like image data. Additionally, meaningful sub-graphs must be identified in the sense that nodes and edges composing a patch must share similar semantic or information, s.a. a community of friends sharing biking interest in a social network. As such, a graph patch extraction process must satisfy the following conditions: (1) The same extraction algorithm can be applied to any arbitrary graph, (2) The nodes in the sub-graph patch must be more closely connected than for those outside the patch, and (3) The extraction complexity must be fast, that is at most linear w.r.t. the number of edges, i.e. O(E). Graph partitioning algorithms have been studied for decades (Buluc ¸et al., 2016) given their importance in identifying meaningful clusters. Mathematically, graph partitioning is known to be NP-hard (Chung, 1997) . Approximations are thus required. A graph clustering algorithm with one of the best trade-off accuracy and speed is METIS (Karypis & Kumar, 1998) , which partitions a graph into a pre-defined number of clusters/patches such that the number of within-cluster links is much higher than between-cluster links in order to better capture good community structure. For these fine properties, we select METIS as our graph patch extraction algorithm. However, METIS is limited to finding non-overlapping clusters, as visualized in Figure 1 . In this example, METIS partitions the graph into four non-overlapping parts, i.e. {1, 2, 3}, {4, 5, 6}, {7, 8, 9} and {10, 11, 12}, resulting in 5 edge cuts. Unlike images, extracting non-overlapping patches could imply losing important edge information, i.e. the cutting edges, and thus decreasing the predictive performance, as we will observe experimentally. To overcome this issue and to retain all original edges, we allow graph patches to overlap with each other. For example in Figure 1 , if the source and destination nodes of an edge are not in the same patch, we assign both nodes to the patches they belong to. As such, node 3 and node 4 are in two different patches, here the blue and red one, but are connected with each other. After our overlapping adjustment, these two nodes belong to both the blue and red patches. This practice is equivalent to expanding the graph patches to the one-hop neighbourhood of all nodes in that patch. Formally, METIS is first applied to partition a graph into P non-overlapping patches: {V 1 , ..., V P } such that V = V 1 ∪ ... ∪ V P and V i ∩ V j = ∅, ∀i ̸ = j. Then, patches are expanded to their one-hop neighbourhood in order to preserve the information of between-patch links and make use of all graph edges: V i ← V i ∪ { N 1 (j) | j ∈ V i }, where N k (j) defines the k-hop neighbourhood of node j.

3.3. PATCH ENCODER

For images, patch encoding can be done with a simple linear transformation given the fixed resolution of all image patches. This operation is fast and well-defined. For graphs, the patch encoder network must be able to handle complex data structure such as invariance to index permutation, heterogeneous neighborhood, variable patch sizes, convolution on graphs, and expressive to differentiate graph isomorphisms. As a result, the graph patch encoder is a GNN, whose architecture is designed to best transform a graph token G p into a fixed-size representation x Gp ∈ R d into 3 steps. Step 1. Raw node and edge linear embedding. The input node features α i ∈ R dn×1 and edge features β ij ∈ R de×1 are linearly projected into d-dimensional hidden features: h 0 i = U 0 α i + u 0 ∈ R d ; e 0 ij = V 0 β ij + v 0 ∈ R d (1) where U 0 ∈ R d×dn , V 0 ∈ R d×de and u 0 , v 0 ∈ R d are learnable parameters. Step 2. Graph convolutional layers with (favorite) GNN. We apply a series of L convolutions to improve the patch representation of node and edge features: h ℓ+1 i = f h (h ℓ i , {h ℓ j |j ∈ N (i)}, e ℓ ij ) h ℓ+1 i , h ℓ i ∈ R d , e ℓ+1 ij = f e (h ℓ i , h ℓ j , e ℓ ij ) e ℓ+1 ij , e ℓ ij ∈ R d , where ℓ is the layer index, functions f h and f e (with learnable parameters) define the specific GNN architecture, and N (i) is the neighborhood of the node i. Step 3. Pooling and readout. The final step produces a fixed-size vector representation by mean pooling all node vectors in G p and applying a small MLP to get the patch embedding x Gp . The patch encoder is a GNN, and thus has the same potential limitations of over-squashing and poor long-range dependencies. However, these problems become prominent only for large graphs. But for small patch graphs, such problems do not really exist (or are negligible). Indeed, in practice, the mean number of nodes and the mean diameter for graph patches are around 3.2 and 1.8 respectively for molecular datasets and around 12.0 and 2.7 for image datasets, see Table 9 .

3.4. POSITIONAL INFORMATION

Regular grids offer a natural implicit arrangement for the sequence of image patches and for the pixels inside the image patches. However, such ordering of nodes and patches do not exist for general graphs. This lack of positional information reduces the expressivity of the network. Hence, we use explicitly one absolute PE for the patch nodes and one relative PE for the graph patches. Node PE. Input node features in Eq 1 are augmented with p i ∈ R K : h 0 i = T 0 p i + U 0 α i + u 0 ∈ R d , where T 0 ∈ R d×K is a learnable matrix. The benefits of different PEs are dataset dependent. We follow the strategy in Rampášek et al. ( 2022) that uses random-walk structural encoding (RWSE) (Dwivedi et al., 2021) for molecular data and Laplacian eigenvectors encodings (Dwivedi et al., 2020) for image superpixels. Since Laplacian eigenvectors are defined up to sign flips, the sign of the eigenvectors is randomly flipped during training. Patch PE. Relative positional information between the graph patches can be computed from the original graph adjacency matrix A ∈ R N ×N and the clusters {V 1 , ..., V P } extracted by METIS in Section 3.2. Specifically, we capture relative positional information via the 'coarsened adjacency matrix' A P ∈ R P ×P over the patch graphs: A P ij = |V i ∩ V j | = Cut(V i , V j ), where Cut(V i , V j ) = k∈Vi l∈Vj A kl is the standard graph cut operator which counts the number of connecting edges between cluster V i and cluster V j . We observe that matrix A P is sparse as it only connects patches that are neighbors on the original graph. This can cause poor long-distance interactions. To avoid this situation, we can simply smooth out the adjacency matrix A P with any graph diffusion process. In this work, we select the n-step random walk diffusion process: A P D = D -1 A P n ∈ R P ×P (5) 3.5 MIXER LAYER For images, the original mixer layer in Tolstikhin et al. ( 2021) is a simple network that alternates channel and token mixing steps. The token mixing step is performed over the token dimension, while the channel mixing step is carried out over the channel dimension. These two interleaved steps enable information fusion among tokens and channels. The simplicity of the mixer layer has been of great importance to understand that the self-attention mechanism in ViT is not the only critical component to get good performance on visual classification tasks. This has also led to a significant reduction in computational cost with little or no sacrifice in performance. Indeed, the self-attention mechanism in ViT requires O(P 2 ) memory and O(P 2 ) computation, while the mixer layer in MLP-Mixer needs O(P ) memory and O(P ) computation. We modify the original mixer layer to introduce positional information between graph tokens. Let X ∈ R P ×d be the patch embedding {x G1 , ..., x G P }. The graph mixer layer can be expressed as U = X + (W 2 σ(W 1 LayerNorm(A P D X))) ∈ R P ×d Token mixer, Y = U + (W 4 σ(W 3 LayerNorm(U ) T )) T ∈ R P ×d Channel mixer, where A P D ∈ R P ×P is the patch PE from Eq.5, σ is a GELU nonlinearity (Hendrycks & Gimpel, 2016) , LayerNorm(•) is layer normalization (Ba et al., 2016) , and matrices We generate the final graph-level representation by mean pooling all the non-empty patches: W 1 ∈ R ds×P , W 2 ∈ R P ×ds , W 3 ∈ R dc×d , W 4 ∈ R d×dc , h G = p m p • x Gp / p m p ∈ R d , where m p is a binary variable with value 1 for non-empty patches and value 0 for empty patches (since graphs have variable sizes, and thus small graphs can produce empty patches). Finally, we apply a small MLP to get the graph-level target: y G = MLP(h G ) 3.6 DATA AUGMENTATION MLP-Mixer architectures are known to be strong over-fitters (Liu et al., 2021) . In order to reduce this effect, we perform data augmentation of graph patches. At each epoch, we randomly drop a few edges before running METIS partitioning, to produce more diverse partitions. This data augmentation process is very fast as METIS graph clustering only amounts to a small portion of the data preparation time, therefore adding little extra cost during the training processes.

4. EXPERIMENTS

Graph Benchmark Datasets. We conduct extensive experiments to investigate the proposed method. From the Benchmarking GNNs (Dwivedi et al., 2020), we test on ZINC, MNIST and CI-FAR10. From the open graph benchmark (OGB) (Hu et al., 2020) , we test on MolHIV, MolTOX21, and MolPCBA. Summary statistics of datasets are reported in Table 6 and Appendix A.1.

Extraction

Step: Study of # patches. We observe in Figure 5 that increasing the number of patches generally improves performance, which is consistent with computer vision (Dosovitskiy et al., 2020; Tolstikhin et al., 2021) . We set the number of patches to 16/32 by default. The resulting graph patches are of small size; they typically contain 3-12 nodes with a value diameter of 2-3, see Table 9 . Step: Study of k-hop extension. In Figure 2 and consistent with our intuition that extracting non-overlapping patches implies losing important edge information. We further expand graph patches to their k-hop neighbourhood. Performance increases first and then flattens out or begins to decrease when k = 3 for ZINC and k = 1/2 for MolHIV.

Encoder

Step: GNN-based patch encoder. We evaluate the effect of various GNN models as patch encoder in Table 2 , which includes GCN (Kipf & Welling, 2017) , GatedGCN (Bresson & Laurent, 2017) , GINE (Hu et al., 2019) and Graph Transformer (Shi et al., 2020) . We find that our GNN-MLP-Mixer architecture matches or outperforms existing GNNs across different datasets and patch encoders. These promising results demonstrate the generic nature of the proposed Graph MLP-Mixer architecture which can be applied to any MP-GNNs in practice. Step: GNN-free patch encoder. We also investigate a GNN-free patch encoder. For each patch, we embed all node and edge features as bags of nodes and edges, then average and readout, where all transformations are based exclusively on MLPs, see Eq.9. Interestingly, this GNN-free MLP-Mixer produces good results (last row of Table 2 ), which seems to imply that the GNN-encoder is not critical in this architecture. Even more excitingly, further development of Graph MLP-Mixer may not need to use specialized GNN libraries like DGL (Wang et al., 2019) or PyG (Fey & Lenssen, 2019a) to achieve competitive performance with standard MP-GNNs. Step: Updated node encoding. We consider a more expressive GNN-MLP-Mixer by updating the node representation with the patch representation coming from the mixer layer. We call this improved version GNN-MLP-Mixer * and we present the details in Appendix A.9. Positional Information. We show the effects of two kinds of positional encoding in Figure 3 . First, we observe a significant drop in performance when either node PE or patch PE is removed. Besides, it can be observed that the extent of poor performance of models without node PE against using node PE is greater for ZINC than MolHIV. This difference can be explained by the fact that ZINC features are purely atom and bond descriptors whereas MolHIV features consist additional information that is informative of e.g. if an atom is in ring, among others. (Mialon et al., 2021) 0.202 ± 0.011 ----Graphormer (Ying et al., 2021) 0.122 ± 0.006 ----GPS (Rampášek et al., 2022) 0.070 ± 0.004 0.7880 ± 0.0101 0.2907 ± 0.0028 0.6562 ± 0.0115 0.2515 ± 0.0012 SAN (Chen et al., 2022) 0.139 ± 0.006 0.7775 ± 0.0061 0.2765 ± 0.0042 0.6439 ± 0.0075 0.2545 ± 0.0012 GraphTrans (Kreuzer et al., 2021) --0.2761 ± 0.0029 --GNN-AK+ (Alsentzer et al., 2020) 0.080 ± 0.001 0.7961 ± 0.0119 0.2930 ± 0.0044 0.6480 ± 0.0089 0.2736 ± 0.0007 SUN (Frasca et al., 2022) 0 Mixer Layer: MLP vs. Attention. In Table 4 , we replace MLP-Mixer layers with standard Transformer layers while keeping the rest of the components the same. The performance of MLP-Mixer is surprisingly better than the latter despite a lower complexity. Full results are reported in Table 13 .

Data

State-Of-The-Art. Expressivity. We experimentally show that Graph MLP-Mixer is strictly more powerful than 1-WL, and not less powerful than 3-WL. Although graph PEs s.a. Laplacian eigenvectors (Belkin & Niyogi, 2003) or k-step Random Walk PE (Li et al., 2020a; Dwivedi et al., 2021) cannot guarantee two graphs are generally isomorphic, it was shown in Dwivedi et al. ( 2021) that they can distinguish nonisomorphic graphs for which the 1-WL test fails. As a consequence, Graph MLP-Mixer is strictly more powerful than 1-WL. We experimentally validate this property by running Graph MLP-Mixer on the highly symmetric Circulant Skip Link (CSL) dataset of Murphy et al. (2019) in Table 16 , which requires GNNs to be strictly more expressive than the 1-WL test to succeed. Besides, Graph MLP-Mixer reaches perfect accuracy on SR25 (Balcilar et al., 2021) . The SR25 dataset contains 15 strongly regular graphs with 25 nodes each, where no model less or as powerful as 3-WL test can distinguish the pairs in SR25 dataset. Long Range Graph Benchmark (LRGB). We evaluate our models and compare with the baselines on the LRGB (Dwivedi et al., 2022) with 2 graph-level datasets, i.e., Peptides-func and Peptidesstruct, that arguably require long-range information reasoning to achieve strong performance in the given tasks. As shown in Table 2 and Table 5 , Graph MLP-Mixer performs significantly better than the baselines, especially on Peptides-struct. The improvement can be explained by the nature of these datasets, which is consistent with the empirical findings in (Dwivedi et al., 2022) that simple instances of local MP-GNNs perform poorly on the proposed datasets. More information is provided in Table 15 .

5. CONCLUSION

In this work, we have proposed a novel GNN model directly inspired from ViT/MLP-Mixer architectures in computer vision and presented promising results on benchmark graph datasets. Future work will focus on further exploring graph networks with the inductive biases of graph tokens and Transformer-like architectures in order to solve fundamental node and link prediction tasks, and potentially without the need of specialized GNN libraries.

A EXPERIMENTAL DETAILS

A.1 DATASETS DESCRIPTION CSL is a synthetic dataset introduced in Murphy et al. (2019) to test the expressivity of GNNs. CSL has 150 graphs. Each CSL graph is a 4-regular graph with edges connected to form a cycle and containing skip-links between nodes. SR25 is another synthetic dataset used to empirically verify the expressive power of Graph MLP-Mixer. SR25 (Balcilar et al., 2021) has 15 strongly regular graphs (3-WL failed) with 25 nodes each. SR25 is translated to a 15 way classification problem with the goal of mapping each graph into a different class. ZINC (Dwivedi et al., 2020) is a subset (12K) of molecular graphs (250K) from a free database of commercially-available compounds (Irwin et al., 2012) . These molecular graphs are between 9 and 37 nodes large. Each node represents a heavy atom (28 possible atom types) and each edge represents a bond (3 possible types). The task is to regress a molecular property known as the constrained solubility. The dataset comes with a predefined 10K/1K/1K train/validation/test split. MNIST and CIFAR10 (Dwivedi et al., 2020) are derived from classical image classification datasets by constructing an 8 nearest-neighbor graph of SLIC superpixels for each image. The resultant graphs are of sizes 40-75 nodes for MNIST and 85-150 nodes for CIFAR10. The 10-class classification tasks and standard dataset splits follow the original image classification datasets, i.e., for MNIST 55K/5K/10K and for CIFAR10 45K/5K/10K train/validation/test graphs. These datasets are sanity-checks, as we expect most GNNs to perform close to 100% for MNIST and well enough for CIFAR10. MolTOX21 and MolHIV (Hu et al., 2020) are molecular property prediction datasets adopted from the MoleculeNet (Szklarczyk et al., 2019) . All the molecules are pre-processed using RDKit (Landrum et al., 2006) . Each graph represents a molecule, where nodes are atoms, and edges are chemical bonds. Input node features are 9-dimensional, containing atomic number and chirality, as well as other additional atom features such as formal charge and whether the atom is in the ring or not. The datasets come with a predefined scaffold splits based on their two-dimensional structural frameworks, i.e. for MolTOX21 6K/0.78K/0.78K and for MolHIV 32K/4K/4K train/validation/test. MolPCBA (Hu et al., 2020) is another real-world molecular graph classification benchmark. MOLTOX21 and MolHIV are of small and medium scale with 7.8K and 41.1k graphs respectively, whereas MOLPCBA is of large scale with 437.9K graphs, and applies a similar scaffold spliting procedure. It consists of multiple, extremely skewed (only 1.4% positivity) molecular classification tasks, and employs Average Precision (AP) over them as a metric. Peptides-func and Peptides-struct (Dwivedi et al., 2022) are derived from 15,535 peptides with a total of 2.3 million nodes retrieved from SAT-Pdb (Singh et al., 2016) . Both datasets use the same set of graphs but differ in their prediction tasks. These graphs are constructed in such a way that requires long-range interactions (LRI) reasoning to achieve strong performance in a given task. In concrete terms, they are larger graphs: on average 150.94 nodes per graph, and on average 56.99 graph diameter. Thus, they are better suited to benchmarking of graph Transformers or other expressive GNNs that are intended to capture LRI. Distributions of the graph sizes. We plot of the distributions of the graph sizes (i.e. the number of nodes in each data sample) of these datasets in Figure 4 .

A.2 HYPERPARAMETERS

We follow the benchmarking protocol introduced in Dwivedi et al. ( 2020) based on PyTorch (Paszke et al., 2019) and PyG (Fey & Lenssen, 2019b) . We use Adam (Kingma & Ba, 2014) optimizer, with the default settings of β 1 = 0.9, β 2 = 0.999, and ϵ = 1e for both the baseline and our methods. We run each experiment 4 times, and report the mean ± s.d. performance. Detailed hyperparameters are provided in Table 7 and Table 8 .

A.3 PATCH EXTRACTION

Study of # patches. We observe in Figure 5 that increasing the number of patches generally improves performance. Patch Size and Diameter. We set the number of patches to 16/32 by default. The resulting graph patches are of small size, as shown in Table 9 . 

A.4 PATCH ENCODER

Baselines and GNN-based patch encoder. We use GCN (Kipf & Welling, 2017) , GatedGCN (Bresson & Laurent, 2017) , GINE (Hu et al., 2019) and Graph Transformer (Shi et al., 2020) directly, which also server as the patch encoder of Graph MLP-Mixer to see its general uplift effect. Hyperparameter and model configuration are described in Table 7 and Table 8. GNN-free patch encoder. The GNN-free patch encoder reported in Table 2 , is an all-MLP architecture. For each graph patches, we embed all node features (bag of nodes), all edge features (bag of edges), then average and readout, based exclusively on MLPs. h Gp = MLP 3 i∈Vp MLP 1 (h i ) + eij ∈Ep MLP 2 (e ij ) A.5 DATA AUGMENTATION For the Mixer Layer, the complexity is O(P ).

A.11 OVER-SQUASHING

We illustrate experimentally the over-fitting property that can be produced by over-squashing with the synthetic TreeNeighbour dataset from (Alon & Yahav, 2020 ) and a real-world long-range dataset borrowed from (Dwivedi et al., 2022) in Figure 6 and Table 17 . For the synthetic TreeNeighbour dataset, we run the experiments with GCN, GGCN and Graph MLP-Mixer with the number of layers being the double size as the tree depth and a hidden size of 128, and it can be observed that there is no issue for the GNNs to overfit this dataset. In (Alon & Yahav, 2020) , the authors use a number of layers equal to the tree depth+1 and a small hidden size of 32, which does not provide enough learning capacity to the networks and thus under-fits the dataset. 1.00 1.00 1.00 1.00 1.00 1.00 1.00 We confirmed this observation with the real-world long-range graph dataset, peptide-func taken from (Dwivedi et al., 2022) with a mean value of 57 for the graph diameter of these large graphs. The same result occurs, the GCN, GGCN and Graph MLP-Mixer are able to over-fit the dataset at almost 100%. 



Figure1: The basic architecture of the proposed Graph MLP-Mixer. Graph MLP-Mixer consists of a patch extraction module, a patch embedding module, a mixer layer, a global average pooling, and a classifier head. The patch extraction module partitions graphs into overlapping patches. The patch embedding module transforms these graph patches into corresponding token representations, which are fed into a sequence of mixer layers to generate the output tokens. A global average pooling layer followed by a fully-connected layer is finally used for prediction. Each Mixer Layer is a residual network that alternates between a Token Mixer applied to all patches, and a Channel Mixer applied to each patch independently.

where d s and d c are the tunable hidden widths in the tokenmixing and channel-mixing MLPs, and are set following Tolstikhin et al. (2021).

Augmentation. Then proposed data augmentation (DA) corresponds to newly generated graph patches with METIS at each epoch, while no DA means patches are only generated at the initial epoch and then reused during training.

representations are updated using both the outputs of the MP-GNN layer and the MLP-Mixer layer:ĥl+1 i,p = M l+1 h l+1 i,p + N l+1 Y l+1p serve as the input of the next layer and the iterative process goes back to the first step above.A.10 COMPLEXITY ANALYSISFor each graph G = (V, E), with N = |V| being the number of nodes and E = |E| being the number of edges, the METIS patch extraction takes O(E) runtime complexity, and outputs graph patches {G 1 , ..., G P }, with P being the pre-defined number of patches. Accordingly, we denote each graph patch as G p = (V p , E p ), with N p = |V p | being the number of nodes and E p = |E p | being the number of edges in G p . After our one-hop overlapping adjustment, the total number of nodes and edges of all the patches are (N U = p N p ) ≤ 2N and (E U = p E p ) ≤ 2E, respectively. Assuming base GNN has O(E) runtime complexity, our patch embedding module has O(E U ) runtime complexity.

Figure 6: Train Accuracy across problem radius (tree depth) in the NEIGHBORSMATCH problem (Alon & Yahav, 2020).

Figure 7 presents the generalization performance of GNNs on the TreeNeighbour dataset specifically designed to analyze synthetically the property of datasets with long-range dependencies.

Figure 7: Test Accuracy across problem radius (tree depth) in the NEIGHBORSMATCH problem (Alon & Yahav, 2020).

Differences between MLP-Mixer components for images and graphs.

Effect of the k-hop extension. We expand graph patches extracted by METIS to the k-hop neighbourhood of all nodes in that patch. 0-hop means non-overlapping patches without extension.

Table 3 presents different results. First, it is clear that DA Effect of positional information. We study the effects of node PE and patch PE by removing one of them in turn from our model while keeping the other components unchanged.

Effect of data augmentation (DA):✗ means no DA and ✓uses DA.

Mixer Layer vs. Transformer Layer.

Comparison of our best results from Table2with the state-of-the-art GTs (missing values from literature are indicated with '-'). For ZINC, all models have approximately ∼ 500k parameters.

Table5presents the SOTA GraphTransformer (GT) models. To ensure fair comparison, we did not include Graphormer(Ying et al., 2021) that achieved top score on MolHIV after pre-training on a large dataset of 3.8M graphs. Overall, we observe that our Graph MLP-Mixer model achieves competitive performance without making use of the fully-connected attention mechanism, solely using low-cost mixer operations. Besides, we are more efficient in model parameters and training time. Full comparison with number of training parameters, memory and training time is provided Table 14 and Table 15 in the appendix.

-8 . The learning rate is reduced by half if the validation loss does not improve after 10 epochs. The training stops at a point when the learning rate reaches to a value of 1 × 10 -5 . We use 4 layers of GNN layers for patch encoder and 4 layers of Mixer layers by default. For benchmarking datasets from Dwivedi et al. (2020), the dropout is set to 0. For OGB datasets fromHu et al. (2020), we tune the dropout ∈ {0, 0.1, 0.2, 0.3, 0.4, 0.5} Summary statistics of datasets used in this study

Model hyperparameters for four datasets fromDwivedi et al. (2020)

Model hyper-parameters for three datasets from OGB(Hu et al., 2020)

Summary statistics of graph patches for different datasets.

Study of k-hop neighbour on ZINC and MolHIV, corresponding to Fig.2the METIS can provide against random graph partitioning. For random graph partition, nodes are randomly assigned to a pre-defined number of patches. For both METIS and random partition, we re-generate graph patches at each epoch. Table11shows that using METIS as the graph partition algorithm consistently gives better performance than random node partition, which corresponds to our intuition that nodes and edges composing a patch should share similar semantic or information. Nevertheless, it is interesting to see that random graph partitioning is still able to achieve reasonable results, which shows that the performance of the model is not solely supported by the quality of the patches.

Comparison of METIS vs. random graph partitioning.

provides the full results of Table3.A.6 MIXER LAYERIn table 13, GNN-Trans-Encoder is the model whose MLP-Mixer layers are replaced with the same number of standard transformer encoder layers. The rest architecture is the same.

annex

 2020), all models have approximately ∼ 500k parameters. For MolHIV and MolPCBA, there is no upper limit on the number of parameters. To enable a fair comparison, we set the batch size to 128 for ZINC and MolHIV, and 256 for MolPCBA for all the SOTA GT models and ours, and run all experiments using the same machine.For SUN (Frasca et al., 2022) , due to the huge memory consumption, we use a small batch size of 8 with gradient accumulation. Otherwise, any batch size larger than 8 will lead to out of memory (OOM). 

Model CSL SR25

Accuracy ↑ Accuracy ↑ GCN-MLP-Mixer* 1.0000 ± 0.0000 1.0000 ± 0.0000 GatedGCN-MLP-Mixer* 1.0000 ± 0.0000 1.0000 ± 0.0000 GINE-MLP-Mixer* 0.9800 ± 0.0183 1.0000 ± 0.0000 GraphTrans-MLP-Mixer* 1.0000 ± 0.0000 1.0000 ± 0.0000 Table 16 : Results for the CSL (Murphy et al., 2019) dataset and SR25 (Balcilar et al., 2021) A.9 GNN-MLP-MIXER* Like MLP-Mixer in Computer Vision, GNN-MLP-Mixer is a sequential two-step process. The first step embeds the nodes contained in the graph patches with a MP-GNN and pools the node embedding together to generate a patch representation. The second step combines the patch representations with a mixer layer. However, unlike the original MLP-Mixer, meaningful node representations are difficult to produce due to the high variability of graphs. To improve the expressiveness of GNN-MLP-Mixer, we propose an iterative two-step process by updating alternatively the representation of nodes and patches as follows.First, the node and edge representations are updated with a MP-GNN applied to each graph patch G p (V p , E p ) separately and independently:where ℓ is the layer index, p is the patch index, i, j denotes the nodes, N (i) is the neighborhood of the node i and functions f h and f e (with learnable parameters) define the specific MP-GNN architecture.Second, a fixed-size vector representation of the patch G p is produced by mean pooling all node vectors in the patch followed by a MLP:Third, the patches represented by X l+1 = {x l+1 1 , ..., x l+1 P } ∈ R P ×d are processed with a MLP-Mixer layer: (12) 

