MOTIF-INDUCED GRAPH NORMALIZATION

Abstract

Graph Neural Networks (GNNs) have emerged as a powerful category of learning architecture for handling graph-structured data. However, the existing GNNs usually follow the neighborhood aggregation scheme, ignoring the structural characteristics in the node-induced subgraphs, which limits their expressiveness for the downstream tasks of both the graph-and node-level predictions. In this paper, we strive to strengthen the general discriminative capabilities of GNNs by devising a dedicated plug-and-play normalization scheme, termed as Motif-induced Normalization (MotifNorm), that explicitly considers the intra-connection information within each node-induced subgraph. To this end, we embed the motif-induced structural weights at the beginning and the end of the standard BatchNorm, as well as incorporate the graph instance-specific statistics for improved distinguishable capabilities. In the meantime, we provide the theoretical analysis to support that the MotifNorm scheme can help alleviate the over-smoothing issue, which is conducive to designing deeper GNNs. Experimental results on eight popular benchmarks across all the tasks of the graph, node, as well as link property predictions, demonstrate the effectiveness of the proposed method. Our code is made available in the supplementary material.

1. INTRODUCTION

In recent years, Graph Neural Networks (GNNs) have emerged as the mainstream deep learning architectures to analyze irregular samples where information is present in the form of graphs, which usually employs the message-passing aggregation mechanism to encode node features from local neighborhood representations (Kipf & Welling, 2017; Veličković et al., 2018; Xu et al., 2019; Yang et al., 2020b; Hao et al., 2021; Dwivedi et al., 2022b) . As a powerful class of graph-relevant networks, these architectures have shown encouraging performance in various domains such as cell clustering (Li et al., 2022; Alghamdi et al., 2021) , chemical prediction (Tavakoli et al., 2022; Zhong et al., 2022) , social networks (Bouritsas et al., 2022; Dwivedi et al., 2022b) , traffic networks (Bui et al., 2021; Li & Zhu, 2021 ), combinatorial optimization (Schuetz et al., 2022; Cappart et al., 2021) , and power grids (Boyaci et al., 2021; Chen et al., 2022a) . However, the commonly used message-passing mechanism, i.e., aggregating the representations from neighborhoods, limits the expressive capability of GNNs to address the subtree-isomorphic phenomenon prevalent in the real world (Wijesinghe & Wang, 2022) . As shown in Figure 1(a) , subgraphs S v1 , S v2 induced by v 1 , v 2 are subtree-isomorphic, which decreases the GNNs' expressivity in graph-level and node-level prediction, i.e., (1) Graph-level: Straightforward neighborhood aggregations, ignoring the characterises of the node-induced subgraphs, lead to complete indistinguishability in the subtree-isomorphic case, which thus limits the GNNs' expressivity to be bottlenecked by the Weisfeiler-Leman (WL) (Weisfeiler & Leman, 1968) test. (2) Node-level: Under the background of over-smoothing (as illustrated in Figure 1(b) ), the smoothing problem among the root representations of subtree-isomorphic substructures will become worser when aggregating the similar representations from their neighborhoods without structural characterises be considered. In this paper, we strive to develop a general framework, compensating for the ignored characteristics among the node-induced structures, to improve the graph expressivity over the prevalent messagepassing GNNs for various graph downstream tasks, e.g., graph, node and link predictions. Driven by the fact that deep models usually follow the CNA architecture, i.e., a stack of convolution, normalization and activation, where the normalization module generally follows GNNs convolution operations, we accordingly focus on developing a higher-expressivity generalized normalization scheme to enhance the discriminative abilities of various GNNs' architectures. Thus, the question is, "how to design such a powerful and general normalization module with the characteristics of node-induced substructures embedded ?" To address this challenge, this paper devises an innovative normalization mechanism, termed as Motif-induced Normalization (MotifNorm), that explicitly considers the intra-connection information in each node-induced subgraph (i.e., motif (Leskovec, 2021) ) and embeds the achieved structural factors into the normalization module to improve the expressive power of GNNs. Specifically, we start by empirically disentangling the standard normalization into two stages, i.e., centering & scaling (CS) and affine transformation (AT) operations. We then concentrate on mining the intra-connection information in the node-induced subgraphs, and develop two elaborated strategies, termed as representation calibration and representation enhancement, to embed the achieved structural information into CS and AT operations. Eventually, we demonstrate that MotifNorm can generally improve the GNNs' expressivity for the task of graph, node and link predictions via extensive experimental analysis. In sum, the contributions of this work can be summarized as follows: • Driven by the conjecture that a higher-expressivity normalization with abundant graph structure power can generally strengthen the GNNs' performance, we develop a novel normalization scheme, termed as MotifNorm, to embed structural information into GNNs' aggregations. • We develop two elaborated strategies, i.e., representation calibration and representation enhancement, tailored to embed the motif-induced structural factor into CS and AT operations for the establishment of MotifNorm in GNNs. • We provide extensive experimental analysis on eight popular benchmarks across various domains, including graph, node and link property predictions, demonstrating that the proposed model is efficient and can consistently yield encouraging results. It is worth mentioning that MotifNorm maintains the computational simplicity, which is beneficial to the model training for highly complicated tasks.

2. RELATED WORKS

In this section, we briefly introduce the existing normalization architectures in GNNs, which are commonly specific to the type of downstream tasks, i.e., graph-level and node-level tasks. Graph-level Normalization. Node-level Normalization. This type of mechanism tries to rescale node representations for the purpose of alleviating over-smoothing issues (as shown in Figure 1 (b)). Yang et al. (2020a) design the MeanNorm trick to improve GCN training by interpreting the effect of mean subtraction as approximating the Fiedler vector. Zhou et al. (2021) scale each node's features using its own standard deviation and proposes a variance-controlling technique, termed as NodeNorm, to address the variance inflammation caused by GNNs. To constrain the pairwise node distance, Zhao & Akoglu (2020b) introduce PairNorm to prevent node embeddings from over-smoothing on the node classification task. Furthermore, Liang et al. (2022) design ResNorm from the perspective of long-tailed degree distribution and add a shift operation in a low-cost manner to alleviate over-smoothing. However, these approaches are usually task-specific in GNNs' architectures, which means that they are not always significant in general contribution to various downstream tasks. Furthermore, the characteristics of node-induced substructures, which trouble GNNs' performance in various downstream tasks, are ignored by these normalizations.

3. MOTIF-INDUCED GRAPH NORMALIZATION

In this section, we begin by give the preliminary with regard to our proposed MotfiNorm, along the way, introduce notations and a basis definition of motif-induced information. Let G = (VG, EG) denotes a undirected graph with n vertices and m edges, where VG= {v 1 , v 2 , ..v n } is an set of vertices and EG is a unordered set of edges. N (v) = {u ∈ VG|(v, u) ∈ EG} denotes the neighbor set of vertex v, and its neighborhood subgraph S v is induced by Ñ (v) = N (v) ∪ v, which contains all edges in EG that have both endpoints in Ñ (v). As shown in Figure 1 (a), Sv 1 and Sv 2 are the induced subgraphs of v 1 and v 2 . Their structural information are defined as: Definition 1. The motif-induced information ξ(S vi ) denotes the structural weight of substructure S vi , i.e., node-induced subgraph (Leskovec, 2021) with regard to v i , where ξ is formulated as: ξ(S vi ) = ϕ(|E Sv i |)ψ(|V Sv i |), where | • | denotes cardinality. ϕ(|ES v i |) = 2|ES v i |/(|VS v i | • (|VS v i | -1)) and ψ(|VS v i |) = |VS v i | 2 respectively refers to the density and power of S vi . This definition focuses more on edge information while two different subgraphs are subtree-isomorphic, on the contrary, focuses more on node power.

3.1. PRELIMINARY

Batch normalization can be empirically divided into two stages, i.e., centering & scaling (CS) and affine transformation (AT) operations. For the input features H ∈ R n×d , the CS and AT follow: CS : H CS = H -E(H) √ D(H) + , AT : H AT = H CS γ + β, where is the dot product with the broadcast mechanism.  G = ξ(SV G ) = [ξ(SV G 1 ); ξ(SV G 2 ); ...; ξ(SV Gm )] = [ξ(Sv 1 ), ..., ξ(Sv n )]∈ R n×1 . The segment summation-normalization M SN = [F(ξ(SV G 1 )); F(ξ(SV G 2 )); ...; F(ξ(SV Gm ))]∈ R n×1 where F(ξ(SV G i )) = ξ(SV G i )/ ξ(SV G i ) ∈ R |V G i |×1 , denotes the summation-normalization in each graph.

3.2. MOTIFNORM FOR GNNS

The commonly used message-passing aggregations are node-specific, i.e., ignoring the characteristics in the node-induced subgraphs, which limits the expressive capability of GNNs in various downstream tasks. Therefore, we present a generalized graph normalization framework, termed as MotifNorm, to compensate for the ignored structural characteristics by GNNs, and develop two elaborated strategies: representation calibration (RC) and representation enhancement (RE). Representation Calibration (RC). Before the CS stage, we calibrate the inputs by injecting the motif-induced weights as well as incorporate the graph instance-specific statistics into representations, which balances the distribution differences along with embedding structural information. For the input feature H ∈ R n×d , the RC is formulated as: RC : H RC = H + w RC H SA • (M RC 1 T d ), where 1 d is the d-dimensional all-one cloumn vectors and • denotes the dot product operation. w RC ∈ R 1×d is a learned weight parameter. M RC = M SN • M G ∈ R n×1 is the calibration factor for RC, which is explained in details in Appendix A.3. H SA ∈ R n×d is the segment averaging of H, obtained by sharing the average node features of each graph with its nodes, where each individual graph is called a segment in the DGL implementation (Wang et al., 2019) . Representation Enhancement (RE). Right after the CS operation, node features H CS are constrained into a fixed variance range and distinctive information is slightly weakened. Thus, we design the RE operation to embed motif-induced structural information into AT stage for the enhancement the final representations. The formulation of RE is written as follows: RE : H RE = H CS • Pow(M RE , w RE ), where w RE ∈ R 1×d is a learned weight parameter, and Pow(•) is the exponential function. To imitate affine weights in AT for each channel, we perform the segment summation-normlization on calibration factor M RC and repeats d columns to obtain enhancement factor M RE ∈ R n×d , which ensures column signatures of Pow(M RE , w RE ) -1 are consistent. Expressivity Analysis. MotfiNorm with injected RC and RE operation, compensating structural characteristics of subgraphs for GNNs, can generally improve graph expressivities as follows: (1) Graph-level: For graph prediction tasks, MotfiNorm compensates for the structural information to distinguish the subtree-isomorphic case that 1-WL can not recognize, which could extend the GNNs more expressive than the 1-WL test. Specifically, an arbitrary GNN equipped with Motfi-Norm is capable of more expressive abilities than the 1-WL test in distinguishing k-regular graphs. (2) Node-level: The ignored structural information strengthens the node representations, which is beneficial to the downstream recognition tasks. Furthermore, MotifNorm with structural weights injected can help alleviate the over-smoothing issue, which is analyzed in the following Theorem 1. (3) Training stability: The RC operation is beneficial to stabilize model training, which makes normalization operation less reliant on the running means and balances the distribution differences by considering the graph instance-specific statistics and is analyzed in the following Proposition 1. Theorem 1. MotifNorm helps alleviate the oversmoothing issue. Proof. Given two extremely similar embeddings of u and v (i.e., H u -H v 2 ≤ ). Assume for simplicity that H u 2 = H v 2 = 1, w RC 2 ≥ c, and the motif-induced information scores between u and v differs a considerable margin (M RC 1 T d ) u -(M RC 1 T d ) v 2 ≥ 2 /c. We can obtain (H u + (w RC (M RC 1 T d )) u • H u + H v 2 ) -(H v + (w RC (M RC 1 T d )) v • H v + H u 2 ) 2 ≥ -H u -H v 2 + ((w RC (M RC 1 T d )) u -(w RC (M RC 1 T d )) v ) • H u + H v 2 2 ≥ -+ w RC 2 • (M RC 1 T d ) u -(M RC 1 T d ) v 2 • H u + H v 2 2 ≥ -+ 2 = , where the subscripts u, v denote the u-th and v-th row of matrix ∈ R n×d . This inequality demonstrates that our RC operation could differentiate two nodes by a margin even when their node embeddings become extremely similar after L-layer GNNs. Similarly, RE operation can also differentiates the embeddings, and the theoretical analysis is provided in A.1 Proposition 1. RC operation is beneficial to stabilzing the model training. Proof. The complete proof is provided in Appendix A.2.

3.3. THE IMPLEMENTATION OF MOTIFNORM

We merge RE operation into AT for a simpler formultation. Given the input feature H ∈ R n×d , the formulation of the MotifNorm is written as: RC : H RC = H + w RC H SA • (M RC 1 T d ), CS : H CS = H RC -E(H RC ) D(H RC ) + , AT : H AT = H CS • (γ + P)/2 + β, where P = Pow(M RE , w RE ) and H AT is the output of MotifNorm. To this end, we add RC and RE operations at the beginning and ending of the original BatchNorm layer to strengthen the expressivity power after GNNs' convolution. The additional RC and RE operations are the dot product in R n×d , thus their time complexity is O(nd).

4. EXPERIMENTS

To demonstrate the effectiveness of the proposed MotifNorm in different GNNs, we conduct experiments on three types of graph tasks, including graph-, node-and link-level predictions. 1 . Baseline Methods. We compare our Mo-tifNorm to various types of normalization baselines for GNNs, including Batch-Norm (Ioffe & Szegedy, 2015) , Uni-tyNorm (Chen et al., 2022c) , Graph-Norm (Cai et al., 2021 ), ExpreNorm (Dwivedi et al., 2022a) for graph predictions, and Group-Norm (Zhou et al., 2020) , PairNorm (Zhao & Akoglu, 2020a) , MeanNorm (Yang et al., 2020a) , NodeNorm (Zhou et al., 2021) for all three tasks. More details about the datasets, baselines and experimental setups are provided in Appendix B.1. In the following experiments, we aim to answer the questions: (i) Can MotifNorm improve the expressivity for graph isomorphism test, especially go beyond 1-WL on k-regular graphs? (Section 4.1) (ii) Can MotifNorm help alleviate the over-smoothing issue? (Section 4.2) (iii) Can MotifNorm generalize to various graph tasks? (Section 4.3)

4.1. EXPERIMENTAL ANALYSIS ON GRAPH ISOMORPHISM TEST

The IMDB-BINARY is a well-known graph isomorphism test dataset consisting of various k-regular graphs, which has become a common-used benchmark for evaluating the expressivity of GNNs. To make the training, valid and test sets follow the same distribution as possible, we adopt a hierarchical dataset splitting strategy based on the structural statistics of graphs (More detailed description are provided in Appendix B.1). For graph isomorphism test, Graph Isomorphism Network (GIN) (Xu et al., 2019 ) is known to be as powerful as 1-WL. Notably, GIN consists a neighborhood aggregation operation and a multi-layer perception layer (MLP), and this motivates a comparison experiment: comparing a one-layer MLP+MotifNorm with one-layer GIN to directly demonstrate MotifNorm's expressivities in distinguishing k-regular graphs. As illustrated in Figure 2 , the vanilla MLP cannot capture any structural information and perform poorly, while the proposed MotifNorm method successfully improve the performance of MLP and even exceeds the vanilla GIN. Furthermore, Table 2 provides the quantitative comparison results, where GSN (Bouritsas et al., 2022) and GraphSNN (Wijesinghe & Wang, 2022) are two recent popular methods realizing the higher expressivity than the 1-WL. From these comparison results, the performance of one-layer MLP with MotifNorm is better than that of one-layer GIN, GSN, and GraphSNN. Moreover, the commonly used GNNs equipped with MotifNorm, e.g., GCN and GAT, achieve higher ROC-AUC than GIN when the layer is set as 4. GIN with MotifNorm achieves better performance and even goes beyond the GSN and GraphSNN. Furthermore, MotifNorm can further enhance the expressivity of GSN and GraphSNN. More detailed results on IMDB-BINARY are provided in Appendix B.2.

4.2. EXPERIMENTAL ANALYSIS ON THE OVER-SMOOTHING ISSUE

To show the effectiveness of MotifNorm for alleviating the over-smoothing issue in GNNs, we visualize the first three categories of Cora dataset in 2D space for a better illustration. We select PairNorm, NodeNorm, MeanNorm, GroupNorm and BatchNorm for comparison and set the number of layer as 32. Figure 3 categories into different clusters, i.e., the other normalization methods may lead to the loss of discriminant information and make the representations entangled. Furthermore, we provide the quantitative results by considering three metrics including accuracy, intra-class distance and inter-class distance. In details, we set layers from 2 to 32 with the step size as 2, and visualize the line chart in Figure 4 (a)∼4(c). Figure 4(a) shows the accuracy with regard to layers, which directly demonstrates the superiority of MotifNorm when GNNs go deeper. In order to characterize the disentangling capability of different normalizations, we calculate the intra-class distance and inter-class distance with layers increasing in Figure 4 (b) and 4(c). As shown in these two figures, MotifNorm obtains lower intra-class distance and higher inter-class distance, which means that the proposed MotifNorm enjoys better disentangling ability. We provide more t-SNE visualizations of different layer number and different GNNs' backbones in Appendix B.3.

4.3. MORE COMPARISONS ON THE OTHER SIX DATASETS

For graph prediction task, we compare normalizations on ogbg-moltoxcast, ogbg-molhiv and ZINC by using GCN and GAT as backbones, where ZINC is a graph regression dataset. For node and link property predictions we conduct experiments on one social network dataset (Pubmed), one protein-protein association network dataset (ogbn-proteins) and a collaboration network between authors (ogbl-collab) by using GCN as backbone. The details are as follow: 

4.4. ABLATION STUDY AND DISCUSSION

To explain the superior performance of MotifNorm, we perform extensive ablation studies to evaluate the contributions of its two key components, i.e., representation calibration (RC) and represen- tation enhancement (RE) operations. Firstly, we show in Figure 5 the ablation performance of GCN on ogbg-moltoxcast. Figure 5 (a) and 5(b) show the ROC-AUC results with regard to training epochs when the layer number is set to 4, which show that both RC and RE can improve the classification performance. Furthermore, by comparing these two figures, RC performs better than RE in terms of recognition results, which plugs at the beginning of BatchNorm with the graph instance-specific statistics embedded. To further explore the significance of RC and RE at different layers, we compute the mean values of w RC and w RE , which are visualized in Figure 5 (c). As can be seen from the mean statistis of w RC , w RE among 32 layers' GCN, the absolute values of w RC and w RE become larger when the network goes deeper (especially in the last few layers), indicating that the structural information becomes more and more critical with the increase numbers of layers. To evaluate the additional cost of RC and RE operations, we provide the runtime, parameter and memory comparison by using GCN (l = 4) with BatchNorm and MotfiNorm on the ogbg-molhiv dataset. Here, we provide the cost of runtime and memory by performing the code on NVIDIA A40. The cost information is provided in Table 7 . Discussion. The main contribution of this work is to propose a more expressive normalization module, which can be plugged into any GNN architecture. Unlike existing normalization methods that are usually task-specific and also without substructure information, the proposed method explicitly considers the subgraph information to strengthen the graph expressivity across various graph tasks. In particular, for the task of graph classification, MotifNorm extends GNNs beyond the 1-WL test in distinguishing k-regular graphs. On the other hand, when the number of GNNs' layers becomes larger, MotifNorm can help alleviate the over-smoothing problem and meanwhile maintain better discriminative power for the node-relevant predictions.

5. CONCLUSION

In this paper, we introduce a higher-expressivity normalization architecture with an abundance of graph structure-specific information embedded to generally improve GNNs' expressivities and representatives for various graph tasks. We first empirically disentangle the standard normalization into two stages, i.e., centering & scaling (CS) and affine transformation (AT) operations, and then develop two skillful strategies to embed the subgraph structural information into CS and AT operations. Finally, we provide a theoretical analysis to support that MotifNorm can extend GNNs beyond the 1-WL test in distinguishing k-regular graphs and exemplify why it can help alleviate the over-smoothing issue when GNNs go deeper. Experimental results on 10 popular benchmarks show that our method is highly efficient and can generally improve the performance of GNNs for various graph tasks.

A THEOREM ANALYSIS

This section provides the corresponding proofs to support theorems in the main context. A.1 PROOF FOR THEOREM 1 Theorem 1. MotifNorm helps alleviate the oversmoothing issue. Proof. Given two extremely similar embeddings of u and v (i.e., H u -H v 2 ≤ ). Assume for simplicity that H u 2 = H v 2 = 1, w RC 2 ≥ c 1 , and the motif-induced information scores between u and v differs a considerable margin (M RC 1 T d ) u -(M RC 1 T d ) v 2 ≥ 2 /c 1 . We can obtain (H u + (w RC (M RC 1 T d )) u • H u + H v 2 ) -(H v + (w RC (M RC 1 T d )) v • H v + H u 2 ) 2 ≥ -H u -H v 2 + ((w RC (M RC 1 T d )) u -(w RC (M RC 1 T d )) v ) • H u + H v 2 2 ≥ -+ w RC 2 • (M RC 1 T d ) u -(M RC 1 T d ) v 2 • H u + H v 2 2 ≥ -+ 2 = , where the subscripts u, v denote the u-th and v-th row of matrix ∈ R n×d . This inequality demonstrates that our RC operation could differentiate two nodes by a margin even when their node embeddings become extremely similar after L-layer GNNs. Similarly, by assuming Pow(M RE , w RE ) u 2 ≤ c 2 and Pow(M RE , w RE ) u -Pow(M RE , w RE ) v 2 ≥ (1 + c 2 ) • , one can prove that the RE operation differentiates the embedding with motif-induced information: Pow(M RE , w RE ) u • H u -Pow(M RE , w RE ) v • H v 2 = Pow(M RE , w RE ) u • (H u -H v ) + (Pow(M RE , w RE ) u -Pow(M RE , w RE ) v ) • H v 2 ≥ -Pow(M RE , w RE ) u • (H u -H v ) 2 + (Pow(M RE , w RE ) u -Pow(M RE , w RE ) v ) • H v 2 ≥ -c 2 • + (1 + c 2 ) • = . The proof is complete.

A.2 PROOF FOR PROPOSITION 1

Proposition 1. RC operation is beneficial to stabilzing the model training. Proof. The RC operation is formulated as RC : H RC = H + w RC H SA • (M RC 1 T d ), where and 1 d is d-dimensional all-one column vector. Here H SA introduces the current graph's instance-specific information, i.e., mean representations in each graph. w RC is a learnable weight balancing mini-batch and instance-specific statistics. Assume the number of nodes in each graph is consistent. The expectation of input features after RC, i.e., E(H RC ), can be represented as E(H RC ) = (1 + w RC (M RC 1 T d )) • E(H). ( ) Let us respectively consider the following centering operation of normalization for the original input H and the feature matrix H RC after RC operation, H Center-In = H -E(H), H Center-RC = H RC -E(H RC ), where H Center-In and H Center-RC denote the centering operation on H and H RC . To compare the difference between these two centralized features, we perform H Center-RC -H Center-In = (H RC -E(H RC )) -(H -E(H)) = H + w RC (M RC 1 T d ) • H SA -(1 + w RC (M RC 1 T d )) • E(H) -(H -E(H)) = w RC (M RC 1 T d ) • (H SA -E(H)), , 4 Here, we ignore the affine transformation operation and assume values larger than the running mean, kept after the following activation layer, are important information for representations, and vice versa. In case of w RC > 0, while H SA > E(H), more important information tends to be preserved, and vice versa. In case of w RC < 0, while H SA > E(H), the noisy features tend to be weakened, and vice versa. A similar analysis in BatchNorm2D has been provided in (Gao et al., 2021) , interested readers please refer to that for details. The main difference is that MotifNorm aims to embed structural information to compensate for ignored characteristics in node-include subgraphs, while RBN is proposed to address the distribution differences. j G k = , 5 p G k = , 4 i G k = , 7 q G k = The proof is complete.

A.3 DESIGN OF THE REPRESENTATION CALIBRATION FACTOR

Here, we talk about the design of representation calibration factor M RC = M SN • M G , which is a normalization for motif-induced weights M G . If we directly adopt the original weights, existing many unequal large values, it will make training oscillating. Thus, the normalization is essential for M G . However, if we just perform summation-normalization in an arbitrary graph (i.e., M SN ), it will not distinguish two graphs with the same nodes but different degrees, e.g., four graphs in Figure 6 where each weight will become 1/8. To this end, we design the above normalization technique to strengthen the motif power for the representation calibration operation.

B EXPERIMENTAL DETAILS B.1 MORE DETAILS OF BENCHMARK DATASETS AND BASELINE METHODS

Benchmark Datasets. For the task of graph property prediction, we select IMDB-BINARY, ogbgmoltoxcast, ogbg-molhiv and ZINC datasets. The IMDB-BINARY is a k-regular dataset for binary classification task, which means that each node has a degree k. The ogbg-moltoxcast is collected for multi-task classification task, where the number of the tasks is 617. The ogbg-molhiv is a molecule dataset that used for binary classification task, but the ouput dimension of the end-to-end GNNs is 1 because its metric is ROC-AUC. The ZINC is the real-world molecule dataset for the example reconstruction. In this paper, we follow the work in (Dwivedi et al., 2022a) to use ZINC for the task of graph regression. These graph prediction datasets are from (Morris et al., 2020; Hu et al., 2020; Irwin et al., 2012) respectively. For the node level prediction, we select four benchmark datasets including Cora, Pubmed and ogbn-proteins. The first three datasets are the social network and the last one is a protein-protein association network dataset. For the evalutation of link property prediction, we select ogbl-collab dataset in this paper. These node and link prediction datasets are from (Kipf & Welling, 2017; Hu et al., 2020) respectively. More detailed dataset information is provided in Table 8 . Baseline Methods. To evalute our proposed MotifNorm module, we need to compare other normalization methods adopted in GNNs for various graph tasks, including BatchNorm (Ioffe & Szegedy, 2015) , GraphNorm (Cai et al., 2021 ), ExpreNorm (Dwivedi et al., 2022a) for graph property prediction, and PairNorm (Zhao & Akoglu, 2020a) , NodeNorm (Zhou et al., 2021) , MeanNorm (Yang et al., 2020a) , GroupNorm (Zhou et al., 2020) for node and link property prediction. A part of these normalization methods are provided in (Chen et al., 2022b) , and other source codes are provided by authors. For the backbone GNNs, we consider the most popuare message-passing architectures such as GCN (Kipf & Welling, 2017), GAT (Veličković et al., 2018) , GIN (Xu et al., 2019) , SGC (Wu et al., 2019) and GraphSage (Hamilton et al., 2017) . Specially, we will compare our MotifNorm with all above normlization modules. For the network architectures, we follow CNA architecture, i.e., convolution, normalization and activation. In this paper, we do not adopt any skills like dropedge (Huang et al., 2020; Rong et al., 2020) , residual connection (Xu et al., 2018; Li et al., 2019; Liu et al., 2020) , etc. Experiment Setting. For different datasets, we provide more detailed statistics information in Table 8 . The embedding dimension in each hidden layer on all datasets is set as 128. We optim the GNNs' architectures using torch.optim.lr scheduler.ReduceLROnPlateau by setting patience step as 10 or 15 to reduce learning rate. The learning rate is 1e -3 for graph classification, and 1e -2 for node, link predictions. When the learning rate reduces to 1e -5, the training will be terminated. More detailed statistics of experimental settings are provided in Table 9 . Specially, we split the IMDB-BINARY dataset into train-vallid-test format using a hierarchical architecture. In details, we first segment this dataset according to the edge density information into 10 set (i.e., the edge density ∈ {0.0 -0.1}, ∪, {0.1 -0.2}..., {0.9 -1.0}, and then sort the graphs using the average degree information. Finally, we select the samples in each segment using a fix step size as valid and test samples. The statistic information for splitting the valid and test set of Label-0 and Label-1 on IMDB-BINARY is provide in Table 10 . By adopting this splitting scheme, distribution differences among train, valid and test sets are weakened (Experiment results show this contribution fact but without a theoretical basis now). The splitting details are implemented in the datasets/dgl imdb dataset.py. To reproduce the comparison results using a single layer of MLP and GIN, the dropout is set to 0.0 and warming up the learning rate from 0.0 to 1e -3 at the first 50 epoches. When layer is equal to 4, the doupout at the input layer is selected in {0.3, 0.4, 0.5} (i.e., -init dp in the code), and hidden layer is set to 0.5. To draw the Figure 2 , we remove the warmup operation for learning rate (i.e., remove -lr warmup in the shell files). Implementation. MotifNorm needs to process the motif-induced weight into datasets and then load this processing information to embed structural information into node representations. Especially for the node-relevant classification, node representations need to contain the same power before MotifNorm. Thus, we perform the l-2 normalization to ensure their power are consistent. The two scripts are at: datasets/preprocess.py and modules/norm/motifnorm.py. 0.0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1.0 1.0 Total Num. 2 ).

B.3 MORE COMPARISONS ON THE OVER-SMOOTHING ISSUE

This section provide the illustration of GCN (Kipf & Welling, 2017) and GraphSage (Hamilton et al., 2017) with different six existent normalizations on Cora dataset. Given the node embedding X = [x 1 , x 2 , ..., x n ] T ∈ R n×d , where x i ∈ R d denotes the node embedding at layer i. The existing normalization methods for GNNs include PairNorm (Zhao & Akoglu, 2020a) , NodeNorm (Zhou et al., 2021) , MeanNorm (Yang et al., 2020a) , GroupNorm (Zhou et al., 2020) and BatchNorm (Ioffe & Szegedy, 2015) are depicted as follows: PairNorm (Zhao & Akoglu, 2020a) . xi = x i - 1 n n i=1 x i , PairNorm(x i , s) = s • xi 1 n n i=1 ||x i || 2 2 . ( ) NodeNorm (Zhou et al., 2021) . NodeNorm(x i , p) = x i std(x i ) 1 p . MeanNorm (Yang et al., 2020a) . MeanNorm(x (k) ) = x (k) -E[x (k) ]. BatchNorm (Ioffe & Szegedy, 2015) . BatchNorm(X) = X -E(X) √ D(X) + γ + β. GroupNorm (Zhou et al., 2020) . GroupNorm(X; G, λ) = X + λ • G g=1 BatchNorm( Xg ), where Xg = softmax(X • U)[:, g] • X. MotifNorm please refer to Eq. ( 5) for details. Here x (i) ∈ R n denotes the i-th column of X, s in PairNorm is a hyperparameter controlling the average pair-wise variance and we choose s = 1 in our case. p in NodeNorm denotes the normalization order and our paper uses p = 2. The t-SNE illustrations of different normalizations embedded in GCN and GraphSage are provided in Figure 9 and 10 respectively, where the number of layers are 4, 16, 32. Figure 11 shows the oversmoothing phenomenon with the layers increasing using GraphSage, where the number of layers is from 16 to 32 with the step size as 2. 76.83 40.42 43.49 68.09 61.58 60.60 45.43 23.98 15.43 BatchNorm 75.49 45.11 42.74 63.75 62.96 61.54 47.05 23.01 14.89 MotifNorm 77.48 76.54 73.68 67.81 67.02 66.07 52.31 48.94 48.39 



Figure 1: (a) The illustration of the subtree-isomorphic phenomenon, where two subgraphs S v1 and S v2 are induced by root node v 1 and v 2 with the same degree k = 4, but the connection information among neighborhoods is different. (b) The t-SNE illustration of over-smoothing issue on Cora dataset when GraphSage layer is up to 20. Here, we show the first three categories for visualization.

Figure 2: Learning curves of one-layer MLP, GIN, MLP + MotifNorm and GIN + MotifNorm on IMDB-BINARY dataset with various k-regular graphs.

(a)∼3(f) show the t-SNE visualization of different normalization techniques, and we can find that none of them suffer from the over-smoothing issue like Graph-Sage with BatchNorm (shown in Figure1(b)). However, MotifNorm can better distinguish different 1

Figure 3: The t-SNE visualization of node representations using GCN with different normalization methods on Cora dataset.MotifNormGroupNorm BatchNorm NodeNorm MeanNorm PairNorm

Figure 5: Abalation study of the Representation Calibration (RC) and Representation Enhancement (RE) operations in MotifNorm on the ogbg-moltoxcast dataset. Here we use GCN as the basic backbone to conduct the abalation study.

Figure 6: The illustration of four k-regular graphs with the same nodes but different structures.When directly performing the summation-normalization operation on motif-induced information, all weights will be equal to 1/8.

Figure 7: Learning curves of one-layer MLP, GIN, MLP + MotifNorm and GIN + MotifNorm on IMDB-BINARY dataset (without learning rate warmup).

Figure8: Learning curves of one-layer GIN, GIN+BatchNorm and GIN+MotifNorm on IMDB-BINARY dataset with learning rate warmup (i.e., the curves of the reported scores in Table2).

Figure9: The t-SNE visualization using GCN with different normalization methods on Cora dataset.

Figure 10: The t-SNE visualization using GraphSage with different normalization methods on Cora dataset. 21

E(H) and D(H) denote mean and variance statistics, and γ, β ∈ R 1×d are the learned scale and bias factors. In this work, we aim to embed structural weights into BatchNorm, and thus take a batch of graphs for example to give notations.

The statistic information of the benchmark datasets under different graph-structured tasks.

Experimental results on IMDB-BINARY dataset with various k-regular graphs. The best results under different backbones are highlighted with boldface.

Experimental results of different normalization methods on the graph prediction task. We use GCN, GAT as the backbones. The best results on each dataset are highlighted with boldface. = 16 l = 32 l = 4 l = 16 l = 32 l = 4 l = 16 l = 32 GraphNorm 60.53 52.79 53.22 75.30 73.86 64.03 0.576 1.254 1.537 BatchNorm 63.31 53.39 53.24 76.07 76.87 73.74 0.585 0.624 0.643 UnityNorm 63.47 58.76 57.13 75.91 76.19 75.46 0.563 0.621 0.777 ExpreNorm 65.56 57.65 57.60 76.99 72.24 72.56 0.555 0.562 1.451 MotifNorm 66.57 64.04 58.26 77.36 77.08 76.70 0.495 0.517 0.522

The comparison results of different normalization methods on the node prediction and link prediction task by using GCN as the backbone. The best results are highlighted with boldface. = 16 l = 32 l = 4 l = 16 l = 32 l = 4 l = 16 l = 32 GCN NoNorm 76.16 54.67 45.58 69.16 63.24 63.15 35.38 22.11 15.24 PairNorm 74.25 56.24 55.13 69.28 63.15 63.00 31.26 23.22 14.69 NodeNorm 76.02 40.87 41.18 70.17 63.50 63.23 27.48 08.48 08.28 MeanNorm 76.05 73.40 65.34 69.14 63.05 62.40 33.28 22.56 16.16 GroupNorm 76.19 63.55 54.84 70.25 62.74 63.63 35.28 27.41 20.27 BatchNorm 75.62 48.88 43.28 69.96 67.36 63.86 47.57 26.14 21.68 MotifNorm 77.08 76.66 67.81 71.69 68.66 68.05 51.65 50.01 47.65

Comparisons with empirical tricks on ogbg-molhiv and ZINC datasets.

Comparisons with empirical tricks on ogbn-proteins and ogbl-collab datasets.Firstly, we adopt the vanilla GNN model without any tricks (e.g., residual connection, etc.), and provide the settings of the hyperparameter in Appendix B.1. Accordingly to the mean results (w.r.t., 10 different seeds) shown in Table3 and Table 4, we can conclude that MotifNorm generally improves the graph expressivity of GNNs for graph prediction task and help alleviate the over-smoothing issue with the increase of layers. Furthermore, we provide more comparison experiments by using GIN and GraphSage as backbones in Appendix B.4. The results in Table5and Table6demonstrate that MotifNorm preserves the superiority in graph, node and link prediction tasks compared with other existent normalizations.

The cost comparisons between BatchNorm and MotifNorm.

The statistic information of 8 benchmark datasets.

The detailed experimental settings of GNNs on various graph-structured tasks.

The statistic information for splitting IMDB-BINARY.

Experimental results on IMDB-BINARY dataset with various k-regular graphs. The best results under different backbones are highlighted with boldface.

Experimental results of different normalization methods on the graph prediction task. We use GCN, GAT and GIN as the basic backbones and set the number of layers as 4, 16 and 32. The best results on each dataset are highlighted with boldface. PairNorm 60.69 59.45 55.07 74.06 72.13 62.25 0.573 0.569 0.597 NodeNorm 61.94 55.58 49.90 74.80 57.75 57.64 0.625 1.332 1.547 MeanNorm 63.21 60.42 54.76 74.42 72.39 60.59 0.602 0.637 0.695 GroupNorm 61.58 59.48 56.73 76.66 71.84 66.38 0.641 0.673 0.737 GraphNorm 60.78 53.75 53.36 75.59 65.55 66.49 0.592 0.655 1.547 BatchNorm 63.39 59.73 53.47 76.11 76.62 74.21 0.573 0.611 0.655 ExpreNorm 64.97 57.91 57.82 76.05 76.75 72.36 0.564 0.570 0.646 MotifNorm 66.92 65.19 63.40 77.29 77.71 75.99 0.489 0.524 0.523 GAT NoNorm 62.61 50.84 50.12 76.71 57.38 50.64 0.714 1.541 1.547 GraphNorm 60.53 52.79 53.22 75.30 73.86 64.03 0.576 1.254 1.537 BatchNorm 63.31 53.39 53.24 76.07 76.87 73.74 0.585 0.624 0.643 ExpreNorm 65.56 57.65 57.60 76.99 72.24 72.56 0.555 0.562 1.451 MotifNorm 66.57 64.04 58.26 77.36 77.08 76.70 0.495 0.517 0.522 GIN NoNorm 62.19 56.38 54.83 76.33 69.70 58.87 0.496 0.520 1.069 GraphNorm 62.44 54.95 55.72 76.55 66.00 67.01 0.462 1.203 1.446 BatchNorm 63.72 58.67 55.56 76.62 70.28 66.82 0.477 0.516 1.153 ExpreNorm 65.98 57.80 56.56 76.23 69.97 70.96 0.438 0.482 1.157 MotifNorm 66.65 63.01 57.46 77.38 73.03 72.89 0.410 0.458 0.902

Experimental results of different normalization methods on the node prediction task and link prediction task. We use GCN, and GraphSage as the basic backbones and set the number of layers as 4, 16 and 32. The best results on each dataset are highlighted with boldface. .6745.58 69.16 63.24 63.15 35.38 22.11  15.24 PairNorm 74.25 56.24 55.13 69.28 63.15 63.00 31.26 23.22 14.69 NodeNorm 76.02 40.87 41.18 70.17 63.50 63.23 27.48 08.48 08.28 MeanNorm 76.05 73.40 65.34 69.14 63.05 62.40 33.28 22.56 16.16 GroupNorm 76.19 63.55 54.84 70.25 62.74 63.63 35.28 27.41 20.27 BatchNorm 75.62 48.88 43.28 69.96 67.36 63.86 47.57 26.14 21.68 MotifNorm 77.08 76.66 67.81 71.69 68.66 68.05 51.65 50.01 47.65 GraphSage NoNorm 76.94 40.65 41.67 66.05 60.56 60.47 25.27 02.08 00.00 PairNorm 72.78 53.02 45.90 62.29 60.53 60.32 41.72 16.88 12.44 NodeNorm 77.22 40.64 40.64 64.48 62.63 61.89 19.74 02.57 02.62 MeanNorm 76.68 58.70 47.48 63.69 61.03 52.06 46.17 21.54 13.16 GroupNorm

